Fog Creek Software
Discussion Board

Equivalent of mshtml for Linux?

Is there any equivalent of mshtml for Linux? Preferably, I should be able to use it from Python?

I need to be able to parse non-well formed HTML documents quickly and access any parts of them using a DOM-like functionality.

Banana Man
Wednesday, September 1, 2004

Gnome's libxml might help you.

Wednesday, September 1, 2004

You can do this with a combination of TidyLib (to clean up non-well formed HTML and convert to XHTML) and libxml2.  TidyLib itself has uses a DOM-like structure as well, so you may not need libxml2, depending on what you're trying to do.

These are cross-platform, and will run on pretty much any OS.  Since they are C-based, there are many "wrappers" available for higher-level languages like Python and Tcl.

Wednesday, September 1, 2004

There's also Tag Soup ( ) and NekoHTML ( )

Wednesday, September 1, 2004

There are many native Python libraries you can use, starting with thebuilt-in SGMLParser and HTMLParser modules.  You might also take a look at Beautiful Soup:

Abe Fettig
Wednesday, September 1, 2004

If I recall correctly, KHTML (Konqueror and Safari's rendering engine) has Python bindings.

Simon Perreault
Wednesday, September 1, 2004

Can you use MSHTML to parse a document without hosting IE?

Wednesday, September 1, 2004

*  Recent Topics

*  Fog Creek Home