Fog Creek Software
Discussion Board




Equivalent of mshtml for Linux?

Is there any equivalent of mshtml for Linux? Preferably, I should be able to use it from Python?

I need to be able to parse non-well formed HTML documents quickly and access any parts of them using a DOM-like functionality.

Banana Man
Wednesday, September 01, 2004

Gnome's libxml might help you.

ice
Wednesday, September 01, 2004

You can do this with a combination of TidyLib (to clean up non-well formed HTML and convert to XHTML) and libxml2.  TidyLib itself has uses a DOM-like structure as well, so you may not need libxml2, depending on what you're trying to do.

http://tidy.sourceforge.net/
http://www.xmlsoft.org/

These are cross-platform, and will run on pretty much any OS.  Since they are C-based, there are many "wrappers" available for higher-level languages like Python and Tcl.

joev
Wednesday, September 01, 2004

There's also Tag Soup ( http://mercury.ccil.org/~cowan/XML/tagsoup/ ) and NekoHTML ( http://www.apache.org/~andyc/neko/doc/html/ )

matt
Wednesday, September 01, 2004

There are many native Python libraries you can use, starting with thebuilt-in SGMLParser and HTMLParser modules.  You might also take a look at Beautiful Soup:

http://www.crummy.com/software/BeautifulSoup/

Abe Fettig
Wednesday, September 01, 2004

If I recall correctly, KHTML (Konqueror and Safari's rendering engine) has Python bindings.

Simon Perreault
Wednesday, September 01, 2004

Can you use MSHTML to parse a document without hosting IE?


Wednesday, September 01, 2004

*  Recent Topics

*  Fog Creek Home