Fog Creek Software
Discussion Board




Real-world XML DOM?

Hiya all
With much disappointment, I've found that Microsoft's XML DOM library is exceedingly picky, even in HTML mode. It complains about unterminated tags, non-standard attributes and so on
But I need to parse HTML files, and you know that nobody has ever paid too much attention to creating pages in standard HTML, or even *valid* HTML for that matter. I don't consider writing my own parser an option (the less I have to handle strings, the better). So I ask to you: do you know of any XML DOM that can be used, let's say, for a real-world browser?
FWIW: I'm writing a download manager+web spider (kind of like a merge between Getright and Teleport Pro), and for the "web spider" part I need to download and parse HTML, to find all linked and embedded files

KJK::Hyperion
Sunday, July 21, 2002

There are quite a few nice Perl modules for spidering and parsing HTML.

http://search.cpan.org/search?mode=module&query=robot

Given a URL the robot modules can extract and visit links.

For HTML parsing
http://search.cpan.org/search?dist=HTML-Parser
http://search.cpan.org/doc/GAAS/HTML-Parser-3.26/Parser.pm

and also using C/C++ HTML tidy does HTML parsing http://www.w3.org/People/Raggett/tidy/


Matthew Lock
Sunday, July 21, 2002

Thanks! I didn't remember Tidy, I hope it's reusable enough

(and I knew that there were Perl modules to do it - there's *always* a Perl module to do something - but I need this for C/C++/Delphi)

KJK::Hyperion
Sunday, July 21, 2002

If your only looking for links, i'd just use a regular expression, otherwise you have alot of work ahead of you. I've just finished something similar to what your talking about, and my next project was going to be an html parser, and let me tell you, its MUCH harder then XML.  Definatly not worth the effort unless your going to be parsing through teh content and need to keep stuff grouped by tables, etc.  As for an HTML parsing library, I haven't found one either, other then a couple perl modules. 

Vincent Marquez
Sunday, July 21, 2002

Why don't you have a look at WGET??. It's a parser / downloader, it works, it's fast and has more options you could wish for. And the code is avaliable for you to see :)

http://www.gnu.org/software/wget/wget.html

http://www.wget.org/

Also, have a look at:
http://directory.google.com/Top/Computers/Software/Internet/Site_Management/Mirroring/

HTH. Javier

Javier JJ
Monday, July 22, 2002

Maybe "Programming Bots, Spiders, and Intelligent Agents in Microsoft Visual C++ (Microsoft Programming Series)"
by David Pallmann could be of interest to you?

Just me (Sir to you)
Monday, July 22, 2002

KJK,

The MSXML is designed for parsing XML documents - so it will only read valid XHTML.  If valid HTML will be no good.

TjanXMLTree is a fast and compact XML parser.  You can download the .pas file from here:

http://home.planet.nl/~verho037/jfdelphi.htm

You could modify it for your own use.

Ged Byrne
Monday, July 22, 2002

Since you're on windows, why not use the webbrowser control to do the html parsing? It's possible to use it without actually displaying everything. Once you've loaded it you can walk the HTML DOM fairly easily.

MSXML is an XML parser - HTML is *not* well-formed XML, so it typically won't parse that well.

Chris Tavares
Tuesday, July 23, 2002

About half of the complexity in parsing real world HTML documents is in the lexer and the other half is in the parser.  You only need the lexer to find links. Focus on the lexer.

Ex IE programmer
Thursday, July 25, 2002

How about using Regular Expressions instead?

Dunc
Friday, July 26, 2002

I have a related question. What about parsing HTML documents containing errors? E.g. <B> Text1 <I> Text2 </B> </I>
Did you saw any recomendations?

Igor Zhbanov
Monday, July 29, 2002

I mean: <B> Text1 <I> Text2 </B> Text3 </I>

Igor Zhbanov
Monday, July 29, 2002

This is an old thread but it's my current vexation.

IE's rich text field produces fugly html when you paste from word into it.  But since IE has it in a DOM it has an opportunity to clean it up.

I'm trying to get something like MSXML.DOMDocument to eat it's own serialised HTML so that I can traverse and regenerate some sane XML.

I'm not having luck, can anyone shead light?

Oliver George
Thursday, July 01, 2004

*  Recent Topics

*  Fog Creek Home