Fog Creek Software
Discussion Board




Parsing HTML pages

Hi all,

I've developed a meta search application in 1998 to query all important search engines at the same time and show only the resulting pages. At the time I had Excite, Altavista and Yahoo! working.
Trouble is, how to parse an HTML page generated by one of this services? As I could not do better, I simply looked for reference keys on the result, like, start from "results:" and stop on "more pages". Then I would try to extract all data also using string handling routines.
It isn't an elegant solution but works. Now I'm thinking about updating the application, to query Google, Yahoo! and AlltheWeb, and am wondering if there is an easier way to parse HTML.
I know there are components that build a virtual tree from an HTML, but the trouble is that search engines can change their HTML pages at any time, besides advertising insertion also changes the page source.
It could be easier if all of them had XML based web services, but I understand this would be against their business plans. :-)

Mauricio Macedo
Wednesday, March 31, 2004

I sure don't know of a magic bullet. You're building a screen-scraping application, therefore you're pretty much at the mercy of the website that you're scraping...

I've had luck parsing pages at the Yahoo business site because some wonderful programmer there was kind enough to do things like this:

<td class=headline>blah blah</td>
<td class=date>somedate</td>

Maybe you'll find that some of the search engines do something similar?

Rob VH
Wednesday, March 31, 2004

> It could be easier if all of them had XML based web services

Actually...
http://www.google.com/apis/download.html

TomA
Wednesday, March 31, 2004

Mauricio, head to CPAN: http://search.cpan.org/ There're parsing modules available for all major search engines. They're in WWW:: namespace.

Egor Shipovalov
Wednesday, March 31, 2004

I visited the CPAN site (good stuff!) but they're also parsing strings. The Google API has a problem: a low limite of queries per day and non commercial use (IIRC).
As I said before, the search engines have no interest in programs automagically getting only the results. I see that little has changed since 1998, I will have to look at the page source, lie that my program is IE 6.0, and string parse the results. *sigh*

Mauricio Macedo
Wednesday, March 31, 2004

Though someone's already pointed it out to you, if you can use it, CPAN is the way to go.

For interacting with search engines, try WWW::Search: http://search.cpan.org/dist/WWW-Search

For Web screen scraping in general, try WWW::Mechanize: http://search.cpan.org/dist/WWW-Mechanize

If you want to roll your own HTML parsing, well, that's hard. You can try something like HTML::Parser, or, if you're doing this in C or C++, some custom C/C++ HTML Parser. The only one I'm vaguely familiar with is libxml2's HTML parser, but that's a tree-based parser and you probably want an event stream, and also, libxml is huge.

The iron-fisted hot dog and pretzel Baron of Fulton County.
Wednesday, March 31, 2004

I'd probably start off with a disclaimer that most search engines make it difficult (illegal, if you like) to perform screen scraping of their results.. Google, AFAIK, explicitly forbids automated result collection.

Having said that, I've needed to do this (for my own use) and Jericho and NekoHTML for Java, the WWW:: namespace (earlier posters have mentioned the WWW::Search modules and Mech, which are useful) for Perl come to mind.

My solution ultimately, though, was to use HTML::TokeParser and get a list of all the anchors (href tags) within the page and then do some client side processing to figure out which ones are valid (and which arent). Considering my application just needed the links (and didn't require the excerpt), it worked well for me..

For example, for a standard Google search, you'd eliminate all links that point to somewhere inside Google (definitions and other Google services) and the Google advertising service (it's a separate domain, but easily recognizable ;)

Everything else is part of your search result. YMMV for other engines, but most are adopting the cleancut look, so you won't get many cluttered pages and the non-search result links should generally be easily identifiable.

Good luck :)

deja vu
Wednesday, March 31, 2004

I'm just working on that very thing - a program that run Google, Yahoo, etc queries and then retrieves the links.  The way we did it is by defining regular expressions that uniquely define where stuff is in the results.  For example, Yahoo delimits the results (as opposed to the ads, header, footer, etc) by starting an UL tag (enclosed in brackets, of course, but I don't want to fight the forum engine).  So the first step is to have a regular expression that takes all the text from UL to /UL.  Next the individual links are delimites by LI tags - another regex.  We define the whole thing using regexes - how to find links (direct, cached and "view as html"), descriptions, titles; the next page button; the "found 560,123 links" text.  I have about a dozen search engines described like this - it works well.  BTW, I'm programming in Python but any regex equiped language would work.

Sven
Friday, April 02, 2004

*  Recent Topics

*  Fog Creek Home