Fog Creek Software
Discussion Board




Perl, IE and invalid HTML

Hey guys (and hopefully Perl gurus),

I'm trying to get up to speed on Perl, and am finding that some Perl modules, such as Perl::TableExtract, are completely intolerant of invalid HTML (ie. missing end tags, etc.).  The process I'm trying to automate with Perl needs to parse lots of data in table format from various websites.  But almost all of the website's pages contains tons of invalid HTML (invalid from the standpoint of various HTML validation tools), and hence my Perl code doesn't work.

These pages *display* just fine in IE, which is obviously VERY tolerant of bad HTML.

I'll probably just resort to brute force regex parsing, but shouldn't Perl be more tolerant of invalid HTML?

Anon123
Friday, July 09, 2004

Being tollerant of bad HTML is actually pretty difficult, and parsing HTML with regular expressions is generally a bad idea, but might be the best you can do in that situation.

Is there some library you can leverage, such as an htmltidy package, that will clean up the illegal HTML to make it able to be parsed by your perl scripts?

Lou
Friday, July 09, 2004

First, seperate "Perl" from the perl module you're trying to use, which undoubtedly uses regexps, and likely was not written by the perl language developers.

"Perl" itself doesn't know a THING about HTML.  Your "problem" is that the module's code isn't tolerant of bad HTML.

Your solutions are:  roll your own module, find another that better fits your predicament, write your own parser.

There may be instances where it's IMPOSSIBLE to parse truly horribly written HTML.  In any case, what are you screen scraping for anyway?  Seems fishy.

If you haven't got the skill to use the tool you're attempting to use, don't blame the tool.

muppet from madebymonkeys.net
Friday, July 09, 2004

Perl is perfectly tolerant of any string you want to give it.  Apparently, the Perl::TableExtract code (an added library, NOT native Perl) is intolerant of things IE thinks are just fine.

That's a problem with the Library, not Perl.

Sorry, but you've stepped on a pet-peeve of mine.  Don't blame the tool you are using, when what isn't working isn't the tool -- its something you've added to the tool.

AllanL5
Friday, July 09, 2004

Thanks for the fast responses guys!

>> Is there some library you can leverage, such as an htmltidy package, that will clean up the illegal HTML to make it able to be parsed by your perl scripts?

I have to check into that.

>> In any case, what are you screen scraping for anyway?  Seems fishy.

What's fishy about screen scraping?  Isn't that what the LWP module was partially designed to do?  Anyway, I've got users who are tired of visiting various web sites manually to fetch current pricing data on specific auto parts and just want to have the process automated, a seemingly perfect fit for Perl and its LWP module.

>> Apparently, the Perl::TableExtract code (an added library, NOT native Perl) is intolerant of things IE thinks are just fine.  That's a problem with the Library, not Perl.

Agree 100%.  I love Perl, even though I'm relatively new to it.  The things I've been able to accomplish in a short amount of time are truly astounding, and a testament to the power of Perl and all the various modules.

Anon123
Friday, July 09, 2004

Pretty touchy PERL lovers around here.  The OP didn't come off to me as attacking PERL, but whateva.

I think what his issue brings to light is exactly why Internet Explorer is the #2 worst thing that happened to the internet (#1 being ad banners, popups, and other breeds of spam).  If IE would not have tolerated bad HTML, we would not have legions of bad HTML websites and web "developers" out there.  Too many people open their site in IE and it looks OK so they stop there.

Clay Whipkey
Friday, July 09, 2004

Clay,

It's really a topic for a whole other discussion, but I've always been curious why ANY browser accepts invalid HTML.

Anon123
Friday, July 09, 2004

I use a tool in Python that is great about bumping into bad HTML. It is called BeautifulSoup. Essentially it seems to take the document and break it down into a data structure that represents the nesting of the tags. If something is broken you have more info in the surrounding tag then you should. When that happens I have just dumped out to some simple error catching that is replaced by special code in the one area if I really need what is within the tags. I imagine a system like this would be something that can be easily written in Perl.

Jeff
Friday, July 09, 2004

I'd suggest building your tool around HTML::Parser. As the POD explains:

>>We have tried to make it able to deal with the HTML that is actually "out there", and it normally parses as closely as possible to the way the popular web browsers do it instead of strictly following one of the many HTML specifications from W3C. Where there is disagreement, there is often an option that you can enable to get the official behaviour.<<

I've found it does an impressive job of taking nasty HTML that exists in the wild and parsing it into something reasonable. It's not always "perfect" about how it handles things like imputing missing close tags, but it is resilient and almost always does something that's at least manageable, in my experience.

It's quite straightforward to use HTML::Parser to generate a parse tree then traverse the tree to find the <table> element and its children and extract the information you want.

John C.
Friday, July 09, 2004

John,

>> It's quite straightforward to use HTML::Parser to generate a parse tree then traverse the tree to find the <table> element and its children and extract the information you want.

Ah, excellent advice, thanks.

Anon123
Friday, July 09, 2004

Oh, I should add that on a past project we also tried piping things through htmltidy and then parsing them (using a fairly standard Java-based HTML parser), but it turned out that just using the HTML::Parser Perl class by itself was more effective, at least for the corpus we were dealing with. But it's probably not a bad idea to test it for yourself.

John C.
Friday, July 09, 2004

Screen scraping is not a good idea unless it's just amateur stuff. Also, it's one thing to view data on a page in your browser but another to pull it and put it on your page for others to view.  Sounds like a question of use. The people you are scraping from are probably morons, sounds like it from their html so they probably do not have a clue who is using their site or how. When and if they get a clue they can block you or they can just add/modify a robots file as I'm sure you read/respect that.

If you are going to get data from a remote site do it through xml. I once worked on a project where we got bids from overture and the like ... screen scraping, a lot of it and always had to change ... imagine when they make changes to their layout. Do you know what to expect and throw errors or warnings when you don't receive the patterns  you expect? I seriously suggest talking xml with those people but again, if they can't even get their html right (right enough ... I used Perl modules to do the parsing and never had to resort to my own regexes) then they probably don't have a clue.

Nonetheless, what you are doing is probably a violation of fair use which is why someone mentioned fishy but you did not ask about the legalities. I only mention it as you seemed to be taken aback by the suggestion.

me
Friday, July 09, 2004

automate IE to open to the pages you want to view, take  screenshots, define an algorithm to go pixel by pixel and pick out color variance, shading, patterns, etc....

like pie it is, easy.

josheli
Friday, July 09, 2004

"It's really a topic for a whole other discussion, but I've always been curious why ANY browser accepts invalid HTML"

Simple, which product do you want:
  1. Browser that chokes on a large percentage of the pages on the web.

  or

  2. Browser that does a reputable job of rendering those pages in spite of their errors?

Most users want #2.

sgf
Friday, July 09, 2004

if no browser accepted poorly formed HTML, then no one would need 2 because no one would write poorly formed web pages, because it'd be detrimental to their business.

muppet from forums.madebymonkeys.net
Friday, July 09, 2004

Your wishful thinking about valid HTML is good.

But in our real world, there is already tons of bad HTML. So now what?

JD

JD
Friday, July 09, 2004

If nobody was "allowed" to write bad HTML, the web never would have gotten past the university doors.  It'd be the exclusive domain of physics researchers wanting to share documents.


Friday, July 09, 2004

come on.  it's not as though html is that difficult to write properly.

muppet from forums.madebymonkeys.net
Friday, July 09, 2004

yes it is. people can't even write XML in a well-formed manner.
the computer is supposed to HELP you, not make your life more difficult.
ideally the user agent would present a warning to the user, much like the script warning (that is to say, unobtrusive) saying "this page has HTML errors". then someone who cared could fix it

mb
Friday, July 09, 2004

"come on.  it's not as though html is that difficult to write properly. "

I think you're overlooking the fact that lots of HTML is written by people whose technical ability is way below the average on this board. People with absolutely NO prior programming experience of any kind.

'Oh, you know how to create a three line table? You're our new HTML expert! Make us a whole site!'

sgf
Friday, July 09, 2004

And BTW:
"if no browser accepted poorly formed HTML,"

If that was the case, I'd write one and take over the market! :)

sgf
Friday, July 09, 2004

The idea of 'good' HTML keeps changing. Don't you remember when all the HTML books told you it aided readability to put tag names in capitals, and then XHTML became the standard and made illegal.

Stephen Jones
Saturday, July 10, 2004

That's interesting 'me', I worked on a project that involved reading data from Overture too.  I don't think that it's always just good for "amateur" stuff though.  At the time I was working on this project, Overture hadn't put up their XML interface yet.  Even though it was a suboptimal solution, doing the screen scraping was the simplest part of the whole project.  The rest of it had to do with automating the process of analyzing ad performance and deciding on bid prices for thousands of people.  It was tremendously valuable even though at the base of it the thing was messy (and had to be updated frequently to adapt to breaking site changes).

Kalani
Sunday, July 11, 2004

why rejecting bad HTML (or XML) is a bad idea

http://diveintomark.org/archives/2004/01/14/thought_experiment

Gregg Tavares
Monday, July 12, 2004

*  Recent Topics

*  Fog Creek Home