Fog Creek Software
Discussion Board




Looking for advice on web spidering app

Dear JOS Readers,

I need to create a web spider app.  The spider will need to visit various predetermined web sites, use the site's search facilities, accumulate the search results and then reformant and present the final results to the user in a browser window.

After some preliminary research, it looks like Perl and the LWP toolkit are the best way of accomplishing this task.

Are there any other tools I should evaluate?

Recycler
Monday, June 14, 2004

DIY - http://books.perl.org/book/206

sethx9
Monday, June 14, 2004

Let me guess... this spider will harvest email addresses?

Anti-Spammer
Monday, June 14, 2004

This is easy in most environments. It usually only takes a few lines of code. I've used C#.NET for this. Just use whatever you are familiar with.

Just me (Sir to you)
Monday, June 14, 2004

>> Let me guess... this spider will harvest email addresses?

Your guess is not only wrong, it's dumb.  I can *buy* software that does that.  For the purposes of this discussion, however, the data to be gathered is irrelevant.

Please try to be more helpful next time.

Recycler
Monday, June 14, 2004

Use mshtml from within C#. It helps you parse HTML documents.

Green Pajamas
Monday, June 14, 2004

It's pretty quick to do this with Perl and LWP, plus HTML::Parser if you need to extract information from the HTML pages.

If performance is a consideration, though, LWP is single-threaded (if memory serves) and will block while it retrieves pages. You might want to look into LWP::Parallel (which I haven't used, but appears to let you issue multiple simultaneous requests).

John C.
Monday, June 14, 2004

I was messing around with a python script the other day ... that might be interesting. I think you'll find a LOT of examples in Perl though ... it's what I always use.

seekingDataModels
Monday, June 14, 2004

>> If performance is a consideration, though, LWP is single-threaded (if memory serves) and will block while it retrieves pages. You might want to look into LWP::Parallel (which I haven't used, but appears to let you issue multiple simultaneous requests).

That's good to know (which I didn't).  I absolutely need the ability to search multiple sites at the same time.

Recycler
Monday, June 14, 2004

I use perl & lwp with a mysql backend for my crawling. If speed is important, just run multiple copies of crawler (and partition the fetched urls)

bhagwaan
Monday, June 14, 2004

PLEASE be sure to fetch the "robots.txt" file and respect the site's wishes. It is *such a pain* to add various hacks for detecting and abusing badly behaved spiders...

Tyrannosaurus Rant
Monday, June 14, 2004

IPWorks has a great toolkit for $299 that includes a nice COM control that can really help speed the development of such a program.

They also have a .Net component pack too I believe.

http://www.ipworks.com/products/controls/?ctl=HTTP&sku=IPA6-A

The HTTP component is one of 25 or so components included in the pack.

Wayne
Monday, June 14, 2004

Sure, a full HTML DOM is great for the complex projects of this type, but in my experience you can go a long way in this very quickly with some regular expressions and a few templates.

Just me (Sir to you)
Tuesday, June 15, 2004

*  Recent Topics

*  Fog Creek Home