![]() |
![]() |
![]() |
Looking for advice on web spidering app Dear JOS Readers,
Recycler
DIY - http://books.perl.org/book/206
sethx9
Let me guess... this spider will harvest email addresses?
Anti-Spammer
This is easy in most environments. It usually only takes a few lines of code. I've used C#.NET for this. Just use whatever you are familiar with.
Just me (Sir to you)
>> Let me guess... this spider will harvest email addresses?
Recycler
Use mshtml from within C#. It helps you parse HTML documents.
Green Pajamas
It's pretty quick to do this with Perl and LWP, plus HTML::Parser if you need to extract information from the HTML pages.
John C.
I was messing around with a python script the other day ... that might be interesting. I think you'll find a LOT of examples in Perl though ... it's what I always use.
seekingDataModels
>> If performance is a consideration, though, LWP is single-threaded (if memory serves) and will block while it retrieves pages. You might want to look into LWP::Parallel (which I haven't used, but appears to let you issue multiple simultaneous requests).
Recycler
I use perl & lwp with a mysql backend for my crawling. If speed is important, just run multiple copies of crawler (and partition the fetched urls)
bhagwaan
PLEASE be sure to fetch the "robots.txt" file and respect the site's wishes. It is *such a pain* to add various hacks for detecting and abusing badly behaved spiders...
Tyrannosaurus Rant
IPWorks has a great toolkit for $299 that includes a nice COM control that can really help speed the development of such a program.
Wayne
Sure, a full HTML DOM is great for the complex projects of this type, but in my experience you can go a long way in this very quickly with some regular expressions and a few templates.
Just me (Sir to you)
|