Fog Creek Software
Discussion Board




Search Engine spidering a database driven site?

I'm trying to figure out how a search engine spiders a dynamically generated website.  Lets say I'm building a travel web site, and all of my destinations are stored in Oracle, is the spider able to find something like that? Or can it only find and index static pages? (I am, of course, using GET requests on my page generation tools, to facilitate bookmarkability).

Thanks in advance,
Mark

Mark
Wednesday, October 23, 2002

Maybe things have changed; this is mostly beyond me.  I remember Greenspun discussing the same problem and solving it by periodically generating static pages from the database for the search engines to find.  I guess you have to be careful about picking which static pages and what they say.

tk
Wednesday, October 23, 2002

I can't offer much help, but I hear friendly URLs may help search bots. 

Example:
/destinations.asp?d=4567 
to
/destinations/US/NYC/

Then set up a way to handle the URL and change it to something your pages can use.

http://www.alistapart.com/stories/succeed/ 
That article suggests bots have an easier time with friendlier URLs.  It is aimed at PHP with a link on how to do it in Apache using mod_rewrite, but there are alternatives around the net for other environments.

BUt the pages have to be linked to.  So neither regular nor friendly URLs will be indexed if some page does not link to them.  This makes sense, but I am not an expert.

Diego
Wednesday, October 23, 2002

Google seems to be able to index dynamic pages. For instance, if you search google for Keeping old active discussion alive you will get this discussion forum as the first hit, with a dynamic querysting

http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=Keeping+old+active+discussion+alive

Ben
Wednesday, October 23, 2002

A search engine with the capabilities of Google can index generated content. Type "Walter Rumsby" into Google and you get a few references to this forum. Now, think for a second - is Joel manually creating static pages with our posts? No way, this is all stuff coming out of a database.

All you should need is some way of getting to the content (eg. via links on a publicly accessible page on your site). A list of destinations should be accessible from a straightforward GET (eg. www.*.com/destinations/list/) and so should be accessible as a link. Indexing results from more complex searches is probably a not really feasible unless you provide links for things like "Today's Specials to Hawaii" which includes your search parameters in the URI.

Then again, I remember hearing that Google was turning up content that was supposed to be secure on some sites - :) or :( (depending on your point of view).

Walter Rumsby
Wednesday, October 23, 2002

Here's a newsletter article about this topic. According to this article, Google and Inktomi are able to index dynamically generated web pages.

http://www.axandra.com/archive/newsletter9.htm

El Sastre
Thursday, October 24, 2002

Pages are indexed as long as they are statically linked. Like a list of sections on the main page.

I use mod_rewrite to make my sites user and spider friendly do. So the users see www.mysite.com/category/product, and the backend sees mysite.com/engine/showp.php?pid=product

Everyone happy.

Napoleon Hill :)
Thursday, October 24, 2002

Thanks ALOT everyone, this has been really helpful. I've decided to keep everything dynamic, but put a text link at the bottom of the home page to an index with links that query my servlet for each record. It sounds like alot, but it's not even a thousand pages. I'm pretty sure I can rig something that updates the index every time I add a record. That should help the search engine "click through" to each individual page and catalog it.

Thanks again.

Mark
Thursday, October 24, 2002

This hit home. Wanted to respond earlier but was too busy (in today's world, that's a good thing! ;-)

I often have need to spider sites. Done static  and dynamically generated pages, and it basically makes no difference to the spider in most cases.

How the page that gets sent down the wire to the user agent originally got generated doesn't really matter -- when it hits your user agent, it's all still a mix of html and script (not always, but very, very common).

The spiders, to one degree or another, parse the page, find things that look like links and then get or post those, get the resulting page back, etc...

So my experience has been that what happens to a page on the server is basically independent of the spider's ability to walk through the links -- the spider just doesn't care.

Having said that, here are some general issues I have found in spidering sites, and mostly these are dependent on the particular useragent you're doing the spidering with. My experience has been that the following are much more significant issues regarding site-spidering than simply whether it's database driven or totally static:

1) can't get to secure pages. two problems - if your spider doesn't do SSL, then you've a problem (if somebody built a "secure" site **without** SSL, then everbody on the site has a problem!). Other issue is with a logon. some spiders may not support supplying userid/password, and will not be able to access pages behind the logon. Many spiders (lwp, wget, LinkBot (WatchFire), etc. do handle both situations). In some cases, a web page doesn't use  a "regular" html form to capture userid/password, and instead popsup an OS-specific dialog box for you to type in your logon info. Not all spiders provide an interface to allow you to enter and store such info through the OS-based logon. Sometimes you just can't authenticate via some spiders at all and you're hosed.

2) can't handle forms. Some spiders can't effectively or efficiently handle submitting data with a request/post of a form, meaning that you won't be able to walk through an interactive workflow, which means there are pages you won't be able to get to. Some spiders do handle providing form data, but they don't have any way that's practical or usable to input all the data you might need, especially if repeated data is not allowed in the workflow from one crawl to another, so for all intents and purposes, they don't really handle forms. In general, I've found that spiders don't do forms really conveniently. (I've written some stuff in perl, thanks to the work of G. Aas on the whole LWP modules, that handles the situation better).

3) heavily interlinked sites leave the spider in an infinite loop. Had this problem not too long ago on an inherited site we didn't know much about. Plugged into all manner of 3d party content providers (didn't know it at the time), and it turned out that they had a reference section where each article had a link to a search engine (didn't know this at the time, either) that returned "similar" articles. Well, the spider ended up getting trapped in these library pages continuously running search queries for similar pages, then going through those results running queries again on each of those articles, etc. Spider never got out of the library. Badly written spider, BTW. It's a commercial spider, and found during this that it doesn't persist ** any info ** until the scan is stopped/done, and it keeps the entire scanned tree in memory, so the memory usage on the box doesn't scale with the size of the site scanned. Basically the machine was just shy of dying for lack of memory when I stopped it. If it dies on a scan, you lose everything. After adjusting the scanning filters to not traverse these links in the library, it worked ok and got to other pages it had not ever seen.

4) spider can't understand / see / resolve client-side dynamically generated links. For spiders that are just parsers and useragents, they look for patterns in the page code that "look" like they're links based on some rules/regexes they've been given. Some spiders let you configure what it will interpret as a link. However, I have seen some javascript where there was no way in hell anybody could have built a valid url out of the base url and any string pattern found in the raw code -- the actual url for a link was dynamically generated client-side at runtime in memory. It never existed in any usable form in any parsable code. If such links are the only way to get to another part of a site  ( which is bad site / page design, BTW, javascript "DHTML" is supposed to ** augment ** navigation and user experience, not be the primary way to access content), then the spider won't ever see the destination pages. Your spider would need some flavor of VM to execute the script to be able to get to such links.

5) counting / estimating pages is probably a useless exercise, in general. I've often gotten asked to spider a site and then give somebody an estimate of the number of pages in a site. I've had to explain that it's next to impossible to do this automatically for a number of reasons. Won't go into them here, but if anyone's interested in that, let me know. It's not very interesting, really, but if you get the same question...

6) Of course, you've got to have a filtering ability to limit what the spider crawls. I haven't messed with any actual spiders that didn't have the ability to set such limitations. Another couple simple but important must-have abilities is the ability to customize the useragent string and to ignore robots.txt. Had one site that was harsh about rejecting the default UA string of the commercial spider I was using, but it let me in when I told it it was ie coming off a windows box. Obviously, you may need to ignore the robots.txt file depending on your requirements at the time, and how limiting the robots.txt file are.

7) Oh, and if you're spidering somebody else's site, be careful, sometimes folks don't like it, and with good reason.

That's all that comes to mind right off.  As I said, all these issues have been way more significant than whether the site's static or DB driven. Back to work, now...

anonQAguy
Thursday, October 24, 2002

*  Recent Topics

*  Fog Creek Home