Fog Creek Software
Discussion Board




what is the best language to build a search engine

I'm heading a pseudo school project to revamp a nonprofit organization's website,  and one of the things i want to do is to code a search engine from scratch (our team is not getting paid, so i need to provide as much of a  learning experience as possible).  I do not want to buy any pre designed engines.  The site is coded in .asp.  I've read a ton of different stuff, but cannot seem to find a clear answer to which language would be the best for something like this. 

grover cleveland the 7th
Saturday, January 17, 2004

Depends how powerful you really want this to be, and if you have vast amounts of info to trawl through, but I would suggest perhaps Perl off the top of my head. Regular expressions and text handling features in Perl lend themselves quite well to a simple text search engine i'd imagine.

//Disclaimer: no, I haven't tried writing a search engine in Perl.

Andrew Cherry
Saturday, January 17, 2004

depends on your website, will you be using a database to store any data? is all the content on one machine? one language?

if you want to offer a learning experience, another possibility instead of getting them to write the thing from scratch might be to get them to select and install an opensource search engine alternative..I believe there are a few at sourceforge that may interest you.

FullNameRequired
Saturday, January 17, 2004

1. The language you're most comfortable with.

2. The lower level the better if you're talking about scalability beyond your personal website or local intranet.

Full name:
Saturday, January 17, 2004

You don't have to buy any predesigned engines - there are plenty you can use for free, many of them with speed or features that will take you a lot of work to accomplish:

WebGlimpse, whose features are described in [ http://webglimpse.net/index.php?dir=subfeatures&page=features.html ] offers a free commercial license for some nonprofits (I don't know their criterion for deciding).  It can do _approximate_ searches - you can misspell your search terms, and it will still find them, and do that rather quickly.

Estraier [ http://estraier.sourceforge.net/ ] has google-like output.

You can find various others in http://freshmeat.net

Ori Berger
Saturday, January 17, 2004

Learning issues aside, sometimes the best lesson is in economy. If the search is against publicly accessible web pages, make use of a search engine already in existence.

Search google for:
"Jamie Oliver" site:joelonsoftware.com

Google even has an API.  :)

m
Saturday, January 17, 2004

I also recommend using google's api. But if you really want to create your own search engine perl would be a good choice as it probably has the largest amount of search and web spidering modules for it.

Some examples:
* WWW::Robot http://search.cpan.org/~awrigley/WWW-Robot-0.023/lib/WWW/Robot.pm
* LWP http://search.cpan.org/~gaas/libwww-perl-5.76/lib/LWP.pm
* LWP Cookbook http://www.perldoc.com/perl5.6/lib/lwpcook.html

Matthew Lock
Saturday, January 17, 2004

You can also get some inspiration for creating a crawler from the Wayback machines open source code: http://crawler.archive.org/

Matthew Lock
Saturday, January 17, 2004

Google's spider is done in Python I believe

Damian
Saturday, January 17, 2004

Start by reading this:

http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC

Joel Spolsky
Saturday, January 17, 2004

The H2.ro team has built a search engine very quickly using Perl. It took them about one week, but they were excellent Perl programmers.

The team's web site is at http://studio.h2.ro/

The search engine is at http://www.h2.ro/ - it's in Romanian language.

I was a member of the team, but had nothing to do with the search engine as I was involved in other projects at the time.

Many members from the original team who built the search engine and web spider are now working in the US.

James
Saturday, January 17, 2004

Problem with using google as your search engine is that it will only works for "important" sites ( i.e. those who have a high page rank )

That's why Joel can use google search on his site. After all, Google thinks he is more important than Billy Joel :) But, for the average Joe, I don't think it works.

For example, take this page:
http://www.callisto.si.usherb.ca/~98701364/

and try this search on google:
David site:http://www.callisto.si.usherb.ca/~98701364/

It doesn't work, even if the name of the author of the web page is used at some places.

Eric V
Sunday, January 18, 2004

Your search should be

David site:callisto.si.usherb.ca

or

David site:si.usherb.ca

or even

David site:usherb.ca

i.e. just the domain, not the full url.

Alex.ro
Sunday, January 18, 2004

java afcourse

The Artist Formerly Known as Prince
Sunday, January 18, 2004

You might want to consider languages like erlang for the special characteristics they can put forth. But a generic language like Python with a clean interface to c++ space would work too.

Li-fan Chen
Sunday, January 18, 2004

You'll want to start with something like VB.Net or Python, with a little less formality than the final working version of the Search Engine (which probably consist of a Crawler, an Indexer, an Optimizer, a Query Engine, a Presentor, and a Presentor Cache Engine)

Li-fan Chen
Sunday, January 18, 2004

Did you know that MySQL implements full text search?

See http://www.mysql.com/doc/en/Fulltext_Search.html

You might find that by depending on this feature the exact language you use is irrelevant as long as it supports MySQL.  What do you think?

Seun Osewa
Sunday, January 18, 2004

Seun you are right, MySQL has some support for full text searches, but that doesn't necessarily makes it suitable for large search engines.

Li-fan Chen
Sunday, January 18, 2004

I have some experience in implementation of search engines (I'm no specialist though).
Frankly, the language does not matter so much. Lucene is written in Java, whereas most others are probably written in C, C++, or Perl.
I work regularly on an implementation in C, but have also written toy implementations in Java and Python. I'd rather use a high-level language.
Anyway, language is not the main concern, sino scalability. Writing an indexer + search engine for 50 Mo of text is rather easy (remember, that's 25 times Don Quijote, or 300 times Alice in Wonderland).  For 1 Go,  it's a different project, let alone 1 To.

Pakter
Sunday, January 18, 2004

Pakter, excellent point.  It echoed my sentiments exactly. 

vince
Sunday, January 18, 2004

Lucene is being ported to other languages (I lurk on the mailing lists). A Perl port is almost done, and I think Ruby is also on the cards..

As far as building a search engine goes, Joel's link above is a good place to start. If you have access to a library, you might also want to check out books on IR (Information Retrieval) ..

There are three portions to building a good search engine (I study in the field, so I think I feel qualified to give an introduction :) an efficient spider, a good indexing mechanism and a good means of querying your indices..

Google's spider is written in Python, as someone noted. The main aspect here is speed. A language that is capable of multiple threads and / or instances is essential (since your constraint for large sites is likely your network bandwidth). No, I wouldnt necessarily use Perl, because of its limitations on threads..  YMMV based on how much data you plan on indexing ... a few megabytes worth of text is manageable by any language you feel most comfortable with ...

A few gigabytes and some constraints on how fast the data much be searched ... then you're looking at a bit of tuning ..

One tip, though.. pay attention to metadata. Google uses pagerank, it can also use formatting information to determine which parts of a document are more important than others ... this means some analysis of your content is necessary, if you must go beyond the simple "keyword search"

deja vu
Monday, January 19, 2004

What will happen to the site once your "team" pulls out? Will they have to find other "volunteers" to rip out your fancy hand-coded search facility, and replace it with something minimal but no-maintenance?
I don't want to sound to cinical but I have met the corpses of this type of "help" before. You have told them that you aren't doing them any favours, and that if they want a website for the future they should look elsewhere since you are just using there needs as a pretext for trying out a few things that will be abandoned, right?

Just me (Sir to you)
Monday, January 19, 2004

well...i can do some volunteer work there as long as i have time to do so....I probably won't stay there forever though...the management is just the type that drives you insane (5 IT guys gone in 2 years...what does that tell you....especially in this economic climate).  But I am in the process of basically getting other students to take the reigns when i am gone.  Maybe i am misreading this post, but how does doing volunteer work make me a corpse?  Wouldn't a corpse take the easy way out and just install a pre written engine?  I probably am just misreading this though, cause i have no clue what you guys mean by trolling either. 

grover cleveland the 7th
Monday, January 19, 2004

The corpse I was referring to is the abandonware that will be running at the not-for-profit.  Most of the cost of a software system is in operations/maintenance.
The fact that students are recruited for the site indicates that te budget is probably close to 0$. If the student project is writing fancy home grown code, what will happen after the students leave? Their will be no budget for maintenance, and believe me, no one else will want to support this code. New volunteers will either if the NFP is lucky ripp and replace with something of-the-shelf, or if they are unlucky rip and replace with another round of new homegrown abandonware. Either way your team's effort is wasted.
The goal in a situation like this should be to create the least maintenance possible: 0 LOC if possible. This usually means getting some low cost stuff from a COTS software vendor that has a track record of easy maintenance, or maybe a low cost ASP service.
Your students can get a valuable lesson out of this. Have them propose different ways of tackeling this thing and then make them work out the complete lifecycle of the solution proposed.

Just me (Sir to you)
Tuesday, January 20, 2004

hmmm, i stumbled upon this discussion by accident. i noticed you guys were proficient with writing web pages in advanced languages. I have a little HTML experience, but it is very limited. Writing a search engine seems like it would be impossible for me, but it is an interesting subject. As to the comment by Eric V. that Google will never pick up a website or page, that is not entirely accurate. I run    WWW.bumcity.com    on a free hosting site and it is listed on google. perhaps that's because someone else started it and I picked it up as a "second owner", so to speak? anyway, i submitted it to free submission services and google picked it up. Also, the site Eric asked us to look up IS listed on google.....

BTW, I have a few questions for you all, maybe you can answer them for me? here goes:

what is a good program for simple website design?

if i pay for a hosting plan, what should i look out for? i am considering this plan here:      http://win2000hoster.com/    because like me, it's cheap. However it's a shared server.

I have been running bumcity.com for some time now. I bought the domain name because I was out of work and was basically being called a bum by my friends & relatives. It was ( and still is! ) very tough to find employment here in Connecticut, even with a college degree. So the name seemed to fit and is humorous (to me, anyway). But i am thinking of revamping the site and making different products and articles for it. I feel it's a good way to learn new skills and implement new ideas. Would you guys have any suggestions for me?


thanks and best regards,

JC

joe carlis
Monday, February 16, 2004

*  Recent Topics

*  Fog Creek Home