Fog Creek Software
Discussion Board




Database Search

I'm developing a database search system for a client but I'm a little confused with the type of architecture I should base the system design on. Some facts about the content before I proceed with the problem:

1. The search scope will be text only, stored in a table
2. The text content will be updated every day - new records will be added and old will be deleted

Which of the following options do you think will be the best approach for a database search system considering the facts I've just mentioned:

1. Full-text - It's flexible and fast but considering fact #2 the index population process will be a bottleneck.

2. SQL - A simple LIKE-based query may do the search job just fine. But I fear that it may get slow, result in poor performance and won't provide advanced features (not so important though) like boolean expressions, ranking etc.

3. Custom Indexing - Here I can create a words table with single words split from the searchable text contents. Then another table can link these words & the text records. The only problem with this option being that the indexing process for some 50,000 text records may take quite a lot of time.

FYI, The database server will run SQL Server 2000.

Guys, your ideas and thoughts on an efficient search technique are requested. Thank you.

Josh_from_3rd_floor
Tuesday, September 16, 2003

How much are you actually planning on updating it? I mean, index generation might be expensive, but a single daily batch of 1000 inserts would probably not take more than a few seconds. Sure, it's slow, but does it _matter_?

In any case, though I am by no means a DB expert, I'd lean towards the full-text index: it is intended to do what you want, and rolling your own is generally slower and harder to maintain than using an existing feature.

Mike Swieton
Tuesday, September 16, 2003

Never used Full-Text indexeing on SQL Server - but are you search the indexing would be slow??  What kind of volume of updates / inserts are you going to have?

DJ
Tuesday, September 16, 2003

Search in books online for the text "start_background_updateindex". This will propagate changes to the full-text service as they occur. Alternately you could do incremental updates every half hour or whatever.

Dennis Forbes
Tuesday, September 16, 2003

The database will have some 5,000 inserts per batch.

The full-text incremental updates (as suggested by Dennis Forbes) sounds like a good idea. But will these updates be real-time and can anyone share their experience with full-text search. My only worry with full-text is that the [re]indexing after each batch insert should not slow down the data access and it should be instantly reflected in any search results.

Josh_from_3rd_floor
Tuesday, September 16, 2003

We've used full text search on a volume of about 3K changes per day and haven't noticed any performance problems.  We have come across some issues with the search itself, though.  The algorithm seems to be very word-oriented so sometimes the result aren't as you might expect when you search for partial words, numbers, and other non-word stuff.  We had to augment the full text search with a "like" search in order to effectively handle searching some fields for part numbers.

anon
Tuesday, September 16, 2003

Josh,

do yourself a favor and subscribe to the SQL2K list on sswug (just send a mail to subscribe-sql2k@ls.sswug.org).
They have top notch Full Text experts on there running installations far larger than you describe, and are always eager to help.

Just me (Sir to you)
Tuesday, September 16, 2003

I'll chime in that 5000 records a day can be indexed in a very short amount of time indeed. I've seen acceptable performance from a flat-file word database and a perl scripts on mailing list archives, running on old pentiums.

I would guess that the FT search capability of SQL Server would be very much capable of handling this and that a 1Ghz processor should be able to index at least 10-20 megabytes of text per second without breaking a sweat or affecting normal operations in way that would be visible to the user.

Saruman
Tuesday, September 16, 2003

"[re]indexing after each batch insert should not slow down the data access and it should be instantly reflected in any search results."

What is your definition of "instantly"? From the second the transaction is committed? If that's what you're looking for, you won't find it with full-text search in SQL Server. The reason is that FTS actually isn't actually a part of SQL Server, and doesn't take part in SQL Server transactions : It's a generic service theoretically used by many systems.

I think you really need to define your actual requirements, such as whether, for instance, a one minute lag between an item being added and it returning from full-text searches is sufficient.

One thing I will disagree with, though: I doubt it could populate a corpus of 20MB of text per second.  Doing so (building a searchable corpus) isn't just copying the text into a big buffer, but as you mentioned is the deconstruction of the text is a searchable, err, graph. I've had tables with maybe 100KB of data take dozens of seconds to index.

Dennis Forbes
Tuesday, September 16, 2003

I don't use Full Text Search, I implement a word index table which is updated at the time of record creation, I tend to have some kind of minimum size of word, typically 5 characters with some kind of programmatic override for things like keywords and initials.

Along with the document or whatever the word is found in you can store the relative position and so get near and proximity searches. 

If you need to you can add an index to the same table which is the root of the word (discovered using a grammar).  Similarly you can add a soundex index.

As the table and indexes are created at the time of the source data creation there's no appreciable delay in updating nor are you subject to an external system running its own set of indices (which are pretty much the same as the above ones).

Sometimes the wheel really is the best use of an axle.

Simon Lucy
Tuesday, September 16, 2003

*  Recent Topics

*  Fog Creek Home