Fog Creek Software
Discussion Board




Full text retrieval

What would you recommend for the following:
For an Intranet site I need to be able to do full text retrieval (Google style) on a collection of HTML reports. The size of the collection is in the few GB order. The Intranet platform is W2k3-IIS 6.0-.NET-SQL Server 2000. I am currently leaning towards looking into SQL Servers full text facilities (have not used those before).
I can store the relevant text portions in SQL Server or just on the file system, whatever. Since the reports are structured, simple regex operations can take care of any HTML artifacts etc., so you can think of the HTML files as really just text files if needed.
Does thie SS2K full text sound like a good approach? What would you do?

Just me (Sir to you)
Friday, April 09, 2004

This is one I heard of recently:

Store the full texts in one table
Split up the texts into words an put the words into a second table.
Create a third table linking texts to words used in the texts.

This enables you to search on words used in a certain text.

Geert-Jan Thomas
Friday, April 09, 2004

It's suprisingly hard to write a good regex to strip HTML. 3 common problems are tags like this:

<A HREF="ptag.html" TITLE="How to use the <P> tag">P tag</A>

and when the tag is spilt across multiple lines like this:

<P CLASS="header">My
Website</P>

and the use of HTML comments

http://www.4guysfromrolla.com/webtech/073000-1.shtml

http://www.aota.net/forums/printthread.php?threadid=12016

Matthew Lock
Friday, April 09, 2004

Matthew,

this is not for general HTML, it is for a specific type of HTML reports and luck has it these report files are rather regex friendly.

Just me (Sir to you)
Friday, April 09, 2004

> Split up the texts into words an put the words into a second table.

I quite like this approach. IBM's Infoman product performed extremely well because they adopted this (quite a lot of years ago). Much bigger database (but only about twice the size or so) and slower inserts but very quick queries,

I don't, however, know what SQL Server offers on this front in terms of full text searching... or how good it is.

My only addition would be a dictionary to exclude indexing the common words like 'the,at,in,on,to,a,I,be,my' etc.

Gwyn
Friday, April 09, 2004

I don't think I have enough information to answer your question but making a few assumptions (including that you want high performance) here is what I would do.
(Assuming these are lots of normal sized files not a few gigabyte files) I would store them in the file system and set up a hierarchy such that each folder contained only a few files on average and the floder location of a given file could be deduced easily from its name.

I would index each word in BerkeleyDb.  The keys would be the words and the values would indicate the filename and position the word begins within the file.

name withheld out of cowardice
Friday, April 09, 2004

Files are around 50k each.
Given that no one seems to jump out and shout: SS2K full-text is a nightmare, I think I will try it out.

What it promises over the DIY suggestions is:
- Catalog management and background indexing
- proximity queries: A near B
- Inflectional forms of verbs and nouns (drives, drove, driving, and driven also match queries for "drive")

It has more features like multilanguage stuff, ranking and indexing different file types (.doc, .xls etc.), but those do not apply to my case.

It also looks like it comes with its own buildin HTML indexer, so that saves me a few lines of code.

Of course what is appealing in the basic DIY cases is the simplicity. SS2K full-text still feels a bit "black box" to me, but at least it does not have the "voodoo" feel of SS2K English Query.

Just me (Sir to you)
Friday, April 09, 2004

MySQL also has full-text indexing and full-text searches.

MX
Friday, April 09, 2004

MX,

does it have any advantages over the SS2K full-text?

Just me (Sir to you)
Friday, April 09, 2004

I think most people missed the point.
The OP is planning to use SQL Server 2000 to store and index full text.
I've used it. It is suitable, though slower than specialized tools. In theory, you should be able to test it easily (Check tool TEXTCOPY.)
But a few GB of text can be a big number of files, and you'll probably need plenty of RAM.

Pakter
Friday, April 09, 2004

Guys suggesting to "store words in another table": this is essentially how *every* modern full-text search subsystem works, MS SQL built-in or not. The "table of words" is just handled behind the scenes, and is called inverted index. If a DBMS can't do that, it said not to have full-text search capability ("LIKE" isn't full-text, as it operates on strings, not texts (words)).

Just me, if your version of MS SQL supports full-text search (some cheap versions seem not to), just go ahead and use it - you will hardly invent anything better. MySQL has this facility included by default, and it's fairly versatile fast.

Egor
Friday, April 09, 2004

Pakter,

that is correct. I expect load on the system to be really low. New reports are pulled in overnight, and I have the luxury of sequencing the retrievals (setup with persitent query profiles and email delivery of reports). The DBMS machine has currently only 512Mb installed. I'll see how it fares. Thanks for the textcopy lead. Google turned up some interesting stuff on that.

Just me (Sir to you)
Friday, April 09, 2004

Thanls Egor,

My SQL Server setup does have full-text capability.

Just me (Sir to you)
Friday, April 09, 2004

SQL Server's 'Full text' retrieval is external to the database itself, which has always led me to believe that it doesn't have an inverted index.

In any event it is pretty trivial to create a word index table, its something I routinely add to every system I do that handles text.

Simon Lucy
Friday, April 09, 2004

Simon Lucy, I wonder what line of thought could lead you to believe that system is unlikely to have inverted index because of being an add-on? How else in the world could a fulltext search engine function?

FYI, fulltext engines that use RDBMS to store inverted indexes are *times* slower on any substatial datasets. Few ones that have half-decent performance use sophisticated multi-table structures to work around this.

Egor
Friday, April 09, 2004

Lucene.NET - http://lucenedotnet.com - Lucene.Net is a high-performance, full-featured text search engine.

Jagger.NET
Saturday, April 10, 2004

I've not used lucene.net.  But the original Java version is amazing.  Have a look at this post

http://madbean.com/blog/66/

Koz
Sunday, April 11, 2004

believe me, I have experienced the speed of lucene first hand. I had developed a search based on a SQL Like clause  over an .mdb database, it was just not holding up and taking about 2-3 minutes to do a search. then I did a small prototype with lucene and the same huge database, it was almost instant !!! couldn't believe my eyes. Up untill then I had a thing in my mind that managed execution (Java Virtual Machine) would be slow, boy was I wrong. plus it enables a lot of things that is impractical to implement with sql and .mdb approach like proximity search. Well, I left that project and company 1.5 yrs ago, so it hasn't been implemented I guess. But when I showed my prototype to my boss, they jumped with joy. and me too.

Richard Hsu
Tuesday, June 01, 2004

*  Recent Topics

*  Fog Creek Home