Buffer Search Algorithm
1. You have a large nul terminated string (the haystack buffer),
2. You have a very large array* of short-ish nul terminated strings (needles)
* (it's not quite an array in my problem, as I sort of have a window onto a few at a time, but can't see the next items until I have completed the current ones)
3. You need to find out - as quickly as possible - how many times each needle occurs in the haystack**,
** (it's not quite like that in my actual problem, it's provided it occurs more than some minimum number of times, if it occurs less than the minimum number, I don't actually care)
QUESTION: What is the fast way to do it.
SOLUTION A (terrible):
A linear search of the haystack buffer for each needle, for example using strstr to find first occurance of needle in haystack, then loop and strstr again +1 position after last occurance. Until we ahve a count.
SOLUTION BI (still pretty terrible)
Pre-scan the buffer, count the occurances of each character, e.g. A occurs 237 times, B occurs 124 times, C occurs 68 times.
When testing a needle, the maximum number of times it can occur is that of the least popular character in the needle, e.g. Continuing from A occurs 237 times, B occurs 124 times, C occurs 68 times. -- then ABC *maximum* occurances would be 68 times, although it may be less
Previous paragraph potentially allows
(i) earlier termination of linear search e.g. when we've hit 68 ABCs there can't be any more, so time to stop searching
(ii) avoid unnecessary searches, e.g. if searching for ABCZ and we know Z occurs 0 times, no need for linear search
(iii) Because of my rule 3** (above). In my particular case, I can eliminate additional linear searches.
SOLUTION BII (getting better)
As with BI, but work with (say) 2 character sequences, e.g. AB occurs 15 times, BC occurs 7 times
When testing a needle, the maximum number of times it can occur is that of the least popular 2-character in the needle, e.g. Continuing from AB occurs 15 times and BC 7 times -- then ABC *maximum* occurances would be 7 times, although it may be less
Previous paragraph potentially allows
(i) earlier termination of linear search e.g. when we've hit 15 ABCs there can't be any more, so time to stop searching
(ii) avoid unnecessary searches, e.g. if searching for ABCZ and we know CZ occurs 0 times, no need for linear search
(iii) Because of my rule 3** (above). In my particular case, I can eliminate additional linear searches.
Here's the challenge, anyone - SOLUTION BIII or C -- even better ones?
S. Tanna
Friday, February 27, 2004
Maybe I'm missing something, but surely you can just do one pass through the buffer and keep the results in a map of the form map<string, count> like so:
1) Get string from buffer
2) Check map for string, if it's not there create an entry with a count of 1. If it is there increment the count.
3) Repeat for next string from buffer until you reach the end of the buffer.
4) Your map now contains a count of occurances for each string in the buffer.
R1ch
Friday, February 27, 2004
What you're missing is the vast number of needles (I probably didn't emphasize that enough)
What if I have 100,000,000 needles?
What you are proposing is comparing every needle against every character position in the buffer. That is the same number of comparisons as the linear search, solution A -- it is just that you are doing them in a different order.
S. Tanna
Friday, February 27, 2004
ah, right - so you only need to know the results for a small number of needles?
R1ch
Friday, February 27, 2004
You need the result for every needle
You just don't need to keep it in memory
For a simple version - Imagine you write the result for each needle to an output file or something, and then can throw it away.
Actually a more accurate characterization of my particular problem - imagine you have 100,000,000 needles, and you need to find the 1,000 most frequently occuring needles.
S. Tanna
Friday, February 27, 2004
My favourite searching shortcut involves skipping ahead whilst doing a linear search.
If you are searching ABCZ and you encounter an A; next check for the Z backwards. If there's no Z then there's no match. If your search in in memory this will be quick.
If you produce a table of start char and end char+offset you could perform on linear search, and trigger a reverse offset search for hits on start character.
With quick supporting structures this could fly!
Good luck!
Tim H
Friday, February 27, 2004
I think I'd want to pick up a book or two on how gene sequencers work. IIRC, that's a pretty similar thing - how to match up a ton of small needles to a huge haystack. In the worst case, you've read about an interesting topic :)
Robert 'Groby' Blum
Friday, February 27, 2004
Does the haystack change frequently? Or do the needles?
If the haystack doesn't change, you could run a slow process to somehow index the haystack. The obvious one would be to make a list of all of the possibe strings in the haystack, but there is probably a much better way.
pdq
Friday, February 27, 2004
Yeh haystack and needle based on user supplied files
S. Tanna
Friday, February 27, 2004
If the number of needles is very large then precomputing some hashs from the buffer may help.
Say all of your needles are at least 8 characters long. Generate a set of hashes at each offset and store these in a map of (hash value -> offset).
For each needle compute the hash of the first 8 characters and look this up in the map. You get a list of *possible* offsets to verify.
Depending on the distribution of the needle lengths, etc you could tweak this algorithm. For example, compute hashes for all prefixes 1 ... N (giving N maps).
Rob Walker
Friday, February 27, 2004
BTW I forgot to mention I did a B.IIb
Which is to use 4 character sequences as well as 2. I could I suppose do 3, 5, 6, 7, 8, etc. too
Storing counts for every possible occurruence is impractical, as the size of the array required for these counts grows massive (e.g. 5 char sequences would potentially require, if 256 chars, 2^40 array elements), so do a bit of hashing around where a single array element in a reasonable size array (say 2^16) represents the highest of each of a number of the large array (e.g. each element in a 2^16 array, represents the highest number in 2^24 arrays)
I have explained the last para very well, but the principle is the same as doing 2 characters
And this does make a dramatic (but not enough) difference.
S. Tanna
Friday, February 27, 2004
Rob, I posted at same time as you.
Needles are not all the same size, but interesting idea. And the kind of thing I'm looking for.
S. Tanna
Friday, February 27, 2004
A minor improvement you could add to your BII idea... if you are already keeping a count of each letter or 2-letter combo int he buffer, you could also keep a list of the index of each occurence. Then when matching your needle you only need to match against the indexes that match the start of the needle, instead of the whole buffer
I dont know what your memory constraints are though, this could be too much data to hold around
MikeMcNertney
Friday, February 27, 2004
If I'm reading this correctly, it sounds like a SolvedProblem -- namely the longest substring problem?
http://www.cs.sunysb.edu/~algorith/files/longest-common-substring.shtml
Search google for it; I distintly remember learning a great algorithm for it in college but my textbook is at home.
MR
Friday, February 27, 2004
Ooops, more like:
http://www.cs.sunysb.edu/~algorith/files/string-matching.shtml
MR
Friday, February 27, 2004
Yes, finding all instances of a string in a text is more or less a solved problem. However, in this case we have the added knowledge that we are not searching for 1 string, but many many strings. This changes the parameters of the question and makes pre-processing of the buffer more viable
MikeMcNertney
Friday, February 27, 2004
I like Rob's algorithm- to generalize it to work with needles of arbitrary length would just require a hash table for each possible needle length. If the needles are between, say, 3 and 10 chars in length, this only requires 8 hash tables- and 8 passes though the haystack to generate them.
Ken
Friday, February 27, 2004
Ah, actually Rob's doing something different, by storing the list of offsets in the table as opposed to the counts. My strategy won't work if the needles can be arbitrarily long. I like his idea even more now!
Ken
Friday, February 27, 2004
I am a db guy, so this is how i think, provided haystack is gigabytes, and needles are say 5-1000,
if the needles are long (512k) and you need to match the entire breath of the needles, find a unique signature of each needle (say a reasonable length 32 bytes) and sort them.... the signature will fill a thin space inside that many number of bytes.. and then do a first pass over the long haystack.. this should reduce it to the likely occuring strings.. because you don't want to compare long strings.. you want to compare short strings.
after that do the most straightward search you can...
that's a rough it.. not sure if the optimization will work for you.
Li-fan Chen
Friday, February 27, 2004
Have you tried citations archive? There must be a trillion ways to solve this problem :D
Li-fan Chen
Friday, February 27, 2004
Your volume of search strings complicates matters immensely, but you might wish to start by looking at the standard single-string search algorithm, Boyer-Moore, which Tim H. described above. Sunday and Hume made some refinements, particularly in the area of searching for natural language words. I didn't quite catch whether your list was more static than your haystack, but if so, you could precompute and store the table for each needle. At any rate, here are some references that might lead you to some more relevant work for your particular problem.
Boyer R.S., Moore J.S. 1977. A fast string searching algorithm. Communications of the ACM. 20:762-772.
Sunday, D. M. 1990. A very fast substring search algorithm. Comm. ACM 33(8):132-142.
Hume A., Sunday D.M. 1991. Fast String Searching
Software - Practice and Experience 21(11):1221-1248
veal
Friday, February 27, 2004
Missed an important period...
Hume A., Sunday D.M. 1991. Fast String Searching.
Software - Practice and Experience 21(11):1221-1248
veal
Friday, February 27, 2004
The needles can be represented as a regular expression consisting of the logical-OR of the individual strings.
Now research regular expression processing algorithms.
There may be a way of combining the DFSM generated by a regexp compiler and the Boyer-Moore algorithm to produce a high-performance string searcher.
David Jones
Saturday, February 28, 2004
I think Li-fan Chen is on the right lines. You need to reduce the number of searches you do, however you finally happen to do those searches for each string/substring. Hmm. Maybe there's something that correlates to compression...?
Monday, March 1, 2004
Recent Topics
Fog Creek Home
|