Fog Creek Software
Discussion Board




Welcome! and rules

Joel on Software

Longest Common Substring code

I'm developing a VB.NET application that needs to find the longest common substring in a list of filenames...

That being said, I've spent countless hours searching the web finding tons of C code to do it the "brute force" way and using generalized suffix trees.  Being that I haven't coded in C for many years, I was hoping someone here might have a library that can be used in VB.NET to accomplish this task.

Thanks for your help.

James

James F
Tuesday, March 08, 2005

Take a look at the collection classes in .NET.  They're pretty good.

Taking a guess at what you're trying to do, perhaps a Hashtable using your substrings as keys and another collection (ArrayList?) as the value would work, so that each time you permute a filename into substrings, the original filename is added to the collection for that particular substring.  Then you can iterate over the keys and/or values and pick the "longest" one.

There's probably a better way.  One of the times I regret not having a CS education.

I don't really have a clear idea of what you're trying to accomplish here, and I bet I'm not the only one.  A bit more detail might smooth the way for more helpful responses.

Matt Conrad
Thursday, March 10, 2005

Don't "brute force" it, you'll be there all day.  Use a dynamic programming algorithm.  There's some pseudocode here:

http://www.ics.uci.edu/~dan/class/161/notes/6/Dynamic.html

for finding the longest common substrings of any pair of strings.  That should be easily convertible into VB.NET.

Once you've got a two-string version, you'll need to generalize it to multiple strings.  Here's what I'd do:

Create two arraylists called "possibilities1" and "possibilities2".  Run your LCS algorithm on the first two strings in your file list.  There may be a tie for first place, so add all ties to the list of possibilities1.

Now take the third string on the file list.  For each item on the possibilities1 list, run it with the third file, and put each resulting LCS into possibilities2.

Blow away possibilities1, rename possibilities2 to possibilities1, and repeat with the 4th, 5th, etc, items on the file list.

When you're done, find the longest item left in the possibilities list, and you've got the longest common subsequence of all the files.

There's probably a more time-efficient way to do this than the algorithm I just laid out, but this way is pretty space-efficient.

example:  you've got files

1: abcdefghijklm
2: abcd11111def
3: def22222bc
4: bc333333f

First time through, possibilities are abcd, def.
Second time through, possibilities are bc, def
Third time through, possibilities are bc, f.

So bc wins.

Make sense?

Eric Lippert
Friday, March 18, 2005

*  Recent Topics

*  Fog Creek Home