Fog Creek Software
Discussion Board


I know one-way hash functions give you a unique number sequence for every input. So they are a good way to sign files for example for intergrity checking.

If I wanted to figure out how similar two files are based just on their hashes, would it be possible to compare the hashes to figure that out?

If not (which is what I am guessing to be the answer), is there any other (reliable) way of comparing to files and figuring out how similar/different they are?


Thursday, April 29, 2004

Compare some large number of bytes at random positions in the two files.

Thursday, April 29, 2004

You could take a slice of the file (maybe 4k chunks) and perform a sliding window comparison against the target file.  Perform a simple xor of the bits, and count the one's.  The greater the count the greater the difference.  A rough outline for sure, but with some fine tuning, it would probably work sufficently enough.  You may get some false positives, but the signal to noise ratio would probably be low enough to prove useful.

Thursday, April 29, 2004

The question is only meaningful if you have some sort of model of the kinds of changes that might have happened. Corruption of random bits? Misspellings? Insertion or deletion of occasional characters? Reordering of chunks 4k in size? A similarity measure designed for any one of these situations is likely to misbehave badly when faced with any other.

Detecting similar bodies of text is a handy thing to be able to do for spam detection. Unfortunately, the model there is "stuff that results in something that reads kinda-sorta the same to typical naive human readers", which isn't easy to express precisely and therefore isn't easy to measure.

If you can describe what sort of similarities you're interested in, then it may well be possible to come up with efficient and reliable ways of detecting them. If not, you're just guessing :-).

Gareth McCaughan
Thursday, April 29, 2004

For example, a "diff' tool (like Windiff) can tell you which lines in a text file have changed (are different) and which are unchanged.

Christopher Wells
Thursday, April 29, 2004

One-way hashes guarantee nothing of the sort.

A cryptographically strong hash function reduces a document to a sequence of bits.  e.g. MD5 outputs 128 bits.

Obviously, the set of all possible 20-byte documents cannot map 1:1 to the 2**128 possible hashes, so the hash is by no means "unique".  Uniqueness isn't even guaranteed for documents smaller than the hash output.

For cryptographically strong hashes, two very similar documents will hash quite differently. The intent is for the probability of two documents having the same hash to be on the order of the size of the hash output itself.

If you want to compare two documents for SIMILARITY, then you need to use an algorithm (which may involve hashes) that can do that.  But plain MD5 or SHA-1 ain't it.
Look at some of the "diff" algorithms.

Aside: if you look at ordinary FlexLM license keys in a large license file, the hashes look similar. This indicates that basic FlexLM is very weak. Globetrotter likely used a weak hash function instead of MD5 because the latter was at the time, exposed to red-tape due to export controls.

David Jones
Thursday, April 29, 2004

Google for "edit distance."  Basically, it's the amount of change you'd have to make to string 0 to make it equal to string 1.  In our data mining apps we sometimes use it to find out which 2 strings are "closest" to each other.

Thursday, April 29, 2004

There's a function in GNU libc that will calculate edit distance for you - it's called fstrcmp. There's a module on CPAN based on this function as well. I used it to measure similarity between small text documents (job ads) - worked VERY well.

Thursday, April 29, 2004

Depends on what you mean by "similar"...

For text, I suspect you could make a markov chain kind of affair but instead of using the table to generate amusing nonsense you could compare it to the equivalent table generated from the other file. This might be sensitive to the insertion of words, but you could probably tweak the tables (skipping words etc.) for the second file if you find anything wildly different. Of course you'd need some kind of heuristic to ensure this didn't result in two radically different files appearing to be rather similar.

Thursday, April 29, 2004

> If you can describe what sort of similarities you're
> interested in, then it may well be possible to come up
> with efficient and reliable ways of detecting them. If not,
> you're just guessing :-).

I have a few scenarios in mind. One of them involves two text files. I am looking for changes in the words used, number of words used, location of words (insertion/deletions).. I am not looking for a smart diff algorithm which will know that "great" and "wonderful" are similar or anything.

I use diff tools all the time probably just like many of you. They are good at telling you if your files are different or not which is a true/false kinda answer and then they show you the differences. I need an alogrithm which will tell me if the two files are different or not, and if they are different it'll tell me a percentage of how much similar they are... It should say "File A is 80% similar to File B". I can't quite come up with a way to compare files in a way which will allow me to calculate the percentage.

The other scenario is with two binary files. I can't think of anyways to compare two binary files without knowing what exactly they contain. If the files had some kinda structure, I suppose it would be easier to do a diff.

I'll look into what has been suggested so far.

Thank you!

Thursday, April 29, 2004

*  Recent Topics

*  Fog Creek Home