Fog Creek Software
Discussion Board




dictionary

Suggest an efficient algorithm to find all sets of anagrams (e.g., stop, post, spot are anagrams of each other) in a dictionary of words.

There is a well known way to solve this problem (I won't give it away for those who haven't seen this problem before) ... can you suggest an even faster algorithm?

Vin
Wednesday, May 01, 2002

I don't know abt the well known method or the most effecient method. But one way to do is the following psuedocode:

struct Dict{
String orig_word;
String ord_word;
} NewDict[Dictionary size];

Struct Dict temp;

For(each word w in dictionary){
temp.orig_word=w;
temp.ord_word=sortbyalphabets(w);
Insert temp in NewDict using insertion sort, ordering by temp.ord_word;
}

The array NewDict will now contain all the sets of anagrams together(looking at the orig_word part of each element).

vivek gupta
Wednesday, May 01, 2002

That's the well-known solution I was referring to.  So, if the dictionary has n words of atmost m characters each, this algorithm run in n.O(m lg m) + O(n lg n) time assuming we use quicksort to sort each time.

Can you make it faster?  State any assumptions (be reasonable!)

Vin
Wednesday, May 01, 2002

this is a dumb optimization, but it still works...

you know that if they are anagrams they have to be the same length, so instead of inserting them all into the same list, insert them into a list only for those words that have the same length.. so "post" would go into the 4 letter list... and "bread" would go into the 5 letter list..

this makes the insertion a little quicker because we can assume there will be less items in the list ... then again, if the dictionary has all words that are the same length this doesn't add any optimization...

also, instead of sorting the letters in each word, maybe you could do something which is faster than a sort, something that only takes O(m) time, like you could take each character's number value (a=1, z=26) and multiply it to get a product for each word.  i think the product will be the same only for anagrams (as long as you only compare the products of words that are the same length). 

example: ant  - a=1; n=14; t=20 so the id is 1*14*20 = 280.  any other word that is the same length will give the same id (for example tan).

Michael H. Pryor
Wednesday, May 01, 2002

I was thinking of something along the lines of mapping each word to a unique numeric 'signature' as well. Tell me if you think this will work ...

Instead of sorting each word to obtain the 'signatures', use a 32-bit integer to convert the words to their number equivalents (or 'signatures').  Initialize the integer to zero.  Then, do a one-pass through the word and for each character 'x' encountered, add 2^I(x) to the integer (where I(x) is the position of the character in the alphabet – so I(A) is 0 & I(Z) is 25).  So, the C++ code for this operation would look something like:

int getSignature(char *pword)
{
  int retval = 0;    //assume 32-bit integer
  int I = 0;
  int nLen = strlen(pword);

  for (I = 0; I<nLen; I++)
  {
      //assume all characters are in lowercase
      retval += (1 << (pword[I] – 'a'));
  }

  return retval;
}

Since no two integers can ever be generated equal from a summation of an equal number of different sets of powers of 2 (the whole binary representation of numbers is based on this idea), every such (signature, length) pair for words in the dictionary must be unique.  Moreover, since we store 26 alphabets in 32 bits, we have room for [2^(31-I(x))–1] repititions of character 'x' within a word, with I(x) defined as before.  For example:

'a' can be repeated [2^31–1] times,
'b' can be repeated [2^30–1] times and so on till
'z' which can be repeated [2^6–1] = 63 times. 

Also, this representation will handle at least [2^32–2^26] words that contain at least one repeated character.  These provisions are more than sufficient for a realistic dictionary of words. 

So, computing the signatures for every word now takes only O(m) time now rendering the whole algorithm faster.  We'll have to do a primary sort on the signatures & secondary sort on length.  Even so, the time taken to find anagrams will now be:
    n.O(m) + O(n lg n)

Vin
Wednesday, May 01, 2002

>> I think the product will be the same only for anagrams (as long as you only compare the products of words that are the same length).

Not true:
  cut = 3 * 21 * 20 = 1260.
  log = 12 * 15 * 7 = 1260.

Of course, this class of optimization is only useful if the size of the word might be large, which in practice is never the case.

I would do something very similar to the previous suggestions, but using a hash table whose initial size is approximately the number of words to be dealt with.

Jim Lyon
Wednesday, May 01, 2002

doh. you're right, that won't work.

Michael H. Pryor
Thursday, May 02, 2002

Michael,
You can fix the problem with your method if you only use prime numbers
a=2,b=3,c=5,d=7,e=11,...
That also gets rid of the problem of counting any number of repetitions of the letter "a" as identical.

A.T.
Friday, May 03, 2002

i like your idea of using primes, AT!

Vin
Friday, May 03, 2002

*  Recent Topics

*  Fog Creek Home