dictionary Suggest an efficient algorithm to find all sets of anagrams (e.g., stop, post, spot are anagrams of each other) in a dictionary of words. There is a well known way to solve this problem (I won't give it away for those who haven't seen this problem before) ... can you suggest an even faster algorithm? Vin Wednesday, May 01, 2002 I don't know abt the well known method or the most effecient method. But one way to do is the following psuedocode: struct Dict{ String orig_word; String ord_word; } NewDict[Dictionary size]; Struct Dict temp; For(each word w in dictionary){ temp.orig_word=w; temp.ord_word=sortbyalphabets(w); Insert temp in NewDict using insertion sort, ordering by temp.ord_word; } The array NewDict will now contain all the sets of anagrams together(looking at the orig_word part of each element). vivek gupta Wednesday, May 01, 2002 That's the well-known solution I was referring to.  So, if the dictionary has n words of atmost m characters each, this algorithm run in n.O(m lg m) + O(n lg n) time assuming we use quicksort to sort each time. Can you make it faster?  State any assumptions (be reasonable!) Vin Wednesday, May 01, 2002 this is a dumb optimization, but it still works... you know that if they are anagrams they have to be the same length, so instead of inserting them all into the same list, insert them into a list only for those words that have the same length.. so "post" would go into the 4 letter list... and "bread" would go into the 5 letter list.. this makes the insertion a little quicker because we can assume there will be less items in the list ... then again, if the dictionary has all words that are the same length this doesn't add any optimization... also, instead of sorting the letters in each word, maybe you could do something which is faster than a sort, something that only takes O(m) time, like you could take each character's number value (a=1, z=26) and multiply it to get a product for each word.  i think the product will be the same only for anagrams (as long as you only compare the products of words that are the same length).  example: ant  - a=1; n=14; t=20 so the id is 1*14*20 = 280.  any other word that is the same length will give the same id (for example tan). Michael H. Pryor Wednesday, May 01, 2002 I was thinking of something along the lines of mapping each word to a unique numeric 'signature' as well. Tell me if you think this will work ... Instead of sorting each word to obtain the 'signatures', use a 32-bit integer to convert the words to their number equivalents (or 'signatures').  Initialize the integer to zero.  Then, do a one-pass through the word and for each character 'x' encountered, add 2^I(x) to the integer (where I(x) is the position of the character in the alphabet – so I(A) is 0 & I(Z) is 25).  So, the C++ code for this operation would look something like: int getSignature(char *pword) {   int retval = 0;    //assume 32-bit integer   int I = 0;   int nLen = strlen(pword);   for (I = 0; I> I think the product will be the same only for anagrams (as long as you only compare the products of words that are the same length). Not true:   cut = 3 * 21 * 20 = 1260.   log = 12 * 15 * 7 = 1260. Of course, this class of optimization is only useful if the size of the word might be large, which in practice is never the case. I would do something very similar to the previous suggestions, but using a hash table whose initial size is approximately the number of words to be dealt with. Jim Lyon Wednesday, May 01, 2002 doh. you're right, that won't work. Michael H. Pryor Thursday, May 02, 2002 Michael, You can fix the problem with your method if you only use prime numbers a=2,b=3,c=5,d=7,e=11,... That also gets rid of the problem of counting any number of repetitions of the letter "a" as identical. A.T. Friday, May 03, 2002 i like your idea of using primes, AT! Vin Friday, May 03, 2002   Fog Creek Home