Fog Creek Software
Discussion Board




Simple task but on a large scale

Given a plain ASCII text file containing roughly 10 million English words, what would be the most simple or elegant way to count each word in the file and rank them from most common to least common?

I've thought of a few ways to do this with 10, but 10 million is the kicker.  Any ideas?  Perl script to MySQL databse, etc..

Kpal
Wednesday, June 02, 2004

Bulk load it into SQL Server (or other database) and then do a SELECT Word, Count(*) GROUP BY Word ORDER BY Count(*)

DJ
Wednesday, June 02, 2004

your are sure there are 10 million different words in your text file?

My Name
Wednesday, June 02, 2004

"your are sure there are 10 million different words in your text file?"

I'm sure there *aren't*, since he wants to rank by how common a given word is.

anon
Wednesday, June 02, 2004

You don't need no steenkin' database, just write a perl script and worry about the logic; Perl will do fine what has to be done.

GG
Wednesday, June 02, 2004

In Linux :
sort $file | uniq -c | sort -n -k 1 -r
no Perl is needed.

Fredrik Svensson
Wednesday, June 02, 2004

I checked with some dictionaries.

wc *.dic 2> /dev/null ; time sort *.dic | uniq -c | sort -n -k 1,1 -r > /tmp/dmp
  81274  81274 1201499 de_DE.dic
  81274  81274 1201499 de-DE.dic
  46148  46155  526276 en_GB.dic
  46148  46155  526276 en-GB.dic
  62076  62076  695734 en_US.dic
  62076  62076  695734 en-US.dic
  24491  24490  321123 sv.dic
403487  403500 5168141 total

real    0m8.537s
user    0m8.380s
sys    0m0.150s

Looking at the result it seems to be alright :
head /tmp/dmp
      7 USA
      7 PS
      7 mm
      7 MHz
      7 km
      7 kg
      7 ISBN
      7 Hz
      7 EU
      7 cm

Even if 403.487 words is not 10.000.000 it will probably be faster to run the above thing than to sit down and write any scripts doing the same thing.
Tanken that sorting should be O(N log N) the time will be longer but I do not think by much.

Is this some class assignment or what ?

Fredrik Svensson
Wednesday, June 02, 2004

You need the 'trie' structure.


Wednesday, June 02, 2004

10 mil isn't large scale ;) At least, not on today's computers. Consider ram: if you have 16 bytes per word (i.e. chars), which is an out-of-my-ass guess at the mean, that gives you what, like 160 megabytes? What kind of machine do you have that can't allocate that kind of memory?

Or do you think I'm off? Maybe I'm off by a factor of ten. That's not a big deal. So you need a gig of ram. It's called virtual memory, and I'll bet you $20 that even with sequential searches on a reasonably modern machine it'll run fine, i.e. minutes or hours.

If performance really is critical, that changes things, but I've yet to see very many cases where performance was more important than cost. Sure, there's a lot of apps that are time-critical. But the vast majority simply are not.

Mike Swieton
Wednesday, June 02, 2004

I had a homework assignment similar to this in my second year at university, so I thought I might chip in... hopefully what I have to say isn't  totally obvious.

There are about 100,000 or so English words, excluding proper names and so on. The vast majority of the text is going to consist of these words, so I'd be suprised to see more than 200K unique words. This is small enough number that you could just use a straightforward hash table to count them.

Jimmy Jo-Jo
Wednesday, June 02, 2004

The average number of letters in a word in a normal piece of prose is a lot less than the 16 you  suggest for ASCII or even the 8 for Unicode. Hopefully the Op will give us the answer later :)

How much RAM is needed would surely depend on the way the sort process works.

Stephen Jones
Wednesday, June 02, 2004

Interesting. I still regret that old post about "needles in a haystack" waned away before it got anywhere.

Alex
Wednesday, June 02, 2004

This wouldn't be a large mbox file would it? And you're not trying to come up with words to defeat spam filters are you?

questions
Wednesday, June 02, 2004

Thanks for all the great feedback.  Here are some more details..

1. The 10 million words will be random as in dump "War and Peace" into a text file and process.
2. No, not a class assignment.  It's an idea for a social networking utility.
3. Sorting in memory is fine - the machine has enough it appears.
4. No, it's not being used to create a spam filter, but I have heard that some folks make available large files of spam to tune filters.  I could use one to test it with.

Kpal
Thursday, June 03, 2004

Could not help it..
I downloaded War and Peace from here :
http://onlinebooks.library.upenn.edu/webbin/gutbook/lookup?num=2600
Then run :
time cat /tmp/wrnpc11.txt | tr -cs "[:alpha:]" "[\n*]" | tr "[:upper:]" "[:lower:]" | sed -e "s/^'//" -e "s/'$//" -e '/^$/d' | sort | uniq -c | sort -n -k 1,1 -r | head
  34629 the
  22274 and
  16743 to
  14938 of
  10574 a
  9998 he
  9001 in
  8200 that
  7984 his
  7356 was

real    0m3.196s
user    0m3.030s
sys    0m0.120s

The statistics of War and Peace (+ some Project Gutenberg Licence comments) is :
wc /tmp/wrnpc11.txt
  67414  566532 3284649 /tmp/wrnpc11.txt

Oh, well good luck with your project

Fredrik Svensson
Friday, June 04, 2004

Sweet!  Thanks for letting me see this idea in action.

Kpal
Friday, June 04, 2004

*  Recent Topics

*  Fog Creek Home