Simple task but on a large scale
Given a plain ASCII text file containing roughly 10 million English words, what would be the most simple or elegant way to count each word in the file and rank them from most common to least common?
I've thought of a few ways to do this with 10, but 10 million is the kicker. Any ideas? Perl script to MySQL databse, etc..
Kpal
Wednesday, June 2, 2004
Bulk load it into SQL Server (or other database) and then do a SELECT Word, Count(*) GROUP BY Word ORDER BY Count(*)
DJ
Wednesday, June 2, 2004
your are sure there are 10 million different words in your text file?
My Name
Wednesday, June 2, 2004
"your are sure there are 10 million different words in your text file?"
I'm sure there *aren't*, since he wants to rank by how common a given word is.
anon
Wednesday, June 2, 2004
You don't need no steenkin' database, just write a perl script and worry about the logic; Perl will do fine what has to be done.
GG
Wednesday, June 2, 2004
In Linux :
sort $file | uniq -c | sort -n -k 1 -r
no Perl is needed.
Fredrik Svensson
Wednesday, June 2, 2004
I checked with some dictionaries.
wc *.dic 2> /dev/null ; time sort *.dic | uniq -c | sort -n -k 1,1 -r > /tmp/dmp
81274 81274 1201499 de_DE.dic
81274 81274 1201499 de-DE.dic
46148 46155 526276 en_GB.dic
46148 46155 526276 en-GB.dic
62076 62076 695734 en_US.dic
62076 62076 695734 en-US.dic
24491 24490 321123 sv.dic
403487 403500 5168141 total
real 0m8.537s
user 0m8.380s
sys 0m0.150s
Looking at the result it seems to be alright :
head /tmp/dmp
7 USA
7 PS
7 mm
7 MHz
7 km
7 kg
7 ISBN
7 Hz
7 EU
7 cm
Even if 403.487 words is not 10.000.000 it will probably be faster to run the above thing than to sit down and write any scripts doing the same thing.
Tanken that sorting should be O(N log N) the time will be longer but I do not think by much.
Is this some class assignment or what ?
Fredrik Svensson
Wednesday, June 2, 2004
You need the 'trie' structure.
Wednesday, June 2, 2004
10 mil isn't large scale ;) At least, not on today's computers. Consider ram: if you have 16 bytes per word (i.e. chars), which is an out-of-my-ass guess at the mean, that gives you what, like 160 megabytes? What kind of machine do you have that can't allocate that kind of memory?
Or do you think I'm off? Maybe I'm off by a factor of ten. That's not a big deal. So you need a gig of ram. It's called virtual memory, and I'll bet you $20 that even with sequential searches on a reasonably modern machine it'll run fine, i.e. minutes or hours.
If performance really is critical, that changes things, but I've yet to see very many cases where performance was more important than cost. Sure, there's a lot of apps that are time-critical. But the vast majority simply are not.
Mike Swieton
Wednesday, June 2, 2004
I had a homework assignment similar to this in my second year at university, so I thought I might chip in... hopefully what I have to say isn't totally obvious.
There are about 100,000 or so English words, excluding proper names and so on. The vast majority of the text is going to consist of these words, so I'd be suprised to see more than 200K unique words. This is small enough number that you could just use a straightforward hash table to count them.
Jimmy Jo-Jo
Wednesday, June 2, 2004
The average number of letters in a word in a normal piece of prose is a lot less than the 16 you suggest for ASCII or even the 8 for Unicode. Hopefully the Op will give us the answer later :)
How much RAM is needed would surely depend on the way the sort process works.
Stephen Jones
Wednesday, June 2, 2004
Interesting. I still regret that old post about "needles in a haystack" waned away before it got anywhere.
Alex
Wednesday, June 2, 2004
This wouldn't be a large mbox file would it? And you're not trying to come up with words to defeat spam filters are you?
questions
Wednesday, June 2, 2004
Thanks for all the great feedback. Here are some more details..
1. The 10 million words will be random as in dump "War and Peace" into a text file and process.
2. No, not a class assignment. It's an idea for a social networking utility.
3. Sorting in memory is fine - the machine has enough it appears.
4. No, it's not being used to create a spam filter, but I have heard that some folks make available large files of spam to tune filters. I could use one to test it with.
Kpal
Thursday, June 3, 2004
Could not help it..
I downloaded War and Peace from here :
http://onlinebooks.library.upenn.edu/webbin/gutbook/lookup?num=2600
Then run :
time cat /tmp/wrnpc11.txt | tr -cs "[:alpha:]" "[\n*]" | tr "[:upper:]" "[:lower:]" | sed -e "s/^'//" -e "s/'$//" -e '/^$/d' | sort | uniq -c | sort -n -k 1,1 -r | head
34629 the
22274 and
16743 to
14938 of
10574 a
9998 he
9001 in
8200 that
7984 his
7356 was
real 0m3.196s
user 0m3.030s
sys 0m0.120s
The statistics of War and Peace (+ some Project Gutenberg Licence comments) is :
wc /tmp/wrnpc11.txt
67414 566532 3284649 /tmp/wrnpc11.txt
Oh, well good luck with your project
Fredrik Svensson
Friday, June 4, 2004
Sweet! Thanks for letting me see this idea in action.
Kpal
Friday, June 4, 2004
Recent Topics
Fog Creek Home
|