Fog Creek Software
Discussion Board




SORT

WHAT IS THE BEST WAY TO SORT A VERY LARGE FILE, IF THIS FILE IS TOO LARGE FOR THE MEMORY?

JOE
Wednesday, August 04, 2004

Send the output to the card punch. Take your stack of cards to the machine operator and tell her to run them through the sort process. Wait about 4 hours and go back to get your card stack now in sorted order. Feed them back into the punch card reader and print it all out on 11x17 fanfold paper to verify.

old_timer
Wednesday, August 04, 2004

Use the Bayes-Knorr algorithm with a Inverse-Tree-Traverse pattern. Google should give some more info on this.

Dennis Forbes
Wednesday, August 04, 2004

How old is your computer that it doesn't even type in lower case?

Bill Rushmore
Wednesday, August 04, 2004

Don't listen to Dennis, he's just trying to trick you.

Everybody knows you'll need to first run it through a Hayes-Bozon Leaf Pruning filter in order to remove duplicates before you apply a recursive Radix sort to achieve O(n) performance.

Captain McFly
Wednesday, August 04, 2004

What does your data look like?  Is it random or does it follow some known distribution?  Does the sort need to be stable or not?  You can probably use something like bucket sort (or a recursive bucket sort like quicksort).  You just need to work out a reasonable scheme for storing the buckets (since they can be larger than your memory capacity).

Kalani
Wednesday, August 04, 2004

A toolshed or an abandoned warehouse are both excellent places to stash excess buckets until you need them again.

muppet
Wednesday, August 04, 2004

The great implementation advice on JOS is the main reason I keep coming back.

Bored Bystander
Wednesday, August 04, 2004

Don't let anyone fool you!  There is only one sort: BUBBLE!  But make sure to run it twice in case one slips by.

n^2 > n log n > n

How can you dispute that!  Bubble rocks!

Capn' Kirk
Wednesday, August 04, 2004

I believe the method of superior sorting algorithm you seek is "bogosort". Here is a link for more information: http://en.wikipedia.org/wiki/Bogosort

Thank you.

anon-y-mous cow-ard
Wednesday, August 04, 2004

I thought the obvious answer would be to buy more memory?

Ankur
Wednesday, August 04, 2004

Undo your Caps Lock and everything will sort fine, my friend.

 
Wednesday, August 04, 2004

If you need to sort ONE very large file, then it means it has only ONE element, which means it's SORTED in any way you want.


Wednesday, August 04, 2004

"Bogosort"

We had a programmer who to get a random number from 1 to 10 kept would keep calling our random function (which returned 0-65535) until the return value was between 1 and 10.

Also, he was an Indian on an H1-B visa. Maybe that and the bogosort is just what they teach over there?

Babu Jones
Wednesday, August 04, 2004

What does it mean to sort a very large file?  Are you trying to arrange the bytes in the file in ascending order?

name withheld out of cowardice
Wednesday, August 04, 2004

As Bored Bystander said, it's also a good idea to examine how you use the file in the first place.  If you write infrequently but read frequently, for example, you should store the file in sorted form (in this case your time to sort will obviously be O(1) but your insert time will be something like O(log n)).

If you have any control over it, use a RDBMS so that you can worry about your application's primitives instead of partially reimplementing an RDBMS.

Kalani
Wednesday, August 04, 2004

JOE -

In case you haven't figured it out yet, no one is answering your post because (1) it looks like a homework assignment and (2) the ALL CAPS screams "I'm a clueless asshat!"

But, being that I'm in a charitable mood, Google for external mergesort.

Yet another anon
Wednesday, August 04, 2004

Pfft. O(1) is weak. I'm working on a O( n - n - 1 ) sort which actually sorts the file before you create it. I'm having a little bit of trouble with duplicate values, though.

Captain McFly
Wednesday, August 04, 2004

Note that JOE has not checked back. Guys like JOE expect the phone to ring with a specific answer in response.


Wednesday, August 04, 2004

To give a somewhat serious answer, if faced with such a situation I would chop the file into X parts and then sort and persist each of the X parts using conventional high efficiency sorting algorithms. Imagine having 5 sorted decks of cards with the lowest cards up - it's fairly easy and efficient from there to build it into one sorted deck.

Dennis Forbes
Wednesday, August 04, 2004

External mergesort? Clearly some people have retroactively stolen my idea. I'm going to have to go get a patent and seek justice for this wrong.

Dennis Forbes
Wednesday, August 04, 2004

The obvious answer is not to sort it and go home and drink a beer - it will sort itself out, given enough time.

coward
Wednesday, August 04, 2004

"retroactively stolen my idea"

Dude, I hear ya. Remember that math formula, sum 1...n = (n*(n+1))/2.  I invented that.  Only I did it about 200 or 300 years too late.

If there's ever a time machine invented, first thing I'm gonna do is go back in time and whack that motherfucker whose name is in all the textbooks.  Just a low-life retro-thief is what he is.

Yet another anon
Wednesday, August 04, 2004

If you're sorting lines in the file, maybe you have enough memory to store pointers to the start of each line in the file (if not the actual lines themselves). In this case you could use any decent sort algorithm just reading the relevant lines from disk when you need to compare them...

Also, like someone said, you might want to store the data in the file in such a way that it's always sorted, or build some kind of index or hash table to make sorting and lookups quicker.

So yeah, essentially reimplementing a DBMS. Waste of time! been done! just embed something like SQLLite (which can store the data in a specified file and doesn't need any kind of database server daemon running).

Perhaps even something simple like BDB could handle this?

Matt
Wednesday, August 04, 2004

The OP's question reeks of some sort of low-level computer assignment - maybe a summer school high school computer class.

Dennis Forbes
Wednesday, August 04, 2004

Bubble sort, of course, as invented by my acquaintance Keith, the parking lot attendant, who assures me that it is the most efficient sort algorithm of all time.

"But everybody invents the bubble sort," I told him.

"Yes, but I invented it first."

How can you argue with logic like that?

E. Naeher
Wednesday, August 04, 2004

I have a book at home about mathematical cranks -- it is quite amazing the number of people who are thoroughly convinced that (a) they invented "casting out nines", and (b) they could make millions of dollars by patenting their invention and selling it to... someone. 

Eric Lippert
Wednesday, August 04, 2004

Mr. Lippert, how scriptin' going?

 
Wednesday, August 04, 2004

Sorry, I didn't realize that this was probably a joke until I saw the previous post on sorting.

Kalani
Wednesday, August 04, 2004

+++We had a programmer who to get a random number from 1 to 10 kept would keep calling our random function (which returned 0-65535) until the return value was between 1 and 10+++

we had a similar colleague who wrote this function:

while (1)
{
    x = rand(100);
    if (x < 3) { break; }
}
show_random_picture(x);

 
Wednesday, August 04, 2004

Did you know, your caps lock uses up CPU and ram? So things will go much faster if you capitalize properly.

Anon to protect the guilty
Wednesday, August 04, 2004

Assuming this question is for real....

Donald Knuth wrote some books called The Art of Computer Programming. You want volume 3. I don't know if the second edition has what you are looking for, but I know my first edition does (its an f*ing antique too), with all kinds of fold out graphs showing samples of how to do merge-sorts with tape drives or punch cards. From the bad old days when computers had like 16k of core memory and operating systems used like 2k of memory.

Peter
Wednesday, August 04, 2004

JOE, just a quick idea.

Split the file into N files, each small enough to sort in memory.

Sort each, one at a time.

"Merge" the files -- go through each, pick the top strings, figure out which is the minimum, add to output file, repeat.

If you can't afford to have 10,000 file handles open, merge two at a time.

Alex
Wednesday, August 04, 2004

Sorry, Dennis, I just swooped right down and replied. Missed your answer.

Alex
Wednesday, August 04, 2004

>> if (x < 3) { break; }

Really appalling is that he put the break between braces! ;)

Alex
Wednesday, August 04, 2004

SORT. I JUST WANT TO SORT.

YOU GUYS MAKE THIS STUFF WAY TOO CONFUSING. I THINK YOUR CODING ABILITIES HAVE GONE TO YOUR HEADS.

I HAVE A BUNCH OF DATA IN A REALLY LARGE FILE. WHO CARES IF ITS LINES OF TEXT OR WHATEVER.

I'D USE A DAMNED BUCKET IF I KNEW WHAT IT WAS.

DAMNIT.

SORT! I WANT TO SORT!

WHERES LOTHAR OF THE HILL PEOPLE WHEN YOU NEED HIM TO MAKE THINGS SIMPLER.

WOULD BE SORTER
Wednesday, August 04, 2004

Buy a Mac.

S. Jobs
Wednesday, August 04, 2004

Would be sorter:

You do realize that you've not told us anything useful. What do you mean by sort? Do you want the lines in sorted order? If they're not text, as your last post seems to imply, then what criteria to sort by, as lexical sorting will obviously not work.

What kind of system are you on? If you're on a Unix system, and you're dealing with text files, you can use the sort command. But if your OS is a secret, it's kind of hard for us to know the easiest way.

If you expect an answer, you should stop being an asshole. We don't know everything about your problem, and you won't tell us. You won't be polite, you type in all caps, and generally are acting oafish. Our abilities haven't gone to our heads; rather, our abilities allow us to see a huge number of variations on your problem, and we don't know which applies.

Furthermore, it sounds like a homework question. If it is, do your own damned homework. If not, why the secrecy?

Anon to protect the guilty
Wednesday, August 04, 2004

The right sort for your sort is a good sort. Google "porn" and don't come back.

caltrop
Thursday, August 05, 2004

This would be childsplay in Delphi.

Just me (Sir to you)
Thursday, August 05, 2004

HELP ME SORT.

WOULD BE SORTER
Thursday, August 05, 2004

"HELP ME SORT."

No.

Anon to protect the guilty
Thursday, August 05, 2004

*  Recent Topics

*  Fog Creek Home