Fog Creek Software
Discussion Board




sorting

This is not to request advice; I just wonder if others do this the same way I do.
If you were going to sort a list of items and each was associated with an integer, and many of the integers were unique, and the largest integer was under let's say 10,000, what is the best way?
I am thinking mainly of Java or Perl.

The Real PC
Monday, December 29, 2003

I would outsource it overseas.

Alyosha`
Monday, December 29, 2003

Excel?

--
ee

eclectic_echidna
Monday, December 29, 2003

How are they store?  What is the goal?  When is the assignment due?


Monday, December 29, 2003

Sounds like a candidate for a radix sort.

Mark Smith
Monday, December 29, 2003

A bubble sort in Java is definitely the way to go.

Don Knuth
Monday, December 29, 2003

Sounds like a candidate for "SELECT id,data FROM myTable ORDER BY id".

Alyosha`
Monday, December 29, 2003

I don't really understand.  By "associated with", do you mean a Hash, or a HashTable or whatnot? 

vince
Monday, December 29, 2003

Let's say it was Perl, and you had a data structure where the keys are integers, and the values are some kind of arrays or whatever, and you wanted to sort by the keys. I used to do it one way, and then I started doing it a different way that I have never seen in a book, and it seems better. I have not done a benchmark comparison or anything, however.

The Real PC
Monday, December 29, 2003

Are there any duplicate values in your sort list? If not, this is tailor-made for a counting sort, sometimes called an insertion sort.  Just create an integer array with a size of (in your case) 10,000, and then push the list of values to be sorted into the array where each value represents an array element. So an item with a value of 5,125 goes into array element 5,125.

When demonstrating the benefits and drawbacks of JIT compiler optimization as part of my recent book, I created a small .NET program that compared the bubble sort, selection sort, quicksort, and counting sort algorithms. In most cases, the counting sort won easily. As an example, 100K integer items with a maximum item value of 100K were sorted  in 0.016 seconds using the default release build configuration.

Mark

Author of "Comprehensive VB .NET Debugging"
http://www.apress.com/book/bookDisplay.html?bID=128

Mark Pearce
Monday, December 29, 2003

Of course, a 100K array that has a large percentage of unused elements is a huge waste of space...

--
Monday, December 29, 2003

In Java:

(Note: All of the below use a merge sort. Implementing your own sort algorithm if you desire is straightforward.)

Option 1:
The class (that type your object instances are) must implement the Comparable interface.
Collections.sort( yourList );

-- or --
Option 2:
Implement a Comparator.

Collections.sort( yourList, new YourComparator() );

-- or --
Option 3:
Use an ordered Collection in conjuction with one of the above. 

Burninator
Monday, December 29, 2003

When sorting, bear in mind the distribution of the raw data, and how it will be looked up.  No single algorithm is the best for all cases.  For example, bubble sort which normally runs in O(n^2) can often be faster than an O(nlogn) algorithm like merge sort or quick sort, if the data is almost completely sorted already.

T. Norman
Monday, December 29, 2003

Oh, you said Java?

Then just make use of what is available in the language and do Collections.sort().

T. Norman
Monday, December 29, 2003

I agree with Mark's 1n solution given the problem space of under 10,000 elements.

Regarding the wasted space argument, do a second pass to compress the data if needed. Total sort time - 2n instead of 1n. That's as good as it gets.

Don Knuth
Monday, December 29, 2003

This question is poorly specified. "Best" in terms of implementation time, maintainability, object code size, run-time performance, memory usage?

Personally I would use the language's provided sorting capabilities (e.g. "sort" in Perl) until I had a concrete reason to think that was a problem.

John C.
Monday, December 29, 2003

Not enough information.

Do you want to optimize for (a) The best case, (b) the worst case, (c) the 'average' case (if there is such a thing)

What are your constraints? Do you need minimum (1) memory, (2) minimum disk I/O, (3) minimum CPU cycles, (4) minimum time. For each of (1)(2)(3)(4) are you primarily concerned with (a)(b) or (c) above.

Does none of it matter? Do you just want a simple (clean source code) that a future maintenance programmer can actually understand -- as opposed to the fastest solution, best optimizes for your choices of (a) thru (c) and (1) thru (4). The fastest, best optimized may be a bit too complex for your average grunt maintenance programmer.

Answer these questions, and we might have a chance at answering yours that maybe, just maybe, might make sense for your particular situation.

For anyone who has submitted an answer already: SHAME ON YOU!

The problem has not been sufficiently defined to even hazard an educated guess at this one.

Sgt. Sausage
Monday, December 29, 2003

Disagree. He's sorting by integer keys and there are less than 10,000 keys. The only question is are the keys unique. If they are, Mark's solution is a great one. It's not like he said 'how do I sort something', he gave plenty enough details for a first response like Mark's.

Sgt, don't tell me - you are only able to think about what possible algorithms might work for a task if you have a complete enough specification that there is no thinking involved?

Don Knuth
Monday, December 29, 2003

10,000 is small enough that any nlog(n) algorithm will do in almost all cases.  So just use the already-tuned, already tested facilities that are available with the language.  No sense spending so much time to figure out what is the ultimate fastest performing way if you're only talking about the difference between 1.5 and 1.9 seconds.  Remember, developer performance is often more important than program performance.

It's only if you have to roll your own because it isn't available in the language, or you have requirements for hyper-fast performance which aren't met by the existing languages/libraries, or your data set is too large to hold entirely in memory, then it makes sense to spend to time to choose or create the best algorithm.

NoName
Monday, December 29, 2003

==>Sgt, don't tell me - you are only able to think about what possible algorithms might work for a task if you have a complete enough specification that there is no thinking involved?

No, I won't tell you that. You're making my point *exactly*

I've thought about it.

I know that there are dozens of sorting algorithms I can readily choose from. All of them with different behaviors.
I thought about what makes one sorting algorithm better than another. I thought about what it means to be a "better" algorithm.

As stated, his spec was "Sort this short list" (paraphrasing) -- without respect to any my earlier questions.

That's a complete enough spec for me.

My users couldn't care less if I used a radix sort or a bubble sort. I'd take it and run, but I'd certainly *think* about it first, and consider the options based on the answers to the questions in my earlier post.

In case you missed it, my earlier post was an attempt at getting the original poster to actually *think* about the problem first -- like the rest of us do -- and wrap some parameters around a vague "Sort this list" (non)spec.

I thought about it enough to ask (at least some of) the right questions.

The "possible algorithms [that] might work" are irrelevant to me.

We want the "best" algorithm that will work, given the constraints we have to work with. Without clarifying what "best" is, it's only a guess and any suggestion is an answer.

I don't know about you, but I don't want just *any* answer, I want the *best* answer, given the constraints I'm faced with and to do that, in this case, requires making choices, as outlined in the questions in my earlier post.

I've thought about it.

You've made my point. Those of us who have a legacy of success in this line of work are exactly those of us who spend the time to think, and consider the options. Those who don't are doomed to mediocrity ... or worse!

I say again: If you didn't consider at least some of the questions in my earlier post: SHAME ON YOU.

You know better than that. (or you should)

Sgt. Sausage
Tuesday, December 30, 2003

Sgt, in the time it takes to debate this lets just write tha darn code, which is faster than even looking up the parameters for qsort.

Object * SortedObjectReferences[10000];
memset(SortedObjectReferences, 0, 10000 * sizeof(Object *));
num_items_to_sort = UnsortedSet.size();
for (int i=0; i < num_items_to_sort; i++) {
  key = UnsortedObjectSet.item[i].key;
  assert(key < 10000);
  SortedObjectReferences[i] = & UnsortedObjectSet.item[i];
  }
// SortedObjectReferences is now sorted in time 10000.
// can now compact set if desired

--

linear time to sort - much much faster than your n log n sorts

1.5 seconds may not be long to you but I need this data set sorted 50,000 times a second.

Don Knuth
Tuesday, December 30, 2003

Actuall sorted in a*n, where n < 10000. With compaction, sorting will take a*n + b*10000.

That's way faster that O(n log n) and the memory usage is pretty inconsequential for any situation in which you have 10,000 items to begin with.

Don Knuth
Tuesday, December 30, 2003

SortedObjectReferences[key] = & UnsortedObjectSet.item[i];

Don Knuth
Tuesday, December 30, 2003

Once upon a time, The Real PC summoned two advisors for a test. He showed them a metal box with two slots in the top, a control knob and a lever. "What do you think this is?" he asked them.

One advisor, Don Knuth, said, "It is a toaster." The king asked, "How would you design an embedded computer for it?" The engineer replied, "Using a 4-bit microcontroller, write a simple program that reads the darkness knob and quantizes its position to one of 16 shades of darkness. The program would use that as the index to a 16-element table of initial timer values. It would turn on the heat and start the timer with the initial value selected from the table. At the end of the time delay, it would turn off the heat and pop up the toast."

The second advisor, Sgt. Sausage, said, "But toasters don't just turn bread into toast. They are also used to toast muffins, bagels and frozen waffles. What you see before you is a breakfast food cooker. As the subjects of your kingdom become more sophisticated, they will demand more capabilities. They will need a breakfast food cooker that can also cook sausage, fry bacon and make scrambled eggs. A toaster that only makes toast will soon be obsolete.

"First," he continued, "create a class of breakfast foods. Specialize this class into subclasses: grains, pork and poultry. The process is repeated with grains divided into toast, muffins, pancakes and waffles; pork is divided into sausages and bacon; poultry divided into scrambled eggs, hardboiled eggs, poached eggs, fried eggs and omelette classes.

"The ham and cheese omelette class is worth special attention because it must inherit characteristics from the pork, dairy and poultry classes. Thus, we see that the problem cannot be properly solved without multiple inheritance. At run time, the program must create the proper object and send a message to the object that says, 'Cook yourself.' The semantics of this message depends on the kind of object, so there's a different meaning re: a piece of toast than re: scrambled eggs.

"Reviewing the process so far, we see that the analysis phase has revealed that the primary requirement is to cook any kind of breakfast food. In the design phase, we have discovered some derived requirements. Specifically, we need an object-oriented language with multiple inheritance. Of course, users don't want the eggs to get cold while the bacon is frying, so concurrent processing is required as well.

"We must not forget the user interface. The lever that lowers the food lacks versatility and the darkness knobs are confusing. Users won't buy the product unless it has a user-friendly, graphical interface. When the breakfast cooker is plugged in, users should see a cowboy boot on the screen. Users click on it, and the message 'Booting Unix v.8.3' appears on the screen. (Unix 8.3 should be out by the time the product gets to the market.) Users pull down a menu and click what they want to cook.

"All that remains is a hardware platform. An 80386 with 8MB memory, 30MB hard disk and a VGA monitor should do it. With a multitasking, object oriented language supporting multiple inheritance and has a built-in GUI, writing the program will be a snap. Imagine the difficulty if we had allowed a hardware-first design strategy to lock us into a four-bit microcontroller."

The king thought for a moment and then had Sgt. Sausage beheaded.

And they all lived happily ever after.

The End

Don Knuth
Tuesday, December 30, 2003

==>The king thought for a moment and then had Sgt. Sausage beheaded.

Nice post, but you still don't get it. You probably
never will.

You're still illustrating my point.

To boldly suggest your solution:

==>"Using a 4-bit microcontroller, write a simple program that reads the darkness knob and quantizes its position to one of 16 shades of darkness. The program would use that as the index to a 16-element table of initial timer values. It would turn on the heat and start the timer with the initial value selected from the table. At the end of the time delay, it would turn off the heat and pop up the toast."


Is asinine. You've got to ask questions:

What are my concerns:I've chosen (for no apparent reason) to use a particular 4-bit microcontroller. Is this a good choice? Who's going to supply those. What are their costs? Is my assembly line tooled for this particular component? If not what will it cost to retool the line/plant for this particular microcontroller. How's it going to interface with the rest of the system? What's the future availibility of these microcontrollers? Will my supplier be able to continue to supply them when I ship a gazillion toasters a year? Can someone else make the chip cheaper? What are my costs, what am I optimizing for?
Am I optimizing for the cheapest possible production costs? Am I maximing quality, as measured by MTBF? As such, is X microcontroller a better choice, even though a bit more expensive than Y controller -- because of a lower failure rate? What if I can get an 8 bit chip on the cheap? Sure I'll be wasting it (I only need 4 bits!) , but it's cheaper than the 4 bit. Is the chip easily and cheaply replaced when it fails, or will the user have to scrap the thing and buy a new toaster? ...

I go back to my previous post. It's about choosing the "best" fit for the task at hand, where you've got to think a bit before hand to define exactly what you mean by "best".

At the end of the day, if you just jump in without thinking, you're gonna get burned.

BTW - A toaster is a particularly bad example. How many toasters have you bought that needed changed, in some fundamental way, after you brought it home? How many toasters have you needed to change the functionality on after you pulled it out of the box and started using it?

Now, how much code have you written, that needed modified afer it was put into production? One is inherently static, the other inherently dynamic. Apples and Oranges.

Sgt. Sausage
Tuesday, December 30, 2003

Umm I wonder Sgt. Sausage, how do you ever manage to get up in the morning if you have to consider nth order causality before doing so each time?

The original question was posed in an odd sort of way, 'I wonder if y'all do it the same way I do', 'what is the best way?' without saying in the first place which method he uses and then later says 'I used to do it one way, and then I started doing it a different way that I have never seen in a book, and it seems better', again without specifying this novel method.

As has been said, for the population of records any method will suffice and most will achieve reasonably similar results.  Personally, I quite like Alyosha's but I'd index the table by the id field which would sort it anyhow.

In Perl you can do the same thing, either by using some database, or an index array.  Walk the data ids looking for next id in sequence, stick each id into another array with the position in the first array, if you continue the search in a loop rather than start at the beginning each time you'll get a slight improvement.

Mostly though, I'm curious about Real PC's novel method.

Simon Lucy
Tuesday, December 30, 2003

The first answer is - use language which you already know. How many sort operations you should execute per second, minute, etc? Java is fast and offers APIs (and sorting examples) to help implement any sorting you want, but you need a running JVM. You should not use Java if you need only to sort something. Perl is likely to start a new process each time when called, but may be you are more skilled in it.

Evgeny Gesin /Javadesk.com/
Tuesday, December 30, 2003

I did something like what Mark Pearce called a counting sort. However I didn't remember having read it in an algorithms book, so thought maybe no one else does it.
I had lists of Midi events, each one associated with a number of time ticks. There were some duplicates, but not a lot. The highest time value wasn't going to be tremendous and neither was the number of events.
Rather than describe in English, this is what I did:

for (@{$inevents}) {
    my @event = @{$_};
        
    push(@{$tmp[$event[1]]},$_);
    }

for (@tmp) {
    push(@out,@$_) if ref;
    }

The Real PC
Tuesday, December 30, 2003

I always thought Don Knuth was a dead computing legend.  Guess I'm wrong.  Are you really *the* Don Knuth?  If so that's cool.


Tuesday, December 30, 2003

Knuth is dead? So TAOCP will never be finished? Was he killed in the same accident as Robert Jordan?

MugsGame
Tuesday, December 30, 2003

Briliant story, Don.

Dennis Forbes
Tuesday, December 30, 2003

I'm not much for C, but won't Don Knuth's algorithm fail when keys are not unique?  And didn't the original question only state that "many" keys were unique, and not all of them?

Kevin
Tuesday, December 30, 2003

Credit where due -- I stole the story from one of Phillip Greenspun's articles.

So, the data does need to be sorted a thousand times a second. 1.9 seconds to sort is not going to be ok for a real time program in which anything greater than 1ms latency is unacceptable.

But yeah, there's going to be multiple ids with the same index so the method won't work after all. Probably has to work at the interrupt level so going with a hash table that has lists hanging off the indices isn't going to work if the lists need to use the system's memory allocator, so one should preallocate a pool of 10000 nodes and the array, and then do it with the aforementioned method, but allowing for collisions. The order of events within the collisions may not matter, so they can just just appended or prepended to each hash list.

Don Knuth
Tuesday, December 30, 2003

Knuth's not dead, and the poster here wasn't Knuth.

If he was, the algorithm would be written in MIX assembly, and half the discussion would be "exercises for the student."

Chris Tavares
Tuesday, December 30, 2003

Chris is right. I'm just A Don, not The Don.

Just A Don
Tuesday, December 30, 2003

Being a bit pedantic here, I can't help but note that the original post omitted information that is necessary to ensure that the counting sort will work. It specifies *integers*, not *positive* or *non-negative* integers. Only later, when the OP points out that he is using time ticks, do we have enough information to surmise that negative integers are not a possibility.

Of course, the way the problem was stated I suspected that the OP was using a counting sort. Still, I stand by my assertion that the original post contained too little information to be meaningful. It's about as useful as asking "What's the best car I can buy for under $30,000?" Does "best" mean fastest? Highest resale value? Most comfortable? Safest? Largest cargo capacity?

Incidentally, I find it amusing that "Don Knuth" is posting on this thread, considering that Knuth is often said to be the source of the quip that "premature optimization is the root of all evil" (though I've read that its true originator was Tony Hoare).

Not Don Knuth
Tuesday, December 30, 2003

I second that!

Brian Kernighan
Tuesday, December 30, 2003

Could you guys please keep it quiet in here?

Bjarne Stroustrup
Tuesday, December 30, 2003

[The only question is are the keys unique. If they are, Mark's solution is a great one.]

Dear "Dr. Knuth,"
I'm sorry but I must disagree. The keys do not have to be unique, since you can hang any number of list references off of each integer key.
In the past I would have done a numerical sort, but then I thought what a waste that is in a case like this.
Since then, I was sorting lists of dates in Java programs and I started to reach for the Java sort but then remembered my nice quick simple numerical sort.

The Real PC
Tuesday, December 30, 2003

Real PC, I think both Mark and I stated each time that the solution was contingent on unique ids, something that was not made clear in your first two posts and whcih you were not available to clarify.

No problem, since the modified solution I posted ofter you clarified your task will work and is identical to the (Perl?) code you posted.

Don Knuth
Tuesday, December 30, 2003

What's this silliness about unique ids?

Insertion sort doesn't give a damn about unique ids.  It also has the beauty of being short in code and fast in execution.

void insertionSort(double numbers[], int array_size)
{
    int i, j;
    double index;

    for (i = 1; i < array_size; i++) {
        index = numbers[i];
        j = i;
        while ((j > 0) && (numbers[j - 1] > index)) {
            numbers[j] = numbers[j - 1];
            j = j - 1;
        }
        numbers[j] = index;
    }
}

Taken from working code, where non-unique values are not only possible but pretty much assured.  Valid values in my application are from 0 to 167, and there are typically several thousand values.

Clay Dowling
Tuesday, December 30, 2003

Why not.

Scan through the list and see if it's already sorted. If not, randomly change the positions of two numbers.

Repeat.

Should work.

Schmiel the painter
Tuesday, December 30, 2003

Clay's method theoretically runs in O(n^2).

However, if the data has many duplicates, the actual time would be less than that because the duplicates would reduce the number of items that needed to be moved around.

This is why performance requirements are important.  Find out what sort of speed is needed (present and forseeable future), and reuse whatever is already available if it will satisfy those bounds.  It's a waste of time to roll your own sort routine to reduce the time to less than 1.9 seconds if 5 seconds has already been determined to be acceptable.

T. Norman
Tuesday, December 30, 2003

*  Recent Topics

*  Fog Creek Home