Fog Creek Software
Discussion Board




A sorting challenge

I have the frequent need to sort alphanumeric data in a sensible format and standard sort methods do not achieve this.

When we sort numbers we treat the number as right justified and sort, so 2 comes before 10 (0...002 is less than 0...010)

When we sort strings we treat the string as left justified and sort (2 comes before 10).

So, let's say I have a mixed alphanumeric string representing, say, a change record id and sort it.. I get the following result:

CR-1
CR-100340534
CR-2

Now this is not useful. I could pad all numbers to be a certain size but CR-000000001 is not as visually fetching as CR-1

In the same list I could get other format ids; e.g. BUG1, BUG2 etc. so I don't want to be applying some logic based on fixed formatting.

I can further complicate this by wanting to sort, say, version or release numbers. So now I want:

1.0.1
2.0a
10.2
10.2.2
10.2.2a
10.2.2b
10.2.2.1

The only way I can think is to break the string down into a number of parts at each change of character type.. for example 2.0.14 becomes an array of typed parts:
(number) 2
(alpha) .
(number) 0
(alpha) a

and then I can try comparing two values on a part by part basis, using an appropriate method, and if comparing a number against an alpha that the alpha always comes first.

Anyone come across a standard algorithm for this sort of thing?

gwyn
Saturday, February 07, 2004

I dunno, treat each period or other puctuation as a delimiter and then a change of data type as if it was delimited also.

10.2.3a.4  would become 10,2,3,a,4

I'm not an Algorithm master by any means but I'm first this time.  Actually I came across this kind of things a long time ago when I was doing Bus Parts.  I sorted them wrong and messed up the mechanics.  If only I had read the book first!
Unfortunately it was a card system in books, well before I had a computer.  <sigh>

Greg Kellerman
Saturday, February 07, 2004

separate the data used for sorting and the data used for display.

FullNameRequired
Saturday, February 07, 2004

I think what FullNameRequired means is (here goes the contrived example):

You have data in the form number.shortstring.number.number. E.g.:

12.NYC.100.1

what you do is construct an integer value from it, in this case, the 12 goes into the uppermost byte, NYC somehow goes into the next byte, etc.

So now you can sort based on numbers instead of having a complicated "compare" function for your data.

Of course, cramming an arbitrary string into an integer of limited size is a different story. Perhaps you have a finite dictionary, this would work fine.

(Sorry if I'm being amateurish, probably am.)

Alex.ro
Saturday, February 07, 2004

I thought he meant copy the field into another one, then add all the leading zero's you want in the 2nd one that nobody ever sees but you sort on.

record_id | record_id_sort
1          | 00000000001
123123 | 00000123123

www.MarkTAW.com
Saturday, February 07, 2004

If the data is stored in a database and you can do your sorting in a SQL query, you might get some ideas from this previous topic:
http://discuss.fogcreek.com/joelonsoftware/default.asp?cmd=show&ixPost=97221
that discussed sorting data that is mixed alpha and numeric.

Philip Dickerson
Saturday, February 07, 2004

You could also store the data internally as, e.g. "CR-000000001" and use a regex or function to convert it to "CR-1" for display purposes.  (Just trim any zeroes that follow a nonnumeric character.)  That way you don't have store two separate values or worry about keeping the two values in sync.

For more complicated numbers like "10.2.2a," you could split the value into array of strings, with one string for each alpha sequence or numeric sequence (treating the punctuation as a delimiter.)  For example, "10.2.2a" would map to an array with four elements, {"10", "2", 2", "a"}.  Then, have a function that determines which of two values has the lower sort order by stepping through the elements in the arrays. 

For example, {"1", "0", "1"} comes before {"2", "0", "a"} based on the first elements in the array (one comes before two.)  Similarly, {"10", "2", "2"} comes before {"10", "2", "2", "1"} because the first three elements are identical, but the second value has a fourth element.

Conceptually, it's similar to the traditional problem of sorting by lastname/firstname, where you first compare the lastnames of two Person objects, and then compare the firstnames if necessary (if the lastnames are identical.)

If you're using a Dotnet language, you could encapsulate this sort-order logic by using the IComparable interface.  (The Comparable interface in Java looks similar.)  Here's an example in C#:

http://builder.com.com/5100-6373-1058827.html

Robert Jacobson
Sunday, February 08, 2004

Separate the string into its parts and do a bucket sort, e.g. a radix sort.

Nick
Sunday, February 08, 2004

Thanks for your comments guys.

I'll have a look at the Radix sort..

It is .NET and I'm already using the IComparable interface for other things.

And no I can't separate the display and stored data as it will often be something supplied as a string by the user (which has some implicit meaning to the user, but not to the computer)

gwyn
Sunday, February 08, 2004

"(which has some implicit meaning to the user, but not to the computer) "

I'm sorry but doesn't the computer need to know the meaning if it is to sort it properly??  Or do you what some generic number-text parser / sorter thing?

DJ
Sunday, February 08, 2004

Umm 'visually fetching' has nothing to do with sorting.  The rule is that for numeric portions of alphanumeric keys which are to be treated as numbers the field width must be constant.  In other words you have to pad the numeric field.

I wouldn't worry about compressing zeroes out of such a field if it really is a key then its going to be a serial number of some kind and people would be more confused by variable length serial numbers than they would some kind of display protocol.

If its a compound key then you pad in creating the key (and have to pad when searching and so on), this isn't  that unusual either.

Simon Lucy
Monday, February 09, 2004

As for version number sorting, numbers preceed characters in ASCII (and its equivalents), so if I wanted to create a single string and sort on it taking into account letter versions I'd descend to the next level and keep the width of the entire string fixed.

10.2.0.0
10.2.0.a  (for the first letter release)
10.2.1.0  (for the next dot version)
10.2.9.a
10.3.0.0

and so on...

Simon Lucy
Monday, February 09, 2004

"I'm sorry but doesn't the computer need to know the meaning if it is to sort it properly??  Or do you what some generic number-text parser / sorter thing?"

No the computer doesn't need to know and, yes, that's exactly what I want.

If I give you the list:
EUR-1
PRR-134
ALIS 4.36.4.1a
PRR-120
EUR-10
EUR-2
ALIS 4.36.4.1

you can proably do a really good job of sorting this, without needing to know what any of it means. And that's what I want

And Simon, your use of extra periods may fix the sorting of the final alpha(s) but it still doesn't address sorting 2.0.0 before 10.0.0

Gwyn
Monday, February 09, 2004

Yes it does, if you know its going to go through two digits of release then you start with the 0.

01.01.00
10.00.00

And so on.

Simon Lucy
Monday, February 09, 2004

Oh, if these strings are from dissimilar systems so you have no control over them then I would parse the numerics from the strings, applying them in the order they appear in the string and hash the results from left to right, probably dividing by 13 and then using the result as the key.

You may get collisions and you should leftmost numerics to get the same field width but it will give you a reasonable sort.

Simon Lucy
Monday, February 09, 2004

A simple implementation in Python:

import re
separator_re = re.compile("[.,:_/ -]+")

def split(s):
  """Split s into fields separated by strings of separator characters. Shuffle the fields so that actual fields come first, followed by the separating strings."""
  fields = separator_re.split(s)
  first,last = [],[]
  a,b = first,last
  for field in fields:
    a.append(field)
    a,b = b,a
  return first + last

def compare_field(a,b):
  """Return <,=,> 0 depending on the relative ordering of a,b. If a,b are both digit strings, compare first by numerical value, falling back to character-for-character comparison if they're numerically equal. If exactly one is a digit string, it compares earlier than the other."""
  try: p = int(a)
  except: p = None
  try: q = int(b)
  except: q = None
  if p is None and q is None: return cmp(a,b)
  if p is None: return -1
  if q is None: return +1
  return p-q or cmp(a,b)

def heuristic_compare(a,b):
  """Given two strings, split them into fields and compare numerically or string-wise, as seems appropriate."""
  a_fields = split(a)
  b_fields = split(b)
  for x,y in map(None,a_fields,b_fields):
    if x is None: return -1
    if y is None: return +1
    delta = compare_field(x,y)
    if delta: return delta
  # If we get here, then I think we must have a==b,
  # but let's be paranoid:
  return cmp(a,b)

sorts the list above into the order I think you want. However, I suspect it's still not quite what you have in mind, because it will sort "26a" before "5a". It would be possible to tweak compare_field so that it detects alternating digit and non-digit parts or something, or so that it notices if a field is of the form <digits><non-digits> ... but here we're getting *very* heuristic, and frankly it all seems rather wrong-headed to me.

The right way to deal with this sort of situation is to understand where the things being sorted come from. There simply *isn't* a reliable way for the computer to tell when "5kr1pt" is a serial number to be treated as an opaque lump and sorted as a bag o'bytes, when it's a version code (5/kr/1/pt) to be sorted chunk-by-chunk, and when it's 1337-speak for "script" and should therefore trigger the klaxon and the big neon sign saying "loser".

The more guesswork your algorithm does, the more fragile it will be and the more incomprehensible its failures will be. Sometimes that's a good tradeoff, but  there needs to be a good *reason* why it is. In this case, no reason has been given so far...

Gareth McCaughan
Monday, February 09, 2004

*  Recent Topics

*  Fog Creek Home