Fog Creek Software
Discussion Board




using opaque primary keys in data model

I sometimes encounter databases where every table has an "ID" column that is what I call "opaque" -- i.e. it has no real world meaning, it's just a "pointer" to that row in the table.

From what I know of the relational model, primary keys should be based on actual attributes, not some made-up meaningless value.  On the other hand the "pointer" approach can offer certain conveniences.

I felt rather sure I could find this discussed somewhere here in this forum, but it did not pop up near the top of any of my searches.  Do others see this practice a lot?  Can anyone point me to any lucid discussion of this issue, or just share his opinion?

AMS
Thursday, June 03, 2004

> ... just share his opinion?

I think they're useful. Joe Celko thinks they suck.

Christopher Wells
Thursday, June 03, 2004

Another view is that primary keys should *never* correspond to data in the real world.  That's because real-world data tends to change, more often than you'd think.  And when the real-world data changes, you start to have to update a bunch of related tables, or stick with the old (incorrect) data.

For example, you might feel safe using ISO codes to represent countries.  But what happens when Zaire (ZR) changes its name to Congo (CD)?  You'll either have to do many cascaded updates across your related tables, or continue to use ZR even though it's no longer valid.

The same types of problems happen for many other kinds of real-world values: usernames, email addresses, telephone numbers, or social security numbers.  They may not be permanent, and they may not even be unique in the long-run.

For these reasons I've found it best to use some type of arbitrary unique value as a primary key, either GUIDs or auto-incrementing numbers.  Then you can set up the "real world" values as an alternate key (unique index).  This way, your application's users can modify those values when needed, without impacting the rest of the database.

Ryan
Thursday, June 03, 2004

"primary keys should be based on actual attributes"

Where did you learn that?  It is actually a bad practice because actual attributes DO change.

DBA
Thursday, June 03, 2004

I almost always use "opaque" ID's, unless it's something that's unlikely to ever change, such as an employee ID #.  Why does Celko think they suck?

yet another anon
Thursday, June 03, 2004

"From what I know of the relational model, primary keys should be based on actual attributes, not some made-up meaningless value."

I believe Codd, when he formalized the relational model, intended the primary key to be meaningless.  In the early days, the use of "real" values for primary keys was mostly just a size optimization.

There are many many reasons not to use actual data attributes for primary keys.  All the points that Ryan makes are valid -- data attributes change, your primary key shouldn't.  In unversity, it was drilled into our heads constantly.

Almost Anonymous
Thursday, June 03, 2004

http://www.google.com/search?hl=en&lr=&ie=UTF-8&q=celko+%22artificial+key%22&btnG=Search

Christopher Wells
Thursday, June 03, 2004

I think that what you're calling opaque keys are called surrogate keys.

John Topley (www.johntopley.com)
Thursday, June 03, 2004

The relational model merely requires a key -- any key will do. So, there's no reason why you can't use 'meaningless' keys -- it depends on how you want to model it.

The point of using a country abbreviation as a key is an example of the choice of simplicity over familiarity. The most simple is the integer, but the most familiar might be country name or abbreviation. There are pros and cons to each and ultimately the decision is left to the modeler as dictated by the circumstances.

Strictly using one over the other without consideration of the model is a hallmark of a poor design (and designer).

MR
Thursday, June 03, 2004

One could argue that any table using a surrogate primary key by definition cannot be in 3rd normal form.  This is because the table's columns are truly dependent on the natural key (the column others are recommending you put a unique contstraint on) not the actual primary key (some arbitrary number or guid). 

I'm interested to hear what others think about this?

tool
Thursday, June 03, 2004

Actually make that 2nd normal form...

tool
Thursday, June 03, 2004

I think it would still be in 3rd (or 2nd) normal form -- the key difference (no pun intended) is that there is a one-to-one relationship between the artifical key and the natural key.

Almost Anonymous
Thursday, June 03, 2004

What alternative does Celko propose? I mean, if you want to do something straightforward like track a doctor's patients, you have to come up with a unique ID somehow, and there's not likely anything extrinsic that you can use (names are obviously not unique, etc.)...

John C.
Thursday, June 03, 2004

Celko wasn't able to convince me. It was years ago that I discussed it with him, when I was learning/using SQL for the first time. Therefore I'm sure that I wouldn't be able to do his argument justice. You might find out more about his reasoning by searching the Web and/or Usenet and/or his books for Celko+"artificial key".

Christopher Wells
Thursday, June 03, 2004

"Where did you learn that?"

It's a recollection from a course in relational modeling.  The argument being that if an entity does not have a natural unique identifer (potentially a composite of several attributes) then it is incompletely modeled.

Perhaps it's one of those academic purist ideas that often fails in the real world, where it is not always possible (or perhaps just not always practical) to identify a unique, immutable natural key, for many of the reasons discussed here.

"It is actually a bad practice because actual attributes DO change. "

Yep, I remember a case where gender was part of the primary key on a table and everything worked fine for a long time until someone's gender changed.

Thanks for the discussion.

AMS
Thursday, June 03, 2004

AMS,

"It's a recollection from a course in relational modeling."
"Perhaps it's one of those academic purist ideas..."

Just to let you know -- as I said it has nothing to do with the RM. Your instructor must have either misinterpreted the relational model or (most likely) had no clue what he/she was talking about.

MR
Thursday, June 03, 2004

"The argument being that if an entity does not have a natural unique identifer (potentially a composite of several attributes) then it is incompletely modeled."

I can somewhat agree with this (although I can think of a few exceptons).  For a person table, it's probably not completely modeled if you just have first/last name (people can have the same name.  Maybe if you add SSN number in there then it's reasonably unique.  However, you wouldn't have to have a primary key made up of first name, last name, and SSN -- that's horribly ugly.  (SSN numbers are not unique so you cannot use those alone)

Almost Anonymous
Thursday, June 03, 2004

The only thing that matters in *RELATIONAL* databases is...RELATIONS! The arbitrary data you decide to associate with these relationships/objects is up to you! In a perfect world, where humans were better at associating things with numbers, you would have no need for this "natural" arbitrary data!

anon-y-mous cow-ard
Thursday, June 03, 2004

A few months ago I read an article advocating the use of "natural keys":

http://sqlteam.com/item.asp?ItemID=2599

It generated quite a bit of discussion in the "Comments" section; there was even an appearance by Ron Soukup, who corrected the incorrect impression left by the article about his own views on the subject.

Some of the commentary is juvenile.  But on page 3 a guy going by "quazibubble" posted a long series of posts explaining why random, meaningless primary keys should be used instead of "natural" ones.  It's very thorough (albeit a bit manic) and worth spending the time to read.

Kyralessa
Thursday, June 03, 2004

Even if it weren't awkward, Name+SSN wouldn't work well.

In the general case, an SSN is not guaranteed to be present at all. And not only are people's names far from unique, they're far from immutable. In fact, there's very little about a person that I would feel comfortable characterizing as truly immutable (date of birth possibly being an exception, although I bet there are even cases where that's malleable).

John C.
Thursday, June 03, 2004

Joe Celko's articles:

http://www.intelligententerprise.com/030320/605celko1_1.jhtml
http://www.intelligententerprise.com/030810/613celko1_1.jhtml

http://www.intelligententerprise.com/authors/last_name.jhtml?author_id=1328&lname=Celko&fname=Joe

DataMiner
Thursday, June 03, 2004

I use unique keys that may, or may not have some external representation.  For instance an Order does not need two unique keys, the order number and a unique key to identify it.

Similarly an order line doesn't need anything more to tie it to the order than a foreign key for order number.

For those entities that don't have some external unique referent I'll create the keys in the usual counter type way but I'll include a prefix which is unique to that table.

This is not necessary, it does nothing to improve the validation of data, its purely for my own use.  If you've ever had to play the electronic equivalent of pick up sticks you'll thank whoever did the same thing so you can pattern match grids of data and understand the connections.

Simon Lucy
Thursday, June 03, 2004

I can't think of a reason for someone's birthdate to change, but needing to change a DOB in the database due to data entry error is likely.

MilesArcher
Thursday, June 03, 2004

I'd hazard a guess that most RDBMs are optimised to work best/fastest with numeric primary keys anyhow? I mean, comparing two numeric keys when doing a join has got to be faster than comparing strings?

I don't buy the argument, anyway. Perhaps it would be a teensy bit more elegant if we could avoid 'artificial' keys. but really, they don't do any harm, and are damn useful, being pretty much guaranteed never to need changing, whereas it can be pretty hard to come up with other unique/immutable keys derived from the actual data.

Matt
Thursday, June 03, 2004

Argh!!! The only thing "artificial" is the data that is associated with the "surrogate" key! The numeric primary key is the most "elegant" since, again in a perfect world, your database would consist entirely of non-string data!

anon-y-mous cow-ard
Thursday, June 03, 2004

I have to say, I've never used real data for a PK/FK, not even when the data is defined solely by the system.  Recently I've even gotten away from using auto-incrementing integers and switched to GUID's (since they can be generated anywhere, not just on the server).

Also, I pretty much work under the assumption that no table in my DB is ever "completely modeled."  There is an infinite amount of data you could collect about a 'person' object, and pretty much none of it is guarenteed to be 100% unique.

Joe
Thursday, June 03, 2004

One more thought for the pro-generic-id cause, since it's the end of the day and I'm just killing time before I get to go home :)

If I give the user a grid containing data that partially represents a table (ie, there are more columns and relations than they get in the quick-view grid), and they select an item to drill down into it, it's much more convenient and efficient to pull the complete data set by querying the artificial PK.  The code ends up being much cleaner, since the simpler query didn't require the app to know as much about the DB.

Joe
Thursday, June 03, 2004

BTW, another way that real life may interfere with academic purity - using an artificial PK for a person table will help insulate against privacy concerns - "Patient #12345234" is nice and anonymous.

Regarding the idea that "if you don't have a unique value, you haven't modeled it completely" - that's pedantic rubbish. We've already gone over the difficulties with people - maybe a DNA map would be a good PK? (Unless you have twins...)

How about vehicles? The VIN is *guaranteed* to be unique. Unless a chop shop forges them...

MAC addresses? Fine, except that many modern routers allow you to set the MAC address. Oops.

It just doesn't work - anything people can screw with, they *will* screw with. So an artificial numeric PK is just safer.

Philo

Philo
Thursday, June 03, 2004

Two observations based on 13 years of designing and maintaining relational databases:

1. Surrogate keys are way easier to deal with in the long run. When you don't use surrogate keys, and force yourself to use the natural keys, you wind up with joins that can sometimes require three or four (or more) fields to link two tables. In one db I worked on, four commonly joined tables resulted in 13 fields for the joins. I know about views; that's not the point. At least with surrogate keys, you can make the joins easier, i.e. ID to ID. When you think ahead and name the ID fields the same thing, you can type out the joins with hardly any thought involved.

2. Auto-incremented surrogate keys are the devil. When you use auto-incremented surrogate keys, like IDENTITY fields in MS-Sql, you're in for a world of hurt if something happens to your DB and you have to tie child-tables back to the rows in the original table, but now those rows have changed IDs thanks to awful RESTORE or data-copying restrictions on IDENTITY columns. I know this is easier to deal with now (SET IDENTITY INSERT ON with bulk loads since v6 or so), but it wasn't always.

The greatest thing MS-Sql 2000 added to work around this issue was GUID columns with the NEWID() function. They make nice, mindless ID columns that are still easy to manage.

Troy King
Thursday, June 03, 2004

Those of you that use GUIDs:

I've read that GUIDs can give worse performance than int autoincrementing IDs.  I'm curious what your real-world experience has been.  Have you done any comparison?  Do your joins feel slower?

Kyralessa
Thursday, June 03, 2004

I'd imagine GUIDs are less performant than int's, since there's that much more data to compare.  But the question is, what are your priorities?  Does it matter if your join takes 125ms vs 175ms?  Can you negate that by throwing a slightly faster processor at the DB server?  Not quoting real numbers here, just saying...

For me, the advantages GUIDs offer are more important, since I write mostly line-of-business apps that manage a lot of records, but rarely if ever need to support more than 10 users running queries at the exact same time.

Joe
Thursday, June 03, 2004

Aren't GUIDs useful, when you eventually need to combine data that was generated by more than one database/server installation? Doing this (merging data from more than one DB) using an artifical key that isn't a GUID, you would need to change your "int recordid" keys to "int serverid"+"int recordid".

Christopher Wells
Thursday, June 03, 2004

> I can't think of a reason for someone's birthdate to change

My wife's birthdate changed.  She had no birth certificate (born during a short war in a 3rd world country).  When she immigrated to the US, 3 official documents all had different birthdates (2 or 3 days apart).  We'd pick one every year and celebrate it.  (When she was annoyed she'd accuse me of picking the one that was most convenient around my busy work schedule).

We ignored this for years.  At one point she was denied a driver's license due to the mismatch on the IDs, but we just got the license somewhere else.  After she was naturalized the IRS demanded we clean this up, so we got the social security card & drivers license changed to match the passport.  I think that's the actual date too, but I never can remember it exactly.

Will
Friday, June 04, 2004

Will, it's funny, your wife's situation is exactly the kind of thing I was thinking of. It seemed at first glance that birthdate was the single attribute of people that was guaranteed to be defined at "the point of manufacture" (like a VIN or UPC)... then I realized I had a friend whose birthdate might not be immutable. He was born in Korea and immigrated to the U.S. as a child, and apparently when he was naturalized the authorities here got his birthdate wrong. I don't think he's ever gotten around to having it corrected (probably much more trouble than it's worth).

But I think your wife's story is even cooler. I'm surprised you didn't "get" to celebrate three times every year ;-)

John C.
Friday, June 04, 2004

>My wife's birthdate changed. <

No, she was born only once, but her officially-recorded date of birth changed. This may seem like nit-picking, but it is exactly the point of discussions about "natural keys".

DataMiner
Friday, June 04, 2004

I find the misuse of GUIDs as record keys unfathomable.  Why have an enormous key which is going to be the largest single field, other than free text which wouldn't be used as a key.

For systems with rapid transactions the differences between the GUIDs is going to be 1 or 2 digits, all for the sake of psuedo-randomness.

Reduce your keys to some reasonable population size, why leak performance when its unnecessary?

Simon Lucy
Friday, June 04, 2004

Oh the point about merging data from different servers is well taken.  I solve that with a single character out of the prefix for the key for data that needs to be preserved as to which server it was created on.  That gives me 36 databases without getting into non-alpha characters.

As for the equivalent efficiency of numerical (binary) over alphanumeric keys.  Its all bits.  Using only numerical (as in ints, longs, double words) will give the same span of data for a wider alphanumeric key but will not be human readable in any sense.

I've had to fix too many sets of data, for a variety of reasons, to ever want to use binary keys.

Simon Lucy
Friday, June 04, 2004

Simon L said: Reduce your keys to some reasonable population size, why leak performance when its unnecessary?

I'll pit my database's performance against anyone's on five times the equipment. I have two PII servers that see 2 TB of data per week with 100% churn, and a Quad PIII that sees 6TB per week with 80% churn. None of them average over 20% CPU. The one serious skill I have is SQL Server performance. A coworker and I designed and maintain a database that runs 24/7 and provides services to 1.7 million distinct users, with half a million of them active at any given time (with sessions averaging 7.5 hours). We did this with off-the-shelf hardware and software.

After spending 3 weeks doing extensive SQL Server performance tests, I found that varchars under 70 characters are faster than ints for all key searches used in my databases. I challenge everyone to do their own tests. It was a shock to me to find them faster, because it certainly goes against "common knowlege". This was a very detailed test, covering lots of situations, using real data numbering in the millions of rows. (I did also test smaller table sizes).

In addition, the 36 characters for the GUID isn't that big, and the selectivity is ideal (which is the purpose of the key anway, of course). I pay very close attention to row and page sizes to make sure I'm not wasting time and effort with bad design decisions. I test before and after attempts at improvement. The query design and physical data placement (and many other things) affect performance far more than choice of a single column (within reasonable limits). In addition, I research the hell out of my data when making a decision. Example: What is the correct size and type for a first name field? An email address? A url? When choosing the type for a given usage pattern, is the CPU time saved looking up fixed-width offsets rather than variable offsets (i.e. char vs varchar) undone by the additional disk I/O and network transmission for all those spaces? I made a list of about 80 questions like this when doing my research and did lots of tests to find out.

On a daily basis, I fight more more for my own coding and maintenance convenience and performance. GUIDs and NEWID() save me quite a bit of time and effort, and do not impact the performance of my databases.

Troy King
Friday, June 04, 2004

That's interesting, any idea as to why the character based keys were faster? on the face of it I can't think of any reason why they would be, apart from the case for numeric/binary keys just being very poorly optimised?

Those really long GUID type fields tend to annoy me, although I can where they're useful, for the relatively small database tables I'm dealing with I'd far rather be able to look something up easily by an ID number of 1523, than some 32 character hexadecimal string or whatever. Especially if you're going to present the ID to the user in any way, say as a quick reference to bring up that record again.

Also which looks better

http://site.com/article/1234/about_foo

http://site.com/article/32AC-143D-FFE6-0234-DE46-AC5D

I hate web apps *cough* coldfusion *cough* that throw those kind of IDs around all the time.

Matt
Saturday, June 05, 2004

Matt, I've got tables where I'd have overflowed regular int columns by now with an autonumber field (if autonumbers didn't present other problems already mentioned). I don't actually keep 2 billion+ rows in a given table, but some tables have had more rows that that go through them.

As for presenting it to the user, I'd never make them have to know or pay attention to a GUID. Surrogate keys are for housekeeping purposes... what I give to the user may use a GUID one way or the other, but I'd certainly never make them type it in or have to look at it.

Troy King
Saturday, June 05, 2004

Matt, it just occurred to me your question is a red herring. You said "which looks better", and showed the urls
http://site.com/article/1234/about_foo
http://site.com/article/32AC-143D-FFE6-0234-DE46-AC5D
It doesn't matter which looks better. The correct link would be one with descriptive text. The url is an implementation detail. A link should tell where you're about to navigate, so it doesn't matter what the url itself is and whether it contains a guid or not.

Troy King
Sunday, June 06, 2004

Dataminer (says my wife's birthdate didn't change):

Eh - I guess it's semantics.  Sure, the day she was born (as some absolute truth) was always the same.  But the database field "birthdate" was recorded differently in multiple databases and was eventually modified. 

Incidentally, we also had a similar situation with the last name (a common Arabic name which was spelled differently on different IDs) The INS got to be the arbiter of this one when the passport was issued.  Here there was definitely no right answer as they were all legimitate transliterations.

My point is that recorded knowledge of almost every type is subject to change in special situations.

Will
Sunday, June 06, 2004

*  Recent Topics

*  Fog Creek Home