Fog Creek Software
Discussion Board




Lookup tables vs. explicit values?

I'm building a web app that is essentially a specialized registry of people. Simply put, there's table 'person' that has fields like 'skin_tone', 'hair_color', etc, each with a known, small number of valid options.

I prefer to store options for each field in their own table, and tie that table by foreigh key to the 'person' table.

The client suggests that it's a better design to not store information about valid values in the DB, and instead code it into the application. He argues that my approach hurts scalability (ie, more queries to the DB), and maintainability (mixing up the app layer and data layer).

I think that both approaches have a right to exist, but mine is more appropriate for larger apps (this one is not small, and my example is greatly simpified).

Which of the approaches do you choose, and when?

Perl Casuality
Wednesday, June 23, 2004

I'm strongly in favour of storing the information in the database. If well-coded, then a new option can be added simply by inserting it in the database.

Scalability issues:
* You would need to be making a heck of a lot of database accesses on simple lookup tables to degrade performace. Database degradation comes from complex queries with lots of joins and sub-queries, not from simply "SELECT ID, Value FROM foobar ORDER BY ID" type queries.
* Make sure your database is indexed correctly. If you want the data from your lookup tables ordered by the primary key, then they will blindingly fast in every database engine I have used. If you want the results ordered by value or another column, add an index to that column.
* An advanced technique to be aware of in extreme circumstances is caching. You can cache the database information in memory to reduce database accesses. I'm not a Perl user so I don't know the specific techniques available to yo.u
* Some database engines perform good caching automatically.

However if the client says to do it his way, don't bite the hand that feeds you!

Herr Herr
Wednesday, June 23, 2004

"However if the client says to do it his way, don't bite the hand that feeds you! "

Wrong!

You are the professional developer. It is your responsability to do things the best way possible.

You wouldn't tell your doctor how to do surgery, would you?

In the long term, when things go unmaintenable because they were built the way the client wanted, and not the correct way, you are going to be the one to blame, anyway.

.NET Developer
Wednesday, June 23, 2004

The killer to arguments about scalability and performance is to ask:

"Oh really? How many less clients will it be able to support if we use lookup tables vs explicit values?"

Since your client will have no answer to that you can then proceed to use lookup tables.

Matthew Lock
Wednesday, June 23, 2004

Do what Herr Herr said, but make sure the decision is documented (i.e. make him sign something to say that you are doing it that way because that is what he wants).


Wednesday, June 23, 2004

So do they propose that you store enumerated values (int's) and the only key to these values is somewhere in the code?


m's database bias rule #1: The Data lives longer than the code. Be good to the data.


Sounds like a learning opportunity. Ask them what other pearls of wisdom they have for system scalibility. :)

m
Wednesday, June 23, 2004

No, they propose to just store string values (color names, for example) in VARCHAR fields.

Perl Casuality
Wednesday, June 23, 2004

Lets say you use HAIR_COLOUR : VARCHAR(20) instead of HAIR_COLOUR : INT. 

The Int is probably 4 bytes long.  Thats all you need to store and pass around.  The varchar is 20 bytes long.  If you have 1000 records that 20,000 bytes, and that is just one field.  Double that if your using unicode.

This isn't just bytes on the disk, these are also bytes being passed around on the network.  Not a problem if your server isn't under strain, but this is a scalablity argument.

With a look up table you only have to store and pass around the 4 byte integers.  You can use an even narrower type if you really have to. 

Any decent DBMS will optimise the join away so it won't cost anything.

Ged Byrne
Wednesday, June 23, 2004

Are you saying the client just wants to put 'Blonde' in the hair_color field in the data table?

That's easy to fix - just get a book on database normalisation and teach him how databases are normalized. Point out that there's a reason most mature developers create databases in the third normal form.

More to the point - if he wants a drop down in the application, what if your application has "blonde" while another application touching the database has "blond"? Now reporting is problematic. (Same problem if you let users type in directly)

Finally, if you need to search by value, integer indexes are faster and more efficient than varchar indexes.

Philo

Philo
Wednesday, June 23, 2004

About 2 years ago I ported an Access application to PHP and Postgresql. The Access database contained 10 years worth of historical market data, where there was a text field to record the daily condition of the market for each product. (good, poor, average etc) Over 10 years each of those words had been spelt in almost every possible misspelling in existance, for example:

good, godo, goop, gooo

Etc. Made reporting impossible until everyone of those 500,000 entries was normalised.

Matthew Lock
Wednesday, June 23, 2004

You must not work in the financial sector much.  "Goop" is a PERFECTLY VALID condition for the market to be in.  ;-)

muppet is now from madebymonkeys.net
Wednesday, June 23, 2004

Third normal form is your friend

Tapiwa
Wednesday, June 23, 2004

I would have thought that it would be better to have a fully normalized table structure than an atrificial data/app split.

The table structure can be exported wholesale. I see no lack of portability in having the look-up tables correct. With SQL statements and views it would be a different matter.

Stephen Jones
Wednesday, June 23, 2004

Using a lookup table is definately the way to go.  It sounds like you'll be using these values to populate a set of drop-down boxes on the interface.  So ask the client if they want future changes (blonde to blond and add red) to take 30 minutes or 3 days (more like 5 minutes or three days - oh and don't forget the re-deployment of the interface to all users).

Also, if internationalization is a concern/possibility you could set up the lookup to have (id int, english_color varchar(20), german_color varchar(20)...) and point to different columns for the drop down without changing the data that's stored on each record.

The decision to put this data locally and repetitively in each record is almost always a bad idea, and if you need to drive an interface with drop-downs, it is a bad idea (select distinct hair_color from people vs select hair_color from hair_colors ... which is faster/more scalable).

Lou
Wednesday, June 23, 2004

Thanks to everyone for the suggestions. It's interesting that many suggested storing INT, which in fact isn't what I meant.

Our argument isn't INT vs. VARCHAR, but rather application-supplied VARCHAR (them) vs. database constrained VARCHAR (me).

Them:

table person (
hair_color VARCHAR(30)
);
And that's it. Stuffing proper values in is application responsibility.

Me:
table person (
hair_color VARCHAR(30),
foreign key hair_color references hair_color_lookup (hair_color)
);

There're some more fields in hair_color_lookup table (for example, the priority field that defines sequence in which to display values in a drop-down box).

This therefore begs the next question: is anything seriously wrong with the second setup, and I should use INT as foreign key? I dislike this idea because of the necessity to do a lot of JOINs every time I select from 'person'. I think that speed differences of INT vs. VARCHAR(30) are negligeble when the fields are indexed, are they?

Perl Casuality
Wednesday, June 23, 2004

We've found that most "lookup" values are kept in SQL Server's memory anyway -- there's no disk I/O involved (DB traces show 0 milliseconds).  Your only cost is network time, and most of the time you'll be doing a join on them anyway, so the time spent on networking is a given.

The other poster has a good point about I18N -- you can either put multi-locale values in a table, or you can write satellite assemblies -- both have a cost in development time, you just have to decide which is easier for your organization to maintain.

chiph
Wednesday, June 23, 2004

You method is the only one that will produce normalized table structures.

Stephen Jones
Wednesday, June 23, 2004

Being a Perl programmer for a few years now, I would be sorely tempted to do a hybrid approach.

That would be:  Yes, have the additional tables of allowed values.  Do a simple query on program start up to load these values into a Perl hash for each set of allowed values (ie one for hair_color, one for something else).  An alternative is to create and load the hash on the first query that uses it.

Then, use the 'hash' (sorted as needed -- Perl does this stuff in a single command) to display drop-downs.  Use the selected options as SQL query parameters against the table.

If they want to add an allowed value, only then do you need to hit the allowed values tables in the database with an insert to add it.  This lets you qualify the additions as you go.

You get the speed of in-memory parameter look-up, without the performance hit of large JOIN sets.

AllanL5
Wednesday, June 23, 2004

Oh, BTW, I think the 'INT' suggestions come out of your 'foreign key' statement.  If you have a small set of options, then using an 'int' as the foreign key to the lookup table makes sense.

You then organize the lookup table so each 'hair_color' has an int 'key' and a VARCHAR 'value'.

BTW again, if you store both the string 'hair_color' and the int index into the 'hair_color_table' where you store the string 'hair_color' again, you are adding redundancy, and duplicate storage of strings.  That's another reason people say "just store the INT index into the 'hair_color_table'."  This is what 'normalization' is all about, after all.

Redundancy is not completely bad though, as it can speed up queries.  However, once it rears its ugly head, you need to take steps to insure redundant things continue to agree with each other, ALWAYS.

AllanL5
Wednesday, June 23, 2004

>You are the professional developer. It is your >responsability to do things the best way possible.

Always get a requirement. What is too slow? Only
then can you make design tradefoffs. Arguing about
this crap is a waste of time.

Always benchmark. Do it both ways and look at the
performance. That way you will know and you don't
have to argue tradefoffs. If it is too slow try indexing,
assuming you can define this, then try server side
caching, then client side caching, then embedding.

son of parnas
Wednesday, June 23, 2004

"Me:
table person (
hair_color VARCHAR(30),
foreign key hair_color references hair_color_lookup (hair_color)
);
"

I'd modify your design to only use a single lookup table, but add another column to identify the code type (hair_color, skin_tone), etc. That way you can add attributes without the need for a new table everytime. 

Tom H
Wednesday, June 23, 2004

Philo (and others indicating a problem with "normalization"):

==>That's easy to fix - just get a book on database normalisation and teach him how databases are normalized. Point out that there's a reason most mature developers create databases in the third normal form.

Maybe *you* should get those books too -- and read them! <grin>

I fail to see how the value "blonde" -vs- some integer value (say 1) in a table makes it any less normalized.

the record (row, tuple, whatever) :

EMPLOYEE (EMP_ID, FNAME, LNAME, HAIR_COLOR)

        101, Jane, Doe, blonde

is just as normalized as

        101, Jane, Doe, 1

QUESTION: By what definition of normalization is the second record any more "normalized" than the first?

If the second is, as you seem to think, more normalized -- then why not take it to the extreme(?):

EMPLOYEE

    101, 3, 6, 1


Where you have a FName and LName lookup table. Surely under your definition of "normalization" this is a better way to go (not!)

It's more "normalized" isn't it? (not)


It's been about 15 years since my university days, but here's what I remember:

1NF: requires that all column values in a table are atomic
2NF: satisfies 1NF and every non-key attribute is irreducibly dependent on the primary key.
3NF: satisfies 2NF and all non-key attributes are mutually independent.

Seems to me like both of my rows above are in 3NF. Maybe I'm misremembering or not understanding your "normalization" argument.

The normalization argument, while a good and valid point, is, as I see it,  irrelevant to this discussion.

FYI -- I'm generally a fan of the "lookup table" approach, but the normalization argument is not relevant to the title of and discussion within this thread.

I use it for optimization purposes. I typically deal with large databases with hundreds of millions of records. The space savings is huge. I find that it's, in most cases, actually quicker to do the lookup too (even though the DB engine has to do more work), because the integer data is much smaller, leading to data and index pages being much smaller, resulting in less physical I/O when pulling those pages off the disk. YMMV

Sgt. Sausage
Wednesday, June 23, 2004

==>I'd modify your design to only use a single lookup table, but add another column to identify the code type (hair_color, skin_tone), etc. That way you can add attributes without the need for a new table everytime.

I've done this also, within dozens of apps over the years, but often wondered about it -- the "correctness" of it.

here's a few breif discussions:

http://www.tdan.com/i016ht03.htm

Your thoughts or comments on this?

Sgt. Sausage
Wednesday, June 23, 2004

"Your thoughts or comments on this"

The article makes a good point, on the other hand the app I spend most time supporting has over 130 different types of codes. Do I really want another 130 tables in the schema? No. Do I want to update the DB schema every time I need another code? No. Do I give up some purity by cheating a little here? Yes.

Tom H
Wednesday, June 23, 2004

If you ever want to access the lookup values from outside the application - for example a report - you'll appreciate having the values in the database so that you can join directly to them.

Big B
Wednesday, June 23, 2004

Was written:

>the record (row, tuple, whatever) :

>EMPLOYEE (EMP_ID, FNAME, LNAME, HAIR_COLOR)

>        101, Jane, Doe, blonde

>is just as normalized as

>        101, Jane, Doe, 1

>QUESTION: By what definition of normalization is the
>second record any more "normalized" than the first?

I think you missed the date factor. Hair color changes over time so the table should be like this

EMPLOYEE (EMP_ID, FNAME, LNAME)

employeeHairTable
EMP_ID,HairId, hairDate

theHairTable
HairId,HairDescription

I really think that's what they were talking about.

me
Wednesday, June 23, 2004

I think the right answer is 'both'.

Either:

Put the lookup tables into the database, and cache them in memory-mapped files maintained by a daemon.

Or:

Put the full values in the database and constrain them with (in Oracle they're called) check constraints.

The advantage of the first one is that you can get an express list of values to show in a user interface.

James
Wednesday, June 23, 2004

Normalise, nor-norrr-malise-alise

Matt
Wednesday, June 23, 2004

>>If the second is, as you seem to think, more normalized -- then why not take it to the extreme(?):

Names are free text; hair color is (presumably) chosen from a finite set of enumerated values.

John C.
Wednesday, June 23, 2004

James has the right idea. You should have separate lookup tables with defined relationships for the sole purpose of ensuring relational integrity (ensuring that someone doesn't put in a misspelled hair colour, and so on), and given that this should generally be a fixed list you can simply load and cache it on startup (I believe with SQL Server 2005 and .NET, and I mention this as an aside as I realize these aren't your platform, you'll be able to set a cache dependency on a table, so if the table changes it'll invalidate the cache and reload on the next access. If the same front end is the only method the back end data gets changed, obviously you could manually invalidate the cache whenever you do an update).

People often use artificial keys, such as an autonumber integer, for space conservation reasons. Many advocate against that because it's unnecessarily complex and unintuitive to maintain ("What is hair colour 77 again?" Even better use guids and ask yourself "What was {6DCE77E6-883F-455a-9020-084F77632E78}
again?") and instead do what you are advocating (having a natural key, such as the fulltext description of the hair colour, with a lookup table existing purely as a existence relational integrity check). If your attributes are overly verbose (or if you plan on them being multilingual), it's worthwhile splitting the difference -- have a 4 char acronym for each hair colour (BLND, BRWN, BRNT, RED , and so on), obviously not all will be completely intuitive, and use that as the relationship. If the column is created with case-insensitivity it's just as efficient as an int to query upon, but it's much more programmer friendly.

Dennis Forbes
Wednesday, June 23, 2004

Sgt Sausage appears to be suffering under a delusion here. Nobody is saying that using a foreign key instead of the value makes any difference to normalization. There is considerable argument in the Acess world whether you should link to foreign keys or directly to the value.

The second point is that people are confusing validation with normalization. Data validation is business logic but that is not the same as normalization. If you have a SS number you may have rules as to how many alphanumeric characters can be entered and in what format but the actual field is an number field, not a look-up table going to all possible numbers.

If you have a limited number of entries that are going to be oft-repeated you want to keep them in a separate table and have a look-up field. Imagine you were dealing with a list of products instead; nobody would be suggesting you didn't have a separate look-up table for products. The same for countries (even though the list will change over time).

Stephen Jones
Thursday, June 24, 2004

==> Sgt Sausage appears to be suffering under a delusion here. Nobody is saying that using a foreign key instead of the value makes any difference to normalization

Excuse me? That's exactly what Philo was saying and exactly who I was responding to:

==>Are you saying the client just wants to put 'Blonde' in the hair_color field in the data table? That's easy to fix - just get a book on database normalisation and teach him how databases are normalized

Doesn't look delusional to me! Looks like Philo's saying that 'Blonde' in the field is a no-no and he recommends a book on normalisation as the correction to that no-no.  Seems he's implying that using 'Blonde' rather than the foreign key is violating the rules of normalization in some way. It's not.


==> The second point is that people are confusing validation with normalization. Data validation is business logic but that is not the same as normalization.

That's *exactly* the point I was trying to get across, although you said it more succinctly than I.  Thanks for the assist.

Regardless of whether I'm delusional or not (I'll admit, sometimes I am <vbgrin>), you've made my point.

So we're in agreement!

Sgt. Sausage
Thursday, June 24, 2004

*  Recent Topics

*  Fog Creek Home