I'm saving this cached from google because the original site is gone, and who knows when google will purge its cache.
February 15, 2005
Relational Botox: The neverending key debate.
I've personally endorsed RDBMS extreme makeovers of legacy systems to use synthetic keys over natural keys for quite some time now. However, for one reason or another this debate always pops up, particularly during the early stages of new projects.
I recall reading this somewhere in space:
" many legacy SQL data models use Natural primary keys. A Natural key is a key with business meaning that is an attribute (or attributes) that is unique by virtue of its business semantics. Examples of natural keys might be the Social Insurance Number. The rule is simple: If a candidate key attribute has meaning outside of the database context, it is a Natural key, no matter if it is automatically generated or not."
Many would argue that natural keys almost always cause problems in the long run. A good primary key must be unique, constant, and is of course required (never null or unknown). Very few entity attributes satisfy these requirements, and some that do are not efficiently indexable by SQL databases. They'd also argue that we should be absolutely sure that a candidate key attribute never changes through the data's entire lifetime before promoting it to a primary key.
In terms of uniqueness, a natural key has, by definition, semantics outside of the database. How do you guarantee unique values? Is there a single entity/generator? Will it be the same trustable entity for as long as the data lives?
In terms of continuity, as the natural key has semantics in the "real world", will these semantics be constant? Can you guarantee that existing values never change?
I've blogged this so I remember what to say the next time a DBA sparks this heated discussion with me.
What do you think?
Posted by rchan at February 15, 2005 04:16 PM
I'm a natural key man, personally.
I don't know how many times I've seen data get corrupted because someone was shuffling data around carelessly and didn't notice that the values clause of ( 2, 38, 44552 ) should have been ( 2, 33, 44552 ). Obvious mistakes don't become obvious until the data has intrinsic meaning.
I usually find the divide is between the unix/oracle camp ( natural keys ) and the ms/sqlserver camp( arbitrary auto increment keys ).
The arbitrary keys camp generally claim that their keys are faster, and it lets you change the values of what would have been the natural key.
I have this to say about that:
- to a computer, characters are numbers. Let the database makers worry about indexing algorithms. That's what they are there for.
- early optimisations are almost always folly. You generally should not worry about making something fast until you know you have it working at all.
- lastly, if a natural key "changes" on you, then it is either a new "thing" ( it has changed after all ), or it wasn't a natural key after all.
For instance, consider these tables:
Fruit ( id, name )
Sales( timestamp, fruit_id, price )
Now if today an ID of 2 is an apple, you record some sales, then tomrrow update the fruit table so the name associated with 2 is orange, what just happened to yesterdays apple sales?
As a long time developer ( 22 years ), I also note that most of the people in the arbitrary key camp like writing code to maintain the model's integrity. If you talk to enough of the natural key folk, you'll see they're interested in a stable, inviolatable data model wherein multiple developers can write applications in a variety languages and technologies and know that they can trust the data.
Posted by: glenn at February 15, 2005 08:29 PM