Tuesday, 16 February 2010

SQL NULL Rage

Just don’t allow nulls in your relational database tables, just don’t do it. No. No, not even then, no. No. NO.

Null, it doesn’t mean empty string, it doesn’t mean zero, it doesn’t mean missing relationship, it means undefined, it is a mark for missing data. Why would you want to store a mark indicating that data isn’t there instead of just not storing anything?

At what point does using null become overuse? When a couple of columns in a row contain nulls? When most of them contain nulls? When all of them contain nulls and foreign key columns contain nulls? Why don’t you just fill your entire database to capacity with nulls because you expect it to contain real data at some point in the future?

Yeah that's right, because that would be stupid, wouldn’t it?

E. F. Codd, the daddy of the relational model, suggests in his The Relational Model for Database Management Version 2 (go buy it here) that the null mark should be further refined into two marks, one for information that is applicable but absent (A-marks) and another for inapplicable information (I-marks).

He goes on to recommend that A-marks not be allowed in foreign key columns:

"I strongly recommend that database administrators or users consider very carefully the question of whether to permit or prohibit A-marks in foreign-key columns, and also that they document how and why that decision was made."

Codd then gives us two rules for all databases, the first, don’t allow either mark for a primary key and don’t allow I-marks for foreign keys. And the second rule, don’t have foreign keys point to stuff that’s not there:

"There are two integrity rules that apply to every relational database

  1. Type E, entity integrity. No component of a primary key is allowed to have a missing value of any type. No component of a foreign key is allowed to have an I-marked value (missing-and-inapplicable).
  2. Type R, referential integrity. For each distinct, unmarked foreign-key value in a relational database, there must exist in the database an equal value of a primary key from the same domain. If the foreign key is composite, those components that are themselves foreign keys and un-marked must exist in the database as components of at least one primary-key value drawn from the same domain."

This is all good for maintaining referential integrity, but for me, it doesn’t really address the issues. The quad-value logic seems to me to make the problem worse. How do I do arithmetic and aggregation on these two kinds of null mark? How would these two marks eventually manifest themselves in the code? Would I treat them differently? Why do I even care what kind of null it is, I just don’t want nulls to bite me on the arse later.

We should treat null marks as seriously in all columns as Codd proposes we treat them in relation defining columns. Why not keep partitioning the data until we get rid of the nulls? Normalizing to 5th and 6th normal form if we have to!

Much smarter people than me have thought about this and if you are at all bothered in relational database design then check out The Third Manifesto. Especially this PDF, Missing info without nulls.

From the Missing info without nulls abstract:

""The Third Manifesto", by C.J. Date and Hugh Darwen (3rd edition, Addison-Wesley, 2005), contains a categorical proscription against support for anything like SQL’s NULL, in its blueprint for relational database language design."

Sounds good to me.