Wednesday, February 8, 2017

Database Architecture

The most popular database is Excel. Knowledge workers use Excel spreadsheets as a simple, non-relational database more often than as an accounting worksheet. That's OK, because it works. The big boys (software engineers) use relational databases, which is like tying together each worksheet in Excel; the tying together (joining) of database tables is the relational part. And, about a decade ago, the cool kids started using NoSQL databases which operate more or less like a hash table for blazing fast performance sans the relationships.

In my computer science college classes I was taught that each row in a database requires a primary key. Just like each row on your Excel spreadsheet has a unique number: 1, 2, 3; each row in a relational database needs a way to find that row of data, quickly. In academia – in other words, in theory – I was frequently taught to use something like a person's social security number (SSN) as the primary key for a database table. But, in practice, that doesn't work nicely. In theory, there's no difference between theory and practice. In practice there is. One of the key tenets of primary keys is that they must never change unless every single reference to that primary key (the relationship joined by a foreign key) is also updated which is not a trivial task

There are a few issues with using a person's SSN as a primary key. One issue is that it's a long number, much longer than a 1 – 5 digit integer typically used as a primary key, so it takes a database engine longer to compare numbers.

More importantly, what happens if a person's SSN changes? SSNs can change in domestic violence cases, identity theft, when two people are mistakenly issued the same SSN, etc. You may think, since these are such rare cases, that it's not a big deal for a database; but software systems have to be developed to handle every possible case to work effectively. Good software should reflect a true reality, not a theoretical reality. So, to solve this issue, most databases issue a sequence of integers for each row. In the case where databases are in a cluster, each database might tag a digit on the end representing which database server issued a specific primary key. In other words, instead of 1, 2, 3; database server #4 in a cluster might issue 14, 24, 34, etc. as primary keys.

Database Best Practices

Having done a bit of database work at Apple, I've learned some best practices that are worth sharing.

1. Primary keys: Never build intelligence into a primary key – a primary key is simply an artifact of the database and it should represent nothing more than a way to access a row in a database table (i.e., don't use SSN as a primary key). Just like a software developer doesn't care what GUID is assigned to an object in memory, neither should a database designer care about the contents of a primary key. Creating a primary key that's a simple integer is highly efficient since a computer can quickly find and compare numbers (in the case of integers) much faster than a string of nine characters (in the case of SSNs).

[Side note: Do not use a numeric type for defining data fields which won't be used for "math." In other words, credit card numbers, phone numbers, SSNs, etc., should be text types (strings) in a database model. One specific problem I've encountered on a production system is when storing the credit card security code (CSC). Although this code is always numeric, it can contain an important leading zero. When I saw my CSC repeatedly failing, I knew what the issue was and confirmed it by reaching out to the DBA.]

2. Table Names: Database table names should be singular (Employee, Order, Transaction, Statistic etc); they should be named for what each row in the table represents, not the entire collection. The reason is that, typically, there's a one-to-one mapping between a row in a database table and an object used in code. For example, in code, an instance variable referencing an Employee object should represent a single employee from the database while an an instance variable that's plural, such as Employees, should represent a collection of objects such as an array or dictionary.

3. Lookup Tables: A lookup table is a simple static database table that's used to populate a list or collection. For example, a list of countries that your company ships to. Perhaps, your company only ships to the U.S. and Canada. Later, when you start shipping to more countries, how do you update the pop-up list of countries on your website or mobile app? With a lookup table, you simply add another row to the table with the new country that you ship to. Updating the database table is easier than changing your code, recompiling, and deploying. Typically, a look up table also has a column representing a sort order. This is done so the list can be displayed in a specific order with, say, the U.S. listed first, instead of Afghanistan, since most of your customers are located in America.

4. Compound Primary Keys: A database table should have a single primary key for a typical one-to-many relationship to another table (join). Sometimes, you need a many-to-many relationship. For example, a Person table/object joined to an Address. A person might have multiple addresses (homes), and an address might belong to multiple people. In these cases, where a many-to-many relationship is needed then a simple middle table is set up which contains only two columns which are both primary keys. One of the primary keys in the middle table is the primary key of the Person and the other is the primary key of the Address. Technically speaking, the two primary keys of the middle table are propagated foreign keys. One of the middle table's primary keys is a foreign key from the Person table and the other is a foreign key from the Address table.

I'm not aware of a practical case where more than two primary keys are needed in a database table. In cases where I've seen three (or more) primary keys in a database table I realized that the database designer didn't have a good understanding of relational databases. What they typically needed was a single primary key and indices created for their other columns to optimize their lookup speeds.

No comments: