Wednesday, February 8, 2017

Database Architecture & RDBMS Best Practices

The most popular database is Excel. Knowledge workers use Excel spreadsheets as simple, non-relational databases more often than as an accounting worksheet. That's OK, because it works. The big boys (software developers) use relational databases, which is like tying together each worksheet in Excel; the tying together (joining) of database tables is the relational part.

In my computer science college classes I was taught that each row in a database requires a primary key. Just like each row on your Excel spreadsheet has a unique number: 1, 2, 3; each row in a relational database needs a way to find that row of data, quickly. In academia – in other words, in theory – I was frequently taught to use something like a person's social security number (SSN) as the primary key for a database table. But, in practice, that doesn't work nicely. In theory, there's no difference between theory and practice. In practice there is.

One of the key tenets of primary keys is that they must never change unless every single reference to that primary key (the relationship joined by a foreign key) is also updated which is not a trivial task

There are a few issues with using a person's SSN as a primary key. One issue is that it's a long number, much longer than a 1 – 5 digit integer typically used as a primary key, so it takes a database engine longer to compare numbers during a lookup.

More importantly, what happens if a person's SSN changes? SSNs can change in domestic violence cases, identity theft, when two people are mistakenly issued the same SSN, etc. You may think, since these are such rare cases, that it's not a big deal for a database; but software systems have to be developed to handle every possible case to work effectively. Good software should reflect a true reality, not a theoretical reality. So, to solve this issue, most databases issue a sequence of integers for each row. In the case where databases are in a cluster, each database might tag a digit on the end representing which database server issued a specific primary key. In other words, instead of 1, 2, 3; database server #4 in a cluster might issue 14, 24, 34, etc. as primary keys to avoid any primary key collisions.


Database Best Practices

Having done a bit of database work at Apple, I've learned some best practices that are worth sharing. A RDBS should start off in third normal form which means only primary/foreign keys are duplicated in separate tables – the rest of the data should be unique to each table. Later, when there's a performance issue, the database can be optimized, as needed, by denormalizing the data.

1. Primary keys: Never build intelligence into a primary key – a primary key is simply an artifact of the database and it should represent nothing more than a way to access a row in a database table (i.e., don't use SSN as a primary key). Just like a software developer doesn't care what GUID is assigned to an object in memory, neither should a database designer care much about the contents of a primary key. Creating a primary key that's a simple integer is highly efficient since a computer can quickly find and compare numbers (in the case of integers) much faster than a string of nine characters (in the case of SSNs).

I once consulted with a retried Army first sergeant at DTIC, around 1999, who made her RDBS primary keys positive and negative integers. The primary keys for active personnel were positive integers and the retired personnel primary keys were negative integers. While this sounds smart, it's actually too clever. This was pre-9/11 so it was rare for a retired person to come back on active duty. The reason this was a bad decision is that it can be dangerous to change primary keys, and then update all related tables' foreign keys.

2. Table Names: Database table names should be singular (Employee, Order, Transaction, Statistic, etc); they should be named for what each row in the table represents, not the entire collection. The reason is that, typically, there's a one-to-one mapping between a row in a database table and an object used in code. For example, in code, an instance variable referencing an Employee object should represent a single employee from the database while an instance variable that's plural, such as Employees, should represent a collection of objects such as an array or dictionary.

3. Lookup Tables: A lookup table is a simple static database table that's used to populate a list or collection. For example, a list of countries that your company ships to. Perhaps, your company only ships to the U.S. and Canada. Later, when you start shipping to more countries, how do you update the pop-up list of countries on your website or mobile app? With a lookup table, you simply add another row to the table with the new country that you ship to. Updating the database table is easier than changing your code, recompiling, and deploying. Typically, a look up table also has a column representing a sort order. This is done so the list can be displayed in a specific order with, say, the U.S. listed first, instead of Afghanistan, since most of your customers are located in America.

4. Compound Primary Keys: A database table should have a single primary key for a typical one-to-many relationship to another table (join). Sometimes, you need a many-to-many relationship. For example, a Person table/object joined to an Address. A person might have multiple addresses (homes), and an address might belong to multiple people. In these cases, where a many-to-many relationship is needed then a simple middle table is set up that contains only two columns which contain two primary keys propagated from the two joining tables. One of the primary keys in the middle table is the primary key of the Person and the other is the primary key of the Address. Technically speaking, the two primary keys of the middle table are propagated foreign keys. One of the middle table's primary keys is a foreign key from the Person table and the other is a foreign key from the Address table.

I'm not aware of a practical case where more than two primary keys are needed in a database table. In cases where I've seen three (or more) primary keys in a database table I realized that the database designer didn't have a good understanding of relational databases. What they typically needed was a single primary key and indices created for their other columns to optimize their lookup speeds.


5. Number vs Varchar: 

Do not use a numeric type for defining data fields which won't be used for calculations (i.e. "math"). In other words, credit card numbers, phone numbers, SSNs, etc., should be string types (i.e. varchars) in a database model. One specific problem I've encountered on a production system is when a developer stored the credit card security code (CSC) as a numeric data type. Although this credit card code is always numeric, it can contain an important leading zero. When I saw my CSC repeatedly failing at checkout on an e-commerce web site, I immediately knew the problem and confirmed it by reaching out to the DBA.

No comments: