Wednesday, February 8, 2017

Database Architecture

The most popular database is Excel. Knowledge workers use Excel spreadsheets as a simple, non-relational database more often than as an accounting worksheet. That's OK, because it works. The big boys (software engineers) use relational databases, which is like tying together each worksheet in Excel; the tying together of database tables is the relational part. And, about a decade ago, the cool kids started using NoSQL databases which operate more or less like a hash table for blazing fast performance sans the relationships.

In my computer science college classes I was taught that each row in a database requires a primary key. Just like each row on your Excel spreadsheet has a unique number: 1, 2, 3; each row in a relational database needs a way to find that row of data, quickly. In academia – in other words, in theory – I was frequently taught to use something like a person's social security number (SSN) as the primary key for a database table. But, in practice, that doesn't work nicely. In theory, there's no difference between theory and practice. In practice there is. One of the key tenets of primary keys is that they must never change unless every single reference to that primary key (relationship) is also updated which is not a trivial task

There are a few issues with using a person's SSN as a primary key. One issue is that it's a long number, much longer than a 1 – 5 digit integer typically used as a primary key, so it takes a database engine longer to compare numbers.

More importantly, what happens if a person's SSN changes? SSNs can change in domestic violence cases, identity theft, when two people are mistakenly issued the same SSN, etc. You may think, since these are such rare cases, that it's not a big deal for a database; but software systems have to be developed to handle every possible case to work effectively. So, to solve this, most databases issue a sequence of integers for each row. In the case where databases are in a cluster, each database might tag a digit on the end representing which database server issued a specific primary key. In other words, instead of 1, 2, 3; database server #4 in a cluster might issue 14, 24, 34, etc. as primary keys.

Database Best Practices

Having done a bit of database work, I've learned some best practices that are worth sharing.

1. Primary keys: Never build intelligence into a primary key – a primary key is simply an artifact of the database and it should represent nothing more than a way to access a row in a database table (i.e., don't use SSN as a primary key).

2. Table Names: Database table names should be singular (Employee, Order, Transaction, etc); they should be named for what each row in the table represents, not the entire collection. The reason is that, typically, there's a one-to-one mapping between a row in a database table and an object used in code. For example, in code, an instance variable referencing an Employee object should represent a single employee from the database while an an instance variable that's plural, such as Employees, should represent a collection of objects such as an array or dictionary.

3. Lookup Tables: A lookup table is usually a simple static database table that's used to populate a list or collection. For example, a list of countries that your company ships to. Perhaps, your company only ships to the U.S. and Canada.; later, when you start shipping to more countries, how do you update the pop-up list of countries on your website or mobile app? With a lookup table, you simply add another row to the table with the new country that you ship to. Typically, a look up table also has a column representing a sort order so it can be displayed in a specific order with, say, the U.S. listed first, instead of Afghanistan, since most of your customers come from America.






No comments: