Thursday, August 14, 2014

Monitoring How Things Fail

As our world becomes more complex, things fail in ways we never expected. In the military, we trained for different scenarios so we had documented responses. I see much less of this scenario based training in high tech simply because it's too complex to cover critical, unimagined, failures. Grace Hopper captured this sentiment best when she said, "Life was simple before World War II. After that, we had systems."

This week I started reading a book I heard about on NPR, The Checklist Manifesto. The author, Atul Gawande, is an accomplished surgeon. He noticed that seemingly small, yet critical, bits of information were overlooked in the operating room. Borrowing from the experiences of airplane pilots, Dr. Gawande began using checklists before operating on patients resulting in fewer mistakes in the OR.

When I originally launched Adjix I encountered different ways that servers could fail. A few incidents stick out in my mind.

Don't Backup Onto Your Only Backup

Always have a working backup. This seems obvious. I noticed one of my Adjix servers had slow disk I/O – in other words it seemed that the hard drive was failing so I backed it up onto a backup drive. Unfortunately, the backup never completed. I was left with a failed server and an unstable, corrupted, backup copy. The important lesson I learned here was to rotate backups. Nowadays, this should be less of a problem with services like AWS.

A couple years later I saw a similar issue when I was consulting to a startup that maintained two database servers in a master/slave cluster. The hard drive on the master server was full causing their website to go down. Their lead developer logged into the master server and started freeing up space by deleting files and folders. In his haste, he deleted the live database. When he logged into the slave database he discovered that his delete command had replicated which deleted his slave database. Their last offline backup was a week old. He was fired as the rest of the team took spreadsheets from the operations and sales departments and did their best to rebuild the live database.

How Do You Define Failure?

Apple uses WebObjects which is the web app server that has powered the iTunes store and the online store since they were created. WebObjects included one of my favorite RDMS, OpenBase. The beauty of OpenBase was that it could handle up to 100 servers in a cluster and there was no concept of a master/slave. Any SQL written to one server would be replicated to the others within five seconds. This is very handy for load balancing.

OpenBase's process for clustering was elegantly simple. Each database server in the cluster was numbered, 1, 2, 3, etc. Each database would generate its own primary key, 1, 2, 3, etc. These two numbers were combined so that the number of the server was the least significant digit. For example, database server #8 would generate primary keys like 18, 28, 38, 48, etc. This ensured that each database server's primary keys were unique. The SQL was then shared with all the other databases in the cluster.

Here's where something looked better on paper than in the real world. If one of the servers failed, it would be removed from the cluster. The problem was, how do you define failure?

If one of the database servers was completely offline then that was clearly a failure. But, what if the hard drive was beginning to fail – to the point that a read or write operation might take 20 or 30 seconds to successfully complete? Technically, it hasn't failed, but the user experience on the web site would be horrible. One solution would be to set a timeout for the longest you'd expect an operation to take, say five seconds, and then alert a system admin when your timeout is exceeded.

Who Watches Who?

When I launched Epics3, I had to monitor an e-mail account for photo attachments. I used a Java library that implemented IMAP IDLE which is basically an e-mail push notification standard. Perhaps there was a limitation in the Java library I was using, but IDLE simply wasn't reliable in production. It would hang and my code had no way to detect the problem. My solution was to simply check the mail server for new e-mail every ten seconds. This was a luxury I had since my bandwidth wasn't metered and Gmail didn't mind my code frequently checking for new e-mail.

Like Adjix, Epics3 was a WebObjects Java app. WebObjects uses a daemon, wotaskd, that checks for lifebeats from my app. If the app stops responding, wotaskd kills it and restarts it. The problem I had is that my Java thread would sometimes hang when checking for new e-mail. The app was alive and well, but the e-mail check thread was hanging. The solution to this problem was to have the e-mail check thread update a timestamp in the application each time it checked for new e-mail. A separate thread would then check the timestamp in the app every few minutes. If it found that the timestamp was more than a few minutes old, the app would simply kill itself and wotaskd automatically restarted the app. This process worked perfectly, which was a relief.

Things don't always fail as we imagined so it's important to avoid a failure of imagination.

No comments: