Tuesday, January 29, 2019

Rolling Out a New Version of Your Website

Tricks I Learned at Apple: Steve Jobs Load Testing is an excellent precursor to this post.

When launching a completely new version (update) of a website, it's best to have a rollout and a rollback plan. Very few brand new websites will have the problems that HealthCare.gov had in 2013 because new websites typically start with zero traffic. HealthCare.gov was a unique case since it went from zero to millions of users, overnight. 

Typically, as a website grows, servers will be added and optimized to handle the additional traffic. But, if growth happens too quickly, then the company can prevent new users from creating new accounts on the website while they manage their growth and scale up their infrastructure. Facebook was able to manage their growth by rolling out across college campuses, one at a time, whereas Twitter had no way to control their growth since they were open to the public, resulting in the fail whale. Again, these are rare cases; the typical problem with websites occur when rolling out a major update.


Rolling out the New Website Version

While growing from zero to millions of users is a high quality problem, it's actually very rare. A more likely problem is encountered when an entirely new version of a website is rolled out since it will probably have critical bugs or scaling issues. 

When I worked at Apple and Wyndham, we had to handle both bugs and scaling issues. At Apple, we switched from using RDBMs to memory caches for read-only data. At Wyndham, we had to roll out more than a dozen different websites at once for brands like Days Inn, Ramada, Howard Johnson's, Super 8, Hawthorn Suites, etc.


Managing Risk

Initially, Wyndham wanted to switch from the old website to the new one, all at once. My boss, who's a particularly sharp guy, had enough experience to immediately recognize the risk of doing this. Specifically, what if the new website was broken (what if it had too many bugs, preventing customers from booking rooms)? Instead, he suggested a very simple plan. Rather than making the switch, overnight, he suggested we keep the old version of the website running while rolling out the new website over the course of a week or so.

Since both the old and new versions of the website talked to the same database, it was a simple process, at a high level. We'd have an all-hands meeting, on Monday morning, in our war room (dedicate conference room). During Monday's meeting, all of the departments (marketing, product management, development, and QA) would give a thumbs up to move forward. Then, we'd have our load balancers begin to randomly send 1% of the traffic to the new version of our website. We'd place a cookie on the customer's browser so, if they came back later, they'd automatically be directed to the new version of the website otherwise they'd end up the old version. 


Staging the Rollout 

Just before the close of business on Monday, we'd meet again to confirm that everything was running as expected. On Tuesday morning, we'd meet and give a thumbs up to increase the traffic to the new website to 5%, etc. It looked like this:
Monday: 1%
Tuesday: 3% – 5% (based on Monday's performance)
Wednesday: 10%
Thursday: 50%
Friday: 100%

The beauty of starting at 1% and then 3 % – 5% is that's the most revenue you'll risk losing (in theory) if something goes wrong.

By using this week-long rollout process, we all kept our jobs. I only recall one time, when there was a major bug, that we had to stop after the first day or two, which wasn't a big deal; we simply sent all traffic to the old website while the new one was fixed and we got it right on our next rollout.

No comments: