Thursday, June 9, 2011

Tricks I Learned At Apple: Steve Jobs Load Testing

When I worked at the Apple Online Store, we would never load test the live website. There was rarely a need. However, it was always an interesting experience to turn the store back on after Steve Jobs walked off stage following one of his keynote presentations. As part of our postmortem, once the store was back online, we'd ask ourselves where the servers were constrained: CPU, network bandwidth, disk I/O, or memory? While it's hard to predict exactly how the entire system would behave in the real world, we had a good idea, before we flipped the switch, thanks to our thorough testing strategies.

Load Testing
Many companies use load testing software to see what kind of load their web app can handle. A common, but flawed way to load test a web site is to bring it online and then turn on the load testers. The problem with this technique is that it doesn't give you any idea of how the website will perform if it goes down while it's live. When a website, that is in production, goes down, it must be brought up while under load and things will behave very, very differently. For example, it was discovered, when the iTunes Store first launched, that one of its trusted WebObjects components wasn't thread safe and this bug only presented itself under very heavy load.

Cutting My Teeth
When I first joined the Apple online store, I was paired up with an experienced software engineer so that I could get up to speed on the code repository, build process, and unit and component testing. Since the online store was already live, we would never roll out new code without first testing it and gathering detail metrics.

My first task with my coworker was to implement a simple web service which retrieved product information, in the form of a plist, over the network. A simple service like this could normally be written in a day or two, but it took us most of the week as my mentor explained each step to me while we pair programmed our way through the process. (Although we pair programmed, our software methodology was Agile/Scrum, not Extreme Programing. Each team used whichever development technique they agreed on as long as they could stay on schedule. The teams I worked with were fortunate to have formally trained scrum masters who were supported by management.)

Before writing any production code, we'd write our unit tests. All software engineers should be taught to write their API unit tests first – it's a good discipline to learn. Next, we coded using WebObjects/Java with Eclipse/WOLips and we always ran the app in debug mode with key break points so that we could step through the code. I've frequently seen too many software engineers, elsewhere, who just code away as if they're throwing something against the wall to see what sticks.

As soon as we checked in our code, the repository would automatically build all of the applications and run the unit tests against them. If you broke the build everyone on the team, plus a project manager or two, would receive a notification e-mail identifying you as the culprit.

Token
We had one, highly specialized piece of software code which could only be checked out, worked on, and checked in by a single engineer at a time. You were only allowed to touch this piece of code if you possessed a physical token. In our case, the token was a Darth Tater doll, which had to be conspicuously displayed on the top of your cube or bookcase.

Gathering Metrics
Once our service was code complete, bug free, and checked into the repository we began component testing to gather metrics on the new code. This is another step that's commonly overlooked in novice teams. I suspect that this "gather metrics" step isn't included in The Joel Test because Joel Spolsky's product was a desktop app and not a web app under heavy load (or, perhaps it's implicit in Spolsky's "Do you have testers?" step).

Before we could even consider including our code in the live code branch, we would hit it with many millions of requests. At Apple, we had very sophisticated caching algorithms which could store any number of entries we wanted, depending on our goals. Did we need a cache with only 500 products in it or 50,000? After a cold start, would we need to "warm up" the cache with specific products? How long should we wait, after no hits, before removing a product from the cache to free up memory?

As a side note, our caches were always hash tables. The beauty of a hash table is that it has a Big O notation run time that's constant: O(1). When you're asked, during a job interview, which is the fasted lookup function, don't, as is very common, say, "a B-tree binary tree." Perfect hash tables always win, hands down.

Tweaking and Done
We would tweak our code until we had acceptable metrics. Our metrics would measure how much memory the cache used and how long it took for each service request/response to be fulfilled. Depending on our needs, we might shoot for a goal to have 99.7% of our service requests returned within 35 ms, while 95% were returned within 10 ms with no single request taking longer than 50 ms.

These tests were run against a copy of the live database in a production environment. It's not a perfect indication of how the web app would perform once it was live. But it doesn't take long for this to be a great way to set expectations.

At the end of our sprint these metrics would be demoed as part of the Agile definition of "done." The code was now ready to be checked into the QA branch for functional testing before going live.

18 comments:

Stefano Ricciardi said...

Interesting insights on your internal process, thank you for sharing.

Loved the physical token idea, by the way :)

Matt said...

I'm sure Apple's products have their own dark code corners, but how does it compare to elsewhere? I'm wondering if the design integrity is partially a function of the engineering that went into it.

Chethan R Vasishta said...

I recently learned that they used a physical token in my company too. It was an orange bowl and since then, the process of checking in is termed as bowling! :)

Jason said...

We had a decoy hunting duck named "merge mallard".

Joe Moreno (@JoeMoreno) said...

Matt,

Hmm, that's a very good question.

All code develops some dark corners when coding quickly if you're not careful and experienced.

I think part of Apple's success is due to a combination of several things such as experienced and enthusiastic engineers who, thanks to Steve Jobs' focus, were very results oriented.

I believe the development tools, environment, and Apple's/NeXT's philosophies were key.

A big part of software engineering is managing complexity. Apple's technologies have an amazingly appropriate level of abstraction to really optimize development.

Mike said...

Great article -- thanks for sharing.

Two questions:

1) How long were your sprints?

2) Any thoughts on how to bring a team without this level of process sophistication up to speed?

Joe Moreno (@JoeMoreno) said...

Mike,

1. Our sprints started off at two weeks long, but, after a few iterations, we lengthened them to three weeks. Three weeks worked great for our team.

2. In my experience, nothing brings an engineer up to speed faster than their enthusiastic personal initiative and motivation. When we conducted our job interviews, one of the things we looked for was enthusiasm about the company and technology.
To bring an inexperienced team up to speed, you're going to need a strong, experienced leader with good communication skills.

sostler said...

Great article. How did your team ensure that your automated performance tests matched the characteristics of the production load?

Joe Moreno (@JoeMoreno) said...

sostler,

We had similar servers (database, app, web, etc) in dev as in the deployment environment. Based on our metrics in dev, we could extrapolate how much load we could handle in deployment. This was only an approximate guess, though. What we never knew was how big the real load would be once Steve Jobs walked off stage and we went live. That's why we'd match up the postmortem results to our estimates.

As a side note, I didn't work, personally, on every aspect of every step that I wrote about. But, we'd have debrief meetings so everyone would know what was going on.

sostler said...

Thanks. When you simulated the traffic load, though, how did you predict the space of actions that your visitors would take? A user hitting the landing page then leaving takes less resources than one who loads up a shopping cart, performs lots of product searches, etc.

Joe Moreno (@JoeMoreno) said...

sostler,

We wouldn't load test an entire web page to gather metrics. Rather, we'd load test the components that made up a web page, especially components that need to go over the network. For example, we'd hit the database millions of times to measure the results. If a component needed to fetch data from SAP, we'd see if we could cache it since that would be faster.

We'd ask ourselves, how many concurrent requests can I send to a service or database before it bottlenecks.

If each component returns within, say, 50 ms, and they are all running in parallel, you'll know that you have a solid page, in theory.

The key is to know the max load that a service can handle. If you're going to be anywhere near that load, then you know to fire up more instances, used more caches, and/or optimize your code.

When everything is humming along, I personally like to see the servers' CPU and memory utilization to be less than 10%.

Also, it's important to not over engineer your code. You shouldn't do any premature optimization without analyzing where the bottleneck are, first.

We'd know what actions our visitors wold take based on experience. But, it was usually obvious, if a new product was announced, that most users would either look at the static product page or they'd visit the online store, put it into their cart, and configure it to see the cost.

sostler said...

Thanks for the explanation, much appreciated.

L-heezy said...

Other than response speed, what other metrics did you gather during component testing?

Joe Moreno (@JoeMoreno) said...

L-heezy,

Response time was, by far, the number one metric that we looked at. We also looked at memory foot print to see if that made sense - we could tweak the memory foot print by optimizing the size of the cache and how long things lived in the cache.

I don't recall ever looking too closely at network bandwidth in dev - that usually wasn't a big concern. But, we did look at bandwidth during the postmortem.

Aaron Dufour said...

A physical token sounds to me like nothing more than a false belief that your source control is working. If you're using reasonable source control, merges are almost always painless. I'm not really sure what good a physical token does when merging works.

Jon Hendry said...

At one contract gig where I was the 'build master' the check in totem was a viking helmet.

Vicky Sitaraman said...

Hi Joe, How to change the markup of the page without a server restart? Can we store server scripts in a database and dynamically changes it?

Vicky Sitaraman said...

Hi Joe, How the dynamic portion of the Web page be changed without restarting the server? Can the dynamic server Web scripts be stored in database?