On August 1st 2012, Knight Capital Group lost $440 million in 45 minutes of high-frequency trading.
The deployment of a software update to their trading platform was faulty. Their subsequent efforts to manually rollback the deployment were catastrophically fatal. There’s a detailed Securities and Exchange Commission filing that describes in lurid technical detail the full scale of the disaster. Triggering an investigation by the SEC is a great way to get your system deeply audited.
Before you start feeling too smug, let’s look at how Knight Capital managed software deployments. The dev-ops team were good guys working under pressure to get things done. Their house was messy, but it wasn’t a dump. What does your own house look like? We’ve taken some steps to clean up ours, and we’re open sourcing them today. This is our take on the next evolution of deployment.
In order to appreciate how the Knight Capital deployment failure played out, you need to understand what the software update was supposed to do. August 1st 2012 introduced new regulations into the market to make it more efficient for retail investors (better prices for you and me). The update to Knight Capital’s systems kept them compliant with this new regulation.
In the lead up to August 1st, the software development team tested and verified the update.
All tests passed. The dev-ops team then manually copied and installed the update to the 8 servers running the live system in preparation for a switchover on the morning of the 1st. But one of the servers was not updated correctly, and on that day, continued to run the old code. This error was not caught because the manual deployment process had no verification step, human or otherwise.
The new code and the old code had a nasty interaction. The old code stopped the buying and selling of shares once the order value was reached. The new code relied on another part of the system to do this. The system configuration also changed with the update. The new configuration had a fatal side-effect – the old code would never reach the order value! The old code would just keep trading and trading.
The market opened at 9:30AM. Knight Capital had $365 million in the bank. By 9:31AM everybody knew something was terribly wrong. The dev-ops team took a decision to manually rollback to the old code. This was the only card they had to play because they had no automation in place, not even a “kill switch”. They did this as quickly a possible. Of course, the new configuration remained in place. Can you see the problem? They now had 8 machines burning money. By 10:15AM they had taken the whole system down and stopped trading. Knight Capital’s final trading position was $460 million in the red. And they “only” had $365 million in the bank. Bring on the SEC!
Maybe your company isn’t able to kill itself quite so quickly. Fast, safe and reliable deployments are still directly linked to your company’s survival, and your continued employment. There’s a lot of activity around deployment at the moment. This is driven by the move towards continuous delivery of software, the scale of our modern systems, and the need to use cloud resources effectively. Despite all the noise, most companies are still just using an ad hoc collection of scripts and basic tooling. There are lots of people still copying code over. You might be one of the lucky ones using a proper deployment tool – fantastic! But you can still do better.
What went wrong at Knight Capital?
Deployment by copying code is pretty much asking for trouble. Any deployment system that does this is going to be fragile, and it doesn’t matter how the “copying” happens. Perhaps you’re pulling a tarball out of your build system. Perhaps you’re using git hooks. Or a tool like Chef or Ansible that runs configuration recipes on your machines. The key problem is that you are changing files on a live production server. There are just too many ways that can fail.
The configuration of the system is also a weak point. It needs to move in lock-step with the code. But it’s easy for the configuration to “drift”. Team members occasionally log into boxes and make a few tweaks, but forget to tell anyone. With Knight Capital there was no oversight or review of such changes. While the “Vier-Augen-Prinzip” is great in theory, it’s pretty hard to implement in practice, especially for frequent daily tasks. Relying on humans to behave correctly is … optimistic.
Knight Capital was unable to rollback their system correctly. There was no definition of the desired system state. Tools like Puppet and CFEngine try to get you to think in this way, in terms of the way you want the system to be – the ideal “model” of the system. The trouble is that they are still hostages to fortune – long-lived production machines can end up in very strange states that can’t be corrected.
Which brings us to “immutability”.
This is a key piece of the solution. Instead of allowing changes to your production machines, prevent them! Machines are not allowed to change. This removes an entire class of failures and errors. If Knight Capital had commissioned 8 new machines, but still made the same deployment error, they would have been able to bring the old machines back online, and keep going. You can see how powerful this is when you realize that it protects you even when you have a manual deployment process.
There are two schools of thought here: machines vs. containers. The Machine school says you should build a new machine instance for each deployment. The Container school says you should move up a level, and run containers on your machine instances. The Container school is exemplified by Docker, the Machine school by Netflix.
The Machine school arose first, and was a natural way to maximize the capabilities of hosting platforms like Amazon Web Services. Each time you deploy, you create and configure (by automation!) new machine instances. You can bring these online in a staged manner. You keep the old machines around in case you need to rollback (which is going to be very fast). There is overhead. It does take (variable!) time to spin up instances. And to get the full benefits, you really should only run one system component per machine. Adrian Cockcroft, previously Netflix Director of Architecture, talks about this, among other things.
The Container school is an optimization of the Machine school. Use containerization technology to run many services on one machine, but each service thinks it has its own machine (at scale you’ll still go down to one service per machine). Containerization strips down the virtual machine idea to the smallest feature set that can possibly work. The benefit is far lower resource usage, so you can run hundreds of “machines” on one instance. Micheal Bryzek, CTO of Gilt Group, talks about this approach. Docker is the key piece of technology that makes containers easy to use. The Docker eco-system has exploded, with a great flowering of automation tools based on Docker. But many of these tools still suffer from the same fundamental problem that plagued Knight Capital.
What is the fundamental problem?
The mutability of systems. The fact that they can change at all means they will break and fail. A running system is best left alone, because it works. Making machines (or containers) immutable does not make the system as a whole immutable. If you can somehow take the same principles and apply them to the entire system, you’d never have downtime!
This is an issue that has been on our minds at nearForm. We’ve thought about it a lot. If affects us, because we build so many systems, for so many people. We can’t rely on the company culture of our clients as a safety mechanism – it’s hard enough building our own culture!
So do you make systems themselves immutable? That’s the answer isn’t it?
Well you can’t. But you can deal with the problem by thinking about it in another ways.
First, we need control over change.
Can the effectiveness of the git version control system teach us anything? How about making the description of the desired state itself immutable? That solves many problems. You capture the knowledge of how the system is supposed to be. You capture the changes over time. You prevent configuration drift because all changes need to be “committed”. You can’t place the system itself under version control (that would bankrupt you), but you can control its description. Not as a by-product of throwing some recipe files into git, but as an integral part of how your deployment works.
Second, homeostasis is cool.
Yes, this is the point where we argue that software should be more like biology! It’s an old one, but a good one. The deployment system should be like your body, constantly regulating your core temperature to keep it where it’s meant to be. This is not a new idea, but the implementations to date have not fully exploited the reliability of containers, and are instead based on mutating live machines in production. With the power of a full system description, you can monitor the live production environment, and return it to the desired state if it deviates. To determine the deviation, you have to “diff” against the desired state – that’s a hard, but feasible, problem to solve.
And finally, containers.
What a wonderful idea. We really love them. So let’s generalize. A Docker container is an instantiation of an immutable part of the system. Why does that have to be restricted to machine-like things? Surely we can “containerize” entire sub-systems. We can have containers that represent multiple services, or security configurations, partial deployments, or traffic flows. These containers contain child containers, that define more detailed aspects of the system. let’s make it containers all the way down. All the container definitions are under version control. With this new concept of containers, you can implement that second part of homeostasis – returning the system to the desired state. Once you have the “diff”, you can be clever about how you instantiate containers at all levels of the system. There are lots of algorithmic possibilities, from fast deploys, to ultra-safe “risk minimizers”.
So this is our vision at nearForm.
We’ve been building this new deployer and using it on client projects for most of 2014. We’ve taken the decision to open-source our work. It’s early days, and we’re going to need help to achieve the full vision. We think it’s the right combination of philosophy and practicality to really make a difference to the real-world problems that dev-ops face day-to-day.
The nearForm Deployer has changed our lives. It’s made deployments fun, and it helps us move really, really fast. We’ve decided to keep building it out as a full open-source deployment solution.