Enterprise software tends to be subject to hugely unrealistic expectations. Even software engineers often underestimate the true cost (in terms of both finance and time) of trying to build near-perfect software. Perhaps this is because the world’s best-known examples of great software systems, such as the Apollo guidance system, or the space shuttle flight computer, were not built for the enterprise. They had the luxury of not having to think about things like return on investment.
In the enterprise software world, we do not have that luxury. Successful business requires risk assessment, risk management, and return on investment analysis. Enterprise software development is no exception. Aiming to build defect-free software is almost never a good business decision, because achieving perfection is exponentially expensive.
Yet, in the enterprise, we suffer from the collective delusion that the software we build must be perfect.
The classic risk management approach to software is based on carefully controlled, big releases. Responsibility is spread over all departments, so that neither engineering, nor QA, nor operations can really be blamed for failure. Now, don’t get me wrong – slow, infrequent, large-scale releases do work. They are painful, but they do get software out the door. But is this approach good business sense? Does our collective fear of failure leave us stuck in a local maximum?
we need to accept that the process of software development is about accepting, not conquering, failure.
To make progress, we need to accept that the process of software development is about accepting, not conquering, failure. Once you accept that failure is real, that failure happens, that failure is normal, you can get beyonds the politics of risk management, and focus on business value.
There is tremendous scope for innovation in the development of large-scale software systems. Companies like Google, Netflix, and Facebook have shown that it is possible to achieve extremely high reliability, and deliver new code to production, every day. If your company is not doing that, you are costing your company money. This cost comes from two things: it is much more expensive to deliver code, and from the opportunity cost of lost business opportunities.
So how do you deliver software, live, to production, every day, without everything falling apart? By accepting that everything is falling apart, all the time.
The real insight is understanding by how much. What is the acceptable rate of errors in the system, on a continuous basis? Answer that question and you can liberate your software development process. Why? Because as long as you can remain below the threshold of acceptable failure, you are free to deploy!
And your architectural mindset can change. A system that is designed to experience constant failure is strong, not weak, because it is tested and exercised on an ongoing basis. This enables continuous small releases. You update only one component at a time. Small releases are less risky. They have small uncertainties, enabling them to stay under the failure threshold. The microservices architecture, where your system is naturally composed of small, independent components, is perfectly suited to this approach. More importantly, it becomes easier for you to estimate the level of risk you are taking on.
So what is the secret to making changes to enterprise software systems while protecting the reliability threshold?
The answer is the process and practice of continuous delivery. In a microservices context, continuous delivery means the ability to create a specific version of a microservice, and to run one or more instances of that version in production, on demand.
If everything is defined in terms of primitive operations on the production system, and you can control the composition of the primitives, then you can control the system. The activation and deactivation of a single microservice instance is your primitive. A single microservice instance is the unit with which you build your system.
The sequence of stages in the journey of that unit from development to production – local or dev, staging, control, production, monitoring and diagnostics – is called the pipeline. The immutable, packaged builds of the microservice that can be deployed into production are called artifacts. The pipeline assumes that the generation of defective artifacts is a common occurrence and attempts to filter them out at each stage, including production.
When a defective artifact makes it to production, this is handled as a normal event rather than an emergency.
When a defective artifact makes it to production, this is handled as a normal event rather than an emergency. Each new artifact in deployment is continuously verified, and removed if necessary. Risk is controlled by progressively increasing the proportion of activity that the new artifact handles.
Continuous delivery in a microservices-based architecture delivers the following business benefits:
The tooling to support continuous delivery in the microservice architecture is still in the early stages of development. There is no comprehensive solution at present, although there are many early contenders.
Different teams and companies select their own combination of third-party and in-house tools. You need to put together a context-specific tool-set for your microservice system. And you will almost certainly also need to invest in the development of some of your own tooling.
Despite these challenges, the benefits of microservices and continuous delivery are multiplicative, and make business sense.
The path to production
Fast deployment to production needs empowered developers.
Local validation (such as unit testing and code review) is the first stage of risk management. Once the developer is satisfied that a viable version of the microservice is ready, it is the developer who initiates the pipeline to production. This is an important aspect of the continuous delivery process. Fast deployment to production needs empowered developers.
You have to trust every developer on your team. Everybody can push to production. That doesn’t mean that everybody gets to push a big red button and blow up the world. It means that everybody get to put press the green button that starts the conveyor belt.
The staging environment reproduces the development environment validation in a controlled environment, so that it is not subject to the normal variances in local developer machines. Staging’s core responsibility is to generate an artifact with an estimated failure risk that is within a defined tolerance. This is what makes it safe for individual developers to press that green button. The steps of the staging process estimate the probability of failure for each artifact.
Production is the live, revenue-generating part of the pipeline. Production is updated when the production system accepts an artifact and a deployment plan, applies the deployment plan under measurement of risk, and rolls back if failure thresholds are crossed.
This pipeline is not so very different from a traditional process. It contains three key innovations: intense automation instead of manual processes, measurement of the reliability of the artifact at each stage, and small – very small – artifacts.
It’s important to distinguish continuous delivery from continuous deployment. Continuous deployment is a form of continuous delivery, where commits, while they may be automatically verified, are pushed directly and immediately to production. Continuous delivery operates at a coarser grain, where sets of commits are packaged into immutable artifacts.
In both cases, deployments can be effectively real-time and occur multiple times per day. From a microservices perspective, it is continuous delivery that you want. The microservice is the unit of deployment.
When continuous delivery is understood to mean continuous delivery of microservice instances, this understanding drives other virtues. Microservices should be kept small, so that verification – especially human verification, such as code reviews – is possible within the desired timeframes of multiple deployments per day.
The pipeline protects you from exceeding failure thresholds by providing measures of risk at each stage to production. It is not necessary to develop complex mathematics to print out a specific probability of crossing the failure threshold. Instead, you can use your knowledge of the system to develop simple scoring metrics that capture the same benefits without excessive work. This is what makes the approach practical.
In development, the key risk measurement tools are code reviews and unit testing. A code review is a binary measure – either it has been done, or it has not. Regardless of the actual probability of failure, you can be absolutely certain that a code review improves your chances. The trick is to make code reviews practical and effective. A code review of the small code volume in a microservice is much more valuable than a core review of a component in a large monolithic code base, simply because there are fewer things to worry about.
Unit testing is another obvious risk management technique. Again, you may score by using a level of coverage that your team is comfortable with. There is no shame in coverage below 100% – it increases your risk, sure, but perhaps that extra velocity makes sense in your business context.
You can introduce other verifications into the development environment, such as code linting, simulations, standardized small test data sets, and so on. Don’t forget the human factor – do a weekly survey of developer happiness (this is very useful and highly recommended).
The techniques you use are not as important as using them to get quantification. Each quality measure generates a score, and you only accept code from development into staging if the aggregate score is high enough. Over time, you can adjust as necessary. You can also adjust for business needs, tightening and easing the score as required to reduce risk or increase velocity, respectively.
In the staging system, you can verify that the score from development is correct. You can then move on to a further set of validations. Measure the behavior of a microservice in terms of its adherence to the message flows of the system. Measure against a small network of machines that replicate a subset of production. Gather performance data. Implement a stricter code review process. Meet legal and regulatory requirements by providing workflows for manual review and sign-off. These are all possible, and can be done quickly and efficiently, enabling you to retain your ability to deploy quickly.
Again, the key to making this work is to focus on one microservice at a time. That is what makes the difference.
Staging provides you with a critical set of risk scores that let you make a go/no-go decision: can you deploy this new version of a microservice to production? This is where you will need to put in the most effort to build out your own tooling and your own risk measures.
Finally, even in production, the risk of failure continues to be measured. Key metrics, especially those relating to message flow rates, can be used to determine service and system health. You can use incremental deployment techniques to slowly introduce a microservice into production. Instead of sharp, discontinuous changes, you have slow, measured changes. For example, if one microservice normally runs with 10 instances, to upgrade that microservice, you don’t replace all of them at once. Instead, replace them one by one, each time verifying that the system is still healthy.
The fact that microservice artifacts all go through the same pipeline, and are all managed in the same way, makes this possible. In particular, you can use approaches such as containerization to make this much easier. The tooling for production management of containers is ideal for production management of microservices – it is the same problem space.
A last word
A common criticism of the microservice architecture is that is it too complex, and requires too much investment in a continuous delivery pipeline. In other words, that it is too much effort at the start of a project. This is plain nonsense. Even if you choose to build using the monolithic architecture, you should still be putting a continuous delivery pipeline in place. The argument that you should go monolith first, and then maybe move to microservices, is an excuse for not building a continuous delivery pipeline first, and is not credible.
If you’d like to read more detail about any of the topics raised here, my book The Tao of Microservices is due for publication in early 2017. In the meantime, the first four chapters are available now from the publisher, Manning.