System reliability: when 99% is better than 100%

It’s natural to aim for perfect reliability, but striving for this can actually be damaging

Luca Lanziani, Damo Girling 19 Jan 2024

Organisations naturally aim for 100% system reliability, but users don’t require this extreme level of uptime. The right approach is to target high reliability, so users get the uptime they actually need.

System reliability is a basic requirement demanded by users. They want to be able to use the services provided by organisations and if the systems aren’t reliable enough then they’ll switch to a competitor.

It's natural for organisations to think their systems need to be 100% reliable 100% of the time to ensure they meet their users' requirements. However, seeking this perfect outcome isn't the best approach. This is because it goes beyond the level of uptime users require and creates a barrier to organisational progress. The key is for organisations to determine the optimal percentage of reliability for the specific systems they have.

SRE delivers the reliability users expect

An extension of modern Developer Operations (DevOps), Site Reliability Engineering (SRE) plays a crucial role in ensuring organisations deliver the reliability expected by their users.

SRE achieves this by applying software engineering principles to infrastructure and operations processes. It places particular emphasis on improving availability, efficiency, latency, capacity, performance and incident response.

High reliability is better than extreme reliability

Service Level Objectives (SLOs) specify a target for the reliability of systems. Each percentage increase (be it 0.900 or 0.999) has a significant impact on how much downtime is allowed:

Daily downtime allowed for 99.000% service reliability: 14m 24s — high reliability
Daily downtime allowed for 99.900% service reliability: 1m 26s — very high reliability
Daily downtime allowed for 99.999% service reliability: 0.86s — extreme reliability

The key is to aim for high reliability, rather than extreme reliability. There are two core reasons for this.

First, users won’t tell the difference between high and extreme service reliability. This is because it’s so minimal to their overall experience and other factors (like their internet connection) come into play. However, a user will notice if a service is offline because it's being brought in line with an extreme reliability target.

Second, setting an extreme target will block progress and freeze the organisation. The reason for this is that the effort to maintain such a high uptime will force developers to be extra cautious and mainly focus on keeping the system up. This creates a culture of fear among developers and diverts them away from working on tasks that can accelerate growth for the organisation — such as developing new features and products that bring noticeable user benefits.

However, it’s not so simple as to say all organisations should aim for a flat rate of 99%. The definition of “high” reliability depends on what a user expects from that particular system. Organisations need to set a realistic target — a number that balances keeping their customers happy while also allowing their developers to evolve the service offering.

There are many ways to determine the optimal number, and it all depends on the scale of the business, the industry it operates in, the regions it serves and a myriad of other factors. And it's not necessarily about the number itself but rather how organisations calculate it — such as differentiating business hours versus off-hours when maintenance may be appropriate.

In general, though, error budgets are the simple answer; the error budget is the maximum amount of time a system can fail over a defined period. An organisation’s development team can burn safely through that budget during normal operations, but should stop if the error budget is depleting too fast.

Elements of a reliable and resilient system

1. Build systems with reliability in mind

Reliability doesn't start in production, organisations need to build their systems from the ground up with it in mind. The product owner, software, testing, data, infrastructure engineers and all the other functions should work together to ensure each component is as reliable and resilient as possible. On resiliency, Donald Firesmith, a trainer at the Software Engineering Institute, explains: “A system is resilient to the degree to which it rapidly and effectively protects its critical capabilities from disruption caused by adverse events and conditions.”

2. Measure the reliability of systems

Observability is a measure of how quickly organisations can pinpoint an issue by looking at their logs, metrics, traces and alerts. The faster and more accurately the SRE team finds the root cause of an issue, the more observable the system is. Having the right tools is the first step on the journey. However, a tool is just a tool.

Organisations need tracing to enable end-to-end visibility, persona-based dashboards, actionable alerts and reduced noise. Together, these enable targeted responses and save their people from notification overload.

This is what Nearform did for a telecoms giant. We developed dashboards that enabled the industry standard observability our client needed, giving them the resilience they require to protect them against the loss of revenue from major outages.

3. Recover fast from incidents

It's 5 AM, and an organisation’s system on-call person is woken to an alert that says more than X% of card payments are failing… What's the priority? The more time we spend looking for the root cause, the more sales we lose.

The focus should be on recovering from the issue as quickly as possible and stabilising the system, so that it can return to processing payments, while tracking the issue resolution process and reporting updates to customers. All of this should happen following a well-defined incident management process.

And the work doesn’t stop when the system is back up and running. We should track the incident in our ticketing system with all relevant information and start an investigation into the problem management process.

4. Hold regular system and process reviews

How often do incidents occur? Why do they happen? Why does it take so long to discover an issue? Why does it take so long to recover from them?

Here, we are zooming out of the single incident and looking at the overall picture. Technology is just one part of the puzzle; we should also focus on people and processes when assessing our posture.

Organisations should run regular postmortems and retrospectives, create a blameless culture to empower everyone to report issues, set clear remediation strategies, and follow up to ensure they are implemented correctly.

Keeping track of the incident and implementing problem management helps us avoid technical issues.

Systems should be as reliable as needed

It requires a change of mindset for organisations to aim for anything less than perfection, But in the context of system reliability, ‘perfection’ should be redefined. Instead of it being 0% downtime, perfection is the level of uptime that users expect. Because users will have a clear idea of how much downtime they expect, and it’s missing this target that will be unacceptable. Operating in this way means organisations will have the resilience they need to deliver sustained business impact.