Service Level Objectives (SLOs) specify a target for the reliability of systems. Each percentage increase (be it 0.900 or 0.999) has a significant impact on how much downtime is allowed:
Daily downtime allowed for 99.000% service reliability: 14m 24s — high reliability
Daily downtime allowed for 99.900% service reliability: 1m 26s — very high reliability
Daily downtime allowed for 99.999% service reliability: 0.86s — extreme reliability
The key is to aim for high reliability, rather than extreme reliability. There are two core reasons for this.
First, users won’t tell the difference between high and extreme service reliability. This is because it’s so minimal to their overall experience and other factors (like their internet connection) come into play. However, a user will notice if a service is offline because it's being brought in line with an extreme reliability target.
Second, setting an extreme target will block progress and freeze the organisation. The reason for this is that the effort to maintain such a high uptime will force developers to be extra cautious and mainly focus on keeping the system up. This creates a culture of fear among developers and diverts them away from working on tasks that can accelerate growth for the organisation — such as developing new features and products that bring noticeable user benefits.
However, it’s not so simple as to say all organisations should aim for a flat rate of 99%. The definition of “high” reliability depends on what a user expects from that particular system. Organisations need to set a realistic target — a number that balances keeping their customers happy while also allowing their developers to evolve the service offering.
There are many ways to determine the optimal number, and it all depends on the scale of the business, the industry it operates in, the regions it serves and a myriad of other factors. And it's not necessarily about the number itself but rather how organisations calculate it — such as differentiating business hours versus off-hours when maintenance may be appropriate.
In general, though, error budgets are the simple answer; the error budget is the maximum amount of time a system can fail over a defined period. An organisation’s development team can burn safely through that budget during normal operations, but should stop if the error budget is depleting too fast.