Do you need all those nines?

Written by Matt Poole | Nov 11, 2020 2:42:22 AM

Did you know that automation can reduce the amount of engineered resilience you need to keep unscheduled outages short?

About the nines of availability

When designing a computer system, a product owner will frequently assert that the system must be available 100% of the time during working hours.

Usually, this can be brought down to a number that starts with ninety – the famed “nines of availability”, such as 99.999% (5 minutes and 15 seconds of downtime per year of 24x7 operation), or 99.9% (almost 45 minutes per month) – but whatever the number, there’s a cost.

Engineers have a rule of thumb that each extra nine adds a factor of 10 to the build cost: 99% costs 10x more than 90%, 99.9% costs 10x more than 99%, or 100x more than 90%, etc.

There’s room to argue and wiggle but it’s a useful guideline, and although it may not hold true at the 90%-99% threshold engineering more nines into your availability does start to get very, very expensive.

Disaster recovery, a cheap alternative

A cheaper way to get more nines is with a good disaster recovery (“DR”) position, where the system engineering is not to a really high level. Instead, there is an automated path to recovery from failure, which can shorten an outage to well under an hour.

An OSS Group client, a leading life insurance company in New Zealand, approached us for help with the infrastructure design for deploying Tableau Server into AWS. This will not be a system of record or otherwise subject to significant availability requirements, but the question of high-availability (“HA”) or clustering came up all the same.

Tableau Server has native support for cluster operation, which gives extra resilience. But there’s operating cost for more servers, and it wasn’t clear if it was necessary to introduce the complexity.

Once the product owner consulted with the business further, it became apparent that this was definitely not a high-nines system, at least in the starting phases. This meant the recovery-time objective (“RTO”), or the maximum time allowed to return the system to operation, was several hours.

Part of the managed service offering OSS Group provides for this client is fully automated creation of backup images of EC2 virtual servers.

By coupling these images with monitoring and alerting, and an automated recovery process, a failed server can be restored within minutes and then brought back into service.

Some business process tweaks ensure that creators of reports keep a local copy of any changes for at least a day after publishing. Doing so, we deliver a much-reduced outage window without the client having to spend thousands of dollars a year running multiple servers.

LEARN MORE on how to best combine HA and DR options for your particular scenario

View full post