Redunancy Planning, more work than adding one of everything

Since I started my career, redundancy has been featured in almost every deployment discussion. The general best practice is to add an additional element for each service tier, also know as N+1 redundancy. This approach is straight forward, but many people would actually be surprised by how often these schemes fail. At a very famous incident in San Francisco, a data center lost power in the majority of its co-location suites, due to the failure of their N+2(one better than N+1) backup power generation scheme.

Start with the Easy Things

You start by looking at each individual component in your stack and deciding if this system fails, can it fail independently. If you do this with your stack, you’ll generally find that pieces that scale easily horizontally have failure boundaries isolated to the system itself. For example if a web server fails, generally it has no impact on service, because concurrency is maintained elsewhere, but it will reduce capacity. This is the easiest place to plan for, because an extra server will typically take care of the issue.

Now with the hard things

When you look at components such at database masters, or storage nodes the story becomes more complex. This type of equipment generally has failure boundaries that extend beyond themselves. A rack full of application servers may become useless when they are no longer able to access a database for writes. You don’t truly have redundancy here until you have a scheme for fail-over. Without planning you may be trying to figure out slave promotion in 2:32 am.

Then with the hard and really expensive things

Core infrastructure needs love too. Again, things like rack power, networks, carriers, cloud providers, and buildings have failure boundaries as well. They unfortunately extend to several portion of your stack at once. They are very difficult to plan around, and often take a significant investment to have redundancy in. The datacenter mentioned above used 2 spare generators for redundancy for all the co-location suites, when 3 of the primary generators failed, so did their redundancy plan. They had let each suite become dependent on all of the other suites having normal power operations.

Finally, figure out what you have to do

Once you’ve identified all of your failure boundaries, its time for the fun part, financial discussions! Remember, why its important to have backups of all data, redundancy is a financial hedge. When planning try to figure out what the cost of downtime is, and to what extents the business is willing to fund them. Its not uncommon that multi-datacenter redundancy would require an application change to achieve, but its probably not worth the investment if you have no customers. Create a target and engineer a system that meets that goal for the budget.