Resources

Your disaster recovery strategy has a hole

Disaster recovery [DR] plans are almost always about physical or virtual infrastructure, but this misses where most of today’s real disasters happen: in the software.

The software says it's fine!The infrastructure DR plan typically doubles everything (at least), so if a “thing” fails you have another one of those “things” to pick up the work. This is a huge oversight: if you anecdotally think about where issues have happened and where they are more likely to happen now that we live in a virtualised work environment, software is the source of many more problems.

While hardware infrastructure is hard to plan around, it has some well known principles. Software by contrast tends to be the poor cousin and lacks a DR plan, unlike hardware.. This is more of a risk than most businesses admit: just by the nature of the environment, most errors happen in software rather than hardware.

Why should we care about having a software DR plan? First and foremost, disasters have an impact on your brand. People notice disasters, even minor ones. Be they internal customers who lose faith in your system, key clients who slowly move away from you or the press latching onto your aggressive cost cutting (e.g. a certain global airline). Not having a DR plan at the software level is a risk.

This is best summarised as: measure the disaster on the potential impact it may have, not the likelihood of it happening.

So, how do you go about creating a software DR plan?

Let’s look at how it’s done for the hardware and infrastructure. Picture the scene:

You have just launched a new product and the press loves it. You are pushing nightly builds to your cloud infrastructure with weekly updates available for your IoT devices which are done over the air. You have continuous delivery down to a fine art.

Then two problems happen. The first in the hardware; the second in the software.

The first is that a key load balancer dies. Or the database replication stops. Or the subnet config for part of the DMZ gets overwritten and now your database can’t see your analytics platform.

"Spin up another one, kill the old one"

What do you do? Simple: You spin up another one and kill the old one. You might leave the old one running for a while to diagnose the root cause, but fundamentally you know the fix: replace it.

The second is a “weird bug” in the software. The kind which ends up on email threads as “that missing record issue” or “the thing from last Thursday”. These come in two flavours: slow burners and sudden flashes. These are disasters, just not the lights-out version you get with hardware. They are critical to your customers, your brand will be hurt and your dev team - who are hard at work on the release due in 4 weeks time - is distracted.

What do you do?

A software disaster, unlike a hardware disaster, can’t usually be rolled back or replaced with a “working one”. Worse still, a software disaster can corrupt the business data, as more data flows through the corrupt code. Then things are only set to get worse.

The fact therefore is this: as more infrastructure issues become software, and as more systems become interconnected, the cause of disasters is moving around the stack from the well understood kill-and-replace methodology to the difficult what-the-hell-is-happening panic. This is where your software DR planning needs to live.

The solutions employed to address the what-the-hell-is-happening panic boil down to gathering as much information as possible (and coffee). And this is why debugging in production is such an essential (yet often overlooked) part of an organisation’s disaster recovery strategy.

Unlike infrastructure, which commands budgets that allow for double or more the required capacity (double the routers and failover databases), software relies on good logs and the other sources of information. That means that developers must have good judgement about logging, which includes logging something they aren’t thinking about, isn’t in the spec, and doesn’t really matter to anyone at the time they are writing the software.

"We need to rely on software to help us build quality software"

If the infrastructure fails, you replace it because you have a methodology for it. But if the software fails, the disaster distracts not only the team from their deadlines (and their weekends), but the customer service team, the support team and potentially account managers and press. Avoidable problems are always the most annoying.

The deadline will be missed or done at a stretch, which probably means letting the quality slip. And the cycle continues.

The lesson which comes from every disaster is to include the software monitoring as part of your disaster recovery plan, and not to leave it as something for the coders to think about while they’re writing the code. You need detailed in-application monitoring which records as much detail as possible as that when - not if - you’re hit by a disaster, you know that the development team has access to all the information on what happened and can reduce the disaster time.

We can’t rely on developers to plug in extra statements printing out the events in every line of the code. We need to rely on software to help us build quality software.