A focus on Business Continuity - Business Continuity should be a tested part of holistic system design, not a bolt-on afterthought

Lots of people never think about the possibility of things going wrong, but some do, and it is part of the governance of an enterprise that someone has thought about business continuity.

However, even when they have, they perhaps haven’t thought about the holistic problem – business continuity (keeping the business up and running and available to customers) is a lot more important than merely having backups available, without much idea of how you’ll use them in any specific contingency. And risk management is necessary but not always present – does the CEO really want 0% downtime, no matter how much it costs and regardless of the impact on business agility? And, peoples’ perception of risk is often unrealistic – people are increasingly scared by “ransomware” (nasty and new and external) but most system outrages are still caused by system failures and human error (and, possibly, systems that aren’t designed with appropriate resilience in mind). Overall, too, most people are much more optimistic about their chances of successful business recovery than turns out to be the case when disaster actually happens.

All this was underlined by a professional survey I’ve just seen, of 250 UK people responsible for the IT Disaster Recovery (DR) plan for a company with over 500 staff. It was carried out by opinion Matters for DRaaS (Disaster Recovery as a Service) specialists iLand. I prefer the term “Business Continuity”, which has a wider scope and covers slow-downs etc as well as disasters, to “Disaster Recovery” – but the terms are a near-enough equivalent for the purposes of this Blog. Key results from the iland survey include:

95% of companies faced an IT outage in the last 12 months; System Failure and Human Error are the most likely causes above cyber attack and the dreaded “unexplained downtime”, with environmental threats (lightning, earthquake etc) coming last.
87% of respondents had executed a failover in the last year; although 82% were confident of success, 55% had problems none-the-less.
About a third of respondents don’t have a fully trained Disaster Recovery team; and aren’t testing their DR Plan often, if at all (which means that their DR Plan is probably somewhat illusory).
Cloud-based DR is increasing (over a third of respondents use it) but security is still a major concern. In fact, cloud-based security (based on process and documented SLAs) is likely to be better than on-premise security (often, on-premise security policies aren’t enforced effectively and there are often people-based security loopholes), in my opinion.

iLand’s recommendations, based on the results of the full survey, with my comments, are:

Balance downtime and cost (ensure that you can achieve required recovery times without exceeding the budget – that is, it is about risk management, not risk elimination);
Ensure that DR testing is easy and cost-effective (choose the right solution, one that supports non-intrusive testing, not just any solution; and make sure that any cloud SLAs allow you to test DR and let you have access to your DR management data);
Address security and compliance (this means build it in, not bolt it on; and discuss the security implications for DR/Business Continuity with in-house or external experts before designing your business continuity plans, not afterwards).

These recommendations seem very sensible to me. I’d just like to repeat, for emphasis, that appropriate business continuity should be baked into the design of systems from the first, not bolted on afterwards – and you should make sure that you take a holistic view of the system, including manual processes and people issues.

It is not unknown for a company to recover all its IT and data after a disaster, only to find that the business still can’t operate, because key people or communications links aren’t available (of course, this is less likely if you actually test your business continuity plans regularly). I’d also note that customer communication is important, if customers might be impacted by an incident (or hear about it on Twitter) – letting your customers know that you are still around and in control of the situation (especially after a visible or newsworthy disaster) is probably even more important than getting a broken database back on-line.

I also think that business continuity (or disaster recovery and so on) is one area where a company might not (I hope) have a lot of internal practical experience to call on. Engaging with a third party specialist may make economic sense, not least because a third party can “speak truth unto power”. Telling the CEO that his demand for 0% downtime, and 100% security at zero additional cost is unrealistic may be seen as “negative thinking” and will probably be career limiting. Nevertheless, business continuity is well worth investing in – I believe that a significant number of companies suffering a major and visible outage go out of business further down the line, without necessarily realising that (whatever the immediate cause), their problems started when customers lost trust in the company, and started looking at the competition, back when a major outage was mishandled.