Content Copyright © 2021 Bloor. All Rights Reserved.
Also posted on: Bloor blogs
Fear not, this is not going to be a nerdy deep dive into the technical minutiae of this week’s Facebook outage. Plenty of others have done that to death, but if you are interested, the clearest and most detailed description of what occurred can be found in this blog and follow up video podcast from the folks at Thousand Eyes. Rather, I am going to focus on the implications for business and IT leaders that come out of the conclusions of their analysis.
Given the importance of IT in general, and networks and the internet in particular, to Facebook’s business model, you have to ask if this particular failure scenario was high up, if not at the top of, their corporate risk register. Even if it was, the inclination of many people to ignore or downplay a risk that is unlikely to happen very often (if at all), even though the impact of the failure would be catastrophic, is all too common. The challenge is that you can spend enormous amounts of time, energy and money on minimising the chance of it happening and putting in place mitigations to deal with it if it does, and then it never happens. Nobody likes spending money on insurance until the disaster happens.
You will know what applications and services are important in driving customer experience, revenue and profitability. As a first step you should be challenging your IT people to prove they have identified and can monitor seamlessly, all the various layers and elements of the IT and network infrastructure that those critical applications and services use. Only then can you assess all applications dependencies and, therefore, risks. There is no excuse, there are good tools out there. Acquire them and use them.
Most large organisations have a business continuity plan and most of those will say that they haven’t got, or have mitigated any single points of failure. There is much talk about designing in resilience and business continuity from the start. I’m sure most organisations believe that is being done. I’m sure Facebook did, but the fact of the matter is that there was clearly more than one “single point of failure” in this incident (follow the Thousand Eyes article and podcast and you will spot them). To be fair, the Facebook IT and network environment is enormous and highly complex. Spotting the single point of failure may not be that simple. That is why testing your business continuity plans, annually at least, is so critical. In large, complex environments this can be a costly burden. But it comes back to risk assessment and risk management. If it can knock $4bn off your share value, it might just be worth the effort.
An unfortunate by-product of the Facebook outage, identified by Rory Cellan-Jones on his “5 minutes on” podcast on BBC sounds, was the adverse impact it had on small businesses who rely on Facebook to drive traffic to their websites. It might just be that your critical single-point-of failure is not your IT infrastructure but the IT infrastructure of a partner you rely on. Mind where you put your eggs.