Content Copyright © 2015 Bloor. All Rights Reserved.
Also posted on: IT Infrastructure
Last week I had a tour of the Equinix LD5 datacentre in Slough. It’s an impressive facility with a six nines availability record. Clearly one of the contributory factors in this is the way in which all critical procedures are doubled up. In other words, two people checking that a particular procedure has been completed properly, not unlike the pre-flight checks in an aeroplane cockpit.
Over recent years there have been a number of studies into the causes of datacentre outages with results ranging from 57% to 75% caused by human error. Look a little further and issues like “improper failover” probably have a human element to the failure as well, so the percentages may even be a little conservative. Indeed a datacentre manager recently admitted privately to me that “most outages are down to human error”. So is doubling up the only or best solution?
One immediate challenge to that is the scarcity of people, or at least good, qualified people. A recent Gartner reported estimate that overall across the IT industry in Europe there would be a shortage of 1.3 million qualified people by 2020. Universities like Leeds and Anglia Ruskin in the UK are beginning to address this issue with specific masters degrees in datacentre design, data centre leadership and management and a focus on datacentre research projects. But the challenge is likely to remain for the foreseeable future, and doubling up may become a very expensive option for all but the most critical procedures.
Many datacentres seem to have an interesting mix of “old lags” and young, new inexperienced staff. According to Owen Ashby at people risk experts Cognisco, this can lead to situations where the older, very experienced staff member gets overconfident about their knowledge and capability, while the new, inexperienced staff member may be capable, but lacks the confidence to challenge upwards. Many HR systems claim to help you manage staff competency, but in reality the challenge is to maintain it and to be able to assess and ensure how staff are likely to act, work and behave in the real working environment or when under pressure from peers.
Much of the orthodoxy around managing people focuses on process and training. Develop and document well thought out processes and ensure comprehensive training takes place and risk of failure, or outages, will be minimised…perhaps. But despite all the training, mistakes still happen. Being able to identify staff that are consciously competent, unconsciously competent, consciously incompetent or, worst of all, unconsciously incompetent enables you to tailor appropriate remedial actions. Training isn’t always needed, sometimes mentoring and coaching is more appropriate, or risky staff can be re-assigned. Indeed firms who have used this approach often find that they save significant sums of money by not simply sheep dipping everyone with training.
Risk is generally well understood by businesses. For example, IT security is all about assessing how likely something is to happen and if it does, how serious the impact will be and then implementing the right level of security. It is strange then that few organisations assess people risk in the same way.
Given the potential impact of outages on the reputation of a datacentre people risk should be on the agenda of the Board. In a highly competitive datacentre market achieving that elusive sixth nine could be as simple as identifying and mitigating poor behaviours.
Image: Spinster Cardigan
This post first appeared on the old Cassini Reviews website.