Building TRUST in your infrastructure - it's easy to lose, but harder to build and much harder to restore

For the average man or woman on the Clapham Omnibus there is very little consideration about whether they TRUST the IT infrastructure on which they depend for so many of their daily interactions. It is only when major incidents occur, like the recent WannaCry attack that shut down large sections of the NHS infrastructure, the data centre outage at British Airways or the data security breach at TalkTalk that they begin to question and doubt their TRUST that it just works.

For the CIO, the resilience and security of his or her IT infrastructure should be a major priority. Afterall, damage to the reputation of the business and TRUST in the brand is often the result of IT failures. In reality, the CIO has an unenviable position, caught between customers who expect 100% uptime and a Board that is often unable, or unwilling to grasp the importance and increasing complexity of the underlying IT infrastructure that supports their business.

The debate around Board assurance and customer expectations will be for another time, although, in the meantime, you can get a quick start by reading this article from Kevin Borley. Rather, I want to try and lay out and describe the various elements that go to make up TRUST and how and where they play in an increasingly complex IT infrastructure environment. At Bloor, we see TRUST being made up of 5 key elements: security, governance, risk, compliance and last, but not least, resilience. Let’s take a look at each one in a little more detail.

Security

IT Infrastructure is physical. There are data centres, servers, disk drives, network cables, comms towers, wi-fi base stations to name just some elements. Go to any modern large data centre and you will see double chain-link fences with barbed wire tops, security gates with protection against ram raiders, entrances that prevent tail-gating, large scale CCTV surveillance, biometric access to data halls and individual racks. It is perhaps no surprise that the Docklands data centre that houses the London Internet Exchange is seen by the Government’s COBRA committee as a critical UK infrastructure asset. Stopping the bad guys damaging or getting access to the infrastructure is a key element of security, and you should also remember that the insider threat is often greater than any from outside.

The rapid development of new technologies and new deployment models continues to create new and different cyber risks that need to balanced and prioritised against older risks. The Consumerisation of IT brought with it challenges around perimeter security and user authentication. Server virtualisation combined with multi-tenant hosting meant that there needed to be a rethink around traditional IT network security. Now the rapid development and deployment of IoT devices has highlighted once again the vulnerability of cyber-attacks on new devices and networks that have not had adequate security built in from the start. Firewalls, server hardening, encryption, authentication have not gone away, but they are no longer enough. DDoS attacks in particular require excellent intrusion detection facilities, agile and resilient load balancing capabilities and above all an attitude of threat assessment and prevention.

The main point here is that your security is only as strong as your weakest link… and today’s complex infrastructure environments have plenty of links.

Governance

In IT, governance is often associated with information and is seen as a series of legal and best practice standards that govern the way in which an organisation handles its information – in particular, the personal and sensitive information of customers and employees. However, in the broader context of the management of the whole of an organisation’s infrastructure, IT governance is about the way rules, norms and actions are structured, sustained, regulated and held accountable. It is a subset of the Corporate governance regime that the Board has to have in place. The key element in this is accountability. Governance should not be a tick-box exercise. It is the way in which an organisation’s Board ultimately assures itself about the resilience and security of the IT infrastructure to deliver on the organisation’s business objectives.

Despite many years of IT outsourcing, hosting and now Cloud services there remains a lack of understanding that when you outsource elements of your IT infrastructure you are not giving up accountability. This lack of understanding blinds Boards to questions about who is responsible and accountable for certain functions. An Infrastructure as a Service provider will guarantee the security of their hardware and systems, but they will not, in most cases, take responsibility for securing the data you hold on their systems. A recent Barracuda survey (registration required) of 550 IT decision makers in EMEA, shows that 64% of those surveyed believed their IaaS provider was responsible for securing their data in the Cloud. Being aware of these issues and putting mitigations in place is a key governance issue for Boards – if your service provider loses personally identifiable data, the GDPR regulators will come after you, waving their multimillion-euro fines, because you carelessly chose the wrong service provider.

Risk

The focus here is on Risk Management. Every organisation will have a different appetite for risk. Certainly, understanding an organisation’s level of appetite for risk is important. The critical part is, having set that level to have a very clear way of managing the risks. There are two key components to risk: how likely is the risk to occur and, if it does, what impact will that have.

The Board needs to see IT risks that are likely to impact the business. Those that have the potential to seriously disrupt a business should all receive close attention, with the serious ones that are more likely to occur receiving particular focus and attention. What is being done to reduce the likelihood of the risk and occurring and, if it occurs, what have we done to mitigate its impact?

The CIO needs to ensure that s/he can drill down to, and measure accurately, the individual elements in the infrastructure to understand the risks of individual component failures, security breaches or capacity issues and how they ultimately affect the risks that the Board sees. Then, the CIO can assess how much time and effort are put into ensuring the problems don’t occur and/or mitigating against their impact.

Compliance

According to the International Compliance Association “the term compliance describes the ability to act according to an order, set of rules or request.” The important word in there is, act. It isn’t just about obeying a set of externally imposed rules. It is also about the compliance to internal systems of control that have been put in place to ensure compliance to the externally imposed rules and being able to evidence that you act upon them.

It is also important to understand that compliance does not automatically equate to “good practice”. Often the regulations that are imposed are a minimum legal requirement to protect the interests and safety of people and organisations. Nor are regulations necessarily all encompassing. A financial or retail organisation may have PCI compliance, but this does not necessarily mean that their financial data is secure. Compliance should not be mistaken for security as customers of Target in the US discovered to their cost in 2013 when 10 million credit card and personal account details were hacked.

Resilience

If your infrastructure isn’t resilient, in other words, it can’t cope with unexpected demand, security attacks, power outages and the like, there is little likelihood that it will engender a high level of TRUST. All the elements we have discussed so far in the article contribute towards building resilience, but by themselves they are not enough.

High levels of process automation in systems administration and application workflow that are reviewed at least monthly and that release operations personnel to focus primarily on exceptions will reduce the risks of human error and increase agility.

Capacity management and performance reporting with automated links back into other system management tools will help early identification of potential issues such as bottlenecks. The ability to make dynamic adjustments to capacity to meet fluctuating demand is also a critical element in ensuring system resilience. Ultimately, can you say that you almost always meet your service level agreements and that you rarely experience unexpected capacity problems.

Then, if the worst happens, do you have business continuity policies and procedures in place, that reflect the needs of the business? The recovery mechanisms in place should reflect those business continuity policies. They should be fully automated, apart possibly from a decision to move to another data centre, tested in part regularly and in whole at least once a year.

The unspoken risk

Most of what we have discussed here revolves around the building of policies, procedures and products that combine to strengthen TRUST in the underlying infrastructure. Often, commentators will talk about the importance of people and process in building TRUST. Yet, the instance of breaches in security, availability or governance caused by human error remains stubbornly high. Every 3 years the Ponemon Institute carries out a survey of the cost of data centre outages. Its 2016 survey highlights that 22% of data centre outages are caused by human error. This hasn’t changed from the previous survey in 2013. The view is that this figure understates the position and that many of the generator, UPS and other equipment failures were down to some sort of human error. Depending on how much credence you give to these views the percentages range from just over 50% to almost 75%.

A greater understanding of the competencies, capabilities and confidence of your staff often highlights the inadequacies or inappropriateness of much of the training provided, as well as cultural issues that prevent organisations dealing with issues such as unconscious incompetence or an inability to challenge upwards.

People risk is a major issue that runs through all the 4 components of TRUST that we have covered. It is not a separate item. For more information of what is meant by people risk have a look at some of the case studies from a company called Cognisco and how the risks can be addressed, for example: http://www.cognisco.com/case-studies/nhs/ ; http://www.cognisco.com/case-studies/bt/; and, http://www.cognisco.com/case-studies/eurostar/.

Conclusion

TRUST can be an ephemeral object. It is easy to lose, but harder to build and much harder to restore. Many factors can combine to cause a lack of TRUST. Not all the components and issues that combine to build TRUST are created equal. Learn to identify those components and issues that have the most potential impact on TRUST. Design the problems out where possible, mitigate against the worst impacts where that is not possible and monitor them closely. Finally, ensure you act and don’t just tick boxes… and remember your people risk.