The boundaries of what we view, broadly, as the Cloud Management market have increased. It encompasses the performance of private, public and hybrid-cloud infrastructure as well as traditional on-premises configurations that, increasingly, are virtualised and cloud-like. It also includes a strong focus on cloud cost optimization. Cloud Management enables IT organisations to deliver relevant, timely, performance, health, cost and resource utilisation metrics across all levels of IT and the Business. It includes elements of Digital Experience Management (DEM) and IT Service Management (ITSM), as well as the more traditional Data Centre, Networking and Application Performance Management disciplines. Changing development and deployment models, characterised by DevOps and Containerisation, require management and monitoring tools to provide observability into, and tools for, new development environments.
Why is it important?
Business relies on IT for the delivery of its customer propositions. Indeed, for many, IT is the business. The growth and scale of cloud computing, the agility provided using micro-services and new deployment technologies, the power of business analytics (BI) and the immediacy of social media has brought forth new, digital-only, business models that operate at a global level, creating new levels of expectation and scale for IT performance in old and new businesses alike.
In this environment, customer experience of the application is critically important. Instrumentation across all key applications and infrastructure is the only way to measure and ensure end to end performance and availability. Legacy, silo-specific monitoring tools are no longer adequate as they can’t communicate or relate to one another and have no understanding of the applications that are running on their components. Application Performance Management tools (APM) alone can’t ensure performance and frequently can’t identify the root cause of performance degradation, especially those rooted in some part of the I/O path, such as the network or storage infrastructure. With the Internet of Things (IoT), the collaborative and global nature of business, and the specific hardware demands of Artificial Intelligence (AI), Machine Learning and Data Science, that infrastructure is getting more and more complex.
Therefore, being able to monitor and react to performance issues in near real-time and being aware of what parts of this complex IT environment are being used by individual applications is a critical business requirement.
There is also a very important side-effect of a comprehensive Cloud Management system; the ability to benchmark and profile performance characteristics of applications under different workloads and in different environments. This can then be used to develop sophisticated capacity planning models to help reduce the instance of expensive over-provisioning in public and hybrid-cloud environments.
Cloud Management is different from traditional application and network performance management tools in its ability to capture and correlate data from a wide range of sources, across an entire IT hybrid infrastructure, that includes public and private cloud deployments, irrespective of vendor. Cloud Management is comprised of monitoring, cross-domain correlation and AI-based analytics. It captures granular information, in real-time, on transaction flows from storage arrays, from network devices, between server VMs across both on-premises and cloud environments using a combination of hardware and software probes to capture packet and flow data, agents, logs and events.
The hardware probes tap physically into storage and network fibre connections, while software probes capture information from physical and virtualised devices in an agentless, non-intrusive manner. They ingest huge amounts of real-time data which is correlated, analysed and presented to IT operations and business managers in user-defined custom views on a single pane of glass.
While Cloud Management systems should be application-aware, it is important to understand that, in a virtualised and micro-services environment, the end-user might be another micro-service. Cloud Management focuses on understanding the infrastructure associated with those service-to-service transactions as well as the overall application performance that the end-user experiences and is generally categorised now by the term Observability.
With the huge amount of data collected, Cloud Management solutions make use of advanced correlation algorithms, machine learning and predictive analytics to provide sophisticated management information dashboards.
Business leaders need to care because the only real reason to care about infrastructure performance and availability is that it has a significant impact on the business. British Airways and TSB IT problems recently that had financial impacts of £150 million and £50 million respectively in direct costs. While it is by no means clear that infrastructure performance issues were the major factor here, poor performance can have serious, direct implications. 10 years ago, Amazon was already showing that every 100ms of latency cost them 1% in sales, while Google found an extra 0.5 seconds in search page generation time dropped traffic by 20%.
Performance problems may be attributable to a variety of different components within a hybrid Cloud infrastructure, or in applications themselves. Therefore, CIOs, CTOs and IT Operations leaders need to care about the significant problems with siloed operational teams and vendor or technology specific tools. The business does not want to hear “not my problem, try that lot over there”. It wants the business issue fixed, quickly and effectively. It wants the root cause of an issue identified quickly and collaboratively, and to see someone put onto addressing it. It doesn’t want to see developers and technicians playing the “blame game”. Ultimately, it enables IT to reduce Mean Time to Resolution (MTTR), or, as we have jokingly heard it called, “Mean Time to Proving Innocence”.
In the last few years there has been a growth in tools designed to help companies control their Cloud spend. By and large, these tools focus on one of the big three cloud providers and provide information about cloud usage to help identify where, for example, cloud services are left running when they are not needed or where purchased capacity is not being used effectively. AWS, Azure and GCP do provide their own tools that customers can use to help manage their cloud bills.
The development of hybrid and multi-cloud infrastructures and an increased focus on digital transformation and migrating more legacy applications to the Cloud requires a more sophisticated approach. We are now seeing a move towards tools and platforms that look at performance, cost and risk across multiple clouds, with capabilities in understanding how legacy workloads would run in the cloud and the ability to factor in risk when managing cloud capacity, as well as identifying wasted spend etc. This is giving rise to the term FinOps which we are focusing on in more depth
We are also seeing an emerging trend where Observability tools, which originated to help DevOps teams and Site Reliability Engineers (SRE) overcome some of the visibility challenges inherent in containerised micro-services deployments in a public cloud environment, into a broader set of operational performance monitoring capabilities. This is involving a greater focus on DEM and the use of synthetic data to drive active testing to compensate for gaps in visibility caused by the increasing use of public networks outside the direct control of in-house IT operations teams.
Unsurprisingly the three largest Public Cloud providers have been slow in providing observability solutions that enable IT Operations teams to gain single end-to-end visibility of application and business services performance in genuine hybrid and multi-cloud architectures. Microsoft Azure and GCP have started to open up their management tools to allow monitoring of hybrid-cloud applications running on their own cloud and in the customer’s data centres. AWS has been the most reluctant to open up its management tools. However, there were some fairly quiet announcements at Re:invent in 2019 about AWS CloudFormation and AWS Config being opened up to support resources outside the AWS Cloud.
Traditional vendors like IBM, BMC and Broadcom have a genuine ability to monitor -performance across all the major Cloud providers and a much wider range of on-premises systems. We should also note here that there is a very active and competitive market for Hybrid and Multi-Cloud Management tools from independent vendors such as Dynatrace, DataDog and Virtana to name but three.
We note that some IT Operations and development departments have built their own monitoring solutions using open source tools like Prometheus and Grafana. However, this would be a non-trivial exercise in any large-scale hybrid or multi-cloud scenario, and we can’t see the point of trying to build your own when so many commercial solutions are available, particularly when some of them, notably IBM, use key elements of open source monitoring and logging tools.