The concept of continuous availability is not new. But now, with AIOps, the technology is at hand to make continuous availability happen.
In today’s modern 24/7 business environments, there is a great need for applications and services to be online and optimally performing all the time. As such, the concept of continuous availability is becoming a cornerstone to the way IT operations, site reliability engineers (SREs), and security teams work. This will especially be the case as more aspects of business operations are digitally transformed.
Continuous availability is not new. Gartner defined it in 2012 as an approach that “focuses on eliminating the downtime required for standard maintenance tasks, but more importantly, removing any disruption to IT operations regardless of events that may occur.”
Two recent critical changes in this domain that impact organizations today make continuous availability even more important and accessible. First, as business operations’ dependency on IT infrastructure continues to increase in response to the need for faster and leaner digital business processes, the need for continuously available also increases.
And second, there is finally the technology available that can help make continuous availability happen. That technology is AIOps (artificial intelligence for IT operations). AIOps applies machine learning to data from sensors, traces, logs, and more to perform event correlation, anomaly detection, and causality determination. It can play a role in continuous availability three distinct ways.
Also see: Addressing Modern Cloud App Problems with Observability and AIOps
Spot problems before they have an impact
The traditional approach to IT management would be to wait for an angry call from customers or internal users about a service disruption or the poor quality of a service. AIOps offers a more predictive mode of operation. It enables a proactive approach that could spot, for example, an increase in dropped or re-sent packets and other indicators of poor performance and take corrective actions in real-time.
Similarly, a security team could use AIOps to spot anomalies that are pre-cursors to a cyber attack or indicative of a data breach. For example, AIOps might be used to alert the security team that an unusually large about of data is being sent out of the organization via a normally lightly used port.
The bottom line: AIOps is about being able to evolve the understanding of the state of a company’s systems in real time and deducing that there is a problem before systems are impacted.
See also: AIOps and Observability Roll into the Next Stage
Reduce time to problem resolution
Modern applications and services are cloud-based and built using microservices and containers. Even a simple application such as providing a mobile front-end to a user’s account would involve backend elements maintained by the organization, a database on a public cloud, connectivity via the user’s provider, and any one of the major mobile operating systems. There are many inter-dependencies between the various elements, and the business has little control over most elements that could impact performance or availability.
When a problem happens, it can take a great amount of time to determine the source of the outage. AIOps can help automate the root-cause analysis, accelerating the meantime to repair (MTTR) for an outage or other problem. This can significantly reduce the meantime to repair/recover.
See also: More Use Cases for AIOps as Its Value to Enterprises Grows
Analyze historical data to ensure continuous availability
Many envisioned applications for AIOps deal with the real-time analysis of logs, traces, alerts, and more to address operational issues. However, the ability of AIOps to assimilate, analyze, and derive insights can also be used on historical data.
The idea here is that the intelligent capabilities of an AIOps system could spot relationships, anomalies, and other issues that are causing performance or availability problems. In this role, AIOps could serve in a predictive capability, providing IT, operations, SREs, and security with insights to improve operations.