Given the complexity of today’s digital organizations, the importance of continuous app availability has taken on greater importance.
As the world becomes more reliant on digital applications and services, there is little tolerance for downtime. That is driving businesses to push for continuous app availability, a concept with roots dating back a decade when Gartner brought attention to the topic.
Back then, seeing how detrimental outages were to companies, Gartner coined the term to describe an approach that “focused on eliminating the downtime required for standard maintenance tasks, but more importantly, removed any disruption to IT operations regardless of events that occur.”
Fast-forward to today, and the importance of continuous app availability takes on greater importance.
Infrastructures have grown increasingly larger and more complicated. Many applications are cloud-native and essentially cobbled together smaller components and microservices via APIs. At the same time, the speed of application development has accelerated, with many new applications and updates rolled out every day. Taken together, these factors make it harder to guarantee uptime in today’s complex systems.
See also: AIOps 2.0: Making Actionable Intelligence Actually Actionable
Basic AIOps addresses some issues
AIOps is the deployment of machine learning to track data from sensors, traces, logs, and other sources to prevent internal and external disruption, whether that be through event correlation or anomaly detection. It can also provide a better analysis of why an event happened through casualty determination.
Frequently, businesses turn to AIOps due to the complexity of today’s applications. Specifically, there are many inter-dependencies between the various elements of modern applications. Making matters worse, many businesses have little control over most elements that could impact performance or availability.
When a problem happens, it can take a great amount of time to determine the source of the outage. AIOps can help automate the root-cause analysis, accelerating the meantime to repair (MTTR) for an outage or other problem. This can significantly reduce the meantime to repair/recover. And this, in turn, can help improve availability.
Given the complexity of your average digital organization in 2022, with layers of microservices and ephemeral architectures, the need for AIOps is even more critical than it was a decade ago.
Key to app availability: A move to predictive over reactive
AIOps has proven to be a useful tool to get to the source of problems in today’s complex application environments. But, as such, its typical use is as a reactive tool that kicks in once a problem occurs.
Increasingly, there is interest in using this tool in a more predictive way. That point was raised in a recent industry blog:
“AIOps was born in the world of event data, traditionally monitoring infrastructures by focusing on changes in topology. But there’s a seismic shift brewing as more businesses migrate their IT assets to the cloud and operate like software companies to satiate consumers’ appetites for bigger and better digital apps. These digital-first companies need to shift from monitoring their infrastructures to monitoring their applications. After all, apps now determine the user experience.
Enterprises simply can’t afford to spot an incident after an app has crashed. Incidents must be detected faster and earlier in the incident lifecycle, and resolution must occur before impacting customers, partners, or internal stakeholders.”
With these thoughts in mind, many businesses are looking to use AIOps to spot problems in the making, not after they happen. The idea is to do real-time analysis of logs, traces, alerts, and more. An AIOps tool would also need to assimilate, analyze, and derive insights from historical data.
The idea here is that the intelligent capabilities of an AIOps system could spot relationships, anomalies, and other issues that are causing performance or availability problems. In this role, AIOps could serve in a predictive capability, providing IT, operations, site reliability engineers (SREs), and more with insights to improve operations.
In particular, an AIOps platform can play a role in automatically learning the “normal state” of each critical business service and application. It can assess the underlying behavior of the supporting hardware and software services, making it easy to flag anomalies automatically. And then, using big data analytics, machine learning, and other artificial intelligence technologies, it can then automate the identification (in advance) and resolution of infrastructure issues ranging from performance or availability issues to an all-out outage of the infrastructure.