AI and machine learning can help automate incident response by assessing situations, prioritizing alerts, and aiding human operators.
Historically, incident response has been a people-based workflow, but these days people don’t have to spend a ton of time staring at screens. There are three parts to automating incident response, including building automated workflows to route, remediate, and ensure a clean system of record. Automating incident response allows humans to have less of a hands-on role but a more crucial one, but what happens when AI gets involved?
RTInsights recently sat down with Richard Whitehead, CTO and Evangelist-in-Chief at Moogsoft, to discuss the growing need for automated incident response, past limitations in the field, and how the application of AI is helping. Here is a summary of our conversation.
RTInsights: What is the ultimate goal and benefits of automating incident response?
Whitehead: The one that’s most front-of-mind for people today, to steal a word out of the DevOps vernacular, is to eliminate toil. We’re trying to get more done with less effort. The primary goal for automation is to eliminate the boring, repetitive things that don’t add any real value to the project, and automating them helps free up engineers to focus on more interesting and impactful tasks.
Another aspect of automating incident response is that it allows a more consistent approach to the problem. So, the likelihood of making mistakes out of boredom or lack of attention that can occur with tasks that are perceived as low values is addressed. This is very similar to the concept behind Robotic Process Automation, automated testing, etc.
RTInsights: What have been the limitations and issues with automating incident response in the past?
Whitehead: The biggest challenge has always been the amount of time and effort required to set up automation. I would argue that’s probably even more difficult today because there’s less consistency. The rate of change has increased. So, you have to weigh the cost benefits of putting the time and effort into automation against the likelihood of that scenario reoccurring. In environments with a degree of stability, it’s definitely worth doing. If you have a high rate of change, you don’t know if you are automating something that may never recur.
RTInsights: Given those issues, what elements should go into automating incident response?
Whitehead: When you are trying to automate instant response, what do you want to do is create an environment that’s as robust, as flexible, and as tolerant to change as possible. In that way, it doesn’t get caught by slight nuances in the differences between different incidents. That’s where we’re seeing things like machine learning really being applied because of the flexibility of the algorithms to adapt to changing environments.
RTInsights: Can you give some examples of how automating incident response is used in various settings, application areas, or industries?
Whitehead: A good example, in this case from a Cloud Service Provider, is the ability to automate the creation and documentation of an incident by a ticketing system of record, even if the incident resolves itself. We’ve seen environments where problems are essentially auto-healing and built into the infrastructure. However, because a problem can impact a customer, it still needs to be recorded, but it might not necessarily require an operator to take some form of action.
You want to have a system capable of identifying and recording an issue impacting a user or customer or violating a service level objective. It becomes a matter of record that the incident happened, but it is done so without any human intervention. That’s incredibly powerful.
A more significant issue to deal with beyond incident management is problem management. You want to look into why problems occur after the fact and try to eliminate them. The task of responding to an ongoing incident becomes a lot less demanding because the system is doing the heavy lifting of the identification of the issue and documentation of it in the form of a ticket.
We’re also seeing examples where automation can be used to streamline the process. Think of tasks that somebody would do when an incident occurs. In general terms, this could be where you would maybe triage the issue to try and determine the level of severity or impact of a problem. Then you would start gathering information to determine how to proceed and resolve the issue. These are steps that can actually be automated. By the time the operator has been notified and starts to look at the problem, many of the mundane steps that you would do manually have been completed.
This significantly reduces the mean time to recover a problem by eliminating some triage steps and the diagnostics. Depending on the environment, that can be as simple as just providing information about the asset, service, or process that’s causing the incident. You can also automate the presentation of detailed diagnostic information to the operator if you have the right automation tools.
RTInsights: What about auto-remediation?
Whitehead: We get asked a lot about auto-remediation because it seems to be very topical. I wouldn’t say I’m skeptical. I would say I’m realistic about auto-remediation because it’s something we’ve been looking at, or we’ve been asked for, for three decades. It’s been “just around the corner” for a long time.
The challenge, I think it’s called the paradox of automation, is that the more you automate something, the more crucial a human being becomes in trying to decide whether or not that should be automated. I feel like the state of the art when it comes to auto-remediation is the system recommends what needs to be done to resolve an issue, but it’s a human that decides to do it.
That’s where we are today. AI and machine learning are helping by operating in the background, looking at the operator’s actions, and saying the last ten times this problem occurred, you took the following action nine out of ten times. Would you like to do that again? The theory is if you answer yes often enough, the system will learn, and you could, hypothetically, let the system take over. But at times, there’s that one occasion where you do something different. The human becomes more crucial in this process to make that one decision. This is a matter of opinion, but I feel that if you can be 100% confident in an auto remediate action, you’re probably dealing with something that needs to be fixed at the root. If you’re that confident, the effort should be put into making sure it never happens in the first place.