Smart Talk Episode 6: AIOps and the Future of IT Monitoring

Observability tools watch millions of real-time data points. Thanks to AIOps these can be turned into actionable information or real-time automation.

AIOps is closely connected to observability. So, real-time data such as metrics, events, logs, and traces (MELT) are equally relevant to AIOps as well. If observability tools help with instrumentation, data capture and storage, AIOps tools help with alerts’ noise reduction, faster troubleshooting, and issue remediation. In this episode, Dinesh Chandrasekhar, Chief Analyst and Founder, Stratola, and Fred Koopmans, CPO, BigPanda, discuss the topic of AIOps, the relevance of data freshness of MELT, and the future of IT monitoring with AI.

This engaging conversation starts from the origins and relevance of AIOps tooling but quickly navigates to the why, who, and how of AIOps. Data freshness is discussed in detail as real-time data is extremely pertinent in detecting problems and averting downtime. Do not miss Fred’s insightful responses about the future of this space and where he sees innovations are bound to happen.

Guest
Fred Koopmans, Chief Product Officer, BigPanda
Fred is dedicated to driving innovation and collaboration, building trusted customer partnerships, creating product roadmaps, and empowering individuals to achieve the extraordinary. He leads product strategy, product management, product marketing, and user experience teams at BigPanda.Previously at Cloudera, a hybrid data cloud company for enterprise analytics and machine learning, Fred served as senior vice president of product management and scaled the business from $300 million to nearly $1 billion in annual revenue. As architect and director of product management at Bytemobile, his innovative strategies increased revenue by 200%. At Citrix, he revamped the service-provider product development lifecycle, boosting customer adoption by 200%.

Host: Dinesh Chandrasekhar is a technology evangelist, a thought leader, and a seasoned IT industry analyst. With close to 30 years of experience, Dinesh has worked on B2B enterprise software as well as SaaS products delivering and marketing sophisticated solutions for customers with complex architectures. He has also defined and executed highly successful GTM strategies to launch several high-growth products into the market at various companies like LogicMonitor, Cloudera, Hortonworks, CA Technologies, Software AG, IBM etc. He is a prolific speaker, blogger, and a weekend coder. Dinesh holds an MBA degree from Santa Clara University and a Master’s degree in Computer Applications from the University of Madras. Currently, Dinesh runs his own company, Stratola, a customer-focused business strategy consulting and full-stack marketing services firm.

Resources
Watch Smart Talk Episode 1: The Data-in-Motion Ecosystem Landscape
Watch Smart Talk Episode 2: The Rise of GenAI Applications with Data-in-Motion
Watch Smart Talk Episode 3: Modern Data Pipelines and LLMs
Watch Smart Talk Episode 4: Real-Time Data and Vector Databases
Watch Smart Talk Episode 5: Disaggregation of the Observability Stack
View the data-in-motion ecosystem map here
Learn more about data-in-motion on RTInsights here

Dinesh Chandrasekhar

Hello and welcome to this episode of Smart Talk at Data and Motion Leadership series. Joining me today is Fred Koopmans, CPO of BigPanda. Fred has a fantastic history in product management, working across companies like Citrix, Cloudera, and now BigPanda. He has a master’s in computer engineering and he’s also an ex-colleague and good friend that I’ve worked with in the past as well. So thank you for joining us today, Fred.

Fred Koopmans

Thank you, Dinesh. It’s good to reunite with you.

Dinesh Chandrasekhar

Of course. So for folks in the audience that do not know about BigPanda, help us understand who BigPanda is, and this topic that we are going to be talking about today is about AIOps and the relevance of AIOps in the context of data-in-motion. So maybe help us understand what AIOps is as well.

Fred Koopmans (01:08)

Sure. And again, thank you for having me on the podcast. So AIOps is the concept of using AI to accelerate and improve IT operations. Now that’s pretty broad and I think different vendors define it a bit differently. So for some people it’s about being predictive, predicting the future, predicting my next outage or my next hard drive failure, that kind of a thing. For others, it’s about accelerating the reactive or the responsive nature of an operations team. I think we’re all probably aware of the massive growth of observability and APM [application performance monitoring] tools over the last decade or so, and especially as applications and infrastructure has shifted to the cloud. Well, those tools are predominantly bought and deployed by developers so that when something goes wrong, they can log in and they have all the observability of what’s going on. I can look at the logs and metrics traces, etc.

Our definition at BigPanda of AIOps is harnessing that machine data to automatically and proactively inform IT operations that “Hey, there’s an issue or there’s a potential issue that someone should investigate” basically. So you stop getting embarrassed by your customers calling and saying, “Hey, did you know you’re down because down right now”. So it’s taking that same observability of data, but putting a different kind of purpose through it, and then well, you can imagine that’s a lot of data. So it’s using AI to automatically sift through all of that data and find the needles of the haystack.

Dinesh Chandrasekhar (02:51)

Fantastic, thank you. And in this talk series, we get excited about things like proactive and predictive and all that kind of stuff. That’s what we talk about here when it comes to real-time data. So you said observability platforms galore, and then there are APM solutions out there, and then AIOps kind of extending that paradigm into making it more proactive and predictive. I want to tell the audience as well as bring up the topic of MELT or metrics, events, logs and traces, which are primarily the data streams that get injected or ingested into these kinds of platforms so that the right kind of predictions can be made about when certain systems are going to go down or when a particular issue is going to happen and so forth. So how do you handle–as an AIOps platform vendor–this kind of volume of data streams? Because we are talking about collecting data from infrastructure, cloud containers, all different types of logs and whatnot. So when you are ingesting this level of volume across the enterprise, what are some of the challenges that you face and how do you handle that as a AIOps vendor?

Fred Koopmans (04:10)

Yeah, so we kind of segment the workflow into a few different components of the stack or layer to the stack. So there’s the elements that generate the raw data–so instrumenting the product, generating metrics, logs and traces. This is the highest volume from a gigabytes kind of perspective. So that’s like a first layer of the stack; the second layer of the stack that is processing those to search for threshold breaches or anomalous patterns there. The third layer of the stack then is the layer where it began to sit that processes those pieces of information. In other words, it’s the events. So BigPanda takes the input as the realtime events. So if something has crossed a pre-configured threshold or a predictive threshold, anomalous or otherwise, that’s the event when BigPanda comes in. So the data that comes into BigPanda is typically a kilobyte, a couple of kilobytes per event, but our largest customers, I mean think very large enterprises, banks, airlines, healthcare industry, our largest customers literally receive 4 million events a day.

So it’s just an enormous amount of information, but it’s not a lot of data per piece of information. Now, as it turns out, about 99.99% in some cases of that data is duplicative. It’s redundant, it’s just noise. Somebody configured this threshold 10 years ago, forgot about the application, and it’s not useful. It’s not productive for operations. So our design principle is first off, out of the PCP IP layer. It’s, well, do things at the right layer of the stack that makes the most sense, and don’t send data beyond that. Don’t send BigPanda all of your metrics logs and traces when you already have tools that are processing those and searching for those anomalies.

Dinesh Chandrasekhar (06:16)

Understood. So if I were to understand what you just said, the first and the second level of the stack that you mentioned could be the players that are ingesting and processing all these voluminous events, logs and traces and whatnot, but the events that get bubbled up from anomalous behavior or any kind of specific false positives, false negatives, whatnot, those get pretty much into the third stack, which is what BigPanda is. So do you actually really care about how that data is collected? I’ve seen open-source types of instrumentation methodologies like OTel [OpenTelemetry] and metrics and all that kind of stuff. That could be another way of doing that. Or closed- source vendors like the traditional observability vendors out there who are also ingesting all these types of data from various sources and or escalating or elevating those specific events into an AI app solution like yours. Do you really care about how that event comes up or do you prefer or care about a specific architecture?

Fred Koopmans (07:31)

We don’t really care. We care that it comes up and we’re able to take that data as inputs. It’s one of the key inputs to BigPanda. I’ll talk about some other inputs here in a second, but we don’t particularly care its app. As long as the customer has decent coverage and they have sort of a well-defined alert quality or quality alert definition, and they can say, “Hey, don’t send me an alert unless it has the applications related to the business impact of that location, the runbook that you want me to execute”, so on and so forth. So we care about alert quality more than how it originates. Now, I’ll observe that my average customer has 20 to 30 different monitoring sources. Some of them have been employed recently, but they have decades of monitoring. People are still sending us SNP traps if you can believe it. So there’s a large variety of things. So that third layer is really the aggregation and sort of the filtering layer. And I’ll add one more observation, which is I do see a little bit of a trend in the market towards OpenTelemetry. So in the monitoring space, there’s a shift towards standardization, open-source cost control, all kinds of things in that perspective.

Dinesh Chandrasekhar (08:54)

Very cool. And I’ve seen the same trend as well when it comes to OpenTelemetry. I think a lot of companies that are adopting IT vendors are starting to support it, which makes the system and the ecosystem a lot more cooperative to work with, I suppose.

Fred Koopmans (09:09)

Exactly, yeah, and that’s good. We just kind of assume a heterogeneous environment and we kind of thrive on whatever it is, however you get it to me, you can literally email S-N-P-A-P-I, whatever, we will standardize and turn it into a more homogenous stream for the operations team on the outbound side of.

Dinesh Chandrasekhar (09:34)

Got it. So a quick side conversation here on the people that are actually benefiting from something like this. You talk about these homogenous streams that you generate and then for the operations folks, right? So are these operations folks typically suffering from getting these alerts on a real-time basis or are they suffering from an overload of events coming in through or what is it that we are trying to solve for these operation folks?

Fred Koopmans (10:02)

All of the above. So let’s say a typical customer, typical here of BigPanda customers, the typical IT operations teams, unless they have an AIOps platform like a BigPanda, they’re going to be suffering from alert fatigue. So they may be receiving 4 million raw events. Now they don’t have 4 million actions to take on those raw events. They may have a couple hundred, maybe a couple thousand at most, actions to take. But sifting through the raw and what’s actionable is the first thing. And if you go back maybe a decade or so, that scale wasn’t so far out of whack. And what they had actually done is built up human IT operations teams to manage the scale. But as the machine-generated data from the observability tools has grown and grown and grown, and cloud CI/CD and DevOps, like the pace of change and everything else has just overwhelmed those teams.

So I think the first thing they need is just to reduce the noise, get it down to a sane kind of human manageable level so that they’re not missing things. They’re always afraid they’re going to miss an outage. And if you end up with a painful outage and you go back and realize my observability work to told me I had an issue, told me where the issue was, but it didn’t get to me, it got lost in the noise. It’s the first problem to solve for is how do we help them not get that signal loss in the noise. Second thing is to help with automated analysis. Like, okay, well great, I have an issue. Can you tell me about that issue? How can you accelerate the triage in particular of that? It turns out that in enterprise, more than half of incidents are triggered by changes.

And if you think about it’s the first thing an operator does, like, okay, something broke. What changed? Is somebody upgrading something, somebody doing maintenance? Is somebody pushing some code? Something must have changed. This worked yesterday. It’s worked the last six months. Well, that’s a great sort of segue to the next data source for BigPanda is change events. So change tickets, outputs from CI/CD tools like Jenkins, GitHub, JIRA, as people are making changes in a continuous manner, we feed that in as another real-time data stream, and then we have to obviously do some correlation between the two saying, “Hey, things are breaking exactly where things are changing.” That’s relevant context for that operations team. So once you get the noise under control, you kind of do a correlation there for what’s changed and sort of put that right at the fingertips that says in Case A, “there’s an issue and the following things have changed in the exact area. I think you should start your investigation there.” Or Case B their possibility doesn’t say anything has changed. So this is probably more of a hardware failure or something external to there. Or for any changes, don’t waste your time triaging by going and searching all your change tools. Well, the AI has already automated that for you.

I’ll add into that, which is again what are you solving for? Again, think about that first-line IT operator, what’s going on, what changed? The next thing that they ask is, have we seen this before? And well, it turns out AI is really good at that as well. Doing a search for an incoming incident to say have I seen an incident like this before? Maybe not in that exact same node, maybe not that exact same application, but somewhere in the overall environment. Giving them that and kind of putting that right at their fingertips too–that’s really useful. It’s obviously an AI-requiring capability in order to do that kind of semantic search because it’s not going to come in with the exact same keywords as before, but oh yeah, we’ve seen an issue like that impact the billing system. Turn this into a P1 outage, you should really prioritize this one. That’s the third kind of use case.

Dinesh Chandrasekhar (14:12)

So that brings up actually a really very important point about how quickly do these people need to know about what happened and be able to triage that particular issue, as you said, and figure out what the root cause was, how do we solve it? And if we have seen it in the past and all that, it seems like time is of a essence here. Which begs the question, when you talk about those three layered stack kind of a model, how is the data when it actually reaches you, how quickly do you get that information from the point when the issue actually occurred to the point where it is actually elevated to an AIOps platform? We talk about data freshness all the time here in the series, and to me this kind of brings up or begs that question of how fresh is the data when it actually reaches you?

Fred Koopmans (15:06)

You hit the nail in the head. Time is money. So if you are having an outage, every second counts. We have some analyst reports estimating for the average enterprise that the cost of an outage is something like $20,000 per minute. If you have a one hour outage, it’s a lot of money and if you can shave off even 5, 10, 20 minutes, 30 minutes from that, you save a lot of money for that organization. The reasons for the costs are varied in nature. There are revenue implications. If you are an online retailer, you’re down, you’re not making money. Those brand reputations like everybody remembers when their favorite service went down for a long time and they punish you, your customers punish you when that happens. But there’s also just the operational overhead. Without this proactive signal, what tends to happen is everyone kind of panics and says, “Shoot, we’re down! Don’t know how. Let’s get all the smart people on a call and let’s figure it out. Let’s get the storage team, the network team, the database team, the applications teams.”

Next thing you know, you’d literally have 300 people on a bridge call. Now if this observability could have maybe 10 minutes before that been able to identify an issue and highlight it at the right priority into the right assignment group, you could have saved those 300 people and just notified maybe the five people that really need to take a first look. If it’s not their issue they can hand it off. So real time matters. This isn’t like millisecond kind of real time. What tends to happen is from a threshold breach to a ticket being created, we’re talking 30 seconds to a minute, something like that. So if faster is better, we’re always looking for ways both outside of our platform and inside our platform to cut that latency down as much as possible. But it’s also the scenario where when something happens, you get a network event and then there’s a database event, an application event. You get all of these signals; basically there’s a significant issue, you’re going to get more than one event. All of that kind of flows in different time sequences. We correlate all of it together and create a single ticket.

Dinesh Chandrasekhar (17:29)

Got it. Kind of want to hit on the AI response that you mentioned as well in your earlier answer, which is where you said AI helps with pattern recognition and that’s what it does well and the semantic searches that he brought up and all that. But before we even started talking about all this AI fervor that’s happening around us for the last couple of years at least AIOps has existed as AIOps even before that. I think this particular space has always been called AIOps for a while. So how have things changed here? How have things changed from where it was known as AIOps to what is AIOps today? Has there been a shift in terms of the AI methodologies inside or the technology itself has changed or have you started incorporating some of these GenAI types of tooling inside your offerings? What has changed?

Fred Koopmans (18:28)

Yeah, I would say you could even compartmentalize into a Gen one and a Gen two from an AI perspective. So one of kind of AI, let’s call it the pre-chat, GPT launch November of 2022 kind of version of things .ai. This case was about anomaly detection, intelligent correlation, compression techniques, pattern recognition, these sorts of things. Turns out that change events and moderate events is really good at that kind of stuff. And looking at metrics and logs and traces like the AI set that was applied towards that problem were kind of perfect for those use cases, but there were a lot of things that were out of reach. So it wasn’t until Gen two AI became available that you could start going after different kinds of use cases. The ability to quickly and at scale do a semantic search to realize that I’ve seen an incident that sounded a lot like this before, that really wasn’t possible with Gen one technology.

It was generative AI and large language models and vector databases that sort of has made that possible. I think we’re in the Gen two era now. You still need the Gen one, right? That’s that proactive detection and noise reduction that tells you, hey, you’re adding an issue. But in terms of helping that operations team build those frontline folks that are doing triage, taking a quick look and deciding, is this an issue? Is it not? And who would I escalate to? They can use it to read through the summary of all of that data. A classic BigPanda ticket looks like 40 different alerts happening in different locations. Here’s all the change events, here’s all the similar incidents. Here you go. Now, if you’re a frontline operator, that might be 10 to 20 minutes just to read the ticket. That’s a lot of data that’s been compressed into that. But generative allows us to summarize and say, Hey, here’s the tldr–this is really important, pay attention. Or here’s just a P three–it’s not that important, it’s just one more of this storage threshold kind of issue. It’s still important, but it’s not urgent in the same way. So that’s one unlock. Here’s the other really big unlock, and this is where I see the market going over the next year or two.

If I think about during the gen one era, what data sets were providing the input into the AIOps platform. It was events, monitoring events, change events, observability events, some threshold got breached, but that’s the kind of the data but now it’s expanded. Those are still important, but it’s expanded in gen two to be words, all kinds of different words flowing from different sources. Some of them real time in nature, some of them not. I’ll give you a few examples.

I’ll start with the non-real time categories. So every historical ticket after action review, postmortem remediation steps, run books, it’s tailored operating procedures, wikis, that’s a lot of information. It’s a lot of context that is buried somewhere in some operational system on how to maintain and troubleshoot and repair issues. And I’ll add runbooks to complete that list. All of those words being fed into an AIOps system unlocks a lot more context that allows you to say, yeah, we’ve seen this issue before and by the way, here’s how it was fixed. Here’s who fixed it. Here’s the after-action review that we never followed up on. Like, okay, that’s pretty good context. So that’s category one. What’s category two of words. It’s real-time words. What are people chatting on right now? Who’s calling the help desk? And if the help desk is starting to get flooded with incoming queries about something being down, or there’s people on a Zoom call that are speaking over Zoom, which is like we’re doing now, that’s being transcribed automatically through AI into words. That becomes a real-time data feed.

Dinesh Chandrasekhar (22:44)

Talking on chats and social streams and whatnot as well, I suppose.

Fred Koopmans (22:49)

Exactly. Yeah, you just mentioned another good one. What are people chatting online out in the metaverse that whether it’s Downdetector, other sites like that, these are other important signals. So all of these words now become possible real-time data you can bring into an AIOps platform. Giving more context, again, the goal being help people detect what’s going on as quickly as possible, both proactive as quickly as possible and predictive like, “Hey, we don’t have an issue yet, but this is heading into the wrong direction.” And then triage and remediation of that. So I would say that the gen two here is a game changer for the whole industry allowing you to connect machine-generated data and human-generated data words coming together in the same platform.

Dinesh Chandrasekhar (23:38)

Very cool. And I’m totally excited about some of the innovations that are going to be coming about and let’s probably end this conversation. It’s been fantastic. On that particular note about as the CPO of a fantastic company like BigPanda, what is your outlook? What are some of the key trends and innovations that you’re seeing in the AIOps space that are bound to come in the next couple of years with, again, all this excitement around GenAI and everybody’s coming out with an innovation or tool almost every day. What is it that you are excited about? What are you going to be seeing in the next couple of years?

Fred Koopmans (24:12)

Yeah, I’ll give you two. And the first is intended to ground all of us product people in reality. So I was meeting with a BigPanda customer last week, one of the largest banks in the world, and he was giving me an overview of the strategy for the next 12 months. This was ahead of me going and presenting the roadmap. By the way, our roadmap was going to be all about BigPanda’s AI. And at the end of this session I was like, “Hey, you didn’t talk about AI. What’s your AI strategy at the bank?” And he says, “My AI strategy is to make sure that vendors don’t charge me three X the price for delivering no additional value because they’re excited about AI.” And I was like, oh, okay. Well, let’s make sure that our AI delivers value. So the first thing I want to say is, while AI is amazing from a technology perspective, if you don’t find a way to connect it to provable ROI, you’re not going to sell it.

You’re not going to be able to charge any more for it. I know there’s a lot of exuberance of excitement about, man, we’re going to double our price. We’re going to triple our price with all of it. I don’t think that’s going to last unless you can actually follow that through to say, here are exactly the benefits that you got of the ROI models and so on and so forth. Now having said that, well, I had a really productive roadmap conversation with this bank, so it went well. But what we talked about was how you go from using AI to triage, to drive remediation, and specifically we talked about the forthcoming BigPanda copilot. Our project name for this is Biggie, giving me all your data. It’s like a big data kind of application. So we talked about Project Biggie with them, and one of his observations, and I agree with this, is lots of vendors have copilots and a lot of them have unique data sets, unique use cases.

So my prediction is that you’re going to see a scenario where a developer has a copilot or frontline operator has a copilot, the help desk has a copilot that’ve all been uniquely customized and trained for them, and they’re all going to need to talk to each other. So we were joking that we are not far from the day where instead of needing those 300 people on a phone call to triage what’s going on, you’re going to get 10 copilots on an AI call, and they’re going to talk amongst themselves. They’re going to argue and they’re going to say, “Okay, go talk to Biggie. Biggie knows the answer here. None of us know.” And that goes even beyond it. Legal’s going to have copilots, etc. So I think the copilot era is coming where everyone has an assistant–it mirrors, the organizational org chart, the people org chart. You’re going to have kind of a mirror image of copilots where not only will they exist and where they add value, but they’ll talk to each other too. It’d be really cool. And our AI starts communicating and root-causing things on our behalf.

Dinesh Chandrasekhar (27:17)

Almost like a couple of years ago, I spoke about autonomous enterprises and self-feeding enterprises as the future to look forward to as technology in this particular space emerges. And I think the way you said it about copilots, coexisting, talking with each other, taking orders from one another as an organizational org chart and so forth, I think is the future that we are all probably, I dunno, excited about, but at the same time a little concerned about as well if they can make all the decisions for us and running that company and making some of those key decisions as well. But definitely exciting days ahead. Fred, thank you so much for joining us today. I think I thoroughly enjoyed the conversation. I hope the audience got a lot of information from you about the AI space and how data-in-motion, real time and all that factor into it as well. So appreciate your time today and thank you so much for being here.

Fred Koopmans (28:13)

Absolutely. My pleasure, Dinesh. Thank you for having me on the podcast. Big fan of the space and it’s been great. Thanks.