Smart Talk Episode 7: Cardinality, Control and Costs in Observability

Cardinality, Control and Costs in Observability Kloudfuse

Cardinality, control, and costs are the three Cs of understanding and managing observability data as Krishna Yadappanavar, CEO of Kloudfuse explains with Smart Talk’s host, Dinesh Chandrasekhar, founder and principal analyst of Stratola.

We have discussed observability and the complementary space of AIOps a couple of times on this series but this time, we take a more pragmatic dive into the topic to understand the mentality of the buyer. What should a CIO of an organization look for when they are in the market for an observability solution? Join us on this episode as Dinesh Chandrasekhar, Chief Analyst and Founder of Stratola, talks to Krishna Yadappanavar, CEO of Kloudfuse. Krishna explains observability through the lens of three factors – Cardinality, Control and Costs.
These 3 Cs are key to understanding and managing ever-increasing observability data. These factors are not only important for managing the data but also for leveraging the data and meta data for additional analytics.
A new development in the field of observability is model observability, especially the LLMs that drive generative AI. The 3 Cs also apply to this emerging use case.

Some of the topics covered in this episode of Smart Talk are:

Observability data’s potential
Tools proliferation
Overcoming siloed observability data
Considering the metadata
Insight into cost

Guest
Krishna Yadappanavar, CEO, Kloudfuse
Krishna Yadappanavar is the Co-Founder and CEO of Kloudfuse, a unified observability platform. He previously co-founded SpringPath, securing $94 million in funding and leading the company to a $320 million acquisition by Cisco. With over 20 patents, Krishna has significantly impacted data, virtualization, and storage technologies at Veritas, Commvault, EMC, VMware, and Cisco. He co-authored VMware’s VMFS and designed critical components of the storage virtualization stack for ESX Server. Additionally, Krishna advises and invests in emerging startups across Data, Virtualization, Cloud, Security, and AI/ML, contributing to vision, product strategy, engineering, and go-to-market efforts.

Host: Dinesh Chandrasekhar is a technology evangelist, a thought leader, and a seasoned IT industry analyst. With close to 30 years of experience, Dinesh has worked on B2B enterprise software as well as SaaS products delivering and marketing sophisticated solutions for customers with complex architectures. He has also defined and executed highly successful GTM strategies to launch several high-growth products into the market at various companies like LogicMonitor, Cloudera, Hortonworks, CA Technologies, Software AG, IBM etc. He is a prolific speaker, blogger, and a weekend coder. Dinesh holds an MBA degree from Santa Clara University and a Master’s degree in Computer Applications from the University of Madras. Currently, Dinesh runs his own company, Stratola, a customer-focused business strategy consulting and full-stack marketing services firm.

Resources
Smart Talk Episode 6: AIOps and the Future of IT Monitoring
Smart Talk Episode 5: Disaggregation of the Observability Stack
Smart Talk Episode 4: Real-Time Data and Vector Databases
Smart Talk Episode 3: Modern Data Pipelines and LLMs
Smart Talk Episode 2: The Rise of GenAI Applications with Data-in-Motion
Smart Talk Episode 1: The Data-in-Motion Ecosystem Landscape
View the data-in-motion ecosystem map here
Learn more about data-in-motion on RTInsights here

Transcript
Dinesh Chandrasekhar

Hello and welcome to this episode of Smart Talk, a Data in Motion leadership series. And on this episode we have a special guest, Krishna Yadappanavar. He’s the CEO of Cloud Fuse. He’s no stranger to the startup ecosystem. He is a serial entrepreneur. He has built a couple of companies before this, and so we warmly welcome Krishna to have this conversation about observability, which is, again, a favorite theme in this series.

Krishna Yadappanavar

Thank you.

Dinesh Chandrasekhar

So Krishna, as a way of introducing yourself, why don’t you tell us about Kloudfuse and your drive to start the company?

Krishna Yadappanavar (01:01):

Okay, absolutely. Thanks, Dinesh. Thanks for the warm introduction. Hi folks, my name is Krishna. Yeah, I’ve been living in the Valley for almost two-plus decades and have been working with a bunch of startups and big companies. The name to fame is like VMware when it was an early startup. I joined and then saw it growing from literally close to a million ERR to a 64 billion company and I’ve been associated with different data-related technologies, whether it is writing file systems, distributed systems, databases or OLAP or OLTP. Okay, so throughout this journey what I noticed is data is the secret of all the insights, whether it is on the product analytics side or it is providing a solution like virtualization or providing a solution like backup or disaster recovery. Having done my startup, Springpath, which was in the hyperconvergence, trying to bring together the convergence of storage, networking, security altogether in a box I sold to Cisco.

And after spending some time at Cisco, I was thinking what are the next big trends that are going to come? This is back in early 2020. I came across a couple of trends, like some of the trends related to how the data with respect to a developer DevOps or SecOps is growing exponentially. How are the new trends in the machine learning AI and LLM models back then LLM models were in the early stages are going to disrupt the market? And then as the human brain starts to think and react to certain incidents, you want the machines to act in the similar way. These were some of the problems we came across and in the intersection of all three, I found that hey, solving the problem of not only the observability, but observability plus analytics and automation, on top of the data, which is focused on the developers and the DevOps is very critical. That led to the beginning of the Kloudfuse–one led to the another and then we’re a team of around 40 plus people now.

Dinesh Chandrasekhar (03:16):

Well, congratulations and this is a great start. So wishing you good luck on that journey. Talking about observability, this wasn’t something that just came about yesterday. I worked in that space as well for a fair amount of time and the concept of observability has evolved over the years. So originally 10, 12 years ago people were hyped up about talking about infrastructure monitoring, network monitoring and all that, and then slowly one thing led to another and then there were cloud monitoring and container monitoring capabilities that get added on. And then today we have this notion of observability that is quite popular. Most of the companies that used to tout monitoring are now observability companies. And I know you started off afresh in the observability space wanting to create a difference. How would you describe this evolution? What was before compared to what we have today? How have you seen this evolution?

Krishna Yadappanavar (04:09):

Yeah, great question. Dinesh. I have seen this, I mean as a developer myself writing a monolithic application running in physical machines. Then I saw the advent of virtualization, whether it’s like VMware or Hyper-V or the open-source virtualization technologies and the containerization came in. So if you look at the core problems when it comes to the data for the observaility, as these evaluations have evolved, so we have seen the attributes associated with the data keeps on increasing, and when you take the Cartesian product of those attributes, then it becomes really large in order of multiple millions to billions. What they call as the cardinality associated with that cardinality is the data volume. As the data volume increases, people want to transform data A to data B for better analytics. They want to automate certain workflows on top of the data.

They want to slice and dice the data so that it can give you better insights. So in short, as the data volume increases the traditional way of, hey, I was monitoring so-called as the known knowns goes away, which is the traditional monitoring. Then you are looking at the known unknowns, which is the beginning of observability and there are the complete unknown unknowns where you don’t know anything and you are thrown at multi-terabyte to petabytes of worth of data and you have to dissect within that data and get to the lack of better word where exactly the problem is, how it is correlated to the incident, what is the root cause analysis, what is the impact analysis. So as long as the developers write code, this complexity and more and more services are coming. This complexity is going to go higher and higher and hence this continues to be an evolving space where new challenges emerge.

Dinesh Chandrasekhar (06:14):

Fantastic. So observability obviously is definitely a difficult problem to solve. I would love to explore why is that? But I think you touched on it a little bit as well just now, but it’s also that we have a crowded marketplace with a dime-a-dozen vendors out there who are saying, we solve this part of observability and that sort of observability and all that, but there’s still a search for the ideal solution. So every CIO that I talk to is always looking for this one magic bullet that kind of solves their problems. Why is that? Is there a different lens through which you have to look at it to understand why there is a different urge to get that ideal solution?

Krishna Yadappanavar (07:04):

So as I was alluding to earlier let me step back a little bit, right? What is that customer is looking when they are thinking of an ideal observing solution? Let’s start from the problem. I classify this problem as the three Cs–the Cardinality, Control, and the Cost. Let me go into the next level of details. What do these three Cs mean? Cardinality, it’s all about how we have certain data, whether it’s a tricky metric point or a logline or an event or a trace or a span coming from your distributor tracing or a continuous profile’s profile, it gets attached with additional, for lack of a better word, labels or tags. When you take the Cartesian product of the potential values those labels can take, it’ll grow really, really high. So now every data point needs to be associated with its tags.

So let’s call tags as the metadata. And then there is a data. Different schemes have different problems. Some are metadata heavy. When you go to the matrix site, when you come to the logs and the spans like the distributed tracing, they are like data heavy, but in effect, it’s the tremendous increase in the volume of the observability data because of the cardinality. Nowadays I have seen the reverse trend. I mean back in the days people were thinking like, Hey, let me send my data to a SaaS portal and then the SaaS vendor would manage all of that data. But when I talk to whether it’s the CTO or a CIO or a head of engineering or the developers or the architects and even the CFO, they’re thinking, let me have control of my data. What do they mean by that? There’s a reverse trend that is going on that I have so much data for various reasons, whether it is the egress in risk cost, the security aspects of it, or the volume of the data itself.

They don’t want to send that data outside their VPC and there is another angle to it. They want to bring in all the possible interfaces they can think of whether it is for a traditional observatory kind of interface like creating your dashboards, alerts, SLOs, or any analytics functions which could be written in a traditional SQL or a GraphQL or there could be advanced like Spark jobs to run some analytics on top of the observability data because observability has become that fundamental pillar. That means they have to own the data. The data should not leave the VPC. When I say the data, the data which is getting ingested, the data getting queried, and the data which is getting analyzed and, last but not least, is the cost. If you go to any vendor out there, whether it is a traditional SaaS commercial vendor or an open source component, and there are a lot of open source solutions out there. The infrastructure cost, the cost of the vendor is directly proportional to the volume of data, directly proportional to the number of queries, and directly proportional to the number of users. Those three things are the problems which a traditional organization who’s looking for an ideal solution, an ideal observability solution is looking for

Dinesh Chandrasekhar (10:24):

Cardinality, control, and costs. I think I love that. The three Cs are a great way to look at the observability space and how you infer what is important to the actual users and so forth. Talking about costs, since we touched on it, I want to ask you this question as well. From my own personal experience, when I’ve talked with customers who are looking for an observability solution, what they often complain about is like, Hey, I have at least eight to 10 different tools in each department that I have. I’m looking at maybe 30 to 40 tools across the organization today. I’m already having a lot of costs paying for these licenses year after year. “Why do I want one more observability solution,” is a pushback that I used to get, right? So I’m going to pose that same question to you now that you’ve touched on the cost aspect. How do you approach that question with a CIO and convince them or tell them as to why this is better than having 30 or 40 different tools?

Krishna Yadappanavar (11:23):

Okay, great question. So to answer that question, let’s start with the problem of why there is a tool proliferation. So if you look at the entire ecosystem, traditionally some vendors, if I take the commercial vendors, they were pretty good in certain streams. If you go to logs, you can think of Splunk. If you think of metrics, you think of Datadog. Then inside Google and all the FANGs of the world. This whole movement of open source started, especially with the advent of Kubernetes, then came things like Prometheus, OpenTelemetry, and whatnot.

And there is this whole shift that is going on towards and moving to the open-source based solution. What does it mean? That means the developers, the architects, the DevOps guys want to ingest their observability data in an open format. Meaning, even if I pick any instrumentation to instrument my code or put any agent to collect my data, that should be a hundred percent open-source compatible. So that’s why when the commercial vendors also started putting their agents in open source, then on the query side, they want the whole visualization, the dashboarding, alerting–all these capabilities to be driven by the open-source query languages. That’s why the emergence of PromQL, LockQL, TraceQL, OpenTelemetry, they’re now trying another open-source query language.

So now you’re in this world where you have many options. The customers have already picked certain for certain stream, certain vendors.

Then there’s an open source movement and then different teams are using different infrastructures. Some are based on Kubernetes, some are based on serverless, some are based on ECS, Fargate whatnot. So that is adding another dimension and to get to the speed and the agility of the whole product delivery, CI/CD has evolved in this intersection to solve the problems very quickly. They try to look for the pointed solution and hence end up choosing the very pointed solution. That’s when we as an ideal observability solution, if I were to start my observability stack for my company, I would step back and see like, hey, if I want to reduce my MTTR and MTTD, I need to collect all the end streams of the observability data. Do I go to n different vendors and pick n different streams, or do I go to an observability data lake where I can put all the streams together so that the correlation, the advanced functions like outlier detection, anomalies, causation becomes relatively easier? That would be an ideal solution where you can consolidate everything in a data lake where you can preserve your data within your premises.

Dinesh Chandrasekhar (14:18):

Fantastic. And I would also like to add that the cost for tool proliferation, I agree largely because developers want to build their own thing and they have added in a lot of open source tooling as well to the mix, but it’s also department-level purchases. So an IT department feels like I can solve this problem because let me get a bandaid solution, let me buy this tool off the shelf and use it. And then over time they realized they have added one more tool to the arsenal and without realizing that they’re not looking at the forest for the trees. So the CIO conversations are always interesting and about how you can compact or reduce the number of tools you have across the enterprise and have one observability platform looking across the departments, looking across applications, infrastructure containers, and whatnot.

Krishna Yadappanavar (15:08):

Absolutely. So I mean to add on to that now different personas in the company are also looking at the same data. Like DevOps developers, architects are all looking at the observability data with respect to infrastructure, containerization applications, and whatnot. Using the same logs. The SecOps guys are trying to dissect the data to look for the security and the threats. Looking at the similar data coming from the logs or the traces. Even the DataOps guys are looking like, Hey, how good are my data operations? And now with the advent of LLM, even the LLM Ops guys are looking at the similar data to do their kind of analytics. So there is another consolidation that needs to happen. That’s one thing I would look for in an ideal observability solution. How do I bring in all the different personas in an organization so that they can leverage the data from the same so-called data lake.

Dinesh Chandrasekhar (16:05):

Truly the proverbial single pane of glass thing that we’ve all been striving for. So it’s a good thing. So I want to touch on one other thing that you mentioned briefly as you were explaining the previous response, which was about reducing MTTR, right? So as a primary crux of observability, it is about not only troubleshooting but also the reducing the MTTR reducing alert noise and those kinds of metrics as well. So it definitely saves SREs and IT Ops from having to split hairs and figure out where the trouble is and all that as a key fundamental requirement for solving this. You need access to real-time events as they’re happening. If there is a log that got entered in a particular application or a particular server about a malicious activity or something like that, you need access to that event right away then and there so that you can understand where that anomaly is, what’s happening across your infrastructure, why this particular spike in a particular memory thread or whatever.

So you need to figure that out and in order for that to happen, you need the ability to ingest all of this in real time as well. Data immediacy, which is a favorite term of mine over the last year I’ve been talking about it, and data freshness is kind of prime importance here. This is what we are talking about, how fresh the data is, is how quickly you can resolve that particular problem or maybe even avert something that’s about to happen in this particular context of observability, particularly when you’re talking about hundreds and thousands of in a servers that you’re monitoring and all that, or where do you pinpoint saying, here’s where the trouble is in terms of getting the data to be as fresh as possible or not. Is it largely dependent on the ingestion mechanisms? Because you spoke about TEL and other types of instrumentation techniques and all that too. So how else would you think about it or look at it from this perspective of how quickly can I get access to realtime data?

Krishna Yadappanavar (18:03):

Okay, another great aspect of the observe team, you’re absolutely right, Danesh. So the key dimension where the observative data is getting consumed is how quickly can I have the data, the moment the data has left from the source of the data, whether it is your application or your infrastructure components or your platform like open source components and things like that. So for that, if you look at how the industry has evolved in the last five to 10 years or so, the realtime streaming services, realtime databases have come along. If you look at the traditional observ solutions, I mean they couldn’t leverage those functionality because the technology was relatively older. So with the advent of the realtime streaming and the realtime databases, you can get to the access of the data as quickly as possible. So that is the measure of what is called as a freshness of the data from the moment it has left the application until it is readily queryable, that’s all matters.

Then there is that aspect of like, hey, I have all the data. How do I compartmentalize that data? How do I find the relevant patterns I need to get to the root cause is the next set of the problems. So that means I should be able to transform data one data to the other data. Hey, I’m getting series of logs. Can I quickly look at a metric out of the logs? I’m getting a series of spans. Can I look at some attribute within that span or a trace to analyze that data? Because these attributes are typically correlated and that’s how the debugging happens. So that is the next dimension. And then the third dimension is the advanced analytics. On top of that data, can I bring some interesting statistical or the large language models to analyze the data to find what is called the outliers in my system?

Can I find the anomalies in my system? Okay, can I look into the seasonality aspect of my data? Can I forecast my data based on what I have seen from the past? The seasonal aspect of the data? So all these is what I call as the package of the advanced analytics. So when you think of the overall solution after the freshnes sof the data is solved, you need to think about the data as a unit of a brick and then how each brick can be adjusted and then set of analytics function. And then the natural thing that we’ll need to is like, hey, I have analyzed this once, can I automate it? That becomes the natural extension to the whole thing. So that’s why along with the three Cs problem, we have seen the customers asking how can I observe, analyze and automate my observing data.

Dinesh Chandrasekhar (20:57):

Very cool. Very cool. And then as a part of your response, you also mentioned the magic word LLMs, large language models. I mean these days you cannot have a conversation without talking about GenAI LLMs. So I’m glad you mentioned it because I could definitely ask you about this, which is LLM observability. It looks like that’s an emerging space suddenly given that we have a proliferation of LLMs everywhere and people are struggling with understanding how they’re performing and so forth. So tell us about that. It looks like Kloufuse is also building something on that front as well, right? So tell us more about it.

Krishna Yadappanavar (21:32):

I mean, yes. I mean fundamentally the whole, the LLM models are getting deployed in various use cases, right? As for lack of better word. The dynamism of the data changes, especially in the observability world, the data is very dynamic. Building the right LLM model to do certain operations is always hard. So we have looked at the problem in two ways. How can I leverage certain LLM models on top of the existing observability data, which is getting consumed by all the different personas I talked about, whether it is DevOps or SecOps or DataOps guys or the LLMOps guys. That’s the one aspect of it. And there is the other aspect of, hey, I’m developing an application where the LLM is a very, very critical component. How do I look at the complete observability of an application which is producing the data, which is getting fed into the LLM, and then there are a lot of consumers who are consuming that data from those LLM models.

So we are thinking of, in fact, I can say that we are the first one to end to end think about what is the true observability for all the applications which are developed using the LLM models. Because I have come across a lot of solutions just looking at the model observability like drift and things like that. But we are looking end to end. That’s a very interesting aspect because a lot of infrastructure application observability goes along with the model observability and the rest of the things. And then last but not the least, if you go and ask any CIO or the CFO, the cost of the LLM, like the current solutions, the cost is another key dimension. How do you keep that cost or even provide any analytics on the cost metrics of the LLM models itself is another aspect of it. So you have to look into everything: performance, lack of, let’s call it an APM for LLM applications, and then the cost. So these are the typical dimensions you would look at it.

Dinesh Chandrasekhar

Very cool, exciting space and I’m definitely excited about learning what’s coming in that space in the upcoming few months. So thank you so much, Krishna. This has been a very, very wonderful conversation. Loved having you on our show. Love talking about observability. I’m going to remember your three Cs: Cardinalality, Control, and Costs. I think it’s a great way, great mantra to look at observability. So thank you for all the insights. Appreciate you being here. Thank you.

Krishna Yadappanavar

Thanks a lot, Dinesh. Thanks for having me on your web chat.