Data Observability 2.0: The Role of Data Lineage in Data Quality

Enterprise data quality needs sophisticated, proactive data management approaches that leverage observability data (e.g., data lineage) to automatically apply monitoring, identify data quality issue impacts, and summarize observability insights using Generative AI.

Organizations are facing data quality issues that have a growing impact as they increasingly leverage data in their products and services. Those quality issues are driven by problems with data entry, data capture, and data pipeline failures. Traditional approaches to data monitoring and observability have limitations in that they are not scalable, not proactive, and only provide partial visibility into the data pipeline.

RTInsights recently sat down with Kyle Kirwan, CEO and Co-founder of Bigeye, to discuss these issues and more. What came out of our talk is the recognition of the growing need for more sophisticated, proactive data management approaches that leverage observability data (e.g., data lineage) to automatically apply monitoring, identify data quality issue impacts, and summarize observability insights using Generative AI.

Here is a lightly edited summary of our conversation.

RTInsights: At a high level, what data quality issues are organizations facing today that are impacting their operations?

Kirwan: Data quality has been a problem for quite a long time. The problem itself has not fundamentally changed. The impacts of that problem have changed with the changes in the way that organizations leverage their data.

Historically, companies have leveraged their data for analytics and reporting. And now, more and more companies are leveraging data directly in their products and services. Sometimes, that’s powered by a predictive model or a machine learning model. So, the impact of data quality issues has been magnified by the differences in the applications of data. But the fundamental reasons for data quality issues are pretty old at this point. It could be issues with data entry. It could be issues with the way that data was captured through an automated mechanism or not being captured correctly. Those would be problems with the sources.

Data quality issues can also arise with the infrastructure that’s being used to process the data. You could have a problem with an ETL job not completing or failing to run. Issues can also arise with the code that’s used in the data pipeline itself. For example, let’s say somebody makes a mistake when writing an ETL job, which could transform the data in a way that was unintended.

These are all very old reasons for data quality issues to occur. And I think those are still sort of the same things that happen today. The thing that’s changed over time is, again, the impact that they have on the organizations. And I think, in particular, the increase in the size of the impact that a data quality problem can have nowadays, that scale is much larger, and the cost is much bigger than it used to be. These issues have been driving a lot of the investment in data quality over the last five or so years.

RTInsights: What are the issues with and shortcomings of traditional approaches to data monitoring and observability?

Kirwan: Pre-observability, the obvious data quality challenges were with scalability and with not being proactive. So, if you’re used to writing data quality rules, you have basically two major issues. One is that producing the rules is expensive because it requires a human being to think of what’s going to be encoded in the rule. And then, because of the high expense of creating each rule, you can only really create rules for a limited set of things that could go wrong with the data.

That means you have to think in advance about what you want to test for. Lots of different things can go wrong in a modern complex data pipeline or architecture. And so that means it’s basically impossible to create rules for every possible thing that could go wrong.

Those are the challenges with traditional data quality.

And then, with data monitoring, you get more visibility than data quality rules, but still not the level that is promised with observability. Observability requires monitoring, as well as other elements like lineage, which is the ability to see each step the data goes through all along the data pipeline. That’s not an element of monitoring but it is an element of observability.

Observability gives you the ability to fully introspect the data itself for quality issues, see where it has come from and where it is going, and the ability to see information about the infrastructure on which the data is being processed and whether it is running as expected. All of that combined is what we call observability.

So, the shortcoming with just monitoring is that it’s a piece of the bigger observability puzzle, but it’s not the full picture. If you have data quality monitoring but you lack things like lineage, you don’t have information about where the data came from and where it’s going.

Talking about shortcomings in observability itself, I think for most people, it’s that observability tells you what’s going on in your system, and it allows you to understand why that’s happening. However, most people want to use observability to achieve some outcome, such as higher data quality. That’s where you really need to bring in best practices and processes, such as things like incident management.

Observability can tell you that something went wrong, it can help you see where it went wrong, and it can help you understand why it went wrong, but then your organization still has to decide what to do about it. That is not something that observability traditionally solves for, so you need to bring in additional things like incident management or some sort of a response process to actually take action on what your observability system’s telling you.

I don’t know if that’s exactly a shortcoming, but it’s what when people think of observability, what they’re really looking for is the full solution. However, observability itself only gets you part of the way there. And it’s that missing chunk at the end that I think many would expect to be included. And so that’s something that they’re looking for when they think of observability overall.

RTInsights: Could we talk about the need for proactive data management and a more sophisticated approach to data observability, a so-called data observability 2.0?

Kirwan: Going back to what I said earlier about data monitoring and lineage, what we’ve seen from a lot of the current data observability offerings is that they’ve gotten fairly good at gathering signals and metadata or information about what’s going on in your systems. Then, the challenge is how do you activate that? How do you take that information that you’re getting from your observability platform and use it in some way? For example, in Bigeye, we leverage our lineage graph to help automate various actions that a data team might want to take when it comes to data management.

One example is if we want to add monitoring to our data environment, one way to do that is to have a human being go and say, “Okay, I want to add monitoring to this table and that table and that table.” Another way to do it would be to look at your lineage graph and say, “Okay, which users are using what data?” And if I want to be able to tell those specific users that their data is healthy, then looking at the lineage leading to that user and the data that they’re consuming, what are all of the places along my data pipeline where monitoring needs to be added? And can the system add that monitoring for me automatically? That’s something that Bigeye can do.

Now, the observability pieces are the lineage graph itself and the monitoring that’s going to be added. But Bigeye can add automation by saying, “Well, if we already know the lineage graph, we can automatically determine where the monitoring needs to be applied.” That is a level of sophistication that we haven’t seen so far in the category generally.

So far, the observability space has been focused on how many sources we can connect to. How much metadata can we pull out? How good can anomaly detection be? Bigeye has been the first platform to really push the automation barrier. Now that we have that observability information, how do we actually leverage it? And the lineage graph is one of the big places that we’ve looked to first to be able to create automation on top of the observability data.

RTInsights: How does Bigeye help in this area?

Kirwan: Lineage is the big area that we’ve been investing in over the last year or two. My team has spent a lot of time looking at where we can automate tasks for our customers on top of that lineage graph that we have for them. The reason that we’ve decided to focus on that lineage graph is because, for a lot of our customers, lineage is a big piece of metadata that they don’t have a great picture of.

When they deploy Bigeye and, for the first time, get that really detailed map of their lineage, it opens up a lot of opportunities for us to automate things for them because we know that lineage graph.

I mentioned the example of automatically applying monitoring to the right places in the data model. A naive approach to creating contrast would be, okay, we hook up the data observability system to our various databases and to the data warehouse, but then someone’s got to decide what monitoring they’re going to turn on. That’s still a manual process in most data observability offerings on the market. We said, “Okay, that’s a very tedious task to do when you have 10,000 tables, 20,000 tables, or 100,000 tables. So where can we add automation there?”

Another example would be to do a root cause analysis or impact analysis. For example, when we flag a data quality problem for our customers automatically, we then take that issue, look at their lineage graph, and automatically summarize for them that this is the blast radius that this issue is going to have in your environment.

We can do this because we can see the lineage graph and we can see all of those downstream tables and reports that are going to be impacted by it. We can roll all of that up for them, and we can assign a severity level to the data quality issue. Normally, that would be a human judgment problem. Someone would have to say, “Okay, we have an issue. It’s on the user’s table.” That seems pretty important or not. At the same time, we’re able to automatically assess the severity of the problem for them because we have all that lineage metadata.

RTInsights: Are there any other things we should discuss?

Kirwan: There has been a big increase in questions about the use of AI and GenAI this past year. There are some low-cost, low-impact places to drop GenAI into observability products, and we have so far avoided that. We are actively exploring with our customers where that can be best leveraged in the product that yields an actual improvement. One of the things that GenAI has shown that it’s very, very capable of is the ability to summarize a volume of information that would be difficult for a human to parse and boil down into something that a human being can understand.

An example where that might be applied is, again, if we detected an anomaly in the customer data. We can look at the entire lineage graph and look at the other monitors that we have running in their environment. Then, we can start to boil that signal down for them and give them a summary of what’s going on in their system.

Now, they would be able to do that manually by looking at the observability data. They can look at what alerts are firing. They can look at the lineage graph. It’s not that a human can’t do it; it’s a laborious task to do on a large scale. GenAI gives us an opportunity to say, “Great. We have all this really rich metadata about what’s happening in the customer’s environment. Can we use a gen AI model to boil all of that down into something that’s easier for a human being to consume?” So that gives a direction of where the company’s investing in GenAI over the next few months and quarters.

Ready to future-proof your data operations? Schedule a demo with Bigeye today and discover how lineage can transform your observability strategy.

Data Observability 2.0: Moving Beyond Traditional Monitoring to Ensure Enterprise Data Quality

About Salvatore Salamone

Leave a Reply Cancel reply

About Salvatore Salamone

Recommended Articles

Leave a Reply Cancel reply