Instead of looking at data as just information somewhere in silos, you should look at data as an asset. In other words, treat data as a product.
Many businesses are sinking great amounts of resources into digital transformation and migration projects. Unfortunately, the rote efforts simply carry over the same complexities and issues that have prevented them from getting great value from their data.
RTInsights recently sat down with Animesh Kumar, co-founder and CTO of The Modern Data Company, to talk about the issues in this space, how a new type of thinking centered on data as a product needs to be embraced, and what benefits can be achieved when this approach is taken.
Here is a summary of our conversation.
RTInsights: What are the issues that make data management and migration so complex today?
Animesh: Fundamentally, the way we approach this problem is the problem itself. For example, suppose you start to migrate data from your legacy systems into a cloud-based system or the modern stack. You are just moving it as if you bought a new tool. You’re moving every piece of information that you have into the new tool without understanding the complexities or nuances or the benefits of that tool.
If you look back, a couple of years ago, when Azure first came about doing things on big data, they introduced the system or feature called HDInsight, which was Hadoop on Azure Cloud. It was exactly the way you would deploy Hadoop on your on-premises data center. It installed exactly in the same way on the Azure Cloud, and people started moving their Hadoop ecosystem from Cloudera and Hortonworks to Azure HDInsight.
But then, they were not able to realize any value. They went into the cloud, but they were not able to leverage any of the cloud’s basic tenants like compute on-demand, scalability, elastic clusters, and stuff like that. All those things were missed out.
That is a fundamental problem. If I take everything from my, let’s say, Teradata instance and move everything into Snowflake, that movement itself is going to take me a couple of years. You have to build all the pipes and remove the dependencies. But I’m treating Snowflake as just another Teradata on the cloud.
What we are trying to emphasize here is if you can introduce a concept of product thinking or right-to-left engineering, then you first decide what you want to achieve with this migration. Once you have defined the outcomes or the right-hand side, you start working backwards and figure out what data needs to be moved instead of expensively shifting the entire foundation. And if any data has to be moved, is there any transformation that needs to happen so I can leverage the new ecosystem better and drive my outcomes? The product mindset is all about optimizing outcomes without stressing existing foundations.
RTInsights: What is the impact of this complexity on a business?
Animesh: Businesses are supposed to do digital transformations faster so that they become data-driven and start making better choices and decisions that will impact their business goals. But it takes them two years to do the migration.
And after the migration is done, they realize that they haven’t really changed much or anything at all. They have just adopted a new tool that the herd is running to, which has carried forward most of the challenges they had with the legacy systems, along with a few new ones. As a result, they’re not able to make their data-based decisioning any faster than their earlier pace.
So, businesses are losing out first, on time, and second, on the value that was promised on migration. We have a couple of customers who have done three to five years’ worth of migration and spent upwards of $50 to $70 million dollars on it, yet they’re still running exactly the same way they had been running four or five years ago.
RTInsights: What is the data development platform, and how does it help?
Animesh: Instead of looking at data as just information somewhere in silos, you should look at data as an asset, data as a product, as if you are building a mobile application or any tool. So, you have to have some semblance of the outcomes, like, what kind of metrics, KPIs, or information you want to deduce from this data. And then, you work backwards to specifically meet the goals and prevent efforts on tangents that diverge from these goals.
So, you introduce the product thinking, and when you start to look at the problem from the product perspective, you start to define, who’s the persona of this product. Who’s going to use this product? Who’s going to maintain this product? What are the metrics that are essential for this product’s upkeep? What are the basic KPIs that this product has to expose? What is the usability aspect of this product? What kind of users are going to use this information, and how are they going to consume it?
All these details have to be thought through, and these details combined make up the data product. Once the product has been defined, you need a platform for the data developer to build these products. When we say data developer, it’s not only the data engineer. It could be the product manager, business analyst, data analyst, SQL developer, or any persona developing the data. A data developer platform is a data infrastructure specification that enables self-service for each of these personas with enough tooling support and boiler plating to quickly experiment, productize, and make this data product available to data consumers.
So, instead of looking at these problems like migration, ETL, observability, or governance problems, you have to look at this problem holistically. For example, let’s say you have orders data or transactional information somewhere that you want to productize, and your primary persona is the ML team, who would use this data to come up with cart abandonment use cases. They’ll create a model out of the data. Now, if the data scientist is my persona, they need to be able to consume this data in pandas.Dataframe or a Python framework.
They don’t really care about writing SQLs on top of the data. They care about writing Python programs on top of it. But that’s not all; the same data might be consumed by, say, the consultant on the case to run queries, and the CDO might want to look at the same data through a dashboard. To achieve this, you have to make this data product consumable by a wide variety of consumers. The Data Developer Platform plays a key role here by enabling consumption-agnostic outputs. In other words, high-quality and optimally governed data presented by the data product can be materialized by any chosen consumer endpoint without the complexities of managing multiple access patterns.
RTInsights: How does Modern help in this area?
Animesh: Modern’s flagship product, DataOS, is a direct implementation of the Data Developer Platform Specification.
DataOS is a data operating system that has been developed on the ethos of a data development platform. It connects all the aspects of data development, from data processing and lifecycle management to discoverability, observability, and governance. DataOS gives you a way to start working with your data immediately with productized qualities.
RTInsights: Can you talk about some of the examples of people using DataOS or user success stories?
Animesh: We have a large liquor distribution company that had fundamental issues with distributing liquor from business to business. They were the hub in the center. During Covid, when people were at home, their drinking preferences changed from beer to wine. These guys were not able to detect such changes. They lost almost $250 million because of this lack of insight.
They were very rudimentary in terms of digitization of their entire process chain. A lot of times, if, say, Walmart placed an order, it took the sales guys responsible for that account two days to learn the new order had been placed.
Sometimes the order used to be accepted, and in the distribution center, they didn’t have enough stock to fulfill those orders. So, there was a lot of to and fro. We started helping them to look at this problem from the product perspective. They started building data products around transaction history, user retention, pattern identification, and more. The data products that we have built in collaboration with them helped them drive their businesses more efficiently.
RTInsights: How should data teams be thinking about data as a product or data mesh?
Animesh: There’s a nuance here. Data mesh is basically a data design framework or the technological alignment of the company’s organizational hierarchy.
Imagine a company that has a central IT group, and all problems related to data go to that one team, which is a mammoth team. Now, this company is designed not for the data mesh kind of architectural pattern because data mesh requires you to have a decentralized team. In a decentralized environment, domains own their own data and expose sets of data products for other people or other team members to consume.
And then there are companies that are designed in this way. So, there’s a marketing department that has enough capabilities to produce or manage these data products. And this is typically how any decent-sized company works. So, they have their own Oracle database, and somebody else has their own Snowflake instance. These guys can adopt a data mesh architectural paradigm to start exposing data products from their domain, which can then be consumed by other domains.
So, data mesh is basically the design framework, and the data product is the fundamental unit powering this design. Data products are not dependent on data mesh, but data mesh depends on data products as its fundamental building block. Similarly, Data Products can power other data design patterns, depending on the organization’s approach. You can have a data fabric, you can have a central Snowflake, a warehouse, and then multiple people can produce multiple data products on top of Snowflake and let everybody else use them.
RTInsights: Could you talk about how various architectures are difficult to deploy and how are you helping in this area?
Animesh: We should be looking at the data infrastructure like a set of building blocks. Think of DataOS as AWS, where you bring together different building blocks to create a plethora of different data solutions.
You can go to Amazon and request raw compute, raw storage, and raw networking, and you can compose them together into any kind of infrastructure or architecture that would meet your needs. You have all the necessary building blocks, and you can build a data center experience as Azure did for HDInsight, but it can be completely cloud-driven using Terraform, Ansible, and whatnot.
In a similar spirit, you should look at DataOS as having those necessary building blocks that are necessary for data operations. How do you transform, how do you ingest, how do you activate, how do you query, and stuff like that? You compose a couple of DataOS resources and deploy the solution.
If we go up another level, these building blocks are composed together to build data design frameworks such as data products and data mesh. So, you can have DataOS, and you define, let’s say, different workspaces for different domains. These domains could have complete infrastructure isolation with containerized resources. As a result, the data in these domains would also be completely isolated, not corrupting each other. All the domains can be on DataOS and work in the data mesh fashion as separate entities.
So, the idea is once you have DataOS, which takes roughly 30 minutes to install on your cloud account, then having a data fabric-based architectural pattern or data mesh-based architectural pattern is a matter of days rather than a matter of months, that we are typically seeing in the industry today.