Smart Talk Episode 8: Enabling Real-time Queries on Data Lakehouses

Today’s data lakehouse requirements can include real-time data sources and high-performance queries to provide almost real-time insights. Learn how Kafka streams and powerful query engines can extend a lakehouse’s capabilities.

The data lakehouse has emerged as a flexible, multi-use repository. In this Smart Talk episode, Dinesh Chandrasekhar, CEO of Stratola, and his guest, Justin Borgman, CEO and Chairman of Starburst discuss how to stretch the capabilities of a data lakehouse to include real-time data and high-performance queries that can provide almost real-time insights–an increasingly common use case. Two key technologies are required–Kafka streams and a powerful query engine.

Especially interesting are their perspectives on the importance of open-source software and open formats that’s been validated by Snowflake and Databricks announcing support of Apache Iceberg. Justin shares his advice for benchmarking solutions–use your enterprise data, run your actual queries, simulate scale, and finally, calculate costs.

Topics covered include:

Kafka for streaming real-time data into data lakehouses (4:22)
Open formats’ advantages (5:56)
SQL’s supporting role for GenAI (8:53)
Snowflake, Databricks, and Iceberg (11:56)
Flexible data repository strategy (17:21)

Guest

Justin Borgman, CEO and Chairman, Starburst

Justin Borgman is a subject matter expert on all things big data & analytics. Prior to founding Starburst, he was Vice President & GM at Teradata (NYSE: TDC), where he was responsible for the company’s portfolio of Hadoop products. Justin joined Teradata in 2014 via the acquisition of his company Hadapt where he was co-founder and CEO. Hadapt created “SQL on Hadoop” turning Hadoop from a file system to an analytic database accessible by any BI tool. He founded Starburst in 2017, seeking to give analysts the freedom to analyze diverse data sets wherever their location, without compromising on performance.

Host

Dinesh Chandrasekhar is a technology evangelist, a thought leader, and a seasoned IT industry analyst. With close to 30 years of experience, Dinesh has worked on B2B enterprise software as well as SaaS products delivering and marketing sophisticated solutions for customers with complex architectures. He has also defined and executed highly successful GTM strategies to launch several high-growth products into the market at various companies like LogicMonitor, Cloudera, Hortonworks, CA Technologies, Software AG, IBM etc. He is a prolific speaker, blogger, and a weekend coder. Dinesh holds an MBA degree from Santa Clara University and a Master’s degree in Computer Applications from the University of Madras. Currently, Dinesh runs his own company, Stratola, a customer-focused business strategy consulting and full-stack marketing services firm.

Resources

Smart Talk Episode 7: Cardinality, Control and Costs in Observability

Smart Talk Episode 6: AIOps and the Future of IT Monitoring

Smart Talk Episode 5: Disaggregation of the Observability Stack

Smart Talk Episode 4: Real-Time Data and Vector Databases

Smart Talk Episode 3: Modern Data Pipelines and LLMs

Smart Talk Episode 2: The Rise of GenAI Applications with Data-in-Motion

Smart Talk Episode 1: The Data-in-Motion Ecosystem Landscape

View the data-in-motion ecosystem map here

Learn more about data-in-motion on RTInsights here

Transcript

Dinesh Chandrasekhar:

Hello and welcome to this episode of Smart Talk at Data and Motion Leadership series. I’m your host, Dinesh Chandrasekhar, chief analyst and founder of Stratola. Our guest today is Justin Borgman, CEO and chairman of Starburst. Justin has had a stellar career in security and data analytics companies, and prior to founding Starburst in 2017, he had founded a company called Had Adapt, which was later acquired by Teradata where he served as a VP and GM for quite a number of years. Welcome Justin. And so let’s start with Starburst, right? I think a lot of people know Starburst as a brand, but there are quite a lot of people that are also eager to learn a little bit more about Starburst. Tell us about Starburst, particularly its origins and your drive to start the company.

Justin Borgman:

Yeah, my pleasure. So as you mentioned in the introduction, I’ve been in the data analytics space for about 15 years now, going all the way back to that first startup, which was acquired by Teradata. Of course, as I’m sure your audience knows, Teradata for many decades frankly, was the leader in data warehousing analytics. And that model really necessitated moving all of your data into a proprietary database, which was your enterprise data warehouse. And from there you could run fast analytics and understand your business. I think what we saw was an opportunity to basically turn that model on its head, particularly in two ways. Number one, the ability to leverage open table formats in a data lake, so giving you data warehousing performance. But in a data lake, sometimes people call this a lakehouse architecture today, as well as being able to reach out to other data sources and join tables that live in another database with tables sitting in that data lake.

So for example, you might have an Oracle database or SQL Server database, and you want to join a table in one of those systems with a table in an Iceberg file format in a data lake. And that’s essentially what our technology does. It’s the underlying technology called Trino. It’s an open source project. It was originally born out of Facebook, and it is how many of the largest internet companies, LinkedIn, Airbnb, Netflix, Apple, etc. do their own data warehousing analytics. Again, in that model where the data lake is the central repository where they can get very low cost of ownership, storing data in these data lakes, as well as being able to join other tables as well. And so really Starburst is just the commercialization of that open source project. We provide an enterprise version of Trino that has extra security features, extra connectors, extra performance benefits, and a whole host of other features and functionality.

Dinesh Chandrasekhar:

Thank you. And I definitely want to dive a little deeper into Trino and Iceberg and all that. I think those are all great topics for today, but can I step back a little bit and ask you about if you were to look at the evolution of data architectures, we had the traditional databases and then data warehouses come about, and with the explosion of data and the need for processing more real-time data, lakehouse architectures and others came about. So in your world, as you look at the evolution of data architectures, data lakehouse, and in your case I think you have a concept called the Icehouse as well, how has that impacted organizations’ ability to handle real-time data effectively?

Justin Borgman:

Yeah, great question. And just to clarify for your listeners, the icehouse concept is really just an Iceberg-based lakehouse. So the data’s stored in an iceberg table format and you’re able to do data warehousing style analytics on top of that. The net result provides really low total cost of ownership as well as the ability to handle near real-time data as you described. And the way that we think about that is we see a tremendous increase in the amount of streaming data technologies in the market like Kafka for example, where customers are increasingly using that to stream data in near real time into a data lake.

And from our standpoint, that’s where we want to pick it up. We’ve built something we call streaming ingest where you can connect to a Kafka stream and we will automatically turn that into Iceberg tables and make them available for querying almost instantaneously. So this does enable a business now to have much faster fresher insights on their data as a result of this architecture.

Dinesh Chandrasekhar:

Thank you. So the Lakehouse promises definitely to be a very unified architecture approach for batch and real-time analytics. Could we say that, I mean, how do you see this architectural shift transforming BI and the traditional decision-making across industries today? How has that changed?

Justin Borgman:

Yeah, I do see it changing things pretty dramatically. I think one of the drivers and one of the benefits of this architecture is as simple as economics. At the end of the day, those traditional data warehouses could get very expensive. That was actually probably one of the number one complaints during my time at Teradata. Nobody ever said Teradata was a bad database. It’s actually a great database system. It just happens to be extremely expensive and once you’re in, you’re in and you’re sort of committed.

And so this data lake allows you a greater flexibility because you’re using open formats, which allows for the customer to choose what is the right engine to access my data. It gives you a lot of flexibility, reduces the lock-in, but also allows you to store your data in really inexpensive commodity storage, which in the cloud context is increasingly S3 or Google GCS or Azure Data Lake storage. And even in the on-prem world, we see S3 compatible object storage from companies like Dell or IBM or what have you, where you can basically get S3. So that becomes the sort of common foundation layer for storing data very, very cost effectively, and that is part of what’s driving this transformation.

Dinesh Chandrasekhar:

Okay, so let’s maybe now get into, now since I think that’s kind of like the whole driver behind your offering, it has gained popularity over the years as a very powerful query engine in the realtime data space. How do you see its role evolving in the modern data ecosystem? Especially as you mentioned, there are other open source technologies like Apache Iceberg, which are also offering a lot of interoperability between different data systems and so forth. So how has this combined with the combination of some of these other open source technologies changing the modern data ecosystem?

Justin Borgman:

I think it’s becoming really the sort of the Postgres of data warehousing. Postgres of course is a widely deployed, extremely popular open source database. It’s a traditional R-D-B-M-S single node. Trino is sort of like the MPP massively parallel processing data warehousing analytics equivalent. And so for your big data, for your data warehousing style activities, this is now becoming the defacto open source choice.

Now sometimes people ask, well, what about Spark by comparison? Spark is a great general-purpose processing engine, but not really optimized for SQL analytics. And I think to your point earlier about business intelligence and decision-making, SQL is still the language of those types of use cases, whether it’s connecting a BI tool, running reporting, or even building data-driven applications, SQL continues to be a really important language to interface, and Trino is the number one engine for that in the market today.

When you combine it with something like Iceberg, as you said, you now have a complete data warehouse essentially. You have the query engine portion, you have the storage portion, and now you have a complete open data warehouse. They can also run anywhere, it can run on-prem, it can run in the cloud. So you have a lot of flexibility with that stack.

Dinesh Chandrasekhar:

Can I ask you a little offshoot question? Since you mentioned SQL as kind of the go-to for a lot of these data stores these days, and I believe that in the last 30, 40 years, nothing has been able to shake that for sure, but with the advent of gen AI technologies and natural language processing everywhere, people are now able to talk about data democratization where you now distribute it to even business analysts that don’t have probably equal knowledge, but can use natural language to way to say, get me the last three months of sales within this particular region and so forth.

And internally obviously translates that to SQL and then queries the engine or whatever, right? So do you see a shift in that as well? Is SQL going to thrive and survive, or is there going to be a shift in how we look at query data going forward?

Justin Borgman:

That’s a really great question and I do think you’re onto something there. I think gradually over time, I think generative AI as an interface will become super popular because to your point, it sort of dumbs it down for anybody frankly to use. So now it’s more of a Google experience on all of the data in an enterprise, and that’s very exciting. In fact, we’ve incorporated an early version of that in our own product and I think everybody will, it will become table stakes.

I do think though, behind the scenes, those technologies will really just be converting that natural language into a SQL syntax for the engine to actually execute. So I think the language will still be important, but it may become more of an implementation detail behind a generative AI natural language style interface. I think you’re spot on. It kind of reminds me of when calculators or even graphing calculators were invented, suddenly we didn’t need to know all the formulas and exactly how to do long division because our calculator took care of that. I think that’s kind of what generative AI is going to do for us here.

Dinesh Chandrasekhar:

Easier access to data, definitely for sure. I think that’s where we are headed. So definitely an exciting space. So we spoke about Trino. Can I shift gears and ask you about Iceberg again? That’s becoming very, very popular. I see the bigger behemoths in the industry starting to adopt iceberg as a very natural way to say we are interoperable, we support it, and so forth. So as organizations increasingly adopt real-time analytics, what is the role of iceberg in enabling a more efficient and scalable data management? What is your opinion on that?

Justin Borgman:

Yeah, I think it’s a big deal. I think it’s the biggest story other than AI of 2024. And the reason I say that is the format’s been around for a few years, but really this year the market sort of settled the debate on which format is going to win. There was a brief period where there are sort of three popular competing formats, and it was a question who’s going to win?

Our bet was always Iceberg, I guess I would say we predicted would go this way, but I think the market has sort of agreed really this summer when both Snowflake and Databricks announced their own intentions to support it, and that sort of just killed the debate like Iceberg is the defacto standard and what that does for customers, customers are the real winners in this by far. And that’s because they can now store the data in a format that they own, that they control that is portable for them, that is not in the hands of some database vendor that’s going to hold them hostage for decades to come.

They own that and that means they can play the engines off of each other. They can say, okay, Starburst is going to do this workload that’s going to give me the best cost performance for that. Maybe Snowflake is better for this workload. Maybe Databricks is better for that workload and the customer has the choice between these engines, which is amazing. When engines compete, you win as a customer and I think that’s really what Iceberg makes available.

Dinesh Chandrasekhar:

But that was a great summary. I think that made clear about the importance of iceberg looking forward as companies are standardizing on a model where I think everybody’s more interoperable and I think it benefits the customer, as you said, without having to being tied down with a particular vendor, but allows them to be a little bit more open and flexible. That is a great point for sure.

Justin Borgman:

Exactly.

Dinesh Chandrasekhar:

Justin, why don’t we talk about maybe a customer example here because Trino and Iceberg are the center of the conversation today, tell us about maybe a customer case study where you’ve seen this practically put to use and what are the kind of benefits that they’ve seen by adopting Trino and Iceberg?

Justin Borgman:

Happy to. There are a number of examples both from leading internet companies like a DoorDash to more traditional enterprises like Comcast that have been around a long time that in both cases are moving off of what I would call traditional data warehouse platforms, moving workloads to start off of traditional data warehouse platforms.

In the case of Comcast, very traditional on-prem data warehouse. In the case of DoorDash, I would call it a very traditional cloud data warehouse. And in either case, what they’re trying to do ultimately is get better TCO on their SQL analytics and provide the flexibility to work with the latest cutting-edge technologies that can interface off of this one common format.

Again, to our previous point, I think what they’re also trying to do, and this relates to the AI topic, is they’re laying the foundations of getting their data architecture into place where they can now have easy access to the data that they need to train their own models or perform RAG workflows ultimately to support their own AI ambitions. And I think a lot of enterprises are in those early days of sort of figuring out what can AI do for me? How can this give me a competitive advantage?

And while they’re figuring that out, one thing I think that they’re all very clear on is that their own proprietary data is going to be central to giving them competitive advantage. And so setting up a data infrastructure that gives you access to what you need in a low cost high performant way is a core step in that process.

Dinesh Chandrasekhar:

So as a way of benefit in a, can I double click on that and say or ask you with real-time data particularly, it often introduces challenges like schema evolution changes to the schema as the sources change, the target needs to adapt and so forth, and data versioning as well. How does Apache Iceberg help address some of these challenges in modern data platforms like this?

Justin Borgman:

So there is the concept of versioning and doing time travel and being able to sort of see how data has evolved within our platform. We’ve also added data lineage, data quality metrics that we’re able to capture and present to our users so that you can really understand where did that data come from, how has it evolved, how has it iterated and provide that visibility again ultimately to the end user.

Dinesh Chandrasekhar:

Okay. Then with Trino, you spoke about how you can combine diverse data sources and do some joint querying and all that. Is the architecture moving more towards a centralized data source or data store, or is it keeping them where they are, but providing the ability to combine them and giving the visibility for in consumers? What’s the in-state architecture that we are looking at here?

Justin Borgman:

Yeah, great question. There are elements of both, and I think that’s what’s always made it challenging for us to even articulate our own value proposition because people are used to one model and one frame of mind, which is centralize everything in a traditional data warehouse or you just don’t have access to it. And I think the way that we see the world evolving is that there will be a central repository which is going to be a data lake unquestionably, that is going to store the majority of the data or as much of the data as possible because you’re going to get economic benefits, you’re going to get performance benefits of storing as much as you can in iceberg formats in your lake. So we think that’s a great strategy for a lot of your data, but we also think there’s always going to be use cases where you’re going to want to reach out to some other data source.

Maybe it’s exploratory analytics. I have just a hypothesis that I want to go test that I think could be really big for our business, but I don’t want to develop all the ETL pipelines and go through all that process just for an idea, just for a hunch that I have. Well, that’s a great use case where being able to join a table that lives somewhere else with what you have is a game changer. It might actually allow you to go attest that hypothesis in a matter of minutes rather than weeks to get the teams to move the data in a way that you would need. And so I think both are valuable, but we think of it as majority in the lake and then reaching beyond that lake is the way we think about it.

Dinesh Chandrasekhar:

So if I am a third party enterprise that is, let’s say, looking for modern data platform, what are some of the critical performance considerations that I would want to have in my checklist when I’m looking at Trino versus a bunch of other alternatives? Then my priority is let’s say, handling real-time data queries, making sure that there is low latency and things like that. So those are my requirements. What are some of the considerations that I would want to have in my checklist?

Justin Borgman:

Yeah. Well, the top two pieces of advice I would give are, number one, use real queries that you actually use. I think it’s very common for people to use industry benchmarks, and that’s fine as maybe a very cursory step, but it’s not going to be reflective of your workloads. It just never is. Every company has their own things that they’re trying to do. So it’s always best to try to simulate your end state as best as you can.

And that means leveraging your own queries and your own data as you’re putting together your own proof of concept and doing benchmarking. You just should never trust other vendors benchmarks exclusively. Even our own. We have them, you can look at them, but you should really test this yourself with your own queries and your own data.

The second thing I would say is also make sure that you’re simulating scale and scale is important because this is where at least we find some of our own opportunities with customers to let’s say replace a vendor that they’ve purchased, where in the POC process, they thought that vendor met their needs, but as they got to real production scale, it just couldn’t handle it.

And this is where I think also there’s a great benefit to leveraging open source technologies like Trino, which have been proven at the largest scale imaginable, like Apple is running it at insane scale, obviously Facebook insane scale. So this stuff can work. It works at that scale. That should give you some peace of mind. But even still, I would say simulate it yourself in your own benchmarking process to really ensure that these different technologies are going to meet the needs that you have in production. Cool.

And then the third piece maybe that I’ll add is cost. Cost is also so important, right? Cost and performance are really just two sides of the same coin. And you need to factor that in your benchmarking too, right? You’re not just going to choose the fastest one. You want to choose the best cost performance. And so it is an important part of the component as well.

Dinesh Chandrasekhar:

I agree. I think that’s a major checklist item for a lot of people that are evaluating solutions out there for sure. So maybe let’s bring this to close from a trends perspective. I just want to ask you, there is a lot that’s happening in the data space today, right? So there are data warehouse vendors, lakehouse vendors, data lake vendors, and several alternatives, real-time analytics databases and whatnot.

The choices are definitely wide and confusing for the buyer. So from an emerging trends perspective, do you see some kind of convergence happening when it comes to real-time data processing, the data lakehouse architectures that we just spoke about, and the open source ecosystem in general? Is there any kind of convergence that you see happening that will make it more clearer for the buyer in the near future?

Justin Borgman:

I do. I think we’re starting to see very popular patterns emerge very often these patterns originate in the internet, hyperscalers and then translate to the enterprise over time. And I think we’re now at that point where it is making its way into the enterprise. And the patterns that I see are leveraging technologies like Kafka for the streaming portion. And of course you have multiple choices there. You can do Confluent, you can do Amazon’s version. You have choice in all of these open source platforms, which is great. I think Iceberg for sure, for the format to store your data, that to me seems like the safest bet you could possibly make. And then on the engine side, again, finding the right engine for the right job. I think if it’s SQL Analytics, we would say Trino and Starburst are the best bet, but you should prove that to yourself.

If you’re training a machine learning model, you’d probably use Spark for that. And those are the patterns that we see. I think all four of those technologies will be incredibly popular in open-source-derived data architectures for years to come. And again, open source gives you that flexibility to be able to mix and match components over time, which is going to make your architecture stand the test of time. And I think that’s really what you want to do is not create technical debt that you’re going to have a really hard time replacing 10 years from now. And open source gives you that flexibility.

Dinesh Chandrasekhar:

Love that point. Thank you. I think we should wrap this up with that great note. Justin, thank you so much for joining us today. I think it was a great conversation understanding more about Trino and Iceberg and how Starbust offers this fantastic platform that combines the best of both worlds in your platform. Thank you so much and appreciate you joining us.

Justin Borgman:

Thank you, Dinesh. It was my pleasure.