Smart Talk Episode 9: Apache Iceberg and Streaming Data Architectures

Tomer Shiran, Co-Founder of Dremio, provides an insider’s perspective about Apache Iceberg, its emergence as a standard table format, and how AI is optimizing query performance.

While data lakes seemed to be an innovative and relevant solution to many data challenges a few years ago, they were still rife with issues pertaining to data management, ACID compliance with transactions, query optimization etc. A need for a common table format arose and several open standards started competing for universal acceptance. Apache Iceberg emerged as the clear winner in spite of Delta Lake and Apache Hudi having a huge fan base as well. Dremio is one of the companies that has backed Iceberg since the early days. In this episode, Dinesh Chandrasekhar, Chief Analyst and Founder, Stratola interviews Tomer Shiran, Co-founder of Dremio to get an insider’s perspective about Apache Iceberg and how it is adopted in the industry today. Join this session to learn more about:

Importance of Apache Iceberg in today’s data architectures
Access to real-time streaming data enabled by such architectures
Customer implementations and key benefits
Innovations in areas like query optimizations
A data leader’s considerations when evaluating table formats for their lakehouse

Guest

Tomer Shiran, Co-Founder and CPO, Dremio

Tomer Shiran served as Dremio’s CEO for the first 4.5 years, overseeing the development of the company’s core technology and growing the team to 100 employees. Previously, he was the fourth employee and VP Product of MapR, a Big Data analytics pioneer. Tomer held numerous product management and engineering roles at IBM Research and Microsoft. He is the founder of two websites that have served millions of users and 100K+ paying customers. He holds an MS in Computer Engineering from Carnegie Mellon University and a BS in Computer Science from Technion – Israel Institute of Technology and is the author of numerous U.S. patents.

Host

Dinesh Chandrasekhar is a technology evangelist, a thought leader, and a seasoned IT industry analyst. With close to 30 years of experience, Dinesh has worked on B2B enterprise software as well as SaaS products delivering and marketing sophisticated solutions for customers with complex architectures. He has also defined and executed highly successful GTM strategies to launch several high-growth products into the market at various companies like LogicMonitor, Cloudera, Hortonworks, CA Technologies, Software AG, IBM etc. He is a prolific speaker, blogger, and a weekend coder. Dinesh holds an MBA degree from Santa Clara University and a Master’s degree in Computer Applications from the University of Madras. Currently, Dinesh runs his own company, Stratola, a customer-focused business strategy consulting and full-stack marketing services firm.

Resources

Smart Talk Episode 8: Enabling Real-time Queries on Data Lakehouses

Smart Talk Episode 7: Cardinality, Control and Costs in Observability

Smart Talk Episode 6: AIOps and the Future of IT Monitoring

Smart Talk Episode 5: Disaggregation of the Observability Stack

Smart Talk Episode 4: Real-Time Data and Vector Databases

Smart Talk Episode 3: Modern Data Pipelines and LLMs

Smart Talk Episode 2: The Rise of GenAI Applications with Data-in-Motion

Smart Talk Episode 1: The Data-in-Motion Ecosystem Landscape

View the data-in-motion ecosystem map here

Learn more about data-in-motion on RTInsights here

Transcript

Dinesh Chandrasekhar (00:12):

Hello and welcome to this episode of Smart Talk at Data and Motion Leadership series. I’m your host Dinesh Chandrasekhar, chief analyst and founder at Stratola joining hats with RTInsights. We are pleased to bring Tomer Shiran to this episode. We are going to be talking about Dremio and Apache Iceberg and the importance of that in today’s modern data architectures. Tomer is a well-renowned technology leader, having worked in companies like MapR and Microsoft. He started Dremio almost nine years ago, and we are excited to have Tomer on this particular episode. Welcome, Tomer.

Tomer Shiran (00:54):

Thank you. I’m excited to be here.

Dinesh Chandrasekhar (00:55):

So let’s begin with Dremio. Tell us about Dremio. What was your vision behind starting Dremio nine years ago?

Tomer Shiran (01:05):

So Dremio is the hybrid lakehouse for the business, and I think we’ll have time to unpack that, but at a high level, we allow companies to analyze their data regardless of where it is so they don’t have to put it into a proprietary and expensive cloud data warehouse. They can query the data wherever it is, and we really appealed to that kind of business persona. So a business analyst, a business user, folks that use BI tools or write SQL queries, and that’s in contrast to others in the market, which are more focused on a very technical user that is using Python or Spark, Scala, things like that. And there’s a lot of things we do in the product to make it so that it can be appropriate for a less technical user. And that’s everything from the entire experience, all the way to taking away some of the things such as data optimization. We make that fully autonomous, all happens behind the scenes. The data gets partitioned in the right way and organized in the right way based on the workloads. And so we do a lot of work and a lot of innovation to make it so that anybody can really take advantage of data.

Dinesh Chandrasekhar (02:14):

Awesome. And there is the undeniable elephant in the room, which is Apache Iceberg that we have to talk about because that is largely associated when you talk about Dremio. So tell us about Apache Iceberg and your driver behind supporting that for so many years.

Tomer Shiran (02:32):

Sure. Yeah. So actually Iceberg and other comparable, what we call table formats, didn’t exist when we started Dremio, but really I’d say (I could get this wrong), around five years ago, I think was really the beginning of this new era of the data lake. And prior to that, Dremio was a query engine. You could query data, but it had to already be there. It had to be in your lake and object storage, then you had to have the data already there, ideally in a format such as Parquet, so you could query it with high performance. But increasingly what we kept hearing in the market is that companies, they were really excited about being able to use an open data architecture where data is in their own S3 bucket or their own Azure storage account, but they wanted the all-in-one simplicity of a data warehouse where they can do inserts, updates, and deletes as well, not just read-only queries.

And so table formats provide that capability. They enable various engines such as Dremio to not just be able to read data, but also to be able to update it. And so think of it as kind of a metadata layer on top of files. So historically, data lakes were just a bunch of files in object storage, but the problem is files are immutable, you can’t change them. And so a lot of use cases were just too complicated. Once you think of data and this higher-level abstraction of a table, you can start to interact with it. You can change things, you can update records, you can delete records, you can change schemas, and do all of that just the way you would do it with any kind of database. And so we identified Iceberg as a potential way to kind of solve this problem. At the time that we did that, it was just a project, an open source project that Netflix was working on and Apple as well.

And some of these companies use Dremio. And so we had a lot of inside knowledge of how that was going and their experience with it. And so we decided that we were going to really focus on making these technologies a reality. We initially started evangelizing Iceberg right when it was entering the Apache Software Foundation as an open source project. We put some analysts, sorry, evangelists on the project to really get the word out. And long story short, within a couple of years, Iceberg became broadly adopted, not just by our engine, but by all the major players in the market, whether it’s Amazon, Google, Snowflake, Confluent. Basically everybody decided to embrace Iceberg as their standard table format. And I think that’s really important also, when you think about the value of an open data architecture and data lakes or lakehouses as they’re called now, having the data in an open format that different engines can operate on, that’s really critical. It’s not something that one company can just create their own format and nobody else supports it because that defeats the whole purpose of an open architecture.

Dinesh Chandrasekhar (05:54):

And then it becomes a proprietary thing by itself and without any kind of…

Tomer Shiran (05:58):

Just like a data warehouse.

Dinesh Chandrasekhar (06:00):

I get that. And congratulations on the support that the entire community has been showing for Iceberg. Every single conference, every single vendor that I talked to has been widely appreciative of Iceberg and their support for Iceberg. So I think it’s definitely growing for sure. But I would also say that we have also seen an emergence of other such stable formats and catalogs. So what has led to these different variations and tell us about interoperability from your vantage point. How do you see that?

Tomer Shiran (06:32):

Well, I think we should separate the discussion into table formats and catalogs, and I’ll address each of them separately when it comes to table formats, Iceberg was not the only game in town. Actually. Databricks had created their own table format called Delta Lake, and for a while it was kind of like this two-horse race. There was actually a third project called Hootie, which Uber had created, but that kind of became, I think more in the background, didn’t get as much traction in the community. So it was really a two-horse race between Iceberg and Delta Lake, and we were all over Iceberg from an evangelism standpoint on social media, blogs, conferences, and so forth. But Databricks has a lot of customers and many of them were using Delta Lake. And I think what ended up happening and what ended up kind of tipping the scale here is the fact that the rest of the technology providers, they wanted something that was open.

That was a lot of the value again, of this modern data architecture is having the data in open formats. And so going with something that is an Apache project that’s open source that different companies can contribute to, I think that was super valuable. And once all the key major players, Amazon, Google, snowflake, etc., once all of them decided that Iceberg was their format as well, I think it became clear that that was kind of winning. Databricks then went and bought a company that was founded by some of the creators of Iceberg. They spent something like 2 billion on that, and I think that was the final maybe stamp that sealed that two-horse race and declared Iceberg the winner. I think in the long term, what’s going to happen, there’s still Delta Lake out there, but I think what’s going to happen is that the two will end up kind of being closer and closer to each other, where at some point most people will just think about it as Iceberg and it’ll be kind of one format for everybody.

The catalog is an important ingredient in all of this because when you are thinking about table formats, there has to be some service that has the list of tables in the environment and can point to the metadata for the Iceberg metadata for each of these tables. That’s the core foundational principle of these catalogs, which are also called meta stores. They’re kind of technical catalogs, data discovery, everything else. Well, I’d say that’s even a higher level beyond that, but even at the technical level, you need something that can manage the transactions, the updates to these tables. And there were different options there. Some people still use the Hive meta store from back in the Hadoop days. That’s still pretty popular. Amazon has a hosted version of it called Glue. There’s just different options out there.

Google had a couple of these meta stores, one as part of BigQuery, one as part of their data process, so all sorts of options. What we ended up doing is partnering with Snowflake to create an open source option, and that’s called Apache Polaris. And so Apache Polaris is an open source Iceberg catalog basically that anybody can download for free. It’s an open source project. It’s not just us and Snowflake contributing. There’s now committers from other companies as well, including large cloud providers. And I think what’s going to happen is that Polaris will become the standard Iceberg catalog that everybody uses ultimately.

Dinesh Chandrasekhar (09:58):

Awesome. So one of the things that we talk about in this particular Smart Talk series is about real-time data and so forth, particularly realtime data. So we understand that one of Iceberg’s promises is also about unifying all different types of data, batch streaming everything into this common table format and so forth. How do you think it helps technology leaders look at data architectures currently from how they were looking at it before, and does it even reduce operational overhead and things like that? Tell us about some benefits of Iceberg in this particular context of bringing together all the data in one single view.

Tomer Shiran (10:45):

Sure. Well, I think there’s a couple aspects here. One is in general, if you’re going to create a platform where you’re going to centralize data, you probably want something that is going to stand the test of time and Iceberg. And in general, open kind of in formats are much more likely to do that. If you think about proprietary technologies and vendors, there’s every five or 10 years there’s something new and people want to adopt that and they end up moving to that. But if you can have your data in an open format that is supported by a variety of different tools and engines, you have Dremio and Spark and Databricks and Snowflake and Confluent, lots and lots of tools that support Iceberg. And tomorrow there will be new startups and new technologies that support it. Well then you continue to enjoy all this innovation over time. And so you’re kind of future-proofing yourself by choosing something like this and not putting your data inside of a warehouse like a Snowflake or something of that nature or a Redshift.

And so I think that’s one thing is just having this infinitely scalable way to store data in an open format that encourages you to kind of centralize and standardize. I think the second thing is you mentioned streaming data. So for a very long time, data lakes couldn’t really play nicely with streaming data because the data lake was kind of defined by just storing a bunch of files. And these files are, there’s no transactionality to them. There’s no kind of consistency that’s easy to manage between readers and writers. And with Iceberg, that’s different because operations such as updating a table, adding records is an atomic operation. And so you can stream data into an Iceberg table, let’s say, or do something like CDC (change-data capture) from some database into Iceberg. Somebody else could be reading these records as they’re being written. And so that having that clear kind of transactionality in the format unlocks these types of use cases. And I think that’s one of the reasons that you see someone like Confluent, which is the company behind Kafka, adopt Iceberg as their storage format now as well.

And I feel that it brings about that notion of data democratization in a way where suddenly this becomes the single central place where people can push data into. And of course, as you said, others that need to read data off such tables can do so as well. And you’re bringing about a diverse set of tools and technologies being able to intercommunicate so easily because of this one common open standard that we put in place.

Tomer Shiran (13:22):

Yeah, that’s exactly it.

Dinesh Chandrasekhar (13:24):

But then that also begs the question about data governance. So that becomes almost like an instant question, if I can push all my data into this from all these different tools and then I can also read from it and all that, then the question about data lineage and consistency and auditability compliance, all those become prime importance. How do you advise organizations when it comes to Iceberg and subsequently with data governance?

Tomer Shiran (13:56):

Yeah. Well, one of the roles of the catalog, the lakehouse catalog such as Polaris, is to provide capabilities such as access control. And so of course you have to make sure when you have a platform like this with all this data, that the data is only being accessed by the people that should be accessing it. Now, what we’ve done in Dremio with our lakehouse engine is we provide all these capabilities for the enterprise. So enterprise authentication, enterprise access control, integrated with all the different systems, Azure, active directory, Okta, et cetera, auditing for example, who did what. And you can query these audit logs and get a really deep understanding of who’s touching what data is somebody accessing I where they shouldn’t be accessing I, and you can understand lineage across different data assets. And so there are all these enterprise or governance requirements that need to be solved. And I think that is addressed through the catalog layer. And Jeremy provides a Polaris powered enterprise catalog and also through the engine itself, which obviously is responsible for things like the auditing, for example.

Dinesh Chandrasekhar (15:10):

Okay. You spoke about efficiencies when it comes to, again, from a data architecture perspective and all that, but when you talk about storage and compute in the context of Iceberg, how do you see this? How should data leaders consider this in their larger data architecture designing, how do they fit in this so that it brings in cost efficiencies, particularly when it comes to storage and compute?

Tomer Shiran (15:42):

Yeah, look, we see when customers move to Dremio from something else such as a cloud data warehouse, often their costs go down by at least two x like 50%. And so there’s a huge cost saving just from the efficiency of the engine, and the fact that it’s an open architecture already makes it much more efficient, provides a much lower TCO, and so storage is cheap. At the end of the day, that’s not where most of the costs are. And so what we’ve also done, for example at Dremio is realizing that storage is cheap and compute is expensive. We will actually optimize the data in the storage in different shapes and forms. So you might have a table of transactions or some kind of event data. We might maintain different aggregations of that data and different sorts or partitioning schemes of that data so that different workloads can actually come and get much higher performance and not have to scan the entire dataset for every query that’s being run. And so we can store with some storage overhead, let’s say 10% storage overhead. We can give people 10 x faster performance on average for their entire workload just by doing things like that. And so there’s I think a lot of opportunity out there to optimize performance. And usually when you’re optimizing performance, you’re really optimizing cost, especially in this era where you can scale as much as you want. This question is how much is it costing you?

Dinesh Chandrasekhar (17:18):

Can you maybe share a specific example of a company, a customer, yours maybe that has leveraged Iceberg transformed their data operations, some of the business outcomes from such an implementation?

Tomer Shiran (17:32):

Yeah, I mean, there’s hundreds of these companies that we work with that are enterprises for the most part, leveraging Iceberg and so forth. One recent one that I was talking to a large bank; they had a combination of data in a cloud kind of data warehouse, and the costs of that had been growing year over year to a point where they just realized, Hey, this is just not sustainable anymore. We can’t keep growing these costs. And at the same time, they also had data and other data sources, including some that was on-prem. And so they brought in Dremio, deployed that in the cloud in a way that actually the compute runs in their own VPC and they connected that to their various sources. And so for the data that was in the cloud data warehouse, they just moved that into Iceberg Table on S3 and object storage.

And then because Dremio has connectors to all these other data sources as well, including their on-prem data, they’re able to bring all that together, define views that include even joins across different data sources. And now they have one pane of glass and much lower cost as well for all their analytics. And so that’s a great one because taking advantage of all sorts of capabilities in the Dremio platform. Often when we’re starting with a new customer, they’re landing Dremio in as a unified analytics layer like data virtualization. They can federate across different sources and then gradually they’ll take data that is sitting in more expensive systems and move that into a more modern open data architecture. You see a lot of migrations from Redshift, for example, into Iceberg tables now, and just it doesn’t scale. It gets really expensive and hard to manage, and then you bring the data into Iceberg, you put Dremio on top, you’re able to serve all those use cases with far less operational challenges and 10 x lower costs.

Dinesh Chandrasekhar (19:28):

Okay. I had a question about siloed data systems, which is very common when it comes to enterprises and they all struggle with it. And you spoke about connectors that allow you to plug in various different data sources very easily into Dremio. My question would be more from a sense of the flexibility or the diversity of such data connectors. Maybe you can talk about that, but also from a query optimization and effectiveness perspective because there’s always that type of between you optimize in the query, then you compensate on something else and so forth. Are there benchmarks that the audiences can look for, look up and find out more about that?

Tomer Shiran (20:14):

Yeah. Historically people have thought of data warehouses and data lakehouses maybe and data virtualization or federation as two separate systems. And what we’ve done at Dremio is we’ve kind of blurred the lines between those two things. We started with just building the best query engine. We invented Apache Arrow, which is now downloaded like 70 or 80 million times a month for really high performance kind of in-memory processing. So that’s the foundation of our engine–being able to run queries extremely fast on object storage, but we also built the ability to connect to these other databases and to be able to push down query processing into those systems. And so you might have most of your data now say in Iceberg tables or Parquet files in object storage, but you still have data sitting in Oracle and MongoDB and these other systems, some data warehouses, and you want to incorporate all of that into one shared semantic layer.

And so Dremio can do that, and when a query comes in, we automatically figure out what part of it can be pushed down into the underlying system. And that includes really complex SQL queries, right? Joins and correlated subqueries and all sorts of constructs that most systems can’t push down into an underlying system. So we take the query, we break it apart. Maybe a simple example would be, I have two tables in Postgres and two tables in the object storage and I’m doing a four-way join. We’ll push down the two-way join of Postgres into Postgres, let Postgres do that, join there and if possible, if there’s filters that can be pushed down as well so that the minimum amount of data actually has to come back from the Postgres system. And so that’s the default mode of operation. But we give customers the flexibility because sometimes they don’t want have maybe too much load on that database and they don’t want to put a lot of load on that.

Or maybe that system just by the nature of it, like say Elasticsearch is just really slow at scanning data. And so asking it to do a lot of processing is actually kind of counterproductive. And so what the system can do is it has this caching technology called data reflections, which allow us to maintain different caches of the data that’s in these different systems. So we can cache that and the user is not aware of that. The user is running the query from a BI tool or just running a SQL query. They’re not aware of these caches. The caches are actually stored as Iceberg tables inside of the object storage, but to the user, they’re just interacting with data from all these systems. And we will go periodically and update the cache based on those systems. And the cache doesn’t even have to be as simple as just a raw copy of a table somewhere.

It could be maybe that raw copy, but also different aggregations on that data. So we’re constantly analyzing the workloads and figuring out, okay, this is what we should be persisting, at least for now in object storage. We’re constantly reevaluating those decisions so that users are just getting this really amazing kind of subsecond response times when they’re dragging the mouse in Tableau or Power BI or something like that. We don’t want them to go and have to create extracts in their BI tools because those are a nightmare to manage, and they really restrict the kind of analysis you can do,

Dinesh Chandrasekhar (23:20):

And it’s about data freshness as well. You also touched on a very important topic, which I wanted to bring up, which is when you talk about these multiple data silos, the concept of the data caches is really resonating well with me so that every time you’re not doing the data hop and trying to get data back and forth and all that, and if there is a way to cache it closer to where the access is, I think that’s making it more fresh in a way. But at the same time, there are probably data sources that keep updating, as I said, a real-time source for example, if that be the case. How do you compensate for that? How do you make sure that the data still refreshes in a timely manner so that there is no latency from where Dremio fixes the data?

Tomer Shiran (24:07):

Well, first of all, we give users the ability to control that. At some level. It’s a business decision. Am I okay with the data, the results being 15 minutes stale, or am I not okay with that? And that depends on the source of data for some things and the use cases. So for some things it might be okay, for others it’s not. And so it’s very easy to just set that in the system. But the second thing is in the last year, we’ve actually released something called Dremio Live Reflections. The idea with Live Reflections is that we can update these different cached elements, these reflections live. As soon as some data changes in an Iceberg table, we immediately know what changed and all the reflections that are derived from that unit are automatically refreshed in an incremental way. And so we spent a lot of time building the ability through these incremental updates in all sorts of different situations and all sorts of crazy scenarios because the ability to update caches live is really dependent on how well can you do that incrementally because the moment you have to do a full kind of update, it just takes time in a large data set, right?

If we’re talking about a million records, sure, you can do the full thing every time, but when we’re talking about billions or trillions of records and many of our customers run at that kind of scale, then you have to make sure that you can do this stuff much more intelligently. And so we spent a lot of time built a lot of IP in this area of incremental reflection updates and being able to understand also the query optimizer when a new query comes in, what is it okay to use and what is it not okay to use? And so you can imagine sometimes I look at these things, our support team shares with me of queries that are like 10 pages long, and so our optimizer has to be able to take this SQL query that’s massive, break it apart, figure out, okay, I can take this. If I rewrite the query plan in this way, it’s still logically the same, but then I can replace this part of the tree with some data that I’ve precalculated already and this other part of the tree with data that I’ve also precalculated, and it’s all going to be consistent and the user’s going to get the same result. They’ll just get it a hundred times faster. And so, yeah, huge amount of IP and patents that we’ve built in this area.

Dinesh Chandrasekhar (26:14):

I think you brought up also another point that I think is very appropriate for talking about today, which is when you talk about optimizing a query that is 10 pages long, that also begs the question, how much have you incorporated some of the newer AI advancements and innovations into these offerings, and how has that helped in speeding up some of these things?

Tomer Shiran (26:39):

Yeah, you know AI creates tremendous amount of opportunity, I think, at multiple different layers. And so in terms of how we build the system and how some of these things work, which is what you’re talking about in terms of how users can interact with the system, we allow them to ask questions now in natural language and give them responses based on our understanding of the metadata and actually the values and so forth as well. And also in allowing users to build their own AI applications, like utilizing unstructured data and so forth. And so we have a lot of investments in all these different areas, specifically in the area of the core capabilities of the system and how they operate. One of the things we launched recently is reflection recommendations where we do analysis of the query workload and we learn from that, and the system gets smarter and smarter as it sees more queries.

And it figures out through this kind of learning, what are the right set of materializations to create automatically in the system that will maximize the amount of cost saving or the amount of performance increase. And then as the workload changes, because workload’s not a static thing. It’s not like you ran this today and tomorrow you might, you’re guaranteed to run the same things as the workload is changing over time. Those decisions are constantly reevaluated. And so we look at, okay, each of those constructs that were created, each of those reflections that were created, what is the actual value of that? What has it been over the last 24 hours? What has it been over the last seven days? And we can then figure out, okay, well maybe we shouldn’t be doing this anymore. We should be doing some other set of things instead. And so that constant learning, that’s just something people had to do in the past, and it was super complicated.

Of course, people could never get it right in most of our customers, if you look at it, most organizations, the team that’s running the infrastructure, the data platform is not the team that’s querying the data. They’re not the ones using the data. And so there’s this disconnect where the platform team doesn’t understand the workloads, they can’t, there’s all these different people and different teams and use cases. And so they had no really good way to figure out how to optimize the data, right? Because they just couldn’t know what people were doing other than looking through a big log of SQL queries and certainly couldn’t anticipate things that were going to happen in the future. And so just for the most part, data didn’t get optimized and the data would be partitioned by a date column, and then 80% of the workload was actually querying by the device ID column or something like that. It’s just terrible results. And so by making this all intelligent and automated, you’re not just taking work away from people that had to do it manually, but you’re actually getting to an outcome that just couldn’t be achieved before that kind of automation.

Dinesh Chandrasekhar (29:23):

Very cool. I’m glad you spoke about that part of the innovations that you brought into the product set, because almost every vendor today talks about natural language to SQL and that kind of stuff, which is again appreciated. But at the same time, it doesn’t quite differentiate itself, but the real problems are, like what you said about the reflection recommendations and things like that. I think that’s where operationally, there’s a ton of optimizations that could be done and more than manual intervention, something that could automatically analyze the workloads and make those kinds of recommendations is absolutely beneficial. So thank you for sharing that.

One last question maybe would be, if I were to be an enterprise data leader out there and I’m looking at revamping my data architecture, I’m thinking about Iceberg as a strategic thing to be part of my data architecture, how would you advise, what kind of strategic considerations should they prioritize maybe the top three or something that would maximize the benefits of the data architecture, particularly from future-proofing it, right? So because we are thinking this is going to be open standard and it’s going to be for the future and all that, how do you recommend a data leader?

Tomer Shiran (30:45):

Yeah. Well, I think you hit the nail on the head that the data has to be open. When I say data, I think having, and I say open, I mean open source, an Apache project, and so I would encourage anybody to make sure that you’re using Iceberg as the table format. Underneath that is Parquet is the actual file format behind it Iceberg. And then going forward using Polaris, which is an Apache project as the meta store, as the catalog. I think those are the three or the core pillars of being open. And then I think the next step is you need a SQL engine, right? That’s fundamentally the key piece of Dremio, right? Is providing that ability to run SQL queries on top of Dremio. I’d say use something like DBT on top of that–DBT scheduling, the Dremio work, the workloads, the queries to do the transformations that you need to do. But that alone I think, gets you pretty far in terms of like, okay, you can replace your data warehouses, you can run all these workloads and have a very modern kind of modern data architecture.

Dinesh Chandrasekhar (31:49):

Maybe one more question then, because I think I’ll be remiss if I didn’t ask you this. Given the kind of technology leader you are, and you’ve been in this space building up some of these open source innovations and so forth, what are you more excited about in the next couple of years in this particular data space? What are some of the things that are cooking that we as outsiders probably don’t know that you as an insider know and are excited about?

Tomer Shiran (32:21):

Look, I think there are a number of different areas, but I think we kind of touched on it. AI is creating a new opportunity. The AI models that are out there, all of them really like everything from GPT to Claude to Gemini, they’re all getting really, really smart. And that creates opportunities to do things, to automate things that in the past required people to do and people to do kind of things they didn’t really want to do. And so I think that is going to allow us as an industry to get closer and closer to that holy grail of allowing people to really take advantage of their data and not constantly be stuck waiting on some transformation to be done before they can access the data, not constantly questioning the quality of the data. I mean, there’s just so much stuff that needs to get better in data. Despite decades of innovation and new technologies, we’re just not far enough along today in data. It’s still kind of a mess. And I think there’s a lot of opportunities that we’re working on and others I’m sure are working on to really change the game.

Dinesh Chandrasekhar (33:36):

Very well said. Well, Tomer, thank you so much for joining us today. This was a fantastic conversation. We’d like to keep our episodes short and succinct, but sometimes I go a little overboard, primarily when the conversation is super exciting like this. So thank you for sharing all your insights about Reio Iceberg and how you see the world of data and beyond. So thank you so much.

Tomer Shiran (33:56):

Yeah, for sure. Thanks for having me here.