Smart Talk Episode 3: Modern Data Pipelines and LLMs

PinIt
Modern Data Pipelines and LLMs Luke Roquet

GenAI will help organizations to finally harness their unstructured data if they redesign their data pipelines with different transformations and built-in governance.

In the last three decades, we have seen data pipelines evolve from traditional methods such as ETL to ELT. But, in today’s world of LLMs and GenAI applications, the need to move massive loads of unstructured data is critical. Enter multi-modal data pipelines. While they address the need to get the data from varied sources into LLMs, this raises multiple questions around data trustworthiness, governance, real-time decision-making, convergence with data-in-motion, etc. Join Dinesh Chandrasekhar, Chief Analyst & Founder of Stratola, as he talks with special guest, Luke Roquet, Co-founder and COO of Datavolo.

Luke and Dinesh distill years of experience building data platforms and products into their discussion which covers:

  • current data architecture gaps for GenAI 
  • GenAI models for harnessing unstructured data
  • evolution of data pipelines
  • vectorizing and parsing unstructured data (PDFs) for GenAI
  • pipelines with built-in data governance
  • Low-risk GenAI use cases

Guest
Luke Roquet, Co-founder and COO, Datavolo
Luke Roquet has been involved in the Big Data world since its inception — having been at Oracle and then joining Hortonworks at a very early stage where he helped build out the Central Region and has run global sales and marketing teams at both Unravel Data and Cloudera.  He has a passion for innovation and creating very successful and happy customers!

Host: Dinesh Chandrasekhar is a technology evangelist, a thought leader, and a seasoned IT industry analyst. With close to 30 years of experience, Dinesh has worked on B2B enterprise software as well as SaaS products delivering and marketing sophisticated solutions for customers with complex architectures. He has also defined and executed highly successful GTM strategies to launch several high-growth products into the market at various companies like LogicMonitor, Cloudera, Hortonworks, CA Technologies, Software AG, IBM etc. He is a prolific speaker, blogger, and a weekend coder. Dinesh holds an MBA degree from Santa Clara University and a Master’s degree in Computer Applications from the University of Madras. Currently, Dinesh runs his own company, Stratola, a customer-focused business strategy consulting and full-stack marketing services firm.

Resources

Watch Smart Talk Episode 1: The Data-in-Motion Ecosystem Landscape
Watch Smart Talk Episode 2: The Rise of GenAI Applications with Data-in-Motion
View the data-in-motion ecosystem map here
Learn more about data-in-motion on RTInsights here

Dinesh Chandrasekhar:

Hello and welcome to Smart Talk:  Data-in-Motion Leadership series. This is a brand new episode with Luke Roquet, our guest who is the Chief Operating Officer at Datavolo. Luke is no stranger to the data world. He has been in senior operations and sales leadership roles at a variety of data management companies like Cloudera, Hortonworks, Unravel Data, AWS and so forth. So we are excited to have you here. Luke, welcome. 

Luke Roquet: Thanks. I’m really excited to be here. 

Dinesh Chandrasekhar: Before we start, why don’t we give you an opportunity to quickly explain who Datavolo is and what you do. Why don’t you take the floor? 

Luke Roquet (1:00):

It’s pretty simple. Our mission is to help folks become 10x data engineers for AI. AI presents a whole new set of challenges and opportunities with regard to how to manage data pipelines. Fundamentally, when we talk to companies doing AI, we find that data engineering is a major bottleneck for generative AI use cases. So our mission is to help people easily harness all their multimodal data for their AI applications. 

Dinesh Chandrasekhar (1:26):

Fantastic and I think that’s exactly why we invited you to the show today because we’ve been talking about data-in-motion and the context of data and motion in today’s world, particularly when it comes to GenAI and all that. It absolutely makes sense to focus specifically on data pipelines as a topic for today in the context of GenAI. I think that makes more sense because “it all starts there” is how I would like to say it. We have talked about data and motion in the last episode. We spoke about how in the context of GenAI it’s applicable and how we need to not just look at one side of data alone–data-in-motion also plays a very relevant role in all that. But what we didn’t get into was a double click into the architecture of it. In your vantage point, particularly as you start embarking on this journey with data, how do you see the current data architectures? Do you see gaps in how they are addressing the GenAI applications of today and what are some of these gaps that you see that should be addressed? 

Luke Roquet (2:30):

Yeah, it’s very interesting because like you, Dinesh, I have been in this data space for a very long time, and back in the early days of Hortonworks, let’s call it 2012, 2013, when I first started there, there was all this promise about big data letting you harness all of your data, your structured data, your semi-structured data, your unstructured data. But the reality is if you look at the market today, the vast majority of workloads in big data environments are highly structured data. And so as a result, the data architectures have evolved to suit that type of world. So the leading data pipeline tools of today are built around row-oriented constructs, around highly structured context of how they process data. They’ve even changed from an ETL to an ELT type construct because that’s what the demands of the enterprise have been for the last decade. But the reality is with generative AI, now these models can really, truly harness this unstructured data for the first time. 

So the 90% of data that companies have that previously they’ve never been able to tap, they can now tap with generative AI, which opens up a tremendous amount of possibility but also opens up a tremendous challenge in terms of how do we manage the fundamental data that enables these AI applications. And if you read anything on the internet about why generative AI has been a little bit slower than we expect it to be in the enterprise, the number one reason is because of the data. Because companies don’t have the systems in place to handle the data to make the AI successful. And your AI is only as good as your data. 

Dinesh Chandrasekhar (4:00):

There you go. I think you nailed it with that last punchline there. I mean, data has always been the most relevant thing in pretty much any setup that you have and particularly in this context, I think you are right. Data is where it all starts. So with that said, and particularly since you started talking about ETL and ELT, I want to kind of highlight that I think you and I have had the same pedigree when it comes to data management companies. And we’ve been through this evolution of data pipelines, as I like to call it, where a couple of decades ago we were talking about ETL-type pipelines where we would get data from different types of sources, do all kinds of transformations, and then load it somewhere else and so forth. And then came about this whole moment with big data and the data warehouses of the world and all that. People started thinking ELTs a much better way where we can load it up and then do whatever transformations we want. Do you see a change even in that when it comes to the current needs of GenAI applications, particularly as we start building LLMs left, right, and center? How are these changing when it comes to data pipelines? What is the model currently that is favorable?

Luke Roquet (05:10):

I think, absolutely it demands a different construct, and that’s because the T of transformation is fundamentally different for structured data and unstructured data. So in a structured data world, the transformation that’s happening is by and large, advanced complex joins, it’s windowing, it’s aggregation. So you want to get all the data in one place, a large data lake per se or lakehouse so that you can do advanced transformations and put it in a form that is then more semantically accessible by your SQL engine. That’s the world the enterprise has been living in today. And that’s why all of the leading data pipeline tools today, and you know who they are, are all ELT-based. But then what happens now with the notion of AI, generative AI, is that we need to fundamentally transform the unstructured data to a numerical representation that could be understood by a model. 

And to do that doesn’t require complex aggregations and joins. It truly is a simple transformation of an unstructured file or document or stream to a numerical vectorized representation of that into a vector store. And so as a result, what happens is you see kind of cludgy achitectures being proposed all over the place, things like ELTL or ELTP, where essentially because we have these point-to-point notions of data movement, before you can transform it, you have to move data to the lakehouse, then you do the vectorization, then you have to pick it up and you have to move it again. And it just simply doesn’t make sense in an unstructured multimodal world. So the reality is that the proper way to do pipelines on this multimodal data is to do that transformation on the fly and put the data where it needs to go. 

So I’ll give you an example, a very simple example. You’re bringing in a large set of PDF documents. You need to parse them, you need to chunk them, you need to vectorize and put them in a vector database and then also store that original document in probably your lakehouse for reference. But you’ve done that all in one flow that then can be easily updated. And look, this world is evolving rapidly. The embeddings engines keep changing, right? The best parsers change, the best chunking strategies change. And so you need an ETL tool that lets you handle and orchestrate that process in a way that’s very modular and very adaptable. 

Dinesh Chandrasekhar (07:34):

From my vantage point, what I also see is as you start laying out this architecture of this vector database and the embeddings and the chunking and everything that happens between that and the actual LLM, what I also notice is decision-making is a lot more complex. It is just not with this kind of data that is coming in from all the structured data–suddenly there are also data streams that are happening that had to be added in for context and correlation in order to make those key decisions and all that. So what I’m also seeing is that there is a tremendous amount of decision-making engines or platforms that are on the rise that need data from all these different pockets and that need to kind of bring it all together where they can put it together and make the LLM kind of speak the truth in a way. 

Which brings us to probably a question that I have to ask you. Let’s talk about the pink elephant in the room:  hallucinations. 

Every time we bring up GenAI applications or LLMs these days, people always worry about a few things. 54% I think is the percentage of companies that are hesitant to put GenAI to work primarily because of the fear of either security or hallucinations, data quality, and things like that. So where does it start, in your opinion, when you think of a multimodal pipeline that you put together when building this GenAI application? When do you start controlling data quality:  before it gets into the LLM’s hands or before the LLM starts spewing out whatever it thinks is the fact, but not confusing or misleading the user? How do we prevent these kinds of hallucinations from happening in your opinion? 

Luke Roquet (09:19):

Yeah, it’s a great question. I mean, fundamentally it should be noted that LLMS are built to hallucinate, right? That’s what makes them novel and unique is that they’re built and designed in such a way that they can create new novel things that we’ve never considered. So they’re built to hallucinate, but we want them to hallucinate in a factual way. And the way that you could do that is there’s really two primary ways both completely centered around the data. One is you could spend a significant amount of time and money fine-tuning your own model with all of your own data and heavily tune that model to set the temperatures of setting on a model to very low so that it doesn’t hallucinate anything that it can’t just factually state. But then you lose some of the creativity and the power of the model by doing so. 

And then certainly because of the cost and such of doing the fine-tuning models, the approach that most organizations are taking is a RAG-based approach, a retrieval augmented generation approach, whereby you use a pre-trained pre-tuned model, but then you augment it with the contextual data from your organization. And so then it becomes paramount that you are transforming the data in the right way using the right parsing, the right chunking strategies, putting the right data in the vector database for the model to augment its model so that it’s not hallucinating with falsities, that it’s hallucinating or generating novel content based solely on the facts at-hand within your organization and your business. So it really goes back to that point we made earlier, again, that you can only trust your AI if you can trust your data and your AI is only as good as your data. 

And at Datavolo, we primarily are working with large, heavily regulated enterprises that are very concerned about the risks that you talked about. The major thing holding back generative AI in the market right now is the unknowns in terms of how models are trained, in terms of hallucination, in terms of the risks of data privacy and security. Those are fundamentally the risks and those are the risks that can only be solved by having a solid data architecture and a really solid data pipeline strategy that makes sure that all of your data (because look, it’s hard to secure this type of data that’s in documents and video files and audio files and imagery) has the right pipeline in place to handle that data so that you have the governance, you have the end-to-end lineage, you have that audit trail that’s fundamental to being able to do secure and trusted AI. 

Dinesh Chandrasekhar (11:43):

Phenomenal. I think that’s a very comprehensive response to what I asked, which also raises a whole bunch of questions starting with what are some of the use cases that you typically see, maybe take your own company as an example with the kind of customers that you work with? What are some of the use cases that you typically see that people are building with? Because that’s another question that keeps rising in the field: I know gen AI is on the rise, everybody wants to do cool and exciting things and all that, but are the cool and exciting things more practical? Are they being put to real practical use cases? One of the most common things that I keep hearing is customer success, “Hey, I have a customer success application that I want to build. Gen AI is an absolute wonder because it can take in tremendous amount of documentation that I have all my previous customer case studies, this, that troubleshooting everything and then sum it up and make really clever responses to my customers for their problems, and I can implement chat bots and whatnot, and the possibilities are endless when comes to something like customer success.” But what other use cases are you hearing about that make this feel more lucrative? 

Luke Roquet(12:54):

Yeah, I think most organizations are starting with really low-risk use cases, and you nailed one, which is a customer support, customer success chatbot. In fact, we have one that is part of our open community forums. People can come and ask questions. It sources both our documentation, but also it sources all of our chat history. We have an open public Slack forum where experts are weighing in and answering questions. So when a new question is asked of the chat bot, it has not only the documentation that’s kind of a point-in-time record, but also is augmented with all the conversations that are happening by experts in our Slack channels. So that’s a very common use case. Certainly document summarization is a very, very common use case. Organizations want to load in all of their manuals, all of their documents and be able to ask questions about these. 

It’s very common for legal documents. Any organization that has a lot of large PDF is a very common one. And then another really common one is actually around help desk calls. So you want to understand the tone, the sentiment of the folks calling into your help desk, and then understand are your customer support folks handling those calls appropriately and giving the right answers and keeping the customers happy. Those are all examples of very low-risk use cases to organizations where people are starting. 

The other way place they’re starting is a lot of organizations say they’re doing generative AI, and when you dig into it, they’re really using ISV products, independent software products that use generative AI, right? So I think that’s where we’re at today in the enterprise. We’re in the stage where we’re doing very low-risk testing the waters of what’s possible, and combining that with solutions from third parties that have built in AI capabilities. And then where we’re heading though is leaps and bounds beyond that, right? It’s a complete and fundamental disruption of how analytics is done, around how forecasting is around what business innovation and capabilities are possible. That’s what’s coming, but we’re still in the groundwork phase right now where everybody’s getting the foundational architectures laid down, testing and implementing low-risk use cases, still high value but low risk, and then laying that foundation for the really transformational disruption, which is a little bit down the road. 

Dinesh Chandrasekhar (15:12):

Awesome. I want to also ask you based on your last responses about the technology side of things, you and I come from the open source world and we are very familiar with it. How do you see the current state of affairs when it comes to LLMs and all this stuff and the openness that may or may not exist here? Do you see that as a challenge or more as a support vehicle for you to get it more into the hands of customers to build more gen AI applications? So how do you see it? 

Luke Roquet (15:48):

Yes, it’s such an interesting question and one of frankly great debate across my network internally at Datavolo, with customers, with thought leaders, and that debate is around can the open source models catch up or keep up with these companies that had billions of dollars of funding, right? You look at companies like Open AI, Anthropic, etc. They have tons of money and resources, but shockingly, and no surprise to you or me, Dinesh, because we’ve been in this world for a long time, the open community is doing an incredible job at staying pretty darn close. And so what I’m seeing now in the enterprises is the enterprises, they’re certainly experimenting with a lot of the public models, but at the same time, especially in financial services, which we do a lot of work in, there’s a reticence to put your data out and accessible by those open models, and they want open source models that they can use within their own environments and their own constraints. And so if the open source models are almost as good, but it’s fully known how they’re trained, the training set and it’s controlled and owned in their model, that becomes very, very compelling to organizations highly sensitive to security . So I think it’s a very interesting world. The amount of money invested in some of these proprietary vendors like OpenAI will cause them to continually be ahead of the curve. But I think you’re going to see open source continue to push the boundaries and stay right on the heels of the proprietary solutions. 

Dinesh Chandrasekhar (17:17):

Very cool. I’m very excited about the promise of what this stage is set and how it’s going to evolve in the next phase, and I think it’s going to bring in more innovations, competitions, and co-competitions. I think it’s a really exciting time. 

Luke Roquet (17:33):

Yeah, even if you look at it more like a big-picture visionary, the world that we’re talking about is so exciting today and it’s going to continue to get more exciting. One of the big challenges today that we deal with every day with customers is what’s the right parsing strategy? What’s the right chunking strategy? And right now we use different libraries from different vendors for those, and we might use one vendor today, and then next week Google might come up with something and we want to switch to that. But at some point, we’re going to have a model that we can just say, here’s my source data and here’s what I want to do with it. Go figure out the right parsing strategy, the right chunking strategy, and we just hand it off. So you’re going to continue to see massive evolutions there. 

You’re going to continue to see massive evolutions. You and I talked about this the other day, Dinesh, that the way that analytics looks today, will fundamentally be different in five years. Right now we spend tremendous amounts of time and money getting data through multiple zones–bronze, silver, gold, platinum production–so that it’s in this most refined state and the SQL analysts can access that data. I don’t believe that’ll be the case five years from now. There will be certainly data sets like that, but the endstate five years from now is that all of our data is coming in raw, it’s being fed into a model, and then we just speak our own natural language to that model. And that’s any language. We have it at Datavolo. We have a flow builder where someone can go in English and they can say, “Hey, build me a flow that does this,” and it builds them the flow. 

Somebody came recently from Turkey came, asked in Turkish to build a flow. We were quite curious to see what would happen at Datavolo. We saw the requests come in and what do you know? 60 seconds later a flow was generated. So these models are so good at understanding people’s natural language, and that’s the evolution. We won’t need to have all of these highly semantically structured data sets with highly semantically skilled people to access that data. We’ll be able to ingest data as it comes and be able to speak and interact with it in our own language. 

Dinesh Chandrasekhar (19:38):

That was great. And that was going to be my summary question: asking you about what you thought about the space, and I think you greatly touched on a variety of different topics. So thank you so much for that. And you mentioned flow. I just want to clarify for the audience here, I know what a flow is. I worked on it, but just for clarity, would you call the flow a pipeline or how are you defining it?

Luke Roquet (20:02):

Yeah, I mean it is different terms for the same thing. We often use those terms interchangeably, but yeah, the flow, the pipeline, it’s really the orchestration of data from where it’s originated to where it needs to go. And that’s one of our fundamental theses and beliefs that Datavolo is that whatever advance has ever happened, and there will continue to be more and more advances, right? Today’s generation of AI next will be interactive AI or some other form. It’ll continue to evolve, but what will always be a constant is the data and the need to get the data from where it originates to where it is consumed. And fundamentally, we believe those will always be two different places. So the orchestration and governance of how that happens, the chain of custody, the lineage, to make sure that it’s securely gotten, it’s transmitted from point of origin to point of consumption is a fundamental thing that we will always be solving for. 

Dinesh Chandrasekhar:

Well, Luke, thank you so much for your time. I think we had a great conversation about understanding data as well as the space that you’re in and the possibilities that are happening in this world of GenAI right now. So I appreciate your time and thank you for joining us. 

Luke Roquet: Thanks, Dinesh. Really appreciate it. It was great to speak with you.

Leave a Reply

Your email address will not be published. Required fields are marked *