Once you have intelligently implemented your data pipeline, it’s easy to multiply the outcomes and complement a pipeline with other processes.
All businesses today are data-driven. Those that excel use data to make fast, intelligent decisions and take quick actions. Reaping the benefits that data has to offer requires that data pipelines make it easy to access data and are automated. This, in turn, will help speed up and automate business processes.
RTInsights recently sat down with Guillaume Moutier, Senior Principal Technical Evangelist at Red Hat, to talk about the data pipelines. We discussed why it is important to automate them, issues that arise when implementing them, tools that help, and the benefits such pipelines deliver. Here is a summary of our conversation.
RTInsights: We hear a lot about data pipelines. Could you define what they are or explain what they do?
Moutier: Data pipelines describe all the steps that data can go through over its life cycle. That includes everything from ingestion to transformation, including processing, storing, and archiving. With this definition, just the simple copying of data, from point A to point B, could be considered a data pipeline. But in this case, it might just be considered a small pipe. Usually, with data pipelines, we are talking about more complicated scenarios. It often involves different sources being merged or split into different destinations and multiple steps of transformations happening during this process. It comes down to this: I have some data in point A. I want to have something at point B. Meanwhile, it must go through some transformation or processing steps. That’s the definition of the data pipeline.
RTInsights: Why is it important to automate data pipelines?
Moutier: As with any processing that we are doing now in modern application development, automation comes with aspects that are really, really important from a business perspective. First, there is the management itself of the data. With automation comes reproducibility. Usually, you would want to implement this automation as code so that you can review everything, and you can have different versions of everything.
Automation is the difference between why this data pipeline is behaving differently now from before. Without automation, we often cannot determine who touched it. We don’t know who made some configuration changes. These are the same issues you would consider with an application. Who made the change? Business use case improvement comes from automation by coding everything, being able to replicate, being able to reproduce things from development to production in different stages.
Plus, automation brings scalability and the ability to reapply different recipes on different business cases. It’s a better way to do things. As I often say, it’s the equivalent of going from an artisanal mode to an industrial mode. Maybe the things you are producing are really good when you’re doing this as an artisan. You are in your workshop, and you are crafting things. That’s fantastic. But if you have to produce hundreds of those items, you must automate the process, whether it is preparing your parts or finishing or anything. It’s the exact same equivalent for your data pipelines.
RTInsights: What are some Industry-specific use cases that especially benefit from automating data pipelines?
Moutier: Automating pipelines offers benefits to every industry. Whenever you have data to process, an automated data pipeline helps. I will cite some examples. I’m working with people in healthcare who want to automatically process data from the patients for different purposes. It can be to automate image recognition to speed up a diagnosis process. It can be to pre-process MRI scans or X-rays to get to the diagnosis faster. It’s not only about getting the different data. You have to pre-process it, transform it, and then apply some machine learning process to generate a prediction. Is there a risk of this or can we detect that disease in the image? That is something that you can automate to be processed in real time.
I’m also working with another group in healthcare to speed up some ER processes. Instead of waiting to see a doctor to prescribe a treatment, further analysis, taking a blood sample, or any other exam, we are trying to implement a data-driven model. Here, a machine learning model is involved using preliminary exams, patient history, or other information like that. It can automatically predict the next exam these patients should take. Instead of waiting maybe two hours at the ER to see a doctor, now, the nurse will be able to directly send this patient to take those further tests. That’s what the doctor would have done. Of course, these models are trained by doctors and endorsed by them. It’s just a way to speed up the process at the ER.
In insurance, you might set up an automated pipeline to analyze an incoming claim. For example, you might have an email with some pictures attached to it. You might do some pre-processing to analyze the sentiments in the letter. Is your client really upset, just complaining, or making a simple request? You can use natural language processing to automate this analysis. If this is a claim about a car accident, the attached pictures are supposed to be of the damaged car. You can automatically detect if a picture is indeed a picture of a car and not a banana or a dog or a cat. But more seriously you can tag it with information gathered from the image: location, weather conditions, and more.
Those kinds of automated pipelines can speed up any kind of business process. As a result, automating data pipelines applies to any industry that wants to speed up some business processes or tighten those processes.
RTInsights: What are some of the challenges in automating data pipelines?
Moutier: The first challenge that comes to mind is the tooling. You must find the right tools to do the right job, whether the tools are used for ingestion, processing, or storing. But where it gets difficult is that nowadays there are so many different tools or projects, especially when considering open-source projects. And the number is growing at a really fast pace. If you are looking at common tools nowadays, and here I’m thinking for example of Airflow or Apache NiFi or similar tools that will help you automate those processes, their use changes rapidly. Often the tools are only mainstream for a year or year and a half, then they will be replaced by something else. The pace at which you must track all the tooling is a real challenge.
On top of that, another challenge that I’ve seen people are struggling with is a good understanding of the data itself, its nature, its format, its frequency. Does it change often? Especially with real-time data, you must understand the cycles at which the data may vary.
Also, the weight of the data can be an issue. Sometimes people are designing data pipelines that look fantastic on paper, “Oh, I’m going to take this and apply that.” And it works perfectly in the development environment because they are only using 100 megabytes of data. But when your pipeline is handling terabytes, even petabytes of data, it may behave differently. So, you need a good understanding of the nature of the data. That will help you face the challenges that come with it.
RTInsights: Are there different issues with pipelines for batch vs. real-time vs. event-driven applications?
Moutier: The issues, of course, can be seen as different. But from my perspective, they are different mainly in the way they are tackled. The root causes of the issues may be the same. It comes down to scalability and reliability. Let’s take an example. What do I do when there is a loss in the transmission or a broken pipe in my pipeline? It can happen to batch processing or real-time processing. But you do not use the same approaches to solve this problem in each case. If there is a broken pipe with batch processing, it’s almost not a major problem. We just restart the batch process.
You need a different approach in a real-time infrastructure or event-driven data pipeline. You have to address the issues from the business perspective. What happens when a server goes down? What happens when there is a security breach? It’s all those root causes that will help you identify exactly the challenges you face in those different cases and what solution to apply to them. It must be driven with these considerations in mind. It is not only that there is a specific issue, but where does it come from? And never forget what you want to achieve because that’s the only way to apply the proper mitigation solution to those issues.
RTInsights: What technologies can help?
Moutier: A data streaming platform like Kafka is something that is almost ubiquitous now. More and more, it is used on almost all types of pipelines, event-driven, real-time, or batch. You do not see any modern application development or data pipelines without Kafka at some step. That’s a great tool allowing many different architectures to be built upon it.
I’ve also seen more and more serverless functions being used. They are fantastic, and especially if you are working in Kubernetes or in an OpenShift environment. Serverless, which is the KNative implementation into OpenShift, is a perfect match for an event-driven architecture. It brings you scalability to go down to zero in terms of resource consumption and scale up to whatever you want, depending on the flow of data coming in.
If we go back a few years ago, we had those batch processing servers sitting idle 24/7, working at 2:00 a.m. during the night to do some processing on data. That’s a huge consumption of resources for something that we run 10 minutes a day. That doesn’t make any sense. Today you would use cloud-native architectures and tools developed for Kubernetes for your data pipelines.
Another technology that helps is an intelligent storage solution, such as Ceph. In the latest releases of Ceph, you have, for example, bucket notifications. That means your storage is not some dumb dumpster where your data is coming in. Now, it can react on the data. And it can react in many different ways. With bucket notifications in Ceph, you can send a message to a Kafka topic or just to an API endpoint saying, “There is this event that happened on the storage, and this file is being uploaded, modified, or deleted.”
Also, upcoming in Ceph, a feature like S3 Select is a fantastic add-on for analytics workloads. Instead of bringing all the data up to the processing cluster, you only retrieve the data you are interested in. You select this directly in the source, and you retrieve for processing only the data you want. So that’s the kind of feature that makes storage now a more interesting part of the pipeline you can play with. It’s part of the architecture, and again, making storage much more than just a simple repository of data.
RTInsights: Once you automate the data pipelines, what are the benefits?
Moutier: Scalability, reliability, security, always being able to know what’s happening because it’s fixed by code. If you have automated your pipeline, that means at some point, you have coded it. What runs is exactly what is supposed to run. From a business perspective, that gives you a great advantage.
Once you have intelligently implemented your data pipeline, it’s easy to multiply the outcomes. For example, you could directly send data from your storage to processing or an event stream to processing. That’s great. But suppose you have taken the step, for example, to put Kafka in the middle. In that case, that means that even if you have started with only one consumer for the data, it’s easy to add another one, and add hundreds of different consumers to the topics.
If you architect from the start with automation and scaling in mind, it gets easier to complement a pipeline with other processes, branches, or any other processing you want to do with the data.