Change Data Capture makes your data stores and existing databases first-class citizens in an event-driven architecture (EDA).
Event-driven architectures (EDAs) play an increasingly important role as organizations incorporate more events and streaming data into their applications. They employ EDA to capture insights and communicate changes instantly to help enhance customer experiences or improve organizational efficiencies. Quite often, Change Data Capture (CDC) can be used to increase the amount and variability of the information sources for EDA applications.
RTInsights recently sat down with Gunnar Morling, a Red Hat software engineer who is leading the Debezium project, an open-source platform for change data capture (CDC). We discussed uses cases for CDC, the role of CDC in EDA applications, where Debezium fits, and more. Here is a summary of our conversation.
RTInsights: What type of applications need change data capture?
Morling: The beauty of change data capture (CDC) is that it’s not bound to a particular type of application. Any type of application can take advantage of having low latency when it comes to reacting to data changes. The way I like to think about CDC is that it’s an enabler for your application. It enables specific capabilities.
Let’s say your application would like to send updates to another system or a data warehouse. Or maybe your application would like to keep a search index up to date so people can do a full-text search. Or maybe your application needs something like an audit log, so you have an audit trail of all your data changes. CDC would let you do these things. That’s what I mean by enabling specific capabilities.
You can also use it to enable specific non-functional requirements. For instance, you might have requirements for a particular use case where you need to access a dataset very efficiently. You would use a cache for that, and you could use CDC to update that cache. Or maybe you want to join the data from multiple tables into a single representation. Again, you could use CDC to drive the updates to such a materialized view. Another thing would be something like polyglot persistence. There may be specific requirements or specific use cases in your application that would benefit from a graph data store, whereas all the other parts of your data model may be in a relational database system (RDBMS). You could use CDC to propagate the changes from the RDBMS over to that graph data store.
So, the way I like to think about it is that CDC is an enabler for all kinds of use cases. And it’s a little bit like liberating your data.
RTInsights: How is CDC used to bridge traditional data stores and new cloud-native, event-driven architectures?
Morling: CDC makes your data stores and existing databases first-class citizens in an event-driven architecture (EDA). The nice thing about this is that you don’t have to modify your applications, which are based on those existing databases. It’s transparent to those applications. Your database essentially acts as an event producer, and you can set up all sorts of consumers that react to those data changes.
One scenario is to use CDC as an on-ramp to the cloud for your on-premises legacy databases. For example, suppose you want to take your data from an on-prem Oracle database into a cloud data warehouse, such as Snowflake. CDC lets you set up such a data pipeline, and it would have a very low end-to-end latency.
Essentially, it means a democratization of data. Instead of having to go through your database and query for it, the data comes to you, and you can react to it as it changes. This gives you many interesting options. Something we see a lot is when someone sets up a CDC stream, more and more use cases emerge that subscribe to the same data change stream. These are often use cases they did not initially think about. That’s the beauty of this model.
But let me also point out that CDC is not only for integrating legacy or existing data sources. It also is a part of new and cloud-native data sources; for instance, modern data stores like Apache Cassandra, ScyllaDB, or Yugabyte, come with CDC out of the box. And I would expect this to be an integral part of all the databases down the road.
RTInsights: Where does Debezium come into play? How does it help?
Morling: Debezium is the leading open-source CDC platform. It is useful because there is no one API that enables CDC for all sorts of databases. There are different APIs, different formats, and different mechanisms needed for implementing CDC for different databases, like Postgres, MySQL, SQL Server, and so on. Debezium provides you with one unified way for getting changes out of all those databases.
You don’t have to care about the database specifics like the database-specific formats and other factors. Debezium takes care of that for you. It will expose changes to you in one abstract unified event format. Consumers don’t have to care about the particular source of a data change. It is based on Apache Kafka, which allows you to go about this in an asynchronous way and connect all kinds of consumers. Users can react to all the inserts, all the updates, all the deletes on the source databases.
RTInsights: What does Red Hat offer in this area?
Morling: Debezium is part of a supported Red Hat product, which is called Red Hat Integration. Red Hat Integration is a portfolio product that comes with all kinds of features and capabilities. For instance, it contains Red Hat AMQ Streams, our Apache Kafka distribution, as well as a schema and API registry. It’s an on-prem, self-managed product, which people could then deploy on Red Hat OpenShift or on their RHEL [Red Hat Enterprise Linux] if that’s what they want to. They can use Red Hat Integration for running and operating Kafka and also Debezium connectors by themselves.
Another offering is Red Hat OpenShift Streams for Apache Kafka (RHOSAK), which is a fully managed cloud service for Apache Kafka. And we plan to offer a cloud service for Change Data Capture along with RHOSAK.
RTInsights: Can you give us some use cases and examples of CDC being used to implement EDAs?
Morling: We spoke about some of the use cases before, but I want to dive a little bit deeper into two particular usages of change data capture in the context of microservices.
The first one is data exchange between microservices. So, if you follow the ideas of microservices, you want to make them self-contained, and they should work as loosely coupled as possible. But still, no matter how hard you try, services will interact with other services. They are not isolated. They need to exchange data with each other.
For example, let’s say you have an e-commerce service that receives purchase orders. One first thought could be for this service to persist data, like a new purchase order, in its own database. And then, it also could send out a message to other services, such as a fulfillment service, over Apache Kafka.
The problem is those two activities (updating the service’s own database and sending that message to the other services via Kafka) cannot happen as part of one shared transaction. Kafka just doesn’t support that. So, you would end up with something that is called dual writes. You update those two resources without having shared transactional guarantees. And this means you will be prone to inconsistencies. One of the operations might fail. You may have applied the change to the database, but you have not sent the message to Kafka. Obviously, this sort of data inconsistency is not very desirable.
Change data capture can be a way out of the problem. One way to approach this would be to just capture the changes in the first service’s database and send them over to the fulfillment service. But if you think about it, this might expose the internal data model of your service, and you might not find this very attractive.
The way around this is to use what is called the outbox pattern, which is based on CDC but applies a specific spin to it. The idea is instead of capturing your actual business tables, like purchase orders and so on, you would have a dedicated table, an outbox table. Whenever your application or service wants to send a message to another service, it will insert a record into this outbox table.
You could use Debezium to capture just the inserts to this outbox table. Whenever there’s a new message, there will be an insert in this outbox table, and Debezium will relay those messages to the other services. This avoids the dual writes issue because the service only updates its own database. It would update its business tables and also insert the outbox message within one transaction.
Another example of CDC’s use is in cases where you don’t have microservices yet. Maybe you are still working with an application based on a monolithic architecture, and you would like to split up the monolith into microservices. You don’t typically want to do one big bang migration because it’s too risky. You would rather take small baby steps and extract small parts of your monolith into new services, step by step, so as to mitigate or minimize the risk. CDC can help do that.
The idea is to implement a pattern, which is called the strangler fig pattern. What you do is put a routing component in front of your service, and you start to take some components out of your monolith and put them into a new microservice. You configure this routing component to send all the reads to that particular part of the domain over to that new service. What you then would do is use CDC and Debezium to propagate all the writes from the monolith’s database over to this new microservice and its own database. Then that new microservice can satisfy all the read requests made towards it.
You could keep iterating this process. You can use CDC to gradually extract functionality out of monolith over into microservices and avoid the risk of a big bang migration. At some point, you could move all the logic, everything which pertains to a particular part of the domain, over to the microservice. And then, for instance, you could use CDC to propagate changes back to the monolith. That would be useful in cases where you have some data that’s now owned by the microservice but which also is referenced by logic that’s still part of the monolith, i.e., that data needs to be propagated from the microservice back to the monolith. You could use CDC in a bi-directional manner. You just would have to make sure you don’t propagate the same data infinitely.
Bio
Gunnar Morling (@gunnarmorling) is a software engineer and open-source enthusiast by heart. He is leading the Debezium project, a distributed platform for change data capture. He is a Java Champion, the spec lead for Bean Validation 2.0 (JSR 380), and has founded multiple open source projects such as JfrUnit, kcctl, and MapStruct. Gunnar is an avid blogger (morling.dev) and has spoken at a wide range of conferences like QCon, Java One, Devoxx, JavaZone, and many others. He’s based in Hamburg, Germany.