Change data capture enables applications to react every time data changes. And it does this without having to change the apps that modify the data.
Businesses today need the ability to respond faster, more efficiently, and more intelligently to the world around them. To do this, many are building out cloud-based, event-driven solutions that rely on streams of data flowing through their systems. And to do this, they must capture and process information as it’s created. That’s where change data capture (CDC), in general, and Debezium, in particular, is increasingly playing a critical role.
So, why the need for CDC? What does it do, and how does it help? Gunnar Morling, a Red Hat software engineer who is leading the Debezium project, noted in an interview with RTInsights that “CDC makes your data stores and existing databases first-class citizens in an event-driven architecture (EDA). The nice thing about this is that you don’t have to modify your applications, which are based on those existing databases. It’s transparent to those applications. Your database essentially acts as an event producer, and you can set up all sorts of consumers that react to those data changes.”
CDC lets any type of application take advantage of having low latency when it comes to reacting to data changes. This is an essential element in real-time systems. One specific example of its use is to keep a search index up to date so people can do a full-text search taking into account any real-time updates. It can also be used as an on-ramp to the cloud for on-premises legacy databases. For example, it could be used to take data from an on-premises legacy database and insert it into a cloud data warehouse. In this use case, CDC lets a business set up such a data pipeline, and it would have a very low end-to-end latency.
When updating a source database, which could be any relational database such as Oracle, Microsoft SQL Server, Postgres, or MySQL, a business may need to update multiple related resources such as a cache and a search index. A simple approach would require upgrading the applications to update those resources at the same time. However, trying to consistently write this changed data to more than one target introduces many challenges and coordination overhead. CDC avoids issues like dual writes to, instead, update resources concurrently and accurately.
CDC accomplishes this by tracking row-level changes in database source tables – categorized as insert, update, and delete events – and makes those change notifications available to any other systems or services that rely on the same data. The change notifications are emitted in the same order they were made in the original database. In this way, CDC ensures that all interested parties of a particular data set are accurately informed of the change and can react accordingly, either refreshing their own version of the data or by triggering business processes.
A typical CDC pipeline would start with an event producer application. An example might be a shopping app where a user creates or updates an account. The app makes those changes in a database. CDC would then make the changes available to a stream engine. CDC minimizes the resources required for processes because it only deals with data changes.
See also: Application Modernization and Change Data Capture
Why change data capture is increasingly in demand
Major industry changes are driving the need for CDC in real-time applications. To start, many more operations and business processes must transition from batch and reactive modes to real-time. Financial institutions can no longer afford to analyze transactions once a month or week looking for fraud. They must react to real-time data about user behavior and transactions in progress and stop the fraud from happening in the moment. Similarly, a modern retailer would use dynamic input such as an in-store beacon alert that a loyalty program customer is in the cosmetic section to instantaneously deliver in-person or text promotional offers. A CDC system could serve as the intermediary between the events and instances that invoke a change and the analytics engine that processes them to derive insights and take actions.
A second transformation underway is the need for rapid application development and constant updates to meet user expectations and demands. Batch-based, monolithic applications of old are not up to the task. They cannot be easily modified. New features take forever to roll out. And any small change requires updating the entire application. Organizations do not have the luxury to take workhorse production applications offline for any period to revise them.
The way to address this issue is to break apart the monolithic applications of old and build new applications based on a cloud-native architecture. As such, many organizations are moving to cloud-native application architectures based on microservices and containers. The services and data provided by the core systems are encapsulated and presented to other elements of an application as a microservice via an API.
In that way, new front-end applications can easily be developed, and new analytics algorithms and models can be used to extract insight into the data. The core application and services it provides remain untouched. CDC sits between the core application and the downstream processes, capturing the data changes and making them available to a stream engine.
How Debezium fits in
Debezium is an open-source distributed platform for change data capture. It enables applications to react every time data changes. And it does this without having to change the apps that modify the data. Additionally, Debezium continuously monitors databases and lets any applications stream every row-level change in the same order they were committed to the database.
Debezium is useful because there is no one API that enables CDC for all the different databases a business might use. There are different APIs, different formats, and different mechanisms needed for implementing CDC for different databases. Debezium provides one unified way for getting changes out of all the different databases.
That means a business does not have to care about the database specifics like the database-specific formats and other factors. Debezium takes care of that. It will expose changes in one abstract unified event format. Consumers don’t have to care about the particular source of a data change.
Additionally, it is based on Apache Kafka, which allows these actions to take place in an asynchronous way and connects to all kinds of consumers. This allows applications to react to all the inserts, all the updates, all the deletes on the source databases in real time.