At a time when there’s a tsunami of new data, one of the fastest growing categories of data storage is DBMS transactional databases. The most popular databases in the group include Oracle, MySQL, SQL Server, and Postgres. As your data volumes rise and the speed at which it arrives, there is ever increasing pressure to make that data available for fast and efficient analytics. The most important process you can implement to capture and integrate this data from DBMS systems into analytical systems is a process known as change data capture (CDC).
Understanding the workings of CDC is important in improving the functionality of your data infrastructure. It also helps you identify all the steps you need to take in making sure the freshest possible information is available for all query engines that you use, from a data warehouse to machine learning models and AI.
In this blog post we describe what CDC is, why it’s important, how it works – and why it’s sometimes hard to implement and manage. And we introduce the data lakehouse, a relatively new architectural pattern, as a partner to CDC.
We describe how the data lakehouse can match the speed and efficiency of CDC while keeping all of your data available for the full range of analytics use cases, from business intelligence to AI/ML. Together, CDC and the data lakehouse can help you to move data faster, simplify your data infrastructure and data operations, and reduce costs.
Change data capture moves data quickly and efficiently from the transactional domain - referred to as online transactional processing (OLTP) - to the analytics domain, called online analytics processing (OLAP). (This article explains the difference.) CDC has evolved to serve a range of use cases:
It’s common today for CDC to be used for urgent use cases, while other use cases remain in batch mode. However, just as a chain is only as strong as its weakest link, a data architecture is only as responsive as its slowest processes. More organizations are seeing the value of moving most or all of their data integration use cases to CDC, often in tandem with streaming technologies such as Kafka, simplifying the entire data estate and making it more responsive.
When running services such as an online marketplace, a ride-sharing app, or supply chain logistics, you are likely to have large deployments of transactional databases feeding data to your applications. As you rapidly collect data from online purchases, user interactions, and real-time deliveries, your transactional databases are robust and purpose-built to store and process updates coming in from your applications.
As your business develops, you want to leverage these large and valuable sources of data for analytics. You want to understand which items are selling fastest over the last 30 days, the last week, and the last hour; how busy your drivers are, and whether load is likely to increase or decrease in the next 30 minutes; or which shipping routes might be delayed due to incoming inclement weather.
You might be tempted to build analytics directly against your operational database. While you may get by with some simple analytics, you will quickly run into problems as you scale. RDBMS systems are not designed to efficiently process analytical-style queries. Your queries will start to add pressure to the production resources, risking impact to both line of business applications and analytics workloads.
With change data capture techniques, you can capture all changes that happen in transactional databases and quickly integrate them into an analytics-ready system such as a data warehouse or a data lakehouse.
This can drive tremendous impact, as reflected in systems that we all take for granted today. When we order from an online marketplace with a “next day delivery” guarantee, access a ride-hailing app for transportation, or use an AI-powered system for a weather forecast, we are likely to be seeing CDC at work, keeping analytics systems up to date with current orders, inventory levels, and environmental conditions.
Organizations move to CDC when they can no longer afford to have out-of-date data for query responses. The need to change now may be driven by competitive pressure or by the desire to offer state-of-the-art services to internal and/or external customers.
Change data capture is designed to impose the lightest burden possible on the source database while efficiently delivering updates. There are three main processes used to capture changes; the last of them, log-based, is usually recommended as most efficient, when the underlying database supports it:
Different databases support different approaches to CDC, but most databases in use today create logs that can be used as the basis for a low-impact implementation of CDC. Determining what approaches are available and best to use in each use case is part of the effort involved in implementing CDC.
CDC has its challenges. Non-log-based approaches impose a performance hit on source databases. Since it operates on a continuous basis, CDC needs to be monitored and tuned. This requires the use of a wide range of components, often including Debezium, Kafka, Spark, Hudi, Airflow, and others.
This requires the team to have, or develop, expertise not only in several different tools, but in a range of technology domains and how to make them work together in practice. There is an initial burden on the engineering team to implement CDC and related technologies, and an ongoing burden to maintain it, as described in our blog post about infrastructure.
The core of CDC is data mutability, by definition: CDC captures data changes and sends them rapidly along to targets. In a typical data lake, the changes can only be appended to existing data, not used to update the target. So either tables or partitions of tables must be rewritten constantly, which is expensive, or left in fragmented form, which saves some money but slows query processing.
This is one of the core motivations for why Apache Hudi was created in 2016, pioneering the path to the data lakehouse. Hudi was optimized from the beginning for mutable data, including ACID transactions – that is, fast, easy, reliable updates.
Among lakehouse projects, Hudi is best adapted for mutable data. In particular, Hudi employs two distinct updating strategies: copy on write (CoW) - a commonly used strategy that merges and rewrites columnar files - and merge on read (MoR), only fully implemented in Hudi, which amortizes updates so the entire file is not rewritten on each one. This makes Hudi highly responsive to the updates continuously delivered by CDC.
Building CDC ingestion services is a daily focus for us here at Onehouse and it is a core feature of our universal data lakehouse. We use CDC to seamlessly ingest data from databases such as MySQL, Mongo, and Postgres, and streaming tools such as Debezium and Kafka, creating a turnkey system ready for the widest range of analytics use cases.
If you want to talk to some of our in-house experts that have built and operated some of the largest CDC-powered systems in the world, reach out to gtm@onehouse.ai.
Be the first to read new posts