Data engineers need a new approach for using data change data capture (CDC) to extract updates from transactional databases and keep analytics tables up-to-date, with end-to-end times measured in seconds. The data warehouse is flexible and can be updated fast enough, but is proprietary, expensive, and creates an extra, captive copy of incoming data. Traditional data lakes can accommodate the data volumes involved on affordable object storage, such as Amazon S3 or Google Cloud Storage. But data lakes operate best in batch mode, making real-time and near real-time updates difficult or impossible to achieve.
Now there is a new solution that offers the best of both worlds: the flexibility and update speed of a data warehouse and the openness, capacity, and affordability of a data lake. This solution is the universal data lakehouse, based on Hudi technology and offered as a managed service by Onehouse.
Our previous blog post described the benefits of CDC and why it is more and more frequently used in modern data architectures. However, CDC only captures changes and sets them in motion toward a destination. For the data warehouse, updating the destination is a solved problem, though this comes as a price. But there’s no uniform, easy to implement, performant, and reliable way to update destination tables on a data lake architecture.
In this blog post we describe the problem; current solutions; and why those solutions are unsatisfactory. We then describe how the the Onehouse managed service, based on the Hudi open source project, offers a fast, easy-to-use, serverless solution to this challenge, running on inexpensive object storage in the cloud.
In addition, using Onehouse to create and maintain the target data table doesn’t only solve the problem for the original use case. Once the target data table is created, it’s in open table formats and open data formats on object storage in the cloud, in your own virtual private cloud (VPC) account. From there, you can:
The initial data table can serve as a bronze data table in a medallion architecture. Open compute services can then be used to create a cleansed and deduped silver data table, still resident on the data lake. In addition, data from CDC can be augmented with data from multiple sources. The bronze and silver data tables serve as a source of truth for multiple purposes, simplifying the data architecture, as shown in Figure 1.
Silver data tables can then be processed to gold, using open or proprietary computer services, on the data lake or within one or more data warehouses. This approach reserves expensive proprietary compute services for the use cases where they add the most value.
CDC is largely a solved problem for using change logs from transactional databases to updating data warehouse destinations. The problem here is that data warehouses are not an ideal solution for many use cases, for reasons that include:
Ideally, we would use a data lake solution instead. Data lakes tend to be open and to run on inexpensive object stores, making it practical to store very large amounts of data and to keep versions of data tables for re-use and for governance purposes, as is done with the medallion architecture.
However, keeping a data lake updated against the changes delivered by CDC has, until now, been far more difficult than with a data warehouse. To understand why it’s so challenging, we need to break the process down into its component parts.
The term CDC only refers to capturing changes from the source database, but that’s actually only the first part of the problem that data engineers and developers face. To move changes through your data infrastructure efficiently, you need to complete several steps. For log-based CDC, these steps are:
The result of these steps is that the analytics-ready destination data table is a consistent, exact, and reliable representation of the current state of the transactional database that it reflects.
This problem is challenging for data lakes for several reasons:
This leads to serious operational problems when trying to implement a CDC process on the data lake. Until recently, it’s been impossible to get the best of both worlds - the openness, flexibility, lower cost, and ease of use of the data lake, and the ability to work with mutable data of the data warehouse.
It’s well known that databases are limited by the CAP theorem, in which an ideal data store is consistent, available, and has partition tolerance - yet, according to the theorem, no data store can have all three of these attributes at once.
Similarly, a data lake-based solution for CDC needs to be open, flexible, and easy-to-use; but available solutions have offered one or two of these attributes, not all three.
Figure 2 shows some of the important operational databases we want to use CDC for on the left, and the kind of targets that different solutions support on the right: either data warehouses or data lakes.
Solutions tend to fall into five categories with regard to their openness, flexibility, and ease of use:
Why are the Hudi open source project, and the Onehouse managed service based on Hudi, the only solution that’s able to support inserts, updates, and deletions on a data lake?
Hudi is the original lakehouse open source project and was designed from the beginning to support mutable data. Hudi accomplishes this by supporting a rich set of services that include the needed update and deletion capabilities, with performance that approaches that of a transactional database.
What’s different about Hudi? The core difference is that Hudi uses metadata in a clever and original way to work around the append-only nature of updates to data lakes residing on object storage. Data files, which are immutable, are kept up to date in four steps:
Figure 3 shows the tradeoffs between using the CoW and MoR updating approaches. Adroit use of CoW and MoR allows for an optimal balance between write and read performance, maintaining a high level of overall system performance. Overall performance approaches the performance of a relational database, at far less cost.
Hudi has specific advantages over other lakehouse projects that enable Hudi users (and the Onehouse managed service) to achieve these feature and performance benefits. The differences are described in our blog post, Apache Hudi vs Delta Lake vs Apache Iceberg - Data Lakehouse Feature Comparison, and explained in our webinar on the same topic.
Creating a data lakehouse with any of the major lakehouse open source projects, including Hudi, is a large undertaking. Onehouse delivers a data lakehouse as a managed service, so you avoid the challenges of implementing a data lakehouse yourself.
Onehouse also makes full use of the new Onetable project, which provides interoperability across Hudi, Iceberg, and Delta Lake. As a result, the Onehouse user gets all the capabilities of Hudi, which is highly open, highly performant, and has a rich services layer, while maintaining full interoperability with all lakehouse table formats.
With Onehouse, the user saves time and effort in two important (and related) ways:
Onehouse offers end-to-end CDC as part of a managed service. The Onehouse offering is serverless; you don’t have to instantiate, provision, and operate servers, nor do you have to scale, deal with errors and faults, or handle security issues. You simply call the services needed to extract data and load the changes into an analytics-ready data table.
The Hudi project and the Onehouse managed service treat CDC as a first-class problem. Onehouse dedicates development and operational resources to supporting end-to-end CDC, including a close partnership with Confluent for managed Kafka and continuous interaction with relevant open source projects. Onehouse customers frequently use the service for end-to-end CDC, which is a primary driver for many customers to begin an engagement with Onehouse.
A Onehouse-type solution to end-to-end CDC is not new; in fact, it’s a widely used architecture that is implemented in hundreds of companies, mostly enterprise-scale organizations with large engineering staffs. These organizations have spent a great deal of engineering time and effort to implement semi-custom solutions based on open source lakehouse software.
Onehouse is gaining traction as a solution that organizations seriously consider when implementing CDC. If you have such a workload on the horizon – or if you want to save time, money, and hassle on existing workloads – contact Onehouse today.
Be the first to read new posts