May 17, 2023

Instantly unlock your CDC PostgreSQL data on a data lakehouse using Onehouse

Instantly unlock your CDC PostgreSQL data on a data lakehouse using Onehouse

You can now use the new Onehouse PostgreSQL Change Data Capture (CDC) feature to stream data continuously into a data lakehouse. I will dive deep into the CDC architecture and pipelines using the Onehouse product which is built on open source systems such as Apache Kafka, Debezium and Apache Hudi

To enable effective data-driven decision making, organizations recognize the need to leverage the wealth of data hosted within various OLTP databases. A solution that has gained popularity is replicating this data in near real-time to a data lakehouse, which serves as a powerful foundation for many critical use cases across industries.

Building a reliable and efficient data replication pipeline from RDBMS to a data lakehouse requires a comprehensive understanding of both the OLTP systems and the analytical requirements. It demands expertise in data integration, database technologies, data governance, and scalable architecture design. By successfully implementing such a solution, organizations can unlock the power of their high-value data, facilitate data-driven decision making, and unleash a wide array of critical use cases to drive business success.

Onehouse architecture overview 

By automating data management tasks and providing a fully managed, cloud-native platform, Onehouse allows customers to build data lakes quickly and easily, which can result in significant time and cost savings compared to a DIY approach. With Onehouse, customers can focus on deriving insights and value from their data. This can be particularly valuable for organizations that don't have the resources or expertise to design, build, and manage their own data infrastructure.

Onehouse also provides a range of enterprise security features, such as encryption, access controls, and monitoring, to ensure that data is protected at all times. This is essential for organizations that need to comply with industry regulations or data privacy laws. By using Onehouse, customers can have peace of mind that their data is secure and compliant.

Onehouse's modern architecture provides customers with full control over their own data and the compute resources used by the Onehouse data plane because the data plane is run within a customer's cloud account and VPC so that a customer’s data remains within their account and never leaves it. This is in contrast to some other SaaS cloud architectures, where data is processed and stored outside of the customer's account, often resulting in questions and concerns around data privacy and security. By using such an architecture, Onehouse is able to provide customers with the benefits of a cloud-based solution while allowing them to maintain the security and control of their data.

Pitfalls with using PostgreSQL as a data lake

Relational database management systems (RDBMS) like PostgreSQL play a pivotal role in storing a significant portion of the world's valuable data. These databases primarily cater to online transaction processing (OLTP) systems, which are crucial for day-to-day business operations. However, running complex analytics directly on these OLTP databases such as PostgreSQL can be challenging as it may disrupt the high-volume transactional activities that are essential for smooth business functioning.

To address this challenge, it is critical to develop a change data capture (CDC) and replication framework, which orchestrates the movement of data from the OLTP databases to the data lake. This framework should be designed to handle complex data structures, maintain data consistency, minimize latency, and achieve near real-time synchronization. It must also account for the inherent challenges of replicating data from OLTP systems, such as concurrent updates, schema changes, and transactional dependencies.

Additionally, the architecture supporting the data lakehouse must be scalable, fault-tolerant, and optimized for analytical workloads. It should leverage distributed computing technologies and parallel processing capabilities to efficiently handle large datasets and complex analytics tasks. This enables organizations to derive valuable insights from the replicated data without impacting the operational efficiency of the OLTP databases.

CDC process, architecture and components

Setting up a change data capture (CDC) streaming and replication to a data lake using PostgreSQL as a source involves multiple components. Let's break down the common components in a typical architecture:

  1. PostgreSQL is the source database where the data resides.
  2. Debezium is a popular open-source CDC tool. It connects to the PostgreSQL database and captures the changes happening in real-time, including inserts, updates, and deletes. Debezium transforms these changes into a stream of events.
  3. Apache Kafka is a distributed streaming platform. Debezium publishes the captured database changes as events into Kafka topics. Kafka ensures reliable and scalable event streaming, acting as a buffer between the data source and downstream processing.
  4. Apache Spark is a powerful distributed data processing framework. In this use case, Spark reads the change events from Kafka and performs further processing and transformations on the data. Spark can handle large-scale data processing and is often used for analytics, ETL (Extract, Transform, Load), and machine learning tasks.
  5. Apache Hudi is an open-source data storage technology. It specializes in efficiently managing incremental updates and providing fast data ingestion and query capabilities on large datasets. In this context, Hudi updates the changes into tables stored in a cloud storage system like Amazon S3.
  6. Apache Airflow is an open-source platform for orchestrating and scheduling workflows. It enables you to define and manage the dependencies between different tasks or components in the data pipeline. In this use case, Airflow orchestrates the execution of Spark and Hudi, ensuring the proper sequencing and coordination of these steps.
  7. AWS Glue Data Catalog is a fully managed metadata catalog service provided by Amazon Web Services (AWS). It enables you to store, organize, and manage metadata about your data assets. In this use case, Glue Data Catalog is used to update the state of the tables and make them discoverable by query engines, allowing efficient querying and analysis of the data stored in the data lake.

Implementing a data pipeline using the components mentioned above is a complex task that requires a strong understanding of data engineering principles and experience in integrating and configuring the different tools. Data engineers need to have a deep understanding of each tool's capabilities, limitations, and integration points to ensure smooth interoperability and efficient processing. They also need to be familiar with best practices in data governance, security, and performance optimization to ensure the data pipeline is reliable, scalable, and cost-effective. 

In addition, they need to be proficient in scripting and coding to automate tasks and reduce manual errors. Building a data pipeline using the mentioned components requires a combination of skills, knowledge, and experience in data engineering and related fields which would be hard to have for a customer. 

As an example, let’s take a look at a common use case of performing CDC ingestion from a PostgreSQL database with 100+ tables to replicate data into the data lake.  

Based on our experience, it typically takes three data engineers around three-six months to set up a pipeline of this nature.  However, if you require the same data pipeline for four distinct business units, you would need to undertake the same laborious process four times, which could cost more precious engineering time. By leveraging Onehouse's Managed Lakehouse solution for the same use case, a single data engineer can productionize all four deployments in a couple weeks, resulting in a 10x or more cost savings. Moreover, apart from cost savings, this solution offers other intangible benefits, including simplified maintenance and quicker business insights.

Onehouse built-in CDC feature

Onehouse now has an exciting new feature that can perform CDC ingestion from a PostgreSQL to an analytics ready data lakehouse. Several customers such as Apna have been running this pipeline in production. Under the hood, Onehouse will automate the data infrastructure in your account to leverage Debezium, Kafka, Spark, and Hudi without exposing any of this complexity or maintenance to you. If you are currently designing your CDC ingestion pipelines to a data lake or if you are tired of the maintenance and on-call monitoring, Onehouse can make this a seamless experience for you.

In the following, I will go through an example that uses the Onehouse product to ingest data continuously from an AWS RDS PostgreSQL database into an Amazon S3 data lake. 

Enable CDC on the source PostgreSQL database

To configure change data capture (CDC) on the source PostgreSQL database, you need to turn on logical replication using write ahead logging (WAL). For the details, see How do I replicate tables in RDS for PostgreSQL using logical replication and Setting up logical replication for your Aurora PostgreSQL DB cluster.

To verify the status of logical replication, log in to the source PostgreSQL database, and run the following query:

cdc_db=> SELECT name,setting FROM pg_settings WHERE name IN ('wal_level','rds.logical_replication');

          name                            | setting 

------------------------------+---------

 rds.logical_replication            | on

 wal_level                                | logical

(2 rows)

Create a schema registry in AWS Glue

In this example, we will leverage AWS Glue Schema Registry to record and validate real-time streaming data schemas, similar to the functionality provided by Confluent Schema Registry

  1. Once on the AWS Management Console, search for and click on GLUE.
  2. On the left tab, click Schema Registries.
  3. Click on Add registry.
  4. Enter the name, description if required and any tags and click on Add registry.
  5. Write down the name of the registry because it is required to create the PostgreSQL CDC data source in the Onhouse Management Console.

Create a new CDC source in Onehouse

To create a new PostgreSQL CDC source, log in to the Onehouse Management Console; go to Connections -> Sources and click “Add Data Source”.  Choose “Postgres CDC”. 

Onehouse will perform some network connectivity validation tests before a new source can be added. 

Set up a CDC stream capture in Onehouse

In the Onehouse Management Console, on the left side, go to Services -> Capture and choose “Add a stream”.  Give the new stream a name, choose the data source as the PostgreSQL CDC source created previously.  Because we are working on a CDC use case, select the Write Mode as “Mutable (Updates/deletes)”.  

For the test source table, specify the target table name and then click “Configure” and provide the information as shown below:

In the Transformations section, choose “Convert data from CDC format” and then click “Add Transform”; for CDC Format, pick “Postgres (Debezium)” as follows: 

For this simple example, you may keep the other settings at their default values. In the final step, provide the target data lake, database and data catalog information:

That is it!  You can now click “Start Capturing” and go back to Services -> Capture to confirm the new stream capture is starting up.  It may take approximately 30-40 minutes for the initial setup because Onehouse needs to provision an Amazon MSK cluster and configure Debezium in the background.  Following that all the incremental CDC streaming will be processed, transformed to the Hudi format and loaded into the data lake without any further delay. 

Built-in table optimizations and interoperability

Onehouse’s role does not stop there. Onehouse offers comprehensive data management and built-in support for Apache Hudi’s advanced table optimization services such as clustering, compaction, and cleaning. These services are critical for optimizing the performance and cost-effectiveness of Hudi tables, especially for these OLTP CDC and replication use cases, due to how fast the table data can be mutating.

Hudi’s industry leading and innovative async compaction feature is a game changer in the space of cloud data analytics. Compaction involves merging small files (Avro format) into larger base files (Parquet format) to reduce the number of files and improve query performance, without ever blocking writing. 

With the data residing in the data lake that runs on Apache Hudi, customers have the flexibility to choose their preferred query engines, such as Amazon Athena, Apache Presto, Trino, Databricks or Snowflake, based on their specific requirements and cost/performance tradeoffs.  By leveraging the Onetable feature, interoperable metadata can be easily generated for Delta Lake and Apache Iceberg, without the need to touch the underlying data.

Conclusion

In this blog post, we have demonstrated how simple it is to use the Onehouse PostgreSQL Change Data Capture (CDC) feature to set up an end-to-end data pipeline. By leveraging this feature, organizations can replicate data from their PostgreSQL databases to a data lakehouse with near real-time latency in just a few clicks! This enables them to keep up with the fast changing demands of the modern business world, where data-driven decision-making is essential for success.

If you want to learn more about Onehouse and would like to give it a try, please visit the Onehouse listing on the AWS Marketplace or contact gtm@onehouse.ai

I would like to thank my fellow Onehouse team members Vinish Reddy Pannala and Daniel Lee for their contributions to the Onehouse PostgreSQL CDC feature.

Authors
No items found.

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future