July 18, 2023

The Ultimate Data Lakehouse for Streaming Data Using Onehouse + Confluent

Written by:

Andy Walner

The Ultimate Data Lakehouse for Streaming Data Using Onehouse + Confluent

Announcing our Confluent Partnership

We are excited to announce that Onehouse has partnered with Confluent as part of the Connect with Confluent program to provide deep integrations between Confluent's data streaming platform and Onehouse's managed data lakehouse. 🎉

We previously launched our Confluent Cloud Kafka connector, which customers are using to ingest data from Confluent Kafka topics into Onehouse. This connector makes it easy for Confluent users to centralize and transform their data in a data lakehouse, unlocking key analytics capabilities.

Alongside our partnership announcement today, we are unveiling a new feature that deepens our integration with Confluent, completely automating a critical workflow for Confluent users: real-time database replication with Change Data Capture. This blog post will showcase a full demo of the new feature, but first let's review some core concepts:

What is Onehouse?

Onehouse is a fully-managed data lakehouse built on Apache Hudi. Onehouse blends the ease of use of a warehouse with the cost-efficiency and scale of a data lake. Using Onehouse, organizations build their lake in minutes, process their data in seconds, and own their data in open formats instead of being locked away to individual vendors.

What is Confluent Cloud?

Confluent Cloud is a fully-managed data streaming platform that helps you move your data in real-time. Confluent Cloud makes data movement effortless with solutions like managed Apache Kafka streaming.

What is Change Data Capture?

Change Data Capture (CDC) is the real-time process of detecting, capturing, and forwarding relational database changes to a downstream service like Onehouse. CDC is the state-of-the art approach for replicating a database into a centralized data lakehouse for analyzing data, building dashboards, and more.

How do Onehouse and Confluent Cloud work with CDC?

Onehouse and Confluent Cloud are both designed for streaming data, forming a powerful combination for ingesting CDC data in real-time. As changes happen in your database, you can perform analytics on them live in the data lakehouse.

Database Replication Made Open & Easy

To understand the power of real-time database replication with Onehouse and Confluent, we'll follow the story of a fictitious financial company called FinCo, which stores anonymized/encrypted credit card data in a Postgres database.

In order to identify new opportunities for their business, FinCo wants to analyze and build dashboards using credit card data stored in their Postgres database. However, relational databases like Postgres and MySQL are not a good fit for analytics at scale due to challenges with costs, storage limitations, and analytical workload performance.

Realizing they must expand beyond their Postgres database, FinCo starts exploring analytics-oriented OLAP systems like data warehouses, data lakes, and warehouse. They opt for the data lakehouse to get the benefits of the data lake (low-cost, low-latency data streaming) combined with the structure and ease-of-use of the data warehouse. Now they are ready to replicate their Postgres database into their lakehouse to access the full power of analytics for the credit card data.

Normally, FinCo would spend several months setting up their data lakehouse and database replication pipeline, but with Onehouse and Confluent, they can launch a fully-functional analytics streaming system within hours. FinCo simply creates a Confluent account and connects it to Onehouse along with their Postgres database. Onehouse handles the rest, automatically building and maintaining data ingestion pipelines from Postgres into the data lakehouse. Now that their data is in the lakehouse, FinCo's analysts can query and visualize the data in any tool of their choice with seconds-to-minutes data freshness.

As shown in the diagram above, this effortless experience is made possible by Onehouse's deep integration with Confluent. Onehouse automatically creates and manages resources (eg. Kafka topics, CDC Source Connectors like Debezium, and Schema Registry) within FinCo's Confluent account so FinCo can have maintenance-free database replication running around the clock.

Unlike with other opaque ingestion services, keeping the infrastructure within FinCo's Confluent account enables them to extend the infrastructure and seamlessly leverage any of the other streaming services Confluent offers like Stream Designer or ksqlDB.

CDC Ingestion Workflow

Let's explore the feature hands-on and see how easy it is to set up Postgres database replication with Onehouse and Confluent! We'll show a walkthrough of the flow in Onehouse - you can also view the full solution guide here.

In the Onehouse console, create a new Confluent Cloud CDC Source.

Enter the details for your Postgres database, followed by the details for your Confluent Cloud account. Onehouse will use these credentials to read updates from your Postgres database and configure resources in your Confluent Cloud account for transporting these updates as CDC data.

Create a Stream Capture in Onehouse using your Confluent Cloud CDC source. Here we choose the Mutable Write Mode so we can replicate inserts, updates, and deletes from the Postgres tables to our Onehouse tables.

Select the source table(s) from Postgres to replicate as Onehouse tables. Here we enable Pipeline Quarantine so we can continue ingestion in case the source sends invalid data. Add a transformation to convert the data from the Postgres CDC format to a record in the Onehouse table (essentially, this decodes messages from Postgres about how the database has changed).

While creating the Stream Capture, you can also add Data Quality Validations and set Key Configurations to optimize the table. The "Record Key Field" in Onehouse is similar to a Primary Key in Postgres, so you can use the same field as you have in Postgres.

After creating the Stream Capture, you can monitor the ingestion pipeline within Onehouse and receive automated alerts for events of interest.

Onehouse creates the tables in your S3 or Google Cloud Storage account, where you can query them as Apache Hudi, Apache Iceberg, or Delta Lake tables (learn more) using the catalogs and query engines of your choice. Below, we show an example querying a Onehouse table in Athena.

Behind the scenes, Onehouse creates and manages resources in your Confluent to facilitate the database replication. You can open your Confluent account to see these auto-created resources, such as the Source CDC Connector (Debezium in this case).

FinCo's credit card data is now replicated from Postgres to their data lakehouse! If FinCo later decides they want to transform this data to curate business-level tables (eg. a table for credit card transactions per household), they can easily do so within Onehouse, keeping their CDC ingestion pipeline under one fully-managed pane of glass.

Wrapping Up

That's all! You've now learned how to replicate your Postgres database to a data lakehouse in just a few steps, enabling you to leverage your most valuable data for real-time analytics use cases. We are excited to help Confluent and Onehouse users adopt this state-of-the-art architecture for integrating databases to data lakes.

We are excited to continue expanding our partnership with Confluent to build deeper automation for all your data streaming use cases.

Want to try this feature or have ideas for more Confluent + Onehouse use cases? Reach out to us at gtm@onehouse.ai.

Special thanks to my Onehouse colleagues Lokesh Lingarajan and Yongkyun (Daniel) Lee who built this incredible feature and helped create the demo, and to Paul Earsy at Confluent who has been instrumental to our partnership!

Authors

No items found.