February 13, 2025

Using Apache Hudi Data with Apache Iceberg and Delta Lake

Using Apache Hudi Data with Apache Iceberg and Delta Lake

Modern data lakes demand flexibility across table formats, each solving distinct data challenges. The reality of building scalable data platforms has taught us that different use cases often require different approaches. Let's dive into how organizations can effectively leverage Apache Hudi alongside Iceberg and Delta Lake to build a truly format-agnostic stack, that delivers the best of their capabilities.

Why Multiple Formats? 

Real talk: no single table format solves every data challenge. Some teams need lightning-fast real-time updates for real-time analytics or data applications, others require flexible schema evolution for evolving data models, and many need both. Engineering teams naturally gravitate toward whatever's the best fit for their specific problem, leading to an organic accumulation of different formats across an organization. Instead of fighting this reality, smart teams are embracing it, turning what could be a source of fragmentation into a strategic advantage.

The Format Landscape 

The story of open table formats on the data lakehouse is one of solving real-world problems at scale, and democratizing data by opening it up to multiple engines and ecosystems. Apache Hudi emerged from Uber's trenches, tackling massive-scale ingestion/ETL with record-level updates. Its architecture supports atomic updates and incremental processing while maintaining exceptional query performance. When you need efficient, incremental updates at high-scale and/or low-latency, Hudi delivers unparalleled performance. Furthermore, Hudi offers open source table maintenance services to help you accelerate queries and clean up expired files. 

Delta Lake, Databricks' brainchild, was built to take advantage of all of Databricks’s advanced compute capabilities through a tight integration with Apache Spark. This makes it particularly effective when you want to leverage all of the latest functionality on the Databricks platform, while storing data in a simple and performant table format.

Apache Iceberg, invented at Netflix, provides a straightforward spec that makes it easy for beginners, along with its hidden partitioning to adapt to evolving workloads.  Iceberg’s simplicity and hidden partitioning capabilities make it useful for batch workloads that need to maintain historical data access while adapting to changing requirements.

As the three projects grew up alongside respective OSS communities, there has been the looming question of which one is right choice for different needs. As the open data lakehouse system evolved, different warehouse/query engines have distracted the market largely with table format choice conversations, when what really counts is open vs closed

Breaking Down Barriers

Enter Apache XTable (incubating), which tackled this problem head-on  by making tables formats interoperable at the metadata level. It preserves ACID guarantees and time travel capabilities while eliminating any lock-in built by compute engines or data catalogs on top of specific formats. This innovation enables a powerful pattern: choosing your table format based on write workload characteristics, while maintaining flexibility for read access across different compute engines.

Consider a scenario where your data scientists are running complex machine learning training jobs using Databricks, which excels with Delta Lake format, while your analytics team needs to query the same datasets through Snowflake, which works optimally only with Apache Iceberg. Instead of maintaining duplicate datasets or complex ETL pipelines, XTable allows you to write data once in your preferred format based on your primary workload requirements, then seamlessly access it in other formats as needed.

For instance, you might choose to write your real-time streaming data in Hudi format to leverage its efficient upsert capabilities, while still making this data immediately queryable by your Snowflake users in Iceberg format and your Databricks users in Delta format. 

This flexibility delivers substantial compute cost savings while reducing data redundancy and storage costs, allowing teams to use their preferred tools without compromise and breaking down the silos that traditionally exist between different data platforms and formats.

The Onehouse Approach 

Rather than forcing choices between formats, Onehouse flips the script entirely. Onehouse uses Hudi as the primary writing format but makes data simultaneously available in Iceberg and Delta formats through an elegant approach. This architecture leverages Hudi's highly-efficient write performance across complex workloads, enabling Onehouse to deliver best-in-class cost-performance and data freshness for customers. All three open table formats (Hudi, Iceberg, and Delta Lake) store their underlying data in Parquet format. Onehouse leverages this common foundation by maintaining a single copy of the Parquet files, while only generating the lightweight metadata layer for each table format. This approach minimizes storage overhead while enabling seamless multi-format access.

The magic lies in the implementation: table optimizations are applied at the Parquet level, automatically benefiting queries across all formats. When data changes, whether through batch updates or streaming ingestion, changes propagate automatically through an intelligent change data capture mechanism. Teams can tune synchronization frequency based on their needs, from near real-time updates to daily syncs.

Performance concerns? The overhead is minimal thanks to shared Parquet files and optimized metadata handling. The system efficiently handles translation across formats, and those shared optimizations keep paying dividends across every access pattern. For organizations wrestling with existing data in various formats, this approach offers a pragmatic path forward, enabling gradual adoption without disrupting existing workflows.

The catalog synchronization capability further sets this approach apart. Traditional setups often combine different table formats in their own catalog silos – and other engines that work best with their own favorite catalogs to plan/execute queries faster e.g. Databricks and Unity Catalog. Onehouse breaks down these barriers by enabling a single copy of data to sync across multiple catalogs simultaneously. This flexibility lets organizations leverage the best query engines and tools for each use case with ease, while maintaining a single source of truth.

Conclusion 

The future of data lakes isn't about choosing between formats – it's about leveraging their complementary strengths intelligently. By embracing format diversity while maintaining operational simplicity, organizations can build more flexible, powerful data platforms that serve diverse use cases without compromising on performance or maintainability. The key lies not in forcing standardization, but in creating systems that turn format diversity from a challenge into a strategic advantage.

Ready to explore how your organization can leverage the power of multiple table formats without the operational overhead? Visit onehouse.ai to learn more about building a truly unified data lake that harnesses the best of Hudi, Iceberg, and Delta Lake. Our team of data infrastructure experts is ready to help you navigate the journey to a more flexible, powerful data platform.

Authors
Bhavani Sudha Saktheeswaran
Head of Open Source

Sudha is the Head of Open Source at Onehouse and a PMC member of the Apache Hudi project. She comes with vast experience in real-time and distributed data systems through her work at Moverworks, Uber and Linkedin’s data infra teams. She is a key contributor to the early Presto integrations of Hudi. She is passionate about engaging with and driving the Hudi community.

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future