July 10, 2023

The Road to an Open and Interoperable Lakehouse

Written by:

Kyle Weller

The Road to an Open and Interoperable Lakehouse

Snowflake Summit

At Snowflake Summit there was an impressive lineup of features, ranging from Document AI, Snowpark Container Services, Dynamic Tables, and updates around Iceberg support. Several of our customers today want to query data in Onehouse from Snowflake’s SQL Warehouse. Given that Snowflake’s private preview of Iceberg tables is the method available for lakehouse integration, the Iceberg updates at the summit were of particular interest for us.

As we documented in the past, performance was notably poor for external tables and some of our users felt they had to revert back to Snowflake native tables. It is very exciting to see in the keynote that they have now improved the performance with unmanaged Iceberg tables by 2x, and now offer near-parity of performance between managed Iceberg tables and Snowflake native tables. While it is puzzling why the feature has to continue in private preview for so long, the announcement is still a promising step forward for the community. More details are covered in this blog from Snowflake.

Databricks Data+AI Summit

At the Data + AI summit, Databricks unveiled some captivating new advances for the Lakehouse, including LakehouseIQ, LakehouseAI, and notably Delta Lake 3.0 with the announcement of Delta UniForm. The Delta UniForm launch reaffirms an urgent need in the community to build bridges between the Lakehouse table formats of Delta Lake, Apache Hudi, and Apache Iceberg. We were hoping that they would chose to collaborate on projects already in motion to further unify the communities as opposed to embedding this project inside Delta Lake code repository. Nonetheless, it's welcome move from an industry leading data vendor to support more open data formats than their own.

Delta UniForm: What it Is (and What it Isn’t)

Let’s understand what UniForm is and what it isn’t. In the keynote, Databricks described how Delta, Hudi, and Iceberg are all Parquet files under the hood, each with their own format for metadata. Databricks CEO Ali Ghodsi announced that, with UniForm “whenever you write data to Delta, it just will generate metadata in all three formats: Hudi, Iceberg, and Delta”. From reading the announcement blog it appears UniForm is primarily aimed at existing Delta Lake and Databricks users, rather than an attempt at solving format fractures across the three projects. The goals described include making it easier to choose Delta Lake as the standard format within Databricks, as the other formats also are rapidly adopted.

With Delta UniForm, you consolidate your data lakehouse on the Delta Lake project, and get one-way read replicas for Hudi and Iceberg. If you are already a Databricks user with existing Delta Lake deployments, then UniForm is excellent news for you. But if you are building out a new data lakehouse to take advantage of any of the technical differentiators that have been documented at length from Hudi, including Merge-On-Read writers, advanced indexing subsystems, asynchronous table services or differentiators from Iceberg including partition evolution, you would want broader interoperability with Hudi and Iceberg. Especially, if you want to keep your options on query engines or table formats flexible, then you may want to consider Onetable more seriously.

Can we work together?

Could there be another approach - one which doesn’t involve one-way conversions within a project governed by a single vendor? In February this year we at Onehouse kickstarted the conversation by announcing Onetable and inviting the community to collaborate on a new project to bring three-way, omni-directional interoperability between Delta Lake, Apache Hudi, and Apache Iceberg. Since then we have been hard at work on Onetable, hardening it with production use cases and building a network of partners and co-owners before it is open sourced. While Onehouse is born from roots in the Hudi project, Onetable will be a separate Github project, co-owned by other companies who also have interests in Delta and Iceberg.

As Databricks went fast for their first release alone, it’s now time to go far - to work together and extend that to the full solution that the community needs. We encourage Databricks (and any others) to join as co-owners in Onetable, where we can collaborate in a vendor-neutral environment and bridge the gap together. Most importantly, this helps us govern the project in vendor-neutral ways to ensure this compatibility is maintained for the community, as the three projects are sure to evolve and innovate further in the coming years.

Omni-Directional Interoperability

Onetable takes a source metadata format and extracts the table metadata that can then be synced into one or more target formats. Why does omni-directional interoperability matter? Today all three data lakehouse projects have vibrant and fast-growing communities, each of the three projects has technical differentiators, and the vendor ecosystem of support is scattered. Leading organizations in the industry routinely share why their independent evaluations led them to choose Apache Hudi (ex: Walmart and Zoom), Delta Lake (ex: Coinbase), or Apache Iceberg (ex: Dell). One-way conversions simply function as retention attempts to grow a single lakehouse project at the expense of others - and of the community’s interest in interoperability.

To make this clear, the diagram below describes the additional use case support available through the new Lakehouse interoperability spec defined in Onetable, beyond what is present in Delta UniForm.

In Conclusion

As the Data Lakehouse vs Data Warehouse race heats up between Databricks and Snowflake, wars are being created around table formats, fracturing the ecosystem, and slowing innovation in the industry with “paralysis by analysis”. This ultimately hurts the potential of the data lakehouse as an alternative, and at the same time a complement to the data warehouse. Would n't it be great if we avoid some unnecessary burden of technical evaluations between these projects, while also being able to mix-and-match them across different workloads?

By releasing Delta UniForm inside the Delta Lake project, it makes it hard for the community to collaborate in a vendor-neutral zone. As we approach the open source launch of Onetable we invite any parties interested in being early committers, or if interested in early access to test real use-cases, to reach out to info@onehouse.ai. Let’s build the next generation of the open and interoperable data lakehouse for the future together.

Authors

No items found.

The Road to an Open and Interoperable Lakehouse

Snowflake Summit

Databricks Data+AI Summit

Delta UniForm: What it Is (and What it Isn’t)

Can we work together?

Omni-Directional Interoperability

In Conclusion

Read More:

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

Ray vs Dask vs Apache Spark™ — Comparing Data Science & Machine Learning Engines

Apache Flink™ vs Apache Kafka™ Streams vs Apache Spark™ Structured Streaming — Comparing Stream Processing Engines

Data Deduplication Strategies in an Open Lakehouse Architecture

The Open Table Format War: Merely a Battle on the Path to Engineering a Truly Open Data Platform

ACID Transactions in an Open Data Lakehouse

The Road to an Open and Interoperable Lakehouse

Snowflake Summit

Databricks Data+AI Summit

Delta UniForm: What it Is (and What it Isn’t)

Can we work together?

Omni-Directional Interoperability

In Conclusion

Read More:

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

Ray vs Dask vs Apache Spark™ — Comparing Data Science & Machine Learning Engines

Apache Flink™ vs Apache Kafka™ Streams vs Apache Spark™ Structured Streaming — Comparing Stream Processing Engines

Data Deduplication Strategies in an Open Lakehouse Architecture

The Open Table Format War: Merely a Battle on the Path to Engineering a Truly Open Data Platform

ACID Transactions in an Open Data Lakehouse

Subscribe to the Blog