February 2, 2023

Announcing Onetable

Written by:

Tim Brown

Summary

Onehouse customers can now query their Apache Hudi tables as an Apache Iceberg and/or Delta Lake table unlocking native performance optimizations for the leading data lakehouse projects from popular cloud query engines to cutting edge open source projects.

At the base of a data platform's hierarchy of needs sits the fundamental need to ingest, store, manage, and transform data. Onehouse provides this foundational data infrastructure as a service to ingest and manage data in our customers' lakes. As these data lakes continue to grow in terms of size and variety within an organization, it becomes imperative to decouple your foundational data infrastructure and compute engines that process data. Different teams can leverage specialized compute frameworks, e.g. Apache Flink (Stream Processing), Ray (Machine Learning), or Dask (Python data processing), to solve the problems important to their organization. Decoupling allows developers to use one or more of these frameworks on a single instance of their data stored in an open format without the tedium of copying it into another service where compute and storage are tightly coupled. Apache Hudi, Apache Iceberg, and Delta Lake have emerged as the leading open-source projects providing this decoupled storage layer with a powerful set of primitives that provide transaction and metadata (popularly referred to as table formats) layers in cloud storage, around open file formats like Apache Parquet.

Background

AWS and Databricks created the original commercial momentum for this technology category in 2019, with their support for Apache Hudi and Delta Lake respectively. Today, most cloud data vendors support one or more of these formats. However, they continue to build a vertically optimized platform to drive stickiness to their own query engine where data optimizations are locked to certain storage formats; for example, to unlock the power of Databricks's Photon engine, you need to use Delta Lake. AWS has pre-installed Apache Hudi in all of their analytics services for multiple years and continues to support more advanced workloads in near-real time. Snowflake has announced stronger integration with Iceberg for external tables and even the ability to query Delta tables as external tables. BigQuery announced integrations with all three formats, starting with Iceberg.

All of these different options offer a mixture of support and we haven't even started listing off the support in various open source query engines, data catalogs, or data quality products. This increasingly large compatibility matrix can leave organizations worried that they will be locked in to a particular set of vendors or subset of the available tools creating uncertainty and anxiety when plunging into their data lake journey.

Why build Onetable?

Over the past year, we have published a thorough comparison between the open-source projects that shows how Hudi has significant technical differentiators, especially for the update heavy workloads that power Hudi and Onehouse’s incremental data services. Further, Hudi’s automated table services for managing and optimizing your tables lay a comprehensive foundation for your data infrastructure, while being completely open source. When choosing a table format, engineers are currently faced with a tough choice of which benefits are most important to them. For example, choosing Hudi’s table services or a fast spark engine like Databricks Photon. At Onehouse we simply ask, do we really need to choose? We want our customers to have the best experience possible when working with their data, which means supporting formats other than Hudi to take advantage of the ever growing set of tools and query engines in the data ecosystem. As a company that advocates for interoperability across query engines, it would be hypocritical of us if we did not apply the same standards to metadata formats to help avoid fracturing the data into silos. Today, we are taking a big step in that direction with Onetable.

What is Onetable?

Onehouse is committed to openness and wants to help organizations enjoy the cost efficiency and advanced capabilities Hudi unlocks without being constrained by the current market offerings through a new feature on our cloud product, Onetable. When your data is at rest in your lake, the three formats are not so different. They all provide a table abstraction over a set of files and with it a schema, commit history, partitions, and column stats. Onetable takes a source metadata format and extracts the table metadata into a common format that can then be synced into one or more target formats. This allows us to expose tables ingested with Hudi as an Iceberg and/or Delta Lake table without requiring a user to copy or move the underlying data files used for that table while maintaining a similar commit history to enable proper point in time queries.

This approach is similar to how Snowflake retains its own internal metadata for Iceberg tables, while creating Iceberg metadata for external interoperability. Hudi also already supports an integration with BigQuery, which is being leveraged by large open-source users and Onehouse users.

Why are we excited?

Onehouse customers will have the option to enable Onetable as a catalog to automatically expose their data as not only a Hudi table but also an Iceberg and/or Delta Lake table. Exposing tables in these different metadata formats allows customers to easily onboard with Onehouse and enjoy the benefits of a managed data lakehouse while maintaining their existing workflows with tools and query engines they love.

As an example, Databricks is a very popular choice for running Apache Spark workloads, with their proprietary Photon engine that delivers performance accelerations when using the Delta Lake table format. To ensure customers using Onehouse and Databricks get a great experience without any performance pitfalls, we used a 1TB TPC-DS dataset to benchmark the query performance. We compared Apache Hudi and Delta Lake tables with/without Onetable and Databricks’s platform accelerations like disk caching and Photon. The following chart shows how Onetable unlocks higher performance inside Databricks on Onehouse/Hudi tables by translating metadata based on the Delta Lake protocol.

Further, we exposed this same table inside Snowflake, popularly used for easy warehousing, as an external table. We performed a similar 1TB TPC-DS benchmark that compared Snowflake’s native/proprietary tables, external parquet tables and a Hudi table using Onetable. The chart below shows how Onetable now exposes a consistent snapshot of a Hudi table to Snowflake queries while offering performance similar to Snowflake’s parquet tables.

While the performance of the external tables above is not nearly as fast as that of the native Snowflake table, Onetable provides the capability to expose an up-to-date view of the data lake inside Snowflake to help power downstream ETLs/transformations or keep queries running while an organization transitions to building a data lakehouse to compliment their Snowflake data warehouse. This approach avoids copying the full set of data into your warehouse or doubling storage costs while still allowing engineers and analysts to derive more meaningful, aggregated native tables for quickly serving reports and dashboards, taking full advantage of Snowflake’s powerful capabilities.

Most of all, we are excited about how this sets users up for success with a flexible, tiered data architecture that is already prevalent in many large-scale data organizations. Apache Hudi offers industry-leading speed and cost-efficiency for incremental ingest/etl on the data lake, which is the bedrock Onehouse is built on. Users leverage Hudi for this efficient, cost optimized data ingestion into raw/bronze-silver tables. Onehouse’s table management services can optimize the layout of this data directly at the lake level for better query performance. Users can then transform these optimized tables using a warehouse engine like BigQuery, Redshift, Snowflake or a lake engine like Databricks, AWS EMR, Presto and Trino. The derived data is then served to the end-user to build data applications like personalization, near real-time dashboards and more. Onetable provides the much needed portability for users to pick the query engine they love based on their needs and cost/performance tradeoffs. At the same time, users can lower compute costs with Hudi’s proven efficiency for challenging change-data-capture scenarios combined with Onehouse’s table optimization/management services.

Future Work

The landscape of query engines, open-source tools and new products in the data space is constantly evolving. Each of these existing and new services that spring up every year come with varying levels of support for these table formats. Onetable allows our customers to use any service that integrates with at least one of the three formats giving them a largest set of options possible.

Onehouse is committed to open source, and that includes Onetable. Initially this will be an internal feature reserved for Onehouse customers as we iterate on the design and implementation. We are looking for partners from the other projects and community to iterate on this shared standard representation of tables and ultimately open source the project for the entire ecosystem. For e.g Hudi’s catalog sync table service incrementally maintains such catalog metadata as changes happen to the underlying Hudi tables. A similar implementation with Onetable, would create immense value for data lake users, by keeping metadata across different engines in sync with a single integration

Please email info at onehouse dot ai, to get in touch for collaborating on Onetable. Finally, I would like to thank my fellow Onehouse team members Vamshi Gudavarthi and Vinish Reddy Pannala for helping make this vision a reality.

Authors

No items found.