Apache XTable is now an Incubating project at the Apache Software Foundation. The project’s open source status was announced during a session held at Open Source Data Summit 2023. You can read the overview that follows and view the full talk.
Apache XTable (previously known as OneTable) was announced as an open source project in incubation at the Apache Software Foundation during a talk last fall. Tim Brown, a software engineer with Onehouse, provided background information and the rationale for the project's initial formation and OSS release. He was joined by Ashwin Agrawal, a senior researcher with Gray Systems Lab (GSL) at Microsoft, and Anoop Johnson, a senior staff software engineer at Google. They talked about how XTable, then called OneTable, adds interoperability to projects they’re each working on: Microsoft’s Fabric and Google’s BigLake, respectively. Agarwal concluded the talk by announcing the OSS release of OneTable (now XTable - Ed.), reviewing its planned roadmap, and inviting the audience to join the project and contribute.
“The OneTable project is not a separate data lakehouse table format. It is an omnidirectional interop for existing table formats.” — Tim Brown
At the implementation level, the three primary data lakehouse formats — Hudi, Delta Lake, and Iceberg — are very similar to each other, explained Tim Brown. All three match Parquet data storage with format-specific metadata. While the technical implementations for the three are very similar, meaning that the work required to make them interoperable is well-defined, they each provide different performance profiles and data processing features. So the decision process for evaluating which of the three to use can be costly and time-consuming — and in some cases, a mix of two or even three formats might be needed for a specific use case.
To meet this challenge, OneTable (now XTable - Ed.) was born. “The OneTable project is not a separate data lakehouse table format. It is an omnidirectional interop for existing table formats,” Brown explained. As an omni-directional, real-time interoperability layer, it saves time and increases confidence in one's choice of format to use for storing data in a data lake.
XTable allows data consumers to use whichever format makes sense for a specific data processing use case or to work with a given vendor, independent of how the data is originally stored in a data lake. And data consumers won’t have the same need to spend significant time up front evaluating data formats, or risk experiencing vendor lock-in later on. So XTable users’ data can be truly universal; they can write it once and query it everywhere.
Brown proceeded to demonstrate the full range of functionality one can expect from XTable. He set up an example with two source tables, one using the Hudi format and the other using the Delta Lake format. And then he showed how lightweight, XTable-provided methods can be used to move data from both Hudi and Delta Lake formats into equivalent Iceberg tables and then back into the original formats.
So, for example, a user who prefers using Trino for data querying can rely on XTable to provide all of the data in the performant Iceberg format, regardless of how it is initially stored in the data lakehouse. And, if the same user wants to write some results back to a format that Trino doesn’t support write access for (e.g., Hudi), they can rely on XTable to help perform that write. Brown focused on incremental transformations, which have traditionally been challenging in a data lake, though XTable also supports full-table syncs.
“Microsoft Fabric presents users with a unified, logical data lake, in which users are able to get data from multiple formats and lakes and run queries on top of it.” — Ashwin Agrawal
“Our research shows that confining the data state to one data lake is itself a big problem,” Agrawal said. He introduced the session to Microsoft Fabric, a project that uses XTable-powered interoperability to build a single, unified, logical data lake on top of any and all of the data lakes available in a company. To demonstrate the power of Microsoft Fabric, Agrawal set up an example with two source data lakes storing data in all three of the formats supported by XTable.
“Microsoft Fabric presents users with a unified, logical data lake, in which users are able to get data from multiple formats and lakes and run queries on top of it,” Agrawal said. He showed us how Fabric users can view the data in a single interface, abstracting away the choice of data lake and format. Users can then perform all the kinds of operations they might want to perform on the data from within Microsoft Fabric or other tools.
For example, users can define cross-lake table relationships, such as foreign keys; query across all of the tables in all of the lakes available in Fabric; quickly iterate on unified data analysis; and generate fast insights. All of these, as well other standard data manipulation operations, are available to users through Microsoft Fabric, regardless of which specific data lake the underlying data might be stored in.
“BigLake makes open data a first-class citizen in BigQuery. BigLake allows customers to keep their data in open file formats in cloud storage while performing data analysis in BigQuery.” — Anoop Johnson
Anoop Johnson discussed BigLake Managed Tables, a recently announced new feature currently in private preview in Google’s BigLake. Managed Tables bring read and write capabilities for Iceberg tables that are external to BigLake, and interoperate with files in any format, including Iceberg, Apache Hudi, and Delta Lake. “BigLake makes open data a first-class citizen in BigQuery. BigLake allows customers to keep their data in open file formats in cloud storage while performing data analysis in BigQuery,” Johnson said, providing context.
The feature natively integrates with Parquet-formatted files for import, allows BigQuery to be used for data analysis, and provides Iceberg-compliant formats for export of the results. XTable significantly improves BigLake’s interoperability. It extends the existing native Iceberg connectivity by seamlessly synchronizing data into and out of Apache Hudi and Delta Lake formats. As a result, data that has been processed by BigQuery and stored in BigLake can be queried with any existing analysis tools.
“You don't want to be lying awake at night, wondering what life would have been like if you'd just chosen the other format.” — Tim Brown
Agarwal concluded the session by announcing that OneTable (now XTable - Ed.) is now an open source project. With strong support from folks across the entire community, the project has seen a lot of interest and has been growing very quickly.
“We are super-focused on improving the performance footprint for OneTable, so that it can live hidden in infrastructure and automatically generate target table formats.” — Ashvin Agarwal
The project’s goals are well aligned with Apache Software Foundation’s model. The project will continue to provide seamless industry-wide interoperability across all data lake formats, eliminating silos, and aiming for sustainability and ongoing evolution.
“We are super-focused on improving the performance footprint for OneTable (now XTable - Ed.), so that it can live hidden in infrastructure and automatically generate target table formats,” Agarwal said in his announcement. The roadmap also includes a focus on increased interoperability (adding more data formats, including Apache Paimon), native engine integrations, reaching feature parity with all lakehouse formats, real-time replication in any direction, and much more.
You can try the project 5-minute quickstart, star the Github repo, and find up-to-date information about future plans on the project page and docs:
This has been a summary of a presentation given at Open Source Data Summit 2023; if interested in more details, you can view the full talk.
Be the first to read new posts