May 20, 2024

Announcing Multi-Catalog Sync with Snowflake, Databricks, BigQuery

Written by:

Kyle Weller

Announcing Multi-Catalog Sync with Snowflake, Databricks, BigQuery

Today we are announcing that our multi-catalog synchronization feature now integrates with the Snowflake catalog, Databricks Unity Catalog, and Google Data Catalog. With the world’s most open data lakehouse - what we call the Universal Data Lakehouse - Onehouse is making it possible for anyone to query a single copy of data from almost any cloud query engine of their choice.

With this announcement, the Onehouse multi-catalog synchronization feature now syncs table metadata to all the following catalogs:

Snowflake, used for a number of cloud-based data warehousing workloads such as BI and reporting, and rapidly expanding into AI and machine learning.
Databricks Unity Catalog, a governance layer for data and AI on Databricks.
Google Data Catalog, used by popular Google Cloud Platform engines such as Google BigQuery and Google BigLake.
AWS Glue Data Catalog, widely used throughout the AWS ecosystem.
DataHub, a popular open source metadata platform popular in the modern data stack ecosystem.
Hive Metastore, a critical component of Hadoop ecosystems.
Apache XTable, an open source tool announced by Google, Microsoft, and Onehouse last year to provide cross-table interoperability between open lakehouse formats via metadata translations for Apache Hudi, Apache Iceberg, and Delta Lake.

This news has made, well, the news; Datanami has a solid write-up, posted this morning.

The Importance of Open Data Architectures

Onehouse is dedicated to making data open. Many of today’s popular data platforms are fully integrated analytics systems that lock user data into their systems for data storage, management, processing, and querying. Cloud data warehouses are good examples of these integrated systems. For example, Snowflake was built from the ground up with tight integration of storage, management, and processing, with its SQL layer.

Every day, customers and prospects share with us their desire to avoid lock-in to individual tools and platforms. More importantly, customers are increasingly investing more in data use-cases outside of traditional business analytics and reporting. For example, SiliconAngle highlighted last year that nearly 40% of Snowflake customers are also running Databricks. And nearly 50% of Databricks customers are also Snowflake customers. This overlap indicates that a significant number of organizations are ultimately duplicating their data across multiple proprietary platforms, with all the associated governance and maintenance burdens.

“The increasing overlap between Snowflake and Databricks can be seen as a response to these companies’ realization that to extract maximum value from their data, they need to address both business intelligence and AI/ML workloads,” Dave Vellante and George Gilbert wrote at the time.

To enable best-of-breed data tools for such diverse data workloads, organizations need an open data platform that seamlessly interoperates with different data formats, catalogs and compute engines. With an open platform, they can unlock universal data access from any of the many new and emerging downstream engines, while avoiding painful proprietary data silos. They can reuse data across use cases and processing frameworks without duplicating it, and they can migrate in and out of commercial offerings. Imagine using Snowflake for BI and reporting, and Databricks for AI/ML, and Google Cloud Platform for data engineering - all on a single copy of data, that is transformed, managed, and optimized in a single “source of truth.”

Without this approach, governing and maintaining pipelines and silos across multiple platforms becomes very complex.

Open Data Requires Open Interfaces and Interoperability Across the Stack

Open data architectures require the use of open table formats such as Apache Hudi, Apache Iceberg, or Delta Lake. These open table formats, along with open file formats such as Apache Parquet, are what free data from proprietary storage formats and allow the use of multiple query engines.

Yet, while open formats are necessary, they are not sufficient:

We need format interoperability. Each format has an affinity to specific ecosystems and tools. For example, Hudi has a strong affinity with Spark, Flink, and Presto. Delta Lake is tightly integrated with the Databricks ecosystem. And Apache Iceberg has a stronger affinity with cloud data warehouses as an open table format option. So, to truly open up access to the entire ecosystem, we also need interoperability between formats. For example, Snowflake has introduced support for Iceberg, which is a step in the right direction, while vendors such as Databricks prefer other table formats, making it challenging for the large number of organizations that use both products. That’s the motivation for building the Onetable project, which we co-launched late last year with Google and Microsoft - enabling customers to write and store data just once, and consume that data through any of the three leading table formats. Onetable has since been donated to the Apache Software Foundation as Apache XTable (Incubating).

We need catalog interoperability. Now that open table formats have enabled freedom of data in storage, catalogs are becoming the new lock-in point. For example, Snowflake requires the use of their catalog for full platform support for Iceberg. This limitation on catalog support creates a new point of lock-in, limiting interoperability with query engines outside of Snowflake, and requiring that data continue to be managed by Snowflake.

We need open data services. Platform services are often tied to proprietary vendor offerings, locking users in. By contrast, Onehouse draws on open services for functions such as clustering, compaction, and data cleansing. For example, a powerful, open catalog sync service makes data in open formats available through various catalogs such as Snowflake, Databricks Unity Catalog, Google Data Catalog, Hive Metastore, AWS Glue Data Catalog, and more.

Multi-Catalog Synchronization: Free Your Data from Lock-in

Multi-catalog synchronization is an ideal solution to complement the open format, open data services, and catalog interoperability principles on which Onehouse is built. In short, it makes it simple for Onehouse users to set up a data pipeline once, while making the data from that pipeline accessible across a number of query engines. This is a significant differentiator against other ETL/ELT solutions, which often integrate with only a single catalog, limiting the data owner's choices.

Once you onboard to Onehouse, you can leverage the Multi-Catalog Sync Tool with only a few clicks.

First, navigate to the “Catalogs” page in the left panel.
Next, choose the catalogs you want to enable and and submit their configuration details.

When creating your pipelines in Onehouse, you now have a “multi-select” option for choosing the catalogs you want to synchronize. Under the hood, Onehouse will automatically trigger sync jobs to each catalog as your data is written to your lakehouse. Onehouse will seamlessly handle schema evolution, deletions, and other table management tasks along the way.

Now that the catalogs are synchronized, you can fire up Databricks, Snowflake, BigQuery, or any engine of your choice, and just start writing SQL.

Hopefully it’s clear that Onehouse is committed to open data across all three critical pillars: format interoperability; catalog interoperability, and open data services. And we are committed to make an open data architecture widely available and easy to use. Working with data should be about creating insights - not about padding vendors’ pockets. Interested in experiencing an alternative, open approach to vendor lock-in? Give Onehouse a try.

Authors

No items found.