August 29, 2024

Breaking Free from Data Lock-In with Snowflake & Onehouse

Breaking Free from Data Lock-In with Snowflake & Onehouse

Many organizations use a cloud data warehouse as their default platform for storing and analyzing data. Snowflake is a leader here, along with other options such as Amazon Redshift and Google BigQuery. A data warehouse such as Snowflake is designed as a tightly integrated system. Data typically stays within its ecosystem for storage, management, and processing. This tight integration provides a simple, reliable solution that "just works," but it also leads to data lock-in.

While Snowflake offers various ways to access and manipulate data, these interfaces ultimately rely on the underlying SQL query engine to retrieve data from the Snowflake storage layer. This means that all access to Snowflake data incurs a cost. For example, if there is a need to employ other query engines such Spark, distributed SQL databases, AI/ML platforms, or real-time analytics platforms, there is no direct access to the raw data — the data must be handled by Snowflake. Even downloading the data has a cost. This can limit the flexibility with which product teams support the needs of the business.

Snowflake’s Open Options 

Options for lower-cost data access can give users access to more data processing options. To address this, Snowflake has recently introduced support for Apache Iceberg. Iceberg is one of the three leading table formats used for data lakehouses, along with Apache Hudi and Delta Lake

Iceberg support is a positive development for Snowflake users. They can now store and manage their data in an open table format, and still query the data using the Snowflake query engine. 

Snowflake users have to choose one of two options for Iceberg support:

  1. Iceberg External Tables: In this option, data is stored outside of Snowflake in cloud object storage such as S3, GCS, or ADLS. This option allows read-only access by Snowflake to Iceberg tables managed by an external catalog, such as AWS Glue. Storage maintenance is left to the user to perform manually. These tables are accessible to any query engine capable of reading Iceberg tables. Unfortunately, Snowflake query performance is often slower when using external tables, because these tables are typically not optimized for Snowflake performance.
  2. Native Iceberg Tables: This option allows read-write access to Iceberg tables managed by Snowflake’s internal catalog. Snowflake handles all maintenance operations, and the tables benefit from Snowflake’s optimized performance. Access to these tables is facilitated through the Iceberg Catalog SDK, enabling external compute engines, such as Apache Spark, to connect to the Snowflake Iceberg catalog and query data directly. ​

The Snowflake implementation of Native Iceberg Tables is a step toward the goal that users share: openness, with full performance. However, it still imposes some important restrictions:

  • Limited Support for Other Query Engines: Native Iceberg tables can only be written to by Snowflake. Other query engines have read access, as long as they support the Iceberg format. But these other query engines cannot write to Snowflake Native Iceberg Tables, nor can they access these tables if they don’t support the Iceberg format.
  • Catalog Management: For Native Iceberg Tables, Snowflake manages the catalog, which includes metadata and table management. While this provides full Snowflake platform support, users who want to access the tables with another engine need to use the Snowflake Catalog SDK. This allows tools like Apache Spark to read the metadata and access the data directly from cloud storage, without relying on Snowflake compute resources. 

Because of these limitations, businesses that rely on a multitude of data tools, frameworks, and platforms may be interested in a more open, more powerful, and more flexible approach. 

The Onehouse Advantage: Fastest and Most Open Lakehouse for Snowflake

Onehouse provides the Universal Data Lakehouse, a fully-managed cloud data lakehouse which ensures full interoperability across all three metadata table formats - Apache Hudi, Apache Iceberg, and Delta Lake. The Universal Data Lakehouse uses Apache XTable (Incubating) to provide omnidirectional metadata translation across all three formats, as shown in Figure 1. 

Figure 1. XTable powering the Universal Data Lakehouse
for use with Snowflake

With the Universal Data Lakehouse, you create a single source of truth for each of your data tables. All data tables are maintained in your own virtual private cloud (VPC) on object storage, such as Amazon S3 and Google Cloud Storage. You can access each data table with any query engine, without incurring excessive costs, as needed for your use case. 

Onehouse provides an open, economical, high-performance alternative to Snowflake Native Iceberg Tables: 

  • Openness: Query engines. Use any query engine, including Snowflake (via support for Iceberg External Tables) and Databricks (via Delta Lake), using XTable for interoperability where needed. 
  • Openness: Catalogs. Take advantage of built-in catalog sync with many of the leading catalogs, including AWS Glue, Snowflake, Databricks Unity, Google Data Catalog, DataHub, Hive Metastore, and the XTable catalog. 
  • Speed. Onehouse includes automated table optimizations, so tables are always optimized for performance with any query engine. When used with Snowflake, for instance, query performance matches that of Snowflake-managed Native Iceberg Tables, but as a fully open solution.

The Onehouse approach replaces the tradeoffs of Snowflake’s two options with a fully managed, fully optimized solution, as shown in Table 1. 

Table 1: Comparing Iceberg table options

Snowflake & Databricks: Turning Rivals into Roommates with Onehouse

There are many cases where users either are already using multiple processing and query engines on the same data, or want and need to do so. 

The most prominent example? It’s becoming a norm that users want to share data between Snowflake and Databricks. Research shows an overlap of more than 40% in the user base between the two platforms, and this number is likely to increase. (Especially as open options, such as Onehouse, are implemented.) Figure 2 shows the overlap. 

Figure 2. Increasing overlap across
Databricks and Snowflake use
(Source: Kyle Weller on Linkedin; data from siliconANGLE)

Up until now, users have had to ingest, store, and manage much of their data twice across the two platforms, an expensive and time-consuming proposition. This process is further complicated by regulations such as GDPR, which require you to quickly and seamlessly delete all copies of specific information upon request. Users now have to go through the additional step of ensuring GDPR compliance and maintaining data consistency across two platforms.

While Snowflake’s Native Iceberg Tables allow users to ingest and manage their data in Iceberg using Snowflake and access the Iceberg data with Databricks, this remains suboptimal for Databricks users, who experience better integration and performance with Delta Lake. Querying Delta Lake data with Snowflake requires a multi-step process, including creating an external stage, defining a suitable file format (typically Parquet), and creating and querying an external table that references the Delta Lake data. This setup, though functional, is not as seamless as using Delta Lake natively within Databricks.

This scenario highlights the trade-off between using Delta Lake for superior integration with Databricks and using Iceberg for better integration with Snowflake. Delta Lake's tight coupling with Databricks ensures optimal performance when used together, while Snowflake’s Native Iceberg Tables offer better performance within Snowflake.

With the Onehouse approach, a single copy of the data can be  presented to Snowflake as Iceberg External Tables and to Databricks as Delta Lake tables. Users can sync metadata to their preferred catalog as needed, enabling seamless data sharing between Snowflake and Databricks, while allowing each platform to use its native table format. 

This unified data management approach supports the three leading table formats - Apache Hudi, Apache Iceberg, and Delta Lake - simplifying data sharing and usage across ecosystems, and allowing users to access their data with the preferred native format of their chosen query engines without duplicating data.

Beyond Snowflake and Databricks, this integration level is available to most popular query engines, nearly all of which support Apache Hudi, Iceberg, or Delta Lake.

In our next blog post, "Transform Your Snowflake Experience with Onehouse: The Fastest Iceberg Tables, Affordably," we will delve deeper into how Onehouse optimizes the performance of Iceberg tables within Snowflake; why using Snowflake with Onehouse is more cost-efficient; and how Onehouse seamlessly integrates with various platforms. Stay tuned!

Stay Connected

Follow Onehouse on LinkedIn and X/Twitter. Reach out directly at gtm@onehouse.ai for personalized consultations and more information on how Onehouse can transform your data architecture into a streamlined, cost-efficient, and open system.

Authors
No items found.

Read More:

Building an ExaByte-level Data Lake Using Apache Hudi at ByteDance

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future