Many organizations use a cloud data warehouse as their default platform for storing and analyzing data. Snowflake is a leader here, along with other options such as Amazon Redshift and Google BigQuery. A data warehouse such as Snowflake is designed as a tightly integrated system. Data typically stays within its ecosystem for storage, management, and processing. This tight integration provides a simple, reliable solution that "just works," but it also leads to data lock-in.
While Snowflake offers various ways to access and manipulate data, these interfaces ultimately rely on the underlying SQL query engine to retrieve data from the Snowflake storage layer. This means that all access to Snowflake data incurs a cost. For example, if there is a need to employ other query engines such Spark, distributed SQL databases, AI/ML platforms, or real-time analytics platforms, there is no direct access to the raw data — the data must be handled by Snowflake. Even downloading the data has a cost. This can limit the flexibility with which product teams support the needs of the business.
Options for lower-cost data access can give users access to more data processing options. To address this, Snowflake has recently introduced support for Apache Iceberg. Iceberg is one of the three leading table formats used for data lakehouses, along with Apache Hudi and Delta Lake.
Iceberg support is a positive development for Snowflake users. They can now store and manage their data in an open table format, and still query the data using the Snowflake query engine.
Snowflake users have to choose one of two options for Iceberg support:
The Snowflake implementation of Native Iceberg Tables is a step toward the goal that users share: openness, with full performance. However, it still imposes some important restrictions:
Because of these limitations, businesses that rely on a multitude of data tools, frameworks, and platforms may be interested in a more open, more powerful, and more flexible approach.
Onehouse provides the Universal Data Lakehouse, a fully-managed cloud data lakehouse which ensures full interoperability across all three metadata table formats - Apache Hudi, Apache Iceberg, and Delta Lake. The Universal Data Lakehouse uses Apache XTable (Incubating) to provide omnidirectional metadata translation across all three formats, as shown in Figure 1.
With the Universal Data Lakehouse, you create a single source of truth for each of your data tables. All data tables are maintained in your own virtual private cloud (VPC) on object storage, such as Amazon S3 and Google Cloud Storage. You can access each data table with any query engine, without incurring excessive costs, as needed for your use case.
Onehouse provides an open, economical, high-performance alternative to Snowflake Native Iceberg Tables:
The Onehouse approach replaces the tradeoffs of Snowflake’s two options with a fully managed, fully optimized solution, as shown in Table 1.
There are many cases where users either are already using multiple processing and query engines on the same data, or want and need to do so.
The most prominent example? It’s becoming a norm that users want to share data between Snowflake and Databricks. Research shows an overlap of more than 40% in the user base between the two platforms, and this number is likely to increase. (Especially as open options, such as Onehouse, are implemented.) Figure 2 shows the overlap.
Up until now, users have had to ingest, store, and manage much of their data twice across the two platforms, an expensive and time-consuming proposition. This process is further complicated by regulations such as GDPR, which require you to quickly and seamlessly delete all copies of specific information upon request. Users now have to go through the additional step of ensuring GDPR compliance and maintaining data consistency across two platforms.
While Snowflake’s Native Iceberg Tables allow users to ingest and manage their data in Iceberg using Snowflake and access the Iceberg data with Databricks, this remains suboptimal for Databricks users, who experience better integration and performance with Delta Lake. Querying Delta Lake data with Snowflake requires a multi-step process, including creating an external stage, defining a suitable file format (typically Parquet), and creating and querying an external table that references the Delta Lake data. This setup, though functional, is not as seamless as using Delta Lake natively within Databricks.
This scenario highlights the trade-off between using Delta Lake for superior integration with Databricks and using Iceberg for better integration with Snowflake. Delta Lake's tight coupling with Databricks ensures optimal performance when used together, while Snowflake’s Native Iceberg Tables offer better performance within Snowflake.
With the Onehouse approach, a single copy of the data can be presented to Snowflake as Iceberg External Tables and to Databricks as Delta Lake tables. Users can sync metadata to their preferred catalog as needed, enabling seamless data sharing between Snowflake and Databricks, while allowing each platform to use its native table format.
This unified data management approach supports the three leading table formats - Apache Hudi, Apache Iceberg, and Delta Lake - simplifying data sharing and usage across ecosystems, and allowing users to access their data with the preferred native format of their chosen query engines without duplicating data.
Beyond Snowflake and Databricks, this integration level is available to most popular query engines, nearly all of which support Apache Hudi, Iceberg, or Delta Lake.
In our next blog post, "Transform Your Snowflake Experience with Onehouse: The Fastest Iceberg Tables, Affordably," we will delve deeper into how Onehouse optimizes the performance of Iceberg tables within Snowflake; why using Snowflake with Onehouse is more cost-efficient; and how Onehouse seamlessly integrates with various platforms. Stay tuned!
Stay Connected
Follow Onehouse on LinkedIn and X/Twitter. Reach out directly at gtm@onehouse.ai for personalized consultations and more information on how Onehouse can transform your data architecture into a streamlined, cost-efficient, and open system.
Be the first to read new posts