In working with hundreds of customers large and small, I've encountered a common pitfall: picking a query engine before establishing a data architecture.
In this blog, we will emphasize the importance of developing a robust data strategy before selecting a query engine. This will help you avoid sub-optimal architectures and lock-in scenarios.
I recommend that you start by establishing a versatile modern data lakehouse foundation, one that is flexible and interoperable with any query engine. Ingest data into any open data table format - Apache Hudi, Apache Iceberg, or Delta Lake. Then use Apache XTable for interoperability, making your data accessible to any query engine. You now have the freedom to use and re-use the same “source of truth” data tables with the best engine for each use case.
In an era of exponential data growth, traditional data warehouses and data lakes are hitting their limits. Enter the Onehouse Universal Data Lakehouse – a next-generation architecture that combines the best of both worlds. This innovative architecture is reshaping how businesses such as Uber, Walmart, and ByteDance handle vast and diverse data with unparalleled efficiency.
The Universal Data Lakehouse (UDL) architecture is vendor-independent. Onehouse supports it in several ways:
The Onehouse UDL architecture leverages the best features of the data lake, such as open source data formats and object storage, while supporting incremental updates similar to a data warehouse. This approach has the following benefits:
Figure 1 shows today’s most-used query engines. With the UDL architecture, it’s easy to add new use cases, supported by the query engine of your choice. For more details, please refer to Building a Universal Data Lakehouse.
Many customers make common mistakes when selecting a query engine before establishing a solid data strategy, including:
These limitations severely hamper your ability to make fast, data-driven decisions.
A Onehouse customer initially chose Snowflake as their cloud data warehousing solution for both ELT and BI workloads. They used a third-party cloud ingestion tool to load approximately 25-30 TB of mutable data per year into Snowflake, followed by executing SQL merge operations within Snowflake into the staging/raw layer. This process cost around $1.16 million a year, with Snowflake accounting for the majority of the cost (~85%) and the third-party ingestion tool accounting for the rest.
By replacing the third-party ingestion tool and the SQL merge process in Snowflake with a Onehouse data lakehouse solution, the customer achieved:
In addition to cost savings and performance improvements, the new Onehouse-powered data lakehouse provided the freedom to use any analytics tools they preferred. They could continue using Snowflake for BI and reporting, but now also have the flexibility to use Apache Trino and Spark for other use cases, all against the same (much fresher) data.
This example illustrates the importance of developing a robust data strategy before selecting a query engine, ensuring flexibility, cost efficiency, and optimal performance.
For more details, see Optimize Costs by Migrating ELT from Cloud Data Warehouses to Data Lakehouses.
To build an effective and scalable data strategy, we recommend that customers follow these best practices.
Adopt Open Data Formats:
Keep Your Query Engine Choices Open:
Adopt Open Source Data Table Formats:
Achieve Universal Interoperability with Apache XTable:
Ensure Multi-Catalog Integration:
This complements the open data and table formats, open data services, and catalog interoperability principles on which a solution like Onehouse is built.
We've summed up these recommendation in an infographic you can share (Figure 2).
To build a robust data strategy that maximizes flexibility, performance, and cost-efficiency, we recommend that you avoid selecting a query engine first. Instead, establish a solid data foundation, such as the Onehouse Universal Data Lakehouse platform, which provides a flexible, interoperable data architecture. By doing so, you can choose the best query engine for each specific use case, avoiding the limitations and risks associated with vendor lock-in and sub-optimal data architectures, and making your data useful, at minimum cost for each new use case.
If you are interested in learning more, please reach out to gtm@onehouse.ai or sign up for a free trial with $1000 in credits.
Be the first to read new posts