September 26, 2024

Why Data Architecture Should Guide Your Query Engine Decision

Written by:

Po Hong

Why Data Architecture Should Guide Your Query Engine Decision

In working with hundreds of customers large and small, I've encountered a common pitfall: picking a query engine before establishing a data architecture.

In this blog, we will emphasize the importance of developing a robust data strategy before selecting a query engine. This will help you avoid sub-optimal architectures and lock-in scenarios.

I recommend that you start by establishing a versatile modern data lakehouse foundation, one that is flexible and interoperable with any query engine. Ingest data into any open data table format - Apache Hudi™, Apache Iceberg™, or Delta Lake. Then use Apache XTable for interoperability, making your data accessible to any query engine. You now have the freedom to use and re-use the same “source of truth” data tables with the best engine for each use case.

The Onehouse Universal Data Lakehouse

In an era of exponential data growth, traditional data warehouses and data lakes are hitting their limits. Enter the Onehouse Universal Data Lakehouse – a next-generation architecture that combines the best of both worlds. This innovative architecture is reshaping how businesses such as Uber, Walmart, and ByteDance handle vast and diverse data with unparalleled efficiency.

The Universal Data Lakehouse (UDL) architecture is vendor-independent. Onehouse supports it in several ways:

Through our contributions to the Apache Hudi open source project, the first data lakehouse project
By offering LakeView (a free service) and Table Optimizer, which work with Hudi today, and will work with all major data table formats in the future
And by offering Onehouse Cloud, our managed service based on Hudi, with additional features, which makes setting up and running a data lakehouse fast and easy

Benefits of the Universal Data Lakehouse

The Onehouse UDL architecture leverages the best features of the data lake, such as open source data formats and object storage, while supporting incremental updates similar to a data warehouse. This approach has the following benefits:

Decoupling Storage from Compute: This allows for independent scaling and optimization of resources, providing the flexibility to adjust storage and compute capacities based on workload demands, without workloads affecting each other.
Easy to Switch or Add Query Engines: Seamlessly integrate new query engines as needed. This flexibility ensures that you can always use the most suitable query engine for each specific use case, avoiding vendor lock-in and enhancing performance and cost-efficiency.
Scalability: Built to handle vast amounts of data, the UDL architecture can scale out seamlessly, ensuring performance remains snappy even as data volumes grow.
Cost Efficiency: By decoupling storage from compute, and enabling the use of best-in-class query engines, the UDL helps optimize resource usage, leading to significant cost savings.

Figure 1 shows today’s most-used query engines. With the UDL architecture, it’s easy to add new use cases, supported by the query engine of your choice. For more details, please refer to Building a Universal Data Lakehouse.

Figure 1: Query Engine Survey Result by Onehouse (June, 2024)

The Pitfalls of Query Engine-Driven Architecture

Many customers make common mistakes when selecting a query engine before establishing a solid data strategy, including:

Vendor Lock-In: Choosing a single query engine - and the matching data store, if bundled together - can lead to dependence on a specific vendor, limiting flexibility and making it difficult to switch providers or integrate new technologies.
Challenges of Adding New Query Engines: Integrating additional query engines later can be complex and costly, often requiring significant changes to the data architecture and workflows.
High Total Cost of Ownership (TCO): Selecting an inappropriate query engine can lead to increased costs, including licensing fees, operational expenses, and additional charges for scaling.
Sub-Optimal Performance: Without a tailored data strategy, the chosen query engine may not be optimized for specific workloads, leading to slower performance and inefficiencies.

These limitations severely hamper your ability to make fast, data-driven decisions.

A Customer Example

A Onehouse customer initially chose Snowflake as their cloud data warehousing solution for both ELT and BI workloads. They used a third-party cloud ingestion tool to load approximately 25-30 TB of mutable data per year into Snowflake, followed by executing SQL merge operations within Snowflake into the staging/raw layer. This process cost around $1.16 million a year, with Snowflake accounting for the majority of the cost (~85%) and the third-party ingestion tool accounting for the rest.

By replacing the third-party ingestion tool and the SQL merge process in Snowflake with a Onehouse data lakehouse solution, the customer achieved:

Cost Savings: Nearly 80% reduction in costs.
Performance Enhancement: Significant improvement in ETL performance, reducing data latency from 12-24 hours to minutes.

In addition to cost savings and performance improvements, the new Onehouse-powered data lakehouse provided the freedom to use any analytics tools they preferred. They could continue using Snowflake for BI and reporting, but now also have the flexibility to use Apache Trino and Spark for other use cases, all against the same (much fresher) data.

This example illustrates the importance of developing a robust data strategy before selecting a query engine, ensuring flexibility, cost efficiency, and optimal performance.

For more details, see Optimize Costs by Migrating ELT from Cloud Data Warehouses to Data Lakehouses.

Recommendations for a Solid Data Strategy

To build an effective and scalable data strategy, we recommend that customers follow these best practices.

Adopt Open Data Formats:

Use formats such as Apache Parquet and Apache Avro to ensure compatibility across various tools and platforms.

Keep Your Query Engine Choices Open:

Selecting a single query engine can be restrictive and increases the chance of vendor lock-in.
Different use cases often require different query engines. For more details, see the Onehouse Analytics Engine Guide.

Adopt Open Source Data Table Formats:

Utilize table formats like Apache Hudi, Apache Iceberg, and Delta Lake, which are not controlled by a single vendor, to keep control and avoid lock-in.

Achieve Universal Interoperability with Apache XTable:

Implement XTable for universal interoperability, decoupling the choice of a table format from the query engine.
This allows for seamless integration with various query engines optimized for different use cases.

Ensure Multi-Catalog Integration:

Ensure your data strategy includes integration with the most popular data catalogs to enhance data discovery and governance, making the data accessible across a number of query engines.

This complements the open data and table formats, open data services, and catalog interoperability principles on which a solution like Onehouse is built.

We've summed up these recommendation in an infographic you can share (Figure 2).

Figure 2: Recommendations for a Solid Data Strategy

Conclusion

To build a robust data strategy that maximizes flexibility, performance, and cost-efficiency, we recommend that you avoid selecting a query engine first. Instead, establish a solid data foundation, such as the Onehouse Universal Data Lakehouse platform, which provides a flexible, interoperable data architecture. By doing so, you can choose the best query engine for each specific use case, avoiding the limitations and risks associated with vendor lock-in and sub-optimal data architectures, and making your data useful, at minimum cost for each new use case.

If you are interested in learning more, please reach out to gtm@onehouse.ai or sign up for a free trial with $1000 in credits.

Authors

No items found.

Why Data Architecture Should Guide Your Query Engine Decision

The Onehouse Universal Data Lakehouse

Benefits of the Universal Data Lakehouse

The Pitfalls of Query Engine-Driven Architecture

A Customer Example

Recommendations for a Solid Data Strategy

Conclusion

Read More:

Data Architecture Survey Report: The Lakehouse Is Your Data Foundation for AI

Apache Kafka® (Kafka Connect) vs. Apache Flink® vs. Apache Spark™: Choosing the Right Ingestion Framework

Apache Iceberg™ vs Delta Lake vs Apache Hudi™ - Feature Comparison Deep Dive

How to Save Apache Iceberg™ from Chilly Meltdown When Dimensions Too Hot to Handle

Meet Spark Analyzer – a free tool to unearth Apache Spark™ bottlenecks

Why the Apache Spark™ Default Autoscaler Fails Your Lakehouse (and How We Fixed It)

Why Data Architecture Should Guide Your Query Engine Decision

The Onehouse Universal Data Lakehouse

Benefits of the Universal Data Lakehouse

The Pitfalls of Query Engine-Driven Architecture

A Customer Example

Recommendations for a Solid Data Strategy

Conclusion

Read More:

Data Architecture Survey Report: The Lakehouse Is Your Data Foundation for AI

Apache Kafka® (Kafka Connect) vs. Apache Flink® vs. Apache Spark™: Choosing the Right Ingestion Framework

Apache Iceberg™ vs Delta Lake vs Apache Hudi™ - Feature Comparison Deep Dive

How to Save Apache Iceberg™ from Chilly Meltdown When Dimensions Too Hot to Handle

Meet Spark Analyzer – a free tool to unearth Apache Spark™ bottlenecks

Why the Apache Spark™ Default Autoscaler Fails Your Lakehouse (and How We Fixed It)

Subscribe to the Blog