Raising our Series B and launching LakeView and Table Optimizer: Read the announcement
August 7, 2024
Onehouse Analytics Engine Guide
Written by:
Po Hong and Floyd Smith
Introduction
Life requires choices, and data architecture is certainly no exception. There is a very wide range of analytics engines out there - including data warehouses, data lake engines, and data processing frameworks, each with its own strengths, weaknesses, and community support. One informal, live survey even found usage spread smoothly across 10 different query engines, with Spark, Databricks, and Snowflake as the top three.
You should avoid the traditional approach of choosing a query engine before deciding on the data architecture. Instead, base your selection(s) on your specific workloads. New trends offer you freedom of choice:
Efforts around interoperability, such as Apache XTable (Incubating), allow flexibility across data lakehouse formats, seamlessly integrating with any data catalog.
We call the resulting open, flexible architecture the Universal Data Lakehouse (UDL), where you can expose data in an open, interoperable manner across all query engines. The Universal Data Lakehouse levels the playing field regarding query engine selection.
Onehouse as the Universal Data Lakehouse
Onehouse implements the Universal Data Lakehouse architecture, providing a single, high-quality copy of your data in the bronze and silver stages of the medallion architecture, as shown in Figure 1. The UDL completely decouples storage from compute, while centralizing data management.
The UDL is fully open, extremely cost-efficient, supports mutable data, and is scalable without limits. Your downstream data tables can serve any purpose, from business intelligence (BI) analytics and reporting to data science, machine learning, and AI. You can add new use cases at any time - without affecting existing workloads.
Different use cases often require different query engines. With the Universal Data Lakehouse, you have full freedom of choice. So with the UDL, the key question is: which query engine is best for your specific use case? You can now leverage the unique strengths of multiple purpose-built engines, without finding yourself limited and having to pay more, suffer poor performance, or both.
We will explore several popular cloud-based query engines suitable for analyzing data stored in open file formats like Apache Parquet or open table formats - such as Apache Hudi, Apache Iceberg, and Delta Lake - in a cloud data lakehouse. Since you already have the data stored in the data lakehouse (such as Onehouse’s UDL), we'll focus on their capabilities for querying data from a cloud storage layer such as Amazon S3 or Google Cloud Storage (GCS). This architecture decouples compute from storage and avoids the need to make extra copies of data, enabling you to keep data securely stored as a single source of truth in your virtual private cloud (VPC).
To help you navigate in this new environment, we have created this guide. Though Onehouse is interoperable with any query engine, for brevity and focus, we will only highlight the key features and capabilities of some of the most popular query engines, including Amazon Athena, Amazon Redshift, Databricks, Google BigQuery, and Snowflake, and open source analytics engines such as ClickHouse, Presto, StarRocks, and Trino.
Query Engine Selection Criteria
When choosing a query/analytics engine, it is important to consider the following factors:
Manageability
Scalability
Cost
Performance
SQL support
The next sections provide a broad, but non-comprehensive, assessment of some of the most popular query and analytics engines adopted by UDL users. The analytics engine ecosystem is rapidly evolving, so you may find that functionality and performance change dramatically as these engines improve and new engines emerge.
These engines fall into two main categories (Figure 2):
Proprietary: Amazon Athena, Amazon Redshift, Databricks , Google BigQuery, and Snowflake.
Open source: ClickHouse, Presto, StarRocks, and Trino.
Choose the Right Engine
Here's a breakdown of each category to help you compare analytics engines, and choose the right one for your needs, with a summary shown in Table 1.
Manageability
Manageability is a critical concern for the use of these solutions. You may lean toward a managed solution, for ease of implementation and operation, or self-managed open source software, for flexibility and control. In an increasing number of cases, for a given technology, you will find that you have available both a robust open source solution, with its community, as well as a managed service offering, which reduces startup time and operational hassle.
The solutions we are describing here give you a range of options:
Managed only: Amazon Athena, Amazon Redshift, Databricks, Google BigQuery, and Snowflake are fully managed services, with minimal setup required. They handle infrastructure provisioning and maintenance for you. Amazon Athena and Google BigQuery are serverless by design; Amazon Redshift and Databricks each have a serverless option.
Managed optional/self-managed: Open source software (OSS) options ClickHouse, Apache Presto, Apache Trino, and StarRocks require some level of self-management or orchestration for deployment and configuration. Several vendors provide managed solutions for these OSS engines, such as ClickHouse Cloud, CelerData for StarRocks, and Starburst for Trino.
Scalability
Scalability or elasticity to adapt to variable workloads via horizontal scaling is perhaps the greatest single advantage offered by the cloud, and services that more fully support scalability are highly desirable. All of the query engines here are scalable, but some have limitations or impose operational burdens when taking advantage of scalability:
Highly scalable: Databricks, Google BigQuery, Snowflake, and StarRocks offer excellent horizontal scaling for both compute and storage resources. These platforms are designed to efficiently manage large datasets and dynamic workloads, making them ideal for enterprises with demanding data processing needs. They can scale out by adding more nodes and resources seamlessly, ensuring consistent performance even as data volumes grow.
Scalable with caveats:
Elastic resizing for Amazon Redshift requires migrating cluster metadata, which makes the cluster temporarily unavailable. Most queries are paused and connections are held open, though some active queries may be dropped. Elastic resizing redistributes data to node slices in the background. While the Redshift cluster remains available for read and write operations during this process, some queries may take longer to run.
ClickHouse scales by adding replicas and shards; you need to carefully choose the shard key for efficient data distribution and manage shard and replica configuration across multiple ClickHouse servers. This process can be complex and might not provide the same level of elasticity as the highly scalable platforms.
Presto and Trino achieve scalability through distributed deployments on clusters. They distribute queries across multiple nodes, allowing for high concurrency and large-scale data processing. However, managing and optimizing these deployments can require significant expertise and resources.
Amazon Athena offers fully managed scaling by AWS, meaning users do not need to worry about provisioning resources. However, this also comes with inherent limitations that users cannot modify, such as maximum query execution times and concurrency limits. While Athena handles many scalability aspects automatically, it may not provide the same level of control and customization as other platforms.
Cost
Cloud options allow you to pay only for what you use. However, costs can quickly add up once solutions are deployed to production. Each of the query engines here works differently, and is more efficient with some workloads than others, so comparing costs across them is challenging; however, extensive use of these engines by a wide range of customers allows us to give some general guidelines as to cost:
Premium: Databricks, Google BigQuery, and Snowflake are considered premium query engines. They have many advanced features, but can be expensive.
Balanced: Amazon Athena and Amazon Redshift can be more cost-effective alternatives to premium engines, especially for workloads with predictable compute needs.
Cost-effective: ClickHouse, Presto, StarRocks, and Trino have open-source options, and can be cost-effective if you can manage your own infrastructure.
Performance
As with cost, comparing performance across query engines is challenging, as each is better suited for some use cases than others, and operations can always be tuned for optimum performance, if you consider it worthwhile to put in the effort for that use case. However, similar to cost, we can categorize query engines based on their performance when querying data stored in a cloud-based data lakehouse, rather than in each engine’s native storage:
Optimized for performance: ClickHouse, Google BigQuery, Snowflake, and StarRocks are known for their fast query speeds, particularly when dealing with complex queries and large datasets:
ClickHouse: This columnar database management system is optimized for high-performance queries. Its architecture allows for real-time data ingestion and fast query execution, making it ideal for environments where low-latency performance is crucial.
Google BigQuery: A fully managed, serverless data warehouse that excels in handling petabyte-scale data sets. BigQuery's use of massively parallel processing (MPP) and its unique architecture enables it to deliver fast query performance even for very large and complex queries.
Snowflake: Known for its unique multi-cluster shared data architecture, Snowflake provides excellent performance for both read and write operations. It separates storage and compute, allowing for independent scaling of each, optimizing performance and reducing latency.
StarRocks: This next-generation analytics database is designed for high-concurrency and low-latency queries. Its vectorized execution engine and optimizations for OLAP workloads ensure that it performs well on complex queries with multiple joins.
Performance considerations: Amazon Athena, Amazon Redshift, Databricks, Presto, and Trino offer good performance, but there are considerations to keep in mind:
Amazon Athena: This serverless query service leverages the Trino (formerly PrestoSQL) engine. While it offers the convenience of a fully managed service, there can be additional overhead due to its serverless architecture. Performance may be impacted by the underlying infrastructure management, though it remains a powerful tool for ad hoc querying.
Amazon Redshift: Performance can vary depending on query complexity and cluster configuration. While Redshift can handle large-scale data warehousing tasks effectively, optimal performance often requires careful tuning of the cluster and query optimization strategies. Additionally, during elastic resizing operations, some queries may be delayed or dropped.
Databricks: A unified analytics platform built on Apache Spark, Databricks can achieve high performance for both batch and streaming data. However, to fully leverage its capabilities, it’s often necessary to tune and optimize Spark for performance.
Presto and Trino: These distributed SQL query engines are designed for fast analytics on large datasets. They offer good performance out of the box, but achieving optimal results may require fine-tuning the cluster setup, query optimization, and careful resource management.
SQL Support
The introduction of structured query language (SQL) for queries against the relational database management system (RDBMS) offerings from different vendors was a revolution in standardization, ease of use, and performance many years ago. Today, SQL is used for data management functions as well as for queries, and a high degree of SQL compliance is highly desirable for a data lakehouse query engine:
Standards-compliant: Amazon Athena, Amazon Redshift, Google BigQuery, Snowflake, and StarRocks support most standard SQL features with minimal extensions.
Variations: ClickHouse, Databricks, Presto, and Trino generally support standard SQL, with some limitations for advanced features. Examples include:
ClickHouse offers a limited set of window functions compared to standard SQL. Some advanced window functions like LEAD or LAG are not supported.
Databricks does NOT support certain advanced SQL window functions such as CUME_DIST or NTH_VALUE, lateral joins, and recursive common table expressions (CTEs).
Presto/Trino have limitations on nested subqueries, particularly for complex nesting levels, and lack support for lateral joins and recursive common table expressions (CTEs).
Summary
In this guide, we summarized several popular cloud-based query engines suitable for analyzing data stored in a cloud data lake using Apache Parquet format and managed by a data lakehouse. The optimal choice depends on your specific use cases:
Ease of use and manageability: Amazon Athena and Amazon Redshift (if you are on AWS), Databricks (if you use Delta Lake), and Snowflake.
High scalability and performance: Google BigQuery, Snowflake, and StarRocks.
Cost-effectiveness: Consider Amazon Athena (for serverless and low-compute requirements) and Amazon Redshift, or open-source options such as ClickHouse, Presto, StarRocks, and Trino if you prefer to manage the infrastructure yourself.
Real-time analytics: ClickHouse and StarRocks.
The real choice depends on your workloads, and we advise you to run some representative workloads on these engines before you decide. By quickly ingesting and managing data in the most open and interoperable manner possible, we at Onehouse can help you perform such evaluations on a level playing field across all these engines discussed. In fact, our team has helped many companies, large and small, employ an intelligent mix of these query engines to balance the trade-offs described above, while eliminating traditional lock-in points.
Onehouse offers a managed data lakehouse solution for the highest degree of ease of use, interoperability, manageability, performance, and scalability. The Onehouse managed service offers you freedom of choice in deploying any or all of the query engines described in this guide, as well many more.
In addition, Onehouse offers the free Onehouse LakeView service and Onehouse Table Optimizer, a managed service for use with OSS data lakehouses. If you are interested in learning more, please reach out to gtm@onehouse.ai, or sign up for a free trial with $1000 in credits.