June 26, 2024

Introducing Onehouse Table Optimizer

Written by:

Kyle Weller

and

Kyle Weller and Lin Liu

When building a data lakehouse, engineering teams are often brimming with excitement and expectations for an open and interoperable architecture. When organizations begin to query and utilize the data within their lakehouse, the initial momentum can quickly give way to frustration over slow query performance and the difficulty of deriving meaningful insights.

By properly optimizing their data lakehouse, users can achieve performance on par with or better than data warehouses. This starts with acknowledging that a data lakehouse's roots lie in cloud storage—a vast collection of files. Efficient data processes such as retrieval, filter pushdown, and joins depend on optimizing these underlying files. Critical data lakehouse optimizations include proper file sizing, clustering related data, compacting log files, and other data layout techniques. If you implement these optimizations manually, your teams will have an ongoing operational burden of scheduling, running these optimizations, diagnosing failures and managing the underlying infrastructure.

All three major table formats - Apache Hudi™, Apache Iceberg™, and Delta Lake - help by adding metadata abstractions layers around files, making many optimizations possible. Apache Hudi goes further, offering open table services that help users operate these optimizations at scale.

As the original creators, and major supporters of Hudi, and as operators of a leading managed data lakehouse service, we at Onehouse have decades of collective experience designing and operating these advanced optimizations at scale in some of the largest data lakes on the planet. Today, we are excited to pour our unique understanding and innovations into a brand-new product: Onehouse Table Optimizer.

What We Are Launching

Onehouse Table Optimizer intelligently optimizes your table data layouts and automates away the tedious chore of manually tuning infrastructure and operations. Our current customers, who use Onehouse for data ingestion and ETL pipelines every day, already have Table Optimizer running with great results across Apache Hudi, Apache Iceberg, and Delta Lake.

Today, we are making Table Optimizer available for teams who choose to run their own ingestion pipelines. Whether you build your pipelines with Onehouse or on your own, Table Optimizer can run automated optimizations on those tables.

We have observed these improvements leading to 2-10x improvements in query performance for customer tables, while also driving a sharp decrease in costs.

Some organizations are missing out on massive potential performance and cost savings by not applying intelligent optimizations to their lakehouse tables. On the flip side, other organizations may be spending precious engineering resources applying inefficient tuning and chasing never-ending rounds of optimization.

When a team chooses to operate these advanced features on their own, the services either run in-line, competing with their writer resources, or complex infrastructure must be created and managed to trigger and schedule separate pipelines with technologies such as Apache Airflow and Apache Spark™. Table Optimizer, on the other hand, avoids the inefficiencies and complexities of DIY table management.

‍

Before Table Optimizer

‍

After Table Optimizer

Key Capabilities of Onehouse Table Optimizer

Auto File-Sizing

Small files are a leading cause of poor query performance and are often caused by variance in write patterns or poor selection of write parallelism. Table Optimizer automatically clusters small files to maintain optimal file sizes across your tables for fast query performance.

Intelligent Incremental Clustering

Partitioning is useful for co-locating data frequently queried together for better read performance. Clustering allows you to further organize data within a partition and evolve the layout as new data is written to the table.

Onehouse runs clustering incrementally to cluster only new data and to save on write costs while optimizing query performance. You can configure your data layout strategy or let Onehouse run adaptive clustering based on your query patterns, enabling you to leverage various clustering strategies - from simple sorting to advanced multi-dimensional Z-Order/Hilbert space-filling curves.

Adaptive Compaction

Merge on Read tables leverage compaction to periodically merge all of the recent changes that have occurred on a table, enabling industry-leading write performance on high-throughput workloads. However, orchestrating and tuning compaction can be challenging, especially when running it across hundreds or thousands of tables.

To avoid log file build-ups, you’ll need to run compaction with the right frequency and choose an optimal strategy for which files to prioritize for compaction. With Table Optimizer, Onehouse handles orchestration and intelligently balances compaction across active partitions for optimal cost and performance.

Automatic Cleaning

Cleaning reclaims storage space by removing old file/data versions that are no longer used in your table. Onehouse will automatically run cleaning based on your data retention requirements to retain storage space and save costs.

Asynchronous Services

Onehouse runs optimizations independently of the data ingestion process to improve write performance by removing bottlenecks. This is crucial for large-scale workloads, as running services concurrently with ingestion (in-line) can block the ingestion writer, potentially reducing throughput.

By handling optimization tasks asynchronously, Table Optimizer ensures that data ingestion remains fast and efficient. Async table services typically require complex orchestration, which is all handled by Onehouse when you use Table Optimizer.

No Migration Needed

You can keep your existing pipelines and simply point Onehouse at your tables for automatic management and optimizations. This allows Table Optimizer to seamlessly integrate with your existing workloads, providing immediate benefits without any migration.

Conclusion

Unlock the full potential of your data lakehouse with Onehouse Table Optimizer. By automating critical processes like clustering, compaction, and cleaning, Onehouse ensures your data is always up-to-date, performant, and managed in a cost-efficient manner. With Onehouse, you can focus on building and innovating, and let us take care of the heavy lifting.

Ready to supercharge your data lakehouse? Schedule a call with Onehouse today and try out the difference for yourself. Also, check out the launch webinar to see a live demo of Table Optimizer.

Authors

Kyle Weller

VP of Product

Experience includes Azure Databricks, Azure ML, Cortana AI Agent, global scale data and experimentation platforms for Bing Search, and Office 365. Onehouse author and speaker.

Introducing Onehouse Table Optimizer

What We Are Launching