January 16, 2025

Introducing Onehouse Compute Runtime to Accelerate Lakehouse Workloads Across All Engines

Written by:

Kyle Weller and Rajesh Mahindra

Introducing Onehouse Compute Runtime to Accelerate Lakehouse Workloads Across All Engines

An Independent and Universal Foundation

Data lakehouse open table formats (OTFs) have ignited a powerful paradigm shift in data infrastructure toward open, independent, decoupled data storage. Since their inception in 2017, three communities across Apache Hudi, Apache Iceberg, and Delta Lake have tirelessly worked to unlock and level the playing field for independent storage. As ecosystem adoption has rapidly gained momentum, in 2025, we finally have reached a consensus that storing your data independently with an OTF should be the default choice when designing your data infrastructure.

With this consensus, do we have the panacea now? At Onehouse, we see this consensus as just the beginning (or, dare we say, “tip of the iceberg”) of reaching truly independent and universal storage. Let’s examine… Across the ecosystem, services from every category have rapidly added read/write support to OTFs. Snowflake, BigQuery, Redshift and all major data warehouses now support “external OTF tables” as a data lake strategy. Event streaming products, databases, catalogs, OLAP engines, data frames, AI/ML services, data integration tools, and nearly every other category in the data landscape now have an option with read/write support to at least one (and often all three) of Hudi, Iceberg, and Delta Lake.

Basic read/write support is a commendable start to establishing independence, but new friction points have emerged that challenge storage being interoperable and universal once again: data catalogs, table maintenance, and workload optimizations. Almost every vendor that supports an OTF now also offers their own catalog and maintenance, which often restricts which tools can read/write to the tables. To ensure that the control of data remains firmly in the users' hands, the industry needs not only decentralized storage but also a carefully crafted decentralized compute platform that can perform table maintenance and optimize typical workloads universally across these different cloud data warehouses and vendors. For example, vendor innovations are typically delivered with storage management tied to proprietary catalogs, which are tightly integrated to their compute runtimes.

With decades of collective experience inventing, pioneering, building, and operating some of the largest data lakehouses on this planet, the Onehouse engineering team has quietly been crafting a specialized execution runtime that we are finally ready to reveal to the world. Not only is this runtime truly independent, but it also delivers powerful reimplementations of core data lakehouse features to ensure best-in-class cost/performance across all engines.

Onehouse Compute Runtime

Today, we proudly announce Onehouse Compute Runtime (OCR), a high-performance data processing runtime designed explicitly for the unique workload challenges of data lakehouses. Apache Spark is the most popular compute engine for the lakehouse. Still, it is a generic data processing framework that requires specialized skills to optimize and tune when operating production workloads with open table formats. OCR is a specialized integration of lakehouse compute such as Apache Spark, lakehouse storage such as Apache Hudi, and everything else needed to optimize your lakehouse workloads. OCR has been proven by our customers to deliver up to 30x query performance and 10x faster writes beyond what is available from open source without the need to hand-tune.

A performance impact of this magnitude from OCR not only leads to dramatic cost savings but also makes it possible to realize use cases that were too expensive to try on a lakehouse/warehouse before. Onehouse customers have used OCR to fuel near-real-time clickstream and telecom analytics, blockchain ledger transactions, IoT utility sensor monitoring, retail supply chains, and PB-scale marketing analytics platforms.

“With automated scaling and resources that adapt to our workloads, Onehouse helps us dedicate our teams to building out our core platform differentiators rather than keeping the data stack continuously optimized” - Emil Emilov, Principal Software Engineer, Conductor
https://www.onehouse.ai/blog/conductor-unlocks-the-power-of-open-data-architectures-with-onehouse

Onehouse Compute Runtime is the industry's first compute runtime that works across all table formats, all query engines, and all catalogs. OCR is the execution foundation that runs all Onehouse workloads and services, including ingestion, incremental ETL, and table optimizations. OCR is 100% compatible with OSS Apache Spark, so it is easy to migrate on and off our platform to ensure your workloads are future-proof from lock-in. Your data is written to the OTF of your choice, and with our OneSync™ multi-catalog integration, your storage will be optimized and universally available for virtually any compute engine on the market.

OCR offers industry-leading performance and cost savings with unique features that fall under three main pillars:

Adaptive Workload Optimizer : runtime features that intelligently react to workload.
Serverless Compute Manager : compute infrastructure optimized for the most challenging lakehouse workloads.
High-Performance Lakehouse I/O: a radical rethinking of some foundational lakehouse operations.

Let’s take a peek under the hood of these pillars to appreciate the leading-edge innovations in OCR.

Adaptive Workload Optimizer

When designing data lake ingestion, ETL, and table optimization workloads, engineers face challenging problems in monitoring and tuning their jobs for concurrency and parallelism while ensuring the Spark resources they deploy are efficiently utilized. Through trial and error, highly skilled engineers customize configurations and constantly monitor their pipelines to strike a tricky balance between meeting data latency SLAs and not exceeding their team budgets. For example, data lakehouses face a unique challenge in balancing write latency and query performance when using lakehouse storage such as Apache Hudi, Apache Iceberg, and Delta Lake. The Adaptive Workload Optimizer in OCR solves all these challenges while enabling simple but massive workload accelerations.

Performance profiles

Performance profiles are available out of the box in OCR so users can simply select whether they want the fastest write performance, fastest query performance, or they would like Onehouse to set a balanced profile automatically. OCR then automates many advanced Spark and OTF configurations behind the scenes to achieve the desired workload profile, which we have observed results in up to 5x faster writes and 10x faster reads.

Lag-aware execution

Lag-aware execution in OCR ensures your data lakehouse workloads hit consistent data latency targets. OCR combines knowledge of your sync frequency, latency goal, the amount of data available at the source, and past execution performance to adapt how your job is scheduled and resourced. With advanced monitoring and alerting in our lag-aware execution, your team is always up to date on detailed metrics related to your latency goals.

Multiplexed job scheduler

Our multiplexed job scheduler makes compute resource sharing extremely efficient by virtualizing jobs across Spark clusters. Onehouse has developed unique mechanisms to interleave and distribute resources to parallel tasks during workload execution. With production customers, we have proven reliable pipelines that can efficiently stream hundreds of jobs through a small footprint of shared cluster resources. This ultimately results in a lowered compute footprint needed to run your workloads due to efficient resource sharing across jobs.

Serverless Compute Manager

When developing your ingestion, ETL, or table optimization workloads, sometimes deploying and managing the compute resources is even more difficult than writing the code. Seasoned engineers constantly wrangle with monitoring, operating, debugging, and tuning their Spark/EMR/kubernetes clusters. With obscurity in the dark art of Spark tuning, many teams end up with over-provisioned and wasted resources as they try to make their pipelines run faster or avoid dreaded out-of-memory (OOM) errors. Managed solutions for Spark are not new, but most offer generic Spark runtimes, unaware of the specific characteristics of lakehouse workloads. Our Serverless Compute Manager is purpose-built to cater to needs and nuances of Spark jobs with lakehouse storage, such as Apache Hudi.

Elastic cluster scaling

Elastic cluster scaling in OCR goes beyond the generic Spark dynamic allocation used in other managed Spark platforms. OCR combines CPU/Mem/IO usage at runtime with intelligence about your workloads to make scaling decisions. OCR knows how much data is pending to write and whether specific lakehouse table optimizations must run inline or asynchronously. While multiplexing jobs, Onehouse can predict workload spikes and scale up or down more rapidly to optimize for cost or hit performance targets.

Multi-cluster management

Multi-cluster management offers a way to dedicate resources to certain pipelines. You can achieve complete compute isolation between clusters so that high-priority workloads never sacrifice resources to others with more flexible SLAs. OCR Compute Manager allows you to create fine-grained size maximums to guarantee workloads operate within budget constraints. All you have to do is set a maximum allowed budget, and OCR takes care of efficiently scaling multiple cluster resources up & down, to match your workload demands. OCR comes out of the box with rich monitoring and alerting, with detailed cost attributions so you can identify precisely how much each pipeline costs. Onehouse provides different cluster types, specifically built for different lakehouse workloads such as ingestion or table optimization.

Serverless bring-your-own-cloud (BYOC)

When it comes to deploying compute as a service, most vendors offer either a serverless or a bring-your-own-cloud (BYOC) model which come with tough tradeoffs on either side. Serverless offerings today require compute to be hosted in the vendor’s account, forcing your data to leave your security and compliance boundary. BYOC, while deployed securely in your VPC, often sacrifices the serverless experience by giving you the burden of sizing, monitoring, upgrading, and operating clusters.

OCR delivers the BYOC model in your VPC without losing the benefits of a fully-managed serverless experience. With OCR in your VPC, you have complete control over the network and security requirements to ensure no data leaves your compliance or residency boundaries. This also gives you the opportunity to work with your cloud provider directly to negotiate discounts, purchase reserved instances, or leverage spot nodes to pocket those extra savings. Being a serverless experience, you define simple maximum limits for budget constraints and we handle the rest from efficiently scaling, but also the tedious upgrades, security patches, and on-call monitoring.

High Performance Lakehouse I/O

Onehouse Compute Runtime also ships with optimized software that has reimagined the core I/O building blocks in the data lakehouse. The most common operations in a data lakehouse include reading an existing object/file from cloud storage, compressing/decompressing, serializing/deserializing, merging changes, indexing lookups and writing it back to cloud storage. While we have columnar file formats, much of the remaining stack does not work at the column/cell granularity, depending on which columns are affected by different operations. Our engineering team noticed that this makes a huge difference when running these lakehouse workloads, and as a result, we have reimplemented some core parts of our runtime to take full advantage of compute, memory and network resources within a given cluster.

Optimized storage access

Accessing data stored in lakehouse table formats is trivial in principle, but under the hood, object storage list, get, put, and other operations quickly add costs and latency to workloads. OCR’s High-Performance Lakehouse I/O offers optimized storage access, which reduces the number of cloud storage requests compared to Parquet readers found with OSS table format APIs or other Spark runtimes. These work hand-in-glove with table optimization capabilities to correctly partition and size files within a table, to minimize parquet metadata processing overhead during query execution.

Parallel pipelined execution

Fully utilizing a given cloud VM/server calls for efficient use of the various resources (network I/O, compute CPU cycles, storage IOPS) concurrently over time, minimizing the total task duration. OCR’s High Performance Lakehouse I/O introduces parallel pipelined execution (built on classic principles around event queuing such as SEDA), which can activate these different resources simultaneously, such that each is operating near peak throughput during task execution. E.g., we don’t want the CPU to sit idle when the network I/O is being done and vice versa. To avoid this situation, OCR starts processing parts of the object/file as it's still being read from cloud storage.

Vectorized columnar merging

High performance Lakehouse I/O also introduces advanced vectorized columnar merging techniques that contribute to accelerating write operations by up to 4x. Writes and updates are widespread in data lakehouse workloads. The popular Redset paper published by AWS in the analysis of their Redshift fleet claims ~40% of queries are updates and ~70% of query run times is dedicated to updates. Whether you are applying an update/delete, or even if you compact small files or sort data within files, a merge-style operation occurs. To more efficiently handle these operations, we developed a highly optimized partial columnar merge algorithm that taps into SIMD capabilities in the underlying hardware. With this, we have reduced the cost of these operations as a function of the number of records or volume of data that has been modified. Without the vectorized columnar merge, the cost of compacting a file is the same whether a single record has changed or most records have been updated.

Summary

Developing this specialized execution runtime demanded inventive solutions to complex challenges—exactly the kind we thrive on at Onehouse. Amidst the complexity of the techniques, the beauty of the new Onehouse Compute Runtime is how it seamlessly operates in the background. Bring any data ingestion, ETL, table service, or any other lakehouse workload to Onehouse, and you will see guaranteed performance and cost savings.

Onehouse delivers a truly independent and universal foundation for your data lakehouse. Your organization will benefit from well-organized Hudi, Iceberg, and/or Delta Lake tables that are performance-certified and ready to be used by Snowflake, Databricks, BigQuery, Redshift, Athena, Starburst, CelerData—you name it. Perhaps more important than your tables being ready to use, now a single copy of your data is portable, interchangeable, future proof and available to various tools that support your diverse use cases.

If you are ready to decouple your query engines from data management and make your data lakehouse universal, or you want to see the blazing fast speed of Onehouse Compute Runtime in action, reach out to us at gtm@onehouse.ai for a generous free trial. If you want to join the team working on leading edge innovations like this, we are hiring top talent.

Authors

No items found.

Introducing Onehouse Compute Runtime to Accelerate Lakehouse Workloads Across All Engines

An Independent and Universal Foundation

Onehouse Compute Runtime