At Onehouse, we’ve always believed that the future of data infrastructure is open. Openness delivers freedom of choice and interoperability, levels the playing field, and speeds up innovation. With the expanding support for open formats in recent years, the first wave of this movement centered around liberating the data. But if we zoom out, it becomes clear that it was never just about data—it’s also about the control of compute used to process and consume the data.
We’ve made strides towards making data open with file formats such as Apache Parquet, open lakehouse storage frameworks such as Apache Hudi and Delta Lake, open table formats such as Apache Iceberg, data interoperability initiatives such as Apache XTable (Incubating) and Delta Lake Uniform, along with a budding renaissance of open data catalogs. But we are still often restricted to closed compute because achieving open compute on open data is not as easy as it should be.
That’s why today we are excited to announce Open Engines™. This addition to our open data lakehouse platform brings the ability to deploy open source engines on top of open data (inside or outside Onehouse) with a few clicks at lower maintenance costs than self-installed deployments. The first engines we are announcing available today include Apache Flink™ (stream processing), Trino (BI and analytics), and Ray (AI/ML and data science).
Onehouse is already the industry’s most open data platform, providing interoperability across all table formats and synchronizing metadata across all the leading catalogs. Open Engines now complements this interoperability with the ability to quickly and reliably deploy and run a best-of-breed open source engine against any table managed by Onehouse.
With Open Engines, Onehouse is now removing the final barrier to realizing a truly universal data lakehouse and finally flipping the defaults—for both data and compute—to open.
Open table formats have captured the industry’s imagination over the past few years. They’ve unbundled data from any compute engine and turned the data lakehouse architecture into a viable alternative to data warehouses. Even with robust open table formats or lakehouse storage, users still have to manually select, deploy, configure, and manage compute engines, with each compute engine vendor claiming to be the absolute best for every workload under the sun.
Spoiler alert: No engine is excellent for all data workloads! This statement is backed up by years of hands-on experience building and operating planet-scale data infrastructure at Uber & Linkedin and supporting an open source community with thousands of open source data lakehouse users. Check out the deep dive blogs comparing analytics, data science and machine learning, and stream processing engines that show, for example, that Apache Spark™ is well-rounded but not necessarily the best engine in any of these categories.
There are broadly two categories of engines in the market - cloud data warehouses with proprietary compute layers and formats and open source engines with commercial offerings from a handful of vendors. Typically, users start by choosing what they believe to be the best engine for their immediate need and then settle on a combination of file, storage, and table format that provides good support for that engine. But in a world where open data formats are first-class citizens, that’s a backward approach that can compromise your architecture from the start.
It’s time to break this cycle. And it starts by acknowledging a critical inversion in how most teams build their data stacks today.
We’ve often seen organizations start with a cloud warehouse such as Snowflake, which defaults to a proprietary storage format and catalog that makes it very difficult to bring data science engines such as Apache Spark™ and machine learning engines such as Ray to the same data. We have also routinely encountered users who tried to fit large-scale ETL workloads into interactive query engines such as Trino or Athena and had painful data freshness and pipeline efficiency issues. This tunnel vision of initially focusing on the engine imposes many severe limitations for downstream users.
Pick the wrong engine or its variant: Engine vendors often specialize around their use cases, and their recommendations may not carry weight beyond that. Starting engine-first could complicate even basic tasks in engine selection like performance benchmarking, where users could end up paying for managed versions of OSS engines or closed data warehouses due to the underlying data being optimized (or not optimized) differently, even though the compute performance is comparable.
Limited flexibility: Users mold data to fit the assumptions and constraints of the chosen engine. This often results in tight coupling—for example, using a warehouse-specific table format, storage layout, and data catalog where they may not be the best fit. Users may be unable to choose optimal write/read formats. The outcome? Data could become siloed and less reusable outside that engine. These risks are more pronounced with cloud data warehouses, which all default to closed data formats and proprietary data catalogs.
Data optimization is ignored: Most compute vendors excel at optimizing their engines’ compute runtimes, not necessarily the data in storage. The subject matter expertise on the core data lakehouse tech required to deeply optimize storage is concentrated within 1-2 companies, including Onehouse. Also, each engine typically optimizes compute only for its workloads (e.g., BI), with minimal optimization done for other classes of queries executed outside them. While it may not be interesting for engine vendors to speed up a competitor’s query, users who desire open data platforms are left with the painful task of optimizing across use cases.
Performance tax: Data scale is ever-growing, which makes it essential to run purpose-built specialized engines to handle different use cases. The wrong engine choice can slow you down and inflate costs—just a 20% difference in performance could mean millions of dollars over time in wasted cloud spend. If you are unable to bring other engines to your data, you are limiting your purchasing power with existing vendors.
Fortunately, there is a battle-tested, well-treaded alternative. Companies with best-in-class data infrastructure do the opposite - they make the engines fit their data and futureproof their data in open formats. The table below presents the open data stack at some of these forward-thinking companies. The astute reader may notice how the dominant engines have changed over a decade from 2014 (Pig, MapReduce, etc.) to 2024 (Flink, Ray, Spark). Still, the open data formats have ensured that the new engines support the older data and avoid endless migration projects.
A common misconception is that such a rich data platform is only helpful at the extreme data scale at which these companies operate. These companies do have a more significant concentration of data and infrastructure engineering talent. However, picking different engines for different use cases is more about features (e.g., stream processing engines have unique windowing capabilities) and ergonomics (e.g., Ray can natively run Python libraries across the ML ecosystem).
In the commercial engine ecosystem, vendors often limit the open source experience of their engines in order to nudge users towards their managed offerings. This raises the barrier to entry when operating/deploying open compute engines. As a result, companies shy away from getting started on an “open compute on open data” baseline, only because of operational complexity and lack of engineering resources to wrangle 3-4 engines. Instead, they settle for a default engine.
At Onehouse, we’ve spent more than six years supporting lakehouse workloads across all these open source engines, while resolving more than 2700 support issues. We bring that experience with us as we launch Open Engines to break this catch-22 paradox and make it easier and cheaper than ever to bring open compute engines to your open data tables.
Open Engines is a new Onehouse platform component that enables users to quickly and reliably deploy their favorite open source engines to process or query data for all their open tables. Open Engines automates the infrastructure deployment of open source engines on top of Onehouse Compute Runtime and connects them to the tables created or managed inside or outside Onehouse. Open Engines enables a more natural data-centric approach to building your data platform, with the freedom to choose between “self-installed open-source” and “managed” solutions at every step.
Open Engines aims to unlock a level of choice and flexibility, often only within reach of the most advanced data teams. It opens up new options for processing, modeling, or analyzing data without requiring organizations to provision tools and infrastructure while keeping their stack free from redundant data copies or the vendor lock-in of proprietary engines and catalogs.
Open Engines is not about adding one more compute vendor to your decision matrix. It’s about ensuring that users are not prevented from adopting OSS engines due to a lack of engineering resources or operational bandwidth. Thus, Onehouse’s preview pricing is carefully designed to save time and money for users deploying their own open source compute engines. The table below summarizes details we have heard from our customers about the deployment, maintenance and operational overhead required to self-manage these OSS engines at different scales. This self managed overhead comes with a substantial monthly cost based on the engineering time expended by data teams.
Onehouse Open Engines will lower your self-management costs >10x, and Onehouse customers get a free cluster with unlimited queries that can scale up to 20 OCUs (Onehouse Compute Units). This means customers can get started on an open data lakehouse and spin up a cluster large enough to run some serious production workloads right within Onehouse.
Now, let’s dive deeper into additional benefits and capabilities that Open Engines functionality unlocks for users. While there are many open source engines on the market, we have initially selected three engines to integrate at the time of launch based on the most common use cases Onehouse addresses.
Several challenges are associated with deploying, configuring, testing, running, and managing self-installed compute engines. Allocating resources, setting up infrastructure, and implementing security controls can all require weeks or months of setup and then constant ongoing maintenance and on-call rotations for your teams. With Open Engines, we provide production-ready optimized compute engines with a few clicks, and they run securely in your private cloud. Compute engines run on compute clusters on top of the Onehouse Compute Runtime, creating a single pane-of-glass for all your clusters with built-in cost monitoring and controls. You can set limits for each compute engine cluster and watch them elastically scale up and down thanks to the Onehouse runtime.
By simplifying the deployment and management of open source engines that default to open formats, we aim to eliminate any steep learning curves and encourage adopting open data architectures as the default. That will, in turn, promote the adoption of these engines and give them exposure to a broader range of users who may otherwise not have gone through the effort of deploying them and instead rushed into closed, proprietary offerings.
Each engine supports one or two catalogs to discover and enforce access controls for the tables on which they are used. A tactical hurdle users face when attempting to mix and match different engines on the same data is catalog management. Compute engines spun up using Open Engines seamlessly integrate with the catalogs supported by Onehouse’s multi-catalog capabilities.
We are actively working on some exciting enhancements to this functionality, including synchronizing permissions across these engines’ catalogs and helping enforce a uniform governance structure across them. Given that Onehouse also supports all major commercial catalogs and engines, it’s as simple as adding a new catalog sync when you choose a managed compute engine or cloud warehouse for your data. The combination of open data formats, Open Engines, and Onehouse’s multi-catalog sync makes it easy to switch engines with a few clicks, with no downtime or data migration.
Even the world’s fastest SQL engine cannot deliver premium compute performance against poorly maintained tables. While warehouse and query engine vendors tend to attribute the superior performance of their products to compute optimizations, providing well-optimized tables in open table and file formats is a pre-requisite to achieving such performance. Proprietary warehouse vendors typically make their storage optimizations available only to their proprietary engines. On the other hand, data lake query engines have little control over how data is stored and how tables are optimized on top of open table formats.
Tables accessed by Open Engines can be optimized by OSS frameworks such as Hudi or our engine-neutral, best-in-class table optimization service - Table Optimizer. With well-optimized tables under the hood, query engines such as Trino benefit from blazing fast reads. Onehouse will rapidly deliver and deploy the latest lakehouse features, such as secondary indexes to Open Engines, improving read and write speeds. For the Hudi user community, this means faster and more nimble delivery of fixes to the Trino connector through Open Engines.
The figure below illustrates this by clearly distinguishing performance optimizations based on whether they are compute optimizations (e.g., fast SQL planning and execution) or storage optimizations (e.g., compaction and clustering). Storage optimizations often provide 10-100x performance improvements, especially on poorly maintained tables.
Open Engines raises the bar for baseline performance by enabling users to unlock advanced storage optimizations for open engines without being forced to pay for premium compute engines when compute optimizations are not critical for a workload. This enables users to reduce their overall cloud spend by using open engines (e.g., OSS Trino) while being able to seamlessly upgrade to closed engines (e.g., Starburst) when they desire a higher level of performance.
Building an open data stack by stitching together multiple independent libraries (e.g., Hudi, XTable, Iceberg, Delta, Spark, Kafka…) is not for everyone. It’s not for the faint of heart to keep the stack upgraded and maintained as newer versions of all these dependencies show up. This is a very mundane but acute pain that impedes users from operating open data stacks and enjoying the latest features. Onehouse maintains, tests, packages, and certifies binary versions of open table formats (Hudi, Delta, Iceberg, XTable) and compatible open engine versions, eliminating the hair-pulling needed to assemble an open compute stack on top of lakehouse storage.
Open Engines will be automatically upgraded to the latest stable versions that work with your data in storage, returning a lot of time (and maybe also joy in life) to your data teams to focus on high-impact business projects.
In summary, by making open source engines more accessible, Onehouse hopes to expand their use and encourage innovation in the ecosystem. We have consistently stated that users should place more consideration on their architectures up front to think through requirements such as performance, cost, queryability, and interoperability. Open Engines removes the artificial pressures that force users to rush into an engine, instead enabling you to focus on understanding the data at hand and try different engines to pick the right one for your workloads.
Open Engines is available in private preview starting today. Get in touch with us to request access. If you want to learn more, register for our upcoming webinar to go deeper and see demos of how it all comes together. Don’t forget to check out Open Engines on our site, which also links to some comprehensive comparisons between the leading ML/DS, analytics, and stream processing engines for your workloads.
Approach your data platform correctly - data first, engines next.
Be the first to read new posts