June 26, 2024

Raising our Series B and Our Quest For the Most Open and Interoperable Data Platform

Written by:

Vinoth Chandar

Raising our Series B and Our Quest For the Most Open and Interoperable Data Platform

If you’ve been following Onehouse, you’ll know that the past year has marked many important milestones on our journey to help every organization build an open data architecture. From kickstarting an industry conversation around data interoperability with the release of OneTable, now Apache XTable (Incubating) last year, to keynoting the first Open Source Data Summit, to sharing some of our customer journeys, it is clear that we’ve made great progress towards our vision.

Today, I’m excited to share that we have raised a $35M Series B led by Craft Ventures, with participation from existing investors Addition and Greylock Partners, which will help us accelerate our pace of innovation and product development. Along with this announcement, we’re also launching two new products: LakeView, a free lakehouse observability tool for the OSS community, and Table Optimizer, which automates data lakehouse optimizations.

Please check out these additional resources:

Vision

But first, let’s go back to the vision. We believe users need an open data architecture. Why? Because it’s the only way for organizations to unlock their most valuable asset - their data - and put it to productive use in support of existing and emerging use cases. Whether Generative AI, predictive ML, real-time analytics, or traditional BI - there’s too much innovation in the data space to lock data into a single, vertically integrated platform. Users must be free to choose the right tool for the use case. Over almost two decades, it's been proven that “one size fits all” is an idea whose time has come and gone.

Traditional databases and data warehouses have always been tightly integrated systems. Storage and processing are tightly coupled and optimized for performance. However, there’s a significant downside to this coupling. The biggest decision we see users regretting is choosing their compute engine vendor first and then having that choice dictate, “top-down,” their storage and data management options. An integrated data platform locks ingestion, storage, ELT pipelines, catalog and data services to the chosen engine - a cloud warehouse, database, or data lake engine. Reusing or sharing data with other processing engines is difficult or impossible.

Instead, we should think “bottom-up,” starting from the data, which is, after all, the enduring asset in the stack. This requires an open data architecture, where storage and data management are decoupled from the query engine. In this scenario, data is ingested, transformed, and managed just once. Any engine can then query it, and users can take the time to evaluate multiple engines properly, apply the right engine for the right use case, and even migrate between engines easily. Users free their data from lock-in, create a single source of truth, and can then bring it to any use case. Perhaps most importantly, this approach future-proofs your data architecture. For example, who would have guessed five years ago that vector databases would become so important to support RAG applications?

That’s our vision for the data lakehouse and our original motivation for building the first lakehouse nearly a decade ago at Uber. Though it has taken a while, thanks to all the shifts and twists in the data ecosystem, this vision can now be a reality broadly across the industry. Compared to 2021, when the world was divided into two camps—data lakes and data warehouses—we’re entering a great convergence into an open data lakehouse as the center of all data gravity.

Building the Vision

While the vision is clear, getting from our legacy state to that vision is much harder. It requires openness and interoperability at all layers of the stack. While openness is absolutely necessary, it's not sufficient. Without a strong commitment from vendors on interoperability and compatibility, we risk fracturing data into silos again, even on top of open formats. While data lock-in is at the top of everyone’s mind, compute lock-in can be even worse.

We specifically still need to solve for the following:

Table format interoperability. We have three widely used data table formats today: Apache Hudi, Apache Iceberg, and Delta Lake. These projects all have specific design points and strengths. Hudi provides a feature-rich open platform with the best performance for incremental processing with fast-changing data, and is the format of choice for managing streaming or CDC data at scale. Delta Lake is best integrated with Databricks, has great features, and is very popular amongst Apache Spark users. Iceberg is best integrated with Snowflake and has found favor as an easy way to represent table snapshots/statistics between engines. Just like users shouldn’t lock themselves into a particular data platform, we believe that users shouldn’t be locked into a specific table format. In fact, the mix-and-match of table formats unlocks some exciting possibilities, which is why we contribute to Apache XTable (Incubating) and worked across the aisle to help make Delta Lake and Hudi compatible. We expect this healthy trend to continue with the recent Databricks/Tabular development. We are at least a few years away from total unification. since new table formats are emerging, feature sets are still divergent, and existing pipelines still pump many exabytes of data into these three formats.

Catalog interoperability. All the progress around table format interoperability has now shed light on data catalogs as a new point of lock-in. Each compute engine manages table metadata and access control policies/enforcement using its own catalog. The industry doesn’t have a single, de facto standard catalog that integrates well with all data platforms and query engines, such that users can manage their data permissions centrally. Over the past few weeks, Snowflake announced Polaris Catalog and Databricks open sourced Unity Catalog, with support for all three projects. We are eager to participate in efforts to define a new common catalog API that all engines can agree on, along with a production-grade OSS replacement for Hive metastore. Even with an API standard and OSS catalogs, each compute engine will still run its own catalog instance for valid technical reasons. So, we believe the right approach is to support and sync data to all the leading catalogs.

Open compute services for ingestion, transformation, and management. For years, organizations have relied on closed, point-to-point tools to ingest data into their data stores, whether we go back decades or look at relatively modern solutions. We should provide the option for users to build using open ingest services to ingest into the lakehouse from practically any data source with up-to-the-minute data freshness, all built on the same strengths in incremental processing. Similarly, users should be able to manage their tables using OSS to ensure their tables are optimized for queries, to ensure their open tables can match a proprietary data warehouse table in performance. Finally, users can use popular open-source data processing frameworks such as Apache Spark or Apache Flink to transform data. This optionality to build on top of your open data without being forced into closed compute services is key to avoiding compute lock-in. Otherwise, your data could be imprisoned by closed compute services, in spite of open data formats.

Easy-to-use Data Lakehouse as a Service. Advanced engineering teams may have all the means they need to build their own open data lakehouse. And this is precisely what my team did at Uber years ago. Unfortunately, not every organization has all the resources or time to undertake a project like this. Because of this, many teams end up choosing closed solutions that also close off their data, despite fully realizing the value of an open data architecture. That’s why we need vendors to step up and deliver a fully managed data lakehouse as a cloud service. This cloud service should provide all the common needs like managed ingestion, table optimization, transformations, interoperability, security, performance, and reliability, on top of open foundations as described above, without creating lock-in. Such a service could eliminate the artificial choice to be made between an open foundation for your data and getting projects done for your business. Users would be able to mix-and-match managed services and open-source solutions seamlessly for various needs, in a flexible manner. That’s exactly the product we are building at Onehouse.

Figure 2. The Onehouse managed data lakehouse service

Onward and Upward

So, what does this new funding mean? First, I’d like to thank our investors at Craft Ventures, Addition, and Greylock Partners for believing in our vision, along with our early customers, who have given this vision meaning.

As Michael Robinson of Craft Ventures said, “Onehouse enables organizations to deploy a lakehouse in a matter of minutes – a critical function as the data lakehouse has become the standard architecture to centralize data and power new services like real-time analytics, predictive ML, and GenAI. One day, every organization will be able to take advantage of truly open data platforms, and Onehouse is at the center of this transformation.”

We will continue to invest in the core Onehouse platform and open-source projects such as Hudi and XTable. Expect to see our development team grow quickly but carefully. You can expect to see more market presence from Onehouse, as we assemble a world-class go-to-market team to bring this very unique product to our customers.

If it sounds like an exciting time to be part of Onehouse, it is. It's also a great time to join us on this mission to make data open.

Authors

No items found.