January 2, 2025

Unbundling Your Data Platform: How Open Data Lakehouses are Changing the Game

Written by:

Ryan Garrett

and

Ryan Garrett

Unbundling Your Data Platform: How Open Data Lakehouses are Changing the Game

At this year’s Open Source Data Summit, Vinoth Chandar, founder and CEO of Onehouse, originator of the data lakehouse architecture, and long-time expert in open source data infrastructure, shared insights on the rising trend of “unbundling” data platforms. This shift empowers organizations to build modular, interoperable data ecosystems tailored to their unique needs. Drawing on his experiences with data infrastructure during the hyper-growth stages of companies like LinkedIn and Uber, Vinoth discussed the limitations of traditional data systems and the advantages of a new, open data lakehouse approach, and shared examples of other companies that embraced this approach, such as Walmart and Notion.

The Unbundling Movement in Data Platforms

Unbundling describes the breaking down of the independent modules of a product into multiple products, each specialized in its function. It’s an approach we’ve seen shape industries over the years; for example, broad service platforms such as Craigslist evolved into tailored products, like Zillow for real estate and Upwork for freelancing. Unbundling a data platform, Chandar explained, “means decoupling storage, compute query engines, and all the different data tools that you're using to interoperate seamlessly with one another.”

“It's like building your data platform with Lego blocks—each block representing a trend or technology you should consider to create a flexible, custom fit.” - Vinoth Chandar, Founder of Onehouse

This modular approach is a major shift from the traditional, all-in-one “bundled” data warehouse model, which typically groups storage, compute, and query features into a single, proprietary platform. By decoupling these components, data platforms can reduce or eliminate dependence on proprietary vendor offerings, allowing organizations to select the best tools for each use case.

The Problem With Bundled Data Platforms

But if bundled data platforms worked fine for years, why are they no longer a valid solution? Bluntly stated, bundled data platforms lock you in.

It wouldn’t be so bad if they were perfect for every use case. Unfortunately, they are not. For example, the data warehouse has decades of optimizations built in for traditional reporting. But if your use case is around real-time analytics, machine learning, or GenAI, you will need a platform optimized for fresher data at a much larger scale.

Some organizations facing these use cases and their inherent challenges then turn to multiple bundled data platforms. For example, it is not surprising to see engineering or IT organizations working with a data warehouse such as Snowflake, a real-time analytics platform such as Clickhouse, a vector database such as Pinecone, and a catch-all data lake on Amazon S3.

“With significant investments in AI today, choosing a data platform that doesn't scale well for vector embeddings and requires constant updating can quickly burn a big hole in your budget.” - Vinoth Chandar, Founder of Onehouse.

In large organizations with varied services, each with their own querying and analytics tools, the use of bundled platforms leads to duplicate pipelines, extra integration work, and significant management overhead. Traditional bundled platforms also pose risks around vendor lock-in. Proprietary solutions can tie organizations to a single query engine and storage platform, making it hard to adopt new tools as needs evolve.

An open, unbundled platform enables teams to choose from various storage formats (e.g., Apache Hudi™, Apache Iceberg™) and computing environments (e.g., Kubernetes), ensuring that organizations can adopt new engines or frameworks as they become available and necessary for future offerings.

Blueprint for The Next Generation of Data Platforms

Vinoth outlined a blueprint for an unbundled internal data platform with several defining goals:

Open Storage for Diversified Data Types: A next-generation platform should support structured, semi-structured, and unstructured data in open file formats, making it interoperable across compute and query engines.
Flexible Compute: By avoiding compute lock-in via a single vendor, teams can blend open source and proprietary components, meaning developers can build cloud-agnostic data pipelines.
Scalability and Efficiency: As data volumes grow exponentially, platforms should optimize for cost-effective scaling, avoiding runaway storage and compute costs.
Modular Architecture: Data stack architectures should be composed of well-defined layers or modules, allowing individual solutions within each layer - such as storage, compute, or query engines - to be easily swapped as needs change. This modular setup helps teams avoid the costly, complex migrations that can come with traditional, tightly integrated architectures.

Building an Unbundled Data Platform

The flexibility of an unbundled data platform allows companies to build custom data infrastructures, combining tools based on specific needs and optimizing cost and performance without major vendor dependencies. Chandar broke down the specific components that make up an unbundled platform, from storage to the analytics engine.

Storage and Table Formats: Open formats like Parquet and ORC, as well as table formats such as Hudi and Iceberg, provide flexible data storage on cloud platforms such as S3, with efficient, incremental processing options.
Metadata and Catalogs: Operational catalogs, such as Hive Metastore, help manage data visibility and accessibility, with new catalog standards allowing for compatibility across different query engines.
Data Management Tools: Tools such as Apache Spark™ and Apache Flink™ provide ingestion, transformation, and optimization capabilities to automate processes and ensure fast, low-latency access to data.
Query and Analytics Engines: Open-source query engines such as Trino, Clickhouse, and StarRocks offer flexible analytics capabilities, while specialized machine learning (ML) frameworks support advanced data science tasks.

Unbundling in the Real World

Several prominent companies illustrate the benefits of unbundling data platforms through the use of an open data lakehouse architecture:

Uber: Uber’s data stack leverages an unbundled lakehouse with Apache Hudi, storing more than 250 petabytes in open formats. Uber uses Spark, Flink, and Presto as query engines, while an extended Hive Metastore handles metadata across tools.
Walmart: Walmart has developed an open data lakehouse that integrates Hudi for data freshness and supports BigQuery, Presto, and Spark to create a flexible and efficient analytics platform.
Notion: By unbundling, Notion saved more than $1.2 million annually with an open lakehouse to streamline data consistency, reduce pipeline failures, and enable capabilities for vector search in AI embeddings for model development.

The Path Forward: Creating Future-Ready Data Platforms

The future of data platforms will see more support for unstructured data and catalog interoperability. Onehouse, for example, is developing efficient data storage formats optimized for new AI needs. “We need to blend and add a lot of support for unstructured data formats. And I think that'll complete the picture and make this layer, the storage layer, support all kinds of data,” Vinoth explained.

Vinoth encouraged attendees to embrace open data lakehouses for the next generation of their data platform.

“If you're considering your next data platform, build it on an open data lakehouse. Unbundle your storage from any one engine. Engines will change all the time. Provide a modular architecture where the single source of truth on cloud storage is accessible from any engine that you may need now or in the future.” - Vinoth Chandar, Founder of Onehouse.

The shift to unbundling is about creating a foundation that supports innovation and interoperability without sacrificing flexibility or incurring unnecessary costs. With unbundled platforms, companies can take advantage of best-in-class tools for each of their diverse use cases, while maintaining the adaptability to evolve with the data landscape. “It's very important to invest the time early on,” Vinoth emphasized, “so we can advance the technology, building the data platform without going through peaks and troughs.”

Authors

Ryan Garrett

Director, Product Marketing

Director of product marketing. Experience includes Upsolver, Thoughtspot, Teradata, Treasure Data, Aster Data. Education: University of Kentucky; Boston University. Onehouse author and speaker.