June 4, 2024

The Universal Data Lakehouse: User Journeys From the World's Largest Data Lakehouse Builders

Written by:

Floyd Smith

The Universal Data Lakehouse: User Journeys From the World's Largest Data Lakehouse Builders

‍This post is a brief summary of an expert panel hosted by Onehouse on Linkedin on April 9, 2024. You can view the full recording and read the highlights below.

On April 9, 2024, Vinoth Chandar, Founder and CEO of Onehouse, and creator and PMC Chair of the Apache Hudi™ project, moderated a panel discussion with the builders and operators of three of the largest data lakehouses in the world. In order of appearance in the webinar, the other three panelists were:

Satya Narayan, Senior Distinguished Architect, Data Engineering at Walmart Global Tech
Karthik Natarajan, Engineering Manager for the Hudi team at Uber
Balaji Varadarajan, Senior Staff Software Engineer and Tech Lead for the Data Infrastructure Team at Robinhood and PMC member for Apache Hudi

The panelists described their journeys of adopting data lakehouses. They provided deep insights and historical anecdotes, recounting how they were introduced to lakehouses and describing the challenges they faced while working with the largest lakehouses on the planet. They also discussed the significance and history of Apache Hudi-based systems and how they expect the lakehouse ecosystem to evolve.

The Universal Data Lakehouse Origin Story

“I remember many days and nights that we were up just to make sure that things were working. It was a pretty interesting time.” — Karthik Natarajan

Companies such as Uber and Onehouse have been pioneers and leaders, inventing, evolving, and pushing the boundaries of data lakehouse software, systems, and architecture. Chandar filled us in on some of the history: “I was part of the team that built the world's first data lakehouse at Uber back in 2016 to address challenges related to real-time transactional data,” he said. Uber donated the project to the Apache Foundation, and Onehouse was founded to continue growing the community, build out the underlying open source projects, and provide support for a universal lakehouse. Apache XTable is a recent result of that work, bringing a universal storage layer to the industry at large.

Natarajan’s team at Uber has expanded Uber’s adoption of lakehouse technologies to the levels in use today: they are now using a lakehouse as a universal source of truth across the company. The transition was not easy, and it took many late nights and long days of solving hard problems that hadn’t been solved before. In support of company-wide adoption, Natarajan’s team is pushing the boundaries of what their lakehouse can do: “We want to abstract many of the details of how the lakehouse works away from the customers, similar to how databases hide many of their internal details," he explained.

“Just because of the size and scale of our systems, the technology we invest in has a significant impact. So we performed a lot of due diligence before deciding to use a data lakehouse.” — Satya Narayan

Walmart Global Tech has been migrating long-established large scale processing engines to data lakehouses. At Walmart’s scale, decisions are very visible and carry significant weight, internally as well as across the industry. So, Satya Narayan shared, Walmart goes through a strict and detailed due diligence process when making data system architecture related decisions.

By the time they were considering a data lakehouse, the data team was already supporting multiple disparate lines of business at a large scale. The complexity of Walmart’s business translated into workloads that ranged from real-time transactional processing to slower-moving ETLs—all of which were hosted across multiple hybrid (multi-cloud and on-premises) data systems.

To make sure to cover their full range of internal use cases, Walmart’s team decided to test two extremes from a performance perspective. The first process was more traditional: time partitioned, unbounded, with a high load of insert operations. The second process sounded much more difficult to scale: not partitioned, real-time bound, with a heavy dependence on upsert operations. Hudi handled both—even working well across many other dimensions, including cost, community support, compatibility, ecosystem health, and tool integration. Much of this work has been done in public, and findings can be found on the Walmart engineering blog.

“Open formats are actually not new. We were using Parquet for a decade.” -– Balaji Varadarajan

Robinhood’s Balaji Varadrajan took an entirely different path to using a lakehouse. While his team’s switch to an Apache Hudi-based system paid off in terms of efficiency and cost, their initial decision hinged on regulatory and compliance requirements. As a financial institution that has expanded into the EU and UK, Robinhood has to comply with many regulations and compliance regimes, including the General Data Protection Regulation (GDPR). So they were drawn to the fact Hudi is open source software, meaning Robinhood engineers can audit and extend the source code as necessary. They were also attracted to Hudi’s support of core functionality required for meeting compliance needs, such as Hudi’s very efficient implementation of data deletion at scale.

Robinhood considered many other dimensions when evaluating whether to move to a data lakehouse, as well as compliance requirements. Apache Hudi-based data lakehouses are fully featured; they provide support for checkpointing and incremental processing, and offer a rich set of connectors. Hudi also presents itself as a battle-tested system, because it is used in production and at scale by many high-profile companies. These advantages, as well as Hudi’s support for a wide range of services relevant to running a production lakehouse, all had a part to play in Robinhood’s lakehouse origin story.

Current Challenges and Benefits of Using Apache Hudi Lakehouses

“Just to give a glimpse of the size and scale, we are processing hundreds of petabytes. We use more than 600K cores to just process the data. It's a massive amount of data.” — Satya Narayan

At Walmart’s scale, one would expect that handling the volume of data, and the required complexity, might be hard or even intractable. However, Narayan reported that, unlike the previous three-tiered ETL architecture, Walmart’s modern data lake is both meeting demands and opening up new opportunities.

For example, a data lake can expose data at all points in the processing layer. And incremental processing, along with support for fast querying, mean that data scientists and analysts can produce analytics much more quickly than before, despite the increased data size. The systems continue to perform and scale, and old bottlenecks, such as from read and write amplification, have either disappeared, or diminished to the point where they aren’t showstoppers anymore.

Most importantly, data formatting and schema now evolve at the speed of business. “I think this was our watershed moment at Uber as well,” Chandar said. Data lakes applied at a large scale unlock straightforward solutions to common problems related to financial transaction processing. “[Hard] analytics counting problems become natural processing problems, where you need to join things to each other correctly,” Chandar explained. This solves a “reconciliation activity that is ultra-critical to the entire business.”

“We have been iterating on the lakehouse just to keep in step with the business, with asset transactions being a part of it.” — Karthik Natarajan

Natarajan explained how, at Uber, the engineering and data teams are continuing to push the boundaries of what is possible with data lakehouses in a way that’s tied directly to their business needs. The teams are investing in lakehouse technology across the board, including building interoperability into the software and expanding and maintaining lakehouse integrations for all the query engines in use at Uber. They are also building tools to help the ecosystem be more reliable, and finding ways to support how teams interact with both structured and unstructured data in a way that scales. Data freshness continues to be a core concern, and they’ve blogged about innovative ways to provide freshness guarantees.

The teams at Uber are also focusing on improving various core areasof functionality. For example, they are experimenting with multiwriter support, new ways to query data, and query optimization techniques. Often, such core improvements will take one or two years to build and release. “At Uber, we have always been following the business and we have been democratizing data across engineering, data science, and ML communities, whose use cases are very different,“ Natarajan said. This translates into adopting and maintaining open source products where they are available, solving needs (such as around HDFS), and innovating in-house to fill gaps. Uber continues to contribute their in-house innovation to the open source community.

“We are no longer bottlenecked on the source database side. Compute resources only need to be provisioned for change volumes, and the overall cost of running our process is greatly reduced.” — Balaji Varadarajan

At Robinhood, Varadarajan explained that their adoption of an Apache Hudi-based data lakehouse allows them to continuously increase performance and reduce costs.They do their work at a very large data scale: Varadarajan’s team supports more than 2,000 tables across more than 100 databases, with as many as 10% having a fifteen-minute data freshness SLA.

In one example problem Varadarajan described, Robinhood is required to snapshot table data for compliance and audit purposes. In a pre-lakehouse world, that would typically be implemented using full table copies or overwrites.

That approach, though, doesn’t work well at Robinhood’s scale and complexity. They are experiencing low-level technical issues around storage, I/O bandwidth limits on source databases, complicated key space partitioning, and auto-provisioning of compute and other resources. They are also facing algorithmic problems with, to list two examples, skew (as a result of working with partitioned key spaces), and read/write amplification in the system. All of these problems extend over millions of rows - and sometimes more than one hundred million rows per database.

The team at Robinhood regularly solves these kinds of problems by using standard lakehouse-supported tooling, such as change data capture (CDC), with a mix of Apache Hudi, Delta Streamer, Debezium, and other lakehouse components. We share more about Robinhood’s lakehouse journey in our recent Robinhood blog post.

What the Future Holds

The panelists see lakehouses becoming much more interoperable, providing better support for traditional relational database functionality, and providing improved support for AI- and ML-related use cases.

“CDC is change data capture, but there is also CDA, change data apply. In lakehouses we just talk about the capture, but everything depends on the apply.” — Vinoth Chandar

Both Natarajan and Chandar told us they see a future where lakehouses continue to move closer to approximating relational databases on top of data lakes. Besides adding in more basic relational-like functionality, lakehouses will also provide better abstractions, such that consumers won’t have to understand the details of a lakehouse’s internal structure to use it. They will also catch up to the relational world’s support for change-data-apply (CDA), moving beyond the current focus on CDC, leading to lakehouses that can change or update keys, create indexes, and offer other efficient “apply” operations that historically have only been available in relational databases

Projects such as Apache XTable are explicitly dedicated to making lakehouses more interoperable, and they build the foundations for an interoperable future. And the effort for increased interoperability will support newer, modern, socio-technical approaches to thinking about data processing (such as data mesh).

Narayan mentioned that AI/ML training data is already being stored in lake storage and is naturally supported by lakehouses. The next step will be to provide better and more integrated support for data vectors and other approaches that are commonly used for data science and AI/ML use cases. Chandar agreed: “In the community, there are people modeling vector searches using our current index systems,” he said.

Wrap-up

An exclusive panel made up of four industry veterans took the audience on a journey describing how they adopted and built some of the largest data lakehouses in the world. The session started with a few very different introductions to the world of lakehouses. The speakers then explored the wide spectrum of challenges relevant to operating at the largest scales in the world today. And they described where the lakehouse ecosystem is likely headed next. For a more complete account, please see the full recording.

Authors

No items found.

The Universal Data Lakehouse: User Journeys From the World's Largest Data Lakehouse Builders

The Universal Data Lakehouse Origin Story

Current Challenges and Benefits of Using Apache Hudi Lakehouses

What the Future Holds

Wrap-up

Read More:

Announcing Open Engines™: Flipping defaults to “open” for both data and compute

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

Ray vs Dask vs Apache Spark™ — Comparing Data Science & Machine Learning Engines

Apache Flink™ vs Apache Kafka™ Streams vs Apache Spark™ Structured Streaming — Comparing Stream Processing Engines

Data Deduplication Strategies in an Open Lakehouse Architecture

The Open Table Format War: Merely a Battle on the Path to Engineering a Truly Open Data Platform

The Universal Data Lakehouse: User Journeys From the World's Largest Data Lakehouse Builders

The Universal Data Lakehouse Origin Story

Current Challenges and Benefits of Using Apache Hudi Lakehouses

What the Future Holds

Wrap-up

Read More:

Announcing Open Engines™: Flipping defaults to “open” for both data and compute

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

Ray vs Dask vs Apache Spark™ — Comparing Data Science & Machine Learning Engines

Apache Flink™ vs Apache Kafka™ Streams vs Apache Spark™ Structured Streaming — Comparing Stream Processing Engines

Data Deduplication Strategies in an Open Lakehouse Architecture

The Open Table Format War: Merely a Battle on the Path to Engineering a Truly Open Data Platform

Subscribe to the Blog