In a detailed and deeply technical talk given at the Open Source Data Summit 2023, we learned how Apache Hudi supports critical processes and governance at scale in Robinhood’s data lakehouse. View the full talk or read below for an overview.
In their talk, Robinhood team members Balaji Varadarajan, Senior Staff Engineer, and Pritam Dey, Tech Lead, described their company’s data lakehouse implementation. We learned how Robinhood’s lean data team has been relying heavily on Apache Hudi and related OSS services to handle exponential growth into the multi-petabyte scale.
Key takeaways include details of their tiered architecture implementation; how the same architecture can be applied to track metadata and meet related SLAs (such as for data freshness); and how GDPR compliance and other data governance processes can be efficiently implemented at scale.
View highlights below or on Youtube, and visit the Open Source Data Summit site for the full talk.
Pritam Dey started by giving us an overview of the Robinhood Data Lakehouse and overarching data ecosystem. The ecosystem supports more than ten thousand data sources, processes multi-petabyte datasets, and handles use cases that vary wildly in data freshness patterns (from near-real-time streams to static), data criticality, traffic patterns, and other factors.
The Robinhood Data Lakehouse ingests data from many disparate sources: streams of real-time app events and experiments, third-party data available on various schedules via API, and online RDBMSes such as Postgres. This data must then be made available to many consumer types and use cases—including for both high-criticality use cases, such as fraud detection and risk evaluation, and for lower-criticality ones, such as analytics, reporting, and monitoring.
Dey explained that support for all of the various use cases at Robinhood is built on top of a multi-tiered architecture, with highest-criticality data being processed in Tier 0, and subsequent tiers used to process data with lower constraints. To illustrate how the architecture addresses Robinhood’s needs, Dey drilled down to the implementation of a critical tier within Robinhood’s lakehouse.
Data processing in each tier starts with a data source—for this example, Debezium is monitoring a relational database service (RDS), such as Postgres. Before a tier can fire up, a one-time bootstrap process completes, making sure initial target tables and schemas are defined in the data lakehouse—anticipating a Debezium-driven change data capture (CDC) stream. Once the tables are in place, a multistep process is fired up and kept alive for the lifetime of the tier:
To show us how the architecture above generalizes and can be expanded on, Dey demonstrated how a critical metadata property—freshness—is maintained.
Internally-produced metadata used for tracking data freshness (from Debezium and Apache Hudi sources) is looped back through the infrastructure mentioned at steps 2 and 3 in the process above (i.e., the Debezium-fed Kafka streams and the DeltaStreamer), and then fed to step 4. That is, Hive metadata stores are updated with changes in Debezium state and other freshness metrics produced by the DeltaStreamer.
Dey demonstrated that support for this kind of tiered architecture is built on a mix of core features of Apache Hudi and other OSS components of the lakehouse. Key features that a tiered architecture relies on include:
For his part, Balaji Varadarajan, Senior Staff Engineer at Robinhood, gave us a deep and nuanced look at how Robinhood uses the tiered architecture of its data lakehouse to address data governance and GDPR-related requirements.
Data governance at scale is complex, with multiple objectives:
Robinhood addresses these objectives at scale - the Robinhood Data Lakehouse stores more than 50,000 datasets - by organizing the lakehouse into disparate zones.
Zone tags and related metadata are used to track and propagate information about the disparate zones throughout the lakehouse. Robinhood’s team has implemented a central metadata service to support the zones. The service is built on the same kind of tiered architecture we saw above for the freshness metadata.
Varadarajan explained that tagging is done both by hand and automatically in the system (including programmatically at the source-code level), with tag creation being co-located with schema management work. Any changes to tagging are enforced, tracked, and monitored via Lint checks in the system, as well as with automated data classification tools, which help cross-check the tags and detect any data leaks or aberrations.
To demonstrate how powerful the resulting system is, Varadarajan walked us through Robinhood’s efficient implementation of the PII delete operation, which is required for the GDPR’s “right to be forgotten” - required by the European General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA).
Support for PII tracking and masking, as required for an efficient, GDPR-compliant implementation of PII deletion, is difficult to deliver in a lakehouse as large and complex as Robinhood’s. It needs to be possible to delete all of a single user’s PII on demand over the entire multi-petabyte-sized data lakehouse. And this has to be done quickly, efficiently, and without impacting other users. Varadarajan explained that Robinhood’s implementation relies on just two (tricky to implement) metadata services:
The two metadatas (ID and mask) are applied and tracked ubiquitously across the lakehouse. As a result, the PII delete operation can be implemented via a standard Apachi Hudi delete operation, which is efficient, fast, and operates over the entire lakehouse.
“Apache Hudi is the central component of our data lakehouse. It enabled us to operate efficiently, meet our SLAs, and achieve GDPR compliance.” — Balaji Varadarajan, Senior Staff Engineer, Robinhood
We were shown the many benefits of building a tiered architecture on top of an Apache Hudi driven, OSS-powered data lakehouse. Specifically:
Day and Varadarajan make a convincing case that the Robinhood Data Lakehouse uses its dependence on Apache Hudi and related OSS projects to gain strategic advantage over competitors. They’ve implemented reliable data governance mechanisms, are staying up to date with GDPR compliance efficiently and at scale, and have been able to handle exponential data and processing growth. They can also support various metadata, tracking, and other SLAs, such as for data freshness. For an in-depth understanding, view their complete talk.
Be the first to read new posts