One of the most intriguing sessions at Open Source Data Summit was the presentation by Ankur Ranjan, Data Engineer III, alongside Ayush Bijawat, Senior Data Engineer, about their use of Apache Hudi at leading retailer Walmart. You can view the full presentation or check out the summary that follows.
In the talk, Ankur and Ayush shared their motivations and learnings from the strategic shift from a data lake to a data lakehouse architecture at Walmart, with a focus on the importance of the Apache Hudi lakehouse format in making this change.
Some of their key takeaways were the challenges that prompted the use of a data lakehouse and the benefits of adopting a universal data lakehouse architecture. These benefits, which combine the best of the data lake and data warehouse architectures, include faster row-level operations, strong schema enforcement and versioning, better transaction support, effective handling of duplicates, and more.
Ankur kicked off the talk with a history of data storage methodologies, including the motivations, strengths, and weaknesses of each. Initially, he explained, data warehouses were the go-to solution for structured data, efficiently connecting with business intelligence (BI) tools to generate insights. However, their high operational costs, and the complexity of maintaining them, marked a need for innovation.
Enter: the era of data lakes. Propelled by the evolution of Apache Hadoop and proliferation of cloud storage around 2012–2013, data lakes gained traction for their ability to handle not just structured but also voluminous semi-structured and unstructured data. Data lakes became a staple in large organizations for their scalability and versatility. Despite their advantages, data lakes posed notable challenges in maintaining data integrity and in preventing data from turning into a chaotic “data swamp.” The solution to the data swamp? According to Ankur, it needed to be a best-of-both-worlds approach — a data lakehouse. He explained, “...the data warehouse was great for management features, [and the] data lake was scalable and agile … we are combining [their benefits] and creating the data lakehouse.”
With this natural evolution, the next step of Ankur and Ayush’s journey was picking the right data lakehouse architecture for Walmart. While there are three open table formats in mainstream use (Apache Hudi, Apache Iceberg, and Delta Lake), Walmart chose to go with Apache Hudi for two key reasons:
At the core of Apache Hudi, explained Ankur, is its innovative structure, which combines data files (stored in Parquet format) with metadata in a unique way to enable a slew of advantages. This design enables efficient data management and supports important features, such as record keys and precombined keys.
To explain precisely how Hudi works, Ankur first walked through the core concepts and terminology:
To help build some intuition around the system, Ankur described how it could work using a hypothetical database of students. In his example, the student ID acts as the primary key, the created column is the partition path, and an “update timestamp” on the record serves as the precombine key.
With this setup, if an incoming upsert (i.e., the operation to update a record, or insert it if the record does not yet exist) from a source to a target for a student record comes in, a few things will happen: First, Hudi will check if the incoming data has greater value of that particular precombine key, which is the “update timestamp” field in our example. Then it will simply upsert the data, ensuring we upsert the latest data into the target without needing to look at all other records, thanks to the handy precombine field we can check against, significantly speeding up the operation.
Hudi also supports two types of tables – 'Copy on Write' (CoW) and 'Merge on Read' (MoR). Copy on write is optimal for read-heavy environments, because it applies most operations during the data writing phase. In contrast, merge on read is suitable for write-heavy scenarios.
Given the working intuition of Apache Hudi that Ankur provided, Ayush dove into the actual enablement of Apache Hudi in organizations, addressing a question he gets a lot: “How easy is it to enable Hudi in my data lake architecture?”
Fairly easy, it turns out. And that’s because of how Hudi interacts with downstream storage and upstream compute or query engines, Ayush explained. Since all data lakes are using some file system (S3 on AWS, etc.), with some file formats (Parquet, CSV, etc.) storing data on top of them, Hudi fits into the layer between the raw data formats and the compute engine. “[Hudi’s] compatibility with the compute engines, whether it's Spark, BigQuery, or Flink, is phenomenal, and we can simply continue to use our existing file system,” Ayush said.
In summary, Hudi led directly to a broad swath of benefits that Ayush, Ankur, and the team saw directly in their implementation at Walmart:
To provide better intuition for the improved upsert and merge operations they saw, Ayush explained how a librarian might organize physical library files under the data lake and the data lakehouse paradigms. In this comparison, our “librarian” is functionally our compute engine, which is doing the computational heavy-lifting in these scenarios.
In the data lake paradigm, a new batch of papers to be filed amongst many loosely organized papers comes in. Then, the librarian must check every previous set of papers, combine them, and then insert the new papers. This is because the existing papers weren’t particularly organized, so our librarian needs to check every single one to get them organized relative to one another.
In the new data lakehouse paradigm, however, things can happen much more efficiently. This is because now our loose papers are a well-organized shelf of books. When a new batch of books comes in to be filed away, our librarian can interact with only the spaces on the bookshelves, due to the enhanced organization.
In actual implementation, there are some additional advantages to the lakehouse approach: reduced developer overhead and reduced data bifurcation. Reducing developer overhead is important across organizations to minimize potential error vectors and cost. One major load taken off the developers in the lakehouse paradigm is the read and compute time (step 2 in Figure 4), since in the data lake it is all on the developers’ shoulders to implement and manage. Additionally, data deletion in the lake paradigm, where data is not clearly organized, can be a huge error vector, where incorrect deletes across partitions and joins can easily lead to incorrect or out-of-date data.
The lakehouse reduces data bifurcation due to its partial update support (step 2 in Figure 5). Before, teams would often use a separate NoSQL database, such as MongoDB, to support this important use case. Hudi allows developers to instead keep this data in the filesystem as a single source of truth, while still enabling partial updates. This saves money and also keeps data clean and up-to-date by reducing duplication.
Through illustrative, layperson-friendly examples that helped develop clear intuition for the Apache Hudi data lakehouse, and the clear benefits that it brought to bear on Walmart’s data organization, Ayush and Ankur gave a thorough explanation of how the system works and the huge benefits it can confer onto data organizations. To see all the insights they had to offer, check out their full talk from the conference.
Be the first to read new posts