Apache Hudi was originally developed by Uber in 2016 to bring to life a transactional data lake that could quickly and reliably absorb updates to support the massive growth of the company's ride-sharing platform. Apache Hudi is now widely used to build some very large-scale data lakes by many across the industry. True to its origins, Apache Hudi offers a promising solution for managing data in a rapidly changing environment.
Hudi enables users to track changes to individual records over time, using the record-level metadata that Hudi stores, and is a fundamental design choice in Hudi. However, this choice is also a common cause of controversy due to its uniqueness among peers and it's crucial to have a clear understanding of the value record-level metadata provides, as well as the additional costs. This blog will discuss the significance of the five record-level meta fields in Hudi and the associated storage overheads to fully comprehend its benefits for Apache Hudi workloads.
The record key meta field is used to uniquely identify a record within a Hudi table or partition. With the help of record keys, Hudi can ensure that there are no duplicate records and enforce uniqueness integrity constraints during write time. Similar to databases, record keys are also used in the indexing of records to enable faster, targeted updates and deletes, as well as generate CDC change logs out of a Hudi table. Most source data already contains a natural record key, although Hudi can also automatically generate record key (in the upcoming major release) in case of use cases such as log events that may not contain such a field.
In mutable workloads, data changes after it has been ingested or stored. Typically these are a) delete requests to be compliant with data protection related regulations and b) update requests that are trickling down from an upstream system. Without record keys to link change records together, it can result in duplicates in the system. For example, suppose we are receiving change logs from an upstream OLTP database. These logs can have updates to the same key multiple times within a time window. To prevent duplicates, we must merge the records within the same commit and also against the records in storage, consistently based on the same key definition.
If you are now wondering that record keys are not very helpful with immutable data, let us give you an example. Consider a scenario where new data is continuously added to a table while a backfill is required to fix a past data quality issues or rollout new business logic. Backfills can occur over any time period, and there is no guarantee that the data being backfilled will not overlap with active writes. Without record keys, backfills must be strictly performed partition-by-partition while co-ordinating with active writers to stay off backfilled partitions to avoid inaccurate data or duplicates. However, with record keys, users can identify and backfill individual records, as opposed to dealing with it at the coarser partition levels. When combined with Hudi’s concurrency control mechanisms and support for ordering fields, the active and backfill writer can seamlessly write to the table without worrying about backfill writer overwriting the active writes, which can take the table back to older state. Note that these things can happen to your data, even with strictly serialized transactions.
Now that we have established we need keys, let’s now understand why they also need to be stored along with the actual record in a persistent form, even as Hudi supports virtual keys. There are obvious benefits to this, in case of compound keys, where recalculating or reprocessing the record keys every time can be time-consuming, as it needs to read multiple columns out of storage. Accidents happen and in data engineering, unintentional changes to configs are common and often lead to multiple teams to spending hours identifying and resolving the root cause. One example of this could be record key configurations being accidentally altered leading to two records that appear to be duplicated but are treated as separate records in the system. When the key field changes (say from A to B), there is no guarantee that all the historical data in a table are unique with respect to the new key field B, since we have performed all uniqueness enforcement with A thus far. Hence, materializing the record key is a simple yet effective technique to avoid getting into these tricky data quality issues. If materialized record keys are used, the differences between the two records (change in record keys) are logged along with the data, and there is no violation of uniqueness constraints.
Databases often consist of several internal components, working in tandem to deliver efficiency, performance, and great operability to their users. Similarly, Hudi is also designed with built-in table services and indexing mechanisms, that ensure performant table storage layout and faster queries.
These services rely on record keys to correctly and efficiently carry out their intended goals. Lets consider compaction service for instance. Compaction is a way to merge the delta logs with base files to produce the latest version of a file with the most recent snapshot of data. It is inefficient for the compaction process to inspect the data every time to extract the record keys for older files. The deserialization cost can add up easily since this needs to be done for each and every record and also every time when compaction is run. As called out by seminal database work, the record keys are the fundamental construct that ties together techniques like indexing which speed up writes/queries and other mechanisms like clustering that cause records to move across files within a table.
These fields capture the physical/spatial distribution of a record in a Hudi table. The _hoodie_partition_path field represents the relative partition path under which a record exists. The _hoodie_file_name field represents the actual data file name where a record exists. Going back to Hudi’s roots in incremental data processing, partition path field is routinely used to filter records further from an incremental query, for e.g a downstream ETL job that is only interested in changes to the last N day partitions in a table, can do so by simply writing a _hoodie_partition_path filter.
These fields are also means to quickly debug data quality issues in a production environment. Let us imagine debugging a duplicate record issue, caused by duplicate jobs or misconfiguration of lock providers etc. You notice that there are duplicate entries in your table, but you're not sure how they got there. You'll also need to locate the affected records and determine when the problem occurred. Without the necessary meta fields, identifying the root cause of the issue would be like looking for a needle in a haystack. In Hudi, simple "select _hoodie_partition_path, _hoodie_file_name, columns from <hudi_table> where <predicates>;" would emit the partition path and the file name from which the duplicated records are served to investigate further. Since these two fields are the same for all records in a single file, they compress very well and don't bear any overhead.
These two fields represent the temporal distribution of a record in a Hudi table, making it possible to track the history of changes to records. The _hoodie_commit_time field indicates the commit time when the the record was created, which is similar to a database commit. The _hoodie_commit_seqno field is a unique sequence number for each record within a commit, similar to offsets in Apache Kafka topics. In Kafka, offsets help streaming clients keep track of messages and resume processing from the same position after a failure or shutdown. Similarly, _hoodie_commit_seqno can be used to generate streams from a Hudi table.
To better understand this capability, let's consider a Copy On Write (CoW) table, where new writes produce versioned base files, by merging against the existing latest base file. Merely tracking versions at the file level here can be inadequate, since not all records in a file may have been updated during a commit. To get this type of record-level changes in other data lakehouse systems, one must join every two adjacent snapshots of the table, which can be very costly and imprecise in situations like losing metadata about a table snapshot.
In contrast, Hudi treats record-level change streams as a first order design goal and encodes this information at all levels - commit time into files, log blocks and records. Also by storing this change tracking information efficiently alongside data, even incremental queries can benefit from all storage organization/sorting/layout optimizations, performed on the table. The use of record level metafields for retrieving change streams is not an uncommon pattern. Refer Snowflake DB’s incremental processing paper discussing this concept which Hudi embraced in its design way back in 2016.
Another powerful feature that Hudi unlocks with this meta field is the ability to retain near infinite history for records. One of the Hudi community users - a large bank, is able to successfully leverage this feature to support time travel queries on historical data - even upto 5 or 6 years back. This can be realized in practice by merely managing file sizing configs, enabling scalable metadata and disabling cleaner. Without persisting the commit time along with the record, it is impossible to see a history of the record right from when it was created. This feature comes in handy when you want to tap the time travelling abilities in a historical table with so many years of data.
Combined with Hudi’s scalable table metadata, this can unlock near infinite history retention, that has enabled some Hudi users time travel even few years back.
So far, we discussed the essential features that meta fields in Hudi unlock. If you are still concerned about the storage cost of meta fields, we wanted to conclude with a small benchmark that estimates this overhead and puts it into context. This benchmark is run against the Hudi master. For this, we generated sample data for tables of different widths and compared the cost of storing additional meta fields in a Hudi table over a vanilla parquet table written via spark. Here is the benchmark setup if you are interested in details.
The benchmark compares vanilla parquet, Hudi CoW Bulk Insert with default gzip compression, and Hudi CoW Bulk Insert with snappy compression on tables of three different widths - 10 columns, 30 columns, and 100 columns. Hudi by default uses gzip compression and this compresses better than the vanilla spark parquet writing. Here you will be able to see the actual data including metadata is very well compressed (record key meta field compresses 11x, while others compress even more, sometimes compressing out completely) and is lesser in storage compared to the vanilla parquet data without meta fields. Hudi’s default is moving towards zstd in future releases which will offset the compute overhead of gzip over snappy. Even if we were to use snappy codec in Hudi, you can see that extra space taken by meta fields estimated for a 100 TB table reduces as the table becomes wider and wider. Even for a 100 TB table size on a standard TPCDS like table with 30 columns, you would pay only ~$8 for adding record-level meta fields. If the table was wider, say 100 columns or even 1000 columns, adding meta fields will cost no more than $1.
In conclusion, the meta fields Hudi tracks at the record level have greater purposes. They play a pivotal role to maintain data integrity by holding up the uniqueness constraints in a table, supporting faster targeted updates/deletes, enabling incremental processing and time traveling, supporting table services to run accurately and efficiently, handling duplicates safely, going back in history a few years to track any given record. They aid debugging and prevent pipeline cleaning nightmares due to potential data quality issues. If you use a table format like Delta or Iceberg without these kind of meta fields, many of these benefits are not simple to achieve. For instance, something as basic as duplicate detection would need multiple joins with source data and assumptions on the data model or left as the user’s responsibility to handle it prior to ingesting it into the data lake. Before we wrap up, we like the readers to think about this question - If the cost you would pay for adding the meta fields for a 30 column table that is 100TB in size at rest is about $8, doesn't that sound nice for all the goodness the record level meta fields offer?
If you are still unsure, please take a look at this blog from Uber. Uber leveraged the combination of Hudi’s record-level meta fields and incremental processing capabilities to achieve an 80% reduction in compute cost in their pipelines, which can easily cover additional meta fields overhead, many many times over.
We love to hear your thoughts. Please reach to us on Hudi community or the community slack.
Be the first to read new posts