The Apache Hudi community has been growing rapidly, with developers seeking ways to leverage its powerful capabilities for ingesting and managing large-scale datasets efficiently. Every week I receive common questions about a set of topics related to tips and tricks for maximizing their Hudi experience. One of the top questions I get is related to how Hudi can perform upserts, ensuring low-latency access to the latest data. That’s what we’ll cover in this blog!
One of the major considerations to factor in for fast upserts is choosing the right storage table type. Hudi supports two different storage table types - Copy-On-Write (COW) and Merge-On-Read (MOR). Each table type can impact upsert performance differently due to their distinct approaches to handling data updates.
COW tables are operational simpler compared to MOR tables because all updates are written to base files, which are in Apache Parquet format. You do not need to run a separate service like compaction to manage any log files to improve the read or storage efficiency. Updates are handled by entirely rewriting the file to generate a new version of the base file. Consequently, COW tables exhibit higher write amplification because there is a synchronous merge that occurs in order to create the new base file version. However, a key advantage of COW tables is their zero read amplification because all data is available in the base file, ready to read. The disk reads required for a query are minimal because they do not need to read multiple locations or merge data.
In contrast to COW tables, MOR tables have more operational complexity. Rather than re-writing the entire file, MOR writes updates to separate log files, then these log files are merged with the base file into a new file version at a later time. If you’re familiar with Hudi, this is done with the compaction service. Compaction is needed to bound the growth of log files so query performance doesn’t deteriorate and storage is optimized.
Writing to log files directly avoids re-writing the entire base file multiple times lowering the write amplification - and if you are working with streaming data, this difference becomes apparent. This makes MOR tables to be write-optimized. However, MOR tables have a higher read amplification for snapshot queries, in between compactions, due to the need to read base files and log files and merge the data on the fly.
If you have a high update:insert ratio and are sensitive to ingestion latency, then MOR tables might be a good option for a table type. One example is with streaming sources– usually, you’ll want to act on insights faster to provide relevant and timely information to your users. However, if your workload is more insert-based and you can tolerate reasonable ingestion latencies, then COW tables are a good option.
By leveraging indexes, Hudi avoids full-table scans when locating for records during upserts, which can be expensive in terms of time and resources. Hudi’s indexing layer maps record keys to their corresponding file locations. The indexing layer is pluggable and there are several index types to choose from. One thing to consider is that indexing latency is dependent on multiple factors like how much data is being ingested, how much data is in your table, whether you have partition or non-partitioned tables, type of index chosen, how update-heavy the workloads are and the record key’s temporal characteristics. Depending on the performance needed and uniqueness guarantees, Hudi provides different indexing strategies out-of-the-box that can be categorized to either global or non-global indexes.
One of the main considerations between global vs. non-global is related to index lookup latency due to the differences in uniqueness guarantees:
Non-global indexes only look up matched partitions: For example, if you have 100 partitions and the incoming batch has records for only for the last 2 partitions, only file groups belonging to those 2 partitions will be looked up. For upsert workloads at scale, you might want to consider non-global indexes, like non-global bloom, non-global-simple and bucket.
Global indexes look at all file groups in all partitions: For example, if you have 100 partitions and the incoming batch of records has records for the last 2 partitions, all the file groups in all 100 partitions will be looked up (since hudi has to guarantee there exists only one version of the record key across the entire table). This can cause increased latency for upsert workloads at scale.
Bloom index: This is a good index strategy for update-heavy workloads if the record keys are ordered by some criteria (eg, timestamp-based) and updates are related to the recent set of data. For example, if the record keys are ordered based on timestamp and we’re updating data within the last few days.
Simple index: This is a good index strategy for update-heavy workloads if you’re sporadically updating files across the entire span of the table and the records keys are random i.e., non-timestamp-based.
Bucket index: This is a good index strategy if the total amount of data stored per partition is similar across all partitions. The number of buckets (or file groups) per partition has to be defined upfront for a given table. Here’s an article Sivabalan wrote that talks more about it.
Partitioning is a technique used to split a large dataset into smaller, manageable pieces based on certain attributes or columns in the dataset. This can greatly enhance query performance as only a subset of the data needs to be scanned during queries. However, the effectiveness of partitioning largely depends on the granularity of the partitions.
A common pitfall is setting partitions too granularly, such as dividing partitions by <city>/<day>/<hour>. Depending on your workload, there might not be enough data at the hourly granularity, resulting in many small files of only a few kilobytes. If you’re familiar with the small file problem, more small files cost more disks seeks and degrades query performance. Secondly, on the ingestion side, small files also impact the index lookup because it will take longer to prune irrelevant files. Depending on what index strategy you’re implementing, this may negatively affect write performance. I recommend users always start with a coarser partitioning scheme like <city>/<day> to avoid the pitfalls of small files. If you still feel the need to have granular partitions, I recommend re-evaluating your partitioning scheme based on query patterns and/or you can potentially take advantage of the clustering service to balance ingestion and query performance.
I hope this blog helps tour guide you into some specific parts of Hudi you can tune in order to get fast upserts. There is no specific formula to use- much of how you configure your applications and tune it will be specific to the workload types. If you have more specific questions about upsert performance related to your workload, you can find me in the Hudi community--especially in the community slack!
Be the first to read new posts