Indexing has been an integral part of Apache Hudi like many other transactional data systems and unlike plain table format abstractions. In this blog, we discuss how we have reimagined indexing and built a new multi-modal index in Apache Hudi 0.11.0 release, a first-of-its-kind high-performance indexing subsystem for the Data Lakehouse architecture, to optimize the performance of queries and write transactions, especially for huge and wide tables.
Indexing is widely employed in database systems, such as relational databases and data warehouses to reduce I/O cost and improve query efficiency. Similar to how an index page at the end of book helps you locate information quickly, the database index contains auxiliary data structures to quickly locate records needed, without reading unnecessary data from storage. Given that Hudi’s design has been heavily optimized for handling mutable change streams, with different write patterns, Hudi has uniquely supported indexing capabilities from its inception, to speed up upserts on the Data Lakehouse.
In fact, tens of indexing techniques exist in the literature, and most popular database systems, e.g., RDBMS, PostgreSQL, MySQL, Spanner, CockroachDB, etc., provide a powerful toolbox that supports many of them. While Hudi’s indexing is industry-proven now for fast upserts, these benefits have been untapped for queries. Given the 10-100x data scale on data lakes over traditional databases/warehouses, a generalized indexing subsystem can bring game-changing performance gains to the lake. In the Hudi 0.11.0 release, we reimagined what a general-purpose multi-modal index for lakes should look like. Hudi’s multi-modal index has been implemented by enhancing the metadata table with the flexibility to extend to new index types, along with an asynchronous index building mechanism.
This blog touches upon the core design principles and how multi-modal index serves all existing indexing mechanisms, while others that follow cover the remaining aspects in more detail.
The multi-modal index needs to satisfy the following requirements:
Building on these requirements, we design and implement the multi-modal index to realize the generalized indexing subsystem for Hudi.
All the indexes containing table metadata are stored as a single internal Hudi Merge-On-Read (MOR) table, i.e., the metadata table, within the data table. This is a common practice where databases store metadata as internal views and Apache Kafka as internal topics. The metadata table is serverless and independent of compute and query engines. The MOR table layout gives lightning-fast writes by avoiding synchronous merge of data with reduced write amplification. This is extremely important for large datasets as the size of updates to the metadata table can grow to be unmanageable otherwise. This helps Hudi to scale metadata to TBs of sizes like other data systems like BigQuery.
We already have files, column_stats, and bloom_filter indexes to boost performances on multiple fronts as seen later in this blog. The foundational framework is built to be extensible and scalable to any new index like bitmap, R-tree-based indexes, record level index and much more. Any such index can be enabled and disabled as per necessity without needing to coordinate with other indexes. In addition, Hudi is proud to deliver asynchronous indexing, a first of its kind, to support index building alongside regular writers without impacting the write latency (a blog coming soon to discuss async indexing in detail).
The metadata table guarantees ACID with transactional updates. All changes to the data table are translated into metadata records committing to the metadata table. We have designed this as a multi-table transaction so that every write to the Hudi table is successful only when the data table and metadata table are both committed. The multi-table transaction ensures atomicity and is resilient to failures so that partial writes to either the data or metadata table are never exposed to other read or write transactions. The metadata table is built to be self-managed so users don’t need to spend operational cycles on any table services including compaction and cleaning. In the future, we plan to augment the updates on MOR tables with log compaction service that can further reduce the write amplification.
To boost read and write performance, the processing layers need point lookups to find necessary entries from the files in the metadata table. Since Parquet is columnar and Avro is row-based, they are not suited for point lookups. HFile format from HBase on the other hand is designed specifically for efficient point lookups.
We ran experiments to measure the latency of point lookups of N entries among 10 Million (10M) entries in one file for different file formats. HFile shows 10x to 100x improvements when compared to Parquet or Avro, which are still used in other formats like Delta and Iceberg for table metadata.
Since most access to the metadata table are point and range lookups, the HFile format is chosen as the base file format for the internal metadata table. Since the metadata table stores the auxiliary data at the partition level (files index) or the file level (column_stats index), the lookup based on a single partition path and a file group is going to be very efficient with the HFile format. Both the base and log files in Hudi’s metadata table uses the HFile format. Each log file could contain multiple log blocks. As seen from the below figure, Hudi uses a novel idea of leveraging Inline File System to read the content of actual data blocks as HFile, so as to leverage the faster lookup from HFile format. This design is meticulously chosen to reduce the remote GET calls in cloud storage schemes as point lookups may not need to download the entire file.
In addition, these metadata table indexes are served via a centralized timeline server which caches the metadata, further reducing the latency of the lookup from executors.
The metadata table has several benefits to improve performance for Hudi users. Let's take a look at how Hudi’s file listing can improve by up to 10x and data skipping can reduce read latency by 10x to 30x or more with the multi-modal index.
Large deployments of analytical pipelines in cloud storages usually have 100k or more files across 1000s of partitions. Direct file listing on such a scale is often the bottleneck due to throttling and high I/O operations, leading to scalability issues. To improve the file listing performance, Hudi stores the information in a partition named files in the metadata table to avoid file system calls, such as exists, listStatus, and listFiles. The files partition stores file information such as file name, size, and active state for each partition in the data table.
We showcase the performance improvement of file listing using Hudi tables of various scales containing different numbers of files and partitions on Amazon S3. By using the files index in the metadata table, the file listing latency is drastically reduced compared to direct listing on S3, providing 2-10x speedup (including the non-partitioned table with 1M files, not shown in the figure). Since cloud storage like S3 does rate-limiting and throttling for file system calls on very large datasets, the direct file listing does not scale well with the increasing number of files in the partition and in some cases, the file system calls may not complete. In contrast, the files index helps remove such bottlenecks and provides fast access to the file listing. Even better, by reusing the metadata table reader and caching the index at the timeline server, the file listing latency is further reduced.
Another major benefit of the metadata table is to assist in data skipping while serving read queries. column_stats partition stores the statistics of interested columns, such as min and max values, total values, null counts, size, etc., for all data files. The statistics are used while serving read queries with predicates matching interested columns. This can boost the query performance by a large factor since unmatched files are filtered out, without being read from the file system and also reduces the I/O burden on the file system. Further, if the user has configured clustering, Z-order, or any other layout optimization, these can reduce the query latency by an order of magnitude, as the files are nicely laid out in terms of access patterns of commonly queried columns.
In column_stats partition, the record key is composed by concatenating the column name, partition name, and the data file name in order, so that we can do point lookups and range reads. Such a record key design unlocks the ability to perform prefix lookups on the column_stats index as well. For example, as shown above, Query1 has the col1 and partition specified and Query2 has the col2 specified in the predicates. The predicates are used to construct the prefix lookup to the column_stats index without having to provide the complete record keys. This drastically reduces the index lookup for large datasets with 100s or even 1000s of columns, as the number of index entries to look up is on the order of O(num_query_columns) which is usually small (e.g., 5 to 10), instead of O(num_table_columns) which could be huge (e.g., more than 100 or 1000).
We ran an experiment for prefix-based lookups with a file of 10M entries. Each column lookup is expected to match 10k entries. HFile is able to show a minimum of 3x better latency when compared to the next best, i.e., Parquet, in all cases. This also greatly benefits the performance on cloud storage, since this drastically reduces the number of remote GET calls. With such design, data skipping brings gains of 10x to 30x on query latency when compared to no data skipping. Expect more details in a follow-up blog on data skipping with Hudi.
One of the most widely used indexes in Hudi is the bloom-filter-based index. This index employs range-based pruning on the minimum and maximum values of the record keys and bloom-filter-based lookups to tag incoming records. For large tables, this involves reading the footers of all matching data files for bloom filters, which can be expensive in the case of random updates across the entire dataset. The bloom_filter partition in the metadata table is introduced to store bloom filters of all data files to avoid scanning the footers from all data files. The record key in this partition is composed of the partition name and the data file name. Similar to the column_stats index, this leverages point and prefix lookup. Based on our analysis for a Hudi table with 100k files, reading bloom filters from the bloom_filter partition in the metadata table is 3x faster compared to reading from individual data file footers.
As quoted above, we would like to enrich Hudi’s metadata even further. We are adding a new record-level index, leading the Data Lakehouse technology for scalable metadata, that maps the record keys to actual data files where they are stored. For very large-scale datasets like 100 Billion+ records, existing indexes may not meet the SLA for some types of workloads. With our multi-modal index framework and faster lookups, we should be able to locate the records faster than existing indexes. This can be very powerful for large deployments where index lookup itself could define the entire write latency. We are also looking to add bloom filters for secondary columns, bit-map indexes, and much more. We welcome more ideas and contributions from the community to add more indexes to our multi-modal index bandwagon.
Hudi brings a novel multi-modal index, a serverless and high-performance indexing subsystem to the Data Lakehouse architecture to store various types of auxiliary data to boost the read and write latencies. The foundation is designed to be scalable, self-managing in many ways, and supports adding richer indexes to Hudi with efficiency and ease. We plan to enhance the multi-modal index with new indexes in the coming releases.
Be the first to read new posts