Clustering is a storage optimization method available in open table formats such as Apache Hudi, Apache Iceberg, and Delta Lake focused on enhancing query efficiency by reducing unnecessary file scans. At its core, clustering aims to tackle the misalignment often found between the ingestion order (e.g., data arrival time) and the query-access patterns (e.g., event time). By organizing data based on frequently queried fields, storage is optimized, enabling engines to access data more selectively.
To establish the problem context, take a scenario in which data, initially ingested by arrival time, is spread across multiple files. Queries filtering by columns other than the arrival time (such as event date) would need to scan through a large number of files, which increases read latency. Clustering addresses this issue by reorganizing data and bringing similar records together, thus reducing the number of files a query must access and improve overall query performance.
Linear sorting is the simplest clustering technique, where data is ordered based on a specific column, such as Emp Name
in our example, or another frequently queried field. The goal is to improve data locality, ensuring that records with similar values are grouped within the same file. This makes it easier for the query engine to locate and access relevant data, significantly enhancing performance by reducing the amount of data scanned.
Let’s take this simple example to understand the impact of sorting on query performance. Suppose we have employee data that’s sorted by the Emp Name
column, as illustrated in the image. This data is split into two files:
Since the data is sorted by Emp Name
, each file contains a unique range of names, ensuring that any query targeting a specific name will only need to scan the file containing that range.
Consider the following query:
SELECT * FROM Employee WHERE Emp Name = 'David'
In this case, the query engine checks the Emp Name
column and sees that the name range for File 1 is Arynn-Dennis. Since David falls within this range, the query engine only needs to scan File 1 to locate the data. File 2 (with the range Diana-Joseph) is entirely skipped because it does not contain relevant records for the query.
By organizing the data using sorting, the layout of the data files aligns better with the query pattern, enabling the engine to access only half of the files. This selective reading of files minimizes I/O operations and reduces latency, which is especially beneficial for large datasets with multiple partitions.
Now, imagine that queries often filter not just on Emp Name
but also on Emp Sal
and Emp ID
. In such cases, linear sorting by a single column like Emp Name
will not help prune all the irrelevant records. To address this, we can use hierarchical sorting, where the data is sorted in a particular order - first by Emp Name
, followed by Emp Sal
, and then by Emp ID
. For example, if we have a query with predicates like this:
SELECT * FROM Employee WHERE Emp Name='David' AND Emp Sal=25000 AND Emp ID=22
In this case, the query engine will take into consideration every predicate and compare the min-max ranges of Emp Name
, Emp Sal
, and Emp ID
to filter files more precisely. However, hierarchical sorting is only effective when queries include a filter on the first sorted column (Emp Name
in this case) and will not work well if predicates refer to just Emp Sal
or Emp ID
or a combination of both. For instance, if a query filters only on Emp Sal
, such as:
SELECT * FROM Employee WHERE Emp Sal > 19000
The engine may still need to scan all files because the primary sorting by Emp Name
could scatter the ranges of Emp Sal
across all files. Thus, while hierarchical sorting helps to deal with multi-column queries to some extent, its effectiveness depends heavily on the order of the sorted columns and query patterns.
Overall, linear sorting proves effective for queries with single predicates (e.g., filtering by Emp Name
alone) but struggles to handle queries involving multiple predicates efficiently. Its limitation lies in being confined to ordering data based on a single column, making it less suitable for complex query patterns. While hierarchical sorting offers an improvement by sorting across multiple columns, it too falls short when filters are applied to columns lower in the hierarchy, as the data layout is primarily driven by the first sorted column.
Multi-dimensional clustering is an advanced clustering technique that extends beyond simple sorting by organizing data across multiple columns simultaneously. This method is especially valuable for complex queries that filter on multiple fields, as it preserves data locality across multiple dimensions. By clustering data in this manner, query engines can skip more files and minimize I/O operations, leading to significant performance improvements.
Let’s consider a dataset with employee information, including attributes like Emp Sal (Employee Salary) and YOE (Years of Experience). The table below represents a few entries:
In this dataset, employees like Leo and Mia have similar values for both Emp Sal
and YOE. If we visualize this data in a two-dimensional space with Emp Sal on one axis and YOE on the other, we can see that Leo and Mia are closer to each other in this 2D space, reflecting their similar salary and experience levels.
Now, let’s consider a query that filters by both salary and experience:
SELECT * FROM Employee WHERE Emp Sal > 10000 AND YOE > 4
This query involves filtering on both Emp Sal
and YOE
. Ideally, we want records with similar values for both attributes to be stored together in the same file. So, for our example, we would expect Leo and Mia to be in one single file. However, if data is not clustered on multiple dimensions, these records could end up being in multiple files and the query engine will again have to scan them all to fetch them, increasing I/O and reducing performance (imagine this at scale). The ability to sort data by using information from multiple dimensions is not possible with the simple sorting algorithm (discussed in the previous section). Multi-dimensional clustering tackles this problem particularly.
Z-order clustering is a popular technique for multi-dimensional clustering, addressing the challenge of keeping related data points together when mapping from higher to lower dimensions. A Z-order curve is a space-filling curve that arranges data points in a pattern resembling the letter "Z." It’s used to map data points from multiple dimensions (such as Emp Sal
and YOE
) to a single-dimensional file structure, preserving spatial locality.
Applying Z-order clustering to the example above, we can organize files to utilize both columns—Emp Sal
and YOE
—for clustering. By clustering along a Z-order curve, records with similar values across both columns are stored close to each other. When the query engine processes our example query for employees with a salary over 10,000 and more than 4 years of experience, it can quickly locate relevant data within a single file (say, File 1.parquet in this case), as shown in the below image.
Another effective technique for multi-dimensional clustering is the Hilbert curve. Like the Z-order curve, the Hilbert curve is a space-filling curve that maps data points from an N-dimensional space onto a one-dimensional line while preserving data locality.
Hilbert curves are particularly advantageous when data access patterns are complex and involve multiple dimensions, as they offer superior spatial locality preservation in high-dimensional clustering compared to Z-order. For example, scientific data involving attributes like time, depth, and location benefits significantly from the locality preservation offered by the Hilbert curve, as it allows for efficient access to data points that are close in multi-dimensional space.
In the context of our example, imagine if the employee dataset had additional attributes, such as Department or Location. With more dimensions, a Hilbert curve would be more effective at clustering data points that are similar across all attributes, ensuring that records with closely related values across Emp Sal, YOE, Department, and Location are stored together. This minimizes I/O and accelerates query performance.
It is important to note that while Hilbert curves provide better locality preservation in higher dimensions making them ideal for datasets with complex access patterns across multiple attributes, they are also more computationally intensive. When applying such clustering algorithms, it is crucial to consider tradeoffs between factors such as computational overhead, query patterns, and the specific dimensions being optimized. Hence there need to be considerations when applying such clustering algorithms.
Now that we have a fair understanding of the different clustering algorithms, let’s see how these apply to the open lakehouse formats.
While clustering is offered by the three most popular open table formats, Apache Hudi provides clustering capabilities through different strategies, deployment modes, and flexibility in running the service using various open source compute. Most importantly, Hudi's clustering options are designed to balance ingestion performance with optimized data layout. These optimizations can be performed asynchronously without disrupting ongoing ingestion processes, allowing users to configure clustering according to their specific workloads without needing to find dedicated maintenance windows.
Hudi supports all of the three storage layout optimization strategies discussed before: Linear sorting, Z-ordering, and Hilbert curves. Each strategy defines how records are reorganized during clustering, providing tailored optimizations to meet varied workload requirements. The default method is linear sorting.
Hudi's clustering is managed through a table service, simplifying the clustering process for users. The clustering service in Hudi involves two main phases: scheduling and execution. This two-step approach allows Hudi to reorganize data files while maintaining query performance and ensuring consistency. Here’s a breakdown of each phase:
Takeaway: Hudi enables running all of the three clustering algorithms - Linear sorting, Z-ordering, and Hilbert curves. It supports both inline and asynchronous modes for clustering, providing flexibility to balance clustering with ingestion workloads. These modes are supported by Spark DataSource Writer, Flink and Hudi Streamer.
Apache Iceberg’s clustering is achieved through the rewrite_data_files
Spark procedure, which usually consolidates fragmented or smaller files into optimized groups. Clustering in Iceberg is not a dedicated service and hence this approach requires users to trigger clustering manually, ensuring that optimizations are applied when needed. Since clustering in Iceberg is a manual and reactive process, there is no built-in mechanism to balance ingestion performance with optimized data layout via asynchronous methods. As a result, clustering operations are typically performed during planned maintenance windows, based on workload demands and optimization priorities.
Iceberg supports both linear sorting and Z-order clustering, which can be applied through the rewrite_data_files
procedure. Linear sorting organizes data lexicographically by a single column, making it effective for simple, single-predicate queries. On the other hand, Z-order clustering uses a space-filling curve to optimize data across multiple dimensions, improving spatial locality for multi-predicate queries such as geospatial or time-series workloads. These strategies enable Iceberg to improve query performance by minimizing file scans, especially when filtering on sorted columns.
Takeaway: Apache Iceberg’s clustering techniques (linear & Z-ordering) enables users to optimize data layout during planned maintenance windows, though it does not natively integrate with ingestion processes for seamless, asynchronous optimization.
Delta Lake supports clustering through linear sorting, Z-order, and more recently, Hilbert curve introduced with Liquid Clustering in Delta Lake 3.1.0. The OPTIMIZE ZORDER BY
command enables Z-order clustering, optimizing query performance for predicates on Z-order columns. Z-order clustering in Delta Lake needs users to manually manage clustering column configurations. Liquid Clustering, a new incremental clustering feature, addresses some of these issues and groups files with Hilbert-clustered data.
Liquid Clustering is still triggered manually by invoking the OPTIMIZE
command, but it rewrites only unclustered files in subsequent operations, significantly reducing write amplification. By using Hilbert curve clustering, it preserves better spatial locality compared to Z-order, enabling faster queries for multi-dimensional filters. Additionally, Liquid Clustering stores clustering information in the transaction log, eliminating the need for users to remember and specify them during each clustering operation, thus reducing user errors and streamlining the process.
Takeaway: Delta Lake supports both linear and multi-dimensional clustering. Liquid Clustering enables incrementally clustering data files, without the need to rewrite already clustered data . It does require users to manually trigger optimization through the OPTIMIZE
command.
Clustering is a critical optimization technique in data lakehouse architectures, addressing the challenge of aligning data layout with query access patterns. Through strategies like linear sorting and multi-dimensional clustering, we can significantly improve query performance by reducing unnecessary file scans.
Apache Hudi’s clustering features provide flexible options for data reorganization, enabling users to choose between inline and asynchronous clustering based on their performance needs. By balancing ingestion and clustering operations, Hudi helps users maintain optimized storage and deliver fast, efficient queries, which is critical in large-scale data lakehouse environments.
Be the first to read new posts