January 23, 2025

What is Clustering in an Open Data Lakehouse?

What is Clustering in an Open Data Lakehouse?

Clustering is a storage optimization method available in open table formats such as Apache Hudi, Apache Iceberg, and Delta Lake focused on enhancing query efficiency by reducing unnecessary file scans. At its core, clustering aims to tackle the misalignment often found between the ingestion order (e.g., data arrival time) and the query-access patterns (e.g., event time). By organizing data based on frequently queried fields, storage is optimized, enabling engines to access data more selectively.

To establish the problem context, take a scenario in which data, initially ingested by arrival time, is spread across multiple files. Queries filtering by columns other than the arrival time (such as event date) would need to scan through a large number of files, which increases read latency. Clustering addresses this issue by reorganizing data and bringing similar records together, thus reducing the number of files a query must access and improve overall query performance.

Types of Clustering Algorithms

Linear Sorting

Linear sorting is the simplest clustering technique, where data is ordered based on a specific column, such as Emp Name in our example, or another frequently queried field. The goal is to improve data locality, ensuring that records with similar values are grouped within the same file. This makes it easier for the query engine to locate and access relevant data, significantly enhancing performance by reducing the amount of data scanned.

Fig 1: Employee dataset sorted by Emp Name field in the two data files

Let’s take this simple example to understand the impact of sorting on query performance. Suppose we have employee data that’s sorted by the Emp Name column, as illustrated in the image. This data is split into two files:

  • File 1 has a name range of Arynn to Dennis with employee IDs ranging from 1 to 40 and salaries between 20,000 and 30,000.
  • File 2 has a name range of Diana to Joseph with employee IDs from 30 to 65 and salaries from 24,000 to 50,000.

Since the data is sorted by Emp Name, each file contains a unique range of names, ensuring that any query targeting a specific name will only need to scan the file containing that range.

Consider the following query:

SELECT * FROM Employee WHERE Emp Name = 'David'

In this case, the query engine checks the Emp Name column and sees that the name range for File 1 is Arynn-Dennis. Since David falls within this range, the query engine only needs to scan File 1 to locate the data. File 2 (with the range Diana-Joseph) is entirely skipped because it does not contain relevant records for the query.

By organizing the data using sorting, the layout of the data files aligns better with the query pattern, enabling the engine to access only half of the files. This selective reading of files minimizes I/O operations and reduces latency, which is especially beneficial for large datasets with multiple partitions.

Now, imagine that queries often filter not just on Emp Name but also on Emp Sal and Emp ID. In such cases, linear sorting by a single column like Emp Name will not help prune all the irrelevant records. To address this, we can use hierarchical sorting, where the data is sorted in a particular order - first by Emp Name, followed by Emp Sal, and then by Emp ID. For example, if we have a query with predicates like this:

SELECT * FROM Employee WHERE Emp Name='David' AND Emp Sal=25000 AND Emp ID=22

In this case, the query engine will take into consideration every predicate and compare the min-max ranges of Emp Name, Emp Sal, and Emp ID to filter files more precisely. However, hierarchical sorting is only effective when queries include a filter on the first sorted column (Emp Name in this case) and will not work well if predicates refer to just Emp Sal or Emp ID or a combination of both. For instance, if a query filters only on Emp Sal, such as:

SELECT * FROM Employee WHERE Emp Sal > 19000

The engine may still need to scan all files because the primary sorting by Emp Name could scatter the ranges of Emp Sal across all files. Thus, while hierarchical sorting helps to deal with multi-column queries to some extent, its effectiveness depends heavily on the order of the sorted columns and query patterns.

Overall, linear sorting proves effective for queries with single predicates (e.g., filtering by Emp Name alone) but struggles to handle queries involving multiple predicates efficiently. Its limitation lies in being confined to ordering data based on a single column, making it less suitable for complex query patterns. While hierarchical sorting offers an improvement by sorting across multiple columns, it too falls short when filters are applied to columns lower in the hierarchy, as the data layout is primarily driven by the first sorted column.

Multi-Dimensional Clustering

Multi-dimensional clustering is an advanced clustering technique that extends beyond simple sorting by organizing data across multiple columns simultaneously. This method is especially valuable for complex queries that filter on multiple fields, as it preserves data locality across multiple dimensions. By clustering data in this manner, query engines can skip more files and minimize I/O operations, leading to significant performance improvements.

Let’s consider a dataset with employee information, including attributes like Emp Sal (Employee Salary) and YOE (Years of Experience). The table below represents a few entries:

In this dataset, employees like Leo and Mia have similar values for both Emp Sal and YOE. If we visualize this data in a two-dimensional space with Emp Sal on one axis and YOE on the other, we can see that Leo and Mia are closer to each other in this 2D space, reflecting their similar salary and experience levels.

Fig 2: Two employees (quite similar in attributes) projected in 2D

Now, let’s consider a query that filters by both salary and experience:

SELECT * FROM Employee WHERE Emp Sal > 10000 AND YOE > 4

This query involves filtering on both Emp Sal and YOE. Ideally, we want records with similar values for both attributes to be stored together in the same file. So, for our example, we would expect Leo and Mia to be in one single file. However, if data is not clustered on multiple dimensions, these records could end up being in multiple files and the query engine will again have to scan them all to fetch them, increasing I/O and reducing performance (imagine this at scale). The ability to sort data by using information from multiple dimensions is not possible with the simple sorting algorithm (discussed in the previous section). Multi-dimensional clustering tackles this problem particularly.

Z-order clustering

Z-order clustering is a popular technique for multi-dimensional clustering, addressing the challenge of keeping related data points together when mapping from higher to lower dimensions. A Z-order curve is a space-filling curve that arranges data points in a pattern resembling the letter "Z." It’s used to map data points from multiple dimensions (such as Emp Sal and YOE) to a single-dimensional file structure, preserving spatial locality.

Applying Z-order clustering to the example above, we can organize files to utilize both columns—Emp Sal and YOE—for clustering. By clustering along a Z-order curve, records with similar values across both columns are stored close to each other. When the query engine processes our example query for employees with a salary over 10,000 and more than 4 years of experience, it can quickly locate relevant data within a single file (say, File 1.parquet in this case), as shown in the below image.

 Fig 3: Z-order clustering preserves locality (in 1 dimension) for the two employees

Hilbert Curves

Another effective technique for multi-dimensional clustering is the Hilbert curve. Like the Z-order curve, the Hilbert curve is a space-filling curve that maps data points from an N-dimensional space onto a one-dimensional line while preserving data locality.

Hilbert curves are particularly advantageous when data access patterns are complex and involve multiple dimensions, as they offer superior spatial locality preservation in high-dimensional clustering compared to Z-order. For example, scientific data involving attributes like time, depth, and location benefits significantly from the locality preservation offered by the Hilbert curve, as it allows for efficient access to data points that are close in multi-dimensional space.

In the context of our example, imagine if the employee dataset had additional attributes, such as Department or Location. With more dimensions, a Hilbert curve would be more effective at clustering data points that are similar across all attributes, ensuring that records with closely related values across Emp Sal, YOE, Department, and Location are stored together. This minimizes I/O and accelerates query performance.

It is important to note that while Hilbert curves provide better locality preservation in higher dimensions making them ideal for datasets with complex access patterns across multiple attributes, they are also more computationally intensive. When applying such clustering algorithms, it is crucial to consider tradeoffs between factors such as computational overhead, query patterns, and the specific dimensions being optimized. Hence there need to be considerations when applying such clustering algorithms.

Now that we have a fair understanding of the different clustering algorithms, let’s see how these apply to the open lakehouse formats.

Clustering in Apache Hudi

While clustering is offered by the three most popular open table formats, Apache Hudi provides clustering capabilities through different strategies, deployment modes, and flexibility in running the service using various open source compute. Most importantly, Hudi's clustering options are designed to balance ingestion performance with optimized data layout. These optimizations can be performed asynchronously without disrupting ongoing ingestion processes, allowing users to configure clustering according to their specific workloads without needing to find dedicated maintenance windows.

Hudi supports all of the three storage layout optimization strategies discussed before: Linear sorting, Z-ordering, and Hilbert curves. Each strategy defines how records are reorganized during clustering, providing tailored optimizations to meet varied workload requirements. The default method is linear sorting. 

Hudi’s Clustering Service

Hudi's clustering is managed through a table service, simplifying the clustering process for users. The clustering service in Hudi involves two main phases: scheduling and execution. This two-step approach allows Hudi to reorganize data files while maintaining query performance and ensuring consistency. Here’s a breakdown of each phase:

  1. Scheduling Clustering: In this phase, Hudi creates a clustering plan to specify which files will be reorganized:
    1. Identifying Eligible Files: Based on the chosen strategy, Hudi identifies files that could benefit from clustering, like small or fragmented files.
    2. Grouping Files: Files are grouped to meet the target file size, with an option to cap group size to enhance parallelism.
    3. Saving the Plan: The clustering plan is saved to the timeline in Avro format, ready for execution.
  2. Executing Clustering: This phase processes the clustering plan, applying the defined strategy to each file group:
    1. Retrieving Clustering Groups: The plan designates specific groups for reorganization.
    2. Applying Strategy: Hudi applies the execution strategy (e.g., sortColumns) and rewrites data into new files.
    3. Committing Changes: A "REPLACE" commit records the new layout, updating metadata accordingly.

Takeaway: Hudi enables running all of the three clustering algorithms - Linear sorting, Z-ordering, and Hilbert curves. It supports both inline and asynchronous modes for clustering, providing flexibility to balance clustering with ingestion workloads. These modes are supported by Spark DataSource Writer, Flink and Hudi Streamer. 

Clustering in Apache Iceberg

Apache Iceberg’s clustering is achieved through the rewrite_data_files Spark procedure, which usually consolidates fragmented or smaller files into optimized groups. Clustering in Iceberg is not a dedicated service and hence this approach requires users to trigger clustering manually, ensuring that optimizations are applied when needed. Since clustering in Iceberg is a manual and reactive process, there is no built-in mechanism to balance ingestion performance with optimized data layout via asynchronous methods. As a result, clustering operations are typically performed during planned maintenance windows, based on workload demands and optimization priorities.

Iceberg supports both linear sorting and Z-order clustering, which can be applied through the rewrite_data_files procedure. Linear sorting organizes data lexicographically by a single column, making it effective for simple, single-predicate queries. On the other hand, Z-order clustering uses a space-filling curve to optimize data across multiple dimensions, improving spatial locality for multi-predicate queries such as geospatial or time-series workloads. These strategies enable Iceberg to improve query performance by minimizing file scans, especially when filtering on sorted columns.

Takeaway: Apache Iceberg’s clustering techniques (linear & Z-ordering) enables users to optimize data layout during planned maintenance windows, though it does not natively integrate with ingestion processes for seamless, asynchronous optimization.

Clustering in Delta Lake

Delta Lake supports clustering through linear sorting, Z-order, and more recently, Hilbert curve introduced with Liquid Clustering in Delta Lake 3.1.0. The OPTIMIZE ZORDER BY command enables Z-order clustering, optimizing query performance for predicates on Z-order columns. Z-order clustering in Delta Lake needs users to manually manage clustering column configurations. Liquid Clustering, a new incremental clustering feature, addresses some of these issues and groups files with Hilbert-clustered data. 

Liquid Clustering is still triggered manually by invoking the OPTIMIZE command, but it rewrites only unclustered files in subsequent operations, significantly reducing write amplification. By using Hilbert curve clustering, it preserves better spatial locality compared to Z-order, enabling faster queries for multi-dimensional filters. Additionally, Liquid Clustering stores clustering information in the transaction log, eliminating the need for users to remember and specify them during each clustering operation, thus reducing user errors and streamlining the process.

Takeaway: Delta Lake supports both linear and multi-dimensional clustering. Liquid Clustering enables incrementally clustering data files, without the need to rewrite already clustered data . It does require users to manually trigger optimization through the OPTIMIZE command.

Conclusion

Clustering is a critical optimization technique in data lakehouse architectures, addressing the challenge of aligning data layout with query access patterns. Through strategies like linear sorting and multi-dimensional clustering, we can significantly improve query performance by reducing unnecessary file scans.

Apache Hudi’s clustering features provide flexible options for data reorganization, enabling users to choose between inline and asynchronous clustering based on their performance needs. By balancing ingestion and clustering operations, Hudi helps users maintain optimized storage and deliver fast, efficient queries, which is critical in large-scale data lakehouse environments.

Authors
Dipankar Mazumdar, ‍Staff Developer Advocate
Dipankar Mazumdar
Staff Data Engineering Advocate

Dipankar is currently a Staff Developer Advocate at Onehouse, where he focuses on open-source projects such as Apache Hudi & XTable to help engineering teams build robust data platforms. Before this, he contributed to other critical projects such as Apache Iceberg & Apache Arrow. For most of his career, Dipankar worked at the intersection of Data Engineering and Machine Learning. He is also the author of the book "Engineering Lakehouses using Open Table Formats" and has been a speaker at numerous conferences such as Data+AI, ApacheCon, Scale By the Bay, Data Day Texas among others.

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future