In recent weeks, there has been a growing interest in comparing the performance of the Apache Hudi vs. Delta Lake vs. Apache Iceberg open source projects for the data lakehouse. We felt the community deserves more transparent and reproducible analysis. We want to add our perspective on how these benchmarks should be executed and presented, what value they bring, and how we should interpret them.
Recently Databeans published a blog where the performance of Hudi/Delta/Iceberg is compared head-to-head using a TPC-DS benchmark. While it’s fantastic to see the community coming forward and taking action to improve awareness of the current state of the art in the industry, we identified a few issues with the way the experiments were conducted and the results were reported, which we want to share and discuss more broadly today.
As a community, we should strive to add more rigor when publishing benchmarks. We believe these are crucial tenets of any benchmarking efforts:
With respect to these fundamental issues, we believe that the Databeans blog, unfortunately, came short of sharing the complete picture of what and how the results were achieved. For example:
We routinely run performance benchmarks to make sure that Hudi’s rich feature-set is provided along with the best performance possible for the exabytes of Hudi-powered data lakes out there. Our team has extensive experience in benchmarking complex distributed systems like Apache Kafka or Pulsar, true to the principles outlined above.
To make sure the published benchmarks comply with these principles:
Hudi’s origins take root in incremental data processing to turn all old school batch jobs incremental. Thus, Hudi’s default configs are geared towards incremental upserts and generating change streams for incremental ETL pipelines, treating the initial load as a rare, one-time activity. Thus, closer attention needs to be paid for the load times to be comparable with Delta.
As could be seen clearly, Delta and Hudi are within 6% for the 0.11.1 release and 5% for the current Hudi’s master* (we’ve additionally benchmarked against Hudi’s master branch since we’ve recently discovered a bug in Parquet encoding configuration that has promptly been resolved).
To power the rich feature-set that Hudi provides on top of raw Parquet tables, such as:
and many more, Hudi internally stores a set of additional metadata along with every record called meta-fields. Since tpc-ds is primarily concerned with snapshot queries, in this particular experiment, such fields have been disabled (and not computed), Hudi still does persist them as nulls, enabling turning them on in the future w/o the need to evolve the schema. Adding five such fields as nulls has, while low, still non-negligible overhead.
As we can see, there’s practically no difference between Hudi 0.11.1 and Delta 1.2.0 performance, and Hudi’s current master is very slightly faster (~5%).
You can find raw logs in this directory on Google Drive:
To reproduce the results above, please use our branch in Delta’s benchmark repo and follow the steps in the README.
Summing up, we’ve wanted to underscore the importance of openness and reproducibility in such a sensitive and sophisticated area as performance benchmarking. As we repeatedly saw, obtaining reliable and trustworthy benchmarking results is tedious and challenging, requiring dedication, diligence, and rigor to back it up.
Going forward, we’re planning to release more internal benchmarks that highlight how Hudi’s rich feature-set reaches unmatched performance levels in other common industry workloads. Stay tuned!
Be the first to read new posts