May 22, 2024

Running Apache Hudi™ on Databricks

Running Apache Hudi™ on Databricks

Apache Hudi was launched at Uber nearly a decade ago as an "incremental data lake." Today, Hudi is considered as one of the three major open source data lakehouse projects, along with Apache Iceberg™ and Delta Lake. At Onehouse, we use Hudi and the Onehouse managed service to help our customers implement the Universal Data Lakehouse architectural pattern. 

Hudi serves as a streaming data lake platform, bringing core warehouse and database functionality directly to a data lake. It enables users to do near real-time ingestion, incremental processing and unified batch streaming in addition to providing ACID transactions support, efficient upserts/deletes and time travel capabilities in the data lake. Hudi also provides tables, transactions, efficient upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency, all while keeping the data in open source file format. 

Databricks is a cloud-based data engineering platform that helps companies process and analyze large amounts of data using Apache Spark™ and also provides interfaces to build AI/ML models. Databricks also recently introduced the Photon engine, which is compatible with Apache Spark APIs, providing extremely fast query performance at low cost for ETL, streaming, and data science workloads directly on the data lake in cloud storage. 

You can get the “best of both worlds” for key capabilities by running Hudi in the Databricks environment. 

Running Hudi on Databricks

If you're a Databricks user and are wondering `How do I run Apache Hudi on Databricks?`, you’re not alone. After all, by running Hudi on Databricks, you get benefits of using Hudi for faster real-time processing use cases on Databricks’ Photon engine, thus getting the best of both platforms.

In this blog, we'll walk you through the straightforward process of setting up and configuring Hudi within your Databricks environment. All you need is your Databricks account, and you're ready to follow along. We'll cover everything from creating compute instances to installing the necessary libraries, ensuring you're set up for success.

We'll explore the essential configurations needed to leverage Hudi tables effectively. With these insights, you'll be equipped to tackle data management tasks with ease, all within the familiar Databricks environment.

Pre-requisites:

  1. A Databricks account. Note that you can work with Databricks community edition to follow along with this blog.

Setup:

  1. From your Databricks console, click Compute and Create Compute
  2. Choose a relevant compute name such as hudi-on-databricks and choose a runtime version that is compatible with the Hudi version you are planning to use. For example, Databricks’ 13.3 runtime supports Spark 3.4.1, which is currently supported by Hudi.
  1. Once the compute is created, head into the Libraries tab inside hudi-on-databricks and click on Install new.
  2. In the pop-up, choose Maven and provide the Hudi-Spark maven coordinates such as org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.1.
  1. The minimal configuration to work with Hudi tables with Spark needs to be provided to the Databricks runtime environment. This can be written to the Spark tab under Configuration inside hudi-on-databricks. Paste in the following configurations:
  1. Now you have to Confirm the changes, which restarts the compute. Now, any notebook which uses this compute will be able to read and write Hudi tables.

Putting it all together:

Let’s create a notebook based on the example provided in the Hudi-Spark quickstart page. You can also download the notebook from this repository and upload it to Databricks for ease-of-use.

As we have already updated the compute with the Hudi-Spark bundled jar, it’s automatically added to the classpath and you can start reading and writing Hudi tables as you would in any other Spark environment.

While this example demonstrates writing data to the local path inside Databricks, typically for production users would write to S3/GCS/ADLS. This pattern requires an additional configuration in Databricks. For steps to enable Amazon S3 reads/writes you can follow this documentation.

Note: Pay close attention to hoodie.file.index.enable being set to false. This enables the use of Spark file index implementation for Hudi; this speeds up listing of large tables, and is required if you're using Databricks to read Hudi tables.

You can also turn this into a view, to enable you to run SQL queries from the same notebook.

Conclusion:

In conclusion, integrating Apache Hudi into your Databricks workflow offers a streamlined approach to data lake management and processing. By following the steps outlined in this blog, you've learned how to seamlessly configure Hudi within your Databricks environment, empowering you to handle needed data operations with efficiency and confidence.

With Hudi, you can leverage incremental data processing, simplified data ingestion, and efficient data management – all within the familiar Databricks ecosystem. Whether you're working with heavy workloads or complex data structures, Hudi on Databricks provides the tools you need to succeed. 

We hope this guide has equipped you with the knowledge and skills to unlock the full potential of Apache Hudi in your data projects. By harnessing the power of Hudi on Databricks, you can streamline your data operations, drive insights, and accelerate innovation.

To learn more about the Universal Data Lakehouse architectural pattern and its benefits, you can download the whitepaper for free.

Authors
No items found.

Subscribe to the Blog

Be the first to read new posts

We are hiring diverse, world-class talent — join us in building the future