Apache Hudi was launched at Uber nearly a decade ago as an "incremental data lake." Today, Hudi is considered as one of the three major open source data lakehouse projects, along with Apache Iceberg™ and Delta Lake. At Onehouse, we use Hudi and the Onehouse managed service to help our customers implement the Universal Data Lakehouse architectural pattern.
Hudi serves as a streaming data lake platform, bringing core warehouse and database functionality directly to a data lake. It enables users to do near real-time ingestion, incremental processing and unified batch streaming in addition to providing ACID transactions support, efficient upserts/deletes and time travel capabilities in the data lake. Hudi also provides tables, transactions, efficient upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency, all while keeping the data in open source file format.
Databricks is a cloud-based data engineering platform that helps companies process and analyze large amounts of data using Apache Spark™ and also provides interfaces to build AI/ML models. Databricks also recently introduced the Photon engine, which is compatible with Apache Spark APIs, providing extremely fast query performance at low cost for ETL, streaming, and data science workloads directly on the data lake in cloud storage.
You can get the “best of both worlds” for key capabilities by running Hudi in the Databricks environment.
If you're a Databricks user and are wondering `How do I run Apache Hudi on Databricks?`, you’re not alone. After all, by running Hudi on Databricks, you get benefits of using Hudi for faster real-time processing use cases on Databricks’ Photon engine, thus getting the best of both platforms.
In this blog, we'll walk you through the straightforward process of setting up and configuring Hudi within your Databricks environment. All you need is your Databricks account, and you're ready to follow along. We'll cover everything from creating compute instances to installing the necessary libraries, ensuring you're set up for success.
We'll explore the essential configurations needed to leverage Hudi tables effectively. With these insights, you'll be equipped to tackle data management tasks with ease, all within the familiar Databricks environment.
Let’s create a notebook based on the example provided in the Hudi-Spark quickstart page. You can also download the notebook from this repository and upload it to Databricks for ease-of-use.
As we have already updated the compute with the Hudi-Spark bundled jar, it’s automatically added to the classpath and you can start reading and writing Hudi tables as you would in any other Spark environment.
While this example demonstrates writing data to the local path inside Databricks, typically for production users would write to S3/GCS/ADLS. This pattern requires an additional configuration in Databricks. For steps to enable Amazon S3 reads/writes you can follow this documentation.
Note: Pay close attention to hoodie.file.index.enable being set to false. This enables the use of Spark file index implementation for Hudi; this speeds up listing of large tables, and is required if you're using Databricks to read Hudi tables.
You can also turn this into a view, to enable you to run SQL queries from the same notebook.
In conclusion, integrating Apache Hudi into your Databricks workflow offers a streamlined approach to data lake management and processing. By following the steps outlined in this blog, you've learned how to seamlessly configure Hudi within your Databricks environment, empowering you to handle needed data operations with efficiency and confidence.
With Hudi, you can leverage incremental data processing, simplified data ingestion, and efficient data management – all within the familiar Databricks ecosystem. Whether you're working with heavy workloads or complex data structures, Hudi on Databricks provides the tools you need to succeed.
We hope this guide has equipped you with the knowledge and skills to unlock the full potential of Apache Hudi in your data projects. By harnessing the power of Hudi on Databricks, you can streamline your data operations, drive insights, and accelerate innovation.
To learn more about the Universal Data Lakehouse architectural pattern and its benefits, you can download the whitepaper for free.
Be the first to read new posts