At Onehouse, we help organizations rapidly implement a Universal Data Lakehouse, unlocking previously untapped potential for analytics and AI.
Here’s one example of the flexibility that the Universal Data Lakehouse model provides: The continuous ingestion services offered by Onehouse write data in your S3/GCS buckets as Apache Hudi, Delta Lake, and/or Apache Iceberg tables. Onehouse can even simultaneously write in all three formats - thanks to Apache XTable (Incubating), an open source project we originally launched as OneTable with Google and Microsoft.
When a data engineering team evaluates Apache Hudi, some of the unique characteristics that stand out include:
However, over the past few years, there was a period where developers’ table format choices were constrained by their query / processing engine choices.
Now, though, developers realize they can easily mix and match table formats with query engines using the likes of Apache XTable. So a question we hear more of is, “Can I use Apache Hudi with Databricks?” Some are even pleasantly surprised to hear that the answer is, “Yes, it’s easy!”
In this blog, we’ll guide you through the steps to use Apache Hudi on Databricks and demonstrate how to catalog the tables within Databricks’ Unity Catalog via Apache XTable.
Let’s walk through the straightforward process of setting up and configuring Hudi within your Databricks environment. All you need is your Databricks account and you're ready to follow along. We'll cover everything from creating compute instances to installing the necessary libraries, ensuring that you're set up for success.
We'll also explore the essential configurations needed to leverage Hudi tables effectively. With these insights, you'll be equipped to tackle a wide range of data management tasks with ease, all within the familiar Databricks environment.
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.catalog.spark_catalog org.apache.spark.sql.hudi.catalog.HoodieCatalog
spark.sql.extensions org.apache.spark.sql.hudi.HoodieSparkSessionExtension
spark.kryo.registrator org.apache.spark.HoodieSparkKryoRegistrar
Let’s create a notebook based on the example provided in the Hudi-Spark quickstart page. You can also download the notebook from this repository and upload it to Databricks for ease of use.
As we have already updated the compute with the Hudi-Spark bundled jar, it’s automatically added to the classpath. You can start reading and writing Hudi tables as you would in any other Spark environment.
# pyspark
from pyspark.sql.functions import lit, col
tableName = "trips_table"
basePath = "file:///tmp/trips_table"
columns = ["ts","uuid","rider","driver","fare","city"]
data
=[(1695159649087,"334e26e9-8355-45cc-97c6-c31daf0df330","rider-A"
,"driver-K",19.10,"san_francisco"),
(1695091554788,"e96c4396-3fad-413a-a942-4cb36106d721","rider-C","
driver-M",27.70 ,"san_francisco"),
(1695046462179,"9909a8b1-2d15-4d3d-8ec9-efc48c536a00","rider-D","
driver-L",33.90 ,"san_francisco"),
(1695516137016,"e3cf430c-889d-4015-bc98-59bdce1e530c","rider-F","
driver-P",34.15,"sao_paulo"),
(1695115999911,"c8abbe79-8d89-47ea-b4ce-4d224bae5bfa","rider-J","
driver-T",17.85,"chennai")]
inserts = spark.createDataFrame(data).toDF(*columns)
inserts.show()
While this example demonstrates writing data to the local path inside Databricks, typically, for production, users would write to S3/GCS/ADLS. This pattern requires an additional configuration in Databricks. For steps to enable Amazon S3 reads/writes you can follow this documentation.
hudi_options = {
'hoodie.table.name': tableName
}
inserts.write.format("hudi"). \
options(**hudi_options). \
mode("overwrite"). \
save(basePath)
Note: Pay close attention to hoodie.file.index.enable being set to false. This enables the use of Spark file index implementation for Hudi; this speeds up listing of large tables, and is required if you're using Databricks to read Hudi tables.
tripsDF = spark.read.format("hudi").option("hoodie.file.index.enable", "false").load(basePath
tripsDF.show()
You can also turn this into a view, to enable you to run SQL queries from the same notebook.
tripsDF.createOrReplaceTempView("trips_table")
%sql
SELECT uuid, fare, ts, rider, driver, city FROM trips_table WHERE fare > 20.0
As Databricks’ Unity Catalog supports only Delta Lake table format currently, if there is a need to use Unity Catalog, you have to translate the Hudi table metadata to Delta Lake format using Apache XTable.
In production scenarios, when you’re working with cloud object stores, you can translate the tables and add them to Unity Catalog as mentioned in the XTable documentation.
In conclusion, integrating Apache Hudi into your Databricks workflow offers a streamlined approach to data lake management and processing. By following the steps outlined in this blog, you've learned how to seamlessly configure Hudi within your Databricks environment, empowering you to handle needed data operations with efficiency and confidence.
To learn more about how to mix-and-match table formats and query engines, and the value of doing so, you can download the Universal Data Lakehouse whitepaper today for free.
Be the first to read new posts