Introducing Onehouse Compute Runtime. Watch the webinar. Read the blog.
February 22, 2023
Getting Started: Manage your Hudi tables with the admin Hudi-CLI tool
Written by:
Sivabalan Narayanan
Motivation
Managing and operating data lakehouse tables and pipelines in an organization can be very challenging when working with 1000’s of tables. Users will have myriad of needs while managing such large number of tables from:
monitoring the commits/writes
understanding the layout optimizations
managing the cleaning cadence
inspecting the events timeline for partial failed commits
and much more…
Hudi comes with a CLI admin tool, named Hudi-CLI to assist users in managing their data lakehouse tables. Recently Hudi-CLI has gotten a facelift by introducing a hudi-cli-bundle with 0.13.0 release, which makes launching the CLI tool way easier compared to older versions.
The Hudi-CLI is much more feature-rich where you can expand the use cases in how you interact with your Hudi table:
Inspect and monitor your Hudi tables
Perform table services
Assist with disaster and recovery
Here is a more detailed breakdown on some of the things you can do with the new Hudi-CLI:
Monitoring/Investigation
To get stats for each commits on given Hudi table:
Showing partition details for a specific commit
Showing file level details for a specific commit
Get statistics(percentiles) about file sizes in the table(this will give you a sense of the presence of a lot of small files in your table)
List files via the metadata table
Show the file system view
Show archived commits
Table Services:
Perform clustering for a Hudi table
Schedule compaction and execute compaction for a MOR table
Repair/Disaster Recovery:
Repair hoodie properties
Repair corrupted clean files
Savepoint and restore for disaster recovery purposes
Get started with launching the Hudi-CLI tool:
To get started and launch the Hudi-CLI tool, follow the steps listed below. Set up differs depending on whether you are using Spark 2 or Spark 3.
Spark 2:
Pre-requisite:
You need to fork the local hudi-repo in order to use Spark 2 with Hudi-CLI.
Step 1: Build hudi with below command
Step 2: Navigate to the hudi-cli directory and launch hudi-cli.sh
When you run the script, you should see this logo and a prompt:
Congrats! We’ve just launch the new Hudi-CLI tool. Check out the section down below on getting started with some Hudi-CLI commands with a sample data set!
Spark 3:
From 0.13.0, we have introduced a hudi-cli-bundle for easy use of Hudi-CLI. You can use the CLI and Spark bundles from maven central or build it locally. We’ll go over these 2 setups down below:
Pre-requisite:
If you are directly launching the CLI in your production environment, you can point to your Spark in the production environment. If you are trying this out locally or elsewhere, where you don’t have Spark, you’ll need to download Spark 3.x. Be sure to take note where you download this file because we’ll need to export this as SPARK_HOME env variable later on.
Maven Central Set up
I’ll be showing examples for the hudi-cli-bundle-2.12 and Spark 3.2. Please adjust to the right version in the below commands. To use alternate versions, check out this apache org rep.
Step 1: Create an empty folder as a new directory anywhere on your computer, outside of the Apache Hudi cloned project. I just called the directory cli-staging.
Step 2: Curl the hudi-cli-bundle jar, hudi-spark-bundle jar and bash script into the cli-staging directory:
Step 3: Make a conf directory inside cli-staging: cli-staging/conf/. Curl the hudi-env.sh and log4j2 properties:
Step 4: Navigate back to cli-staging. Export the SPARK_HOME, CLI_BUNDLE_JAR, and SPARK_BUNDLE_JAR env variables. After, you can start Hudi-CLI shell:
Setting your env variables should look like this:
When you run the script, you should see this logo and a prompt:
Congrats! We’ve just launch the new Hudi-CLI tool. Check out the section down below on getting started with some Hudi-CLI commands with a sample data set!
Local Build Set up
Step 1: You can build Hudi with below command with your Spark and Scala version. In the examples below, I’m building with Spark 3.2 and Scala 2.12. You can find more details in the Hudi GitHub.
You should see a Build Success. It does take awhile to build, so don’t be alarmed.
Step 2: Create an empty folder as a new directory anywhere on your computer, outside of the Apache Hudi cloned project. I just called the directory cli-staging.
Step 3: Copy the hudi-cli-bundle jar, hudi-spark*-bundle jar and bash script into the cli-staging directory:
Make sure you have 2 jar files and 1 bash script:
Step 4: Make a directory inside cli-staging and call it conf. Copy the hudi-cli-bundle/conf folder to cli-staging/conf:
Make sure you have a bash script and a properties file in the conf/ directory:
Step 5: In the terminal, export the SPARK_HOME, CLI_BUNDLE_JAR, AND SPARK_BUNDLE_JAR env variables. After, you can start Hudi-CLI shell
In my case, I set the env variables like this:
When you run the script, you should see this logo and a prompt:
Congrats! We’ve just launch the new Hudi-CLI tool. Check out the section down below on getting started with Hudi-CLI commands with some sample data set!
Now that you’ve launched the new Hudi-CLI tool for either Spark 2 or Spark 3, you can get started with the different commands.
Pre-reqs:
You’ll need to have a Hudi table to connect to. If you don’t have one, you can download the Hudi table with sample data that’s provided.
Step 1: You’ll need to connect a table. If you’re using the sample data, put the absolute path on where the table exist
You should see something like this:
Once the table is connected, you can run some commands.
Commands available with Hudi-CLI
Here is the list of top level commands we have in Hudi-CLI. Each command will have few options where, some are mandatory, while others are optional.
Auto complete works with [TAB] in your keyboards to assist you with different options we have for each command.
Step 2: Let’s see the available commands
If you run:
You should see this:
Step 3: Let’s describe the table
The command, desc, describes the table properties and values. It can also tell you whether a table is loaded properly.
You should see something like this:
Step 4: Let’s inspect commits
The command, commits show, shows you all the commits for the tables.
You should see a lists commits for the table of interest with stats about the writes in the commit of interest:
Step 5: Show partition level stats for a given commit
The command, commit showpartitions -- commit [CommitTime], shows the commits for a particular partition path. You can grab the CommitTime from `commits show` under the CommitTime column.
You should see something like this:
Step 6: Show the file level stats for a given commit
The command, commit showfiles –commit [CommitTime], shows the FileId with the previous commit:
You should see a table like this:
Step 7: Show file sizes
The command, stats filesizes, gives you a distribution of file sizes across the table. You can get a sense if there are a lot of small files and what the average file size is.
You should see a table like this:
Spark Dependencies:
While some of the commands don’t need a Spark engine for them to be executed, others may. Let’s take a look at few of the commands that require the Spark engine:
Be sure to set the SPARK_HOME for the following commands:
Step 8: List files via Metadata Table
The Metadata Table can significantly improve read/write performance of your queries. The main purpose of the Metadata Table is to eliminate the requirement for the "list files" operation.
You should see some logs. Finally, you’ll see the partitions:
Step 9: Add a savepoint and restore the savepointed commit.
Savepoint saves the table as of the commit time, so that it lets you restore the table to this savepoint at a later point in time. This also ensures that the cleaner will not clean up any files that are savepointed.
When your run the command with a CommitTime, you should see that it was savepointed:
Now, if you run savepoint show - you can see the before and after the table is savepointed:
You can see the change when a table was not savepointed and when it was:
Open another terminal and navigate to your Hudi table. When you inspect the properties of the .hoodie folder, you will see the savepoint meta files.
In a separate terminal type:
In my case, I navigated to the hudi_trips_cow directory with the sample dataset:
When you list the .hoodie directory:
You’ll see the savepoint meta files:
Restoring to the savepoint:
You can see the savepoint is rolled back:
If you inspect the .hoodie directory, you will see restore meta files.
The CLI/admin tool can be very helpful in understanding your Hudi tables. With this tool, you can monitor your Hudi tables, troubleshoot issues, help understand whether you need to perform table services like compaction, clustering, and disaster recovery when in need.
You can try the Hudi-CLI with the sample data set.