February 22, 2023

Getting Started: Manage your Hudi tables with the admin Hudi-CLI tool

Getting Started: Manage your Hudi tables with the admin Hudi-CLI tool

Motivation

Managing and operating data lakehouse tables and pipelines in an organization can be very challenging when working with 1000’s of tables. Users will have myriad of needs while managing such large number of tables from:

  • monitoring the commits/writes
  • understanding the layout optimizations
  • managing the cleaning cadence
  • inspecting the events timeline for partial failed commits

and much more… 

Hudi comes with a CLI admin tool, named Hudi-CLI to assist users in managing their data lakehouse tables. Recently Hudi-CLI has gotten a facelift by introducing a hudi-cli-bundle with 0.13.0 release, which makes launching the CLI tool way easier compared to older versions. 

The Hudi-CLI is much more feature-rich where you can expand the use cases in how you interact with your Hudi table: 

  • Inspect and monitor your Hudi tables
  • Perform table services 
  • Assist with disaster and recovery 

Here is a more detailed breakdown on some of the things you can do with the new Hudi-CLI: 

Monitoring/Investigation

To get stats for each commits on given Hudi table:
  • Showing partition details for a specific commit
  • Showing file level details for a specific commit
  • Get statistics(percentiles) about file sizes in the table(this will give you a sense of the presence of a lot of small files in your table) 
  • List files via the metadata table 
  • Show the file system view
  • Show archived commits

Table Services

  • Perform clustering for a Hudi table
  • Schedule compaction and execute compaction for a MOR table

Repair/Disaster Recovery:

  • Repair hoodie properties 
  • Repair corrupted clean files
  • Savepoint and restore for disaster recovery purposes

Get started with launching the Hudi-CLI tool:

To get started and launch the Hudi-CLI tool, follow the steps listed below. Set up differs depending on whether you are using Spark 2 or Spark 3. 

Spark 2

Pre-requisite:

You need to fork the local hudi-repo in order to use Spark 2 with Hudi-CLI. 

Step 1: Build hudi with below command

Step 2: Navigate to the hudi-cli directory and launch hudi-cli.sh

When you run the script, you should see this logo and a prompt: 

hudi ->

Congrats! We’ve just launch the new Hudi-CLI tool. Check out the section down below on getting started with some Hudi-CLI commands with a sample data set! 

Spark 3:

From 0.13.0, we have introduced a hudi-cli-bundle for easy use of Hudi-CLI. You can use the CLI and Spark bundles from maven central or build it locally. We’ll go over these 2 setups down below:

Pre-requisite:

If you are directly launching the CLI in your production environment, you can point to your Spark in the production environment. If you are trying this out locally or elsewhere, where you don’t have Spark, you’ll need to download Spark 3.x. Be sure to take note where you download this file because we’ll need to export this as SPARK_HOME env variable later on. 

Maven Central Set up

I’ll be showing examples for the hudi-cli-bundle-2.12 and Spark 3.2. Please adjust to the right version in the below commands. To use alternate versions, check out this apache org rep

Step 1: Create an empty folder as a new directory anywhere on your computer, outside of the Apache Hudi cloned project. I just called the directory cli-staging.

Step 2: Curl the hudi-cli-bundle jar, hudi-spark-bundle jar and bash script into the cli-staging directory:

Step 3: Make a conf directory inside cli-staging: cli-staging/conf/. Curl the hudi-env.sh and log4j2 properties:

Step 4: Navigate back to cli-staging. Export the SPARK_HOME, CLI_BUNDLE_JAR, and SPARK_BUNDLE_JAR env variables. After, you can start Hudi-CLI shell:

Setting your env variables should look like this:

When you run the script, you should see this logo and a prompt: 

hudi ->  

Congrats! We’ve just launch the new Hudi-CLI tool. Check out the section down below on getting started with some Hudi-CLI commands with a sample data set! 

Local Build Set up

Step 1: You can build Hudi with below command with your Spark and Scala version. In the examples below, I’m building with Spark 3.2 and Scala 2.12. You can find more details in the Hudi GitHub.

You should see a Build Success. It does take awhile to build, so don’t be alarmed. 

Step 2: Create an empty folder as a new directory anywhere on your computer, outside of the Apache Hudi cloned project. I just called the directory cli-staging.

Step 3: Copy the hudi-cli-bundle jar, hudi-spark*-bundle jar and bash script into the cli-staging directory:

Make sure you have 2 jar files and 1 bash script:

Step 4: Make a directory inside cli-staging and call it conf. Copy the hudi-cli-bundle/conf folder to cli-staging/conf:

Make sure you have a bash script and a properties file in the conf/ directory:

Step 5: In the terminal, export the SPARK_HOME, CLI_BUNDLE_JAR, AND SPARK_BUNDLE_JAR env variables. After, you can start Hudi-CLI shell

In my case, I set the env variables like this:

When you run the script, you should see this logo and a prompt: 

hudi -> 


Congrats! We’ve just launch the new Hudi-CLI tool. Check out the section down below on getting started with Hudi-CLI commands with some sample data set! 

More details on the set up is covered here

Get Started with the Hudi-CLI Commands:

Now that you’ve launched the new Hudi-CLI tool for either Spark 2 or Spark 3, you can get started with the different commands. 

Pre-reqs:

You’ll need to have a Hudi table to connect to. If you don’t have one, you can download the Hudi table with sample data that’s provided. 

Step 1: You’ll need to connect a table. If you’re using the sample data, put the absolute path on where the table exist

You should see something like this:

Once the table is connected, you can run some commands.

Commands available with Hudi-CLI

Here is the list of top level commands we have in Hudi-CLI. Each command will have few options where, some are mandatory, while others are optional. 

Auto complete works with [TAB] in your keyboards to assist you with different options we have for each command. 

Step 2: Let’s see the available commands

If you run:

You should see this:

Step 3: Let’s describe the table

The command, desc, describes the table properties and values. It can also tell you whether a table is loaded properly.

You should see something like this:

Step 4: Let’s inspect commits 

The command, commits show, shows you all the commits for the tables.

You should see a lists commits for the table of interest with stats about the writes in the commit of interest: 

Step 5: Show partition level stats for a given commit 

The command, commit showpartitions  -- commit [CommitTime], shows the commits for a particular partition path. You can grab the CommitTime from `commits show` under the CommitTime column. 

You should see something like this:

Step 6: Show the file level stats for a given commit 

The command, commit showfiles –commit [CommitTime], shows the FileId with the previous commit:

You should see a table like this:

Step 7: Show file sizes 

The command, stats filesizes, gives you a distribution of file sizes across the table. You can get a sense if there are a lot of small files and what the average file size is. 

You should see a table like this:

Spark Dependencies:

While some of the commands don’t need a Spark engine for them to be executed, others may. Let’s take a look at few of the commands that require the Spark engine: 

Be sure to set the SPARK_HOME for the following commands:

Step 8: List files via Metadata Table

The Metadata Table can significantly improve read/write performance of your queries. The main purpose of the Metadata Table is to eliminate the requirement for the "list files" operation.

You should see some logs. Finally, you’ll see the partitions:

Step 9: Add a savepoint and restore the savepointed commit. 

Savepoint saves the table as of the commit time, so that it lets you restore the table to this savepoint at a later point in time. This also ensures that the cleaner will not clean up any files that are savepointed.

When your run the command with a CommitTime, you should see that it was savepointed:

Now, if you run  savepoint show - you can see the before and after the table is savepointed:

You can see the change when a table was not savepointed and when it was:

Open another terminal and navigate to your Hudi table. When you inspect the properties of the .hoodie folder, you will see the savepoint meta files.

In a separate terminal type:

In my case, I navigated to the hudi_trips_cow directory with the sample dataset:

When you list the .hoodie directory:

You’ll see the savepoint meta files:

Restoring to the savepoint:

You can see the savepoint is rolled back:

If you inspect the .hoodie directory, you will see restore meta files. 

The CLI/admin tool can be very helpful in understanding your Hudi tables. With this tool, you can monitor your Hudi tables, troubleshoot issues, help understand whether you need to perform table services like compaction, clustering, and disaster recovery when in need. 

You can try the Hudi-CLI with the sample data set.

To get started on Apache Hudi, check out:

Authors
No items found.

Read More:

Onehouse Achieves PCI Compliance Certification
Introducing Onehouse LakeView
Introducing Onehouse Table Optimizer
How to use Apache Hudi with Databricks

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future