February 22, 2023

Getting Started: Manage your Hudi tables with the admin Hudi-CLI tool

Written by:

Sivabalan Narayanan

Getting Started: Manage your Hudi tables with the admin Hudi-CLI tool

Motivation

Managing and operating data lakehouse tables and pipelines in an organization can be very challenging when working with 1000’s of tables. Users will have myriad of needs while managing such large number of tables from:

monitoring the commits/writes
understanding the layout optimizations
managing the cleaning cadence
inspecting the events timeline for partial failed commits

and much more…

Hudi comes with a CLI admin tool, named Hudi-CLI to assist users in managing their data lakehouse tables. Recently Hudi-CLI has gotten a facelift by introducing a hudi-cli-bundle with 0.13.0 release, which makes launching the CLI tool way easier compared to older versions.

The Hudi-CLI is much more feature-rich where you can expand the use cases in how you interact with your Hudi table:

Inspect and monitor your Hudi tables
Perform table services
Assist with disaster and recovery

Here is a more detailed breakdown on some of the things you can do with the new Hudi-CLI:

Monitoring/Investigation

To get stats for each commits on given Hudi table:

Showing partition details for a specific commit
Showing file level details for a specific commit
Get statistics(percentiles) about file sizes in the table(this will give you a sense of the presence of a lot of small files in your table)

List files via the metadata table
Show the file system view
Show archived commits

Table Services:

Perform clustering for a Hudi table
Schedule compaction and execute compaction for a MOR table

Repair/Disaster Recovery:

Repair hoodie properties
Repair corrupted clean files
Savepoint and restore for disaster recovery purposes

Get started with launching the Hudi-CLI tool:

To get started and launch the Hudi-CLI tool, follow the steps listed below. Set up differs depending on whether you are using Spark 2 or Spark 3.

‍

Spark 2:

Pre-requisite:

You need to fork the local hudi-repo in order to use Spark 2 with Hudi-CLI.

Step 1: Build hudi with below command

‍Step 2: Navigate to the hudi-cli directory and launch hudi-cli.sh

When you run the script, you should see this logo and a prompt:

hudi ->

Congrats! We’ve just launch the new Hudi-CLI tool. Check out the section down below on getting started with some Hudi-CLI commands with a sample data set!

‍

Spark 3:

From 0.13.0, we have introduced a hudi-cli-bundle for easy use of Hudi-CLI. You can use the CLI and Spark bundles from maven central or build it locally. We’ll go over these 2 setups down below:

‍

Pre-requisite:

If you are directly launching the CLI in your production environment, you can point to your Spark in the production environment. If you are trying this out locally or elsewhere, where you don’t have Spark, you’ll need to download Spark 3.x. Be sure to take note where you download this file because we’ll need to export this as SPARK_HOME env variable later on.

Maven Central Set up

I’ll be showing examples for the hudi-cli-bundle-2.12 and Spark 3.2. Please adjust to the right version in the below commands. To use alternate versions, check out this apache org rep.

Step 1: Create an empty folder as a new directory anywhere on your computer, outside of the Apache Hudi cloned project. I just called the directory cli-staging.

Step 2: Curl the hudi-cli-bundle jar, hudi-spark-bundle jar and bash script into the cli-staging directory:

‍

Step 3: Make a conf directory inside cli-staging: cli-staging/conf/. Curl the hudi-env.sh and log4j2 properties:

^‍‍‍

Step 4: Navigate back to cli-staging. Export the SPARK_HOME, CLI_BUNDLE_JAR, and SPARK_BUNDLE_JAR env variables. After, you can start Hudi-CLI shell:

Setting your env variables should look like this:

‍

When you run the script, you should see this logo and a prompt:

hudi ->

‍

Congrats! We’ve just launch the new Hudi-CLI tool. Check out the section down below on getting started with some Hudi-CLI commands with a sample data set!

‍

Local Build Set up

Step 1: You can build Hudi with below command with your Spark and Scala version. In the examples below, I’m building with Spark 3.2 and Scala 2.12. You can find more details in the Hudi GitHub.

You should see a Build Success. It does take awhile to build, so don’t be alarmed.

‍

Step 2: Create an empty folder as a new directory anywhere on your computer, outside of the Apache Hudi cloned project. I just called the directory cli-staging.

Step 3: Copy the hudi-cli-bundle jar, hudi-spark*-bundle jar and bash script into the cli-staging directory:

‍

Make sure you have 2 jar files and 1 bash script:

‍

Step 4: Make a directory inside cli-staging and call it conf. Copy the hudi-cli-bundle/conf folder to cli-staging/conf:

‍

Make sure you have a bash script and a properties file in the conf/ directory:

‍

Step 5: In the terminal, export the SPARK_HOME, CLI_BUNDLE_JAR, AND SPARK_BUNDLE_JAR env variables. After, you can start Hudi-CLI shell

In my case, I set the env variables like this:

‍

When you run the script, you should see this logo and a prompt:

hudi ->

‍

Congrats! We’ve just launch the new Hudi-CLI tool. Check out the section down below on getting started with Hudi-CLI commands with some sample data set!

More details on the set up is covered here.

‍

Get Started with the Hudi-CLI Commands:

Now that you’ve launched the new Hudi-CLI tool for either Spark 2 or Spark 3, you can get started with the different commands.

‍

Pre-reqs:

You’ll need to have a Hudi table to connect to. If you don’t have one, you can download the Hudi table with sample data that’s provided.

‍

Step 1: You’ll need to connect a table. If you’re using the sample data, put the absolute path on where the table exist

You should see something like this:

Once the table is connected, you can run some commands.

‍

Commands available with Hudi-CLI

Here is the list of top level commands we have in Hudi-CLI. Each command will have few options where, some are mandatory, while others are optional.

Auto complete works with [TAB] in your keyboards to assist you with different options we have for each command.

‍

Step 2: Let’s see the available commands

‍

If you run:

You should see this:

‍

Step 3: Let’s describe the table

The command, desc, describes the table properties and values. It can also tell you whether a table is loaded properly.

You should see something like this:

‍

Step 4: Let’s inspect commits

The command, commits show, shows you all the commits for the tables.

‍

You should see a lists commits for the table of interest with stats about the writes in the commit of interest:

‍

‍

Step 5: Show partition level stats for a given commit

The command, commit showpartitions -- commit [CommitTime], shows the commits for a particular partition path. You can grab the CommitTime from `commits show` under the CommitTime column.

‍

You should see something like this:

‍

Step 6: Show the file level stats for a given commit

The command, commit showfiles –commit [CommitTime], shows the FileId with the previous commit:

You should see a table like this:

‍

Step 7: Show file sizes

The command, stats filesizes, gives you a distribution of file sizes across the table. You can get a sense if there are a lot of small files and what the average file size is.

You should see a table like this:

‍

Spark Dependencies:

While some of the commands don’t need a Spark engine for them to be executed, others may. Let’s take a look at few of the commands that require the Spark engine:

Be sure to set the SPARK_HOME for the following commands:

Step 8: List files via Metadata Table

The Metadata Table can significantly improve read/write performance of your queries. The main purpose of the Metadata Table is to eliminate the requirement for the "list files" operation.

‍

You should see some logs. Finally, you’ll see the partitions:

Step 9: Add a savepoint and restore the savepointed commit.

Savepoint saves the table as of the commit time, so that it lets you restore the table to this savepoint at a later point in time. This also ensures that the cleaner will not clean up any files that are savepointed.

When your run the command with a CommitTime, you should see that it was savepointed:

‍

‍

Now, if you run savepoint show - you can see the before and after the table is savepointed:

You can see the change when a table was not savepointed and when it was:

‍

Open another terminal and navigate to your Hudi table. When you inspect the properties of the .hoodie folder, you will see the savepoint meta files.

In a separate terminal type:

In my case, I navigated to the hudi_trips_cow directory with the sample dataset:

‍

When you list the .hoodie directory:

You’ll see the savepoint meta files:

‍

Restoring to the savepoint:

You can see the savepoint is rolled back:

‍

If you inspect the .hoodie directory, you will see restore meta files.

‍

The CLI/admin tool can be very helpful in understanding your Hudi tables. With this tool, you can monitor your Hudi tables, troubleshoot issues, help understand whether you need to perform table services like compaction, clustering, and disaster recovery when in need.

‍

You can try the Hudi-CLI with the sample data set.

‍

To get started on Apache Hudi, check out:

The getting started guide: https://hudi.apache.org/docs/quick-start-guide
The Hudi community to see the product roadmap and upcoming community events

‍

‍

Authors

No items found.

Read More:

Optimizing JobTarget’s Data Lake: Migrating to Hudi, a Serverless Architecture, and Templated Glue Jobs‍

Hudi

Optimizing JobTarget’s Data Lake: Migrating to Hudi, a Serverless Architecture, and Templated Glue Jobs‍

Floyd Smith

May 13, 2024

Data Warehouse vs. Data Lake vs. Data Lakehouse

Hudi

Data Warehouse vs. Data Lake vs. Data Lakehouse

Floyd Smith

May 2, 2024

Onehouse Custom Transformations QuickStart

Product

Onehouse Custom Transformations QuickStart

Chandra Krishnan

May 1, 2024

Dremio Lakehouse Analytics with Hudi and Iceberg using XTable

Hudi

Dremio Lakehouse Analytics with Hudi and Iceberg using XTable

Dipankar Mazumdar & Alex Merced

April 23, 2024

Schema Evolution on the Data Lakehouse

Product

Schema Evolution on the Data Lakehouse

Andy Walner

April 19, 2024

Scaling and Governing Robinhood’s Data Lakehouse

Hudi

Scaling and Governing Robinhood’s Data Lakehouse

Floyd Smith

April 17, 2024