April 18, 2023

Getting Started: Incrementally process data with Apache Hudi™

Written by:

Raymond Xu

Getting Started: Incrementally process data with Apache Hudi™

Incremental processing is a key technique used in data systems to efficiently handle large volumes of data. Rather than handling all the data at once, it involves pulling subsets of data from a source and processing them separately. For example, a data system that does incremental processing pulls the changed data like inserts and updates spanning from a short time range and then incrementally updates the records in a downstream data system. By doing reads and writes in this way, it will result in better data freshness and lower compute resource usage, and in addition, diagnosing and pinpointing errors will be easier.

Constructing custom incremental pipelines is no easy task. It is important to thoughtfully design the processing stages while considering factors like data size, traffic pattern and latency requirements to maximize efficiency. Undoubtedly, you may encounter some of the most prevalent issues like ensuring data consistency across all stages within the pipelines. To swiftly detect problems, it is also crucial to set up proper monitoring and data validation mechanisms. Furthermore, you may deal with operational complications that originate from backfill requests and dependency issues of different pipeline stages. Building efficient tooling for managing the pipelines’ pause, resume, restart, and backfill will dramatically reduce the operational burdens.

An alternative to building custom incremental pipelines is using Apache Hudi, a data lakehouse platform designed explicitly with incremental processing capabilities. Hudi’s concepts of timeline and instants naturally support incremental processing by providing timestamps and file paths to the changed data. Hudi provides configurations for easily managing incremental queries’ required parameters, such as begin time and end time. In 0.13.0, Change-Data-Capture (CDC), a richer format of incremental processing, was added to return finer-granular data to enable more capable downstream processing. In the remaining sections of this blog, we will walk through some code examples to give you a quick overview of incremental processing with Hudi and, hopefully, give you a starting point for building your own incremental pipelines.

Medallion architecture with incremental processing

Example 1: Simple Incremental Query

Let’s start with a simple incremental query example using the sample stock data from the Apache Hudi repo. We’ll prepare a table with 2 commits, with the 2nd commit containing updates. From there, we’ll illustrate how to set up the query options and analyze the results.

‍

Step 1:

Clone the hudi repo and navigate to the root directory:

git clone git@github.com:apache/hudi.git
cd <path>/hudi

hudi-clone.sh — hosted with ❤ by GitHub

‍

Step 2:

Execute the code snippet below. Be sure to update the Hudi Spark Bundle version to what’s used in your Spark shell:

spark-shell \
  --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.13.0 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
  --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

spark-shell-hudi.sh — hosted with ❤ by GitHub

‍

Step 3:

Execute the snippet to create the table:

import org.apache.spark.sql.SaveMode._

val basePath = "/tmp/hudi/stocks"
val stocksDF1 = spark.read.json("docker/demo/data/batch_1.json")
val stocksDF2 = spark.read.option("multiline", "true").json("docker/demo/data/batch_2.json").limit(1)

stocksDF1.write.format("hudi").
  option("hoodie.datasource.write.recordkey.field", "symbol").
  option("hoodie.datasource.write.partitionpath.field", "date").
  option("hoodie.datasource.write.precombine.field", "ts").
  option("hoodie.table.name", "stocks").
  mode(Overwrite).
  save(basePath)

stocksDF2.write.format("hudi").
  option("hoodie.datasource.write.recordkey.field", "symbol").
  option("hoodie.datasource.write.partitionpath.field", "date").
  option("hoodie.datasource.write.precombine.field", "ts").
  mode(Append).
  save(basePath)

hudi-write-stocks.scala — hosted with ❤ by GitHub

‍

Step 4:

List the Hudi timeline to see the 2 commits in a separate terminal window:

ls -l /tmp/hudi/stocks/.hoodie/

hudi-hoodie-ls.sh — hosted with ❤ by GitHub

‍

Step 5:

From the 1st commit to the 2nd, there was 1 record updated. Now, we can run an incremental query to see what was changed between these two commits:

val basePath = "/tmp/hudi/stocks"

spark.read.format("hudi").load(basePath).createOrReplaceTempView("hudi_stocks")

val firstCommit = spark.
  sql("select distinct(_hoodie_commit_time) as commitTime from hudi_stocks order by commitTime").
  map(k => k.getString(0)).take(1)(0)

val incrementalDF = spark.read.format("hudi").
  option("hoodie.datasource.query.type", "incremental").
  option("hoodie.datasource.read.begin.instanttime", firstCommit).
  load(basePath)

incrementalDF.show()

hudi-incremental-read.scala — hosted with ❤ by GitHub

‍

The result is the changed record with the version matching the latest commit.

Note: We can also specify `hoodie.datasource.read.end.instanttime`, which will limit the version of changed records up to the specified commit.

If we set `hoodie.datasource.read.begin.instanttime=0` and omit `hoodie.datasource.read.end.instanttime`, it will effectively return all the records written to the table.

Example 2: Incremental Join

Once the changed records have been incrementally retrieved, we might want to join them with other dimension tables and update a broader table containing more information for analysis. Let’s proceed with processing the data from Example 1 to demonstrate an incremental join.

Copy and paste this code snippet in the terminal:

import org.apache.spark.sql.SaveMode._

// simulate a dimension table for stocks
case class Company(symbol: String, name: String, yearFounded: Int)
val records = Seq(
  Company("MSFT", "Microsoft", 1975))
records.toDF("symbol", "name", "yearFounded").createOrReplaceTempView("companies")

// join the changed records with the dimension table
incrementalDF.createOrReplaceTempView("changedStocks")
val result = spark.sql("""
  |SELECT s.*, c.name, c.yearFounded
  |FROM changedStocks s
  |JOIN companies c
  |ON s.symbol = c.symbol""".stripMargin)

val wideTablePath = "/tmp/hudi/stocks_wide"
result.write.format("hudi").
  option("hoodie.datasource.write.recordkey.field", "symbol").
  option("hoodie.datasource.write.partitionpath.field", "date").
  option("hoodie.datasource.write.precombine.field", "ts").
  option("hoodie.table.name", "stocks_wide").
  mode(Append).
  save(wideTablePath)

// query the wide table to see all the columns.
spark.read.format("hudi").load(wideTablePath).show()

hudi-incremental-join.scala — hosted with ❤ by GitHub

Example 3: Incremental Processing with CDC

In the preceding 2 examples, we could pull the changed records as of the latest version or up to the end of the commit time. However, we could not view the previous state of the records, or determine if some records were hard-deleted. To accomplish this, we need to leverage the full change-data-capture (CDC) capabilities in Hudi.

Starting with Hudi 0.13.0, the CDC feature was introduced, allowing the logging of before and after images of the changed records, along with the associated write operation type (insert or update or delete).

‍

Case study: employee headcount

Company XYZ has offices in the US, India and China. At time `1000`, there is 1 employee in each office. At time `1100`, there are 2 new hires in the US office and 1 new hire in both India and China offices. At time `1200`, a new office in Singapore was established and an employee from the US office moved to the new Singapore office.

The company's HR department wants to keep records of how many employees each office has.

We can model this scenario by having an input stream of employee id, office country and time. A key requirement for this example is to continuously update a table of the office country, headcount and time.

Case study: employee headcount processing flow

‍

The code snippet below shows how to simulate the input data, and perform incremental CDC processing and aggregation using Hudi and Spark streaming:

import org.apache.spark.sql.catalyst.expressions.{Add, If, Literal}
import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, Dataset, Row}
import org.apache.spark.sql.SaveMode._

// initialize `employee_country` table with CDC enabled
val employeeCountryTablePath = "/tmp/hudi/employee_country"
val employeeCountrySchema = StructType(Seq(
  StructField("employeeId", IntegerType),
  StructField("country", StringType),
  StructField("ts", StringType)
))
spark.createDataFrame(spark.sparkContext.emptyRDD[Row], employeeCountrySchema).
  write.format("hudi").
  option("hoodie.datasource.write.recordkey.field", "employeeId").
  option("hoodie.datasource.write.precombine.field", "ts").
  option("hoodie.table.name", "employee_country").
  option("hoodie.table.cdc.enabled", "true").
  mode(Overwrite).
  save(employeeCountryTablePath)

// simulate input stream and write to `employee_country`
val inputStream = new MemoryStream[(Int, String, String)](100, spark.sqlContext)
inputStream.toDS().toDF("employeeId", "country", "ts").
  writeStream.format("hudi").
  foreachBatch { (batch: Dataset[Row], _: Long) =>
    batch.write.format("hudi").
      option("hoodie.datasource.write.recordkey.field", "employeeId").
      option("hoodie.datasource.write.precombine.field", "ts").
      mode(Append).
      save(employeeCountryTablePath)
  }.start()

// initialize `country_headcount` table
val countryHeadcountTablePath = "/tmp/hudi/country_headcount"
val countryHeadcountSchema = StructType(Seq(
  StructField("country", StringType),
  StructField("headcount", IntegerType),
  StructField("ts", StringType)
))
spark.createDataFrame(spark.sparkContext.emptyRDD[Row], countryHeadcountSchema).
  write.format("hudi").
  option("hoodie.datasource.write.recordkey.field", "country").
  option("hoodie.datasource.write.precombine.field", "ts").
  option("hoodie.table.name", "country_headcount").
  mode(Overwrite).
  save(countryHeadcountTablePath)

// create a CDC processing stream to aggregate the changed data and update `country_headcount` table
spark.readStream.format("hudi").
  option("hoodie.datasource.query.type", "incremental").
  option("hoodie.datasource.query.incremental.format", "cdc").
  load(employeeCountryTablePath).
  writeStream.format("hudi").
  foreachBatch { (batch: Dataset[Row], _: Long) =>
    val current = spark.read.format("hudi").load(countryHeadcountTablePath)
    batch.select(
      get_json_object(col("before"), "$.country").as("bf_country"),
      get_json_object(col("after"), "$.country").as("af_country"),
      get_json_object(col("after"), "$.ts").as("ts")
    ).
    withColumn("bf_ct", new Column(If(isnull(col("bf_country")).expr, typedLit(0).expr, typedLit(-1).expr))).
    withColumn("af_ct", new Column(If(isnull(col("af_country")).expr, typedLit(0).expr, typedLit(1).expr))).
    select(explode(array(Array(
      struct(col("bf_country").as("country"), col("bf_ct").as("ct"), col("ts")),
      struct(col("af_country").as("country"), col("af_ct").as("ct"), col("ts"))): _*))).
    select(col("col.country").as("country"), col("col.ct").as("ct"), col("col.ts").as("ts")).
    where("country is not null").
    groupBy("country").
    agg("ct" -> "sum", "ts" -> "max").
    join(current, Seq("country"), "left").
    select(
      col("country"),
      new Column(Add(col("sum(ct)").expr, If(isnull(col("headcount")).expr, Literal(0), col("headcount").expr))).as("headcount"),
      col("max(ts)").as("ts")
    ).
    write.format("hudi").
    option("hoodie.datasource.write.recordkey.field", "country").
    option("hoodie.datasource.write.precombine.field", "ts").
    mode(Append).
    save(countryHeadcountTablePath)
  }.start()

// simulate input data
inputStream.addData(Seq((1, "US", "1000"), (2, "IN", "1000"), (3, "CN", "1000")))
inputStream.addData(Seq((4, "US", "1100"), (5, "US", "1100"), (6, "IN", "1100"), (7, "CN", "1100")))
inputStream.addData(Seq((4, "SG", "1200")))

// read the latest country_headcount data
spark.read.format("hudi").load(countryHeadcountTablePath).
  select("country", "headcount", "ts").show(false)

/*
+-------+---------+----+
|country|headcount|ts  |
+-------+---------+----+
|US     |2        |1200|
|IN     |2        |1100|
|CN     |2        |1100|
|SG     |1        |1200|
+-------+---------+----+
Note that the result may not be immediately up-to-date given the streaming processing is async.
*/

hudi-cdc-streaming.scala — hosted with ❤ by GitHub

The above code snippet was taken from an existing test class in the Hudi repo, contributed by Bi Yan. For more introduction to Hudi CDC features, please refer to the guide on the official website.

Summary

In this blog, we briefly covered the motivations and challenges with building incremental processing pipelines, and illustrated how Hudi supports incremental processing in different scenarios with sample code. Uber’s recent blog “Setting Uber’s Transactional Data Lake in Motion with Incremental ETL Using Apache Hudi” is an awesome reference for operating production incremental pipelines. The Hudi community will continue to enhance the CDC feature, further advancing the incremental processing capabilities.

Authors

No items found.

Hudi

Apache Iceberg™ vs Delta Lake vs Apache Hudi™ - Feature Comparison Deep Dive

Kyle Weller

October 2, 2025

Getting Started: Incrementally process data with Apache Hudi™

Example 1: Simple Incremental Query

Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

Example 2: Incremental Join

Example 3: Incremental Processing with CDC

Case study: employee headcount

Summary

Read More:

Choosing the Right Data Ingestion Method: Batch, Streaming, and Hybrid Approaches

Optimizing Performance in Open Source Data Warehouses: Query Tuning, Data Partitioning, and Caching Strategies

Data Lake vs. Warehouse vs. Lakehouse

Data Architecture Survey Report: The Lakehouse Is Your Data Foundation for AI

Apache Kafka® (Kafka Connect) vs. Apache Flink® vs. Apache Spark™: Choosing the Right Ingestion Framework

Apache Iceberg™ vs Delta Lake vs Apache Hudi™ - Feature Comparison Deep Dive

Getting Started: Incrementally process data with Apache Hudi™

Example 1: Simple Incremental Query

Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

Example 2: Incremental Join

Example 3: Incremental Processing with CDC

Case study: employee headcount

Summary

Read More:

Choosing the Right Data Ingestion Method: Batch, Streaming, and Hybrid Approaches

Optimizing Performance in Open Source Data Warehouses: Query Tuning, Data Partitioning, and Caching Strategies

Data Lake vs. Warehouse vs. Lakehouse

Data Architecture Survey Report: The Lakehouse Is Your Data Foundation for AI

Apache Kafka® (Kafka Connect) vs. Apache Flink® vs. Apache Spark™: Choosing the Right Ingestion Framework

Apache Iceberg™ vs Delta Lake vs Apache Hudi™ - Feature Comparison Deep Dive

Subscribe to the Blog