April 13, 2022

Apache Hudi Native AWS Integrations

Written by:

Kyle Weller

Intro

Apache Hudi is a data lakehouse technology that provides an incremental processing framework to power business critical data pipelines at low latency and high efficiency, while also providing an extensive set of table management services. With strong community growth and momentum, AWS has embraced Apache Hudi natively into its cloud ecosystem since 2019. The goal of this blog is to highlight the key integrations and how you can build an open lakehouse on AWS with Apache Hudi.

First let’s review some history to put the growth of Apache Hudi in perspective.

(2019 - EMR announcement + Re:Invent) (2020 - Athena and Redshift announcement)

‍

Since then AWS continues to upgrade the latest Hudi versions, including the recent update to 0.9.0. With Apache Hudi natively integrated into these powerful and cost effective AWS services, it is an easy choice to use Hudi to build transactional data lakes, serverless pipelines, low latency streaming data platforms, and powerful open lakehouse solutions.

There are many success stories from the community available online to read. Here are a few that AWS wrote blogs on recently:

‍

AWS Service Integrations

Let’s discuss each AWS service to understand how the integration works and what value Hudi adds to the ecosystem.

Amazon EMR is a service that helps you manage and scale Spark, Hive, Presto, Trino, and other big data workloads. As powerful and convenient as EMR is, it can still be challenging to deal with record-level inserts, updates, and deletes. Apache Hudi brings these basic data lakehouse capabilities and much more to EMR. The documentation for how to create an EMR cluster with Apache Hudi is incredibly brief because Hudi is pre-installed and ready to work out of the box.

With Apache Hudi, you can now author incremental queries on EMR allowing you to expose changesets and incrementally process only data that has changed. This enables you to significantly reduce latency compared with old-school batch pipelines by processing smaller amounts of data more frequently and efficiently.

With EMR and Hudi you unlock two types of write operations, Copy-On-Write (COW) and Merge-On-Read (MOR). COW is how most other data lakehouse technologies operate. MOR is unique to Apache Hudi and it allows you to write data in a combination of columnar (e.g. Parquet) + row based (e.g. Avro) file formats. Hudi exposes configurations which allow you to balance write latency, write amplification, and how frequently to automatically compact the data for maximum read performance.

Apache Hudi offers many table services including cleaning, compaction, clustering, file sizing, indexing, Z-ordering, and more. All of these services help you efficiently build an industry leading lakehouse data platform with EMR.

Many people don’t realize, but you can actually use Apache Hudi and EMR without having to convert existing Parquet data in your data lake. The feature is called bootstrapping, and this blog details how you can use EMR to build separate metadata about existing Parquet data without having to touch or convert raw Parquet data to Hudi format.

Apache Hudi will automatically sync your table metadata with the catalog of your choosing with minimal configurations. The natural choice for this on AWS is your Glue catalog.

You can also use Hudi connectors in Glue Studio if you wanted to write directly to Hudi tables with Glue instead of EMR.

Amazon Athena supports Hudi out of the box with no special configuration required. If you are writing to Hudi tables from EMR or Glue and have Glue/Hive catalog syncing enabled then you can simply point Athena to the registered database and write SQL queries with no extra setup required:

Hudi supports snapshot isolation, which means you can query data without picking up any in-progress or not-yet-committed changes. This means you can write changes to the dataset in EMR at the same time users are querying it with Athena and Amazon Redshift Spectrum.

Another lesser known scenario is that if you have unexplored Parquet files that aren’t registered in your catalog, you can use Athena to bootstrap these Parquet files into Hudi tables without needing EMR as described above. This operation will leave the Parquet files alone without rewriting them, but it will write Hudi metadata around the files and allow you to query the data as Hudi tables.

CREATE EXTERNAL TABLE `partition_cow`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `event_id` string, 
  `event_time` string, 
  `event_name` string, 
  `event_guests` int)
PARTITIONED BY ( 
  `event_type` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
LOCATION
  's3://bucket/folder/partition_cow'

Redshift Spectrum enables you to define and query from “External Tables” that are stored on S3 outside of Redshift. This is valuable when you are already using Redshift and need to extend analysis into a large or rapidly changing dataset that you do not want to load into your warehouse. Since Hudi is preinstalled and natively supported by Redshift, you can create External Tables in Redshift with a simple DDL statement:

CREATE EXTERNAL TABLE tbl_name (columns)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://s3-bucket/prefix'

Recommended Architecture Patterns

Connecting all the pieces together below is a sample architecture pattern that you may start to build a model off of depending on your use cases.

Database ingestion - If you are performing CDC from Amazon Aurora, you might be using MSK and MSK Connect to bring the data into Kafka. Apache Hudi provides a powerful tool to help you efficiently ingest this data into the data lake called Hudi DeltaStreamer.
Real-Time Analytics - If you need to process the data with low latency in real time before it touches S3 on the lake, you can strap in EMR managed Flink to process the data and even write the data into S3 as Hudi tables.
ETL / ML / Analytics - To clean, process, enrich the data on the lake, you can use EMR to spin up Spark, Presto, Trino, Hive, compute which all support Apache Hudi.
Interactive Analytics - Your Hudi tables are all automatically synced with your AWS Glue Catalog, so you can easily open up Athena and have instant serverless access to write SQL queries on your Hudi tables.
BI Analytics - If you have heavy data warehouse scenarios you can query your Apache Hudi tables in S3 with Redshift spectrum, or import the data directly if you need.

How to get started

If you are new to Apache Hudi and looking for how to get started with Hudi on AWS there are many places you can go to learn more and get help along the way. Here are some materials I recommend to help you get started: