Hudi-Presto Workshop
Building an Open Data Lakehouse on AWS S3 with Apache Hudi & Presto
April 24th, 2024 | 9 AM PST | 12 PM EST
Prerequisites
Technology Stack
Dataset
The workshop will leverage TPC-DS dataset in volume of 10 GB to demonstrate the various capabilities of read and write with Hudi and Presto. The dataset will be made available at a common S3 location accessible to workshop attendees.
Environment Details
All the required open source software and its dependencies will be pre-installed for this workshop session. Attendees will use Jupyter Notebooks to run various read and write queries on Apache Hudi using Presto and Spark SQL. Users will also have access to Spark UI and Presto UI for additional analysis and debugging.
Description
The lakehouse architecture combines the flexibility, scalability, and cost-efficiency of data lakes with the robust data management features of data warehouses. This workshop is designed to provide data engineers & architects with a comprehensive understanding of Apache Hudi and use it to build an open lakehouse architecture on AWS S3, utilizing Presto as the engine for fast and interactive queries.
Attendees will learn:
- Open Lakehouse architecture stack with Hudi as the transactional layer & Presto as the compute engine.
- Hudi’s Table optimization service - Clustering & Metadata tables to help improve query performance.
- Practical exercises on creating different Hudi tables (CoW, MoR) on S3, ingesting data, performing upserts/deletes, and synching with catalogs such as Hive Metastore.
- Various ways of querying data using Presto including snapshot and read-optimized queries.
- Application of clustering table service & metadata table to observe firsthand improvements in query speed on the Presto-side.
Featured Speakers
Stay in the know
Be the first to hear about news and product updates