Between billion-dollar funding rounds and major announcements from industry leaders including Microsoft, Google, Snowflake, and Databricks, it’s clear that 2024 was the year of artificial intelligence and open table formats. AWS further cemented this at their annual re:Invent conference last week, where they announced a boatload of new features to help you build a data lakehouse that powers your AI and analytics workloads. In this blog, I’ll cover the most important announcements data teams should follow.
Amazon made one thing clear at this year’s re:Invent – they are doubling down on open table formats. Open table formats, such as Apache Hudi, Apache Iceberg, and Delta Lake, bring data warehouse-like capabilities to the data lake, adding ACID transactions, fast performance, and more.
AWS’s recently-launched S3 Tables introduce a new type of object storage bucket, natively integrated with open table formats. Initially, S3 Tables will support Apache Iceberg only, but will likely add support for other table formats down the road (notice the format field in their get-table API response). Now let’s see what S3 Tables offer, and where they fall short.
✅ Feature: Automated table maintenance
AWS will automatically run Iceberg table maintenance on S3 Tables, enabling up to 3x faster query performance and 10x higher TPS than un-optimized tables (according to AWS; baselines are unclear). This feature could make it easier for AWS users to run workloads in Iceberg, which lacks the open source table maintenance functionality available in Hudi and Delta Lake to varying degrees.
Specifically, S3 Tables will compact small files into larger files to improve query performance, and remove expired table snapshots & unreferenced files to free up storage.
The initial price for this table maintenance is steep, with Amazon charging $0.05 per GB processed. This is ~50x more expensive than running table maintenance yourself with Spark on Amazon EMR, based on our team’s estimates.
✅ Feature: Glue catalog integration
When you create S3 Tables, they will automatically sync to your Glue catalog so they can be accessed by AWS’s full suite of data products. This feature automates one of the steps you’d typically have to build yourself for a DIY data lake.
The Glue catalog integration can help teams keep data producers and consumers aligned within the AWS ecosystem. We’ve seen the benefits of such functionality with our own customers using Onehouse’s multi-catalog sync feature that syncs to catalogs including Glue, Snowflake, Databricks, and BigQuery.
⚠️ Limitation: Potential lock-in to AWS ecosystem
By tightly coupling storage with the data management and catalog functionality, S3 Tables present a more closed ecosystem than pure S3. This could potentially limit compatibility with other vendors such as Snowflake and Databricks, or even your in-house solutions. For example, non-AWS Spark may not be able to read IAM permissions you set on your S3 Tables in AWS.
While S3 Tables lower the barrier to entry for working with open table formats, they introduce a potential lock-in point for your data. The storage bucket has historically been an extensible primitive, upon which customers and vendors can run their own services, but S3 Tables now muddy the waters between the storage and catalog layers.
⚠️ Limitation: Only works with Spark + Iceberg
Another limitation is the lack of interoperability across engines and formats. Initially, S3 Tables support only Iceberg for storage and Spark as the compute engine. This is a critical gap for many teams, such as Amazon Athena users who want to query data with Trino, or Databricks users who prefer to work with Delta Lake tables. I’d expect AWS to close the query engine and table format gaps over time, given their recent work with customers on Apache XTable (Incubating), but it’s unclear when that will happen.
💡 What if I’m already using S3 with open table formats?
S3 Tables are a new product line from AWS, which means they won’t affect your existing lakehouse deployments on S3. You can continue to self-manage your open table formats on pure S3 with EMR, or use other managed platforms for data ingestion and table optimization.
Glue is a serverless tool for running Spark workloads on AWS. Glue 5.0 introduces fine-grained access control for Spark through AWS Lake Formation, and automated Data Lineage for Glue jobs through Amazon DataZone. Glue 5.0 also adds upgrades and new features around the open table formats - Apache Hudi, Apache Iceberg, and Delta Lake - to improve performance and reliability.
While this release did not include any groundbreaking changes, the integrations within the AWS product ecosystem should make life easier for teams running workloads with open table formats on Glue.
Amazon just made it official – SageMaker is now the de facto interface for Analytics and AI on AWS. The newly-announced SageMaker Studio unifies many of AWS’s data tools behind a single pane of glass:
With the new SageMaker, you can easily run business intelligence queries with Redshift, process data with Apache Spark, and train your AI models. This is all backed by SageMaker Lakehouse, allowing you to work with a single copy of data stored across S3 and Redshift, with centralized data governance (so long as you stay within the AWS product ecosystem).
This might sound familiar – Microsoft and Google recently announced similar product lines - Microsoft Fabric and Google’s BigLake - that unify their data warehouses, data processing engines, catalogs, and more, into single product suites backed by open table formats.
This trend makes sense, as hyperscalers race to build the all-in-one data and AI platform that will take over the market. Hopefully these grand ambitions don’t interfere with the composable product ecosystem, which has made AWS a top choice for data teams. Each of these AWS products has great standalone value, and pairs well with technologies outside of the AWS orbit, such as using StarRocks to query data in a Glue catalog, or using EMR to transform data that is later queried with Dremio or Starburst.
While all-in-one projects like SageMaker Lakehouse may improve ease-of-use, they can also limit the speed of innovation, and make it more difficult for customers to adopt new, best-in-class technologies in their data infrastructure. AWS customers should closely follow along as these products evolve, and pay attention to how well they integrate with other tools within and outside of the AWS ecosystem.
Onehouse offers a variety of solutions for customers on AWS, including high-performance ingestion & ETL, and advanced table optimization. These work hand-in-glove with AWS data analytics through native integrations that exist with Apache Hudi and Apache Iceberg.
High-Performance Ingestion & Transformations
Many of our customers ingest data with Onehouse into open table formats in their S3 buckets. Onehouse offers best-in-class efficiency for ETL (extract, transform, load) workloads, in addition to helpful data pipeline features including data quality quarantine, low-code transformations, easy performance tuning, and deep observability.
Advanced Table Optimization
Onehouse also helps data teams who are already running their own data pipelines with Apache Hudi. Our Table Optimizer can attach to existing Hudi tables to accelerate performance by up to 10x with auto-optimized compaction, clustering, and cleaning. Table Optimizer leverages advanced optimization techniques we’ve built while supporting some of the world’s most complex workloads.
Open and Interoperable
All Onehouse products are built to be modular – they play well with the AWS ecosystem, and are compatible anywhere else you process or query your data. Onehouse uses Apache XTable to land data in Hudi, Iceberg, and Delta Lake, and syncs tables to multiple catalogs, so you are never locked into a single table format, catalog, or cloud vendor.
Whether you’re just starting out with open table formats, or you’re already running data pipelines at petabyte scale, we are here to help. Contact us to get tips from the Onehouse experts on how you can make the most of your AWS data platform.
Be the first to read new posts