October 7, 2024

Open Table Formats and the Open Data Lakehouse, In Perspective

Open Table Formats and the Open Data Lakehouse, In Perspective

There is a lot of buzz around data lakehouse architecture today, which unifies two mainstream data storage technologies - the data warehouse and data lakes - promising to do more with less. On the other hand, all major data warehouse vendors have embraced the use of open table metadata formats, due to customer demand for the flexibility and openness promised by supporting an open format. 

Three projects - Apache Hudi, Apache Iceberg, and Delta Lake - are now at the center of all the attention and vendor chess moves in this space. These projects are pivotal in forging an open, adaptable foundation for your data that allows enterprises to choose appropriate compute engines tailored to their unique workloads, thus avoiding the constraints of proprietary storage formats. However, the proliferation of the terms open table format and open data lakehouse, used interchangeably across these projects, necessitates clarification and a deeper understanding. 

Adopting these table formats has laid the groundwork for openness. Still, it is crucial to recognize that an open data architecture needs more than just open table formats—it requires comprehensive interoperability across formats, catalogs, and open compute services for essential table management services such as clustering, compaction, and cleaning to also be open in nature. A lot of the advocacy today has been toward showing that replacing proprietary data storage formats with open table formats makes the data architecture open and interoperable. In reality, customers end up choosing a particular open table format, based on support by a given vendor, while being tied to proprietary services and tools for other essential needs. This presents barriers to realizing a genuinely open data architecture within an organization.

The questions that we hope to answer through this blog are:

  • What are the differences between an open table format and an open data lakehouse platform?
  • Is an open table format enough to realize a truly open data architecture? 
  • How seamlessly can we move across different platforms today? 

To help answer these, we will explore the evolution of data architecture over the years, break down the lakehouse architecture into its components, and perform some comparative analysis to distinguish between what exactly qualifies as open and what does not. Let’s start with a bit of history.

Evolution of Data Architecture

Over the years, organizations have invested heavily in developing centralized, reliable, and scalable data architectures. The goal is to provide users more access to data, allowing more consumers to use data and facilitate analytical workloads, ranging from business intelligence (BI) to machine learning (ML). As the demand for these workloads has increased, data architectures have continually evolved, adapting to meet the complex and varied needs of modern data processing and storage. In the following sections, we will break down the evolution of data architecture from OLTP (Online Transactional Processing) to the modern data lakehouse, highlighting the key technological components and their structure in each kind of system.

Figure 1. Data architectures have evolved over several decades.

OLTP

OLTP systems have been foundational in handling transactional data. These systems are designed to efficiently handle one or a few rows of data at a time, enabling operations such as inserts, updates, and deletes, making them ideal for high-volume transactional environments.


Traditionally, there is no equivalent to the term "table format" in databases. They merely refer to a storage format, which is a lower-layer technical detail abstracted from the user. However, in the spirit of our focus in this blog, we will break apart the storage formats used in OLTP into file and table formats. An OLTP database can be distilled into six technical components including storage, file format, table format, storage engine, compute engine and the system catalog. These components are bundled together into a single system for transactional processing.

Figure 2. A generic representation of an OLTP-based architecture

OLTP systems are optimized for point lookups and real-time transactional data (for example, the storage format uses row-optimized files and indexes). They are not designed for complex analytical queries that require scanning large volumes of aggregated data.

OLAP/Data Warehouse

Online Analytical Processing (OLAP) systems were among the first data systems designed for efficient querying of aggregated data. Commonly referred to as data warehouses, these systems act as central repositories that store data from disparate sources and are optimized for querying large volumes of information. Traditionally, these systems were built on top of a row-store RDBMS with specialized optimizers tailored for analytical query patterns, compute engines, and storage mechanisms. However, the introduction of columnar storage, where data is organized and stored by columns rather than rows, has allowed OLAP databases to leverage efficient compression and enable faster access to specific attributes in analytical queries.

OLAP databases excel at managing structured data and are vital for making complex queries and analyses, fundamental to business intelligence (BI) workloads.  Similar to OLTP, an OLAP-based system bundles the six technical components together into one unit; note that the file format is now column-oriented rather than row-oriented, and the compute engine is optimized for different purposes. 

  Figure 3. A generic representation of an OLAP-based architecture

OLAP is well-adapted for analytics against structured data, but its capacity to process semi-structured and unstructured data, essential for more advanced applications such as machine learning (ML), is limited. This limitation of OLAP highlighted the necessity for more versatile data architectures.

Data Lake

Data lakes started with the Hadoop era as a solution to the limitations of data warehouses, particularly their inefficiencies in handling varied types of data (structured, semi-structured, and unstructured) and high operational costs. In a more technical sense, data lakes utilize a distributed file system or object storage to offer scalable, low-cost storage in open file formats such as Apache Parquet and Apache ORC. The data lake architecture also avoids long-running components in the data read/write path so that compute can also scale elastically - typically 10-100x compared to reading/writing from data warehouses, which typically have few node clusters.

Unlike data warehouses, data lakes support a schema-on-read architecture that allows storing data at significantly lower costs. Data lakes introduced the separation of compute and storage components, enabling organizations to scale these resources independently, which enhances flexibility and optimizes costs.

Figure 4. A generic representation of a data lake architecture. All the elements are unbundled,
including storage and compute; there is no table format nor storage engine. 

With a data lake architecture, organizations typically adopt a two-tiered approach: utilizing open file formats such as Parquet in cloud data lakes for machine learning and advanced analytical workloads, while selectively exporting data to add to an existing data warehouse for business intelligence (BI) and ad-hoc query support. This two-tiered approach increases complexity and overhead, often creating data staleness and additional data copies.

Figure 5. A two-tiered architecture with data lakes and a data warehouse. ML-based workloads run on diverse data types
in the data lake (left), while BI and ad hoc queries run on separate data in the data warehouse (right).

Data Lakehouse

While data lakes provide scalable and cost-effective storage solutions, they lack many of the capabilities of a DBMS. In particular, lacking a storage engine and a table format, they are unable to support ACID transaction guarantees. This leads to reliability issues when writing and reading data, particularly with concurrent writes. To address this critical shortcoming, the data lakehouse architecture introduces a transactional database layer, while retaining the advantages of data lake architecture, namely infinite storage and elastic compute. 

Transactional capabilities in a lakehouse are enabled by combining the storage engine component with the storage format, now more commonly referred to as the table format, which acts as an open metadata layer on top of file formats like Parquet. The transactional database layer integrates critical DBMS features such as ACID transactions, indexing, concurrency control, and table maintenance services (such as clustering, compaction, cleaning), on top of existing data lake storage. 

Most importantly, a  lakehouse architecture deconstructs the tightly coupled components of traditional data warehouses into a more modular framework. Data is stored in open file and table formats in a lakehouse, allowing different compute engines to work together on the same data, while the transactional database layer provides the robust characteristics of a conventional DBMS, ensuring data integrity and transactional consistency across diverse workloads.

Figure 6. Generic representation of a lakehouse architecture. Each component is modular, not tightly coupled to others, unlike a warehouse/OLAP system.

Unpacking a Generic Lakehouse Architecture

Before we go into these components in depth, it’s worthwhile, for the purposes of this blog, to differentiate between a data lakehouse architecture and a data lakehouse platform. We suggest the following definitions for these terms. 

Lakehouse Architecture 

A lakehouse architecture provides a reference blueprint or design pattern that shows how the different layers and components of the system interact to integrate the capabilities of a data warehouse and a data lake into a single, unified system. The lakehouse architecture consists of multiple layers, including ingestion, storage, metadata, transactional database, and consumption services, each serving a specific role in managing and processing data. Think of it as the blueprint of a building, outlining how different rooms, utilities, and structural elements will be arranged and interact; a plan for how the building can be constructed.

Lakehouse Platform

A lakehouse platform refers to the actual implementation of the lakehouse architecture. The lakehouse platform is a functional system that realizes the architecture’s blueprint by integrating various tools, technical components, and services necessary to manage, process, and analyze data. Whereas the architecture is the blueprint, the platform is the actual building, put up using specific materials and techniques. 

Components in a Lakehouse Architecture

Let’s dive deeper into each layer/component of the lakehouse architecture.

Lake Storage: This layer comprises cloud object stores (storage that holds objects - files, as well as any metadata per file,) where files from various operational systems are stored after they have been ingested via ETL/ELT processes. It is a way of taking the bytes, combining them into a file, and saving the file to a designated path in the file system. The lake storage layer supports storing any data type and can scale as needed.

File Format: File formats hold the actual raw data, which is stored physically in the storage layer (object stores). These formats dictate how data is organized within each file—whether in structured, semi-structured, or loosely structured/unstructured forms. They also determine how the data is arranged, i.e. either in row-based or columnar fashion.

Storage/Table Format: Table format or storage format acts as an open metadata layer over the file format layer. It provides an abstraction that separates the logical representation of files from the physical data structure. The table format defines a schema on immutable data files, enabling transactional capabilities, and providing APIs for readers and writers to interact with the data.

Storage Engine: The storage engine layer is responsible for managing the underlying data within the lakehouse architecture, ensuring that all files and data structures, such as indexes, are well-maintained, well-optimized, and up-to-date. The storage engine ensures transactional integrity by supporting core characteristics: atomicity, consistency, isolation, and durability (ACID), essential for ensuring data accuracy and reliability, especially for concurrent transactions. It also handles critical table management services such as clustering, compaction, cleaning, and indexing, in combination with the table format to optimize the data layout for improved query performance and operational efficiency.

Catalog: The catalog layer is vital for enabling efficient search and data discovery within the lakehouse architecture. It keeps track of all tables and their metadata, cataloging table names, schemas, and references to specific metadata associated with each table’s format.

Compute Engine: The Compute Engine layer comprises engines that are responsible for processing data, ensuring that both read and write operations are executed efficiently. It uses the primitives and APIs provided by table formats to interact with the underlying data, managing the actual processing tasks.

Now that we've defined each component and understood their functions, we can see that the open lakehouse architecture closely resembles that of a traditional data warehouse. The key distinction, however, is that each component in a data lakehouse is open and modular, offering the flexibility to mix and match these components to meet specific use case requirements. Another important thing to note here is that the transactional database layer in a lakehouse, offered in combination with the storage engine and table format, is what differentiates it from a data lake architecture. In fact, the services provided by the storage engine layer are critical to making lakehouses operational and optimizing them for better query performance.

As we dive deeper into the architecture of an open lakehouse, it's essential to unpack one of its most critical components: the table format. This term, along with others such as metadata format, storage format and open lakehouse format, often leads to confusion because these concepts are not usually discussed in isolation. Understanding the nuances of table formats will equip us to address the important questions raised earlier regarding how they differ from an open lakehouse platform and whether they are all that is required to realize a truly open data architecture. In the next section, we will clarify what a table format truly entails and its role in enabling an open data architecture.

Understanding the Table Format

Contrary to what some might think, table formats are not a new concept. They are similar to what databases have referred to as storage formats for many years. Their roots trace back to the early days of relational databases, with the implementation of Edgar Codd’s relational model in systems such as Oracle. Historically, OLTP database vendors such as Oracle and Microsoft allowed users to view a dataset (files) as a structured table, managed by the database’s storage engine, with the native compute engine interacting with those underlying files. These table formats are proprietary in nature and hence closed, i.e. accessible by only the native compute engine.

Then, in the big data world, with the advent of Hadoop-based data lakes, the MapReduce framework was initially the only way of accessing and processing the data files that were stored in the Hadoop File System (HDFS). However, accessing data stored in HDFS required the writing of MapReduce-specific Java code, which meant it was limited to only a few sets of specialized engineers. 

As a result, compute engines such as Apache Hive were created so more people, such as data analysts, could have access to data stored in the data lake. This necessitated defining a dataset’s schema and how to refer to that dataset as a table, enabling engines to interact with the schema. This gave rise to the development of the Apache Hive table format. It allowed democratization of data using Hive Query Language (HiveQL) - a query language similar to SQL. The Hive table format marked the inception of a new generation of standalone table formats.

With that historical context, let’s go a bit deeper into what a table format is. In the previous section, we defined it as  an open metadata layer over the file format layer - an abstraction that separates the logical representation of files from the physical data structure. To elaborate, a table format organizes a dataset's files to present them as a single table. This abstraction allows multiple users and tools to interact with the data concurrently, whether for writing or reading purposes. The table format includes a set of reader and writer APIs and tracks metadata information. 

Figure 7. Composition of metadata for table formats’ metadata

This is a typical composition of the table format metadata:

  • Schema information: Describes the structure of the table, including column names, data types, and any nested structures (like arrays or structs).
  • Partition information: Lists the specific values or ranges of values for each partition, to quickly identify relevant partitions during query execution.
  • Statistical information: Includes information like row count, null count, and min/max values for each column, based on the Parquet data file. This helps in query optimization by providing details that can be used to filter data efficiently.
  • Commit History: Keeps track of all changes made to the table, including inserts, updates, deletes, and schema changes. This allows for time travel queries and version rollbacks.
  • Data File Path: Lists the paths to data files that are part of the table, often with details about which partitions they belong to.

With that thorough understanding of table formats in general, let’s switch focus to lakehouse table formats. Hudi, Iceberg, and Delta Lake are the three widely used open table formats in a lakehouse architecture. These metadata formats brings capabilities such as:

  • Support for ACID-based transactions: Table formats, in combination with the storage engine, support ACID transactions, allowing for operations (such as inserts, upserts, and deletes) to be executed reliably, and ensuring concurrent operations do not result in conflicts or corruption.
  • Schema evolution: These formats support schema evolution, allowing users to change the schema of a table over time without breaking existing queries. Changes such as adding, renaming, or removing columns can be done while maintaining compatibility with previous versions of the data.
  • Time travel & data versioning: Lakehouse table formats allow users to query historical versions of data using time travel. This means you can access the state of the data as it existed at a specific point in time, which is critical to enable a range of different types of queries, as well as for auditing and debugging.

These capabilities are essential for running analytical workloads within a lakehouse architecture. However, the key takeaway about modern table formats is their open nature. This openness ensures that data is accessible to any engine compatible with a specific table format, setting them apart from the proprietary (closed) table formats used in vendor-specific databases and data warehouses. 

Now that we've explored the nuances of table formats and differentiated between lakehouse architectures and platforms, it is a good time to address the initial question we set out to answer: What are the differences between an open table format and an open lakehouse platform? Well, an open table format represents just one standalone component within the broader context of an open lakehouse architecture (Fig 7), which also encompasses additional elements such as file formats, storage engines, compute engines, and catalogs. This component primarily serves as a metadata layer situated atop the actual data files, facilitating an independent and open data tier that is accessible to various compute engines.

Figure 8. The table format is one standalone component in the data lakehouse architecture

On the other hand, an open lakehouse platform integrates every component—storage, file format, table format, storage engine, compute, and catalog—into a cohesive system. Because these components are open and modular, the platform offers the flexibility to switch between various options depending on specific use cases. 

Requirements for an Open Data Architecture

The reason we had to go a little deeper into the distinction between open table formats and open lakehouse platforms is that there's a prevailing notion suggesting that adopting an open table format automatically ensures openness and interoperability within data architectures. While these table formats have laid the foundation for an open ecosystem, it would not be right to simply claim that this is all that is required to achieve a truly open data architecture. There are several important factors that must also be taken into account. Here are some practical considerations.

  • Open Standards for Components: For a truly open data architecture, the technical components themselves must be open and interoperable. Adopting open standards does not necessarily mean using self-managed open source solutions. Many organizations, due to high engineering costs and maintenance requirements, will opt for vendor-managed solutions for their compute and storage needs, which makes sense. However, when selecting a vendor platform, it is crucial to ensure that there is no platform lock-in and that data remains accessible from multiple compute engines. This flexibility allows for mixing and matching “best of breed” solutions and the possibility of switching vendors in the future if needed.
  • Interoperability between Table Formats: Open table formats have provided the flexibility to build open data architectures, yet choosing the right format remains a challenging decision due to each format’s distinct advantages. For example, Hudi excels in update-heavy environments and integrates well with Spark, Flink, Presto, and Trino. Apache Iceberg is optimized for read operations, and Delta Lake is primarily utilized within the Databricks ecosystem with Spark. Additionally, as lakehouse systems evolve to support newer workloads that utilize multiple table formats, organizations require universal data accessibility. This means the data must be write-once, query-anywhere, across any compute platform. This is where open source projects such as Apache XTable (incubating) offer interoperability among various table formats through lightweight metadata translation. 
  • Interoperability Between Catalogs: While open table formats have broadened access to data, the data catalog is another component in the lakehouse architecture that is often controlled by vendors. Vendor platforms today require the use of a proprietary catalog to leverage full support for these open table formats, creating a new form of lock-in. This dependency restricts interoperability with other query engines not supported by the specific platform, compelling organizations to keep their data management confined to a single vendor’s ecosystem. There are also situations where different teams within an organization use distinct catalogs that follow the same specifications (e.g. Iceberg REST) and standards (such as using the same APIs). However, despite uniformity at some levels, there isn't a straightforward method to synchronize data between these catalogs without having to recreate or migrate the tables based on the metadata. Note that methods such as register_table in Apache Iceberg warn against registering in more than one catalog, as doing so could lead to missing updates, table corruption, and loss of data. Also, while this process may be manageable for smaller datasets, it is impractical at scale. Since each catalog has its own access control policies, which do not apply to other catalogs, it also prevents universal data access. Such scenarios underscore the need for open, interoperable catalogs that can bridge different platforms.
  • Open Platform and Table Services: In a lakehouse architecture, table management services such as compaction, clustering, indexing, and cleaning are essential for optimizing data layouts for efficient query processing. Complementing these, platform services provide specific functionality for data workflows and interface with data writers and readers, including tools for ingestion, import/export of open tables, and catalog syncing. Traditionally, customers have relied on closed, proprietary platforms for these functions, which limits flexibility. To truly take advantage of an open data architecture, there is a pressing need for these services to be open and interoperable, allowing users to manage data sources and maintain optimal table performance without being tethered to vendor-specific solutions. This openness ensures that data architectures can remain flexible, adapting to emerging needs without vendor lock-in.

Now, coming to our question - Is an open table format enough to realize a truly open data architecture?

Figure 9. Shows the point of lock-ins even after replacing proprietary storage format with an open table format.  

To summarize, merely substituting the proprietary table format in a closed data architecture/platform with an open table format such as Hudi, Iceberg, or Delta Lake does not make for a fully open data architecture. An open data architecture requires ensuring that all components, including open table formats, catalogs, and table management services (part of the storage engine) are open and, most importantly, interoperable. A fully open approach allows organizations to seamlessly switch between specific components or the overall platform—whether it's a vendor-managed or self-managed open source solution—as new requirements emerge. This flexibility prevents new forms of lock-in and supports universal data accessibility.

Reference Implementation for An Open Data Lakehouse Platform 

This brings us to the final part of this blog. Given we have spent a fair amount of time in understanding the terms and terminologies around the lakehouse architecture, it makes sense to bring all of this together and see what an open and interoperable lakehouse platform looks like. We will use Hudi’s lakehouse platform as an example. However, this discussion is not limited to just open source solutions. A vendor-managed lakehouse platform or other open source solutions can also fit here, as long as the platform aligns with the four requirements listed above.

While many perceive Hudi as merely an open table format, this is a misconception. In reality, Hudi is also a comprehensive open lakehouse platform that integrates unique storage engine capabilities with a robust table format atop open file formats such as Apache Parquet. Unlike standalone table formats, Hudi offers native capabilities, including tools and utilities for data ingestion, recovery, rollback, and compliance tracking. This makes Hudi more than just a component of the lakehouse architecture - that is, more than just a table format; it is a robust platform that supports a broad range of analytical workloads.

Figure 10. Apache Hudi’s lakehouse platform

Below is a reference diagram of a typical database management system, showing how Hudi constitutes the bottom half of a database optimized for lake environments. Above Hudi, multiple clients are positioned, showing Hudi's integration with a variety of analytical tools.

Figure 11. Reference diagram highlighting Hudi components that are existing (green)
or planned or proposed (yellow), along with external components (blue).

Hudi's transactional layer functions like a database kernel, managing the file layout and schema through its table format and tracking changes with its timeline. This timeline (part of the Log Manager) acts as a critical event log that records all table actions in sequence, enabling features such as time travel, version rollbacks, and incremental processing. In terms of performance optimization, Hudi utilizes a variety of indexes (part of the Access Methods block) to minimize I/O and enhance query speeds; these include file listings, Bloom filters, and record-level indexes, which together facilitate faster writes and efficient query planning. Hudi’s built-in open table services (part of the Replication & Loading services)—such as clustering, compaction, and cleaning—are designed to keep the table storage layout performant, capable of running in inline, semi-asynchronous, or fully asynchronous modes, depending on operational requirements. 

To deal with multiple concurrent workloads, Hudi uses various concurrency control techniques (part of the Lock Manager) such as non-blocking multi-version concurrency control (MVCC) for writers and table services, and between table services, as well as optimistic concurrency control (OCC) for multi-writers, while introducing non-blocking concurrency control (NBCC) to support multiple writers without conflict, ensuring that operations do not interfere with each other. There is also work in progress on integrating a multi-tenant caching layer (part of the Buffer Manager) with Hudi's transactional storage to balance fast data writing with optimal query performance, enabling shared caching across query engines and reducing storage access costs.

Moreover, Hudi integrates platform services into its open stack that manage tasks such as ingestion, catalog syncing, data quality checks, and table exports. Hudi Streamer, a natively available and commonly used ingestion tool within the Hudi stack, seamlessly integrates with Kafka streams and supports automatic checkpoint management, schema registry integration, and data deduplication. It also facilitates backfills, continuous mode operations with Spark/Flink, and incremental exports. Additionally, Hudi provides various catalog synchronization tools that sync Hudi tables with diverse catalogs such as Hive Metastore, AWS Glue, and Google BigQuery.

Figure 12 shows a reference implementation of an open lakehouse architecture using Hudi’s platform.

Figure 12. Reference implementation of an open lakehouse platform

Starting from the ingestion side, Hudi Streamer ingests data from various sources such as Kafka streams, databases, or file systems into Hudi. In this architecture, Amazon S3 serves as the data lake storage, holding Parquet data files where records are written. 

Apache Hudi’s table format is used as the primary choice of open table format here, but the architecture also incorporates XTable to ensure interoperability between different formats. This flexibility allows users to switch between formats and read specific table format metadata with their choice of compute engine. This way, they are not forced to stick to a specific table format or compute engine. 

Hudi’s storage engine takes care of table management services such as clustering, compaction, and cleaning older versions of data, while keeping storage costs in check. Hudi’s platform services include catalog sync tools that facilitate data availability across various catalogs such as AWS Glue and Hive, enabling querying by interactive engines such as Trino and Presto.  And once the data is available for querying, it can then be consumed by the various tools in the analytics layer, to run analytical workloads ranging from BI to machine learning. 

At this point we can answer our final question - How seamlessly can we move across different platforms?

An important aspect of our reference architecture is that all components and services are open, modular, and interoperable. This flexibility ensures that if there is a need to switch or move between platforms in the future, based on specific use cases, there are no inherent restrictions. Users in this architecture can select the tools and services that best fit their needs, ensuring optimal alignment with their operational goals. By committing to open standards and ensuring interoperability across critical components (data file formats, data table formats, data catalogs), this approach lays a solid foundation for a truly open data architecture.

Recap

As we reflect on the questions posed above, it becomes clear that simply adopting an open table format does not guarantee overall openness. True openness hinges on a commitment to open standards, interoperability, and comprehensive platform services that extend across table formats. We learned that, while open table formats provide foundational elements for building open architectures, an open lakehouse platform goes beyond just the metadata layer to encompass things like ingestion, the compute engine, the storage engine, and catalogs—all of which need to be open to make data accessible across multiple environments, not just restricted to a single vendor's platform. The flexibility to move between different platforms without disruption is crucial for modern data architectures. This is why interoperability within the various components of a lakehouse architecture plays a key role.

Authors
No items found.

Read More:

On “Iceberg and Hudi ACID Guarantees”
Maximizing Change Data Capture
It’s Time for the Universal Data Lakehouse‍
Lakehouse or Warehouse? Part 2 of 2
Lakehouse or Warehouse? Part 1 of 2

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future