March 12, 2025

The Open Table Format War: Merely a Battle on the Path to Engineering a Truly Open Data Platform

Written by:

Pauline Brown

The Open Table Format War: Merely a Battle on the Path to Engineering a Truly Open Data Platform

The data lakehouse landscape is evolving past the tired debate of format supremacy. Many recent articles have highlighted how the greater war is on open vs closed. From the fallout, a new paradigm is emerging: the table format-agnostic, or open, unbundled, data stack. This approach transcends the traditional competition between Apache Hudi, Apache Iceberg, and Delta Lake, acknowledging a fundamental reality: while some organizations successfully operate on a single table format, each format brings unique capabilities that excel for different workloads. Organizations should have the freedom to select the best format for each use case without being locked into a single solution. Different teams within the same company gravitate toward different solutions. The real-time streaming team might swear by Hudi, while the data science group champions Delta Lake, and the analytics team relies heavily on Iceberg. The goal isn't to crown a winner (let's be honest, that's never going to happen), but to break down the silos these format choices inadvertently create.

The End of Forced Standardization

Organizations have long felt pressured to standardize on a single table format, by vendor narratives that promote “pick our standard, and it will solve all your problems”. This has sparked endless debates between engineering teams about choosing Apache Hudi, Delta Lake, or Apache Iceberg. These "table wars" have led to suboptimal compromises, forcing teams to adapt their workflows to formats that weren't designed for their specific use cases. The real cost? Innovation paralysis, expensive cloud data bills and technical debt.

These widespread format compatibility issues prompted us to originally create XTable at Onehouse, directly addressing challenges repeatedly voiced by our customers and open source users. Despite some skepticism about unified solutions, we successfully built what many considered impossible. The project has since evolved into Apache XTable (Incubating), validating our approach and helping shift the industry perspective beyond format competition. This evolution, alongside parallel developments like Databricks Uniform, has fostered unexpected but healthy collaboration across previously rigid format boundaries.

The Power of Format Diversity

Modern data platforms demand flexibility. Apache Hudi's strength in near real-time ingestion and record-level updates makes it invaluable for teams processing millions of real-time events while maintaining historical accuracy. Delta Lake's seamless integration with Apache Spark serves teams running complex machine learning pipelines combining historical and real-time data. Apache Iceberg's integration into cloud data warehouses, offers an easy/performant way for teams working with data models and complex analytical queries.

The reality? Different teams within your organization have distinct needs better served by different formats. Instead of fighting this diversity, forward-thinking organizations are embracing it.

Architecting the Future

Building a table format-agnostic platform requires rethinking traditional data architectures. The foundation starts with a unified metadata layer that abstracts away format complexity from end users, on top of open file formats like Apache Parquet that can be read by all major compute engines. This architecture empowers organizations to select the ideal processing engine for each specific workload, whether Spark for batch processing, Flink for streaming, or Presto for interactive queries, without table format compatibility limitations dictating technology choices.

With table format compatibility barriers removed, the real innovation shifts to workload-specific engine selection and optimization. Ideally, your data platform enables organizations to intentionally perform writes and reads using the most appropriate format based on specific access patterns and requirements, maximizing the unique strengths of each format. Streaming event data might land in Hudi tables optimized for rapid ingestion, while transformed aggregates flow into Iceberg tables optimized for complex analytical queries. This happens transparently to end users, who interact with logical datasets rather than file format details.

Cross-format transaction management, enabled by projects like Apache XTable, maintains data integrity across different table formats by coordinating writes, preventing conflicts, and ensuring all systems reflect consistent information regardless of the underlying format. This enables advanced use cases like real-time customer 360 views, where different aspects of both static and dynamic customer data might reside in multiple table formats, yet applications receive a unified and consistent view.

Implementation within the Modern Enterprise

Success with a table format-agnostic approach demands thoughtful architecture that serves both builders and consumers. Data engineers need robust tools for managing schema evolution and ensuring data consistency. Data scientists and analysts require simple, consistent interfaces for accessing data quickly, regardless of its storage format.

Performance optimization in this environment becomes more nuanced. Engineers must carefully consider partition alignment, file sizes/layouts, and caching strategies that work effectively across formats. However, these technical considerations remain invisible to end users, who experience consistent performance regardless of the underlying table format.

Data governance takes center stage in an unbundled platform’s stack. Organizations need unified security controls, audit logging, and lineage tracking that work consistently across formats, ensuring compliance without creating additional burden for data consumers.

At Onehouse, we're tackling this challenge head-on with a fresh perspective. Building on Apache Hudi's foundation alongside technologies like Apache XTable, Onehouse Compute Runtime (OCR), and Lakehouse Table Optimizer, we're delivering a practical solution to a problem that's been frustrating data teams for years. XTable serves as a universal translator for table formats, enabling seamless interoperability while preserving each format's unique capabilities. OCR is a high-performance data lakehouse runtime that accelerates all core lakehouse workloads to deliver significantly improved performance without specialized tuning requirements. Paired with Table Optimizer's centralized capabilities that dramatically accelerate performance through automated clustering and compaction while optimizing data layout and cleaning, this ecosystem represents the future of data infrastructure: one where teams achieve superior performance without wasting resources on unnecessary storage or processing, regardless of their chosen format.

Looking Ahead

The future of data lakehouses isn't about choosing sides in format wars – it's about building flexible, inclusive platforms that empower every team to work with data efficiently. By embracing a table format-agnostic approach, organizations can create data environments that truly serve both builders and consumers, leading to more innovative and successful data initiatives.

The key lies in shifting focus from underlying formats to the value they provide. When implemented thoughtfully, an unbundled, table format-agnostic stack gives engineers the flexibility to build robust data pipelines while ensuring data scientists and analysts can seamlessly access and analyze data without getting caught in technical details. This is the true promise of the modern data lakehouse: empowerment through choice rather than restriction through standardization.

Ready to explore how your organization can move beyond format wars and build a truly unified data platform? Visit onehouse.ai to learn more about our innovative approach to the modern data stack.

Authors