March 4, 2024

What is a Data Lakehouse?

What is a Data Lakehouse?

The need to analyze organizational data has deep roots. Thousands of years ago, writing was developed, largely as a way to keep track of resources and transactions. With increasing computing power and storage capacity, the relational database management system (RDBMS) and structured query language (SQL) were born in the 1970s. 

This led to successive waves of innovation, most prominently in the form of the data warehouse; NoSQL databases; the data lake; and now the data lakehouse. 

But just what is that last, most recent innovation? The data lakehouse is an important new paradigm in data architecture - so, having been present at the creation, we here at Onehouse would like to spell out our take on just what it is. 

The Data Warehouse Leads the Way

For decades, beginning in the 1980s, data used for analytics mostly lived inside of data warehouses. Think IBM, Oracle, Teradata, and so on. A data warehouse took structured data from operational systems and other sources, enriched it to make it more useful, reformatted it for easier querying, and made it available for analytics. 

These systems were great for the applications that they powered, but there were many limitations:

  • Data warehouses required complex hardware and software infrastructure to deliver reasonable performance, given the state of the art at the time. 
  • Buying a data warehousing system was difficult and running it was expensive, requiring some combination of internal expertise, consulting help, and paid support. 
  • Only structured data was supported. Semi-structured and unstructured data weren’t included. 
  • Traditional data warehouses scaled up, not out. So scaling meant buying a single, high-performance computer, caching hardware, etc. - not scaling out to many industry-standard servers. 
  • Data often lived in proprietary formats, so switching to a different vendor was either very expensive, or more or less impossible. 
  • Integrations, such as support for a new ETL tool or downstream analytics options, were developed and delivered at the whim of individual vendors.

With all this complexity and proprietary control, the data warehouse was only available to big companies making big investments. But the arrival of the cloud brought new approaches. 

Amazon Redshift brought the data warehouse to the cloud, and Snowflake brought new features, including the separation of storage and compute (originally, both from a single vendor). This allowed compute power to be turned up and down as needed, saving money between peaks. 

With the cloud, companies no longer had to buy their own hardware - a capital expense, known as CapEx. Instead, “pay as you go” became the norm - an operational expense, known as OpEx, which companies tend to prefer. And scaling up was out; scaling out was in. 

But a data warehouse was still a data warehouse - performant for specific use cases, but expensive to scale, and built only for highly structured data.

The Data Lake Offers an Alternative

In the mid-2010s, engineering leaders began to search for, and build, alternatives to the data warehouse. They wanted a centralized repository that could store a variety of data and make it available to multiple downstream tools. Early digital natives were maturing, and they were no longer interested in signing checks with more and more zeros for rigid data warehouses, even cloudy ones.

Data lakes arrived, based on open source standards, not proprietary technology like the data warehouse. Data lakes provided cheap, flexible storage for voluminous structured, semi-structured, and unstructured data. They continue to serve many purposes, including powering today’s revolution in machine learning and AI. 

However, data lakes lack key features of data warehouses, such as structure and easy accessibility. They are generally designed for immutable data - data that cannot easily or efficiently be updated or deleted. 

Many organizations today run two parallel data infrastructures: one for structured data, based on the data warehouse, and another for less-structured data, based on the data lake. These two largely separate systems feature a great deal of interchange, leading to slow updating, complex pipelines, duplicate data, and operational headaches. (Our blog post, It’s Time for the Universal Data Lakehouse, describes these parallel, but interconnected, systems.) 

Figure 1: Many organizations run data warehouse and data lake silos in parallel. 

Introducing the Data Lakehouse

Having built these powerful, but complex systems, data architects took a deep breath and considered what they had created so far. The challenge was to find a way to get the best of both worlds - to add the query responsiveness and updatability of the data warehouse to the low cost and flexibility of the data lake. 

A successful effort at Uber to bring these two kinds of systems together led to the development of Apache Hudi, the first data lakehouse, at Uber in 2016. 

While there are various flavors of data lakehouses - for example, some have built-in query engines, while others do not - they share several core technical factors, drawn from previous technologies: 

  • Open standards (data lake): A data lakehouse should be as open as possible, either using open data formats, software, and services (the preferred approach), or at least having open alternatives readily available when proprietary approaches are used. 
  • Scalability and flexibility (data lake): A data lakehouse should be able to scale to support massive volumes of data, and a variety of data, often independently scaling storage and compute.
  • Updatability (data warehouse). While a data warehouse allows updates to records, a data lake only allows you to add and delete entire files. The lakehouse can update records within files, without proprietary software. 
  • Performance (best of both): Adapting to massive scale and variety of data should not require a significant tradeoff on performance. The data lakehouse is not yet as fast as a data warehouse, but performance is sufficient for a wide range use cases. 
  • Affordability (data lake): This is a business requirement and a technical requirement in one, since low cost enables technical innovation (as in AI/ML). Organizations adopt the lakehouse to avoid paying outsized fees to data warehouse vendors. 

A data lakehouse is not a straight-up replacement for a data warehouse. But it can handle many - often, the majority - of tasks that were previously handled by a data warehouse. In particular, the data lakehouse can easily handle the first two stages of a medallion architecture:

  • The bronze layer contains unaltered copies of ingested data, for governance and further processing
  • The silver layer contains ingested data that has been deduped, cleansed, and otherwise prepared for queries and/or further processing

Further processing can then be applied, using compute engines such as BigQuery, Databricks, Snowflake and others, Python and other code for machine learning and AI, as well as query engines and models used in AI and machine learning. Where expensive proprietary solutions are applied, they can use a smaller, focused dataset, and the cost can be borne by the specific use case that requires use of the proprietary solution. 

Figure 2: The universal data lakehouse architecture eliminates data silos. 

A well-designed data infrastructure that uses a data lakehouse in a central role, with one or more flavors of data warehouse put to work for specific use cases, has many advantages over the parallel data warehouse / data lake architecture in widespread use today. 

Note: The core storage in a data lakehouse is still the data lake, living on cheap object storage such as Amazon S3, Azure Blob Storage, or Google Cloud Storage. This storage lives securely in the user’s virtual private cloud (VPC). The lakehouse adds a layer of metadata around the data lake to support upserts - updates to existing data records and inserts of new records - and deletes. Some lakehouses are more efficient at processing upserts, deletes, and queries against data tables that have had upserts and deletes applied to them, than others.  

What is the Universal Data Lakehouse?

We’ve explained what the data lakehouse is - and we know that understanding it may have required a lot of work. Having done all that, why introduce the concept of a universal data lakehouse (UDL)?

The universal data lakehouse is a kind of Platonic ideal - a data architecture that has all the advantages, and none of the defects, of the data warehouse and data lake approaches that preceded it. But different versions of the data lakehouse have different advantages and trade-offs. The UDL helps us articulate what’s most important to someone using a version of the data lakehouse to achieve their specific goals. 

The three major data lakehouse projects, in order of creation, are Apache Hudi, Apache Iceberg, and Delta Lake. Architects also create their own lakehouse-type systems on top of data lakes, with or without the use of code from the three established projects. 

Note: Confusingly, the name “data lakehouse” was popularized by Databricks, the creators of Delta (their proprietary system) and Delta Lake (the open source version), beginning in 2020, but it applies fully to all three major lakehouse projects. 

We here at Onehouse build on Apache Hudi, and contribute strongly to it, so we do have an interest in which data lakehouse is seen as superior. With that in mind, we have done a detailed analysis of the three data lakehouses, available as a widely read blog post and as a white paper. Here’s a brief summary of today’s realities, as we see them:

  • Apache Hudi: Highly performant, especially for upserts and queries, and very open, with a broad range of users and project contributors. Hudi has open data formats and table formats, but also a very strong set of open services. Hudi has been evaluated and chosen by demanding organizations, including Walmart and many others
  • Apache Iceberg: Less performant, especially for upserts, but very open at the data format and table format level, with a broad range of users and project contributors; less open and less robust for services. Apache Iceberg has been chosen as an interchange format for Snowflake’s proprietary internal data storage system. 
  • Delta and Delta Lake: The internal, proprietary Delta project is highly performant, especially for upserts, and has a wide range of services. Delta Lake is less open and often lags noticeably behind the Delta project.  

For most purposes, we here at Onehouse see Apache Hudi as the lakehouse project that most closely approaches the UDL ideal. However, companies that have chosen to work with Snowflake, and other vendors that use Iceberg - or that have chosen to work with Databricks, and their Delta / Delta Lake combination - often find it convenient to use Iceberg or Delta Lake, respectively, for their internal projects, with varied success in approaching the UDL ideal. 

Two relatively new realities, however, militate against this seemingly comfortable, “dance with who brung ya” approach:

  • More and more companies use both Databricks (more for ML/AI) and Snowflake (more as a traditional data warehouse); in one survey, more than 40% of the customers of each company also used the other. 
  • OneTable, introduced by Onehouse in 2023, is an open source interoperability platform that reads and writes flexibly across all three projects. 

The universal data lakehouse architecture, as described in our blog post, encourages you to use your data lakehouse as a “single source of truth” for data that can then be processed by any query or compute engine, prominently including Databricks and Snowflake. 

So What’s a Data Architect or Data Engineer to Do?

Decisions are hard. Here at Onehouse, we seek to keep you “in choice” at all times:

  • We continue to help drive Hudi and OneTable to new heights of both functionality and openness. 
  • We use both Hudi and OneTable as foundational elements of our Onehouse managed service, which gets you a lakehouse asymptotically approaching the UDL ideal very quickly and easily. 
  • We avoid artificial “lock-in” to either Hudi or the Onehouse managed service, so you always have a full range of options. 
  • We are inviting a widening circle of industry players to consider a similar open, “best of all worlds” approach, centered around OneTable, with increasing success. 

If you want to do your own internal evaluation and make a strategic choice for the long term, the rigorous approach used by Walmart may be instructive. If you have a more specific use case, and want to go through a standard proof of concept (PoC) process, we encourage you to include Onehouse. 

In either case, we invite you to read more in our universal data lakehouse white paper or try Onehouse for free.

Authors
No items found.

Read More:

Overhauling Data Management at Apna
The First Open Source Data Summit is a Hit!
OneTable is Now Open Source

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future