The need to analyze organizational data has deep roots. Thousands of years ago, writing was developed, largely as a way to keep track of resources and transactions. With increasing computing power and storage capacity, the relational database management system (RDBMS) and structured query language (SQL) were born in the 1970s.
This led to successive waves of innovation, most prominently in the form of the data warehouse; NoSQL databases; the data lake; and now the data lakehouse.
But just what is that last, most recent innovation? The data lakehouse is an important new paradigm in data architecture - so, having been present at the creation, we here at Onehouse would like to spell out our take on just what it is.
For decades, beginning in the 1980s, data used for analytics mostly lived inside of data warehouses. Think IBM, Oracle, Teradata, and so on. A data warehouse took structured data from operational systems and other sources, enriched it to make it more useful, reformatted it for easier querying, and made it available for analytics.
These systems were great for the applications that they powered, but there were many limitations:
With all this complexity and proprietary control, the data warehouse was only available to big companies making big investments. But the arrival of the cloud brought new approaches.
Amazon Redshift brought the data warehouse to the cloud, and Snowflake brought new features, including the separation of storage and compute (originally, both from a single vendor). This allowed compute power to be turned up and down as needed, saving money between peaks.
With the cloud, companies no longer had to buy their own hardware - a capital expense, known as CapEx. Instead, “pay as you go” became the norm - an operational expense, known as OpEx, which companies tend to prefer. And scaling up was out; scaling out was in.
But a data warehouse was still a data warehouse - performant for specific use cases, but expensive to scale, and built only for highly structured data.
In the mid-2010s, engineering leaders began to search for, and build, alternatives to the data warehouse. They wanted a centralized repository that could store a variety of data and make it available to multiple downstream tools. Early digital natives were maturing, and they were no longer interested in signing checks with more and more zeros for rigid data warehouses, even cloudy ones.
Data lakes arrived, based on open source standards, not proprietary technology like the data warehouse. Data lakes provided cheap, flexible storage for voluminous structured, semi-structured, and unstructured data. They continue to serve many purposes, including powering today’s revolution in machine learning and AI.
However, data lakes lack key features of data warehouses, such as structure and easy accessibility. They are generally designed for immutable data - data that cannot easily or efficiently be updated or deleted.
Many organizations today run two parallel data infrastructures: one for structured data, based on the data warehouse, and another for less-structured data, based on the data lake. These two largely separate systems feature a great deal of interchange, leading to slow updating, complex pipelines, duplicate data, and operational headaches. (Our blog post, It’s Time for the Universal Data Lakehouse, describes these parallel, but interconnected, systems.)
Having built these powerful, but complex systems, data architects took a deep breath and considered what they had created so far. The challenge was to find a way to get the best of both worlds - to add the query responsiveness and updatability of the data warehouse to the low cost and flexibility of the data lake.
A successful effort at Uber to bring these two kinds of systems together led to the development of Apache Hudi, the first data lakehouse, at Uber in 2016.
While there are various flavors of data lakehouses - for example, some have built-in query engines, while others do not - they share several core technical factors, drawn from previous technologies:
A data lakehouse is not a straight-up replacement for a data warehouse. But it can handle many - often, the majority - of tasks that were previously handled by a data warehouse. In particular, the data lakehouse can easily handle the first two stages of a medallion architecture:
Further processing can then be applied, using compute engines such as BigQuery, Databricks, Snowflake and others, Python and other code for machine learning and AI, as well as query engines and models used in AI and machine learning. Where expensive proprietary solutions are applied, they can use a smaller, focused dataset, and the cost can be borne by the specific use case that requires use of the proprietary solution.
A well-designed data infrastructure that uses a data lakehouse in a central role, with one or more flavors of data warehouse put to work for specific use cases, has many advantages over the parallel data warehouse / data lake architecture in widespread use today.
Note: The core storage in a data lakehouse is still the data lake, living on cheap object storage such as Amazon S3, Azure Blob Storage, or Google Cloud Storage. This storage lives securely in the user’s virtual private cloud (VPC). The lakehouse adds a layer of metadata around the data lake to support upserts - updates to existing data records and inserts of new records - and deletes. Some lakehouses are more efficient at processing upserts, deletes, and queries against data tables that have had upserts and deletes applied to them, than others.
We’ve explained what the data lakehouse is - and we know that understanding it may have required a lot of work. Having done all that, why introduce the concept of a universal data lakehouse (UDL)?
The universal data lakehouse is a kind of Platonic ideal - a data architecture that has all the advantages, and none of the defects, of the data warehouse and data lake approaches that preceded it. But different versions of the data lakehouse have different advantages and trade-offs. The UDL helps us articulate what’s most important to someone using a version of the data lakehouse to achieve their specific goals.
The three major data lakehouse projects, in order of creation, are Apache Hudi, Apache Iceberg, and Delta Lake. Architects also create their own lakehouse-type systems on top of data lakes, with or without the use of code from the three established projects.
Note: Confusingly, the name “data lakehouse” was popularized by Databricks, the creators of Delta (their proprietary system) and Delta Lake (the open source version), beginning in 2020, but it applies fully to all three major lakehouse projects.
We here at Onehouse build on Apache Hudi, and contribute strongly to it, so we do have an interest in which data lakehouse is seen as superior. With that in mind, we have done a detailed analysis of the three data lakehouses, available as a widely read blog post and as a white paper. Here’s a brief summary of today’s realities, as we see them:
For most purposes, we here at Onehouse see Apache Hudi as the lakehouse project that most closely approaches the UDL ideal. However, companies that have chosen to work with Snowflake, and other vendors that use Iceberg - or that have chosen to work with Databricks, and their Delta / Delta Lake combination - often find it convenient to use Iceberg or Delta Lake, respectively, for their internal projects, with varied success in approaching the UDL ideal.
Two relatively new realities, however, militate against this seemingly comfortable, “dance with who brung ya” approach:
The universal data lakehouse architecture, as described in our blog post, encourages you to use your data lakehouse as a “single source of truth” for data that can then be processed by any query or compute engine, prominently including Databricks and Snowflake.
Decisions are hard. Here at Onehouse, we seek to keep you “in choice” at all times:
If you want to do your own internal evaluation and make a strategic choice for the long term, the rigorous approach used by Walmart may be instructive. If you have a more specific use case, and want to go through a standard proof of concept (PoC) process, we encourage you to include Onehouse.
In either case, we invite you to read more in our universal data lakehouse white paper or try Onehouse for free.
Be the first to read new posts