This introductory article describes the difference between a data warehouse, a data lake, and a data lakehouse. For a deeper dive on the newest of these technologies, see It's Time for the Universal Data Lakehouse.
Data warehouses came first; data lakes have grown in importance; and the data lakehouse, which adds new capabilities to data lakes, is one of the most important emerging technologies in data management, right up there with generative AI. (And in fact, the data lakehouse is used to power generative AI; Notion’s Q&A feature is one example of many.)
But how are the data warehouse, the data lake, and the data lakehouse different? What are they good for? This article will show you.
All three technologies are useful for data management, data integration, and data analytics. As data volumes grow, and data velocity accelerates, organizations continue to want to do more and more with their data, but still manage it all as efficiently as possible. Each technology plays an important role in doing so, but their relative importance is changing.
Note. There is a recent, regrettable tendency to confuse the meaning of two of the terms defined below: “data lake” and “data lakehouse.” The data lakehouse is a data management and data integration infrastructure for data stored on data lakes. So you can store data on a data lake then access, update, and manage it using a data lakehouse. You can even, using the new Apache XTable (Incubating) project, manage data lake data using several data lakehouses interchangeably.
The first database management systems (DBMSs) emerged decades ago. They were first used as transactional databases, but the same databases were soon also used as data analytics engines. And at low volumes (of both transactions and queries), using a single database for both worked well enough.
However, as volumes grow for both transactions and queries, a database slows down - or even grinds to a halt. Adding analytics workloads to the transactional database becomes unworkable. This conflict between workloads led to the creation of the data warehouse.
Data from transactional databases is extracted, transformed, and loaded (or “ETLed”) to the data warehouse, where analytics processing takes place. As part of this processing, data can be reformatted for faster querying, enriched with external data, and more. A data warehouse is ideal for structured data that can be neatly organized into tables with specific schemas - that is, the type that comes out of transactional databases.
As the world moved to the Internet, cloud-native data warehouse services appeared. These new services were able to break performance and scalability bottlenecks that had become very limiting in the on-premises world.. A cloud data warehouse can also separate storage from compute, then scale each independently as needed.
The new cloud data warehouses retained some of the disadvantages of their on-premises cousins:
As structured data became more manageable, volumes of semi-structured data (such as JSON data from event logs and sensors) and unstructured data (such as video and audio files) grew and grew. As companies gathered more and more of this less-structured data, and as data volumes (and therefore data-related costs) exploded, something had to change.
The data lake was developed to store large volumes of data, which would cost too much to store in a data warehouse, and to include semi-structured and unstructured data, which is not supported by a data warehouse. Less-structured data also tends to be orders of magnitude more voluminous than structured data; think accounting or census data vs. log files and video files and you can imagine the difference. The data warehouse is not a good place to store voluminous data because costs are too high.
Data lake software is mostly open source software and runs on cloud object storage, which makes data lakes much cheaper than (usually proprietary) data warehouses. For a long time, open source software was not considered up to the challenge of offering relational database management system (RDBMS) capabilities, especially ACID (basically, reliable) transactions on rows and columns of data. And open source software did not support the query standard for the RDBMS world, structured query language (SQL).
Note. Data stores for voluminous, less-structured data took on the name NoSQL about 20 years ago = andNoSQL databases have tried to live down that name ever since it was invented. Many NoSQL databases now support variants of SQL for queries and data management. However, these variants tend to be more limited and slower to run than “true” SQL - partly because of the nature of the data and partly because they have not been optimized nearly as strenuously as SQL.
As the cloud became the go-to place for all things data, developers looked for more efficient, cost-effective ways to manage what was now called "big data," while taking advantage of the flexibility and scalability of the cloud. One of the most common approaches was a data lake built on Apache Hadoop and running on one of the popular public clouds such as AWS, Microsoft Azure, or Google Cloud Platform.
These data lakes were great for storing both structured and less-structured data relatively cheaply. And thanks to the scalability of cloud object storage, data volumes were less of a concern. The data lake was originally used mainly for big data processing and data science-type analytics workloads, including machine learning and AI.
But data lakes have their drawbacks. The lack of full SQL support means that the tools and processes business teams have long used to access and analyze their data often end up being walled off from the data lake. (The need to find, for instance, a Python programmer to get anything done with big data is a huge barrier for business types who are not comfortable going beyond SQL.)
Most importantly, data lakes do not support updates at the record level. (Because they don’t run that expensive, proprietary RDBMS software we mentioned above.) If you have a million (or a trillion) records in a data table on a data lake, and some records are updated, you have to make a whole new copy of the data table - yes, millions or trillions of records - to bring it up to date. This is tremendously wasteful, so organizations often update their lake-based data tables in batches, once a day or less.
In addition, the very success of the data lake - and ongoing growth in data warehouse use - has led to complex data pipelines reminiscent of “spaghetti code.” Users want access to all the data, all the time. So pipelines meander back and forth between the data warehouse side and the data lake side, accumulating complexity and risking data and operational errors at each additional step. This also limits data freshness to the slowest components of the entire, often rickety, data infrastructure.
Note. In particular, some use cases employ streaming to achieve near real-time latencies, while others use batch processing and are only updated once a day. Any query that cuts across these use cases is stuck delivering stale results, due to the infrequent updates on the batch side.
For the last decade, data-forward organizations have been aware of the need for new solutions. Cost, complexity, and “worst of both worlds” limitations, such as stale data, burden the entire data infrastructure. Traditional business intelligence (BI) and reporting are not going anywhere, but new advanced analytics and data science use cases are needed to compete and lead in business. These use cases need to operate on fresh data, which is very hard (or very expensive) to support on the data lake.
So emerged the data lakehouse. The data lakehouse works on data stored in data lakes, storing structured, semi-structured, and unstructured data. And, like the data lake, the typical data lakehouse can run on open source software, reducing costs and improving choice (though there are proprietary versions as well).
In addition, the data lakehouse supports core functionality and services that enable the data lakehouse to be updated and queried with efficiency and speed that rival the data warehouse. And the data lakehouse supports all analytics use cases, from BI to predictive analytics and ML/AI, because it supports all data types. (Though you will still only be able to use true SQL on more or less structured data and will be doing ML/AI work using Python or similar.)
Table 1 shows major differences between these data architectural patterns. Data services manage data and back-end tasks such as data table compression and compaction.
Note. Database technology is more than half a century old, and there are exceptions to almost every description you could give of them. The table shows the generally understood differences between most implementations of each approach.
Most organizations of much size, or with much future ambition, need to handle a full mix of analytics approaches - with a recent special focus on ML/AI - across a full range of types of data. And they need to deliver results quickly and cost-effectively, while smoothly supporting legacy use cases. So no one can get by with the data warehouse only, or the data lake only.
What has evolved instead, in organizations that have not evolved to use a data lakehouse approach, is a kind of parallel data infrastructure. Legacy and new reporting and BI use cases run on a data flow centered on the data warehouse. Newer use cases that include less-structured data, advanced analytics, data science, and AI/ML run on a separate data flow, centered on the data lake, all shown in Figure 1.
Of course, many use cases want data cutting across these flows, so there is a great deal of data copying back and forth. The resulting data infrastructure is expensive, complex, and slow. Many data transfers take place in batch mode, and every query result is only as up-to-date as the least-frequently-updated pipeline that feeds into it.
To varying degrees, the three data lakehouse projects offer a way out of this two-tier infrastructure (and the copying back and forth that goes with it). With a data lakehouse, you can put your “source of truth” data tables, for both bronze and silver, on the data lake. Increasingly, when used with a data lakehouse, gold tables can live on the lake as well.
Until recently, however, there has been one tiny little problem. When you chose one data lakehouse, you excluded the others. You could reduce lock-in across your system, but you still had lock-in to the lakehouse format you chose and the ecosystem around it.
In a world where everyone needs access to everything, this is a blocker. For example, more than 40% of Snowflake users also run Databricks, and vice versa. Lock-in at the lakehouse level is a rather forbidding prospect.
The universal data lakehouse (UDL) architecture, shown in Figure 2, solves this problem. The UDL includes Apache XTable (formerly OneTable), which provides full read/write interoperability across lakehouse formats. With XTable, the UDL can take its rightful place at the core of your data infrastructure.
Proprietary software plays a role in the UDL; the compute and query engines you use, for instance, may be either open or proprietary. But your “source of truth” data tables stay in object storage, in your own virtual private cloud (VPC), on your cloud provider(s) of choice. You use the tools you need, as you need them, without lock-in for your data infrastructure.
With Onehouse, the Universal Data Lakehouse is packaged up and delivered as a managed service. You get a data lakehouse in days, rather than months; ops work and maintenance are fully automated.
If you’re interested in trying Onehouse, sign up for a test drive. To learn more, check out this recorded webinar with Onehouse Founder and CEO and Apache Hudi founder and PMC Chair Vinoth Chandar.
Be the first to read new posts