This important talk by Onehouse Founder/CEO Vinoth Chandar was delivered at Data Council Austin in March 2022. Data Council Austin is “the world’s largest independent full-stack data conference.” It’s a community-driven event that includes data science, data engineering, analytics, machine learning (ML), artificial intelligence (AI), and more.
Vinoth Chandar originated the data lakehouse architecture while at Uber, and he is chair of the project management committee (PMC) for the Apache Hudi project. Originally described as a “transactional data lake,” Hudi is now considered to be the first, and one of the three leading data lakehouse projects, after Databricks introduced the term in 2020.
In this talk, Vinoth compares the past, present, and future uses of data warehouses, data lakes, and data lakehouses. He concludes by calling for an open, lakehouse-first architecture, with most workloads served directly from a unified data lakehouse. In this architecture, the lakehouse serves a multiverse of different engines specializing in areas such as reporting, business intelligence, predictive analytics, data science, and ML/AI.
We have divided the talk into two blog posts:
The talk has been lightly edited for conciseness and flow.
If you’re interested in learning more about Onehouse, visit https://www.onehouse.ai/product.
I’m happy to see everybody here. This is my first live talk in years. I hope this talk is informative; let's dive in.
I consider myself a one-trick pony who's worked way too long in one thing: distributed databases. I started this journey at Oracle: change data capture, log replication, all that sort of stuff. And then I led the Voldemort key-value store project during the hypergrowth phases at LinkedIn. Then I had a brief stint at Box where we were building a real-time data store. And recently, I was at Confluent building the ksqlDB streaming database, followed by joining Uber.
Hudi is a project that we started while I was at Uber, and I’ve been growing the community with the Apache Software Foundation for the last five years. With Onehouse, we are doing more of the same distributed data warehouses/database-type work as when I was at Uber.
There's a lot of good material out there for Hudi, and we can always connect over Hudi Slack. So instead of talking about Hudi today, I’ll set out the differences between the data warehouse - which everyone is familiar with - versus the data lake and the data lakehouse, which are newer. I’ll describe the overall architecture, how to think about things, and whether you should stay in your current architecture or move forward.
The goals of the talk include:
So why give a talk like this now? A few reasons:
We'll first start by sketching the evolution of these technologies, the arc of how we got here. Then we’ll try to understand and study them across these three different aspects: architecture, capabilities, and price/performance. Finally, I'll leave you with a pattern that I've seen emerge, distilling the patterns that I've seen in the Hudi community, for a more balanced and future-proof way to build your core data infrastructure and architecture.
“...a more balanced and future-proof way to build your core data infrastructure and architecture.”
On-premises data warehouses, the first arc in the slide below, can now be seen as simply specialized databases for analytics. For a long time, that's all we had. Then 2012 saw the launch of BigQuery - serverless queries - and now a lot of innovation has happened around the warehousing space.
Redshift, in my opinion, really brought cloud warehouses mainstream, as the cloud picked up a lot of momentum. And of course Snowflake famously decoupled storage and compute, with great usability. So today you have cloud warehousing options, which are pretty mature. A lot of people use them, as they provide a stable resource for analytics needs.
If you look at the other arc, data lakes really started out as an architectural pattern - not tangible software that you can download and use, as you can with an RDBMS or a data warehouse. Data lakes started by supporting search and social: high-scale data use cases.
I used to work at LinkedIn, back when we were using all of these approaches to process a lot of data and build data-driven products. Spark grew and revolutionized data processing. Cloud grew – for a lot of workloads, cloud storage is ubiquitous. Data lake slowly became pretty much synonymous with files on top of cloud storage.
At Uber in 2016 we were trying to build a textbook data architecture: a data lake architecture where we could get our operational data from databases and stream some external sources into a raw layer, then be able to do any kind of ETL or post-processing and derive more data sets out of it.
“The core… transactional capabilities that you needed were just simply lacking on the data lakes.”
We found that it was pretty impossible to build that textbook architecture because the core capabilities that we needed, such as transactions, were simply lacking on data lakes. And that's how the Hudi project was born.
We added updates, deletes, transactions; we even went as far as to add database CDC-style change streams on top of the lakes. And this is what we called a transactional data lake.
The lakehouse is now technically proven, including use cases from:
The technology foundations - what lakehouse technology adds on top of the existing data lakes - is pretty proven out. So just from looking at the Apache Hudi community through that lens, you can see these are everyday, daily services. And the large enterprises that you see here, if you add them all together, you're looking at a few exabytes of data being managed using lakehouse tech like Apache Hudi.
“...you're looking at a few exabytes of data being managed using lakehouse tech like Apache Hudi.”
So the tech is really there. And you might have noticed all the recent developments. As someone who started my career watching the database wars, I’ve found it very interesting, to say the least; some headlines from last year:
How do you make sense of it all? Data warehouses are pretty well understood. Everybody understands what they're for; they're pretty mature. And data lakes have been in this trough of disillusionment from pretty much 2018 to 2020.
And then there's a lot of frustration around data engineering. There are too many moving pieces out there. And the lakehouse is an emerging category, which I believe can redeem data lakes by adding some of the core missing functionality that I’ve described.
But we need to be cautious, because this is not the first time that we've seen a silver bullet being prescribed to solve all problems. And I'm unfortunately getting old enough to have lived through the Hadoop-era promises, where the Hadoop enterprise data warehouse would take over data warehousing.
To me, it was a little bit ahead of its time, and the focus went away from solving core user problems. Things like Hudi should have been written perhaps five years earlier, in my opinion. So we need to approach this with cautious optimism.
So I'll share my honest evaluation of where things are today and how I see the landscape evolving. We'll cover three aspects:
Let's quickly look at the basics, starting with: what is the on-prem warehouse?
You have a bunch of nodes with beefy disks and CPU and you run SQL, which runs on the nodes and accesses local data; it's just a clustered database architecture. Again, it's hard to really picture the architecture of closed systems - but I think, high-level, they look like this. On the other hand, you have cloud storage, which is infinitely scalable.
The problem with on-prem warehouses is that the storage and the compute are coupled, so you cannot independently scale them. But in the cloud warehouse model, they solved it really well - you can store as much data as you want on cloud storage, with of course some caveats, but then you can spin up on-demand compute.
And these are managed services, platform services such as optimizing your queries, transaction management, metadata. These help the services be more resilient. They can do a lot of globally optimal additions. You may be spinning up virtual warehouses on the same table and you can do a lot of cross-optimizations that way. So this is the architecture that we've seen.
If you now go over to the data lake, traditional data lakes have, interestingly, always had storage and compute separation. That's literally how they were born. But the traditional data lake has pretty much been files on top; then you have a mixture of JSON and anything else, really. You have SQL engines - essentially a bunch of nodes that can execute SQL on top of them - and, depending on the engine and the vendor you use, you either have a cache or not.
With lakehouses, what we've really done is to insert a layer in-between and track more metadata. And then you see that it's become more structured. The world is more structured now in the lakehouse.
"You can see how the architectures approach each other.”
Essentially, we added a transaction management layer, a bunch of things that can optimize your table, similar to what you find in the warehouse, like clustering your table and so on; we'll talk about some of these later. And also, schema management or statistics, just tracking more scalable file-level statistics for your tables. The rest of the architecture is pretty much the same. You can see how the architectures approach each other, from the cloud warehousing model and the lakehouse model.
Openness is a big topic - and it doesn't stop with open data formats, It extends well beyond, it cuts into architecture. So of course open data formats are interoperable, future-proof. Data warehouses, however, use more or less proprietary data formats.
Interoperability is a different thing. If your solution is really popular, then a lot of tools will interoperate with you. But with open file and table formats, what you really get is this: the ecosystem can evolve on its own. You're not beholden to a vendor adding support for X, Y, Z formats. That's the key power of having open data formats.
“You’re not beholden to a vendor adding support for X, Y, Z formats.”
Data location and access actually vary between vendors, even between warehouses. In some you store data with the warehousing vendor. On some, it's like cloud provider/operator warehouses; it sits within your account. So it varies. Lakes mostly store data in your own bucket, with some caveats that you need to be careful about - how you set up permissions on your bucket so that you can remain owner of the objects that you've written.
“Data services is where the key differences are.”
Data services is where the key differences are. Most of the things which operate on a warehouse that maintain or manage your tables are proprietary. Even on the lake, I think there's a pattern to keep open data formats, but lock everything else to a vendor runtime. And this is something where we've done, I would say, a much better job on the Hudi project, where you get ingest services, table optimization - all these services are in open source. This combination is what, at least in our opinion, locks in openness, because then a format is just a passive thing. The question is, what do you do with it?
So on code and community, again, can you influence the project? Even if you have a team, large enterprises are tied to vendor roadmaps. And even on the lake, it varies by the project and whether it's a grassroots open source project, or it's more vendor-driven by one company.
For a quick sidebar on data mesh: I get a lot of people telling me, "Oh, I'm building a mesh and not a lake." Just a quick thing here: it’s a pretty orthogonal concept. If you remember, I said data lake was an architectural concept. It mostly talks about how you organize your data, not the data infrastructure. And if you look at even an introductory article on the topic, it advocates for data infrastructure standardization as a key point of how we can implement a data mesh.
To summarize our architectural discussion: what I see today is that data warehouses are really built for business intelligence (BI), where the problem is about scanning the least amount of data possible. And that's because they have performant metadata management, long-running servers. Generally, the data warehouse is geared towards supporting these more interactive workloads better.
If you look at data lakes, the lakes support scalable AI. What it really means is they're scan optimized. They can scan vast amounts of data. That's because all of them hit cloud storage - that is, there are no intermediate servers, so they can directly access cloud storage. And open formats mean that the kind of ecosystem that we just talked about in the keynote, these kinds of ecosystems can evolve much more quickly, because smaller projects can come up, they can be built on top of this data, and it's easy to do that: to evolve and iterate as a community.
The lakehouse in some senses is trying to blend both. But there are challenges. I reckon there'll be a lot more query planning; these kinds of advances are needed. While if you Google any “data lake versus data warehouse” comparison, it'll tell you that the difference is unstructured vs. structured data - but lakehouses are still mostly dealing with structured data. And there's still a lot more standardized data management to do; these are challenges to be solved.
Visit Part 2.
If you’re interested in learning more about Onehouse, visit https://www.onehouse.ai/product.
Be the first to read new posts