This post is a brief summary of an expert panel from Open Source Data Summit 2023. View the full recording here, or read the highlights below.
At Open Source Data Summit 2023, Vinoth Chandar, Founder and CEO of Onehouse, and creator and PMC chair of the Apache Hudi project, hosted a panel discussion with data technology all-stars. They appear in the following order in the accompanying video:
The conversation was deep and wide-ranging, covering everything from the early days of Hadoop and Kafka, to why open source is an increasingly attractive alternative to proprietary solutions, to predictions about the future of data architecture.
Given the panelists’ vast experience in the industry, how have they seen the perception of open source change and become more mainstream—and what do they think is driving that change?
When Raghu was working on Hadoop at Yahoo, open source was far from being the default option for a new piece of architecture. But there was a real business need, a lot of scale and complexity, and the company wasn’t trying to monetize the solution—all reasons open source made sense. The benefits were immediately clear: “Many hands make light work. And many hands also means broader adoption,” Raghu said.
"Many hands also means broader adoption." - Raghu Ramakrishna, Microsoft
From those early days, the very openness of open source generated its own momentum within the engineering community. Praveen, for instance, was inspired to build open source systems at Uber after having a “front row seat” to Jay’s work building Kafka at LinkedIn. Contributors push each other to be better, and successful projects build resumes. “This is something engineers can be associated with over their entire career,” Kapil said.
As Rahgu put it, open source is also “the best way we have today of creating standards: standard interfaces, standards for interoperability, standards for benchmarks.” That’s made it a boon to companies: customers love it because it prevents vendor lock-in, and adoption of open source projects as standards naturally encourages higher-quality code, with more users, more feedback, and more testing.
“When enterprises think about servicing their data teams, they want to provide the best tool for the job, whether this might be proprietary or an open source stack like Trino, Presto, or Spark,” Justin Levanoski said. And as Raghu explained, “They want to know that if they get unhappy with a vendor, they have optionality.”
Across compute and storage, companies today have many reasons to consider open source projects or managed services. Optimized columnar formats have closed the gap dramatically between the traditional warehouse and what you can do with a query engine accessing data stored in open formats. Kafka has democratized open data streams for everyone. And the need for real-time decision-making at Uber gave rise to Hudi, the first streaming transactional data lake. “The reason why we were able to succeed and scale is because of open source—specifically in data,” Praveen said. By decoupling storage and compute, projects like Hudi have been key to enabling companies to choose the right engines for their workloads without worrying about incompatibilities.
“The combination of something like Hudi and a query engine like Trino is effectively an alternative data warehouse.” - Justin Borgman, Starburst
These new capabilities have fueled the reaction against traditional, proprietary formats that lock customers in. “The combination of something like Hudi and a query engine like Trino is effectively an alternative data warehouse,” Justin Borgman said. Thinking back to Hadoop and Teradata, we’ve seen this movie before. The difference, Justin explained, is that now it’s playing out in the cloud with more than a decade of innovation in between: “You can credibly run an alternative data warehouse with performance that’s nearly identical, without the lock-in and without the cost—it’s really happening now.”
Compared to a decade ago, or even just a few years ago, many more companies are running on the public cloud. The market has responded with an explosion of data tools that can be optimized for a company’s needs and fully operated on their behalf, while still using open standards. “The bar to build a team out internally and produce a quality user experience for internal users goes up every year as the tools that are available externally also go up,” Jay said. “Increasingly, there's efficiency pressure on these companies, and they're looking at: ‘Where are our best engineers spending time?’”
"The bar to build a team out internally and produce a quality user experience for internal users goes up every year." Jay Kreps, Confluent
There remains a small niche of tech companies who will build their own solutions, and ideally open source them to keep pushing the industry forward. However, for enterprise customers that aren’t in tech, data is a means to an end. Even if they use an open source engine, they want a managed service—they’re often buying Confluent, not rolling their own solution using open source Kafka or Flink. “At some level,” Raghu said, “the conversation begins not with, should I buy or build, but with what should I buy?”
The answer to this question depends on a lot of factors. Are you willing to do infrastructure management such as provisioning clusters? How will you handle the fact that different solutions may have different standards, metadata, or access controls? What will the end-to-end experience look like for your team? More than ever, being able to do this analysis and think like a buyer has become a vital skill set for data engineers.
The panel discussion wrapped up by asking the panelists to think about what changes we might expect in data and open source over the next few years.
AI is, of course, a big one. “Besides just the models, there’s a lot of tooling around managing the entire lifecycle, and I think we’ll see a lot more open source activity there,” Kapil said. AI is also posed to change how we interact with data, finally opening up the possibility for natural language queries to replace SQL.
Batch and stream processing are converging—a trend that seems unlikely to slow down. “Kafka's getting better at sending data out to object storage, so that stream-in-storage is happening. Systems like Hudi are doing a great job of some of the incremental ingestion and processing,” Jay said. “It brings a lot of the sophisticated use of data out of these old end-of-the-day batch jobs, right into the operation of the company, which is where it can have impact.”
“The need for open and cross-compatible data formats will continue to accelerate.” - Kapil Surlaker, LinkedIn
Convergence will also keep happening between data lakes and data warehouses, and on-premises and cloud storage. That said, “there will always be purpose built systems, and your architecture will always evolve—even if 80 or 90% lives in a lake,” Justin Borgman said. If anything, this just makes open source more important. “I think, and hope, that the need for open and cross-compatible data formats will continue to accelerate,” Kapil said.
To hear all of the panelists’ predictions, and listen to the full discussion, watch the panel discussion here. A full transcript of the highlights video shared in this blog post follows.
Speakers appear in the following order in the video above:
Vinoth Chandar, CEO at Onehouse
I don’t think many companies were expecting to prepare an AI or LLM strategy when the year started, actually. What do you think are some of the most impactful technology shifts, like fundamental changes that you think are going to happen in the next five years?
Raghu Ramakrishnan, CTO for Data at Microsoft
Every time I've done a review with Bill G over the last many years, invariably he ends with this question, when can I actually query your database in English? And until now, I always said, "that's a great question Bill", which is what you save when you don't have a great answer. This time around, I showed him an assistant, a copilot in Microsoft jargon. And that's something I will point to as a direction. The databases of tomorrow are going to be very, very different from the databases we have been used to. No one will ever write a SQL query from scratch anymore. No one will ever tune a database from scratch anymore. No one will build a data pipeline from scratch anymore.
Vinoth Chandar
Bringing all forms of data to a single platform, with universal access, federated, that's been a dream of yours. Is this going to happen this time? What do you think?
Justin Borgman, Chairman and CEO at Starburst
Probably a couple of changes that are important from the first Data Lakes with Hadoop and where we are today. One of them is certainly the formats themselves have greatly improved in terms of their level of optimization. We use columnar formats now to get great read performance and they support updates and deletes. So the functionality gap between the traditional data warehouse and what you can do now with a query engine in open formats, that gap has closed dramatically. The other thing is the engines themselves, like Trino have matured significantly. I mean, it's been almost a decade now since Trino was first created.
These are really battle tested engines now at this point that have all the bells and whistles, cost-based query optimizers and all of that just takes time.
I think in the early days of Hadoop we probably oversold the capabilities of those early Hadoop engines back then. And the reality is building a database is hard and it takes years to build all those pieces. And a query engine is all the hard parts of a database but a lot of work has gone into these things where you can credibly run an alternative data warehouse with performance that's nearly identical without the lock-in, without the cost and expense. And I think that's game changing, and it's real, it's really happening now.
Vinoth Chandar
We touched upon this thing about the right query engine for the right workload. Do you think it's more of a need or a want? Is it just mostly things that you think the bigger tech company should do or what do you think is stopping people from picking the right engine for the right workload out there at large?
Kapil Surlaker, VP of Engineering at LinkedIn
So I think using the right query and compute engine for specific workloads is absolutely a need rather than just a want. You have a diversity of workloads, like some you need real time data, some you need more historical, some you need much faster query performance because it's site facing and it's really hard to find a single engine that's just going to do it all and that ticks all your boxes.
Now of course, there is a cost to it. Every time you get a new engine into your ecosystem, you have the complexity of integration, you have the complexity of learning curve for your engineers. If it's a proprietary system, do you have the potential risk of a vendor lock-in. To me, one of the biggest blockers in enabling that is really the data consistency and compatibility problem.
So typically a lot of these query engines are coming with their data formats and expectations locked in and therefore you have the situation where one engine can't read the data that is, maybe, created by another or they're duplicating and therefore you get different answers sometimes based on which engine you go to.
And that's probably the reason why people are not able to pick and choose the right compute and the query engines for their workloads. And that's why I think usually appreciative of projects like Apache Hudi obviously. So I think disconnecting those data layers from the compute and query layers, I think, is the most critical factor in enabling that innovation.
Vinoth Chandar
What framework should one use when deciding between building in-house DIY and just off the shelf open source versus using a managed service?
Jay Kreps, co-founder and CEO at Confluent
I think the world has changed a lot. At LinkedIn, it was definitely a set of in-house things and things from open source we would bring in when I was there and that was the best option available. But, you know, something that's fully operated for you and has a ton of functionality is heavily optimized, is fantastic. And I think it's very appealing. It doesn't mean there's no point to the open source, both because people love an open standard and because it gives you a good exit to do it yourself if the vendor isn't doing a good job.
But yeah, I think the set of companies that are now addressed by managed services is just much higher and that bar to build a team out internally and really produce a quality user experience for the internal users that just kind of goes up every year.
I think the skillset in this area has changed. If the data engineer or large scale database person of the past was mostly about how do I take an open source product and run it or how do I build something from scratch now I think the buy component of that is actually quite important of just like, "Hey, how do I evaluate different existing services? Pick something I'm going to be comfortable running off of for many years, make sure I get a cost effective deal out of that and how do I trade that off against doing it myself?"
So I don't think there's a hard and fast rule of always build or always buy, but I think the buy is becoming a much bigger part of things and I think it's because of the cloud.
Vinoth Chandar
Billion-dollar question, do you think lakes and warehouses are going to converge?
Justin Levandoski, Director of Engineering for BigQuery at Google
I think warehouses and lakes have already converged in meaningful and interesting ways. I think it's standard today to look across the cloud offerings and see at least analytics offerings in particular. And to see an architecture that supports a serverless consumption model, an architecture that cleanly separates compute/storage, maybe more components such as catalog and so forth. With open source engines and open formats being used meaningfully across these ecosystems. And this really isn't by accident. If you look at how data warehousing and data lakes have evolved over the past, say 10 plus years, there's several commonalities, like large data sizes, need for scale, so on and so forth. From the beginning. I think data lakes brought about new interesting workload types but if you look at where customers want to go, it's natural that customers that are in especially enterprises buying these services, they want these things to converge in meaningful ways, but still having the freedom to plug and play and choose at each layer of the stack. So at the end of the day, you have data assets that they want to manage and analyze and ideally they see things under a single umbrella but not necessarily disparate systems.
Vinoth Chandar
So Uber is a very unique data company. I also lived that era at Uber. It requires real time data, but massive scale. You do critical things like matching supply demand. So how have you architected with open source data over the years? Are there tools, platforms that have been particularly impactful to get Uber to where it's today?
Praveen Naga, VP of Engineering and Data Science at Uber
The reason why we are able to succeed and scale is because of open source. If we had decided, "Hey, we're going to do everything ourselves," imagine I don't think this company would've existed, honestly.
The real time needs is actually very real where operations on the ground need to actually be able to see what trips are happening and how are drivers positioning and stuff. And then we have to make these payments to the drivers, these millions of drivers on-time, and these folks are coming to our platform to earn money. So anyways, this is where Vinoth who was very much involved in actually creating the transactional data lake as a concept to begin with and that became a streaming transactional data lake and that's where Hudi came in. And I would say that's been our biggest benefit where we are able to run this data lake in an efficient manner and provide real time insights for all these different domains that I talked about.
Be the first to read new posts