Exactly one year ago, we announced Onehouse - an open, fully-managed, cloud-native data lakehouse service - to improve time-to-value for state-of-the-art data lakes radically. Onehouse provides core data infrastructure services needed to build and operate data lakes on top of open formats, making an otherwise arduous journey painless.
Since launching, we’ve worked with several early users to bring our product vision to life and power their production data lakes. Our goal is to deliver the ease of use and automation of the cloud data warehousing stack, on top of data lakehouse technology, in turn also providing much-needed cost-effectiveness and performance benefits to users.
As an important milestone on this journey, I am thrilled to announce our $25M Series A funding led by Addition and Greylock partners. I am honored to have Jerry Chen (Greylock) and Aaron Schildkrout (Addition) join our board.
Through this time, we have also engaged with over 100 organizations on their data lake and data warehousing challenges. In the sections below, we have shared how they help shape our roadmap, along with industry trends and our longer-term vision for cloud data infrastructure.
Back in 2021, most of the organizations we talked to were curious about the new data lakehouse technologies. However, 2022 was, in many ways, the breakout year for the data lakehouse, where almost all organizations we talked to were actively evaluating this shift. As someone who has been working on data lakehouse technology even before it was called “lakehouse,” I couldn’t be more excited to see the incredible growth and momentum for the category. Apache Hudi saw record levels of engagement last year as companies, big and small, used the platform to build their lakes. Almost all major cloud warehouses and cloud data lake engines now have integrations for the three major data lakehouse storage projects. Many vendors like Ahana, Dremio and Starburst have rebranded their entire cloud product offerings around the data lakehouse.
All said, Gartner still rightfully places data lakehouses at the peak of the hype cycle, where there is buoyant optimism without fully distilling what workloads the technologies can or cannot support. At Onehouse, we have a team that has been in the trenches to operate arguably the largest transactional data lake on the planet. We want to learn from the shortcomings of the original Hadoop EDW vision, where euphoria and optimism ultimately failed to deliver mature, easy-to-use software services needed to adopt the technology quickly. We are approaching the inevitable platformization of the data lakehouse with a keen focus on the automation, operational excellence and technology innovations necessary based on our hard-earned battle scars.
Almost unanimously, users were wary of moving from one vertical technology stack to another. Many of these users had migrated from an on-prem data warehouse to a cloud data warehouse just a few years ago and are now facing some critical business problems. Costs are ballooning as their data volumes, and the number of pipelines grow. As these organizations look to start their data science efforts, they also need data in different open data formats and programmatic access, which are hard to achieve in these warehouses. Finally, even with an alternative technology stack built on open formats, there is fear of lock-ins due to various proprietary services. Most engine vendors focus only on their own engines' workloads or data formats when it comes to managing the underlying data. This leaves the users fending for themselves, with analysis paralysis or costly data migrations. We are already living in an evolving, multi-engine world where users are picking different engines for traditional analytics, stream processing, data science, machine learning and artificial intelligence. The most popular engines that are querying Apache Parquet/Orc are now different from what they were five years ago.
Onehouse believes the right user-centric approach is to bring different engines to data, not the other way around. Onehouse enables horizontal integration of different engines to a common, well-managed cloud data storage, such that standard services like data ingestion, data clustering, indexing, and GDPR deletions can be performed once and leveraged across multiple engines. In fact, we can directly address some of the cost concerns cited above by judiciously moving workloads off cloud warehouses and reducing the use of high-cost proprietary managed services. Today, we are doubling down on our commitment to unlocking this multi-engine ecosystem, by announcing the Onetable feature to seamlessly interoperate across Apache Hudi, Delta Lake, or Iceberg to unlock proprietary performance optimizations found inside some vendors like Databricks or Snowflake. Our follow-on efforts will highlight what engines are best suited for which workloads and why.
Many users we met were pretty savvy and fully aware of all the tradeoffs around a vertical cloud warehousing data stack vs. a more disaggregated, flexible data lake architecture. They simply got started with a cloud warehouse because it is a fully managed service that reduces the operational burden on their data teams. The idea was that, at a sufficient scale, they could justify expanding their data teams with experienced data engineers who can build out their data lakehouse using technologies like Apache Hudi. However, data has immense inertia, and the reality of successfully executing a company-wide data migration proves to be a daunting multi-year project.
Even as the world was enamored over generative AI last year, basic data lake infrastructure remains painfully manual. One of the most important founding goals of our company is to change this status quo. We challenge ourselves daily to build products and technology that can get data lakes up and running within minutes with just a few easy clicks. Users should then be able to connect various query engines to their lake, perform workload evaluations and go live days after. We were pleasantly surprised when our very first user was able to go live within a few days, with a complex use-case like near real-time CDC to a data lake on AWS. Finally, these benefits are available to even non-Onehouse users by way of Apache Hudi’s comprehensive set of data services. Going forward, we like to contribute even more towards making Onehouse/Apache Hudi the first stop on a user's cloud data journey.
While the idea of using open table formats to scale metadata for large immutable tables has garnered much attention in the last year, that’s barely scratches the surface of how transformational a technology like Apache Hudi can be. The original motivations for creating the project at Uber in 2016 centered around a bold new vision for replacing batch data processing on lakes/warehouses, that needed faster mutability on data lakes. Back then, being stuck in batch data processing, it was a pipedream for us at Uber to incrementally process mutable transactional data streams in near real-time to a data lake. Today, Apache Hudi users can easily do this with a few commands on any cloud provider. Hudi has accomplished this by innovating on real-hard computer science problems around indexing, merge-on-read storage formats, asynchronous table services, scalable metadata, non-blocking concurrency control, and built-in support for change data capture, which optimize for all use-cases that need mutability.
Over the past year, we have put together peer-reviewed thorough comparisons that call out these design choices and technical differentiators and show several large enterprises are running advanced incremental processing workloads on Hudi. We have brought the same incremental processing magic to Onehouse as well, where users can build bronze-silver layers of the medallion architecture with virtually no code in an entirely incremental fashion that avoids any data recomputation. As the market vendors and industry pundits make their 2023 predictions, we want to throw in our own. We predict that data lakehouse technology will chart a path similar to the major relational databases, following the Lindy effect, which states the longer a technology has been around, the more likely it’s also likely to be around in the future. Onehouse aims to further advance the incremental processing vision through continued contributions to the Apache Hudi project and product innovations. We intend to share more on this later this year.
As we open up a new chapter in the company’s life, I want to sincerely thank all our early design partners, users and customers for their support and feedback. I also want to thank our small but super-talented team for all their dedication and hard work. And finally, the amazing Apache Hudi community for the continued collaboration and pushing the edges of innovation for open source data lake platforms.
It’s not an exaggeration to say that Onehouse’s success could have a profound impact on the industry, where we can finally decouple data storage and management from the different compute engines that operate on the data, moving us past data lockins forever. We will approach this vision with all sincerity and our first principles intact. Onwards.
Be the first to read new posts