As the volume of data that companies manage and optimize grows exponentially, the demand for efficient, scalable, and cost-effective data management solutions is more urgent than ever. Data lakehouses have emerged as a promising alternative to more rigid, traditional options such as data warehouses because of their open architecture, but their implementation remains complex for many organizations.
Onehouse offers a fully managed data lakehouse that simplifies this process, empowering companies to access the full potential of their data without latency or overly complex services. In a recent webinar, Emil Emilov, Principal Software Engineer at Conductor, shared how his team overcame the challenge of managing terabytes of data across multiple services by transitioning to Onehouse’s managed data lakehouse “basically overnight,” immediately simplifying Conductor’s data infrastructure and significantly enhanced operations.
Traditional closed data warehouse systems have many key scaling limitations, including:
Data lakehouses address these problems by offering the benefits of traditional storage systems and increasing data freshness. They are a relatively recent development, and prominent examples include Apache Hudi (2017), Apache Iceberg (2018), and Databricks’s Delta Lake (2019).
Data lakehouses can meet scaling needs by offering:
Despite these benefits, teams may be hesitant to transition to a data lakehouse because of the time-consuming, complex process of building and maintaining one internally. Onehouse, powered by Apache Hudi, solves this problem for teams by managing the lakehouse itself while providing users with a simple, highly customizable dashboard interface to visualize, model, and process their data.
Onehouse streamlines the creation of data lakehouses, offering features that abstract away setup and management. These services include:
Conductor, a leading SEO and content marketing platform, faced significant data management challenges. With terabytes of data spread across multiple services, such as Aurora and Snowflake, their data infrastructure was becoming increasingly fragmented and difficult to scale. They sought to build a unified, scalable data platform that could support analytics for AI, business intelligence (BI), and user-facing services.
“Onehouse has been a godsend for us.” - Emil Emilov, Conductor
Emil Emilov, Principal Software Engineer at Conductor, shared how Onehouse met this challenge. He said that Onehouse “has been a godsend” for Conductor, and that Onehouse has “covered 90-plus percent of our needs.”
Within a short time, Conductor implemented a cost-effective Hudi/S3-backed data lakehouse using Onehouse, which allows for:
In particular, Conductor was focused on improving their user-facing analytics. With a large user base relying on real-time data insights, they needed a custom solution that could handle high-concurrency and low-latency queries for their front-end applications.
“They actually taught us how to partition things properly.” - Emil Emilov, Conductor
Conductor also emphasized the “excellent support” that they receive. Emil said, “We also get a lot of learning from Onehouse. They actually taught us how to partition things properly and whatnot,” with thanks to the Onehouse team.
Onehouse worked with Conductor to solve their user-facing challenge by implementing a custom bucketing system. Emil explained how standard partitioning, typically by dividing data by day, led to uneven partitions, with some days generating up to 1TB of data while others produced only 20GB. By switching to a custom system that grouped data by year, week, and bucket number, Conductor drastically reduced query times from 30 seconds to just 6 seconds, a major improvement, especially for user-facing analytics.
“Some of the features that Onehouse now has is because we asked for them. It's really cool working together.” - Emil Emilov, Conductor
Even with these improvements, Conductor still faced query latency challenges — the “speed bump from hell,” as Emilov described it. Their user-facing services pull a small subset of data from the larger database, making it crucial to avoid large data scans and optimize query pruning. Since many users were starting queries at the same time, query concurrency became a problem. Managing the resulting workload was a complex task.
To meet their target of sub-second query times, Conductor explored various query engines, including Redshift and Athena. Ultimately, they opted for StarRocks, a distributed OS columnar database that’s low-latency and Hudi-compatible. They opted for the system’s shared-nothing architecture, a more costly local storage option with S3 spillover, to achieve Conductor’s performance goals.
Onehouse took complete ownership of Spark management, handling all administrative and operational tasks for Conductor’s Spark clusters. Emil highlighted that with Onehouse’s support, Conductor was able to “just treat [Spark] like a black box” without needing to manage the complex Spark environment, and rely on Onehouse for all performance tuning and system upkeep. Conductor was then able to focus on query performance improvements using StarRocks’s advanced features, such as a cost-based optimizer (CBO) and a visual query planner, to fine-tune query performance without the burden of managing Spark operations internally.
“I'm super happy we don't have to manage Spark. That's one of the big wins that we have here.” - Emil Emilov, Conductor
Despite these successes, Conductor faces hurdles such as S3 latency and complex queries. By transitioning to a shared-nothing architecture and refining their ingestion processes, they are working toward minimizing latency and maximizing performance. They continue to evolve their data platform and plan to integrate more automation, streamline their APIs, and explore new tools and technologies, such as the upcoming StarRocks 3.3 release.
Emilov shared some key advice based on Conductor’s experience: “Start simple.” Avoid “big bang” solutions and instead focus on building clean, working pipelines first. Proper data modeling and benchmarking with real, application-based queries and data will help avoid future bottlenecks.
Through partnerships like the one with Onehouse, by gradually building up complexity, and by optimizing along the way, organizations like Conductor can create scalable, high-performance platforms that meet the expanding needs of their business.
“You start with the simplest, but working, end-to-end pipeline and evolve from there.” - Emil Emilov, Conductor
The session ended with Q&A. Emilov was asked how much time Conductor would have spent to create a data lakehouse on their own. He wasn’t sure, but concluded that, in most cases: “You're better off using Onehouse, lest you spend way too much time. And there is never enough time and there is never enough money, obviously.”
As more organizations move toward open data architectures, the need for a managed solution that simplifies the complexities of data lakehouses is clear. Onehouse’s approach not only addresses the challenges of building and maintaining a lakehouse, but also enables efficient scaling while delivering real-time insights.
For those interested in exploring how these technologies can transform your organization’s data management strategies, we invite you to watch the full webinar and discover the possibilities of a modern data landscape.
Be the first to read new posts