October 31, 2024

Conductor Unlocks the Power of Open Data Architectures with Onehouse

Conductor Unlocks the Power of Open Data Architectures with Onehouse

As the volume of data that companies manage and optimize grows exponentially, the demand for efficient, scalable, and cost-effective data management solutions is more urgent than ever. Data lakehouses have emerged as a promising alternative to more rigid, traditional options such as data warehouses because of their open architecture, but their implementation remains complex for many organizations. 

Onehouse offers a fully managed data lakehouse that simplifies this process, empowering companies to access the full potential of their data without latency or overly complex services. In a recent webinar, Emil Emilov, Principal Software Engineer at Conductor, shared how his team overcame the challenge of managing terabytes of data across multiple services by transitioning to Onehouse’s managed data lakehouse “basically overnight,” immediately simplifying Conductor’s data infrastructure and significantly enhanced operations.

Challenges with Modern Data Infrastructure

Traditional closed data warehouse systems have many key scaling limitations, including:

  1. Data Latency: Batch data processing means that organizations don’t always have the freshest data available, limiting their ability to provide real-time analytics.
  2. Cost and Inflexibility: Storing data in closed systems can be expensive, with vendors charging premium rates for the dual storage and compute requirements. Andy Walner, Product Manager at Onehouse, explained why organizations face these challenges with traditional closed systems, saying “Closed systems are inflexible … you end up duplicating pipelines to handle more use cases.”
  3. Managing Complexity: As organizations scale and expand their data services, particularly when adding AI/ML functionality, they often create multiple, nearly identical data pipelines. This adds costs and introduces risks around data consistency and governance.

Data lakehouses address these problems by offering the benefits of traditional storage systems and increasing data freshness. They are a relatively recent development, and prominent examples include Apache Hudi (2017), Apache Iceberg (2018), and Databricks’s Delta Lake (2019).  

Benefits of Data Lakehouses

Data lakehouses can meet scaling needs by offering:

  1. Cost-Efficiency: With traditional data lakes, data can remain in your existing cloud storage (such as Amazon S3 or Google Cloud Platform), which allows you to manage your storage options, while performing ingestion and ETL directly on files.
  2. Flexible Querying: Traditional databases often require duplicating data to support different query engines or custom use cases, once again leading to added complexity and overhead. With a data lakehouse, organizations can eliminate the need for data duplication while still leveraging various query engines. This allows for the maintenance of traditional database features, such as ACID (atomicity, consistency, isolation, durability) guarantees. Transactions are reliable, and data integrity is maintained across all use cases, without added operational burden. 
  3. High-Performance ETL: Open source solutions such as Apache Hudi can be used to support fast queries and efficient data management while ensuring data freshness. Hudi’s incremental processing capabilities allow organizations to capture and query data in near-real time, ensuring that data pipelines serve up-to-date information without the need for full reprocessing.

Despite these benefits, teams may be hesitant to transition to a data lakehouse because of the time-consuming, complex process of building and maintaining one internally. Onehouse, powered by Apache Hudi, solves this problem for teams by managing the lakehouse itself while providing users with a simple, highly customizable dashboard interface to visualize, model, and process their data.

Onehouse streamlines the creation of data lakehouses, offering features that abstract away setup and management. These services include:

  1. Continuous Ingestion: Connect to any data source for near-real-time data ingestion, ensuring your data is always current.
  2. Managed Table Services: Automatically optimize tables for enhanced TL and query performance, without the burden of manual optimizations.
  3. Interoperability: Onehouse seamlessly enables access to any query engine, which lets users choose the engine that best fits their needs. Whether working with Hudi, Delta, Iceberg, or other options, Onehouse ensures seamless integration and flexibility.

Conductor’s Data Journey

Conductor, a leading SEO and content marketing platform, faced significant data management challenges. With terabytes of data spread across multiple services, such as Aurora and Snowflake, their data infrastructure was becoming increasingly fragmented and difficult to scale. They sought to build a unified, scalable data platform that could support analytics for AI, business intelligence (BI), and user-facing services.

“Onehouse has been a godsend for us.” - Emil Emilov, Conductor

Emil Emilov, Principal Software Engineer at Conductor, shared how Onehouse met this challenge. He said that Onehouse “has been a godsend” for Conductor, and that Onehouse has “covered 90-plus percent of our needs.” 

Within a short time, Conductor implemented a cost-effective Hudi/S3-backed data lakehouse using Onehouse, which allows for:

  1. Cost-Effective and Scalable Architecture Without Migration: Onehouse’s “bring your own cloud” model allowed Conductor to maintain their data in Amazon S3 buckets while benefiting from the managed Hudi environment. 
  2. Managed Spark on EKS: Onehouse handled Spark clusters running within Conductor’s internal virtual private cloud (VPC), providing consistent support and simplifying what had previously been a complex environment. Onehouse implemented automatic Hudi table services and AWS Glue syncs, which streamlined Conductor’s workflow and enabled seamless integration with their existing infrastructure.

In particular, Conductor was focused on improving their user-facing analytics. With a large user base relying on real-time data insights, they needed a custom solution that could handle high-concurrency and low-latency queries for their front-end applications.

“They actually taught us how to partition things properly.” - Emil Emilov, Conductor

Conductor also emphasized the “excellent support” that they receive. Emil said, “We also get a lot of learning from Onehouse. They actually taught us how to partition things properly and whatnot,” with thanks to the Onehouse team. 

Custom Bucketing for Performance Optimization 

Onehouse worked with Conductor to solve their user-facing challenge by implementing a custom bucketing system. Emil explained how standard partitioning, typically by dividing data by day, led to uneven partitions, with some days generating up to 1TB of data while others produced only 20GB. By switching to a custom system that grouped data by year, week, and bucket number, Conductor drastically reduced query times from 30 seconds to just 6 seconds, a major improvement, especially for user-facing analytics.

“Some of the features that Onehouse now has is because we asked for them. It's really cool working together.” - Emil Emilov, Conductor

Query Latency and Sub-Second Performance

Even with these improvements, Conductor still faced query latency challenges — the “speed bump from hell,” as Emilov described it. Their user-facing services pull a small subset of data from the larger database, making it crucial to avoid large data scans and optimize query pruning. Since many users were starting queries at the same time, query concurrency became a problem. Managing the resulting workload was a complex task.

To meet their target of sub-second query times, Conductor explored various query engines, including Redshift and Athena. Ultimately, they opted for StarRocks, a distributed OS columnar database that’s low-latency and Hudi-compatible. They opted for the system’s shared-nothing architecture, a more costly local storage option with S3 spillover, to achieve Conductor’s performance goals.

Onehouse took complete ownership of Spark management, handling all administrative and operational tasks for Conductor’s Spark clusters. Emil highlighted that with Onehouse’s support, Conductor was able to “just treat [Spark] like a black box” without needing to manage the complex Spark environment, and rely on Onehouse for all performance tuning and system upkeep. Conductor was then able to focus on query performance improvements using StarRocks’s advanced features, such as a cost-based optimizer (CBO) and a visual query planner, to fine-tune query performance without the burden of managing Spark operations internally.

“I'm super happy we don't have to manage Spark. That's one of the big wins that we have here.” - Emil Emilove, Conductor

Overcoming Challenges and Future Directions

Despite these successes, Conductor faces hurdles such as S3 latency and complex queries. By transitioning to a shared-nothing architecture and refining their ingestion processes, they are working toward minimizing latency and maximizing performance. They continue to evolve their data platform and plan to integrate more automation, streamline their APIs, and explore new tools and technologies, such as the upcoming StarRocks 3.3 release.

Emilov shared some key advice based on Conductor’s experience: “Start simple.” Avoid “big bang” solutions and instead focus on building clean, working pipelines first. Proper data modeling and benchmarking with real, application-based queries and data will help avoid future bottlenecks. 

Through partnerships like the one with Onehouse, by gradually building up complexity, and by optimizing along the way, organizations like Conductor can create scalable, high-performance platforms that meet the expanding needs of their business. 

“You start with the simplest, but working, end-to-end pipeline and evolve from there.”  - Emil Emilov, Conductor

The Path Forward

The session ended with Q&A. Emilov was asked how much time Conductor would have spent to create a data lakehouse on their own. He wasn’t sure, but concluded that, in most cases: “You're better off using Onehouse, lest you spend way too much time. And there is never enough time and there is never enough money, obviously.”

As more organizations move toward open data architectures, the need for a managed solution that simplifies the complexities of data lakehouses is clear. Onehouse’s approach not only addresses the challenges of building and maintaining a lakehouse, but also enables efficient scaling while delivering real-time insights.

For those interested in exploring how these technologies can transform your organization’s data management strategies, we invite you to watch the full webinar and discover the possibilities of a modern data landscape.

Authors
No items found.

Read More:

Overhauling Data Management at Apna
The First Open Source Data Summit is a Hit!
OneTable is Now Open Source
On “Iceberg and Hudi ACID Guarantees”

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future