October 24, 2024

Leveling Up the Data Lakehouse with LakeView and Table Optimizer

Written by:

Floyd Smith

Leveling Up the Data Lakehouse with LakeView and Table Optimizer

‍As part of its funding announcement on June 26, Onehouse announced two new products: LakeView and Table Optimizer. In a recent webinar, Kyle Weller and Andy Walner from Onehouse showed how these two products reduce costs and improve reliability for OSS-based data lakehouses. Read on for an overview, or view the full talk.

Important requirements in areas such as observability, alerting, stability, and performance are often skipped over as teams iterate quickly on data analysis within large data lakehouses. When performing analysis at an urgent pace, it is difficult to convince stakeholders to slow down and invest in core infrastructure work that would make a software implementation more productive.

The LakeView and Table Optimizer products are designed to alleviate this problem for Apache Hudi, with automatic monitoring, reporting, observability, and performance optimization services. They can improve data storage and processing performance by as much as 100x when used together, while reducing operational costs.

Kyle Weller, Head of Product at Onehouse, has worked in the data space for more than 10 years, including several years as product lead for Azure Databricks when it was new at Microsoft. Now, working at Onehouse since the first version of Onehouse Cloud was shipped, he’s seen firsthand how challenging it can be for lakehouse users to stay on top of operational and performance work. Andy Walner, Product Manager at Onehouse, previously worked on machine learning operations products at Google and Dashworks, where he saw analogous problems come up as data ingestion was scaled at the application layer. In this webinar, they demonstrate how LakeView and Table Optimizer can automate operations, reduce costs, and supercharge the performance of OSS-based data lakehouses.

Eliminate Engineering Overhead with Onehouse

Onehouse has been building out fully managed lakehouses at the largest scale since its inception. As the operator of a fully managed lakehouse service, Onehouse is constantly looking for ways to optimize the performance of the systems they are responsible for. Clients might have thousands of pipelines, process petabytes worth of data, and hit burst rates that reach terabytes of data processed per minute.

No matter where client data comes from—XML files shipped over SFTP, Oracle databases, PostgreSQL, Mongo, IBM DB2, or any other source—Onehouse offers a highly performant data ingestion solution that can be deployed in days. Once deployed, the platform will ingest data with low latency, and then store it in any needed lakehouse format. To support storage-agnostic services, last year Onehouse, alongside Microsoft and Google, launched and open-sourced Apache XTable (Inclubating). XTable launched with support for Apache Hudi, Iceberg, and Delta Lake formats, and will soon support many others.

Onehouse has also built a managed transformation service that scales basic data transformations and includes multiple predefined data cleaning operations, such as for basic deduplication, masking, cleanup, and schema modification. The system also allows customers to write bespoke transformations that leverage Onehouse's managed support.

Data ingestion with Onehouse is often up to ten times faster than existing data pipelines. Onehouse-managed pipelines offer end-to-end change-data-capture (CDC) processing, moving data from popular databases to a shared data lakehouse, and then through any popular data processing engine. “We want to completely decouple ingestion and storage from processing, and make it so that users store and write your data once and it's well-optimized and well-managed,” says Weller.

Why Build LakeView and Table Optimizer?

In conversations with customers and many other community members, Walner discovered some recurring pains when operating a lakehouse—pains that Onehouse, as a managed service provider, has been experiencing as well. When executing against tight deadlines, teams often don’t have time or resources to build alerting, monitoring, observability, performance, and other production-grade infrastructure in advance. As a result, they are often in a reactive mode—building monitoring and alerting only after systems break, writing complicated custom queries for basic metadata analysis, fighting ongoing performance battles as the lakehouse composition shifts, and having a difficult time debugging the system with the tools at hand, such as low-level scripts and command line interface calls. “You can’t just set up alerting and monitoring and then forget it. It requires constant monitoring and updates to remain relevant and keep the system performant,” explains Walner. So Onehouse built two products to address these problems for Hudi-based lakehouses: LakeView to provide monitoring, observability, and reporting; and Table Optimizer to automatically optimize lakehouse storage and pipeline performance.

Introducing LakeView—Automatic Monitoring, Reporting, and Observability Support

LakeView is a free, out-of-the-box monitoring, observability, and reporting platform for Hudi-based data lakehouses and processing pipelines. Besides being free, it’s also extremely easy to set up, since the public GitHub repository contains a fully runnable version of the code. The GitHub version only depends on read access to a copy of a lakehouse’s Apache Hudi metadata and comes with automation to help get that off the ground quickly.

Automatic Monitoring

LakeView provides users with a bird’s-eye view of their tables. This includes a UI with charts for monitoring stable state, curated dashboards built by lakehouse experts, and insights to help avoid problems before they compound.

It allows users to track changes, quickly notice if anything looks off, and proactively address issues before they affect downstream users.

Automatic Reporting

On a weekly cadence, LakeView will email an automatically compiled report, which otherwise might have been created manually by an operations person or engineer.

The email is populated with the most relevant high-level metrics for a given lakehouse, configurable metrics, custom alerts, and other information to help make sure that everyone is informed at a high level of how the lakehouse is doing, without requiring manual updates.

LakeView will monitor and report on common performance metrics, such as data skew (across both files and partitions) and data distribution across partitions. This popular feature helps users identify bottlenecks that might develop as data fills tables. DataOps team members can adjust clustering and partitioning strategies over time to address problems before they affect performance.

Debugging Support

LakeView provides the first-ever visual interpretation of the Apache Hudi event timeline. DataOps teams investigating missing data or other pipeline problems can now explore the event timeline and dive into individual events. This helps them pinpoint exactly when and where a problem was introduced to the system—whether in a processing service, such as for data compacting, or in the table commits themselves. With support for fast searching and filtering, debugging data problems is now a more pleasant and efficient experience.

Merge-on-Read Admin Support

Merge-on-read tables are very efficient; Onehouse has run multiple benchmarks that show the benefits of using them. These tables are, however, difficult to manage effectively. LakeView simplifies the administration work with a visual representation of their performance data. Reports on log file build-up and other related metrics give DataOps administrators the visibility they need to be able to adopt, use, and maintain merge-on-read tables.

LakeView can be deployed quickly, easily, and without friction—it doesn’t rely on complicated installs, additional costs, or permission grants. All it requires to function well is access to a copy of the Apache Hudi metadata, and it includes tooling to ingest the metadata automatically and on an arbitrary schedule. For existing Onehouse customers this functionality is already available, and they have free access to it within their environment.

Table Optimizer—Automatic Table and Pipeline Optimization

Operating open table formats can be challenging. DataOps engineers ensure that cleaning, clustering, compacting, file-resizing, and many other processes are set up properly, monitored, and continuously updated just to keep the system running at optimal performance.

Performance tuning of the system requires finicky updates to many interrelated configurations, including budgets, triggers, concurrency limits, and data retention enforcement.

Custom optimization services run inline with data pipelines will compete with write operations for resources, forcing performance tradeoffs. The work is more than worthwhile, since careful tuning can bring up to 100x performance gains for the system as a whole.

Table Optimizer addresses these problems and helps capture some of the potential performance gains automatically. It includes a slate of automatic optimization services, which Onehouse runs asynchronously in the background (that is, not inline) against tables and pipelines they manage for clients. These optimizations include adaptive compaction, which uses optimal strategies for merge-on-read table compaction; intelligent incremental clustering for automatically clustering data based on actual query performance; auto file-sizing for table storage optimization; and automatic TTL enforcement and cleaning, which reclaims storage space by deleting files and data that are no longer relevant.

The optimization work can be performed on-demand or asynchronously, scales automatically, and is timed to be off-peak for your use cases—all running side-car fashion on Onehouse infrastructure.

Your Universal Data Lakehouse is Waiting

LakeView and Table Optimizer are natural extensions to Onehouse’s managed data ingestion, storage, and processing offerings. All of Onehouse’s offerings are easy to deploy, have no maintenance requirements, and automatically optimize performance and cost. They are often more efficient, both in cost and performance, than equivalent offerings.

To learn more about Onehouse, products, pricing, or anything related to working with data at scale, reach out to the team at gtm@onehouse.ai. For more detailed documentation, and demos of how LakeView and Table Optimizer might be used, check out the webinar we summarized in this post.

LakeView is available for free on GitHub, and you can get started in your infrastructure in as little as 20 minutes.

For other deep data processing conversations, see our recorded webinar, Vector Embeddings in the Lakehouse: Bridging AI and Data Lake Technologies, delivered Tuesday, August 27.

Authors

No items found.

Leveling Up the Data Lakehouse with LakeView and Table Optimizer

Eliminate Engineering Overhead with Onehouse

Why Build LakeView and Table Optimizer?