In 2017, Onehouse CEO Vinoth Chandar helped open source Apache Hudi as the first open source data lakehouse technology. Since then, Apache Hudi has grown rapidly, achieving 5.2k stars and a community of 11k+ members, including industry leaders such as Amazon, ByteDance, Notion, Walmart, and Zoom.
At Onehouse, we’ve engaged closely with the Hudi community while building our cloud service, a fully managed data lakehouse. Through our work with the community, and as Hudi contributors, we’ve witnessed the challenges of operating and optimizing data lakehouses. Often, while helping community members with performance tuning or tracking down errors, we find that challenges around observability hinder progress.
Today, we are excited to release Onehouse LakeView, a new, free product for the Apache Hudi community. LakeView provides a user interface for Apache Hudi with metrics, charts, and insights to help you monitor your tables and identify optimization opportunities. This is all made possible by analyzing table metadata, so base data files are never accessed, and no business data leaves your private cloud.
LakeView addresses key challenges in managing a data lakehouse by enhancing observability metrics from Hudi with context about workloads and data management functions.
When embarking on your lakehouse observability journey, you'll likely start by using Hudi’s built-in metrics. However, building a comprehensive observability system requires much more effort. You may spend time creating dashboards and alerts on top of these standard metrics. Challenges such as data skew or the creation of numerous small files may cause performance issues, prompting additional projects to calculate and monitor new metrics. Each observability project demands deep system knowledge to track the right metrics and build effective dashboards.
The effort needed to establish sufficient observability often leads teams to deprioritize it, leaving you to troubleshoot in the dark and optimize without clear metrics to guide you. LakeView is purpose-built to provide observability out of the box, leveraging the Onehouse team’s extensive experience in supporting large data lakes for our customers and the open-source community.
Now, let's discuss some challenges we’ve observed and how LakeView can help.
When operating a data lakehouse, it’s important to track the state of your tables and stay on top of significant or problematic changes. To accomplish this, you’ll need to write custom queries to answer questions such as how many records exist in a table, or how many updates vs. inserts a table had in the past three days. In many cases, you also need an understanding of the intricacies of how different table services are operating or performing.
LakeView provides a user interface with interactive charts and metrics so you can quickly scan through the state of your tables, and dive deeper if something looks off. This allows you to fix problems before they become serious.
Your data lakehouse is constantly evolving as folks across your organization introduce new data workloads and patterns. Staying on top of all your tables can feel like a chore.
LakeView sends a weekly email summary to help you stay on top of key metrics representing the health of your data lakehouse.
Achieving high performance for your data lakehouse tables is tricky and requires ongoing maintenance as data patterns evolve. Common pitfalls, such as data skew and small files, may result in slow reads or unnecessarily expensive writes to the table.
LakeView exposes a full view of data skew across files and partitions to help you diagnose performance issues and optimize your tables.
Debugging issues often requires diving into the Hudi timeline to track down commits and table service jobs. This process can be difficult and tedious to navigate using a command-line interface (CLI). LakeView presents a searchable view of the timeline to help you quickly get to the bottom of issues, such as a clustering job that takes too long or is not getting triggered properly.
For teams leveraging Hudi’s Merge on Read tables, it’s critical that compaction runs on time to ensure timely access to fresh data for consumers. If you choose the wrong compaction strategy, your tables may accumulate many log files, potentially hurting read performance or exposing stale data. LakeView tracks log file accumulation across all filegroups and warns you at a specified threshold. You can also browse the log file counts across the table to help optimize your compaction strategy.
For proactive monitoring, LakeView allows you to configure alerts via email and Slack, triggered by metrics such as data skew or log file accumulation.
LakeView is now available for interested members of the Hudi community! Sign up here to get started, and also check out the launch webinar to see a live demo of LakeView. Setup is simple and can be completed in less than 20 minutes using the LakeView GitHub repository.
Be the first to read new posts