February 21, 2022

Automagic Data Lake Infrastructure

Written by:

Kyle Weller

While a data warehouse can just be used, a data lake still needs to be built. Building one can be a daunting task that requires hard-to-find engineers who specialize in several different technology stacks from databases, streaming services, distributed compute frameworks like Apache Spark, cloud storage, Kubernetes, cloud networking and security, and so much more. The Data + AI ecosystem is also rapidly evolving with countless new innovative technologies and open source projects being released every year.

Very few organizations have the time or talent required to make the upfront investment to build a data lake so they sometimes turn to drop-in warehouse silos that they soon outgrow, resulting in costly churn on tedious infrastructure migrations later down the road.

‍

The DIY Data Lake

Let’s walk through an example scenario. Say I have a Postgres operational database and I want to perform some BI analytics. First, I configure my Postgres DB, install necessary drivers, set up debezium, deploy kafka, run spark, flink or another compute engine for checkpoint based streaming into a data lake. To obtain ACID transactions and efficient upserts into my lake I will pick up a data Lakehouse technology like Apache Hudi, to build an EL data pipeline for “raw” tables that mirror my upstream tables. I will need to study the layout and distribution of my data, craft strategies to handle late arriving data, understand the downstream use cases and customize performance tuning strategies from partitioning, indexing, clustering, compaction, etc. Then, I will build some ETL pipelines, to produce more derived tables on the lake or even in a warehouse so I can run SQL queries.

Read enough documentation yet? We just got to the fun part, operationalizing all of this infrastructure. Airflow scheduling, Jenkins CI/CD dev/prod deployments, scaling kubernetes compute, resource, job, and quality monitoring. Ensuring you meet compliance regulations, data retention policies, passing internal security reviews on your networking configurations. Building an on-call rotation and waking up in the night to backfill for some urgent executive dashboard…

During my time on the Azure Data team I helped hundreds of companies strategize the architecture for their data lakes in the cloud and I routinely saw it take 3-6 engineers up to 6months+ to fully operationalize a production-grade Data Lake for this scenario. Now say you want to accomplish the same feat in 4 other divisions in your company. Not only do you need to replicate this complex infrastructure, but you need to hire and train another team to manage it…

‍

Automagic Data Services

Onehouse provides automagic data infrastructure that automates away the tedious chores of building a data lake. The same scenario above can be accomplished by a single engineer in a fraction of the time allowing you to refocus engineering efforts on higher impact activities.

Point to where your data is and Onehouse will automate and manage all infrastructure required to incrementally and continuously deliver data into a ready to use data Lakehouse. Onehouse delivers automagic performance at scale without complex custom performance tuning. We can monitor the statistics of the files, the characteristics of your writes and we can automatically manage your file sizes, partitioning, indexing, clustering, and even apply advanced space curves like Z-Order or hilbert curve optimizations. All of the Onehouse managed services are built on open services offered by industry proven Apache Hudi, which fundamentally transforms a data lake with transactions, concurrency control, schema evolution, and advanced performance tuning capabilities.

‍

Open Data Infrastructure

When building the data platform for your organization it is important to design a future-proof architecture that ensures your data is open and interoperable. There is a trend in open source data for query engine providers to treat open table formats as leverage to “lock-in” data and then upsell proprietary services for their single query engine. Don’t limit your data Lakehouse into a single query engine! Onehouse offers a way for you to decouple your data lake infrastructure from query engines, so you can unlock interoperable plug-and-play mixed mode compute.

Users routinely pick Hudi over other choices because open table formats on their own are not enough. Hudi ensures your data is portable by offering a rich set of open services like clustering, snapshot/restore, compaction, z-order, file-sizing, streaming ingest, incremental ETL, CDC sources and a wide margin of other technical differentiators best saved for a future blog.

Read the Onehouse Commitment To Openness to learn more about our mission to future proof your data.

If you are excited about our vision you can engage with us through one of the following channels:

Join our pilot! We have a pilot program that kicks off with a small cohort of users every month. Click “Request A Demo” above or directly reach out to info@onehouse.ai if you are interested to become an early design partner
Join our team! We are hiring a diverse team of world class talent and are looking for people passionate about our mission
Follow us on social media: LinkedIn, Twitter, Slack
Send us any other question to info@onehouse.ai

Authors

No items found.