November 26, 2024

Open Source Data Summit 2024 Draws Data Engineers and Data Architects

Written by:

Floyd Smith

Open Source Data Summit 2024 Draws Data Engineers and Data Architects

Whether you are working on greenfield projects, or trying to do more with complex infrastructure, open solutions for data infrastructure are likely to be on your mind. In October, the Open Source Data Summit 2024 virtual conference attracted a large and enthusiastic audience to more than a dozen sessions, exploring open source alternatives to proprietary systems - and the lock-in that results.

All sessions are now available on demand, available from the conference homepage, and you can link to key sessions from the highlights below. You can also check out recordings from the inaugural 2023 Open Source Data Summit.

Keynote Focus: Data Platform Unbundling

Organizations need to deliver a wide range of use cases on shared data, so duplicating data and core functionality across parallel, proprietary systems impedes progress. Vinoth Chandar, Founder and PMC Chair for Apache Hudi, and Founder and CEO of Onehouse, described a new “standard model” for data integration, with the open data lakehouse at its core.

Figure 1. Onehouse exemplifies the unbundled model

Also see these relevant resources:

Open Catalog Exchange Draws Databricks, Snowflake, and More

The data catalog is sometimes the last point of lock-in for proprietary providers, so this discussion of open data catalog alternatives was both lively and productive. Practitioners from Databricks, Datastrato, Acryl Data, Onehouse, and Snowflake joined in exchanges and advocacy for leading alternatives - Apache Gravitino, Apache Polaris, DataHub, and Unity Catalog.

Figure 2. Our panel tackles catalog interoperability

Also see these relevant resources:

Dissecting the Lambda Architecture with an Apache Beam

The Lambda architecture begins by assuming that the same records are fed into a batch and a stream processing system in parallel. The open data lakehouse begins by assuming that the same records are fed into a single stream and processed incrementally, using the data lakehouse for storage and intermediate processing. Apache Beam works on the second premise. Watch as Beam founder David Regalado examines the Lambda architecture and shows how Beam can help replace it.

Figure 3. Apache Beam unifies multiple elements

Also see these relevant resources:

Hudi 1.0: A Wave of Change on the Data Lake

Apache Hudi has always presented an alternative to traditional assumptions about what one can accomplish on the data lake. Hudi 1.0 takes that alternative approach to a new level, creating a true alternative to database technology, while preserving all the best aspects of the data lake. As Apache 1.0 approaches general availability, Balaji Veradarajan and Y Ethan Guo unveil what’s new in Hudi 1.0.

Figure 4. Hudi 1.0 handles database responsibilities

Also see these relevant resources:

“X Men” Xplain XTable

The key advantage of the open data lakehouse is its universal accessibility. But that advantage is blunted by the division of data lakehouse implementations amongst three OSS projects: Apache Iceberg, Apache Hudi, and Delta Lake. Now Apache XTable (Incubating) provides interoperability, bringing down barriers and opening up data architectures. Join Ashvin Agrawal and Dipankar Mazumdar as they carry out a live demo of XTable with leading open source query engines. (And see the XTable on AWS session for a deep dive on that platform.)

Figure 5. XTable ingesting and outputting metadata

Also see these relevant resources:

Conductor Creates Low Latency on the Lake

Is it really possible to smoothly handle terabytes of data - and respond to complex queries in real-time - on the data lake? On top of a managed service? Yes, but only with careful attention to detail at every step, including data pruning, data skipping, and data modeling best practices. Emil Emilov of Conductor shows how to deliver high-performance user-facing applications at scale.

Figure 6. Conductor accelerated query acceleration with Onehouse

Also see these relevant resources:

Conclusion

But wait, Bullwinkle, there’s more! Ten cool talks:

Shravana Krishnamurthy powering digital offers for banks from the data lake
Manfred Moser and Will Morrison juggling Trino clusters with the Trino Gateway
Sivanagaraju Gadiparthi delivers the best of stream and batch data processing
Ajit Panda scales compliance across an exabyte of systems globally
Lakshmana Yenduri, Sr. breaks down Flink vs. Hadoop vs. Spark for batch processing
Audra Montenegro, Bhavani Sudha Saktheeswaran, Priyanka Naik, and Hena Pawar describe how to manage OSS user communities for success
Joe Reis breaks down silos with artful data modeling
Tim Meehan shows how composable systems can replace the traditional data warehouse
Denis Krivenko leverages Kubernetes for a cloud native data platform
Shuguang Xiang and Shi Kai Ng build near real-time data analytics on Flink CDC and Hudi

Figure 7. The OSS communities panel shares best practices

For more fulsome descriptions, followed by links to all the talks, visit the OSDS 2.0 event page.

Authors

No items found.

Read More:

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

Product

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

Chandra Krishnan

April 17, 2025

Ray vs Dask vs Apache Spark™ — Comparing Data Science & Machine Learning Engines

Product

Ray vs Dask vs Apache Spark™ — Comparing Data Science & Machine Learning Engines

Andy Walner

April 17, 2025

Apache Flink™ vs Apache Kafka™ Streams vs Apache Spark™ Structured Streaming — Comparing Stream Processing Engines

Product

Apache Flink™ vs Apache Kafka™ Streams vs Apache Spark™ Structured Streaming — Comparing Stream Processing Engines

Sagar Lakshmipathy

April 17, 2025

Data Deduplication Strategies in an Open Lakehouse Architecture

Data Deduplication Strategies in an Open Lakehouse Architecture

Dipankar Mazumdar and Aditya Goenka

March 20, 2025

The Open Table Format War: Merely a Battle on the Path to Engineering a Truly Open Data Platform

The Open Table Format War: Merely a Battle on the Path to Engineering a Truly Open Data Platform

Pauline Brown

March 12, 2025

ACID Transactions in an Open Data Lakehouse

ACID Transactions in an Open Data Lakehouse

Dipankar Mazumdar

February 20, 2025