November 26, 2024

Open Source Data Summit 2024 Draws Data Engineers and Data Architects

Open Source Data Summit 2024 Draws Data Engineers and Data Architects

Whether you are working on greenfield projects, or trying to do more with complex infrastructure, open solutions for data infrastructure are likely to be on your mind. In October, the Open Source Data Summit 2024 virtual conference attracted a large and enthusiastic audience to more than a dozen sessions, exploring open source alternatives to proprietary systems - and the lock-in that results. 

All sessions are now available on demand, available from the conference homepage, and you can link to key sessions from the highlights below. You can also check out recordings from the inaugural 2023 Open Source Data Summit

Keynote Focus: Data Platform Unbundling

Organizations need to deliver a wide range of use cases on shared data, so duplicating data and core functionality across parallel, proprietary systems impedes progress. Vinoth Chandar, Founder and PMC Chair for Apache Hudi, and Founder and CEO of Onehouse, described a new “standard model” for data integration, with the open data lakehouse at its core. 

Figure 1. Onehouse exemplifies the unbundled model

Also see these relevant resources:

Open Catalog Exchange Draws Databricks, Snowflake, and More

The data catalog is sometimes the last point of lock-in for proprietary providers, so this discussion of open data catalog alternatives was both lively and productive. Practitioners from Databricks, Datastrato, Acryl Data, Onehouse, and Snowflake joined in exchanges and advocacy for leading alternatives - Apache Gravitino, Apache Polaris, DataHub, and Unity Catalog. 

Figure 2. Our panel tackles catalog interoperability

Also see these relevant resources:

Dissecting the Lambda Architecture with an Apache Beam 

The Lambda architecture begins by assuming that the same records are fed into a batch and a stream processing system in parallel. The open data lakehouse begins by assuming that the same records are fed into a single stream and processed incrementally, using the data lakehouse for storage and intermediate processing. Apache Beam works on the second premise. Watch as Beam founder David Regalado examines the Lambda architecture and shows how Beam can help replace it. 

Figure 3. Apache Beam unifies multiple elements

Also see these relevant resources:

Hudi 1.0: A Wave of Change on the Data Lake

Apache Hudi has always presented an alternative to traditional assumptions about what one can accomplish on the data lake. Hudi 1.0 takes that alternative approach to a new level, creating a true alternative to database technology, while preserving all the best aspects of the data lake. As Apache 1.0 approaches general availability, Balaji Veradarajan and Y Ethan Guo unveil what’s new in Hudi 1.0

Figure 4. Hudi 1.0 handles database responsibilities

Also see these relevant resources:

“X Men” Xplain XTable 

The key advantage of the open data lakehouse is its universal accessibility. But that advantage is blunted by the division of data lakehouse implementations amongst three OSS projects: Apache Iceberg, Apache Hudi, and Delta Lake. Now Apache XTable (Incubating) provides interoperability, bringing down barriers and opening up data architectures. Join Ashvin Agrawal and Dipankar Mazumdar as they carry out a live demo of XTable with leading open source query engines. (And see the XTable on AWS session for a deep dive on that platform.) 

Figure 5. XTable ingesting and outputting metadata

Also see these relevant resources:

Conductor Creates Low Latency on the Lake

Is it really possible to smoothly handle terabytes of data - and respond to complex queries in real-time - on the data lake? On top of a managed service? Yes, but only with careful attention to detail at every step, including data pruning, data skipping, and data modeling best practices. Emil Emilov of Conductor shows how to deliver high-performance user-facing applications at scale

Figure 6. Conductor accelerated query acceleration with Onehouse

Also see these relevant resources:

Conclusion

But wait, Bullwinkle, there’s more! Ten cool talks:

  1. Shravana Krishnamurthy powering digital offers for banks from the data lake
  2. Manfred Moser and Will Morrison juggling Trino clusters with the Trino Gateway
  3. Sivanagaraju Gadiparthi delivers the best of stream and batch data processing
  4. Ajit Panda scales compliance across an exabyte of systems globally
  5. Lakshmana Yenduri, Sr. breaks down Flink vs. Hadoop vs. Spark for batch processing
  6. Audra Montenegro, Bhavani Sudha Saktheeswaran, Priyanka Naik, and Hena Pawar describe how to manage OSS user communities for success
  7. Joe Reis breaks down silos with artful data modeling
  8. Tim Meehan shows how composable systems can replace the traditional data warehouse
  9. Denis Krivenko leverages Kubernetes for a cloud native data platform
  10. Shuguang Xiang and Shi Kai Ng build near real-time data analytics on Flink CDC and Hudi 
Figure 7. The OSS communities panel shares best practices

For more fulsome descriptions, followed by links to all the talks, visit the OSDS 2.0 event page

Authors
No items found.

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future