Whether you are working on greenfield projects, or trying to do more with complex infrastructure, open solutions for data infrastructure are likely to be on your mind. In October, the Open Source Data Summit 2024 virtual conference attracted a large and enthusiastic audience to more than a dozen sessions, exploring open source alternatives to proprietary systems - and the lock-in that results.
All sessions are now available on demand, available from the conference homepage, and you can link to key sessions from the highlights below. You can also check out recordings from the inaugural 2023 Open Source Data Summit.
Organizations need to deliver a wide range of use cases on shared data, so duplicating data and core functionality across parallel, proprietary systems impedes progress. Vinoth Chandar, Founder and PMC Chair for Apache Hudi, and Founder and CEO of Onehouse, described a new “standard model” for data integration, with the open data lakehouse at its core.
Also see these relevant resources:
The data catalog is sometimes the last point of lock-in for proprietary providers, so this discussion of open data catalog alternatives was both lively and productive. Practitioners from Databricks, Datastrato, Acryl Data, Onehouse, and Snowflake joined in exchanges and advocacy for leading alternatives - Apache Gravitino, Apache Polaris, DataHub, and Unity Catalog.
Also see these relevant resources:
The Lambda architecture begins by assuming that the same records are fed into a batch and a stream processing system in parallel. The open data lakehouse begins by assuming that the same records are fed into a single stream and processed incrementally, using the data lakehouse for storage and intermediate processing. Apache Beam works on the second premise. Watch as Beam founder David Regalado examines the Lambda architecture and shows how Beam can help replace it.
Also see these relevant resources:
Apache Hudi has always presented an alternative to traditional assumptions about what one can accomplish on the data lake. Hudi 1.0 takes that alternative approach to a new level, creating a true alternative to database technology, while preserving all the best aspects of the data lake. As Apache 1.0 approaches general availability, Balaji Veradarajan and Y Ethan Guo unveil what’s new in Hudi 1.0.
Also see these relevant resources:
The key advantage of the open data lakehouse is its universal accessibility. But that advantage is blunted by the division of data lakehouse implementations amongst three OSS projects: Apache Iceberg, Apache Hudi, and Delta Lake. Now Apache XTable (Incubating) provides interoperability, bringing down barriers and opening up data architectures. Join Ashvin Agrawal and Dipankar Mazumdar as they carry out a live demo of XTable with leading open source query engines. (And see the XTable on AWS session for a deep dive on that platform.)
Also see these relevant resources:
Is it really possible to smoothly handle terabytes of data - and respond to complex queries in real-time - on the data lake? On top of a managed service? Yes, but only with careful attention to detail at every step, including data pruning, data skipping, and data modeling best practices. Emil Emilov of Conductor shows how to deliver high-performance user-facing applications at scale.
Also see these relevant resources:
But wait, Bullwinkle, there’s more! Ten cool talks:
For more fulsome descriptions, followed by links to all the talks, visit the OSDS 2.0 event page.
Be the first to read new posts