Upgrading from the Modern Data Stack to the Modern Data Lake Ant103

Title

AWS re:Invent 2023 - Upgrading from the modern data stack to the modern data lake (ANT103)

Summary

  • The session, presented by Monica and Emma from Starburst, focused on transitioning from the modern data stack to the modern data lake.
  • The modern data stack, while initially aimed at simplifying data architecture, has become complex with various components like ingestion, transformation, storage, visualization, testing, governance, and monitoring.
  • The presenters argue that the modern data stack is essentially a cloud-based version of legacy data architecture and does not represent true modernization.
  • They advocate for a modern data lake approach, emphasizing the separation of storage and compute, open data standards, and avoiding vendor lock-in.
  • The modern data lake should be organized into three zones: raw data (land layer), structured data (structure zone), and consumable data (consume layer).
  • A performant, scalable query engine is essential, and the presenters mention Trino as an example.
  • Open table formats and file formats like ORC, Parquet, and Avro are recommended for efficiency and simplicity.
  • A single point of access and governance is crucial, with a semantic layer to integrate the data lake with other data sources.
  • Starburst's Data Lake Analytics platform is introduced as a solution that embodies these principles, offering a unified analytics platform with a single point of access, an enhanced query engine, and a governance layer.
  • New features announced include streaming ingest, automatic data classification, data lake optimization, and data sharing capabilities.
  • The session concludes with an invitation to visit their booth for further discussion and demonstrations.

Insights

  • The modern data stack's complexity has led to a reevaluation of data architecture, with a shift towards data lakes that offer more flexibility and scalability.
  • The separation of storage and compute is a key principle in modern data architecture, allowing for more efficient resource management and cost savings.
  • Open data standards and formats are gaining traction as they facilitate interoperability and reduce the risk of vendor lock-in.
  • The concept of data centralization is challenged, with the presenters advocating for a federated approach to data management that accommodates the dynamic nature of modern businesses.
  • Starburst's approach to the modern data lake includes a unified analytics platform that integrates various components into a cohesive system, potentially simplifying the management and scaling of data lakes.
  • The introduction of new features by Starburst, such as streaming ingest and automatic data classification, reflects the ongoing innovation in data lake technology and the need for real-time data processing and enhanced security.
  • The session highlights the importance of governance and access control in data lakes, ensuring that data is not only stored and processed efficiently but also managed responsibly and securely.