Building and Operating a Data Lake on Amazon S3 Stg302

Title

AWS re:Invent 2022 - Building and operating a data lake on Amazon S3 (STG302)

Summary

  • Jorge Lopez, a principal specialist at AWS, and Huey Garcia, a senior product manager at the S3 team, discuss the value of data lakes on AWS, particularly on Amazon S3.
  • They highlight the importance of making data accessible, the ROI, and the lower TCO that a modern data strategy can deliver.
  • The session covers the struggles organizations face with legacy solutions and data silos, and the benefits of moving to a modern data architecture.
  • A data lake is defined as a central repository for massive amounts of data, structured and unstructured, that is easy to categorize, process, and analyze.
  • Key characteristics of a data lake on the cloud include the separation of compute and storage, the ability to analyze data in place, and paying only for what you use.
  • The talk outlines the journey of building a data lake: moving data into a central repository, feeding the data lake with streaming data, securing the data, managing governance, and optimizing performance and cost.
  • Jyotsna Karki, a senior data engineer at Novo Nordisk, shares their experience transitioning from a monolithic data lake to a multi-lake, data mesh architecture.
  • The session also touches on AWS services and features relevant to data lakes, such as S3 Replication, Multi-Region Access Point, Lake Formation, and FSx for Lustre.
  • The speakers emphasize the importance of performance optimization, cost savings, and using the latest APIs and services like S3 Select and S3 Intelligent-Tiering.
  • A blueprint for building a data lake on AWS is presented, along with a real-life example from Novo Nordisk, showcasing their journey and the business impact of their data and analytics ecosystem.

Insights

  • Access to data and the ability to experiment quickly and at a lower cost are key drivers for innovation in organizations.
  • The separation of compute and storage in the cloud is a fundamental shift from traditional architectures, allowing for more flexibility and scalability.
  • The ability to analyze data in place without duplicating it is crucial for reducing overhead and maintaining data integrity.
  • Pay-as-you-go models in the cloud help organizations scale their resources efficiently based on demand.
  • Novo Nordisk's transition to a data mesh architecture reflects a growing trend among sophisticated AWS customers to move away from monolithic data lakes to more decentralized, domain-oriented data architectures.
  • AWS Lake Formation is highlighted as a service that accelerates the process of building, managing, and operating data lakes by providing features like governed tables, automatic performance optimization, and data sharing capabilities.
  • Performance and cost are closely related in the context of data lakes, and AWS provides various tools and best practices to optimize both.
  • The real-life example from Novo Nordisk demonstrates the tangible business benefits of a well-architected data lake, including faster data pipeline development, data reuse across the organization, and improved business outcomes in various domains such as R&D, manufacturing, sales, and distribution.