Title
AWS re:Invent 2022 - Building and operating a data lake on Amazon S3 (STG302)
Summary
- Jorge Lopez, a principal specialist at AWS, and Huey Garcia, a senior product manager at the S3 team, discuss the value of data lakes on AWS, particularly on Amazon S3.
- They highlight the importance of making data accessible, the ROI, and the lower TCO that a modern data strategy can deliver.
- The session covers the struggles organizations face with legacy solutions and data silos, and the benefits of moving to a modern data architecture.
- A data lake is defined as a central repository for massive amounts of data, structured and unstructured, that is easy to categorize, process, and analyze.
- Key characteristics of a data lake on the cloud include the separation of compute and storage, the ability to analyze data in place, and paying only for what you use.
- The talk outlines the journey of building a data lake: moving data into a central repository, feeding the data lake with streaming data, securing the data, managing governance, and optimizing performance and cost.
- Jyotsna Karki, a senior data engineer at Novo Nordisk, shares their experience transitioning from a monolithic data lake to a multi-lake, data mesh architecture.
- The session also touches on AWS services and features relevant to data lakes, such as S3 Replication, Multi-Region Access Point, Lake Formation, and FSx for Lustre.
- The speakers emphasize the importance of performance optimization, cost savings, and using the latest APIs and services like S3 Select and S3 Intelligent-Tiering.
- A blueprint for building a data lake on AWS is presented, along with a real-life example from Novo Nordisk, showcasing their journey and the business impact of their data and analytics ecosystem.
Insights
- Access to data and the ability to experiment quickly and at a lower cost are key drivers for innovation in organizations.
- The separation of compute and storage in the cloud is a fundamental shift from traditional architectures, allowing for more flexibility and scalability.
- The ability to analyze data in place without duplicating it is crucial for reducing overhead and maintaining data integrity.
- Pay-as-you-go models in the cloud help organizations scale their resources efficiently based on demand.
- Novo Nordisk's transition to a data mesh architecture reflects a growing trend among sophisticated AWS customers to move away from monolithic data lakes to more decentralized, domain-oriented data architectures.
- AWS Lake Formation is highlighted as a service that accelerates the process of building, managing, and operating data lakes by providing features like governed tables, automatic performance optimization, and data sharing capabilities.
- Performance and cost are closely related in the context of data lakes, and AWS provides various tools and best practices to optimize both.
- The real-life example from Novo Nordisk demonstrates the tangible business benefits of a well-architected data lake, including faster data pipeline development, data reuse across the organization, and improved business outcomes in various domains such as R&D, manufacturing, sales, and distribution.