Title

AWS re:Invent 2022 - Building and operating a data lake on Amazon S3 (STG302)

Summary

Jorge Lopez, a principal specialist at AWS, and Huey Garcia, a senior product manager at the S3 team, discuss the value of data lakes on AWS, particularly on Amazon S3.
They highlight the importance of making data accessible, the ROI, and the lower TCO that a modern data strategy can deliver.
The session covers the struggles organizations face with legacy solutions and data silos, and the benefits of moving to a modern data architecture.
A data lake is defined as a central repository for massive amounts of data, structured and unstructured, that is easy to categorize, process, and analyze.
Key characteristics of a data lake on the cloud include the separation of compute and storage, the ability to analyze data in place, and paying only for what you use.
The talk outlines the journey of building a data lake: moving data into a central repository, feeding the data lake with streaming data, securing the data, managing governance, and optimizing performance and cost.
Jyotsna Karki, a senior data engineer at Novo Nordisk, shares their experience transitioning from a monolithic data lake to a multi-lake, data mesh architecture.
The session also touches on AWS services and features relevant to data lakes, such as S3 Replication, Multi-Region Access Point, Lake Formation, and FSx for Lustre.
The speakers emphasize the importance of performance optimization, cost savings, and using the latest APIs and services like S3 Select and S3 Intelligent-Tiering.
A blueprint for building a data lake on AWS is presented, along with a real-life example from Novo Nordisk, showcasing their journey and the business impact of their data and analytics ecosystem.

Insights

Access to data and the ability to experiment quickly and at a lower cost are key drivers for innovation in organizations.
The separation of compute and storage in the cloud is a fundamental shift from traditional architectures, allowing for more flexibility and scalability.
The ability to analyze data in place without duplicating it is crucial for reducing overhead and maintaining data integrity.
Pay-as-you-go models in the cloud help organizations scale their resources efficiently based on demand.
Novo Nordisk's transition to a data mesh architecture reflects a growing trend among sophisticated AWS customers to move away from monolithic data lakes to more decentralized, domain-oriented data architectures.
AWS Lake Formation is highlighted as a service that accelerates the process of building, managing, and operating data lakes by providing features like governed tables, automatic performance optimization, and data sharing capabilities.
Performance and cost are closely related in the context of data lakes, and AWS provides various tools and best practices to optimize both.
The real-life example from Novo Nordisk demonstrates the tangible business benefits of a well-architected data lake, including faster data pipeline development, data reuse across the organization, and improved business outcomes in various domains such as R&D, manufacturing, sales, and distribution.

Building an Onboarding Coe for Aws Going beyond Automation Prt227 Building and Operating at Scale with Feature Management Prt261