Title
AWS re:Invent 2023 - Curate your data at scale (ANT205)
Summary
- Speakers: Mahesh Misra (AWS Lake Formation), Jerry Mozes (Amazon.com), Arthit Srinivasan (AWS Lake Formation), and Arti (AWS Managed Services).
- Data Curation: Defined as a five-stage process to break data, system, and people silos, involving source identification, integration, governance, and sharing.
- Challenges: Data silos, complex ETL processes, and governance difficulties due to growing metadata and data volumes.
- Solutions: AWS managed services for metadata harvesting, data classification, ETL, data cataloging, and governance.
- Amazon.com's Journey: Jerry Mozes discussed their approach to fine-grained access control, data protection, and the use of AWS Lake Formation and Amazon DataShare.
- Demo: Arti demonstrated a data curation pipeline using AWS Glue and Lake Formation, showcasing sensitive data detection, data quality checks, and fine-grained access control.
Insights
- Data Modernization vs. Data-Driven Culture: Despite heavy investment in data modernization, less than 25% of companies feel they are data-driven, indicating a gap between technology adoption and effective data utilization.
- Data Curation Importance: Effective data curation is critical for breaking down silos and enabling data-driven decision-making at scale.
- AWS Managed Services: AWS provides a suite of managed services that simplify the data curation process, including AWS Glue for ETL and data quality, and Lake Formation for governance and data sharing.
- Fine-Grained Access Control: Amazon.com's use of AWS Lake Formation and DataShare for fine-grained access control highlights the importance of managing data access at scale while maintaining compliance and governance.
- Data Discovery and Governance: Amazon Data Zone, with its advanced machine learning features, facilitates automated data discovery and connects data with users and tools, emphasizing the need for strong data governance mechanisms.
- Practical Application: The demo illustrated how AWS services can be used to curate data effectively, ensuring data quality and compliance with sensitive data handling, which is crucial for organizations dealing with large and complex datasets.