Title
AWS re:Invent 2022 - Scaling data processing w/Amazon EMR at the speed of market volatility (ANT338)
Summary
- Meenakshi Ponchankaran, a senior big data architect at AWS, and Sampath Kumar Velasamy, director of technology at FINRA, discuss scaling Amazon EMR to handle market volatility.
- FINRA processes over 600 billion records daily, with a storage footprint of over 500 petabytes in their data lake, creating over 9,000 clusters daily.
- The Consolidated Audit Trail (CAT) system, built by FINRA, helps regulators track market activities and analyze market manipulations.
- CAT faces challenges in performance, resiliency, scalability, and cost due to unpredictable data volumes and complex applications.
- FINRA implemented a progressive architecture, upgraded to Spark 3.1, optimized Spark Shuffle with NVMe-based instances, and improved resiliency with checkpointing and restarts.
- Infrastructure optimizations included migrating to Graviton instances, leveraging on-demand capacity reservations, and managing S3 throughput.
- Cost optimizations were achieved through Graviton migration, instance type diversification, and better scheduling of workloads.
- Future considerations include evaluating EMR on EKS, EMR Serverless, Apache Iceberg, Graviton 3, EBS GP3 volumes, and Apache Spark 3.2.
Insights
- Progressive architecture, where data is processed in batches before peak hours, can significantly reduce processing time during critical periods.
- Regular software upgrades can leverage performance improvements and bug fixes in newer versions.
- NVMe-based instances can provide better IOPS for shuffle-intensive Spark applications, leading to performance gains.
- Checkpointing and automatic restarts are crucial for maintaining resiliency in applications with tight SLAs.
- Choosing the right instance type and optimizing configurations can mitigate the impact of underperforming hardware.
- Managing S3 throughput and partitioning can address scalability challenges caused by high concurrency and throughput demands.
- On-demand capacity reservations and targeted ODCR can ensure capacity availability for critical workloads.
- Diversifying instance types in EMR fleet configurations can improve spot capacity availability and reduce costs.
- Continuous performance engineering and optimization are essential for applications that need to scale with increasing data volumes.