Title

AWS re:Invent 2022 - Scaling data processing w/Amazon EMR at the speed of market volatility (ANT338)

Summary

Meenakshi Ponchankaran, a senior big data architect at AWS, and Sampath Kumar Velasamy, director of technology at FINRA, discuss scaling Amazon EMR to handle market volatility.
FINRA processes over 600 billion records daily, with a storage footprint of over 500 petabytes in their data lake, creating over 9,000 clusters daily.
The Consolidated Audit Trail (CAT) system, built by FINRA, helps regulators track market activities and analyze market manipulations.
CAT faces challenges in performance, resiliency, scalability, and cost due to unpredictable data volumes and complex applications.
FINRA implemented a progressive architecture, upgraded to Spark 3.1, optimized Spark Shuffle with NVMe-based instances, and improved resiliency with checkpointing and restarts.
Infrastructure optimizations included migrating to Graviton instances, leveraging on-demand capacity reservations, and managing S3 throughput.
Cost optimizations were achieved through Graviton migration, instance type diversification, and better scheduling of workloads.
Future considerations include evaluating EMR on EKS, EMR Serverless, Apache Iceberg, Graviton 3, EBS GP3 volumes, and Apache Spark 3.2.

Progressive architecture, where data is processed in batches before peak hours, can significantly reduce processing time during critical periods.
Regular software upgrades can leverage performance improvements and bug fixes in newer versions.
NVMe-based instances can provide better IOPS for shuffle-intensive Spark applications, leading to performance gains.
Checkpointing and automatic restarts are crucial for maintaining resiliency in applications with tight SLAs.
Choosing the right instance type and optimizing configurations can mitigate the impact of underperforming hardware.
Managing S3 throughput and partitioning can address scalability challenges caused by high concurrency and throughput demands.
On-demand capacity reservations and targeted ODCR can ensure capacity availability for critical workloads.
Diversifying instance types in EMR fleet configurations can improve spot capacity availability and reduce costs.
Continuous performance engineering and optimization are essential for applications that need to scale with increasing data volumes.