Scaling Data Processing Wamazon Emr at the Speed of Market Volatility Ant338

Title

AWS re:Invent 2022 - Scaling data processing w/Amazon EMR at the speed of market volatility (ANT338)

Summary

  • Meenakshi Ponchankaran, a senior big data architect at AWS, and Sampath Kumar Velasamy, director of technology at FINRA, discuss scaling Amazon EMR to handle market volatility.
  • FINRA processes over 600 billion records daily, with a storage footprint of over 500 petabytes in their data lake, creating over 9,000 clusters daily.
  • The Consolidated Audit Trail (CAT) system, built by FINRA, helps regulators track market activities and analyze market manipulations.
  • CAT faces challenges in performance, resiliency, scalability, and cost due to unpredictable data volumes and complex applications.
  • FINRA implemented a progressive architecture, upgraded to Spark 3.1, optimized Spark Shuffle with NVMe-based instances, and improved resiliency with checkpointing and restarts.
  • Infrastructure optimizations included migrating to Graviton instances, leveraging on-demand capacity reservations, and managing S3 throughput.
  • Cost optimizations were achieved through Graviton migration, instance type diversification, and better scheduling of workloads.
  • Future considerations include evaluating EMR on EKS, EMR Serverless, Apache Iceberg, Graviton 3, EBS GP3 volumes, and Apache Spark 3.2.

Insights

  • Progressive architecture, where data is processed in batches before peak hours, can significantly reduce processing time during critical periods.
  • Regular software upgrades can leverage performance improvements and bug fixes in newer versions.
  • NVMe-based instances can provide better IOPS for shuffle-intensive Spark applications, leading to performance gains.
  • Checkpointing and automatic restarts are crucial for maintaining resiliency in applications with tight SLAs.
  • Choosing the right instance type and optimizing configurations can mitigate the impact of underperforming hardware.
  • Managing S3 throughput and partitioning can address scalability challenges caused by high concurrency and throughput demands.
  • On-demand capacity reservations and targeted ODCR can ensure capacity availability for critical workloads.
  • Diversifying instance types in EMR fleet configurations can improve spot capacity availability and reduce costs.
  • Continuous performance engineering and optimization are essential for applications that need to scale with increasing data volumes.