How Amazon Uses Better Metrics for Improved Website Performance Amz302

Title

AWS re:Invent 2022 - How Amazon uses better metrics for improved website performance (AMZ302)

Summary

  • Jim Roskind and Frank Stone presented on the importance of using the right metrics to improve website performance, focusing on latency.
  • Roskind discussed the pitfalls of percentile latency goals and introduced the concept of organizational dynamics, which affects how groups respond to goals.
  • He explained why percentile goals like P50 and P90 are problematic, as they can lead to worse performance over time due to organizational behavior.
  • Roskind proposed using the trim mean (TM99) as a better metric for latency, which discards the slowest 1% and averages the rest, providing a more accurate representation of user experience.
  • He also emphasized the value of histograms for visualizing data and identifying opportunities for improvement.
  • Frank Stone discussed how to measure latency using AWS CloudWatch, including capturing metrics, measuring with trim mean, monitoring progress, and using histograms to analyze data.
  • Stone outlined methods for collecting latency metrics in CloudWatch, setting up alarms and dashboards, and using histograms to uncover patterns and opportunities for optimization.

Insights

  • Percentile goals like P50 and P90 can inadvertently encourage developers to focus on meeting specific targets rather than genuinely improving performance, leading to a clustering of performance around these targets.
  • Organizational dynamics play a significant role in how metrics are pursued and achieved. Metrics can drive behavior in unintended ways, and when a measure becomes a target, it can cease to be a good measure.
  • Trim mean (TM99) is a robust metric for latency that accounts for almost all user experiences, excluding extreme outliers, and provides a more accurate reflection of the average user's experience.
  • Histograms are powerful tools for visualizing the distribution of latency data, revealing outliers, and identifying areas for potential optimization that might not be apparent when focusing solely on percentile targets.
  • AWS CloudWatch supports trim mean and other advanced metrics, offering a platform for AWS users to implement the same strategies discussed by Roskind and Stone to improve their own website performance.
  • Real User Monitoring (RUM) in CloudWatch allows for the collection of latency metrics directly from end users, providing a more accurate picture of user experience.
  • Automating responses to latency issues using CloudWatch alarms and actions can help maintain performance standards and quickly address emerging problems.