Title
AWS re:Invent 2022 - How Amazon uses better metrics for improved website performance (AMZ302)
Summary
- Jim Roskind and Frank Stone presented on the importance of using the right metrics to improve website performance, focusing on latency.
- Roskind discussed the pitfalls of percentile latency goals and introduced the concept of organizational dynamics, which affects how groups respond to goals.
- He explained why percentile goals like P50 and P90 are problematic, as they can lead to worse performance over time due to organizational behavior.
- Roskind proposed using the trim mean (TM99) as a better metric for latency, which discards the slowest 1% and averages the rest, providing a more accurate representation of user experience.
- He also emphasized the value of histograms for visualizing data and identifying opportunities for improvement.
- Frank Stone discussed how to measure latency using AWS CloudWatch, including capturing metrics, measuring with trim mean, monitoring progress, and using histograms to analyze data.
- Stone outlined methods for collecting latency metrics in CloudWatch, setting up alarms and dashboards, and using histograms to uncover patterns and opportunities for optimization.
Insights
- Percentile goals like P50 and P90 can inadvertently encourage developers to focus on meeting specific targets rather than genuinely improving performance, leading to a clustering of performance around these targets.
- Organizational dynamics play a significant role in how metrics are pursued and achieved. Metrics can drive behavior in unintended ways, and when a measure becomes a target, it can cease to be a good measure.
- Trim mean (TM99) is a robust metric for latency that accounts for almost all user experiences, excluding extreme outliers, and provides a more accurate reflection of the average user's experience.
- Histograms are powerful tools for visualizing the distribution of latency data, revealing outliers, and identifying areas for potential optimization that might not be apparent when focusing solely on percentile targets.
- AWS CloudWatch supports trim mean and other advanced metrics, offering a platform for AWS users to implement the same strategies discussed by Roskind and Stone to improve their own website performance.
- Real User Monitoring (RUM) in CloudWatch allows for the collection of latency metrics directly from end users, providing a more accurate picture of user experience.
- Automating responses to latency issues using CloudWatch alarms and actions can help maintain performance standards and quickly address emerging problems.