Observability Best Practices at Amazon Cop343

Title

AWS re:Invent 2022 - Observability best practices at Amazon (COP343)

Summary

  • The session covers Amazon's approach to observability, emphasizing the importance of understanding customer experience through metrics, troubleshooting, and measuring from various perspectives.
  • Observability is seen as a way to empathize with customers, not just a tool for system monitoring.
  • Amazon's DevOps model means builders are on-call and closely involved with the operations of the services they create.
  • The talk discusses the use of alarms, metrics, logs, traces, and post-mortem analysis (Correction of Errors process) to continuously improve operations.
  • Amazon uses a "five whys" technique for root cause analysis, originating from Toyota's manufacturing processes.
  • Observability data is used in operational reviews at all levels, from small teams to AWS-wide meetings.
  • CloudWatch is a key tool for Amazon's observability, ingesting vast amounts of metric observations and log data.
  • The importance of simple and continuous instrumentation is highlighted to facilitate the collection of telemetry data.
  • The session delves into the use of dashboards, tracing, and profiling to identify and resolve issues.
  • Amazon's approach to alarm management and reduction of alarm fatigue is discussed, including the use of Composite Alarms.
  • CloudWatch Synthetics and Real User Monitoring (RUM) are introduced as tools to measure end-to-end customer experience.
  • The session concludes with the emphasis on a culture of continuous improvement and refinement of observability practices.

Insights

  • Observability at Amazon is deeply integrated into their operational culture, with a strong focus on customer experience rather than just system health.
  • Amazon's observability practices are underpinned by a comprehensive DevOps model, where developers are responsible for the full lifecycle of their services, including on-call duties.
  • The use of structured logging and metrics allows Amazon to troubleshoot and understand system behavior at a granular level, enabling them to respond to incidents effectively.
  • Amazon's observability tools, such as CloudWatch, are not only used internally but also offered to customers, indicating a mature and tested suite of tools.
  • The session highlights the importance of alarm management and the use of composite alarms to avoid alarm fatigue, which is a common challenge in large-scale systems.
  • The introduction of CloudWatch Synthetics and RUM shows Amazon's commitment to measuring real user interactions and experiences, not just system metrics.
  • The culture of continuous improvement, exemplified by the "five whys" technique and post-mortem analysis, is a key factor in Amazon's ability to maintain high operational standards.
  • The talk emphasizes the need for simplicity in instrumentation and the importance of making observability data easily accessible to facilitate quick and informed decision-making during incidents.