Title

AWS re:Invent 2022 - Observability best practices at Amazon (COP343)

Summary

The session covers Amazon's approach to observability, emphasizing the importance of understanding customer experience through metrics, troubleshooting, and measuring from various perspectives.
Observability is seen as a way to empathize with customers, not just a tool for system monitoring.
Amazon's DevOps model means builders are on-call and closely involved with the operations of the services they create.
The talk discusses the use of alarms, metrics, logs, traces, and post-mortem analysis (Correction of Errors process) to continuously improve operations.
Amazon uses a "five whys" technique for root cause analysis, originating from Toyota's manufacturing processes.
Observability data is used in operational reviews at all levels, from small teams to AWS-wide meetings.
CloudWatch is a key tool for Amazon's observability, ingesting vast amounts of metric observations and log data.
The importance of simple and continuous instrumentation is highlighted to facilitate the collection of telemetry data.
The session delves into the use of dashboards, tracing, and profiling to identify and resolve issues.
Amazon's approach to alarm management and reduction of alarm fatigue is discussed, including the use of Composite Alarms.
CloudWatch Synthetics and Real User Monitoring (RUM) are introduced as tools to measure end-to-end customer experience.
The session concludes with the emphasis on a culture of continuous improvement and refinement of observability practices.

Observability at Amazon is deeply integrated into their operational culture, with a strong focus on customer experience rather than just system health.
Amazon's observability practices are underpinned by a comprehensive DevOps model, where developers are responsible for the full lifecycle of their services, including on-call duties.
The use of structured logging and metrics allows Amazon to troubleshoot and understand system behavior at a granular level, enabling them to respond to incidents effectively.
Amazon's observability tools, such as CloudWatch, are not only used internally but also offered to customers, indicating a mature and tested suite of tools.
The session highlights the importance of alarm management and the use of composite alarms to avoid alarm fatigue, which is a common challenge in large-scale systems.
The introduction of CloudWatch Synthetics and RUM shows Amazon's commitment to measuring real user interactions and experiences, not just system metrics.
The culture of continuous improvement, exemplified by the "five whys" technique and post-mortem analysis, is a key factor in Amazon's ability to maintain high operational standards.
The talk emphasizes the need for simplicity in instrumentation and the importance of making observability data easily accessible to facilitate quick and informed decision-making during incidents.