Title

AWS re:Invent 2023 - Building observability to increase resiliency (COP343)

Summary

The talk focused on combining resiliency and observability in operating services.
Techniques for diagnosing issues were discussed, including using dimensionality to slice and dice metrics, high cardinality dimensions, and distributed tracing.
The importance of measuring from different perspectives was emphasized, using synthetics and real user monitoring to uncover hidden issues.
Observability was also linked to preventing future issues by monitoring resource utilization and running game days or controlled experiments.
The speaker highlighted the use of AWS services like CloudWatch, X-Ray, and the Fault Injection Service to aid in observability.
The session concluded with a call to check out the Amazon Builders Library for in-depth articles on operations and architecture.

Dimensionality is a key concept in observability, allowing for granular analysis of metrics across different aspects of a service.
High cardinality dimensions can be managed using tools like CloudWatch Metric Insights and Contributor Insights to focus on the most relevant data points.
Distributed tracing with AWS X-Ray provides a visual map of system architecture and helps pinpoint issues within a distributed system.
Real user monitoring and synthetic workloads are essential for understanding the customer experience and detecting issues that may not be apparent from server-side metrics alone.
Automatic rollbacks and scaling based on utilization metrics are critical practices for maintaining service resiliency.
Observability as code is a practice that ensures consistency in monitoring across production and test environments, which is crucial for effective game days.
The AWS Fault Injection Service is recommended for safely conducting controlled experiments to test system resiliency.
The session underscored the importance of a holistic approach to observability, encompassing both technical tools and operational practices.