Observability Best Practices for Modern Applications Cop344

Title

AWS re:Invent 2022 - Observability: Best practices for modern applications (COP344)

Summary

  • Roland Barcia and Greg Apple presented on observability best practices for modern applications.
  • Modern apps are more difficult to observe due to their distributed nature, use of various technologies, and microservices architecture.
  • Observability should be considered a day zero problem, and modern apps should be built to be observed.
  • The session covered four best practices:
    1. Navigating instrumentation options.
    2. Optimizing the cost of high cardinality.
    3. Reducing alarm fatigue.
    4. Avoiding dangling traces.
  • AWS services have varying levels of support for tracing, and it's important to understand how to propagate traces across services.
  • OpenTelemetry is recommended for metrics and traces, and other tools for logs until OpenTelemetry supports logs in GA.
  • CloudWatch Embedded Metric Format can help manage high cardinality and cost.
  • Synthetic testing, machine learning, and alarm correlation can reduce alarm fatigue.
  • Instrumentation of code is necessary for tracing, and trace context must be passed across service boundaries to avoid dangling traces.
  • The session included hands-on examples and demos.

Insights

  • The shift from monolithic to microservices architecture has significantly increased the complexity of observability.
  • Observability is not just about monitoring; it's about understanding the full lifecycle of logs, metrics, and traces within a system.
  • The use of various AWS services (Lambda, ECS, EKS, ROSA) and technologies (containers, serverless functions) requires a nuanced approach to observability.
  • AWS provides native services and support for popular open-source tools for observability, catering to different customer strategies.
  • The AWS Distro for OpenTelemetry supports metrics and traces, and logs are expected to be supported in the future.
  • CloudWatch Embedded Metric Format is a powerful feature for managing telemetry data efficiently and cost-effectively.
  • Alarm management is crucial in modern applications to avoid alert fatigue and ensure that alarms are meaningful and actionable.
  • Tracing is complex and requires careful planning to ensure end-to-end visibility, especially when dealing with services that do not natively support tracing.
  • The session emphasized the shared responsibility model in AWS, where both AWS and customers must take part in the instrumentation for effective observability.
  • The presenters provided resources such as workshops, GitHub pages, and skill builders for further learning and implementation of observability best practices.