How Not to Practice Observability Dop404

Title

AWS re:Invent 2023 - How not to practice observability (DOP404)

Summary

  • Anand from ManageEngine, a division of Zoho Corporation, discusses common pitfalls in implementing observability.
  • Observability is proactive and relies on historical data, unlike reactive monitoring.
  • Quality of observability improves with the right data sampling, not just more data.
  • Misconceptions about observability can lead to issues like overprovisioning and missing critical spikes in metrics.
  • Creating dashboards should be done thoughtfully to avoid technical debt and ensure they address frequently referred issues.
  • Assumptions can lead to incomplete observability, missing out on capturing all layers of an application.
  • Misconfigurations in alerting can lead to alert fatigue and unnecessary costs.
  • DevOps teams should avoid centralizing configurations and instead tailor observability practices to specific applications.
  • Data hoarding and access restrictions can hinder effective observability.
  • Platform engineering is emerging to address data unification challenges.
  • Observability systems need failover mechanisms and should not cause system crashes.
  • Knowledge transfer between shifts is crucial to avoid reinventing the wheel.
  • Adopting new tools requires internal changes and should not be done just for the sake of using new technology.
  • ManageEngine offers tools for observability and invites attendees to visit their booth for more insights.

Insights

  • Observability is a complex field that requires a balance between proactive data analysis and avoiding information overload.
  • The right sampling rate is crucial for accurate observability, as both under-sampling and over-sampling can lead to misinterpretation of system health.
  • Dashboard creation is a skill gap in many organizations, and dashboards should be created with a clear purpose and regular usage in mind.
  • There is a risk of assuming that if individual parts of a system are fine, the whole system is fine, which can lead to missing systemic issues.
  • Alerting configurations should be optimized to reduce noise and prevent alert fatigue among engineers.
  • Decentralizing observability configurations can empower teams to tailor observability to their specific needs, avoiding a one-size-fits-all approach.
  • Data accessibility and cross-team observability are essential for quick incident resolution.
  • Platform engineering is becoming important for managing data across various tools and ensuring a unified view of observability data.
  • Observability systems themselves need to be robust and not contribute to system instability.
  • When adopting new tools, it's important to consider the people and processes involved, not just the capabilities of the tool itself.
  • ManageEngine's experience with observability across a wide range of products and customers positions them as a knowledgeable entity in the field.