Title
AWS re:Invent 2023 - What’s new with AWS observability and operations (COP339)
Summary
- Brian Denny, a general manager in the AWS Observability team, and his colleague Greg Epple, a tech leader for cloud operations, presented new AWS capabilities for observability and operations.
- AWS aims to assist developers and operators in handling operational incidents, focusing on operational excellence and leveraging telemetry data.
- AWS services are designed for operating on AWS, on-premises, and other clouds, with a focus on easy monitoring, leveraging machine learning, and saving time and money through automation.
- AWS CloudWatch is used extensively within Amazon and AWS for real-time monitoring and troubleshooting.
- AWS has launched several new features and services, including CloudWatch logs anomaly detection, alarm recommendations, dashboard variables, managed Grafana plugins, user notifications, and integration with Amazon Q.
- Application Signals, a new feature, provides pre-built dashboards for service operators and integrates with Container Insights for monitoring application resiliency.
- AWS has improved the Prometheus experience with an agentless solution for EKS clusters.
- CloudWatch Natural Language Query allows users to write queries in natural language.
- CloudWatch Live Tail enables real-time log viewing.
- AWS introduced a new log class for cost-effective log storage and multi-data source querying for hybrid and multi-cloud metric data.
- Systems Manager now features a low-code visual designer for runbooks and integrates with CodeGuru for security.
- Incident Manager has been updated to include on-call schedules for more intelligent routing of incidents.
Insights
- AWS is heavily investing in machine learning to enhance the observability and operational capabilities of its services, aiming to reduce the cognitive load on developers and operators during incident handling.
- The emphasis on operational excellence and the integration of telemetry data into AWS services reflect Amazon's internal practices and its commitment to providing a robust and reliable cloud platform.
- The introduction of CloudWatch logs anomaly detection and natural language query capabilities indicates AWS's focus on making log analysis more accessible and efficient, potentially reducing the time spent on incident investigation.
- The new Application Signals service suggests a shift towards a more application-centric approach to operations, aligning with modern microservices and containerized application architectures.
- AWS's updates to Systems Manager, including the visual designer for runbooks and on-call schedules, demonstrate a push towards simplifying incident response and remediation processes.
- The new log class for infrequent access and the multi-data source querying feature show AWS's responsiveness to customer needs for cost optimization and hybrid/multi-cloud environments.
- AWS's integration with Amazon Q and the mobile app for user notifications reflects a trend towards more proactive and user-friendly incident management and communication.