Title

AWS re:Invent 2022 - Are you ready? Essential strategies for Kubernetes adoption (CON326)

Summary

Ishu Bala, director for EKS, and Rick Sostheim, service team expert, presented strategies for Kubernetes adoption.
Ishu discussed the importance of culture, organizational structure, processes, tools, and architecture in technology adoption.
He emphasized mechanisms as processes to transform inputs into desired outcomes and sustain them.
Amazon's culture is defined by mechanisms like operational reviews and PR FAQs.
Ishu highlighted the concept of two-pizza teams for autonomy and ownership, and the need for operational consistency across teams.
AWS operational culture includes the principle "if you build it, you operate it," and the importance of learning from failures through Correction of Errors (COE) and Operational Readiness Review (ORR).
Tooling is essential to apply best practices across service teams without significant effort.
Rick Sostheim focused on system failures, particularly in Kubernetes and EKS, and how to handle them.
He outlined the EKS service, Kubernetes control plane, and data plane as major failure domains.
Rick stressed the importance of static stability, retries with backoff and jitter, and understanding Kubernetes constructs for resilience.
He provided insights into handling etcd failures, node failures, and control plane impairments.
Rick recommended resources like the Amazon Builders Library and the EKS Best Practices Guide for further learning.

The adoption of Kubernetes requires a holistic approach that includes cultural shifts, organizational restructuring, and the implementation of effective mechanisms.
Amazon's culture of customer obsession and mechanisms like COE and ORR are integral to maintaining operational excellence and learning from failures.
The concept of two-pizza teams is a practical approach to maintaining agility, ownership, and autonomy within teams, which is crucial for innovation and quick decision-making.
Static stability is a key design principle in AWS services, ensuring that the failure of one component does not impact the overall system's functionality.
Understanding and implementing retries with backoff and jitter is critical for managing communication with Kubernetes control planes during outages or impairments.
It is important to monitor etcd storage size and be cautious about the data stored in Kubernetes API objects to avoid overloading the system.
The Amazon Builders Library and the EKS Best Practices Guide are valuable resources for AWS customers to build resilient and operationally sound Kubernetes environments.