Title
AWS re:Invent 2023 - Safely migrate databases that serve millions of requests per second (NFX307)
Summary
- Joey and Ayushi from Netflix presented on safe database migrations at scale.
- Netflix operates in many countries and uses AWS regions with multiple availability zones, resulting in 12 copies of data worldwide.
- Migrations are necessary due to scale, evolving use cases, new AWS features, and cost considerations.
- Avoiding data migration is preferable, but sometimes necessary, and should be automated and executed in parallel for safety.
- The migration process involves abstracting the storage API, shadowing traffic idempotently, and merging in-flight writes with backfill data.
- Netflix uses a data abstraction or gateway to decouple applications from database-specific APIs.
- Idempotency tokens are used for deduplication and conflict resolution.
- AWS Nitro has improved clock accuracy, which aids in generating idempotency tokens.
- Read APIs are designed to be shadowable and resumable.
- Shadow traffic is used to duplicate traffic to a new database implementation without impacting production.
- Backfilling involves copying data from the old to the new database, with throttling to avoid system overload.
- Verification of data correctness and system performance is crucial before promoting a new database to production.
- Netflix accomplished a significant migration in 2023, moving over 250 databases impacting 300+ applications and thousands of terabytes of data.
- Case studies included Cassandra version upgrades and migration from Dynamite to key-value pairs.
- Key takeaways: automate migrations, evaluate new versions thoroughly, isolate API and data migrations, and prefer homogeneous migrations.
Insights
- Netflix's scale and global presence necessitate robust and safe migration strategies to avoid service disruptions.
- The overlap of database solutions for different use cases at Netflix indicates a need for flexible and adaptable data platforms.
- The introduction of new AWS features can trigger migrations even if use cases remain static, highlighting the importance of staying current with cloud offerings.
- Cost considerations play a significant role in migration decisions, especially as managed services may become more expensive at scale compared to self-managed options.
- The use of data abstraction layers and gateways allows Netflix to change data stores without impacting client applications, demonstrating the benefits of loose coupling and abstraction in system design.
- Idempotency tokens and precise clock synchronization are critical for ensuring data consistency during migrations, with AWS Nitro providing the necessary clock accuracy.
- Shadowing traffic and backfilling are essential techniques for ensuring data integrity and system performance during migrations.
- The case studies presented by Netflix provide real-world examples of the challenges and solutions involved in large-scale database migrations.
- The key takeaways emphasize the importance of automation, thorough testing, and strategic planning in successful database migrations.