Safely Migrate Databases That Serve Millions of Requests per Second Nfx307

Title

AWS re:Invent 2023 - Safely migrate databases that serve millions of requests per second (NFX307)

Summary

  • Joey and Ayushi from Netflix presented on safe database migrations at scale.
  • Netflix operates in many countries and uses AWS regions with multiple availability zones, resulting in 12 copies of data worldwide.
  • Migrations are necessary due to scale, evolving use cases, new AWS features, and cost considerations.
  • Avoiding data migration is preferable, but sometimes necessary, and should be automated and executed in parallel for safety.
  • The migration process involves abstracting the storage API, shadowing traffic idempotently, and merging in-flight writes with backfill data.
  • Netflix uses a data abstraction or gateway to decouple applications from database-specific APIs.
  • Idempotency tokens are used for deduplication and conflict resolution.
  • AWS Nitro has improved clock accuracy, which aids in generating idempotency tokens.
  • Read APIs are designed to be shadowable and resumable.
  • Shadow traffic is used to duplicate traffic to a new database implementation without impacting production.
  • Backfilling involves copying data from the old to the new database, with throttling to avoid system overload.
  • Verification of data correctness and system performance is crucial before promoting a new database to production.
  • Netflix accomplished a significant migration in 2023, moving over 250 databases impacting 300+ applications and thousands of terabytes of data.
  • Case studies included Cassandra version upgrades and migration from Dynamite to key-value pairs.
  • Key takeaways: automate migrations, evaluate new versions thoroughly, isolate API and data migrations, and prefer homogeneous migrations.

Insights

  • Netflix's scale and global presence necessitate robust and safe migration strategies to avoid service disruptions.
  • The overlap of database solutions for different use cases at Netflix indicates a need for flexible and adaptable data platforms.
  • The introduction of new AWS features can trigger migrations even if use cases remain static, highlighting the importance of staying current with cloud offerings.
  • Cost considerations play a significant role in migration decisions, especially as managed services may become more expensive at scale compared to self-managed options.
  • The use of data abstraction layers and gateways allows Netflix to change data stores without impacting client applications, demonstrating the benefits of loose coupling and abstraction in system design.
  • Idempotency tokens and precise clock synchronization are critical for ensuring data consistency during migrations, with AWS Nitro providing the necessary clock accuracy.
  • Shadowing traffic and backfilling are essential techniques for ensuring data integrity and system performance during migrations.
  • The case studies presented by Netflix provide real-world examples of the challenges and solutions involved in large-scale database migrations.
  • The key takeaways emphasize the importance of automation, thorough testing, and strategic planning in successful database migrations.