Automating a 20 Tb File Server Migration Com303

Title

AWS re:Invent 2023 - Automating a 20 TB file server migration (COM303)

Summary

  • Dave Stavaker, a Chief Platform Engineer and AWS Hero, shared his experience migrating a 20 TB corporate file server to Amazon FSx for Windows File Server using AWS DataSync.
  • The migration involved overcoming challenges with a large number of small files (1.5 billion) and ensuring minimal downtime (2 hours).
  • The team initially used a marketplace solution for high availability but transitioned to FSx for Windows once it matured with necessary features like Active Directory integration, capacity scaling, DNS aliasing, file access audit logging, and PrivateLink.
  • They utilized AWS DataSync for file-level copying and FSx as a managed service, with Terraform for infrastructure as code.
  • The migration took 11 months, starting with 4 agents and 8 tasks, scaling up to 50 agents and 150 tasks.
  • Tools were developed for parsing Terraform output, tracking task runtimes, and creating CloudWatch dashboards.
  • Data was prioritized, and tasks were grouped and automated for execution during the migration window.
  • The team learned valuable lessons, including the importance of reliable runtime information, understanding data, and considering data deduplication.
  • Post-migration, they achieved an $11,000 monthly savings and are considering future optimizations, such as leveraging S3 for better data management.

Insights

  • Data Complexity: The complexity of managing a large number of small files can significantly impact migration strategies and timelines. Understanding the data structure is crucial for planning.
  • Cost vs. Time: The initial focus on cost savings by using spinning disk storage led to a longer migration time. Balancing cost and resource allocation is essential for efficient migrations.
  • Automation and Tooling: Custom tools and automation played a critical role in managing the migration process, especially when dealing with a high number of tasks and agents.
  • Infrastructure as Code: Using Terraform for infrastructure as code allowed for an audited deployment pipeline, ensuring that changes were safe and traceable.
  • Migration Planning: Detailed planning, akin to NASA's launch procedures, was vital for a successful migration, including prioritizing data, scheduling tasks, and having a clear rollback plan.
  • Post-Migration Considerations: After the migration, the team continued to seek cost optimizations and is exploring the use of AWS services like S3 and Lambda for better data management and further cost reductions.