Scaling Warhammer 40000 Darktide from 0 to 100000 Players in 1 Hour Gam305

Title

AWS re:Invent 2023 - Scaling Warhammer 40,000: Darktide from 0 to 100,000 players in 1 hour (GAM305)

Summary

  • Presenters: Joris van der Donk (AWS Solutions Architect) and Andrew Klawich (Technical Director at Fatshark).
  • Topic: Scaling the game Warhammer 40,000: Darktide on AWS to support over 100,000 concurrent players within an hour of launch.
  • Fatshark's Background: A game studio from Stockholm, Sweden, known for cooperative gameplay experiences and the Warhammer game series.
  • Technical Challenges: Creating a scalable, server-authoritative, cost-effective architecture for Darktide, ensuring consistent player experience, and managing latency.
  • Solutions Implemented:
    • Login Queue: Utilized AWS Lambda and ElastiCache for Redis to handle spikes in login requests.
    • Immaterium Service: A Java-based service using gRPC and Redis for party management and player presence.
    • Matchmaking and Game Server Allocation: Amazon GameLift FlexMatch for matchmaking and GameLift FleetIQ for managing EC2 instances and game session placement.
    • Global Accelerator: To optimize network paths and reduce latency for players worldwide.
  • Results: Successful scaling to 30,000 vCPU across regions, handling millions of enemies in-game, and maintaining a good player experience.
  • Lessons Learned: Emphasizing serverless architecture, managed services, and observability; using jitter to avoid traffic spikes; and leveraging EC2 Spot Instances for cost savings.
  • Future Plans: Expand server locations, improve CPU budgeting for game servers, and explore AWS Local Zones and new regions.

Insights

  • Serverless First Approach: Starting with serverless services like Lambda allowed Fatshark to rapidly prototype and scale, with the flexibility to move to other services like Fargate or EC2 if needed.
  • Managed Services: Fatshark's reliance on AWS managed services like DynamoDB, MemoryDB, and Aurora Serverless 2 helped them focus on game development rather than infrastructure management.
  • Cost-Effective Scaling: The use of EC2 Spot Instances and careful right-sizing of architecture were key strategies for cost savings during the game's launch.
  • Traffic Management: Implementing a login queue with jitter and spreading player traffic over time were effective in managing the massive influx of players at launch.
  • Global Deployment: The deployment of game servers in over 12 regions and the use of AWS Global Accelerator ensured low latency and a good player experience globally.
  • Observability and Monitoring: Tools like Honeycomb were crucial for identifying and resolving backend performance issues and bugs.
  • Future Enhancements: Plans to use AWS Local Zones and new regions to bring servers closer to players, and to refine CPU budgeting for game servers to handle variable workloads more efficiently.