Title
AWS re:Invent 2023 - Scaling Warhammer 40,000: Darktide from 0 to 100,000 players in 1 hour (GAM305)
Summary
- Presenters: Joris van der Donk (AWS Solutions Architect) and Andrew Klawich (Technical Director at Fatshark).
- Topic: Scaling the game Warhammer 40,000: Darktide on AWS to support over 100,000 concurrent players within an hour of launch.
- Fatshark's Background: A game studio from Stockholm, Sweden, known for cooperative gameplay experiences and the Warhammer game series.
- Technical Challenges: Creating a scalable, server-authoritative, cost-effective architecture for Darktide, ensuring consistent player experience, and managing latency.
- Solutions Implemented:
- Login Queue: Utilized AWS Lambda and ElastiCache for Redis to handle spikes in login requests.
- Immaterium Service: A Java-based service using gRPC and Redis for party management and player presence.
- Matchmaking and Game Server Allocation: Amazon GameLift FlexMatch for matchmaking and GameLift FleetIQ for managing EC2 instances and game session placement.
- Global Accelerator: To optimize network paths and reduce latency for players worldwide.
- Results: Successful scaling to 30,000 vCPU across regions, handling millions of enemies in-game, and maintaining a good player experience.
- Lessons Learned: Emphasizing serverless architecture, managed services, and observability; using jitter to avoid traffic spikes; and leveraging EC2 Spot Instances for cost savings.
- Future Plans: Expand server locations, improve CPU budgeting for game servers, and explore AWS Local Zones and new regions.
Insights
- Serverless First Approach: Starting with serverless services like Lambda allowed Fatshark to rapidly prototype and scale, with the flexibility to move to other services like Fargate or EC2 if needed.
- Managed Services: Fatshark's reliance on AWS managed services like DynamoDB, MemoryDB, and Aurora Serverless 2 helped them focus on game development rather than infrastructure management.
- Cost-Effective Scaling: The use of EC2 Spot Instances and careful right-sizing of architecture were key strategies for cost savings during the game's launch.
- Traffic Management: Implementing a login queue with jitter and spreading player traffic over time were effective in managing the massive influx of players at launch.
- Global Deployment: The deployment of game servers in over 12 regions and the use of AWS Global Accelerator ensured low latency and a good player experience globally.
- Observability and Monitoring: Tools like Honeycomb were crucial for identifying and resolving backend performance issues and bugs.
- Future Enhancements: Plans to use AWS Local Zones and new regions to bring servers closer to players, and to refine CPU budgeting for game servers to handle variable workloads more efficiently.