building a world in the clouds: mmo architecture on aws (mbl304) | aws re:invent 2013
DESCRIPTION
Can you really build the infrastructure required to bring a massively multiplayer online game (MMO) to life in the cloud? This session discusses the evolution of Red 5 Studios' FireFall—a free-to-play MMO. FireFall runs entirely on the AWS platform and allows players from around the world to play together in the cloud. The session covers some of the design decisions made over the last two years—the things that worked well and not so well. The session also presents some of the solutions Red 5 implemented to ease the transition from dedicated data center hardware to virtual servers in AWS.TRANSCRIPT
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Building a World in the Clouds MMO Architecture on AWS
Jeffrey Berube Director of Technical Operations
Red 5 Studios, Inc.
November 15, 2013
• What is Firefall?
• Why build in the cloud?
• Infrastructure Goals
• Evolution of the Platform
• Tools for Success
Overview
What is Firefall?
• Free-to-play cooperative open world shooter
• “Shardless” world
• Instance-based maps • Both persistent and transient map types
• Possible for many instances of a map to exist at the same time
• Free-to-play cooperative open world shooter
• “Shardless” world
• Instance-based maps • Both persistent and transient map types
• Possible for many instances of a map to exist at the same time
• Free-to-play cooperative open world shooter
• “Shardless” world
• Instance-based maps • Both persistent and transient map types
• Possible for many instances of a map to exist at the same time
Why?
Why build in the cloud?
• Players are unpredictable
• Developers are unpredictable
• Cyclical player behavior opens up opportunities
for significant cost savings
Why build in the cloud?
• Players are unpredictable – Forecasts can be (and usually are) wrong
• Too little hardware, Too many players
– Bad for everyone
• Too much hardware, Too few players
– Good for players (sort of) but bad for the business
– What if they don’t stick around?
• Developers are unpredictable
• Cyclical player behavior opens up opportunities for significant cost savings
Why build in the cloud?
• Players are unpredictable
• Developers are unpredictable
• Cyclical player behavior opens up opportunities
for significant cost savings
Why build in the cloud?
• Players are unpredictable
• Developers are unpredictable – Active development has risks
• Performance can change drastically
• New services can “appear” the day of the patch
– MMOs are ALWAYS being actively developed!
• (If you want to be successful…)
• Cyclical player behavior opens up opportunities for significant cost savings
Why build in the cloud?
• Players are unpredictable
• Developers are unpredictable
• Cyclical player behavior opens up opportunities
for significant cost savings
Player Graph
36 Hours
Player Graph with Server Overlay
Player Graph with Efficient Server Overlay
35%–37.5% Savings
Infrastructure Goals
Infrastructure Goals
Deployment and Recovery
How do we make site
management better?
• Expansion
• Scalability
• Disaster Recovery
• Self-Healing
Platform
How can the platform make
the player experience better?
• Downtime
• Player Mobility
Infrastructure Goals:
Deployment and Recovery
Deployment and Recovery Goals
• Quick regional expansion
• On-demand scalability
• Disaster recovery with minimal downtime
• Self-healing
Deployment and Recovery Goals
• Quick regional expansion – Traditionally, expansion is a multi-month process
• Contracts, Purchase and Shipping, Installation, etc.
– Today, adding a region is about a week long task
• Additional improvements are in the works
• On-demand scalability
• Disaster recovery with minimal downtime
• Self-healing
Deployment and Recovery Goals
• Quick regional expansion
• On-demand scalability
• Disaster recovery with minimal downtime
• Self-healing
Deployment and Recovery Goals
• Quick regional expansion
• On-demand scalability – Automated scale up and down without* limits
• Instance sizes desired may not always be available, however
• Disaster recovery with minimal downtime
• Self-healing
Deployment and Recovery Goals
• Quick regional expansion
• On-demand scalability
• Disaster recovery with minimal downtime
• Self-healing
Deployment and Recovery Goals
• Quick regional expansion
• On-demand scalability
• Disaster recovery with minimal downtime – Traditionally, DR sites are expensive and are not always
properly maintained
– Our goal is to automate disaster recovery safely
• We do a lot manually at present
• Self-healing
Deployment and Recovery Goals
• Quick regional expansion
• On-demand scalability
• Disaster recovery with minimal downtime
• Self-healing
Infrastructure Goals:
Platform
Platform Goals
• Zero downtime game updates
• Players can play globally without restrictions
Platform Goals
• Zero downtime game updates – Blue-Green deployment
– Doesn’t preclude scheduled maintenance
• Some things are more safely done offline
• Players can play globally without restrictions
Platform Goals
• Zero downtime game updates
• Players can play globally without restrictions
Platform Goals
• Zero downtime game updates
• Players can play globally without restrictions – Characters won’t be held hostage
• Player data is available everywhere they want to be
• We don’t charge a player so that they can play with their friends
– Prefer closest healthy region, however
Evolution of the Platform
The Beginning March 2011
The Beginning March 2011
AWS Services in Use
Elastic Compute Cloud (EC2)
US-West-1 Availability Zone: b
INET
AWS
CORP
Outside-Game
Alpha October 2011
Alpha October 2011
AWS Services in Use
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)
Elastic Load Balancing (ELB)
Simple Queue Service (SQS)
US-West-1 Availability Zone: b
INET
AWS
CORP
Insi
de-C
ore
Inside-Game
Outside-Game
Inside-LB
Inside-DB
Inside-AppIc
Chef
Log
MCP MD MM
HP HPHP
I C U
I C U
Ar Ad
Mx HP
HP
HP
Out
side
-LB
AWS ELB
Ad
PvP
Operator
Closed Beta April 2012
Closed Beta April 2012
AWS Services in Use
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)
Elastic Load Balancing (ELB)
Simple Queue Service (SQS)
Relational Database Service (RDS)
ElastiCache
US-West-1
Availability Zone: b
INET
AWS
CORP
Insi
de-C
ore
Insi
de-G
ame
Outside-Game
Insi
de-L
B
Inside-AppGr
Ic
Chef
Log
MCP
MD
MM
HP
HP
HP
I C U
Ar Ad S
OW PvP
Availability Zone: c
AWS RDS
HP
HP
HP
Out
side
-LB
Insi
de-G
ame
Outside-Game
Insi
de-L
B
Inside-App
MM
HP
HP
HP
I C U
Ar Ad S
OW PvPHP
HP
HP
Out
side
-LB
AWS ELBOperator
Gamescom August 2012
Gamescom August 2012
AWS Services in Use
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)
Elastic Load Balancing (ELB)
Simple Queue Service (SQS)
Relational Database Service (RDS)
ElastiCache
CloudFront
Virtual Private Cloud (VPC) US-West-2
Availability Zone: b
EU-West-1
US-East-1
INET
AWS
CORP
HQ
Inside-Core
Inside-Game
Outside-Game
Outside-LB
Inside-DB
Inside-App
Inside-Core
Inside-AppTasks
GrIc
Task Task Task
Chef LogGrIc
Chef Log
AWS ELB
MCP MD MM
HP HP HP
I C U A P W
I C U Ar L In
Ad Co S A P W
OW NPE PvE PvP
VPC
AWS ELB
Operator
US-West-2
Open Beta July 2013
Open Beta July 2013
AWS Services in Use
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)
Elastic Load Balancing (ELB)
Simple Queue Service (SQS)
CloudFront
Virtual Private Cloud (VPC)
Elastic MapReduce (EMR)
US-West-2
Availability Zone: b
Ops
EU-West-1
US-East-1
INET
AWS
CORP
HQ
Inside-Core
Inside-Game
Outside-Game
Outside-LBInside-LB
Inside-DB
Inside-App
Inside-Core
Inside-AppTasks
Inside-Search
ES
ES
ES
ES
ES
ES
GrIc
Task Task Task
Chef LogGrIc
Chef Log
AWS ELB
MCP MD MM HP HP HP
HP HP HP
I C U A P W
I C U Ar L In
Ad Co S A P W
OW NPE PvE PvP
VPC
Operator
VPC
US-East-1
Today November 2013
Today November 2013
AWS Services in Use
Elastic Compute Cloud (EC2)
Simple Storage Service (S3)
Elastic Load Balancing (ELB)
Simple Queue Service (SQS)
CloudFront
Virtual Private Cloud (VPC)
Elastic MapReduce (EMR)
US-West-2
Availability Zone: b
Ops
EU-West-1
US-East-1
AP-NorthEast-1
SA-East-1
INET
AWS
CORP
HQ
Inside-Core
Inside-Game
Outside-Game
Outside-LBInside-LB
Inside-DB
Inside-App
Inside-Core
Inside-AppTasks
Inside-Search
ES
ES
ES
ES
ES
ES
GrIc
Task Task Task
Chef LogGrIc
Chef Log
AWS ELB
MCP MD MM HP
HP
C A P
I C U Ar L In
Ad Co S A P W
OW NPE PvE PvP
VPC
Operator
VPC
US-East-1
Tools for Success
Third-party Tools
• Opscode Chef
• collectd
• Icinga (Nagios)
• Graphite
• Graylog2
• HAProxy
• Keepalived
• elasticsearch
• Others (Bluepill, Thin, memcached, RabbitMQ, etc.)
Third-party Services
• Dyn – Global Load Balancing
– DNS Failover
• PagerDuty – On-call scheduling
– Phone calls!
• Pingdom – Transaction monitoring
• Duo Security
Internal Tools
• Architect
• Cartographer
• Dashboards (Everywhere)
Internal Tools
• Architect – Everything heartbeats
– Service-specific data aggregation
• Cartographer
• Dashboards (Everywhere)
Internal Tools
• Architect
• Cartographer
• Dashboards (Everywhere)
Internal Tools
• Architect
• Cartographer – Builds new game server stacks
– Replaces failed game server components
– Scales up (or down) the servers within a pool depending on
player demand
• Dashboards (Everywhere)
Internal Tools
• Architect
• Cartographer
• Dashboards (Everywhere)
Internal Tools
• Architect
• Cartographer
• Dashboards (Everywhere) – Multiple data sources
• Graphite
• Production databases
• Business Intelligence databases
Please give us your feedback on this
presentation
As a thank you, we will select prize
winners daily for completed surveys!
MBL304