1,000,000 daily users and no cache

68
O’Reilly RailsConf, 20110518

Upload: wooga

Post on 15-Jan-2015

6.194 views

Category:

Technology


1 download

DESCRIPTION

Online games pose a few interesting challenges on their backend: A single user generates one http call every few seconds and the balance between data read and write is close to 50/50 which makes the use of a write through cache or other common scaling approaches less effective. Starting from a rather classic rails application as the traffic grew we gradually changed it in order to meet the required performance. And when small changes no longer were enough we turned inside out parts of our data persistency layer migrating from SQL to NoSQL without taking downtimes longer than a few minutes. Follow the problems we hit, how we diagnosed them, and how we got around limitations. See which tools we found useful and which other lessons we learned by running the system with a team of just two developers without a sysadmin or operation team as support.

TRANSCRIPT

Page 1: 1,000,000 daily users and no cache

O’Reilly  RailsConf,  2011-­‐05-­‐18

Page 2: 1,000,000 daily users and no cache

Who  is  that  guy?

Jesper  Richter-­‐Reichhelm  /  @jrirei

Berlin,  Germany

Head  of  Engineering  @  wooga

Page 3: 1,000,000 daily users and no cache

Wooga  does  social  games

Page 4: 1,000,000 daily users and no cache

Wooga  has  dedicated  game  teams

PHPCloud

RubyCloud

RubyBare  metal

ErlangCloud

Coomingsoon

Page 5: 1,000,000 daily users and no cache

Flash  client  sends  state  changes  to  backend

Flash  client Ruby  backend

Page 6: 1,000,000 daily users and no cache

Social  games  need  to  scale  quite  a  bit

1  million  daily  users

200  million  daily  HTTP  requests

Page 7: 1,000,000 daily users and no cache

Social  games  need  to  scale  quite  a  bit

1  million  daily  users

200  million  daily  HTTP  requests

100,000  DB  operaGons  per  second

40,000  DB  updates  per  second

Page 8: 1,000,000 daily users and no cache

Social  games  need  to  scale  quite  a  bit

1  million  daily  users

200  million  daily  HTTP  requests

100,000  DB  operaGons  per  second

40,000  DB  updates  per  second

…  but  no  cache

Page 9: 1,000,000 daily users and no cache

A  journey  to  1,000,000  daily  users

Start  of  the  journey

6  weeks  of  pain

Paradise

Looking  back

Page 10: 1,000,000 daily users and no cache

We  used  two  MySQL  shards  right  from  the  start

data_fabric  gem

Easy  to  use

Sharding  by  Facebook’s  user  id

Sharding  on  controller  level

Page 11: 1,000,000 daily users and no cache

Scaling  applicaLon  servers  was  very  easy

Running  at  Amazon  EC2

Scalarium  for  automaGon  and  deploymentRole  basedChef  recipes  for  customiza5on

Scaling  app  servers  up  and  out  was  simple

Page 12: 1,000,000 daily users and no cache

Master-­‐slave  replicaLon  for  DBs  worked  fine

app app app

lb

db

dbslave

db

dbslave

Page 13: 1,000,000 daily users and no cache

We  added  a  few  applicaLon  servers  over  Lme

app app app app app app app app app

lb

db

dbslave

db

dbslave

Page 14: 1,000,000 daily users and no cache

Basic  setup  worked  well  for  3  months

DecMay Jun Jul Aug Sep Oct Nov

Page 15: 1,000,000 daily users and no cache

A  journey  to  1,000,000  daily  users

Start  of  the  journey

6  weeks  of  pain

Paradise

Looking  back

Page 16: 1,000,000 daily users and no cache

SQL  queries  generated  by  Rubyamf  gem

Rubyamf  serializes  returned  objects

Wrong  config  =>  it  also  loads  associaGons

Page 17: 1,000,000 daily users and no cache

SQL  queries  generated  by  Rubyamf  gem

Rubyamf  serializes  returned  objects

Wrong  config  =>  it  also  loads  associaGons

Really  bad  but  very  easy  to  fix

Page 18: 1,000,000 daily users and no cache

New  Relic  shows  what  happens  during  an  API  call

Page 19: 1,000,000 daily users and no cache

More  traffic  using  the  same  cluster

app app app app app app app app app

lb

db

dbslave

db

dbslave

Page 20: 1,000,000 daily users and no cache

Our  DB  problems  began

DecMay Jun Jul Aug Sep Oct Nov

Page 21: 1,000,000 daily users and no cache

MySQL  hiccups  were  becoming  more  frequent

Everything  was  running  fine  ...

...  and  then  DB  throughput  went  down  to  10%

A]er  a  few  minutes  everything  stabilizes  again  ...

...  just  to  repeat  the  cycle  20  minutes  later!

Page 22: 1,000,000 daily users and no cache

AcLveRecord’s  checks  caused  20%  extra  DB  load

Check  connecGon  state  before  each  SQL  query

MySQL  process  list  full  of  ‘status’  calls

Page 23: 1,000,000 daily users and no cache

AcLveRecord’s  status  checks  caused  20%  extra  DB  

Check  connecGon  state  before  each  SQL  query

MySQL  process  list  full  of  ‘status’  calls

Page 24: 1,000,000 daily users and no cache

I/O  on  MySQL  masters  sLll  was  the  boYleneck

New  Relic  shows  DB  table  usage

60%  of  all  UPDATEs  were  done  on  the  ‘Gles’  table

Page 25: 1,000,000 daily users and no cache

New  Relic  shows  most  used  tables

Page 26: 1,000,000 daily users and no cache

Tiles  are  part  of  the  core  game  loop

Core  game  loop:1)  plant2)  wait3)  harvest

All  are  operaGonson  Tiles  table.

Page 27: 1,000,000 daily users and no cache

We  started  to  shard  on  model,  too

We  put  this  table  in  2  extra  shards

old  master

old  slave

Page 28: 1,000,000 daily users and no cache

We  started  to  shard  on  model,  too

We  put  this  table  in  2  extra  shards1)  Setup  new  masters  as  slaves  of  old  ones

old  master

old  slave

new  master

Page 29: 1,000,000 daily users and no cache

We  started  to  shard  on  model,  too

We  put  this  table  in  2  extra  shards1)  Setup  new  masters  as  slaves  of  old  ones

old  master

old  slave

new  master

new  slave

Page 30: 1,000,000 daily users and no cache

We  started  to  shard  on  model,  too

We  put  this  table  in  2  extra  shards1)  Setup  new  masters  as  slaves  of  old  ones2)  App  servers  start  using  new  masters,  too

old  master

old  slave

new  master

new  slave

Page 31: 1,000,000 daily users and no cache

We  started  to  shard  on  model,  too

We  put  this  table  in  2  extra  shards1)  Setup  new  masters  as  slaves  of  old  ones2)  App  servers  start  using  new  masters,  too3)  Cut  replica5on

old  master

old  slave

new  master

new  slave

Page 32: 1,000,000 daily users and no cache

We  started  to  shard  on  model,  too

We  put  this  table  in  2  extra  shards1)  Setup  new  masters  as  slave  of  old  ones2)  App  servers  start  using  new  masters,  too3)  Cut  replica5on4)  Truncate  not-­‐used  tables

old  master

old  slave

new  master

new  slave

Page 33: 1,000,000 daily users and no cache

8  DBs  and  a  few  more  servers

app app

app app app app app app app app

app appapp

lb

5lesdb

5lesslave

5lesdb

5lesslave

db

dbslave

db

dbslave

app app app

Page 34: 1,000,000 daily users and no cache

Doubling  the  amount  of  DBs  didn’t  fix  it

DecMay Jun Jul Aug Sep Oct Nov

Page 35: 1,000,000 daily users and no cache

We  improved  our  MySQL  setup

RAID-­‐0  of  EBS  volumes

Using  XtraDB  instead  of  vanilla  MySQL

Tweaking  my.cnfinnodb_flush_log_at_trx_commitinnodb_flush_method

Page 36: 1,000,000 daily users and no cache

Data-­‐fabric  gem  circumvented  AR’s  internal  cache

2x  Tile.find_by_id(id)  =>  1x  SELECT  …

Page 37: 1,000,000 daily users and no cache

Data-­‐fabric  gem  circumvented  AR’s  internal  cache

2x  Tile.find_by_id(id)  =>  1x  SELECT  …

Page 38: 1,000,000 daily users and no cache

2  +  2  masters  and  sLll  I/O  was  not  fast  enough

We  were  geing  desperate:

“If  2  +  2  is  not  enough,  ...

…  perhaps  4  +  4  masters  will  do?”

Page 39: 1,000,000 daily users and no cache

It’s  no  fun  to  handle  16  MySQL  DBs

app app app app appapp app

app app app app app app app app app

appapp

lb

5lesdb

5lesslave

5lesdb

5lesslave

db

dbslave

db

dbslave

Page 40: 1,000,000 daily users and no cache

It’s  no  fun  to  handle  16  MySQL  DBs

app app app app appapp app

app app app app app app app app app

appapp

lb

5lesdb

5lesslave

5lesdb

5lesslave

5lesdb

5lesslave

5lesdb

5lesslave

db

dbslave

db

dbslave

db

dbslave

db

dbslave

Page 41: 1,000,000 daily users and no cache

We  were  at  a  dead  end  with  MySQL

DecMay Jun Jul Aug Sep Oct Nov

Page 42: 1,000,000 daily users and no cache

I/O  remained  the  boYleneck  for  MySQL  UPDATEs

Peak  throughput  overall:  5,000  writes/s

Peak  throughput  single  master:  850  writes/s

We  could  get  to  ~1,000  writes/s  …

…  but  then  what?

Page 43: 1,000,000 daily users and no cache

Pick  the  right  tool  for  the  job!

Page 44: 1,000,000 daily users and no cache

Redis  is  fast  but  goes  beyond  simple  key/value

Redis  is  a  key-­‐value  storeHashes,  Sets,  Sorted  Sets,  ListsAtomic  opera5ons  like  set,  get,  increment

50,000  transacGons/s  on  EC2Writes  are  as  fast  as  reads

Page 45: 1,000,000 daily users and no cache

Shelf  Lles  :  An  ideal  candidate  for  using  Redis  

Shelf  Lles:{  plant1  =>  184,plant2  =>  141,plant3  =>  130,plant4  =>  112,

…  }

Page 46: 1,000,000 daily users and no cache

Shelf  Lles  :  An  ideal  candidate  for  using  Redis  

Redis  HashHGETALL:  load  whole  hashHGET:  load  a  single  keyHSET:  set  a  single  keyHINCRBY:  increase  value  of  single  key…

Page 47: 1,000,000 daily users and no cache

Migrate  on  the  fly  when  accessing  new  model

Page 48: 1,000,000 daily users and no cache

Migrate  on  the  fly  -­‐  but  only  once

true  if  id  could  be  addedelse  false

Page 49: 1,000,000 daily users and no cache

Migrate  on  the  fly  -­‐  and  clean  up  later

1. Let  this  running  everything  cools  down

2. Then  migrate  the  rest  manually

3. Remove  the  migraGon  code

4. Wait  unGl  no  fallback  necessary

5. Remove  the  SQL  table

Page 50: 1,000,000 daily users and no cache

MigraLons  on  the  fly  all  look  the  same

Typical  migraGon  throughput  over  3  daysIni5al  peak  at  100  migra5ons  /  second

Page 51: 1,000,000 daily users and no cache

A  journey  to  1,000,000  daily  users

Start  of  the  journey

6  weeks  of  pain

Paredise  (or  not?)

Looking  back

Page 52: 1,000,000 daily users and no cache

Again:  Tiles  are  part  of  the  core  game  loop

Core  game  loop:1)  plant2)  wait3)  harvest

All  are  operaGonson  Tiles  table.

Page 53: 1,000,000 daily users and no cache

Size  maYers  for  migraLons

MigraGon  check  overloaded  RedisMigra5on  only  on  startup

We  overlooked  an  edge  caseNext  5me  only  migrate  1%  of  usersMigrate  the  rest  when  everything  worked  out

Page 54: 1,000,000 daily users and no cache

In-­‐memory  DBs  don’t  like  to  saving  to  disk

You  sGll  need  to  write  data  to  disk  eventuallySAVE  is  blockingBGSAVE  needs  free  RAM

Dumping  on  master  increased  latency  by  100%

Page 55: 1,000,000 daily users and no cache

In-­‐memory  DBs  don’t  like  to  dump  to  disk

You  sGll  need  to  write  data  to  disk  eventuallySAVE  is  blockingBGSAVE  needs  free  RAM

Dumping  on  master  increased  latency  by  100%

Running  BGSAVE  slaves  every  15  minutes

Page 56: 1,000,000 daily users and no cache

ReplicaLon  puts  Redis  master  under  load

ReplicaGon  on  master  starts  with  a  BGSAVENot  enough  free  RAM  =>  big  problem

Master  queue  requests  that  slave  cannot  processRestoring  dump  on  slaves  is  blocking

Page 57: 1,000,000 daily users and no cache

Redis  has  a  memory  fragmenLon  problem

24  GB

44  GB

in  8  days

Page 58: 1,000,000 daily users and no cache

Redis  has  a  memory  fragmenLon  problem

24  GB

38  GB

in  3  days

Page 59: 1,000,000 daily users and no cache

If  MySQL  is  a  truck...

Fast  enough  for  reads

Can  store  on  disk

Robust  replicaGon

h\p://www.flickr.com/photos/erix/245657047/

Page 60: 1,000,000 daily users and no cache

If  MySQL  is  a  truck,  Redis  is  a  Ferrari

Fast  enough  for  reads

Can  store  on  disk

Robust  replicaGon

Super  fast  reads/writes

Out  of  memory  =>  dead

Fragile  replicaGon

h\p://www.flickr.com/photos/erix/245657047/

Page 61: 1,000,000 daily users and no cache

Big    and  staLc  data  in  MySQL,  rest  goes  to  Redis

Fast  enough  for  reads

Can  store  on  disk

Robust  replicaGon

Super  fast  reads/writes

Out  of  memory  =>  dead

Fragile  replicaGon

60  GB  data

50%  writes

256  GB  data

10%  writesh\p://www.flickr.com/photos/erix/245657047/

Page 62: 1,000,000 daily users and no cache

95%  of  operaLon  effort  is  handling  DBs!

app app app app app app app app app app app appapp

app app app app app app app app app app app appapp

app app app app app app app app app app app appapp

lb lb

redis

redisslave

redis

redisslave

redis

redisslave

redis

redisslave

redis

redisslave

db

dbslave

db

dbslave

db

dbslave

db

dbslave

db

dbslave

Page 63: 1,000,000 daily users and no cache

We  fixed  all  our  problem  -­‐  we  were  in  Paradise!

DecMay Jun Jul Aug Sep Oct Nov

Page 64: 1,000,000 daily users and no cache

A  journey  to  1,000,000  daily  users

Start  of  the  journey

6  weeks  of  pain

Paredise  (or  not?)

Looking  back

Page 65: 1,000,000 daily users and no cache

Data  handling  is  most  important

How  much  data  will  you  have?

How  o]en  will  you  read/update  that  data?

What  data  is  in  the  client,  what  in  the  server?

It’s  hard  to  “refactor”  data  later  on

Page 66: 1,000,000 daily users and no cache

Monitor  and  measure  your  applicaLon  closely

On  applicaGon  level  /  on  OS  level

Learn  your  app’s  normal  behavior

Store  historical  data  so  you  can  compare  later

React  if  necessary!

Page 67: 1,000,000 daily users and no cache

Listen  closely  and  answer  truthfully

slideshare.net/wooga

Jesper  Richter-­‐Reichhelm

@jrirei

Q  &  A

Page 68: 1,000,000 daily users and no cache

If  you  are  in  Berlin  just  drop  by

Thank  you

slideshare.net/wooga

Jesper  Richter-­‐Reichhelm

@jrirei