Transcript
Page 1: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Why did we think large scale distributed systems would be

easy? Gordon Rowell

PuppetConf San Francisco 2013

[email protected]

Page 2: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Background

Site Reliability Engineering runs many services The same rules always apply:

●  Make the service scale ●  Make the deployment consistent ●  Understand all layers of the system ●  Monitor everything ●  Plan for failure ●  Break things, under controlled conditions

Page 3: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Scaling is fun

We don't deploy "a server" •  Servers break, power fails •  Clients/DNS need to be reconfigured

We don't deploy "a cluster"

•  Networks break, servers break, power fails •  Clients/DNS need to be reconfigured

We deploy redundant clusters

•  Attempt to send clients to nearest serving cluster •  Anycast allows for unified client configuration

Page 4: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

But client DoS is not

Poorly written code... ●  on small numbers of clients... ●  is annoying

Poorly written code...

●  on a huge number of clients... ●  can cause serious infrastructure pain

Write good code and stage your releases

●  Work with the service owners ●  Stage rollouts, allow soak time ●  Have a rollback plan for clients and test it ●  Have DoS limits for services, test them

Page 5: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Load balancing is fun

Do you have enough capacity? •  How many backends do you need? •  What happens if half of your backends lose power? •  What about when half are already out for repairs?

How do you send clients to the right cluster?

•  Client configuration •  DNS round-robin (simple global load balancing) •  DNS views (give best answer for client IP) •  Anycast (portable IP, routed to "nearest" cluster) •  Consider: DNS views plus Anycast

Page 6: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

But global outages are not

Monitor everything ●  Health check failures bring down your service ●  ...by design

Test everything

●  You should expect (and test) data center outages ●  A global outage can ruin your day ●  Cascading failures are unpleasant

Learn from outages

●  Write postmortems ●  Focus on the facts! ●  What went wrong and what can be better? ●  A postmortem is not about blame

Page 7: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Thundering herds are not

For Puppet •  "Lots" of Mac desktops and laptops •  "Lots" of Ubuntu desktops, laptops and servers •  "Some" others

What if they all want to do a puppet run?

•  What about every hour? •  What about every five minutes?

Randomize your cron jobs! (and test it) How can you shed load on the server?

Page 8: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Anycast is fun

Anycast is "coarse-grain" load balancing •  Routes traffic to the “nearest”, “serving” cluster

Networks break

•  Physical issues •  Routing issues •  Configuration issues •  Load balancer bugs

Anycast monitoring is hard

Page 9: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Anycast directed to one site is not fun

Page 10: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Anycast directed to one site is not fun All clients could be sent to the same cluster

•  Be ready for that •  Can a single cluster handle worldwide traffic? •  What do you do if it can't?

Have a mitigation strategy to shed load

●  Include load calculations early in health checks ●  Consider DNS views to redirect some traffic ●  Drop traffic if you have to

Page 11: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Diversity is good...for people

Be ruthless against platform diversity If you can’t automate it, don’t do it

●  “Could we bring up another 50 today, please?” ●  “That backend was just a little different and...oops”

Anycast helps you be consistent

●  Traffic could go anywhere Every OS upgrade is a time to refactor and clean

Page 12: Why Did We Think Large Scale Distributed Systems Would be Easy? - PuppetConf 2013

Questions?

Gordon Rowell [email protected]


Top Related