failure happens - reliability and how to run large websites

84
Failure Happens F***, the f*****g thing is f****d What broke and what we learned

Upload: artur-bergman

Post on 08-Jul-2015

9.012 views

Category:

Technology


1 download

DESCRIPTION

Talk from Web 2.0 Expo San Francisco 2008 on how to run large websites, and what has failed, and how you get around it.

TRANSCRIPT

Page 1: Failure Happens - Reliability and how to run large websites

Failure HappensF***, the f*****g thing is f****d

What broke and what we learned

Page 2: Failure Happens - Reliability and how to run large websites

Redundancy

Redundancy, in general terms, refers tothe quality or state of being redundant,that is: exceeding what is necessary ornormal; or duplication. This can have anegative connotation, especially inrhetoric: superfluous or repetitive; or apositive implication, especially inengineering: serving as a duplicate forpreventing failure of an entire system.

Page 3: Failure Happens - Reliability and how to run large websites

Jesse Robbins Artur Bergman

Page 4: Failure Happens - Reliability and how to run large websites

Artur Bergman Jesse Robbins

Page 5: Failure Happens - Reliability and how to run large websites

• Jesse– Runs ops for Etelos– Firefighter/EMT– Emergency Manager

• Katrina– Experiences running large websites– Had the best title ever “Master of Disaster”

• Artur– Runs ops & engineering for Wikia– Experiences of running large websites, enterprise

(boring) and stock exchanges– Core Perl developer, long development background

• Both of us– Write for O’Reilly Radar– Speak at conferences– Annoy our peers and coworkers– Agree on nearly everything

Page 6: Failure Happens - Reliability and how to run large websites

Redundant

Page 7: Failure Happens - Reliability and how to run large websites

Jesse is sick

• Thankfully, we have high availability– Hence this talk

• Jesse has a 98% availability• I am more honest, probably more like

90% excluding the time I sleep• Our combined availability is 99.84%• His war stories will be missing

Page 8: Failure Happens - Reliability and how to run large websites

June 23-24, 2008Jesse & Steve

Page 9: Failure Happens - Reliability and how to run large websites

364.96 Main

• San Francisco data center• Hosts a lot of Web 2.0 companies• Power outage• 24 July 2008

– A day I am sure a lot of people rememberfondly

Page 10: Failure Happens - Reliability and how to run large websites
Page 11: Failure Happens - Reliability and how to run large websites
Page 12: Failure Happens - Reliability and how to run large websites

Mistakes

• Generator 3 took down 1 and 4– 200% more outage than needed

• But really?– Not 365 Mains fault

Page 13: Failure Happens - Reliability and how to run large websites

Failure happens

• A single datacenter is the problem– Since they all fail at some point

• Recovery procedures after failure– Power was gone ~45 minutes– Most services took hours to come back– Some unnamed ones more than 12 hours

• Communication– All DNS servers in the same datacenter!

Page 14: Failure Happens - Reliability and how to run large websites
Page 15: Failure Happens - Reliability and how to run large websites

Radar article• Disaster recovery plans exist on a different

continuum, affecting not just operations butalso your entire organisation's response todisasters.

• An earthquake is a question of when, not if.Are the startups ready for this? How long willwe expect them to be gone? Several of theworld's largest websites went down. None ofthem were ready for a datacenter outage.None of them had backup datacenters or failover that worked.

• None even had a coherent strategy forcommunicating the situation to the rest of theworld.

Page 16: Failure Happens - Reliability and how to run large websites

Futility of MTBF

• Mean time between failures– Vendor quote you this all time

• Irrelevant!• Failure is inevitable• 365 Main probably had a excellent

aggregated MTBF– But when something fails, the mean time to the

next failure is hardly going to make you feel better

Page 17: Failure Happens - Reliability and how to run large websites

MTTR

• Mean time to recovery• Drastically reduced severity of the

power outage even without hot standby• Noone cares if you fail once a minute

– If you recover in 50 ms• If you are down 1 minute a week, you

are still going to hit 4 nines (99.99%)

Page 18: Failure Happens - Reliability and how to run large websites

Nines (roughly)

• 99% 5000 Minutes / Year 3.5 Days

Page 19: Failure Happens - Reliability and how to run large websites

Nines (roughly)

• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )

Page 20: Failure Happens - Reliability and how to run large websites

Nines (roughly)

• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year

Page 21: Failure Happens - Reliability and how to run large websites

Nines (roughly)

• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year• 99.999% 5 Min / Year

Page 22: Failure Happens - Reliability and how to run large websites

Nines (roughly)

• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year• 99.999% 5 Min / Year• 99.9999% 30 Seconds / Year

Page 23: Failure Happens - Reliability and how to run large websites

Nines (roughly)

• 99% 5000 Min / Year (3.5 days)• 99.9% 500 Min / Year ( 8 hours )• 99.99% 50 Min / Year• 99.999% 5 Min / Year• 99.9999% 30 Seconds / Year• 99.99999% 3 Seconds / Year

Page 24: Failure Happens - Reliability and how to run large websites

Irrelevance of the nines

• Blizzard– $520 million in profit last year

• World of Warcraft– 10 million players

• 98-99%– By design

Page 25: Failure Happens - Reliability and how to run large websites

Train your users

• Scheduled Downtime each week• Very little redundancy• Server failure

– Up to 10 minutes of data loss• Been like this from the beginning

Page 26: Failure Happens - Reliability and how to run large websites

“We pay them money, so wehave to accept the downtime.”

Page 27: Failure Happens - Reliability and how to run large websites

Reliability

• Don’t aim to high unless– Banks– Space shuttles– Lung/heart machines

• The higher you aim– Increases complexity (exponentially)– The harder you fail

Page 28: Failure Happens - Reliability and how to run large websites
Page 29: Failure Happens - Reliability and how to run large websites

Complexity killed the cat

Page 30: Failure Happens - Reliability and how to run large websites

5m360.yahoo.comYahoo! 360

10mwww.livejournal.comLiveJournal

25mwww.myspace.comMySpace

45mwww.xanga.comXanga

1h 10mwww.last.fmLast.fm

1h 10mwww.orkut.comOrkut

1h 35mwww.facebook.comFacebook

2h 5mwww.classmates.comClassmates.com

4h 0mwww.linkedin.comLinkedIn

2h 55mwww.reunion.comReunion.com

5h 5mwww.hi5.comhi5

6h 0mwww.friendster.comFriendster

7h 25mspaces.live.comWindows Live Spaces

12h 28mwww.bebo.comBebo

Jan-Feb 2008 - Source pingdom.com

Page 31: Failure Happens - Reliability and how to run large websites

5m360.yahoo.comYahoo! 360

10mwww.livejournal.comLiveJournal

25mwww.myspace.comMySpace

45mwww.xanga.comXanga

1h 10mwww.last.fmLast.fm

1h 10mwww.orkut.comOrkut

1h 35mwww.facebook.comFacebook

2h 5mwww.classmates.comClassmates.com

4h 0mwww.linkedin.comLinkedIn

2h 55mwww.reunion.comReunion.com

5h 5mwww.hi5.comhi5

6h 0mwww.friendster.comFriendster

7h 25mspaces.live.comWindows Live Spaces

12h 28mwww.bebo.comBebo

Jan-Feb 2008 - Source pingdom.com

$800 MM

Page 32: Failure Happens - Reliability and how to run large websites

Measurement

• How do you measure uptime?• Ping doesn’t work• Connect• Your view is limited from your

monitoring stations• Network problems outside your control

– Hello Cogent

Page 33: Failure Happens - Reliability and how to run large websites

Measurement• Look at the traffic

– The data is there– HTML delivery time– Image delivery time– TCP packet loss– Use an image call to collect end user performance

metrics• Calculate expected traffic rates

– Benchmark against that (bandwidth curves shouldbe smooth!)

– I always watch the bandwidth• Wikipieda method

– How many people complain on IRC?

Page 34: Failure Happens - Reliability and how to run large websites

Outage?

Page 35: Failure Happens - Reliability and how to run large websites

Outage!

Page 36: Failure Happens - Reliability and how to run large websites

Youtube vs BGP vs Pakistan

• BGP runs your internet– Protocol for routers to share routing data– How to get from me to somewhere else

• Each organization has an AS number• Each router keeps track of the number

of AS numbers to the destination overdifferent routes

• Chooses the shortest one

Page 37: Failure Happens - Reliability and how to run large websites

Anycast / Multihoming

• BGP allows you to tell multiple ISPs thatyou are capable of handling a network

• Traffic will flow the “shortest” path• If a link goes down, that router-router

BGP session goes away and the routeis then withdrawn through the system

• “BGP Convergence”– Don’t ask what it really means

Page 38: Failure Happens - Reliability and how to run large websites

Networks and prefixes

• Each netblock is subclassed and has aprefix.

• People mostly know /24 which is 255addresses

• /23 is twice as that• /8 is a vast quantity

Page 39: Failure Happens - Reliability and how to run large websites

IP Conservationvs

Routing table conservation• We are running out of Ips• Our routing table is growing fast

• To limit the growth of the routing table,routers will usually block any routesmore specific than /24

• Youtube was being a good citizen andbroadcasting one 22 instead of four /24

Page 40: Failure Happens - Reliability and how to run large websites

Pakistan Telekom

• Government orders ban of Youtube• PT achives this by broadcasting a BGP

route for the one of Youtubes IP rangesusing a /24 prefix– Sadly, they did this to the entire world

• Routers choose the most specific routefirst, so /24 wins over /22

• All of youtube traffic went to Pakistan

Page 41: Failure Happens - Reliability and how to run large websites

Try reaching for 4 nines

• A BGP error anywhere, can quickly bring youdown

• Thank the souls running the large ISPs corenetworking.– They are the reason it works

• Only way to solve this, is to be a bad citizenand spam the table with more routes. Buteven that doesn’t fully protect you from localoutages

Page 42: Failure Happens - Reliability and how to run large websites

June 23-24, 2008Jesse & Steve

Page 43: Failure Happens - Reliability and how to run large websites

Value of reliability(operations and performance)

• Bad reliability is a waste or R&D• Why develop if you can’t deliver?

• Operations is always treated as thestepchild of Engineering

• But with no reliability, no company• Fixed amount of time + faster site =

more page views

Page 44: Failure Happens - Reliability and how to run large websites

Speed / Reliability

• Important• Direct correlation between speed and

user interaction• Brand name relies on reliability

Page 45: Failure Happens - Reliability and how to run large websites
Page 46: Failure Happens - Reliability and how to run large websites

Requests /sec

Response time

Page 47: Failure Happens - Reliability and how to run large websites

Requests /sec

Response time

Page 48: Failure Happens - Reliability and how to run large websites

Nothing matters

• This entire conference!• Any cool features!

• Unless it works

Page 49: Failure Happens - Reliability and how to run large websites

Cost benefit

• Cost of deliver• Revenue earned

• Increase cost for more complexity

Page 50: Failure Happens - Reliability and how to run large websites

Metrics you need

• Cost per page view• Cost per specific feature/page

• This is key, what you should prioritize, whatyou should do is, dependent on thesenumbers

• How else can you value it?• Don’t always go for cheap, sometimes it is

better to buy time using money, sometimesnot.

Page 51: Failure Happens - Reliability and how to run large websites

Operational Engineers

• Ops stepchild of development?– Ops is staffed with failed developers

• Fire them

• Hire good ones• Who are passionate to learn and

explore the entire stack

Page 52: Failure Happens - Reliability and how to run large websites

My story

• Software developer• Interested in ops• I always get transferred to ops

– Fixing the same problems every time• (Save me, go to Velocity and learn!)

• I bring engineering to ops, and a way tolook at the entire system

Page 53: Failure Happens - Reliability and how to run large websites
Page 54: Failure Happens - Reliability and how to run large websites

Pyromaniac

Paranoid

Page 55: Failure Happens - Reliability and how to run large websites
Page 56: Failure Happens - Reliability and how to run large websites

Backups / High Availability

• Don’t confuse them• Backups protect your data• High Availability keeps your site running

• Mysql replication is a valid HA solution• But it won’t help you with

– DROP TABLE;

Page 57: Failure Happens - Reliability and how to run large websites

Debugging

• 9 Rules of debugging• http://www.debuggingrules.com/Poster_

download.html– Yes the font is horrible

Page 58: Failure Happens - Reliability and how to run large websites

Rule 1:Understand the system

• Complexity Kills• No excuse• If you write it, you must know it• If you run it, you must know it• If you buy it, you must know it

Page 59: Failure Happens - Reliability and how to run large websites

Rule 3:Quit thinking and look

• "It is a capital mistake to theorize beforeone has data. Insensibly one begins totwist facts to suit theories, instead oftheories to suit facts.”

Page 60: Failure Happens - Reliability and how to run large websites

Rule 3:Quit thinking and look

• What do you look at?• The importance of monitoring• Monitoring• Monitoring• Monitoring

Page 61: Failure Happens - Reliability and how to run large websites

My my, confusing term

• Monitoring• Alerting• Trending

Page 62: Failure Happens - Reliability and how to run large websites

Alerting

• Acts on monitoring data• Severe alerts

– Active– Needs action

• Passive alerts– Things that need to be done but not right now

• DO NOT OVER ALERT• DO NOT CRY WOLF

Page 63: Failure Happens - Reliability and how to run large websites

Wikia alerting strategy

• When the site is slow• Or down• We send emails and do phone calls• Europe and US West coast• Looking to hire in East Asia• No night time

Page 64: Failure Happens - Reliability and how to run large websites

Trending

• Long term• Capacity planning

Page 65: Failure Happens - Reliability and how to run large websites

Ganglia

• We love ganglia• Automatically graphs everything you

want - just works• Large scale clusters• Multicast• Zero config• RRD

Page 66: Failure Happens - Reliability and how to run large websites

http://ganglia.wikimedia.org/

• 270 hosts• 880 CPU• 2 clusters• 1.2 TB of Memory

Page 67: Failure Happens - Reliability and how to run large websites

http://ganglia.wikimedia.org

Page 68: Failure Happens - Reliability and how to run large websites

Custom Ganglia Gmetrics

• Write your own

gmetric --name='Oldest query' --type=int32--units='sec' --dmax=65 --value=`echo 'show processlist' | mysql -uroot -ppass |grep -v Sleep | grep -v 'system user' | head -2 |tail -1 | cut -f 6`

Page 69: Failure Happens - Reliability and how to run large websites

Something is wrong

• Don’t worry, data warehouse

Page 70: Failure Happens - Reliability and how to run large websites

Problem found

• If it is critical, start a phone conversation• Use IRC to communicate technical data• One person liasons with non technical

staff• One person specifically in command• Sleep scheduling ( audit log important )

Page 71: Failure Happens - Reliability and how to run large websites

Post crisis

• Root cause analysis– Just find out what went wrong– And how to avoid it– Or fix it faster next time if you can’t

• Keep track of your uptime

Page 72: Failure Happens - Reliability and how to run large websites

Automation

• All machines are created equal• Seriously• If you manually make changes• You are wrong

– Unless you know what you are doing

Page 73: Failure Happens - Reliability and how to run large websites

Best practices

• Version control• Gold images• Centralised authentication• Time Sync ( NTP )• Central logging• ( All of this applies for virtual machines

too!)

Page 74: Failure Happens - Reliability and how to run large websites

Puppet

• New hip kid on the block• Written in ruby• Better support?• Much nicer syntax• Easier to extend

Page 75: Failure Happens - Reliability and how to run large websites

tcpdump / wireshark

• If you suspect the network• Don’t just suspect• LOOK AT IT• Tcpdump / waveshark will tell you

– If your packets are lost, delayed orcorrupted

– Your windowing is wrong

Page 76: Failure Happens - Reliability and how to run large websites

Puppet

• Automated machine configuration• Automation is key

• Our Motd states

“If change change anything locally, I will huntdown and kill you”

Page 77: Failure Happens - Reliability and how to run large websites

Rule 4: Divde and Conquer

• Look at the problems in turn• Split between people• Go in the order you suspect is the most

likely

Page 78: Failure Happens - Reliability and how to run large websites

Rule 5:Change one thing at a time

• I cannot stress this enough• IF YOU DO NOT THEN YOU HAVE

FAILED TO IDENTIFY THE PROBLEM

Page 79: Failure Happens - Reliability and how to run large websites

Rule 6:Keep an audit trail

• You might be making things worse• Good for the root cause analysis• Have your shell log all commands

– Good practice anyway• Version control

Page 80: Failure Happens - Reliability and how to run large websites

Rule 9:If you didn’t fix it, it ain’t fixed

• You must do something to fix a problem• Or it will bite you again• And again• And again• They don’t just appear and disappear• Except BGP route convergence :)

Page 81: Failure Happens - Reliability and how to run large websites
Page 82: Failure Happens - Reliability and how to run large websites

Good Book!

Page 83: Failure Happens - Reliability and how to run large websites

“multiple and unexpectedinteractions of failures are

inevitable”-Charles Perrow

Page 84: Failure Happens - Reliability and how to run large websites

shit happens.

[email protected]@oreilly.com