understanding and dealing with operator mistakes in internet services kiran nagaraja, fábio...

34
Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen Rutgers University Vivo Project http://vivo.cs.rutgers.edu Funding from NSF grants: #EIA- 0103722, #EIA-9986046, and #CCR- 0100798.

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Understanding and Dealing with Operator Mistakes in Internet Services

Kiran Nagaraja, Fábio OliveiraRicardo Bianchini, Richard P. Martin, Thu D. Nguyen

Rutgers University

Vivo Projecthttp://vivo.cs.rutgers.edu

Funding from NSF grants: #EIA- 0103722, #EIA-9986046, and #CCR-0100798.

Page 2: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project2

Motivation

Internet services are ubiquitous, e.g., Google, Yahoo!, Amazon, Ebay, etc.Expectation of 24 x 7 availability, but service outages still happen!

Sorry....We apologize for the inconvenience, but the system is currently unavailable. Please try your request in an hour. If you require assistance please call Customer Service at 1-866-325-3457.

A significant number of outages in Internet services are a result of operator actions [Oppenheimer03]

#1: Architecture is complex

#2: Systems are constantly evolving

#3: Lack of tools for operators to reason about the impact of their actionsOffline testing, emulation, simulation

Very little detail on operator mistakesDetails strongly guarded by companies and administrators

Page 3: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project3

Talk Outline

Approach and Contributions

Operator Study: Understanding the Mistakes

Validation: Preventing Exposure of Mistakes

Conclusion and Future Work

Page 4: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project4

This Work

Understanding: Gather detailed data on operators’ mistakes What categories of mistakes?

What’s the impact on the service?

How do mistakes correlate with experience, impact ?

Approaches to deal with operator mistakes: prevention, recovery, automation

Validation: Allow operators to evaluate the correctness of their actions prior to exposing them to the serviceSimilar to offline testing, but:

Virtual environment (extension of online environment)

Real workload

Migration back and forth with minimal operator involvement

Page 5: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project5

Contributions

Detailed information on operator tasks and mistakes43 experiments detailed data on operator behavior including 42 mistakes

64% immediately degraded throughput

57% were software configuration mistakes

Demonstrate that human experiments are possible and valuable

Designed and prototyped a validation infrastructureImplemented on 2 cluster-based services: cooperative Web server (PRESS)

and a multi-tier auction service

2 techniques to allow operators to validate their actions

Demonstrated that validation is a promising technique for reducing impact of operator mistakes

66% of all mistakes observed in operator study were caught

6 of 9 mistakes caught in live operator experiments with validation

Successfully tested with synthetically injected mistakes

Page 6: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project6

Talk Outline

Approach and ContributionsApproach and Contributions

Operator Study: Understanding the MistakesRepresentative environment

Choice of human subjects and experiments

Results

Validation: Preventing Exposure of Mistakes Validation: Preventing Exposure of Mistakes

Conclusion and Future WorkConclusion and Future Work

Page 7: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project7

Multi-Tiered Internet Services

Client Requests

Web ServerWeb ServerWeb ServerWeb Server

ApplicationServer

ApplicationServer

ApplicationServer

ApplicationServer

ApplicationServer

ApplicationServer

DatabaseDatabase

Tier 1:

Web servers

Tier 2:

App servers

Tier 3:

Database server

Page 8: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project8

Tasks, Operators & Training

TasksScheduled maintenance tasks (proactive), e.g. upgrade Apache

Diagnose-and-repair tasks (reactive), e.g. diagnose a disk failure

Operator composition14 computer science graduate students

5 professional programmers (Ask Jeeves)

2 sysadmins from our department

Categorization of operators - based on filled in questionnaire11 novices – some familiarity with set up

5 intermediates – experience with a similar service

5 experts - in-charge of a service requiring high uptime

Operator trainingNovice operators given warm-up tasks

Material describing service, and detailed steps for tasks

Page 9: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project9

Experimental Setup

Service3-tier auction service, and client emulator from

Rice University’s DynaServer Project

Loaded at 35% of capacity

Machines2 Web servers (Apache), 5 application servers (Tomcat), 1 database machine (MYSQL)

Operator assistance & data capture Monitor service throughput

Modified bash shell for command and result trace

Manual observationNoting anomalies in operator behavior

Bailing out ‘lost’ operators

Page 10: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project10

Example Trace

Task: Add an application serverMistake: Apache misconfiguration

Impact: Degraded throughput

Application server addedFirst Apache misconfigured and

restarted Second Apache misconfigured and

restarted

Page 11: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project11

Sampling of Other Mistakes

Adding a new application serverOmission of new application server from backend member list

Syntax errors, duplicate entries, wrong hostnames

Launching the wrong version of software

Migrating the database for performance upgradeIncorrect privileges for accessing the database

Security vulnerability

Database installed on wrong disk

Page 12: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project12

Operator Mistakes: Category Vs Impact

64% of all mistakes had immediate impact on service performance36% resulted in latent faults

Obs. #1: Significant no. of mistakes can be checked by testing with a realistic environment

Obs. #2: Undetectable latent errors will still require online-recovery techniques

0

2

4

6

8

10

12

14

16

18

20

Degradedthroughput

Serviceinaccessible

IncreasedMTTR

Incomplete componentintegration

Securityvulnerability

Web serverpotentially

inaccessible

Reducedsystem

capacity

Potentialdatabase

crash

Impact Category

# o

f M

ista

ke

s

Page 13: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project13

0

2

4

6

8

10

12

14

16

Local config Global config Incorrectrestart

Start ofwrong SW

version

Unnecessaryrestart of SW

UnnecessaryHW

replacement

Wrong choiceof HW

Mistake Categories

# o

f M

ista

ke

sOperator Mistakes

Misconfigurations account for 57% of all errorsConfiguration mistakes spanning multiple components are more likely

Obs. #1: Tools to manipulate and check configurations are crucial

Obs. #2: Be extremely careful when maintaining multiple versions of s/w

Page 14: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project14

Operator Categories

Experts also made mistakes!Complexity of tasks executed by experts were higher

0

0.2

0.4

0.6

0.8

1

1.2

Local config Global config Incorrectrestart

Start of wrongSW version

Unnecessaryrestart of SW

UnnecessaryHW

replacement

Wrong choiceof HW

Mistake Categories

Ra

tio

of

mis

tak

es

/ex

pe

rim

en

ts

Novice Intermediate Expert

Page 15: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project15

Operator Study Summary

43 experiments 42 mistakes

27 (64%) mistakes caused immediate impact on service performance

24 (57%) were software configuration mistakes

Mistakes were made across all operator categories

Trace of operator commands & service performance for all experimentsAvailable at http://vivo.cs.rutgers.edu

Page 16: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project16

Talk Outline

Approach and ContributionsApproach and Contributions

Operator Study: Understanding the MistakesOperator Study: Understanding the Mistakes

Validation: Preventing Exposure of MistakesTechnique

Experimental Evaluation

Conclusion and Future WorkConclusion and Future Work

Page 17: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project17

Validation of Operator’s Actions

ValidationAllow operator to check correctness of his/her actions prior to

exposing their impact to the service interface (clients)

Correctness is tested by:

1. Migrate the component(s) to virtual sand-box environment,

2. Subject to a real load,

3. Compare behavior to a known correct one, and

4. Migrate back to online environment

Types of validation: Replica-based: Compare with online replica (real time)

Trace-based: Compare with logged behavior

Page 18: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project18

Validating a Component: Replica-Based

Web ServerWeb ServerWeb ServerWeb Server

DatabaseDatabase

Tier 1

Tier 3

Tier 2

Validation slice Online slice

ApplicationServer

ApplicationServer

DatabaseProxy

DatabaseProxy

Web ServerProxy

Web ServerProxy

ApplicationServer

ApplicationServer

ApplicationServer

ApplicationServer

Client Requests

Compare

Compare

Application State

ShuntCompare

Page 19: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project19

Validating a Component: Trace-Based

Validation slice Online slice

ApplicationServer

ApplicationServer

DatabaseProxy

DatabaseProxy

Web ServerProxy

Web ServerProxy

State

Compare

Compare

Web ServerWeb ServerWeb ServerWeb Server

DatabaseDatabase

Tier 1

Tier 3

Tier 2Application

Server

ApplicationServer

ApplicationServer

ApplicationServer

Client Requests

Shunt

State

Page 20: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project20

Implementation Details

Shunting performed in middleware layerEach request tagged with a unique ID all along the request path

Component proxies can be constructed with little effortReuse discovery and communication interfaces, common messaging core

State management requires well-defined export and import APIStateful servers often support such API

Comparator functions to detect errorsSimple throughput, flow, and content comparators

Page 21: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project21

Validating Our Prototype: Results

Live operator experimentsOperator given option of type of validation, duration, and to skip validation

Validation caught 6 out of 9 mistakes from 8 experiments with validation

Mistake-injection experimentsValidation caught errors in data content (inaccessible files, corrupted files)

and configuration mistakes (incorrect # of workers in Web Server degraded throughput)

Operator-emulation experimentsOperator command scripts derived from the 42 operator mistakes

Both trace-based and replica validation caught 22 mistakesMulti-component validation caught 4 latent (component interaction)

mistakes

Page 22: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project22

Reduction in Impact with Validation

0

2

4

6

8

10

12

14

16

18

20

Degradedthroughput

Serviceinaccessible

IncreasedMTTR

Incomplete componentintegration

Securityvulnerability

Web serverpotentially

inaccessible

Reducedsystemcapacity

Potentialdatabase

crash

Impact Categories

# o

f M

ista

ke

s

Mistakes

Mistakes with validation

Page 23: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project23

Reduction in Mistakes with Validation

0

2

4

6

8

10

12

14

16

18

Local config Global config Incorrectrestart

Start ofwrong SW

version

Unnecessaryrestart of SW

UnnecessaryHW

replacement

Wrongchoice of HW

Mistake Categories

# o

f M

ista

ke

s

Mistakes

Mistakes with validation

Page 24: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project24

Shunting & Buffering Overheads

Shunting overhead for replica-based validation 39% additional CPU All requests and responses are captured and forwarded to validation slice

Trace-based validation is slightly better 32 % additional CPU

Overhead is incurred on single component, and only during validation

Various optimizations can reduce overhead to 13-22%Examples: response summary (64byte), sampling (session boundaries)

Buffering capacity during state check pointing and duplicationRequired to buffer only about 150 requests for small state sizes

Page 25: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project25

Caveats, Limitations, and Open Issues

Non-determinism increases complexity of comparators and proxiesE.g., choice of back-end server, remote cache vs. local disk, pseudo-

random session-id, time stamps

Hard state management may require operator interventionComponent requires initialization prior to online migration

Bootstrapping the validationValidating an intended modification of service behavior – no traces or

replica for comparison!

How long to validate? What types of validation?Duration spent in validation implies reduced online capacity

Page 26: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project26

Conclusions & Future Work

Gathered data on operator execution & mistakes Majority of the mistakes were configuration errors Many of them degraded system throughput

Validation is an effective technique to check operator mistakesSimple techniques caught majority of mistakesFeasible in overhead and implementation effort ‘Validation ready’ components: hooks for logging, forwarding &

buffering messages, saving/restoring state

Future work: Taking validation further…Validate operator actions on databases, network componentsCombine validation with diagnosis for assisting operatorsOther validation techniques: Model-based validation

Page 27: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project27

Acknowledgements

We are thankful to our volunteer operators: fellow students, professional programmers, and LCSR staff members

We also would like to express our gratitude to Christine Hung, Neeraj Krishnan, and Brian Russell for their help in building the monitoring infrastructure in the early stages of the project

Page 28: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Thank you!

Questions?

For more information and traces of operator experiments:

http://vivo.cs.rutgers.edu

Page 29: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Back up Slides

Page 30: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project30

Operator Mistakes: Category Vs Impact

0

2

4

6

8

10

12

14

16

18

20

Degradedthroughput

Serviceinaccessible

IncreasedMTTR

Incomplete componentintegration

Securityvulnerability

Web serverpotentially

inaccessible

Reducedsystemcapacity

Potentialdatabase

crash

Impact Category

# o

f M

ista

ke

s

Wrong choice of HW component (1)

Unnecessary HW replacement (3)

Unnecessary restart of SW component (3)

Start of wrong SW version (8)

Incorrect restart (3)

Global misconfiguration (16)

Local misconfiguration (8)

64% of all mistakes had immediate impact on service performance36% resulted in latent faults

Obs. #1: Significant no. of mistakes can be checked by testing with a realistic workload

Obs. #2: Undetectable latent errors will still require online-recovery techniques

Page 31: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project31

Mendosus and Slice Isolation

Mendosus virtualizes a network of nodes on an Ethernet LAN

Injects network level failures including network partitionsAllows easy isolation of nodes into online and validation slices

Migration does not require any network level modifications

Page 32: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project32

Validation Techniques

Trace-Based ValidationRequest/response trace logged to disk

State managementState checkpointed to disk is used for initializing

Validation scenariosCan have higher directed coverage

Replica-Based ValidationReal-time forwarding from online-replica

State managementState from replica is directly used for initializing

Validation scenariosReflects current online characteristics

Multi-Component ValidationTest interaction with working components from online slice

Page 33: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project33

Implementation Details

Shunting performed in middleware layerE.g., Auction service: Apaches’ mod_jk module, Tomcat valves, JDBC driver

Each request tagged with a unique ID all along the request path

Component proxies can be constructed with little effortReuse discovery and communication interfaces , and add a common

request/response messaging core

E.g., Auction service required 4 proxies – derived by adding/modifying only 232, 307, 274 and 384 lines of C/Java code

State management requires well-defined export and import APIStateful servers often support such API

For Tomcat App server, regular state manager required small modification to export API to validation infrastructure

Simple comparator functions to detect errorsThroughput, flow and content comparators

Page 34: Understanding and Dealing with Operator Mistakes in Internet Services Kiran Nagaraja, Fábio Oliveira Ricardo Bianchini, Richard P. Martin, Thu D. Nguyen

Kiran Nagaraja, Rutgers University Vivo Project34

Shunting & Buffering Overheads

Various optimizations can reduce overhead to 13-22%Example, summary (64byte), sampling (session boundaries)

Buffering capacity during state checkpointing and duplicationRequired to buffer only about 150 requests for small state sizes

0

20

40

60

80

100

130 150 170 190 210 230 250 270 290

Requests/sec

% C

PU

usa

ge

replica-val

replica-summary-val

replica-sample-val

replica-data flow-val

base

39%