early detection of configuration errors to reduce failure ... · early detection of configuration...

31
Early detection of configuration errors to reduce failure damage Tianyin Xu, Xinxin Jin, Peng Huang, Yuanyuan Zhou, Shan Lu, Long Jin, Shankar Pasupathy UC San Diego University of Chicago NetApp

Upload: phamnhu

Post on 31-Mar-2018

232 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

Early detection of configuration errors to reduce failure damage

Tianyin Xu, Xinxin Jin, Peng Huang, Yuanyuan Zhou,

Shan Lu, Long Jin, Shankar Pasupathy

UC San Diego University of Chicago NetApp

done

Page 2: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

This paper is not about bugsabout configuration errors.

done

• bad values inside configuration files

• introduced by sysadmins

• nothing “wrong” in our code

2

Page 3: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

When systems use bad configuration values, the code does report errors.

• throw exceptions

• return error code

• crash with coredumps

3

correct != timely

Page 4: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

/* sys_24-7.c */

signal(SIGSEGV, call_techsup);

static void call_techsup(int sig) {if (fork() == 0) {char* args[] = {“0911”, “SOS”};int rv = execvp(dial_prog_path, args);if (rv != 0)fprintf(stderr, “I’m sorry (%d)!”, errno);

}}

Errors are often reported too late!

configuration“/bad/dial/path”

4

Page 5: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

4

/* sys_24-7.c */

signal(SIGSEGV, call_techsup);

static void call_techsup(int sig) {if (fork() == 0) {char* args[] = {“0911”, “SOS”};int rv = execvp(dial_prog_path, args);if (rv != 0)fprintf(stderr, “I’m sorry (%d)!”, errno);

}}

Errors are often reported too late!

configuration“/bad/dial/path”

Shoot!Well, this is unexpected…

Error code: 500

An error has occurred and we’re working

to fix the problem!

The service will be up and running shortly.

“It’s too late to apologize!”

Page 6: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

rollout

Systems execution has stages.

initialization

observation period

Time

workload

production

error

Server

days/weeks

Server Server Server Server ServerServer

Server Server Server Server Server Server

Server

Server

5

Page 7: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

initialization rollout workload error

dam

age

of

con

fig

err

ors

Chart Title

All stages are not created equal.

fix

revenue loss

observation period production

Server

Faulttolerance

service outage

6

Page 8: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

Misconfigured backup DNS (used upon attacks) made LinkedIn inaccessible for half a day.

Faulty failover configurations turned a 10 minute outage into a 2.5 hour ordeal.

Misconfigured data protection allowed a bug to wipe out 10% of the storage nodes.

Does this truly happen?

7

Page 9: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

Sysadmins’ wish

All the configuration

errors can be exposed

at initialization.

initialization rollout workload error

observation period production

Difficult for sysadmins

to test out latent

configuration errors

8

Reality

Page 10: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

Contribution

• A perspective of checking configurations early and detect errors timely

• A study on real-world configuration checking practices deficiency of built-in configuration checks prevalent threats of latent configuration errors

• PCheck: tooling support for automatically generating configuration checking code help systems detect configuration errors early effective, safe, and efficient

9

Page 11: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

How well terribly do systems check their configurations?

• Studied configuration parameters of R.A.S. (Reliability, Availability, Serviceability) features

Software RAS Param.

HDFS 44

YARN 35

HBase 25

Apache 14

Squid 21

MySQL 43

mission critical

not really needed for initialization

─ 12%─39% are not used during initialization

10

w/o early checking,

errors would become

latent & catastrophic.

Page 12: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

How well terribly do systems check their configurations?

• 14%─93% of the studied parameters do not haveany special checking code at initialization

→ rely on usage code for checking/reporting errors

5%─39% of R.A.S. parameters are subject to latent errors.

11

Page 13: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

Detecting latent configuration errors would require separate checking code at systems’ initialization phase.

12

Page 14: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

Systems already have checking logic implied by the usage of configuration values (though usage code often comes late).

Can we leverage usage-implied checkingto detect configuration errors early?

13

Page 15: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

Why not copy+paste the code that uses configuration values into initialization?

use(cfg1)

use(cfg2)

initialization rollout workload error

observation period production 14

use(cfg1)

use(cfg2)

Page 16: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

use(cfg1)

use(cfg2)

Why not copy+paste the code that uses configuration values into initialization?

use(cfg1)

use(cfg2)

initialization rollout workload error

observation period production 14

Page 17: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

/* sys_24-7.c */

signal(SIGSEGV, call_techsup);

static void call_techsup(int sig) {if (fork() == 0) {char* args[] = {“0911”, “SOS”};int rv = execvp(dial_prog_path, args);if (rv != 0)fprintf(stderr, “I’m sorry (%d)!”, errno);

}}

configuration

Demo

15

Page 18: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

/* sys_24-7.c */

signal(SIGSEGV, call_techsup);

static void call_techsup(int sig) {if (fork() == 0) {char* args[] = {“0911”, “SOS”};int rv = execvp(dial_prog_path, args);if (rv != 0)fprintf(stderr, “I’m sorry (%d)!”, errno);

}}

Demo

int rv = execvp(dial_prog_path, args);

configuration

15

Copy

Page 19: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

/* sys_24-7.c */

static int sys_init() { load_config();.........

}

Demo

int rv = execvp(dial_prog_path, args);

15

Paste

Page 20: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

Problem 1 Executing code needs context args is undefined

Problem 2 Execution can have side effects prank calls are crimes.

exec() removes the current process image

Copy+paste code does not work!

int rv = execvp(diag_prog_path, args);

16

args is undefined

Page 21: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

best effort: may not always be able to determine the values

configurations often have relatively simple context

int rv = execvp(diag_prog_path, args);

Backtrack to determine values of undefined variables

+ char* args[] = {“0911”, “SOS”};

Produce necessary context

17

Page 22: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

“Sandbox” the checking code

int rv = execvp(diag_prog_path, args);

char* args[] = {“0911”, “SOS”};

- int rv = execvp(diag_prog_path, args);

- char* args[] = {“0911”, “SOS”};

+ int rv = check_execvp(diag_prog_path);

Prevent side effects

Rewrite instructions based on check utilities that validate the operands w/o executing the instructions

18

int rv = check_execvp(diag_prog_path);

Page 23: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

PCheck implementation

• Works for both C and Java programs LLVM compiler framework for C code

Soot compiler framework for Java code

19

Page 24: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

PCheck implementation

1. Extract instructions that use configuration values

2. Produce necessary context

3. Prevent side effects

4. Capture runtime anomalies

19

prog

conf

APIs

Input Output

checker

checker

checker

prog

5. Insert + invoke checking code in program bitcode/bytecode

Analysis

invok. loc.

Page 25: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

Evaluation

• Evaluate on 58 real-world latent configuration errors 37 new errors (discovered in our study)

21 historical errors (caused failures in the past)

20

Page 26: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

done

How many errors can be detected?

0

5

10

15

20

25

30

35

40

1 2

Chart Title

71.4%

78.4%

Historical errors New errors

# latent config errors

missed

detected

21

Page 27: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

done

How many errors can be detected?

21

Type of errors# (%) errors detected

Historical New

Type/format errors 1/1 (100.0%) 13/13 (100.0%)

Invalid options/ranges 2/2 (100.0%) 4/4 (100.0%)

Incorrect files/dirs 9/12 (75.0%) 5/7 (71.4%)

Miscellaneous errors 3/6 (50.0%) 7/13 (53.8%)

Total 15/21 (71.4%) 29/37 (78.4%)

Page 28: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

done

What errors are missed?

• Cannot generate the checking code fail to produce the execution context

e.g., values from runtime requests

• Cannot safely execute the checking code unknown side effects

e.g., used as bash command

22

Page 29: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

done

Caveats

• Errors manifested via a long running period(we cannot run checks for too long.) resource misconfigurations (exhaustion)

performance misconfigurations (degradation)

• Errors not exposed via “obvious” anomalies (we report errors based on exceptions, error code, etc.) semantic errors (e.g., backup data to wrong files)

23

Page 30: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

• Checking overhead

HDFS/YARN/HBase less than 1000 msec

Apache/MySQL/Squid less than 100 msec

• False positives tested with the default values and real-world settings

collected from 830 configuration files

only 3 parameters have false alarms reported- caused by imprecise analysis that missed control-flow dependencies (the exposed anomalies are unreal)

done

What is the cost?

24

Page 31: Early detection of configuration errors to reduce failure ... · Early detection of configuration errors to reduce failure damage ... during initialization 10 ... exec() removes the

Check your configurations early!

• Treat configuration errors like fatal diseases

• PCheck: auto-generating& invoking configurationchecking code to enforceearly detection.

Conclusion