what bugs live in the cloud? a study of 3000+ issues in cloud systems jeffry adityatama, kurnia j....

What Bugs Live in the Cloud?A Study of 3000+ Issues in Cloud Systems

Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria

Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake

Thanh Do

2

First, let’s ask Google

3

Cloud era

No Deep Root Causes…

4

What reliability research community do?

• Bug study1. A Study of Linux File System Evolution. In FAST ’13. 2. A Comprehensive Study on Real World Concurrency Bug

Characteristics. In ASPLOS ’08. 3. Precomputing Possible Configuration Error Diagnoses. In ASE

’11. …

5

Open sourced cloud software

• Publicly accessible bug repositories

6

Study to solve…

• What bugs “live” in the cloud?• Are there new classes of bugs unique to cloud

systems?• How should cloud dependability tools evolve

in near future?• Many others questions…

7

Cloud Bug Study (CBS)

• 6 systems: Hadoop MapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume

• 11 people, 1 year study• Issues in a 3-year window:

Jan 2011 to Jan 2014• ~21000 issues reviewed• ~3600 “vital” issues in-depth study• Cloud Bug Study (CBS) database

8

Classifications

• Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS

• Hardware failures - types of hardware and types of hardware failures

• Software bug types – Logic, error handling, optimization, config, race, hang, space, load

• Implications – Failed operation, performance, component downtime, data loss, data staleness, data corruption

• ~25000 annotations in total, about 7 annotations per issue

9

Cloud Bug Study (CBS) database

• Open to public

10

Outline

• Introduction• Methodology• Overview of results• Other CBS database use cases• Conclusion

11

Methodology

• 6 systems, 3-year span, 2011 to 2014• 20~30 bugs a day! Protein yeah!• 17% “vital” issues affecting

real deployments• 3655 vital issues

12

Example issueTitle

Type & Priority

Description

Time to resolve

Discussion

13

Outline


14

Classifications for each vital issue

• Aspects• Hardware types and failure modes• Software bug types• Implications• Bug scopes

15

Overview of result

• Aspects • Hardware faults vs. Software faults• Implications

16

Aspects

• CS = Cassandra• FL = flume• HB = HBase• HD = HDFS• MR = MapReduce• ZK = ZooKeeper

17

Aspects: Reliability

• Reliability (45%)– Operation & job

failures/errors, data loss/corruption/staleness

18

Aspects: Performance

• Reliability• Performance (22%)

19

Aspects: Availability

• Reliability• Performance• Availability (16%)– Node and cluster

downtime

20

Aspects: Security

• Reliability• Performance• Availability• Security (6%)

21

Overview of result

• Aspects (classical)• Aspects – Data consistency, scalability, topology, QoS

• Hardware faults vs. Software faults• Implications

22

Aspects: Data consistency

• Data consistency (5%)– Permanent inconsistent

replicas– Various root causes:• Buggy operational

protocol• Concurrency bugs

and node failures

23

Cassandra cross-DC synchronization

A’

B’ B

C’

A

C

Background operational protocols often buggy!

A’ A’

B’ B’

C’ Permanent inconsistency

24

Aspects: Scalability

• Data consistency• Scalability (2%)– Small number does not

mean not important!– Only found at scale

• Large cluster size• Large data• Large load• Large failures

25

Large cluster• In Cassandra

O(n3) calculation

Ring position changed.

100x

CPU explosion

26

Large data

In HBase

Tens ofminutes

R1

R2

R3

R…

R100K

Insufficient lookup operation

27

Large load

In HDFS 1000x small files in parallel

…

…

… Not expecting small files!

28

Large failure

Time cost: 7+ hours

AM managing 16,000 tasks fails

1

2

3

…

1K

2K

3K

4K

5K

…

16K

Un-optimized connection

29

From above examples…

• Protocol algorithms must anticipate – Large cluster sizes– Large data– Large request load of various kinds– Large scale failures

• The need for scalability bug detection tools

30

Aspects: Topology

• Data consistency• Scalability• Topology (1%)– Systems have problem

when deployed on some network topology• Cross DC• Different racks• New layering architecture

– Typically unseen in pre-deployment

31

Aspects: QoS

• Data consistency• Scalability• Topology• QoS (1%)– Fundamental for multi-

tenant systems– Two main points

• Horizontal/intra-system QoS

• Vertical/cross-system QoS

32

Overview of result

• Aspects (classical)• Aspects (unique)– Data consistency, scalability, topology, QoS


33

HW faults vs. SW faults“Hardware can fail, and reliability should come from software.”

34

HW faults and modes

• 299 improper handling of node fail-stop failure

• A 25% normal speed memory card causes problems in HBase deployment.

35

Hardware faults vs. Software faults

• Hardware failures, components and modes• Software bug types

36

Software bug types: Logic

• Logic (29%)– Many domain-specific

issues

37

Software bug types: Error handling

• Logic• Error handling (18%)– Aspirator, Yuan et al,

[OSDI’ 14]

38

Software bug types: Optimization

• Logic• Error handling• Optimization (15%)

39

Software bug types: Configuration

• Logic• Error handling• Optimization• Configuration (14%)

– Automating Configuration Troubleshooting. [OSDI ’10]

– Precomputing Possible Configuration Error Diagnoses. [ASE ’11]

– Do Not Blame Users for Misconfigurations. [SOSP ’13]

40

Software bug types: Race

• Race (12%)– < 50% local concurrency

bugs• Buggy thread interleaving• Tons of work

– > 50% distributed concurrency bugs• Reordering of messages,

crashes, timeouts• More work is needed

– SAMC [OSDI ’14]

41

Software bug types: Hang

• Hang (4%)– Classical deadlock– Un-served jobs, stalled

operations, …• Root causes?• How to detect them?

42

Software bug types: Space

• Space (4%)– Big data + leak = Big leak– Clean-up operations

must be flawless.

43

Software bug types: Load

• Load (4%)– Happen when systems

face high request load– Relates to QoS and

admission control

44

Overview of result

• Aspects (classical)• Aspects (unique)– Data consistency, scalability, topology, QoS


45

Implications

• Failed operation (42%)• Performance (23%)• Downtimes (18%)• Data loss (7%)• Data corruption (5%)• Data staleness (5%)

46

Root causesEvery implication can be caused by all kinds of hardware and software faults!

47

“Killer” bugs

• Bugs that simultaneously affect multiple nodes or even the entire cluster

• Single Point of Failure still exists in many forms– Positive feedback loop – Buggy failover – Repeated bugs after failover – …

48

Outline


49

CBS database

• 50+ per system and aggregate graphs from mining CBS database in the last one year

• Still more waiting to be studied…

50

Components with most issuesHow should we enhance reliability for multiple cloud system interaction?

Cross-system issues are prevalent!

51

Most challenging types of issues

52

Top k% of most complicated issue

53

System evolution

Hadoop 2.0

54

Conclude

• One of the largest bug studies for cloud systems

• Many interesting findings, but more questions can be raised from our analysis– What types of performance issues exist?– Root causes for hang issues?– …

• Cloud Bug Study(CBS) database.

55

Thank you!

http://ucare.cs.uchicago.edu/




what bugs live in the cloud? a study of 3000+ issues in cloud systems jeffry adityatama, kurnia j....

Documents

reliability reliability

year study issues

aspects cs

comprehensive study

cloud era

open sourced cloud software

software faults implications

qos hardware failures