what bugs live in the cloud? a study of 3000+ issues in cloud systems jeffry adityatama, kurnia j....

55
What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake Thanh Do

Upload: gabriel-bryant

Post on 24-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

What Bugs Live in the Cloud?A Study of 3000+ Issues in Cloud Systems

Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria

Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake

Thanh Do

Page 2: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

2

First, let’s ask Google

Page 3: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

3

Cloud era

No Deep Root Causes…

Page 4: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

4

What reliability research community do?

• Bug study1. A Study of Linux File System Evolution. In FAST ’13. 2. A Comprehensive Study on Real World Concurrency Bug

Characteristics. In ASPLOS ’08. 3. Precomputing Possible Configuration Error Diagnoses. In ASE

’11. …

Page 5: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

5

Open sourced cloud software

• Publicly accessible bug repositories

Page 6: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

6

Study to solve…

• What bugs “live” in the cloud?• Are there new classes of bugs unique to cloud

systems?• How should cloud dependability tools evolve

in near future?• Many others questions…

Page 7: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

7

Cloud Bug Study (CBS)

• 6 systems: Hadoop MapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume

• 11 people, 1 year study• Issues in a 3-year window:

Jan 2011 to Jan 2014• ~21000 issues reviewed• ~3600 “vital” issues in-depth study• Cloud Bug Study (CBS) database

Page 8: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

8

Classifications

• Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS

• Hardware failures - types of hardware and types of hardware failures

• Software bug types – Logic, error handling, optimization, config, race, hang, space, load

• Implications – Failed operation, performance, component down- time, data loss, data staleness, data corruption

• ~25000 annotations in total, about 7 annotations per issue

Page 9: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

9

Cloud Bug Study (CBS) database

• Open to public

Page 10: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

10

Outline

• Introduction• Methodology• Overview of results• Other CBS database use cases• Conclusion

Page 11: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

11

Methodology

• 6 systems, 3-year span, 2011 to 2014• 20~30 bugs a day! Protein yeah!• 17% “vital” issues affecting

real deployments• 3655 vital issues

Page 12: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

12

Example issueTitle

Type & Priority

Description

Time to resolve

Discussion

Page 13: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

13

Outline

• Introduction• Methodology• Overview of results• Other CBS database use cases• Conclusion

Page 14: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

14

Classifications for each vital issue

• Aspects• Hardware types and failure modes• Software bug types• Implications• Bug scopes

Page 15: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

15

Overview of result

• Aspects • Hardware faults vs. Software faults• Implications

Page 16: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

16

Aspects

• CS = Cassandra• FL = flume• HB = HBase• HD = HDFS• MR = MapReduce• ZK = ZooKeeper

Page 17: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

17

Aspects: Reliability

• Reliability (45%)– Operation & job

failures/errors, data loss/corruption/staleness

Page 18: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

18

Aspects: Performance

• Reliability• Performance (22%)

Page 19: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

19

Aspects: Availability

• Reliability• Performance• Availability (16%)– Node and cluster

downtime

Page 20: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

20

Aspects: Security

• Reliability• Performance• Availability• Security (6%)

Page 21: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

21

Overview of result

• Aspects (classical)• Aspects – Data consistency, scalability, topology, QoS

• Hardware faults vs. Software faults• Implications

Page 22: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

22

Aspects: Data consistency

• Data consistency (5%)– Permanent inconsistent

replicas– Various root causes:• Buggy operational

protocol• Concurrency bugs

and node failures

Page 23: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

23

Cassandra cross-DC synchronization

A’

B’ B

C’

A

C

Background operational protocols often buggy!

A’ A’

B’ B’

C’ Permanent inconsistency

Page 24: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

24

Aspects: Scalability

• Data consistency• Scalability (2%)– Small number does not

mean not important!– Only found at scale

• Large cluster size• Large data• Large load• Large failures

Page 25: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

25

Large cluster• In Cassandra

O(n3) calculation

Ring position changed.

100x

CPU explosion

Page 26: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

26

Large data

In HBase

Tens ofminutes

R1

R2

R3

R…

R100K

Insufficient lookup operation

Page 27: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

27

Large load

In HDFS 1000x small files in parallel

… Not expecting small files!

Page 28: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

28

Large failure

Time cost: 7+ hours

AM managing 16,000 tasks fails

1

2

3

1K

2K

3K

4K

5K

16K

Un-optimized connection

Page 29: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

29

From above examples…

• Protocol algorithms must anticipate – Large cluster sizes– Large data– Large request load of various kinds– Large scale failures

• The need for scalability bug detection tools

Page 30: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

30

Aspects: Topology

• Data consistency• Scalability• Topology (1%)– Systems have problem

when deployed on some network topology• Cross DC• Different racks• New layering architecture

– Typically unseen in pre-deployment

Page 31: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

31

Aspects: QoS

• Data consistency• Scalability• Topology• QoS (1%)– Fundamental for multi-

tenant systems– Two main points

• Horizontal/intra-system QoS

• Vertical/cross-system QoS

Page 32: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

32

Overview of result

• Aspects (classical)• Aspects (unique)– Data consistency, scalability, topology, QoS

• Hardware faults vs. Software faults• Implications

Page 33: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

33

HW faults vs. SW faults“Hardware can fail, and reliability should come from software.”

Page 34: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

34

HW faults and modes

• 299 improper handling of node fail-stop failure

• A 25% normal speed memory card causes problems in HBase deployment.

Page 35: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

35

Hardware faults vs. Software faults

• Hardware failures, components and modes• Software bug types

Page 36: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

36

Software bug types: Logic

• Logic (29%)– Many domain-specific

issues

Page 37: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

37

Software bug types: Error handling

• Logic• Error handling (18%)– Aspirator, Yuan et al,

[OSDI’ 14]

Page 38: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

38

Software bug types: Optimization

• Logic• Error handling• Optimization (15%)

Page 39: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

39

Software bug types: Configuration

• Logic• Error handling• Optimization• Configuration (14%)

– Automating Configuration Troubleshooting. [OSDI ’10]

– Precomputing Possible Configuration Error Diagnoses. [ASE ’11]

– Do Not Blame Users for Misconfigurations. [SOSP ’13]

Page 40: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

40

Software bug types: Race

• Race (12%)– < 50% local concurrency

bugs• Buggy thread interleaving• Tons of work

– > 50% distributed concurrency bugs• Reordering of messages,

crashes, timeouts• More work is needed

– SAMC [OSDI ’14]

Page 41: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

41

Software bug types: Hang

• Hang (4%)– Classical deadlock– Un-served jobs, stalled

operations, …• Root causes?• How to detect them?

Page 42: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

42

Software bug types: Space

• Space (4%)– Big data + leak = Big leak– Clean-up operations

must be flawless.

Page 43: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

43

Software bug types: Load

• Load (4%)– Happen when systems

face high request load– Relates to QoS and

admission control

Page 44: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

44

Overview of result

• Aspects (classical)• Aspects (unique)– Data consistency, scalability, topology, QoS

• Hardware faults vs. Software faults• Implications

Page 45: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

45

Implications

• Failed operation (42%)• Performance (23%)• Downtimes (18%)• Data loss (7%)• Data corruption (5%)• Data staleness (5%)

Page 46: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

46

Root causesEvery implication can be caused by all kinds of hardware and software faults!

Page 47: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

47

“Killer” bugs

• Bugs that simultaneously affect multiple nodes or even the entire cluster

• Single Point of Failure still exists in many forms– Positive feedback loop – Buggy failover – Repeated bugs after failover – …

Page 48: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

48

Outline

• Introduction• Methodology• Overview of results• Other CBS database use cases• Conclusion

Page 49: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

49

CBS database

• 50+ per system and aggregate graphs from mining CBS database in the last one year

• Still more waiting to be studied…

Page 50: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

50

Components with most issuesHow should we enhance reliability for multiple cloud system interaction?

Cross-system issues are prevalent!

Page 51: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

51

Most challenging types of issues

Page 52: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

52

Top k% of most complicated issue

Page 53: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

53

System evolution

Hadoop 2.0

Page 54: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius

54

Conclude

• One of the largest bug studies for cloud systems

• Many interesting findings, but more questions can be raised from our analysis– What types of performance issues exist?– Root causes for hang issues?– …

• Cloud Bug Study(CBS) database.