what bugs live in the cloud? a study of 3000+ issues in cloud systems jeffry adityatama, kurnia j....
TRANSCRIPT
What Bugs Live in the Cloud?A Study of 3000+ Issues in Cloud Systems
Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake
Thanh Do
2
First, let’s ask Google
3
Cloud era
No Deep Root Causes…
4
What reliability research community do?
• Bug study1. A Study of Linux File System Evolution. In FAST ’13. 2. A Comprehensive Study on Real World Concurrency Bug
Characteristics. In ASPLOS ’08. 3. Precomputing Possible Configuration Error Diagnoses. In ASE
’11. …
5
Open sourced cloud software
• Publicly accessible bug repositories
6
Study to solve…
• What bugs “live” in the cloud?• Are there new classes of bugs unique to cloud
systems?• How should cloud dependability tools evolve
in near future?• Many others questions…
7
Cloud Bug Study (CBS)
• 6 systems: Hadoop MapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume
• 11 people, 1 year study• Issues in a 3-year window:
Jan 2011 to Jan 2014• ~21000 issues reviewed• ~3600 “vital” issues in-depth study• Cloud Bug Study (CBS) database
8
Classifications
• Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS
• Hardware failures - types of hardware and types of hardware failures
• Software bug types – Logic, error handling, optimization, config, race, hang, space, load
• Implications – Failed operation, performance, component down- time, data loss, data staleness, data corruption
• ~25000 annotations in total, about 7 annotations per issue
9
Cloud Bug Study (CBS) database
• Open to public
10
Outline
• Introduction• Methodology• Overview of results• Other CBS database use cases• Conclusion
11
Methodology
• 6 systems, 3-year span, 2011 to 2014• 20~30 bugs a day! Protein yeah!• 17% “vital” issues affecting
real deployments• 3655 vital issues
12
Example issueTitle
Type & Priority
Description
Time to resolve
Discussion
13
Outline
• Introduction• Methodology• Overview of results• Other CBS database use cases• Conclusion
14
Classifications for each vital issue
• Aspects• Hardware types and failure modes• Software bug types• Implications• Bug scopes
15
Overview of result
• Aspects • Hardware faults vs. Software faults• Implications
16
Aspects
• CS = Cassandra• FL = flume• HB = HBase• HD = HDFS• MR = MapReduce• ZK = ZooKeeper
17
Aspects: Reliability
• Reliability (45%)– Operation & job
failures/errors, data loss/corruption/staleness
18
Aspects: Performance
• Reliability• Performance (22%)
19
Aspects: Availability
• Reliability• Performance• Availability (16%)– Node and cluster
downtime
20
Aspects: Security
• Reliability• Performance• Availability• Security (6%)
21
Overview of result
• Aspects (classical)• Aspects – Data consistency, scalability, topology, QoS
• Hardware faults vs. Software faults• Implications
22
Aspects: Data consistency
• Data consistency (5%)– Permanent inconsistent
replicas– Various root causes:• Buggy operational
protocol• Concurrency bugs
and node failures
23
Cassandra cross-DC synchronization
A’
B’ B
C’
A
C
Background operational protocols often buggy!
A’ A’
B’ B’
C’ Permanent inconsistency
24
Aspects: Scalability
• Data consistency• Scalability (2%)– Small number does not
mean not important!– Only found at scale
• Large cluster size• Large data• Large load• Large failures
25
Large cluster• In Cassandra
O(n3) calculation
Ring position changed.
100x
CPU explosion
26
Large data
In HBase
Tens ofminutes
R1
R2
R3
R…
R100K
Insufficient lookup operation
27
Large load
In HDFS 1000x small files in parallel
…
…
… Not expecting small files!
28
Large failure
Time cost: 7+ hours
AM managing 16,000 tasks fails
1
2
3
…
1K
2K
3K
4K
5K
…
16K
Un-optimized connection
29
From above examples…
• Protocol algorithms must anticipate – Large cluster sizes– Large data– Large request load of various kinds– Large scale failures
• The need for scalability bug detection tools
30
Aspects: Topology
• Data consistency• Scalability• Topology (1%)– Systems have problem
when deployed on some network topology• Cross DC• Different racks• New layering architecture
– Typically unseen in pre-deployment
31
Aspects: QoS
• Data consistency• Scalability• Topology• QoS (1%)– Fundamental for multi-
tenant systems– Two main points
• Horizontal/intra-system QoS
• Vertical/cross-system QoS
32
Overview of result
• Aspects (classical)• Aspects (unique)– Data consistency, scalability, topology, QoS
• Hardware faults vs. Software faults• Implications
33
HW faults vs. SW faults“Hardware can fail, and reliability should come from software.”
34
HW faults and modes
• 299 improper handling of node fail-stop failure
• A 25% normal speed memory card causes problems in HBase deployment.
35
Hardware faults vs. Software faults
• Hardware failures, components and modes• Software bug types
36
Software bug types: Logic
• Logic (29%)– Many domain-specific
issues
37
Software bug types: Error handling
• Logic• Error handling (18%)– Aspirator, Yuan et al,
[OSDI’ 14]
38
Software bug types: Optimization
• Logic• Error handling• Optimization (15%)
39
Software bug types: Configuration
• Logic• Error handling• Optimization• Configuration (14%)
– Automating Configuration Troubleshooting. [OSDI ’10]
– Precomputing Possible Configuration Error Diagnoses. [ASE ’11]
– Do Not Blame Users for Misconfigurations. [SOSP ’13]
40
Software bug types: Race
• Race (12%)– < 50% local concurrency
bugs• Buggy thread interleaving• Tons of work
– > 50% distributed concurrency bugs• Reordering of messages,
crashes, timeouts• More work is needed
– SAMC [OSDI ’14]
41
Software bug types: Hang
• Hang (4%)– Classical deadlock– Un-served jobs, stalled
operations, …• Root causes?• How to detect them?
42
Software bug types: Space
• Space (4%)– Big data + leak = Big leak– Clean-up operations
must be flawless.
43
Software bug types: Load
• Load (4%)– Happen when systems
face high request load– Relates to QoS and
admission control
44
Overview of result
• Aspects (classical)• Aspects (unique)– Data consistency, scalability, topology, QoS
• Hardware faults vs. Software faults• Implications
45
Implications
• Failed operation (42%)• Performance (23%)• Downtimes (18%)• Data loss (7%)• Data corruption (5%)• Data staleness (5%)
46
Root causesEvery implication can be caused by all kinds of hardware and software faults!
47
“Killer” bugs
• Bugs that simultaneously affect multiple nodes or even the entire cluster
• Single Point of Failure still exists in many forms– Positive feedback loop – Buggy failover – Repeated bugs after failover – …
48
Outline
• Introduction• Methodology• Overview of results• Other CBS database use cases• Conclusion
49
CBS database
• 50+ per system and aggregate graphs from mining CBS database in the last one year
• Still more waiting to be studied…
50
Components with most issuesHow should we enhance reliability for multiple cloud system interaction?
Cross-system issues are prevalent!
51
Most challenging types of issues
52
Top k% of most complicated issue
53
System evolution
Hadoop 2.0
54
Conclude
• One of the largest bug studies for cloud systems
• Many interesting findings, but more questions can be raised from our analysis– What types of performance issues exist?– Root causes for hang issues?– …
• Cloud Bug Study(CBS) database.
55
Thank you!
http://ucare.cs.uchicago.edu/