dark and panic lab computer science, rutgers university1 evaluating the impact of communication...
Post on 15-Jan-2016
214 views
TRANSCRIPT
Computer Science, Rutgers University 1Dark and Panic Lab
Evaluating the Impact of Communication Architecture on Performability of Cluster-Based
Services
Kiran Nagaraja, Neeraj Krishnan, Ricardo Bianchini, Richard P Martin, Thu D Nguyen
Rutgers University
Computer Science, Rutgers University 2Dark and Panic Lab
Motivation
Network services often use cluster of commodity components
Various design choices Incl. communication
architecture
Numerous performance studies TCP is perceived to be more robust
Performance vs. Availability tradeoff not well understood
Web Server Performance
0
1000
2000
3000
4000
5000
6000
7000
8000
TCP VIA-RDMA-0-COPY
Communication ArchitectureR
equ
ests
/sec
Computer Science, Rutgers University 3Dark and Panic Lab
Our Study
Evaluate impact of 2 different communication architectures on service performance and availability in presence of faults
TCP vs. VIA Kernel-level comm. vs. user-level Mature vs. new technology Differ in fault-model
Quantify performability (performance and availability)
Study systems under various fault scenarios Sensitivity to fault rates and fault classes
Case study: High performance cluster-based Web server Understand tradeoff between high performance and high
availability design choices
Computer Science, Rutgers University 4Dark and Panic Lab
Computing Average Availability
Assumptions: Faults are non-overlapping and independent
Parameters: MTTF, MTTR Sources: [Sullivan91, Chillarege95, Iyer99, Talagala99, Trivedi00, Heath02]
Measure throughput under single fault
Computer Science, Rutgers University 5Dark and Panic Lab
Effect of Single Fault: Seven Stage Model
Various phases map the behavior of system under single fault All phases may not be necessary
Computer Science, Rutgers University 6Dark and Panic Lab
Performability(P) = Tn x log(AI) log(AA)
Tn – Throughput under normal execution AI - Availability of “Ideal” system e.g., 0.99999 AA – Average Availability Log scale allows linearization with unavailability and
reduces the range
Performability Metric
Normal performance Penalty component
Computer Science, Rutgers University 7Dark and Panic Lab
Case Study: PRESS Web Server
Cluster-based, locality-conscious web server Serve requests out of globally coordinated memory pool
Several versions developed over time Differ in performance and fault-tolerance
Internal communication architecture TCP versions
TCP-PRESS, TCP-PRESS-HB VIA versions
VIA-PRESS-0, VIA-PRESS-3, VIA-PRESS-5 Names consistent with previous performance
study[HPCA02]
Computer Science, Rutgers University 8Dark and Panic Lab
PRESS Versions Comparison
PRESS Versions Description Fault Detection
General Protocol Characteristics
TCP-PRESS Base version Connection based
TCP Assumes: Very few h/w permanent faults, transient faults are common Robust to transient faults OK to lose packets
TCP-PRESS-HB Periodic heartbeats
VIA-PRESS-0 Base version Connection based
VIA Assumes: Faults indicate serious problems
Fail-stop model
Lost packets are bad
VIA-PRESS-3 RDMA for comm.
Same
VIA-PRESS-5 RDMA and Zero-copy (Dynamic pinning)
Same
Computer Science, Rutgers University 9Dark and Panic Lab
Single-Fault Experiments
Setup: 4-PC cluster running at 90% load 800Mhz, 2 SCSI disks, 1 Gbps network
TCP over VIA, bare VIA 4 client nodes make HTTP requests
Rutgers trace Poisson arrival process
Fault Set Link down, switch down OS - memory exhaustion, OS - no pin-able memory Null pointer, off-by-N pointer value, off-by-N size [Sullivan91] Application crash, hang Node crash, freeze
Computer Science, Rutgers University 10Dark and Panic Lab
Single-Fault Results
Link down
Computer Science, Rutgers University 11Dark and Panic Lab
Performance
VIA-based communication enables higher performance Low latency, less software overhead
Performance Comparison
0
1000
2000
3000
4000
5000
6000
7000
8000
TCP TCP-HB VIA-0 VIA-3 VIA-5
PRESS Versions
Req
ues
ts/s
ec
Computer Science, Rutgers University 12Dark and Panic Lab
Performability Results
Identical fault load for all versions Application fault rate 1/month
All versions of VIA do better than TCP
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
TCP TCP-HB
VIA-0 VIA-3 VIA-5
PRESS Versions
Un
avai
lab
ilit
y
0
5
10
15
20
25
30
Perfo
rmab
ility
internal link internal switch node crash node freeze
os-mem-no-locking os-sk-buf-no-mem application crash application hang
app-nullpointer app-offbyNpointer app-offbyNsize Performability
Computer Science, Rutgers University 13Dark and Panic Lab
TCP Vs VIA: Program Robustness
VIA application fault rates 1/day, 1/week, 1/month Programming complexity
TCP application fault rate 1/month
Program Robustness
0
5
10
15
20
25
30
TCP TCP-HB
VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5
PRESS Versions
Per
form
abil
ity
Cross over point
Computer Science, Rutgers University 14Dark and Panic Lab
VIA under Stressful Fault Load
Additional fault load Transient packet drops1/month, system failure 1/month Application faults -> 2/month
TCP-HB performs slightly better than 2 VIA versions
Performability
0
5
10
15
20
TCP TCP-HB VIA-0 VIA-3 VIA-5
PRESS Versions
Per
form
abil
ity
Computer Science, Rutgers University 15Dark and Panic Lab
Observations – Cluster Communication
Match fault-model of network stack to fabric Non-fatal behavior on transient faults
TCP is robust to packet drops Fail-stop behavior on permanent faults
Protocol level fault-avoidance Preserve message boundaries Reduce number of copies Pre-allocate communication resources
Explicit fault reporting by all components in “path”
End-to-End necessary, but may not be sufficient Reduces detection latency Allows more accurate recovery actions
Computer Science, Rutgers University 16Dark and Panic Lab
Related Work
Impact of faults on systems Robustness and availability studies
Protocol performance studies Congestion avoidance and control in WAN
Back-off based algorithms
Interconnects in cluster environment SAN context: Packet drops Serious failures Evidence of faults due to immature technology Fault tolerant interconnects: Myrinet
Computer Science, Rutgers University 17Dark and Panic Lab
Summary & Conclusion
Studied impact of communication architecture on service performability
Surprisingly VIA versions delivered better availability
Comparison under varying fault loads Evaluated architecture maturity and complexity
Desirable cluster-based protocol characteristics Messaging, single-copy transfers, pre-allocated
resources
Computer Science, Rutgers University 18Dark and Panic Lab
Thank you. Questions?
http://dark-panic.rutgers.edu/Research/vivo