dark and panic lab computer science, rutgers university1 evaluating the impact of communication...

18
Computer Science, Rutgers Univ ersity 1 Dark and Panic Lab Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services Kiran Nagaraja, Neeraj Krishnan, Ricardo Bianchini, Richard P Martin, Thu D Nguyen Rutgers University

Post on 15-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 1Dark and Panic Lab

Evaluating the Impact of Communication Architecture on Performability of Cluster-Based

Services

Kiran Nagaraja, Neeraj Krishnan, Ricardo Bianchini, Richard P Martin, Thu D Nguyen

Rutgers University

Page 2: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 2Dark and Panic Lab

Motivation

Network services often use cluster of commodity components

Various design choices Incl. communication

architecture

Numerous performance studies TCP is perceived to be more robust

Performance vs. Availability tradeoff not well understood

Web Server Performance

0

1000

2000

3000

4000

5000

6000

7000

8000

TCP VIA-RDMA-0-COPY

Communication ArchitectureR

equ

ests

/sec

Page 3: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 3Dark and Panic Lab

Our Study

Evaluate impact of 2 different communication architectures on service performance and availability in presence of faults

TCP vs. VIA Kernel-level comm. vs. user-level Mature vs. new technology Differ in fault-model

Quantify performability (performance and availability)

Study systems under various fault scenarios Sensitivity to fault rates and fault classes

Case study: High performance cluster-based Web server Understand tradeoff between high performance and high

availability design choices

Page 4: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 4Dark and Panic Lab

Computing Average Availability

Assumptions: Faults are non-overlapping and independent

Parameters: MTTF, MTTR Sources: [Sullivan91, Chillarege95, Iyer99, Talagala99, Trivedi00, Heath02]

Measure throughput under single fault

Page 5: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 5Dark and Panic Lab

Effect of Single Fault: Seven Stage Model

Various phases map the behavior of system under single fault All phases may not be necessary

Page 6: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 6Dark and Panic Lab

Performability(P) = Tn x log(AI) log(AA)

Tn – Throughput under normal execution AI - Availability of “Ideal” system e.g., 0.99999 AA – Average Availability Log scale allows linearization with unavailability and

reduces the range

Performability Metric

Normal performance Penalty component

Page 7: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 7Dark and Panic Lab

Case Study: PRESS Web Server

Cluster-based, locality-conscious web server Serve requests out of globally coordinated memory pool

Several versions developed over time Differ in performance and fault-tolerance

Internal communication architecture TCP versions

TCP-PRESS, TCP-PRESS-HB VIA versions

VIA-PRESS-0, VIA-PRESS-3, VIA-PRESS-5 Names consistent with previous performance

study[HPCA02]

Page 8: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 8Dark and Panic Lab

PRESS Versions Comparison

PRESS Versions Description Fault Detection

General Protocol Characteristics

TCP-PRESS Base version Connection based

TCP Assumes: Very few h/w permanent faults, transient faults are common Robust to transient faults OK to lose packets

TCP-PRESS-HB Periodic heartbeats

VIA-PRESS-0 Base version Connection based

VIA Assumes: Faults indicate serious problems

Fail-stop model

Lost packets are bad

VIA-PRESS-3 RDMA for comm.

Same

VIA-PRESS-5 RDMA and Zero-copy (Dynamic pinning)

Same

Page 9: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 9Dark and Panic Lab

Single-Fault Experiments

Setup: 4-PC cluster running at 90% load 800Mhz, 2 SCSI disks, 1 Gbps network

TCP over VIA, bare VIA 4 client nodes make HTTP requests

Rutgers trace Poisson arrival process

Fault Set Link down, switch down OS - memory exhaustion, OS - no pin-able memory Null pointer, off-by-N pointer value, off-by-N size [Sullivan91] Application crash, hang Node crash, freeze

Page 10: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 10Dark and Panic Lab

Single-Fault Results

Link down

Page 11: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 11Dark and Panic Lab

Performance

VIA-based communication enables higher performance Low latency, less software overhead

Performance Comparison

0

1000

2000

3000

4000

5000

6000

7000

8000

TCP TCP-HB VIA-0 VIA-3 VIA-5

PRESS Versions

Req

ues

ts/s

ec

Page 12: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 12Dark and Panic Lab

Performability Results

Identical fault load for all versions Application fault rate 1/month

All versions of VIA do better than TCP

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

TCP TCP-HB

VIA-0 VIA-3 VIA-5

PRESS Versions

Un

avai

lab

ilit

y

0

5

10

15

20

25

30

Perfo

rmab

ility

internal link internal switch node crash node freeze

os-mem-no-locking os-sk-buf-no-mem application crash application hang

app-nullpointer app-offbyNpointer app-offbyNsize Performability

Page 13: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 13Dark and Panic Lab

TCP Vs VIA: Program Robustness

VIA application fault rates 1/day, 1/week, 1/month Programming complexity

TCP application fault rate 1/month

Program Robustness

0

5

10

15

20

25

30

TCP TCP-HB

VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5 VIA-0 VIA-3 VIA-5

PRESS Versions

Per

form

abil

ity

Cross over point

Page 14: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 14Dark and Panic Lab

VIA under Stressful Fault Load

Additional fault load Transient packet drops1/month, system failure 1/month Application faults -> 2/month

TCP-HB performs slightly better than 2 VIA versions

Performability

0

5

10

15

20

TCP TCP-HB VIA-0 VIA-3 VIA-5

PRESS Versions

Per

form

abil

ity

Page 15: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 15Dark and Panic Lab

Observations – Cluster Communication

Match fault-model of network stack to fabric Non-fatal behavior on transient faults

TCP is robust to packet drops Fail-stop behavior on permanent faults

Protocol level fault-avoidance Preserve message boundaries Reduce number of copies Pre-allocate communication resources

Explicit fault reporting by all components in “path”

End-to-End necessary, but may not be sufficient Reduces detection latency Allows more accurate recovery actions

Page 16: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 16Dark and Panic Lab

Related Work

Impact of faults on systems Robustness and availability studies

Protocol performance studies Congestion avoidance and control in WAN

Back-off based algorithms

Interconnects in cluster environment SAN context: Packet drops Serious failures Evidence of faults due to immature technology Fault tolerant interconnects: Myrinet

Page 17: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 17Dark and Panic Lab

Summary & Conclusion

Studied impact of communication architecture on service performability

Surprisingly VIA versions delivered better availability

Comparison under varying fault loads Evaluated architecture maturity and complexity

Desirable cluster-based protocol characteristics Messaging, single-copy transfers, pre-allocated

resources

Page 18: Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services

Computer Science, Rutgers University 18Dark and Panic Lab

Thank you. Questions?

http://dark-panic.rutgers.edu/Research/vivo