scalable computing on open distributed systems jon weissman university of minnesota national...

Scalable Computing on Open Distributed Systems

Jon WeissmanUniversity of Minnesota

National E-Science CenterCLADE 2008

What is the Problem?• Open distributed systems

– Tasks submitted to the “system” for execution– Workers do the computing, execute a task, return an answer

• The Challenge– Computations that are erroneous or late are less useful– Failure, errors, hacked, misconfigured– Unpredictable time to return answers

• Both local- and wide-area systems– Focus on volunteer wide-area systems

Shape of the Solution

• Replication• Works for all sources of unreliability

– computation and data

• How to do this intelligently - scalably?

Replication Challenges• How many replicas?

– too many – waste of resources– too few – application suffers

• Most approaches assume ad-hoc replication– under-replicate: task re-execution (^ latency)– over-replicate: wasted resources (v throughput)

• Using information about the past behavior of a node, we can intelligently size the amount of redundancy

Problems with ad-hoc replication

Unreliable node

Reliable nodeTask x sent to group A

Task y sent to group B

System Model

0.9

0.4

0.8

0.8

0.7

0.8

0.8

0.7

0.4

0.3

• Reputation rating ri– degree of node reliability

• Dynamically size the redundancy based on ri

• Note: variable sized groups

• Assume no correlated errors, relax later

Smart Replication• Rating based on past interaction with clients

– prob. (ri) over window • correct/total or timely/total

– extend to worker group (assuming no collusion) => likelihood of correctness (LOC)

• Smarter Redundancy– variable-sized worker groups– intuition: higher reliability clients => smaller groups

12

1:,

12

1

1

12

1121

)1(k

kmm

k

iii

k

iik

ii rr

Terms

• LOC (Likelihood of Correctness), g

– computes the ‘actual’ probability of getting a correct or timely answer from a group g of clients

• Target LOC (target)– the success-rate that the system tries to ensure while

forming client groups

Scheduling Metrics

• Guiding metrics– throughput : is the set of successfully completed

tasks in an interval

– success rate s: ratio of throughput to number of tasks attempted

Algorithm Space

• How many replicas?– algorithms compute how many replicas to meet a

success threshold

• How to reach consensus?– Majority (better for byzantine threats)– M-1 (better for timeliness)– M-2 (2 matching)

One Scheduling Algorithm

Evaluation

• Baselines– Fixed algorithm: statically sized equal groups uses no

reliability information

– Random algorithm: forms groups by randomly assigning nodes until target is reached

• Simulated a wide-variety of node reliability distributions

Experimental Results: correctness

Simulation: byzantine behavior only … majority voting

Role of target

• Key parameter– hard to specify

• Too large– groups will be too large (low throughput)

• Too small– groups will be too small (low success rate)

• Instead, adaptively learn it– bias toward or s or both

Adaptive Algorithm

What about time?

• Timeliness• Result > time T is less (or not) useful

– (1) soft deadlines• user interacting, visualization output from computation

– (2) hard deadlines• need to get X results done before HPDC/NSDI/… deadline

• Live experimentation on PlanetLab• Real application: BLAST

Some PL data

Computation

- both across and within nodes

Communication

- both across and within nodes

Temporal variability

PL EnvironmentRidge is our live system that implements reputation

120 wide-area nodes, fully correct, M-1 consensus

3 Timeliness environments based on deadlines

D=120s D=180s D=240s

Experimental Results: timeliness

Best BOINC (BOINC*), conservative (BOINC-) vs. RIDGE

Makespan Comparison

Collusion

• Suppose errors are correlated?• How?

– Widespread bug (hardware or software)– Misconfiguration– Virus– Sybil attack– Malicious group

• With Emmanuel Jeannot (Inria)

Key Ideas• Execute a task => answer groups

– A1, A2, … Ak

– For each Ai there are associated workers Wi1, Wi

2… Win

– Pcollusion(workers in Ai)

• Learn probability of correlated errors– Pcollusion(W1, W2)

• Estimate probability of group correlated errors– Pcollusion(G), G=[W1, W2, W3, …] via f {Pcollusion(Wi, Wj), for all i,j}

• Rank and select answer– Pcollusion(G) and |G|– Update matrix: Pcollusion(W1, W2)

Bootstrap Problem

• Building collusion matrix• Must first “bait” colluders

– Over-replicate such that majority group is still correct to expose colluders

– : probability of worker collusion– : probability colluders fool the system

• Given group size k

4: 1 group 30% colluders, always collude5. Same group – colludes 30% of the time7. 2 groups (40%, 30% colluders)

correctness

throughput

Summary

• Reliable Scalable computing– correctness and timeliness

• Future work– combined models and metrics– workflows: coupling data and computation

reliability

Visit ridge.cs.umn.edu to learn more

scalable computing on open distributed systems jon weissman university of minnesota national...

Documents

widearea systemsfocus

reputation120 widearea

group bsystem model0

group atask y

sized equal groups

x results

byzantine behavior

live system