lecture 11: parallel processing of irregular computations & load balancing

66
Lecture 11: Parallel Processing of Irregular Computations & Load Balancing Shantanu Dutt ECE Dept. UIC

Upload: drea

Post on 12-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Lecture 11: Parallel Processing of Irregular Computations & Load Balancing. Shantanu Dutt ECE Dept. UIC. Discrete Event Simulation—Basics with VHDL Descriptions as an Example. VHDL Dataflow Description of a Circuit:. Library IEEE; use IEEE.STD_LOGIC_1164.all; - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Lecture 11: Parallel Processing of Irregular Computations & Load Balancing

Shantanu DuttECE Dept.UIC

Page 2: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 3: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Discrete Event Simulation—Basics with VHDL Descriptions as an Example.

Library IEEE; use IEEE.STD_LOGIC_1164.all;entity ckt1 is port(s1,s2:in bit; Z:out bit); end entity ckt1;architecture data_flow of ckt1 issignal sbar1,sbar2,x,y:bit;begin sbar1 <= not s1 after 2 ns;sbar2 <= not s2 after 2 ns;x <= s1 and sbar2 after 4 ns;y <= s2 and sbar1 after 4 ns;Z <= x or y after 4 ns;end architecture data_flow;

VHDL Dataflow Description of a Circuit:

Page 4: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Discrete Event Simulation—Basics

Page 5: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Discrete Event Simulation—Basics (cont’d)

Page 6: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Discrete Event Simulation—Basics (cont’d)

Page 7: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Parallel DES for Logic Simulation

Page 8: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 9: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 10: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 11: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 12: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 13: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 14: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 15: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Correctness Issues in Parallel DES• What happens is inter-processor messages

are received out of simulation time order, either from the same processor or from different processors? In other words, if a msg. w/ simulation time ti is received before a msg. w/ simulation time tj, where ti > tj, then what happens? The sim. time ti and tj msgs. could be coming from the same or different processors

• If a proc. “blindly” processes all msgs. as they come, then this can lead to incorrect simulation. E.g., the sim. time tj msg. can cause an output that affects the input to the process for the sim. time ti msg. in the above example. So if the earlier arriving sim. time ti msg. is processed before the later arriving sim. time tj msg., the former simulation output will likely be incorrect.

Page 16: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Correctness Issues in Parallel DES: Solutions• For each msg. sent from processor Pk targeting a

(simulation) process Qr (which is, say, in processor Pq), Pk records the sim. time tq of the latest such msg. When sending the next msg. targeting Qr, Pk also mentions the prev. sim. time along w/ the current one tj.

• So the msg. data looks like Mj = (input value, tj [curr. sim. time], tq [prev. sim. time])

• The next msg. Mi = (input value, ti, tj)• Receiving proc. Pq also records the sim. time tq of

the last msg. received for each input of Qr. If a new msg. meant for that input of Qr has the prev. sim. time the same as that it has recorded, then that msg. is correct in terms of timing order. Otherwise, Pq will store the msg. but wait for a previous msg. of correct timing order that it has not yet recvd.

• So if for the input, say, A, of Qr, the recorded time of the prev. simulation is tq, and msg. Mi =(value, ti, tj) is recvd. it will not be processed. Only after msg. Mj =(value, tj, tq) is recvd. it will be processed, followed by the processing of msg. Mi (since the latest recorded sim. time for i/p A of Qr is tj).

• With regards to msg. from multiple processors, Pq will not perform any simulation until it has recvd. timing-correct msgs (e.g., Mj above) from all procs. supposed to send it msgs. This issue underscores the imp. of null msgs. w/o which simulation will not proceed further in this aforementioned approach

Page 17: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Some examples of applications requiring DES

Page 18: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 19: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Search TechniquesA

BC

D

E

F

G

1

2

3

4

5

6

7

BFS

A

BC

D

E

F

G

Graph

dfs(v) /* for basic graph visit or for soln finding when nodes are partial or full solns */ v.mark = 1; for each (v,u) in E if (u.mark != 1) then dfs(u)

Algorithm Depth_First_Search_Soln for each v in V v.mark = 0;if G has partial soln nodes then for each v in V if v.mark = 0 then dfs(v); end for; else soln_dfs(root); /* root is a particular node in V from were we can start the solution search */

soln_dfs(v)/* used when nodes are basic elts of the problem and not partial soln nodes, and a soln. is a path */v.mark = 1;If path to v is a soln, then return(1);for each (v,u) in E if (u.mark != 1) then soln_found = soln_dfs(u) if (soln_found = 1) then return(soln_found)end for;v.mark = 0; /* can visit v again to form another soln on a different path */return(0)

DFS (black arcs)and Soln_DFS (black+red arcs)

A

BC

D

E

F

G

1

2

3

4

5

6

7

8

9

Soln found (A,B,E,C,F)that meets some criterion

10

Page 20: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Search Techniques—Exhaustive DFS

optimal_soln_dfs(v)/* used when nodes are basic elts of the problem and not partial soln nodes, and a soln. is a path */beginv.mark = 1;If path to v is a soln, then begin if cost < best_cost then begin best_soln=soln; best_cost=cost; endif v.mark=0; return;Endiffor each (v,u) in E if (u.mark != 1) then cost = cost + edge_cost(v,u); /* global var. */ optimal_soln_dfs(u)end for;v.mark = 0; /* can visit v again to form another soln on a different path */endAlgorithm Depth_First_Search_Opt_Soln for each v in V v.mark = 0; best_cost = infinity; cost = 0; optimal_soln_dfs(root);

DFS (black arcs)and Soln_DFS (black+red arcs)

Optimal_Soln_DFS (black+red+green) arcs

A

BC

D

E

F

G

1

2

3

4

5

6

7

8

9

Soln found(A,B,E,C,F)

10

i > 10

i+1

i+2

i+3 i+4

Best soln. sofar (A,C,E,D,F,G)

Page 21: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Best-First Search

BeFS (root)begin open = {root} /* open is list of gen. but not expanded nodes—partial solns */ best_soln_cost = infinity; while open != nullset do begin curr = first(open); if curr is a soln then return(curr) /* curr is an optimal soln */ else children = Expand_&_est_cost(curr); /* generate all children of curr & estimate their costs---cost(u) should be a lower bound of cost of the best soln reachable from u */ for each child in children do begin if child is a soln then delete all nodes w in open s.t. cost(w) >= cost(child); endif store child in open in increasing order of cost; endfor endwhileend /* BFS */

Expand_&_est_cost(Y)begin children = nullset; for each basic elt x of problem “reachable” from Y do begin if x not in Y and if feasible child = Y U {x}; path_cost(child) = path_cost(Y) + cost(Y, x) /* cost(Y, x) is cost of reaching x from Y */ est(child) = lower bound cost of best soln reachable from child; cost(child) = path_cost(child) + est(child); children = children U {child}; endforend /* Expand_&_est_cost(Y);

Y = partial soln. = a path from root to current “node” (a basic elt. of the problem, e.g., a city in TSP, a vertex in V0 or V1 in min-cut partitioning). We go from each such “node” u to the next one u that is “reachable “ from u in the problem “graph” (which is part of what you have to formulate)

u 10

12 1519

18

1718

16

(1)

(2)

(3)

costs

root

Page 22: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Best-First Search

Proof of optimality when cost is a LB• The current set of nodes in “open” represents a complete front of generated nodes, i.e., the rest of the in-generated nodes in the search space are descendants of “open”• If first node curr in “open” is a soln, then cost(curr) <= cost(w) for each w in “open”• Cost of any solution node in the search space not in “open” and not yet generated is >= cost of its ancestor in “open” and thus >= cost(curr). Thus curr is the optimal (min-cost) soln

u 10

12 15 19

18

17

1816

(1)

(2)

(3)

costs

root

Y = partial soln.

Page 23: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Search techs for a TSP example9

5

2

1

3

5 4

8

7

5

AB

C

D

E

F

B E F

F

D F

E F D E

D

x

A A

C

F E E

A A A

27 31 33

Exhaustive search using DFS (w/ backtrack) for findingan optimal solution

Solution nodes

TSP graph

Page 24: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Search techs for a TSP example (contd)

B E F

F

D F

E F

A A

C

F

A

27

23+8

BeFS for finding an optimal TSP solution

22+9

C D E

C E D

X X X

F D

21+6

C F

B F

F

A

8+16

11+14

14+9

20

5+15

• Lower-bound cost estimate: MST({unvisited cities} U {current city} U {start city})• LB as structure (spanning tree) is a superset of reqd soln structure (cycle)• min(metric M’s values in set S)<= min(M’s values in subset S’)• Similarly for max??

9

5

21

3

5 4

8

7

5

AB

C

D

E

F

MST for node (A, E, F); =MST{F,A,B,C,D}; cost=16

Path cost for(A,E,F) = 8

Set S of all spanningtrees in a graph G

Set S’of all Hamiltonianpaths (that visits a nodeexactly once)in a graph G

S

S’

Page 25: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

BFS for 0/1 ILP Solution

root(no vars

exp.)

• X = {x1, …, xm} are 0/1 vars• Choose vars Xi=0/1 as next nodes in some order (random or heuristic based)X2=0 X2=1

Solve LPw/ x2=0;Cost=cost(LP)=C1

Solve LPw/ x2=1;Cost=cost(LP)=C2

Solve LPw/ x2=1, x4=0;Cost=cost(LP)=C3

Solve LPw/ x2=1, x4=1;Cost=cost(LP)=C4

X4=0 X4=1

X5=0 X5=1

Solve LPw/ x2=1, x4=1, x5=1Cost=cost(LP)=C6

Solve LPw/ x2=1, x4=1, x5=0Cost=cost(LP)=C5

optimal soln

Cost relations:C5 < C3 < C1 < C6C2 < C1C4 < C3

Page 26: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

(stop when child gen. is a soln. node that is at most(1+alpha)*cost(best(open)), alpha is given sub-opt. fraction.

Page 27: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 28: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 29: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 30: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 31: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 32: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

for speedup > 1

Page 33: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

• For Sp(P) > 1, we need n*texp/((n/P)*(texp+(P-1)*tacc)) > 1 texp > texp/P + (P-1)*tacc texp(P-1)/P > (P-1)*tacc P < texp / tacc

• For constant efficiency, this is even worse:E(P)=Sp(P)/P = T(1)/(Tp(P)*P) = n*texp/(P*(n/P)*(texp+(P-1)*tacc)) = n*texp/(n*texp+ n*(P/(P-1))*tacc) = const. C <= 1 1 + ((P-1)/P)*tacc/texp) = 1/C ((P-1)/P)*tacc/texp = 1/C – 1Differentiating both sides wrt P to minimize the expr. (max. C), we get:(tacc/texp )/P2) = 0, which cannot occur for any P.

for speedup > 1

Page 34: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

• Nodes w/ cost >= the current best global soln. so far are discarded. Note that this can sometimes lead to idling, and at other times non-essential work can be done before such deletion of nodes take place. Both are overheads of parallel B&B.

• A local best soln. @ head of local open is global opt. if all other processors have terminated by then (their termn. msg. may be in transit in some cases)

Page 35: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 36: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Load Balancing

Load info exchange LIE

Load/work transfer

Legend:

• Generic Load Balance protocol‒ Periodic LIEs between subsets of processors (generally,

neighbors or small extended neighborhoods, e.g., distance k apart for small k)

‒ Followed by work transfers as indicated by the LIE and work transfer policy

• Issues to be determined in a LB technique (generally application and parallel system dependent):

‒ Frequency of LIE‒ Definition of load‒ Load difference threshold or in general some relative load

condition criteria to trigger work transfer‒ Donor or receiver initiated load/work transfer?‒ How much and which work to transfer?

Page 37: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 38: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 39: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 40: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 41: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 42: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 43: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 44: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 45: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

without a numerical load computation

based on rank (a la the AC method)

: minimizes non-essential work but significantly increases idling due to large

taccess/texp

Page 46: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Quality Equalizing (QE) Load Balancing Techniques

• Various techniques developed by my former Ph.D. student Prof. Nihar Mahaptra (MSU) and myself over a few years. The refs are:

• N.R. Mahapatra and S. Dutt, ``An efficient delay-optimal distributed termination detection algorithm'', Jour. Parallel and Distr. Computing , Oct. 2007, pp. 1047-1066.

• N.R. Mahapatra and S. Dutt, ``Adaptive Quality Equalizing: High-Performance Load Balancing for Parallel Branch-and-Bound Across Applications and Computing Systems'', Proc. Joint IEEE Parallel Processing Symposium/ Symp. on Parallel and Distr. Processing , April 1998.

• N.R. Mahapatra and S. Dutt, ``Random Seeking: A General, Efficient, and Informed Randomized Scheme for Dynamic Load Balancing'', Proc. Tenth IEEE Parallel Processing Symposium, April 1996, pp. 881-885.

• N.R. Mahapatra and S. Dutt, ``New anticipatory load balancing strategies for scalable parallel best-first search'', American Mathematical Society's DIMACS Series on Discrete Mathematics and Theoretical Computer Science, Vol. 22, 1995, pp. 197-232. S. Dutt and N.R. Mahapatra, ``Scalable load-balancing strategies for parallel A* algorithms'', Special Issue on Scalability of Parallel Algorithms and Architectures Journal of Parallel and Distr. Computing, Vol. 22, No. 3, Sept. 1994, pp. 488-505.

• S. Dutt and N.R. Mahapatra, ``Parallel A* algorithms and their performance on hypercube multiprocessors'',Proc. Seventh IEEE Parallel Processing Symposium, 1993, pp. 797-803. 

Page 47: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

• The donor processor grants very few nodes to acceptor (e.g., alternating 2-3 nodes starting from local rank 2 node)

• For high-latency low-bw platforms like NOWs (n/w of workstations and Beowulf clusters like Argo):

– set s higher (should be inversely proportional to bw , otherwise n/w saturation can occur)

– decrease frequency of load info exchange (LIE)

Page 48: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

(alternating rank nodes in merged open list) for s > 1

Page 49: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 50: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 51: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 52: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

. Will see worst-case analysis later in this regard.

Page 53: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

[E = T1/PTp(P) = W(N)/Wp(P)= W(N)/(W(N) + Wo(N, P)), N is problem size)

Page 54: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Scalability Analysis• Derivation of QE’s isoefficieny upper-bound of

Q(PDd)

Worst-case assumption (for worst-case rank difference): each proc. is worst in its neighborhood, and its neighbor on this path is best in my neighborhood

Best node rank wrt Pi,1’s = Q(sd)Best node rank wrt Pi,2’s = Q(sd) and wrt Pi,1’s = Q(2sd)

Best node rank wrt Pi,D-1’s = Q(sd) and wrt Pi,1’s= Q ((D-1)sd) =

Pi,1

Pi,2

Pi,3

Pi,D-1

Pi,D

opt. cost

Q ((D-1)sd) rank gap w/ only 1 or few best proc. w/ essential nodes

proc. w/ best node opt. soln. in worst case for isoeff.

proc. w/ worst node

opt. cost

Q ((D-1)sd) rank gap w/ only 1 or few best proc. w/ essential nodes

proc. w/ opt. node

proc. w/ worst node

Page 55: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Scalability Analysis• Derivation of QE’s isoefficieny upper-bound of

Q(PDd)

• Taking the fact that d other such paths of “neighbors” of the 1st path, the rank difference among d such paths of length about D is also Q (Dsd) (the Q (sd) rank gap between neighboring processors on a path encompasses the rank difference w/ the other (d-1) other neighbors, one each in the “neighboring” (d-1) paths of length about D) = Q((Dd) (s=const).

• After Q ((Dsd) = Q ((Dd) iterations proc. w/ best node produces the optimal solution. In this time, Q((Dd)2/2 ) non-essential (NE) works get done in a group of d neighborhood paths of distance about D. This happens across Q(P/dD) such path groups total NE work across P procs. = Q((P/Dd)*(Dd)2) Q (PDd) NE work or idling.

opt. cost

Q ((D-1)sd) rank gap w/ only 1 or few best proc. w/ essential nodes

proc. w/ best node opt. soln. in worst case for isoeff.

proc. w/ worst node

opt. cost

Q ((D-1)sd) rank gap w/ only 1 or few best proc. w/ essential nodes

proc. w/ opt. node

proc. w/ worst node

Page 56: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

Q(3tc + 3ts/2)/texp) to be preciseQ(2tc + ts)/texp) to be precise(texp is a constant wrt arch.)

Page 57: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 58: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 59: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 60: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 61: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

. Rationale: More global load balancing( smaller global rank difference betw. best and worst qual-itatively loaded processors) w/o high commun. overhead

Page 62: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 63: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing

s

(costk(1) > costi(3)

Page 64: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 65: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing
Page 66: Lecture 11:  Parallel Processing of Irregular Computations & Load Balancing