filter decomposition for supporting coarse-grained pipelined parallelism

Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Wei Du, Gagan Agrawal

Ohio State University

Distributed Data-Intensive Applications Fast growing datasets Remote data access

Distributed data storage More connected world

Internet

data

data

data

data

datadatadata

Requirements: Huge Storage/Powerful Computer/Fast Connection

Internet

data

data

data

data

datadatadata

Implementation: Local processing

Internet

data

data

data

data

datadatadata

Internet

data

data

data

data

datadatadata

Implementation: Remote processing

Requirements: Complex Analysis at Data Centers

Our hypothesis Coarse-grained pipelined execution

model is a good match

Internet

A Practical Solution

data

data

Coarse-Grained Pipelined Execution

Definition Computations associated with an application are

carried out in several stages, which are executed on a pipeline of computing units

Example — K-nearest Neighbor (KNN) Given a 3-D range R= <(x1, y1, z1), (x2, y2, z2)>, and

a point p = (a, b, c). We want to find the nearest K neighbors of p within R.

Range_query Find the K-nearest neighbors

Challenges Computation associated with an application

needs to be decomposed into stages Decomposition decisions are dependent on

the execution environment Generating code for each stage (SC03) Other performance issues for the pipelined

execution (ICPP04) Adapting to the dynamic execution

environment (SC04)

RoadMap

Filter Decomposition Problem MIN_ONETRIP Algorithm MIN_BOTTLENECK Algorithm MIN_TOTAL Algorithm Experimental Results Related Work Conclusion

Filter DecompositionC1

C2

Cm-1

Cm

L1

Lm-1

computation pipeline

f1

f2

fn-1

fn

atomic filters

f3 - f6

fn

L1

C1

Cm

Cm-1

f1f1 , f2

C2

fn-1

Lm-1

f2 , f3

fn

L1

C1

Cm

Cm-1

f1f1

C2

fn-2,fn-1

Lm-1

Filter DecompositionC1

C2

Cm-1

Cm

L1

Lm-1

computation pipeline

f1

f2

fn-1

fn

atomic filters

Goal: Find a placement

p (f1,f2, …, fn) = (F1, F2, …, Fm) whereFi = fi1, fi1+1, …, fik , (1 ≤ i1,ik ≤ n) such that the predicted execution time is minimal (1≤ i ≤ m).

f3

f4

L1

C1

C3

f1f1 , f2

C2

L2

Cost Model

Bottleneck stage: bth stage the slowest stage in the pipeline

Execution timeT = T(C1)+T(L1)+N*T(C2)+T(L2)+T(C3)

= ∑i≠bTi + (N-1)*Tb

Three Algorithms

MIN_ONETRIP Algorithm dynamic programming algorithm to minimize ∑Ti

MIN_BOTTLENECK Algorithm dynamic programming algorithm to minimize Tb

MIN_TOTAL Algorithm greedy algorithm try to minimize T

T = ∑i≠bTi + (N-1)*Tb

Filter Decomposition: MIN_ONETRIP

Cm-2

Cm-1

Cm

Lm-1

Lm-2 fn-1

fn-1fn

fn

Goal: minimize time spent by one packet on the pipeline

Cm-2

Cm-1

Cm

Lm-1

Lm-2

T[i,j]: min cost of doing computations f1 ,…,,…, fi on computing units C1,…, Cj,

where the results of fi are on Cj.

T[i,j] = minT[i-1,j] + Cost_comp(P(Cj),Task(fi))

T[i,j-1] + Cost_comm(B(Lj-1),Vol(fi))

Goal: T[n,m] Cost: O(mn)

Filter Decomposition: MIN_ONETRIP

Filter Decomposition: MIN_BOTTLENECK

Cm-2

Cm-1

Cm

Lm-1

Lm-2 fn

f1

…fn

fn-1

f1

…

fn-1fn

fn-2

f1

…

……

f2…fn

f1

Goal: minimize time spent at the bottleneck stage

N[i,j]: min cost of bottleneck stage for computing f1 ,…,,…, fi on computing units C1,…, Cj, where the results of fi are on Cj.

Cost: O(mn2)

N[i,j] = min

max{ N[i,j-1], Cost_comm(B(Lj-1),Vol(fi)) }

… …

max{ N[i-1,j-1], Cost_comm(B(Lj-1),Vol(fi-1)), Cost_Comp(P(Cj),Task(fi)) }

max{ N[1,j-1], Cost_comm(B(Lj-1),Vol(f1)), Cost_Comp(P(Cj), Task(f2) + … + Task(fi)) }


C1

C2

C3

C4

L1

L3

L2

f1

f2

f3

f4

f5

L1

C1

C3

C4

C2

f1f1 : T1

Estimated Costf1 , f2

f1, f2 : T2

f1 - f3 : T3

f1 - f4 : T4

Min{T1 … T4 } = T2

To minimize the predicted execution time T


RoadMap

Filter Decomposition Problem MIN_ONETRIP Algorithm MIN_BOTTLENECK Algorithm MIN_TOTAL Algorithm Experimental Results Related Work Conclusion

Experimental Results 4 Configurations

3 Applications Virtual Microscope Iso-Surface Rendering

1 1 11 1

1 1 10.1 0.5

1 1 0.011 0.001

0.1 1 0.011 0.001

Used Applications

Virtual Microscope (Vmscope) an emulation of a microscope input: a rectangular region, a resolution

value output: portion of the original image with

certain resolution

Experimental Results: Virtual Microscope

3 queries Q1 : 1 packet Q2 : 4 packets Q3 : 4500 packets

4 Algorithms MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exhaustive_Search

0

50

100

150

200

250

300

Q1 Q2 Q3

MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search

Execution Time (in ms)

0

100

200

300

400

Q1 Q2 Q3



Application

0

150

300

450

600

750

Q1 Q2 Q3



0

200

400

600

800

1000

Q1 Q2 Q3



Application


Two observations The performance variance between different

algorithms is small The Exha_Search does not always give the best

placement characteristics based on one packet information combining two filters as one, saving copying

cost


Iso-surface rendering (Iso) input: a 3-D grid, a scalar value, a view

screen with angle specified output: a surface seen from certain angle,

which captures points in the grid whose scalar value matches the given iso-surface value

Used Applications

Experimental Results: Iso 2 Implementations

ZBUF ACTP

2 Datasets small : 3 packets large : 47 packets

4 Algorithms MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exhaustive_Search

0

0.5

1

1.5

2

ZBUF ACTP



Application

0

5

10

15

20

ZBUF ACTP



Application

Small dataset

Large dataset

Experimental Results: Iso

The MIN_TOTAL algorithm gives the best placement for small dataset

The MIN_ONETRIP algorithm finds the best placement for large dataset

This application is very data-dependent !


0

10

20

30

40

50

60

70

80

3 10 100



Number of Runs

0

10

20

30

40

50

60

3 10 100



Number of Runs

ZBUF

ACTP


Conclusion & Future Work

Our algorithms perform quite well Future Work

To find more accurate characteristics of applications

estimate of the performance change resulting from combining multiple atomic filters

estimate of the impact of data dependence

Thank you !!!Thank you !!!

filter decomposition for supporting coarse-grained pipelined parallelism

Documents

bottleneck algorithmmin

bottleneck cost

cost of bottleneck stage

onetrip algorithmmin

pipelinefilter decomposition

tb filter decomposition

computing f1

pipelined execution