filter decomposition for supporting coarse-grained pipelined parallelism

30
Filter Decomposition for Supporting Coarse- grained Pipelined Parallelism Wei Du, Gagan Agrawal Ohio State University

Upload: iliana-herring

Post on 13-Mar-2016

44 views

Category:

Documents


4 download

DESCRIPTION

Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism. Wei Du , Gagan Agrawal Ohio State University. data. data. data. data. Internet. data. data. data. Distributed Data-Intensive Applications. Fast growing datasets Remote data access Distributed data storage - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Wei Du, Gagan Agrawal

Ohio State University

Page 2: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Distributed Data-Intensive Applications Fast growing datasets Remote data access

Distributed data storage More connected world

Internet

data

data

data

data

datadatadata

Page 3: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Requirements: Huge Storage/Powerful Computer/Fast Connection

Internet

data

data

data

data

datadatadata

Implementation: Local processing

Internet

data

data

data

data

datadatadata

Page 4: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Internet

data

data

data

data

datadatadata

Implementation: Remote processing

Requirements: Complex Analysis at Data Centers

Page 5: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Our hypothesis Coarse-grained pipelined execution

model is a good match

Internet

A Practical Solution

data

data

Page 6: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Coarse-Grained Pipelined Execution

Definition Computations associated with an application are

carried out in several stages, which are executed on a pipeline of computing units

Example — K-nearest Neighbor (KNN) Given a 3-D range R= <(x1, y1, z1), (x2, y2, z2)>, and

a point p = (a, b, c). We want to find the nearest K neighbors of p within R.

Range_query Find the K-nearest neighbors

Page 7: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Challenges Computation associated with an application

needs to be decomposed into stages Decomposition decisions are dependent on

the execution environment Generating code for each stage (SC03) Other performance issues for the pipelined

execution (ICPP04) Adapting to the dynamic execution

environment (SC04)

Page 8: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

RoadMap

Filter Decomposition Problem MIN_ONETRIP Algorithm MIN_BOTTLENECK Algorithm MIN_TOTAL Algorithm Experimental Results Related Work Conclusion

Page 9: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Filter DecompositionC1

C2

Cm-1

Cm

L1

Lm-1

computation pipeline

f1

f2

fn-1

fn

atomic filters

f3 - f6

fn

L1

C1

Cm

Cm-1

f1f1 , f2

C2

fn-1

Lm-1

f2 , f3

fn

L1

C1

Cm

Cm-1

f1f1

C2

fn-2,fn-1

Lm-1

Page 10: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Filter DecompositionC1

C2

Cm-1

Cm

L1

Lm-1

computation pipeline

f1

f2

fn-1

fn

atomic filters

Goal: Find a placement

p (f1,f2, …, fn) = (F1, F2, …, Fm) whereFi = fi1, fi1+1, …, fik , (1 ≤ i1,ik ≤ n) such that the predicted execution time is minimal (1≤ i ≤ m).

Page 11: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

f3

f4

L1

C1

C3

f1f1 , f2

C2

L2

Cost Model

Bottleneck stage: bth stage the slowest stage in the pipeline

Execution timeT = T(C1)+T(L1)+N*T(C2)+T(L2)+T(C3)

= ∑i≠bTi + (N-1)*Tb

Page 12: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Three Algorithms

MIN_ONETRIP Algorithm dynamic programming algorithm to minimize ∑Ti

MIN_BOTTLENECK Algorithm dynamic programming algorithm to minimize Tb

MIN_TOTAL Algorithm greedy algorithm try to minimize T

T = ∑i≠bTi + (N-1)*Tb

Page 13: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Filter Decomposition: MIN_ONETRIP

Cm-2

Cm-1

Cm

Lm-1

Lm-2 fn-1

fn-1fn

fn

Goal: minimize time spent by one packet on the pipeline

Page 14: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Cm-2

Cm-1

Cm

Lm-1

Lm-2

T[i,j]: min cost of doing computations f1 ,…,,…, fi on computing units C1,…, Cj,

where the results of fi are on Cj.

T[i,j] = minT[i-1,j] + Cost_comp(P(Cj),Task(fi))

T[i,j-1] + Cost_comm(B(Lj-1),Vol(fi))

Goal: T[n,m] Cost: O(mn)

Filter Decomposition: MIN_ONETRIP

Page 15: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Filter Decomposition: MIN_BOTTLENECK

Cm-2

Cm-1

Cm

Lm-1

Lm-2 fn

f1

…fn

fn-1

f1

fn-1fn

fn-2

f1

……

f2…fn

f1

Goal: minimize time spent at the bottleneck stage

Page 16: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

N[i,j]: min cost of bottleneck stage for computing f1 ,…,,…, fi on computing units C1,…, Cj, where the results of fi are on Cj.

Cost: O(mn2)

N[i,j] = min

max{ N[i,j-1], Cost_comm(B(Lj-1),Vol(fi)) }

… …

max{ N[i-1,j-1], Cost_comm(B(Lj-1),Vol(fi-1)), Cost_Comp(P(Cj),Task(fi)) }

max{ N[1,j-1], Cost_comm(B(Lj-1),Vol(f1)), Cost_Comp(P(Cj), Task(f2) + … + Task(fi)) }

Filter Decomposition: MIN_BOTTLENECK

Page 17: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

C1

C2

C3

C4

L1

L3

L2

f1

f2

f3

f4

f5

L1

C1

C3

C4

C2

f1f1 : T1

Estimated Costf1 , f2

f1, f2 : T2

f1 - f3 : T3

f1 - f4 : T4

Min{T1 … T4 } = T2

To minimize the predicted execution time T

Filter Decomposition: MIN_BOTTLENECK

Page 18: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

RoadMap

Filter Decomposition Problem MIN_ONETRIP Algorithm MIN_BOTTLENECK Algorithm MIN_TOTAL Algorithm Experimental Results Related Work Conclusion

Page 19: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Experimental Results 4 Configurations

3 Applications Virtual Microscope Iso-Surface Rendering

1 1 11 1

1 1 10.1 0.5

1 1 0.011 0.001

0.1 1 0.011 0.001

Page 20: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Used Applications

Virtual Microscope (Vmscope) an emulation of a microscope input: a rectangular region, a resolution

value output: portion of the original image with

certain resolution

Page 21: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Experimental Results: Virtual Microscope

3 queries Q1 : 1 packet Q2 : 4 packets Q3 : 4500 packets

4 Algorithms MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exhaustive_Search

Page 22: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

0

50

100

150

200

250

300

Q1 Q2 Q3

MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search

Execution Time (in ms)

0

100

200

300

400

Q1 Q2 Q3

MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search

Execution Time (in ms)

Application

0

150

300

450

600

750

Q1 Q2 Q3

MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search

Execution Time (in ms)

0

200

400

600

800

1000

Q1 Q2 Q3

MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search

Execution Time (in ms)

Application

Experimental Results: Virtual Microscope

Page 23: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Two observations The performance variance between different

algorithms is small The Exha_Search does not always give the best

placement characteristics based on one packet information combining two filters as one, saving copying

cost

Experimental Results: Virtual Microscope

Page 24: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Iso-surface rendering (Iso) input: a 3-D grid, a scalar value, a view

screen with angle specified output: a surface seen from certain angle,

which captures points in the grid whose scalar value matches the given iso-surface value

Used Applications

Page 25: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Experimental Results: Iso 2 Implementations

ZBUF ACTP

2 Datasets small : 3 packets large : 47 packets

4 Algorithms MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exhaustive_Search

Page 26: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

0

0.5

1

1.5

2

ZBUF ACTP

MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search

Execution Time (in ms)

Application

0

5

10

15

20

ZBUF ACTP

MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search

Execution Time (in ms)

Application

Small dataset

Large dataset

Experimental Results: Iso

Page 27: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

The MIN_TOTAL algorithm gives the best placement for small dataset

The MIN_ONETRIP algorithm finds the best placement for large dataset

This application is very data-dependent !

Experimental Results: Iso

Page 28: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

0

10

20

30

40

50

60

70

80

3 10 100

MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search

Execution Time (in ms)

Number of Runs

0

10

20

30

40

50

60

3 10 100

MIN_ONETRIP MIN_BOTTLENECK MIN_TOTAL Exha_Search

Execution Time (in ms)

Number of Runs

ZBUF

ACTP

Experimental Results: Iso

Page 29: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Conclusion & Future Work

Our algorithms perform quite well Future Work

To find more accurate characteristics of applications

estimate of the performance change resulting from combining multiple atomic filters

estimate of the impact of data dependence

Page 30: Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Thank you !!!Thank you !!!