k f u s i o n simple annotations for optimized data flow

31
K F U S I O N Simple Annotations for Optimized Data Flow Liam Kiemele, Celina Berg, Aaron Gulliver, Yvonne Coady University of Victoria with thanks to Tim Mattson, Andrew Brownsword (Intel)

Upload: holt

Post on 23-Feb-2016

28 views

Category:

Documents


0 download

DESCRIPTION

K F U S I O N Simple Annotations for Optimized Data Flow. Liam Kiemele, Celina Berg, Aaron Gulliver, Yvonne Coady University of Victoria with thanks to Tim Mattson, Andrew Brownsword (Intel). Road Map. KFusion at work Motivation KFusion Costs and benefits a nnotations, lines of code - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: K F U S I O N Simple Annotations for Optimized Data Flow

K F U S I O N

Simple Annotations for Optimized Data FlowLiam Kiemele, Celina Berg, Aaron Gulliver, Yvonne Coady

University of Victoria

with thanks to Tim Mattson, Andrew Brownsword (Intel)

Page 2: K F U S I O N Simple Annotations for Optimized Data Flow

2

Road MapKFusion at work

MotivationKFusion

Costs and benefitsannotations, lines of codemodularity, performance

Future work and conclusionexplicit composition of computation around data flow

IWOCL 2013 Kiemele

KFusion

Eval

Future!

Page 3: K F U S I O N Simple Annotations for Optimized Data Flow

IWOCL 2013 Kiemele 3

Parallel HardwareBackgroun

d

Page 4: K F U S I O N Simple Annotations for Optimized Data Flow

4

Good News and Bad News…

ParallelismAdded complexityOptimization

Memory and Bandwidth

Modularity: Let’s talk LibrariesDetails behind an APIOptimize data access (prefetching, caching…)Better separation of concerns

IWOCL 2013 Kiemele

Background

Page 5: K F U S I O N Simple Annotations for Optimized Data Flow

5

OpenCL LibrariesOpenCL (Computing Language), for CPUs and GPUs

At the heart of any given library will be kernels

Suppose we build an OpenCL Linear Algebra Library

__kernel void add_vectors(__global float* sum, __global float* v1, __global float* v2) {

int i = get_global_id(0); sum[i] = v1[i] + v2[i];}

IWOCL 2013 Kiemele

KFusion

Page 6: K F U S I O N Simple Annotations for Optimized Data Flow

6

What you get…c = sqrt(add(square(x), square(y));

square

square

add

sqrt

IWOCL 2013 Kiemele

KFusion

Page 7: K F U S I O N Simple Annotations for Optimized Data Flow

7

What you get…c = sqrt(add(square(x), square(y));

IWOCL 2013 Kiemele

KFusion

Kernel Operation

Memory Access Cycles

square 1 load and store 804square 1 load and store 804add 1 2 loads and 1 store 804sqrt 1 load and store 804total 4 9 3216

Page 8: K F U S I O N Simple Annotations for Optimized Data Flow

8

What you WANT!

c = sqrt(add(square(x), square(y));

x

y

add

sqrt

IWOCL 2013 Kiemele

KFusion

Page 9: K F U S I O N Simple Annotations for Optimized Data Flow

9

What you WANT!c = sqrt(add(square(x), square(y));

IWOCL 2013 Kiemele

KFusion

Kernel Operation

Memory Access Cycles

square 1 load and store 804square 1 load and store 804add 1 2 loads and 1 store 804sqrt 1 load and store 804total 4 9 3216

Kernel Operation

Memory Access Cycles

fu

1 load 4041 load 4041 - 41 store 404

total 4 3 1216

Page 10: K F U S I O N Simple Annotations for Optimized Data Flow

10

Two Choices

IWOCL 2013 Kiemele

KFusion

Modular ImplementationReusableEasy to maintain and developIndividual Kernel optimization

Monolithic ImplementationPerformanceAllows for optimizations which will otherwise exist between modules

Can we do both?

Page 11: K F U S I O N Simple Annotations for Optimized Data Flow

11

Introducing KFusion

IWOCL 2013 Kiemele

square(…) kernel square

Application File

Library File Kernel File

float* square

square(…)

add(…) kernel add …

float* add …

sqrt(…) kernel sqrt …

float* sqrt …

Kernel Operation Memory Access Cyclessquare 1 load and store 804square 1 load and store 804add 1 2 loads and 1 store 804sqrt 1 load and store 804total 4 9 3216

KFusion

Page 12: K F U S I O N Simple Annotations for Optimized Data Flow

12

After KFusion…

IWOCL 2013 Kiemele

square(…) kernel square

Application File

Library File Kernel File

void square …

square(…)

add(…) kernel add …void add …

sqrt(…) kernel sqrt …void sqrt …

New Call:c = fu(…);

New Function:

float* fu(…)

New Kernel:kernel fu(…)

Kernel Operation Memory Access Cycles

fu

1 load 4041 load 4041 - 41 store 404

total 4 3 1216

KFusion

Page 13: K F U S I O N Simple Annotations for Optimized Data Flow

13

It works!

1024 2048 4096 81920

0.010.020.030.040.050.060.070.08

Before KFusionAfter KFusion

vector size

time

(ms)

IWOCL 2013 Kiemele

KFusion

Page 14: K F U S I O N Simple Annotations for Optimized Data Flow

14

Road MapKFusion at work

what and how…why!

Costs and benefitsannotations, lines of codemodularity, performance

Future work and conclusionexplicit composition of computation around data flow

IWOCL 2013 Kiemele

KFusion

Eval

Future!

Page 15: K F U S I O N Simple Annotations for Optimized Data Flow

15

CostsAnnotations

application hintslibrary synchronizationkernel data flow for compositions

Preprocessorbuild dependency graphsource-to-source transformation

loop fusiondeforestation

IWOCL 2013 Kiemele

KFusion

Eval

Page 16: K F U S I O N Simple Annotations for Optimized Data Flow

16

Annotations#pragma start fuse

square(x,x)square(y,y)add(c,x,y)sqrt(c, c)

c = sqrt(add(, square(y));#pragma end fuse

#pragma sync out public void dot_product(double result, vector x);

#pragma sync inpublic void matrix_vector_mult(vector b, Matrix A, vector x)

IWOCL 2013 Kiemele

application

Library

KFusion

Eval

Page 17: K F U S I O N Simple Annotations for Optimized Data Flow

17

Annotations__kernel void add_vectors(__global float* sum, __global float* v1,

__global float* v2) {#pragma kload{

int i = get_global_id(0); float arg1 = v1[i]; float arg2 = v2[i]; float s; } s = arg1 + arg2; #pragma kstore

{ sum[i] = s; }}

IWOCL 2013 Kiemele

kernel

KFusion

Eval

add

Page 18: K F U S I O N Simple Annotations for Optimized Data Flow

18

Dependency Graph

IWOCL 2013 Kiemele

square(x) square(y)

add(c,x,y)

sqrt(c)

x y

c

KFusion

Eval

Page 19: K F U S I O N Simple Annotations for Optimized Data Flow

19

Transformation…

IWOCL 2013 Kiemele

square(x) square(y)

add_sqrt(c,x,y)

x

c

y

KFusion

Eval

Page 20: K F U S I O N Simple Annotations for Optimized Data Flow

20

Replacement Kernel!

IWOCL 2013 Kiemele

fu(c,x,y)

x

c

y

KFusion

Eval

Page 21: K F U S I O N Simple Annotations for Optimized Data Flow

AOSD 2013 Kiemele 21

Annotations

Imag

e Man

ipulat

ion

Linea

r Alge

bra

Physi

cs En

gine

05

10152025

ApplicationKernel/Library

KFusion

Eval

Page 22: K F U S I O N Simple Annotations for Optimized Data Flow

22

Benefits

IWOCL 2013 Kiemele

KFusion

Eval

Imag

e Man

ipulat

ion

Linea

r Alge

bra

Physi

cs En

gine

0200400600800

10001200

Lines of Code Generated

LibraryKernelFused LibraryFused Kernel

Page 23: K F U S I O N Simple Annotations for Optimized Data Flow

23

Performance

IWOCL 2013 Kiemele

KFusion

Eval

512 1024 2048 40960.5

1

1.5

2

2.5

3

3.5

4

4.5

Speedup of Fused Kernels over Unfused Kernels vs Image Size

Low DependencyMedium DependencyHigh DependencyHigh Dependency Low Fusion

Image Size (pixels)

Relat

ive p

erfo

rman

ce in

crea

se a

fter f

usion

Page 24: K F U S I O N Simple Annotations for Optimized Data Flow

24

Performance

IWOCL 2013 Kiemele

KFusion

Eval

512 1024 2048 40960

0.2

0.4

0.6

0.8

1

1.2

Speedup of Automatic Fusion over Manual Fusion vs Image Size

Low DependencyMedium DependencyHigh DependencyHigh Dependency Low Fusion

Image Size (Pixels)

Relat

ive S

peed

up

Page 25: K F U S I O N Simple Annotations for Optimized Data Flow

AOSD 2013 Kiemele 25

Roofline Analysis of Performance

Peak Actual GFlops =minimum(Bandwidth x flops/byte, Peak Performance)

Three Linear Algebra Scenariosc = sqrt(a2 + b2)d = sqrt( (x1 – x2)2 + (y1 – y2)2)Start of conjugate gradient

r = Ax – bp = rR2 = r*r

Page 26: K F U S I O N Simple Annotations for Optimized Data Flow

26

c = sqrt(a2 + b2)

IWOCL 2013 Kiemele

KFusion

Eval

Page 27: K F U S I O N Simple Annotations for Optimized Data Flow

27

d = sqrt((x1 – x2)2 + (y1 – y2)2)

IWOCL 2013 Kiemele

KFusion

Eval

Page 28: K F U S I O N Simple Annotations for Optimized Data Flow

28

Conjugate Gradient

IWOCL 2013 Kiemele

KFusion

Eval

Page 29: K F U S I O N Simple Annotations for Optimized Data Flow

AOSD 2013 Kiemele 29

Road MapKFusion at work

what and how…why!

Costs and benefitsannotations, lines of codemodularity, performance

Future work and conclusionexplicit composition of computation around data flow

KFusion

Eval

Future!

Page 30: K F U S I O N Simple Annotations for Optimized Data Flow

30

Future WorkTools

comprehension and visualizationemulationperformance testing

Combine with other approaches

Optimizing compilesCode Generators

IWOCL 2013 Kiemele

KFusion

Eval

Future!

kfuse{ calls}

__kernel void k(…) { kload { … } computation kstore { … }}

Page 31: K F U S I O N Simple Annotations for Optimized Data Flow

IWOCL 2013 Kiemele 31

ConclusionKFusion is a first step towards

explicit, flexible controlAllowing optimizations between modulesseparation of concernsgithub.com/4Liamk/KFusion/wiki

512

1024

2048

4096

0

2

4

Spe

edup0

6001200

01020