openmp on the ibm cell be · 2009-05-19 · transform original code allocate buffers in the local...

OpenMP on theIBM Cell BE

15th meeting of ScicomP

Barcelona Supercomputing Center (BSC)

May 18-22 2009

Marc Gonzalez Tallada

15th meeting of Scicomp, Barcelona Supercomputing Center (BSC) Barcelona, May 18-22, 2009

2

Index

� OpenMP programming and code transformations

� Tiling and Software cache transformations

� Sources of overheads

� Performance

� Loop level parallelism

� Double buffer

� Combine OpenMP and SIMD parallelism


3

Introduction

� The Cell BE Architecture – multi

core design that mixes two

architectures

� One core based on Power PC

architecture (PPE)

� Eight cores based on the

Synergistic Processor Element

(SPE).

� SPEs are provided with local stores

� Load and Store instruction in

SPE can address only Local

Store

� Data transfer to/from main

memory is explicitly performed under software control.

EIB (up to 96 Bytes/cycle)

L2L1 PXU

MIC BIC

16 B/cycle 16 B/cycle 16 B/cycle(2x)

32 B/cycle

16 B/cycle

Dual XDR FlexIO

Synergistic Processor Elements

Power Processor Element

...

PPU

LS

SXU

MFC

SPU

16 B/cycle

(each)

16 B/cycle

(each)

256KB

LS

SXU

MFC

SPU

16 B/cycle

(each)

16 B/cycle

(each)

256KB

LS

SXU

MFC

SPU

16 B/cycle

(each)

16 B/cycle

(each)

256KB


4

Cell programmability

� Transform original code

� Allocate buffers in the local store

� Introduce DMA operations within

the code

� Synchronization statements

� Translate from original address space to local address space

� Manual solution

� Very optimized codes but at cost of

programmability

� Manual SIMD coding

� Overlap of communication with

computation

� Automatic solution

� Tiling, Double Buffer

� Good solution for regular

applications

� Needs of considerable

information at compile time

� Software Cache

� Usually performance is

limited to the available

information at compile time

� Very difficult to generate code

that overlaps computation with communication

PERFORM

ANCE

but not

PROG

RAMM

ABILIT

Y

PROG

RAMM

ABILIT

Y

but not

PERFORM

ANCE


5

Can the Cell BE be programmed as a cache-based multi-core ?

� OpenMP programming model

� Parallel region

� Variable scoping

� PRIVATE, SHARED, THREADPRIVATE ...

� Worksharing constructs

� DO, SECTIONS, SINGLE

� Synchronization constructs

� CRITICAL, BARRIER, ATOMIC

� Memory consistency

� FLUSH

#pragma omp parallel private(c,i) shared(a, b, d)

{

for (i=0; i<N; i++) c[i]= ...;

...

#pragma omp for scheduling(static) reduction(s)

for (i=0; i<N; i++) {

a[i] = c[b[i]] + d[i];

s = s + a[i];

}

...

#pragma omp barrier

...

#pragma omp critical

{

s = s + c[0];

}

}

Hardware does not impose

any restriction to the model!

IBM Cell BE can be programmed as a cache based multi-core


6

Main problem to solve ...

� Transform original code

� Allocate buffers in the local store

� Introduce DMA operations within

the code

� Synchronization statements

� Translate from original address

space to local address space


{

for (i=0; i<N; i++) c[i]= ...;

...

#pragma omp for scheduling(static) reduction(+:s)

for (i=0; i<N; i++) {

a[i] = c[b[i]] + d[i];

s = s + a[i];

}

...

#pragma omp barrier

...

#pragma omp critical

{

s = s + c[0];

}

}

� Compile-time predictable access

� a[i], d[i], b[i], s

� Unpredictable access

� c[b[i]] Software Cache + Tiling techniques


7

Introduction

� Code transformation: poor information at compile-time

� Only software cache


for (i=0; i<N; i++) {

a[i] = c[b[i]] + d[i];

s = s + a[i];

}

...

tmp_s = 0.0;

for (i=start;i<end;i++){

if (!HIT(h1, &d[i]))

MAP(h4, &d[i]);

if (!HIT(h2, &b[i]))

MAP(h2, &b[i]);

tmp01 = REF(h1, &d[i]);

tmp02 = REF(h2, &b[i]);

if (!HIT(h4, &c[tmp02]))

MAP(h4, &c[tmp02]);

tmp03 = REF(h4, &c[tmp02]);

if (!HIT(h3, &a[i]))

MAP(h3, &a[i]);

REF(h3, &a[i])=tmp03 + tmp01;

tmp_s = tmp_s + REF(h3, &a[i]);

}

atomic_add(s, tmp_s, ...);

omp_barrier();

Me mor y handl er (h?): cont ai ns poi nt er t o buff er i n l ocal st ore

HI T: execut es cache l ookup, updat es me mor y handl erREF: perf or ms address transl ati on and act ual me mor y access


8

For strided memory references

� Enable compiler optimizations for memory references that expose a strided access

pattern

� Execute control code at buffer level, not at every memory instance

� Maximize the overlap between computation and communication

� Try to compute the number of iterations that can be executed before needing to

change buffer

&a[i]

One buff er


9

Hybrid code transformation


for (i=0; i<N; i++) {

a[i] = c[b[i]] + d[i];

s = s + a[i];

}

...

tmp_s = 0.0;

i=start;

while (i< end){

n = end;

if (!AVAIL(h1, &d[i]))

MMAP(h1, &d[i]);

n = min(n, i+AVAIL(h1, &d[i]);

if (!AVAIL(h2, &b[i]))

MMAP(h2, &b[i]);

n = min(n, i+AVAIL(h2, &b[i]);

if (!AVAIL(h3, &a[i]))

MMAP(h3, &a[i]);

n = min(n, i+AVAIL(h3, &a[i]);

HCONSISTENCY(n, h3);

HSYNC(h1, h2, h3);

start = i;

for (i=start;i<n;i++){

...

}

}


omp_barrier();

...


tmp01 = REF(h1, &d[i]);

tmp02 = REF(h2, &b[i]);

if (!HIT(h4, &c[tmp02]))

MAP(h4, &c[tmp02]);

tmp03 = REF(h4, &c[tmp02]);

REF(h3, &a[i])=tmp03 + tmp01;

tmp_s = tmp_s + REF(h3, &a[i]);

}

� Organize the LS in two storages:

� Predictale access

� Software cache for unpredictable access


10

Execution model

...

tmp_s = 0.0;

i=0;

while (i< upper_bound){

n = N;

if (!AVAIL(h1, &d[i]))

MMAP(h1, &d[i]);

n = min(n, i+AVAIL(h1, &d[i]);

if (!AVAIL(h2, &b[i]))

MMAP(h2, &b[i]);

n = min(n, i+AVAIL(h2, &b[i]);

if (!AVAIL(h3, &a[i]))

MMAP(h3, &a[i]);

HCONSISTENCY(n, h3);

HSYNC(h1, h2, h3);

start = i;


...

}

}


omp_barrier();

Co

ntr

ol

Co

de

Co

mn

pu

t.

Synch.

� Loops execute in three different

phases

� Control code

�Allocate buffers

�Program DMA transfers

�Consistency

� Synchronize with DMA

� Execute a burst of computation

�Might include some control code, DMA programming and synchronization


11

Compiler limitations: Memory alias

� Compiler limitations

� What if a,b,c or d are memory

alias ?

�How to allocate buffers consistently ?

� What if some element in a

buffer is also referenced

through the software cache ?

� Memory aliasing

�Avoid pointer usage

�Avoid function calls: use inline annotations


{

for (i=0; i<N; i++) c[i]= ...;

...


for (i=0; i<N; i++) {

a[i] = c[b[i]] + d[i] + ...;

s = s + a[i];

}

...

#pragma omp barrier

...

#pragma omp critical section

{

s = s + c[0];

}

}


12

Memory Consistency

� Maintain a relaxed consistency model according to the OpenMP memory model

� Based on Atomicity and Dirty bits

� When data in a buffer has to be evicted, the write-back process is composed by

three steps:

1. Atomic Read

2. Merge

3. Atomic Write


13

Evaluation

� Comparison to a traditional software cache

� 4-way, 128-byte cache line, 64KB of capacity

� Write-back implemented through Dirty-Bits and atomic (synchronous) data

transfers

Cache Overhead Comparison

9,33 12,29 10,93,68

25,93

62,13

21,63

3,76

47,41

103,44

78,61

13,11

0

20

40

60

80

100

120

IS CG FT MG

Application

Execu

tio

n T

ime (

sec)

HYBRID

HYBRID synch

TRADITIONAL


14

Evaluation: Comparing performance with Power 5

� POWER5-based blade with two processors running at 1.5 GHz

� 16 GB of memory (8 GB each processor)

� Each processor

� 2 core with 2 SMT threads each

� Shared 1,8 MB L2

0

2

4

6

8

10

12

14

16

Application / Loop

Execu

tio

n T

ime (

sec)

POWER5

Cell BE HYBRID

Cell BE TRADITIONAL

POWER5 8,25 10,76 5,61 3,12 8,00 0,25 1,52 1,17 1,14 1,19 0,59 0,22 0,03 0,67 0,37 1,55 0,21 0,07

Cell BE HYBRID 9,33 12,29 10,9 3,68 6,65 2,68 1,76 3,79 2,27 2,23 0,81 0,22 0,06 0,81 0,35 1,69 0,49 0,07

Cell BE TRADITIONAL 47,41 103,44 78,61 13,11

IS CG FT MG IS loop 1 IS loop 2 FT loop 1 FT loop 2 FT loop 3 FT loop 4 FT loop 5 MG loop1 MG loop2 MG loop3 MG loop4 MG loop5 MG loop6 MG loop7


15

Evaluation: Scalability

� Cell BE versus Power5

Scalabilty on Cell BE

0

10

20

30

40

50

60

70

80

Number of threads

Execu

tio

n T

ime (

sec)

MG-A

FT-A

CG-B

IS-B

MG-A 23.99 12.28 6.42 3.5

FT-A 72.48 37.88 20.46 10.96

CG-B 73.74 37.75 20.17 12.25

IS-B 45.59 24.21 14.11 10.24

1 SPE 2 SPEs 4 SPEs 8 SPEs

Scalability on Power 5

0

5

10

15

20

25

30

Number of threads

Execu

tio

n T

ime (

sec)

MG-A

FT-A

CG-B

IS-B

MG-A 6.86 3.79 3.12

FT-A 11.64 6.94 5.61

CG-B 24.86 13.20 10.76

IS-B 10.25 9.83 8.25

1 2 4


16

Runtime activity

� Number of iterations per runtime intervention

� Buffer size: 4KB

MG

kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS kernel iters cnt 4KB buffer TRANFER ITERATIONS

1 213310272 2788302 17025381 6,11 76,50 106655136 1393961 8511458 6,11 76,51 53327648 696943 4255404 6,11 76,52

2 95494660 1401200 8842580 6,31 68,15 47747270 700650 4421740 6,31 68,15 23873710 350385 2211260 6,31 68,14

3 33554432 196096 196096 1,00 171,11 16777216 98048 98048 1,00 171,11 8388608 49024 49024 1,00 171,11

4 786412 8098 32392 4,00 97,11 393216 4032 16128 4,00 97,52 196648 2026 8104 4,00 97,06

5 795076 7741 30964 4,00 102,71 401860 3886 15544 4,00 103,41 205232 2005 8020 4,00 102,36

CG


1 225000 444 2664 6,00 506,76 112500 222 1332 6,00 506,76 56244 114 684 6,00 493,37

2 5624700 11100 11100 1,00 506,73 2812200 5550 5550 1,00 506,70 1406100 2850 2850 1,00 493,37

3 5624700 11100 22200 2,00 506,73 2812200 5550 11100 2,00 506,70 1406100 2850 5700 2,00 493,37

4 224988 444 888 2,00 506,73 112488 222 444 2,00 506,70 56244 114 228 2,00 493,37

5 224988 444 444 1,00 506,73 112488 222 222 1,00 506,70 56244 114 114 1,00 493,37

6 5624700 11100 44400 4,00 506,73 2812200 5550 22200 4,00 506,70 1406100 2850 11400 4,00 493,37

7 187490 370 740 2,00 506,73 93740 185 370 2,00 506,70 46870 95 190 2,00 493,37

8 187490 370 740 2,00 506,73 93740 185 370 2,00 506,70 46870 95 190 2,00 493,37

FT


1 134217728 524288 4194304 8,00 256,00 67108864 262144 2097152 8,00 256,00 33554432 131072 1048576 8,00 256,00

2 134217728 524288 4194304 8,00 256,00 67108864 262144 2097152 8,00 256,00 33554432 131072 1048576 8,00 256,00

3 117440512 458752 3670016 8,00 256,00 58720256 229376 1835008 8,00 256,00 29360128 114688 917504 8,00 256,00

IS


1 11534336 11264 11264 1,00 1024,00 5767168 5632 5632 1,00 1024,00 2883584 2816 2816 1,00 1024,00

2 23068672 22528 22528 1,00 1024,00 23068672 22528 22528 1,00 1024,00 23068672 22528 22528 1,00 1024,00

2 SPE 4 SPEs 8 SPEs

2 SPE 4 SPEs 8 SPEs

2 SPE 4 SPEs 8 SPEs

2 SPE 4 SPEs 8 SPEs


17

WORK: time spent in actual computation.

WRITE-BACK: time spent in the write-back process.

UPDATE D-B: time spent updating the dirty-bits information.DMA-IREG: time spent synchronizing with the DMA data transfers in the TC.

DMA-REG: time spent synchronizing with the DMA data transfers in the HLC.DEC: time spent in the pinning mechanism for cache lines.

TRANSAC: time spent executing control code of the TC.

BARRIER: time spent in the barrier synchronization at end of parallel computation.MMAP: time spent in executing look-up, placement/replacement actions and DMA programming.

MG A - LOOP 5 - CACHE OVERHEAD DISTRIBUTION

DMA-REG

3,81

DEC

2,19

MMAP

7,29

BARRIER

4,11

UPDATE D-B

19,05

WRITE-BACK

9,22

WORK

51,53

Evaluation: Overhead Distribution


18

WORK: time spent in actual computation.

WRITE-BACK: time spent in the write-back process.

UPDATE D-B: time spent updating the dirty-bits information.DMA-IREG: time spent synchronizing with the DMA data transfers in the TC.

DMA-REG: time spent synchronizing with the DMA data transfers in the HLC.DEC: time spent in the pinning mechanism for cache lines.

TRANSAC: time spent executing control code of the TC.

BARRIER: time spent in the barrier synchronization at end of parallel computation.MMAP: time spent in executing look-up, placement/replacement actions and DMA programming.

IS B - LOOP 1 - CACHE OVERHEAD DISTRIBUTION

WORK

43,54

TRANSAC

32,95

BARRIER

1,18

UPDATE D-B:

12,75

DEC

0,19DMA-IREG

5,38

DMA-REG

0,39

MMAP

0,74

WRITE-BACK

2,20

Evaluation: Overhead Distribution


19

Memory Consistency

� Maintain a relaxed consistency model, following the OpenMP memory model

� Important sources of overhead

� Dirty Bits: every store operation is monitored

� Atomicity at write-back process

� Optimizations to smooth the impact of this overhead

� Several observations for scientific parallel codes:

�Most of cache lines are modified by one execution flow

�Buffers usually are totally modified, not requiring atomicity at the moment of write-back

�Aliasing between data in a buffer and data in the software cache, rarely occur


20

IS CLASS B

0,0

0,2

0,4

0,6

0,8

1,0

1,2

1 2 3 4 5 6

LOOP

Red

ucti

on

of

execu

tio

n

tim

e (

%) CLR

HL

MR

PERFECT

CG CLASS B

0

0,2

0,4

0,6

0,8

1

1,2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LOOP

Re

du

cti

on

of

ex

ec

uti

on

tim

e (

%) CLR

HL

MR

PERFECT

FT CLASS A

0

0,2

0,4

0,6

0,8

1

1,2

1 2 3 4 5 6 7

LOOP

Red

ucti

on

of

execu

tio

n

tim

e (

%) CLR

HL

MR

PERFECT

MG CLASS A

0

0,2

0,4

0,6

0,8

1

1,2

1 2 3 4 5 6 7

LOOP

Red

ucti

on

of

execu

tio

n

tim

e (

%) CLR

HL

MR

PERFECT

Evaluation: Memory Consistency

CL R: dat a evi cti on based on 128-byt e har dwar e cache li ne reservati onHL: dat a evi cti on i s done at buff er l evel . No ali as bet ween dat a i n buff er and dat a i n t he softwar e cache.

MR: dat a evi cti on i s done at buff er l evel . No ali as bet ween dat a i n buff er and dat a i n t he softwar e cache, and si ngl e writ er.

PERFEC T: dat a evicti on i s freel y execut ed, wit hout at omi ci ty nor di rt y-bi t s


21

Double buffer techniques

� Double buffer does not come for free

� Implies executing more control code

� Requires to adapt the computational bursts to data transfer times

� Depends on the available bandwidth, which depends it self on the number of

executing threads


22

Evaluation: pre-fecth of data

� Speedups and Execution Times

FTS

pe

ed

up

CG

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1,40

IS

1,0821,203

0.996

sp

ee

du

p

0,991,02 1,03 0,99 1,01

1,121,04

0,951,06 1,07 1,09

1,031,10 1,08

1,03 1,02

1,15

1,43

1,02 0,99 0,981,05 1,03

1,27

1,13 1,13 1,16

CG

lo

op

1

CG

lo

op

2

CG

lo

op

3

CG

lo

op

4

CG

lo

op

5

CG

lo

op

6

CG

lo

op

7

CG

lo

op

8

CG

lo

op

9

CG

lo

op

10

CG

lo

op

11

CG

lo

op

12

CG

lo

op

13

CG

lo

op

14

IS lo

op

1

IS lo

op

2

IS lo

op

3

IS lo

op

4

FT

lo

op

1

FT

lo

op

2

FT

lo

op

3

FT

lo

op

4

FT

lo

op

5S

TR

EA

M C

op

yS

TR

EA

M S

cale

ST

RE

AM

Ad

d

ST

RE

AM

Tri

ad

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1,20

Applications/Loop

Modulo Scheduled loops Only pre-fetching for regular

memory references

Ex

ec

uti

on

Tim

e (

se

c)

CG

0,00

2,00

4,00

6,00

8,00

10,00

12,00

14,00

IS FT

12,7211,76

7,476,21

10,03 12,07 Cell BE Pre-fetching

Cell BE no Pre-fetching


23

Combining OpenMP with SIMD execution

� Actual effect

� Limited by the execution model

�Only affects the computational bursts

� Very dependant on runtime parameters

�Number of threads

�Number of iterations per runtime intervention

0,00

0,50

1,00

1,50

2,00

2,50

3,00

3,50

4,00

L-0 L-3 L-4 L-7 L-8 L-11 L-12 L-13 L-0 L-1 L-0 L-1 L-2 L-3 L-1 L-2 L-3 L-4 L-5

CG IS FT MG

Sp

ee

du

p

1 SPE

2 SPEs

4 SPEs

8 SPEs


24

Combining OpenMP with SIMD execution

0,00

0,20

0,40

0,60

0,80

1,00

1,20

1,40

1,60

1,80

copy scale add triad

STREAM FT MG CG

Sp

eed

up

1 SPE

2 SPEs

4 SPEs

8 SPEs

� Actual effect

� Limited by the execution model

�Only affects the computational bursts

� Very dependant on runtime parameters

�Number of threads

�Number of iterations per runtime intervention


25

Conclusions

� OpenMP transformations

� Remember, three phases

� Very conditioned to memory aliasing

�Try to avoid pointers, introduce inline annotations ...

� We can reach similar performance as what we would obtain from a cache

based multi-core

� Double-buffer effectiveness

� Depending on the number of threads, access patterns, bandwidth

� Ranging between 10%-20% of speedup

� SIMD effectiveness

� Only affects the computational phase

� Limited by alignment constraints


26

Questions

openmp on the ibm cell be · 2009-05-19 · transform original code allocate buffers in the local...

Documents