a pipelined execution of tiled nested loops on smps with computation and communication overlapping...

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and

Communication Overlapping

Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas, Nectarios Koziris

National Technical University of AthensDept. of Electrical and Computer Engineering

Computing Systems Laboratory

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Overview

Advanced Architectures Tiling for parallelization Non-overlapping vs. Overlapping

scheme Vertical vs. hyperplane grouping Application on clusters of SMP

nodes



TCP/IP over FastEthernet

Use of popular Socket Interface create socket descriptor sd, then read/write from/to descriptor

sdHub

write send

read receive



CPUkernelmode

bufferlength

TCP

IP ETH

Fast

2) CPU copies datafrom user to kernel space

3) CPU adds protocolheaders

5) DMA copies data to NIC

write(sd, buffer, length);

Example: Send

1) system call (CPU)

user 4) CPU programs DMA eng.



SCI What about Scalable Coherent

Interface? Point-to-point , DSM approach



SCI DSM schemeexportedmemorysegment

importedmemorysegment

SCI

write 100

100

read

50



process VM area

Physical Memory

Contiguous data in process VMare not contiguous in Physical Memory

SCI Zero Copy Scheme



process VM area

Physical Memory

is mapped to

pinned down memory

SCICreateSegment,SCIMapLocalSegment mappingbetween Virtual and contiguous Physical Memory

SCI Zero Copy Scheme



Data transfers

Programmed I/O mode CPU handles data transferring “lost” CPU cycles

DMA mode CPU programs the NIC’s buffers Not blocked during transfer Performs useful tasks



SCI

SCI DMA approach

No copying by CPU•Data already

contiguous in PM•DMA engine copies

data to network

•No packetizationDone in hardware

•But, init only by kernel

We need VIA



Nested For-Loops

for (i1=l1; i1<=u1; i1++)

for (i2=l2; i2<=u2; i2++)

… … … … …

for (in=ln; in<=un; in++)

{

Loop Body

}



Dependence Vectors

i2

i1

for (i1=0; i1<=7; i1++)

for (i2=0; i2<=7; i2++)

A[i,j]=A[i-1,j]+A[i,j-1]



Tiling

i2

i1



Tiling

i2

i1

Processor 0

Processor 1



Non-Overlapping Scheme

i2

i1

Processor 0

Processor 1

Processor 2



Non-Overlapping vs. Overlapping Scheme

P0

P1

P2

P3

P0

P1

P2

P3



Overlapping Scheme

i2

i1

Processor 0

Processor 1

Processor 2



Generalization to SMPs

P0

P1

P2

P3




SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1




CPU1

CPU0




SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1



Vertical vs. Hyperplane grouping

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

SMP0

SMP1

SMP2

SMP3

SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1



Example

Tile SpaceGroup Space

SMP node0

SMP node1

Scheduling vector Π=(1,1)



Non-overlapping vs. Overlapping scheme

Almost half duration of execution steps Slightly more steps

P0

P1

P2

P3

P0

P1

P2

P3

Non-overlapping scheme

9 computation +8 communication steps

Overlapping scheme

12 steps



Vertical vs. Hyperplane Grouping

Slower pipeline filling Faster execution because of lack of intratile synchronization

preferable for Tile Spaces, where the mapping direction is comparatively large

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

SMP0

SMP1

SMP2

SMP3

SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1



Experimental Platform

Linux SMP (Symmetric Multi-Processors) Cluster

8 nodes 128MB RAM 2 Pentium III 800MHz

SCI ring (SCI Dolphin’s PCI-SCI D330 cards)



Initial Code

for (i=1; i<=X; i++)for (j=1; j<=Y; j++)

for (k=1; k<=Z; k++){

A[i][j][k] = func(A[i-1][j][k],

A[i][j-1][k], A[i][j][k-1])

}



Experimental results

3

3.5

4

4.5

5

5.5

6

6.5

7

0 5000 10000 15000 20000 25000 30000 35000

Tim

e (

sec)

Tile Height

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

0 5000 10000 15000 20000 25000 30000 35000

Tim

e (

sec)

Tile Height

Iteration Space 16x16x1024K Iteration Space 48x48x512K

Non-overlapping scheme – vertical

grouping

Overlapping scheme – vertical grouping

Non-overlapping scheme – hyperplane

grouping

Overlapping scheme – hyperplane grouping



Grouping matrix

n

i

iG

m

m

m

m

H

10000

01

000

11111

0001

0

00001

1

1

1

nii mmmm 111 = number of CPUs within an SMP node



Example

Tile SpaceGroup Space

SMP node0

SMP node1

30

31,

3

10

111GGG HPH

Scheduling vector Π=(1,1)

a pipelined execution of tiled nested loops on smps with computation and communication overlapping...

Documents