a pipelined execution of tiled nested loops on smps with computation and communication overlapping...

31
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas, Nectarios Koziris National Technical University of Athens Dept. of Electrical and Computer Engineering Computing Systems Laboratory

Post on 21-Dec-2015

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and

Communication Overlapping

Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas, Nectarios Koziris

National Technical University of AthensDept. of Electrical and Computer Engineering

Computing Systems Laboratory

Page 2: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Overview

Advanced Architectures Tiling for parallelization Non-overlapping vs. Overlapping

scheme Vertical vs. hyperplane grouping Application on clusters of SMP

nodes

Page 3: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

TCP/IP over FastEthernet

Use of popular Socket Interface create socket descriptor sd, then read/write from/to descriptor

sdHub

write send

read receive

Page 4: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

CPUkernelmode

bufferlength

TCP

IP ETH

Fast

2) CPU copies datafrom user to kernel space

3) CPU adds protocolheaders

5) DMA copies data to NIC

write(sd, buffer, length);

Example: Send

1) system call (CPU)

user 4) CPU programs DMA eng.

Page 5: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

SCI What about Scalable Coherent

Interface? Point-to-point , DSM approach

Page 6: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

SCI DSM schemeexportedmemorysegment

importedmemorysegment

SCI

write 100

100

read

50

Page 7: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

process VM area

Physical Memory

Contiguous data in process VMare not contiguous in Physical Memory

SCI Zero Copy Scheme

Page 8: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

process VM area

Physical Memory

is mapped to

pinned down memory

SCICreateSegment,SCIMapLocalSegment mappingbetween Virtual and contiguous Physical Memory

SCI Zero Copy Scheme

Page 9: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Data transfers

Programmed I/O mode CPU handles data transferring “lost” CPU cycles

DMA mode CPU programs the NIC’s buffers Not blocked during transfer Performs useful tasks

Page 10: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

SCI

SCI DMA approach

No copying by CPU•Data already

contiguous in PM•DMA engine copies

data to network

•No packetizationDone in hardware

•But, init only by kernel

We need VIA

Page 11: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Nested For-Loops

for (i1=l1; i1<=u1; i1++)

for (i2=l2; i2<=u2; i2++)

… … … … …

for (in=ln; in<=un; in++)

{

Loop Body

}

Page 12: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Dependence Vectors

i2

i1

for (i1=0; i1<=7; i1++)

for (i2=0; i2<=7; i2++)

A[i,j]=A[i-1,j]+A[i,j-1]

Page 13: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Tiling

i2

i1

Page 14: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Tiling

i2

i1

Processor 0

Processor 1

Page 15: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Non-Overlapping Scheme

i2

i1

Processor 0

Processor 1

Processor 2

Page 16: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Non-Overlapping vs. Overlapping Scheme

P0

P1

P2

P3

P0

P1

P2

P3

Page 17: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Overlapping Scheme

i2

i1

Processor 0

Processor 1

Processor 2

Page 18: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Generalization to SMPs

P0

P1

P2

P3

Page 19: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Generalization to SMPs

SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

Page 20: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Generalization to SMPs

SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

Page 21: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Generalization to SMPs

CPU1

CPU0

Page 22: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Generalization to SMPs

SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

Page 23: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Vertical vs. Hyperplane grouping

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

SMP0

SMP1

SMP2

SMP3

SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

Page 24: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Example

Tile SpaceGroup Space

SMP node0

SMP node1

Scheduling vector Π=(1,1)

Page 25: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Non-overlapping vs. Overlapping scheme

Almost half duration of execution steps Slightly more steps

P0

P1

P2

P3

P0

P1

P2

P3

Non-overlapping scheme

9 computation +8 communication steps

Overlapping scheme

12 steps

Page 26: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Vertical vs. Hyperplane Grouping

Slower pipeline filling Faster execution because of lack of intratile synchronization

preferable for Tile Spaces, where the mapping direction is comparatively large

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

SMP0

SMP1

SMP2

SMP3

SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

Page 27: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Experimental Platform

Linux SMP (Symmetric Multi-Processors) Cluster

8 nodes 128MB RAM 2 Pentium III 800MHz

SCI ring (SCI Dolphin’s PCI-SCI D330 cards)

Page 28: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Initial Code

for (i=1; i<=X; i++)for (j=1; j<=Y; j++)

for (k=1; k<=Z; k++){

A[i][j][k] = func(A[i-1][j][k],

A[i][j-1][k], A[i][j][k-1])

}

Page 29: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Experimental results

3

3.5

4

4.5

5

5.5

6

6.5

7

0 5000 10000 15000 20000 25000 30000 35000

Tim

e (

sec)

Tile Height

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

0 5000 10000 15000 20000 25000 30000 35000

Tim

e (

sec)

Tile Height

Iteration Space 16x16x1024K Iteration Space 48x48x512K

Non-overlapping scheme – vertical

grouping

Overlapping scheme – vertical grouping

Non-overlapping scheme – hyperplane

grouping

Overlapping scheme – hyperplane grouping

Page 30: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Grouping matrix

n

i

iG

m

m

m

m

H

10000

01

000

11111

0001

0

00001

1

1

1

nii mmmm 111 = number of CPUs within an SMP node

Page 31: A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

National Technical University of AthensComputing Systems Laboratory

Example

Tile SpaceGroup Space

SMP node0

SMP node1

30

31,

3

10

111GGG HPH

Scheduling vector Π=(1,1)