scheduling of tiled nested loops onto a cluster with a fixed number of smp nodes maria athanasaki,...

52
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering Computing Systems Laboratory

Upload: owen-terry

Post on 23-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

Maria Athanasaki, Evangelos Koukis, Nectarios Koziris

National Technical University of AthensSchool of Electrical and Computer Engineering

Computing Systems Laboratory

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Previous work M. Athanasaki, A. Sotiropoulos, G. Tsoukalas, N. Koziris,

"Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs using Memory Mapped Network Interfaces", SuperComputing Conference on High Performance Networking and Computing (SC2002), Baltimore, Maryland, November 16-22, 2002.

G. Goumas, A.Sotiropoulos and N. Koziris, "Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping," Proceedings of the 2001 International Parallel and Distributed Processing Symposium (IPDPS2001), IEEE Press, San Francisco, California, April  2001 .

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Overview

Tiling for parallelization Non-overlapping vs. Overlapping

execution scheme Grouping Application on a cluster of SMPs

with a fixed number of nodes Experimental-Simulation Results

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Nested For-Loops

for (i1=l1; i1<=u1; i1++)

for (i2=l2; i2<=u2; i2++)

… … … … …

for (in=ln; in<=un; in++)

{

Loop Body

}

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Dependence Vectors

i2

i1

for (i1=0; i1<=7; i1++)

for (i2=0; i2<=7; i2++)

A[i,j]=A[i-1,j]+A[i,j-1]

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Tiling

i2

i1

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Tiling

i2

i1

Processor 0

Processor 1

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Overview

Tiling for parallelization Non-overlapping vs.

Overlapping execution scheme Grouping Application on a cluster of SMPs

with a fixed number of nodes Experimental-Simulation Results

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Non-Overlapping Scheme

i2

i1

Processor 0

Processor 1

Processor 2

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Non-Overlapping vs. Overlapping Scheme

P0

P1

P2

P3

P0

P1

P2

P3

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Overlapping Scheme

i2

i1

Processor 0

Processor 1

Processor 2

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Overview

Tiling for parallelization Non-overlapping vs. Overlapping

execution scheme Grouping Application on a cluster of SMPs

with a fixed number of nodes Experimental-Simulation Results

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Generalization to SMPs – “Grouping”

SMP0

SMP1

SMP2

SMP3

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Example: Grouping + Non overlapping Communication Scheme

Tile Space

Group Space

SMP node0

SMP node1

Scheduling vector Π=(1,0)

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Example: Grouping + Overlapping Communication Scheme

Tile Space

Group Space

SMP node0

SMP node1

Scheduling vector Π=(1,1)

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Overview

Tiling for parallelization Non-overlapping vs. Overlapping

execution scheme Grouping Application on a cluster of SMPs

with a fixed number of nodes Experimental-Simulation Results

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Scheduling onto a Fixed Number of SMPs

Dynamic Scheduling by the Operating SystemRun time overhead for generating a

lot of processesContext switching slows down the

execution Static Scheduling at Compile

Time

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Scheduling onto a Fixed Number of SMPs

Cyclic Assignment Schedule

Mirror Assignment Schedule

Cluster Assignment Schedule

Retiling

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Cyclic Assignment

SMP0

SMP1

CPU0CPU1

CPU0CPU1

CPU0CPU1

CPU0CPU1

Cyclic assignment on 2 SMP nodes with 2 CPUs

each

SMP0

SMP1

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Cyclic Assignment

CPU0CPU1

CPU0CPU1

CPU0CPU1

CPU0CPU1

Cyclic assignment on 2 SMP nodes with 2 CPUs

each

SMP0

SMP1

SMP0

SMP1

chunk

chunk

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Cyclic Assignment – Non Overlapping Communication

CPU0

CPU1

CPU0

CPU1

Cyclic assignment on 2 SMP nodes with 2 CPUs

each

SMP0

SMP1

t

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Cyclic Assignment - Overlapping Communication

Cyclic assignment on 2 SMP nodes with 2 CPUs

each

t

CPU0

CPU1

CPU0

CPU1

SMP0

SMP1

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Cyclic Assignment - Communication

CPU0CPU1

CPU0CPU1

CPU0CPU1

CPU0CPU1

Cyclic assignment on 2 SMP nodes with 2 CPUs

each

SMP0

SMP1

SMP0

SMP1

chunk

chunk

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Scheduling onto a Fixed Number of SMPs

Cyclic Assignment Schedule

Mirror Assignment Schedule

Cluster Assignment Schedule

Retiling

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Mirror Assignment

SMP0

SMP1

CPU0CPU1

CPU0CPU1

CPU1CPU0

CPU1CPU0

Mirror assignment on 2 SMP nodes with 2 CPUs

each

SMP1

SMP0

chunk

chunk

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Mirror Assignment – Non Overlapping Communication

Mirror assignment on 2 SMP nodes with 2 CPUs

each

CPU0CPU1

CPU0CPU1

SMP0

SMP1

t

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Mirror Assignment - Overlapping Communication

Mirror assignment on 2 SMP nodes with 2 CPUs

each

tCPU0CPU1

CPU0CPU1

SMP0

SMP1

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Mirror Assignment - Communication

SMP0

SMP1

CPU0CPU1

CPU0CPU1

CPU1CPU0

CPU1CPU0

Mirror assignment on 2 SMP nodes with 2 CPUs

each

SMP1

SMP0

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Scheduling onto a Fixed Number of SMPs

Cyclic Assignment Schedule

Mirror Assignment Schedule

Cluster Assignment Schedule

Retiling

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Cluster Assignment

SMP0

SMP1

CPU0

Cluster assignment on 2 SMP nodes with 2 CPUs

each

CPU1

CPU0

CPU1

tiles “TILE”

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Cluster Assignment

SMP0

SMP1

CPU0

Cluster assignment on 2 SMP nodes with 2 CPUs

each

CPU1

CPU0

CPU1

TILESGROUPS

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Cluster Assignment – Non Overlapping Communication

SMP0

SMP1

CPU0

Cluster assignment on 2 SMP nodes with 2 CPUs

each

CPU1

CPU0

CPU1

t

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Cluster Assignment –Overlapping Communication

SMP0

SMP1

CPU0

Cluster assignment on 2 SMP nodes with 2 CPUs

each

CPU1

CPU0

CPU1

t

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Cluster Assignment - Communication

SMP0

SMP1

CPU0

Cluster assignment on 2 SMP nodes with 2 CPUs

each

CPU1

CPU0

CPU1

TILESGROUPS

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Scheduling onto a Fixed Number of SMPs

Cyclic Assignment Schedule

Mirror Assignment Schedule

Cluster Assignment Schedule

Retiling

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Retiling

SMP0

SMP1

CPU0

Retiling on 2 SMP nodes with 2 CPUs each

CPU1

CPU0

CPU1 old tiles

new tiles

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Retiling

SMP0

SMP1

CPU0

Retiling on 2 SMP nodes with 2 CPUs each

CPU1

CPU0

CPU1 old tiles

new tiles

retaining computation

volume of a tile

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Retiling – Non Overlapping Communication

SMP0

SMP1

CPU0

Retiling on 2 SMP nodes with 2 CPUs each

CPU1

CPU0

CPU1

t

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Retiling –Overlapping Communication

SMP0

SMP1

CPU0

Retiling on 2 SMP nodes with 2 CPUs each

CPU1

CPU0

CPU1

t

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Retiling - Communication

SMP0

SMP1

CPU0

Retiling on 2 SMP nodes with 2 CPUs each

CPU1

CPU0

CPU1

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Overview

Tiling for parallelization Non-overlapping vs. Overlapping

execution scheme Grouping Application on a cluster of SMPs

with a fixed number of nodes Experimental-Simulation

Results

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Experimental Platform Linux SMP (Symmetric Multi-

Processors) Cluster 2 nodes

1GB RAM2 Pentium III 1266MHz

Myrinet high performance interconnect

GM low level message passing system

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

The Myrinet interconnect User-level Networking

Based on the GM message passing interface All message exchange using DMA

Directly to/from pinned userspace buffers Communication is offloaded to the NIC

Programmable NIC LANai RISC processor @ 133-333MHz 2-8MB SRAM

2+2Gbps full duplex fiber links

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

GM Architecture

Comprised of three main parts User library Kernel driver Firmware on NIC

OS bypass design Regions of NIC

memory mapped to the VM of a process

GM Library

Application

GM kernel module

GM firmware

User

Kernel

NIC

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Sending and Receiving messages over Myrinet/GM

Sending application

Host

NICSend q

Send DMA Recv DMA

Host DMA

LANai

Receiving application

Host

NICRecv q

Send DMA Recv DMA

Host DMA

LANai

Buffer Event q Buffer Event q

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Initial Code

for (i=1; i<=X; i++)for (j=1; j<=Y; j++)

for (k=1; k<=Z; k++){

A[i][j][k] = func(A[i-1][j][k],

A[i][j-1][k], A[i][j][k-1])

}

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

cyclic

mirror

cluster

retile

cyclic

mirror

cluster

retile

Experimental results

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

500 1000 1500 2000 2500 3000 3500

Sp

eed

up

/ #

pro

cessors

Height of Iteration Space

Non Overlapping Execution Scheme

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

500 1000 1500 2000 2500 3000 3500

Sp

eed

up

/ #

pro

cessors

Height of Iteration Space

Overlapping Execution Scheme

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Simulation results

mirrorcyclic

retile

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 4000 8000 12000 16000 20000

Sp

eed

up

/ #

pro

cessors

Height of Iteration Space

Overlapping Execution Scheme

cluster

mirror

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 4000 8000 12000 16000 20000

Sp

eed

up

/ #

pro

cessors

Height of Iteration Space

Non Overlapping Execution Scheme

retile

clustercyclic

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Simulation results

retile

cluster

cyclic

mirror 0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 4000 8000 12000 16000 20000

Sp

eed

up

/ #

pro

cessors

Height of Iteration Space

Non Overlapping Execution Scheme

mirror cluster

retile

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 4000 8000 12000 16000 20000

Sp

eed

up

/ #

pro

cessors

Height of Iteration Space

Overlapping Execution Scheme

cyclic

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Advantages - Disadvantages

Advantages Disadvantages

cyclic + fast pipeline filling - communication

mirror + better communication than cyclic- idle time steps- worse communication than cluster, retile

cluster+ communication: 1) little volume of data to be transferred 2) data combined in fewer messages

- slow pipeline filling

retile+ fast pipeline filling+ communication: little volume of data to be transfered

- reorganizes tiles annuls optimal tile shape for cache hits

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

The End

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes

National Technical University of AthensComputing Systems Laboratory

PDP 2004

Cyclic Assignment - Overlapping Communication

SMP0

SMP1

SMP0

SMP1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

CPU0

CPU1

equivalentschedulings

P

tscheduling on a fixed number of processors

empty pipeline waiting for thenecessary data to become available

t

P

scheduling on an unlimited number of processors