massively parallel architectures

45
1 Massively Parallel Architectures Colin Egan, Jason McGuiness and Michael Hicks (Compiler Technology and Computer Architecture Research Group)

Upload: jason-hearne-mcguiness

Post on 29-Nov-2014

318 views

Category:

Documents


1 download

DESCRIPTION

BCS Hertfordshire 2007

TRANSCRIPT

Page 1: Massively Parallel Architectures

1

Massively Parallel Architectures

Colin Egan, Jason McGuiness and Michael Hicks(Compiler Technology and

Computer Architecture Research Group)

Page 2: Massively Parallel Architectures

2

Presentation Structure

Introduction

Conclusions

Questions? (ask as we go along and I’ll also leave time for questions at then end of this presentation)

Page 3: Massively Parallel Architectures

3

The memory wall

High performance computing can be achieved through instruction level parallelism (ILP) by:

Pipelining,Multiple instruction issue (MII).

But both of these techniques can cause a bottle-neck in the memory subsystem, which may impact on overall system performance.

The memory wall.

Page 4: Massively Parallel Architectures

4

Overcoming the memory wall

We can improve data throughput and data storage between the CPU and the memory subsystem by introducing extra levels into the memory hierarchy.

L1, L2, L3 Caches.

Page 5: Massively Parallel Architectures

5

Overcoming the memory wall

But …This increases the penalty associated with a miss in the memory subsystem.Which impacts on the average cache access time.Where:

average cache access time = hit-time + miss-rate * miss-penalty

for two levels of caches:average access time = hit-timeL1 + miss-rateL1 * hit-timeL2 + miss-rateL2 * miss-penaltyL2because themiss-penaltyL1 = hit-timeL2 + miss-rateL2 * miss-penaltyL2

for three levels of caches:average access time = hit-timeL1 + miss-rateL1 * hit-timeL2 + miss-rateL2 * hit-timeL3 +

miss-rateL3 * miss-penaltyL3

Page 6: Massively Parallel Architectures

6

Overcoming the memory wall

The average cache access time is therefore dependent on three criteria: hit-time, miss-rate and miss-penalty.

Increasing the number of levels in the memory hierarchy does not improve main memory access time.

Recovery from a cache miss causes a main memory bottle-neck, since main memory cannot keep up with the required fetch rate.

Page 7: Massively Parallel Architectures

7

Overcoming the memory wall

The overall effect:can limit the amount of ILP,can impact on processor performance,increases the architectural design complexity,increases hardware cost.

Page 8: Massively Parallel Architectures

8

Overcoming the memory wall

So perhaps we should improve the performance of the memory subsystem.

Or we could place processors in memory.Processing In Memory (PIMs).

Page 9: Massively Parallel Architectures

9

Processing in memory

The idea of PIM is to overcome the bottleneck between the processor and main memory by combining a processor and memory on a single chip.

The benefits of PIM are:reduced latency,a simplification of the memory hierarchy,high bandwidth,multi-processor scaling capabilities

Cellular architectures.

Page 10: Massively Parallel Architectures

10

Processing in memory

Reduced latency:In PIM architectures accesses from the processor to memory:

do not have to go through multiple memory levels (caches) to go off chip,do not have to travel along a printed circuit line,do not have to be reconverted back down to logic levels,do not have to be re-synchronised with a local clock.

Currently there is about an order of magnitude reduction in latencythereby overcoming the memory wall.

Page 11: Massively Parallel Architectures

11

Processing in memory

Beneficial side-effects of reducing latency:total silicon area is reduced,power consumption is reduced.

Page 12: Massively Parallel Architectures

12

Processing in memory

But …processor speed is reduced,the amount of available memory is reduced.

Page 13: Massively Parallel Architectures

13

Cellular architectures

PIM is easily scaled,multiple PIM chips connected together forming a network of PIM cells (Cellular architectures) to overcome these two problems.

Page 14: Massively Parallel Architectures

14

Cellular architectures

Cellular architectures are threaded:each thread unit is independent of all other thread units,each thread unit serves as a single in-order issue processor,each thread unit shares computationally expensive hardware such as floating-point units,there can be a large number (1,000s) of thread units – therefore they are massively parallel.

Page 15: Massively Parallel Architectures

15

Cellular architectures

Cellular architectures have irregular memory access:

some memory is very close to the thread units and is extremely fast,some is off-chip and slow.

Cellular architectures, therefore, use caches and have a memory hierarchy.

Page 16: Massively Parallel Architectures

16

Cellular architectures

In Cellular architectures multiple thread units perform memory accesses independently.

This means that the memory subsystem of Cellular architectures requires some form of memory access model that permits memory accesses to be effectively served.

Page 17: Massively Parallel Architectures

17

Cellular architectures

Examples:Bluegene Project

Cyclops (IBM)DIMES

Gilgamesh (NASA)Shamrock (Notre Dame)

Page 18: Massively Parallel Architectures

18

Cyclops

Developed by IBM at the Tom Watson Research Center.Also called Bluegene/C in comparison with the later version of Bluegene/L.

Page 19: Massively Parallel Architectures

19

Logical View of the Cyclops64 Chip Architecture

Processor

ChipBoard

Crossbar Network

TU TU TU…

SP SP SP

FPU SR

TU TU TU…

SP SP SP

FPU SR

TU TU TU…

SP SP SP

FPU SR

MEM

OR

YB

AN

K

MEM

OR

YB

AN

K

MEM

OR

YB

AN

K

MEM

OR

YB

AN

K

MEM

OR

YB

AN

K

MEM

OR

YB

AN

K

MEM

OR

YB

AN

K

MEM

OR

YB

AN

K

6 * 4 GB/sec

4 GB/sec

50 MB/sec

1 Gbits/sec

Off-

Chi

p M

emor

y

OtherChips via 3D mesh

Off-

Chi

p M

emor

yO

ff-C

hip

Mem

ory

Off-

Chi

p M

emor

y

IDEHDD

4 GB/sec

6 * 4 GB/sec

SP SP SP SP SP SP SP SP

MSOffice2

Processors : 80 / chipThread Units (TU) : 2 / processorFloating-Point Unit (FPU) : 1 / processorShared Registers (SR) : 0 / processorScratch-Pad Memory (SP) : n KB / TU

On-chip Memory (SRAM) : 160 x 28KB = 4480 KBOff-chip Memory (DDR2) : 4 x 256MB = 1GBHard Drive (IDE) : 1 x 120GB = 120 GBSP is located in the memory banks and is accessed directly by the TU and via the crossbar network by the other TUs.

Page 20: Massively Parallel Architectures

Slide 19

MSOffice2 NOTE ABOUT SCRATCH-PAD MEMORIES:

There is one memory bank(MB) per TU on the chip. One part of the MB is reserved for the SP of the corresponding TU. The size of theSP is customizable.

The TU can access its SP directly without having to use the crossbar network with a latency of 3/2 for ld/st. The other TUs can access the SP of another TU by accessing the corresponding MB through the network with a latency of 22/11 for ld/st. Even TUs from the same processor must use the network to access the SPs of each other.

The memory bank can support only one access per cycle. Therefore if a TU accesses its SP while another TU accesses the same SP through the network, some arbitration will occur and one of the two accesses will be delayed. , 03/10/2003

Page 21: Massively Parallel Architectures

20

DIMES

Delaware Interactive Multiprocessor Emulation System (DIMES)

under development by the Computer Architecture and Parallel Systems Laboratory (CAPSL) at the University of Delaware, Newark, DE. USA under the leadership of Prof. Guang Gao.

Page 22: Massively Parallel Architectures

21

DIMES

DIMESis the first hardware implementation of a Cellular architecture,is a simplified ‘cut-down’ version of Cyclops,is hardware validation tool for Cellular architectures,emulates Cellular architectures, in particular Cyclops, cycle by cycle, is implemented on at least one FPGA,has been evaluated by Jason McGuiness.

Page 23: Massively Parallel Architectures

22

DIMES

The DIMES implementation that Jason evaluated:supports a P-thread programming model,is a dual processor

where each processor has four thread units, has 4K of scratch-pad (local) memory per thread unit,has two banks of 64K global shared memory,has different memory models:

scratch pad memory obeys the program consistency model for all of the eight thread units,global memory obeys the sequential consistency model for all of the eight thread units,

is called DIMES/P2.

Page 24: Massively Parallel Architectures

23

DIMES

DIMES/P2.

64K GlobalMemory

Thread Unit 0

4K Scratch pad

Thread Unit 1

4K Scratch pad

Thread Unit 2

4K Scratch pad

Thread Unit 3

4K Scratch pad

Processor 0

64K GlobalMemory

Thread Unit 0

4K Scratch pad

Thread Unit 1

4K Scratch pad

Thread Unit 2

4K Scratch pad

Thread Unit 3

4K Scratch pad

Network

Processor 1

Page 25: Massively Parallel Architectures

24

Gilgamesh

This material is based on:Sterling TL and Zima HP, Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing, IEEE, 2002.

Page 26: Massively Parallel Architectures

25

Gilgamesh

PIM-based massively parallel architectureExtend existing PIM capabilities by:

Incorporating advanced mechanisms for virtualizing tasks and data.Providing adaptive resource management for load balancing and latency tolerance.

Page 27: Massively Parallel Architectures

26

Gilgamesh System Architecture

Collection of MIND chips interconnected by a system network

MIND: Memory, Intelligence, Network DeviceA MIND chip contains:

Multiple banks of DRAM storageProcessing logicExternal I/O interfaceInter-chip communication channels

Page 28: Massively Parallel Architectures

27

Gilgamesh System Architecture

Fig 1 S&Z

Page 29: Massively Parallel Architectures

28

MIND Chip

MIND chip is multithreaded.MIND chips communication:

Via parcels,Parcel can be used to perform conventional operation as well as to invoke methods on a remote chip.

Page 30: Massively Parallel Architectures

29

MIND Chip Architecture

Fig 2 S&Z

Page 31: Massively Parallel Architectures

30

MIND Chip Architecture

Nodes provide storage, information processing and execution control.Local wide registers serve as thread state, instruction cache, vector memory row buffer.Wide ALU performs multiple operations on separate field simultaneously.System memory bus interface connects MIND to workstation and server motherboard.Data streaming interface deals with rapid high data bandwidth movement.On-chip communication interconnects allow sharing of pipelined FPUs among nodes.

Page 32: Massively Parallel Architectures

31

MIND Node Structure

Fig 3 S&Z

Page 33: Massively Parallel Architectures

32

MIND Node Structure

Integrating memory block with processing logic and control mechanism.Wide ALU is 256 bits, supporting multiple operations.ALU doesn’t provide explicit floating point operations.Permutation network rearrange bytes in a complete row.

Page 34: Massively Parallel Architectures

33

MIND PIM Architecture

High degree of memory bandwidth is achieved byAccessing an entire memory row at a time: data parallelism.Partitioning the total memory into multiple independently accessible blocks: multiple memory banks.

Therefore:Memory bandwidth is square order over off-chip architecture.Low latency.

Page 35: Massively Parallel Architectures

34

Parcels

A variable length of communication packet that contains sufficient information to perform a remote procedure invocation.It is targeted to a virtual address which may identify individual variables, blocks of data, structures, objects, or threads, I/O streams.Parcel fields mainly contain: destination address, parcel action specifier, parcel argument, continuation, and housekeeping.

Page 36: Massively Parallel Architectures

35

Multithreaded Execution

Each thread resides in a MIND node wide register.The multithread controller determines:

When the thread is ready to perform next operation.Which resource will be required and allocated.

Threads can be created, terminated or destroyed, suspended, and stored.

Page 37: Massively Parallel Architectures

36

Shamrock Notre Dame

The material is based on:Kogge PM, Brockman JB, Sterling T, and GaoG, Processing in memory: chips in petaflops. Presented in the Workshop of ISCA 97’.

Page 38: Massively Parallel Architectures

37

Shamrock Notre Dame

A PIM architecture developed in the University of Notre Dame

Page 39: Massively Parallel Architectures

38

Shamrock Architecture Overview

Fig 2 Kogge

Page 40: Massively Parallel Architectures

39

Node Structure

Fig 3 Kogge

Page 41: Massively Parallel Architectures

40

Node Structure

A node consists of logic for a CPU, four arrays of memory and data routing

Two separately addressable memories on each face of the CPU.

Each memory array is made up of multiple sub-units:

allows all bits read from a memory in one cycle to be available at the same time to the CPU logic.

Page 42: Massively Parallel Architectures

41

Shamrock Chip

Nodes are arranged in rows, staggering so that the two memory arrays on one side of one node impinge directly on the memory arrays of two different nodes in the next row:

Each CPU has a true shared memory interface with four other CPUs.

Page 43: Massively Parallel Architectures

42

Shamrock Chip Architecture

A PIM architecture developed in the University of Notre Dame

Page 44: Massively Parallel Architectures

43

Shamrock Chip

Off-chip connections can:Connect to adjoining chips in the same tile topology.Allow a processor on the outside to view the chip as memory in the conventional sense.

Page 45: Massively Parallel Architectures

44

Massively Parallel Architectures

Questions?

Colin Egan, Jason McGuiness and Michael Hicks(Compiler Technology and

Computer Architecture Research Group)