scalar operand networks for tiled microprocessors

55
Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)

Upload: shufang-chi

Post on 03-Jan-2016

42 views

Category:

Documents


2 download

DESCRIPTION

Scalar Operand Networks for Tiled Microprocessors. Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD). Until 3 years ago – computer architects have been using the N-way superscalar to encapsulate the ideal for a parallel processor… - nearly “perfect” but not attainable. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scalar Operand Networks for Tiled Microprocessors

Scalar Operand Networksfor Tiled Microprocessors

Michael Taylor

Raw Architecture Project

MIT CSAIL

(now at UCSD)

Page 2: Scalar Operand Networks for Tiled Microprocessors

Until 3 years ago – computer architects have beenusing the N-way superscalar to encapsulate the idealfor a parallel processor… - nearly “perfect” but not attainable

Superscalar

“PE”->”PE” communication Free

exploitation of parallelism

Implicit

Clean semantics Yes

scalable No

power efficient No

(hw scheduler or compiler)

(or VLIW)

Page 3: Scalar Operand Networks for Tiled Microprocessors

What’s great about superscalar microprocessors? It’s the networks!

Fast low-latency tightly-coupled networks (0-1 cycles of latency, no occupancy)-For the lack of a better name let’s call them Scalar Operand Networks (SONs) - Can we incorporate the benefits of superscalar communication + multicore scalability-Can we build Scalable Scalar Operand Networks?

(I agree with Jose: “We need low-latency tightly-coupled … networkinterfaces” – Jose Duato, OCIN, Dec 6, 2006)

mul $2,$3,$4

add $6,$5,$2

Page 4: Scalar Operand Networks for Tiled Microprocessors

The industry shift toward Multicore - attainable but hardly ideal

Superscalar Multicore

“PE”->”PE” communication Free Expensive

exploitation of parallelism

Implicit Explicit

Clean semantics Yes No

scalable No Yes

power efficient No Yes

Page 5: Scalar Operand Networks for Tiled Microprocessors

Superscalar Multicore

“PE”->”PE” communication Free Expensive

exploitation of parallelism

Implicit Explicit

Clean semantics Yes No

scalable No Yes

power efficient No Yes

What we’d like – neither superscalar nor multicore

Superscalarshave fastnetworksand greatusability

Multicorehas greatscalabilityand efficiency

Page 6: Scalar Operand Networks for Tiled Microprocessors

Why communication is expensive on multicore

Multiprocessor Node 1 Multiprocessor Node 2

Transport Cost

sendoverhead

receiveoverhead

sendoccupancy

sendlatency

receiveoccupancy

receivelatency

Page 7: Scalar Operand Networks for Tiled Microprocessors

Multiprocessor SON Operand Routing

Multiprocessor Node 1

sendoccupancy

sendlatency

Destination node nameSequence numberValueLaunch sequence

Commit LatencyNetwork injection

Page 8: Scalar Operand Networks for Tiled Microprocessors

Multiprocessor SON Operand Routing

Multiprocessor Node 2

receiveoccupancy

receivelatency

receive sequencedemultiplexingbranch mispredictions

injection cost

.. similar overheads for shared memory multiprocessors - store instr, commit latency, spin locks (+ attndt br. mispredicts)

Page 9: Scalar Operand Networks for Tiled Microprocessors

Defining a figure of merit forscalar operand networks

5-tuple <SO, SL, NHL, RL, RO>:

Send Occupancy

Send Latency

Network Hop Latency

Receive Latency

Receive Occupancy

Tip: Ordering follows timing of message from sender to receiver

We can use this metric to quantitativelydifferentiateSONs from existing multiprocessor networks…

Page 10: Scalar Operand Networks for Tiled Microprocessors

Impact of Occupancy (“o” = so+ro)

if (o * “surface area” > “volume”)

not worth it to offload: overhead too high

(parallelism too fine-grained)

Impact of Latency The lower the latency, the less work needed to keepmyself busy waiting for answer not worth it to offload: could have done it myself faster (not enough parallelism to hide latency)

Proc 0 Proc 1

noth

ing

to d

o

Page 11: Scalar Operand Networks for Tiled Microprocessors

The interesting region

Power4 <2, 14, 0, 14,4>(on-chip)

Superscalar < 0, 0, 0, 0, 0>(not scalable)

Page 12: Scalar Operand Networks for Tiled Microprocessors

Superscalar Multicore Tiled MulticorePE-PE communication Free Expensive Cheap

exploitation of parallelism

Implicit Explicit Both

scalable No Yes Yes

power efficient No Yes Yes

(w/ scalable SON)

Tiled Microprocessors (or “Tiled Multicore”)

Page 13: Scalar Operand Networks for Tiled Microprocessors

Tiled Microprocessors (or “Tiled Multicore”)

Superscalar Multicore Tiled MulticoreAlu-Alu communication Free Expensive Cheap

exploitation of parallelism

Implicit Explicit Both

scalable No Yes Yes

power efficient No Yes Yes

Page 14: Scalar Operand Networks for Tiled Microprocessors

Superscalar

CMP/multicore

Tiled

add scalable SON

add scalability

Transforming from multicore or superscalar to tiled

Page 15: Scalar Operand Networks for Tiled Microprocessors

The interesting region

Power4 <2, 14, 0, 14,4>(on-chip)

Raw < 0, 0, 1, 2, 0>Tiled “Famous Brand 2” < 0, 0, 1, 0, 0>

Superscalar < 0, 0, 0, 0, 0>(not scalable)

Page 16: Scalar Operand Networks for Tiled Microprocessors

Scalability Problems in Wide Issue Microprocessors

ControlWideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC

RF

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

Page 17: Scalar Operand Networks for Tiled Microprocessors

Area and Frequency Scalability Problems

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

~N3 ~N2 N ALUs

Ex: Itanium 2

Without modification, freq decreases linearly or worse.

Page 18: Scalar Operand Networks for Tiled Microprocessors

Operand Routing is Global

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

>>

+

Page 19: Scalar Operand Networks for Tiled Microprocessors

Idea: Make Operand Routing Local

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

Page 20: Scalar Operand Networks for Tiled Microprocessors

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF

Bypass Net

Idea: Exploit Locality

Page 21: Scalar Operand Networks for Tiled Microprocessors

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF

Replace the crossbar with a point-to-point, pipelined, routed scalar operand network.

Page 22: Scalar Operand Networks for Tiled Microprocessors

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF>>

+

Replace the crossbar with a point-to-point, pipelined, routed scalar operand network.

Page 23: Scalar Operand Networks for Tiled Microprocessors

Un-pipelinedcrossbarbypass

Point-to-PointRouted MeshNetwork

Local BW ~ N½ ~ N

Area ~ N2 ~ N

Operand Transport Scaling – Bandwidth and Area

We can route more operands per unit time if we are ableto map communicating instructions nearby.

Scalesas 2-DVLSI

For N ALUs and N½ bisection bandwidth:as in conventional superscalar

Page 24: Scalar Operand Networks for Tiled Microprocessors

Operand Transport Scaling - LatencyTime for operand to travel between instructions mapped todifferent ALUs.

Non-local Placement

~ N ~ N½

Locality- Driven Placement

~ N ~ 1

Un-pipelinedcrossbar

Point-to-PointRouted MeshNetwork

Latency bonus if we map communicating instructions nearby so communication is local.

Page 25: Scalar Operand Networks for Tiled Microprocessors

Distribute the Register File

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

Page 26: Scalar Operand Networks for Tiled Microprocessors

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

ControlWideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC

SCALABLE

Page 27: Scalar Operand Networks for Tiled Microprocessors

More Scalability Problems

ControlWideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC

Page 28: Scalar Operand Networks for Tiled Microprocessors

Distribute the rest: Raw – a Fully-Tiled Microprocessor

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

Control

WideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$PC

D$I$

PC

D$I$

PC

D$I$

PC

D$

Page 29: Scalar Operand Networks for Tiled Microprocessors

Tiles!

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$PC

D$I$

PC

D$I$

PC

D$I$

PC

D$

Page 30: Scalar Operand Networks for Tiled Microprocessors

Tiles!

Page 31: Scalar Operand Networks for Tiled Microprocessors

Tiled Microprocessors

-fast inter-tile communication through SON

-easy to scale (same reasons as multicore)

Page 32: Scalar Operand Networks for Tiled Microprocessors

1. Scalar Operand Network and Tiled Microprocessor intro

2. Raw Architecture + SON

3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network.

Outline

Page 33: Scalar Operand Networks for Tiled Microprocessors

Raw Microprocessor

Tiled scalable microprocessorPoint-to-point pipelined networks

16 tiles, 16 issue

Each 4 mm x 4mm tile:

MIPS-style compute processor - single-issue 8-stage pipe

- 32b FPU- 32K D Cache, I Cache

4 on-chip networks- two for operands- one for cache misses- one for message passing

Page 34: Scalar Operand Networks for Tiled Microprocessors

Fetch UnitInstruction

Cache

Generalized Transport Networks

Dynamic Router“GDN”

Dynamic Router“MDN”

FunctionalUnits

Execution Core

Inter-tileNetworkLinks

Compute Processor

Trusted

Core

Untrusted Core

Inter-tile SON

InstructionCache

Static Router

Switch ProcessorCross-

bar

Intra-tile SON

Data Cache

Raw Microprocessor Components

Cross-bar

Page 35: Scalar Operand Networks for Tiled Microprocessors

RFA TL

M1 M2

F P

E

U

r26

r27

r25

r24

Raw Compute Processor Internals

Ex: fadd r24, r25, r26

Page 36: Scalar Operand Networks for Tiled Microprocessors

Tile-Tile Communication

add $25,$1,$2

Page 37: Scalar Operand Networks for Tiled Microprocessors

Tile-Tile Communication

add $25,$1,$2 Route $P->$E

Page 38: Scalar Operand Networks for Tiled Microprocessors

Tile-Tile Communication

add $25,$1,$2 Route $P->$E Route $W->$P

Page 39: Scalar Operand Networks for Tiled Microprocessors

Tile-Tile Communication

add $25,$1,$2

sub $20,$1,$25

Route $P->$E Route $W->$P

Page 40: Scalar Operand Networks for Tiled Microprocessors

tmp3 = (seed*6+2)/3v2 = (tmp1 - tmp3)*5v1 = (tmp1 + tmp2)*3v0 = tmp0 - v1….

pval5=seed.0*6.0

pval4=pval5+2.0

tmp3.6=pval4/3.0

tmp3=tmp3.6

v3.10=tmp3.6-v2.7

v3=v3.10

v2.4=v2

pval3=seed.o*v2.4

tmp2.5=pval3+2.0

tmp2=tmp2.5

pval6=tmp1.3-tmp2.5

v2.7=pval6*5.0

v2=v2.7

seed.0=seed

pval1=seed.0*3.0

pval0=pval1+2.0

tmp0.1=pval0/2.0

tmp0=tmp0.1

v1.2=v1

pval2=seed.0*v1.2

tmp1.3=pval2+2.0

tmp1=tmp1.3

pval7=tmp1.3+tmp2.5

v1.8=pval7*3.0

v1=v1.8

v0.9=tmp0.1-v1.8

v0=v0.9

pval5=seed.0*6.0

pval4=pval5+2.0

tmp3.6=pval4/3.0

tmp3=tmp3.6

v3.10=tmp3.6-v2.7

v3=v3.10

v2.4=v2

pval3=seed.o*v2.4

tmp2.5=pval3+2.0

tmp2=tmp2.5

pval6=tmp1.3-tmp2.5

v2.7=pval6*5.0

v2=v2.7

seed.0=seed

pval1=seed.0*3.0

pval0=pval1+2.0

tmp0.1=pval0/2.0

tmp0=tmp0.1

v1.2=v1

pval2=seed.0*v1.2

tmp1.3=pval2+2.0

tmp1=tmp1.3

pval7=tmp1.3+tmp2.5

v1.8=pval7*3.0

v1=v1.8v0.9=tmp0.1-v1.8

v0=v0.9

RawCC assignsinstructions to the tiles, maximizing locality. It also generates the static routerinstructions that transferoperands between tiles.

Compilation

Page 41: Scalar Operand Networks for Tiled Microprocessors

One cycle in the life of a tiled micro

httpd

4-way automaticallyparallelizedC program

2-thread MPI app

DirectI/OstreamintoScalarOperandNetwork

mem

mem

mem

Zzz...

An application uses only as many tiles as needed to exploit the parallelism intrinsic to that application…

Page 42: Scalar Operand Networks for Tiled Microprocessors

Tile 0

Tile 5

Tile 9

Tile 12

Tile 8 Tile 13 Tile 14 Tile 11

Tile 4 Tile 1 Tile 2 Tile 7

Tile 3

Tile 6

Tile 10

Tile 15

One StreamingApplicationon Raw

very differenttraffic patternsthan RawCC-styleparallelization

Page 43: Scalar Operand Networks for Tiled Microprocessors

Splitter

FIRFilterFIRFilter FIRFilter FIRFilter

FIRFilterFIRFilter FIRFilter FIRFilter

Joiner

Splitter

Detector

Magnitude

FIRFilter

Vec Mult

Detector

Magnitude

FIRFilter

Vec Mult

Detector

Magnitude

FIRFilter

Vec Mult

Detector

Magnitude

FIRFilter

Vec Mult

Joiner

Splitter

FIRFilterFIRFilter

FIRFilterFIRFilter

FIRFilterFIRFilter

FIRFilterFIRFilter

Joiner

Splitter

Joiner

Vec MultFIRFilterMagnitudeDetector

Vec MultFIRFilterMagnitudeDetector

Vec MultFIRFilterMagnitudeDetector

Vec MultFIRFilterMagnitudeDetector

Original After fusion

Auto-Parallelization Approach #2: Streamit Language + Compiler

Page 44: Scalar Operand Networks for Tiled Microprocessors

FIRFilterFIRFilter

FIRFilterFIRFilter

FIRFilterFIRFilter

FIRFilterFIRFilter

Joiner

JoinerVec MultFIRFilterMagnitudeDetector

Vec MultFIRFilterMagnitudeDetector

End Results – auto-parallelized by MIT Streamitto 8 tiles.

Page 45: Scalar Operand Networks for Tiled Microprocessors

AsTrO Taxonomy: Classifying SON diversity

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

>>

+Assignment (Static/Dynamic)

Transport (Static/Dynamic)

Ordering (Static/Dynamic)

+

>>

Is instruction assignment to ALUs predetermined?

Are operand routes predetermined?

Is the execution order of instructions assigned to a node predetermined?

%&/

Page 46: Scalar Operand Networks for Tiled Microprocessors

Static Dynamic

Static

Static

Dynamic

DynamicStatic

RawDynRawScale

TRIPS

Static

Dynamic

Dynamic

ILDP WaveScalar

Assignment

Transport

Ordering

Microprocessor SON diversity using AsTrO taxonomy

Page 47: Scalar Operand Networks for Tiled Microprocessors

1. Scalar Operand Network and Tiled Microprocessor intro

2. Raw Architecture + SON

3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network.

Outline

Page 48: Scalar Operand Networks for Tiled Microprocessors

Raw Chips

October 02

Page 49: Scalar Operand Networks for Tiled Microprocessors

Raw16 tiles (16 issue)180 nm ASIC (IBM SA-27E)~100 million transistors1 million gates

3-4 years of development1.5 years of testing200K lines of test code

Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V

Frequency competitivewith IBM-implementedPowerPCs in same process.

18W average power

Page 50: Scalar Operand Networks for Tiled Microprocessors

Raw motherboard

Support Chipset implemented in FPGA

Page 51: Scalar Operand Networks for Tiled Microprocessors
Page 52: Scalar Operand Networks for Tiled Microprocessors
Page 53: Scalar Operand Networks for Tiled Microprocessors

A Scalable Microprocessor in Action

[Taylor et al, ISCA ’04]

Page 54: Scalar Operand Networks for Tiled Microprocessors

ConclusionsScalability problems in general purpose processors can be addressed by tiling resources across a scalable, low-latency, low-occupancy scalar operand network (SON). These SONs can be characterized by a 5-tuple and the AsTrO classification.

The 180 nm 16-issue Raw prototype shows the feasibility of the approach is feasible. 64+-issue is possible in today’s VLSI processes.

Multicore machines could benefit by adding inter-node SON for cheap communication.

Page 55: Scalar Operand Networks for Tiled Microprocessors

* * * *