scalar operand networks for tiled microprocessors

Scalar Operand Networksfor Tiled Microprocessors

Michael Taylor

Raw Architecture Project

MIT CSAIL

(now at UCSD)

Until 3 years ago – computer architects have beenusing the N-way superscalar to encapsulate the idealfor a parallel processor… - nearly “perfect” but not attainable

Superscalar

“PE”->”PE” communication Free

exploitation of parallelism

Implicit

Clean semantics Yes

scalable No

power efficient No

(hw scheduler or compiler)

(or VLIW)

What’s great about superscalar microprocessors? It’s the networks!

Fast low-latency tightly-coupled networks (0-1 cycles of latency, no occupancy)-For the lack of a better name let’s call them Scalar Operand Networks (SONs) - Can we incorporate the benefits of superscalar communication + multicore scalability-Can we build Scalable Scalar Operand Networks?

(I agree with Jose: “We need low-latency tightly-coupled … networkinterfaces” – Jose Duato, OCIN, Dec 6, 2006)

mul $2,$3,$4

add $6,$5,$2

The industry shift toward Multicore - attainable but hardly ideal

Superscalar Multicore

“PE”->”PE” communication Free Expensive


Implicit Explicit

Clean semantics Yes No

scalable No Yes

power efficient No Yes

Superscalar Multicore

“PE”->”PE” communication Free Expensive


Implicit Explicit

Clean semantics Yes No

scalable No Yes

power efficient No Yes

What we’d like – neither superscalar nor multicore

Superscalarshave fastnetworksand greatusability

Multicorehas greatscalabilityand efficiency

Why communication is expensive on multicore

Multiprocessor Node 1 Multiprocessor Node 2

Transport Cost

sendoverhead

receiveoverhead

sendoccupancy

sendlatency

receiveoccupancy

receivelatency

Multiprocessor SON Operand Routing

Multiprocessor Node 1

sendoccupancy

sendlatency

Destination node nameSequence numberValueLaunch sequence

Commit LatencyNetwork injection

Multiprocessor SON Operand Routing

Multiprocessor Node 2

receiveoccupancy

receivelatency

receive sequencedemultiplexingbranch mispredictions

injection cost

.. similar overheads for shared memory multiprocessors - store instr, commit latency, spin locks (+ attndt br. mispredicts)

Defining a figure of merit forscalar operand networks

5-tuple <SO, SL, NHL, RL, RO>:

Send Occupancy

Send Latency

Network Hop Latency

Receive Latency

Receive Occupancy

Tip: Ordering follows timing of message from sender to receiver

We can use this metric to quantitativelydifferentiateSONs from existing multiprocessor networks…

Impact of Occupancy (“o” = so+ro)

if (o * “surface area” > “volume”)

not worth it to offload: overhead too high

(parallelism too fine-grained)

Impact of Latency The lower the latency, the less work needed to keepmyself busy waiting for answer not worth it to offload: could have done it myself faster (not enough parallelism to hide latency)

Proc 0 Proc 1

noth

ing

to d

o

The interesting region

Power4 <2, 14, 0, 14,4>(on-chip)

Superscalar < 0, 0, 0, 0, 0>(not scalable)

Superscalar Multicore Tiled MulticorePE-PE communication Free Expensive Cheap


Implicit Explicit Both

scalable No Yes Yes

power efficient No Yes Yes

(w/ scalable SON)

Tiled Microprocessors (or “Tiled Multicore”)

Tiled Microprocessors (or “Tiled Multicore”)

Superscalar Multicore Tiled MulticoreAlu-Alu communication Free Expensive Cheap


Implicit Explicit Both

scalable No Yes Yes

power efficient No Yes Yes

Superscalar

CMP/multicore

Tiled

add scalable SON

add scalability

Transforming from multicore or superscalar to tiled

The interesting region

Power4 <2, 14, 0, 14,4>(on-chip)

Raw < 0, 0, 1, 2, 0>Tiled “Famous Brand 2” < 0, 0, 1, 0, 0>

Superscalar < 0, 0, 0, 0, 0>(not scalable)

Scalability Problems in Wide Issue Microprocessors

ControlWideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC

RF

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

Area and Frequency Scalability Problems

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

~N3 ~N2 N ALUs

Ex: Itanium 2

Without modification, freq decreases linearly or worse.

Operand Routing is Global

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

>>

+

Idea: Make Operand Routing Local

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF

Bypass Net

Idea: Exploit Locality

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF

Replace the crossbar with a point-to-point, pipelined, routed scalar operand network.

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF>>

+

Replace the crossbar with a point-to-point, pipelined, routed scalar operand network.

Un-pipelinedcrossbarbypass

Point-to-PointRouted MeshNetwork

Local BW ~ N½ ~ N

Area ~ N2 ~ N

Operand Transport Scaling – Bandwidth and Area

We can route more operands per unit time if we are ableto map communicating instructions nearby.

Scalesas 2-DVLSI

For N ALUs and N½ bisection bandwidth:as in conventional superscalar

Operand Transport Scaling - LatencyTime for operand to travel between instructions mapped todifferent ALUs.

Non-local Placement

~ N ~ N½

Locality- Driven Placement

~ N ~ 1

Un-pipelinedcrossbar

Point-to-PointRouted MeshNetwork

Latency bonus if we map communicating instructions nearby so communication is local.

Distribute the Register File

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

ControlWideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC

SCALABLE

More Scalability Problems

ControlWideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC

Distribute the rest: Raw – a Fully-Tiled Microprocessor

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

Control

WideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$PC

D$I$

PC

D$I$

PC

D$I$

PC

D$

Tiles!

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$PC

D$I$

PC

D$I$

PC

D$I$

PC

D$

Tiles!

Tiled Microprocessors

-fast inter-tile communication through SON

-easy to scale (same reasons as multicore)

1. Scalar Operand Network and Tiled Microprocessor intro

2. Raw Architecture + SON

3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network.

Outline

Raw Microprocessor

Tiled scalable microprocessorPoint-to-point pipelined networks

16 tiles, 16 issue

Each 4 mm x 4mm tile:

MIPS-style compute processor - single-issue 8-stage pipe

- 32b FPU- 32K D Cache, I Cache

4 on-chip networks- two for operands- one for cache misses- one for message passing

Fetch UnitInstruction

Cache

Generalized Transport Networks

Dynamic Router“GDN”

Dynamic Router“MDN”

FunctionalUnits

Execution Core

Inter-tileNetworkLinks

Compute Processor

Trusted

Core

Untrusted Core

Inter-tile SON

InstructionCache

Static Router

Switch ProcessorCross-

bar

Intra-tile SON

Data Cache

Raw Microprocessor Components

Cross-bar

RFA TL

M1 M2

F P

E

U

r26

r27

r25

r24

Raw Compute Processor Internals

Ex: fadd r24, r25, r26

Tile-Tile Communication

add $25,$1,$2


add $25,$1,$2 Route $P->$E


add $25,$1,$2 Route $P->$E Route $W->$P


add $25,$1,$2

sub $20,$1,$25

Route $P->$E Route $W->$P

tmp3 = (seed*6+2)/3v2 = (tmp1 - tmp3)*5v1 = (tmp1 + tmp2)*3v0 = tmp0 - v1….

pval5=seed.0*6.0

pval4=pval5+2.0

tmp3.6=pval4/3.0

tmp3=tmp3.6

v3.10=tmp3.6-v2.7

v3=v3.10

v2.4=v2

pval3=seed.o*v2.4

tmp2.5=pval3+2.0

tmp2=tmp2.5

pval6=tmp1.3-tmp2.5

v2.7=pval6*5.0

v2=v2.7

seed.0=seed

pval1=seed.0*3.0

pval0=pval1+2.0

tmp0.1=pval0/2.0

tmp0=tmp0.1

v1.2=v1

pval2=seed.0*v1.2

tmp1.3=pval2+2.0

tmp1=tmp1.3

pval7=tmp1.3+tmp2.5

v1.8=pval7*3.0

v1=v1.8

v0.9=tmp0.1-v1.8

v0=v0.9

pval5=seed.0*6.0

pval4=pval5+2.0

tmp3.6=pval4/3.0

tmp3=tmp3.6

v3.10=tmp3.6-v2.7

v3=v3.10

v2.4=v2

pval3=seed.o*v2.4

tmp2.5=pval3+2.0

tmp2=tmp2.5

pval6=tmp1.3-tmp2.5

v2.7=pval6*5.0

v2=v2.7

seed.0=seed

pval1=seed.0*3.0

pval0=pval1+2.0

tmp0.1=pval0/2.0

tmp0=tmp0.1

v1.2=v1

pval2=seed.0*v1.2

tmp1.3=pval2+2.0

tmp1=tmp1.3

pval7=tmp1.3+tmp2.5

v1.8=pval7*3.0

v1=v1.8v0.9=tmp0.1-v1.8

v0=v0.9

RawCC assignsinstructions to the tiles, maximizing locality. It also generates the static routerinstructions that transferoperands between tiles.

Compilation

One cycle in the life of a tiled micro

httpd

4-way automaticallyparallelizedC program

2-thread MPI app

DirectI/OstreamintoScalarOperandNetwork

mem

mem

mem

Zzz...

An application uses only as many tiles as needed to exploit the parallelism intrinsic to that application…

Tile 0

Tile 5

Tile 9

Tile 12

Tile 8 Tile 13 Tile 14 Tile 11

Tile 4 Tile 1 Tile 2 Tile 7

Tile 3

Tile 6

Tile 10

Tile 15

One StreamingApplicationon Raw

very differenttraffic patternsthan RawCC-styleparallelization

Splitter

FIRFilterFIRFilter FIRFilter FIRFilter

FIRFilterFIRFilter FIRFilter FIRFilter

Joiner

Splitter

Detector

Magnitude

FIRFilter

Vec Mult

Detector

Magnitude

FIRFilter

Vec Mult

Detector

Magnitude

FIRFilter

Vec Mult

Detector

Magnitude

FIRFilter

Vec Mult

Joiner

Splitter

FIRFilterFIRFilter

FIRFilterFIRFilter

FIRFilterFIRFilter

FIRFilterFIRFilter

Joiner

Splitter

Joiner

Vec MultFIRFilterMagnitudeDetector




Original After fusion

Auto-Parallelization Approach #2: Streamit Language + Compiler

FIRFilterFIRFilter

FIRFilterFIRFilter

FIRFilterFIRFilter

FIRFilterFIRFilter

Joiner

JoinerVec MultFIRFilterMagnitudeDetector


End Results – auto-parallelized by MIT Streamitto 8 tiles.

AsTrO Taxonomy: Classifying SON diversity

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

>>

+Assignment (Static/Dynamic)

Transport (Static/Dynamic)

Ordering (Static/Dynamic)

+

>>

Is instruction assignment to ALUs predetermined?

Are operand routes predetermined?

Is the execution order of instructions assigned to a node predetermined?

%&/

Static Dynamic

Static

Static

Dynamic

DynamicStatic

RawDynRawScale

TRIPS

Static

Dynamic

Dynamic

ILDP WaveScalar

Assignment

Transport

Ordering

Microprocessor SON diversity using AsTrO taxonomy

1. Scalar Operand Network and Tiled Microprocessor intro

2. Raw Architecture + SON

3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network.

Outline

Raw Chips

October 02

Raw16 tiles (16 issue)180 nm ASIC (IBM SA-27E)~100 million transistors1 million gates

3-4 years of development1.5 years of testing200K lines of test code

Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V

Frequency competitivewith IBM-implementedPowerPCs in same process.

18W average power

Raw motherboard

Support Chipset implemented in FPGA

A Scalable Microprocessor in Action

[Taylor et al, ISCA ’04]

ConclusionsScalability problems in general purpose processors can be addressed by tiling resources across a scalable, low-latency, low-occupancy scalar operand network (SON). These SONs can be characterized by a 5-tuple and the AsTrO classification.

The 180 nm 16-issue Raw prototype shows the feasibility of the approach is feasible. 64+-issue is possible in today’s VLSI processes.

Multicore machines could benefit by adding inter-node SON for cheap communication.

* * * *

scalar operand networks for tiled microprocessors

Documents

superscalar microprocessors

chip superscalar

scalar operand networks

cycles of latency

multicoremultiprocessor

nway superscalar

fast lowlatency

finegrained impact of