computer architecture - uoacgi.di.uoa.gr/~halatsis/advanced_comp_arch/geniki_parousiasi.pdf ·...

Universityof

Amsterdam

CSPCSPComputer

Architecture

Computer ArchitectureA bottom-up perspective

Andy Pimentel

Computer Architecture Modeling & Simulation group

[email protected]

Andy Pimentel – p. 1/259

Universityof

Amsterdam

CSPCSPComputer

Architecture

Course material

Book: J.L. Hennessy and D.A. Patterson, “ComputerArchitecture, A Quantitative Approach”, 3rd ed.

Other nice book: D. Sima, T. Fountain and P. Kacsuk,“Advanced Computer Architecture, A Design SpaceApproach”

Sheets available at website(http://www.science.uva.nl/˜andy/aca.html)

Idem for schedule, practical assignments, deadlines, etc.


Universityof

Amsterdam

CSPCSPComputer

Architecture

Outline

Memory hierarchyDRAMCaches: from concept to implementation

Pipelined processorsPipeline hazardsSome design space issues

Modern superscalar processorsDecoding, dispatching, issuing and execution ofinstructionsRegister renamingSequential consistency, exception handlingBranch prediction


Universityof

Amsterdam

CSPCSPComputer

Architecture

Outline (cont’d)

Application specific optimizationsSIMD instructionsData prefetching

Case studiesCompaq Alpha 21264, HP PA-8700, IBM POWER 4,Intel Pentium 4

VLIW processorsPhilips TriMediaIntel/HP IA64 (Itanium 2)Transmeta Crusoe

Embedded processors


Universityof

Amsterdam

CSPCSPComputer

Architecture

Outline (cont’d)

Parallel computersInterconnection networks

Topology, switching, routing, etc.Memory hierarchy

Shared/distributed memory, cache coherency, etc.Case studies

Future directionsSuper-speculative processorsTrace/Multiscalar processorsSimultaneous multithreadingI(ntelligent)RAMs...


Universityof

Amsterdam

CSPCSPComputer

Architecture

Memory hierarchy: DRAM

8 to 16 times slower than SRAM

More dense than SRAM (e.g. SRAM needs about 6transistors/cell)

RAS/CAS addressing using time multiplexing

Needs refreshing

Cycle time roughly 2 times the access time

Processor �Memory speed-gap is wideningProcessors 50% to 100% faster/year (Moore’s Law)DRAM cycle time improves 7%/year


Universityof

Amsterdam

CSPCSPComputer

Architecture

RAMs

capacitorStorage

Transistor

Ground

Address line

Bitline B

SRAM cellDRAM cell

Ground

C2C1

Address line

dc voltage

T6T5

T4T3

T2 T1

Bitline B Bitline B


Universityof

Amsterdam

CSPCSPComputer

Architecture

DRAM (cont’d)

RAS/CAS addressing

Capacitor(1 transistor)

RAS

CAS

Step 1: Row Address SelectStep 2: Column Address Select (select bit)

Refresh: read and write back a whole row


Universityof

Amsterdam

CSPCSPComputer

Architecture

DRAM (cont’d)

Refresh time typically in the tens of milliseconds

Number of refresh cycles dependent on number of rows

Two types of refreshing

Refresh Cycle

Burst

Refresh Time

Time

DistributedRefresh

Refresh


Universityof

Amsterdam

CSPCSPComputer

Architecture

DRAM (cont’d)

Improving bandwidth (not latency!) by exploiting spatial locality

one RAS, multiple CAS addressesFast page mode DRAMsE(xtended) D(ata) O(utput) RAM

Burst mode DRAMs: for one burst 1 RAS and CAS addressBurst EDO RAMSDRAM

or by improving interface: SDRAM and Rambus


Universityof

Amsterdam

CSPCSPComputer

Architecture

DRAM (cont’d)

DATA2

CAS

RAS

DATA1Data

EDO RAMRAS

CAS

ROW

Data

COL1 COL3

DATA1 DATA3

2-bit Burst EDO RAM

DATA2

ROW COL1 COL2 COL3

Address

Address


Universityof

Amsterdam

CSPCSPComputer

Architecture

DRAM (cont’d)

SDRAM changed interface from asynchronous to synchronous:Synchronous DRAM Standard DRAM

Decode R/W Output Decode R/W OutputAddr. latchAddr. latch

Addr1

Addr1

Addr1

Addr1

Addr1

Addr1

Addr1

Addr1

Addr2

Addr2

Addr2

Addr3

Addr3 Addr2Addr4Addr5

Addr4

Addr3

Addr2

Clock

Brought (a sort of) pipelining to DRAMs

DDR-SDRAM (Double Data Rate) transfers data on bothrising and falling clock edges


Universityof

Amsterdam

CSPCSPComputer

Architecture

DRAM (cont’d)

Rambus (RDRAM)

Interface using a split-transaction (= packet-switched) bus(pipelining!)

Separate row, column address control and (18 bits) datalines

So, three transactions can be active at the same time

High clock rate (400 Mhz), but long latency

Data can be transferred on both clock edges


Universityof

Amsterdam

CSPCSPComputer

Architecture

DRAM (cont’d)

Interleaved memory: multiple banks

4 5 7

9 11

0 1 2 3

6

108

Bank 0 Bank 1 Bank 2 Bank 3

Optimizes sequential accesses and can hide refresh cycles

Problem: aliased accessesLarge number of banks (Nec SX/3, 128 banks!),number of banks prime


Universityof

Amsterdam

CSPCSPComputer

Architecture

Memory hierarchy: caches

Performance gap between processor and main memory �

apply caching (basically a poor man’s solution)

Caches are “small” and fast memories (close to theprocessor, typically SRAM)

Nowadays, 2 (or 3) levels of cache between processor andmain memory

Caches are transparent to the user (important!...however...)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Caches (cont’d)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Caches (cont’d)

Cache exploits locality in software

Temporal locality : a referenced item tends to be referencedsoon again

InstructionsData??

Spatial locality : items close to a referenced item tend to bereferenced soon

Instructions + data


Universityof

Amsterdam

CSPCSPComputer

Architecture

Caches (cont’d)

Instruction, data or unified caches

Address cache (TLB – Translation Lookaside Buffer)Caches VA � PA translationsSplit I + D TLBs or unified, sometimes 2 levels

Three common implementationsDirect mappedFully associativeSet-associative


Universityof

Amsterdam

CSPCSPComputer

Architecture

Caches (cont’d)

Instruction and data caches store cache blocks (also called cachelines)

Tag V D Data

Valid Dirty Typically 16 - 128 bytesHigher-orderaddress bits


Universityof

Amsterdam

CSPCSPComputer

Architecture

Cache implementations

Direct mapped cache (often 2nd-level cache)

data

data

data

data

data

data

data

data

Tag

Tag

cache block

Byte

4

3

16 bits memory address

9

Block0Block1Block2Block3Block4Block5Block6Block7Block8Block9Block10Block11Block12Block13Block14Block15

16 bytesMain memory

with 16 bytes of data

Block

compare hit?

Simple hardware & high speed access

Rigid mapping: many memory blocks map onto onecache block � large cache size required


Universityof

Amsterdam

CSPCSPComputer

Architecture

Cache implementations (cont’d)

Fully associative cache (e.g. TLB, branch history table)

data

data

data

data

data

data

data

data

Tag

cache block

Byte



16 bytesMain memory


Tag

4

hit?compare

12

Very flexible mapping (few conflicts)

CAMs (Content Addressable Memory) are expensive �

small caches � multimedia applications often a killer forTLBs


Universityof

Amsterdam

CSPCSPComputer

Architecture

Cache implementations (cont’d)

Set-associative cache (often 1st-level cache)

data

data

data

data

data

data

data

data

Tag

cache block

Byte



16 bytesMain memory


4

Tag Set

compare

10 2

set

hit?

Performance similar to fully associative cachebut less expensive


Universityof

Amsterdam

CSPCSPComputer

Architecture

Virtually vs physically addressed cache

Virtually addressed

CPU

Cache

MMUVA PA

I or DI or D

Memory

Parallel VA translation and cache lookupAliasing problem


Universityof

Amsterdam

CSPCSPComputer

Architecture

Virtually vs physically addressed cache (cont’d)

Physically addressed

CPUMMU Cache

I or DPAVA

I or D

Memory

Slowdown on address translationNo aliasing problem


Universityof

Amsterdam

CSPCSPComputer

Architecture

Virtually vs physically addressed cache (cont’d)

Virtually-indexed, physically tagged cacheCache indexing during translationPage offset bits in VA used as cache index

Number of sets in cache limited (dependent on pagesize)!Solutions: large cache sets or page colouring (OSsupport)

VA

Page offsetbitsidentical bits

for VA and PA

Use as cache index


Universityof

Amsterdam

CSPCSPComputer

Architecture

Cache strategies

Replacement strategies in (set-)associative cachesRandom, FIFO, Least Recently Used (LRU)

Write strategiesWrite-throughWrite-back

Write-miss strategiesAllocate on write (write-back cache): with/withoutfetchNo allocate on write (write-through cache)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Cache misses

Three types of cache misses

Compulsory (cold-start) cache missThe data block is read for the first time

Capacity cache missThe data block has been replaced (cache too small)

Conflict cache missThe data block has been replaced (associativity too low)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Reducing the miss rate & penalty

More levels of cache

Critical-word-first read strategy

Lockup-free cacheMultiple outstanding requests (reads/writes), writebuffers

Sub-blocksCache block:

Tag D V Subblock0 D V Subblock1

D = dirty bitV = valid bit

Prefetching (explained later in detail)

Victim cache (basically increases associativity)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Pipelined processors

RISC(y) processors dominate the microprocessor market

A few addressing modes, fixed instruction format,load-store architecture

Take advantage of pipelining

IF ID WBEX

IF ID EX WB

WBEXIDIF

IF=Instruction FetchID= Instruction DecodeEX=ExecuteWB=Write Back

Traditionally no microcode...not true anymore (are wereturning to CISC?)

Caching essential due to larger code size and pipelining


Universityof

Amsterdam

CSPCSPComputer

Architecture

Pipelined processors

Functionality of our simple pipeline:

IF : Fetch instruction from Icache, update PC

ID : Decode instruction, fetch operands from registers

EX : Execute instructionUse ALU for arithmetic instructionAccess memory for load/storeDetermine branch taken or not

WB : Write back result (from ALU, memory or PC) toregister


Universityof

Amsterdam

CSPCSPComputer

Architecture

Pipeline hazards

There are three types of pipeline hazards:

Structural hazard due to resource conflicts For example,two instructions using the ALU at the same clock cycle (inthis example the MUL takes 2 cycles) � the pipeline needsto be stalled causing a bubble

E = E + 1A = B*C

EE+1E

Pipeline bubble

IF ID EX

IFMUL B,C

INC

WBEX

ID EX WBB*C AB*C


Universityof

Amsterdam

CSPCSPComputer

Architecture

Pipeline hazards (cont’d)

Control hazard : Due to branching

IF ID WBEX

IF ID EX WBBRA

IF ID EX WB

IFBRA

= branch delay

Branch

not takenBranch

taken WBEXIDIF

In this example, the pipeline is stalled when a branch isencountered


Universityof

Amsterdam

CSPCSPComputer

Architecture


Reducing control stalls

Branch delay slot: delayed branching

100101102103104105106

add r2,1,r3bra 105add r4,r2,r4sub r0,r7,r6st 0(r0),r3

ld r1, a

Addr.(interlocked)

Original code

bra 106no-op

interlockedSoftware

ld r1, aadd r2,1,r3

add r4,r2,r4

bra 105

optimizationDelayed branch

ld r1, a

add r2,1,r3 Branch delayslot

sub r0,r7,r6st 0(r0),r3

st 0(r0),r3

add r4,r2,r4sub r0,r7,r6

Assume branch delay of 1 cycle

Nullifying: also schedule instructions from taken/untakenpath


Universityof

Amsterdam

CSPCSPComputer

Architecture


Reducing control stalls (cont’d)

Branch predictionPredict always-taken/always-untakenStatic prediction (compiler)

Extra bit in branch instruction to guide IF unit or tooptimize branch delay slot techniqueBased on heuristics or profiling

Dynamic prediction by hardware: discussed later on


Universityof

Amsterdam

CSPCSPComputer

Architecture


Reducing control stalls (cont’d)

Predication

Instruction i + 3

Instruction i + 2

Branch if C

Instruction i

Instruction i + 6

Instruction i + 5

Predicated execution

Instruction i

(!C)

(!C)

(C)

(C)

Instruction i + 2

Instruction i + 3

Instruction i + 6

Instruction i + 7

Instruction i + 5

Jump to i + 7

false true

i+4

i+1

Instruction i + 7


Universityof

Amsterdam

CSPCSPComputer

Architecture


Data hazard : Due to data dependencies between sourceand destination operands

IF ID

ID WB

WBEX

EXIFADD B,C A

INC A A+1 A

Two bubbles

B+C

A := A + 1

A := B + C


Universityof

Amsterdam

CSPCSPComputer

Architecture


Three types of data hazards:

RAW : Read After Write (true dependency)Read an operand before it is updated by a previousinstruction

WAR : Write After Read (anti dependency)Update an operand before it is read by a previousinstruction

WAW : Write After Write (output dependency)Write an operand before it is written by a previousinstruction

WAR en WAW dependencies are false dependencies


Universityof

Amsterdam

CSPCSPComputer

Architecture


Data hazards (cont’d)

ADD R3, R1, 6 # R3 = R1 + 6ADD R4, R3, R2 # RAW hazard

ADD R4, R3, R2ADD R3, R1, 6 # WAR hazard

ADD R4, R1, R2ADD R4, R3, 7 # WAW hazard

In a simple pipeline, only RAW hazards can occur


Universityof

Amsterdam

CSPCSPComputer

Architecture


Avoiding RAW data hazards

Data forwarding: an execution unit (ALU) bypass

IF ID WBEXADD B,C AB+C

INCIF ID

A+1 AWB

A

EX A := A + 1

A := B + C


Universityof

Amsterdam

CSPCSPComputer

Architecture


Avoiding RAW data hazards (cont’d)

Instruction scheduling

Data available

MEMMEMr1,aLDIF ID

ID

WBEX

EXIFINC r1 r1+1 r1

WB

100

(interlocked)Original code

ld r1, aadd r1, r1, 1sub r2, r4, r7and r3, r3, r5

no-opld r1, a

no-op

103

101102

sub r2, r4, r7

and r3, r3, r5

ld r1, a

add r1, r1, 1 add r1, r1, 1sub r2, r4, r7

and r3, r3, r5104

105

interlockedSoftware

ScheduledAddr.

Finding the optimal schedule is an NP-hard problem

Static scheduling (compiler), dynamic (hardware) or hybrid


Universityof

Amsterdam

CSPCSPComputer

Architecture


Avoiding WAR and WAW hazards (false dependencies): registerrenaming

Old code:ST 0(R5), R4ADD R4, R3, 7 # WAR hazard

New code:ST 0(R5), R4ADD R6, R3, 7 # R4 renamed to R6


Universityof

Amsterdam

CSPCSPComputer

Architecture

Some basic pipeline design space issues

Depth: number of stagesSuperpipelined: large number of stages, high clockfrequency, but more hazards

Number of execution unitsScalar ILP (Instruction Level Parallelism) processors:sequential instruction issue to execution units, parallelexecution

Number of pipelinesSuperscalar ILP processors: parallel instruction issue,parallel execution


Universityof

Amsterdam

CSPCSPComputer

Architecture

Pipelined processors (cont’d)

Scalar versus Superscalar

IF ID

EX1

EX2

EX3

WB

IF ID

EX1

EX2

EX3

WB

Superscalar ILP pipeline

Scalar ILP pipeline

Need to preserve sequential consistency (WAR & WAW hazards,exceptions)!


Universityof

Amsterdam

CSPCSPComputer

Architecture

Today’s superscalar processors

Some of the issues that will be touched

Parallel decoding, multi-way issuing and out-of-orderexecution

Preserving sequential consistency

Exceptions

Branch prediction

Application specific optimizations (SIMD instructions, dataprefetching)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Instruction decoding

Instructionbuffer

Instructioncache

Decode/Issue

Superscalar issue

Issue width

Issuewindow

Inst

ruct

ion

fetc

h st

age

To speed up decoding, instructions are predecoded inI-cache

I-cache often prefetches instructions (miss on block i, fetchi and prefetch i

�

1 when possible)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Superscalar instruction issue

Blocking (direct) versus non-blocking (shelved) issue

Issue blockagesDependencies in window of fetched instructions (olderprocessors)Resource contention

Handling issue blockagesIn-order versus out-of-order issueAligned versus unaligned issue


Universityof

Amsterdam

CSPCSPComputer

Architecture

Superscalar instruction issue (cont’d)

cc

c

c

c c

c

cc

bIssued cycle 2

e

Issue window

d b aFetched

instructions

instructionsIssued a

instructions

instructionsIssued

Fetched

a

abd

Issue window

eIndependent instruction

Dependent instruction

In-order issue

dh g f

d

Aligned (in-order) issue

Issued cycle 3

abdh g f

Gliding window

bdh g f

abd

Fixed window

h g f

Next window

e

e

e

e

aIssued cycle 1 aIssued cycle 1

Out-of-order issue

b

bdh g f e

defghk j i

i

f e d

Issued cycle 2

Issued cycle 3

Unaligned (in-order) issue


Universityof

Amsterdam

CSPCSPComputer

Architecture

Non-blocking issue (shelving)

1. Dispatch instructions to buffer (check for structuralhazards)

2. Issue instructions from buffer when operands are available

Usually in-order, aligned dispatch

Note: throughout literature, the terms instruction issue anddispatch are ambiguous! (my usage differs from H&P!)

Instructionbuffer Decode/Dispatch

Shelving buffer(s)

EX EX

Instruction dispatch

Instruction issue (when operands are available)

Check structural hazards

Resolve data hazards


Universityof

Amsterdam

CSPCSPComputer

Architecture

Shelving (cont’d)

Types of shelving buffersScoreboard buffersReservation stations associated with EX units(individual, grouped or central)Combined buffer for register renaming, shelving andinstruction reordering: ROB (ReOrder Buffer)

Number of shelves nowadays � 30, e.g.AMD Athlon (K7) : 72HP PA-8x00 : 56DEC/Compaq Alpha 21264 : 35


Universityof

Amsterdam

CSPCSPComputer

Architecture

Shelving (cont’d)

Issue order

Shelving buffer Shelving buffer

In-order issue Out-of-order issue

CheckCheck

Nowadays mostly out-of-order issue

Issue rate: how many instructions can be issued/cycle fromthe shelving buffer(s)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Out-of-order execution

An instruction is issued to an execution unit when itsoperands are available (no RAW hazards) : allows forout-of-order execution (dynamic scheduling)

In general, there are two schemes to control out-of-orderexecution

ScoreboardingTomasulo scheduling


Universityof

Amsterdam

CSPCSPComputer

Architecture

Scoreboarding

Introduced in the CDC 6600 (1964!) to strive for 1 IPC

Scoreboard keeps track of the state of instructions andregisters

Entries in shelving buffer store operand locations andbits indicating their availability

Registers include extra bit indicating their validityAt issue, if destination register is valid, then mark it

invalid. Otherwise block (WAW hazard). Validate bit at

WB while checking for WAR hazards

If source register is invalid, then block (RAW hazard)

Explicit register renaming to avoid WAW and WAR hazards

Traditional scoreboarding (without renaming) implementsin-order issue


Universityof

Amsterdam

CSPCSPComputer

Architecture

Scoreboarding (cont’d)

An example

r0r1r2r3r4r5

V

11

1

1020

40

OP S1 V1 DS2 V2

EXUnit

Instruction status

Register file

Instructions from

mul r3, r1, r2 0

mul r1 1 r2 1 r3

Decode/Dispatch stage


Universityof

Amsterdam

CSPCSPComputer

Architecture

Scoreboarding example (cont’d)

r0r1r2r3r4r5

V

11

1

1020

40

OP S1 V1 DS2 V2

EXUnit

Instruction status

Register file

Instructions from

0

mul r3, r1, r2

add r5,r2,r3

add r2 r3 0 r51

0



Universityof

Amsterdam

CSPCSPComputer

Architecture

Scoreboarding example (cont’d)

r0r1r2r3r4r5

V

11

1

1020

40

OP S1 V1 DS2 V2

EXUnit

Instruction status

Register file

Instructions from

add r1 r3 r01

0

1200

1

add r5,r2,r3

add r0,r1,r3

0



Universityof

Amsterdam

CSPCSPComputer

Architecture

Tomasulo scheduling

Introduced in the IBM 360/91 (1967) by Robert Tomasulo

Usually implements out-of-order issue

Dispatched instructions are kept in reservation stations,explicitly storing operand values

Reservation stations often individually associated with anexecution unit


Universityof

Amsterdam

CSPCSPComputer

Architecture

Tomasulo scheduling (cont’d)

Dependency analysis based on the dataflow principleUnavailable operands are tagged with reservationstation ID producing the value (registers also store thistag)Generated results are immediately sent to thereservation stations by a Common Data BusThis scheme automatically implements registerrenaming! (so no WAW and WAR hazards)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Tomasulo scheduling (cont’d)

The basic architecture

Instructions from

r0r1r2r3r4r5

Register file Tag

S1 S2 S1 S2

EXUnit

CDB

RS2 RS5RS1

RS3

RS4

RS6

RS1

EXUnit

add 10 20

add

5mul RS1

RS4

OPOP



Universityof

Amsterdam

CSPCSPComputer

Architecture

Register renaming revisited

Hardware implementation of register renaming common incurrent superscalar processors

Three possible locations of renaming buffersMerged architectural and rename register fileSeparate architectural and rename register filesRenaming in the ROB

At operand fetch, check both architectural and renamingregister files


Universityof

Amsterdam

CSPCSPComputer

Architecture

HW register renaming (cont’d)

Two basic buffer architectures

entryvalid reg.

dest.value valid

valuelatest

associativesearch

1221

1

11

150

1002010

111

1111

1

5validentry

index

1111 12

1

value validvalue

1111

0123

32

0123

r2

11

1020

50

Associative rename buffers Indexed rename buffers


Universityof

Amsterdam

CSPCSPComputer

Architecture

HW register renaming (cont’d)

Number of renaming buffersCompaq Alpha 21264: 41 (int) + 41 (fp)PowerPC 750: 6 (int) + 6 (fp)AMD Athlon: 72 (in ROB)HP PA-RISC 8x00: 56 (in ROB)

Rename rate: renames per cycle


Universityof

Amsterdam

CSPCSPComputer

Architecture

An example of HW register renaming

10,valid0,valid

add r3,r1,r2

mul r2,r0,r1

sub r2,r0,r1

tail

head

entryreg.dest.

value validvalue

latest1

11 1

1 1111

0400

10

4

1

0

1 02 1

valid


Universityof

Amsterdam

CSPCSPComputer

Architecture

HW register renaming example (cont’d)

10,valid

add r3,r1,r2

mul r2,r0,r1

sub r2,r0,r1

tail

valid reg.dest.

value validentry

latest1

11 1

1 1111

0400

10

4

1

0

1 02 1

value

head 1 0 13tag (3), not valid

0

23

1


Universityof

Amsterdam

CSPCSPComputer

Architecture

HW register renaming example (cont’d)

10,valid

add r3,r1,r2

mul r2,r0,r1

tail

entryvalid reg.

dest.value valid

valuelatest

sub r2,r0,r1

11 1

1 1111

0400

10

4

1

0

2

1

1 0 13

0

23

1

head 2 0

0

11

0,valid

1 10 to add instruction


Universityof

Amsterdam

CSPCSPComputer

Architecture

Preserving sequential consistency

Instructions in superscalars may finish out-of-order

Sequential consistency must be preserved: out-of-orderinstructions might have to complete (also called retire andcommit) in-order

Two issues in sequential consistency:Processor consistency (sequence of instructioncompletions)Memory consistency (sequence of memory accesses)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Processor consistency

Weak processor consistencyInstructions may complete out-of-order only whenpossible (dependencies)Problems with precise exceptions (discussed later on)

Strong processor consistencyInstructions always complete in-orderEasy to implement and no exception problems �

common in modern superscalar processors


Universityof

Amsterdam

CSPCSPComputer

Architecture

Memory consistency

Strong memory consistencyMemory accesses occur in strict program order

Weak memory consistencyLoad/store reordering (not violating data dependencies)Increases processor performance


Universityof

Amsterdam

CSPCSPComputer

Architecture

Load/store reordering

Some processors allow loads to bypass stores when targetaddresses are different

If target address of store is not knownNon-speculative bypassing: do not bypass the loadSpeculative bypassing (common in modernsuperscalars): bypass the load and restore state whenthe bypass was invalid

Loads bypassing loads in case of Dcache misses:lock-up-free caches


Universityof

Amsterdam

CSPCSPComputer

Architecture

Exception processing

Out-of-order execution may cause problems withexceptions. An example:

divf f0, f1, f2 % causes an exception

add r1, r3, r4 % commits earlier than the divf

If nothing is done (weak consistency of exceptions), thenexceptions are imprecise

Undesirable in modern processors (e.g. paging, IEEEFP standard)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Exception processing (cont’d)

Modern processors typically feature precise exceptionsPrevious instructions have committedNo following instruction has modified architecturalstatePC points to interrupting instruction

Solutions to implement precise exceptionsOnly issue an instruction when the previousinstructions are known not to cause an exceptionHistory bufferReOrder Buffer (ROB)


Universityof

Amsterdam

CSPCSPComputer

Architecture

History buffer

Store sequential machine state and restore the machine stateafter an exception

Architecturalregisters

resultsInstruction

Operandfetch

state restoreSequential

itemsSuperseded

Historybuffer

(queue)

Expensive path to history buffer (for each simultaneouslywritten operand) and expensive reload after exception


Universityof

Amsterdam

CSPCSPComputer

Architecture

The ReOrder Buffer (ROB)

Increasingly popular in modern superscalar processors

The ROB is a circular bufferfor reordering instructions (toestablish sequential processorconsistency) and may alsoimplement register renamingand shelving

It effectively supportsspeculative execution andprecise exceptions

head (first free entry)

tail (next instruction to be committed)

x x fx

d = dispatchedx = in executionf = finished

dd


Universityof

Amsterdam

CSPCSPComputer

Architecture

The ReOrder Buffer (cont’d)

Instructions inserted in program order (dispatch order)

When an instruction may commit, it writes its result to anarchitectural register/memory

Commit rate: the number of instructions that can commit in1 cycle

Typical commit rate of 1-4

Status bit for speculative execution (a finished speculativeinstruction may not commit)

When the ROB implements register renaming, therenaming buffers are integrated with the ROB structure


Universityof

Amsterdam

CSPCSPComputer

Architecture

Control hazards revisited

In general, one out of five instructions is a branch

Branches

Unconditional Conditional

Jump Branch tosubroutine

Return fromsubroutine

Loop-closingbranch branch

Other cond.

1/3 1/31/3

Taken Untaken

1/61/6(n-1) iterations

5/6 1/6

Grohoski’s estimate of branch statistics


Universityof

Amsterdam

CSPCSPComputer

Architecture

Control hazards (cont’d)

As we know, branches may cause the pipeline to stallResolving of branches may take a while (e.g. branch onthe result of a FP operation)Especially, taken branches should be handledefficiently

Branch problem gets worseDeeper pipelines and superscalar processors sufferfrom more branch penalties (multiple branches inpipeline(s))


Universityof

Amsterdam

CSPCSPComputer

Architecture

Avoiding/reducing branch delays

Branch delay slots (delayed branching)

Multiway branching: follow both paths

Predication

Branch predictionStatic (compiler)Dynamic (hardware)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Dynamic branch prediction

Keep branch history bits (a 2-bit history is common)Stored in the Icache, a Branch History Table (BHT) orBranch Target Buffer (discussed later on)

Predict taken

10

Predict taken

11

Predict not takenPredict not taken

01

taken

not taken

taken

not taken

00

taken

not taken

not takentaken

Accuracy

�

90%Andy Pimentel – p. 77/259

Universityof

Amsterdam

CSPCSPComputer

Architecture

Two-level (correlating) predictor

Include behaviour of other branches

3

Branch address2-bit per branch prediction

2-bit global branch history

xx

Accuracy�

95%


Universityof

Amsterdam

CSPCSPComputer

Architecture

Two-level (correlating) predictor (cont’d)

Two-level predictors come in various sorts and complexities(Yeh & Patt, 1993)

Global branchhistory register

history tableGlobal pattern


Per-set patternhistory tables

Set(branch)


history tables

Addr(branch)

Per-address pattern


Universityof

Amsterdam

CSPCSPComputer

Architecture

Two-level Branch Predictors (cont’d)

Per-set patternhistory tableshistory table

Global patterntablehistory

branch Per-address

Addr(branch)

tablehistory

Per-address

tablehistory

Per-address

Addr(branch)

branch

Set(branch)

Addr(branch)

branch history tables

Per-address pattern

Addr(branch)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Two-level Branch Predictors (cont’d)

Per-set patternhistory tableshistory table

Global patterntablehistory tablehistorytablehistory

Set(branch)

history tablesPer-address pattern

Addr(branch)Per-setbranch branch

Per-setbranch

Per-set

Set(branch) Set(branch) Set(branch)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Branch Target Buffer (BTB)

Cache branch target address besides predicted branchdirection

At each cycle, search PC (instruction to fetch in IFstage) in BTBIf PC is found, then start fetching from cached targetaddress (predicted taken)No branch delay when prediction is correct

Branch folding optimizationStore target instruction rather than addressAllows for zero-cycle branches!


Universityof

Amsterdam

CSPCSPComputer

Architecture

Branch Target Buffer (cont’d)

PC of instruction to fetch

Predicted PC

=

Prediction bits

Look up

No: proceed normally

BT

B e

ntrie

s

Yes: use predicted PC as next PC if predicted taken


Universityof

Amsterdam

CSPCSPComputer

Architecture

Speculative execution

Instructions after predicted branch are executedspeculatively

How deep can the level of speculation be?Typically between 1 and 20 branches

What is processed of speculative instructions?Typically up to execution stage; speculativeinstructions are committed after resolving a branch(e.g. using a ROB)

Increasing the degree of speculative execution: valueprediction � we’ll return to this


Universityof

Amsterdam

CSPCSPComputer

Architecture

App. dependent optimizations: SIMD instructions

Multimedia applications increasingly popular: SIMD

parallelism can be exploited in many multimedia algorithmsMany small integer data typesFrequent multiplies and accumulates in repetitive loopsHighly parallel operations

ISAs extended with SIMD instructions (e.g. MMX)Pack multiple small data items in a register (typically64 bits or larger)Perform same instruction on multiple data items inparallel


Universityof

Amsterdam

CSPCSPComputer

Architecture

SIMD instructions (cont’d)

a0+b0 a1+b1 a2+b2 a3+b3

b0 b2

a0 a1 a2 a3

b3b1

+ + + +

+ +

16 bits

64 bits

Packed Add

b0 b2

a0 a1 a2 a3

b3b1

* * **

a0*b0+a1*b1 a2*b2+a3*b3

Packed Multiply Add


Universityof

Amsterdam

CSPCSPComputer

Architecture

SIMD instructions (cont’d)

Use INT/FP register file or separate register file (needs OSsupport!)

Provide both modulo and saturated arithmeticSaturated arithmetic very useful for pixel operations

Pose compiler writers for a new problem: automaticvectorization

The large variety of SIMD instruction sets doesn’t helphere


Universityof

Amsterdam

CSPCSPComputer

Architecture

App. dependent optimizations: data prefetching

Problem

Multimedia applications suffer from compulsory cachemisses

Calculations applied on streams (no or little re-use)

Observation

Regularity in stream processing:data address Si

� A0�

i � offset

Solution: Stream prefetching

Bring data elements to the cache before they are reallyneeded

Potential problem: trashing


Universityof

Amsterdam

CSPCSPComputer

Architecture

Stream prefetching: a classification

Stream prefetching composed of two actions:

Stream detection: detect when an application is performingoperations on data streams

Issuing prefetches: request the data cache to (regularly)prefetch a certain amount of data

Issuing prefetches

Static Dynamic

Stream Static (SW) (hybrid SW/HW)

detection Dynamic � (HW)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Static detection, static issuing

for (i = 0; i � 3; i++)

prefetch(&a[i]);

for (i = 0; i � N; i++) for (i = 0; i � N - 3; i++) {

sum = a[i] + sum; prefetch(&a[i+3]);

sum = a[i] + sum;

}

for (; i � N; i++)

sum = a[i] + sum;

Cheap implementation

Instruction overhead in loop body

Code rewriting and fine tuning required (e.g. affectscompiler optimizations)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Dynamic detection, dynamic issuing

Detection

Instruction address of loads/stores identifies stream

Large table records the instr. + data addresses of allpossible candidates (Stride Prediction Table)

Issuing of prefetch requests

Instruction address hits in SPT:Compute stride (data-addrcurrent

� data-addrtable)Issue prefetch request foraddrpre f etch

� data-addrcurrent

�

stride


Universityof

Amsterdam

CSPCSPComputer

Architecture

Dynamic detection, dynamic issuing (cont’d)

for (i = 0; i < 100; i++)

Instr. addr. Prev. addr. Stride

ld A[i][j]

ld B[j][i]

Instr. addr. Prev. addr. Stride

ld A[i][j]

ld B[j][i]

100000

200000

100004

200400

4

400

i = 0, j = 1.

i = 0, j = 2.

for (j = 0; j < 100; j++) A[ i ][ j ] += B[ j ][ i ];

prefetch request for 100008

prefetch request for 200800

(a)

(b)

(c)


Universityof

Amsterdam

CSPCSPComputer

Architecture


Transparent: no programmer action required

Run-in effect

Trashing

Large SPT required

Even larger SPT needed when loop-unrolling

PC needed


Universityof

Amsterdam

CSPCSPComputer

Architecture


The effect of loop-unrolling

i0: ......

i1: load R1 R3

i2: ......

i3: incr R1

i4: jump i0

i0: ......i1: load R1 R3i2: ......

i3: incr R1

i4: ......

i5: load R1 R3

i6: ......

i7: incr R1

i8: jump i0

loop unrolling


Universityof

Amsterdam

CSPCSPComputer

Architecture


A few approaches to reduce trashing

A more complex state-machine in the SPT: e.g. onlyprefetch when measuring a constant stride

Introduce separate stream caches

Processor

SPT Stream cache

Cache

Memory

Processor

Memory

CacheSPT Stream cache

MRUreplacement

LRUreplacement


Universityof

Amsterdam

CSPCSPComputer

Architecture

Static detection, dynamic issuing

Detection

A stream prefetch instruction signals a prefetch engine tostart prefetching the elements of a stream

prefetch(&a[0], N, 4, 3);

for (i = 0; i � N; i++)

sum = a[i] + sum;

Programmer selects which streams to prefetch

No run-in effect

Small prefetch table (streams only, no candidates)

No rewriting of inner loop


Universityof

Amsterdam

CSPCSPComputer

Architecture

Static detection, dynamic issuing (cont’d)

Issuing of prefetch requests

Like HW prefetching: use instruction address ofloads/stores

Use data addresses of stream elementsPC not neededNot affected by loop-unrolling: smaller table!More robust prefetching (e.g. when stream is accessedwith multiple strides)More expensive implementation prefetch table


Universityof

Amsterdam

CSPCSPComputer

Architecture

Alpha 21264

7-stage pipeline with clustered EX units

OOO implementation using scoreboarding

IntegerReg.File(80)

IntegerReg.File(80)

IntegerExecution

IntegerExecution

IntegerExecution

IntegerExecution

BranchPrediction

line/setprediction

Icache2-way64 KB

Integer

Register

Rename

FP

Rename

Register

Integer

Issue

Queue

(20 entries)

(15)

FP

Issue

QueueReg.File

FP

(72)

FP Multiply Execution

FP Add Execution

Addr.

Addr.

Dcache2-way64 KB

L2 cacheand

SystemInterface

Fetch Rename Issue Reg. Read Execute Memory


Universityof

Amsterdam

CSPCSPComputer

Architecture

Alpha 21264 (cont’d)

Hybrid tournament branch predictor

Load/store reordering, software data prefetching

(1024x10)

Local

TableHistory

PC

GlobalPrediction(4096x2)

ChoisePrediction(4096x2)

Path history

PredictionsLocal

(1024x3)

Branch prediction


Universityof

Amsterdam

CSPCSPComputer

Architecture

HP PA-8700

OOO implementation using ROBs (4-way issue)

Hybrid static and dynamic (BHT+BTB) branch prediction,simple (i+1) HW data prefetching

Regs

RegsRename

Rename

ROBAddr.

(28 ent.)ROB

RegistersArchitectural

Retire

SortIF Unit

Icache4−way0.75MB

ALU

4 instructions

4 instructions

MEMROB

(28 ent.)(28 ent.)

Addr.Unit

4−wayL/S

Dcache

BusInterface

1.5MB

System

Units

Units

2 Shift/

ALUsINT

2 64bit

Merge

UnitsMul/Add2 FP

2 FPDIV/SQRT


Universityof

Amsterdam

CSPCSPComputer

Architecture

AMD Athlon

64KB 2-way

(dual ported)Icache

Pre-decode

info

BranchPredictor

ControlIF/ID 3-way x86 instr. decoders

Int. scheduler (18-entry)

IEX IEX IEXAddr. Addr. Addr.

FPU Stack Map/Rename

FPU Scheduler (36-entry)

FPU Register File (88-entry)

FADD

3DNow!MMX MMX

3DNow!

FMUL FSTORE

64KB 2-way Dcache(8 banks)

Bus

Inte

rfac

e U

nit

L2 C

ache

Con

trol

ler

ROB (72-entry) + INT Reg. File (24)

Load/Store Queue Unit (44 entries)

BHT+BTB(2K entries)


Universityof

Amsterdam

CSPCSPComputer

Architecture

IBM POWER 4

2 processor cores on-chip (share on-chip 1.5MB 8-way L2 cache)


Universityof

Amsterdam

CSPCSPComputer

Architecture

IBM POWER 4 (cont’d)

8-way issue OOO engine

Tournament branch prediction+ static prediction

Group Completion Table (GCT)is sort of ROB

64 KB DM/2-way Icache/Dcache,hardware data prefetch (L2 and L3caches)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Intel Pentium 4

Decode stage translates 1IA32 instruction per cycleinto uops

trace-cache stores tracesof uops

BHT+BTB+static branchprediction

"double pumped" ALUs

Now also with hyper-threading (simple SMT)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Trace caches

Assume a branch predictor throughput of m, thenTraces are identified by starting address and m � 1branch outcomes

Att

trace(A,taken,taken)

At t

Trace Cache

At t

Trace Cache

later...

Att

trace(A,taken,taken)

fill new trace from I$

to decoder

Lookup A with predictions (t,t)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Intel Pentium 4 (cont’d)

20-stage pipeline, 6-way issue OOO execution


Universityof

Amsterdam

CSPCSPComputer

Architecture

Intel Pentium 4 (cont’d)

126-entry ROB, 128 physical registers (8 architectural)


Universityof

Amsterdam

CSPCSPComputer

Architecture

An alternative to superscalar RISC: VLIW

Principle derived from horizontal µ-programming

Very Large Instruction Words containing multiple operationslots (operations drive execution units)

operationRISC-like

operationRISC-like

operationRISC-like

operationRISC-like

VLIW instruction

Compiler schedules operations within instruction slotsNo hardware scheduling, compiler must find ILPCompiler should know everything about thearchitecture (e.g. timing) to schedule operations


Universityof

Amsterdam

CSPCSPComputer

Architecture

VLIW processors (cont’d)

Bind Resources

Determine Independencies

Determine Dependencies

Frontend & Optimizer

Bind Resources

Determine Independencies

Determine Dependencies

Execute

Compiler Hardware

Superscalar

Dataflow

"Horizon"

VLIW

and IA64


Universityof

Amsterdam

CSPCSPComputer

Architecture


The ideal architecture of a VLIW processor

Mainmermory

Register file

Load/Store

FPALU

INTALU

Branchunit

All execution units have direct access to the register fileTypically, this is infeasible (too many read/write portsfor register file): clustered architectureClustering complicates the compiling (e.g. inter-clusterdata movements)


Universityof

Amsterdam

CSPCSPComputer

Architecture


Require less complex hardware than (superscalar) RISCs

Generally perform well on scientific and multimedia code(predictable)

In theory, compilers should be able to find more ILPthan superscalars (they have a wider scope)Extra room on chip can be used for application-specificHW optimizations

Compiler requires static branch prediction to schedule code

Code is less compact due to NO-OPs


Universityof

Amsterdam

CSPCSPComputer

Architecture


Code compaction techniques to reduce impact of NO-OPsWhere to decompress instructions?

At Icache refill: not in critical path but needs largerIcacheAt instruction fetch: smaller Icache but in criticalpath

Object code compatibility hard to obtainPossible solution: static/dynamic instructionrescheduling


Universityof

Amsterdam

CSPCSPComputer

Architecture

The Philips TriMedia TM1000 processor

32-bit high-performance mediaprocessor with VLIW core(currently 6.5 BOPS)

5 operations/instruction(2 memory operations)

Guarded operations(predication)

SIMD operations

Co-processors for commonmedia algorithms

SDRAM

camera, etc.I C bus to2

2/4/6/8 chan. dig. audio

CCIR601/656YUV 4:2:2

38 MHz(19 Mpix/s)

I SDC - 100 kHz2

YUV 4:2:2

V. 34 or ISDNFront End

PCI (32 bits, 33 MHz)

Down & upscalingYUV to RGB50 Mpix/s

Huffman decoderSlice-at-a-time

32 bits data, 400 MB/s

80 MHz (40 Mpix/s)

CCIR601/656

MPEG 1&2

Stereo dig. audio

I SDC - 100 kHz2

CPUVLIW

32KI$

16KD$

Interface

2I C

OutAudio

InAudio

InVideo VLD

VideoOut

Timers

Image

PCI Interface

Coprocessor

InterfaceSerial

Synchronous

Coprocessor

Memory interface


Universityof

Amsterdam

CSPCSPComputer

Architecture

The TriMedia TM1000 processor (cont’d)

27 execution units

128 entry register file

32Kb, 8-way set-associativeIcache (compressed code)

16Kb, 8-way set-associativeDcache

8 banks, pseudo-dual portedNon-blocking, hierarchical LRUStreamed, critical-word-firstfetching

Instruction cache (32 Kb)

Instruction Fetch Buffer

Instr. Decompression Hardware

Issue Register (5 Ops)

Operation Routing Network

Register Routing and Forwarding Network

Execution Units (27 functions)

Register File (128 x 32)

Data cache (16 Kb)


Universityof

Amsterdam

CSPCSPComputer

Architecture

The TriMedia CPU64

Target: 6x to 8x performance increase over TM1000, whilethe transistor count may not be more than doubled

64-bit registers and data paths (e.g. 64-bit SIMDinstructions)

New, extensive, media instruction set

Improved cache control (SW controlled prefetch andallocation), Dcache truly dual ported

Super-Ops: double-slot operation allowing multi-argument,multi-result operations


Universityof

Amsterdam

CSPCSPComputer

Architecture

The TriMedia CPU64 (cont’d)


Universityof

Amsterdam

CSPCSPComputer

Architecture


Inclusion of MMU (separate I-MMU and dual-portedD-MMU)

The D-MMU is/has64-entry fully-associative D-TLB, software managedIndexed with 32-bit VA and 8-bit process IDVariable page sizes of 4Kb to 16 MB: practical formedia applications with large data streams

Precise exceptions


Universityof

Amsterdam

CSPCSPComputer

Architecture



Universityof

Amsterdam

CSPCSPComputer

Architecture

The (Intel/HP) Itanium 2 (McKinley)

IA64 processor using EPIC: mixing RISC and VLIW

Move complexity back to the compiler:Exploiting explicit parallelism: schedule operations inbundles

Tem-plateInstruction 2 Instruction 1 Instruction 0

40 bits 8 bits

Template provides information on inter and intradependencies of bundlesBranch + memory hints (ld.s + check.s instructions)Predication to reduce branches


Universityof

Amsterdam

CSPCSPComputer

Architecture

The Itanium 2 (cont’d)

8-stage in-order pipeline

2-level BHT + BTB, Target Address Registers forcompiler-hints + Loop Count register

6 instructions issued per cycle

Register rotating for loop-unrolling support

3 branches may be executed in parallel

Load/store reordering allowed

IA32 instructions translated to IA64 instructions


Universityof

Amsterdam

CSPCSPComputer

Architecture

The Itanium 2 (cont’d)


Universityof

Amsterdam

CSPCSPComputer

Architecture

The Transmeta Crusoe

simple 4-way VLIW CPU core5 functional units64 registers

Reduced power by replacing a large number of transistorswith software

x86-compatible through Code Morphing: translates x86 toVLIW instructions (did microcode return?)


Universityof

Amsterdam

CSPCSPComputer

Architecture

The Transmeta Crusoe (cont’d)

Code Morphing

Translates + schedules a whole group of instructions atonce (includes register renaming)

Caches translations

Analyses program behaviour � gradual optimization oftranslations

Alleviates compatibility problem of VLIWs

Applies a history-buffer approach for implementing preciseexceptions

Can control the processor’s clock speed


Universityof

Amsterdam

CSPCSPComputer

Architecture

The Transmeta Crusoe (cont’d)

The Transmeta Crusoe TM5600

L1 Icache64K

8-way set-assoc.

Unified TLB256 entries

4-way set-assoc.

CPU coreInteger unit

FP unit

Multimedia instructionsMMU

L1 Dcache64K

16-way set-assoc.

L2 WB cache512K

4-way set-assoc.

Bus interface

PCI controller

interfaceSerial ROM

SDR SDRAMcontroller

controllerDDR SDRAM

DMA


Universityof

Amsterdam

CSPCSPComputer

Architecture

Embedded processors

To the question "what’s the most popular microprocessoraround?", you probably answered Intel Pentium

Well...thanks for playing, but

Intel Pentium has almost 0% market share. Zip. Zilch.

Pentium is a statistically insignificant chip with tiny sales!


Universityof

Amsterdam

CSPCSPComputer

Architecture

Embedded processors (cont’d)

Relating microprocessors to life on earth...are Pentium’s thevirusses of the chip market? ;-)


Universityof

Amsterdam

CSPCSPComputer

Architecture



Universityof

Amsterdam

CSPCSPComputer

Architecture


In the embedded processor market, there’s no big leader


Universityof

Amsterdam

CSPCSPComputer

Architecture


Types of embedded microprocessorsGP processing cores, such as ARM, MIPS, 68000,PowerPC, etc.Digital Signal Processors (DSPs)Media processors, such as TriMedia, Emotion Engine(PS2), Equator’s MAP, etc.


Universityof

Amsterdam

CSPCSPComputer

Architecture

DSPs


Universityof

Amsterdam

CSPCSPComputer

Architecture

DSPs (cont’d)

Processing continuous data streams (sequences of samples)

Often (hard) real-time applications: call for predictablebehavior!

In-order issue/execution/completion with CPI = 1 (oftenVLIW core)

Mostly fixed-point (FP is slow and expensive)

Still lots of assembly coding


Universityof

Amsterdam

CSPCSPComputer

Architecture

DSPs (cont’d)

Harvard architectureSeparate data memory/bus and instruction memory/bus(multiple ports, high bandwidth)

Multiply-accumulate (sum = sum + k*x[i]) in singleinstruction (common in filters)

Special addressing modesModulo addressing for circular buffers, bit-reversedaddressing (for FFTs)


Universityof

Amsterdam

CSPCSPComputer

Architecture

DSPs (cont’d)

Both modulo and saturated arithmetic

Zero-overhead loops (loop an instruction (sequence) anumber of times)

Predictable interrupt latencies

No caches or caches with locking ( � predictability)


Universityof

Amsterdam

CSPCSPComputer

Architecture


Other concerns

Low costsLowest possible area(Some of the) Technology behind the leading edge

Code density (small memory footprint)ISA methodsCompression

Fast time to marketCompatible architectures (e.g., ARM, MIPS) allowsreuse of codeCustomizable core

Low power if application requires portability


Universityof

Amsterdam

CSPCSPComputer

Architecture

Power Intermezzo

Power equations for CMOS logic circuits

Power consumption: P � ACV 2 f

� τAVIshort f

�

VIleak

First component measures dynamic powerconsumption of (dis-)charging the capacitive load onthe output of each gateProportional to frequency ( f ), activity of the gates (A,some gates may not switch every clock), the totalcapacitance seen by gate outputs (C) and the square ofsupply voltage (V )This term dominates


Universityof

Amsterdam

CSPCSPComputer

Architecture

Power Intermezzo (cont’d)

Second term captures power consumption due toshort-circuit current (Ishort) that momentarily (τ) flowsbetween ground and supply voltage when output of gateswitches

Third term measures the power lost due to leakage currentregardless the state of the gate


Universityof

Amsterdam

CSPCSPComputer

Architecture

Power Intermezzo (cont’d)

Reducing V effectively reduces power consumption(quadratic relationship!)

But, reducing V also limits the maximum frequency( fmax ∝

�

V � Vthreshold

� 2 �

V ) � fmax roughly linear to V

Lessen the effect of reducing V by reducing Vthreshold

Unfortunately, this increases leakage current(Ileak ∝ exp

�

�Vthreshold

� �

35mV

� �

)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Techniques for power reduction

Logic levelClock gating: turn of parts of clock tree (may consumeup to 30% of the power of a processor) � reduceparameter AHalf-frequency clock (use both edges)Asynchronous logic

Exploit parallelism (allows for reducing V )This does not include pipeline parallelism! (thisrequires an increase of f )


Universityof

Amsterdam

CSPCSPComputer

Architecture

Techniques for power reduction (cont’d)

Organisation of memory (e.g., multiple smaller banks, codecompression)

Buses: reduce the number swings on address linesexploiting locality (e.g., using Gray code)

OS: dynamically control f (frequency) dependent onapplication


Universityof

Amsterdam

CSPCSPComputer

Architecture

Playstation 2 architecture


Universityof

Amsterdam

CSPCSPComputer

Architecture

Playstation 2 architecture (cont’d)


Universityof

Amsterdam

CSPCSPComputer

Architecture



Universityof

Amsterdam

CSPCSPComputer

Architecture

Parallel systems

Amdahl’s law � do not forget the “sequential” processor

In the past, many special-purpose processors were used inparallel systems (e.g. Transputer, CM-2)

Nowadays, mostly RISC(y) commodity microprocessorsare used

Cray � DEC AlphaSGI � MIPSIBM � POWER2

Parallelism exploited at multiple levelsTask-level, thread-level and instruction-level


Universityof

Amsterdam

CSPCSPComputer

Architecture

Parallel systems (cont’d)

Some design issues

Procesors: number and power of processors, organization(connectivity)

Memory organization: location, caches, etc.

Type of network: Direct or indirect

Synchronization: SIMD � synchronous, MIMD �

asynchronous


Universityof

Amsterdam

CSPCSPComputer

Architecture

Parallel systems (cont’d)

Synchronous

Parallel paradigms

MIMD

Asynchronous

Shared memory

memoryDistributed

Vector/Array

SIMD

(MISD)Systolic

Shared memory

Distr. memory


Universityof

Amsterdam

CSPCSPComputer

Architecture

MIMD vs SIMD

Fine grain parallelismMedium/Coarse grain parallelism

Memory

Network

IU DU IU DU

(Shared memory) MIMD

(Multiple Instructions, Multiple Data) (Single Instruction, Multiple Data)

Network

Memory

IU DU DU

(Shared memory) SIMD

SIMD has evolved into the SPMD paradigm for MIMD machines

� I will focus on MIMD parallel architectures


Universityof

Amsterdam

CSPCSPComputer

Architecture

Interconnection networks

Two types of networks

Direct (or point-to-point) connection networks

Indirect connection networks

Network properties/definitions

Network switching : the transportation of data from oneprocessor to the other

Circuit-switching (connection stays duringcommunication)Packet-switching (connection only made for singlepacket)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Network properties (cont’d)

Topology : the lay-out of the networkStatic (point-to-point networks)Dynamic (indirect networks)

Node degree : nr. of channels connected to one node ( )

Diameter of network : maximum shortest path between twonodes ( )

Bisection width : when the network is cut into two equalhalves, the minimum number of channels along the cut ( )

Network redundancy (fault tolerance): amount ofalternative paths between two nodes ( )

Network scalability : measure for expandability of thenetwork ( )


Universityof

Amsterdam

CSPCSPComputer

Architecture

Network properties (cont’d)

Network routing : the process of steering data (messages)through the network

Routing and redundancy are coupled : high redundancy

� many routing possibilities

Network functionality : measure for support of routing,fault tolerance, synchronization, message combining, etc.

Network throughput : Amount of transferred datatime units

Network latency : worst case delay for transfer of a unit(empty) message through the network

Hot-spots : nodes that account for a disproportionallyamount of network traffic


Universityof

Amsterdam

CSPCSPComputer

Architecture

Direct (point-to-point ) connection networks

3−Hypercube Systolic array

TorusMesh

4−Hypercube

Completely connected

Linear array Ring Chordal ring of degree 3


Universityof

Amsterdam

CSPCSPComputer

Architecture

Direct connection networks (cont’d)

Network Node degree Diameter Bisection width

Linear array 2 N � 1 1

Ring 2� N

2

�

2

Completely conn. N � 1 1

� N2

� 2

Binary tree 3 2

�

log2N � 1

�

1

2D-mesh 4 2

�

N � 1

�

N

2D-torus 4 2

� �

N2

�

2 N

Hypercube log2N log2N N2

N equals to the number of nodes


Universityof

Amsterdam

CSPCSPComputer

Architecture


Dynamic networks: no (fixed) neighbours (changecommunication topology based on application demands)

Bus networks

Multistage networks (blocking and non-blocking)Omega networksBaseline networksClos networks

Crossbar switches


Universityof

Amsterdam

CSPCSPComputer

Architecture

Busses

Generic bus structure in parallel machines

I/O1

I/On

P1 Pn M1 Mn

Busarbiterand

control

Address lines

Data lines

Control lines

In traditional busses, address and data lines may betime-multiplexed

When there are more bus-masters, arbitration is required


Universityof

Amsterdam

CSPCSPComputer

Architecture

Busses (cont’d)

Synchronous bus

Address

Clock

Data

Read

Typical bus read transaction

Asynchronous busMore complex/expensive (and possibly slower) but alsomore flexible than synchronous bus


Universityof

Amsterdam

CSPCSPComputer

Architecture

Busses (cont’d)

Split-transaction busses: higher throughput by pipelining(but possibly higher latency)

Need extra bus lines to signal the "owner" of data (usingtags)

addr 2

data 2

wait 1

addr 1 addr 3

data 0 data 1

OK 1

Split−transaction bus (pipelined bus)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Busses (cont’d)

Traditional versus split-transaction busses

P1

P2

P3

Time

P1

P2

P3

Time

Processors

Processors

= address bus used = bus not used = data bus used

Split−transaction bus

Traditional bus


Universityof

Amsterdam

CSPCSPComputer

Architecture

Busses (cont’d)

Arbitration: two examples of centralized schemes

Independent request/grant lines: flexible + efficient butexpensive

Bus lines

Master 1 Master 2 Master n

Central

bus

arbiter

R1G1

R2G2

RnGn

Bus busy

R = RequestG = Grant


Universityof

Amsterdam

CSPCSPComputer

Architecture

Busses (cont’d)

Arbitration: two examples of centralized schemes (cont’d)

Daisy-chaining: Less expensive but slow propagation ofgrant and less fairness

Bus lines

Master 1 Master 2 Master nbus

Central

arbiter

G1 G2 Gn

Bus request

Bus busy


Universityof

Amsterdam

CSPCSPComputer

Architecture

Multistage networks

ISCSwitch

Switch

Switcha-1

ISCSwitch

Switch

Switch

ISCSwitch

Switch

Switch01

a

a -a

a -1

n

n

a+1

2a-1

0

b -1

n

b-1

1

bb+1

2b-1

b -b

n

built with a x b switches and a specific InterStage Connection (ISC) patternA generalized Multistage Interconnection Network (MIN)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Multistage networks (cont’d)

Switch networks often use 2x2 switches

N inputs require log2N stages of 2x2 switches,

Each stage requires N2 switch modules

The number of stages determines the delay of the network


Universityof

Amsterdam

CSPCSPComputer

Architecture


0

1

2

3

4

5

6

7

89

11

12

13

14

10

15

0

1

2

3

4

5

6

7

89

11

12

13

14

10

15

A 16x16 Omega network of 2x2 switches

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

Straight Crossoverbroadcast

Upperbroadcast

Lower


Universityof

Amsterdam

CSPCSPComputer

Architecture


Routing in Omega network : not all permutations are unblocked

Permutation (0,7,6,4,2)(1,3)(5) without blocking

0

1

2

3

4

5

6

7 76

5

4

3

2

1

0

Permutation (0,6,4,7,3)(1,5)(2) blocked at switches F,G and H

F

G

H

0

1

2

3

4

5

6

7 76

5

4

3

2

1

0


Universityof

Amsterdam

CSPCSPComputer

Architecture

Crossbar switches

M1 M2 M3 M.. M.. M..

P1

P2

P3

P..

P..

P..

On

Off


Universityof

Amsterdam

CSPCSPComputer

Architecture

Crossbar switches (cont’d)

Possible implementations of a crossbarIo

I1

I2

I3

Io I1 I2 I3

O0

Oi

O2

O3

RAMphase

O0

Oi

O2

O3

DoutDin

Io

I1I2

I3

a ddr


Universityof

Amsterdam

CSPCSPComputer

Architecture


Assume n processors on a bus of width w, an n � n MIN usingk � k switches with line width w and an n � n crossbar with linewidth w.

Network Bus Multistage Crossbar

characteristics network switch

Min. latency constant O(logkn) constant

Bandwidth O(w) O(w) to O(nw) O(w) to O(nw)

Wiring O(w) O(nwlogkn) O(n2w)

complexity

Switching O(n) O(nlogkn) O(n2)

complexity

Connectivity Only one to one Some permutations All permutations

and routing at a time and broadcast, if one at a time

capability network unblocked


Universityof

Amsterdam

CSPCSPComputer

Architecture

Packet switching

� Divide message into packets and route them through thenetwork

Common packet-switching techniques:

Store & forward switching (rather obsolete)Packet is smallest entityPacket buffers at intermediate nodes required

S

I1

I2

Time

Node


Universityof

Amsterdam

CSPCSPComputer

Architecture

Packet switching (cont’d)

Wormhole “routing”Flit is smallest entityOnly small flit buffers requiredOne packet can occupy multiple channels

S

I1

I2

Time

Node

� Nearly distance independent (low latency)

Virtual cut-through switchingCombination of the Store & forward and Wormholetechniques


Universityof

Amsterdam

CSPCSPComputer

Architecture

Store & forward vs Wormhole

The communication latencies for store&forward switching andwormhole routing are expressed by:

Ts& f

� LW

� D

Twormhole� L

W

� FW

�

D � 1

�

where L is the packet length in bits, W the channel bandwidth inbits/s, D the distance (number of hops) and F the flit length inbits.


Universityof

Amsterdam

CSPCSPComputer

Architecture

Flow control

Handshaking between switches, e.g. in wormhole routing:Switch S Switch D

ChannelFlit i

R/A (high)

R/A (low)

R/A (low)

R/A (high)

Flit i

Flit i

Flit i+1


Universityof

Amsterdam

CSPCSPComputer

Architecture

Tree saturation

Wormhole routing may suffer from tree saturation: messages arewaiting for each other � can lead to a snowball effect

Message A

Message B


Universityof

Amsterdam

CSPCSPComputer

Architecture

A generic switch architecture

Cross−bar

InputBuffer

Control

OutputPorts

Input Receiver Transmiter

Ports

Routing, Scheduling

OutputBuffer


Universityof

Amsterdam

CSPCSPComputer

Architecture

Routing techniques

Location of routing “intelligence”:

Source-based routing (e.g. Myrinet)Routers “eat” the head of a packetLarger packetsNo fault tolerance

Local routingMore complex routers but smaller packets

Routing may cause deadlocks

Buffer deadlock (store-and-forward switching)

Channel deadlock (wormhole routing)

Routing may be minimal or non-minimal

Non-minimal routing � potential starvation


Universityof

Amsterdam

CSPCSPComputer

Architecture

Routing deadlocks

V3

graph using V3 and V4Modified channel-dependence

channels (V3, V4)Adding two virtual

V4

containing a cycle

B

A D

CC2

C1

B

A D

C

C3

C4

C2

C1

C4

C3

Channel-dependence graphChannel deadlock

C2

V3V4 C3

C4

C1

C4

C3

C2

C1


Universityof

Amsterdam

CSPCSPComputer

Architecture

Local routing techniques (cont’d)

Determining the routing path

Deterministic (non-adaptive) routing : fixed pathMinimal and deadlock free

Adaptive routing : exploits alternative pathsLess prone to contention and more fault-tolerantPotential deadlocksReassembling of messages (out-of-order arrival ofpackets)Cannot be source-based routingMinimal or non-minimalPartially adaptive vs fully adaptive


Universityof

Amsterdam

CSPCSPComputer

Architecture

Deterministic dimension order routing: X-Y

with X−Y routingX−Y routingDeadlock not possible

D

S


Universityof

Amsterdam

CSPCSPComputer

Architecture

Deterministic routing (cont’d)

Interval labelling

[8,16)

0 2 3

5 76

15

1

4

8 9 10 11

12 13 14

[0,4)

[7,8)[4,6)

Example: Inmos C104 switch


Universityof

Amsterdam

CSPCSPComputer

Architecture

Adaptive routing

S

D


Universityof

Amsterdam

CSPCSPComputer

Architecture

Adaptive routing (cont’d)

XY routing Adaptive routing


Universityof

Amsterdam

CSPCSPComputer

Architecture

Deadlock avoidance

Deterministic routing (e.g. X-Y)

Partially adaptive routingFor example, west-first routing for 2D meshes: route apacket first to the west (if required), then route thepacket adaptively to north, south or east

XY routing West-first routing


Universityof

Amsterdam

CSPCSPComputer

Architecture

West-first routing example


Universityof

Amsterdam

CSPCSPComputer

Architecture

Deadlock avoidance (cont’d)

Virtual channelsVirtual channels are logical links between two nodesusing their own buffers and multipex’ed over a singlephysical channelVirtual channels “break” dependency cycles

Node X Node Y

VL3VL3

VL2

VL1 VL1

VL2

E.g. Round-Robin


Universityof

Amsterdam

CSPCSPComputer

Architecture

Virtual channels (cont’d)

Double Y-channel 2D mesh +X subnetwork


Universityof

Amsterdam

CSPCSPComputer

Architecture

Virtual channels (cont’d)

Advantages

Increased network throughput

Deadlock avoidance

Virtual topologies

Dedicated channels (e.g. debugging, monitoring)

Disadvantages

Hardware cost

Higher latency

Incoming packets may be out-of-order


Universityof

Amsterdam

CSPCSPComputer

Architecture

Distributed memory MIMDs: multicomputers

Message-passing machines (packet switched)

Often a point-to-point network

MemoryLocally addressableNo global address space

Communication & synchronizationVia message passing

Architecture is scalable

Communication is not transparent


Universityof

Amsterdam

CSPCSPComputer

Architecture

Distributed memory MIMDs (cont’d)

Problem: intermediate processors must route messageswhen two communicating nodes are not neighbours

Solution: separate communication processor on node whichperforms routing and DMA transfers

Problem: no global accessible memory available, e.g.sharing of data and code difficult (not transparent)

Solution: Virtual Shared Memory (VSM) or Shared VirtualMemory (SVM)


Universityof

Amsterdam

CSPCSPComputer

Architecture

VSM and SVM

Translate memory references into the message-passing paradigm

Virtual Shared Memory (VSM)Hardware implementationVirtual memory system transparently implemented ontop of VSMUnit of sharing typically small (e.g. cache block)

Shared Virtual Memory (SVM)Software implementation (OS) + hardware support(MMU)Virtual memory system implements shared memory(OS not transparent)Unit of sharing typically larger (e.g. pages)


Universityof

Amsterdam

CSPCSPComputer

Architecture

VSM and SVM (cont’d)

If data has a fixed home node, then there are four approaches toVSM and SVM:

A B D

A B C

S

C

A B C

S

DCBA

− no migration− replication on reads and writes− sequencer process (S) updates

coherency

Read Replication

all replications when writing

Full Replication

Central Server Full Migration− migration

− invalidations guarantee − replication (reads)− migration (writes)

− no coherency problems− no replication

− no coherency problems− no replication− no migration

More on VSM later on...Andy Pimentel – p. 188/259

Universityof

Amsterdam

CSPCSPComputer

Architecture

Real multicomputers: the IBM SP2

POWER2 processors


Universityof

Amsterdam

CSPCSPComputer

Architecture

The IBM SP2 (cont’d)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Shared memory MIMDs: multiprocessors

Network : typically indirect or hybrid (indirect +point-to-point)

MemoryLocally addressableGlobally addressable

Communication & synchronizationVia sharing of data (transparent)

Critical regions (locking)Message passing can be emulated

Architecture is [not,..,reasonably] scalable

Cache coherency problem


Universityof

Amsterdam

CSPCSPComputer

Architecture

Cache coherency in shared memory machines

Cache coherency problem :

In multiprocessor systems data inconsistencies between differentcaches can easily occur

Three sources of the problem can be identified:

Sharing of writable data

Process migration

I/O activity


Universityof

Amsterdam

CSPCSPComputer

Architecture

Cache coherency (cont’d)

Sharing of writable data

Processors

Caches

SharedMemory

P2P1

X X

X

P2P1

X

P2P1

X

X

X’

X’

X’

Before update Write-through Write-back


Universityof

Amsterdam

CSPCSPComputer

Architecture


Process migration

Processors

Caches

SharedMemory

P2P1

X

X

P2P1 P2P1

X

XX’

X’

Write-through Write-backBefore migration

X X’


Universityof

Amsterdam

CSPCSPComputer

Architecture


I/O activity

P1 P2

Caches

Processors

X X

I/O

Write−backWrite−through

MemoryI/OMemory

XXX’X’

I/OMemory

X

X

X X’ X

P1 P2P1 P2

Shared memory

C1 C2

IOP2IOP1 P2P1IOP=I/O processor


Universityof

Amsterdam

CSPCSPComputer

Architecture


In general, cache coherency protocols are based on a set of(cache block) states and state transitions

Two types of protocols: write-invalidate and write-update

Write-invalidate suffers from false sharing

False sharing

Some invalidations are not necessary for correct programexecution:

Processor 1: Processor 2:

while (true) do while (true) do

A = A + 1 B = B + 1

If A en B are located in the same cache block, a cache missoccurs in each loop-iteration due to a ping-pong of invalidations


Universityof

Amsterdam

CSPCSPComputer

Architecture


Processors

Caches X X X

PnP2P1

Shared memory X

Processors

Caches

PnP2P1

Shared memory

X’ X’ X’

X’

Processors

Caches

PnP2P1

Shared memory

X’

X’

I I

Write-invalidate protocol

Write-update protocol


Universityof

Amsterdam

CSPCSPComputer

Architecture

Uniform Memory Access (UMA)

P P

C C

Interconnection network

PP


Memory Memory Memory Memory

Symmetric MultiProcessors (SMP) are well-known UMAarchitectures


Universityof

Amsterdam

CSPCSPComputer

Architecture

UMA architectures (cont’d)

Not/hardly scalableBus-based architectures � saturationCrossbars � too expensive (wiring constraints)Multistage networks � wiring constraints + possiblyhigher latency (more stages)

Possible solutionsReduce network traffic by cachingClustering � non-uniform memory latency behaviour(NUMA)


Universityof

Amsterdam

CSPCSPComputer

Architecture

UMA architectures (cont’d)

Memory contention occurs when multiple processors areaddressing the memory at the same moment

Banked/multiple memories

Non-uniform network traffic in multistage networks maycause tree saturation

Use of message combining (e.g. in the atomicFetch&Add operation)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Message combining

Message combining using the Fetch&Add operation

Main memory

Fetch&Add(X,e1)

Fetch&Add(X,e2)

P1

P2

Switch

X

Main memoryP1

P2

Switch

Main memoryP1

P2

Switch

Main memoryP1

P2

Switch

Fetch&Add(X,e1+e2)

e1

e1 X+e1+e2

X

X

X

X+e1 X+e1+e2


Universityof

Amsterdam

CSPCSPComputer

Architecture

Non Uniform Memory Access (NUMA)

Multiple clusters of SMPs: VSM revisited

c c

MEM

P P

c c

P P

Shared network

Message-passing network

Shared network

MEM

Local memory references are fast, remote ones slow (ratio1:[2-15]) � latency hiding!

Cache-controller/MMU determines whether a reference islocal or remote

When caching is involved, it’s called CC-NUMA (cachecoherent NUMA)

Typically Read Replication (write invalidation)Andy Pimentel – p. 202/259

Universityof

Amsterdam

CSPCSPComputer

Architecture

NUMA (cont’d)

Caches (CC-NUMA) reduce latency

Possibilities for latency hiding � overlap valuable computationwith communication (i.e. the fetching of remote data)

Prefetching of dataBefore remote data is actually required, fetch it from theremote node.

ThreadingWhen the processor threatens to be stalled for a remote datafetch, schedule a new thread of control (lightweightprocess)

Relaxed memory consistency models: how consistentshould the view of memory be?


Universityof

Amsterdam

CSPCSPComputer

Architecture

Sequential consistency (SC)

Processor 1: Processor 2:

A = 0; B = 0;

...... ......

A = 1; B = 1;

if (B == 0) ... if (A == 0) ..

SC model: atomic and strongly ordered memory accesses

e.g. delay write until all validations have beenacknowledged

Single-portedmemory

switch

P3P2 PnP1


Universityof

Amsterdam

CSPCSPComputer

Architecture

Relaxed consistency (RC)

Processor consistency (Sparc): loads may bypass writes

Partial store order (Sparc): loads/writes may bypass writes

Weak consistency (PowerPC) and Released consistency(Alpha,MIPS): no ordering between references(synchronization operations act as memory fences)

Note that RC models always need synchronization such that theirexecution semantics are the same as the SC model


Universityof

Amsterdam

CSPCSPComputer

Architecture

Disadvantages of CC-NUMA

Remote data only held in the small local cache

Performance is severely limited when many data referencesare remote (e.g. incorrect data partitioning, data does not fitin the cache, etc.)

Possible solutionsIncrease cache size � expensive and may increaselatency of local referencesPage migration/replication implemented by the OS

slow (OS-level) and complexworks only at page granularity: problem whenparallel accesses have finer granularity (falsesharing!)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Cache Only Memory Architecture (COMA)

Similar to NUMA, only the main memories act as direct-mappedor set-associative caches � addresses are hashed to a DRAM“cache-line”

Fetched remote data are actually stored in the local mainmemory (replication)

Data elements do not have a fixed home location: they canmigrate

c c

P P

c c

P P

DRAM

Shared network

Message-passing network

Shared network

DRAMcachecache


Universityof

Amsterdam

CSPCSPComputer

Architecture

COMA (cont’d)

MP-network often a hierarchical (e.g. tree) networkA switch within the tree contains a directory with dataelements residing in its sub-treeRemote access requires tree traversal as data has nohome nodeSwitches support message combiningWrite invalidate coherency protocol


Universityof

Amsterdam

CSPCSPComputer

Architecture

COMA (cont’d)

Requires extra memory-subsystem hardwareTag memory to check a data element in the DRAMcache is the required elementComparators to perform this check for multiple DRAMcache blocks (when the DRAM cache isset-associative)State memory to keep the state of the DRAM cacheelements


Universityof

Amsterdam

CSPCSPComputer

Architecture

COMA versus CC-NUMA

COMA more flexible than CC-NUMAReplication not constrained by a small local cache(main memory is much larger!)Dynamic migration/replication of data without the needof OS support and at a fine granularity (less falsesharing)

COMA needs non-standard memory management hardware(expensive and complex)

Remote accesses in COMA slower due to the tree traversal

Coherency protocol harder to implement in COMA: takecare that the last copy of a data element is not removed


Universityof

Amsterdam

CSPCSPComputer

Architecture

COMA versus CC-NUMA (cont’d)

Performance difference: highly dependent on application

Low miss rates: performance of COMA and CC-NUMAsimilar

Capacity misses dominate (a capacity miss occurs becausethe data does not fit in the cache): COMA outperformsCC-NUMA as COMA’s DRAM cache is usually largeenough to store all required data (unlike CC-NUMA’s smalldata caches)

Coherence misses dominate (a coherence miss is due to theinvalidation of data): CC-NUMA outperforms COMA dueto the higher latency of remote accesses in COMA


Universityof

Amsterdam

CSPCSPComputer

Architecture

Simple COMA (S-COMA)

Data placement/allocation at page-granularity: like(software) SVM, the MMU determines whether a page is inthe local DRAM. So, tag memory and comparators are notneeded

Coherency managed in hardware and at a fine granularity:transferred data elements are cache blocks (minimizes falsesharing)

Networkblockscache

Page Cache block

DRAM cacheAndy Pimentel – p. 212/259

Universityof

Amsterdam

CSPCSPComputer

Architecture

S-COMA (cont’d)

A page can be partially filled with valid data (cache blocks)

OS-managed main memory can be fully associative (notfeasible in normal COMA)

DRAM cache misses may be slower than COMA missesdue to OS support (e.g. page faults)

Probability of false replacement: a page fault (DRAM readmiss), fetching a single remote cache block and allocating anew local page, may falsely replace an entire page


Universityof

Amsterdam

CSPCSPComputer

Architecture

Cache coherency revisited

Cache coherency protocols are based on cache block states andstate transitions � how to find copies of a cache block?

Snoopy bus protocols: caches detect copies by monitoringthe bus

Typically for broadcast-based architectures: UMAmachines or within the SMP clusters of CC-NUMA’sEither write-invalidate or write-update

Directory based protocols: store locations of copies indirectory

More scalable than snoopy protocols � used innon-broadcast networks (e.g. CC-NUMA’s andCOMA’s)Typically write-invalidate


Universityof

Amsterdam

CSPCSPComputer

Architecture

Directory based protocols

Full map

X:Directory

Cache Cache

P1 P2 Pn

Cache

Read X Read X Read X

X:Directory

Cache Cache Cache

P1 P2 Pn

X: data X: X: datadata

Write X

X:Directory

Cache Cache Cache

P1 P2 Pn

dataX:


Universityof

Amsterdam

CSPCSPComputer

Architecture

Directory based protocols (cont’d)

Limited map

X:

Cache Cache Cache

P1 P2 Pn

X: data X: data

Directory

Read X

X:

Cache Cache Cache

P1 P2 Pn

X: data X: data

Directory


Universityof

Amsterdam

CSPCSPComputer

Architecture

Directory based protocols (cont’d)

Chained directory

X: data

X:Directory

CacheCacheCache

P1 P2 Pn

Read X

X: data

X:Directory

CacheCacheCache

P1 P2 Pn

X: data


Universityof

Amsterdam

CSPCSPComputer

Architecture


R(i)W(i)Z(j)

R(j)R(i)

and j = i

R(j)

Z(i) = Replace block in cache i

R(i) = read block by cache i

W(i) = write to block by cache i

W(i)

State−transition graph of write−back cache i

State−transition graph of write−through cache i

W(i) Z(i)W(j)

R(i)W(j)Z(i)

Z(j)

M = Modified

S = Shared

INV = Invalid or

not in cache

R(j)Z(j)Z(i)

W(j)

SM

R(i)

W(i)

R(j)

R(j),Z(j),W(j),Z(i)

Z(j)

Invalid Valid

W(j),Z(i)

R(i),W(i)

INV


Universityof

Amsterdam

CSPCSPComputer

Architecture


A snoopy-bus system with 3 processors with MSI write-backcaches

Proc. action P1 state P2 state P3 state Bus act. Data from

P1 read x S — — Rd Memory

P3 read x S — S Rd Memory

P3 write x I — M I —

P2 write x I M I RdI P3’s cache/memory

P1 read x S S I Rd P2’s cache/memory

P3 read x S S S Rd Memory


Universityof

Amsterdam

CSPCSPComputer

Architecture


MESI protocol frequently used in commodity processorsM(odified) : dirty, exclusive cache blockE(exclusive) : clean, exclusive cache blockS(hared) : clean, shared cache blockI(nvalid) : block not resident in cache

Exclusive state reduces invalidation traffic: the cache canwrite without sending invalidations

Bus needs extra status line signalling whether or not data isshared


Universityof

Amsterdam

CSPCSPComputer

Architecture


Cache coherency in a cache hierarchy

Snooping logic not at all levels

Solution: preserve inclusion propertyIf a memory block � L1 cache, then it is � L2 cacheIf a block is in Modified state in the L1 cache, then itmust also be marked as modified in the L2 cache

Requirements for inclusion are not trivial: different blocksizes, associativities, etc.

Automatic inclusion: L1 direct-mapped and L2d-m/set-associative with identical block sizes andsetsL1

� � setsL2.


Universityof

Amsterdam

CSPCSPComputer

Architecture

Synchronization

Hardware synchronization in multiprocessors: similar tosoftware based (OS-level) synchronization for critical sections(semaphores, monitors)

Atomic read-modify-write operations, such as thetest-and-set operation, allow implementation ofsynchronization primitives (locks)

test-and-set( int *address ) {temp = *address;*address = 1;return (temp);

}


Universityof

Amsterdam

CSPCSPComputer

Architecture

Synchronization (cont’d)

Spin lock

lock( int *lock ) {

while ( test-and-set( lock ) == 1 );

}

unlock( int *lock ) {

*lock = 0;

}

Suspend lock


Universityof

Amsterdam

CSPCSPComputer

Architecture


Spin locks may cause thrashing:

Mem P0 P1 P2 Mem P0 P1 P2

P0: lock

Mem P0 P1 P2 Mem P0 P1 P2

P1: lock (failed) P2: lock (failed)

(a)

(c)

(b)

(d)

= invalid lock

= dirty lock


Universityof

Amsterdam

CSPCSPComputer

Architecture


Possible solutions to avoid thrashing

Snooping lock

lock( int *lock ) {

while ( test-and-set( lock ) == 1 )

while ( *lock == 1 );

}

test-and-test-and-set lock

lock( int *lock ) {

for (;;) {

while ( *lock == 1 );

if ( test-and-set( lock ) == 0 )

break;

}

}


Universityof

Amsterdam

CSPCSPComputer

Architecture


Barrier synchronizationShared counter counting the processes reaching thebarrierHardwired barrier lines

P1 P2 P3 Pn

b1

b3b2

bn


Universityof

Amsterdam

CSPCSPComputer

Architecture

Disk storage considerations

To increase fault tolerance and performance of disk: RAID(Redundant Array of Inexpensive Disks)

Data is striped over disks

Parallel disk access is possible (important for SMPs)

7 RAID levels, each with a different scheme to provideredundancy

RAID levels 1-5 survive one disk crash, level 6 survivestwo


Universityof

Amsterdam

CSPCSPComputer

Architecture

RAID (cont’d)

RAID level 0: no redundancy

RAID level 1: MirroringRequires twice the number of disksSmall recovery time

RAID level 3: Bit-interleaved ParityOne redundant disk containing parity informationAll reads and write to all disks � no parallel diskaccess

RAID level 5: Block-interleaved Distributed ParityOne redundant disk containing parity informationReads to one disk, writes need reads to all disks


Universityof

Amsterdam

CSPCSPComputer

Architecture

RAID (cont’d)

Block-interleaved Parity (Level 4) versus Block-interleavedDistributed Parity (level 5)

Parity disk in RAID level 4 forms bottleneck

Block-interleaved (RAID 4) Distributed Block-interleaved (RAID 5)

... ... ... ... ...

2 3 P0

P1

P2

P3

4 5 6 7

8 9 10 11

12 13 14 15

0

... ... ... ... ...

10 2 3

7654

8 9 10 11

12 13 14 15 P3

P2

P1

P0 1


Universityof

Amsterdam

CSPCSPComputer

Architecture

Real multiprocessors: SGI Origin 2000

CC-NUMA architecture


Universityof

Amsterdam

CSPCSPComputer

Architecture

The SGI Origin 2000 (cont’d)


Universityof

Amsterdam

CSPCSPComputer

Architecture

The Cray T3D

NUMA architecture


Universityof

Amsterdam

CSPCSPComputer

Architecture

The Cray T3D (cont’d)


Universityof

Amsterdam

CSPCSPComputer

Architecture

The Cray MTA

An UMA MultiThreaded Architecture (MTA): 128 hardwarethreads per processor

Processors (max 256) I/O Processors (max 256)

I/O caches (max 256)Memories (max 512)

3D Toroidal mesh (16x16x16)


Universityof

Amsterdam

CSPCSPComputer

Architecture

The Cray MTA (cont’d)

SSW

T0

T7

R0

R31

PC

128 copies

Each thread has its own context: 128 * 32 = 4K GPRs

At every instruction a new thread may be scheduled

No data cachesLatency hiding by thread schedulingNo cache coherency problem!

Architecture fully pipelined � enough runnable threadsavoid bubbles in the pipeline


Universityof

Amsterdam

CSPCSPComputer

Architecture


InstructionPool fetch


PoolRetry

Issue

Writ

e

Reg

iste

rW

rite

Writ

eP

ool

Poo

lM

emor

y

Writ

eR

egis

ter

Reg

iste

r

AM C


Universityof

Amsterdam

CSPCSPComputer

Architecture


Lookahead in instructions indicating the number ofsucceeding, independent instructions

LIW (Large Instruction Word) instructions containing 3operations (1 arithmetic, 1 memory and 1 branch/simplearithmetic)

Tagged memorySetting traps on memory locationsForwarding (invisible indirection)Synchronization (full/empty bit), e.g. a read does notcomplete until the full bit is set


Universityof

Amsterdam

CSPCSPComputer

Architecture

Future directions: what’s next?

Who shall tell?Super-speculative processorsTrace/Multiscalar processorsSimultaneous MultiThreaded processors ( � started:Pentium 4)I(ntelligent)RAMsReconfigurable (co-)processorsSingle-chip multiprocessors (started: POWER 4)


Universityof

Amsterdam

CSPCSPComputer

Architecture

What’s next? (cont’d)

Instruction execution generations

First generation

Pipelining, second generation

Superscalar pipelining, third generation

Fourth generation (?)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Trace processors

Traces (consisting of multiple basic blocks) are basic unitfor fetching and execution

Traces are constructed dynamically

Processor contains multiple superscalar processing cores toexecute traces in parallel


Universityof

Amsterdam

CSPCSPComputer

Architecture

Trace processors (cont’d)

Rely heavily on speculative executionNext-trace predictionBranch prediction to construct tracesData value prediction to "remove" RAW dependenciesbetween traces

Mispredictions may be painful

Trace cache stores whole traces as basic elements (locatedbetween I$ and decoder)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Typical architecture of a trace processor

Branchprediction

Traceconstruction

Instructionpreprocessing

Trace cache

Next-traceprediction

Globalregs. Local

regs.

Functionalunits

Instructionbuffer

Processing element 0

Superscalar processing element 1

Superscalar processing element n

Data-valueprediction


Universityof

Amsterdam

CSPCSPComputer

Architecture

Value prediction

Classification of speculative execution techniques

Speculative execution

Control speculation Data speculation

Branch direction(binary)

Branch target(multi-valued)

Data location

Aliased(binary)

Address(multi-valued)

Data value(multi-valued)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Value prediction (cont’d)

Exploit value locality to reduce data-flow restrictions

Value cachingCommon subexpression elimination in hardware

Predicting values (needs verification at commit stage)Last value predictorsStride predictorsContext based predictors: next value based on anumber of preceding values (sort of Markov chain)


Universityof

Amsterdam

CSPCSPComputer

Architecture

Multiscalar processors

Compiler recognizes tasks in Control Flow Graph ofprograms

Consist of multiple BBs: similar to traces, but maycontain both taken/untaken paths of internal branchesSequential relationship between tasks

Tasks are executed in parallel while preserving loosesequential order

Data dependencies between tasks (compiler) are explicitlycommunicated via a unidirectional communication ringnetwork

Sequencer schedules tasks to computing elements andperforms next-task prediction


Universityof

Amsterdam

CSPCSPComputer

Architecture

Multiscalar processors (cont’d)

Recognition of tasks in a CFG

Task A

Task B

Task D

Task E

PE2

PE0

PE1

PE3

Dat

a va

luesB C

E

A

D


Universityof

Amsterdam

CSPCSPComputer

Architecture

Multiscalar processors (cont’d)

A possible multiscalar microarchitecture

I$

processingelement

registerfile

Head TailSequencer

ProcessingUnit

ProcessingUnit

Interconnect

Data banks


Universityof

Amsterdam

CSPCSPComputer

Architecture

Simultaneous MultiThreading (SMT)

Multiple hardware contexts, one for each threadMultiple program counters, register sets, etc.

Traditional fine-grained (vertical) multithreading allows toschedule a thread (issuing its instructions) each cycle (e.g.Cray MTA and Sun MAJC)

Provides latency hidingResources wasted when a thread does not have a lot ofILPPotential waste of resources also holds for on-chipmultiprocessors


Universityof

Amsterdam

CSPCSPComputer

Architecture

Simultaneous MultiThreading (SMT) (cont’d)

SMT allows issuing instructions from all threads at eachcycle

Provides latency hiding + improved ILP (better utilizationof resources)

Shown to be a rather straightforward extension of normalsuperscalar architectures, but

Instruction fetch unit should fetch instructions frommultiple PCs (calls for restrictions)Requires a much larger register file (deeperpipeline/lower clock speed)


Universityof

Amsterdam

CSPCSPComputer

Architecture


Thread 0

Thread 2

Thread 4

Thread 1

Superscalar Fine-grained(vertical) multithreading

Simultaneousmultithreading

Tim

e

Issue slots


Universityof

Amsterdam

CSPCSPComputer

Architecture


The sharing of resources may have some negative effectsBranch prediction interferenceInterthread cache interferenceIncreased memory traffic

However, most of these negative effects are hidden becauseof the multithreading


Universityof

Amsterdam

CSPCSPComputer

Architecture

Intelligent RAM

Addressing the CPU-memory performance gap (wideningfor increasingly aggressive superscalar-like architectures)

Currently, about 60 to 70% of the die area is used forcaches and other memory latency hiding hardware

IRAM solution: integrate processor logic with the DRAM

DRAMs become large enough to store programs and datasets on a single chip


Universityof

Amsterdam

CSPCSPComputer

Architecture

IRAM (cont’d)

Some potential advantages

High internal memory bandwidth (a potential 50X to 100Xincrease)

Lower memory latency (a potential 5X to 10X decrease)

More energy efficient (fewer or no accesses to ahigh-capacitance off-chip bus and DRAM consumes lessenergy than SRAM)


Universityof

Amsterdam

CSPCSPComputer

Architecture

IRAM (cont’d)

Some potential disadvantages

Larger area and lower speed of logic in a DRAM process

Multiplexed I/O lines in DRAM should be avoided:increase of area, power and costs

Retention time of DRAM dependent on temperature:refresh rates could rise dramatically


Universityof

Amsterdam

CSPCSPComputer

Architecture

IRAM (cont’d)

Because of the high bandwidth and low latency, IRAMs arevery suited for vector processing

CPU + caches I/O

Memory crossbar

Memory crossbar

DRAM memory

DRAM memory

Vector unit


computer architecture - uoacgi.di.uoa.gr/~halatsis/advanced_comp_arch/geniki_parousiasi.pdf ·...

Documents