cse502: computer architecture review. cse502: computer architecture course overview (1/2) caveat 1:...

136
CSE502: Computer Architecture CSE 502: Computer Architecture Review

Upload: tyshawn-abbitt

Post on 30-Mar-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

CSE 502:Computer Architecture

Review

Page 2: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Course Overview (1/2)• Caveat 1: I’m (kind of) new here.• Caveat 2: This is a (somewhat) new course.

• Computer Architecture is… the science and art of selecting and

interconnecting hardware and softwarecomponents to create computers …

Page 3: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Course Overview (2/2)• This course is hard, roughly like CSE 506

– In CSE 506, you learn what’s inside an OS– In CSE 502, you learn what’s inside a CPU

• This is a project course– Learn why things are the way they are, first hand– We will “build” emulators of CPU components

Page 4: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Policy and Projects• Probably different from other classes

– Much more open, but much more strict• Most people followed the policy• Some did not

– Resembles the “real world”• You’re here because you want to learn and to be here• If you managed to get your partner(s) to do the work

– You’re probably good enough to do it at your job too» The good: You might make a good manager» The bad: You didn’t learn much

• Time mgmt. often more important than tech. skill– If you started early, you probably have an A already

Page 5: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

1

timeorig

f(1 - f)

timeorig

f(1 - f)

timeorig

Amdahl’s LawSpeedup = timewithout enhancement / timewith enhancement

An enhancement speeds up fraction f of a task by factor Stimenew = timeorig·( (1-f) + f/S )

Soverall = 1 / ( (1-f) + f/S )

(1 - f)

timenew

f/S(1 - f)

timenew

f/S

Page 6: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

The Iron Law of Processor Performance

Cycle

Time

nInstructio

Cycles

Program

nsInstructio

Program

Time

Architects target CPI, but must understand the others

Total WorkIn Program

CPI or 1/IPC 1/f (frequency)

Algorithms,Compilers,

ISA ExtensionsMicroarchitecture

Microarchitecture,Process Tech

Page 7: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Averaging Performance Numbers (2/2)• Arithmetic: times

– proportional to time– e.g., latency

• Harmonic: rates– inversely proportional to time– e.g., throughput

• Geometric: ratios– unit-less quantities– e.g., speedups

n

i iTimen 1

1

n

i

iRate

n

1

1

n

n

iiRatio

1

Page 8: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Power vs. Energy• Power: instantaneous rate of energy transfer

– Expressed in Watts– In Architecture, implies conversion of electricity to heat– Power(Comp1+Comp2)=Power(Comp1)+Power(Comp2)

• Energy: measure of using power for some time– Expressed in Joules– power * time (joules = watts * seconds)– Energy(OP1+OP2)=Energy(OP1)+Energy(OP2)

Wha

t use

s po

wer

in a

chi

p?

Page 9: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

ISA: A contract between HW and SW• ISA: Instruction Set Architecture

– A well-defined hardware/software interface

• The “contract” between software and hardware– Functional definition of operations supported by hardware– Precise description of how to invoke all features

• No guarantees regarding– How operations are implemented– Which operations are fast and which are slow (and when)– Which operations take more energy (and which take less)

Page 10: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Components of an ISA• Programmer-visible states

– Program counter, general purpose registers, memory, control registers

• Programmer-visible behaviors– What to do, when to do it

• A binary encoding

if imem[pc]==“add rd, rs, rt”then pc pc+1 gpr[rd]=gpr[rs]+grp[rt]

Example “register-transfer-level”description of an instruction

ISAs last forever, don’t add stuff you don’t need

Page 11: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Locality Principle• Recent past is a good indication of near future

Spatial Locality: If you looked something up, it is very likely you will look up something nearby soon

Temporal Locality: If you looked something up, it is very likely that you will look it up again soon

Page 12: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Caches• An automatically managed hierarchy

• Break memory into blocks (several bytes)and transfer data to/from cache in blocks– spatial locality

• Keep recently accessed blocks– temporal locality

Core

$

Memory

Page 13: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

===

• Keep blocks in cache frames– data – state (e.g., valid)– address tag

datadatadata

data

Fully-Associative Cache

multiplexor

tag[63:6] block offset[5:0]

address

What happens when the cache runs out of space?

tagtagtag

tag

statestatestate

state =

063

hit?

Page 14: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

The 3 C’s of Cache Misses• Compulsory: Never accessed before• Capacity: Accessed long ago and already replaced• Conflict: Neither compulsory nor capacity• Coherence: (In multi-cores, become owner to write)

Page 15: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Cache Size• Cache size is data capacity (don’t count tag and state)

– Bigger can exploit temporal locality better– Not always better

• Too large a cache– Smaller is faster bigger is slower– Access time may hurt critical path

• Too small a cache– Limited temporal locality– Useful data constantly replaced hi

t ra

te

working set size

capacity

Page 16: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Block Size• Block size is the data that is

– Associated with an address tag – Not necessarily the unit of transfer between hierarchies

• Too small a block– Don’t exploit spatial locality well– Excessive tag overhead

• Too large a block– Useless data transferred– Too few total blocks

• Useful data frequently replacedhi

t ra

teblock size

Page 17: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

• Use middle bits as index• Only one tag comparison

datadatadata

tagtagtag

data tag

statestatestate

state

Direct-Mapped Cache

multiplexor

tag[63:16] index[15:6] block offset[5:0]

=deco

der

tag match(hit?)

Page 18: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

N-Way Set-Associative Cachetag[63:15] index[14:6] block offset[5:0]

tagtagtag

tag

multiplexor

deco

der

=

hit?

datadatadata

tagtagtag

data tag

statestatestate

state

multiplexor

deco

der

=

multiplexor

way

set

Note the additional bit(s) moved from index to tag

datadatadata

data

statestatestate

state

Page 19: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Associativity• Larger associativity

– lower miss rate (fewer conflicts)– higher power consumption

• Smaller associativity– lower cost– faster hit time

~5for L1-Dhi

t ra

teassociativity

Page 20: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Parallel vs Serial Caches• Tag and Data usually separate (tag is smaller & faster)

– State bits stored along with tags• Valid bit, “LRU” bit(s), …

hit?

= = = =

valid?

data

hit?

= = = =

valid?

data

enable

Parallel access to Tag and Datareduces latency (good for L1)

Serial access to Tag and Datareduces power (good for L2+)

Page 21: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Physically-Indexed Caches

• Core requests are VAs

• Cache index is PA[15:6]– VA passes through TLB– D-TLB on critical path

• Cache tag is PA[63:16]

• If index size < page size– Can use VA for index

tag[63:14] index[13:6] block offset[5:0]Virtual Address

virtual page[63:13] page offset[12:0]

/ index[6:0]

/physical

tag[51:1]

physicalindex[7:0]/

= = = =

D-TLB

/physical

index[0:0]

Page 22: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Virtually-Indexed Caches

• Core requests are VAs• Cache index is VA[15:6]• Cache tag is PA[63:16]

• Why not tag with VA?– Cache flush on ctx switch

• Virtual aliases– Ensure they don’t exist– … or check all on miss

tag[63:14] index[13:6] block offset[5:0]Virtual Address

virtual page[63:13] page offset[12:0]

/ virtual index[7:0]

D-TLB

/physical

tag[51:0]

= = = =

One bit overlaps

Page 23: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Inclusion• Core often accesses blocks not present on chip

– Should block be allocated in L3, L2, and L1?• Called Inclusive caches• Waste of space• Requires forced evict (e.g., force evict from L1 on evict from L2+)

– Only allocate blocks in L1• Called Non-inclusive caches (who not “exclusive”?)• Must write back clean lines

• Some processors combine both– L3 is inclusive of L1 and L2– L2 is non-inclusive of L1 (like a large victim cache)

Page 24: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Parity & ECC• Cosmic radiation can strike at any time

– Especially at high altitude– Or during solar flares

• What can be done?– Parity

• 1 bit to indicate if sum is odd/even (detects single-bit errors)

– Error Correcting Codes (ECC)• 8 bit code per 64-bit word• Generally SECDED (Single-Error-Correct, Double-Error-Detect)

• Detecting errors on clean cache lines is harmless– Pretend it’s a cache miss

0 1 0 1 1 0 0 1 0 1 1 0 0 1 0 1

Page 25: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

SRAM vs. DRAM• SRAM = Static RAM

– As long as power is present, data is retained

• DRAM = Dynamic RAM– If you don’t do anything, you lose the data

• SRAM: 6T per bit– built with normal high-speed CMOS technology

• DRAM: 1T per bit (+1 capacitor)– built with special DRAM process optimized for density

Page 26: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

DRAM Chip Organization• Low-Level organization is very similar to SRAM• Cells are only single-ended

– Reads destructive: contents are erased by reading• Row buffer holds read data

– Data in row buffer is called a DRAM row• Often called “page” - not necessarily same as OS page

– Read gets entire row into the buffer– Block reads always performed out of the row buffer

• Reading a whole row, but accessing one block• Similar to reading a cache line, but accessing one word

Page 27: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

DIMM

DRAM Organization

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

x8 DRAMDRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

Rank

Dual-rank x8 (2Rx8) DIMM

x8 DRAM

Bank

All banks within therank share all address

and control pins

x8 means each DRAMoutputs 8 bits, need 8chips for DDRx (64-bit)

All banks are independent,but can only talk to one

bank at a time

Why 9 chips per rank?64 bits data, 8 bits ECC

DRAM DRAM

Page 28: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

AMAT with MLP• If …

cache hit is 10 cycles (core to L1 and back)memory access is 100 cycles (core to mem and back)

• Then …at 50% miss ratio, avg. access: 0.5×10+0.5×100 = 55

• Unless MLP is >1.0, then…at 50% mr,1.5 MLP,avg. access:(0.5×10+0.5×100)/1.5 = 37at 50% mr,4.0 MLP,avg. access:(0.5×10+0.5×100)/4.0 = 14

In many cases, MLP dictates performance

Page 29: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

MemoryController

Memory Controller (1/2)

Scheduler Buffer

Channel 0 Channel 1

CommandsData

ReadQueue

WriteQueue

ResponseQueue

To/From CPU

Page 30: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Memory Controller (2/2)• Memory controller connects CPU and DRAM• Receives requests after cache misses in LLC

– Possibly originating from multiple cores

• Complicated piece of hardware, handles:– DRAM Refresh– Row-Buffer Management Policies– Address Mapping Schemes– Request Scheduling

Page 31: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Address Mapping Schemes

• Example Open-page Mapping Scheme:High Parallelism: [row rank bank column channel offset]

Easy Expandability: [channel rank row bank column offset]

• Example Close-page Mapping Scheme:High Parallelism: [row column rank bank channel offset]

Easy Expandability: [channel rank row column bank offset]

Page 32: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Memory Request Scheduling• Write buffering

– Writes can wait until reads are done

• Queue DRAM commands– Usually into per-bank queues– Allows easily reordering ops. meant for same bank

• Common policies:– First-Come-First-Served (FCFS)– First-Ready—First-Come-First-Served (FR-FCFS)

Page 33: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Prefetching (1/2)• Fetch block ahead of demand• Target compulsory, capacity, (& coherence) misses

– Not conflict: prefetched block would conflict

• Big challenges:– Knowing “what” to fetch

• Fetching useless blocks wastes resources

– Knowing “when” to fetch• Too early clutters storage (or gets thrown out before use)• Fetching too late defeats purpose of “pre”-fetching

Page 34: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

• Without prefetching:

• With prefetching:

• Or:

Prefetch

Prefetch

Prefetching (2/2)

Load

L1 L2

Data

DRAM

Total Load-to-Use Latency

DataLoad

Much improved Load-to-Use Latency

Somewhat improved Latency

DataLoad

Prefetching must be accurate and timely

time

Page 35: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Next-Line (or Adjacent-Line) Prefetching• On request for line X, prefetch X+1 (or X^0x1)

– Assumes spatial locality• Often a good assumption

– Should stop at physical (OS) page boundaries

• Can often be done efficiently– Adjacent-line is convenient when next-level block is bigger– Prefetch from DRAM can use bursts and row-buffer hits

• Works for I$ and D$– Instructions execute sequentially– Large data structures often span multiple blocks

Simple, but usually not timely

Page 36: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Next-N-Line Prefetching• On request for line X, prefetch X+1, X+2, …, X+N

– N is called “prefetch depth” or “prefetch degree”

• Must carefully tune depth N. Large N is …– More likely to be useful (correct and timely)– More aggressive more likely to make a mistake

• Might evict something useful

– More expensive need storage for prefetched lines• Might delay useful request on interconnect or port

Still simple, but more timely than Next-Line

Page 37: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Stride Prefetching

• Access patterns often follow a stride– Accessing column of elements in a matrix– Accessing elements in array of structs

• Detect stride S, prefetch depth N– Prefetch X+1 S, X+2 S, …, X+N S∙ ∙ ∙

Column in matrix

Elements in array of structs

Page 38: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

“Localized” Stride Prefetchers• Store PC, last address, last stride, and count in RPT• On access, check RPT (Reference Prediction Table)

– Same stride? count++ if yes, count-- or count=0 if no– If count is high, prefetch (last address + stride*N)

PCa: 0x409A34 Load R1 = [R2]

PCb: 0x409A38 Load R3 = [R4]

PCc: 0x409A40 Store [R6] = R5

0x409

Tag Last Addr

Stride

Count

0x409

0x409

A+3N N 2

X+3N N 2

Y+2N N 1

If confidentabout the stride(count > Cmin),

prefetch(A+4N)

+

Page 39: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Evaluating Prefetchers• Compare against larger caches

– Complex prefetcher vs. simple prefetcher with larger cache

• Primary metrics– Coverage: prefetched hits / base misses– Accuracy: prefetched hits / total prefetches– Timeliness: latency of prefetched blocks / hit latency

• Secondary metrics– Pollution: misses / (prefetched hits + base misses)– Bandwidth: total prefetches + misses / base misses– Power, Energy, Area...

Page 40: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Before there was pipelining…

• Single-cycle control: hardwired– Low CPI (1)– Long clock period (to accommodate slowest instruction)

• Multi-cycle control: micro-programmed– Short clock period– High CPI

• Can we have both low CPI and short clock period?

Single-cycle

Multi-cycle

insn0.(fetch,decode,exec) insn1.(fetch,decode,exec)

insn0.decinsn0.fetch insn1.decinsn1.fetchinsn0.exec insn1.exec

time

Page 41: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Pipelining

• Start with multi-cycle design• When insn0 goes from stage 1 to stage 2

… insn1 starts stage 1• Each instruction passes through all stages

… but instructions enter and leave at faster rate

Multi-cycle insn0.decinsn0.fetch insn1.decinsn1.fetchinsn0.exec insn1.exec

timePipelined

insn0.execinsn0.decinsn0.fetch

insn1.decinsn1.fetch insn1.exec

insn2.decinsn2.fetch insn2.exec

Can have as many insns in flight as there are stages

Page 42: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Instruction Dependencies• Data Dependence

– Read-After-Write (RAW) (only true dependence)• Read must wait until earlier write finishes

– Anti-Dependence (WAR)• Write must wait until earlier read finishes (avoid clobbering)

– Output Dependence (WAW)• Earlier write can’t overwrite later write

• Control Dependence (a.k.a. Procedural Dependence)– Branch condition must execute before branch target– Instructions after branch cannot run before branch

Page 43: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Pipeline Terminology• Pipeline Hazards

– Potential violations of program dependencies– Must ensure program dependencies are not violated

• Hazard Resolution– Static method: performed at compile time in software– Dynamic method: performed at runtime using hardware– Two options: Stall (costs perf.) or Forward (costs hw.)

• Pipeline Interlock– Hardware mechanism for dynamic hazard resolution– Must detect and enforce dependencies at runtime

Page 44: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Simple 5-stage Pipeline

PC InstCache

Regis

ter

file

MUX

ALU

1

DataCache

++

MUX

IF/ID ID/EX EX/Mem Mem/WB

MUX

op

dest

offset

valB

valA

PC+1PC+1target

ALUresult

op

dest

valB

op

dest

ALUresult

mdata

eq?instru

ction

0

R2R3R4R5

R1

R6

R0

R7

regAregB

datadest

MUX

Page 45: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Balancing Pipeline StagesCoarser-Grained Machine Cycle:

4 machine cyc / instructionFiner-Grained Machine Cycle: 11

machine cyc /instruction

TIF&ID= 8 units

TOF= 9 units

TEX= 5 units

TOS= 9 units

IFID

OF

WB

EX# stages = 11Tcyc= 3 units

IF

IFID

OF

OF

OF

EXEX

WB

WB

WB

# stages = 4Tcyc= 9 units

Page 46: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

IPC vs. Frequency• 10-15% IPC not bad if frequency can double

• Frequency doesn’t double– Latch/pipeline overhead– Stage imbalance

1000ps 500ps500ps2.0 IPC, 1GHz 1.7 IPC, 2GHz

2 BIPS 3.4 BIPS

900ps 450ps 450ps

900ps 350 550

1.5GHz

Page 47: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Architectures for Instruction Parallelism• Scalar pipeline (baseline)

– Instruction/overlap parallelism = D– Operation Latency = 1– Peak IPC = 1.0

D

Su

ccess

ive

Inst

ruct

ion

s

Time in cycles1 2 3 4 5 6 7 8 9 10 11 12

D different instructions overlapped

Page 48: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Superscalar Machine• Superscalar (pipelined) Execution

– Instruction parallelism = D x N– Operation Latency = 1– Peak IPC = N per cycle

Su

ccess

ive

Inst

ruct

ion

s

Time in cycles1 2 3 4 5 6 7 8 9 10 11 12

N

D x N different instructions overlapped

Page 49: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

RISC ISA Format• Fixed-length

– MIPS all insts are 32-bits/4 bytes

• Few formats– MIPS has 3: R (reg, reg, reg), I (reg, reg, imm), J (addr)– Alpha has 5: Operate, Op w/ Imm, Mem, Branch, FP

• Regularity across formats (when possible/practical)– MIPS & Alpha opcode in same bit-position for all formats– MIPS rs & rt fields in same bit-position for R and I formats– Alpha ra/fa field in same bit-position for all 5 formats

Page 50: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Superscalar Decode for RISC ISAs• Decode X insns. per cycle (e.g., 4-wide)

– Just duplicate the hardware– Instructions aligned at 32-bit boundaries

32-bit inst

Decoder

decodedinst

scalar

Decoder Decoder Decoder

32-bit inst

Decoder

decodedinst

superscalar

4-wide superscalar fetch

32-bit inst32-bit inst32-bit inst

decodedinst

decodedinst

decodedinst

1-Fetch

Page 51: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

CISC ISA• RISC focus on fast access to information

– Easy decode, I$, large RF’s, D$

• CISC focus on max expressiveness per min space– Designed in era with fewer transistors, chips– Each memory access very expensive

• Pack as much work into as few bytes as possible• More “expressive” instructions

– Better potential code generation in theory– More complex code generation in practice

Page 52: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Mode Example Meaning

Register ADD R4, R3, R2 R4 = R3 + R2

ADD in RISC ISA

Page 53: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Mode Example Meaning

Register ADD R4, R3 R4 = R4 + R3

Immediate ADD R4, #3 R4 = R4 + 3

Displacement ADD R4, 100(R1) R4 = R4 + Mem[100+R1]

Register Indirect

ADD R4, (R1) R4 = R4 + Mem[R1]

Indexed/Base ADD R3, (R1+R2) R3 = R3 + Mem[R1+R2]

Direct/Absolute ADD R1, (1234) R1 = R1 + Mem[1234]

Memory Indirect

ADD R1, @(R3) R1 = R1 + Mem[Mem[R3]]

Auto-Increment

ADD R1,(R2)+ R1 = R1 + Mem[R2]; R2++

Auto-Decrement

ADD R1, -(R2) R2--; R1 = R1 + Mem[R2]

ADD in CISC ISA

Page 54: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

RISC (MIPS) vs CISC (x86)

lui R1, Disp[31:16]ori R1, R1, Disp[15:0]add R1, R1, R2shli R3, R3, 3add R3, R3, R1lui R1, Imm[31:16]ori R1, R1, Imm[15:0]st [R3], R1

MOV [EBX+EAX*8+Disp], Imm

8 insns. at 32 bits each vs 1 insn. at 88 bits: 2.9x!

Page 55: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

x86 Encoding• Basic x86 Instruction:

Prefixes0-4 bytes

Opcode1-2 bytes

Mod R/M0-1 bytes

SIB0-1 bytes

Displacement0/1/2/4 bytes

Immediate0/1/2/4 bytes

Longest Inst 15 bytesShortest Inst: 1 byte

• Opcode has flag indicating Mod R/M is present– Most instructions use the Mod R/M byte– Mod R/M specifies if optional SIB byte is

used– Mod R/M and SIB may specify additional

constantsInstruction length not known until after decode

Page 56: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Instruction Cache Organization• To fetch N instructions per cycle...

– L1-I line must be wide enough for N instructions

• PC register selects L1-I line• A fetch group is the set of insns. starting at PC

– For N-wide machine, [PC,PC+N-1]

Deco

der

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst InstTag Inst Inst Inst Inst

Cache LinePC

Page 57: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Fetch Misalignment• Now takes two cycles to fetch N instructions

Deco

der

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

000001010011

111

PC: xxx01001 00 01 10 11

Deco

der

Tag Inst Inst Inst InstTag Inst Inst Inst InstTag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

Tag Inst Inst Inst Inst

000001010011

111

PC: xxx01100 00 01 10 11

Inst Inst Inst

Inst

Cycle 1

Cycle 2

Inst Inst Inst

Page 58: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Fragmentation due to Branches• Fetch group is aligned, cache line size > fetch group

– Taken branches still limit fetch width

Deco

der

Tag Inst Inst Inst InstTag Inst Branch InstTag Inst Inst Inst Inst

Tag Inst Inst Inst InstTag Inst Inst Inst Inst

Inst

X X

Page 59: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Types of Branches• Direction:

– Conditional vs. Unconditional

• Target:– PC-encoded

• PC-relative• Absolute offset

– Computed (target derived from register)

Need direction and target to find next fetch group

Page 60: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Branch Prediction Overview• Use two hardware predictors

– Direction predictor guesses if branch is taken or not-taken– Target predictor guesses the destination PC

• Predictions are based on history– Use previous behavior as indication of future behavior– Use historical context to disambiguate predictions

Page 61: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Direction vs. Target Prediction• Direction: 0 or 1• Target: 32- or 64-bit value• Turns out targets are generally easier to predict

– Don’t need to predict N-t target– T target doesn’t usually change

• Only need to predict taken-branch targets• Prediction is really just a “cache”

– Branch Target Buffer (BTB)

TargetPred

+

sizeof(inst)

PC

Page 62: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Branch Target Buffer (BTB)

V BIA BTA

Branch PC

Branch TargetAddress

=

Valid Bit

Hit?

Branch InstructionAddress (Tag)

Next Fetch PC

Page 63: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

BTB w/Partial Tags

00000000cfff9810

00000000cfff9824

00000000cfff984c

v00000000cfff98100000000cfff9704

v00000000cfff98200000000cfff9830

v00000000cfff98400000000cfff9900

00000000cfff9810

00000000cfff9824

00000000cfff984c

v f98100000000cfff9704

v f98200000000cfff9830

v f98400000000cfff9900

00001111beef9810

Fewer bits to compare, but prediction may alias

Page 64: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

BTB w/PC-offset Encoding

00000000cfff984c

v f98100000000cfff9704

v f98200000000cfff9830

v f98400000000cfff9900

00000000cfff984c

v f981ff9704

v f982ff9830

v f984ff9900

00000000cf ff9900

If target too far or PC rolls over, will mispredict

Page 65: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Branches Have Locality• If a branch was previously taken…

– There’s a good chance it’ll be taken again

for(i=0; i < 100000; i++){

/* do stuff */}

This branch will be taken99,999 times in a row.

Page 66: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Last Outcome Predictor• Do what you did last time

0xDC08: for(i=0; i < 100000; i++){

0xDC44: if( ( i % 100) == 0 )

tick( );

0xDC50: if( (i & 1) == 1)odd( );

}

T

N

Page 67: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Saturating Two-Bit Counter

0 1

FSM for Last-OutcomePrediction

0 1

2 3

FSM for 2bC(2-bit Counter)

Predict N-t

Predict T

Transition on T outcome

Transition on N-t outcome

Page 68: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Typical Organization of 2bC Predictor

PC hash32 or 64 bits

log2 n bits

n entries/counters

Prediction

FSMUpdateLogic

table update

Actual outcome

Page 69: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Track the History of Branches

PC Previous Outcome

1Counter if prev=0

3 0Counter if prev=1

1 3 3

prev = 1 3 prediction = T3

prev = 1 3 prediction = T3

prev = 1 3 prediction = T2

prev = 0 3 prediction = T2

Page 70: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Deeper History Covers More Patterns• Counters learn “pattern” of prediction

PC

0 310 1 3 1 0 02 2

Previous 3 Outcomes Counter if prev=000

Counter if prev=001

Counter if prev=010

Counter if prev=111

001 1; 011 0; 110 0; 100 100110011001… (0011)*

Page 71: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Predictor Training Time• Ex: prediction equals opposite for 2nd most recent

• Hist Len = 2• 4 states to train:

NN TNT TTN NTT N

• Hist Len = 3• 8 states to train:

NNN TNNT TNTN NNTT NTNN TTNT TTTN NTTT N

Page 72: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Predictor Organizations

PC Hash

Different pattern foreach branch PC

PC Hash

Shared set ofpatterns

PC Hash

Mix of both

Page 73: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Two-Level Predictor Organization• Branch History Table (BHT)

– 2a entries– h-bit history per entry

• Pattern History Table (PHT)– 2b sets– 2h counters per set

• Total Size in bits– h2a + 2(b+h)2

PC Hash a

b

h

Each entry is a 2-bit counter

Page 74: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Combined Indexing• “gshare” (S. McFarling)

PC Hash

k

XOR

k = log2counters

k

Page 75: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

OoO Execution• Out-of-Order execution (OoO)

– Totally in the hardware– Also called Dynamic scheduling

• Fetch many instructions into instruction window– Use branch prediction to speculate past branches

• Rename regs. to avoid false deps. (WAW and WAR)• Execute insns. as soon as possible

– As soon as deps. (regs and memory) are known

• Today’s machines: 100+ insns. scheduling window

Page 76: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Superscalar != Out-of-Order

A: R1 = Load 16[R2]B: R3 = R1 + R4C: R6 = Load 8[R9]D: R5 = R2 – 4E: R7 = Load 20[R5]F: R4 = R4 – 1G: BEQ R4, #0

C

D

E

cach

e m

iss

B

C

D

E

F

G

10 cycles

B

F

G

7 cycles

A

B

C D

E

F

G

C

D E

F

G

B

5 cycles

B C

D

E F

G

8 cycles

A

cach

e m

iss

1-wideIn-Order

A

cach

e m

iss

2-wideIn-Order

A

1-wideOut-of-Order

A

cach

e m

iss

2-wideOut-of-Order

Page 77: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Review of Register Dependencies

A: R1 = R2 + R3B: R4 = R1 * R4

5-293

R1R2R3R4

Read-After-Write

7-293

7-2921

A

B

5-293

R1R2R3R4

5-2915

7-2915

B

A

A: R1 = R3 / R4B: R3 = R2 * R4

Write-After-Read

5-293

R1R2R3R4

3-293

3-2-63

AB

5-293

R1R2R3R4

5-2-63

-2-2-63

AB

Write-After-Write

A: R1 = R2 + R3B: R1 = R3 * R4

5-293

R1R2R3R4

7-293

27-293

A B

5-293

R1R2R3R4

27-293

7-293

AB

Page 78: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Register Renaming• Register renaming (in hardware)

– “Change” register names to eliminate WAR/WAW hazards– Arch. registers (r1,f0…) are names, not storage locations– Can have more locations than names– Can have multiple active versions of same name

• How does it work?– Map-table: maps names to most recent locations– On a write: allocate new location, note in map-table– On a read: find location of most recent write via map-table

Page 79: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Tomasulo’s Algorithm• Reservation Stations (RS): instruction buffer• Common data bus (CDB): broadcasts results to RS• Register renaming: removes WAR/WAW hazards• Bypassing (not shown here to make example simpler)

Page 80: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Tomasulo Data Structuresvalue

V1 V2

FU

T

T2T1Top========

Map Table

Reservation Stations

CD

B.V

CD

B.T

Fetchedinsns

Regfile

R

T

========

Page 81: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Where is the “register rename”?

• Value copies in RS (V1, V2)• Insn. stores correct input values in its own RS entry• “Free list” is implicit (allocate/deallocate as part of RS)

value

V1 V2

FU

T

T2T1Top========

Map Table

Reservation Stations

CD

B.V

CD

B.T

Fetchedinsns

Regfile

R

T

========

Page 82: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Precise State• Speculative execution requires

– (Ability to) abort & restart at every branch– Abort & restart at every load

• Synchronous (exception and trap) events require– Abort & restart at every load, store, divide, …

• Asynchronous (hardware) interrupts require– Abort & restart at every ??

• Real world: bite the bullet– Implement abort & restart at every insn.– Called precise state

Page 83: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Complete and Retire

• Complete (C): insns. write results into ROB– Out-of-order: don’t block younger insns.

• Retire (R): a.k.a. commit, graduate– ROB writes results to register file– In-order: stall back-propagates to younger insns.

regfile

L1-DI$

BP

Re-Order Buffer (ROB)

C R

Page 84: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

P6 Data Structuresvalue

V1 V2

FU

T+

T2T1Top========

Map Table

RS

CD

B.V

CD

B.T

Dispatch

Regfile

T

========

R value

ROB

HeadRetire

TailDispatch

Page 85: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

MIPS R10K: Alternative Implementation

• One big physical register file holds all data - no copies+ Register file close to FUs small and fast data path– ROB and RS “on the side” used only for control and tags

FU

T+

T2+T1+Top========

Map Table

RS

CD

B.T

Dispatch

T

========

R value

ROB

HeadRetire

TailDispatch

ToldTT

FreeList

T

Arch.Map

Regfile

Page 86: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Executing Memory Instructions• If R1 != R7

– Then Load R8 gets correct value from cache• If R1 == R7

– Then Load R8 should get value from the Store– But it didn’t!

Load R3 = 0[R6]

Add R7 = R3 + R9

Store R4 0[R7]

Sub R1 = R1 – R2

Load R8 = 0[R1]

Issue

Issue

Cache Miss!

Issue Cache Hit!

Miss serviced…Issue

Issue

But there was a later load…

Page 87: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Memory Disambiguation Problem• Ordering problem is a data-dependence violation• Imprecise memory worse than imprecise registers

• Why can’t this happen with non-memory insts?– Operand specifiers in non-memory insns. are absolute

• “R1” refers to one specific location

– Operand specifiers in memory insns. are ambiguous• “R1” refers to a memory location specified by the value of R1. • When pointers (e.g., R1) change, so does this location

Page 88: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Two Problems• Memory disambiguation on loads

– Do earlier unexecuted stores to the same address exist?• Binary question: answer is yes or no

• Store-to-load forwarding problem– I’m a load: Which earlier store do I get my value from?– I’m a store: Which later load(s) do I forward my value to?

• Non-binary question: answer is one or more insn. identifiers

Page 89: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Load/Store Queue (1/2)• Load/store queue (LSQ)

– Completed stores write to LSQ– When store retires, head of LSQ written to L1-D

• (or write buffer)

– When loads execute, access LSQ and L1-D in parallel• Forward from LSQ if older store with matching address

Page 90: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Load/Store Queue (2/2)

regfile

L1-D

I$

BP

ROB

LSQload/store

store data

addr

load data

Almost a “real” processor diagram

Page 91: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Loads Execute When …• Most aggressive approach• Relies on fact that storeload forwarding is rare• Greatest potential IPC – loads never stall

• Potential for incorrect execution– Need to be able to “undo” bad loads

Page 92: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Detecting Ordering Violations• Case 1: Older store execs before younger load

– No problem; if same address stld forwarding happens

• Case 2: Older store execs after younger load– Store scans all younger loads– Address match ordering violation

Page 93: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Loads Checking for Earlier Stores• On Load dispatch, find data from earlier Store

ST 0x4000

ST 0x4000

ST 0x4120

LD 0x4000

=

Address Bank Data Bank

=

=

=

=

=

=

0

No earliermatches

Addr match

Valid store

Use thisstore

Need to adjust this so thatload need not be at bottom,and LSQ can wrap-around

If |LSQ| is large, logic can beadapted to have log delay

Page 94: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Sim

ilar Lo

gic to

Pre

vio

us S

lide

Data Forwarding• On execute Store (STA+STD), check for later Loads

ST 0x4000

ST 0x4120

LD 0x4000

Addr Match

Is LoadCaptureValue

Overwritten

Overwritten

Data Bank

This is ugly, complicated, slow, and power hungry

ST 0x4000

Page 95: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Data-Capture Scheduler

• Dispatch: read available operands from ARF/ROB, store in scheduler

• Commit: Missing operands filled in from bypass

• Issue: When ready, operands sent directly from scheduler to functional units

Fetch &Dispatch

ARF PRF/ROB

Data-CaptureScheduler

FunctionalUnits

Physica

l reg

ister u

pdate

Bypass

Page 96: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Scheduling Loop or Wakeup-Select Loop• Wake-Up Part:

– Executing insn notifies dependents– Waiting insns. check if all deps are satisfied

• If yes, “wake up” instutrction

• Select Part:– Choose which instructions get to execute

• More than one insn. can be ready• Number of functional units and memory ports are limited

Page 97: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Interaction with Execution

A

Sele

ct Log

ic

SRD SL opcode ValL ValR

ValL ValR

ValL ValRValL ValR

Payload RAM

Page 98: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Simple Scheduler Pipeline

Select Payload

Wakeup

A: Execute

CaptureB:

tag broadcastresultbroadcast

enablecapture

on tag match

Select Payload Execute

Wakeup CaptureC:enablecapture

tag broadcast

Cycle i Cycle i+1

A

B

C

Very long clock cycle

Page 99: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Deeper Scheduler Pipeline

Select PayloadA: Execute

CaptureB:

tag broadcastresultbroadcast

enablecapture

Select Payload Execute

CaptureC:enablecapture

tag broadcast

Cycle i Cycle i+1

Select Payload Execute

Cycle i+2 Cycle i+3

Wakeup

Wakeup

A

B

C

Faster, but Capture & Payload on same cycle

Page 100: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Very Deep Scheduler Pipeline

Select PayloadA: Execute

CaptureC:

Cycle i

Wakeup

i+1 i+2 i+3

Select Payload Execute

Wakeup Capture

Select Payload Execute

i+4 i+5

D:

A

C

B

D

Wakeup Capture

B: Select Select Payload Execute

A&B bothready, onlyA selected,B bids again

AC and CD mustbe bypassed,

BD OK without bypass

i+6

Dependent instructions can’t execute back-to-back

Page 101: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Non-Data-Capture Scheduler

Fetch &Dispatch

ARF PRF

Scheduler

FunctionalUnits

Physica

l reg

ister

up

date

Fetch &Dispatch

UnifiedPRF

Scheduler

FunctionalUnits

Physica

l reg

ister

up

date

Page 102: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Pipeline Timing

Select Payload

Wakeup

Execute

Select Payload Execute

Select Payload Read Operands from PRF

Wakeup

Execute

Select Payload Read Operands from PRF Exec

S X EX X

S X E

“Skip” Cycle

Substantial increase in schedule-to-execute latency

Data-Capture

Non-Data-Capture

Page 103: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Handling Multi-Cycle Instructions

Sched PayLd Exec

Sched PayLd Exec

Add R1 = R2 + R3

Xor R4 = R1 ^ R5

Sched PayLd Exec Add R4 = R1 + R5WU

Sched PayLd Exec Mul R1 = R2 × R3Exec Exec

Instructions can’t execute too early

WU

Page 104: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Non-Deterministic Latencies• Real situations have unknown latency

– Load instructions• Latency {L1_lat, L2_lat, L3_lat, DRAM_lat}• DRAM_lat is not a constant either, queuing delays

– Architecture specific cases• PowerPC 603 has “early out” for multiplication• Intel Core 2’s has early out divider also

Page 105: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Load-Hit Speculation• Caches work pretty well

– Hit rates are high (otherwise we wouldn’t use caches)– Assume all loads hit in the cache

Sched PayLd Exec R2 = R1 + #4

Sched PayLd Exec R1 = 16[$sp]Exec Exec Cache hit,data forwardedBroadcast delayed

by DL1 latency

What to do on a cache miss?

Page 106: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Simple Select Logic

Scheduler Entries

1

S entriesyields O(S)gate delay

Grant0 = 1Grant1 = !Bid0

Grant2 = !Bid0 & !Bid1

Grant3 = !Bid0 & !Bid1 & !Bid2

Grantn-1 = !Bid0 & … & !Bidn-21

x0

x1

x2

x3

x4

x5

x6

x7

x8

grant0

xi = Bidi

granti

grant1

grant2

grant3

grant4

grant5

grant6

grant7

grant8

grant9

O(log S) gates

Page 107: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Implementing Oldest First Select

B

C

A

D

E

F

H

G

4

6053172

0

3

22

0

0

Age-Aware Select Logic

Grant

Must broadcast grant age to instructions

Page 108: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Problems in N-of-M Select

B

C

A

D

E

F

H

G

4

6053172

Age-A

ware

1-o

f-M

Age-A

ware

1-o

f-M

∞A

ge-A

ware

1-o

f-M∞

N layers O(N log M) delay

O(lo

g M

) gate

dela

y / se

lect

Page 109: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Select Binding

XORSUB

28

Sele

ct Log

ic for A

LU1

Sele

ct Log

ic for A

LU2

21

ADD 4 1

ADD 5 1

CMP 3 2

Not-Quite-Oldest-First:Ready insns are aged 2, 3, 4

Issued insns are 2 and 4

XORSUB

28

Sele

ct Log

ic for A

LU1

Sele

ct Log

ic for A

LU2

21

ADD 4 1

ADD 5 1

CMP 3 2

(Idle)

Wasted Resources:3 instructions are ready

Only 1 gets to issue

Page 110: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Execution Ports

• Divide functional units into P groups– Called “ports”

• Area only O(P2M log M), where P << F

• Logic for tracking bids and grants less complex (deals with P sets)

ADD 3

LOAD 5

ADD 2

MUL 8

Port 0Port 1Port 2Port 3Port 4

ALU1 ALU2 ALU3 M/D

Shift

FAdd

FM/D SIMDLoad Store

Page 111: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Decentralized RS• Natural split: INT vs. FP

Port 1

Port 3

Store Load

ALU1 ALU2

FP-LdFP-St

FAdd FM/D

L1 Data Cache

FP-only

wake

upIn

t-only

wake

up

INTRF

FPRF

FP ClusterInt Cluster

Often implies non-ROB basedphysical register file:

One “unified” integerPRF, and one “unified”FP PRF, each managedseparately with their

own free lists

Port 0

Port 2

Page 112: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Higher Complexity not Worth Effort

“Effort”

Performance

ScalarIn-Order

Moderate-PipeSuperscalar/OOO

Very-Deep-PipeAggressive

Superscalar/OOO

Made sense to goSuperscalar/OOO:

good ROI

Very little gain forsubstantial effort

Page 113: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

SMP Machines• SMP = Symmetric Multi-Processing

– Symmetric = All CPUs have “equal” access to memory• OS seems multiple CPUs

– Runs one process (or thread) on each CPU

CPU0

CPU1

CPU2

CPU3

Page 114: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

MP Workload Benefits

3-wideOOOCPU

Task A Task B

4-wideOOOCPU

Task A Task B

Benefit

3-wideOOOCPU

Task A Task B3-wideOOOCPU

2-wideOOOCPU

Task BTask A2-wide

OOOCPU

runtime

Page 115: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

… If Only One Task Available

3-wideOOOCPU

Task A

4-wideOOOCPU

Task ABenefit

3-wideOOOCPU

3-wideOOOCPU

Task A

2-wideOOOCPU

2-wideOOOCPU

Task A

runtime

Idle

No benefit over 1 CPU

Performancedegradation!

Page 116: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Chip-Multiprocessing (CMP)• Simple SMP on the same chip

– CPUs now called “cores” by hardware designers– OS designers still call these “CPUs”

Intel “Smithfield” Block Diagram AMD Dual-Core Athlon FX

Page 117: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

On-chip Interconnects (1/4)• Today, (Core+L1+L2) = “core”

– (L3+I/O+Memory) = “uncore”

• How to interconnect multiple “core”s to “uncore”?

• Possible topologies– Bus– Crossbar– Ring– Mesh– Torus

LLC $

MemoryControll

er

Core

$

Core

$

Core

$

Core

$

Page 118: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

On-chip Interconnects (2/4)• Possible topologies

– Bus– Crossbar– Ring– Mesh– Torus

$Bank 0

MemoryControll

er

Core$

Core$

Core$

Core$

$Bank 1

$Bank 2

$Bank 3

Ora

cle U

ltra

SPA

RC

T5

(3

.6G

Hz,

16

core

s, 8

thre

ads

per

core

)

Page 119: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

On-chip Interconnects (3/4)• Possible topologies

– Bus– Crossbar– Ring– Mesh– Torus $

Bank 0

MemoryControll

er

Core$

Core$

Core$

Core$

$Bank 1

$Bank 2

$Bank 3

Inte

l Sandy B

ridge

(3.5

GH

z,6

core

s, 2

thre

ads

per

core

)

• 3 ports per switch• Simple and cheap• Can be bi-directional to

reduce latency

Page 120: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

On-chip Interconnects (4/4)• Possible topologies

– Bus– Crossbar– Ring– Mesh– Torus

Tile

ra T

ile6

4 (

86

6M

Hz,

64

co

res)

Core$$

Bank 1

$Bank

0

Core$

Core$$

Bank 4

Core$$

Bank 3

MemoryControll

er

Core$$

Bank 2

Core$$

Bank 7

Core$$

Bank 6

Core$$

Bank 5

• Up to 5 ports per switchTiled organization combines core and cache

Page 121: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Multi-Threading• Uni-Processor: 4-6 wide, lucky if you get 1-2 IPC

– Poor utilization of transistors

• SMP: 2-4 CPUs, but need independent threads– Poor utilization as well (if limited tasks)

• {Coarse-Grained,Fine-Grained,Simultaneous}-MT– Use single large uni-processor as a multi-processor

• Core provide multiple hardware contexts (threads)– Per-thread PC– Per-thread ARF (or map table)

– Each core appears as multiple CPUs• OS designers still call these “CPUs”

Page 122: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Scalar PipelineTime

Dependencies limit functional unit utilization

Page 123: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Superscalar PipelineTime

Higher performance than scalar, but lower utilization

Page 124: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Chip Multiprocessing (CMP)Time

Limited utilization when running one thread

Page 125: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Coarse-Grained MultithreadingTime

Only good for long latency ops (i.e., cache misses)

Hardw

are Context Switch

Page 126: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Fine-Grained MultithreadingTime

Saturated workload -> Lots of threads

Unsaturated workload -> Lots of stalls

Intra-thread dependencies still limit performance

Page 127: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Simultaneous MultithreadingTime

Max utilization of functional units

Page 128: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Paired vs. Separate Processor/Memory?

• Separate CPU/memory– Uniform memory access

(UMA)• Equal latency to memory

– Low peak performance

• Paired CPU/memory– Non-uniform memory access

(NUMA)• Faster local memory• Data placement matters

– High peak performance

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)Mem

CPU($)Mem

CPU($)Mem

CPU($)MemR RRR

Page 129: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Issues for Shared Memory Systems• Two big ones

– Cache coherence– Memory consistency model

• Closely related• Often confused

Page 130: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

A: 0

Cache Coherence: The Problem• Variable A initially has value 0• P1 stores value 1 into A• P2 loads A from memory and sees old value 0

Bus

P1t1: Store A=1 P2

A: 0

A: 0 1 A: 0

Main Memory

L1

t2: Load A?

L1

Need to do something to keep P2’s cache coherent

Page 131: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Simple MSI ProtocolSt

ore

/ Bu

sRdX

Invalid

Load / BusRd

SharedLoad / --

BusRd / [BusReply]

Cache Actions:• Load, Store, Evict Bus Actions:• BusRd, BusRdX

BusInv, BusWB,BusReplyModified

BusRdX / BusReply

Evict / --

BusRd / BusReply

Evict / BusWB

Load, Store / --

Store / BusIn

v

BusRdX, BusInv / [BusReply]

Usable coherence protocol

Page 132: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Coherence vs. Consistency• Coherence concerns only one memory location• Consistency concerns ordering for all locations• A Memory System is Coherent if

– Can serialize all operations to that location• Operations performed by any core appear in program order

– Read returns value written by last store to that location

• A Memory System is Consistent if– It follows the rules of its Memory Model

• Operations on memory locations appear in some defined order

Page 133: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Sequential Consistency (SC)

switch randomly setafter each memory op

processorsissue memory opsin program order

P1 P2 P3

Memory

Defines Single Sequential Order Among All Ops.

Page 134: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Mutex Example w/ Store Buffer

P1 P2lockA: A = 1; lockB: B=1;if (B != 0) if (A != 0) { A = 0; goto lockA; } { B = 0; goto lockB; }/* critical section*/ /* critical section*/A = 0; B = 0;

Shared Bus

P1Read Bt1 t3

P2Read At2 t4

A: 0B: 0

Write A Write B

Does not work

Page 135: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Relaxed Consistency Models• Sequential Consistency (SC):

– R → W, R → R, W → R, W → W

• Total Store Ordering (TSO) relaxes W → R– R → W, R → R, W → W

• Partial Store Ordering relaxes W → W (coalescing WB)– R → W, R → R

• Weak Ordering or Release Consistency (RC)– All ordering explicitly declared

• Use fences to define boundaries• Use acquire and release to force flushing of values

X → Y X must complete before Y

Page 136: CSE502: Computer Architecture Review. CSE502: Computer Architecture Course Overview (1/2) Caveat 1: Im (kind of) new here. Caveat 2: This is a (somewhat)

CSE502: Computer Architecture

Good Luck!