159233 computer architecture building up the …plyons/159233 (computer... ·  · 2010-02-18t r a...

62
© Paul L y ons 2010 ~ 1 ~ 159233 Computer Architecture 159233 Computer Architecture © Paul L y ons 2010 ~ 2 ~ 159233 Computer Architecture INTRODUCTION 2 IC FABRICATION 7 THE ISA 14 REPRESENTING HLL CONSTRUCTS 30 PERFORMANCE 35 COMPUTER ARITHMETIC 49 BUILDING UP THE DATAPATH 54 FLOATING POINT NUMBERS 70 SINGLE-CYCLE ARCHITECTURE 78 MULTI-CYCLE ARCHITECTURE 109 PIPELINING 125 EXCEPTIONS 155 MEMORY MANAGEMENT 157 cache 165 virtual memory 199 © Paul L y ons 2010 ~ 3 ~ 159233 Computer Architecture INTRODUCTION Physical properties ABSTRACTION Concepts A device s e c to r track Concepts © Paul L y ons 2010 ~ 4 ~ 159233 Computer Architecture LEVELS OF ABSTRACTION INTRODUCTION Computer systems use technology to simulate the human world Human thought processes ISA Gates Data processing modules native data types instruction set registers addressing modes interrupts exception handling I/O handling Transistors cycles per instruction physical registers Machine code Assembly language High level languages 01100000101011 add A, B C = (A + B)*3

Upload: lamhuong

Post on 16-Apr-2018

216 views

Category:

Documents


3 download

TRANSCRIPT

© Paul Lyons 2010~ 1 ~

159233 Computer Architecture

159233

Computer Architecture

© Paul Lyons 2010~ 2 ~

159233 Computer Architecture

INTRODUCTION 2IC FABRICATION 7THE ISA 14REPRESENTING HLL CONSTRUCTS 30PERFORMANCE 35COMPUTER ARITHMETIC 49BUILDING UP THE DATAPATH 54FLOATING POINT NUMBERS 70SINGLE-CYCLE ARCHITECTURE 78MULTI-CYCLE ARCHITECTURE 109PIPELINING 125EXCEPTIONS 155MEMORY MANAGEMENT 157

cache 165virtual memory 199

© Paul Lyons 2010~ 3 ~

159233 Computer Architecture

INTRODUCTION

Physical properties

ABSTRACTION

Concepts

A device

sector

track

Concepts

© Paul Lyons 2010~ 4 ~

159233 Computer Architecture

LEVELS OF ABSTRACTION

INTRODUCTION

Computer systems use technology to simulate the human world

Human thought processes

ISA

GatesData processing modules

native data typesinstruction setregistersaddressing modes

interruptsexception handling

I/O handling

Transistors

cycles per instruction physical registers

Machine code

Assembly language

High level languages

01100000101011

add A, B

C = (A + B)*3

© Paul Lyons 2010~ 5 ~

159233 Computer Architecture

AGENERIC COMPUTER

INTRODUCTION

processor

© Paul Lyons 2010~ 6 ~

159233 Computer Architecture

MEMORY HIERARCHY

INTRODUCTION

© Paul Lyons 2010~ 7 ~

159233 Computer Architecture

THE IC INDUSTRY

IC FABRICATION

Very large market

Very few products

High rate of development

Long development times

Multiple generations in simultaneous development

Discontinuous technological change

© Paul Lyons 2010~ 8 ~

159233 Computer Architecture

PRODUCING THE WAFERS

IC FABRICATION

© Paul Lyons 2010~ 9 ~

159233 Computer Architecture

IC FABRICATION

Si Si

Si

Si Si

Si

Si

SiSi Si

Si

SiSiSi

Si

Si

DOPING THE WAFER

P

-

-

-

-

-

+

+

+

+

+

© Paul Lyons 2010~ 10 ~

159233 Computer Architecture

HOW ACMOSTRANSISTOR WORKS

IC FABRICATION

+ + ++ + ++ + ++ + ++ + +

+-

© Paul Lyons 2010~ 11 ~

159233 Computer Architecture

MAKING ACMOSTRANSISTOR

IC FABRICATION

© Paul Lyons 2010~ 12 ~

159233 Computer Architecture

MAKING ACMOSTransistor

IC FABRICATION

© Paul Lyons 2010~ 13 ~

159233 Computer Architecture

MAKING ACMOSTRANSISTOR

IC FABRICATION

+ + + + + +

+-

© Paul Lyons 2010~ 14 ~

159233 Computer Architecture

THE MIPSCOMPUTER

THE ISA

a popular microprocessor(a billion sold?)

RISC architecture

CISC RISC

slow memory, assembly language era fast memory, HLL era

simple instructions cut clock cycles to 1compilers issue complex instruction sequences

single addressing mode per instructioninstructions that operate only on registers

small controller large no. of registershardwired instructions

© Paul Lyons 2010~ 16 ~

159233 Computer Architecture

THE MIPSCOMPUTER

THE ISA

a popular microprocessor(a billion sold?)

RISC architecture

Architecture

Machine Language

Instruction Set

Compilers

Design Goal

© Paul Lyons 2010~ 17 ~

159233 Computer Architecture

THE ADD INSTRUCTION

THE ISA

The MIPS computer has a 3-address architecture

add a, b, csub a, b, c

# a = b + c# a = b - c

add a, b, cadd a, a, dadd a, a, e

# a = b + c# a = a + d# a = a + e# a contains the sum of b, c, d, & e

move $8, $19 # r8 � #r19 - desired behaviour

add $8, $0, $19 # r8 � 0 + $19 - actual implementation

© Paul Lyons 2010~ 18 ~

159233 Computer Architecture

EXPRESSION TREES AND EVALUATION ORDER

THE ISA

+

+

b c

+

d e

+

bc

+

d

+

e

a

a

a

a f

a

© Paul Lyons 2010~ 19 ~

159233 Computer Architecture

THE REGISTERS

THE ISA

$0, $1, … $31

address calculation, stack pointers as well as data storage

© Paul Lyons 2010~ 20 ~

159233 Computer Architecture

THE REGISTERS

THE ISA

Register Name(s) Use

0 $zero12-3 $v0-$v14-7 $a0-$a38-15 $t0-$t716-23 $s0-$s724-25 $t8-$t926-27 28 $gp29 $sp30 $fp31 $ra

© Paul Lyons 2010~ 21 ~

159233 Computer Architecture

RISCDESIGNS FAVOUR SIMPLICITY

THE ISA

32-bit instructionstwo types: R(egister) –type & I(mmediate)-type

op rs rt rd shamt functR-type

6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

© Paul Lyons 2010~ 22 ~

159233 Computer Architecture

RISCDESIGNS FAVOUR SIMPLICITY

THE ISA

I-type op rs rt constant or address

op rs rt rd shamt functR-type

6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

have a constant operandor access memory

32-bit instructionstwo types: R(egister) –type & I(mmediate)-type

© Paul Lyons 2010~ 23 ~

159233 Computer Architecture

RISCDESIGNS FAVOUR SIMPLICITY

THE ISA

(Yeah, right)

I-type op rs rt constant or address

load $t0 with the word in $s2 plus the word 8 up from the address in $s3lw $t0, 32($s3)

add $t0, $s2, $t0

let’s say x is in register called $t0 in the assembler, actual reg. 8h is in register called $s2 in the assembler, actual reg.18array a starts at the location contained in reg.$s3, actual reg.19

x = h + a[8]

35 19 8 32

(psst; there's a third type as well; J-type, for jump instructions)regularity is an ideal, but good compromises must sometimes be made

© Paul Lyons 2010~ 24 ~

159233 Computer Architecture

COMPILATION

THE ISA

HLL code

expression treevariable-registerassociations

assembler

f = (g+h) – (i+j)

$s1 $s2 $s3 $s4

name registerf $s0g $s1h $s2I $s3j $s4

+$t0 + $t1

-$s0

machine code

memory image

© Paul Lyons 2010~ 25 ~

159233 Computer Architecture

REGISTER��MEMORYTRANSFERS

THE ISA

complex programs are difficult to write, with only 32 registers

instructions for storing data to memory and loading data from memory

memory works like a big 1-D array, addressed by byte

if $19 contains start, then lw $8, 12($19)

loads 1 into register $8

100 1000 10 1start start

+ 4start+ 8

start+ 12

© Paul Lyons 2010~ 26 ~

159233 Computer Architecture

MORE ABOUT MEMORY

THE ISA

Compiler also allocates

232 bytes230 words

4,294,967,2961,073,741,824

Large address space means access times

Compiler tries to keeps spills

© Paul Lyons 2010~ 27 ~

159233 Computer Architecture

BRANCHES

THE ISA

Computers must have

mostly, that's just that's why

sometimes it needs to that's why the PC is the choice depends on

beq $1, $2, L1 # branch to L1 if $1 and $2 are equal

PC is loaded with PC is incremented

bne $1, $2, L1 # branch to L1 if $1 and $2 not equal

© Paul Lyons 2010~ 28 ~

159233 Computer Architecture

TESTS USING INEQUALITY

THE ISA

SLT (Set on Less Than)compares SLT $r1, $r2, $r3

C Equivalent:

BLT (Branch on Less Than)

would need simpler & more regular to use

© Paul Lyons 2010~ 29 ~

159233 Computer Architecture

SLTI (Set on Less Than Immediate)compares SLT $r1, $r2, number

TESTS USING INEQUALITY

THE ISA

SLTU (Set on Less Than Unsigned)

SLTUI (Set on Less Than Unsigned Immediate)

C equivalent:

© Paul Lyons 2010~ 30 ~

159233 Computer Architecture

IF STATEMENTS

REPRESENTING HLL CONSTRUCTS

if (a==b)

c = d+e;

else

c = d-e

register allocationsc d e a b

$16 $17 $18 $19 $20

© Paul Lyons 2010~ 31 ~

159233 Computer Architecture

LOOPS

REPRESENTING HLL CONSTRUCTS

while loopwhile (this[i] == k)

i = i + j

register allocationsi j k 4 (constant)

$19, $20, $21 $10

repeat looprepeat

i = i + j

until (this[i] == k)

?

© Paul Lyons 2010~ 32 ~

159233 Computer Architecture

SUBROUTINE CALLS

REPRESENTING HLL CONSTRUCTS

Another variety ofsaves

jal procAddress

jr $31

For nested procedure calls, stack is spilled into memory$sp contains

stack lives at top end of memory, & grows downwards

Parameters passing uses registers for a nested subroutine call,

caller save: called subroutine can then use any registerscallee save: calling subroutine doesn’t have to restore registers

© Paul Lyons 2010~ 33 ~

159233 Computer Architecture

IMMEDIATE INSTRUCTIONS TO OPERATE ON CONSTANTS

REPRESENTING HLL CONSTRUCTS

addi $29, $29, 4 # sp = sp – 1!

lui <regn>, <16-bit const>

addi $8, $8, 96

00011100000001000000000011111111

lui $8 255

00000000111111110000000000000000 r8

00100000100001000000000000110000

0000000000110000

© Paul Lyons 2010~ 34 ~

159233 Computer Architecture

DESIGN PRINCIPLES

REPRESENTING HLL CONSTRUCTS

Smaller is fastermore registers � greater area � slower clock

Simplicity favours regularitydecoding is faster with

Good design demands good compromiseR-type, I –type, and J-type instructions are all

Make the common case fastimmediate instructions don’t often involve big constantsso 16-bit constants are OK, with lui only needed occasionally

© Paul Lyons 2010~ 35 ~

159233 Computer Architecture

PERFORMANCE METRICS

PERFORMANCE

throughputtotal work accomplished in a given time

execution timetime for a given jobperformance (rate or speed) =

if performanceX > performanceY

then

© Paul Lyons 2010~ 36 ~

159233 Computer Architecture

CPU TIME, I/O TIME, AND WALL CLOCK TIME

PERFORMANCE

CPU time is

access times of are commonaccess time = + +

CPU I/O CPU I/O CPU

CPU time

elapsed time

© Paul Lyons 2010~ 37 ~

159233 Computer Architecture

FACTORS INFLUENCING PERFORMANCE

PERFORMANCE

hardware-related factorsISA implementationCPU cycle timebus cycle timecachingparallelismpipelining

software-related factorsuser algorithmoperating systemcompilers

© Paul Lyons 2010~ 38 ~

159233 Computer Architecture

PERFORMANCE MEASURES; THE CLOCK CYCLE

PERFORMANCE

vlock cycle is

e.g. 10ns

vlock rate is

e.g. 4GHz

no. of clock cycles per instruction is CPI – Cycles Per Instruction – also a factor

no. of instructions x CPI

= execution timex

© Paul Lyons 2010~ 39 ~

159233 Computer Architecture

PERFORMANCE COMPONENTS

PERFORMANCE

time =

CPU clock cycles = ∑ CPIi x no. of instructionsiI = 1

n

if an instruction set includes n different classes of instruction

MFLOPs: Millions of Floating Point Operations/second

MIPs: Millions of Instructions/secondif CPI = 1, MIPS =difficult to compare ISAs, as difficult to compare programs, as does automatically mean ?

x x

no one of these is a full measure of performance

x x

© Paul Lyons 2010~ 41 ~

159233 Computer Architecture

PERFOMANCE COMPONENTS

PERFORMANCE

MIPs: Millions of Instructions/seconddoes more MIPS automatically mean faster execution?consider a 4GHz computer

billions of instructions

compiler1ABC

51 1

compiler210

1 1execution time

=

& 3 classes of instructionwith 2 compilers

MIPS =

CPI123

xxx

52310

102315

© Paul Lyons 2010~ 42 ~

159233 Computer Architecture

MFLOPS

PERFORMANCE

Millions of Floating Point Operations per Second

much scientific, graphic and engineering computing involvesfast floating point arithmetic implies many computers have

the same caveat applies to MFLOPS as to MIPS

© Paul Lyons 2010~ 43 ~

159233 Computer Architecture

BENCHMARKS

PERFORMANCE

too many variables, too much hype

benchmarks are standard programs whose e.g. Livermore loops

e.g. SPEC benchmark suites

manufacturers can to achieve good statsgives an unrealistic impression of

© Paul Lyons 2010~ 44 ~

159233 Computer Architecture

PARALLELISM

PERFORMANCE

some tasks can besystems that can be divided intoe.g. the atmosphere (weather prediction)

Pprocessors

ideallyin practice,

can't be eliminated

© Paul Lyons 2010~ 45 ~

159233 Computer Architecture

PARALLELISM –AMDAHL'S LAWPARALLELISM

PERFORMANCE

how do we measure ?

if a task takes 100s in one configuration and 80s in another, what's the speedup?speed1 = 1 task/100s =speed2 = 1 task/80s =speedup = 0.0125 / 0.01 =

T

ts tP

max speedup �

=

if we have P processors working perfectly in parallel then we reduce the time for the parallel section of the code by a factor of P so the total task time in the parallel configuration TP =speedup S =

most code is a mixture ofserial processing component is & limits

if tp can be reduced to 0

© Paul Lyons 2010~ 46 ~

159233 Computer Architecture

PARALLELISM - AMDAHL'S LAWPARALLELISM

PERFORMANCE

parallelism has two flavoursindependent tasks;dependant tasks;

many tasks in computers involve cf. assembly line for producing goods that undergo same operation sequence

A1 B1 C1

A2 B2 C2

A3 B3 C3

© Paul Lyons 2010~ 47 ~

159233 Computer Architecture

PIPELINE PERFORMANCE

PERFORMANCE

if each task in a pipeline of length L takes t secondssingle task takes

but for n tasks delay before 1st output =delay between subsequent outputs =T(n) =

= (L + n - 1)t

1 2 L

t t t+ + +

© Paul Lyons 2010~ 48 ~

159233 Computer Architecture

PIPELINE

PERFORMANCE

pipeline characteristics

pipeline rate, r∞ =pipeline startup time, s =half performance vector length, n½

(L-1)t + n½ t =n½ =

=

© Paul Lyons 2010~ 49 ~

159233 Computer Architecture

NUMBERS AND THE DATAPATH

COMPUTER ARITHMETIC

datapathALU

number representation2's complement

arithmetic operations+, *, /, shift

© Paul Lyons 2010~ 50 ~

159233 Computer Architecture

2's COMPLEMENT

COMPUTER ARITHMETIC

used in nearly all microprocessorsmost significant bit

other bits

© Paul Lyons 2010~ 51 ~

159233 Computer Architecture

INTEGERS AND ADDRESSES

COPMPUTER ARITHMETIC

integers are , addresses are

Note: unsigned addresses are not used in all computerstransputer addresses are 2's complement, in the range -231 to 231-1

no type information included with the data

© Paul Lyons 2010~ 52 ~

159233 Computer Architecture

2's COMPLEMENT

COMPUTER ARITHMETIC

for n bit numbers2n-1 positive numbers, starting at 02n-1 negative numbers, starting at -1|max –ve no| =x + -x =-x =-2n-1

operation overflows

0

0000 1

0001 2

0010

3

0011

4

0100

5

01016

01107

0111

-8

1000

-7

1001

-6

1010

-5

1011

-4

1100

-3

1101

-2

1110

-1

1111

A

101111000111

101111011000

-5-3-8

-5-4-8

0001101

00011011111111111111111111111111

1

© Paul Lyons 2010~ 53 ~

159233 Computer Architecture

OVERFLOW DETECTION

COMPUTER ARITHMETIC

easy to detect: when

on overflow address of overflowing instruction savedinterrupt handler is called andafter interrupt code finishes, instruction

© Paul Lyons 2010~ 54 ~

159233 Computer Architecture

1-BIT ALU

BUILDING UP THE DATAPATH

bit-slice architecturedesign an ALU component that handles

logical operations require cin

a0

b0

result0

a1

b1

result1

a31

b31

result31

operation

We could extend this design by adding more functional blocks, e.g. multipliers

carry propagation carry lookahead

ab

c

cout

operation

result

invertb

invertb

© Paul Lyons 2010~ 55 ~

159233 Computer Architecture

can we detect when less significant bit slices will generate a carry output?set up modules that allow modules generally incorporate

1-BIT ALU – ADDING CARRY LOOKAHEAD1-BIT ALU

BUILDING UP THE DATAPATH

conventional full adder involves we could design a 32-bit adder as

64 inputs � 264 terms; too bigis there a less expensive way?

© Paul Lyons 2010~ 56 ~

159233 Computer Architecture

1-BIT ALU – ADDING CARRY LOOKAHEAD1-BIT ALU

BUILDING UP THE DATAPATH

a b cin cout sum0 0 0 0 00 0 1 0 1

0 1 0 0 10 1 1 1 01 0 0 0 11 0 1 1 0

1 1 0 1 01 1 1 1 1

cout always 0: carry kill

cout =cin: carry propagate

cout =1: carry generate

Gi =Pi =carry input to the next phase:ci+1 =similarly:ci = substituting repeatedly:ci+1 =

c1+1 ==

Boolean expressions for G, and P, :

all can be calculated in parallel from data inputs and c0

© Paul Lyons 2010~ 57 ~

159233 Computer Architecture

1-BIT ALU – ADDING CARRY LOOKAHEAD

BUILDING UP THE DATAPATH

carry lookahead circuit works onCL units are usually associated with

cascaded to make up

ALU

CL

CL

CL

CL

CL

CL

CL

CL

CL

CL

CL

CL

CL

CL

CL

© Paul Lyons 2010~ 58 ~

159233 Computer Architecture

1-BIT ALU – OTHER OPERATIONS

BUILDING UP THE DATAPATH

logical operationse.g. ANDbit-for-bit logical operation on a pair of words

shift operations e.g. sll & srl (Shift Left Logical & Shift Right Logical)shift information in a register by a specified no. of bits

combination of logical and shift operations to extract parts of a wordcreate a 32-bit mask with desired bits set to 1 (e.g. 8 bits for a character)AND andshift result by

© Paul Lyons 2010~ 59 ~

159233 Computer Architecture

1-BIT ALU – OVERFLOW DETECTION

BUILDING UP THE DATAPATH

when overflow occurs, carry-in and carry-out of sign bit differ connect an EOR gat to cin and cout of most significant adder

cin

a0

b0

result0

a1

b1

result1

a31

b31

result31

operation

overflow

invertb

© Paul Lyons 2010~ 60 ~

159233 Computer Architecture

1-BIT ALU - SLT

BUILDING UP THE DATAPATH

result = ;

if a < b , is equivalent to

set invertb when performing slt operation to-ve values have a sign bitfeed o/p of back tomake the o/p MUX in all the ALUs

ab

c

cout

operation

result

ALU31

ab

c

cout

operation

result

ALU0

0

operation01223

invertb--011

andoraddsubslt

invertbinvertb

© Paul Lyons 2010~ 61 ~

159233 Computer Architecture

BRANCH INSTRUCTIONS –BEQ AND BNE

BUILDING UP THE DATAPATH

equality test for beq and bne instructions also relies onif a=b, then

operation0122

invertb--01

andoraddsubslt 3 1

zero-detect circuit controls

bne 2 1beq 2 1

bne and beq also use

coupled with invertb

need to connect to32-input active-low-input AND gate (i.e., a NOR gate)

© Paul Lyons 2010~ 62 ~

159233 Computer Architecture

SHIFT

BUILDING UP THE DATAPATH

5-bit shamt field specifiestoo slow to shift in

barrel shifter shiftse.g. to shift 5 bits, shamt is

1's bit 4's bit2's bit2's bit

© Paul Lyons 2010~ 63 ~

159233 Computer Architecture

control

DATAPATH LAYOUT

BUILDING UP THE DATAPATH

registers ALU

instruction decoding

bit 0

bit 31

register decoding

data

ALU layout is very regular layout on silicon is also highly structuredcontrol flows and data flows are orthogonal minimises complexity and communication times

© Paul Lyons 2010~ 64 ~

159233 Computer Architecture

MULTIPLIER

BUILDING UP THE DATAPATH

signed multiplication

+

problem:takes multiple clock cycles

© Paul Lyons 2010~ 65 ~

159233 Computer Architecture

MULTIPLIER - FASTER

BUILDING UP THE DATAPATH

∑ ∑ ∑ ∑ ∑ ∑ ∑

∑ ∑ ∑ ∑ ∑ ∑ ∑

∑ ∑ ∑ ∑ ∑ ∑ ∑

multiplier

multiplicand 0101

x1101

0101

0000

0101

0101

1000001

01100101

multiplicand

01100101

multiplier

what if we put an adderafter each partial product?(except the first, of course)

∑ ∑ ∑ ∑ ∑ ∑ ∑

∑ ∑ ∑ ∑ ∑ ∑ ∑

∑ ∑ ∑ ∑ ∑ ∑ ∑

∑ ∑ ∑ ∑ ∑ ∑ ∑

∑∑∑∑∑∑∑

this architecture is wellsuited to

© Paul Lyons 2010~ 66 ~

159233 Computer Architecture

BOOTH'S ALGORITHM WORKS FOR 2's COMP. NUMBERS

BUILDING UP THE DATAPATH

Booth noticed that, when counting in binary,a string of

will be with a next time the

so the string of can be rewrittenas instead of

2m bit

2n bit

…011111…

…10000-1…

…100000…

what's the benefit of that?a simple multiplier uses to multiply by string of x 1sbut if the multiplier can handle , �

the algorithm looking for00 – string of zeros;10 – start of a string of 1s;11 – middle of string of zeros;01 – end of a string of zeros;

=

still have to

© Paul Lyons 2010~ 67 ~

159233 Computer Architecture

MIPSMULTIPLICATION

BUILDING UP THE DATAPATH

the product of a 32-bit multiplications mult and multu occupies 64 bits

hi lo

hi and lo registers are not

mflo $1 (move from lo) putsmfhi $1 (move from hi) to check that

mult ASM instruction is a

generatesgenerates

© Paul Lyons 2010~ 68 ~

159233 Computer Architecture

DIVISION

BUILDING UP THE DATAPATH

mathematically (ideally), division is the inverse of multiplication

if a = b/cthen b =and if b = 1then

but with finite precision arithmetic, occur

© Paul Lyons 2010~ 69 ~

159233 Computer Architecture

DIVISION BY REPEATED SUBTRACTION

BUILDING UP THE DATAPATH

when we multiplied m by n to produce a product pwe generated

when we divide p by mwe're finding out

we can divide p by mby repeatedly p till

© Paul Lyons 2010~ 70 ~

159233 Computer Architecture

WHY ARE FPNUMBERS SPECIAL?

FLOATING POINT NUMBERS

we need a way to representnumbers with fractions, e.g. 3.14159very small numbers, e.g., 0.0000000001very large numbers, e.g. 178478 x 109

representation:sign, exponent, significand:more bits for mantissa �

more bits for exponent �

IEEE 754 floating point standard:single precision usesdouble precision uses

© Paul Lyons 2010~ 71 ~

159233 Computer Architecture

IEEE 754 FLOATING POINT STANDARD:

FLOATING POINT NUMBERS

msb of mantissa is , so only is stored

exponent is "biased" to make sorting easiersubtract 127 to get exponent for single precision & 1023 for double precision

mantissa(23 bits)

23 031

exponent(8 bits)

0

exponent(11 bits)

mantissa(52 bits)

single

double

5263

© Paul Lyons 2010~ 72 ~

159233 Computer Architecture

IEEE 754 FLOATING POINT STANDARD:

FLOATING POINT NUMBERS

special cases:

mantissaexponent

all 0s denormalised numbers; zero is a denormalised no. with

all 1s

all 0s ∞

non-0 NAN

QuietNAN has

SignalllingNAN has

rounding options:to nearest integer, to nearest even integer if fraction is exactly 0.5towards 0towards + ∞towards -∞

© Paul Lyons 2010~ 73 ~

159233 Computer Architecture

GUARD AND ROUND BITS

FLOATING POINT NUMBERS

consider adding two decimal numbers with 3 bits of precision 2.56 + 2.34 x 102

2.34000.0256

2.340.02

2.3400.025

with noextra digits

with 1extra digit

with 2extra digits

extra bits

© Paul Lyons 2010~ 74 ~

159233 Computer Architecture

FLOATING POINT ADDITION

FLOATING POINT NUMBERS

align radix pointsdenormalise one number to

add mantissae

renormalise the resultwatch out for

round to correct number of significant digitsthis may occasionally ripple through to the msb and generate an unnormalised result

renormalise again if required

© Paul Lyons 2010~ 75 ~

159233 Computer Architecture

THE "STICKY BIT"

FLOATING POINT NUMBERS

When generating a result, a string of 0s may be followed by a 1 that will be normalised away

1.93650001 (ignoring the exponent)

simple rounding to nearest even value based on rounding digit, would produce

1.93650001

keep sticky bit as next bit to help resolve "mid-way" rounding problems

1

© Paul Lyons 2010~ 76 ~

159233 Computer Architecture

FLOATING POINT MULTIPLICATION

FLOATING POINT NUMBERS

add exponentsboth include a bias, so have to subtract 1 bias from the result

multiply

normalise result, check for overflow

round

renormalise if necessary

© Paul Lyons 2010~ 77 ~

159233 Computer Architecture

FLOATING POINT AND MIPS

FLOATING POINT NUMBERS

MIPS instructions to support IEEE single and double precision floating point:

add.s and add.d

sub.s and sub.d

mul.s and mul.d

div.s and div.d

c.<x>.s and c.<x>.d <x> may be eq, neq, lt, le, gt, ge

bclt and bclf

comparison sets bit to

single and double

© Paul Lyons 2010~ 78 ~

159233 Computer Architecture

ASMs (Designing a Controller)

COMPUTER PROCESSORS

ControllerASM

Architecture

Controller executes an infinite loop

• instructs processor to get an instruction from memory• identifies instruction that the processor has retrieved• instructs processor to perform data manipulations required by the instruction

Specifies timing of data manipulations

Receives status

information

Algorithmic State Machine

© Paul Lyons 2010~ 79 ~

159233 Computer Architecture

ASMs (Designing a Controller)

STATUS INFORMATION (architecture → controller)

Current instructionIf instruction involves a choice

e.g. JPZ instruction

then controller examines a status line to determine appropriate action

Other status signals are commonly usedNEG OVFL

© Paul Lyons 2010~ 80 ~

159233 Computer Architecture

ASMs (Designing a Controller)

CONTROL SIGNALS (controller → architecure)

Architectural

building block

Control commands No of bits

Registers

MUXes

Memory

etc

© Paul Lyons 2010~ 81 ~

159233 Computer Architecture

ASMs (Designing a Controller)

FINITE STATE MACHINES

0

- THE BASIS OF ALGORITHMIC STATE MACHINES

Rectangles contain

Diamonds containRoundtangles contain

P

Q R SZT F

© Paul Lyons 2010~ 82 ~

159233 Computer Architecture

Outputs in a are TRUE

ASMs (Designing a Controller)

A □ must be present

even if it is empty

A state specifies actions that occur on one clock pulse

Outputs in a □ are TRUE

during that state’s clock pulse, if …

P

RZ F

© Paul Lyons 2010~ 83 ~

159233 Computer Architecture

ASMs (Designing a Controller)

P

Q R SZT F

The state is represented as a numberAt any instant (clock pulse), the ASM

hasassertsasserts

if the condition that governs them is fulfilledASM calculates the

0

1 2

© Paul Lyons 2010~ 84 ~

159233 Computer Architecture

Specs for circuit to navigate round the state chart:

inputs output

No. of present stateStatus inputs

Any external inputsNext state no.

ASMs (Designing a Controller)

P

Q R SZT F

0

1 2

© Paul Lyons 2010~ 85 ~

159233 Computer Architecture

.

ASMs (Designing a Controller)

Z?

present state?

1 → next state

Next state no. ↓

present state no

TRUE → R2 → next state

TRUE → Q0 → next state

TRUE → S0 → next state

TRUE → P

P

Q R SZT F

0

1 2

repeat 0T

F

1

2

© Paul Lyons 2010~ 86 ~

159233 Computer Architecture

ASMs (Designing a Controller)

Ap Bp Z An Bn

The ASM state transition table (navigation only)

Inputs Outputs

P

Q R SZT F

0

1 2

© Paul Lyons 2010~ 87 ~

159233 Computer Architecture

ASMs (Designing a Controller)

General structure of the circuit

Combinatorial logic

Register

Outputs

Next state

Present

state

Status

Externalinputs

clock

P

Q R SZT F

0

1 2

© Paul Lyons 2010~ 90 ~

159233 Computer Architecture

1 00 10 00 0 Ap Bp

The complete ASM state transition table

Ap Bp Z

0 0 00 0 10 1 -1 0 -

An Bn

Inputs Outputs

P Q R S

ASMs (Designing a Controller)

clock

2-bit reg

Q1 Q0

D0

D1

Z

An

Bn

PQRS

P

Q R SZT F

0

1 2

© Paul Lyons 2010~ 91 ~

159233 Computer Architecture

ASMs (Designing a Controller)

A practical problem:initialising the state register

On automatic?

00

Found?

Fire

10Yes

Seek enemy target

01T

F

Consider a 22” naval gun• controlled by an ASM • autoseeking• autofiring

No

At power-up, if state register contains 0, 1, or 3if state register contains 2

√√√√?

© Paul Lyons 2010~ 92 ~

159233 Computer Architecture

ASMs (Designing a Controller)

On automatic?

00

Found?

Fire

10Yes

Seek enemy target

01T

F

No

We need a reset-on-powerup circuit

+5V

0V

rst

+5V

T

V

A practical problem:initialising the state register

Consider a 22” naval gun• controlled by an ASM • autoseeking• autofiring

At power-up, if state register contains 0, 1, or 3if state register contains 2

© Paul Lyons 2010~ 93 ~

159233 Computer Architecture

ASMs (Designing a Controller)

SUMMARY: DESIGNING AN ASM

Construct Shows all inputs and control signals

Translateexternal inputs (if any)status inputspresent statecontrol commandsnext state

Translate

© Paul Lyons 2010~ 94 ~

159233 Computer Architecture

but would you ever use an ASM instead of computer program?

ASMs underlie processor instruction sets

ASM vs. SOFTWARE

Software ASMsin discrete logic

ASMsin FPGA

.. .. ..

DQ

DQ

ASMs (Designing a Controller)

© Paul Lyons 2010~ 95 ~

159233 Computer Architecture

ASMs (Designing a Controller)

Is it easy to understand or modify an ASM circuit?

General format is easily recognisable

inputs commands

But combinatorial circuitry and high-level flowchart vocabularies differ significantly

© Paul Lyons 2010~ 96 ~

159233 Computer Architecture

ASMs (Designing a Controller)

Modifications to state sequence require a complete redesign

Disjunction between ASM diagram and combinatorial circuitry

Could the circuitry(and the outputs)

Is it easy to understand or modify an ASM circuit?

© Paul Lyons 2010~ 97 ~

159233 Computer Architecture

ASMs (Designing a Controller)

If we connect:the control input of a MUX to the state numberthe data inputs to TRUE and FALSE

We’ll use 1 MUX for each output (including the bits of the next state number)

state Q01234567

10010111

T F

.

01234567

Q

USING MUXES AS A LOOKUP TABLE

© Paul Lyons 2010~ 98 ~

159233 Computer Architecture

ASMs (Designing a Controller)

Outputs are sometimes

Q

11

A.B

T F

State no.

01234567

Q

A B

Simpler and easier to understand than completely combinatorial system

USING MUXES AS A LOOKUP TABLE

Q should be in state 3, if

© Paul Lyons 2010~ 99 ~

159233 Computer Architecture

ASMs (Designing a Controller)

USING MUXES AS A LOOKUP TABLE

Consider our prototype ASM

P

Q R SZT F

0

1 2

Ap Bp Z

0 0 00 0 10 1 -1 0 -

An Bn

1 00 10 00 0

Inputs Outputs

P Q R S

1 011 0 00 1 00

00

000 1

ZT F

0123

An

0123

Bn

D1

D0

Q1

Q0

0123

S

0123

R

P0123

0123

Q

© Paul Lyons 2010~ 100 ~

159233 Computer Architecture

ASMs (Designing a Controller)

A LIFT CONTROLLER

© Paul Lyons 2010~ 101 ~

159233 Computer Architecture

A LIFT CONTROLLER

ASMs (Designing a Controller)

© Paul Lyons 2010~ 102 ~

159233 Computer Architecture

A LIFT CONTROLLER

ASMs (Designing a Controller)

© Paul Lyons 2010~ 103 ~

159233 Computer Architecture

ASMs (Designing a Controller)

A LIFT CONTROLLER

© Paul Lyons 2010~ 104 ~

159233 Computer Architecture

ASMs (Designing a Controller)

A LIFT CONTROLLER

© Paul Lyons 2010~ 106 ~

159233 Computer Architecture

ASMs (Designing a Controller)

Up button means “Take something upstairs”

Down button means “Take something downstairs”

If the lift is downstairs,

If the lift is upstairs,

A LIFT CONTROLLER

© Paul Lyons 2010~ 107 ~

159233 Computer Architecture

ASMs (Designing a Controller)

door open?openDoor

upButton +

downButton?

N

Y

N

Y

closed?closeDoorN

000 (At bottom)

001 (Starting up)

up?

ResetUpRequest

010 (Going up)

goUpN

upButton +

downButton?Y

N

closed?

100 (Starting down)

closeDoorN

down?

ResetDnRequest

101 (Going down)

goDownNY

door open?

011 (At top)

Y

openDoorN

A LIFT CONTROLLER

© Paul Lyons 2010~ 110 ~

159233 Computer Architecture

ASMs (Designing a Controller)

AP BP CP AN BN CNcondition closeDoor

openDoor

resetUpRequest

resetDownRequest

goUp

goDown

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

cond1 = doorOpen . ~(upButton + downButton)cond2 = doorOpen . (upButton + downButton)

A LIFT CONTROLLER

© Paul Lyons 2010~ 111 ~

159233 Computer Architecture

ASMs (Designing a Controller)

D2

D1

D0Q0-Q3

openDoor

resetUpRequest

goDown

goUp

CN

clos

ed

DO UBDBT F up dow

n

closeDoor

AN

BN

resetDownRequest

© Paul Lyons 2010~ 112 ~

159233 Computer Architecture

ASMs (Designing a Controller)

Phase 1

Phase 2

A MULTIPLICATION CIRCUIT

© Paul Lyons 2010~ 113 ~

159233 Computer Architecture

ASMs (Designing a Controller)

How does “manual” multiplication work?

e.g. 510 x 11100101 Multiplicand

X 1011 Multiplier

0101

0101

0000

0101000

0110111

Partial products

Product

Hardware multiplication works similarlyBut multiplier

When the process ends, running total containsOtherwise we’d have to use

(=5510)

: AnalysisA MULTIPLICATION CIRCUIT

© Paul Lyons 2010~ 114 ~

159233 Computer Architecture

Partial products

ASMs (Designing a Controller)

0101 Multiplicand

X 1011 Multiplier

0101

0101

0000

0101000

0110111

Storage requirementsMultiplier Multiplicand Product

: AnalysisA MULTIPLICATION CIRCUIT

Product (=5510)

How does “manual” multiplication work?

e.g. 510 x 1110

© Paul Lyons 2010~ 115 ~

159233 Computer Architecture

0101

ASMs (Designing a Controller)

: Analysis

For each 1 in multiplier,addallowing for

1011

1 bit

2 bit

4 bit

8 bit

Each time a partial product is added to the running totalsignificance needs to beFirst PP goes into position withShift running total

Put running total in

A MULTIPLICATION CIRCUIT

0101

© Paul Lyons 2010~ 122 ~

159233 Computer Architecture

0110111

ASMs (Designing a Controller)

: Analysis

For each 1 in multiplier,add multiplicand into running totalallowing for significance of the 1 bit in the multiplier

1011

1 bit

2 bit

4 bit

8 bit

Each time a partial product is added to the running totalsignificance needs to beFirst PP goes into positionShift running total

Put running total in a shift register 2n bits wide

A MULTIPLICATION CIRCUIT

If multiplicand is large, Add an extra bit to the most significant shift register

+ 1

^

© Paul Lyons 2010~ 123 ~

159233 Computer Architecture

ASMs (Designing a Controller)

: Design of the architecture

SRA SRB

register

adder

A MULTIPLICATION CIRCUIT

© Paul Lyons 2010~ 124 ~

159233 Computer Architecture

0

SRA SRB

resetA resetBshiftA shiftB

loadMultiplier

multiplier

loadProduct

clock

register

multiplicand

loadMultiplicand

adder

Top 1/2 Bottom 1/2

lobit

Informal algorithm

Load

Load

Repeat 4 timesIf lowest bit of multiplier is , then

addShift and

ASMs (Designing a Controller)

A MULTIPLICATION CIRCUIT

© Paul Lyons 2010~ 125 ~

159233 Computer Architecture

incr

lobit?

eqz?

done

loadProductT

shiftAshiftB

F

F

00

01

11

10

Start? F

AP BP AN BN loadM

ultiplier

loadM

ultiplicand

clearCounter

reset A

increment

loadProduct

shift A

shift B

done

condition

0 0startT T T T0 1start

0 0

1 0 Tlobit

1 1 Tlobit

0 1

1 1 T-1 0

0 1 T Teqz

0 0 Teqz T T

1 1

ASMs (Designing a Controller)

AMULTIPLICATION CIRCUIT –ASM DIAGRAM

loadMultiplierloadMultiplicand

clearCounterresetA

© Paul Lyons 2010~ 126 ~

159233 Computer Architecture

AP BP AN BN loadM

ultiplier

loadM

ultiplicand

clearCounter

reset A

increment

loadProduct

shift A

shift B

done

condition

0 0startT T T T0 1start

0 0

1 0 Tlobit

1 1 Tlobit

0 1

1 1 T-1 0

0 1 T Teqz

0 0 Teqz T T

1 1

ASMs (Designing a Controller)

© Paul Lyons 2010~ 127 ~

159233 Computer Architecture

ASMs (Designing a Controller)

AP BP AN BN loadM

ultiplier

loadM

ultiplicand

clearCounter

reset A

increment

loadProduct

shift A

shift B

done

condition

0 0startT T T T0 1start

0 0

1 0 Tlobit

1 1 Tlobit

0 1

1 1 T-1 0

0 1 T Teqz

0 0 Teqz T T

1 1

D1

D0

Q1

Q0

0123

AN

0123

BN

AP

BP

T F eqz

lobit

start

0123

loadMultiplierloadMultiplicandClearCounterResetA

0123

increment

0123

loadProduct

0123

shiftAshiftB

0123

done

Q0Q1

clearCounter

increment

eqz

© Paul Lyons 2010~ 128 ~

159233 Computer Architecture

THE PROCESSOR =DATAPATH +CONTROL

SINGLE-CYCLE ARCHITECTURE

datapath can process data as specified in the instructions

but fetch/decode/execute cycle needs

control signals regulatetradeoffs between complex processing and fast hardware moduleswant to minimise both

control loop always has•output to determine the location of the next instruction •read specified by the instruction (sometimes 1, sometimes 2)•perform (memory ref, arithmetic/logical, or branch)

© Paul Lyons 2010~ 129 ~

159233 Computer Architecture

THE ARCHITECTURE

SINGLE-CYCLE ARCHITECTURE

PC

registers

instructionmemory

data memory

data out

data in

ALU

© Paul Lyons 2010~ 130 ~

159233 Computer Architecture

COMBINATORIAL vs. SEQUENTIAL LOGIC

SINGLE-CYCLE ARCHITECTURE

datapath components developed so far useoutputsoutput of a given set of inputsdelay between

full datapath iscontains storage elementsoutputs depend onclock regulates

controller is a sequential circuitusually implemented as

© Paul Lyons 2010~ 131 ~

159233 Computer Architecture

CLOCKING DATA THROUGH APATH

SINGLE-CYCLE ARCHITECTURE

storage(sequential)

storage(sequential)

datapath component(combinatorial)

datapath component(combinatorial)

clock

nc > (n+1)(c-δc)(n+1)δc > c

Xsetup

holdregister loads dataregister o/p unstableregister i/p must be stableregister o/p stable duringcycle time = +

is it worth cutting a slow stage into two?

yes, if:

© Paul Lyons 2010~ 132 ~

159233 Computer Architecture

GATED CLOCKS

SINGLE-CYCLE ARCHITECTURE

we don't always want to load data on every clock cycleuse a separate write control line toclock edge specifies the data should be loadedwrite line specifies the data should be loaded

clockwrite write

© Paul Lyons 2010~ 133 ~

159233 Computer Architecture

INSTUCTION SUBSET

SINGLE-CYCLE ARCHITECTURE

memory reference instructions lw and swarithmetic instructions add, sub, and, or, sltbranch instructions beq and j

two phases single-cycle implementation, combinatorial controllermulti-cycle implementation – leads to

© Paul Lyons 2010~ 134 ~

159233 Computer Architecture

THE PROGRAM COUNTER

SINGLE-CYCLE ARCHITECTURE

add

4

PCinstructionmemory

instruction

© Paul Lyons 2010~ 135 ~

159233 Computer Architecture

REGISTER FILE ANDR-FORMAT INSTRUCTIONS

SINGLE-CYCLE ARCHITECTURE

register file contains 32 32-bit registersimplemented as a fast static RAM with dedicated read and write portsaddresses correspond tocontrol signals, based on current instruction, specifyallows two reads and 1 write on a clock cycleALU operates on output from 2 registers, writes result back to register file

write register

read register1read register2

read data1

read data2

write data

/

/

/

5

5

5

instruction regWrite

ALU

zero

4

ALUoperation

© Paul Lyons 2010~ 136 ~

159233 Computer Architecture

MEMORY REFERENCES

SINGLE-CYCLE ARCHITECTURE

lw

sw$t1, offset ($t2)

16-bit value to add to $t2

to generate branch destinationdedicated ALU adds offset toif offset is –ve, sign is in bit 15need to (set bits 16 - 31 to 1)

instruction

read data1

read data2

write register

read register1read register2

write data

regWrite

/

/

/

5

5

5

zero

ALU

ALUoperation

4

sign-extend

data memory

read write

read address

write address

16 32

© Paul Lyons 2010~ 138 ~

159233 Computer Architecture

THE BEQCONTROL LOGIC

SINGLE-CYCLE ARCHITECTURE

branch destinationALU

sum

shiftleft 2

pc+4

branch address is register (PC) + offsetPC + 4 (here the units are bytes!) isoffset is ; needsunit of offset is words, not bytes; shift it 2 bits to the left to multiply by 4

signextend

instruction

beq $1, $2, offset

© Paul Lyons 2010~ 139 ~

159233 Computer Architecture

SINGLE-CYCLE ARCHITECTURE

sum

4

shift left 2

PCsum

sign-extendinstruction 0:15

instructionmemory

branchdestination

ALUsum

© Paul Lyons 2010~ 140 ~

159233 Computer Architecture

THE BEQCONTROL LOGIC

SINGLE-CYCLE ARCHITECTURE

branch destinationALU

sum

shiftleft 2

pc+4

signextend

instruction

beq is a conditional branch instructionso processor mustALUif ALU's zero-detect is TRUE,

zero to branch control logic

regWrite

addread data1

read data2write register

read register1read register2

write data

operation

© Paul Lyons 2010~ 141 ~

159233 Computer Architecture

THE J INSTRUCTION

SINGLE-CYCLE ARCHITECTURE

j address

another instruction that loads the PC – unconditionally this time26 bit address has 2 0 bits added tono negative addresses, so no need fortop 4 bits of PC are left unaffectedso j instruction can only access

instructionmemory

PC

26unchanged

00

© Paul Lyons 2010~ 142 ~

159233 Computer Architecture

SINGLE CLOCK CYCLE RESTRICTIONS

SINGLE-CYCLE ARCHITECTURE

all operations must start and finish in in 1 clock cycleno resources can be sharedmultiple operations require

increment PC, calculate address, compare registers all need

however, different instruction types can use the same resourcememory references calculateregister operations calculatecan multiplex

instruction

read data1

read data2

write register

read register1read register2

write data

regWrite

/

/

/

5

5

5

zero

ALU

ALUoperation

4

sign-extend16 32

data memory

read write

read adddress

write adddress

similarly

© Paul Lyons 2010~ 143 ~

159233 Computer Architecture

ADDING INSTRUCTION MEMORY

SINGLE-CYCLE ARCHITECTURE

instruction

read data1

read data2

write register

read register1read register2

write data

regWrite

/

/

/

5

5

5

zero

ALU

ALUoperation

4

sign-extend16 32

data memory

read write

read adddress

write adddress

instructionmemory

sumPC

4

Single-cycle instruction requires separate data and instruction memoryno time to read

© Paul Lyons 2010~ 144 ~

159233 Computer Architecture

instruction

read data1

write register

read register1read register2

write data

regWrite

zero

ALU

ALUoperation

4

ADDING THE BEQ INSTRUCTION

SINGLE-CYCLE ARCHITECTURE

data memory

read write

read adddress

write adddress

sign-extend

read data2

sum

4

shift left 2

PC

instructionmemory

sum

© Paul Lyons 2010~ 145 ~

159233 Computer Architecture

SINGLE-CYCLE ARCHITECTURE

write register

read register1read register2write data read data2

read data1

ALU zero

data memory

read adddress

write adddress

instruc-tion

memory

11:15

16:20

21:25

sum

PC

4

sign-extend

0:15 shiftleft 2

sum

© Paul Lyons 2010~ 146 ~

159233 Computer Architecture

ALUCONTROL

SINGLE-CYCLE ARCHITECTURE

controller will handle a subset of the ALU functions

functionandoraddsub

set on less than

ALU Control Input000001010110111

© Paul Lyons 2010~ 147 ~

159233 Computer Architecture

COMBINATORIAL CONTROL UNIT

SINGLE-CYCLE ARCHITECTURE

instructionlwswbeqaddsubandorslt

ALUop0000011010101010

ALU actionaddadd

subtractadd

subtractandor

set on less than

function codexxxxxxxxxxxxxxxxxx100000100010100100100101101010

opcodelwswbeq

R-typeR-typeR-typeR-typeR-type

ALU control010010110010110000001111

A10011111

A00100000

F3--00001

F1--01001

F0--00010

C20101001

C11111001

C00000011

inputs outputs

6 bits 6 bitsF2--00110

© Paul Lyons 2010~ 148 ~

159233 Computer Architecture

COMBINATORIAL CONTROL UNIT

SINGLE-CYCLE ARCHITECTURE

A10011111

A00100000

F3--00001

F2--00110

F1--01001

F0--00010

C20101001

C11111001

C00000011

A10011111

A00100000

F3--00001

F2--00110

F1--01001

F0--00010

C20101001

C11111001

C00000011

© Paul Lyons 2010~ 149 ~

159233 Computer Architecture

COMBINATORIAL CONTROL UNIT

SINGLE-CYCLE ARCHITECTURE

A10011111

A00100000

F3--00001

F2--00110

F1--01001

F0--00010

C20101001

C11111001

C00000011

A1A0=00F3F2

F1F0 00 01 11 10

00

01

11

10

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

F3F2

F1F0 00 01 11 10

00

01

11

10

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

F3F2

F1F0 00 01 11 10

00

01

11

10

- - - -

- - - -

- - - -

- - - -

F3F2

F1F0 00 01 11 10

00

01

11

10

0 0 - -

- 0 - -

- - - -

1 - - 1

A1A0=01 A1A0=11 A1A0=10

C2 = A0C2 = A0 + A1F1C2 =

© Paul Lyons 2010~ 150 ~

159233 Computer Architecture

COMBINATORIAL CONTROL UNIT

SINGLE-CYCLE ARCHITECTURE

A10011111

A00100000

F3--00001

F2--00110

F1--01001

F0--00010

C20101001

C11111001

C00000011

C2 = A0 + A1F1

A1A0=00F3F2

F1F0 00 01 11 10

00

01

11

10

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

F3F2

F1F0 00 01 11 10

00

01

11

10

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

F3F2

F1F0 00 01 11 10

00

01

11

10

- - - -

- - - -

- - - -

- - - -

F3F2

F1F0 00 01 11 10

00

01

11

10

1 0 - -

- 0 - -

- - - -

1 - - 1

A1A0=01 A1A0=11 A1A0=10

C1 = `A1 + `F2C1 = `A1C1 =

© Paul Lyons 2010~ 151 ~

159233 Computer Architecture

C0 = A1F3 + A1F0C0 = A1F3

COMBINATORIAL CONTROL UNIT

SINGLE-CYCLE ARCHITECTURE

A10011111

A00100000

F3--00001

F2--00110

F1--01001

F0--00010

C20101001

C11111001

C00000011

C2 = A0 + A1F1

C1 = `A1 + `F2

A1A0=00F3F2

F1F0 00 01 11 10

00

01

11

10

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

F3F2

F1F0 00 01 11 10

00

01

11

10

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

F3F2

F1F0 00 01 11 10

00

01

11

10

- - - -

- - - -

- - - -

- - - -

F3F2

F1F0 00 01 11 10

00

01

11

10

0 0 - -

- 1 - -

- - - -

0 - - 1

A1A0=01 A1A0=11 A1A0=10

C0 =

© Paul Lyons 2010~ 152 ~

159233 Computer Architecture

COMBINATORIAL CONTROL UNIT

SINGLE-CYCLE ARCHITECTURE

C2 = A0 + A1F1

C1 = `A1 + `F2

C0 = A1F3 + A1F0

C2 = A0 + A1F1

C1 = `A1 + `F2

C0 = A1F3 + A1F0

© Paul Lyons 2010~ 153 ~

159233 Computer Architecture

C1 = `A1 + `F2

C0 = A1F3 + A1F0

C2 = A0 + A1F1

control block

SINGLE-CYCLE ARCHITECTURE

A

01

F

0123

1C

2

0

COMBINATORIAL CONTROL UNIT

© Paul Lyons 2010~ 154 ~

159233 Computer Architecture

zero16:20

21:25write register

read register1read register2write data read data2

read data1

ALU data memory

read adddress

write adddress

instruc-tion

memory 11:15

sum

PC

4sum

shiftleft 2

sign-extend0:15

THE CONTROL SIGNALS

SINGLE-CYCLE ARCHITECTURE

ALUop26:31

0:5 ALUControl

4

2 controllerMemToReg MemRead MemWritePCSrc RegDst RegWrite ALUSrc

© Paul Lyons 2010~ 155 ~

159233 Computer Architecture

CONTROLLER TRUTH TABLE

SINGLE-CYCLE ARCHITECTURE

instructionR-type

lwswbeq

( 0 )(35)(43)( 4 )

regdest10XX

ALUSrc0110

MemToReg01XX

RegWrite1100

MemRead0100

MemWrite0010

Branch0001

ALUOp1

1000

ALUOp2

0001

50110

40000

30010

20001

10110

00110

opcode

op5.`op4.`op3.`op2.op1.op0MemToReg =

`op5.`op4.`op3.`op2.`op1.`op0RegDest =

op5.`op4.`op3.`op2.op1.op0MemRead =

op5.`op4. op3.`op2.op1.op0MemWrite =

`op5.`op4.`op3. op2.`op1.`op0Branch =

`op5.`op4.`op3.`op2.`op1.`op0ALUOp1 =

`op5.`op4.`op3. op2.`op1.`op0ALUOp2 =

op5.`op4.`op3.`op2.op1.op0ALUSrc =+ op5.`op4. op3.`op2.op1.op0

`op5.`op4.`op3.`op2.`op1.`op0RegWrite =+ op5.`op4. op3.`op2.op1.op0

© Paul Lyons 2010~ 156 ~

159233 Computer Architecture

CONTROLLER

SINGLE-CYCLE ARCHITECTURE

op5.`op4.`op3.`op2.op1.op0MemToReg =

`op5.`op4.`op3.`op2.`op1.`op0RegDest =

op5.`op4.`op3.`op2.op1.op0MemRead =

op5.`op4. op3.`op2.op1.op0MemWrite =

`op5.`op4.`op3. op2.`op1.`op0Branch =

`op5.`op4.`op3.`op2.`op1.`op0ALUOp1 =

`op5.`op4.`op3. op2.`op1.`op0ALUOp2 =

op5.`op4.`op3.`op2.op1.op0ALUSrc =+ op5.`op4. op3.`op2.op1.op0

`op5.`op4.`op3.`op2.`op1.`op0RegWrite =+ op5.`op4. op3.`op2.op1.op0

Vcc

op5

op4

op3

op2

op1

op0

output node

© Paul Lyons 2010~ 157 ~

159233 Computer Architecture

CONTROLLER

SINGLE-CYCLE ARCHITECTURE

op5.`op4.`op3.`op2.op1.op0MemToReg =

`op5.`op4.`op3.`op2.`op1.`op0RegDest =

op5.`op4.`op3.`op2.op1.op0MemRead =

op5.`op4. op3.`op2.op1.op0MemWrite =

`op5.`op4.`op3. op2.`op1.`op0Branch =

`op5.`op4.`op3.`op2.`op1.`op0ALUOp1 =

`op5.`op4.`op3. op2.`op1.`op0ALUOp2 =

op5.`op4.`op3.`op2.op1.op0ALUSrc =+ op5.`op4. op3.`op2.op1.op0

`op5.`op4.`op3.`op2.`op1.`op0RegWrite =+ op5.`op4. op3.`op2.op1.op0

Vcc

op5

op4

op3

op2

op1

op0

© Paul Lyons 2010~ 159 ~

159233 Computer Architecture

SINGLE-CYCLE vs.MULTI-CYCLE IMPLEMENTATION

MULTI-CYCLE ARCHITECTURE

with single cycle, longest instruction limits speed of whole machine load instruction involvesCPI = 1 looks good, but

multi-cycle instructions would be faster for all but longest instructionsingle memory can be usedsingle ALU can be used for data, PC and address operations

load and store instructions involve1 memory access1 memory accessseparate and memories necessary to

we can divide instructions into phasese.g., instruction read; register(s) read; compute phase; register write (R-type instruction)set clock period to length of longest phase of an instruction instead of longest instructioninstructions become

© Paul Lyons 2010~ 160 ~

159233 Computer Architecture

TRISTATE OUTPUTS

MULTI-CYCLE ARCHITECTURE

using a multiplexor to select inputs to an ALU meansmultiple 32-bit-wide data paths

alternatively, run a single 32-bit bus past all the sourcesgive the sources tristate outputswith n data sources:

log n mux control inputs �

32 x n data wires �

© Paul Lyons 2010~ 161 ~

159233 Computer Architecture

DATAPATH

MULTI-CYCLE ARCHITECTURE

write signals required

no longer single clock cycle with standard sequence of control signals

temporary registers needed for results because:signal is computed on 1 clock cycle andinputs that produced it changeimplicit control signals used

4

memoryread adddress

write adddress

write data

PC

PCWrite

ALUSelA

RegDest

IRWrite

MemWrite

MemRead

0

1

2

3

instruction

register

memory data

register

ALUSelB

zeroALU

ALU

out

shiftleft 2

write registerread register1

write data

read data2

read data1

read register2

sign-extend

A

B

Memto Reg

I orD

A

B

ALU

out

© Paul Lyons 2010~ 162 ~

159233 Computer Architecture

BREAKING INSTRUCTION INTO CLOCK CYCLES

MULTI-CYCLE ARCHITECTURE

write registerread register1

write data

read data2

read data1

read register2instruction

registersign-extend shift

left 2

4

memoryread adddress

write adddress

write data

PC

zeroALU

PCWrite

MemRead

MemWrite

IRWrite

RegDest

ALUSelA

0

1

2

3

equalise time spent in each clock cycleminimise time for whole instruction

clock cycle should contain no more than111

memory data

register A

B

ALUSelB

ALU

out

Memto Reg

I orD

© Paul Lyons 2010~ 163 ~

159233 Computer Architecture

CLOCK CYCLE 1 - Instruction Fetch

MULTI-CYCLE ARCHITECTURE

common to all instructions

stores instruction in IR so that

load IR and incPC in parallelbothusetake effect

IR �

PC �

© Paul Lyons 2010~ 164 ~

159233 Computer Architecture

A �

B �

ALUout �

CLOCK CYCLE 2 – Instruction Decode & Register Fetch

MULTI-CYCLE ARCHITECTURE

read registers specified by rs and rt fields of instruction into A and B registersdon't need them for all instructions, but does no harm

also compute branch target address, just in casesave result

still don’t know what the instruction is, tho it's in the IR

© Paul Lyons 2010~ 165 ~

159233 Computer Architecture

CLOCK CYCLE 3 – Mem Addr Computation, or Branch Completion

MULTI-CYCLE ARCHITECTURE

in this clock cycle, depends on

memory reference instructions (lw and sw)ALUout �

R-type instructions (arithmetic-logic)ALUout �

conditional branch instructionsif (A==B) PC �

jump instructionPC �

© Paul Lyons 2010~ 166 ~

159233 Computer Architecture

CLOCK CYCLE 4 – Mem Access, or R-Type Instruction Completion

MULTI-CYCLE ARCHITECTURE

memory reference instructions (lw and sw)MDR �

ormemory[ALUout] �

R-type instructions (arithmetic-logic)reg [ IR [11:15 ]] �

© Paul Lyons 2010~ 167 ~

159233 Computer Architecture

CLOCK CYCLE 5 – Memory Read Completion

MULTI-CYCLE ARCHITECTURE

load instructionMDR �

ormemory[ALUout] �

P 329

© Paul Lyons 2010~ 168 ~

159233 Computer Architecture

Memory read completion

Memory accessor R-type

completion

Memory Address. Computation, or.

Branch Completion

R-type memory-reference branch jump

instr decode,register fetch

MULTI-CYCLE ARCHITECTURE

write registerread register1

write data

read data2

read data1

read register2instruction

registersign-extend shift

left 2

4

memoryread adddress

write adddress

write data

PC

zeroALU

PCWrite

MemRead

MemWrite

IRWrite

RegDest

ALUSelA

0

1

2

3

memory data

register A

B

ALUSelB

ALU

out

Memto Reg

I orD

instructionfetch

IR � memory[PC];PC � PC + 4;

A � reg[ IR [ 21:25 ]];B � reg[ IR [ 16:20 ]];

ALUout

� PC + (sign-extend( IR [0:15] ) << 2);

ALUout

� A op B;

reg[IR [11:15]]

�ALUout;

ALUout

� A + sign-extend(IR[0:15]);

MDR � M[ALUout] # sw

OR: M[ALUout] � B # lw

load: reg[IR [16:20]] � MDR;

if (A==B)PC � ALU

out;

PC � { PC[28:31] , IR[0:25], 2'b00 };

© Paul Lyons 2010~ 169 ~

159233 Computer Architecture

STATE MACHINE CONTROLLER

MULTI-CYCLE ARCHITECTURE

Standard ASM approach to constructing a controllersee Patterson & Hennessy, pps 330-340

© Paul Lyons 2010~ 170 ~

159233 Computer Architecture

MICROPROGRAMMING

MULTI-CYCLE ARCHITECTURE

look up control signals for (instruction, clock cycle) instead of calculating them

jump table2n entries

processorcontrol lines

microprogramcounter

0: jump to fetch

fetch

microprogram memory

instructionmicrocode

consider a processor with n-bit opcode and no instruction 0 on power-up, fetch loads with , jumps tojump table jumps to ; instruction executes, jumps back to

© Paul Lyons 2010~ 171 ~

159233 Computer Architecture

MICROPROGRAMMING

MULTI-CYCLE ARCHITECTURE

extra memory access for each clock cycle

new microcode can be downloaded

© Paul Lyons 2010~ 172 ~

159233 Computer Architecture

EXCEPTIONS AND INTERRUPTS

MULTI-CYCLE ARCHITECTURE

exceptions are unexpected eventse.g.

interrupts are unexpected events from outside the processorI/O devices generate interrupts to signal input events; process swapping

terminological confusionMIPS convention: both types of event areIntel 32-bit processor convention: both

handling exceptions is time-consumingmay determine overall speed of machinesave address of current instructiontransfer control tooperating system: OS can then or

© Paul Lyons 2010~ 173 ~

159233 Computer Architecture

EXCEPTIONS AND INTERRUPTS

MULTI-CYCLE ARCHITECTURE

vectored interruptswhen exception occurs, controllerOS routine to handle that exception is

exception handling startse.g. signals from:overflow detector, unrecognised opcode (simplified MIPS processor)external pin on processor (I/O devices)states in controller ASM where exceptions can occur have jump

ASM loads: cause register withEPC withPC with (location of OS routine for )

OS:handles

or

© Paul Lyons 2010~ 174 ~

159233 Computer Architecture

COMPLEX MULTI-CYCLE ARCHITECTURES

MULTI-CYCLE ARCHITECTURE

suitable forCISC machines can have instructions from 2-3 clock cycles to tens or even hundredswhen data for current instruction moves along datapath,early parts of datapath

© Paul Lyons 2010~ 175 ~

159233 Computer Architecture

THE BASIC IDEA

PIPELINING

multi-cycle architecture reducesinstructions stillbut some instructions take fewer thanso

can we combine 1 CPI behaviour with shorter clocks?

single-cycle (1 CPI) architecture simple but slowno instruction can run faster than the slowest

each stage in datapath acts on data from a separate instructionD enacts phase 4 of instruction i on data for instruction iC enacts phase 3 of instruction i-1 on data for instruction i-1later instructions can’t work on data currently being produced by the datapathresources can’t be used at several stages in the datapathneed intermediate registers to keep results available for several clock cycles

DCBA

© Paul Lyons 2010~ 176 ~

159233 Computer Architecture

COMPARISON OF APPROACHES

PIPELINING

Instruction Instr Fetch Reg Read ALU Op Data Mem Reg Write TotalR Format 10ns 5 ns 10ns 5ns 30ns

lw 10ns 5ns 10ns 10ns 5ns 40nssw 10ns 5ns 10ns 10ns 35nsbeq 10ns 5ns 10ns 25

10 20 30 40 50 60 70 80 900

pipeline cycleinstr fetch reg ALU data reg

instr reg ALU data reg

instr

no single

instr fetch reg ALU data reginstr fetch reg ALU data

no multi

instr fetch reg ALU data reg

instr fetch reg ALU data reg

instr fetch reg ALU data reg

instr fetch reg ALU data reg

instr fetch reg ALU data reg

yes single

© Paul Lyons 2010~ 177 ~

159233 Computer Architecture

SPEEDUP

PIPELINING

single clock cycle instructions start every 40ns

multi-clock instructions can start every 50ns (lw) x 0.840ns (sw & R-type) x 137.5ns (branch) x 1.33

speedup

saving of resources (3 ALUs � 1 ALU) as well as speedup

speedup with pipelinex 4 ifx 2.22 in the example (though )x 4 is

© Paul Lyons 2010~ 178 ~

159233 Computer Architecture

PIPELINE OVERHEADS

PIPELINING

load timeflush timeunequal stage delaysdelays in interstage registers

© Paul Lyons 2010~ 179 ~

159233 Computer Architecture

ARE SOME INSTRUCTION SETS BETTER THAN OTHERS?

PIPELINING

constant length instructions "fit" the hardware betterIA32 (Pentium) instructions 1-17 bytestranslated into microinstructions that suit pipelining

standard format with operands in consistent locationsallows register reads to occur before instruction type is known

I-type op rs rt constant or address

op rs rt rd shamt functR-type

© Paul Lyons 2010~ 180 ~

159233 Computer Architecture

ARE SOME INSTRUCTION SETS BETTER THAN OTHERS?

PIPELINING

constant length instructions "fit" the hardware betterIA32 (Pentium) instructions 1-17 bytestranslated into microinstructions that suit pipelining

standard format with operands in consistent locationsallows register reads to occur before instruction type is known

I-type op rs rt constant or address

op rs rt rd shamt functR-type

© Paul Lyons 2010~ 181 ~

159233 Computer Architecture

ARE SOME INSTRUCTION SETS BETTER THAN OTHERS?

PIPELINING

constant length instructions "fit" the hardware betterIA32 (Pentium) instructionstranslated into microinstructions that

standard format with operands in consistent locationsallows

memory access instructions are shorter if all calculations are register-based calculation phase can be used for

word-aligned operands reduce memory accessesno operand transfer takes

© Paul Lyons 2010~ 182 ~

159233 Computer Architecture

write register

read register1read register2write data read data2

read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

DIVIDING THE DATAPATH INTO PIPELINE STAGES

PIPELINING

IF ID ID EX EX MEM MEM WBinstructionfetch

instruction decode,register read

execute,address calculation memory access

writeback

information needed in a later stage must be passed viapipeline registers load , readpipeline registers are named afterfor writeback, the pipeline register

© Paul Lyons 2010~ 183 ~

159233 Computer Architecture

DIVIDING THE DATAPATH INTO PIPELINE STAGES

PIPELINING

write register

read register1read register2write data read data2

read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

memory accesswriteback

instructionfetch

instruction decode,register read

execute,address calculation

IR � mem[PC]PC � PC + 4

r-typei-type

branch

IF ID ID EX EX MEM MEM WB

© Paul Lyons 2010~ 184 ~

159233 Computer Architecture

DIVIDING THE DATAPATH INTO PIPELINE STAGES

PIPELINING

write register

read register1read register2write data read data2

read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch

execute,address calculation memory access

writeback

IF ID EX MEMID EX MEM WB

r-typei-type

branch

A � Reg[ IR[25-21] ];B � Reg[ IR[20-16] ];IMM � SE(Reg[ IR[15-0] ]);

instruction decode,register read

© Paul Lyons 2010~ 185 ~

159233 Computer Architecture

DIVIDING THE DATAPATH INTO PIPELINE STAGES

PIPELINING

write register

read register1read register2write data read data2

read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch

execute,address calculation memory access

writeback

IF ID EX MEM

ALUOut � A + Imm;ALUOut � A func B;ALUOut � A op Imm;ALUOut� NPC+Imm;Cond� (A op 0)

ID EX MEM WB

r-typei-type

branch

instruction decode,register read

© Paul Lyons 2010~ 186 ~

159233 Computer Architecture

DIVIDING THE DATAPATH INTO PIPELINE STAGES

PIPELINING

write register

read register1read register2write data read data2

read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch

execute,address calculation memory access

writeback

IF ID EX MEMID EX MEM WB

PC �NPC

LMD � M[ALUOut];orMemory[ALUOut] � B;

if condPC�ALUOut

r-typei-type

branch

instruction decode,register read

© Paul Lyons 2010~ 187 ~

159233 Computer Architecture

THE LW INSTRUCTION: EXECUTION TRACE

PIPELINING

write register

read register1read register2write data read data2

read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch

execute,address calculation memory access

writeback

IF ID EX MEMID EX MEM WBinstruction decode,register read

© Paul Lyons 2010~ 188 ~

159233 Computer Architecture

THE LW INSTRUCTION: EXECUTION TRACE

PIPELINING

write register

read register1read register2write data read data2

read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch

execute,address calculation memory access

writeback

IF ID EX MEMID EX MEM WBinstruction decode,register read

instruc-tion

memory

© Paul Lyons 2010~ 189 ~

159233 Computer Architecture

THE LW INSTRUCTION: EXECUTION TRACE

PIPELINING

write register

read register1read register2write data read data2

read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch

execute,address calculation memory access

writeback

IF ID EX MEMID EX MEM WBinstruction decode,register read

read data2

read data1

© Paul Lyons 2010~ 190 ~

159233 Computer Architecture

THE LW INSTRUCTION: EXECUTION TRACE

PIPELINING

write register

read register1read register2write data read data2

read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

instructionfetch

execute,address calculation memory access

writeback

IF ID EX MEMID EX MEM WBinstruction decode,register read

sum

© Paul Lyons 2010~ 191 ~

159233 Computer Architecture

THE LW INSTRUCTION: EXECUTION TRACE

PIPELINING

write register

read register1read register2write data read data2

read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch

execute,address calculation memory access

writeback

IF ID EX MEMID EX MEM WBinstruction decode,register read

© Paul Lyons 2010~ 192 ~

159233 Computer Architecture

THE LW INSTRUCTION: EXECUTION TRACE

PIPELINING

write register

read register1read register2write data read data2

read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch

execute,address calculation memory access

writeback

IF ID EX MEMID EX MEM WBinstruction decode,register read

write register

read register1read register2write data

© Paul Lyons 2010~ 193 ~

159233 Computer Architecture

CONTROLLING THE PIPELINE

PIPELINING

Instruction Decode turns instruction intocontrol signals are L�R data flow except for ; both can cause

write register

read register1read register2write data read data2

read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch

execute,address calculation memory access

writeback

IF ID EX MEMID EX MEM WBinstruction decode,register read

controlWB

EXWB

WBMEM

MEM

© Paul Lyons 2010~ 194 ~

159233 Computer Architecture

HAZARDS:WHEN PARTS OF THE PIPELINE STAND IDLE

PIPELINING

structural hazards: two instructions need the same resourceconsider lw instructions on a MIPS processor with 1 memory for data and program

instr fetch readreg ALU data writereg

instr fetch reg ALU data writereg

instr fetch readreg ALU data readreg

instr fetch readreg ALU data readreg

instr fetch readreg ALU data readreg

solution:

© Paul Lyons 2010~ 195 ~

159233 Computer Architecture

HAZARDS:WHEN PARTS OF THE PIPELINE STAND IDLE

PIPELINING

structural hazards: two instructions need the same resourceconsider lw instructions on a MIPS processor with 1 memory for data and program

solution:

instr fetch readreg ALU data writereg

instr fetch reg ALU data writereg

instr fetch readreg ALU data readreg

instr fetch readreg ALU data readreg

instr fetch readreg ALU data readreg

© Paul Lyons 2010~ 196 ~

159233 Computer Architecture

HAZARDS:WHEN PARTS OF THEPIPELINE STAND IDLE

PIPELINING

structural hazards: two instructions need the same resourceconsider lw instructions on a MIPS processor with 1 memory for data and program

solution: datapath for pipelined MIPS uses separate instruction and data memories

data hazards: instruction2 needs data before instruction1 has finished producing itadd $s0 $t0 $t1

sub $t2 $s0 $t3

instr fetch readreg ALU data writereg

instr fetch reg ALU data writereg

© Paul Lyons 2010~ 197 ~

159233 Computer Architecture

HAZARDS:WHEN PARTS OF THE PIPELINE STAND IDLE

PIPELINING

structural hazards: two instructions need the same resourceconsider lw instructions on a MIPS processor with 1 memory for data and program

data hazards: instruction2 needs data before instruction1 has finished producing itadd $s0 $t0 $t1

sub $t2 $s0 $t3

instr fetch readreg ALU data writereg

instr fetch reg ALU data writereg

solution1:don't wait for it to bedoesn't work if

solution: datapath for pipelined MIPS uses separate instruction and data memories

© Paul Lyons 2010~ 198 ~

159233 Computer Architecture

HAZARDS:WHEN PARTS OF THE PIPELINE STAND IDLE

PIPELINING

structural hazards: two instructions need the same resourceconsider lw instructions on a MIPS processor with 1 memory for data and program

data hazards: instruction2 needs data before instruction1 has finished producing itadd $s0 $t0 $t1

sub $t2 $s0 $t3

instr fetch readreg ALU data writereg

instr fetch reg ALU data writereg

solution2: controller inserts into the datapath

reg ALU data writereg

solution: datapath for pipelined MIPS uses separate instruction and data memories

control hazards: control decision depends onif (pipelined) instructioni starts on clock cyclen, instructioni+1 starts on cyclen+1

unless instructioni is ; destinationinstr fetch readreg ALU data writereg

instr fetch reg ALU data writereg

solution1:don't wait for it to bedoesn't work if

© Paul Lyons 2010~ 199 ~

159233 Computer Architecture

data hazards: instruction2 needs data before instruction1 has finished producing itadd $s0 $t0 $t1

sub $t2 $s0 $t3 reg ALU data

structural hazards: two instructions need the same resourceconsider lw instructions on a MIPS processor with 1 memory for data and program

instr fetch readreg ALU data writereg

instr fetch

solution2: controller inserts stalls (aka "bubbles") into the datapath

solution: datapath for pipelined MIPS uses separate instruction and data memories

solution1: forward data to next instruction as soon as it's produced don't wait for it to be written to the register filedoesn't work if data is needed before it is produced

control hazards: control decision depends on result of an incomplete instructionif (pipelined) instructioni starts on clock cyclen, instructioni+1 starts on cyclen+1

unless instructioni is a branch instruction; destination not known for 4 more clock cyclesinstr fetch readreg ALU data writereg

instr fetch reg ALU data writereg

HAZARDS:WHEN PARTS OF THE PIPELINE STAND IDLE

PIPELINING

control hazards: control decision depends on result of an incomplete instructionif (pipelined) instructioni starts on clock cyclen, instructioni+1 starts on cyclen+1

unless instructioni is a branch instruction; destination not known for 4 more clock cycles

© Paul Lyons 2010~ 201 ~

159233 Computer Architecture

HAZARDS:WHEN PARTS OF THE PIPELINE STAND IDLE

PIPELINING

control hazards: control decision depends on result of an incomplete instructionif (pipelined) instructioni starts on clock cyclen, instructioni+1 starts on cyclen+1

unless instructioni is a branch instruction; destination not known for 4 more clock cycles

can justput in extra

tests , calculates n &

still have 1 clock-cycle stallor take one alternative anyway

assumeload next instructionstill lose a clock cycle if , but net improvement

improvements:assumeor compiler to use the stall time (MIPS)or on the basis of record

© Paul Lyons 2010~ 202 ~

159233 Computer Architecture

if $s2=0 then

SCHEDULING THE BRANCH DELAY SLOT

PIPELINING

from before from target from fall through

add $s1, $s2, $s3if $s2=0 then

delay slot

add $s1, $s2, $s3

sub $14,$15,$16..add $s1, $s2, $s3if $s1=0 then

delay slot

.

.add $s1, $s2, $s3if $s1=0 then

sub $14,$15,$16

add $s1, $s2, $s3if $s1=0 then

sub $14,$15,$16

delay slot

add $s1, $s2, $s3if $s1=0 then

sub $14,$15,$16

© Paul Lyons 2010~ 203 ~

159233 Computer Architecture

BRANCH PREDICTION

PIPELINING

small memory indexed byeach location has 1 bit - set a bit , resetwill sometimes be setprediction unrelated tobut will contain

but consider performance of a loop branchtaken times, not taken once mispredictionbut after last iteration, prediction isso misprediction atbranch taken 90% of time, correct prediction 80% of time

use 2-bit branch prediction memorymust be wrong twice before prediction changescopes with repeated loops

predict taken predict taken

not taken

taken

not taken

predict taken

taken

taken

taken

predict taken

not taken

not taken

© Paul Lyons 2010~ 204 ~

159233 Computer Architecture

DATA HAZARDS - CATEGORISATION

PIPELINING

RAW – Read After Write

WAR – Write After Read

WAW – Write After Write

in each of the hazards below, instructioni starts executing before instructionjhazard name refers to what should happen, not what goes wrong (!)

© Paul Lyons 2010~ 205 ~

159233 Computer Architecture

RAWHAZARDS

EXCEPTIONS

four situations – 2 problematic, 2 not

LW R1,45,(R2)DADD R5,R6,R7DSUB R8,R6,R7OR R9,R6,R7

LW R1,45(R2)DADD R5,R1,R7DSUB R8,R6,R7OR R9,R6,R7

LW R1,45,(R2)DADD R5,R6,R7DSUB R8,R1,R7OR R9,R6,R7

LW R1,45(R2)DADD R5,R6,R7DSUB R8,R6,R7OR R9,R1,R7

nothing in following 3 instructions depends on R1(4th following instruction will be in IF when 1st is in WB)

hardware detects phase1 R1 write, phase2 R1 readstalls DADD instruction's EX phase(& following instructions)

hardware detectsforwards

no action required. Write of R1 occurs during 1st half of DSUB's ID phase, and read occurs in 2nd half

© Paul Lyons 2010~ 206 ~

159233 Computer Architecture

WHEN TO FORWARD IN EXPHASE

EXCEPTIONS

sourceinstruction

destinationinstruction forward if:

R-type R-typeI-type, lw, sw, bra

destinationtop ALU i/p

EX/MEM[rd] = ID/EX[rs]

R-type R-type

R-type R-typeI-type, lw, sw, bra

R-type R-type

I-type

R-typeI-type, lw, sw, bra

lw

EX/MEM[rd] = ID/EX[rt]

MEM/WB[rd] = ID/EX[rs]

MEM/WB[rd] = ID/EX[rt]

bottom ALU i/p

top ALU i/p

bottom ALU i/p

top ALU i/p

bottom ALU i/p

top ALU i/p

bottom ALU i/p

top ALU i/p

bottom ALU i/p

R-typeI-type, lw, sw, bra EX/MEM[rt] = ID/EX[rs]

I-type R-type EX/MEM[rt] = ID/EX[rt]

I-type

I-type

lw

R-typeI-type, lw, sw, bra

R-type

R-type

MEM/WB[rt] = ID/EX[rs]

MEM/WB[rt] = ID/EX[rt]

MEM/WB[rt] = ID/EX[rs]

MEM/WB[rt] = ID/EX[rt]

EX/MEM[rd] = ID/EX[rs]

© Paul Lyons 2010~ 207 ~

159233 Computer Architecture

THE IDEAL, THE REALITY

MEMORY MANAGEMENT

The ideal indefinite memory capacityrandom accessany word instantly available

The reality limited memory capacityfinite speedshigh speeds � high costshigh capacity � low speed

The solutionhierarchy of memories, with processor registers at the topeach step down has more capacity but slower access

, & THE SOLUTION

© Paul Lyons 2010~ 210 ~

159233 Computer Architecture

MEMORY MANAGEMENT

THE SOLUTION

archiveprocessorregisters

main(I0)

memory

backing(20)store

VirtualMemory

auto archivaland

file retrieval

Memory management blurs the distinctions to make memory seem as possibleas possibleas possibleas possible

© Paul Lyons 2010~ 211 ~

159233 Computer Architecture

ALTERNATIVE VIEW

MEMORY MANAGEMENT

CPU

levels in the memory hierarchy

level 1

level 2

level n

increasing distance from the CPU in access time

size of the memory at each level

© Paul Lyons 2010~ 212 ~

159233 Computer Architecture

PRINCIPLE(S) OF LOCALITY

MEMORY MANAGEMENT

is liable to beprinciple of

only a small proportion is of interest at any time

is liable to be followedprinciple of

on memory reference:bring item fromto SRAM: fast, but expensive and thus smallfrom which

in code, is liable to be followed byprinciple of (special case of preceding principle)

(and ask it )

© Paul Lyons 2010~ 213 ~

159233 Computer Architecture

AIMS

MEMORY MANAGEMENT

aims: to make memory behaveas asas as

technique:on a hit,on a miss,may need to transfer from to

hit ratio:miss ratio:miss penalty:hit time:

© Paul Lyons 2010~ 214 ~

159233 Computer Architecture

CACHE

MEMORY MANAGEMENT - cache

small, fast between registers and

how do we map the large DRAM address space onto the small SRAM?

direct-mapped cacheaddress in cache with 2n locations is just

0000000100100011010001010110011110001001101010111100110111101111

M

000001010011100101110111

cache

© Paul Lyons 2010~ 215 ~

159233 Computer Architecture

ACCESSING AWORD IN CACHE – the Write Back strategy

MEMORY MANAGEMENT - cache

32 bit addressfrom processor

data

32

tag address

22Cache

1024

decoder

10-bit cache register address

23-bit tag address

tristate buffers

match1

55

23-bit residualfrom cache

data part32

changed bit1

residualsequal?

R 1 0 take data from cacheTag valueMatchR/W Tag action Cache / memory action

R 1 1 take data from cacheR 0 0 read memory to cacheR 0 1 clear write old data to memory, read new word from memoryW 1 0 set write data to cache W 1 1 write data to cacheW 0 0 set write new data & address to cacheW 0 1 set write old data to memory, new data & address to cache

changedbit

1

© Paul Lyons 2010~ 216 ~

159233 Computer Architecture

ACCESSING AWORD IN CACHE

MEMORY MANAGEMENT - cache

R 1 0 take data from cacheTag valueMatchR/W Tag action Cache / memory action

R 1 1 take data from cacheR 0 0 read memory to cacheR 0 1 clear write old data to memory, read new word from memoryW 1 0 set write data to cache W 1 1 write data to cacheW 0 0 set write new data & address to cacheW 0 1 set write old data to memory, new data & address to cache

32 bit addressfrom processor

data

32

tag address

22Cache

1024

decoder

10-bit cache register address

23-bit tag address

tristate buffers

match1

55

23-bit residualfrom cache

data part32

changed bit1

residualsequal?

changedbit

1

© Paul Lyons 2010~ 217 ~

159233 Computer Architecture

OTHER FLAVOURS OF CACHE

MEMORY MANAGEMENT - cache

(Guava, Loganberry, Snail)

changed bit is not always usedwhen a cache location is overwrittenits impossible to tellalways write cache data back to memory whenSimple Swap strategy

OR Write-Through (even simpler)always write data to cache & back to memory whennot as inefficient as you might think;buffer queue stores data & address so

processor cache

Memory

buffer queue

© Paul Lyons 2010~ 218 ~

159233 Computer Architecture

ASSESSING CACHE UPDATE ALGORITHMS

MEMORY MANAGEMENT - cache

cycle time

write-through

simple swap

0.7 0.8 0.9 1.0

HR

flagged swapbuffered swap

© Paul Lyons 2010~ 219 ~

159233 Computer Architecture

BLOCK TRANSERS

MEMORY MANAGEMENT - cache

temporal locality supported by

spatial locality requires

© Paul Lyons 2010~ 220 ~

159233 Computer Architecture

BLOCK SIZE INFLUENCES MISS RATE

MEMORY MANAGEMENT - cache

1 KB

16 KB

256 KB64 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Miss rate

64164

Block size (bytes)

8 KB

block in memory

first word referenced

© Paul Lyons 2010~ 221 ~

159233 Computer Architecture

HANDLING ACACHE MISS

MEMORY MANAGEMENT - cache

what happens to the current instruction when a cache miss occurs?

consider a miss when loading an instruction instruction register containswe'll have to when the value has loadedPC � (because )initiatewrite ; reset

© Paul Lyons 2010~ 222 ~

159233 Computer Architecture

INTRINSITY FASTMATH PROCESSOR CACHES

MEMORY MANAGEMENT - cache

cache index

=tag

data

data512tag

18

V1 32

012

56

13

14

31

256

block offset

hit

instruction miss rate:data miss rate:

© Paul Lyons 2010~ 223 ~

159233 Computer Architecture

TRANSFERRING BLOCKS EFFICIENTLY

MEMORY MANAGEMENT - cache

memory speed puts on latency of access to 1st word of a block

wider memory & bus allowno reductionoverall transfer rate

memory bus clock cycles

to send addressto access datato send data

1 word

4 words

(10 longer than processor cycles)

bus and memory width

cache block width

1 x 115 x 44 x 1

65miss penalty

bandwidth achieved4 x 465

= 0.25 words/cycle

© Paul Lyons 2010~ 224 ~

159233 Computer Architecture

TRANSFERRING BLOCKS EFFICIENTLY

MEMORY MANAGEMENT - cache

memory speed puts lower limit on latency of access to 1st word of a block

wider memory & bus allow parallel transfer of complete blockno reduction I latency of 1st wordoverall transfer rate higher

bus and memory width

cache block width

memory bus clock cycles

to send addressto access datato send data

1 word

4 words

1 x 115 x 44 x 1(10 longer than processor cycles)

2 words

4 words 4 words

4 words

1 x 115 x 2

1 x 115 x 1

65

2 x 1

33

1 x 1

17miss penalty

bandwidth achieved4 x 465

4 x 433

4 x 417

= 0.48 words/cycle

= 0.94words/cycle

- Wider Memory And Bus

= 0.25 words/cycle

© Paul Lyons 2010~ 225 ~

159233 Computer Architecture

TRANSFERRING BLOCKS EFFICIENTLY

MEMORY MANAGEMENT - cache

- Interleaved Memory

M

cache

CPU

M

cache

CPU

M1 M2 M3 M4

cache

CPU

1-word widememory

multi-word widememory

interleavedmemories

all memories can read in

parallel

single-word buscycle count: 1 to send address15 to access memory4 x 1 to send databandwidth

= (4 x 4) /20= 0.8 words/cycle

© Paul Lyons 2010~ 226 ~

159233 Computer Architecture

CACHE DESIGN ISSUES

MEMORY MANAGEMENT - cache

placement policy

size of blocks transferred to and stored in cache

memory update policy

cache size

replacement policy

© Paul Lyons 2010~ 227 ~

159233 Computer Architecture

MEASURING AND IMPROVING CACHE PERFORMANCE

MEMORY MANAGEMENT - cache

memory stall cycles

=

total CPU time = cycle time x (useful execution cycles + stall cycles)

© Paul Lyons 2010~ 228 ~

159233 Computer Architecture

MEASURING AND IMPROVING CACHE PERFORMANCE

MEMORY MANAGEMENT - cache

consider a machine with separate instruction and data cachesproccessor CPI (without memory stalls): 2miss penalty(all misses) 100 cycles for a particular program:

instructions executed I

loads and stores as % of total instructions 36%data cache miss rate 4%instruction cache miss rate 2%

how much faster would the program be if we eliminated misses?

total miss cycles = +

© Paul Lyons 2010~ 229 ~

159233 Computer Architecture

MEASURING AND IMPROVING CACHE PERFORMANCE

MEMORY MANAGEMENT - cache

consider a machine with separate instruction and data cachesproccessor CPI (without memory stalls): 2miss penalty(all misses) 100 cycles for a particular program:

instructions executed I

loads and stores as % of total instructions 36%data cache miss rate 4%instruction cache miss rate 2%

how much faster would the program be if we eliminated misses?

total miss cycles = +

CPI including memory stalls = + =now, speed is IPC = 1/CPIspeed increase =

© Paul Lyons 2010~ 231 ~

159233 Computer Architecture

MEASURING AND IMPROVING CACHE PERFORMANCE

MEMORY MANAGEMENT - cache

how much faster would the program be if we eliminated misses?speed increase = 2.72

what about improving the processor performance?let's try reducing the Cycles Per Instruction by 50%

CPIno misses memory stalls CPItotal2 3.44 5.441 3.44 4.44

existing processor: improved processor:

remember Amdahl's Law?max speedup = total time/time that can be reduced to 0

speed increase = 1.18

© Paul Lyons 2010~ 232 ~

159233 Computer Architecture

MEASURING AND IMPROVING CACHE PERFORMANCE

MEMORY MANAGEMENT - cache

how much faster would the program be if we eliminated misses?speed increase = 2.72

what about improving the processor performance?let's try reducing the Cycles Per Instruction by 50%

speed increase = 1.18let's try doubling the clock rate (memory read/write times don’t change)

miss penalty = 200 clock cycles (previously 100)total miss cycles = instruction misses + data misses

= I x 0.02 x 200 + I x 0.36 x 0.04 x 200 = 6.88 I

CPI = 2 + 6.88 = 8.88

speed increaseexecution timefastclock

execution timeslow clock= =

I x CPI1 x clock cycle1

I x CPI2 x 0.5 x clock cycle1

= 5.44 / (8.88 x 0.5)= 1.23

© Paul Lyons 2010~ 234 ~

159233 Computer Architecture

MEASURING AND IMPROVING CACHE PERFORMANCE

MEMORY MANAGEMENT - cache

how much faster would the program be if we eliminated misses?speed increase = 2.72

what about improving the processor performance?let's try reducing the Cycles Per Instruction by 50%

speed increase = 1.18let's try doubling the clock rate (memory read/write times don’t change)

speed increase = 1.23

cache misses (& Amdahl) reduce impact of other improvementsincreasing clock rate AND decreasing CPI incurs a double hit

with reduced cycles per instruction stall cycles/overall cycles increaseswhen processor clock speeds increase, memory clock speeds don'tmiss penalty high clock speed processor > miss penalty low clock speed processor

good cache design helps performance as much as increasing processor speed

© Paul Lyons 2010~ 235 ~

159233 Computer Architecture

ALTERNATIVE PLACEMENT POLICIES

MEMORY MANAGEMENT - cache

there are various schemes for placing blocks in cacheintended to reduce cache misses

direct mappingcache index bits are subset of memory address,each block of memory locations has only one possible cache destination

fully associative mappingmemory blocks go anywhere in cache, source address is stored with themcache access involves comparing address tag & cache tag at every cache location multiple comparators � expensive hardwaresuitable for L1 (only a few words) caches only

set-associative mappingn-way SA cache has n blockseach memory block maps to any element of a unique subset ( >=2) of the cache blocksmapping to set is direct, mapping within set is associative

© Paul Lyons 2010~ 236 ~

159233 Computer Architecture

INDICATIVE DIAGRAMS

MEMORY MANAGEMENT - cache

direct mapping

set associativemappings

fully associative mapping

© Paul Lyons 2010~ 237 ~

159233 Computer Architecture

FORMAL SPECIFICATIONS

MEMORY MANAGEMENT - cache

Consider a system with memory and cache both organised into blocksblocksize, b = 2w words, for some wlines in cache, L = 2k1 => cachesize = 2k1+w wordsblocks in memory, B = 2k2 => memorysize = 2k2+w wordshence addresses are k2 + w bits long

IfL = 4, k1 = 2B = 8, k2 = 3b = 4, w = 2

then these mappings hold:M0 � C0

M4 � C0

M1 � C1

M5 � C1

etc.

- DIRECT MAPPING

block line j = i mod L (i is memory block)

7

6

5

4

3

2

1

0

Block frame

3

2

1

0

Block no

M

C

wk 2 real address

k 1 wtag cache address

© Paul Lyons 2010~ 238 ~

159233 Computer Architecture

FORMAL SPECIFICATIONS

MEMORY MANAGEMENT - cache

direct mapping desired data may be in only one cache locationthough mapping is many-to-one

fully associative mapping desired data may be in several cache locationsthe one which contains the word addressed (if any) must then be identifiedmany-to-many mapping

again, divide memory into 2k2 blocks of 2w words. (k2 + w)-bit addresses each cache line contains data field (2w words)a tag field (top k2 bits of the block's address)equality-detect circuit (tag field = top k2 bits of address)

at each memory referenceif tag matches, cache contains the addressed datathe equality detect signal acts as a "line select" to allow I/O on the appropriate word of the line

- Fully Associative Mapping

© Paul Lyons 2010~ 239 ~

159233 Computer Architecture

4-WAY FULLY-ASSOCIATIVE CACHE

MEMORY MANAGEMENT - cache

cacheindex

012

9

10

31

dataV

tag

dataV

tag tag

dataV

tag

dataV

tag

hit

data

© Paul Lyons 2010~ 240 ~

159233 Computer Architecture

SET-ASSOCIATIVE CACHE

MEMORY MANAGEMENT - Cache

amalgam of direct and associative schemescache has structurelines are grouped intoblock has words - as beforememory organisationcache has

wk 2 real address

block index

stored in cache with data

wtag cache addressk 0

set no

mapping works at firstset j = i mod S (i is a block in main memory)

Then, after :associative search for blockThen: to find specific word

© Paul Lyons 2010~ 241 ~

159233 Computer Architecture

012

91011

tag

dataV

tag

dataV

tag

dataV

tag

dataV

2-WAY SET-ASSOCIATIVE CACHE

MEMORY MANAGEMENT - cache

31

cacheindex

= =

set

==

hit

data

© Paul Lyons 2010~ 242 ~

159233 Computer Architecture

L1, L2, L3

MEMORY MANAGEMENT - Cache

L1 cachetypically 8KB - 128KBpart of processor corefast technology (SRAM)processor speed

L2 cachetypically 256KB - 1MBoriginally off chip; now often on½ or ¼ processor speed

L3 cacheincreasingly commontypically 16MB - 256MBoff chip (but sometimes on the same die)expensive; used in high-end processors½ L2 cache speed

Some L2 and even L3 caches run at processor speeds.

So what's the point of smaller L1 cache?

A more sophisticated, more expensive, more efficient memory mapping policy is desirable but only cost-effective on a small scale???

© Paul Lyons 2010~ 243 ~

159233 Computer Architecture

MEMORY MANAGEMENT - Cache

128 KB L1 + 256 KB L2AMD Athlon Thunderbird

128 KB L1 + 64 KB L2AMD Duron

128 KB L1AMD K7 Athlon

64 KB L1 + 256 KB L2AMD K6-3

32 KB L1 + 256 KB L2Pentium III Cumine

32 KB L1 + 128 KB L2Celeron

32 KB L1Pentium II and III

64 KB L1AMD K6 and K6-2

32 KB L1Pentium MMX

16 KB L1 + 256 KB L2

(some 512 KB L2)

Pentium Pro

16 KB L1Pentium

16 KB L180486DX4

8 KB L180486DX and DX2

Cache size in the CPUCPU

© Paul Lyons 2010~ 244 ~

159233 Computer Architecture

registersL1 cache(16KB)

Pentium CPU

MEMORY MANAGEMENT - Cache

L2 cache(256KB)

RAM(32M)

system bus

more modern processors: L2 cache on the processor chip

CPUL2 cache

I/O busses

© Paul Lyons 2010~ 245 ~

159233 Computer Architecture

BLOCK REPLACEMENT POLICY

MEMORY MANAGEMENT - Cache

direct mapping cacheno choice; incoming block can only go into one slot

associative cacheany block could go

set-associative cacheincoming block can only go into one setany block in selected set could go

FIFO

Least Recently Used; significantly better performance than FIFO

each set has a reference numberwhen set with reference no. n is referenced

reference nos < n are incrementedreference no of referenced set is set to 0

block-to-go is always block with largest reference numberexpensive hardware when sets are large.

© Paul Lyons 2010~ 246 ~

159233 Computer Architecture

THE NEXT PHASE

MEMORY MANAGEMENT – virtual memory

archiveprocessorregisters

main(I0)

memory

backing(20)store

câcheL1, L2, L3

VirtualMemory

auto archivaland

file retrieval

Memory management blurs the distinctions to make memory seem as big as possibleas fast as possibleas cheap as possibleas secure as possible

VirtualMemory

© Paul Lyons 2010~ 247 ~

159233 Computer Architecture

VIRTUAL MEMORY'S RAISONS D'ETRE

MEMORY MANAGEMENT – virtual memory

Original VM let programs use a memory space larger than physical memoryprogrammers had to divide programs into mutually exclusive overlays (code & data)program had to control loading of its own overlaysVM automatically maps program pages onto physical memory addresses

Also allowed multiple programs to run simultaneously independent virtual address spaces memory protectionpredominant use today

installed memory

program 1

program 2

program 3

program 4

individually smaller, together larger

© Paul Lyons 2010~ 248 ~

159233 Computer Architecture

GENERAL IDEA

MEMORY MANAGEMENT – virtual memory

program code and data are stored as fixed-sized units called pagespages live on disk, have a disk address, are copied into memory when necessarymemory operations use (Virtual Page no, page offset)

program's code (and data) pages don’t have to be contiguous

address translation unit converts VP no. into base address of page in physical memorybase address + page offset produces real address of databits in offset� page size

© Paul Lyons 2010~ 250 ~

159233 Computer Architecture

THE ISAMODEL AND REALITY

MEMORY MANAGEMENT – virtual memory

offsetprocessor

virtual memory space

base

faulthandler

20 memoryo

processoraddresstranslator

main memorya

a'

© Paul Lyons 2010~ 251 ~

159233 Computer Architecture

PRAGMATIC DECISIONS

MEMORY MANAGEMENT – virtual memory

page fault costs millions of clock cyclesmostly latency of first wordlater words arrive comparatively rapidlyso make the page size big enough to repay cost of page fault 4KB – 64KB

it's worth putting considerable effort into reducing page faultsfully associative page placement

long disk access allows time for (complex) software to handle page faults

long write time justifies complexity of write-back over write-through

with fixed size pages, (page no., offset) boundary in addresses is transparentcf. variable-sized segments, where software manipulates segment no. & offset explicitly

memory protection can use the same mechanisms as virtual memory

© Paul Lyons 2010~ 252 ~

159233 Computer Architecture

PAGE PLACEMENT

MEMORY MANAGEMENT – virtual memory

virtual pages can go anywhere in memoryhuge miss penalty means it's worth using complex algorithm, & data structures

virtual page number physical page number page table

each process has a page tablepage table lives in memorypage table register points at start of page tableto perform a process swap, point page table register at a different process's page table

(and swap the program counter and processor registers)

© Paul Lyons 2010~ 253 ~

159233 Computer Architecture

V physical page number

PAGE TABLE

MEMORY MANAGEMENT – virtual memory

page table register

page table

if 0, then page is not in memory

011 page offset

12

physical address

0

virtual address

111231 virtual page number page number

32-bit virtual address, 30-bit physical address (2 bottom bits = 0)virtual address space 4 x larger than physical address space

page table stored in (32-bit) main memory has 19-bit entries13 extra bits (not shown) for page protection information

1229 physical page number

18

20

a similar disk page table holds disk addresses

© Paul Lyons 2010~ 254 ~

159233 Computer Architecture

PAGE FAULTS

MEMORY MANAGEMENT – virtual memory

not all of a program's pages have to be in memory while it's runningonly pages that have been referenced since the process was swapped in

but disk space is much more abundantOS usually reserves enough disk space for all the process's pagescalled the swap space

OS is responsible for handling page faultshardware detects that valid bit for selected page is FALSE

© Paul Lyons 2010~ 255 ~

159233 Computer Architecture

PAGE REPLACEMENT

MEMORY MANAGEMENT – virtual memory

if the OS needs to bring in a page and all pages in the swap space are in useit must oust a pagewhich page? swap space is a fully associative store

?

page-to-go should be about to be unused for as long as possible predicting the future is a difficult businessprediction: Least Recently Used page will be Furthest Future Use page

ousted page goes to swap space

strict LRU algorithm would collect stats at every memory referencetoo expensiveinstead, each page references sets a reference bit for the pagereset periodically, tested after standard delayany page with reference bit still reset is Not Recently Used – can be paged out

© Paul Lyons 2010~ 256 ~

159233 Computer Architecture

for a machine with 32-bit addresses, 32-bit page table entries and 4KB pages how many page table entries? 232 / 212 = 220

how big is the page table? 220 x 22 = 222 B = 4MBso, maybe 400MB in total

PAGE TABLE SIZES

MEMORY MANAGEMENT – virtual memory

typically scores or hundreds of processes running at a time

how to minimise memory usage generally and size of page table in particular?use dynamically-sized page table; only big if program is page-greedy

keep last page register (aka page limit register)forces page table to grow in 1 direction only

but stack and heap usually grow in opposite directions

code

processes' address space

static data (constants, arrays)dynamic data (lists, trees)

stack

© Paul Lyons 2010~ 257 ~

159233 Computer Architecture

PAGE TABLE SIZES

MEMORY MANAGEMENT – virtual memory

for a machine with 32-bit addresses, 32-bit page table entries and 4KB pages how many page table entries? 232 / 212 = 220

how big is the page table? 220 x 22 = 222 B = 4MBso, maybe 400MB in total

typically scores or hundreds of processes running at a time

© Paul Lyons 2010~ 258 ~

159233 Computer Architecture

PAGE TABLE SIZES

MEMORY MANAGEMENT – virtual memory

how to minimise memory usage generally and size of page table in particular?use dynamically-sized page table; only big if program is page-greedy

keep last page register (aka page limit register)forces page table to grow in 1 direction only

but stack and heap usually grow in opposite directionsseparate page tables, with 2 page limit registers, 1 for up, 1 for downtop bit of address differentiates between top & bottom segments of address space

inverted page table only in-use pages are stored; can't use address to index page's entry, must searchto reduce search time, hash the addresses' entries

multi-level (tree-structured) page table complex, but suits non-contiguous pageshighest order bits address a "segment"; if valid, lower bits address page in segment

page the page tableincreases no. of page faults; lock some pages of page table in memory

© Paul Lyons 2010~ 259 ~

159233 Computer Architecture

HANDLING WRITE OPERATIONS

MEMORY MANAGEMENT – virtual memory

cache can use writethroughmemory only hundreds of times slower than registers small writethrough buffer masks write latency

VM write has to wait for disk accessmillions of clock cyclesbuffer would be impracticalwriteback used insteaddisk write only occurs when essential – when OS overwrites a dirty page in memory

© Paul Lyons 2010~ 260 ~

159233 Computer Architecture

THE TLB – HANDLING THE OVERHEAD

MEMORY MANAGEMENT – virtual memory

page table resides in memoryordinary read or write (Instr. Fetch or lw/sw instruction) involves 2 memory accesses

1 to convert virtual address to physical address

1 to access the data

to reduce memory referencesprinciple of locality applies to page table entries tooTranslation Lookaside Buffer is a translation cache

cf. scraps of paper provided by libraries for writing library call no. onmaintains list of locations of a subset of pagesreplacement policy difficult

software too slow; hardware too expensive for complex policy (e.g.LRU)often randomly choose entry-to-go

© Paul Lyons 2010~ 262 ~

159233 Computer Architecture

TLBCIRCUIT OUTLINE

MEMORY MANAGEMENT – virtual memory

virtual address

0111231 12-bit page offset20-bit virtual page number

Tag Physical Page Number

=

=

=

=

==

V Dirty

hit

TLB

Physical Page Number Page Offset

20

Physical Address Tag

256cache

byteoffset

4

blockoffset

data

Cache Index

8

16

18

= hit

if no hit in TLB, use page tableif no hit there, get page from diskif no cache hit, use memory

© Paul Lyons 2010~ 263 ~

159233 Computer Architecture

cache

TLBCIRCUIT OUTLINE

MEMORY MANAGEMENT – Virtual Memory

virtual address

0111231 12-bit page offset20-bit virtual page number

Tag Physical Page Number

TLB

Physical Page Number Page Offset

20

Physical Address Tag

cache

byteoffset

4

blockoffset

data

Cache Index

818

= hit

if no hit in TLB, use page tableif no hit there, get page from diskif no cache hit, use memory

data

Cache Index

8

© Paul Lyons 2010~ 264 ~

159233 Computer Architecture

USING THETLBFOR AREAD/WRITE OPERATION

MEMORY MANAGEMENT – virtual memory

virtual address

TLB access

TLB hit?

Yes: TLB provides physical address

No Yeswrite?

No Yeswrite accessbit on?

write protectionexception

write data into cacheupdate the tag

put data & address intothe write buffer

TLB missexception

No

Try to read datafrom cache

cache miss stallNo Yes

cache hit?

deliver data to the CPU

© Paul Lyons 2010~ 265 ~

159233 Computer Architecture

THE SYNERGY BETWEEN VMAND MEMORY PROTECTION

MEMORY MANAGEMENT – virtual memory

Machines run multiple processes "simultaneously"on single processor machines only one process is active at a timesimple multitasking swaps between processes when I/O occurspreemptive multtasking swaps at short intervals

gives users on multiuser systems the impression of sole accessallows modern multiprocessing systems to handle hundreds of processes

activeprocess

processesengaged in I/O

I/Orequest

readyprocesses

I/O terminates

© Paul Lyons 2010~ 267 ~

159233 Computer Architecture

THE SYNERGY BETWEEN VMAND MEMORY PROTECTION

MEMORY MANAGEMENT – virtual memory

Machines run multiple processes "simultaneously"on single processor machinessimple multitaskingpreemptive multtasking

gives usersallows modern multiprocessing systems to

VM allows processesa process's pages can bewhat if one process addresses outside its own space?

© Paul Lyons 2010~ 268 ~

159233 Computer Architecture

PROTECTION REQUIREMENTS

MEMORY MANAGEMENT – virtual memory

separate user and OS (supervisor) modessome instructions available

ability to make certain information read-only for user processes

mechanism for swapping between modestransfers control to

store in return from exception

put page tables inallows OS toprevents user process fromprevents user process from

© Paul Lyons 2010~ 269 ~

159233 Computer Architecture

IMPLEMENTING APROCESS SWITCH

MEMORY MANAGEMENT – virtual memory

on a machine without a TLB, this is comparatively simplepoint page table register at

consider a switch from P1 to P2

on a machine with a TLBneed toneed to

replacing all of P1's entries in TLB can be inefficient if:

problem; virtual address spaces the samesolution: make them different

give each processOS remembers putat each memory access,

© Paul Lyons 2010~ 270 ~

159233 Computer Architecture

SHARING INFORMATION

MEMORY MANAGEMENT – virtual memory

in general, one process should not be able toTLB has that prevents a process from

sometimes processes need to be able to share informationP1 wants to access information in a page owned by P2

process P2 asks OS to create a new page table entry page table entry goes into P1's virtual address space but accesses P2's physical page P2 can ask OS to set write protection bit in P1's page so P1 can't update the page