159233 computer architecture building up the …plyons/159233 (computer... · · 2010-02-18t r a...

© Paul Lyons 2010~ 1 ~

159233 Computer Architecture

159233

Computer Architecture



INTRODUCTION 2IC FABRICATION 7THE ISA 14REPRESENTING HLL CONSTRUCTS 30PERFORMANCE 35COMPUTER ARITHMETIC 49BUILDING UP THE DATAPATH 54FLOATING POINT NUMBERS 70SINGLE-CYCLE ARCHITECTURE 78MULTI-CYCLE ARCHITECTURE 109PIPELINING 125EXCEPTIONS 155MEMORY MANAGEMENT 157

cache 165virtual memory 199



INTRODUCTION

Physical properties

ABSTRACTION

Concepts

A device

sector

track

Concepts



LEVELS OF ABSTRACTION

INTRODUCTION

Computer systems use technology to simulate the human world

Human thought processes

ISA

GatesData processing modules

native data typesinstruction setregistersaddressing modes

interruptsexception handling

I/O handling

Transistors

cycles per instruction physical registers

Machine code

Assembly language

High level languages

01100000101011

add A, B

C = (A + B)*3



AGENERIC COMPUTER

INTRODUCTION

processor



MEMORY HIERARCHY

INTRODUCTION



THE IC INDUSTRY

IC FABRICATION

Very large market

Very few products

High rate of development

Long development times

Multiple generations in simultaneous development

Discontinuous technological change



PRODUCING THE WAFERS

IC FABRICATION



IC FABRICATION

Si Si

Si

Si Si

Si

Si

SiSi Si

Si

SiSiSi

Si

Si

DOPING THE WAFER

P

-

-

-

-

-

+

+

+

+

+

© Paul Lyons 2010~ 10 ~


HOW ACMOSTRANSISTOR WORKS

IC FABRICATION

+ + ++ + ++ + ++ + ++ + +

+-

© Paul Lyons 2010~ 11 ~


MAKING ACMOSTRANSISTOR

IC FABRICATION

© Paul Lyons 2010~ 12 ~


MAKING ACMOSTransistor

IC FABRICATION

© Paul Lyons 2010~ 13 ~


MAKING ACMOSTRANSISTOR

IC FABRICATION

+ + + + + +

+-

© Paul Lyons 2010~ 14 ~


THE MIPSCOMPUTER

THE ISA

a popular microprocessor(a billion sold?)

RISC architecture

CISC RISC

slow memory, assembly language era fast memory, HLL era

simple instructions cut clock cycles to 1compilers issue complex instruction sequences

single addressing mode per instructioninstructions that operate only on registers

small controller large no. of registershardwired instructions

© Paul Lyons 2010~ 16 ~


THE MIPSCOMPUTER

THE ISA

a popular microprocessor(a billion sold?)

RISC architecture

Architecture

Machine Language

Instruction Set

Compilers

Design Goal

© Paul Lyons 2010~ 17 ~


THE ADD INSTRUCTION

THE ISA

The MIPS computer has a 3-address architecture

add a, b, csub a, b, c

# a = b + c# a = b - c

add a, b, cadd a, a, dadd a, a, e

# a = b + c# a = a + d# a = a + e# a contains the sum of b, c, d, & e

move $8, $19 # r8 � #r19 - desired behaviour

add $8, $0, $19 # r8 � 0 + $19 - actual implementation

© Paul Lyons 2010~ 18 ~


EXPRESSION TREES AND EVALUATION ORDER

THE ISA

+

+

b c

+

d e

+

bc

+

d

+

e

a

a

a

a f

a

© Paul Lyons 2010~ 19 ~


THE REGISTERS

THE ISA

$0, $1, … $31

address calculation, stack pointers as well as data storage

© Paul Lyons 2010~ 20 ~


THE REGISTERS

THE ISA

Register Name(s) Use

0 $zero12-3 $v0-$v14-7 $a0-$a38-15 $t0-$t716-23 $s0-$s724-25 $t8-$t926-27 28 $gp29 $sp30 $fp31 $ra

© Paul Lyons 2010~ 21 ~


RISCDESIGNS FAVOUR SIMPLICITY

THE ISA

32-bit instructionstwo types: R(egister) –type & I(mmediate)-type

op rs rt rd shamt functR-type

6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

© Paul Lyons 2010~ 22 ~



THE ISA

I-type op rs rt constant or address


6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

have a constant operandor access memory

32-bit instructionstwo types: R(egister) –type & I(mmediate)-type

© Paul Lyons 2010~ 23 ~



THE ISA

(Yeah, right)


load $t0 with the word in $s2 plus the word 8 up from the address in $s3lw $t0, 32($s3)

add $t0, $s2, $t0

let’s say x is in register called $t0 in the assembler, actual reg. 8h is in register called $s2 in the assembler, actual reg.18array a starts at the location contained in reg.$s3, actual reg.19

x = h + a[8]

35 19 8 32

(psst; there's a third type as well; J-type, for jump instructions)regularity is an ideal, but good compromises must sometimes be made

© Paul Lyons 2010~ 24 ~


COMPILATION

THE ISA

HLL code

expression treevariable-registerassociations

assembler

f = (g+h) – (i+j)

$s1 $s2 $s3 $s4

name registerf $s0g $s1h $s2I $s3j $s4

+$t0 + $t1

-$s0

machine code

memory image

© Paul Lyons 2010~ 25 ~


REGISTER��MEMORYTRANSFERS

THE ISA

complex programs are difficult to write, with only 32 registers

instructions for storing data to memory and loading data from memory

memory works like a big 1-D array, addressed by byte

if $19 contains start, then lw $8, 12($19)

loads 1 into register $8

100 1000 10 1start start

+ 4start+ 8

start+ 12

© Paul Lyons 2010~ 26 ~


MORE ABOUT MEMORY

THE ISA

Compiler also allocates

232 bytes230 words

4,294,967,2961,073,741,824

Large address space means access times

Compiler tries to keeps spills

© Paul Lyons 2010~ 27 ~


BRANCHES

THE ISA

Computers must have

mostly, that's just that's why

sometimes it needs to that's why the PC is the choice depends on

beq $1, $2, L1 # branch to L1 if $1 and $2 are equal

PC is loaded with PC is incremented

bne $1, $2, L1 # branch to L1 if $1 and $2 not equal

© Paul Lyons 2010~ 28 ~


TESTS USING INEQUALITY

THE ISA

SLT (Set on Less Than)compares SLT $r1, $r2, $r3

C Equivalent:

BLT (Branch on Less Than)

would need simpler & more regular to use

© Paul Lyons 2010~ 29 ~


SLTI (Set on Less Than Immediate)compares SLT $r1, $r2, number

TESTS USING INEQUALITY

THE ISA

SLTU (Set on Less Than Unsigned)

SLTUI (Set on Less Than Unsigned Immediate)

C equivalent:

© Paul Lyons 2010~ 30 ~


IF STATEMENTS

REPRESENTING HLL CONSTRUCTS

if (a==b)

c = d+e;

else

c = d-e

register allocationsc d e a b

$16 $17 $18 $19 $20

© Paul Lyons 2010~ 31 ~


LOOPS


while loopwhile (this[i] == k)

i = i + j

register allocationsi j k 4 (constant)

$19, $20, $21 $10

repeat looprepeat

i = i + j

until (this[i] == k)

?

© Paul Lyons 2010~ 32 ~


SUBROUTINE CALLS


Another variety ofsaves

jal procAddress

jr $31

For nested procedure calls, stack is spilled into memory$sp contains

stack lives at top end of memory, & grows downwards

Parameters passing uses registers for a nested subroutine call,

caller save: called subroutine can then use any registerscallee save: calling subroutine doesn’t have to restore registers

© Paul Lyons 2010~ 33 ~


IMMEDIATE INSTRUCTIONS TO OPERATE ON CONSTANTS


addi $29, $29, 4 # sp = sp – 1!

lui <regn>, <16-bit const>

addi $8, $8, 96

00011100000001000000000011111111

lui $8 255

00000000111111110000000000000000 r8

00100000100001000000000000110000

0000000000110000

© Paul Lyons 2010~ 34 ~


DESIGN PRINCIPLES


Smaller is fastermore registers � greater area � slower clock

Simplicity favours regularitydecoding is faster with

Good design demands good compromiseR-type, I –type, and J-type instructions are all

Make the common case fastimmediate instructions don’t often involve big constantsso 16-bit constants are OK, with lui only needed occasionally

© Paul Lyons 2010~ 35 ~


PERFORMANCE METRICS

PERFORMANCE

throughputtotal work accomplished in a given time

execution timetime for a given jobperformance (rate or speed) =

if performanceX > performanceY

then

© Paul Lyons 2010~ 36 ~


CPU TIME, I/O TIME, AND WALL CLOCK TIME

PERFORMANCE

CPU time is

access times of are commonaccess time = + +

CPU I/O CPU I/O CPU

CPU time

elapsed time

© Paul Lyons 2010~ 37 ~


FACTORS INFLUENCING PERFORMANCE

PERFORMANCE

hardware-related factorsISA implementationCPU cycle timebus cycle timecachingparallelismpipelining

software-related factorsuser algorithmoperating systemcompilers

© Paul Lyons 2010~ 38 ~


PERFORMANCE MEASURES; THE CLOCK CYCLE

PERFORMANCE

vlock cycle is

e.g. 10ns

vlock rate is

e.g. 4GHz

no. of clock cycles per instruction is CPI – Cycles Per Instruction – also a factor

no. of instructions x CPI

= execution timex

© Paul Lyons 2010~ 39 ~


PERFORMANCE COMPONENTS

PERFORMANCE

time =

CPU clock cycles = ∑ CPIi x no. of instructionsiI = 1

n

if an instruction set includes n different classes of instruction

MFLOPs: Millions of Floating Point Operations/second

MIPs: Millions of Instructions/secondif CPI = 1, MIPS =difficult to compare ISAs, as difficult to compare programs, as does automatically mean ?

x x

no one of these is a full measure of performance

x x

© Paul Lyons 2010~ 41 ~


PERFOMANCE COMPONENTS

PERFORMANCE

MIPs: Millions of Instructions/seconddoes more MIPS automatically mean faster execution?consider a 4GHz computer

billions of instructions

compiler1ABC

51 1

compiler210

1 1execution time

=

& 3 classes of instructionwith 2 compilers

MIPS =

CPI123

xxx

52310

102315

© Paul Lyons 2010~ 42 ~


MFLOPS

PERFORMANCE

Millions of Floating Point Operations per Second

much scientific, graphic and engineering computing involvesfast floating point arithmetic implies many computers have

the same caveat applies to MFLOPS as to MIPS

© Paul Lyons 2010~ 43 ~


BENCHMARKS

PERFORMANCE

too many variables, too much hype

benchmarks are standard programs whose e.g. Livermore loops

e.g. SPEC benchmark suites

manufacturers can to achieve good statsgives an unrealistic impression of

© Paul Lyons 2010~ 44 ~


PARALLELISM

PERFORMANCE

some tasks can besystems that can be divided intoe.g. the atmosphere (weather prediction)

Pprocessors

ideallyin practice,

can't be eliminated

© Paul Lyons 2010~ 45 ~


PARALLELISM –AMDAHL'S LAWPARALLELISM

PERFORMANCE

how do we measure ?

if a task takes 100s in one configuration and 80s in another, what's the speedup?speed1 = 1 task/100s =speed2 = 1 task/80s =speedup = 0.0125 / 0.01 =

T

ts tP

max speedup �

=

if we have P processors working perfectly in parallel then we reduce the time for the parallel section of the code by a factor of P so the total task time in the parallel configuration TP =speedup S =

most code is a mixture ofserial processing component is & limits

if tp can be reduced to 0

© Paul Lyons 2010~ 46 ~


PARALLELISM - AMDAHL'S LAWPARALLELISM

PERFORMANCE

parallelism has two flavoursindependent tasks;dependant tasks;

many tasks in computers involve cf. assembly line for producing goods that undergo same operation sequence

A1 B1 C1

A2 B2 C2

A3 B3 C3

© Paul Lyons 2010~ 47 ~


PIPELINE PERFORMANCE

PERFORMANCE

if each task in a pipeline of length L takes t secondssingle task takes

but for n tasks delay before 1st output =delay between subsequent outputs =T(n) =

= (L + n - 1)t

1 2 L

t t t+ + +

© Paul Lyons 2010~ 48 ~


PIPELINE

PERFORMANCE

pipeline characteristics

pipeline rate, r∞ =pipeline startup time, s =half performance vector length, n½

(L-1)t + n½ t =n½ =

=

© Paul Lyons 2010~ 49 ~


NUMBERS AND THE DATAPATH

COMPUTER ARITHMETIC

datapathALU

number representation2's complement

arithmetic operations+, *, /, shift

© Paul Lyons 2010~ 50 ~


2's COMPLEMENT

COMPUTER ARITHMETIC

used in nearly all microprocessorsmost significant bit

other bits

© Paul Lyons 2010~ 51 ~


INTEGERS AND ADDRESSES

COPMPUTER ARITHMETIC

integers are , addresses are

Note: unsigned addresses are not used in all computerstransputer addresses are 2's complement, in the range -231 to 231-1

no type information included with the data

© Paul Lyons 2010~ 52 ~


2's COMPLEMENT

COMPUTER ARITHMETIC

for n bit numbers2n-1 positive numbers, starting at 02n-1 negative numbers, starting at -1|max –ve no| =x + -x =-x =-2n-1

operation overflows

0

0000 1

0001 2

0010

3

0011

4

0100

5

01016

01107

0111

-8

1000

-7

1001

-6

1010

-5

1011

-4

1100

-3

1101

-2

1110

-1

1111

A

101111000111

101111011000

-5-3-8

-5-4-8

0001101

00011011111111111111111111111111

1

© Paul Lyons 2010~ 53 ~


OVERFLOW DETECTION

COMPUTER ARITHMETIC

easy to detect: when

on overflow address of overflowing instruction savedinterrupt handler is called andafter interrupt code finishes, instruction

© Paul Lyons 2010~ 54 ~


1-BIT ALU

BUILDING UP THE DATAPATH

bit-slice architecturedesign an ALU component that handles

logical operations require cin

a0

b0

result0

a1

b1

result1

a31

b31

result31

operation

We could extend this design by adding more functional blocks, e.g. multipliers

carry propagation carry lookahead

ab

c

∑

cout

operation

result

invertb

invertb

© Paul Lyons 2010~ 55 ~


can we detect when less significant bit slices will generate a carry output?set up modules that allow modules generally incorporate

1-BIT ALU – ADDING CARRY LOOKAHEAD1-BIT ALU


conventional full adder involves we could design a 32-bit adder as

64 inputs � 264 terms; too bigis there a less expensive way?

© Paul Lyons 2010~ 56 ~


1-BIT ALU – ADDING CARRY LOOKAHEAD1-BIT ALU


a b cin cout sum0 0 0 0 00 0 1 0 1

0 1 0 0 10 1 1 1 01 0 0 0 11 0 1 1 0

1 1 0 1 01 1 1 1 1

cout always 0: carry kill

cout =cin: carry propagate

cout =1: carry generate

Gi =Pi =carry input to the next phase:ci+1 =similarly:ci = substituting repeatedly:ci+1 =

c1+1 ==

Boolean expressions for G, and P, :

all can be calculated in parallel from data inputs and c0

© Paul Lyons 2010~ 57 ~


1-BIT ALU – ADDING CARRY LOOKAHEAD


carry lookahead circuit works onCL units are usually associated with

cascaded to make up

ALU

CL

CL

CL

CL

CL

CL

CL

CL

CL

CL

CL

CL

CL

CL

CL

© Paul Lyons 2010~ 58 ~


1-BIT ALU – OTHER OPERATIONS


logical operationse.g. ANDbit-for-bit logical operation on a pair of words

shift operations e.g. sll & srl (Shift Left Logical & Shift Right Logical)shift information in a register by a specified no. of bits

combination of logical and shift operations to extract parts of a wordcreate a 32-bit mask with desired bits set to 1 (e.g. 8 bits for a character)AND andshift result by

© Paul Lyons 2010~ 59 ~


1-BIT ALU – OVERFLOW DETECTION


when overflow occurs, carry-in and carry-out of sign bit differ connect an EOR gat to cin and cout of most significant adder

cin

a0

b0

result0

a1

b1

result1

a31

b31

result31

operation

overflow

invertb

© Paul Lyons 2010~ 60 ~


1-BIT ALU - SLT


result = ;

if a < b , is equivalent to

set invertb when performing slt operation to-ve values have a sign bitfeed o/p of back tomake the o/p MUX in all the ALUs

ab

c

∑

cout

operation

result

ALU31

ab

c

∑

cout

operation

result

ALU0

0

operation01223

invertb--011

andoraddsubslt

invertbinvertb

© Paul Lyons 2010~ 61 ~


BRANCH INSTRUCTIONS –BEQ AND BNE


equality test for beq and bne instructions also relies onif a=b, then

operation0122

invertb--01

andoraddsubslt 3 1

zero-detect circuit controls

bne 2 1beq 2 1

bne and beq also use

coupled with invertb

need to connect to32-input active-low-input AND gate (i.e., a NOR gate)

© Paul Lyons 2010~ 62 ~


SHIFT


5-bit shamt field specifiestoo slow to shift in

barrel shifter shiftse.g. to shift 5 bits, shamt is

1's bit 4's bit2's bit2's bit

© Paul Lyons 2010~ 63 ~


control

DATAPATH LAYOUT


registers ALU

instruction decoding

bit 0

bit 31

register decoding

data

ALU layout is very regular layout on silicon is also highly structuredcontrol flows and data flows are orthogonal minimises complexity and communication times

© Paul Lyons 2010~ 64 ~


MULTIPLIER


signed multiplication

+

problem:takes multiple clock cycles

© Paul Lyons 2010~ 65 ~


MULTIPLIER - FASTER


∑ ∑ ∑ ∑ ∑ ∑ ∑

∑ ∑ ∑ ∑ ∑ ∑ ∑

∑ ∑ ∑ ∑ ∑ ∑ ∑

multiplier

multiplicand 0101

x1101

0101

0000

0101

0101

1000001

01100101

multiplicand

01100101

multiplier

what if we put an adderafter each partial product?(except the first, of course)

∑ ∑ ∑ ∑ ∑ ∑ ∑

∑ ∑ ∑ ∑ ∑ ∑ ∑

∑ ∑ ∑ ∑ ∑ ∑ ∑

∑ ∑ ∑ ∑ ∑ ∑ ∑

∑∑∑∑∑∑∑

this architecture is wellsuited to

∑

∑

∑

© Paul Lyons 2010~ 66 ~


BOOTH'S ALGORITHM WORKS FOR 2's COMP. NUMBERS


Booth noticed that, when counting in binary,a string of

will be with a next time the

so the string of can be rewrittenas instead of

2m bit

2n bit

…011111…

…10000-1…

…100000…

what's the benefit of that?a simple multiplier uses to multiply by string of x 1sbut if the multiplier can handle , �

the algorithm looking for00 – string of zeros;10 – start of a string of 1s;11 – middle of string of zeros;01 – end of a string of zeros;

=

still have to

© Paul Lyons 2010~ 67 ~


MIPSMULTIPLICATION


the product of a 32-bit multiplications mult and multu occupies 64 bits

hi lo

hi and lo registers are not

mflo $1 (move from lo) putsmfhi $1 (move from hi) to check that

mult ASM instruction is a

generatesgenerates

© Paul Lyons 2010~ 68 ~


DIVISION


mathematically (ideally), division is the inverse of multiplication

if a = b/cthen b =and if b = 1then

but with finite precision arithmetic, occur

© Paul Lyons 2010~ 69 ~


DIVISION BY REPEATED SUBTRACTION


when we multiplied m by n to produce a product pwe generated

when we divide p by mwe're finding out

we can divide p by mby repeatedly p till

© Paul Lyons 2010~ 70 ~


WHY ARE FPNUMBERS SPECIAL?

FLOATING POINT NUMBERS

we need a way to representnumbers with fractions, e.g. 3.14159very small numbers, e.g., 0.0000000001very large numbers, e.g. 178478 x 109

representation:sign, exponent, significand:more bits for mantissa �

more bits for exponent �

IEEE 754 floating point standard:single precision usesdouble precision uses

© Paul Lyons 2010~ 71 ~


IEEE 754 FLOATING POINT STANDARD:


msb of mantissa is , so only is stored

exponent is "biased" to make sorting easiersubtract 127 to get exponent for single precision & 1023 for double precision

mantissa(23 bits)

23 031

exponent(8 bits)

0

exponent(11 bits)

mantissa(52 bits)

single

double

5263

© Paul Lyons 2010~ 72 ~


IEEE 754 FLOATING POINT STANDARD:


special cases:

mantissaexponent

all 0s denormalised numbers; zero is a denormalised no. with

all 1s

all 0s ∞

non-0 NAN

QuietNAN has

SignalllingNAN has

rounding options:to nearest integer, to nearest even integer if fraction is exactly 0.5towards 0towards + ∞towards -∞

© Paul Lyons 2010~ 73 ~


GUARD AND ROUND BITS


consider adding two decimal numbers with 3 bits of precision 2.56 + 2.34 x 102

2.34000.0256

2.340.02

2.3400.025

with noextra digits

with 1extra digit

with 2extra digits

extra bits

© Paul Lyons 2010~ 74 ~


FLOATING POINT ADDITION


align radix pointsdenormalise one number to

add mantissae

renormalise the resultwatch out for

round to correct number of significant digitsthis may occasionally ripple through to the msb and generate an unnormalised result

renormalise again if required

© Paul Lyons 2010~ 75 ~


THE "STICKY BIT"


When generating a result, a string of 0s may be followed by a 1 that will be normalised away

1.93650001 (ignoring the exponent)

simple rounding to nearest even value based on rounding digit, would produce

1.93650001

keep sticky bit as next bit to help resolve "mid-way" rounding problems

1

© Paul Lyons 2010~ 76 ~


FLOATING POINT MULTIPLICATION


add exponentsboth include a bias, so have to subtract 1 bias from the result

multiply

normalise result, check for overflow

round

renormalise if necessary

© Paul Lyons 2010~ 77 ~


FLOATING POINT AND MIPS


MIPS instructions to support IEEE single and double precision floating point:

add.s and add.d

sub.s and sub.d

mul.s and mul.d

div.s and div.d

c.<x>.s and c.<x>.d <x> may be eq, neq, lt, le, gt, ge

bclt and bclf

comparison sets bit to

single and double

© Paul Lyons 2010~ 78 ~


ASMs (Designing a Controller)

COMPUTER PROCESSORS

ControllerASM

Architecture

Controller executes an infinite loop

• instructs processor to get an instruction from memory• identifies instruction that the processor has retrieved• instructs processor to perform data manipulations required by the instruction

Specifies timing of data manipulations

Receives status

information

Algorithmic State Machine

© Paul Lyons 2010~ 79 ~



STATUS INFORMATION (architecture → controller)

Current instructionIf instruction involves a choice

e.g. JPZ instruction

then controller examines a status line to determine appropriate action

Other status signals are commonly usedNEG OVFL

© Paul Lyons 2010~ 80 ~



CONTROL SIGNALS (controller → architecure)

Architectural

building block

Control commands No of bits

Registers

MUXes

Memory

etc

© Paul Lyons 2010~ 81 ~



FINITE STATE MACHINES

0

- THE BASIS OF ALGORITHMIC STATE MACHINES

Rectangles contain

Diamonds containRoundtangles contain

P

Q R SZT F

© Paul Lyons 2010~ 82 ~


Outputs in a are TRUE


A □ must be present

even if it is empty

A state specifies actions that occur on one clock pulse

Outputs in a □ are TRUE

during that state’s clock pulse, if …

P

RZ F

© Paul Lyons 2010~ 83 ~



P

Q R SZT F

The state is represented as a numberAt any instant (clock pulse), the ASM

hasassertsasserts

if the condition that governs them is fulfilledASM calculates the

0

1 2

© Paul Lyons 2010~ 84 ~


Specs for circuit to navigate round the state chart:

inputs output

No. of present stateStatus inputs

Any external inputsNext state no.


P

Q R SZT F

0

1 2

© Paul Lyons 2010~ 85 ~


.


Z?

present state?

1 → next state

Next state no. ↓

present state no

TRUE → R2 → next state

TRUE → Q0 → next state

TRUE → S0 → next state

TRUE → P

P

Q R SZT F

0

1 2

repeat 0T

F

1

2

© Paul Lyons 2010~ 86 ~



Ap Bp Z An Bn

The ASM state transition table (navigation only)

Inputs Outputs

P

Q R SZT F

0

1 2

© Paul Lyons 2010~ 87 ~



General structure of the circuit

Combinatorial logic

Register

Outputs

Next state

Present

state

Status

Externalinputs

clock

P

Q R SZT F

0

1 2

© Paul Lyons 2010~ 90 ~


1 00 10 00 0 Ap Bp

The complete ASM state transition table

Ap Bp Z

0 0 00 0 10 1 -1 0 -

An Bn

Inputs Outputs

P Q R S


clock

2-bit reg

Q1 Q0

D0

D1

Z

An

Bn

PQRS

P

Q R SZT F

0

1 2

© Paul Lyons 2010~ 91 ~



A practical problem:initialising the state register

On automatic?

00

Found?

Fire

10Yes

Seek enemy target

01T

F

Consider a 22” naval gun• controlled by an ASM • autoseeking• autofiring

No

At power-up, if state register contains 0, 1, or 3if state register contains 2

√√√√?

© Paul Lyons 2010~ 92 ~



On automatic?

00

Found?

Fire

10Yes

Seek enemy target

01T

F

No

We need a reset-on-powerup circuit

+5V

0V

rst

+5V

T

V

A practical problem:initialising the state register

Consider a 22” naval gun• controlled by an ASM • autoseeking• autofiring

At power-up, if state register contains 0, 1, or 3if state register contains 2

© Paul Lyons 2010~ 93 ~



SUMMARY: DESIGNING AN ASM

Construct Shows all inputs and control signals

Translateexternal inputs (if any)status inputspresent statecontrol commandsnext state

Translate

© Paul Lyons 2010~ 94 ~


but would you ever use an ASM instead of computer program?

ASMs underlie processor instruction sets

ASM vs. SOFTWARE

Software ASMsin discrete logic

ASMsin FPGA

.. .. ..

DQ

DQ


© Paul Lyons 2010~ 95 ~



Is it easy to understand or modify an ASM circuit?

General format is easily recognisable

inputs commands

But combinatorial circuitry and high-level flowchart vocabularies differ significantly

© Paul Lyons 2010~ 96 ~



Modifications to state sequence require a complete redesign

Disjunction between ASM diagram and combinatorial circuitry

Could the circuitry(and the outputs)

Is it easy to understand or modify an ASM circuit?

© Paul Lyons 2010~ 97 ~



If we connect:the control input of a MUX to the state numberthe data inputs to TRUE and FALSE

We’ll use 1 MUX for each output (including the bits of the next state number)

state Q01234567

10010111

T F

.

01234567

Q

USING MUXES AS A LOOKUP TABLE

© Paul Lyons 2010~ 98 ~



Outputs are sometimes

Q

11

A.B

T F

State no.

01234567

Q

A B

Simpler and easier to understand than completely combinatorial system


Q should be in state 3, if

© Paul Lyons 2010~ 99 ~




Consider our prototype ASM

P

Q R SZT F

0

1 2

Ap Bp Z

0 0 00 0 10 1 -1 0 -

An Bn

1 00 10 00 0

Inputs Outputs

P Q R S

1 011 0 00 1 00

00

000 1

ZT F

0123

An

0123

Bn

D1

D0

Q1

Q0

0123

S

0123

R

P0123

0123

Q

© Paul Lyons 2010~ 100 ~



A LIFT CONTROLLER

© Paul Lyons 2010~ 101 ~


A LIFT CONTROLLER


© Paul Lyons 2010~ 102 ~


A LIFT CONTROLLER


© Paul Lyons 2010~ 103 ~



A LIFT CONTROLLER

© Paul Lyons 2010~ 104 ~



A LIFT CONTROLLER

© Paul Lyons 2010~ 106 ~



Up button means “Take something upstairs”

Down button means “Take something downstairs”

If the lift is downstairs,

If the lift is upstairs,

A LIFT CONTROLLER

© Paul Lyons 2010~ 107 ~



door open?openDoor

upButton +

downButton?

N

Y

N

Y

closed?closeDoorN

000 (At bottom)

001 (Starting up)

up?

ResetUpRequest

010 (Going up)

goUpN

upButton +

downButton?Y

N

closed?

100 (Starting down)

closeDoorN

down?

ResetDnRequest

101 (Going down)

goDownNY

door open?

011 (At top)

Y

openDoorN

A LIFT CONTROLLER

© Paul Lyons 2010~ 110 ~



AP BP CP AN BN CNcondition closeDoor

openDoor

resetUpRequest

resetDownRequest

goUp

goDown

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

cond1 = doorOpen . ~(upButton + downButton)cond2 = doorOpen . (upButton + downButton)

A LIFT CONTROLLER

© Paul Lyons 2010~ 111 ~



D2

D1

D0Q0-Q3

openDoor

resetUpRequest

goDown

goUp

CN

clos

ed

DO UBDBT F up dow

n

closeDoor

AN

BN

resetDownRequest

© Paul Lyons 2010~ 112 ~



Phase 1

Phase 2

A MULTIPLICATION CIRCUIT

© Paul Lyons 2010~ 113 ~



How does “manual” multiplication work?

e.g. 510 x 11100101 Multiplicand

X 1011 Multiplier

0101

0101

0000

0101000

0110111

Partial products

Product

Hardware multiplication works similarlyBut multiplier

When the process ends, running total containsOtherwise we’d have to use

(=5510)

: AnalysisA MULTIPLICATION CIRCUIT

© Paul Lyons 2010~ 114 ~


Partial products


0101 Multiplicand

X 1011 Multiplier

0101

0101

0000

0101000

0110111

Storage requirementsMultiplier Multiplicand Product

: AnalysisA MULTIPLICATION CIRCUIT

Product (=5510)

How does “manual” multiplication work?

e.g. 510 x 1110

© Paul Lyons 2010~ 115 ~


0101


: Analysis

For each 1 in multiplier,addallowing for

1011

1 bit

2 bit

4 bit

8 bit

Each time a partial product is added to the running totalsignificance needs to beFirst PP goes into position withShift running total

Put running total in


0101

© Paul Lyons 2010~ 122 ~


0110111


: Analysis

For each 1 in multiplier,add multiplicand into running totalallowing for significance of the 1 bit in the multiplier

1011

1 bit

2 bit

4 bit

8 bit

Each time a partial product is added to the running totalsignificance needs to beFirst PP goes into positionShift running total

Put running total in a shift register 2n bits wide


If multiplicand is large, Add an extra bit to the most significant shift register

+ 1

^

© Paul Lyons 2010~ 123 ~



: Design of the architecture

SRA SRB

register

adder


© Paul Lyons 2010~ 124 ~


0

SRA SRB

resetA resetBshiftA shiftB

loadMultiplier

multiplier

loadProduct

clock

register

multiplicand

loadMultiplicand

adder

Top 1/2 Bottom 1/2

lobit

Informal algorithm

Load

Load

Repeat 4 timesIf lowest bit of multiplier is , then

addShift and



© Paul Lyons 2010~ 125 ~


incr

lobit?

eqz?

done

loadProductT

shiftAshiftB

F

F

00

01

11

10

Start? F

AP BP AN BN loadM

ultiplier

loadM

ultiplicand

clearCounter

reset A

increment

loadProduct

shift A

shift B

done

condition

0 0startT T T T0 1start

0 0

1 0 Tlobit

1 1 Tlobit

0 1

1 1 T-1 0

0 1 T Teqz

0 0 Teqz T T

1 1


AMULTIPLICATION CIRCUIT –ASM DIAGRAM

loadMultiplierloadMultiplicand

clearCounterresetA

© Paul Lyons 2010~ 126 ~


AP BP AN BN loadM

ultiplier

loadM

ultiplicand

clearCounter

reset A

increment

loadProduct

shift A

shift B

done

condition


0 0

1 0 Tlobit

1 1 Tlobit

0 1

1 1 T-1 0

0 1 T Teqz

0 0 Teqz T T

1 1


© Paul Lyons 2010~ 127 ~



AP BP AN BN loadM

ultiplier

loadM

ultiplicand

clearCounter

reset A

increment

loadProduct

shift A

shift B

done

condition


0 0

1 0 Tlobit

1 1 Tlobit

0 1

1 1 T-1 0

0 1 T Teqz

0 0 Teqz T T

1 1

D1

D0

Q1

Q0

0123

AN

0123

BN

AP

BP

T F eqz

lobit

start

0123

loadMultiplierloadMultiplicandClearCounterResetA

0123

increment

0123

loadProduct

0123

shiftAshiftB

0123

done

Q0Q1

clearCounter

increment

eqz

© Paul Lyons 2010~ 128 ~


THE PROCESSOR =DATAPATH +CONTROL

SINGLE-CYCLE ARCHITECTURE

datapath can process data as specified in the instructions

but fetch/decode/execute cycle needs

control signals regulatetradeoffs between complex processing and fast hardware moduleswant to minimise both

control loop always has•output to determine the location of the next instruction •read specified by the instruction (sometimes 1, sometimes 2)•perform (memory ref, arithmetic/logical, or branch)

© Paul Lyons 2010~ 129 ~


THE ARCHITECTURE


PC

registers

instructionmemory

data memory

data out

data in

ALU

© Paul Lyons 2010~ 130 ~


COMBINATORIAL vs. SEQUENTIAL LOGIC


datapath components developed so far useoutputsoutput of a given set of inputsdelay between

full datapath iscontains storage elementsoutputs depend onclock regulates

controller is a sequential circuitusually implemented as

© Paul Lyons 2010~ 131 ~


CLOCKING DATA THROUGH APATH


storage(sequential)

storage(sequential)

datapath component(combinatorial)

datapath component(combinatorial)

clock

nc > (n+1)(c-δc)(n+1)δc > c

Xsetup

holdregister loads dataregister o/p unstableregister i/p must be stableregister o/p stable duringcycle time = +

is it worth cutting a slow stage into two?

yes, if:

© Paul Lyons 2010~ 132 ~


GATED CLOCKS


we don't always want to load data on every clock cycleuse a separate write control line toclock edge specifies the data should be loadedwrite line specifies the data should be loaded

clockwrite write

© Paul Lyons 2010~ 133 ~


INSTUCTION SUBSET


memory reference instructions lw and swarithmetic instructions add, sub, and, or, sltbranch instructions beq and j

two phases single-cycle implementation, combinatorial controllermulti-cycle implementation – leads to

© Paul Lyons 2010~ 134 ~


THE PROGRAM COUNTER


add

4

PCinstructionmemory

instruction

© Paul Lyons 2010~ 135 ~


REGISTER FILE ANDR-FORMAT INSTRUCTIONS


register file contains 32 32-bit registersimplemented as a fast static RAM with dedicated read and write portsaddresses correspond tocontrol signals, based on current instruction, specifyallows two reads and 1 write on a clock cycleALU operates on output from 2 registers, writes result back to register file

write register

read register1read register2

read data1

read data2

write data

/

/

/

5

5

5

instruction regWrite

ALU

zero

4

ALUoperation

© Paul Lyons 2010~ 136 ~


MEMORY REFERENCES


lw

sw$t1, offset ($t2)

16-bit value to add to $t2

to generate branch destinationdedicated ALU adds offset toif offset is –ve, sign is in bit 15need to (set bits 16 - 31 to 1)

instruction

read data1

read data2

write register


write data

regWrite

/

/

/

5

5

5

zero

ALU

ALUoperation

4

sign-extend

data memory

read write

read address

write address

16 32

© Paul Lyons 2010~ 138 ~


THE BEQCONTROL LOGIC


branch destinationALU

sum

shiftleft 2

pc+4

branch address is register (PC) + offsetPC + 4 (here the units are bytes!) isoffset is ; needsunit of offset is words, not bytes; shift it 2 bits to the left to multiply by 4

signextend

instruction

beq $1, $2, offset

© Paul Lyons 2010~ 139 ~



sum

4

shift left 2

PCsum

sign-extendinstruction 0:15

instructionmemory

branchdestination

ALUsum

© Paul Lyons 2010~ 140 ~


THE BEQCONTROL LOGIC


branch destinationALU

sum

shiftleft 2

pc+4

signextend

instruction

beq is a conditional branch instructionso processor mustALUif ALU's zero-detect is TRUE,

zero to branch control logic

regWrite

addread data1

read data2write register


write data

operation

© Paul Lyons 2010~ 141 ~


THE J INSTRUCTION


j address

another instruction that loads the PC – unconditionally this time26 bit address has 2 0 bits added tono negative addresses, so no need fortop 4 bits of PC are left unaffectedso j instruction can only access

instructionmemory

PC

26unchanged

00

© Paul Lyons 2010~ 142 ~


SINGLE CLOCK CYCLE RESTRICTIONS


all operations must start and finish in in 1 clock cycleno resources can be sharedmultiple operations require

increment PC, calculate address, compare registers all need

however, different instruction types can use the same resourcememory references calculateregister operations calculatecan multiplex

instruction

read data1

read data2

write register


write data

regWrite

/

/

/

5

5

5

zero

ALU

ALUoperation

4

sign-extend16 32

data memory

read write

read adddress

write adddress

similarly

© Paul Lyons 2010~ 143 ~


ADDING INSTRUCTION MEMORY


instruction

read data1

read data2

write register


write data

regWrite

/

/

/

5

5

5

zero

ALU

ALUoperation

4

sign-extend16 32

data memory

read write

read adddress

write adddress

instructionmemory

sumPC

4

Single-cycle instruction requires separate data and instruction memoryno time to read

© Paul Lyons 2010~ 144 ~


instruction

read data1

write register


write data

regWrite

zero

ALU

ALUoperation

4

ADDING THE BEQ INSTRUCTION


data memory

read write

read adddress

write adddress

sign-extend

read data2

sum

4

shift left 2

PC

instructionmemory

sum

© Paul Lyons 2010~ 145 ~



write register

read register1read register2write data read data2

read data1

ALU zero

data memory

read adddress

write adddress

instruc-tion

memory

11:15

16:20

21:25

sum

PC

4

sign-extend

0:15 shiftleft 2

sum

© Paul Lyons 2010~ 146 ~


ALUCONTROL


controller will handle a subset of the ALU functions

functionandoraddsub

set on less than

ALU Control Input000001010110111

© Paul Lyons 2010~ 147 ~


COMBINATORIAL CONTROL UNIT


instructionlwswbeqaddsubandorslt

ALUop0000011010101010

ALU actionaddadd

subtractadd

subtractandor

set on less than

function codexxxxxxxxxxxxxxxxxx100000100010100100100101101010

opcodelwswbeq

R-typeR-typeR-typeR-typeR-type

ALU control010010110010110000001111

A10011111

A00100000

F3--00001

F1--01001

F0--00010

C20101001

C11111001

C00000011

inputs outputs

6 bits 6 bitsF2--00110

© Paul Lyons 2010~ 148 ~




A10011111

A00100000

F3--00001

F2--00110

F1--01001

F0--00010

C20101001

C11111001

C00000011

A10011111

A00100000

F3--00001

F2--00110

F1--01001

F0--00010

C20101001

C11111001

C00000011

© Paul Lyons 2010~ 149 ~




A10011111

A00100000

F3--00001

F2--00110

F1--01001

F0--00010

C20101001

C11111001

C00000011

A1A0=00F3F2

F1F0 00 01 11 10

00

01

11

10

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

F3F2

F1F0 00 01 11 10

00

01

11

10

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

F3F2

F1F0 00 01 11 10

00

01

11

10

- - - -

- - - -

- - - -

- - - -

F3F2

F1F0 00 01 11 10

00

01

11

10

0 0 - -

- 0 - -

- - - -

1 - - 1

A1A0=01 A1A0=11 A1A0=10

C2 = A0C2 = A0 + A1F1C2 =

© Paul Lyons 2010~ 150 ~




A10011111

A00100000

F3--00001

F2--00110

F1--01001

F0--00010

C20101001

C11111001

C00000011

C2 = A0 + A1F1

A1A0=00F3F2

F1F0 00 01 11 10

00

01

11

10

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

F3F2

F1F0 00 01 11 10

00

01

11

10

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

F3F2

F1F0 00 01 11 10

00

01

11

10

- - - -

- - - -

- - - -

- - - -

F3F2

F1F0 00 01 11 10

00

01

11

10

1 0 - -

- 0 - -

- - - -

1 - - 1

A1A0=01 A1A0=11 A1A0=10

C1 = `A1 + `F2C1 = `A1C1 =

© Paul Lyons 2010~ 151 ~


C0 = A1F3 + A1F0C0 = A1F3



A10011111

A00100000

F3--00001

F2--00110

F1--01001

F0--00010

C20101001

C11111001

C00000011

C2 = A0 + A1F1

C1 = À1 + `F2

A1A0=00F3F2

F1F0 00 01 11 10

00

01

11

10

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

F3F2

F1F0 00 01 11 10

00

01

11

10

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

F3F2

F1F0 00 01 11 10

00

01

11

10

- - - -

- - - -

- - - -

- - - -

F3F2

F1F0 00 01 11 10

00

01

11

10

0 0 - -

- 1 - -

- - - -

0 - - 1

A1A0=01 A1A0=11 A1A0=10

C0 =

© Paul Lyons 2010~ 152 ~




C2 = A0 + A1F1

C1 = À1 + `F2

C0 = A1F3 + A1F0

C2 = A0 + A1F1

C1 = À1 + `F2

C0 = A1F3 + A1F0

© Paul Lyons 2010~ 153 ~


C1 = À1 + `F2

C0 = A1F3 + A1F0

C2 = A0 + A1F1

control block


A

01

F

0123

1C

2

0


© Paul Lyons 2010~ 154 ~


zero16:20

21:25write register


read data1

ALU data memory

read adddress

write adddress

instruc-tion

memory 11:15

sum

PC

4sum

shiftleft 2

sign-extend0:15

THE CONTROL SIGNALS


ALUop26:31

0:5 ALUControl

4

2 controllerMemToReg MemRead MemWritePCSrc RegDst RegWrite ALUSrc

© Paul Lyons 2010~ 155 ~


CONTROLLER TRUTH TABLE


instructionR-type

lwswbeq

( 0 )(35)(43)( 4 )

regdest10XX

ALUSrc0110

MemToReg01XX

RegWrite1100

MemRead0100

MemWrite0010

Branch0001

ALUOp1

1000

ALUOp2

0001

50110

40000

30010

20001

10110

00110

opcode

op5.òp4.òp3.òp2.op1.op0MemToReg =

òp5.òp4.òp3.òp2.òp1.òp0RegDest =

op5.òp4.òp3.òp2.op1.op0MemRead =

op5.òp4. op3.òp2.op1.op0MemWrite =

òp5.òp4.òp3. op2.òp1.òp0Branch =

òp5.òp4.òp3.òp2.òp1.òp0ALUOp1 =

òp5.òp4.òp3. op2.òp1.òp0ALUOp2 =

op5.òp4.òp3.òp2.op1.op0ALUSrc =+ op5.òp4. op3.òp2.op1.op0

òp5.òp4.òp3.òp2.òp1.òp0RegWrite =+ op5.òp4. op3.òp2.op1.op0

© Paul Lyons 2010~ 156 ~


CONTROLLER











Vcc

op5

op4

op3

op2

op1

op0

output node

© Paul Lyons 2010~ 157 ~


CONTROLLER











Vcc

op5

op4

op3

op2

op1

op0

© Paul Lyons 2010~ 159 ~


SINGLE-CYCLE vs.MULTI-CYCLE IMPLEMENTATION

MULTI-CYCLE ARCHITECTURE

with single cycle, longest instruction limits speed of whole machine load instruction involvesCPI = 1 looks good, but

multi-cycle instructions would be faster for all but longest instructionsingle memory can be usedsingle ALU can be used for data, PC and address operations

load and store instructions involve1 memory access1 memory accessseparate and memories necessary to

we can divide instructions into phasese.g., instruction read; register(s) read; compute phase; register write (R-type instruction)set clock period to length of longest phase of an instruction instead of longest instructioninstructions become

© Paul Lyons 2010~ 160 ~


TRISTATE OUTPUTS


using a multiplexor to select inputs to an ALU meansmultiple 32-bit-wide data paths

alternatively, run a single 32-bit bus past all the sourcesgive the sources tristate outputswith n data sources:

log n mux control inputs �

32 x n data wires �

© Paul Lyons 2010~ 161 ~


DATAPATH


write signals required

no longer single clock cycle with standard sequence of control signals

temporary registers needed for results because:signal is computed on 1 clock cycle andinputs that produced it changeimplicit control signals used

4

memoryread adddress

write adddress

write data

PC

PCWrite

ALUSelA

RegDest

IRWrite

MemWrite

MemRead

0

1

2

3

instruction

register

memory data

register

ALUSelB

zeroALU

ALU

out

shiftleft 2

write registerread register1

write data

read data2

read data1

read register2

sign-extend

A

B

Memto Reg

I orD

A

B

ALU

out

© Paul Lyons 2010~ 162 ~


BREAKING INSTRUCTION INTO CLOCK CYCLES



write data

read data2

read data1

read register2instruction

registersign-extend shift

left 2

4

memoryread adddress

write adddress

write data

PC

zeroALU

PCWrite

MemRead

MemWrite

IRWrite

RegDest

ALUSelA

0

1

2

3

equalise time spent in each clock cycleminimise time for whole instruction

clock cycle should contain no more than111

memory data

register A

B

ALUSelB

ALU

out

Memto Reg

I orD

© Paul Lyons 2010~ 163 ~


CLOCK CYCLE 1 - Instruction Fetch


common to all instructions

stores instruction in IR so that

load IR and incPC in parallelbothusetake effect

IR �

PC �

© Paul Lyons 2010~ 164 ~


A �

B �

ALUout �

CLOCK CYCLE 2 – Instruction Decode & Register Fetch


read registers specified by rs and rt fields of instruction into A and B registersdon't need them for all instructions, but does no harm

also compute branch target address, just in casesave result

still don’t know what the instruction is, tho it's in the IR

© Paul Lyons 2010~ 165 ~


CLOCK CYCLE 3 – Mem Addr Computation, or Branch Completion


in this clock cycle, depends on

memory reference instructions (lw and sw)ALUout �

R-type instructions (arithmetic-logic)ALUout �

conditional branch instructionsif (A==B) PC �

jump instructionPC �

© Paul Lyons 2010~ 166 ~


CLOCK CYCLE 4 – Mem Access, or R-Type Instruction Completion


memory reference instructions (lw and sw)MDR �

ormemory[ALUout] �

R-type instructions (arithmetic-logic)reg [ IR [11:15 ]] �

© Paul Lyons 2010~ 167 ~


CLOCK CYCLE 5 – Memory Read Completion


load instructionMDR �

ormemory[ALUout] �

P 329

© Paul Lyons 2010~ 168 ~


Memory read completion

Memory accessor R-type

completion

Memory Address. Computation, or.

Branch Completion

R-type memory-reference branch jump

instr decode,register fetch



write data

read data2

read data1

read register2instruction

registersign-extend shift

left 2

4

memoryread adddress

write adddress

write data

PC

zeroALU

PCWrite

MemRead

MemWrite

IRWrite

RegDest

ALUSelA

0

1

2

3

memory data

register A

B

ALUSelB

ALU

out

Memto Reg

I orD

instructionfetch

IR � memory[PC];PC � PC + 4;

A � reg[ IR [ 21:25 ]];B � reg[ IR [ 16:20 ]];

ALUout

� PC + (sign-extend( IR [0:15] ) << 2);

ALUout

� A op B;

reg[IR [11:15]]

�ALUout;

ALUout

� A + sign-extend(IR[0:15]);

MDR � M[ALUout] # sw

OR: M[ALUout] � B # lw

load: reg[IR [16:20]] � MDR;

if (A==B)PC � ALU

out;

PC � { PC[28:31] , IR[0:25], 2'b00 };

© Paul Lyons 2010~ 169 ~


STATE MACHINE CONTROLLER


Standard ASM approach to constructing a controllersee Patterson & Hennessy, pps 330-340

© Paul Lyons 2010~ 170 ~


MICROPROGRAMMING


look up control signals for (instruction, clock cycle) instead of calculating them

jump table2n entries

processorcontrol lines

microprogramcounter

0: jump to fetch

fetch

microprogram memory

instructionmicrocode

consider a processor with n-bit opcode and no instruction 0 on power-up, fetch loads with , jumps tojump table jumps to ; instruction executes, jumps back to

© Paul Lyons 2010~ 171 ~


MICROPROGRAMMING


extra memory access for each clock cycle

new microcode can be downloaded

© Paul Lyons 2010~ 172 ~


EXCEPTIONS AND INTERRUPTS


exceptions are unexpected eventse.g.

interrupts are unexpected events from outside the processorI/O devices generate interrupts to signal input events; process swapping

terminological confusionMIPS convention: both types of event areIntel 32-bit processor convention: both

handling exceptions is time-consumingmay determine overall speed of machinesave address of current instructiontransfer control tooperating system: OS can then or

© Paul Lyons 2010~ 173 ~


EXCEPTIONS AND INTERRUPTS


vectored interruptswhen exception occurs, controllerOS routine to handle that exception is

exception handling startse.g. signals from:overflow detector, unrecognised opcode (simplified MIPS processor)external pin on processor (I/O devices)states in controller ASM where exceptions can occur have jump

ASM loads: cause register withEPC withPC with (location of OS routine for )

OS:handles

or

© Paul Lyons 2010~ 174 ~


COMPLEX MULTI-CYCLE ARCHITECTURES


suitable forCISC machines can have instructions from 2-3 clock cycles to tens or even hundredswhen data for current instruction moves along datapath,early parts of datapath

© Paul Lyons 2010~ 175 ~


THE BASIC IDEA

PIPELINING

multi-cycle architecture reducesinstructions stillbut some instructions take fewer thanso

can we combine 1 CPI behaviour with shorter clocks?

single-cycle (1 CPI) architecture simple but slowno instruction can run faster than the slowest

each stage in datapath acts on data from a separate instructionD enacts phase 4 of instruction i on data for instruction iC enacts phase 3 of instruction i-1 on data for instruction i-1later instructions can’t work on data currently being produced by the datapathresources can’t be used at several stages in the datapathneed intermediate registers to keep results available for several clock cycles

DCBA

© Paul Lyons 2010~ 176 ~


COMPARISON OF APPROACHES

PIPELINING

Instruction Instr Fetch Reg Read ALU Op Data Mem Reg Write TotalR Format 10ns 5 ns 10ns 5ns 30ns

lw 10ns 5ns 10ns 10ns 5ns 40nssw 10ns 5ns 10ns 10ns 35nsbeq 10ns 5ns 10ns 25

10 20 30 40 50 60 70 80 900

pipeline cycleinstr fetch reg ALU data reg

instr reg ALU data reg

instr

no single

instr fetch reg ALU data reginstr fetch reg ALU data

no multi

instr fetch reg ALU data reg





yes single

© Paul Lyons 2010~ 177 ~


SPEEDUP

PIPELINING

single clock cycle instructions start every 40ns

multi-clock instructions can start every 50ns (lw) x 0.840ns (sw & R-type) x 137.5ns (branch) x 1.33

speedup

saving of resources (3 ALUs � 1 ALU) as well as speedup

speedup with pipelinex 4 ifx 2.22 in the example (though )x 4 is

© Paul Lyons 2010~ 178 ~


PIPELINE OVERHEADS

PIPELINING

load timeflush timeunequal stage delaysdelays in interstage registers

© Paul Lyons 2010~ 179 ~


ARE SOME INSTRUCTION SETS BETTER THAN OTHERS?

PIPELINING

constant length instructions "fit" the hardware betterIA32 (Pentium) instructions 1-17 bytestranslated into microinstructions that suit pipelining

standard format with operands in consistent locationsallows register reads to occur before instruction type is known



© Paul Lyons 2010~ 180 ~



PIPELINING

constant length instructions "fit" the hardware betterIA32 (Pentium) instructions 1-17 bytestranslated into microinstructions that suit pipelining

standard format with operands in consistent locationsallows register reads to occur before instruction type is known



© Paul Lyons 2010~ 181 ~



PIPELINING

constant length instructions "fit" the hardware betterIA32 (Pentium) instructionstranslated into microinstructions that

standard format with operands in consistent locationsallows

memory access instructions are shorter if all calculations are register-based calculation phase can be used for

word-aligned operands reduce memory accessesno operand transfer takes

© Paul Lyons 2010~ 182 ~


write register


read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

DIVIDING THE DATAPATH INTO PIPELINE STAGES

PIPELINING

IF ID ID EX EX MEM MEM WBinstructionfetch

instruction decode,register read

execute,address calculation memory access

writeback

information needed in a later stage must be passed viapipeline registers load , readpipeline registers are named afterfor writeback, the pipeline register

© Paul Lyons 2010~ 183 ~



PIPELINING

write register


read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

memory accesswriteback

instructionfetch


execute,address calculation

IR � mem[PC]PC � PC + 4

r-typei-type

branch

IF ID ID EX EX MEM MEM WB

© Paul Lyons 2010~ 184 ~



PIPELINING

write register


read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch


writeback

IF ID EX MEMID EX MEM WB

r-typei-type

branch

A � Reg[ IR[25-21] ];B � Reg[ IR[20-16] ];IMM � SE(Reg[ IR[15-0] ]);


© Paul Lyons 2010~ 185 ~



PIPELINING

write register


read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch


writeback

IF ID EX MEM

ALUOut � A + Imm;ALUOut � A func B;ALUOut � A op Imm;ALUOut� NPC+Imm;Cond� (A op 0)

ID EX MEM WB

r-typei-type

branch


© Paul Lyons 2010~ 186 ~



PIPELINING

write register


read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch


writeback

IF ID EX MEMID EX MEM WB

PC �NPC

LMD � M[ALUOut];orMemory[ALUOut] � B;

if condPC�ALUOut

r-typei-type

branch


© Paul Lyons 2010~ 187 ~


THE LW INSTRUCTION: EXECUTION TRACE

PIPELINING

write register


read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch


writeback

IF ID EX MEMID EX MEM WBinstruction decode,register read

© Paul Lyons 2010~ 188 ~



PIPELINING

write register


read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch


writeback


instruc-tion

memory

© Paul Lyons 2010~ 189 ~



PIPELINING

write register


read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch


writeback


read data2

read data1

© Paul Lyons 2010~ 190 ~



PIPELINING

write register


read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

instructionfetch


writeback


sum

© Paul Lyons 2010~ 191 ~



PIPELINING

write register


read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch


writeback


© Paul Lyons 2010~ 192 ~



PIPELINING

write register


read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch


writeback


write register

read register1read register2write data

© Paul Lyons 2010~ 193 ~


CONTROLLING THE PIPELINE

PIPELINING

Instruction Decode turns instruction intocontrol signals are L�R data flow except for ; both can cause

write register


read data1

ALUzero

data memory

read adddress

write adddress

instruc-tion

memory

PC

4

sign-extend

shiftleft 2

sum

instructionfetch


writeback


controlWB

EXWB

WBMEM

MEM

© Paul Lyons 2010~ 194 ~


HAZARDS:WHEN PARTS OF THE PIPELINE STAND IDLE

PIPELINING

structural hazards: two instructions need the same resourceconsider lw instructions on a MIPS processor with 1 memory for data and program

instr fetch readreg ALU data writereg

instr fetch reg ALU data writereg

instr fetch readreg ALU data readreg



solution:

© Paul Lyons 2010~ 195 ~



PIPELINING


solution:






© Paul Lyons 2010~ 196 ~


HAZARDS:WHEN PARTS OF THEPIPELINE STAND IDLE

PIPELINING


solution: datapath for pipelined MIPS uses separate instruction and data memories

data hazards: instruction2 needs data before instruction1 has finished producing itadd $s0 $t0 $t1

sub $t2 $s0 $t3



© Paul Lyons 2010~ 197 ~



PIPELINING



sub $t2 $s0 $t3



solution1:don't wait for it to bedoesn't work if


© Paul Lyons 2010~ 198 ~



PIPELINING



sub $t2 $s0 $t3



solution2: controller inserts into the datapath

reg ALU data writereg


control hazards: control decision depends onif (pipelined) instructioni starts on clock cyclen, instructioni+1 starts on cyclen+1

unless instructioni is ; destinationinstr fetch readreg ALU data writereg


solution1:don't wait for it to bedoesn't work if

© Paul Lyons 2010~ 199 ~



sub $t2 $s0 $t3 reg ALU data



instr fetch

solution2: controller inserts stalls (aka "bubbles") into the datapath


solution1: forward data to next instruction as soon as it's produced don't wait for it to be written to the register filedoesn't work if data is needed before it is produced

control hazards: control decision depends on result of an incomplete instructionif (pipelined) instructioni starts on clock cyclen, instructioni+1 starts on cyclen+1

unless instructioni is a branch instruction; destination not known for 4 more clock cyclesinstr fetch readreg ALU data writereg



PIPELINING


unless instructioni is a branch instruction; destination not known for 4 more clock cycles

© Paul Lyons 2010~ 201 ~



PIPELINING


unless instructioni is a branch instruction; destination not known for 4 more clock cycles

can justput in extra

tests , calculates n &

still have 1 clock-cycle stallor take one alternative anyway

assumeload next instructionstill lose a clock cycle if , but net improvement

improvements:assumeor compiler to use the stall time (MIPS)or on the basis of record

© Paul Lyons 2010~ 202 ~


if $s2=0 then

SCHEDULING THE BRANCH DELAY SLOT

PIPELINING

from before from target from fall through

add $s1, $s2, $s3if $s2=0 then

delay slot

add $s1, $s2, $s3

sub $14,$15,$16..add $s1, $s2, $s3if $s1=0 then

delay slot

.

.add $s1, $s2, $s3if $s1=0 then

sub $14,$15,$16


sub $14,$15,$16

delay slot


sub $14,$15,$16

© Paul Lyons 2010~ 203 ~


BRANCH PREDICTION

PIPELINING

small memory indexed byeach location has 1 bit - set a bit , resetwill sometimes be setprediction unrelated tobut will contain

but consider performance of a loop branchtaken times, not taken once mispredictionbut after last iteration, prediction isso misprediction atbranch taken 90% of time, correct prediction 80% of time

use 2-bit branch prediction memorymust be wrong twice before prediction changescopes with repeated loops

predict taken predict taken

not taken

taken

not taken

predict taken

taken

taken

taken

predict taken

not taken

not taken

© Paul Lyons 2010~ 204 ~


DATA HAZARDS - CATEGORISATION

PIPELINING

RAW – Read After Write

WAR – Write After Read

WAW – Write After Write

in each of the hazards below, instructioni starts executing before instructionjhazard name refers to what should happen, not what goes wrong (!)

© Paul Lyons 2010~ 205 ~


RAWHAZARDS

EXCEPTIONS

four situations – 2 problematic, 2 not

LW R1,45,(R2)DADD R5,R6,R7DSUB R8,R6,R7OR R9,R6,R7

LW R1,45(R2)DADD R5,R1,R7DSUB R8,R6,R7OR R9,R6,R7

LW R1,45,(R2)DADD R5,R6,R7DSUB R8,R1,R7OR R9,R6,R7

LW R1,45(R2)DADD R5,R6,R7DSUB R8,R6,R7OR R9,R1,R7

nothing in following 3 instructions depends on R1(4th following instruction will be in IF when 1st is in WB)

hardware detects phase1 R1 write, phase2 R1 readstalls DADD instruction's EX phase(& following instructions)

hardware detectsforwards

no action required. Write of R1 occurs during 1st half of DSUB's ID phase, and read occurs in 2nd half

© Paul Lyons 2010~ 206 ~


WHEN TO FORWARD IN EXPHASE

EXCEPTIONS

sourceinstruction

destinationinstruction forward if:

R-type R-typeI-type, lw, sw, bra

destinationtop ALU i/p

EX/MEM[rd] = ID/EX[rs]

R-type R-type

R-type R-typeI-type, lw, sw, bra

R-type R-type

I-type

R-typeI-type, lw, sw, bra

lw

EX/MEM[rd] = ID/EX[rt]

MEM/WB[rd] = ID/EX[rs]

MEM/WB[rd] = ID/EX[rt]

bottom ALU i/p

top ALU i/p

bottom ALU i/p

top ALU i/p

bottom ALU i/p

top ALU i/p

bottom ALU i/p

top ALU i/p

bottom ALU i/p

R-typeI-type, lw, sw, bra EX/MEM[rt] = ID/EX[rs]

I-type R-type EX/MEM[rt] = ID/EX[rt]

I-type

I-type

lw

R-typeI-type, lw, sw, bra

R-type

R-type

MEM/WB[rt] = ID/EX[rs]

MEM/WB[rt] = ID/EX[rt]

MEM/WB[rt] = ID/EX[rs]

MEM/WB[rt] = ID/EX[rt]

EX/MEM[rd] = ID/EX[rs]

© Paul Lyons 2010~ 207 ~


THE IDEAL, THE REALITY

MEMORY MANAGEMENT

The ideal indefinite memory capacityrandom accessany word instantly available

The reality limited memory capacityfinite speedshigh speeds � high costshigh capacity � low speed

The solutionhierarchy of memories, with processor registers at the topeach step down has more capacity but slower access

, & THE SOLUTION

© Paul Lyons 2010~ 210 ~


MEMORY MANAGEMENT

THE SOLUTION

archiveprocessorregisters

main(I0)

memory

backing(20)store

VirtualMemory

auto archivaland

file retrieval

Memory management blurs the distinctions to make memory seem as possibleas possibleas possibleas possible

© Paul Lyons 2010~ 211 ~


ALTERNATIVE VIEW

MEMORY MANAGEMENT

CPU

levels in the memory hierarchy

level 1

level 2

level n

increasing distance from the CPU in access time

size of the memory at each level

© Paul Lyons 2010~ 212 ~


PRINCIPLE(S) OF LOCALITY

MEMORY MANAGEMENT

is liable to beprinciple of

only a small proportion is of interest at any time

is liable to be followedprinciple of

on memory reference:bring item fromto SRAM: fast, but expensive and thus smallfrom which

in code, is liable to be followed byprinciple of (special case of preceding principle)

(and ask it )

© Paul Lyons 2010~ 213 ~


AIMS

MEMORY MANAGEMENT

aims: to make memory behaveas asas as

technique:on a hit,on a miss,may need to transfer from to

hit ratio:miss ratio:miss penalty:hit time:

© Paul Lyons 2010~ 214 ~


CACHE

MEMORY MANAGEMENT - cache

small, fast between registers and

how do we map the large DRAM address space onto the small SRAM?

direct-mapped cacheaddress in cache with 2n locations is just

0000000100100011010001010110011110001001101010111100110111101111

M

000001010011100101110111

cache

© Paul Lyons 2010~ 215 ~


ACCESSING AWORD IN CACHE – the Write Back strategy


32 bit addressfrom processor

data

32

tag address

22Cache

1024

decoder

10-bit cache register address

23-bit tag address

tristate buffers

match1

55

23-bit residualfrom cache

data part32

changed bit1

residualsequal?

R 1 0 take data from cacheTag valueMatchR/W Tag action Cache / memory action

R 1 1 take data from cacheR 0 0 read memory to cacheR 0 1 clear write old data to memory, read new word from memoryW 1 0 set write data to cache W 1 1 write data to cacheW 0 0 set write new data & address to cacheW 0 1 set write old data to memory, new data & address to cache

changedbit

1

© Paul Lyons 2010~ 216 ~


ACCESSING AWORD IN CACHE


R 1 0 take data from cacheTag valueMatchR/W Tag action Cache / memory action

R 1 1 take data from cacheR 0 0 read memory to cacheR 0 1 clear write old data to memory, read new word from memoryW 1 0 set write data to cache W 1 1 write data to cacheW 0 0 set write new data & address to cacheW 0 1 set write old data to memory, new data & address to cache

32 bit addressfrom processor

data

32

tag address

22Cache

1024

decoder

10-bit cache register address

23-bit tag address

tristate buffers

match1

55

23-bit residualfrom cache

data part32

changed bit1

residualsequal?

changedbit

1

© Paul Lyons 2010~ 217 ~


OTHER FLAVOURS OF CACHE


(Guava, Loganberry, Snail)

changed bit is not always usedwhen a cache location is overwrittenits impossible to tellalways write cache data back to memory whenSimple Swap strategy

OR Write-Through (even simpler)always write data to cache & back to memory whennot as inefficient as you might think;buffer queue stores data & address so

processor cache

Memory

buffer queue

© Paul Lyons 2010~ 218 ~


ASSESSING CACHE UPDATE ALGORITHMS


cycle time

write-through

simple swap

0.7 0.8 0.9 1.0

HR

flagged swapbuffered swap

© Paul Lyons 2010~ 219 ~


BLOCK TRANSERS


temporal locality supported by

spatial locality requires

© Paul Lyons 2010~ 220 ~


BLOCK SIZE INFLUENCES MISS RATE


1 KB

16 KB

256 KB64 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Miss rate

64164

Block size (bytes)

8 KB

block in memory

first word referenced

© Paul Lyons 2010~ 221 ~


HANDLING ACACHE MISS


what happens to the current instruction when a cache miss occurs?

consider a miss when loading an instruction instruction register containswe'll have to when the value has loadedPC � (because )initiatewrite ; reset

© Paul Lyons 2010~ 222 ~


INTRINSITY FASTMATH PROCESSOR CACHES


cache index

=tag

data

data512tag

18

V1 32

012

56

13

14

31

256

block offset

hit

instruction miss rate:data miss rate:

© Paul Lyons 2010~ 223 ~


TRANSFERRING BLOCKS EFFICIENTLY


memory speed puts on latency of access to 1st word of a block

wider memory & bus allowno reductionoverall transfer rate

memory bus clock cycles

to send addressto access datato send data

1 word

4 words

(10 longer than processor cycles)

bus and memory width

cache block width

1 x 115 x 44 x 1

65miss penalty

bandwidth achieved4 x 465

= 0.25 words/cycle

© Paul Lyons 2010~ 224 ~




memory speed puts lower limit on latency of access to 1st word of a block

wider memory & bus allow parallel transfer of complete blockno reduction I latency of 1st wordoverall transfer rate higher

bus and memory width

cache block width

memory bus clock cycles

to send addressto access datato send data

1 word

4 words

1 x 115 x 44 x 1(10 longer than processor cycles)

2 words

4 words 4 words

4 words

1 x 115 x 2

1 x 115 x 1

65

2 x 1

33

1 x 1

17miss penalty

bandwidth achieved4 x 465

4 x 433

4 x 417

= 0.48 words/cycle

= 0.94words/cycle

- Wider Memory And Bus

= 0.25 words/cycle

© Paul Lyons 2010~ 225 ~




- Interleaved Memory

M

cache

CPU

M

cache

CPU

M1 M2 M3 M4

cache

CPU

1-word widememory

multi-word widememory

interleavedmemories

all memories can read in

parallel

single-word buscycle count: 1 to send address15 to access memory4 x 1 to send databandwidth

= (4 x 4) /20= 0.8 words/cycle

© Paul Lyons 2010~ 226 ~


CACHE DESIGN ISSUES


placement policy

size of blocks transferred to and stored in cache

memory update policy

cache size

replacement policy

© Paul Lyons 2010~ 227 ~


MEASURING AND IMPROVING CACHE PERFORMANCE


memory stall cycles

=

total CPU time = cycle time x (useful execution cycles + stall cycles)

© Paul Lyons 2010~ 228 ~




consider a machine with separate instruction and data cachesproccessor CPI (without memory stalls): 2miss penalty(all misses) 100 cycles for a particular program:

instructions executed I

loads and stores as % of total instructions 36%data cache miss rate 4%instruction cache miss rate 2%

how much faster would the program be if we eliminated misses?

total miss cycles = +

© Paul Lyons 2010~ 229 ~




consider a machine with separate instruction and data cachesproccessor CPI (without memory stalls): 2miss penalty(all misses) 100 cycles for a particular program:

instructions executed I

loads and stores as % of total instructions 36%data cache miss rate 4%instruction cache miss rate 2%

how much faster would the program be if we eliminated misses?

total miss cycles = +

CPI including memory stalls = + =now, speed is IPC = 1/CPIspeed increase =

© Paul Lyons 2010~ 231 ~




how much faster would the program be if we eliminated misses?speed increase = 2.72

what about improving the processor performance?let's try reducing the Cycles Per Instruction by 50%

CPIno misses memory stalls CPItotal2 3.44 5.441 3.44 4.44

existing processor: improved processor:

remember Amdahl's Law?max speedup = total time/time that can be reduced to 0

speed increase = 1.18

© Paul Lyons 2010~ 232 ~






speed increase = 1.18let's try doubling the clock rate (memory read/write times don’t change)

miss penalty = 200 clock cycles (previously 100)total miss cycles = instruction misses + data misses

= I x 0.02 x 200 + I x 0.36 x 0.04 x 200 = 6.88 I

CPI = 2 + 6.88 = 8.88

speed increaseexecution timefastclock

execution timeslow clock= =

I x CPI1 x clock cycle1

I x CPI2 x 0.5 x clock cycle1

= 5.44 / (8.88 x 0.5)= 1.23

© Paul Lyons 2010~ 234 ~






speed increase = 1.18let's try doubling the clock rate (memory read/write times don’t change)

speed increase = 1.23

cache misses (& Amdahl) reduce impact of other improvementsincreasing clock rate AND decreasing CPI incurs a double hit

with reduced cycles per instruction stall cycles/overall cycles increaseswhen processor clock speeds increase, memory clock speeds don'tmiss penalty high clock speed processor > miss penalty low clock speed processor

good cache design helps performance as much as increasing processor speed

© Paul Lyons 2010~ 235 ~


ALTERNATIVE PLACEMENT POLICIES


there are various schemes for placing blocks in cacheintended to reduce cache misses

direct mappingcache index bits are subset of memory address,each block of memory locations has only one possible cache destination

fully associative mappingmemory blocks go anywhere in cache, source address is stored with themcache access involves comparing address tag & cache tag at every cache location multiple comparators � expensive hardwaresuitable for L1 (only a few words) caches only

set-associative mappingn-way SA cache has n blockseach memory block maps to any element of a unique subset ( >=2) of the cache blocksmapping to set is direct, mapping within set is associative

© Paul Lyons 2010~ 236 ~


INDICATIVE DIAGRAMS


direct mapping

set associativemappings

fully associative mapping

© Paul Lyons 2010~ 237 ~


FORMAL SPECIFICATIONS


Consider a system with memory and cache both organised into blocksblocksize, b = 2w words, for some wlines in cache, L = 2k1 => cachesize = 2k1+w wordsblocks in memory, B = 2k2 => memorysize = 2k2+w wordshence addresses are k2 + w bits long

IfL = 4, k1 = 2B = 8, k2 = 3b = 4, w = 2

then these mappings hold:M0 � C0

M4 � C0

M1 � C1

M5 � C1

etc.

- DIRECT MAPPING

block line j = i mod L (i is memory block)

7

6

5

4

3

2

1

0

Block frame

3

2

1

0

Block no

M

C

wk 2 real address

k 1 wtag cache address

© Paul Lyons 2010~ 238 ~


FORMAL SPECIFICATIONS


direct mapping desired data may be in only one cache locationthough mapping is many-to-one

fully associative mapping desired data may be in several cache locationsthe one which contains the word addressed (if any) must then be identifiedmany-to-many mapping

again, divide memory into 2k2 blocks of 2w words. (k2 + w)-bit addresses each cache line contains data field (2w words)a tag field (top k2 bits of the block's address)equality-detect circuit (tag field = top k2 bits of address)

at each memory referenceif tag matches, cache contains the addressed datathe equality detect signal acts as a "line select" to allow I/O on the appropriate word of the line

- Fully Associative Mapping

© Paul Lyons 2010~ 239 ~


4-WAY FULLY-ASSOCIATIVE CACHE


cacheindex

012

9

10

31

dataV

tag

dataV

tag tag

dataV

tag

dataV

tag

hit

data

© Paul Lyons 2010~ 240 ~


SET-ASSOCIATIVE CACHE

MEMORY MANAGEMENT - Cache

amalgam of direct and associative schemescache has structurelines are grouped intoblock has words - as beforememory organisationcache has

wk 2 real address

block index

stored in cache with data

wtag cache addressk 0

set no

mapping works at firstset j = i mod S (i is a block in main memory)

Then, after :associative search for blockThen: to find specific word

© Paul Lyons 2010~ 241 ~


012

91011

tag

dataV

tag

dataV

tag

dataV

tag

dataV

2-WAY SET-ASSOCIATIVE CACHE


31

cacheindex

= =

set

==

hit

data

© Paul Lyons 2010~ 242 ~


L1, L2, L3


L1 cachetypically 8KB - 128KBpart of processor corefast technology (SRAM)processor speed

L2 cachetypically 256KB - 1MBoriginally off chip; now often on½ or ¼ processor speed

L3 cacheincreasingly commontypically 16MB - 256MBoff chip (but sometimes on the same die)expensive; used in high-end processors½ L2 cache speed

Some L2 and even L3 caches run at processor speeds.

So what's the point of smaller L1 cache?

A more sophisticated, more expensive, more efficient memory mapping policy is desirable but only cost-effective on a small scale???

© Paul Lyons 2010~ 243 ~



128 KB L1 + 256 KB L2AMD Athlon Thunderbird

128 KB L1 + 64 KB L2AMD Duron

128 KB L1AMD K7 Athlon

64 KB L1 + 256 KB L2AMD K6-3

32 KB L1 + 256 KB L2Pentium III Cumine

32 KB L1 + 128 KB L2Celeron

32 KB L1Pentium II and III

64 KB L1AMD K6 and K6-2

32 KB L1Pentium MMX

16 KB L1 + 256 KB L2

(some 512 KB L2)

Pentium Pro

16 KB L1Pentium

16 KB L180486DX4

8 KB L180486DX and DX2

Cache size in the CPUCPU

© Paul Lyons 2010~ 244 ~


registersL1 cache(16KB)

Pentium CPU


L2 cache(256KB)

RAM(32M)

system bus

more modern processors: L2 cache on the processor chip

CPUL2 cache

I/O busses

© Paul Lyons 2010~ 245 ~


BLOCK REPLACEMENT POLICY


direct mapping cacheno choice; incoming block can only go into one slot

associative cacheany block could go

set-associative cacheincoming block can only go into one setany block in selected set could go

FIFO

Least Recently Used; significantly better performance than FIFO

each set has a reference numberwhen set with reference no. n is referenced

reference nos < n are incrementedreference no of referenced set is set to 0

block-to-go is always block with largest reference numberexpensive hardware when sets are large.

© Paul Lyons 2010~ 246 ~


THE NEXT PHASE

MEMORY MANAGEMENT – virtual memory

archiveprocessorregisters

main(I0)

memory

backing(20)store

câcheL1, L2, L3

VirtualMemory

auto archivaland

file retrieval

Memory management blurs the distinctions to make memory seem as big as possibleas fast as possibleas cheap as possibleas secure as possible

VirtualMemory

© Paul Lyons 2010~ 247 ~


VIRTUAL MEMORY'S RAISONS D'ETRE


Original VM let programs use a memory space larger than physical memoryprogrammers had to divide programs into mutually exclusive overlays (code & data)program had to control loading of its own overlaysVM automatically maps program pages onto physical memory addresses

Also allowed multiple programs to run simultaneously independent virtual address spaces memory protectionpredominant use today

installed memory

program 1

program 2

program 3

program 4

individually smaller, together larger

© Paul Lyons 2010~ 248 ~


GENERAL IDEA


program code and data are stored as fixed-sized units called pagespages live on disk, have a disk address, are copied into memory when necessarymemory operations use (Virtual Page no, page offset)

program's code (and data) pages don’t have to be contiguous

address translation unit converts VP no. into base address of page in physical memorybase address + page offset produces real address of databits in offset� page size

© Paul Lyons 2010~ 250 ~


THE ISAMODEL AND REALITY


offsetprocessor

virtual memory space

base

faulthandler

20 memoryo

processoraddresstranslator

main memorya

a'

© Paul Lyons 2010~ 251 ~


PRAGMATIC DECISIONS


page fault costs millions of clock cyclesmostly latency of first wordlater words arrive comparatively rapidlyso make the page size big enough to repay cost of page fault 4KB – 64KB

it's worth putting considerable effort into reducing page faultsfully associative page placement

long disk access allows time for (complex) software to handle page faults

long write time justifies complexity of write-back over write-through

with fixed size pages, (page no., offset) boundary in addresses is transparentcf. variable-sized segments, where software manipulates segment no. & offset explicitly

memory protection can use the same mechanisms as virtual memory

© Paul Lyons 2010~ 252 ~


PAGE PLACEMENT


virtual pages can go anywhere in memoryhuge miss penalty means it's worth using complex algorithm, & data structures

virtual page number physical page number page table

each process has a page tablepage table lives in memorypage table register points at start of page tableto perform a process swap, point page table register at a different process's page table

(and swap the program counter and processor registers)

© Paul Lyons 2010~ 253 ~


V physical page number

PAGE TABLE


page table register

page table

if 0, then page is not in memory

011 page offset

12

physical address

0

virtual address

111231 virtual page number page number

32-bit virtual address, 30-bit physical address (2 bottom bits = 0)virtual address space 4 x larger than physical address space

page table stored in (32-bit) main memory has 19-bit entries13 extra bits (not shown) for page protection information

1229 physical page number

18

20

a similar disk page table holds disk addresses

© Paul Lyons 2010~ 254 ~


PAGE FAULTS


not all of a program's pages have to be in memory while it's runningonly pages that have been referenced since the process was swapped in

but disk space is much more abundantOS usually reserves enough disk space for all the process's pagescalled the swap space

OS is responsible for handling page faultshardware detects that valid bit for selected page is FALSE

© Paul Lyons 2010~ 255 ~


PAGE REPLACEMENT


if the OS needs to bring in a page and all pages in the swap space are in useit must oust a pagewhich page? swap space is a fully associative store

?

page-to-go should be about to be unused for as long as possible predicting the future is a difficult businessprediction: Least Recently Used page will be Furthest Future Use page

ousted page goes to swap space

strict LRU algorithm would collect stats at every memory referencetoo expensiveinstead, each page references sets a reference bit for the pagereset periodically, tested after standard delayany page with reference bit still reset is Not Recently Used – can be paged out

© Paul Lyons 2010~ 256 ~


for a machine with 32-bit addresses, 32-bit page table entries and 4KB pages how many page table entries? 232 / 212 = 220

how big is the page table? 220 x 22 = 222 B = 4MBso, maybe 400MB in total

PAGE TABLE SIZES


typically scores or hundreds of processes running at a time

how to minimise memory usage generally and size of page table in particular?use dynamically-sized page table; only big if program is page-greedy

keep last page register (aka page limit register)forces page table to grow in 1 direction only

but stack and heap usually grow in opposite directions

code

processes' address space

static data (constants, arrays)dynamic data (lists, trees)

stack

© Paul Lyons 2010~ 257 ~


PAGE TABLE SIZES


for a machine with 32-bit addresses, 32-bit page table entries and 4KB pages how many page table entries? 232 / 212 = 220

how big is the page table? 220 x 22 = 222 B = 4MBso, maybe 400MB in total

typically scores or hundreds of processes running at a time

© Paul Lyons 2010~ 258 ~


PAGE TABLE SIZES


how to minimise memory usage generally and size of page table in particular?use dynamically-sized page table; only big if program is page-greedy

keep last page register (aka page limit register)forces page table to grow in 1 direction only

but stack and heap usually grow in opposite directionsseparate page tables, with 2 page limit registers, 1 for up, 1 for downtop bit of address differentiates between top & bottom segments of address space

inverted page table only in-use pages are stored; can't use address to index page's entry, must searchto reduce search time, hash the addresses' entries

multi-level (tree-structured) page table complex, but suits non-contiguous pageshighest order bits address a "segment"; if valid, lower bits address page in segment

page the page tableincreases no. of page faults; lock some pages of page table in memory

© Paul Lyons 2010~ 259 ~


HANDLING WRITE OPERATIONS


cache can use writethroughmemory only hundreds of times slower than registers small writethrough buffer masks write latency

VM write has to wait for disk accessmillions of clock cyclesbuffer would be impracticalwriteback used insteaddisk write only occurs when essential – when OS overwrites a dirty page in memory

© Paul Lyons 2010~ 260 ~


THE TLB – HANDLING THE OVERHEAD


page table resides in memoryordinary read or write (Instr. Fetch or lw/sw instruction) involves 2 memory accesses

1 to convert virtual address to physical address

1 to access the data

to reduce memory referencesprinciple of locality applies to page table entries tooTranslation Lookaside Buffer is a translation cache

cf. scraps of paper provided by libraries for writing library call no. onmaintains list of locations of a subset of pagesreplacement policy difficult

software too slow; hardware too expensive for complex policy (e.g.LRU)often randomly choose entry-to-go

© Paul Lyons 2010~ 262 ~


TLBCIRCUIT OUTLINE


virtual address

0111231 12-bit page offset20-bit virtual page number

Tag Physical Page Number

=

=

=

=

==

V Dirty

hit

TLB

Physical Page Number Page Offset

20

Physical Address Tag

256cache

byteoffset

4

blockoffset

data

Cache Index

8

16

18

= hit

if no hit in TLB, use page tableif no hit there, get page from diskif no cache hit, use memory

© Paul Lyons 2010~ 263 ~


cache

TLBCIRCUIT OUTLINE

MEMORY MANAGEMENT – Virtual Memory

virtual address

0111231 12-bit page offset20-bit virtual page number

Tag Physical Page Number

TLB

Physical Page Number Page Offset

20

Physical Address Tag

cache

byteoffset

4

blockoffset

data

Cache Index

818

= hit

if no hit in TLB, use page tableif no hit there, get page from diskif no cache hit, use memory

data

Cache Index

8

© Paul Lyons 2010~ 264 ~


USING THETLBFOR AREAD/WRITE OPERATION


virtual address

TLB access

TLB hit?

Yes: TLB provides physical address

No Yeswrite?

No Yeswrite accessbit on?

write protectionexception

write data into cacheupdate the tag

put data & address intothe write buffer

TLB missexception

No

Try to read datafrom cache

cache miss stallNo Yes

cache hit?

deliver data to the CPU

© Paul Lyons 2010~ 265 ~


THE SYNERGY BETWEEN VMAND MEMORY PROTECTION


Machines run multiple processes "simultaneously"on single processor machines only one process is active at a timesimple multitasking swaps between processes when I/O occurspreemptive multtasking swaps at short intervals

gives users on multiuser systems the impression of sole accessallows modern multiprocessing systems to handle hundreds of processes

activeprocess

processesengaged in I/O

I/Orequest

readyprocesses

I/O terminates

© Paul Lyons 2010~ 267 ~


THE SYNERGY BETWEEN VMAND MEMORY PROTECTION


Machines run multiple processes "simultaneously"on single processor machinessimple multitaskingpreemptive multtasking

gives usersallows modern multiprocessing systems to

VM allows processesa process's pages can bewhat if one process addresses outside its own space?

© Paul Lyons 2010~ 268 ~


PROTECTION REQUIREMENTS


separate user and OS (supervisor) modessome instructions available

ability to make certain information read-only for user processes

mechanism for swapping between modestransfers control to

store in return from exception

put page tables inallows OS toprevents user process fromprevents user process from

© Paul Lyons 2010~ 269 ~


IMPLEMENTING APROCESS SWITCH


on a machine without a TLB, this is comparatively simplepoint page table register at

consider a switch from P1 to P2

on a machine with a TLBneed toneed to

replacing all of P1's entries in TLB can be inefficient if:

problem; virtual address spaces the samesolution: make them different

give each processOS remembers putat each memory access,

© Paul Lyons 2010~ 270 ~


SHARING INFORMATION


in general, one process should not be able toTLB has that prevents a process from

sometimes processes need to be able to share informationP1 wants to access information in a page owned by P2

process P2 asks OS to create a new page table entry page table entry goes into P1's virtual address space but accesses P2's physical page P2 can ask OS to set write protection bit in P1's page so P1 can't update the page

159233 computer architecture building up the …plyons/159233 (computer... · · 2010-02-18t r a...

Documents