Download - Cpu Organisasi
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 1/108
EECC550 - ShaabanEECC550 - Shaaban#1 Final Review Winter 2000 2-19-2001
CPU OrganizationCPU Organization• Datapath Design:
– Capabilities & performance characteristics of principal
Functional Units (FUs):
– (e.g., Registers, ALU, Shifters, Logic Units, ...)
– Ways in which these components are interconnected (buses
connections, multiplexors, etc.).
– How information flows between components.
• Control Unit Design:
– Logic and means by which such information flow is controlled.
– Control and coordination of FUs operation to realize the targeted
Instruction Set Architecture to be implemented (can either beimplemented using a finite state machine or a microprogram).
• Hardware description with a suitable language, possibly
using Register Transfer Notation (RTN).
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 2/108
EECC550 - ShaabanEECC550 - Shaaban#2 Final Review Winter 2000 2-19-2001
Hierarchy of Computer ArchitectureHierarchy of Computer Architecture
I/O systemInstr. Set Proc.
Compiler
OperatingSystem
Application
Digital Design
Circuit Design
Instruction Set Architecture
Firmware
Datapath & Control
Layout
Software
Hardware
Software/Hardware
Boundary
High-Level Language Programs
Assembly Language
Programs
Microprogram
Register Transfer
Notation (RTN)Logic Diagrams
Circuit Diagrams
Machine Language
Program
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 3/108
EECC550 - ShaabanEECC550 - Shaaban#3 Final Review Winter 2000 2-19-2001
• For a specific program compiled to run on a specific machine“A”, the following parameters are provided:
– The total instruction count of the program.
– The average number of cycles per instruction (average CPI).
– Clock cycle of machine “A”
• How can one measure the performance of this machine running this
program?
– Intuitively the machine is said to be faster or has better
performance running this program if the total execution time is
shorter.
– Thus the inverse of the total measured program execution time is
a possible performance measure or metric:
PerformanceA = 1 / Execution TimeA
How to compare performance of different machines?
What factors affect performance? How to improve performance?
Computer Performance Measures:Computer Performance Measures:Program Execution TimeProgram Execution Time
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 4/108
EECC550 - ShaabanEECC550 - Shaaban#4 Final Review Winter 2000 2-19-2001
CPU Execution Time: The CPU EquationCPU Execution Time: The CPU Equation
• A program is comprised of a number of instructions – Measured in: instructions/program
• The average instruction takes a number of cycles per
instruction (CPI) to be completed.
– Measured in: cycles/instruction
• CPU has a fixed clock cycle time = 1/clock rate
– Measured in: seconds/cycle
• CPU execution time is the product of the above three
parameters as follows:
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 5/108
EECC550 - ShaabanEECC550 - Shaaban
#5 Final Review Winter 2000 2-19-2001
Factors Affecting CPU PerformanceFactors Affecting CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPI Clock RateInstruction
Count
Program
Compiler
Organization
Technology
Instruction Set
Architecture (ISA)
X
X
X
X
X
X
X X
X
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 6/108
EECC550 - ShaabanEECC550 - Shaaban
#6 Final Review Winter 2000 2-19-2001
Aspects of CPU Execution TimeAspects of CPU Execution Time
CPU Time = Instruction count x CPI
x Clock cycle
Instruction CountInstruction Count
Clock Clock CycleCycle
CPICPIDepends on:
CPU Organization
Technology
Depends on:
Program Used
Compiler
ISA
CPU Organization
Depends on:
Program Used
Compiler
ISA
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 7/108
EECC550 - ShaabanEECC550 - Shaaban
#7 Final Review Winter 2000 2-19-2001
Instruction Types & CPI: An ExampleInstruction Types & CPI: An Example• An instruction set has three instruction classes:
• Two code sequences have the following instruction counts:
• CPU cycles for sequence 1 = 2 x 1 + 1 x 2 + 2 x 3 = 10 cycles
CPI for sequence 1 = clock cycles / instruction count
= 10 /5 = 2
• CPU cycles for sequence 2 = 4 x 1 + 1 x 2 + 1 x 3 = 9 cycles
CPI for sequence 2 = 9 / 6 = 1.5
Instruction class CPI
A 1
B 2
C 3
Instruction counts for instruction classCode Sequence A B C
1 2 1 2
2 4 1 1
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 8/108
EECC550 - ShaabanEECC550 - Shaaban
#8 Final Review Winter 2000 2-19-2001
Instruction Frequency & CPIInstruction Frequency & CPI
• Given a program with n types or classes of instructions with the following characteristics:
C i = Count of instructions of typei
CPI i = Average cycles per instruction of typei
Fi = Frequency of instruction typei
= Ci/ total instruction count
Then:
( )∑=
×=n
i
ii F CPI CPI
1
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 9/108
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 10/108
EECC550 - ShaabanEECC550 - Shaaban#10 Final Review Winter 2000 2-19-2001
Metrics of Computer PerformanceMetrics of Computer Performance
Compiler
ProgrammingLanguage
Application
Datapath
Control
Transistors Wires Pins
ISA
Function Units
Cycles per second (clock rate).
Megabytes per second.
Execution time: Target workload,
SPEC95, etc.
Each metric has a purpose, and each can be misused.
(millions) of Instructions per second – MIPS(millions) of (F.P.) operations per second – MFLOP/s
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 11/108
EECC550 - ShaabanEECC550 - Shaaban#11 Final Review Winter 2000 2-19-2001
Performance Enhancement Calculations:Performance Enhancement Calculations:
Amdahl's LawAmdahl's Law
• The performance enhancement possible due to a given designimprovement is limited by the amount that the improved feature is used
• Amdahl’s Law:
Performance improvement or speedup due to enhancement E:
Execution Time without E Performance with E
Speedup(E) = -------------------------------------- = --------------------------------- Execution Time with E Performance without E
– Suppose that enhancement E accelerates a fraction F of theexecution time by a factor S and the remainder of the time is
unaffected then:
Execution Time with E = ((1-F) + F/S) X Execution Time without E
Hence speedup is given by:
Execution Time without E 1Speedup(E) = --------------------------------------------------------- = --------------------
((1 - F) + F/S) X Execution Time without E (1 - F) + F/S
Note: All fractions here refer to original execution time.
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 12/108
EECC550 - ShaabanEECC550 - Shaaban#12 Final Review Winter 2000 2-19-2001
Pictorial Depiction of Amdahl’s LawPictorial Depiction of Amdahl’s Law
Before:
Execution Time without enhancement E:
Unaffected, fraction: (1- F)
After:
Execution Time with enhancement E:
Enhancement E accelerates fraction F of execution time by a factor of S
Affected fraction: F
Unaffected, fraction: (1- F) F/S
Unchanged
Execution Time without enhancement E 1Speedup(E) = ------------------------------------------------------ = ------------------ Execution Time with enhancement E (1 - F) + F/S
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 13/108
EECC550 - ShaabanEECC550 - Shaaban#13 Final Review Winter 2000 2-19-2001
Amdahl's Law With Multiple Enhancements:Amdahl's Law With Multiple Enhancements:
ExampleExample
• Three CPU performance enhancements are proposed with the followingspeedups and percentage of the code execution time affected:
Speedup1 = S1 = 10 Percentage1 = F1 = 20%
Speedup2 = S2 = 15 Percentage1 = F2 = 15%
Speedup3 = S3 = 30 Percentage1 = F3 = 10%
• While all three enhancements are in place in the new design, eachenhancement affects a different portion of the code and only one
enhancement can be used at a time.
• What is the resulting overall speedup?
• Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)]
= 1 / [ .55 + .0333 ]
= 1 / .5833 = 1.71
∑ ∑+−
=
i i
i
i
i
S F
F
Speedup
)( )1
1
(
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 14/108
EECC550 - ShaabanEECC550 - Shaaban#14 Final Review Winter 2000 2-19-2001
MIPS Instruction FormatsMIPS Instruction Formats
• op: Opcode, operation of the instruction.
• rs, rt, rd: The source and destination register specifiers.
• shamt: Shift amount.
• funct: selects the variant of the operation in the “op” field.
• address / immediate: Address offset or immediate value.
• target address: Target address of the jump instruction.
op target address
02631
6 bits 26 bits
op rs rt rd shamt funct
061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate
016212631
6 bits 16 bits5 bits5 bits
R-Type
I-Type: ALU
Load/Store, Branch
J-Type: Jumps
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 15/108
EECC550 - ShaabanEECC550 - Shaaban#15 Final Review Winter 2000 2-19-2001
A Single Cycle MIPS A Single Cycle MIPS DatapathDatapath
i m m
1 6
32
ALUctr
Clk
busW
RegWr
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bit
Registers
Rs
Rt
Rt
RdRegDst
Ex t e n d e r
M ux
3216
imm16
ALUSrcExtOp
M ux
MemtoReg
Clk
Data In WrEn
32
Adr Data
Memory
MemWr
AL U
Equal
Instruction<31:0>
0
1
0
1
01
< 2 1 : 2 5 >
< 1 6 : 2 0 >
< 1 1 :
1 5 >
< 0 : 1
5 >
Imm16RdRtRs
=
A d d e r
A d
d e r
P C
Clk
0 0
M ux
4
nPC_sel
P CEx t
Adr
Inst
Memory
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 16/108
EECC550 - ShaabanEECC550 - Shaaban#16 Final Review Winter 2000 2-19-2001
RegDst ALUctr ALUSrc MemtoReg Equal
Instruction<31:0><
2 1 : 2 5 >
<
1 6 : 2 0 >
<
1 1 : 1 5 >
< 0 : 1 5 >
Imm16RdRsRt
Adr
Instruction
Memory
DATA PATHDATA PATH
ExtOp MemWr
Control UnitControl Unit
Op
<
2 1 : 2 5 >
Fun
nPC_sel RegWr
< 0 : 2 5 >
Jump_target
Jump
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 17/108
EECC550 - ShaabanEECC550 - Shaaban#17 Final Review Winter 2000 2-19-2001
The Truth Table For The Main ControlThe Truth Table For The Main Control
R-type ori lw sw beq jump
RegDst
ALUSrc
MemtoReg
RegWrite
MemWrite
Branch
Jump
ExtOp
ALUop (Symbolic)
1
0
0
1
0
0
0
x
“R-type”
0
1
0
1
0
0
0
0
Or
0
1
1
1
0
0
0
1
Add
x
1
x
0
1
0
0
1
Add
x
0
x
0
0
1
0
x
Subtract
x
x
x
0
0
0
1
x
xxx
op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
ALUop <2> 1 0 0 0 0 x
ALUop <1> 0 1 0 0 0 x
ALUop <0> 0 0 0 0 1 x
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 18/108
EECC550 - ShaabanEECC550 - Shaaban#18 Final Review Winter 2000 2-19-2001
Clk
PC
Rs, Rt, Rd,
Op, Func
Clk-to-Q
ALUctr
Instruction Memoey Access Time
Old Value New Value
RegWr Old Value New Value
Delay through Control Logic
busA
Register File Access Time
Old Value New Value
busB
ALU Delay
Old Value New Value
Old Value New Value
New ValueOld Value
ExtOp Old Value New Value
ALUSrc Old Value New Value
MemtoReg Old Value New Value
Address Old Value New Value
busW Old Value New
Delay through Extender & Mux
Register
Write Occurs
Data Memory Access Time
Worst Case Timing (Load)Worst Case Timing (Load)
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 19/108
EECC550 - ShaabanEECC550 - Shaaban#19 Final Review Winter 2000 2-19-2001
MIPS Single Cycle InstructionMIPS Single Cycle Instruction
Timing ComparisonTiming Comparison
PC Inst Memory mux ALU Data Mem mux
PC Reg FileInst Memory mux ALU mux
PC Inst Memory mux ALU Data Mem
PC Inst Memory cmp mux
Reg File
Reg File
Reg File
Arithmetic & Logical
Load
Store
Branch
Critical Path
setup
setup
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 20/108
EECC550 - ShaabanEECC550 - Shaaban#20 Final Review Winter 2000 2-19-2001
CPU Design StepsCPU Design Steps
1. Analyze instruction set operations using independentRTN => datapath requirements.
2. Select set of datapath components & establish clock
methodology.
3. Assemble datapath meeting the requirements.
4. Analyze implementation of each instruction to determinesetting of control points that effects the register transfer.
5. Assemble the control logic.
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 21/108
EECC550 - ShaabanEECC550 - Shaaban#21 Final Review Winter 2000 2-19-2001
Drawback of Single Cycle ProcessorDrawback of Single Cycle Processor
• Long cycle time.
• All instructions must take as much time as the slowest:
– Cycle time for load is longer than needed for all otherinstructions.
• Real memory is not as well-behaved as idealized memory
– Cannot always complete data access in one (short) cycle.
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 22/108
EECC550 - ShaabanEECC550 - Shaaban#22 Final Review Winter 2000 2-19-2001
Reducing Cycle Time: Multi-Cycle DesignReducing Cycle Time: Multi-Cycle Design• Cut combinational dependency graph by inserting registers / latches.
• The same work is done in two or more fast cycles, rather than one slowcycle.
storage element
Acyclic
CombinationalLogic
storage element
storage element
AcyclicCombinational
Logic (A)
storage element
storage element
Acyclic
CombinationalLogic (B)
=>
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 23/108
EECC550 - ShaabanEECC550 - Shaaban#23 Final Review Winter 2000 2-19-2001
Instruction Processing CyclesInstruction Processing Cycles
Obtain instruction from program storage
Determine instruction type
Obtain operands from registers
Compute result value or status
Store result in register/memory if needed
(usually called Write Back).
Update program counter to address
of next instruction}
Common
steps
for all
instructions
Instruct ion
Fetch
Instruct ion
Decode
Execute
Result
Store
Next
Instruct ion
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 24/108
EECC550 - ShaabanEECC550 - Shaaban#24 Final Review Winter 2000 2-19-2001
Partitioning The Single Cycle DatapathPartitioning The Single Cycle Datapath
Add registers between smallest steps
P C
N
e x t P C
O p e r a n d
F e t c h Exec
R e g .
F i l e
M e m
A c c e s s
D a t a
M e m I n
s t r u c t i o n
F e t c h
R e s
u l t S t o r e
A L U c t r
R e g D s
t
A L U S r c
E x
t O p
M e m W
r
n P C
_ s e l
R e g W
r
M e m W
r
M e m R
d
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 25/108
EECC550 - ShaabanEECC550 - Shaaban#25 Final Review Winter 2000 2-19-2001
Example Multi-cycle DatapathExample Multi-cycle Datapath
P C
N e x t P C
O p e r a n d
F e t c h
E x t
A L U
R e g .
F i l e
M e m
A c c e s s
D a t a
M e m
I n s t r u c t i o n
F e t c h
R e s u l t S t o r e
A L U c t r
R e
g D s t
A L U S
r c
E x t O
p
n P C
_ s e l
R e
g W r
M e m
W r
M e m
R d
I R A
B
R
M
RegFile
M e m T o R e g
E q u a l
Registers added:
IR: Instruction register
A, B: Two registers to hold operands read from register file.
R: or ALUOut, holds the output of the ALU
M: or Memory data register (MDR) to hold data read from data memory
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 26/108
EECC550 - ShaabanEECC550 - Shaaban#26 Final Review Winter 2000 2-19-2001
Operations In Each CycleOperations In Each Cycle
Instruction
Fetch
Instruction
Decode
Execution
Memory
Write
Back
R-Type
IR ←← Mem[PC]
A ←← R[rs]
B ←← R[rt]
R ←← A + B
R[rd] ←← R
PC ←← PC + 4
Logic
Immediate
IR ←← Mem[PC]
A ←← R[rs]
R ←← A OR ZeroExt[imm16]
R[rt] ←← R
PC ←← PC + 4
Load
IR ←← Mem[PC]
A ←← R[rs]
R ←← A + SignEx(Im16)
M ←← Mem[R]
R[rd] ←← M
PC ←← PC + 4
Store
IR ←← Mem[PC]
A ←← R[rs]
B ←← R[rt]
R ←← A + SignEx(Im16)
Mem[R] ←← B
PC ←← PC + 4
Branch
IR ←← Mem[PC]
A ←← R[rs]
B ←← R[rt]
If Equal = 1
PC ←← PC + 4 +
(SignExt(imm16) x4)
else
PC ←← PC + 4
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 27/108
EECC550 - ShaabanEECC550 - Shaaban#27 Final Review Winter 2000 2-19-2001
Finite State Machine (FSM) Control ModelFinite State Machine (FSM) Control Model
• State specifies control points for Register Transfer.• Transfer occurs upon exiting state (same falling edge).
State X
Register Transfer Control Points
Depends on Input
Control State
Next StateLogic
Output Logic
inputs (conditions)
outputs (control points)
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 28/108
EECC550 - ShaabanEECC550 - Shaaban#28 Final Review Winter 2000 2-19-2001
Control Specification For Multi-cycle CPUControl Specification For Multi-cycle CPU
Finite State Machine (FSM)Finite State Machine (FSM)
IR ←← MEM[PC]
R-type
A ←← R[rs]B ←← R[rt]
R ←← A fun B
R[rd] ←← RPC ←← PC + 4
R ←← A or ZX
R[rt] ←← RPC ←← PC + 4
ORi
R ←← A + SX
R[rt] ←← MPC ←← PC + 4
M ←←MEM[R]
LW
R ←←A + SX
MEM[R] ←← B
PC ←← PC + 4
BEQ & EqualBEQ & ~Equal
PC ←← PC + 4 PC ←← PC + SX || 00
SW
“instruction fetch”
“decode / operand fetch”
E x e c u t e
M e
m o r y
W r i t e - b a c k
To instruction fetch
To instruction fetchTo instruction fetch
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 29/108
EECC550 - ShaabanEECC550 - Shaaban#29 Final Review Winter 2000 2-19-2001
Traditional FSM ControllerTraditional FSM Controller
datapath + state diagram => controldatapath + state diagram => control
• Translate RTN statements intocontrol points.
• Assign states.
• Implement the controller.
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 30/108
EECC550 - ShaabanEECC550 - Shaaban#30 Final Review Winter 2000 2-19-2001
MappingMapping RTNsRTNs To Control Points ExamplesTo Control Points Examples& State Assignments& State Assignments
IR ←← MEM[PC]
0000
R-type
A ←← R[rs]
B ←← R[rt] 0001
R ←← A fun B 0100
R[rd] ←← RPC ←← PC + 4
0101
R ←← A or ZX
0110
R[rt] ←← R
PC ←← PC + 40111
ORi
R ←← A + SX 1000
R[rt] ←← M
PC ←← PC + 41010
M ←← MEM[S]
1001
LW
R ←← A + SX
1011
MEM[S] ←← B
PC ←← PC + 4 1100
BEQ & Equal
BEQ & ~Equal
PC ←← PC + 4
0011
PC ←← PC +
SX || 00
0010
SW
“instruction fetch”
“decode / operand fetch”
E x e c u t e
M
e m o r y
W r i t e - b a c k
imem_rd, IRen
Aen, Ben
ALUfun, Sen
RegDst,
RegWr,PCen To instruction fetch
state 0000
To instruction fetch state 0000To instruction fetch state 0000
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 31/108
EECC550 - ShaabanEECC550 - Shaaban#31 Final Review Winter 2000 2-19-2001
Detailed Control SpecificationState Op field Eq Next IR PC Ops Exec Mem Write-Back
en sel A B Ex Sr ALU S R W M M-R Wr Dst
0000 ?????? ? 0001 10001 BEQ 0 0011 1 1
0001 BEQ 1 0010 1 1
0001 R-type x 0100 1 1
0001 orI x 0110 1 1
0001 LW x 1000 1 1
0001 SW x 1011 1 1
0010 xxxxxx x 0000 1 10011 xxxxxx x 0000 1 0
0100 xxxxxx x 0101 0 1 fun 1
0101 xxxxxx x 0000 1 0 0 1 1
0110 xxxxxx x 0111 0 0 or 1
0111 xxxxxx x 0000 1 0 0 1 0
1000 xxxxxx x 1001 1 0 add 11001 xxxxxx x 1010 1 0 0
1010 xxxxxx x 0000 1 0 1 1 0
1011 xxxxxx x 1100 1 0 add 1
1100 xxxxxx x 0000 1 0 0 1
R
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 32/108
EECC550 - ShaabanEECC550 - Shaaban#32 Final Review Winter 2000 2-19-2001
Alternative Multiple Cycle Datapath (In Textbook)
•Shared instruction/data memory unit
• A single ALU shared among instructions
• Shared units require additional or widened multiplexors
• Temporary registers to hold data between clock cycles of the instruction:
• Additional registers: Instruction Register (IR),
Memory Data Register (MDR), A, B, ALUOut
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 33/108
EECC550 - ShaabanEECC550 - Shaaban#33 Final Review Winter 2000 2-19-2001
Operations In Each CycleOperations In Each Cycle
Instruction
Fetch
Instruction
Decode
Execution
Memory
Write
Back
R-Type
IR ←← Mem[PC]
PC ←← PC + 4
A ←← R[rs]
B ←← R[rt]
ALUout ←← PC +(SignExt(imm16)x4)
ALUout ←← A + B
R[rd] ←← ALUout
Logic
Immediate
IR ←← Mem[PC]
PC ←← PC + 4
A ←← R[rs]
B ←← R[rt]
ALUout ←← PC +
(SignExt(imm16) x4)
ALUout ←←
A OR ZeroExt[imm16]
R[rt] ←← ALUout
Load
IR ←← Mem[PC]
PC ←← PC + 4
A ←← R[rs]
B ←← R[rt]
ALUout ←← PC +
(SignExt(imm16) x4)
ALUout ←←
A + SignEx(Im16)
M ←← Mem[ALUout]
R[rd] ←← Mem
Store
IR ←← Mem[PC]
PC ←← PC + 4
A ←← R[rs]
B ←← R[rt]
ALUout ←← PC +
(SignExt(imm16) x4)
ALUout ←←
A + SignEx(Im16)
Mem[ALUout] ←← B
Branch
IR ←← Mem[PC]
PC ←← PC + 4
A ←← R[rs]
B ←← R[rt]
ALUout ←← PC +
(SignExt(imm16) x4)
If Equal = 1
PC ←← ALUout
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 34/108
EECC550 - ShaabanEECC550 - Shaaban#34 Final Review Winter 2000 2-19-2001
Finite State Machine (FSM) SpecificationFinite State Machine (FSM) Specification
IR ← MEM[PC]
PC ← PC + 4
R-type
ALUout
← A fun B
R[rd]←ALUout
ALUout
← A op ZX
R[rt]←ALUout
ORiALUout
← A + SX
R[rt] ← M
M ←
MEM[ALUout]
LWALUout
← A + SX
MEM[ALUout]
←B
SW
“instruction fetch”
“decode”
E x e c u t e
M e
m o r y
W r i t e - b a c k
0000
0001
0100
0101
0110
0111
1000
1001
1010
1011
1100
BEQ
0010
If A = B then
PC ← ALUout
A ← R[rs]
B ← R[rt]
ALUout
←← PC +SX
To instruction fetch
To instruction fetchTo instruction fetch
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 35/108
EECC550 - ShaabanEECC550 - Shaaban#35 Final Review Winter 2000 2-19-2001
MIPS Multi-cycle DatapathMIPS Multi-cycle Datapath
Performance EvaluationPerformance Evaluation• What is the average CPI?
– State diagram gives CPI for each instruction type
– Workload below gives frequency of each type
Type CPIi for type Frequency CPIi x freqIi
Arith/Logic 4 40% 1.6
Load 5 30% 1.5
Store 4 10% 0.4
branch 3 20% 0.6
Average CPI: 4.1
Better than CPI = 5 if all instructions took the same number
of clock cycles (5).
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 36/108
EECC550 - ShaabanEECC550 - Shaaban#36 Final Review Winter 2000 2-19-2001
• Control may be designed using one of several initial representations. Thechoice of sequence control, and how logic is represented, can then be
determined independently; the control can then be implemented with oneof several methods using a structured logic technique.
Initial Representation Finite State Diagram Microprogram
Sequencing Control Explicit Next State Microprogram counter Function + Dispatch ROMs
Logic Representation Logic Equations Truth Tables
Implementation PLA ROMTechnique “hardwi red contr ol ” “microprogrammed control ”
Control Implementation AlternativesControl Implementation Alternatives
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 37/108
EECC550 - ShaabanEECC550 - Shaaban#37 Final Review Winter 2000 2-19-2001
MicroprogrammedMicroprogrammed ControlControl• Finite state machine control for a full set of instructions is very
complex, and may involve a very large number of states: – Slight microoperation changes require new FSM controller.
• Microprogramming: Designing the control as a program that
implements the machine instructions.
• A microprogam for a given machine instruction is a symbolic
representation of the control involved in executing the instruction
and is comprised of a sequence of microinstructions.•
• Each microinstruction defines the set of datapath control signals
that must asserted (active) in a given state or cycle.
• The format of the microinstructions is defined by a number of fields each responsible for asserting a set of control signals.
• Microarchitecture: – Logical structure and functional capabilities of the hardware as
seen by the microprogrammer.
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 38/108
EECC550 - ShaabanEECC550 - Shaaban#38 Final Review Winter 2000 2-19-2001
Types of “branching”• Set state to 0 (fetch)
• Dispatch i (state 1)
• Use incremented
address (seq) state
number 2
Opcode
State Reg
Microinstruction Address
Inputs
Outputs Control
Signal
Fields
Microprogram
Storage ROM/PLA
Multicycle
Datapath
1
Address Select Logic
Adder
Microprogram
Counter, MicroPC
Sequencing
Control
Field
MicroprogrammedMicroprogrammed Control UnitControl Unit
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 39/108
EECC550 - ShaabanEECC550 - Shaaban#39 Final Review Winter 2000 2-19-2001
List of control Signals Grouped Into FieldsList of control Signals Grouped Into FieldsSignal name Eff ect when deasser ted Ef fect when asser ted
ALUSelA 1st ALU operand = PC 1st ALU operand = Reg[rs]
RegWrite None Reg. is writtenMemtoReg Reg. write data input = ALU Reg. write data input = memoryRegDst Reg. dest. no. = rt Reg. dest. no. = rdMemRead None Memory at address is read,
MDR ←← Mem[addr]MemWrite None Memory at address is writtenIorD Memory address = PC Memory address = SIRWrite None IR ←← Memory
PCWrite None PC ←← PCSourcePCWriteCond None IF ALUzero then PC ←← PCSourcePCSource PCSource = ALU PCSource = ALUout
S i n
g l e B i t C o n
t r o
l
Signal nam e Value Effec t ALUOp 00 ALU adds 01 ALU subtracts 10 ALU does function code
11 ALU does logical ORALUSelB 000 2nd ALU input = Reg[rt] 001 2nd ALU input = 4 010 2nd ALU input = sign extended IR[15-0] 011 2nd ALU input = sign extended, shift left 2 IR[15-0]
100 2nd ALU input = zero extended IR[15-0] M u
l t i p l e B i t
C o n
t r o
l
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 40/108
EECC550 - ShaabanEECC550 - Shaaban#40 Final Review Winter 2000 2-19-2001
MicroinstructionMicroinstruction Field ValuesField ValuesField Name Valu es fo r Field Fu nc tio n o f Field with Specif ic Valu e
ALU Add ALU addsSubt. ALU subtractsFunc code ALU does function codeOr ALU does logical OR
SRC1 PC 1st ALU input = PCrs 1st ALU input = Reg[rs]
SRC2 4 2nd ALU input = 4Extend 2nd ALU input = sign ext. IR[15-0]Extend0 2nd ALU input = zero ext. IR[15-0]Extshft 2nd ALU input = sign ex., sl IR[15-0]
rt 2nd ALU input = Reg[rt]destination rd ALU Reg[rd] ←← ALUout
rt ALU Reg[rt] ←← ALUout rt Mem Reg[rt] ←← Mem
Memory Read PC Read memory using PCRead ALU Read memory using ALU outputWrite ALU Write memory using ALU output, value B
Memory register IR IR ←← MemPC write ALU PC ←← ALU
ALUoutCond IF ALU Zero then PC ←← ALUout
Sequencing Seq Go to sequential µinstructionFetch Go to the first microinstructionDispatch i Dispatch using ROM.
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 41/108
EECC550 - ShaabanEECC550 - Shaaban#41 Final Review Winter 2000 2-19-2001
MicroprogramMicroprogram for The Control Unitfor The Control Unit
Label ALU SRC1 SRC2 Dest. Memory Mem. Reg. PC Write Sequencing
Fetch: Add PC 4 Read PC IR ALU Seq
Add PC Extshft Dispatch
Lw: Add rs Extend Seq
Read ALU Seq
rt MEM Fetch
Sw: Add rs Extend Seq
Write ALU Fetch
Rtype: Func rs rt Seq
rd ALU Fetch
Beq: Subt. rs rt ALUoutCond. Fetch
Ori: Or rs Extend0 Seq
rt ALU Fetch
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 42/108
EECC550 - ShaabanEECC550 - Shaaban#42 Final Review Winter 2000 2-19-2001
MIPS Integer ALU RequirementsMIPS Integer ALU Requirements
00 add
01 addU
02 sub
03 subU
04 and
05 or
06 xor
07 nor
12 slt
13 sltU
(1) Functional Specification:
inputs: 2 x 32-bit operands A, B, 4-bit modeoutputs: 32-bit result S, 1-bit carry, 1 bit overflow, 1 bit zerooperations: add, addu, sub, subu, and, or, xor, nor, slt, sltU
(2) Block Diagram:
ALU ALU A B
m
ovf S
32 32
32
4c
10 operations
thus 4 control bits
zero
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 43/108
EECC550 - ShaabanEECC550 - Shaaban#43 Final Review Winter 2000 2-19-2001
Building Block: 1-bit ALUBuilding Block: 1-bit ALU
A
B
M ux
CarryIn
Result
1-bit
Full
Adder
CarryOut
add
and
or
invertBOperation
Performs: AND, OR,
addition on A, B or A, B inverted
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 44/108
EECC550 - ShaabanEECC550 - Shaaban#44 Final Review Winter 2000 2-19-2001
32-Bit ALU Using 32 1-Bit32-Bit ALU Using 32 1-Bit ALUsALUs
32-bit rippled-carry adder (operation/invertB lines not shown)
A31
B31
1-bit
ALUResult31
A0
B0
1-bit
ALUResult0
CarryIn0
CarryOut0
A1
B1
1-bit
ALUResult1
CarryIn1
CarryOut1
A2
B2
1-bit
ALUResult2
CarryIn2
CarryIn3
CarryOut31
::
CarryOut30CarryIn31
C
Addition/Subtraction Performance:
Assume gate delay = T
Total delay = 32 x (1-Bit ALU Delay)
= 32 x 2 x gate delay
= 64 x gate delay
= 64 T
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 45/108
EECC550 - ShaabanEECC550 - Shaaban#45 Final Review Winter 2000 2-19-2001
MIPS ALU With SLT Support AddedMIPS ALU With SLT Support Added
A31 1-bit
ALUB31 Result31
B0 1-bitALU
A0
Result0
CarryIn0
CarryOut0
A1B1
1-bit
ALUResult1
CarryIn1
CarryOut1
A2B2
1-bit
ALUResult2
CarryIn2
CarryIn3
CarryOut31
::
CarryOut30CarryIn31
C
::::
Zero
Overflow
Less = 0
Less = 0
Less = 0
Less
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 46/108
EECC550 - ShaabanEECC550 - Shaaban#46 Final Review Winter 2000 2-19-2001
Improving ALU Performance:
Carry Look Ahead (CLA) A B C-out
0 0 0 “kill”0 1 C-in “propagate”1 0 C-in “propagate”1 1 1 “generate”
A0
B1
S
GP
G = A and B
P = A xor B
A
B
S
GP
A
B
S
GP
A
B
S
GP
Cin
C1 =G0 + C0 • P0
C2 = G1 + G0 • P1 + C0 • P0 • P1
C3 = G2 + G1C3 = G2 + G1 •••• P2 + G0P2 + G0 •••• P1P1 •••• P2 + C0P2 + C0 •••• P0P0 •••• P1P1 •••• P2P2
G
P
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 47/108
EECC550 - ShaabanEECC550 - Shaaban#47 Final Review Winter 2000 2-19-2001
Cascaded Carry Look-aheadCascaded Carry Look-ahead
16-Bit Example16-Bit ExampleCL
A
4-bit Adder
4-bit Adder
4-bit Adder
C1 =G0 + C0 •• P0
C2 = G1 + G0 •• P1 + C0 •• P0 •• P1
C3 = G2 + G1 •• P2 + G0 •• P1 •• P2 + C0 •• P0 •• P1 •• P2
G
P
G0P0
C4 = . . .
C0
Delay = 2 + 2 + 1 = 5 gate delays = 5T
Assuming all
gates have
equal delay T
{
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 48/108
EECC550 - ShaabanEECC550 - Shaaban#48 Final Review Winter 2000 2-19-2001
Unsigned Multiplication ExampleUnsigned Multiplication Example
• Paper and pencil example (unsigned): Multiplicand 1000
Multiplier 1001
1000
0000
0000
1000Product 01001000
• m bits x n bits = m + n bit product, m = 32, n = 32, 64 bit product.
• The binary number system simplifies multiplication:
0 => place 0 ( 0 x multiplicand).
1 => place a copy ( 1 x multiplicand).
• We will examine 4 versions of multiplication hardware & algorithm:
– Successive ref inement of design.
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 49/108
EECC550 - ShaabanEECC550 - Shaaban#49 Final Review Winter 2000 2-19-2001
Operation ofOperation of CombinationalCombinational MultiplierMultiplier
B0
A0A1A2A3
A0A1A2A3
A0A1A2A3
A0A1A2A3
B1
B2
B3
P0P1P2P3P4P5P6P7
0 0 0 00 0 0
• At each stage shift A left ( x 2).
• Use next bit of B to determine whether to add in shifted multiplicand.
• Accumulate 2n bit partial product at each stage.
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 50/108
EECC550 - ShaabanEECC550 - Shaaban#50 Final Review Winter 2000 2-19-2001
MULTIPLY HARDWARE Version 3MULTIPLY HARDWARE Version 3
Product (Multiplier)
Multiplicand
32-bit ALU
Write
Control
32 bits
64 bits
Shift Right
• Combine Multiplier register and Product register:
– 32-bit Multiplicand register.
– 32 -bit ALU.
– 64-bit Product register, (0-bit Multiplier register).
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 51/108
EECC550 - ShaabanEECC550 - Shaaban#51 Final Review Winter 2000 2-19-2001
Multiply AlgorithmMultiply Algorithm
Version 3Version 3
Done
Yes: 32 repetitions
2. Shift the Product register right 1 bit.
No: < 32 repetitions
1. TestProduct0
Product0 = 0Product0 = 1
1a. Add multiplicand to the left half of product &
place the result in the left half of Product register
32nd
repetition?
Start
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 52/108
EECC550 - ShaabanEECC550 - Shaaban#52 Final Review Winter 2000 2-19-2001
CombinationalCombinational Shifter fromShifter from MUXesMUXes
1 0sel
A B
D
Basic Building Block
8-bit right shifter
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
S2 S1 S0A0A1A2A3A4A5A6A7
R 0R 1R 2R 3R 4R 5R 6R 7
• What comes in the MSBs?
• How many levels for 32-bit shifter?
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 53/108
EECC550 - ShaabanEECC550 - Shaaban#53 Final Review Winter 2000 2-19-2001
DivisionDivision 1001 Quotient
Divisor 1000 1001010 Dividend
–1000
10
101
1010
–1000
10 Remainder (or Modulo result)
• See how big a number can be subtracted, creating quotient bit on each step:
Binary => 1 * divisor or 0 * divisor
Dividend = Quotient x Divisor + Remainder => | Dividend | = | Quotient | + | Divisor |
• 3 versions of divide, successive refinement
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 54/108
EECC550 - ShaabanEECC550 - Shaaban#54 Final Review Winter 2000 2-19-2001
Remainder (Quotient)
Divisor
32-bit ALU
Write
Control
32 bits
64 bits
Shift Left“H I ” “LO”
• 32-bit Divisor register.
• 32 -bit ALU.
• 64-bit Remainder regegister (0-bit Quotient register).
DIVIDE HARDWARE Version 3DIVIDE HARDWARE Version 3
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 55/108
EECC550 - ShaabanEECC550 - Shaaban#55 Final Review Winter 2000 2-19-2001
3b. Restore the original value by adding the Divisor
register to the left half of the Remainderregister,
&place the sum in the left half of the Remainder
register. Also shift the Remainder register to the
left, setting the new least significant bit to 0.
Test
RemainderRemainder < 0Remainder >= 0
2. Subtract the Divisor register from the
left half of the Remainder register, & place the
result in the left half of the Remainder register.
3a. Shift the
Remainder register
to the left setting
the new rightmost
bit to 1.
1. Shift the Remainder register left 1 bit.
Done. Shift left half of Remainder right 1 bit.
Yes: n repetitions (n = 4 here)
nth
repetition?
No: < n repetitions
Start: Place Dividend in RemainderDivide AlgorithmDivide Algorithm
Version 3Version 3
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 56/108
EECC550 - ShaabanEECC550 - Shaaban#56 Final Review Winter 2000 2-19-2001
Representation of Floating Point Numbers inRepresentation of Floating Point Numbers in
Single PrecisionSingle Precision I EEE 754 Standard IEEE 754 Standard
Example: 0 = 0 00000000 0 . . . 0 -1.5 = 1 01111111 10 . . . 0
Magnitude of numbers thatcan be represented is in the range: 2
-126(1.0) to 2 127 (2 - 2-23 )
Which is approximately: 1.8 x 10- 38
to 3.40 x 1038
0 < E < 255Actual exponent is: e = E - 127
1 8 23
s ign
exponent : excess 127binary integer added
mant issa: sign + magnitude, normalizedbinary significand witha hidden integer bit: 1.M
E MS
Value = N = (-1)S X 2 E-127 X (1.M)
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 57/108
EECC550 - ShaabanEECC550 - Shaaban#57 Final Review Winter 2000 2-19-2001
Representation of Floating Point Numbers inRepresentation of Floating Point Numbers in
Single PrecisionSingle Precision I EEE 754 Standard IEEE 754 Standard
Example: 0 = 0 00000000 0 . . . 0 -1.5 = 1 01111111 10 . . . 0
Magnitude of numbers that
can be represented is in the range: 2-126
(1.0) to 2 127 (2 - 2-23 )
Which is approximately: 1.8 x 10- 38
to 3.40 x 1038
0 < E < 255Actual exponent is: e = E - 127
1 8 23
s ign
exponent : excess 127binary integer added
mant issa: sign + magnitude, normalizedbinary significand witha hidden integer bit: 1.M
E MS
Value = N = (-1)S X 2 E-127 X (1.M)
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 58/108
EECC550 - ShaabanEECC550 - Shaaban#58 Final Review Winter 2000 2-19-2001
Representation of Floating Point Numbers inRepresentation of Floating Point Numbers in
Double PrecisionDouble Precision I EEE 754 Standard I EEE 754 Standard
Example: 0 = 0 00000000000 0 . . . 0 -1.5 = 1 01111111111 10 . . . 0
Magnitude of numbers that
can be represented is in the range: 2-1022
(1.0) to 2 1023 (2 - 2 - 52 )
Which is approximately: 2.23 x 10- 308
to 1.8 x 10308
0 < E < 2047Actual exponent is: e = E - 1023
1 11 52
s ign
exponent : excess 1023binary integer added
Mant issa: sign + magnitude, normalizedbinary significand witha hidden integer bit: 1.M
E MS
Value = N = (-1)S X 2 E-1023 X (1.M)
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 59/108
EECC550 - ShaabanEECC550 - Shaaban#59 Final Review Winter 2000 2-19-2001
Floating Point Conversion ExampleFloating Point Conversion Example• The decimal number -2345.12510 is to be represented in the
I EEE 754 32-bit single precision format:
-2345.12510 = -100100101001.0012 (converted to binary)
= -1.00100101001001 x 211 (normalized binary)
• The mantissa is negative so the sign S is given by:
S = 1
• The biased exponent E is given by E = e + 127
E = 11 + 127 = 13810 = 100010102
• Fractional part of mantissa M:
M = .00100101001001000000000 (in 23 bits)
The I EEE 754 single precision representation is given by:
1 10001010 00100101001001000000000
S E M
1 bit 8 bits 23 bits
Hidden
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 60/108
EECC550 - ShaabanEECC550 - Shaaban#60 Final Review Winter 2000 2-19-2001
Floating PointFloating Point
Addition Addition
Flowchart Flowchart
Start
Normalize the sum, either shifting right and
incrementing the exponent or shifting left
and decrementing the exponent
Compare the exponents of the two numbers
shift the smaller number to the right until its
exponent matches the larger exponent
Round the significand to the appropriate number of bits
If mantissa = 0, set exponent to 0
Add the significands (mantissas)
Done
Overflow or
Underflow ?
Generate exception
or return error
(1)
(2)
(3)
(4)
(5)
Stillnormalized?
Yes
No
yes
No
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 61/108
EECC550 - ShaabanEECC550 - Shaaban#61 Final Review Winter 2000 2-19-2001
Floating Point Addition Hardware
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 62/108
EECC550 - ShaabanEECC550 - Shaaban#62 Final Review Winter 2000 2-19-2001
Overflow or
Underflow?
Floating PointFloating Point
Multiplication FlowchartMultiplication Flowchart
(1)
(2)
(3)
(5)
(6)
Start
Done
Is one/both
operands =0?
Set the result to zero:
exponent = 0
Multiply the mantissas
Compute sign of result: Xs XOR Ys
Round or truncate the result mantissa
Compute exponent:
biased exp.(X) + biased exp.(Y) - bias
Generate exception
or return error
Normalize mantissa if needed
(4)
Still
Normalized?(7)
Yes
NoNo
Yes
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 63/108
EECC550 - ShaabanEECC550 - Shaaban#63 Final Review Winter 2000 2-19-2001
Extra Bits for RoundingExtra Bits for RoundingExtra bits used to prevent or minimize rounding errors.
How many extra bits?
IEEE: As if computed the result exactly and rounded.
Addition:
1.xxxxx 1.xxxxx 1.xxxxx
+ 1.xxxxx 0.001xxxxx 0.01xxxxx
1x.xxxxy 1.xxxxxyyy 1x.xxxxyyy
post-normalization pre-normalization pre and post
• Guard Digits : digits to the right of the first p digits of significand to guard
against loss of digits – can later be shifted left into first P places during
normalization.• Addition: carry-out shifted in
• Subtraction: borrow digit and guard
• Multiplication: carry and guard, Division requires guard
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 64/108
EECC550 - ShaabanEECC550 - Shaaban#64 Final Review Winter 2000 2-19-2001
Rounding DigitsRounding DigitsNormalized result, but some non-zero digits to the right of the significand --> the number should be rounded
E.g., B = 10, p = 3: 0 2 1.69
0 0 7.85
0 2 1.61
= 1.6900 * 10
= - .0785 * 10
= 1.6115 * 10
2-bias
2-bias
2-bias
-
One round digit must be carried to the right of the guard digit so that
after a normalizing left shift, the result can be rounded, accordingto the value of the round digit.
IEEE Standard : four rounding modes: round to nearest (default)
round towards plus infinityround towards minus infinityround towards 0
round to nearest: round digit < B/2 then truncate > B/2 then round up (add 1 to ULP: unit in last place) = B/2 then round to nearest even digit
i t can be show n that th is stra tegy m in imizes the mean error in t roduced by round ing .
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 65/108
EECC550 - ShaabanEECC550 - Shaaban#65 Final Review Winter 2000 2-19-2001
Sticky BitSticky BitAdditional bit to the right of the round digit to better fine tune rounding
d0 . d1 d2 d3 . . . dp-1 0 0 0 0 . 0 0 X . . . X X X S X X S
Sticky bit: set to 1 if any 1 bits fall off the end of the round digit
d0 . d1 d2 d3 . . . dp-1 0 0 0 0 . 0 0 X . . . X X X 0
d0 . d1 d2 d3 . . . dp-1 0 0 0 0 . 0 0 X . . . X X X 1
generates a borrow
Round ing Summary :
Radix 2 minimizes wobble in precision.
Normal operations in +,-,*,/ require one carry/borrow bit + one guard digit.
One round digit needed for correct rounding.
Sticky bit needed when round digit is B/2 for max accuracy.
Rounding to nearest has mean error = 0 if uniform distribution of digitsare assumed.
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 66/108
EECC550 - ShaabanEECC550 - Shaaban#66 Final Review Winter 2000 2-19-2001
Pipelining: Design GoalsPipelining: Design Goals• The length of the machine clock cycle is determined by the time
required for the slowest pipeline stage.
• An important pipeline design consideration is to balance the
length of each pipeline stage.
• If all stages are perfectly balanced, then the time per
instruction on a pipelined machine (assuming ideal conditionswith no stalls):
Time per instruction on unpipelined machine
Number of pipe stages
• Under these ideal conditions: – Speedup from pipelining = the number of pipeline stages = k
– One instruction is completed every cycle: CPI = 1 .
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 67/108
EECC550 - ShaabanEECC550 - Shaaban#67 Final Review Winter 2000 2-19-2001
Pipelined Instruction ProcessingPipelined Instruction Processing
RepresentationRepresentation Clock cycle Number Time in clock cycles →
Instruction Number 1 2 3 4 5 6 7 8 9
Instruction I IF ID EX MEM WB
Instruction I+1 IF ID EX MEM WB
Instruction I+2 IF ID EX MEM WB
Instruction I+3 IF ID EX MEM WBInstruction I +4 IF ID EX MEM WB
Time to fill the pipeline
Pipeline Stages:
IF = Instruction Fetch
ID = Instruction Decode
EX = Execution
MEM = Memory Access
WB = Write Back
First instruction, I
Completed
Last instruction,
I+4 completed
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 68/108
EECC550 - ShaabanEECC550 - Shaaban#68 Final Review Winter 2000 2-19-2001
For 1000 instructions, execution time:
• Single Cycle Machine:
– 8 ns/cycle x 1 CPI x 1000 inst = 8000 ns
• Multicycle Machine:
– 2 ns/cycle x 4.6 CPI (due to inst mix) x 1000 inst = 9200 ns
• Ideal pipelined machine, 5-stages:
– 2 ns/cycle x (1 CPI x 1000 inst + 4 cycle fill) = 2008 ns
Single Cycle, Multi-Cycle, Pipeline:Single Cycle, Multi-Cycle, Pipeline:
Performance Comparison ExamplePerformance Comparison Example
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 69/108
EECC550 - ShaabanEECC550 - Shaaban#69 Final Review Winter 2000 2-19-2001
Single Cycle, Multi-Cycle, Vs. PipelineSingle Cycle, Multi-Cycle, Vs. Pipeline
Clk
Cycle 1
Multiple Cycle Implementation:
IF ID EX MEM WB
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
IF ID EX MEM
Load Store
Clk
Single Cycle Implementation:
Load Store Waste
IF
R-type
Load IF ID EX MEM WB
Pipeline Implementation:
IF ID EX MEM WBStore
IF ID EX MEM WBR-type
Cycle 1 Cycle 2
8 ns
2ns
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 70/108
EECC550 - ShaabanEECC550 - Shaaban#70 Final Review Winter 2000 2-19-2001
MIPS: A Pipelined DatapathMIPS: A Pipelined Datapath
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
I n s t r u c t i o n
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0
Address
Writedata
Mux
1
Registers
Read
data 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Data
memory1
ALUresult
Mux
ALU
Zero
ID/EX
IF ID EX MEM WBInstruction Fetch Instruction Decode Execution Memory Write Back
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 71/108
EECC550 - ShaabanEECC550 - Shaaban#71 Final Review Winter 2000 2-19-2001
Pipeline ControlPipeline Control• Pass needed control signals along from one stage to the next as the
instruction travels through the pipeline just like the data
Execution/Address Calculation
stage control lines
Memory access stage
control lines
stage control
lines
Instruction
Reg
Dst
ALU
O 1
ALU
O 0
ALU
Src Branch
Mem
Read
Mem
Write
Reg
write
Mem to
Re
R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 72/108
EECC550 - ShaabanEECC550 - Shaaban#72 Final Review Winter 2000 2-19-2001
Basic Performance Issues In PipeliningBasic Performance Issues In Pipelining
• Pipelining increases the CPU instruction throughput:
The number of instructions completed per unit time.
Under ideal condition instruction throughput is one
instruction per machine cycle, or CPI = 1
• Pipelining does not reduce the execution time of an
individual instruction: The time needed to complete all
processing steps of an instruction (also called instruction
completion latency).
• It usually slightly increases the execution time of each
instruction over unpipelined implementations due to the
increased control overhead of the pipeline and pipeline
stage registers delays.
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 73/108
EECC550 - ShaabanEECC550 - Shaaban#73 Final Review Winter 2000 2-19-2001
Pipelining Performance ExamplePipelining Performance Example• Example: For an unpipelined machine:
– Clock cycle = 10ns, 4 cycles for ALU operations and branches and 5 cycles for memory operations with instruction frequencies of 40%, 20% and 40%,
respectively.
– If pipelining adds 1ns to the machine clock cycle then the speedup in
instruction execution from pipelining is:
Non-pipelined Average instruction execution time = Clock cycle x Average CPI = 10 ns x ((40% + 20%) x 4 + 40%x 5) = 10 ns x 4.4 = 44 ns
In the pipelined five implementation five stages are used with
an average instruction execution time of: 10 ns + 1 ns = 11 ns
Speedup from pipelining = Instruction time unpipelined Instruction time pipelined
= 44 ns / 11 ns = 4 times
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 74/108
EECC550 - ShaabanEECC550 - Shaaban#74 Final Review Winter 2000 2-19-2001
Pipeline HazardsPipeline Hazards• Hazards are situations in pipelining which prevent the next
instruction in the instruction stream from executing duringthe designated clock cycle resulting in one or more stall cycles.
• Hazards reduce the ideal speedup gained from pipelining and
are classified into three classes:
– Structural hazards : Arise from hardware resource conflictswhen the available hardware cannot support all possible
combinations of instructions.
– Data hazards : Arise when an instruction depends on the
results of a previous instruction in a way that is exposed by
the overlapping of instructions in the pipeline.
– Control hazards : Arise from the pipelining of conditional
branches and other instructions that change the PC.
St t l h d E l
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 75/108
EECC550 - ShaabanEECC550 - Shaaban#75 Final Review Winter 2000 2-19-2001
Structural hazard Example:
Single Memory For Instructions & Data
Mem
I
ns
t
r.
O
r d
e
r
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
AL UMem Reg Mem Reg
AL UMem Reg Mem Reg
AL UMem Reg Mem Reg
AL UReg Mem Reg
AL UMem Reg Mem Reg
Detection is easy in this case (right half highlight means read, left half write)
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 76/108
EECC550 - ShaabanEECC550 - Shaaban#76 Final Review Winter 2000 2-19-2001
Data Hazards ExampleData Hazards Example• Problem with starting next instruction before first is
finished
– Data dependencies here that “go backward in time”create data hazards.
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder (in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/–20 –20 –20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value ofregister $2:
DM Reg
Reg
Reg
Reg
DM
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 77/108
EECC550 - ShaabanEECC550 - Shaaban#77 Final Review Winter 2000 2-19-2001
Data Hazard Resolution: Stall CyclesData Hazard Resolution: Stall Cycles
Stall the pipeline by a number of cycles.
The control unit must detect the need to insert stall cycles.
In this case two stall cycles are needed.
IM Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder
(in instructions)
and $12, $2, $5
CC 7 CC 8
10 10 10 10 10/– 20 – 20 – 20 – 20
CC 9
– 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value ofregister $2:
DM Reg
Reg
IM Reg DM Reg
IM DM Reg
IM DM Reg
Reg
Reg
Reg
DMSTALL STALL
CC 10
– 20
CC 11
– 20
STALL STALL
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 78/108
EECC550 - ShaabanEECC550 - Shaaban#78 Final Review Winter 2000 2-19-2001
Data Hazard Resolution: Stall CyclesData Hazard Resolution: Stall Cycles
Stall the pipeline by a number of cycles.
The control unit must detect the need to insert stall cycles.
In this case two stall cycles are needed.
IM Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder
(in instructions)
and $12, $2, $5
CC 7 CC 8
10 10 10 10 10/– 20 – 20 – 20 – 20
CC 9
– 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value ofregister $2:
DM Reg
Reg
IM Reg DM Reg
IM DM Reg
IM DM Reg
Reg
Reg
Reg
DMSTALL STALL
CC 10
– 20
CC 11
– 20
STALL STALL
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 79/108
EECC550 - ShaabanEECC550 - Shaaban
#79 Final Review Winter 2000 2-19-2001
Performance of Pipelines with StallsPerformance of Pipelines with Stalls
• Hazards in pipelines may make it necessary to stall thepipeline by one or more cycles and thus degrading
performance from the ideal CPI of 1.
CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction
• If pipelining overhead is ignored and we assume that the stages areperfectly balanced then:
Speedup = CPI unpipelined / (1 + Pipeline stall cycles per instruction)
• When all instructions take the same number of cycles and is equal to
the number of pipeline stages then:
Speedup = Pipeline depth / (1 + Pipeline stall cycles per instruction)
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 80/108
EECC550 - ShaabanEECC550 - Shaaban
#80 Final Review Winter 2000 2-19-2001
Data Hazard Resolution: ForwardingData Hazard Resolution: Forwarding – Register file forwarding to handle read/write to same register
– ALU forwarding
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 81/108
EECC550 - ShaabanEECC550 - Shaaban
#81 Final Review Winter 2000 2-19-2001
Data Hazard Example With ForwardingData Hazard Example With Forwarding
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order (in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 –20 –20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X –20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 82/108
EECC550 - ShaabanEECC550 - Shaaban
#82 Final Review Winter 2000 2-19-2001
A Data Hazard Requiring A StallA Data Hazard Requiring A Stall
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder (in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
A load followed by an R-type instruction that uses the loaded value
Even with forwarding in place a stall cycle is needed
This condition must be detected by hardware
C il S h d li E lC il S h d li E l
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 83/108
EECC550 - ShaabanEECC550 - Shaaban
#83 Final Review Winter 2000 2-19-2001
Compiler Scheduling ExampleCompiler Scheduling Example• Reorder the instructions to avoid as many pipeline stalls as possible:
lw $15, 0($2)lw $16, 4($2)sw $16, 0($2)sw $15, 4($2)
• The data hazard occurs on register $16 between the second lw and the first sw
resulting in a stall cycle
• With forwarding we need to find only one independent instructions to place
between them, swapping the lw instructions works:
lw $15, 0($2)lw $16, 4($2)nopnopsw $15, 0($2)sw $16, 4($2)
• Without forwarding we need three independent instructions to place between
them, so in addition two nops are added.
lw $15, 0($2)lw $16, 4($2)sw $15, 0($2)sw $16, 4($2)
Control Hazards: ExampleControl Hazards: Example
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 84/108
EECC550 - ShaabanEECC550 - Shaaban
#84 Final Review Winter 2000 2-19-2001
Control Hazards: ExampleControl Hazards: Example• Three other instructions are in the pipeline before branch
instruction target decision is made when BEQ is in MEM stage.
• In the above diagram, we are predicting “branch not taken”
– Need to add hardware for flushing the three following instructions if
we are wrong losing three cycles.
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder (in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
Reducing Delay of Taken BranchesReducing Delay of Taken Branches
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 85/108
EECC550 - ShaabanEECC550 - Shaaban
#85 Final Review Winter 2000 2-19-2001
Reducing Delay of Taken BranchesReducing Delay of Taken Branches• Next PC of a branch known in MEM stage: Costs three lost cycles if taken.
• If next PC is known in EX stage, one cycle is saved.
• Branch address calculation can be moved to ID stage using a register comparator, costing
only one cycle if branch is taken.
PCInstruction
memory
4
Registers
Mux
Mux
Mux
ALU
EX
M
WB
M
WB
WB
ID/EX
0
EX/MEM
MEM/WB
Data
memory
Mux
Hazard
detectionunit
Forwarding
unit
IF.Flush
IF/ID
Sign
extend
Control
Mux
=
Shiftleft 2
Mux
Pipeline Performance ExamplePipeline Performance Example
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 86/108
EECC550 - ShaabanEECC550 - Shaaban
#86 Final Review Winter 2000 2-19-2001
Pipeline Performance ExamplePipeline Performance Example• Assume the following MIPS instruction mix:
• What is the resulting CPI for the pipelined MIPS with
forwarding and branch address calculation in ID stage?
• CPI = Ideal CPI + Pipeline stall clock cycles per instruction
= 1 + stalls by loads + stalls by branches = 1 + .3x.25x1 + .2 x .45x1
= 1 + .075 + .09
= 1.165
Type Frequency Arith/Logic 40%Load 30% of which 25% are followed immediately by an instruction using the loaded valueStore 10%branch 20% of which 45% are taken
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 87/108
EECC550 - ShaabanEECC550 - Shaaban
#87 Final Review Winter 2000 2-19-2001
Memory Hierarchy: MotivationMemory Hierarchy: Motivation
Processor-Memory (DRAM) Performance GapProcessor-Memory (DRAM) Performance Gap
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000
1 9 8 0
1 9 8 1
1 9 8 3
1 9 8 4
1 9 8 5
1 9 8 6
1 9 8 7
1 9 8 8
1 9 8 9
1 9 9 0
1 9 9 1
1 9 9 2
1 9 9 3
1 9 9 4
1 9 9 5
1 9 9 6
1 9 9 7
1 9 9 8
1 9 9 9
2 0 0 0
DRAM
CPU
1 9 8 2
Processor-MemoryPerformance Gap:
(grows 50% / year)
P e r f o r m a
n c e
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 88/108
EECC550 - ShaabanEECC550 - Shaaban
#88 Final Review Winter 2000 2-19-2001
Processor-DRAM Performance Gap Impact:Processor-DRAM Performance Gap Impact:
ExampleExample
• To illustrate the performance impact, assume a pipelined RISC
CPU with CPI = 1 using non-ideal memory.
• Over an 10 year period, ignoring other factors, the cost of a full
memory access in terms of number of wasted instructions:
CPU CPU Memory Minimum CPU cycles or
Year speed cycle Access instructions wasted MHZ ns ns
1986: 8 125 190 190/125 = 1.5
1988: 33 30 175 175/30 = 5.8
1991: 75 13.3 155 155/13.3 = 11.651994: 200 5 130 130/5 = 26
1996: 300 3.33 100 110/3.33 = 33
Memory Hierarchy: MotivationMemory Hierarchy: Motivation
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 89/108
EECC550 - ShaabanEECC550 - Shaaban
#89 Final Review Winter 2000 2-19-2001
Memory Hierarchy: MotivationMemory Hierarchy: Motivation
The Principle Of LocalityThe Principle Of Locality• Programs usually access a relatively small portion of their
address space (instructions/data) at any instant of time
(program working set).
• Two Types of locality:
– Temporal Locality: If an item is referenced, it will tend tobe referenced again soon.
– Spatial locality: If an item is referenced, items whose
addresses are close will tend to be referenced soon.
• The presence of locality in program behavior, makes itpossible to satisfy a large percentage of program access needs
using memory levels with much less capacity than program
address space.
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 90/108
EECC550 - ShaabanEECC550 - Shaaban
#90 Final Review Winter 2000 2-19-2001
Levels of The Memory HierarchyLevels of The Memory Hierarchy
Part of The On-chipCPU Datapath
16-256 Registers
One or more levels (Static RAM):
Level 1: On-chip 16-64K
Level 2: On or Off-chip 128-512K
Level 3: Off-chip 128K-8M
Registers
Cache
Main Memory
Magnetic Disc
Optical Disk or Magnetic Tape
Farther away from
The CPU
Lower Cost/Bit
Higher Capacity
Increased AccessTime/Latency
Lower Throughput
Dynamic RAM (DRAM)
16M-16G
Interface:
SCSI, RAID,
IDE, 1394
4G-100G
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 91/108
EECC550 - ShaabanEECC550 - Shaaban
#91 Final Review Winter 2000 2-19-2001
Cache Design & Operation IssuesCache Design & Operation Issues
• Q1: Where can a block be placed cache?(Block placement strategy & Cache organization)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in cache?
(Block identi f ication)
– Tag/Block
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, LRU
• Q4: What happens on a write?(Cache wri te policy)
– Write through, write back
Cache Organization ExampleCache Organization Example
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 92/108
EECC550 - ShaabanEECC550 - Shaaban
#92 Final Review Winter 2000 2-19-2001
Cache Organization Exampleg p
Locating A Data Block in CacheLocating A Data Block in Cache
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 93/108
EECC550 - ShaabanEECC550 - Shaaban
#93 Final Review Winter 2000 2-19-2001
Locating A Data Block in CacheLocating A Data Block in Cache• Each block frame in cache has an address tag.
• The tags of every cache block that might contain the required dataare checked or searched in parallel.
• A valid bit is added to the tag to indicate whether this entry contains
a valid address.
• The address from the CPU to cache is divided into:
– A block address, further divided into:• An index field to choose a block set in cache.
(no index field when fully associative).
• A tag field to search and match addresses in the selected set.
– A block offset to select the data from the block.
Block Address Block
OffsetTag Index
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 94/108
EECC550 - ShaabanEECC550 - Shaaban
#94 Final Review Winter 2000 2-19-2001
Address Field SizesAddress Field Sizes
Block Address Block
OffsetTag Index
Block offset size = log2(block size)
Index size = log2(Total number of blocks/associativity)
Tag size = address size - index size - offset sizeTag size = address size - index size - offset size
Physical Address Generated by CPU
Four-Way Set Associative Cache:Four-Way Set Associative Cache:
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 95/108
EECC550 - ShaabanEECC550 - Shaaban
#95 Final Review Winter 2000 2-19-2001
Four-Way Set Associative Cache:Four-Way Set Associative Cache:MIPS Implementation ExampleMIPS Implementation Example
Addr ess
22 8
V TagIndex
0
1
2
25 3
25 4
25 5
Data V Tag Data V Tag Data V Tag Data
3222
4-to-1 multiplexor
Hit Da ta
123891011123 031 0
Index
Field
Tag
Field
256 sets
1024 block frames
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 96/108
EECC550 - ShaabanEECC550 - Shaaban
#96 Final Review Winter 2000 2-19-2001
Cache Organization/Addressing ExampleCache Organization/Addressing Example
• Given the following: – A single-level L1 cache with 128 cache block frames
• Each block frame contains four words (16 bytes)
– 16-bit memory addresses to be cached (64K bytes main memory
or 4096 memory blocks)
• Show the cache organization/mapping and cache
address fields for:
• Fully Associative cache.
• Direct mapped cache.• 2-way set-associative cache.
Cache Example: Fully Associative CaseCache Example: Fully Associative Case
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 97/108
EECC550 - ShaabanEECC550 - Shaaban
#97 Final Review Winter 2000 2-19-2001
p yp y
Block offset
= 4 bits
Block Address = 12 bits
Tag = 12 bits
All 128 tags must
be checked in parallel
by hardware to locate
a data block
V
V
V
Valid bit
Cache Example: Direct Mapped CaseCache Example: Direct Mapped Case
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 98/108
EECC550 - ShaabanEECC550 - Shaaban
#98 Final Review Winter 2000 2-19-2001
Block offset
= 4 bits
Block Address = 12 bits
Tag = 5 bits Index = 7 bitsMain Memory
Only a single tag must
be checked in parallelto locate a data block
V
Valid bit
V
V
V
Cache Example: 2-Way Set-AssociativeCache Example: 2-Way Set-Associative
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 99/108
EECC550 - ShaabanEECC550 - Shaaban
#99 Final Review Winter 2000 2-19-2001
Block offset
= 4 bits
Block Address = 12 bits
Tag = 6 bits Index = 6 bits Main Memory
Two tags in a set must
be checked in parallel
to locate a data block
Valid bits not shown
Cache Replacement PolicyCache Replacement Policy
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 100/108
EECC550 - ShaabanEECC550 - Shaaban
#100 Final Review Winter 2000 2-19-2001
Cache Replacement PolicyCache Replacement Policy• When a cache miss occurs the cache controller may have to
select a block of cache data to be removed from a cache block
frame and replaced with the requested data, such a block isselected by one of two methods:
– Random:
• Any block is randomly selected for replacement providinguniform allocation.
• Simple to build in hardware.
• The most widely used cache replacement strategy.
– Least-recently used (LRU):
• Accesses to blocks are recorded and and the block
replaced is the one that was not used for the longest periodof time.
• LRU is expensive to implement, as the number of blocksto be tracked increases, and is usually approximated.
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 101/108
EECC550 - ShaabanEECC550 - Shaaban
#101 Final Review Winter 2000 2-19-2001
Miss Rates for Caches with Different Size,Miss Rates for Caches with Different Size,
Associativity & Replacement AlgorithmAssociativity & Replacement Algorithm
Sample DataSample Data
Associativity: 2-way 4-way 8-way
Size LRU Random LRU Random LRU Random
16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96%
64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
Si l L l C h P fSi l L l C h P f
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 102/108
EECC550 - ShaabanEECC550 - Shaaban
#102 Final Review Winter 2000 2-19-2001
Single Level Cache PerformanceSingle Level Cache Performance
For a CPU with a single level (L1) of cache and no stalls forcache hits:
CPU time = (CPU execution clock cycles +
Memory stall clock cycles) x clock cycle time
Memory stall clock cycles =
(Reads x Read miss rate x Read miss penalty) +
(Writes x Write miss rate x Write miss penalty)
If write and read miss penalties are the same:
Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty
With ideal memory
Single Level Cache PerformanceSingle Level Cache Performance
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 103/108
EECC550 - ShaabanEECC550 - Shaaban
#103 Final Review Winter 2000 2-19-2001
Single Level Cache PerformanceSingle Level Cache PerformanceCPUtime = IC x CPI x C
CPIexecution = CPI with ideal memory
CPI = CPIexecution + Mem Stall cycles per instruction
CPUtime = IC x (CPIexecution + Mem Stall cycles per instruction) x C
Mem Stall cycles per instruction = Mem accesses per instruction x Miss rate x Miss penalty
CPUtime = IC x (CPIexecution + Mem accesses per instruction x
Miss rate x Miss penalty ) x C
Misses per instruction = Memory accesses per instruction x Miss rate
CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x
C
Cache Performance ExampleCache Performance Example
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 104/108
EECC550 - ShaabanEECC550 - Shaaban
#104 Final Review Winter 2000 2-19-2001
Cache Performance ExampleCache Performance Example• Suppose a CPU executes at Clock Rate = 200 MHz (5 ns per cycle)
with a single level of cache.
• CPIexecution = 1.1
• Instruction mix: 50% arith/logic, 30% load/store, 20% control
• Assume a cache miss rate of 1.5% and a miss penalty of 50 cycles.
CPI = CPIexecution + mem stalls per instruction
Mem Stalls per instruction = Mem accesses per instruction x Miss rate x Miss penalty
Mem accesses per instruction = 1 + .3 = 1.3
Mem Stalls per instruction = 1.3 x .015 x 50 = 0.975 CPI = 1.1 + .975 = 2.075
The ideal CPU with no misses is 2.075/1.1 = 1.88 times faster
Instruction fetch Load/store
Cache Performance ExampleCache Performance Example
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 105/108
EECC550 - ShaabanEECC550 - Shaaban
#105 Final Review Winter 2000 2-19-2001
Cache Performance ExampleCache Performance Example• Suppose for the previous example we double the clock rate to
400 MHZ, how much faster is this machine, assuming similarmiss rate, instruction mix?
• Since memory speed is not changed, the miss penalty takes
more CPU cycles:
Miss penalty = 50 x 2 = 100 cycles.
CPI = 1.1 + 1.3 x .015 x 100 = 1.1 + 1.95 = 3.05
Speedup = (CPIold x Cold)/ (CPInew x Cnew)
= 2.075 x 2 / 3.05 = 1.36
The new machine is only 1.36 times faster rather than 2
times faster due to the increased effect of cache misses.
→ CPUs with higher clock r ate, have more cycles per cache miss and more
memory impact on CPI
3 L l f C h3 Le els of Cache
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 106/108
EECC550 - ShaabanEECC550 - Shaaban
#106 Final Review Winter 2000 2-19-2001
3 Levels of Cache3 Levels of Cache
CPU
L1 Cache
L2 Cache
L3 Cache
Main Memory
Hit Rate= H1, Hit time = 1 cycle
Hit Rate= H2, Hit time = T2 cycles
Hit Rate= H3, Hit time = T3
Memory access penalty, M
3 L l C h P f3 L l C h P f
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 107/108
EECC550 - ShaabanEECC550 - Shaaban
#107 Final Review Winter 2000 2-19-2001
CPUtime = IC x (CPIexecution + Mem Stall cycles per instruction) x C
Mem Stall cycles per instruction = Mem accesses per instruction x Stall cycles per access
• For a system with 3 levels of cache, assuming no penalty
when found in L1 cache:
Stall cycles per memory access =
[miss rate L1] x [ Hit rate L2 x Hit time L2
+ Miss rate L2 x (Hit rate L3 x Hit time L3
+ Miss rate L3 x Memory access penalty) ] =
[1 - H1] x [ H2 x T2
+ ( 1-H2 ) x (H3 x (T2 + T3)
+ (1 - H3) x M) ]
3-Level Cache Performance3-Level Cache Performance
Three Level Cache ExampleThree Level Cache Example
7/23/2019 Cpu Organisasi
http://slidepdf.com/reader/full/cpu-organisasi 108/108
Three Level Cache ExampleThree Level Cache Example• CPU with CPIexecution = 1.1 running at clock rate = 500 MHZ
• 1.3 memory accesses per instruction.
• L1 cache operates at 500 MHZ with a miss rate of 5%
• L2 cache operates at 250 MHZ with miss rate 3%, (T2 = 2 cycles)
• L3 cache operates at 100 MHZ with miss rate 1.5%, (T3 = 5 cycles)
• Memory access penalty, M= 100 cycles. Find CPI.
• With single L1, CPI = 1.1 + 1.3 x .05 x 100 = 7.6CPI = CPIexecution + Mem Stall cycles per instructionMem Stall cycles per instruction = Mem accesses per instruction x Stall cycles per access
Stall cycles per memory access = [1 - H1] x [ H2 x T2 + ( 1-H2 ) x (H3 x (T2 + T3)
+ (1 - H3) x M) ]
= [.05] x [ .97 x 2 + (.03) x ( .985 x (2+5)
+ .015 x 100)]
= .05 x [ 1.94 + .03 x ( 6.895 + 1.5) ]
= .05 x [ 1.94 + .274] = .11
• CPI = 1.1 + 1.3 x .11 = 1.24