processor architecture
DESCRIPTION
PROCESSOR ARCHITECTURE. Jehan-François Pâris [email protected]. Chapter Organization. Logic design conventions Implementation of a "toy" CPU Pipelining Pipelining hazards Data hazards Control hazards Exceptions Parallelism. IMPORTANT. LOGIC DESIGN CONVENTIONS. - PowerPoint PPT PresentationTRANSCRIPT
![Page 2: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/2.jpg)
Chapter Organization
• Logic design conventions• Implementation of a "toy" CPU• Pipelining• Pipelining hazards
– Data hazards– Control hazards
• Exceptions• Parallelism
IMPORTANT
![Page 3: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/3.jpg)
LOGIC DESIGN CONVENTIONS
![Page 4: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/4.jpg)
Combinational/state elements
• Combinational elements:– Outputs only depend on current inputs– Stateless
• Adders and, more generally, arithmetic logic unit (ALU)
![Page 5: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/5.jpg)
Combinational/state elements
• State elements: – Have a memory holding a state– Output depends on current inputs and state of
element – State reflects past inputs
• Flip-flops, …
![Page 6: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/6.jpg)
Judicial analogy
• In our legal system– Guilty/not guilty decision is stateless
• Good reasons– Sentencing decision is not
• "Three strikes and you are out" laws• Good reasons
![Page 7: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/7.jpg)
Clocking methodology
• We will assume an edge-triggered clocking technology– Edge is short-enough to prevent data
propagation in state elements– Can read current state of a memory element
at the same time we update it
![Page 8: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/8.jpg)
Clocking convention
• Omit write control signal if state element is updated at every active clock edge
![Page 9: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/9.jpg)
A "TOY" CPU
![Page 10: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/10.jpg)
Motivation
• "Toy" CPU will implement a subset of MIPS instruction set
• Subset will be– Self-sufficient– Simpler to implement– Complex enough to allow a serious
discussion of CPU architecture
![Page 11: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/11.jpg)
The subset
• Will include– Load and store instructions:
lw (load word) and sw (store word)– Arithmetic-logic instructions:
add, sub, and, or and slt (set less than)– Branch instructions:
beq (branch if equal) and j (jump)
![Page 12: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/12.jpg)
Load and store instructions
• Format I• Three operands:
– Two registers $r1 and $r2– One displacement d
• lw $r1, d($r2) loads into register $r1 main memory word at address contents($r2) + d
• sw $r1, d($r2) stores contents of register $r1 into main memory word at address contents($r2) + d
![Page 13: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/13.jpg)
Arithmetic-logic instructions
• Format R• Three operands:
– Three registers $r1, $r2 and $r3• Store into register $r1 result of $r2 <op> $r3
where <op> can be add, subtract, and, oras well as set if less than
![Page 14: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/14.jpg)
Branch instruction
• Format I• Three operands:
– Two registers $r1 and $r2– One displacement d
• beq $r1, $r2, dset value of PC to PC+4 + 4×diff $r1 = $r2
![Page 15: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/15.jpg)
The simplest data path
• Assume CPU will do nothing but – Incrementing its program counter and– Deliver the next instruction
![Page 16: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/16.jpg)
The simplest data path
InstructionMemory
Read address
Instruction
PC
Add4
![Page 17: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/17.jpg)
Implementing R2R instructions
• Takes two 32-bit inputs• Returns
– A 32-bit output– A 1-bit signal if the result is zero
![Page 18: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/18.jpg)
The register file
• Two read outputs that are always available• One write input activated by a RegWrite signal
• Three register selectors
![Page 19: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/19.jpg)
The register file
Read select 1 Read data 1
Read select 2 Read data 2
Write select Write data
5
5
5
RegWrite:enables register writes
![Page 20: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/20.jpg)
Implementing R2R instructions
Registerfile
ALU
Result
RegWrite is enabled
![Page 21: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/21.jpg)
Implementing load and store
• Require– An address calculation:
• contents($r2) + d– An access to data memory
• Before doing the address calculation, we must transform 16-bit displacement d into a 32-bit value using sign extension
![Page 22: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/22.jpg)
The data memory
• One address selector• One write data input• One read data output• Two controls
– MemWrite– MemRead
![Page 23: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/23.jpg)
Sign extension (I)
• If 16-bit number has a zero as MSB– It is positive– Must add 16 zero bits
0110 1010 1010 0100
0110 1010 1010 01000000 0000 0000 0000
![Page 24: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/24.jpg)
Sign extension (II)
• If 16-bit number has a one as MSB– It is negative– Must add 16 one bits
1110 1010 1010 0100
1110 1010 1010 01001111 1111 1111 1111
![Page 25: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/25.jpg)
The data memory
Memory address Read data
Write data
MemRead: enables memory reads
MemWrite: enables memory writes
![Page 26: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/26.jpg)
Implementing the store instruction
Registerfile
ALU Address Read
Write
Sign-extended d fieldSE
![Page 27: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/27.jpg)
Implementing the load instruction
Registerfile
ALU Address Read
Write
SE
SEd field
![Page 28: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/28.jpg)
Implementing conditional branch
• Target Address:– Sign-extend 16-bit immediate part of instruction– Shift left 2– Add to PC
• Branch Control Logic:– Perform test operation on two registers– Check result
![Page 29: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/29.jpg)
Implementing conditional branch
Registerfile
ALU
AddShiftleft 2
Branch
Destination
PC+4
To branchcontrol logic
d field ofinstruction
SE Sign-extended d field
![Page 30: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/30.jpg)
Note
• Arithmetic-logic operations only use – Register file and ALU
• Load and store use– ALU for computing memory address– Data memory
![Page 31: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/31.jpg)
Implementing other instructions
![Page 32: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/32.jpg)
Combining everything
![Page 33: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/33.jpg)
Left to be done
• All control signals:– Two multiplexers: ALUSrc and MemtoReg– RegWrite, MemRad and MemWrite switches– ALU controls (4 bits)
![Page 34: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/34.jpg)
ALU control signals
ALU control lines Function0000 and0001 or0010 add0110 subtract0111 set on less than1100 nor (not in "toy" subset)
![Page 35: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/35.jpg)
Controlling the ALU
• Recall that all R-format instructions have same opcode– Operation performed by ALU is specified in the
function field (bits <0:5>)
![Page 36: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/36.jpg)
Controlling the ALU
• ALU control inputs generated by two-step process– Construct two ALUOp control bits from
opcode– Construct four ALU control bits using
• Two ALUop bits• Six bits from function field when they are
needed
![Page 37: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/37.jpg)
Dependence table
Opcode ALUOp Operation Function Action ALU Ctllw 00 lw - add 0010sw 00 sw - add 0010beq 01 beq - subtract 0110
R-type 10 add 100000 add 0010R-type 10 subtract 100010 subtract 0110R-type 10 and 100100 and 0000R-type 10 or 100101 or 0001R-type 10 slt 101010 slt 0111
![Page 38: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/38.jpg)
Notes
• Two step process simplifies combinatorial logic• Many don't care conditions in truth table
![Page 39: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/39.jpg)
Truth table
ALUOp1
ALUOp2
F5 F4 F3 F2 F1 F0 ALU Control bits
0 0 X X X X X X 00100 1 X X X X X X 01101 0 X X 0 0 0 0 00101 X X X 0 0 1 0 01101 0 X X 0 1 0 0 00001 0 X X 0 1 0 1 00011 X X X 1 0 1 0 0111
![Page 40: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/40.jpg)
Note
• Bits 4 and 5 of function field are not used• ALUOp bits only have three possible values:
00, 01 and 10– Introduces don't care conditions
• All R instructions use same data paths– Other control bits depend only on opcode
![Page 41: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/41.jpg)
Control signal effectsSignal When deasserted When asserted
Regdest Destination register comes from rt field (bits 20:16)
Destination register comes from rd field (bits 15:10)
Regwrite None Enables write into destination register
ALUSrc Second ALU operand comes from second register output
Second ALU operand comes from sign-extended displacement(bits 15:0)
![Page 42: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/42.jpg)
Control signal effects
Signal When deasserted When asserted
PCSrc PC is incremented by 4
PC set to branch target value
MemRead None Enables memory read output
MemWrite None Enables memory writeMemtoReg Value fed to
destination register comes from ALU
Value fed to destination register comes from memory
![Page 43: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/43.jpg)
Note
• PCSrc is asserted when– Instruction is a branch
and– ALU Zero result bit is asserted
• We will introduce a Branch control line
![Page 44: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/44.jpg)
Control line settings
Instruction Rdest ALUsrc MemtoReg RegWrite
R-format 1 0 0 1lw 0 1 1 1sw X 1 X 0beq X 0 X 0
![Page 45: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/45.jpg)
Control line settings
Instruction MemRead
Mem Write
Branch ALUOp 1
ALUOp 0
R-format 0 0 0 1 0lw 1 0 0 0 0
sw 0 1 0 0 0beq 0 0 1 0 1
![Page 46: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/46.jpg)
Active datapaths for a R instruction
![Page 47: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/47.jpg)
Active datapaths for a load instruction
![Page 48: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/48.jpg)
Active datapaths for a beq instruction
![Page 49: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/49.jpg)
The “weird" jump instruction
• Uses J format– Single 26 bit operand– Implements an unconditional jump
• New value of PC is obtained as follows– Bits 1:0 are zero (address is multiple of 4)– Bits 28:2 come from jump operand– Bits 31:29 come from PC+4
![Page 50: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/50.jpg)
Implementing the jump instruction
![Page 51: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/51.jpg)
Limitations of single-cycle design
• If we want all instructions to be executed in one cycle– Clock cycle must be long enough to
accommodate instruction taking the most time• Floating-point multiply or divide
• Does not work for CPUs that have a rich instruction set
![Page 52: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/52.jpg)
PIPELINING
![Page 53: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/53.jpg)
An analogy (I)
• Washing your clothes– Four steps:
1. Putting in the washer2. Putting in the dryer3. Folding/ironing4. Putting them away
![Page 54: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/54.jpg)
An analogy (II)
• Most people– Start second wash load as soon as first wash
load is in dryer– Put second wash load in dryer and start a
third wash load while they are folding/ironing the firs washload
![Page 55: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/55.jpg)
Purely sequential approach
Time 6 pm 6:30 7pm 7:30 8pm 8:30 9pm 9:30
Wash Dry Fold Store
Wash Dry Fold Store
![Page 56: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/56.jpg)
Smart approach
Time 6 pm 6:30 7pm 7:30 8pm 8:30 9pm 9:30
Wash Dry Fold Store
Wash Dry Fold Store
Wash Dry Fold Store
Wash Dry Fold Store
Solution assumes that a housemateputs folded/ironed clothes away for us
![Page 57: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/57.jpg)
Main advantage
• Can do much more in much less time
![Page 58: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/58.jpg)
Limitation
• Slowed down by time taken by longest step– Could be washing/drying/ironing
![Page 59: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/59.jpg)
Instruction steps (I)
• Good candidates for pipelining steps1. Fetch instruction from memory2. Decode instruction3. Read registers4. Execute register to register operation or
calculate address5. Access operand in memory6. Write results into a register
![Page 60: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/60.jpg)
Instruction steps (II)
• Since MIPS instruction set has fixed fields, we can combine steps 2 and 3
1. Fetch instruction from memory2. Read registers while decoding instruction3. Execute register to register operation or
calculate address4. Access operand in memory5. Write results into a register
![Page 61: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/61.jpg)
Sample step timingsInstruction
classInstruction
fetchRegister
readALU
operationData
accessRegister
writeTotaltime
Load word (lw)
200 ps 100ps 200ps 200ps 100ps 800ps
Store word (sw)
200 ps 100ps 200ps 200ps --- 700ps
R format instruction
200 ps 100ps 200ps -- 100ps 600ps
Branch(beq)
200 ps 100ps 200ps -- -- 500 ps
![Page 62: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/62.jpg)
Step 1: Fetch and decode
![Page 63: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/63.jpg)
Step 2: Read registers
![Page 64: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/64.jpg)
Step 3: Use the ALU
![Page 65: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/65.jpg)
Step 4: Access operand in memory
![Page 66: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/66.jpg)
Step 5: Store result in register
![Page 67: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/67.jpg)
Observations
• Most R format instructions operate on three registers and skip step 4
• Same for most I format instructions with an immediate operand
• Store operations skip step 5• Load register instructions go through all five
steps
![Page 68: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/68.jpg)
Pipelining limitations
• Some instructions that skip a step will still have to wait until preceding instruction is done.
• Hazards:– An instruction cannot proceed because
• Hardware cannot support the combination of instructions (structural hazards)
• Data are not ready (data hazards)• Control/branch hazards
![Page 69: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/69.jpg)
Structural hazards
• Combinations of instructions that prevent pipelining
![Page 70: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/70.jpg)
A bad MIPS instruction (I)
• Recall that IBM instructions set had instructions allowing to add to a register the contents of a memory location– RX format
![Page 71: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/71.jpg)
A bad MIPS instruction (II)
• We could think of a MIPS instruction with three registers operands
ADDX r1, r2, r3adding to r1 the contents of the word at address contents of r2 + contents of r3
• We would have r1 = r1 + Mem[r2+r3]
![Page 72: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/72.jpg)
A bad MIPS instruction (III)
• It would be great for accessing arrays– r2 will have starting address of array– r3 would contain the array index multiplied by
4
r2
r3
(fixed value)
(incremented after each step)
![Page 73: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/73.jpg)
A bad MIPS instruction (IV)
• Adding this instruction would be a very bad idea
– Why?
![Page 74: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/74.jpg)
Answer
• Instruction would require two steps using the ALU– Adding r2 and r3 to compute the address of the
memory operand (step 4)– Adding the memory operand to r1
• New step would introduce a structural hazard by preventing any other instruction to access the ALU
![Page 75: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/75.jpg)
My comment
• Careful design of the MIPS CPU and instruction set should be noted – Not true for older instructions sets
• IBM 360, DEC VAX, …– Not true for X86 instruction sets
• CPU is designed to be compatible with an existing instruction set
![Page 76: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/76.jpg)
Designing instruction sets for pipelining (I)
• All instructions should have the same length– Can fetch future instructions before the
current one is decoded• Have few instruction formats with register fields
always in the same position– Can combine instruction decode and register
read steps
![Page 77: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/77.jpg)
Designing instruction sets for pipelining (II)
• Memory operands should only appear in load and store instruction– No instruction can use the ALU twice!
• Operands must be properly aligned in memory– Can always access them in a single memory
cycle
![Page 78: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/78.jpg)
Data hazards (I)
• Assume we have add $s0, $t0, $t1sub $t2, $s0, $t3
ors0 = t0 + t1t2 = s0 – t3
• Need result of add before proceeding with sub instruction
![Page 79: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/79.jpg)
Detail of steps
Cycle 1 2 3 4 5 6add IF ID/RR ALU RWsub IF stall stall ID/RR ALU
• Second instruction must wait until first instruction updated $s0 in cycle 4 before reading its value in cycle 5
![Page 80: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/80.jpg)
Data hazards (II)
• New value of $s0 computed by the add instruction is not stored in $s0 until its step 5 has completed
• New instruction must wait until add instruction has performed its step 5 before performing its step
![Page 81: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/81.jpg)
Data hazards (III)
addsub
![Page 82: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/82.jpg)
Data hazards (IV)
• We lose two cycles during which nothing can be done
• Cannot trust compiler to remove all data hazards• Observe that new value of $s0 become available
at the end of step 3 of add instruction– Add special circuitry to provide this value at the
end of step 2 of sub instruction • Forwarding or bypassing
![Page 83: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/83.jpg)
After forwarding
addsub
![Page 84: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/84.jpg)
Detail of steps
Cycle 1 2 3 4 5 6add IF ID/RR ALU RWsub IF ID/RR ALU RW
• Second instruction now gets updated value at the end of cycle 3 just in time to use it in cycle 4– No stall cycles
![Page 85: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/85.jpg)
Limitations (I)
• Forwarding worked very well because output of step 4 of add was forwarded to be input of step 3 of sub
• Would not work as well if output of an instruction step is need as input of instruction step of next instruction– Will still have one or more pipeline stalls
(bubbles)
![Page 86: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/86.jpg)
Limitations (II)
• Assume we have lw $s0, 20($t1)sub $t2, $s0, $t3
ors0 = Mem[t1+20]t2 = s0 – t3
• Need new value of s0 before proceeding with sub instruction
![Page 87: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/87.jpg)
Limitations (III)
addsub
![Page 88: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/88.jpg)
Detail of steps
Cycle 1 2 3 4 5 6lw IF ID/RR ALU MEM RWsub IF ID/RR stall ALU RW
• Even with forwarding second instruction must wait until completion of memory access of first instruction in cycle 4 before performing its ALU step in cycle 5 – One stall cycle
![Page 89: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/89.jpg)
A last word
• In many architectures, the floating point unit is a significant source of structural hazards– Less well adapted to pipelining
• The MIPS architecture assumes that we have separate memories for instructions and data– Having a single memory for both would result
in many more hazards
![Page 90: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/90.jpg)
Control / jump hazards
• Happen whenever we have a conditional jump• Consider the instructions
add $4, $5,$6beq $1,$2, 40or $7, $8, $9
• Need result of conditional branch (beq) before deciding whether to execute next instruction (or)
![Page 91: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/91.jpg)
Control hazards (II)
beqor
![Page 92: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/92.jpg)
Pipelined datapath
![Page 93: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/93.jpg)
Datapaths for pipelined organization
• Define five steps1. Fetch instruction from memory (IF)2. Instruction decode and register reads (ID)3. Execute AL operation on ALU (EX)4. Access operand in memory (MEM)5. Write back results into a register (WB)
![Page 94: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/94.jpg)
Datapaths for pipelined organization
• Insert registers to save outputs of each step before they get updated by th next step
1. IF/ID registers2. ID/EX registers3. EX/MEM registers4. MEM/WB registers
![Page 95: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/95.jpg)
A first try
IF/
NewNewNewNew
![Page 96: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/96.jpg)
Comments
• This first try is not correct – Load instruction will not be implemented
correctly• Address of destination register will be lost
as soon as new instruction will be fetched• Must save it at each step
![Page 97: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/97.jpg)
The almost correct datapaths
Register address follows
instruction
![Page 98: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/98.jpg)
The almost correct datapaths
![Page 99: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/99.jpg)
More problems
• Address of destination register is not always at the same place in all instructions– Could be instruction bits (20-16)
• For all I-format instructions that write into a register
– Could be instruction bits (15-11)• In R format instructions
![Page 100: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/100.jpg)
Why?
• In R format instructions
• In I format instructions
opcode source s/d constant/address
functshamtdestopcode source source
![Page 101: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/101.jpg)
The solution
• Add a multiplexer at stage EX
![Page 102: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/102.jpg)
More about data hazards
• Considersub $2,$1,$3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
• Last four instructions depend on result of sub
![Page 103: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/103.jpg)
More about data hazards
• $2 is updated at the end of last cycle of sub• First instruction that would get the correct value
of $2 would be the add
![Page 104: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/104.jpg)
More about data hazards
sub IF ID+Reg EX MEM WB
and IF ID+Reg EX MEM WB
or IF ID+Reg EX MEM
add IF ID+Reg EX
sw IF ID+Reg
![Page 105: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/105.jpg)
Adding a forwarding unit
![Page 106: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/106.jpg)
More data hazards
• We can forward the results of sub instruction at the end of its EX step
– In time for all four following instructions• To do that we need special forwarding unit• Not all data hazards can be avoided
– lw followed by any instruction accessing the loaded word
![Page 107: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/107.jpg)
Why?
• lw loads word from RAM into memory– Goes through IF, ID+Reg, EX, MEM and
WB steps– Register value is updated at the end of WB
step• Must delay any following instruction that wants
to access the contents of the register
![Page 108: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/108.jpg)
Data hazard detection unit
• Detects hazards that cannot be avoided • Inserts no operation instructions (nop)
– They do nothing!
![Page 109: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/109.jpg)
More about control hazards
• Outcome of conditional branch is not known until end of step EX– beq and bne use arithmetic unit to evaluate the
branch condition– If branch is taken, we must abort the two
following instructions• Easy because they have not yet updated
anything
![Page 110: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/110.jpg)
More about control hazards
beq IF ID+Reg EX MEM WB
next IF ID+Reg ABORT
next IF ABORT
dest IF ID+Reg EX
![Page 111: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/111.jpg)
More about control hazards
beq IF ID+Reg EX MEM WB
next IF ABORT
dest IF ID+Reg EX MEM
![Page 112: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/112.jpg)
Better implementation of beq/bne
![Page 113: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/113.jpg)
MIPS Optimization
• Move comparison ahead to reduce the number of aborted instructions– Add a simple EQUAL/NOT EQUAL
comparison hardware that tests outputs of register file• Bitwise XOR then ORing the results
–Will return zero if the register contents are identical
![Page 114: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/114.jpg)
Explanations
• Moving the jump address calculation one step ahead means that we will always do the calculation even when it is not needed.
• Simple comparator duplicates one ALU function
![Page 115: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/115.jpg)
New problem
• We need now the correct values of the input registers in step ID– More data hazards
add $t0, $t2, $t3beq $t0, $s0, 400
– Data forwarding can reduce the number of nops but not eliminate them
![Page 116: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/116.jpg)
New data hazards
add IF ID+Reg EX MEM WB
nop
nop
beq IF ID+Reg EX MEM
![Page 117: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/117.jpg)
EXCEPTIONS AND INTERRUPTS
![Page 118: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/118.jpg)
Interrupts (I)
• Request to interrupt the flow of execution the CPU
• Detected by the CPU hardware– After it has executed the current instruction– Before it starts the next instruction.
![Page 119: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/119.jpg)
Interrupts (II)
• When an interrupt occurs:a) The current state of the CPU (program
counter, program status word, contents of registers, and so forth) is saved, normally on the top of a stack
b) A new CPU state is fetched
![Page 120: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/120.jpg)
Interrupts (III)
• New state includes a new hardware-defined value for the program counter– Cannot “hijack” interrupts
• Process is totally transparent to the task being interrupted – A process never knows whether it has been
interrupted or not
![Page 121: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/121.jpg)
Types of interrupts (I)
• I/O completion interrupts – Notify the OS that an I/O operation has
completed,• Timer interrupts
– Notify the OS that a task has exceeded its quantum of CPU time,
![Page 122: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/122.jpg)
Types of interrupts (II)
• Traps– Notify the OS of a program error (division by
zero, illegal op code, illegal operand address, ...) or a hardware failure
• System calls– Notify OS that the running task wants to
submit a request to the OS• Notification of another event
![Page 123: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/123.jpg)
A surprising discovery
• Programs do interrupt themselves!
![Page 124: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/124.jpg)
Context switches• Each interrupt will result into
two context switches:– One when the running task is interrupted – Another when it regains the CPU
• Context switches are not cheap• The overhead of any simple system call is
two context switches
Remember that for 4330!
![Page 125: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/125.jpg)
Prioritizing interrupts (I)
• Interrupt requests may occur while the system is processing another interrupt
• All interrupts are not equally urgent (as it is also in real life)
– Some are more urgent than other– Also true in real life
![Page 126: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/126.jpg)
Prioritizing interrupts (II)
• The best solution is to prioritize interrupts and assign to each source of interrupts a priority level
– New interrupt requests will be allowed to interrupt lower-priority interrupts but will have to wait for the completion of all other interrupts
• Solution is known as vectorized interrupts.
![Page 127: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/127.jpg)
Example from real life
• Let us try to prioritize– Phone is ringing – Washer signals end of cycle– Dark smoke is coming out of the kitchen– …
• With vectorized interrupts, a phone call will never interrupt another phone call
![Page 128: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/128.jpg)
The solution
Smoke in the kitchen
Phone is ringing
End of washer cycle
More low-priority stuff
![Page 129: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/129.jpg)
MIPS Implementation (I)
• Interrupts are a special case of a branch– Use same techniques for handling control
hazards• Almost all MIPS interrupts jump to the same
hardware address (x80000180)– MIPS use a special register to pass along the
type of interrupt to the interrupt handler• The Cause register
![Page 130: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/130.jpg)
MIPS Implementation (II)
• MIPS also saves the address + 4 of the affected instruction in a special register– EPC register
• A STATUS register allows selective disabling of interrupts– Useful for handling short critical sections in
single-threaded kernel
![Page 131: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/131.jpg)
Issues (I)
• Interrupted instruction may have to be restarted– Typical for I/O completion interrupts
• Must then maintain precise exceptions that accurately identify the instruction being interrupted– Not true for hardware interrupts
![Page 132: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/132.jpg)
Issues (II)
• Must be able to restart instruction at the exact point it was interrupted– Not always easy on many architectures
• MIPS solution is to roll back everything and restart instruction as if nothing had happened– Easier on MIPS since register/memory update
is always the last step of any instruction– Must still ensure that we can restore the
original values of all registers
![Page 133: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/133.jpg)
Branch prediction
• CPU will try to predict whether a branch will be taken or not
• Important for loops– Branch is taken at every iteration
but last one
See speculative execution
![Page 134: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/134.jpg)
Parallelism
• Instruction-level parallelism (ILP)• Two ways:
– Increasing the depth of the pipeline:• More steps can be executed in parallel
– Multiple issue:• We duplicate some units (ALU)
–Two or more units can be at the same pipeline stage
![Page 135: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/135.jpg)
An example
• Could modify the toy MIPS architecture by adding a second ALU:– Would allow RR instructions be executed in
parallel with load and store instructions– Would also need extra ports in the register
bank
• Faster but much more complex
![Page 136: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/136.jpg)
Hazards
• Become an even bigger issue
• Some architectures assume that the compiler will take care of all data hazards– Will never issue sequence of instructions with
unsatisfied dependencies• Other architectures check for problems
![Page 137: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/137.jpg)
Speculative execution (I)
• Can speculate that– A branch will not be taken (think of loops)– A store than precedes a load will not store at
the address the load will useand execute the instruction(s) hoping for the best
• If speculation is wrong, we must undo what we have done
![Page 138: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/138.jpg)
Speculative execution (II)
• Any speculation mechanism must include methods to– Check if the speculation was correct– Undo the effect of the speculated instructions
• Quite complex• Can be done by the compiler or the hardware
![Page 139: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/139.jpg)
Fallacies
• Pipelining is easy• Pipelining ideas can be implemented
independently of technologies
![Page 140: PROCESSOR ARCHITECTURE](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813ebc550346895da92307/html5/thumbnails/140.jpg)
Pitfalls
• Instruction set must be pipelining friendly