general purpose processor
DESCRIPTION
FGFTRANSCRIPT
1
GENERAL PURPOSE PROCESSOR
2
Introduction General-Purpose Processor
Processor designed for a variety of computation tasks
Low unit cost, in part because manufacturer spreads NRE over large numbers of units Motorola sold half a billion 68HC05 microcontrollers in
1996 alone Carefully designed since higher NRE is acceptable
Can yield good performance, size and power Low NRE cost for Embedded system designer,
short time-to-market/prototype, high flexibility User just writes software; no processor design
3
Basic Architecture Control unit and
datapath Similar to single-
purpose processor Key differences
Datapath is general Control unit doesn’t
store the algorithm – the algorithm is “programmed” into the memory
ProcessorControl unit Datapath
ALU
Registers
IRPC
Controller
MemoryI/O
Control/
Status
4
Datapath Operations Load
Read memory location into register • ALU operation
– Input certain registers through ALU, store back in register• Store
– Write register to memory location
ProcessorControl unit Datapath
ALU
Registers
IRPC
Controller
MemoryI/O
Control/
Status
10...
...
10
+1
11
11
5
Control Unit Control unit: configures the
datapath operations Sequence of desired
operations (“instructions”) stored in memory – “program”
Instruction cycle – broken into several sub-operations, each one clock cycle, e.g.: Fetch: Get next instruction
into IR Decode: Determine what the
instruction means Fetch operands: Move data
from memory to datapath register
Execute: Move data through the ALU
Store results: Write data from register to memory
ProcessorControl unit Datapath
ALU
Registers
IRPC
Controller
MemoryI/O
Control/
Status
10...
...load R0, M[500] 500
501
100inc R1, R0101
store M[501], R1102
R0 R1
6
Instruction CyclesProcessor
Control unit Datapath
ALU
Registers
IRPC
Controller
MemoryI/O
Control/
Status
10...
...load R0, M[500] 500
501
100inc R1, R0101
store M[501], R1102
R0 R1
PC=100
10
Fetch ops
Exec.
Store resultsclk
Fetch
load R0, M[500]
Decode
100
7
Instruction CyclesProcessor
Control unit Datapath
ALU
Registers
IRPC
Controller
MemoryI/O
Control/
Status
10...
...load R0, M[500] 500
501
100inc R1, R0101
store M[501], R1102
R0 R110
PC=100 FetchDecodeFetc
h ops
Exec.
Store resultsclk
PC=101
inc R1, R0
Fetch Fetch ops
+1
11
Exec.
Store resultsclk
101
Decode
8
Instruction CyclesProcessor
Control unit Datapath
ALU
Registers
IRPC
Controller
MemoryI/O
Control/
Status
10...
...load R0, M[500] 500
501
100inc R1, R0101
store M[501], R1102
R0 R11110
PC=100 FetchDecodeFetc
h ops
Exec.
Store resultsclk
PC=101 FetchDecodeFetc
h ops
Exec.
Store resultsclk
PC=102
store M[501], R1
Fetch Fetch ops
Exec.
11
Store resultsclk
Decode
102
9
Architectural Considerations N-bit processor
N-bit ALU, registers, buses, memory data interface
Embedded: 8-bit, 16-bit, 32-bit common
Desktop/servers: 32-bit, even 64
PC size determines address space
ProcessorControl unit Datapath
ALU
Registers
IRPC
Controller
MemoryI/O
Control/
Status
10
Architectural Considerations Clock frequency
Inverse of clock period
Must be longer than longest register to register delay in entire processor
Memory access is often the longest
ProcessorControl unit Datapath
ALU
Registers
IRPC
Controller
MemoryI/O
Control/
Status
ARMIntroduction
ARM RISC Design Philosophy Smaller die size Shorter Development time Higher performance
Insects flap wings faster than small birds Complex instruction will make some high
level function more efficient but will slow down the clock for all instructions
ARM Design philosophy Reduce power consumption and extend battery
life High Code density Low price
Embedded systems prefer slow and low cost memory Reduce area of the die taken by embedded processor
Leave space for specialized processor Hardware debug capability
ARM is not a pure RISC Architecture Designed primarily for embedded systems
Instruction set for embedded systems
Variable cycle execution for certain instructions Multi registers Load-store instructions Faster if memory access is sequential Higher code density – common operation at
start and end of function Inline barrel shifting – leads to complex
instructions Improved code density E.g. ADD r0,r1,r1, LSL #1
Instruction set for embedded systems
Thumb 16 bit instruction set Code can execute both 16 or 32 bit instruction
Conditional execution Improved code density Reduce branch instructions CMP r1,r2 SUBGT r1,r1,r2 SUBLT r2,r2,r1
Enhanced instructions – DSP Instructions Use one processor instead of traditional combination
of two
Arm Based Embedded device
Peripherals ALL ARM Peripherals are Memory Mapped Interrupt Controllers
Standard Interrupt Controller Sends a interrupt signal to processor core Can be programmed to ignore or mask an individual
device or set of devices Interrupt handler read a device bitmap register to
determine which device requires servicing VIC- Vectored interrupt controller
Assigned priority and ISR handler to each device Depending on type will call standard Int. Hand. Or jump
to specific device handler directly
ARM Datapath Registers
R0-R15 General Purpose registers R13-stack pointer R14-Link register R15 – program counter R0-R13 are orthogonal Two program status registers
CPSR SPSR
ARM’s visible registers
r13_und r14_und r14_irq
r13_irq
SPSR_und
r14_abt r14_svc
user mode fiqmode
svcmode
abortmode
irqmode
undefinedmode
usable in user mode
system modes only
r13_abt r13_svc
r8_fiqr9_fiq
r10_fiqr11_fiq
SPSR_irq SPSR_abt SPSR_svc SPSR_fiqCPSR
r14_fiqr13_fiqr12_fiq
r0r1r2r3r4r5r6r7r8r9r10r11r12r13r14r15 (PC)
BANK Registers Total 37 registers
20 are hidden from program at different time
Also called Banked Registers Available only when processor in certain
mode Mode can be changed by program or on
exception Reset, interrupt request, fast interrupt request
software interrupt, data abort, prefetch abort and undefined instruction
No SPSR access in user mode
CPSR Condition flags – NZCV Interrupt masks – IF Thumb state- T , Jazelle –J Mode bits 0-4 – processor mode
Six privileged modes Abort – failed attempt to access memory Fast interrupt request Interrupt request Supervisor mode – after reset, Kernel work in this mode System – special version of user mode – full RW access to CPSR Undefined mode – when undefined or not supported inst. Is exec.
User Mode
N Z C V unused mode31 28 27 8 7 6 5 4 0
I F T
Instruction execution
3 Stage pipeline ARM Organization
Fetch The instruction is fetched from the memory and
placed in the instruction pipeline Decode
The instruction is decoded and the datapath control signals prepared for the next cycle. In this stage inst. ‘Owns’ the decode logic but not the datapath
Execute The inst. ‘owns’ the datapath; the register bank is
read, an operand shifted, the ALU result generated and written back into a destination register.
ARM7 Core Diagram
3 stage Pipeline – Single Cycle Inst.
3 stage Pipeline – Multi-Cycle Inst.
PC Behavior R15 increment twice before an
instruction executes due to pipeline operation
R15=current instruction address+8 Offset is +4 for thumb instruction
To get Higher performance Tprog =(Ninst X CPI ) / fclk Ninst – No of inst. Executed for a program–
Constant Increase the clock rate
The clock rate is limited by slowest pipeline stage Decrease the logic complexity per stage Increase the pipeline depth
Improve the CPI Instruction that take more than one cycle are re-
implemented to occupy fewer cycles Pipeline stalls are reduced
Typical Dynamic Instruction usage
Instruction Type Dynamic Usage
Data Movement 43%Control Flow 23%Arithmetic operation
15%
Comparisons 13%Logical Operation
5%
Other 1%
Statistics for a print preview program in an ARM Inst. Emulator
Memory Bottleneck Von Neumann Bottleneck
Single inst and data memory Limited by available memory bandwidth A 3 stage ARM core accesses memory on
(almost) every clock
Harvard Architecture in higher performance arm cores
The 5 stage pipeline Fetch
Inst. Fetched and placed in Inst. Pipeline Decode
Inst. Is decoded and register operand read from the register file
Execute An operand is shifted and the ALU result generated. For load and
store memory address is computed Buffer/Data
Data Memory is accessed if required otherwise ALU result is simply buffered
Write Back The results are written back to register file
Data Forwarding Read after write pipeline hazard
An instruction needs to use the result of one of its predecessors before that result has returned to the register file
e.g. Add r1,r2,r3 Add r4,r5,r1
Data forwarding is used to eliminate stall In following case even with forwarding it is
not possible to avoid a pipeline stall E.g LDR rN, [..] ; Load rN from somewhere ADD r2,r1,rN ; and use it immediately
Processor cannot avoid one cycle stall
Data Hazards Handling data hazard in software
Solution- Encourage compiler to not put a depended instruction immediately after a load instruction
Side effects When a location other than one explicitly named in
an instruction as destination operand is affected Addressing modes
Complex addressing modes doesn’t necessarily leads to faster execution
E.g. Load (X(R1)),R2 Add #X,R1,R2 Load (R2),R2 Load (R2),R2
Data Hazards Complex addressing
require more complex hardware to decode and execute them
Cause the pipeline to stall Pipelining features
Access to an operand does not require more than one access to memory
Only load and store instruction access memory The addressing modes used do not have side effects
Register, register indirect, index modes Condition codes
Flags are modified by as few instruction as possible Compiler should be able to specify in which instr. Of the
program they are affected and in which they are not
Complex Addressing Mode
F
F D
D E
X + [R1] [X +[R1]] [[X +[R1]]]Load
Next instruction
(a) Complex addressing mode
W
1 2 3 4 5 6 7Clock cycleTime
W
Forward
Load (X(R1)), R2
Simple Addressing Mode
X + [R1]F D
F
F
F D
D
D
E
[X +[R1]]
[[X +[R1]]]
Add
Load
Load
Next instruction
(b) Simple addressing mode
W
W
W
W
Add #X, R1, R2Load (R2), R2Load (R2), R2
ARM 5 Stage Pipeline
Instruction hazards - Overview Whenever the stream of instructions
supplied by the instruction fetch unit is interrupted, the pipeline stalls.
Cache miss Branch
Unconditional Branches
F2I2 (Branch)
I3
Ik
E2
F3
Fk Ek
Fk+1 Ek+1Ik+1
Instruction
Figure 8.8. An idle cycle caused by a branch instruction.
Execution unit idle
1 2 3 4 5Clock cycleTime
F1I1 E1
6
X
Branch TimingX
Figure 8.9. Branch timing.
F1 D1 E1 W1
I2 (Branch)
I1
1 2 3 4 5 6 7Clock cycle
F2 D2
F3 X
Fk Dk Ek
Fk+1 Dk+1
I3
Ik
Ik+1
Wk
Ek+1
(b) Branch address computed in Decode stage
F1 D1 E1 W1
I2 (Branch)
I1
1 2 3 4 5 6 7Clock cycle
F2 D2
F3
Fk Dk Ek
Fk+1 Dk+1
I3
Ik
Ik+1
Wk
Ek+1
(a) Branch address computed in Execute stage
E2
D3
F4 XI4
8Time
Time
- Branch penalty
- Reducing the penalty
Instruction Queue and Prefetching
F : Fetchinstruction
E : Executeinstruction
W : Writeresults
D : Dispatch/Decode
Instruction queue
Instruction fetch unit
Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.
unit
Branch Timing with Instruction Queue
X
Figure 8.11. Branch timing in the presence of an instruction queue.Branch target address is computed in the D stage.
F1 D1 E1 E1 E1 W1
F4
W3E3
I5 (Branch)
I1
F2 D2
1 2 3 4 5 6 7 8 9Clock cycle
E2 W2
F3 D3
E4D4 W4
F5 D5
F6
Fk Dk Ek
Fk+1 Dk+1
I2
I3
I4
I6
Ik
Ik+1
Wk
Ek+1
10
1 1 1 1 2 3 2 1 1Queue length 1
Time
Branch folding
Branch Folding Branch folding – executing the branch instruction
concurrently with the execution of other instructions. Branch folding occurs only if at the time a branch
instruction is encountered, at least one instruction is available in the queue other than the branch instruction.
Therefore, it is desirable to arrange for the queue to be full most of the time, to ensure an adequate supply of instructions for processing.
This can be achieved by increasing the rate at which the fetch unit reads instructions from the cache.
Having an instruction queue is also beneficial in dealing with cache misses.
Conditional Braches A conditional branch instruction
introduces the added hazard caused by the dependency of the branch condition on the result of a preceding instruction.
The decision to branch cannot be made until the execution of that instruction has been completed.
Branch instructions represent about 20% of the dynamic instruction count of most programs.
Delayed Branch The instructions in the delay slots are
always fetched. Therefore, we would like to arrange for them to be fully executed whether or not the branch is taken.
The objective is to place useful instructions in these slots.
The effectiveness of the delayed branch approach depends on how often it is possible to reorder instructions.
Delayed Branch
Add
LOOP Shift_left R1DecrementBranch=0
R2LOOP
NEXT
(a) Original program loop
LOOP Decrement R2Branch=0Shift_left
LOOPR1
NEXT
(b) Reordered instructions
Figure 8.12. Reordering of instructions for a delayed branch.
Add
R1,R3
R1,R3
Delayed Branch
F E
F E
F E
F E
F E
F E
F E
Instruction
Decrement
Branch
Shift (delay slot)
Figure 8.13. Execution timing showing the delay slot being filledduring the last two passes through the loop in Figure 8.12.
Decrement (Branch taken)
Branch
Shift (delay slot)
Add (Branch not taken)
1 2 3 4 5 6 7 8Clock cycleTime
Branch Prediction To predict whether or not a particular branch will be
taken. Simplest form: assume branch will not take place and
continue to fetch instructions in sequential address order.
Until the branch is evaluated, instruction execution along the predicted path must be done on a speculative basis.
Speculative execution: instructions are executed before the processor is certain that they are in the correct execution sequence.
Need to be careful so that no processor registers or memory locations are updated until it is confirmed that these instructions should indeed be executed.
Incorrectly Predicted Branch
F1
F2
I1 (Compare)
I2 (Branch>0)
I3
D1 E1 W1
F3
F4
Fk Dk
D3 X
XI4
Ik
Instruction
Figure 8.14. Timing when a branch decision has been incorrectly predictedas not taken.
E2
Clock cycle 1 2 3 4 5 6
D2/P2
Time
Branch Prediction Better performance can be achieved if we
arrange for some branch instructions to be predicted as taken and others as not taken.
Use hardware to observe whether the target address is lower or higher than that of the branch instruction.
Let compiler include a branch prediction bit. So far the branch prediction decision is
always the same every time a given instruction is executed – static branch prediction.
Superscalar operation Maximum Throughput - One instruction
per clock cycle Multiple processing units
More than one instruction per cycle
Superscalar
W : Writeresults
Dispatchunit
Instruction queue
Floating-pointunit
Integerunit
Figure 8.19. A processor with two execution units.
F : Instructionfetch unit
Timing
I1 (Fadd) D1
D2
D3
D4
E1A E1B E1C
E2
E3 E3 E3
E4
W1
W2
W3
W4
I2 (Add)
I3 (Fsub)
I4 (Sub)
Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19,assuming no hazards are encountered.
1 2 3 4 5 6Clock cycleTime
F1
F2
F3
F4
7
ALU Logic operation
OR,AND, XOR, NOT, NAND, NOR etc. No dependencies among bits – Each result can be
calculated in parallel for every bit Arithmetic operation
ADD, SUB, INC, DEC, MUL, DIVIDE Involve long carry propagation chain
Major source of delay Require optimization
Suitability of algorithm based on Resource usage – physical space on silicon die Turnaround time
The original ARM1 ripple-carry adder circuit
AB
Cin
sum
Cout
The ARM2 4-bit carry look-ahead scheme
A[3:0]
B[3:0]
Cin[0]
sum[3:0]
Cout[3]
4-bitadderlogic
P
G
The ARM2 ALU logic for one result bit
ALUbus
432105
NBbus
NAbus
carrylogic
fs:
G
P
ARM2 ALU function codes
f s 5 fs 4 fs 3 fs 2 fs 1 fs 0 ALU o utput0 0 0 1 0 0 A and B0 0 1 0 0 0 A and not B0 0 1 0 0 1 A xor B0 1 1 0 0 1 A plus not B plus carry0 1 0 1 1 0 A plus B plus carry1 1 0 1 1 0 not A plus B plus carry0 0 0 0 0 0 A0 0 0 0 0 1 A or B0 0 0 1 0 1 B0 0 1 0 1 0 not B0 0 1 1 0 0 zero
The ARM6 carry-select adder scheme
sum[31:16]sum[15:8]sum[7:4]sum[3:0]
s s+1
a,b[31:28]a,b[3:0]
+ +, +1c
+, +1
mux
mux
mux
Conditional Sum Adder Extension of carry-select adder Carry select adder
One-level using k/2-bit adders Two-level using k/4-bit adders Three-level using k/8-bit adders Etc.
Assuming k is a power of two, eventually have an extreme where there are log2k-levels using 1-bit adders This is a conditional sum adder
Conditional sum - example
Conditional Sum Adder:Top-Level Block for One Bit Position
The ARM6 ALU organization
Z
N
VC
logic/arithmetic
C infunction
invert A invert B
result
result mux
logic functions
A operand latch B operand latch
XOR gates XOR gates
adder
zero detect
The cross-bar switch barrel shifter principle
in[0]
in[1]
in[2]
in[3]
out[0] out[1] out[2] out[3]
no shiftright 1right 2right 3
left 1
left 2
left 3
Shift implementation For left or right shift one diagonal is
turned on Shifter operate in negative logic Precharging sets all output logic to ‘0’.
For rotate right, the right shift diagonal is enabled together with complimentary left shift diagonal
Arithmetic shift uses sign extension rather than ‘0’ fill
Multiplier ARM include hardware support for integer
multiplication Older ARM cores include low cost
multiplication hardware Support 32 bit result multiply and multiply
accumulate Uses the main datapath iteratively
Barrel shifter and ALU to generate 2 bit product in each cycle
Employ a modified booth’s algorithm to produce 2 bit product
Multiplier Radix 2 multiplication Radix 4 multiplication Radix 2 Booth algorithm Radix 4 booth algorithm
Modified Booth’s Recoding–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––xi+1 xi xi–1 yi+1 yi zi/2 Explanation––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– 0 0 0 0 0 0 No string of 1s in sight 0 0 1 0 1 1 End of string of 1s 0 1 0 0 1 1 Isolated 1 0 1 1 1 0 2 End of string of 1s 1 0 0 -1 0 -2 Beginning of string of 1s 1 0 1 -1 1 -1 End a string, begin new one 1 1 0 0 -1 -1 Beginning of string of 1s 1 1 1 0 0 0 Continuation of string of 1s–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
(1) -2 2 -1 2 -1
-1 0 -2 Radix-4 version z
ContextRecoded
radix-2 digits Radix-4 digit Example1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 Operand x
(1) -1 0 1 0 0 -1 1 0 -1 1 -1 1 0 0 -1 0 Recoded version y
Example : Modified Booth’s Recoding
================================a 0 1 1 0x 1 0 1 0z -1 -2 Radix-4================================p(0) 0 0 0 0 0 0+z0a 1 1 0 1 0 0–––––––––––––––––––––––––––––––––4p(1) 1 1 0 1 0 0p(1) 1 1 1 1 0 1 0 0+z1a 1 1 1 0 1 0–––––––––––––––––––––––––––––––––4p(2) 1 1 0 1 1 1 0 0p(2) 1 1 0 1 1 1 0 0================================
Multiplier x
p Product
Multiplicand a
(x x ) a 4 1 3 2 two
4 0 a (x x ) 1 0 two
High speed multiplier Recent cores have high performance
multiplication hardware Support 64 bit result multiply and multiply
accumulate
Multiplier: Carry Save Addition In Multiplication multiple partial products are added
simultaneously using 2-operand adders Time-consuming carry-propagation must be repeated
several times: k operands - k-1 propagations Techniques for lowering this penalty exist - Carry-save
addition Carry propagates only in last step - other steps generate
partial sum and sequence of carries Basic CSA accepts 3 n-bit operands; generates 2n-bit
results: n-bit partial sum, n-bit carry Second CSA accepts the 2 sequences and another input
operand, generates new partial sum and carry CSA reduces number of operands to be added from 3 to 2
without carry propagation
CSA-Basic unit - (3,2)Counter Simplest implementation - full adder (FA) with
3 inputs x,y,z x+y+z=2c+s (s,c - sum and carry outputs)
Outputs - weighted binary representation of number of 1's in inputs
FA called a (3,2) counter n-bit CSA: n(3,2)counters in parallel with no carry links
(a)Carry-propagate (b)carry-save
+A B Cin
Cout S(a) +
A B Cin
Cout S+
A B Cin
Cout S+
A B Cin
Cout S
+A B Cin
Cout S(b) +
A B Cin
Cout S+
A B Cin
Cout S+
A B Cin
Cout S
Cascaded CSA for four 4 bit operands
Upper 2 levels - 4-bit CSAs 3rd level - 4-bit carry-propagating adder
(CPA)
Wallace Tree Better Organization for CSA – faster
operation time
ARM high-speed multiplier organization
Rs >> 8 bits/cycle
carry-save adders
partial sum
partial carry
initialization for MLAregisters
Rm
ALU (add partials)
rotate sum andcarry 8 bits/cycle