general purpose processor

1

GENERAL PURPOSE PROCESSOR

2

Introduction General-Purpose Processor

Processor designed for a variety of computation tasks

Low unit cost, in part because manufacturer spreads NRE over large numbers of units Motorola sold half a billion 68HC05 microcontrollers in

1996 alone Carefully designed since higher NRE is acceptable

Can yield good performance, size and power Low NRE cost for Embedded system designer,

short time-to-market/prototype, high flexibility User just writes software; no processor design

3

Basic Architecture Control unit and

datapath Similar to single-

purpose processor Key differences

Datapath is general Control unit doesn’t

store the algorithm – the algorithm is “programmed” into the memory

ProcessorControl unit Datapath

ALU

Registers

IRPC

Controller

MemoryI/O

Control/

Status

4

Datapath Operations Load

Read memory location into register • ALU operation

– Input certain registers through ALU, store back in register• Store

– Write register to memory location


ALU

Registers

IRPC

Controller

MemoryI/O

Control/

Status

10...

...

10

+1

11

11

5

Control Unit Control unit: configures the

datapath operations Sequence of desired

operations (“instructions”) stored in memory – “program”

Instruction cycle – broken into several sub-operations, each one clock cycle, e.g.: Fetch: Get next instruction

into IR Decode: Determine what the

instruction means Fetch operands: Move data

from memory to datapath register

Execute: Move data through the ALU

Store results: Write data from register to memory


ALU

Registers

IRPC

Controller

MemoryI/O

Control/

Status

10...

...load R0, M[500] 500

501

100inc R1, R0101

store M[501], R1102

R0 R1

6

Instruction CyclesProcessor

Control unit Datapath

ALU

Registers

IRPC

Controller

MemoryI/O

Control/

Status

10...

...load R0, M[500] 500

501

100inc R1, R0101

store M[501], R1102

R0 R1

PC=100

10

Fetch ops

Exec.

Store resultsclk

Fetch

load R0, M[500]

Decode

100

7



ALU

Registers

IRPC

Controller

MemoryI/O

Control/

Status

10...

...load R0, M[500] 500

501

100inc R1, R0101

store M[501], R1102

R0 R110

PC=100 FetchDecodeFetc

h ops

Exec.

Store resultsclk

PC=101

inc R1, R0

Fetch Fetch ops

+1

11

Exec.

Store resultsclk

101

Decode

8



ALU

Registers

IRPC

Controller

MemoryI/O

Control/

Status

10...

...load R0, M[500] 500

501

100inc R1, R0101

store M[501], R1102

R0 R11110


h ops

Exec.

Store resultsclk


h ops

Exec.

Store resultsclk

PC=102

store M[501], R1

Fetch Fetch ops

Exec.

11

Store resultsclk

Decode

102

9

Architectural Considerations N-bit processor

N-bit ALU, registers, buses, memory data interface

Embedded: 8-bit, 16-bit, 32-bit common

Desktop/servers: 32-bit, even 64

PC size determines address space


ALU

Registers

IRPC

Controller

MemoryI/O

Control/

Status

10

Architectural Considerations Clock frequency

Inverse of clock period

Must be longer than longest register to register delay in entire processor

Memory access is often the longest


ALU

Registers

IRPC

Controller

MemoryI/O

Control/

Status

ARMIntroduction

ARM RISC Design Philosophy Smaller die size Shorter Development time Higher performance

Insects flap wings faster than small birds Complex instruction will make some high

level function more efficient but will slow down the clock for all instructions

ARM Design philosophy Reduce power consumption and extend battery

life High Code density Low price

Embedded systems prefer slow and low cost memory Reduce area of the die taken by embedded processor

Leave space for specialized processor Hardware debug capability

ARM is not a pure RISC Architecture Designed primarily for embedded systems

Instruction set for embedded systems

Variable cycle execution for certain instructions Multi registers Load-store instructions Faster if memory access is sequential Higher code density – common operation at

start and end of function Inline barrel shifting – leads to complex

instructions Improved code density E.g. ADD r0,r1,r1, LSL #1

Instruction set for embedded systems

Thumb 16 bit instruction set Code can execute both 16 or 32 bit instruction

Conditional execution Improved code density Reduce branch instructions CMP r1,r2 SUBGT r1,r1,r2 SUBLT r2,r2,r1

Enhanced instructions – DSP Instructions Use one processor instead of traditional combination

of two

Arm Based Embedded device

Peripherals ALL ARM Peripherals are Memory Mapped Interrupt Controllers

Standard Interrupt Controller Sends a interrupt signal to processor core Can be programmed to ignore or mask an individual

device or set of devices Interrupt handler read a device bitmap register to

determine which device requires servicing VIC- Vectored interrupt controller

Assigned priority and ISR handler to each device Depending on type will call standard Int. Hand. Or jump

to specific device handler directly

ARM Datapath Registers

R0-R15 General Purpose registers R13-stack pointer R14-Link register R15 – program counter R0-R13 are orthogonal Two program status registers

CPSR SPSR

ARM’s visible registers

r13_und r14_und r14_irq

r13_irq

SPSR_und

r14_abt r14_svc

user mode fiqmode

svcmode

abortmode

irqmode

undefinedmode

usable in user mode

system modes only

r13_abt r13_svc

r8_fiqr9_fiq

r10_fiqr11_fiq

SPSR_irq SPSR_abt SPSR_svc SPSR_fiqCPSR

r14_fiqr13_fiqr12_fiq

r0r1r2r3r4r5r6r7r8r9r10r11r12r13r14r15 (PC)

BANK Registers Total 37 registers

20 are hidden from program at different time

Also called Banked Registers Available only when processor in certain

mode Mode can be changed by program or on

exception Reset, interrupt request, fast interrupt request

software interrupt, data abort, prefetch abort and undefined instruction

No SPSR access in user mode

CPSR Condition flags – NZCV Interrupt masks – IF Thumb state- T , Jazelle –J Mode bits 0-4 – processor mode

Six privileged modes Abort – failed attempt to access memory Fast interrupt request Interrupt request Supervisor mode – after reset, Kernel work in this mode System – special version of user mode – full RW access to CPSR Undefined mode – when undefined or not supported inst. Is exec.

User Mode

N Z C V unused mode31 28 27 8 7 6 5 4 0

I F T

Instruction execution

3 Stage pipeline ARM Organization

Fetch The instruction is fetched from the memory and

placed in the instruction pipeline Decode

The instruction is decoded and the datapath control signals prepared for the next cycle. In this stage inst. ‘Owns’ the decode logic but not the datapath

Execute The inst. ‘owns’ the datapath; the register bank is

read, an operand shifted, the ALU result generated and written back into a destination register.

ARM7 Core Diagram

3 stage Pipeline – Single Cycle Inst.

3 stage Pipeline – Multi-Cycle Inst.

PC Behavior R15 increment twice before an

instruction executes due to pipeline operation

R15=current instruction address+8 Offset is +4 for thumb instruction

To get Higher performance Tprog =(Ninst X CPI ) / fclk Ninst – No of inst. Executed for a program–

Constant Increase the clock rate

The clock rate is limited by slowest pipeline stage Decrease the logic complexity per stage Increase the pipeline depth

Improve the CPI Instruction that take more than one cycle are re-

implemented to occupy fewer cycles Pipeline stalls are reduced

Typical Dynamic Instruction usage

Instruction Type Dynamic Usage

Data Movement 43%Control Flow 23%Arithmetic operation

15%

Comparisons 13%Logical Operation

5%

Other 1%

Statistics for a print preview program in an ARM Inst. Emulator

Memory Bottleneck Von Neumann Bottleneck

Single inst and data memory Limited by available memory bandwidth A 3 stage ARM core accesses memory on

(almost) every clock

Harvard Architecture in higher performance arm cores

The 5 stage pipeline Fetch

Inst. Fetched and placed in Inst. Pipeline Decode

Inst. Is decoded and register operand read from the register file

Execute An operand is shifted and the ALU result generated. For load and

store memory address is computed Buffer/Data

Data Memory is accessed if required otherwise ALU result is simply buffered

Write Back The results are written back to register file

Data Forwarding Read after write pipeline hazard

An instruction needs to use the result of one of its predecessors before that result has returned to the register file

e.g. Add r1,r2,r3 Add r4,r5,r1

Data forwarding is used to eliminate stall In following case even with forwarding it is

not possible to avoid a pipeline stall E.g LDR rN, [..] ; Load rN from somewhere ADD r2,r1,rN ; and use it immediately

Processor cannot avoid one cycle stall

Data Hazards Handling data hazard in software

Solution- Encourage compiler to not put a depended instruction immediately after a load instruction

Side effects When a location other than one explicitly named in

an instruction as destination operand is affected Addressing modes

Complex addressing modes doesn’t necessarily leads to faster execution

E.g. Load (X(R1)),R2 Add #X,R1,R2 Load (R2),R2 Load (R2),R2

Data Hazards Complex addressing

require more complex hardware to decode and execute them

Cause the pipeline to stall Pipelining features

Access to an operand does not require more than one access to memory

Only load and store instruction access memory The addressing modes used do not have side effects

Register, register indirect, index modes Condition codes

Flags are modified by as few instruction as possible Compiler should be able to specify in which instr. Of the

program they are affected and in which they are not

Complex Addressing Mode

F

F D

D E

X + [R1] [X +[R1]] [[X +[R1]]]Load

Next instruction

(a) Complex addressing mode

W

1 2 3 4 5 6 7Clock cycleTime

W

Forward

Load (X(R1)), R2

Simple Addressing Mode

X + [R1]F D

F

F

F D

D

D

E

[X +[R1]]

[[X +[R1]]]

Add

Load

Load

Next instruction

(b) Simple addressing mode

W

W

W

W

Add #X, R1, R2Load (R2), R2Load (R2), R2

ARM 5 Stage Pipeline

Instruction hazards - Overview Whenever the stream of instructions

supplied by the instruction fetch unit is interrupted, the pipeline stalls.

Cache miss Branch

Unconditional Branches

F2I2 (Branch)

I3

Ik

E2

F3

Fk Ek

Fk+1 Ek+1Ik+1

Instruction

Figure 8.8. An idle cycle caused by a branch instruction.

Execution unit idle

1 2 3 4 5Clock cycleTime

F1I1 E1

6

X

Branch TimingX

Figure 8.9. Branch timing.

F1 D1 E1 W1

I2 (Branch)

I1

1 2 3 4 5 6 7Clock cycle

F2 D2

F3 X

Fk Dk Ek

Fk+1 Dk+1

I3

Ik

Ik+1

Wk

Ek+1

(b) Branch address computed in Decode stage

F1 D1 E1 W1

I2 (Branch)

I1

1 2 3 4 5 6 7Clock cycle

F2 D2

F3

Fk Dk Ek

Fk+1 Dk+1

I3

Ik

Ik+1

Wk

Ek+1

(a) Branch address computed in Execute stage

E2

D3

F4 XI4

8Time

Time

- Branch penalty

- Reducing the penalty

Instruction Queue and Prefetching

F : Fetchinstruction

E : Executeinstruction

W : Writeresults

D : Dispatch/Decode

Instruction queue

Instruction fetch unit

Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.

unit

Branch Timing with Instruction Queue

X

Figure 8.11. Branch timing in the presence of an instruction queue.Branch target address is computed in the D stage.

F1 D1 E1 E1 E1 W1

F4

W3E3

I5 (Branch)

I1

F2 D2

1 2 3 4 5 6 7 8 9Clock cycle

E2 W2

F3 D3

E4D4 W4

F5 D5

F6

Fk Dk Ek

Fk+1 Dk+1

I2

I3

I4

I6

Ik

Ik+1

Wk

Ek+1

10

1 1 1 1 2 3 2 1 1Queue length 1

Time

Branch folding

Branch Folding Branch folding – executing the branch instruction

concurrently with the execution of other instructions. Branch folding occurs only if at the time a branch

instruction is encountered, at least one instruction is available in the queue other than the branch instruction.

Therefore, it is desirable to arrange for the queue to be full most of the time, to ensure an adequate supply of instructions for processing.

This can be achieved by increasing the rate at which the fetch unit reads instructions from the cache.

Having an instruction queue is also beneficial in dealing with cache misses.

Conditional Braches A conditional branch instruction

introduces the added hazard caused by the dependency of the branch condition on the result of a preceding instruction.

The decision to branch cannot be made until the execution of that instruction has been completed.

Branch instructions represent about 20% of the dynamic instruction count of most programs.

Delayed Branch The instructions in the delay slots are

always fetched. Therefore, we would like to arrange for them to be fully executed whether or not the branch is taken.

The objective is to place useful instructions in these slots.

The effectiveness of the delayed branch approach depends on how often it is possible to reorder instructions.

Delayed Branch

Add

LOOP Shift_left R1DecrementBranch=0

R2LOOP

NEXT

(a) Original program loop

LOOP Decrement R2Branch=0Shift_left

LOOPR1

NEXT

(b) Reordered instructions

Figure 8.12. Reordering of instructions for a delayed branch.

Add

R1,R3

R1,R3

Delayed Branch

F E

F E

F E

F E

F E

F E

F E

Instruction

Decrement

Branch

Shift (delay slot)

Figure 8.13. Execution timing showing the delay slot being filledduring the last two passes through the loop in Figure 8.12.

Decrement (Branch taken)

Branch

Shift (delay slot)

Add (Branch not taken)

1 2 3 4 5 6 7 8Clock cycleTime

Branch Prediction To predict whether or not a particular branch will be

taken. Simplest form: assume branch will not take place and

continue to fetch instructions in sequential address order.

Until the branch is evaluated, instruction execution along the predicted path must be done on a speculative basis.

Speculative execution: instructions are executed before the processor is certain that they are in the correct execution sequence.

Need to be careful so that no processor registers or memory locations are updated until it is confirmed that these instructions should indeed be executed.

Incorrectly Predicted Branch

F1

F2

I1 (Compare)

I2 (Branch>0)

I3

D1 E1 W1

F3

F4

Fk Dk

D3 X

XI4

Ik

Instruction

Figure 8.14. Timing when a branch decision has been incorrectly predictedas not taken.

E2

Clock cycle 1 2 3 4 5 6

D2/P2

Time

Branch Prediction Better performance can be achieved if we

arrange for some branch instructions to be predicted as taken and others as not taken.

Use hardware to observe whether the target address is lower or higher than that of the branch instruction.

Let compiler include a branch prediction bit. So far the branch prediction decision is

always the same every time a given instruction is executed – static branch prediction.

Superscalar operation Maximum Throughput - One instruction

per clock cycle Multiple processing units

More than one instruction per cycle

Superscalar

W : Writeresults

Dispatchunit

Instruction queue

Floating-pointunit

Integerunit

Figure 8.19. A processor with two execution units.

F : Instructionfetch unit

Timing

I1 (Fadd) D1

D2

D3

D4

E1A E1B E1C

E2

E3 E3 E3

E4

W1

W2

W3

W4

I2 (Add)

I3 (Fsub)

I4 (Sub)

Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19,assuming no hazards are encountered.

1 2 3 4 5 6Clock cycleTime

F1

F2

F3

F4

7

ALU Logic operation

OR,AND, XOR, NOT, NAND, NOR etc. No dependencies among bits – Each result can be

calculated in parallel for every bit Arithmetic operation

ADD, SUB, INC, DEC, MUL, DIVIDE Involve long carry propagation chain

Major source of delay Require optimization

Suitability of algorithm based on Resource usage – physical space on silicon die Turnaround time

The original ARM1 ripple-carry adder circuit

AB

Cin

sum

Cout

The ARM2 4-bit carry look-ahead scheme

A[3:0]

B[3:0]

Cin[0]

sum[3:0]

Cout[3]

4-bitadderlogic

P

G

The ARM2 ALU logic for one result bit

ALUbus

432105

NBbus

NAbus

carrylogic

fs:

G

P

ARM2 ALU function codes

f s 5 fs 4 fs 3 fs 2 fs 1 fs 0 ALU o utput0 0 0 1 0 0 A and B0 0 1 0 0 0 A and not B0 0 1 0 0 1 A xor B0 1 1 0 0 1 A plus not B plus carry0 1 0 1 1 0 A plus B plus carry1 1 0 1 1 0 not A plus B plus carry0 0 0 0 0 0 A0 0 0 0 0 1 A or B0 0 0 1 0 1 B0 0 1 0 1 0 not B0 0 1 1 0 0 zero

The ARM6 carry-select adder scheme

sum[31:16]sum[15:8]sum[7:4]sum[3:0]

s s+1

a,b[31:28]a,b[3:0]

+ +, +1c

+, +1

mux

mux

mux

Conditional Sum Adder Extension of carry-select adder Carry select adder

One-level using k/2-bit adders Two-level using k/4-bit adders Three-level using k/8-bit adders Etc.

Assuming k is a power of two, eventually have an extreme where there are log2k-levels using 1-bit adders This is a conditional sum adder

Conditional sum - example

Conditional Sum Adder:Top-Level Block for One Bit Position

The ARM6 ALU organization

Z

N

VC

logic/arithmetic

C infunction

invert A invert B

result

result mux

logic functions

A operand latch B operand latch

XOR gates XOR gates

adder

zero detect

The cross-bar switch barrel shifter principle

in[0]

in[1]

in[2]

in[3]

out[0] out[1] out[2] out[3]

no shiftright 1right 2right 3

left 1

left 2

left 3

Shift implementation For left or right shift one diagonal is

turned on Shifter operate in negative logic Precharging sets all output logic to ‘0’.

For rotate right, the right shift diagonal is enabled together with complimentary left shift diagonal

Arithmetic shift uses sign extension rather than ‘0’ fill

Multiplier ARM include hardware support for integer

multiplication Older ARM cores include low cost

multiplication hardware Support 32 bit result multiply and multiply

accumulate Uses the main datapath iteratively

Barrel shifter and ALU to generate 2 bit product in each cycle

Employ a modified booth’s algorithm to produce 2 bit product

Multiplier Radix 2 multiplication Radix 4 multiplication Radix 2 Booth algorithm Radix 4 booth algorithm

Modified Booth’s Recoding–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––xi+1 xi xi–1 yi+1 yi zi/2 Explanation––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– 0 0 0 0 0 0 No string of 1s in sight 0 0 1 0 1 1 End of string of 1s 0 1 0 0 1 1 Isolated 1 0 1 1 1 0 2 End of string of 1s 1 0 0 -1 0 -2 Beginning of string of 1s 1 0 1 -1 1 -1 End a string, begin new one 1 1 0 0 -1 -1 Beginning of string of 1s 1 1 1 0 0 0 Continuation of string of 1s–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––

(1) -2 2 -1 2 -1

-1 0 -2 Radix-4 version z

ContextRecoded

radix-2 digits Radix-4 digit Example1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 Operand x

(1) -1 0 1 0 0 -1 1 0 -1 1 -1 1 0 0 -1 0 Recoded version y

Example : Modified Booth’s Recoding

================================a 0 1 1 0x 1 0 1 0z -1 -2 Radix-4================================p(0) 0 0 0 0 0 0+z0a 1 1 0 1 0 0–––––––––––––––––––––––––––––––––4p(1) 1 1 0 1 0 0p(1) 1 1 1 1 0 1 0 0+z1a 1 1 1 0 1 0–––––––––––––––––––––––––––––––––4p(2) 1 1 0 1 1 1 0 0p(2) 1 1 0 1 1 1 0 0================================

Multiplier x

p Product

Multiplicand a

(x x ) a 4 1 3 2 two

4 0 a (x x ) 1 0 two

High speed multiplier Recent cores have high performance

multiplication hardware Support 64 bit result multiply and multiply

accumulate

Multiplier: Carry Save Addition In Multiplication multiple partial products are added

simultaneously using 2-operand adders Time-consuming carry-propagation must be repeated

several times: k operands - k-1 propagations Techniques for lowering this penalty exist - Carry-save

addition Carry propagates only in last step - other steps generate

partial sum and sequence of carries Basic CSA accepts 3 n-bit operands; generates 2n-bit

results: n-bit partial sum, n-bit carry Second CSA accepts the 2 sequences and another input

operand, generates new partial sum and carry CSA reduces number of operands to be added from 3 to 2

without carry propagation

CSA-Basic unit - (3,2)Counter Simplest implementation - full adder (FA) with

3 inputs x,y,z x+y+z=2c+s (s,c - sum and carry outputs)

Outputs - weighted binary representation of number of 1's in inputs

FA called a (3,2) counter n-bit CSA: n(3,2)counters in parallel with no carry links

(a)Carry-propagate (b)carry-save

+A B Cin

Cout S(a) +

A B Cin

Cout S+

A B Cin

Cout S+

A B Cin

Cout S

+A B Cin

Cout S(b) +

A B Cin

Cout S+

A B Cin

Cout S+

A B Cin

Cout S

Cascaded CSA for four 4 bit operands

Upper 2 levels - 4-bit CSAs 3rd level - 4-bit carry-propagating adder

(CPA)

Wallace Tree Better Organization for CSA – faster

operation time

ARM high-speed multiplier organization

Rs >> 8 bits/cycle

carry-save adders

partial sum

partial carry

initialization for MLAregisters

Rm

ALU (add partials)

rotate sum andcarry 8 bits/cycle

general purpose processor

Documents