computer architecture m - unibo. · pdf filehad no full-custom previous experience • 32...

88
ARM architecture Computer architecture M 1

Upload: hoangkhanh

Post on 12-Mar-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM architecture

Computer architecture M

1

Page 2: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

History

2

• Design software can be bought (Verilog) – soft core

• Acorn computer: an english company Cambridge spin-off (UK)which had developed a 8 bit microprocessor for the BBC on 6502architecture (Synertek e Rockwell)

• In 1982 Acorn engineers looked for a new microprocessor permore sophisticated applications but decided against CISCsolutions because too slow for the specific requirements andinterrupt latency time

• They decided to design a totally new architecture. At the sametime Stanford RISC I and II and MIPS (Microprocessor withoutInterlocked Pipeline Stages) of Berkley appeared on the marketBerkley and they decided to follow that philosophy

• ARM (Advanced RISC Machine) whose three stages is still nowused

• ARM is now a true industry (from 1990) and a «brand» withmultiple implementations and is used by many processorcompanies (Intel too) in multiple environments in tailoredversions (Intellectual Property – IP - cores)

Page 3: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM

3

• T: Thumb• D: On-chip debug support• M: Enhanced multiplier• I: Embedded ICE hardware• T2: Thumb-2• S: Synthesizable code• E: Enhanced DSP instruction set• J: JAVA support• Z: TrustZone• F: Floating point unit• H: Handshake, clockless design for synchronous or

asynchronous design

Page 4: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM- base concepts

4

Apple iPod Photo e iPod Video 5th gen (2X, @80MHz)

Roomba 500 Lego Mindstorm

• Arm is a family of Risc processors conceptually similar toDLX

• There are several versions from a very simple to a verysophisticated one

• Multiple environments (i.e. mobile phones)

Page 5: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM – first version

• Single cycle instructions but potentially multiple cycle sincethe no Harvard architecture is implemented. When more thata single memory access is required (i.e. a LOAD) the extracycles are used for useful microoperations (i.e. autoindexaddress)

5

• LOAD/STORE architecture very simple since the designershad no full-custom previous experience

• 32 bit fixed length instructions

• Three addresses instructions RISC type (with someexceptions CISC type)

• Fixed register bank. Obviously in addition to programmervisible registers there are the machine registers

Page 6: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Tthree stages ARM

• Two interrupts: fast and standard

6

• 16 32-bit general-purpose registers (r0 - r15)

• Three ports register bank (two for reading and one for writing)An additional port for read and write register 15 (PC)

• N-positions barrel shifter

• 32 bit ALU

• The address register is provided with an incrementer (forsequential accesses) – In practice it is a programmable counter

• Two buffer registers for data to and from the memory (invisibleto the programmer). Single bank memory

• Instructions decoder and control logic

• Status register (CSPR)

Page 7: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM register set

7

r0

r1

r2

r3

r4

r5

r6

r7r8

r9

r10

r11

r12

r13 (MSP)

r14 (LR)

r15 (PC)

CPSR

r8_fiq

r9-fiq

r10_fiq

r11_fiq

r12_fiq

r13_fiq

r14_fiq

SPSR_fiq

r13_svc

r14_svc

SPSR_svc

r13_abt

r14_abt

SPSR_abt

r13_irq

r14_irq

SPSR_irq

r13_und

r14_und

SPSR_und

User mode fiq mode svc mode abort mode irq mode undef. mode

System mode only

CPSR Current Program Status RegisterSPSR Saved Processor Status RegisterMSP Master Stack PointerLR Link Register (return register for subroutines)

fiq: fast interruptsvc: software interruptabt: memory faults (abort)irq: standard interruptund:undefined instructions

r0-r7-are common to user and system mode

Page 8: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Current Program Status RegisterCPSR (similar to flag register)

8

N Z C V Unused I F T Mode31 28 27 8 7 6 5 4 0

Condition codesN negativeZ zeroC carryV oVerflow

I,F interrupt masksT Thumb Instr. Set

CPSR [4:0] mode Use Used Register Set 10000 User Normal user code user10001 FIQ Processing Fast interrupt fiq10010 IRQ Processing standard interrupts irq10011 SVC Processing software interrupts svc10111 Abort Processing memory faults abt11011 Undef Handling undefined instructions trap und11111 System Running privileged operating system tasks user

NB Thumb Instruction Set: higly encoded instructions for memorysave

Each privileged mode has a Saved Program Status Register SPSRwhere the current CPSR is saved and a specific r14 (Link Register)

Page 9: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Exceptions

9

When an exception occurs:

1) The corresponding mode is activated2) The PC (r15) is saved in r14 (link register) of the new mode3) The old CSPR is saved in the SPSR of the new mode4) IRQ is disabled setting bit 7 of CSPR and if the exception

corresponds to the Fast Interrupt CSPR bit6 is set5) PC assumes the value of the following table (fixed addresses)

Exception Mode AddressReset SVC 00000000Undef. Instr. UND 00000004Soft. Int. (SWI) SVC 00000008Prefetch Abort ( Instr.Mem. Fault) Abort 0000000CData Abort (Data Mem. Fault) Abort 00000010IRQ IRQ 00000018FIQ FIQ 0000001C

00000014 cannot be used (old ARMs compatibility)

Page 10: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM three stages pipeline

10

Read ReadWrite

Page 11: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Organisation

11

• No forwarding unit (not needed because thefollowing instruction finds the updated valuealready in the RF - three stages pipelines – seenext slide)

• Register bank (=Register File): two ports (read) and one port(write) - as in DLX – for the normal data traffic plus twoaccesses (read e write) for r15 (PC)

• Memory access register has an incrementer for sequentialaccesses which is used for incrementing the PC too

• One extra register for 32 bit multiplication (when multiplying 32bit data the result can be longer that 32 bits)

• Two transit register for the memory (Datain and Dataout - noharvard architecture initially)

Page 12: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

sub r2,r3,r6 fetch decode exec. sub

cmp r2,#3 fetch decode exec.cmp

add r3,r1,r5fetch decode exec.add

time1 2 3

NB The datum required by an instruction (exec stage) finds thedatum already available in the RF. No forwarding unit !!!!

Three stages pipelineSingle cycle instructions

12

• A single execution clock instruction accesses during the execute stage two operands;the datum on bus B shifted (if required), combined in the ALU with bus A datum. Theresult is written back in the register bank. The PC is incremented by the incrementerand the result is stored back in r15 AND in the address register for next instructionaccess

• Fetch stage: the instruction is read from the memory into the data-in register fordecoding

• Decode stage: the instruction in the data-in register (it doesn’t use the datapath) and inthe meantime the next instruction is read from the memory and is «clocked» at the endof the fetch stage

• Exec stage the instruction uses the datapath. In an arithmetic instruction two operandsare read, that on the bus B shifted (combinatorially) if needed and combined withdatum on bus A. The result is written back in the RF in the same clock period

Page 13: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Three stages pipelineMultiple cycles instructions

13

• Since the PC(r15) is incremented in the first stage the programmer mustbe aware that it was already twice incremented (two instructions – 8bytes) if it has to be used in the exec stage

No fetch

bubble

bubble

(memoryStore)

The address computationprevents the decoding becausethe registers towards the ALU cannot be opened

Decoder busy

Here no fetch because the memory is busy with the WB

Memory busy

• Multi-cycles instructions are executed more irregularly. In thisexample an ADD followed by a STORE and three ADDs

• The greyed stages are those where the memory is accessed

• The datapath is used by the STORE for the address computation

fetch ADD decode execute

time

1

fetch STR decode calc. addr.

fetch ADD decode execute

2

3

data xfer

fetch ADD decode execute4

5 fetch ADD decode executeinstruction

Page 14: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Three stages pipelineMultiple cycles instructions

14

fetch decode ex ld r2ldmia r0!,{r2,r3}

sub r2,r3,r6

cmp r2,#3

ex ld r3

fetch

time

decode ex sub

fetch decode ex cmp

Ldmia -> Load multiple registers increment address

• This instruction loads two registers (in this case r2 and r3) with datastarting from the address in r0 (in this case). No need for addresscomputation (value already present in r0). The address isincremented by 4 each load (incrementer)

«register based» load with autoincrement

Page 15: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Branch

time

fetch decodexecutebne foo

sub r2,r3,r6 fetch decod

foo add r0,r1,r2

linkret

fetch decod ex add

adjust

Decision on the third clock

15

ex add

add r13,r14,r2 fetch decod ex add

The branch can be with return and the PC value is savedin the linkret stage. The adjust stage adjusts its valuewhich has been already incremented by 8

Page 16: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Register/Register instructionsDatapath

address register

incrementer

registers

Rd

Rn

PC(r15)

Rm

Barrelas per ins.

as per instruction

multiplier

data out data in instr. pipe

Instruction Reg-RegRd <= Rn op RmR15 (PC) <= AR + 4AR <= AR + 4

16

AR: Address Register

PC value incremented by 4 The same incremented valuein the AR

Page 17: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Register/Immediate instructionsDatapath

Reg-ImmRd <= Rn op ImmR15(PC) <= AR + 4

AR <= AR + 4

17

address register

incrementer

registers

Rd

Rn

PC(r15)

as per ins.

As per instruction

multiplier

data out data in instr. pipe

[7:0]

In this case the operand isin the instruction

Page 18: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Store instructionDatapath

18

(a) 2nd cycle – r15 (PC) is copied into the AR while the datum is written into the memoryand an autoincrement (if required) is executed. If a single byte only must be writtenthe lowest byte of the word is 4 times replicated in the output register.

Store data AR <= R15 (PC) mem[AR] <= Rd I f autoindexing

Rn <= Rn +/ - 4

address register

increment

registersRn

Rd

shifter

= A + B / A - B

PC(r15)

byte? data in i. pipe

address register

increment

registersRn

PC

lsl #0

= A / A + B / A - B

data out data in i. pipe

[11:0]

(a) 1st cycle – The STORE address is computed and stored in the AR. In the meantimer15(PC) is incremented and the value stored in the RF ONLY for the next instruction

Compute address AR <= Rn op Disp R15 (PC)<= AR + 4

Page 19: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Branch

(a) 1st cycle – compute branch target

(b) 2nd cycle – save return address

19

address register

increment

registersPC(r15)

#2

= A+ B

data out data in i. pipe

Compute target address AR <= PC + Disp

PC adjustementShit 2 posiztions right

Target PC increment

Save return address (if required) r14 <= PC (R15/PC to save) AR <= AR(PC) + 4 R15 <=AR(PC) + 4

address register

increment

registersR14

PC(r15)

shifter

= A

data out data in i. pipe

PC(r15)

Page 20: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Pipeline clock

• Data transfer is achieved loading alternatively the data in the latches

1 clock cycle

phase 1

phase 2

20

• ARMs don’t use edge-sensitive FF (FF D) but they are based on a twonon overlapping phases clock internally derived from the processorclock

Page 21: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Datapath timing

• The ALU processes the operands during phase 2: the result is sampledin the destination register at the end of phase 2

21

• Read registers bus are dynamic and precharged in phase 2. In thiscase “dynamic” means that sometimes they are not driven: theymaintain their values and look “pseudo-static”

• In phase 1 the used registers enable their drivers onto the readbusses which presents valid data from the start of phase 1.

• The second operand goes through the barrel shifter and is thereforeavailable with a little delay

• The ALU has input registers which are enabled during phase 1

Page 22: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Timing diagram

Delay = Register read time +

shifter delay +

ALU delay +

Write register setup time +

No phases overlap delay 22

read bus valid

shift out valid

ALU out

shift time

ALU time

registerwrite time

registerreadtime

ALU operandslatched

phase 1

phase 2

prechargeinvalidatesbuses

Page 23: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ALU ARM

23

The value range in 2’ complement of the ARM 2 bit registers goes from–231 (0x80000000) to +231 – 1 (or 0x7FFFFFFF). In case of saturation –“overload” - (out of range values) an automatic correction is performed.If the value is greater than +231 – 1 the result becomes +231 – 1: if it issmaller that –231 the value becomes –231 .

Z

N

VC

logic/arithmetic

C infunction

invert A invert B

result

result mux

logic functions

A operand latch B operand latch

XOR gates XOR gates

adder

zero detect

Since the integration of logic and mathematical functions iscumbersome two different circuits were designed plus a MUX

Page 24: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Control logic

25

decodePLA

cyclecount

multiplycontrol

load/storemultiple

addresscontrol

registercontrol

ALUcontrol

shiftercontrol

instruction

It must be noted the two subsystems which perform the multiplication andthe multiples LOAD e STORE.

Control signals for the subsystems

Page 25: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Memories

26

Data: 32 bit ----- Addresses:32 bit

The notation A(K+2:2) indicates that the addresses of a devices with Kaddresses lines are connected to the addresses of the ARM twopositions right shifted. (Obviously ARM addresses 0 and 1 are NOTemitted). Notice that in ARM dialect the LSBit is on the right

Page 26: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Memories

27

mas [1:0] -> parallelaccess control

Obviously a Wait/Ready signal is available

BusSeL and WR

mas[0] mas[1]0 1 word access1 0 half word access (selection depending fromA[1]0 0 byte access (selection depending from A[0] and A[1])

A[31] selectes either ROM or RAM and r/!w the access type

The I/O is memory mapped

Page 27: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Bus

28

There are three types of busses defined by ARM

• Advanced High-performance Bus (AHB). It is a protocolbased on a single bus. Addressing and transfer areoverlapped for maximum bandwith which supports theburst mode

• Advanced eXtensible Interface (AXI) It is a protocolwhere data and addresses use different channels bothfor reading and writing. Addressing and transferoverlap and burst mode.

• Advanced Peripheral Bus (APB) for low complexityperipherals interface

Normally each ARM microcontroller incorporates a AHB or aAXI together with an APB.

Page 28: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Bus

29

AHB o AXI

Page 29: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

AHB bus generic structure

30

Arbiter

Master#1

Master#2

Master#3

HADDR HWDATA

HRDATA

Address/Control

Decoder

Read Data

Write Data

HRDATA

HADDRHWDATA Slave

#1

Slave#2

Slave#3

Slave#4

Page 30: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

AHB Bus

31

• The bus master can «lock» the bus for atomic transfers (i.e.semaphores). «Split» transactions are allowed where a slavedefers the acknowledge to the master. The slave stores themaster request, wich when gains the bus again replicates therequest (and the slave hopefully is ready to answer). Only asingle pending split transaction is allowed

• Synchronous bus which supports 32, 64 e 128 bit transfers.32 bit address and burst transfers (multiple transfers withincremented addresses)

• Separated address and data bus

• Up to 16 arbitrated bus masters

• The data transfer to an address is overlapped with theemission of the next address (max. bandwith exploitation). Thearbitration takes place during the current transfer

• Burst transfers can be directed to a fix address (.e. a FIFO) orto an automatically incremented address. Burst cannottrespass 1KB address

Page 31: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Topologies

32

Page 32: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Multilayer structure

33

Page 33: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Typical Multilayer

34

Periph#1

Periph#2

Periph#3

Page 34: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

AHB Bus - Master

35

• The meaning ot the other signals explained in the next slides

• An AHB master (i.e. an ARM processor)

• HBSREQx is the request to the arbiter and the transfer starts when themaster receives the signal HGRANTx by activating the address andcontrol signals which are received by the slaves which in turn decodethem. The master is unaware of how many slaves are present Forinstance there can be three slaves, each one controlling 24 MB memoryor two slaves each one controlling 36 MB memory

Page 35: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

AHB Bus - Timing

36

A read or write transfer with wait states

Write

Read

Page 36: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

AHB Bus - Timing

37

Multiple transfers with and without wait periods. Itmust be noted the overlap of data transfer and nextdata addressing

Page 37: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

AHB Bus- Topology

38

There is single address bus used in turn by all selectedmasters

Page 38: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

AHB bus - Arbiter

39

In this figure an AHB arbiter. Nothing is obviously said about thearbiter policy. Normally it is a round robin scheme. The signalHGRANTx indicates which master the next access is granted to.HMASTER[3:0] indicates which master is presently controlling thebus

Page 39: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

High perfomance AXI bus

41

• Great bandwith and low latency

• Retrocompatible with AHB

• Address/control and transfer phases separated

• Separated read write channels

• Transfer parallelism 1 to 128 bytes (lanes) with bus enables

• Burst transactions with single initial address. A signal indicates thetransfer end

• Each transaction carries address and controls

Page 40: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

AXI read and write

42

Page 41: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM caches

53

• The majority of the ARM processors famility use a cache virtuallyaddressed (iAPX are physically addressed)

• Advantage: the cache access is performend in parallel with thevirtual address translation (faster access)

• Disadvantages:

• The cache must be emptied for each context switch

• Possible data sharing must take place outside the cache(address translation different for different processors)

Page 42: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM MMU

54

• Virtual memory is mandatory since caches are virtuallyaddressed

• The MMUs depends on the processor. Here the characteristicsof ARM 7

• The virtual memory is based on page tables not on chip (inmemory as it is the case with iAPX come nei sistemi x86)

• Page size can be 1MB (indicated as sections with a single levelpage table) or 64 KB or 4 KB (pages double level page tables)

• Internal TLB

• Memory is protected by up to 16 domains. A different policy canbe defined for each of them (i.e. cacheable or non cacheable)

Page 43: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM9TDMI

55

• Harvard architecture

• 5 stages pipeline

• Increased clock frequency

Page 44: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Strong ARM (ARM9TDMI)5 stages (DLX with cache)

56

Page 45: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM7TDMI vs ARM9TDMI

57

Increased number of stages for clock frequency increase

Page 46: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM7TDMI: Fetch D e c o d e E x e c u t e

ARM9TDMI:

Pipeline ARM9TDMI

58

Reg. ReadDecode

Process 0.25 um Transistors 110,000 MIPS 220

Metal layers 3 Core area 2.1 mm 2 Power 150 mWVdd 2.5 V Clock 0 to 200 MHz MIPS/W 1500

Page 47: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Stages dynamic

Write-back: register write

59

… as DLX

Fetch

Decode: decodes the instruction and register read (three read ports)

Execute

An operand is shifted (if needed) and the ALU result is available or Address computation

Buffer/data: memory access (load, store)

Page 48: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Forwarding

60

ForwardingPaths

Page 49: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Further ARM information

61

From here onward other information about ARM (only for cultural purposes NOT for the exam)

Page 50: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Unavoidable stalls(as DLX)

62

1 234 5 6 7

LDR R1,@(R2) IF ID EX MEM WB

SUB R4,R1,R5 IF ID EXsub MEM WB

AND R6,R1,R7 IF ID EXand MEM WB

OR R8,R1,R9 IF ID EXE MEM

1234 5 6 7 8 9

LDR R1 ,@(R2) IF ID EX MEM WB

SUB R4,R1,R5 IF ID stall EXsub MEM WB

AND R6,R1,R7 IF stall ID EX MEM WB

OR R8,R1,R9 stall IF ID EX MEM WB

Not possible:R1 read from memory when required by the SUB => STALL

Page 51: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

LDR interlock (bubble)

63

(Interlock => bubble). LDR is followed by an instruction which requires itLDR R4, [R7] ; R4 := MEM32 [R7] ; EOR Exclusive Or

Unused Stadi MEM stag

Page 52: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM architectures

64

Page 53: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Performance

65

Page 54: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM10TDMI

66

Reg. Read Decode

• 64 bit memory: two registers transfer in a cycle

• Clock 300 MHz

• CMOS 250 nm

• Performance: 4 times ARM7TDMI

• Branch prediction

• Non blocking Load and Store(queue)

• 6 stages pipeline

Page 55: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

67

.OOO execution for the three pipelines

ARM 11

Page 56: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

8 stages ARM

68

Pipeline parallelism ALU/MAC, LSU Load and Store don’t block the pipeline OOO execution

8 stages pipeline

Data forwarding Static and dynamic branch prediction Non blocking cache access

Page 57: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM11 MPCore

71

• Up to 255 interrupts sources

Highly configurable

• Up to 4 processors

• Configurable cache 16K-64K for each processor. MESI

• Double or single bus 64-bit AXI

• Optional vectored floating point

Page 58: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM11 MPCore

72

Page 59: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

Comparison

73

Feature ARM9ETM ARM 10ETM Intel® XScaleTM ARM1 1TM

Architecture ARMv5TE(J) ARMv5TE(J) ARMv5TE ARMv6

Pipeline Length 5 6 7 8

Java Decode (ARM926EJ) (ARM1026EJ) No Yes

V6 SIMD Instructions No No No Yes

MIA Instructions No No Yes Copross.

Branch Prediction No Static Dynamic Dynamic

Independent Load- Store Unit

No Yes Yes Yes

Instruction Issue Scalar, in-order Scalar, in-order Scalar, in-order Scalar, in-order

Concurrency None ALU/MAC, LSU ALU, MAC, LSU ALU/MAC, LSU

Out-of-order completion

No Yes Yes Yes

Target Implementation

Synthesizable Synthesizable Custom chip Synthesizable and Hard macro

Page 60: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex family (V7)

74

CortexTM-M3

Cortex-A8

Cortex-R4

Cortex-M1

x1 -4Cortex-A9

Cortex-M4

12k gates...

x1-4Cortex-A15

Cortex-M0

x1 -4Cortex-A51-2R Heron

...2.5GHz

SC 300 TM

• ARM Cortex-M family (v7-M): Microcontrollers for SoC

• ARM Cortex-A family (v7-A): General purpose processors -Applications processors for full OS and 3rd party applications

• ARM Cortex-R family (v7-R): Embedded processors for real time and control signal processing

Page 61: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex performance

75

Page 62: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex M3Pipeline

77

Branch forwarding & speculation

1st Stage - Fetch 2nd Stage -Decode

3rd Stage - Execute

Execute stage branch (ALU branch & Load Store Branch)

Fetch(Prefetch)

AGU

Instruction Decode &

Register Read

Branch

Address Phase &

Write Back

Data Phase Load/Store &

Branch

Multiply & Divide

Shift ALU &Branch

Write

Page 63: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex M3Datapath

78

RegisterBank Mul/Div

AddressIncrementer

ALU

B

A

INTADDR

I_HADDR

AddressRegister

BarrelShifter

Writeback

ALU

Read DataRegister

Write DataRegister

InstructionDecode

I_HRDATA

D_HWDATA

D_HRDATA

AddressIncrementer

D_HADDRAddressRegister

• Three stages pipeline similar to that of ARM 7

• This diagram refers to the internal core and has therefore I andD ports. The memory access takes place outside the core

Page 64: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Bit Banding

81

Traditional bit manipulation

x x x x x 1 x xMask and bit modification

0 0 0 0 0 0 0 0 RAM byte read0x02000000

0 0 0 0 0 1 0 0 RAM writeback0x02000000

0x02000000

Page 65: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Bit Banding

82• Register bit 0 is written into the bit

• A write to a bit band address affects only one bit M3 has two 32MBregions that map onto the two 1MB bit-band regions. The tworegions are separate, one in the SRAM region and one in theperipheral region. Each bit in the bit-band region is addressedsequentially in the 32MB alias region. For example, the eighth bit inthe bit-band region can be accessed using the eighth word in the32MB alias region.

• The write is transformed into an atomic read-modify-write

Page 66: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Thumb instructions

83

Page 67: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Thumb 2 instructions

84• Cortex-M3 implements only a portion of Thumb-2

• Variable instructions length• ARM instructions are fixed length 32 bits• Thumb instructions (higly encoded) are fixed length 16

bit• Thumb-2 instructions are both 16 ot 32-bit

Page 68: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex M3Interrupts

85

Cortex-M3Processor Core

INTNMI

NVIC

Cortex-M3

1-240 Interrupts

• A single non maskable interrupt (INTNMI)

• 1-240 interruzioni con prioritized interrupts

• Maskable interrupts• Variable (according to the version) interrupt number• Vectored interrupt controller (NVIC)

Page 69: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex M3NVIC

87

• In caso di interruzioni di maggiore priorità che si presentino durante unPUSH o un POP dello stack a causa di un interrupt precedente l’NVIClegge immediatamente il puntatore alla routine dell’interrupt di maggiorepriorità

• L’NVIC s’incarica anche dello schema di power management: nel casodi istruzioni WFI (Wait for Interrupt) e WFE (Wait for Event) il coredell’M3 viene messo automaticamente nello stato di low-power.Analogamente per la SOE (Sleep On Exit) che pone il core in powerdown all’uscita dall’interrupt di minore priorità

Page 70: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex M3memory map

88

DebugSYSTEM AHB

Bus Matrixwith

Bit- BanderAlignerand Patch

Code Space

RAM

Peripheral

External RAM

00000000

20000000

40000000

60000000

A0000000

E0000000

E0040000

E0100000

FFFFFFFF

SYSTEM AHB

External Peripheral

Debug ComponentsSystem

SCS + NVICAPB

DCODE AHB

ICODE AHB

INTERNAL PPB

Debug

CM3Core

Instruction

Data

½GB

½GB

½GB

1GB

1GB

La mappa della memoria è prefissata

Page 71: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex M3Protection

89

• The processor allows to define 8 memory regions definedby specific registers

• Each region includes both data and instructions

• Region size: 32bytes-3GBytes

• There are many free open source OS for Cortex 3

• BeRTOS• ChibiOS• Contiki OS• Free RTOS• Micrium uC/OS-II• eCos• NuttX

Page 72: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex M3Simple system

90

Page 73: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex 8

91

Cell phones, game controllers navigations systems oriented

Advanced performance with low power consumption

Architecture• Thumb-2 instructions• 130 new instructions• High density and high performance• NEON unit for signal processing• Audio video and 3D graphic

Page 74: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex 8

92

Page 75: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex 8

93

InstructionFetchUnit

L1 I Cache

InstructionDecode

Unit

AXI Level 3 Memory Interface

L2 Memory System

InstructionExecute &Load/Store

L1 D Cache

NEON Media Processor

Cortex-A8

Page 76: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex 8Register file

95

Page 77: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex 8Protection

97

OS

Application Code

User Mode

PrivilegedMode

OSCode + Data

ApplicationCode + Data

Physical memory

Page 78: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex 8Allocazione della memoria

98

28

Application Code

User Mode

PrivilegedMode

OS

Application Code

User Mode

VirtualAddress

MemoryManagement

Unit

PhysicalAddress

Physical Memory

Application Code + Data

ApplicationCode + Data

OSCode + Data

Page 79: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex 8Memory management

99

• The Memory Management Unit (MMU) controls the memory accesses for protection in addition to the address translation

• The TLBs associate a process identifier to each entry

Page 80: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex 8

100

• Super scalar pipeline: double emission in order andOOO execution

Page 81: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex 8

101

NEON media engine

Page 82: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex 8

102

Page 83: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex 8 for cell phones

105

Page 84: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex A9

107

Page 85: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex 15

108

Page 86: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM family

109

ARM11• 8 stages pipeline• 1GHz• Wireless, consumer, networking and automotive

ARM6 →ARM7

• 3 stages pipeline• Unified memory for data and instructions • 16 bit Thumb instruction set• 54 multiplication unit

ARM8 → ARM9 →ARM10

ARM9• 5 stages pipeline (130 MHz or 200MHz)• Separated data and instructions memory

ARM 10• 300 MHz• Multimedia support• Optional vectored floating point unit

Page 87: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

ARM Cortex family

110

• High end performance

• ARM Cortex-A for complex OS and applications. Thumband Thumb-2 support

• ARM Cortex-R: embedded processor for real-timesystems. Thumb e Thumb 2 support

• ARM Cortex-M: embedded processor for low costapplications. Thumb 2 only

Page 88: Computer architecture M - unibo. · PDF filehad no full-custom previous experience • 32 bit fixed length instructions ... 16 32-bit general-purpose ... The second operand goes through

La famiglia ARM

115