computer architecture m - unibo. · pdf filehad no full-custom previous experience • 32...
TRANSCRIPT
ARM architecture
Computer architecture M
1
History
2
• Design software can be bought (Verilog) – soft core
• Acorn computer: an english company Cambridge spin-off (UK)which had developed a 8 bit microprocessor for the BBC on 6502architecture (Synertek e Rockwell)
• In 1982 Acorn engineers looked for a new microprocessor permore sophisticated applications but decided against CISCsolutions because too slow for the specific requirements andinterrupt latency time
• They decided to design a totally new architecture. At the sametime Stanford RISC I and II and MIPS (Microprocessor withoutInterlocked Pipeline Stages) of Berkley appeared on the marketBerkley and they decided to follow that philosophy
• ARM (Advanced RISC Machine) whose three stages is still nowused
• ARM is now a true industry (from 1990) and a «brand» withmultiple implementations and is used by many processorcompanies (Intel too) in multiple environments in tailoredversions (Intellectual Property – IP - cores)
ARM
3
• T: Thumb• D: On-chip debug support• M: Enhanced multiplier• I: Embedded ICE hardware• T2: Thumb-2• S: Synthesizable code• E: Enhanced DSP instruction set• J: JAVA support• Z: TrustZone• F: Floating point unit• H: Handshake, clockless design for synchronous or
asynchronous design
ARM- base concepts
4
Apple iPod Photo e iPod Video 5th gen (2X, @80MHz)
Roomba 500 Lego Mindstorm
• Arm is a family of Risc processors conceptually similar toDLX
• There are several versions from a very simple to a verysophisticated one
• Multiple environments (i.e. mobile phones)
ARM – first version
• Single cycle instructions but potentially multiple cycle sincethe no Harvard architecture is implemented. When more thata single memory access is required (i.e. a LOAD) the extracycles are used for useful microoperations (i.e. autoindexaddress)
5
• LOAD/STORE architecture very simple since the designershad no full-custom previous experience
• 32 bit fixed length instructions
• Three addresses instructions RISC type (with someexceptions CISC type)
• Fixed register bank. Obviously in addition to programmervisible registers there are the machine registers
Tthree stages ARM
• Two interrupts: fast and standard
6
• 16 32-bit general-purpose registers (r0 - r15)
• Three ports register bank (two for reading and one for writing)An additional port for read and write register 15 (PC)
• N-positions barrel shifter
• 32 bit ALU
• The address register is provided with an incrementer (forsequential accesses) – In practice it is a programmable counter
• Two buffer registers for data to and from the memory (invisibleto the programmer). Single bank memory
• Instructions decoder and control logic
• Status register (CSPR)
ARM register set
7
r0
r1
r2
r3
r4
r5
r6
r7r8
r9
r10
r11
r12
r13 (MSP)
r14 (LR)
r15 (PC)
CPSR
r8_fiq
r9-fiq
r10_fiq
r11_fiq
r12_fiq
r13_fiq
r14_fiq
SPSR_fiq
r13_svc
r14_svc
SPSR_svc
r13_abt
r14_abt
SPSR_abt
r13_irq
r14_irq
SPSR_irq
r13_und
r14_und
SPSR_und
User mode fiq mode svc mode abort mode irq mode undef. mode
System mode only
CPSR Current Program Status RegisterSPSR Saved Processor Status RegisterMSP Master Stack PointerLR Link Register (return register for subroutines)
fiq: fast interruptsvc: software interruptabt: memory faults (abort)irq: standard interruptund:undefined instructions
r0-r7-are common to user and system mode
Current Program Status RegisterCPSR (similar to flag register)
8
N Z C V Unused I F T Mode31 28 27 8 7 6 5 4 0
Condition codesN negativeZ zeroC carryV oVerflow
I,F interrupt masksT Thumb Instr. Set
CPSR [4:0] mode Use Used Register Set 10000 User Normal user code user10001 FIQ Processing Fast interrupt fiq10010 IRQ Processing standard interrupts irq10011 SVC Processing software interrupts svc10111 Abort Processing memory faults abt11011 Undef Handling undefined instructions trap und11111 System Running privileged operating system tasks user
NB Thumb Instruction Set: higly encoded instructions for memorysave
Each privileged mode has a Saved Program Status Register SPSRwhere the current CPSR is saved and a specific r14 (Link Register)
Exceptions
9
When an exception occurs:
1) The corresponding mode is activated2) The PC (r15) is saved in r14 (link register) of the new mode3) The old CSPR is saved in the SPSR of the new mode4) IRQ is disabled setting bit 7 of CSPR and if the exception
corresponds to the Fast Interrupt CSPR bit6 is set5) PC assumes the value of the following table (fixed addresses)
Exception Mode AddressReset SVC 00000000Undef. Instr. UND 00000004Soft. Int. (SWI) SVC 00000008Prefetch Abort ( Instr.Mem. Fault) Abort 0000000CData Abort (Data Mem. Fault) Abort 00000010IRQ IRQ 00000018FIQ FIQ 0000001C
00000014 cannot be used (old ARMs compatibility)
ARM three stages pipeline
10
Read ReadWrite
Organisation
11
• No forwarding unit (not needed because thefollowing instruction finds the updated valuealready in the RF - three stages pipelines – seenext slide)
• Register bank (=Register File): two ports (read) and one port(write) - as in DLX – for the normal data traffic plus twoaccesses (read e write) for r15 (PC)
• Memory access register has an incrementer for sequentialaccesses which is used for incrementing the PC too
• One extra register for 32 bit multiplication (when multiplying 32bit data the result can be longer that 32 bits)
• Two transit register for the memory (Datain and Dataout - noharvard architecture initially)
sub r2,r3,r6 fetch decode exec. sub
cmp r2,#3 fetch decode exec.cmp
add r3,r1,r5fetch decode exec.add
time1 2 3
NB The datum required by an instruction (exec stage) finds thedatum already available in the RF. No forwarding unit !!!!
Three stages pipelineSingle cycle instructions
12
• A single execution clock instruction accesses during the execute stage two operands;the datum on bus B shifted (if required), combined in the ALU with bus A datum. Theresult is written back in the register bank. The PC is incremented by the incrementerand the result is stored back in r15 AND in the address register for next instructionaccess
• Fetch stage: the instruction is read from the memory into the data-in register fordecoding
• Decode stage: the instruction in the data-in register (it doesn’t use the datapath) and inthe meantime the next instruction is read from the memory and is «clocked» at the endof the fetch stage
• Exec stage the instruction uses the datapath. In an arithmetic instruction two operandsare read, that on the bus B shifted (combinatorially) if needed and combined withdatum on bus A. The result is written back in the RF in the same clock period
Three stages pipelineMultiple cycles instructions
13
• Since the PC(r15) is incremented in the first stage the programmer mustbe aware that it was already twice incremented (two instructions – 8bytes) if it has to be used in the exec stage
No fetch
bubble
bubble
(memoryStore)
The address computationprevents the decoding becausethe registers towards the ALU cannot be opened
Decoder busy
Here no fetch because the memory is busy with the WB
Memory busy
• Multi-cycles instructions are executed more irregularly. In thisexample an ADD followed by a STORE and three ADDs
• The greyed stages are those where the memory is accessed
• The datapath is used by the STORE for the address computation
fetch ADD decode execute
time
1
fetch STR decode calc. addr.
fetch ADD decode execute
2
3
data xfer
fetch ADD decode execute4
5 fetch ADD decode executeinstruction
Three stages pipelineMultiple cycles instructions
14
fetch decode ex ld r2ldmia r0!,{r2,r3}
sub r2,r3,r6
cmp r2,#3
ex ld r3
fetch
time
decode ex sub
fetch decode ex cmp
Ldmia -> Load multiple registers increment address
• This instruction loads two registers (in this case r2 and r3) with datastarting from the address in r0 (in this case). No need for addresscomputation (value already present in r0). The address isincremented by 4 each load (incrementer)
«register based» load with autoincrement
Branch
time
fetch decodexecutebne foo
sub r2,r3,r6 fetch decod
foo add r0,r1,r2
linkret
fetch decod ex add
adjust
Decision on the third clock
15
ex add
add r13,r14,r2 fetch decod ex add
The branch can be with return and the PC value is savedin the linkret stage. The adjust stage adjusts its valuewhich has been already incremented by 8
Register/Register instructionsDatapath
address register
incrementer
registers
Rd
Rn
PC(r15)
Rm
Barrelas per ins.
as per instruction
multiplier
data out data in instr. pipe
Instruction Reg-RegRd <= Rn op RmR15 (PC) <= AR + 4AR <= AR + 4
16
AR: Address Register
PC value incremented by 4 The same incremented valuein the AR
Register/Immediate instructionsDatapath
Reg-ImmRd <= Rn op ImmR15(PC) <= AR + 4
AR <= AR + 4
17
address register
incrementer
registers
Rd
Rn
PC(r15)
as per ins.
As per instruction
multiplier
data out data in instr. pipe
[7:0]
In this case the operand isin the instruction
Store instructionDatapath
18
(a) 2nd cycle – r15 (PC) is copied into the AR while the datum is written into the memoryand an autoincrement (if required) is executed. If a single byte only must be writtenthe lowest byte of the word is 4 times replicated in the output register.
Store data AR <= R15 (PC) mem[AR] <= Rd I f autoindexing
Rn <= Rn +/ - 4
address register
increment
registersRn
Rd
shifter
= A + B / A - B
PC(r15)
byte? data in i. pipe
address register
increment
registersRn
PC
lsl #0
= A / A + B / A - B
data out data in i. pipe
[11:0]
(a) 1st cycle – The STORE address is computed and stored in the AR. In the meantimer15(PC) is incremented and the value stored in the RF ONLY for the next instruction
Compute address AR <= Rn op Disp R15 (PC)<= AR + 4
Branch
(a) 1st cycle – compute branch target
(b) 2nd cycle – save return address
19
address register
increment
registersPC(r15)
#2
= A+ B
data out data in i. pipe
Compute target address AR <= PC + Disp
PC adjustementShit 2 posiztions right
Target PC increment
Save return address (if required) r14 <= PC (R15/PC to save) AR <= AR(PC) + 4 R15 <=AR(PC) + 4
address register
increment
registersR14
PC(r15)
shifter
= A
data out data in i. pipe
PC(r15)
Pipeline clock
• Data transfer is achieved loading alternatively the data in the latches
1 clock cycle
phase 1
phase 2
20
• ARMs don’t use edge-sensitive FF (FF D) but they are based on a twonon overlapping phases clock internally derived from the processorclock
Datapath timing
• The ALU processes the operands during phase 2: the result is sampledin the destination register at the end of phase 2
21
• Read registers bus are dynamic and precharged in phase 2. In thiscase “dynamic” means that sometimes they are not driven: theymaintain their values and look “pseudo-static”
• In phase 1 the used registers enable their drivers onto the readbusses which presents valid data from the start of phase 1.
• The second operand goes through the barrel shifter and is thereforeavailable with a little delay
• The ALU has input registers which are enabled during phase 1
Timing diagram
Delay = Register read time +
shifter delay +
ALU delay +
Write register setup time +
No phases overlap delay 22
read bus valid
shift out valid
ALU out
shift time
ALU time
registerwrite time
registerreadtime
ALU operandslatched
phase 1
phase 2
prechargeinvalidatesbuses
ALU ARM
23
The value range in 2’ complement of the ARM 2 bit registers goes from–231 (0x80000000) to +231 – 1 (or 0x7FFFFFFF). In case of saturation –“overload” - (out of range values) an automatic correction is performed.If the value is greater than +231 – 1 the result becomes +231 – 1: if it issmaller that –231 the value becomes –231 .
Z
N
VC
logic/arithmetic
C infunction
invert A invert B
result
result mux
logic functions
A operand latch B operand latch
XOR gates XOR gates
adder
zero detect
Since the integration of logic and mathematical functions iscumbersome two different circuits were designed plus a MUX
Control logic
25
decodePLA
cyclecount
multiplycontrol
load/storemultiple
addresscontrol
registercontrol
ALUcontrol
shiftercontrol
instruction
It must be noted the two subsystems which perform the multiplication andthe multiples LOAD e STORE.
Control signals for the subsystems
Memories
26
Data: 32 bit ----- Addresses:32 bit
The notation A(K+2:2) indicates that the addresses of a devices with Kaddresses lines are connected to the addresses of the ARM twopositions right shifted. (Obviously ARM addresses 0 and 1 are NOTemitted). Notice that in ARM dialect the LSBit is on the right
Memories
27
mas [1:0] -> parallelaccess control
Obviously a Wait/Ready signal is available
BusSeL and WR
mas[0] mas[1]0 1 word access1 0 half word access (selection depending fromA[1]0 0 byte access (selection depending from A[0] and A[1])
A[31] selectes either ROM or RAM and r/!w the access type
The I/O is memory mapped
ARM Bus
28
There are three types of busses defined by ARM
• Advanced High-performance Bus (AHB). It is a protocolbased on a single bus. Addressing and transfer areoverlapped for maximum bandwith which supports theburst mode
• Advanced eXtensible Interface (AXI) It is a protocolwhere data and addresses use different channels bothfor reading and writing. Addressing and transferoverlap and burst mode.
• Advanced Peripheral Bus (APB) for low complexityperipherals interface
Normally each ARM microcontroller incorporates a AHB or aAXI together with an APB.
ARM Bus
29
AHB o AXI
AHB bus generic structure
30
Arbiter
Master#1
Master#2
Master#3
HADDR HWDATA
HRDATA
Address/Control
Decoder
Read Data
Write Data
HRDATA
HADDRHWDATA Slave
#1
Slave#2
Slave#3
Slave#4
AHB Bus
31
• The bus master can «lock» the bus for atomic transfers (i.e.semaphores). «Split» transactions are allowed where a slavedefers the acknowledge to the master. The slave stores themaster request, wich when gains the bus again replicates therequest (and the slave hopefully is ready to answer). Only asingle pending split transaction is allowed
• Synchronous bus which supports 32, 64 e 128 bit transfers.32 bit address and burst transfers (multiple transfers withincremented addresses)
• Separated address and data bus
• Up to 16 arbitrated bus masters
• The data transfer to an address is overlapped with theemission of the next address (max. bandwith exploitation). Thearbitration takes place during the current transfer
• Burst transfers can be directed to a fix address (.e. a FIFO) orto an automatically incremented address. Burst cannottrespass 1KB address
Topologies
32
Multilayer structure
33
Typical Multilayer
34
Periph#1
Periph#2
Periph#3
AHB Bus - Master
35
• The meaning ot the other signals explained in the next slides
• An AHB master (i.e. an ARM processor)
• HBSREQx is the request to the arbiter and the transfer starts when themaster receives the signal HGRANTx by activating the address andcontrol signals which are received by the slaves which in turn decodethem. The master is unaware of how many slaves are present Forinstance there can be three slaves, each one controlling 24 MB memoryor two slaves each one controlling 36 MB memory
AHB Bus - Timing
36
A read or write transfer with wait states
Write
Read
AHB Bus - Timing
37
Multiple transfers with and without wait periods. Itmust be noted the overlap of data transfer and nextdata addressing
AHB Bus- Topology
38
There is single address bus used in turn by all selectedmasters
AHB bus - Arbiter
39
In this figure an AHB arbiter. Nothing is obviously said about thearbiter policy. Normally it is a round robin scheme. The signalHGRANTx indicates which master the next access is granted to.HMASTER[3:0] indicates which master is presently controlling thebus
High perfomance AXI bus
41
• Great bandwith and low latency
• Retrocompatible with AHB
• Address/control and transfer phases separated
• Separated read write channels
• Transfer parallelism 1 to 128 bytes (lanes) with bus enables
• Burst transactions with single initial address. A signal indicates thetransfer end
• Each transaction carries address and controls
AXI read and write
42
ARM caches
53
• The majority of the ARM processors famility use a cache virtuallyaddressed (iAPX are physically addressed)
• Advantage: the cache access is performend in parallel with thevirtual address translation (faster access)
• Disadvantages:
• The cache must be emptied for each context switch
• Possible data sharing must take place outside the cache(address translation different for different processors)
ARM MMU
54
• Virtual memory is mandatory since caches are virtuallyaddressed
• The MMUs depends on the processor. Here the characteristicsof ARM 7
• The virtual memory is based on page tables not on chip (inmemory as it is the case with iAPX come nei sistemi x86)
• Page size can be 1MB (indicated as sections with a single levelpage table) or 64 KB or 4 KB (pages double level page tables)
• Internal TLB
• Memory is protected by up to 16 domains. A different policy canbe defined for each of them (i.e. cacheable or non cacheable)
ARM9TDMI
55
• Harvard architecture
• 5 stages pipeline
• Increased clock frequency
Strong ARM (ARM9TDMI)5 stages (DLX with cache)
56
ARM7TDMI vs ARM9TDMI
57
Increased number of stages for clock frequency increase
ARM7TDMI: Fetch D e c o d e E x e c u t e
ARM9TDMI:
Pipeline ARM9TDMI
58
Reg. ReadDecode
Process 0.25 um Transistors 110,000 MIPS 220
Metal layers 3 Core area 2.1 mm 2 Power 150 mWVdd 2.5 V Clock 0 to 200 MHz MIPS/W 1500
Stages dynamic
Write-back: register write
59
… as DLX
Fetch
Decode: decodes the instruction and register read (three read ports)
Execute
An operand is shifted (if needed) and the ALU result is available or Address computation
Buffer/data: memory access (load, store)
Forwarding
60
ForwardingPaths
Further ARM information
61
From here onward other information about ARM (only for cultural purposes NOT for the exam)
Unavoidable stalls(as DLX)
62
1 234 5 6 7
LDR R1,@(R2) IF ID EX MEM WB
SUB R4,R1,R5 IF ID EXsub MEM WB
AND R6,R1,R7 IF ID EXand MEM WB
OR R8,R1,R9 IF ID EXE MEM
1234 5 6 7 8 9
LDR R1 ,@(R2) IF ID EX MEM WB
SUB R4,R1,R5 IF ID stall EXsub MEM WB
AND R6,R1,R7 IF stall ID EX MEM WB
OR R8,R1,R9 stall IF ID EX MEM WB
Not possible:R1 read from memory when required by the SUB => STALL
LDR interlock (bubble)
63
(Interlock => bubble). LDR is followed by an instruction which requires itLDR R4, [R7] ; R4 := MEM32 [R7] ; EOR Exclusive Or
Unused Stadi MEM stag
ARM architectures
64
Performance
65
ARM10TDMI
66
Reg. Read Decode
• 64 bit memory: two registers transfer in a cycle
• Clock 300 MHz
• CMOS 250 nm
• Performance: 4 times ARM7TDMI
• Branch prediction
• Non blocking Load and Store(queue)
• 6 stages pipeline
67
.OOO execution for the three pipelines
ARM 11
8 stages ARM
68
Pipeline parallelism ALU/MAC, LSU Load and Store don’t block the pipeline OOO execution
8 stages pipeline
Data forwarding Static and dynamic branch prediction Non blocking cache access
ARM11 MPCore
71
• Up to 255 interrupts sources
Highly configurable
• Up to 4 processors
• Configurable cache 16K-64K for each processor. MESI
• Double or single bus 64-bit AXI
• Optional vectored floating point
ARM11 MPCore
72
Comparison
73
Feature ARM9ETM ARM 10ETM Intel® XScaleTM ARM1 1TM
Architecture ARMv5TE(J) ARMv5TE(J) ARMv5TE ARMv6
Pipeline Length 5 6 7 8
Java Decode (ARM926EJ) (ARM1026EJ) No Yes
V6 SIMD Instructions No No No Yes
MIA Instructions No No Yes Copross.
Branch Prediction No Static Dynamic Dynamic
Independent Load- Store Unit
No Yes Yes Yes
Instruction Issue Scalar, in-order Scalar, in-order Scalar, in-order Scalar, in-order
Concurrency None ALU/MAC, LSU ALU, MAC, LSU ALU/MAC, LSU
Out-of-order completion
No Yes Yes Yes
Target Implementation
Synthesizable Synthesizable Custom chip Synthesizable and Hard macro
ARM Cortex family (V7)
74
CortexTM-M3
Cortex-A8
Cortex-R4
Cortex-M1
x1 -4Cortex-A9
Cortex-M4
12k gates...
x1-4Cortex-A15
Cortex-M0
x1 -4Cortex-A51-2R Heron
...2.5GHz
SC 300 TM
• ARM Cortex-M family (v7-M): Microcontrollers for SoC
• ARM Cortex-A family (v7-A): General purpose processors -Applications processors for full OS and 3rd party applications
• ARM Cortex-R family (v7-R): Embedded processors for real time and control signal processing
ARM Cortex performance
75
ARM Cortex M3Pipeline
77
Branch forwarding & speculation
1st Stage - Fetch 2nd Stage -Decode
3rd Stage - Execute
Execute stage branch (ALU branch & Load Store Branch)
Fetch(Prefetch)
AGU
Instruction Decode &
Register Read
Branch
Address Phase &
Write Back
Data Phase Load/Store &
Branch
Multiply & Divide
Shift ALU &Branch
Write
ARM Cortex M3Datapath
78
RegisterBank Mul/Div
AddressIncrementer
ALU
B
A
INTADDR
I_HADDR
AddressRegister
BarrelShifter
Writeback
ALU
Read DataRegister
Write DataRegister
InstructionDecode
I_HRDATA
D_HWDATA
D_HRDATA
AddressIncrementer
D_HADDRAddressRegister
• Three stages pipeline similar to that of ARM 7
• This diagram refers to the internal core and has therefore I andD ports. The memory access takes place outside the core
ARM Bit Banding
81
Traditional bit manipulation
x x x x x 1 x xMask and bit modification
0 0 0 0 0 0 0 0 RAM byte read0x02000000
0 0 0 0 0 1 0 0 RAM writeback0x02000000
0x02000000
ARM Bit Banding
82• Register bit 0 is written into the bit
• A write to a bit band address affects only one bit M3 has two 32MBregions that map onto the two 1MB bit-band regions. The tworegions are separate, one in the SRAM region and one in theperipheral region. Each bit in the bit-band region is addressedsequentially in the 32MB alias region. For example, the eighth bit inthe bit-band region can be accessed using the eighth word in the32MB alias region.
• The write is transformed into an atomic read-modify-write
ARM Thumb instructions
83
ARM Thumb 2 instructions
84• Cortex-M3 implements only a portion of Thumb-2
• Variable instructions length• ARM instructions are fixed length 32 bits• Thumb instructions (higly encoded) are fixed length 16
bit• Thumb-2 instructions are both 16 ot 32-bit
ARM Cortex M3Interrupts
85
Cortex-M3Processor Core
INTNMI
NVIC
Cortex-M3
1-240 Interrupts
…
• A single non maskable interrupt (INTNMI)
• 1-240 interruzioni con prioritized interrupts
• Maskable interrupts• Variable (according to the version) interrupt number• Vectored interrupt controller (NVIC)
ARM Cortex M3NVIC
87
• In caso di interruzioni di maggiore priorità che si presentino durante unPUSH o un POP dello stack a causa di un interrupt precedente l’NVIClegge immediatamente il puntatore alla routine dell’interrupt di maggiorepriorità
• L’NVIC s’incarica anche dello schema di power management: nel casodi istruzioni WFI (Wait for Interrupt) e WFE (Wait for Event) il coredell’M3 viene messo automaticamente nello stato di low-power.Analogamente per la SOE (Sleep On Exit) che pone il core in powerdown all’uscita dall’interrupt di minore priorità
ARM Cortex M3memory map
88
DebugSYSTEM AHB
Bus Matrixwith
Bit- BanderAlignerand Patch
Code Space
RAM
Peripheral
External RAM
00000000
20000000
40000000
60000000
A0000000
E0000000
E0040000
E0100000
FFFFFFFF
SYSTEM AHB
External Peripheral
Debug ComponentsSystem
SCS + NVICAPB
DCODE AHB
ICODE AHB
INTERNAL PPB
Debug
CM3Core
Instruction
Data
½GB
½GB
½GB
1GB
1GB
La mappa della memoria è prefissata
ARM Cortex M3Protection
89
• The processor allows to define 8 memory regions definedby specific registers
• Each region includes both data and instructions
• Region size: 32bytes-3GBytes
• There are many free open source OS for Cortex 3
• BeRTOS• ChibiOS• Contiki OS• Free RTOS• Micrium uC/OS-II• eCos• NuttX
ARM Cortex M3Simple system
90
ARM Cortex 8
91
Cell phones, game controllers navigations systems oriented
Advanced performance with low power consumption
Architecture• Thumb-2 instructions• 130 new instructions• High density and high performance• NEON unit for signal processing• Audio video and 3D graphic
ARM Cortex 8
92
ARM Cortex 8
93
InstructionFetchUnit
L1 I Cache
InstructionDecode
Unit
AXI Level 3 Memory Interface
L2 Memory System
InstructionExecute &Load/Store
L1 D Cache
NEON Media Processor
Cortex-A8
ARM Cortex 8Register file
95
ARM Cortex 8Protection
97
OS
Application Code
User Mode
PrivilegedMode
OSCode + Data
ApplicationCode + Data
Physical memory
ARM Cortex 8Allocazione della memoria
98
28
Application Code
User Mode
PrivilegedMode
OS
Application Code
User Mode
VirtualAddress
MemoryManagement
Unit
PhysicalAddress
Physical Memory
Application Code + Data
ApplicationCode + Data
OSCode + Data
ARM Cortex 8Memory management
99
• The Memory Management Unit (MMU) controls the memory accesses for protection in addition to the address translation
• The TLBs associate a process identifier to each entry
ARM Cortex 8
100
• Super scalar pipeline: double emission in order andOOO execution
ARM Cortex 8
101
NEON media engine
ARM Cortex 8
102
ARM Cortex 8 for cell phones
105
ARM Cortex A9
107
ARM Cortex 15
108
ARM family
109
ARM11• 8 stages pipeline• 1GHz• Wireless, consumer, networking and automotive
ARM6 →ARM7
• 3 stages pipeline• Unified memory for data and instructions • 16 bit Thumb instruction set• 54 multiplication unit
ARM8 → ARM9 →ARM10
ARM9• 5 stages pipeline (130 MHz or 200MHz)• Separated data and instructions memory
ARM 10• 300 MHz• Multimedia support• Optional vectored floating point unit
ARM Cortex family
110
• High end performance
• ARM Cortex-A for complex OS and applications. Thumband Thumb-2 support
• ARM Cortex-R: embedded processor for real-timesystems. Thumb e Thumb 2 support
• ARM Cortex-M: embedded processor for low costapplications. Thumb 2 only
La famiglia ARM
115