ece5917 soc architecture: arm/ambacontents.kocw.net/kocw/document/2014/sungkyunkwan/hanta... ·...

ECE5917SoC Architecture: ARM/AMBA

Tae Hee Han: [email protected]

Semiconductor Systems Engineering

Sungkyunkwan University

Outline

n ARM Processor Architecture

n AMBA Bus

2

What Is ARM?

n Advanced RISC Machine

n Founded in November 1990n Spun out of Acorn Computers – Joint venture of Acorn computer, Apple computer and

VLSI technology

n Designs the ARM range of RISC processor cores

n Licenses ARM core designs to semiconductor partners who fabricate and sell to their customers

n ARM does not fabricate silicon itself

n Also develop technologies to assist with the design-in of the ARM architecturen Software tools, boards, debug hardware, application software, bus architectures,

peripherals etc.

n Market-leader for low-power and cost-sensitive embedded applicationsn Architectural simplicity which allows very small implementations which result in very

low power consumption

3

Reduced Instruction Set Computer

The History of ARM

n Developed at Acorn Computers Limited, of Cambridge, England, between 1983 and 1985

n Problems with CISC:n Slower then memory parts

n Due to memory operand and instruction decodingn Variable clock cycles per instruction

n Solution – the Berkeley RISC I:n Competitiven Easy to develop (less than a year)n Cheapn Pointing the way to the future

4

5

ARM AArch32

Evolution of the AArch32 (1)

n Original ARM architecture:n 32-bit RISC architecture focused on core instruction setn 16 Registers - 1 being the Program counter – generally accessiblen Conditional execution on all instructionsn Load/Store Multiple operations - Good for Code Densityn Shifts available on data processing and address generationn Original architecture had 26-bit address space

n Augmented by a 32-bit address space early in the evolution

n Thumb instruction set was the next big stepn ARMv4T architecture (ARM7TDMI)n Introduced a 16-bit instruction set alongside the 32-bit instruction set

n Different execution states for different instruction setsn Switching ISA as part of a branch or exceptionn Not a full instruction set – ARM still essential

6

Evolution of the AArch32 (2)

n ARMv5TEJ (ARM926EJ-S) introduced:n Better interworking between ARM and Thumb

n Bottom bit of the address used to determine the ISAn DSP-focused additional instructionsn Jazelle-DBX for Java byte code interpretation in hardwaren Some architecting of the virtual memory system

n ARMv6K (ARM1136JF-S) introduced:n Media processing – SIMD within the integer datapathn Enhanced exception handlingn Overhaul of the memory system architecture to be fully architected

n Supported only 1 level of cache

n ARMv7 rolled in a number of substantive changes:n Thumb-2* - variable length instruction setn TrustZone*n Jazelle-RCTn Neon

7

AArch32 Architecture (1)

n Typical RISC architecture:n Large uniform register filen Load/store architecturen Simple addressing modesn Uniform and fixed-length instruction fields

8

AArch32 Architecture (2)

n Enhancements:n Each instruction controls the ALU and shiftern Auto-increment and auto-decrement addressing modesn Multiple Load/Storen Conditional execution

n Results:n High performancen Low code sizen Low power consumptionn Low silicon area

9

Processor Modes (1)

n The ARM has seven basic processor modes:n User : unprivileged mode under which most tasks runn Privileged:

n System : privileged mode using the same registers as user moden FIQ : entered when a high priority (fast) interrupt is raisedn IRQ : entered when a low priority (normal) interrupt is raisedn Supervisor : entered on reset and when a Software Interrupt instruction is

executedn Abort : used to handle memory access violationsn Undef : used to handle undefined instructions

n Mode changes can be made under software control, or can be caused by external interrupts or exception processing

10

Exceptionmodes

Processor Modes (2)

n User mode:n Normal program execution

moden System resources

unavailablen Mode changed by exception

only

n Exception modes:n Entered upon exceptionn Full access to system

resourcesn Mode changed freely

11

AArch32 Registers (1)

n ARM has 37 registers all of which are 32-bits long.n 31 general purpose registers including a program counter (PC)

n At any one time, 16 visible (R0 - R15), Others speed up the exception processn 6 status registers

n 1 dedicated current program status register (CPSR)n 5 dedicated saved program status registers (SPSR)

n The current processor mode governs which of several banks is accessible. Each mode can access

n a particular set of R0-R12 registersn a particular R13 (the stack pointer, sp) and R14 (the link register, lr)n the program counter, R15 (pc)n the current program status register, CPSR

n Privileged modes (except System) can also accessn a particular SPSR (saved program status register)

12

AArch32 Registers (2)

n General-purpose registersn The GP registers R0 – R15 can be split into three groups, which

are differentiated in the way they are banked and in their special-purpose uses:

n The unbanked registers, R0 – R7n Each of them refers to the same 32-bit physical register in all processor

modesn They are completely GP registers, with no special uses implied

n The banked registers, R8 – R14n The physical register referred to by each of them depends on the current

processor moden Where a particular physical register is intended, w/o depending on the current

processor mode, a more specific name is usedn Register 15, R15 (Program Counter)

13

AArch32 Registers (3-1)

R0R1R2R3R4R5R6R7R8R9

R10R11R12

R13 (sp)R14 (lr)R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR

FIQ IRQ SVC Undef Abort

User Mode R0R1R2R3R4R5R6R7R8R9

R10R11R12

R13 (sp)R14 (lr)R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR

Current Visible Registers

Banked out Registers


R0R1R2R3R4R5R6R7

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR



User IRQ SVC Undef Abort

R8R9

R10R11R12

R13 (sp)R14 (lr)

FIQ ModeIRQ Mode R0R1R2R3R4R5R6R7R8R9

R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR



User FIQ SVC Undef Abort

R13 (sp)R14 (lr)

Undef Mode R0R1R2R3R4R5R6R7R8R9

R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR



User FIQ IRQ SVC Abort

R13 (sp)R14 (lr)

SVC Mode R0R1R2R3R4R5R6R7R8R9

R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR



User FIQ IRQ Undef Abort

R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPRS

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR



FIQ IRQ SVC Undef

14

R13 (sp)R14 (lr)

SPRS

Abort



R10R11R12

R13 (sp)R14 (lr)R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR



R10R11R12

R13 (sp)R14 (lr)R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R0R1R2R3R4R5R6R7

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R8R9

R10R11R12

R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)

FIQ Mode R0R1R2R3R4R5R6R7R8R9

R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPRS

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)



IRQ SVC Undef

15

R13 (sp)R14 (lr)

SPRS

Abort

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR

User



R10R11R12

R13 (sp)R14 (lr)R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR



R10R11R12

R13 (sp)R14 (lr)R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R0R1R2R3R4R5R6R7

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R8R9

R10R11R12

R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)

IRQ Mode R0R1R2R3R4R5R6R7R8R9

R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPRS

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR



SVC Undef

16

R13 (sp)R14 (lr)

SPRS

Abort

R13 (sp)R14 (lr)

User

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR

FIQ



R10R11R12

R13 (sp)R14 (lr)R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR



R10R11R12

R13 (sp)R14 (lr)R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R0R1R2R3R4R5R6R7

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R8R9

R10R11R12

R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPRS

R13 (sp)R14 (lr)

SPSR



Undef

17

R13 (sp)R14 (lr)

SPRS

Abort

R13 (sp)R14 (lr)

User

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR

FIQ

R13 (sp)R14 (lr)

SPSR

IRQ



R10R11R12

R13 (sp)R14 (lr)R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR



R10R11R12

R13 (sp)R14 (lr)R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R0R1R2R3R4R5R6R7

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R8R9

R10R11R12

R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPRS

R13 (sp)R14 (lr)

SPSR



SVC

18

R13 (sp)R14 (lr)

SPRS

Abort

R13 (sp)R14 (lr)

User

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR

FIQ

R13 (sp)R14 (lr)

SPSR

IRQ



R10R11R12

R13 (sp)R14 (lr)R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR



R10R11R12

R13 (sp)R14 (lr)R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R0R1R2R3R4R5R6R7

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R8R9

R10R11R12

R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)


R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R13 (sp)R14 (lr)

SPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR




R13 (sp)R14 (lr)

Abort Mode R0R1R2R3R4R5R6R7R8R9

R10R11R12

R15 (pc)

CPSR

R13 (sp)R14 (lr)

SPRS

R13 (sp)R14 (lr)

SPSR



SVC

19

R13 (sp)R14 (lr)

SPRS

Undef

R13 (sp)R14 (lr)

User

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR

FIQ

R13 (sp)R14 (lr)

SPSR

IRQ

AArch32 Register Organization Summary

20

UsermodeR0-R7,R15,and

CPSR

R8R9

R10R11R12

R13 (sp)R14 (lr)

SPSR

FIQ

R8R9

R10R11R12

R13 (sp)R14 (lr)R15 (pc)

CPSR

R0R1R2R3R4R5R6R7

User

R13 (sp)R14 (lr)

SPSR

IRQ

Usermode

R0-R12,R15,and

CPSR

R13 (sp)R14 (lr)

SPSR

Undef

Usermode

R0-R12,R15,and

CPSR

R13 (sp)R14 (lr)

SPSR

SVC

Usermode

R0-R12,R15,and

CPSR

R13 (sp)R14 (lr)

SPSR

Abort

Usermode

R0-R12,R15,and

CPSR

Thumb stateLow registers

Thumb stateHigh registers

Note: System mode uses the User mode register set

Program Status Register

n Condition code flagsn N = Negative result from ALU n Z = Zero result from ALUn C = ALU operation Carried outn V = ALU operation oVerflow

n Sticky Overflow flag - Q flagn Architecture 5TE/J onlyn Indicates if saturation has occurred

n J bitn Architecture 5TEJ onlyn J = 1: Processor in Jazelle state

n Interrupt Disable bits.n I = 1: Disables the IRQ.n F = 1: Disables the FIQ.

n T Bitn Architecture xT onlyn T = 0: Processor in ARM staten T = 1: Processor in Thumb state

n Mode bitsn Specify the processor mode

2731

N Z C V Q

28 67

I F T mode

1623 815 5 4 024

f s x c

U n d e f i n e dJ

21

Program Counter (R15)

n When the processor is executing in ARM state:n All instructions are 32 bits widen All instructions must be word alignedn Therefore the pc value is stored in bits [31:2] with bits [1:0] undefined (as

instruction cannot be halfword or byte aligned)

n When the processor is executing in Thumb state:n All instructions are 16 bits widen All instructions must be halfword alignedn Therefore the pc value is stored in bits [31:1] with bit [0] undefined (as

instruction cannot be byte aligned)

n When the processor is executing in Jazelle state:n All instructions are 8 bits widen Processor performs a word access to read 4 instructions at once

22

Vector Table

Exception Handling

n When an exception occurs, the ARM:n Copies CPSR into SPSR_<mode>n Sets appropriate CPSR bits

n Change to ARM staten Change to exception mode n Disable interrupts (if appropriate)

n Stores the return address in LR_<mode>n Sets PC to vector address

n To return, exception handler needs to:n Restore CPSR from SPSR_<mode>n Restore PC from LR_<mode>

This can only be done in ARM state Vector table can be at 0xFFFF0000 on ARM720T

and on ARM9/10 family devices

FIQIRQ

(Reserved)Data Abort

Prefetch AbortSoftware Interrupt

Undefined Instruction

Reset

0x1C0x180x140x100x0C0x080x040x00

23

24

Data Sizes and Instruction Set

n The AArch32 is a 32-bit architecture

n When used in relation to the AArch32:n Byte means 8 bitsn Halfword means 16 bits (two bytes)n Word means 32 bits (four bytes)

n Most AArch32 implement two instruction setsn 32-bit ARM Instruction Set: Standard 32-bit instruction setn 16-bit Thumb Instruction Set

n 16-bit compressed form

n Code density better than most CISC

n Dynamic decompression in pipeline

AArch32 Instruction Set Basic Features

n Load/Store architecture

n 3-address data processing instructions

n Conditional execution

n Load/Store multiple registers

n Shift & ALU operation in single clock cycle

25

Conditional Execution and Flags

n AArch32 instructions can be made to execute conditionally by postfixing them with the appropriate condition code field

n This improves code density and performance by reducing the number of forward branch instructionsCMP R3,#0 CMP R3,#0BEQ skip ADDNE R0,R1,R2ADD R0,R1,R2skip

n By default, data processing instructions do not affect the condition code flags but the flags can be optionally set by using “S”. CMP does not need “S”.

loop…SUBS R1,R1,#1BNE loop

if Z flag clear then branch

decrement R1 and set flags

26

Condition Codes

n The possible condition codes are listed below:n Note AL is the default and does not need to be specified

Not equalUnsigned higher or sameUnsigned lowerMinus

Equal

OverflowNo overflowUnsigned higherUnsigned lower or same

Positive or Zero

Less thanGreater thanLess than or equalAlways

Greater or equal

EQNECS/HSCC/LO

PLVS

HILSGELTGTLEAL

MI

VC

Suffix Description

Z=0C=1C=0

Z=1Flags tested

N=1N=0V=1V=0C=1 & Z=0C=0 or Z=1N=VN!=VZ=0 & N=VZ=1 or N=!V

27

Examples of Conditional Execution

n Use a sequence of several conditional instructions if (a==0) func(1);

CMP R0,#0MOVEQ R0,#1BLEQ func

n Set the flags, then use various condition codesif (a==0) x=0;if (a>0) x=1;

CMP R0,#0MOVEQ R1,#0MOVGT R1,#1

n Use conditional compare instructionsif (a==4 || a==10) x=0;

CMP R0,#4CMPNE R0,#10MOVEQ R1,#0

28

Branch Instructions

n Branch : B{<cond>} label

n Branch with Link : BL{<cond>} subroutine_label

n The processor core shifts the offset field left by 2 positions, sign-extends it and adds it to the PC

n ± 32 Mbyte rangen How to perform longer branches?

2831 24 0

Cond 1 0 1 L Offset

Condition field

Link bit 0 = Branch1 = Branch with link

232527

29

Can set up LR manually if needed, then load into PCMOV lr, pcLDR pc, =dest

ADS linker will automatically generate long branch veneers for branches beyond 32MB range.

Data Processing Instructions

n Consist of :n Arithmetic: ADD ADC SUB SBC RSB RSCn Logical: AND ORR EOR BICn Comparisons: CMP CMN TST TEQn Data movement: MOV MVN

n These instructions only work on registers, NOT memoryn Syntax:

<Operation>{<cond>}{S} Rd, Rn, Operand2

n Comparisons set flags only - they do not specify Rdn Data movement does not specify Rn

n Second operand is sent to the ALU via barrel shifter

30

The Barrel Shifter

DestinationCF 0 Destination CF

LSL : Logical Left Shift ASR: Arithmetic Right Shift

Multiplication by a power of 2 Division by a power of 2, preserving the sign bit

Destination CF...0 Destination CF

LSR : Logical Shift Right ROR: Rotate Right

Division by a power of 2 Bit rotate with wrap aroundfrom LSB to MSB

Destination

RRX: Rotate Right Extended

Single bit rotate with wrap aroundfrom CF to MSB

CF

31

Register, optionally with shift operation

n Shift value can be either be:n 5 bit unsigned integern Specified in bottom byte of another

registern Used for multiplication by constant

Immediate valuen 8 bit number, with a range of 0-255.

n Rotated right through even number of positions

n Allows increased range of 32-bit constants to be loaded directly into registersResult

Operand 1

BarrelShifter

Operand 2

ALU

Using the Barrel Shifter: The Second Operand

32


33

Conditional codes

+Data processing instructions

+Barrel shifter

=Powerful tools for efficient coded programs


34

e.g.:

if (z==1) R1=R2+(R3*4)

compiles to

EQADDS R1,R2,R3, LSL #2

( SINGLE INSTRUCTION ! )

THUMB Instruction Set

n Thumb is a 16-bit instruction setn Optimized for code density from C code (~65% of ARM code size)n Improved performance from narrow memoryn Subset of the functionality of the ARM instruction set

n Core has additional execution state - Thumbn Switch between ARM and Thumb using BX instruction

n For most instructions generated by compiler:

n Conditional execution is not usedn Source and destination registers

identicaln Only Low registers usedn Constants are of limited sizen Inline barrel shifter not used

35

015

31 0ADDS R2,R2,#1

ADD R2,#1

32-bit ARM Instruction

16-bit Thumb Instruction

ARM7 Family

n 32/16-bit RISC Architecture

n Unified bus architecturen Both instructions and data use the same bus

n 3 stage pipeliningn Fetch / decode / execution

n Coprocessor interface

n Embedded ICE-RT support

n JTAG interface unit

n Optional support for MMU

36

ARM7 Block Diagram

37

Address Register

Register bank

Multiplier

BarrelShifter

Controller

Instruction Decoder

Data in Register

Data out Register

ALU

+1

A[31:0]

PC

Immediate constant

PC update

Control Signals

D[31:0]

ALU-bus

A-bus

B-bus

ARM7 Family

38

AHB I/F

ETM7 I/F

ARM7

8K Cache

MMU

ASB I/F

ARM7

8K Cache

MPU

• Platform OS core• 100MHz (0.13um)• 0.20mW/MHz• Hard Macro IP

ARM720T

• Embedded RTOS core• Hard Macro IP

ARM740T

• Base Integer core• 80-110MHz (0.18um)• 0.28mW/MHz• Synthesizable (soft IP)

ARM7TDMI-S

• Base Integer core(Hard Macro IP)• 100MIPS (0.18um)• 0.25mW/MHz• NEW : low-voltage layout

ARM7TDMI • ARM v4T ISA• Von Neumann Architecture• Thumb Support• Debug Support• 3-Stage Pipeline

ARM7-S

ARM7

ARM9 Family

n 32/16-bit RISC Architecture

n Harvard Architecturen Separate memory bus architecture

n 5 stage pipeliningn Fetch/Decode/Execute/Memory/Write

n Coprocessor interface

n EmbeddedICE-RT support

n JTAG interface unit

n Embedded Trace Macrocell

39

ARM920T Block Diagram

40

Trace Interface

Port

External coprocessor

interfaceInstruction

cache

ARM9TDMIProcessor core

(Integral Embedded ICE)

InstructionMMU

R13

CP15

R13

DataCache Data

MMUWrite backPATRG RAM

WriteBuffer

Trace Interface

Port

IPA[31:0]

IMVA[31:0]

ID[31:0]IVA[31:0]

DVA[31:0] DD[31:0]

DMVA[31:0] DPA[31:0]

WBPA[31:0] DINDEX[5:0]

JTAG

ASB

ARM9 Family

41

ASB, AHB I/F

ARM9T

4K Cache

MPU

• Platform OS processor core• Up to 200MIPS (typical)• 180MHz (typical)

ARM920T

• Embedded RTOS processor core

ARM940T

• Base Integer processor core

ARM9TDMI• ARM v4T ISA• Harvard architecture• Thumb Support• Debug Support• 5-Stage Pipeline• Dual caches• AMBA ASB, AHB bus

ARM9T

ASB, AHB I/F

ETM9 I/F

ARM9T

16K Cache

MMU

ASB, AHB I/F

ETM9 I/F

ARM9

8K Cache

MMU

• Platform OS processor core• Up to 200MIPS (0.18um)• 0.8mW/MHz

ARM922T

ARM9E Family

42

• Platform OS processor core• 200MIPS• 220MHz (0.18um)

ARM926EJ-S

• Embedded RTOS processor core

ARM966E-S

• Low Power Embedded core• 270MHz (0.13um)• 0.14mW/MHz

ARM968E-S• Synthesizable (Soft IP)• ARM v5TE ISA• Thumb Support• Debug Support• 5-Stage Pipeline• Dual Caches• AMBA AHB bus

ARM9E-S

• Embedded OS processor core• 166MHz (0.18um)• 0.86mW/MHz

ARM946E-S

AHB I/F

ETM9 I/F

ARM9E-S

Flexible Caches

TCM I/F

MMU

AHB I/F

ETM9 I/F

ARM9E-S

Flexible Caches

TCM I/F

MPU

AHB I/F

ETM9 I/F

ARM9E-S

Flexible SRAMs

JazelleEnhanced

ARM10 Family

n 6 stage pipeline

n Parallel load / store unitn Allow computation to continue while data transfers complete

n 64-bit data bus

n Optional support for vector floating-pointn 7th stage in pipelinen Increase FP performancen Out-of-order completion

43

ARM1026EJ-S block diagram

n ARM10EJ-S integer core in an ARMv5TEJ implementation

n 32-bit ARMn 16-bit Thumbn 8-bit Jazelle instruction set

n MMUn Single TLB for both instruction and

data

n Memory Protection Unit (MPU)n Partition external memory into 8-

protection regions

n Cache (I/D)n Configurable to 0KB or 4-128KB

n Tightly Coupled Memory (TCM)n Configurable to 0KB or 4KB-1MB

44

InstructionTCM

interface

DataTCM

interface

ETM 10C Interface

InstructionCache

DataCache

MMU/MPU MMU/MPU

ARM1026core

Control Logic and Bus Interface Unit

Write buffer

VFP10Interface

VICInterface

AMBA AHB InterfaceInterface Data

ARM11 Family

n 8 stage pipelinen load-store pipelinen Arithmetic pipeline

n Parallel load / store unitn Allow computation to continue while data transfers completen Non-blocking cache

n 64-bit data bus

n Out-of-order completion

45

ARM1136JF-S block diagram

n ARM11 Core in an ARMv6 implementation

n 32-bit ARMn 16-bit Thumbn 8-bit Jazelle instruction set

n Load Store Unit(LSU)n Manages all load and store

operationsn Load-Store pileline

n Decouples loads and stores from the MAC and ALU pipelines

n Prefetch Unit

n Memory systemn Harvard architecturen 64-bit datapathsn 2-channel DMA into TCM

46

Vector Floating Point(ARM1136JF-S only)

InstructionCache

TCRAM

Pre-fetch Unit Core Data Cache /

TCRAMLoad Store

Unit

Main Translation Look-aside

Buffer

System metrics

Level one instruction side cache controller

DMA Level one data side cache controller

Debug/JTAGETM

VICExternal coprocessor

Interface

Instruction fetch DMA Data readData Write

Peripheral

ARM11 Family

47

• Platform OS core

ARM11

• ARM v6 ISA• Thumb Code Compression• DSP extensions• 8-Stage Pipeline (35% enhanced)• SIMD media processing• Multi-port 64bit memory system• Optional VFP coprocessor• TCM(scratch-pad) memory

ARM11

• Platform OS core• 620MHz (90nm)• 0.45mW/MHz• Floating Point coprocessor

ARM1136JF-S

M-layer AHB

Multi Port

ARM11

I/D Cache

TCM

VFP

M-layer AHB

Multi Port

ARM11

I/D Cache

TCM• Platform OS core• 620MHz (90nm)• 0.45mW/MHz

ARM1136J-S

ARM’s First MPCore Architecture based on ARM11

CPU/VFPCPU/VFP

L1 MemoryL1 Memory

CPU/VFPCPU/VFP

L1 MemoryL1 Memory

CPU/VFPCPU/VFP

L1 MemoryL1 Memory

CPU/VFPCPU/VFP

L1 MemoryL1 Memory

Snoop Control Unit (SCU)I&D64-bit bus

CoherencyControl

bus

Timer

WdogCPU

Interface

IRQ

Timer

WdogCPU

Interface

IRQ

Timer

WdogCPU

Interface

IRQ

Timer

WdogCPU

Interface

IRQ

Interrupt Distributor

Private FIQ lines

Configurable HW interrupt lines

ConfigurableBetween1 and 4

Symmetric CPU

48

INTEL SA & XSCALE Family

49

• Up to 203MHz(Typical)•Only Core Support

SA110

• SA core (StrongARM)• ARM8• ARM v4 Architecture• Harvard Architecture, I/D Cache• 5-Stage Pipeline

•Xscale• ARM5T Architecture• Thumb mode Support• 7-stage integer, 8-stage• memory superpipeline

• Up to 400MHz

PXA210 (250)

Xcale Core

64-bit BIU

2K Mini/D Cache

32K I/D Cache

I/D MMU

• 199MHz

SA1100

SA Core

DRAM

16K-I Cache

8K-D Cache

MMU

SA Core

DRAM

16K-I Cache

MMU

• 233MHz (Typical)

SA1110

SRAM

Cortex-A8 Architecture

n Cortex-A Base Architecturen Thumb-2 technology for power efficient

executionn TrustZoneTM for secure applicationsn v6 SIMD for compatibility with ARM11n Media acceleration applications

n Cortex-A8 Extensionsn Jazelle-RCT for efficient acceleration of

execution environments such as Java and Microsoft .NET

n NEON technology accelerating multimedia gaming and signal processing applications

n VFPv3 supports full IEEE 754 specification and has been expanded to support 32 registers

50

Instruction Fetch Unit NEON

Media Processor

L2 Memory System

AXI L3 Memory Interface

L1 I-cache

Instruction Decode

Unit

Instruction Execution & Load/Store

Unit

L1 D-cache

Cortex-A8

Current ARM MP Core Architecture Overview

n Scalable 1-4 cores solutions (ARM11, Cortex-A5, Cortex-A7, Cortex-A9, Cortex-A15, Cortex-A53, Cortex-A57)

n Supporting AMP/SMP models and any combination of the two

n Memory system and cache subsystems optimized for MP

n Interrupt Controller designed for distribution across multiple cores

51

Debug and Trace Architecture

Distributed Interrupts Logic

CPU CPU CPU CPU

Coherency Logic

External Memory

FPU/Neon

MPCore CPU

I-cache D-cache

Source: ARM (2013)

Cortex-A5: High Volume and Value

n Power-efficient Performancen Simple in-order 8-stage, single-

issue pipelinen Improved branch prediction

n Full feature set of Cortex-A9n NEON, FPU, TrustZonen Symmetric Multi-processing

(SMP)

n Highly configurablen Uniprocessor only version

availablen Optional NEON / FPUn Optional external L2 cache

52

CortexTM-A5 MPCore

432

ARMv7 32b CPU

NEONTM

SIMD engine

Floating Point Unit

4-64k I-Cache 4-64k D-Cache Core 1

ACP (Accelerator Coherency Port) SCU (Snoop Control Unit)

1-2x 64-bit AMBA AXI Bus Interface

Source: ARM (2013)

Cortex-A7: Most Efficient ARMv7

n Power efficient m-architecture n In-order 8-stage, partial dual-

issue n Efficient memory system

n Integrated L2 cache

n Full feature set of Cortex-A15 n Hardware enhanced virtualization n 1 TB physical memory (40bit

addressing) n NEON, FPU, TrustZone n big.LITTLE companion to Cortex-

A15 using AMBA4 ACE

53

CortexTM-A7 MPCore

432

ARMv7 32b CPU

NEONTM

SIMD engine

Floating Point Unit

8-64k I-Cache 8-64k D-Cache Core 1

SCU L2 Cache (128KB – 1MB)

128-bit AMBA 4 – Coherent Bus Interface

Source: ARM (2013)

Cortex-A9: Widely Adopted

n Performance & power optimized multi-core processor

n Dual-issue, Out-of-order pipeline n Scalable SMP – up to 4 cores n Accelerator Coherency Port (ACP)

n Flexible system architecture n Available as a single CPU also n Configurable cache sizes n Optional second AXI interface n Optimized L2 cache controller n NEON, FPU, TrustZone

54

Source: ARM (2013)

CortexTM-A9 MP Core

432

ARMv7 32b CPU

NEONTM

SIMD engine

Floating Point Unit

16-64k I-Cache16-64k D-

CacheCore 1

ARM CoreSightTM Multicore Debug and Trace

ACP SCU

Dual 64-bit AMBA3 AXI

Cortex-A9 Architecture

n Out-of-Order execution

n Register renaming to enable execution speculation

n Non-blocking memory system with load-store forwarding

n Fast loop mode in instruction pre-fetch to lower power consumption

55

FPU/NEONFPU/NEONFPU/NEONFPU/NEONFPU/NEONFPU/NEON

Profiling Monitor Block

Profiling Monitor Block

Dual-instructionDecode stage

Dual-instructionDecode stage

RegisterRename stage

RegisterRename stage

CoreSightDebug Access Port

CoreSightDebug Access Port Virtual to

PhysicalRegister Pool

Virtual toPhysical

Register Pool

BranchMonitorBranch

Monitor

Out of orderMulti-issue

withSpeculation

3+1 Dispatchstage

Out of orderMulti-issue

withSpeculation

3+1 Dispatchstage

Inst

ruct

ion

queu

e an

d Di

spat

ch

Inst

ruct

ion

queu

e an

d Di

spat

ch

ALU/MULALU/MUL

ALUALU

FPU/NEONFPU/NEON

AddressAddress

OoOWriteBackstage

OoOWriteBackstage

Instruction prefetch stageInstruction prefetch stageFast-loop

modeFast-loop

mode

Instructioncache

Instructioncache

Branch PredictionBranch PredictionGlobal History BufferGlobal History Buffer

BR-Target Addr CacheBR-Target Addr Cache

Return StackReturn Stack

Inst

ruct

ion

queu

eIn

stru

ctio

n qu

eue

Pref

etch

queu

ePr

efet

chqu

eue

Memory SystemMemory System

Auto-prefetcherAuto-prefetcher

Load-Store Unit

Quad-slot with forwarding

Load-Store Unit

Quad-slot with forwarding

Store bufferStore buffer

mTLBmTLB

MMUMMU

ProgramTraceUnit

ProgramTraceUnit

Data cacheData cache

CoresightTrace

L2 Cache ControlL2 Cache Control

ECC RAMsECC RAMsBus Interface Unit (BIU)Bus Interface Unit (BIU)

AMBA 3 AXI 64bit

IRQ/FIQ

PL390InterruptController

…

Coresight/JTAG

Debug

Cortex-A9Single coreProcessor

PL310L2 Cache

ControllerMaster Interface Secondary master (with

filtering)

Advanced Bus Interface Unit

ARM Cortex-A9 MP Core Architecture

56

ARM CoreSightTM Multicore Debug and Trace Architecture

GeneralizedInterrupt Controland Distribution

Snoop Control Unit (SCU)AcceleratorCoherence

PortTimersSnoopFiltering

Cache-2-CacheTransfers

Primary AMBA 3 64bit Interface Optional 2nd I/F with Address Filtering

L2 Cache Controller (PL310)

FPU/NEON PTMI/F

Cortex-A9 CPU

I-Cache D-Cache

FPU/NEON PTMI/F

Cortex-A9 CPU

I-Cache D-Cache

FPU/NEON PTMI/F

Cortex-A9 CPU

I-Cache D-Cache

FPU/NEON PTMI/F

Cortex-A9 CPU

I-Cache D-Cache

ø PTM: Program Trace Macrocell

Cortex-A12 CPU: High Performance Embedded

n 40% performance uplift over Cortex-A9

n Same best-in-class energy efficiency

n The most area- and cost-efficient solution

n Premium mobile features n big.LITTLE™ processing enabled n Greater than 4GB addressable

memory n Security with Virtualization and

TrustZone®

57

CortexTM-A12 MP Core

432

ARMv7 32b CPU

NEONTM

Data Engine

Floating Point Unit

32-64k I-Cache 32k D-Cache Core 1


ACP SCU L2 Cache w/ECC (256kB -8MB)

128-bit AMBA 4 AXI Bus Interface

Peripheral Port

Source: ARM (2013)

Load-Store Unit

Branch Prediction

ARM Cortex-A15 Core Architecture

58

128 bits

L1 Instruction Cache

Instruction Fetch

3-way Instruction Decode

Register Rename

DispatchIssue (8-entry Queue per Issue port)

Complex Cluster (N

EON

/FPU)

Complex Cluster (N

EON

/FPU)

Branch

Simple

Cluster 1

Simple

Cluster 0

Load/Store

Multiply Divide

Cluster

32-entry Loop Buffer

L2 Cache ControlBus Interface U

nit (BIU)

Write Back(60 entry) Retirement Buffer

Global History BufferMicroBTB (64-entry)

Branch Target Buffer (256-entry)

Return Stack

L1 Data Cache

Store Buffer

Load/Store

3 Instructions

8 Instruction issue

1 Load & 1 Store

per cycleARM

multiply & Integer

divide

2~10 stages Integer ALU & Shifter

(includes v6-SIMD)

All NEON & FPU ops Quad-FMAC

12 Stage In-Order Pipeline

3~12 Stage Out-of-Order

Pipeline

5 Stages7 Stages

1Stage

1Stage

Source: ARM (2013)

ARM Cortex-A15 MP Core Architecture

n 2.5GHz in 28 HP processn 12 stage in-order, 3-12 stage OoO pipelinen 3.5 DMIPS/MHz ~ 8750 DMIPS @ 2.5GHz

n ARMv7A with 40-bit PA

n Dynamic repartitioning Virtualizationn Fast state save and restoren Move execution between cores/clusters

n 128-bit AMBA 4 ACE busn Supports system coherency

n ECC on L1 and L2 caches

59

ARM CoreSight Multicore Debug and Trace Architecture

NEON/FastFP

PTM I/F

Cortex-A15

Virtualization, ECC

I-$ D-$

NEON/FastFP

PTM I/F

Cortex-A15

Virtualization, ECC

I-$ D-$

NEON/FastFP

PTM I/F

Cortex-A15

Virtualization, ECC

I-$ D-$

NEON/FastFP

PTM I/F

Cortex-A15

Virtualization, ECC

I-$ D-$

L2 Cache Controller

Advanced Bus Interface Unit (> 32-bit Addressing)

64-bit/128-bit AMBA4 Interface

Generalized Interrupt

Control and Distribution

Accelerator Coherence

PortSystemMMU

Snoop Control Unit (SCU)

Cache-2-Cache Transfers

Snoop Filtering Timers

Simple Cluster 0

Simple Cluster 1

Multiply Accumulate

Complex

Complex

Load & Store 0

Load & Store 1

Fetc

h

Deco

de

Rena

me

Cortex-A15 Pipeline

12 Stage In-order pipeline(Fetch, 3 Instruction

Decode, Rename)Source: ARM (2012)ø PTM: Program Trace Macrocell

3~12 Stage out-of-order

pipeline(capable of 8

issue)

ARM Application Processors for Embedded

60

Summary ARMv7-A

61

Cortex-A5 MPCore

Thumb / Thumb-2 / DSP / SIMD / Optional VFPv4-D16 FPU / Optional NEON / Jazelle RCT and DBX, 1–4 cores / optional MPCore, snoop control unit (SCU), generic interrupt controller (GIC), accelerator coherence port (ACP)

4-64 KB / 4-64 KB L1, MMU + TrustZone

1.57 DMIPS / MHz per core

Cortex-A7 MPCore

Thumb / Thumb-2 / DSP / VFPv4-D16 FPU / NEON / Jazelle RCT and DBX / Hardware virtualization, in-order execution, superscalar, 1–4 SMP cores, Large Physical Address Extensions (LPAE), SCU, GIC, ACP, architecture & feature set are identical to A15, 8-10 stage pipeline

32 KB / 32 KB L1, 0-4 MB L2, MMU + TrustZone


Cortex-A8 Thumb / Thumb-2 / VFPv3 FPU / NEON / Jazelle RCT and DAC, 13-stage superscalar pipeline

16-32 KB / 16-32 KB L1, 0-1 MB L2 opt ECC, MMU + TrustZone

2.0 DMIPS/MHz

Cortex-A9 MPCore

Thumb / Thumb-2 / DSP / Optional VFPv3 FPU / Optional NEON / Jazelle RCT and DBX, out-of-order speculative issue superscalar, 1–4 SMP cores, SCU, GIC, ACP

16-64 KB / 16-64 KB L1, 0-8 MB L2 opt parity, MMU + TrustZone

2.5 DMIPS/MHz per core

Cortex-A12

MPCore

Thumb-2 / DSP / VFPv4 FPU / NEON / Hardware virtualization, out-of-order speculative issue superscalar, 1–4 SMP cores, LPAE, SCU, GIC, ACP

32-64 KB / 32 KB L1, 256 KB-8 MB L2


Cortex-A15

MPCore

Thumb / Thumb-2 / DSP / VFPv4 FPU / NEON / Integer divide / Fused MAC / Jazelle RCT / Hardware virtualization, out-of-order speculative issue superscalar, 1–4 SMP cores, LPAE, SCU, GIC, ACP, 15-24 stage pipeline

32 KB I$ w/parity / 32 KB D$ w/ECC L1, 0-4 MB L2, L2 has ECC, MMU + TrustZone

At least 3.5 DMIPS/MHz per core

62

AArch32 Summary

AArch32 Evolution (1/4)

63

1, 2, 3Early ARM architectures

4THalfword and signed halfword / byte support

System mode

Thumb instruction set

4Load-store instrs for signed and unsigned halfword / byte

System mode

SA-110 SA-1110

ARM7TDMI ARM9TDMI

5TE

Improved ARM / Thumb state change

CLZ(Count Leading Zero) instr.

Saturated maths

DSP multiply-accumulate instructions

ARM1020E

XScale

5TEJ

Jazelle(Java bytecode accelerator)

ARM9EJ-S

ARM1026EJ-S

ARM1136EJ-S6

7

SIMD InstructionsMulti-processingV6 Memory acrchitecture(VMSA)Unaligned and mixed endian data support

7-A (Applications)NEON

7-R (Real-time)Hardware divide

7-M (Microcontroller)Thumb-2 onlyHardware divide

Cortex-A /R /M


n ARM v1n Developed at Acorn Computer Limited between 1983 and 1985n 26-bit addressingn No multiplication, CPSR, …(never used in a commercial product)

n ARM v2n Still 26-bit addressing (version 1 extension)n 32-bit result multiply (with accumulate) and coprocessor supportn Atomic load and store instruction (SWP) support (ARM v2a)

n ARM v3n Developed at ARM Limited (1990)n 32-bit addressingn CPSR, SPSR addedn Multi mode support : Undefined, Abort mode addedn ARM6, ARM610, ARM7, ARM710 core (Apple Newton PDA!)

64


n ARM v4n Fully 32-bit addressingn ARMv4T : Thumb mode support ( T variant)n ARMv4 is deployed for StrongARM and ARM810 / ARMv4T: ARM7TDMI,

ARM720T, ARM9TDMI, ARM920T, ARM940T

n ARM v5TE{J}n Superset of ARMv4(1999)n Enhanced DSP Instructionn Jazelle : Java H/W Accelerationn ARM9E, ARM10TDMI, ARM 1020E, XScale

n ARM v6{Z|T2}n Media instruction(SIMD) support (2001)n ARMv6Z (TrustZone), ARMv6T2 (Thumb-2)n ARM1136J, ARM1176JZ

65


n ARM Architecture Versions and Variants(ARM v7)n Application profile (ARM v7-A)

n Memory management support (MMU)n Highest performance at low power

n Influenced by multi-tasking OS system requirementsn TrustZone and Jazelle-RCT for a safe, extensible system

n Real-time profile (ARM v7-R)n Protected memory (MPU)n Low latency and predictability ‘real-time’ needsn Evolutionary path for traditional embedded business

n Microcontroller profile (ARM v7-M, ARM v6-M)n Lowest gate count entry pointn Deterministic and predictable behavior a key priorityn Deeply embedded use

66

Extensions to ARMv7

n MPE – Multiprocessing Extensionsn Added Cache and TLB Maintenance Broadcast for efficient MP

n VE - Virtualization Extensionsn Adds hardware support for virtualization:

n 2 stages of translation in the memory systemn New mode and privilege level for holding an Hypervisor

n With associated traps on many system relevant instructionsn Support for interrupt virtualization

n Combines with a System MMU

n LPAE – Large Physical Address Extensionsn Adds ability to address up to 40-bits of physical address space

67

Announcement

n Homework #3 (Due: Oct. 30)n Compare the ISA between Intel Atom and ARM Aarch-32

n Free format: Either MS Word or MS PPT

68

69

ARM NEON

NEON?

n NEON is a wide SIMD data processing architecturen Extension of the ARM instruction setn 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide)

n NEON Instructions perform “Packed SIMD” processingn Registers are considered as vectors of elements of the same data typen Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. floatn Instructions perform the same operation in all lanes

70

Dn (src register)

Dm (src register)

Dd (Dest Register)

Lane

Operation

NEON: Data Types

n NEON natively supports a set of common data typesn Integer and Fixed-Point; 8-bit, 16-bit, 32-bit and 64-bitn 32-bit Single-precision Floating-point

n Data types are represented using a bit-size and format letter

71

.8.I8

.S8

.U8

.P8

.16.I16

.S16

.U16

.P16

.32.I32

.S32

.U32

.F32

.64 .I64.S64

.U64

64-bit Signed, Unsigned Integers;


Floats


Floats

8/16-bit Signed, Unsigned Integers;

Polynomials

NEON: Registers

n NEON provides a 256-byte register filen Distinct from the core registersn Extension to the VFPv2 register file (VFPv3)

n Two explicitly aliased viewsn 32 x 64-bit registers (D0-D31)n 16 x 128-bit registers (Q0-Q15)

n Enables register trade-offn Vector lengthn Available registers

n Also uses the summary flags in the VFP FPSCRn Adds a QC integer saturation summary flagn No per-lane flags, so ‘carry’ handled using wider result (16bit+16bit à 32-bit)

72

D0

D1

D2

D3

...D30

D31

Q0

Q1

...

Q15

NEON: Vectors and Scalars

n Registers hold one or more elements of the same data typen Vn can be used to reference either a 64-bit Dn or 128-bit Qn registern A register, data type combination describes a vector of elements

n Some instructions can reference individual scalar elementsn Scalar elements are referenced using the array notation Vn[x]

n Array ordering is always from the least significant bit

73

Dn

I64

S32 S32

63 0

D0

D7

64-bit

Qn

F32

S8

127 0

Q0

Q7

128-bit

F32 F32 F32

S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8

F32 F32 F32 F32 Q0

Q0 [3] Q0 [2] Q0 [1] Q0 [0]

74

ARM VFP

VFP – Overview

n VFP – “Vector Floating-point”n A floating point hardware accelerator

n Described as a “coprocessor”n Originally a tightly-coupled coprocessorn Executed instructions from ARM instruction stream via dedicated interfacen Now more tightly integrated into the CPU

n Single and Double precision floating-pointn Fully IEEE compliant: ANSI/IEEE Std. 754-1985n Vector operation (Short vectors of up to 8 SP or 4 DP numbers) are handled

particularly efficiently by the VFP architecturen Most arithmetic instructions can be used on these vectors, allowing SIMD

parallelismn The FP Load & Store instructions have multiple register forms, allowing vectors to

be transferred to and from memory efficiently

75

VFP – Registers

n VFP has 32 GP registers, each capable of holding a SP FP number or a 32-bit integer

n In D variants, these registers can also be used in pairs to hold up to 16 DP FP numbers:n FPSID (Read Only): Can be read to determine which

implementations of the VFP architecture is being usedn FPSCR: Supplies all user-level status and control

n Status bits hold comparison results and cumulative flags for FP exceptionsn Control bits are provided to select rounding options and vector

length/stride, and to enable FP exception trapsn FPEXC: Contains a few bits for system-level status and control

76

VFP – Instructions

n Load /storen Some of these instructions allow multiple register values to be transferred, providing

floating-point equivalents to ARM LDM and STM instructions

n Transfer 32-bit values directly between VFP and ARM GP registers

n Transfer 32-bit values directly between VFP system registers and ARM GP registers

n Add, subtract, multiply, divide, and square rootn Can be used on short vectors as well as on individual floating-point values

n Copy floating-point values between registers

n Perform combined MAC operations on FP values and short vectors

n Perform conversions between SP, DP, unsigned 32-bit integers and two's complement signed 32-bit integers

n Compare floating-point values in registers with each other or with zero

77

VFP – Versions

n VFPv1 is obsolete

n VFPv2 is an optional extension to the ARMv5TE/5TEJ and ARMv6 architectures

n VFPv3-D32 is broadly compatible with VFPv2 but adds exception-less FPU usage, has 32 64-bit FPU registers as standard, adds VCVT instructions to convert between scalar, float and double, adds immediate mode to VMOV such that constants can be loaded into FPU registers

n VFPv3-D16: as above, but it has only 16 64-bit FPU registers

n VFPv3-F16 is uncommon; it supports IEEE754-2008 half-precision (16-bit) floating point

n VFPv4 in/for Cortex-A5, has a fused multiply-accumulate

78

79

ARM AArch64

ARM Architecture Evolution: AArch 32 to AArch 64

80

Jazelle®

VFPv2

SIMD

Thumb®-2

NEON™Adv SIMD

TrustZone®

ARMv5 ARMv6 ARMv7-A/R

Key feature ARMv7-A

compatibility

VFPv3/v4

ARMv8-A

A32+T32 ISAsincluding:l Scalar FP

(SP and DP)l Adv SIMD

(SP Float)

AArch32

A64 ISAsincluding:l Scalar FP

(SP and DP)l Adv SIMD

(SP+DP Float)

AArch64

CRYPTO CRYPTO

ARM v8 (AArch64) - Motivation

n Work on 64-bit architecture started in 2007

n Fundamental motivation is evolution into 64-bitn Ability to access a large virtual address spacen Foresee a future need in ARM’s traditional marketsn Enables expansion of ARM market presence

n Developing ecosystem takes timen Development started ahead of strong demand n ARM now seeing strong partner interest in 64-bit

n Though still some years from “must have” status

81

ARM v8 (AArch64) - Fundamentals

n New instruction set (A64)

n Revised exception handling for exceptions in AArch64 staten Fewer banked registers and modes

n Support for all the same architectural capabilities as in ARMv7n TrustZonen Virtualization

n Memory translation system based on the ARMv7LPAE table formatn LPAE format was designed to be easily extendable to AArch64-bit n Up to 48 bits of virtual address from a translation table base register

82

ARM v8 (AArch64) – New Instruction Set

n New fixed length Instruction set n Instructions are 32-bits in sizen Clean decode table based on a 5-bit register specifiers

n Instruction semantics broadly the same as in AArch32n Changes only where there is a compelling reason to do so

n 31 general purpose registers accessible at all timesn Improved performance and energyn General purpose registers are 64-bits widen No banking of general purpose registersn Stack pointer is not a general purpose register n PC is not a general purpose register n Additional dedicated zero register available for most instructions

83

AArch64 – Unbanked Registers

n 64-bit GP register file used for:n Scalar integer computation

n 32 and 64-bitn Address computation

n 64-bit

n Media register file used for:n Scalar Single and Double Precision

FPn 32 and 64-bit

n Advanced SIMD for Integer and FPn 64 and 128-bit wide vectors

n Cryptography

84

X0 X8 X16 X24

X1 X9 X17 X25

X2 X10 X18 X26

X3 X11 X19 X27

X4 X12 X20 X28

X5 X13 X21 X29

X6 X14 X22 X30*

X7 X15 X23

V0 V8 V16 V24

V1 V9 V17 V25

V2 V10 V18 V26

V3 V11 V19 V27

V4 V12 V20 V28

V5 V13 V21 V29

V6 V14 V22 V30

V7 V15 V23 V31

ARM v8 (AArch64) – Key differences from AArch32

n New instructions to support 64-bit operandsn Most instructions can have 32-bit or 64-bit argumentsn Addresses assumed to be 64-bits in size

n LP64 and LLP64 are the primary data models targeted

n Far fewer conditional instructions than in AArch32n Conditional {branches, compares, selects}

n No arbitrary length load/store multiple instructionsn LD/ST ‘P’ for handling pairs of registers added

85

AArch32 / AArch64 Relationship

n Changes between AArch32 and AArch64 occur on “exception/exception return” only

n Increasing exception level cannot decrease register width (or vice versa)n No Branch and Link between AArch32 and AArch64

n Allows AArch32 applications under AArch64 OS Kerneln Alongside AArch64 applications

n Allows AArch32 guest OS under AArch64 Hypervisorn Alongside AArch64 guest OS

n Allows AArch32 Secure side with AArch64 Non-secure siden Protects AArch32 Secure OS investments into ARMv8

n Requires architected relationship between AArch32 and AArch64 registers

86

ARM v8 (AArch64) – Advanced SIMD and FP Instruction Set

n A64 Advanced SIMD and FP semantically similar to A32n Advanced SIMD shares the floating-point register file as in AArch32

n A64 provides 3 major functional enhancements:n More 128 bit registers: 32 x 128 bit wide registers

n Can be viewed as 64-bit wide registersn Advanced SIMD supports DP floating-point executionn Advanced SIMD support full IEEE 754 execution

n Rounding-modes, Denorms, NaN handling

n Register packing model in A64 is different from A32n 64-bit register view fit in bottom of the 128-bit registers

n Some Additional floating-point instructions for IEEE754-2008n MaxNum/MinNum instructions, Float to Integer conversions with

RoundTiesAway

87

ARM v8 (AArch64) – Cryptography Support

n Instruction level support for Cryptographyn Not intended to replace hardware accelerators in an SoC

n AESn 2 encode and 2 decode instructions

n Work on the Advanced SIMD 128-bit registersn 2 instructions encode/decode a single round of AES

n SHA-1 and SHA-256 supportn Keep running hash in two128 bit wide registersn Hash in 4 new data words each instructionn Instructions also accelerate key generation

88

A-series Evolution Details

89

ARMv7-A ARMv7-A - Extensions ARMv8-A2004 2010 2012

• 32-bit Architecture • 32-bit VA & PA • Supports SMP 1-4 cores • TrustZone – Security • ARM & Thumb1/2 ISA • NEON & VFP • L1 D-Cache Coherency

• 32-bit Architecture• 32-bit VA & PA• Supports SMP 1-4 cores• TrustZone – Security• ARM & Thumb1/2 ISA• NEON & VFP• L1 D-Cache Coherency

• 32-bit Architecture• 32-bit VA & PA• Supports SMP 1-4 cores• TrustZone – Security• ARM & Thumb1/2 ISA• NEON & VFP• L1 D-Cache Coherency

• Full HW Virtualization• 40-bit PA• Hypervisor State• 2 levels PTT

• Full HW Virtualization• 40-bit PA• Hypervisor State• 2 levels PTT

• 64-bit Architecture• 31x64-bit registers (double)• 32-bit instructions• >40-bit PA• New exception model

Cortex-A9

Cortex-A15

Cortex-A53/A57

ARMv8: AArch64

90

CortexTM-A57 MPCore

432

ARMv8 32b/64b CPU Virtual 44b PA

NEONTM

SIMD engine with crypto ext.

Floating Point Unit

48k I-Cachewith DED parity

32k D-Cache w/ECC

Core 1


ACP SCU L2 Cache w/ECC (512kB ~ 2MB)

128-bit AMBA ACE Coherent Bus Interface

CortexTM-A53 MPCore

432

ARMv8 32b/64b CPU Virtual 40b PA

NEONTM

SIMD engine with crypto ext.

Floating Point Unit

8~64k I-Cachew/parity

8~64k D-Cache w/ECC

Core 1


ACP SCU L2 Cache w/ECC (128kB ~ 2MB)


Source: ARM (2013)

Cortex-A932nm

Cortex-A53,20nm

Cortex-A9

Cortex-A53

High Performance Energy Efficient

Cortex-A53 Features

n Cortex-A53 evolves our 8-Stage in-order LITTLE pipelinen ARM v8 AArch64 & AArch32 support at all exception levels

n Optional configurations for different marketsn Smartphone to networking to offload

91

Cortex-A53 Cluster

Core 3

NEON Pipelines

Integer Pipelines

Load-Store Pipelines

ETM

Deco

de

Inst

ruct

ion

Que

ueInstruction Fetch

L1 C

ache

Core 2

NEON Pipelines

Integer Pipelines


ETM

Deco

de

Inst

ruct

ion

Que

ueInstruction Fetch

L1 C

ache

Core 1

NEON Pipelines

Integer Pipelines


ETM

Deco

de

Inst

ruct

ion

Que

ueInstruction Fetch

L1 C

ache

Core 0

NEON Pipelines

Integer Pipelines


ETM

Deco

de

Inst

ruct

ion

Que

ueInstruction Fetch

L1 C

ache

L2

Coherency

L2 Cache

ACP I/F

ACE I/F

Configurable 1-4 Core Capable Optional CPU / NEON

retention capability

Optional ECC for all Core and L2

RAMs

Optional v8 Cryptographic

extensions

Debug v8 & ETM™ v4 Support

AMBA® 4 ACE™System Coherent

Interface

Optional Accelerator

Coherence Port

None or 128K to 2MB

Cortex-A53 Core & L2 Memory System

n Tightly integrated Core (L1) to L2 Memory system

92

L1 Data Cache

Read Buffer

Fully associative Micro-TLB

4-way set-associative,

256-bit Fill Path,64-bit Fetch Path,

64-byte lines

512-Entry, 4-way set associative main TLB

Multi-Stream prefetcher

16-way set-associative,256-bit Fill Path,

512-bit Fetch Path

System coherent interfaces (MOESI)

with 128-bit read/write/snoop

paths

128-bit Accelerator

Coherence Port (IO coherent)

Write Buffer uTLB

Store Buffer

Prefetcher

Main TLB

L2 Cache

ACE I/F

ACP I/F

Core L2

Multi-entry merging store

buffer

3-outstanding load miss capability (per-core, excluding prefetches) 8-outstanding transactions (per-core)

CortexTM-A57 Cluster

FP/Adv SIMD Pipelines

Integer Pipelines

48k

I-Cac

hew

ith p

arity

32k D-Cache w/ECC

Core 4

Out

-of-O

rder


Integer Pipelines

48k

I-Cac

hew

ith p

arity

32k D-Cache w/ECC

Core 3

Out

-of-O

rder


Integer Pipelines

48k

I-Cac

hew

ith p

arity

32k D-Cache w/ECC

Core 2

Out

-of-O

rder

Cortex-A57 Features

n Fixed L1-cache sizen 48KB I-cachen 32KB D-cache

n Configurable L2 cache sizen 512KB/1MB/2MB

n Fully out-of-order execution

n Optional ARMv8 cryptography units

n Reliability features n Caches: SECDED ECC or parity n System-error capabilities

n 1-4 core MP configurability within a cluster

n System interface compatible with CCN-504 and CCI-400

93


Integer Pipelines

48k

I-Cac

hew

ith p

arity

32k D-Cache w/ECC

Core 1

Out

-of-O

rder

ACP L2 w/ECC (512kB ~ 2MB)


CoreLink CCN-504 Cache Coherent Network

94

DSPDSP

CoreLink CCN-504 Cache Coherent Network

Quad Cortex-

A57

L2 Cache

Quad Cortex-

A57

L2 Cache

Quad Cortex-

A57

L2 Cache

Quad Cortex-

A57

L2 Cache

NIC-400

GIC-500

Snoop Filter8-16MB L3 Cache

CoreLinkDMC-520

NIC-400 Network Interconnect

X72DDR4-3200

Flash GPIO

SATAUSBCryptoDPI

10-40 GbE PCIe DSP

CoreLinkDMC-520

X72DDR4-3200

MMU-500

18 AMBA ACE-Lite interfaces for I/O coherency

§ ECC on RAMs and parity on transport§ Integrated QoS regulation & Priority

management§ TrustZone aware

To minimize snoop traffic

Heterogeneous processors – CPU (supported CPU: Cortex-A15, Cortex-A57, Cortex-A53), GPU, DSP and acceleratorsUp to 4 cores

per cluster

4 fully coherent clusters

128-bit bus @ CPU

frequency

Integrated L3 cache

Dual channel DDR3/4 x72

Usable system bandwidth: ~1 Terabit/secPeripheral address space

Generic Interrupt Controller

Low Power Support

§ Extensive clock gating§ Leakage mitigation hooks§ Granular DVFS and CPU

shutdown support§ Partial or full L3 cache

shutdown§ Retention mode

Source: ARM (2013)

Performance Max bandwidth: 25.6 GB/s per channel Buffering to optimize read/write turnaround

Interfaces Optimal direct connection to CCN-504 Industry standard DFI-3.0 to connect to PHY

PHY Low-latency PHY from ARM

Memory Support for x72 DRAM DDR3, DDR3L and DDR4 up to DDR4-3200

Low Power Support Programmable DRAM power modes

Memory Channel

CoreLink DMC-520

n 5th Generation ARM DMC targeting >95% DRAM efficiencyn ECC and RAS features n Performance Profiling n TrustZone Address Space Control

n High efficiency through close integration with CCN-504n System wide QoS, designed and verified with ARM CPUs

95

System I/F

RAS Features

ECC Features

Data Bufferfor both Reads and Writes

Memory Interface

ConfigI/F

Perf

orm

ance

Pr

ofili

ng I/

F

QoS

Dynamic Memory Controller

DDR4, 3 & 3L PHY 3200 Mbpsø RAS = Reliability, Availability, Serviceability

Source: ARM (2013)

ARM Cache Architecture

96

CPU Architecture L1 Cache L2 CacheARM720T ARMv4T (32bit) 8KB Unified I&D cache, 4-way SA, 8 words/line, VIVT -

ARM926EJS ARMv5TEJ (32bit)4KB~128KB configurable size, Harvard architecture,

4-way SA, 8 words/line,VIVT using Modified VA (MVA)

-

ARM1136EJS ARMv6 (32bit) 4KB~64KB configurable size, Harvard architecture,4-way SA, 8 words/line, VIPT -

Cortex-A8 ARMv7-A (32bit)16~32 KB configurable size, Harvard architecture,

4-way SA, 16 words/line,VIPT for I-cache, PIPT for D-cache

0-1 MB configurable size, unified,8-way SA, 16 words / line, opt ECC or parity


4-way SA, 8 words/line,VIPT for I-cache, PIPT for D-cache

0-8 MB configurable size,optional parity


I-cache: 2-way SA, 8 words/line, VIPTD-cache: 4-way SA, 16 words/line, PIPT

128KB-1 MB configurable size, 8-way SA, 16 words / line

Cortex-A15 ARMv7-A (32bit) I-cache: 32 KB, 2-way SA, 16 words/line, Parity per 16 bits, PIPTD-cache: 32 KB, 2-way SA, 16 words/line, Parity per 32 bits, PIPT

512KB-4 MB configurable size,16-way SA, 16 words/line, Optional parity,

Strictly enforced inclusive with L1 data cache

Cortex-A53 ARMv8-A (64bit)a.k.a AArch64

8~64 KB configurable size, Harvard architecture,2-way SA, 16 words/line, PIPT

128KB-2MB configurable size,16-way SA,


Cortex-A57 ARMv8-A (64bit)a.k.a AArch64

I-cache: 48 KB, 3-way SA, ECC, PIPTD-cache: 32 KB, 2-way SA, ECC, PIPT

512KB-2MB configurable size,16-way SA, ECC,


40-bit Physical Addresses

44-bit Physical Addresses

Announcement

n Homework #4 (Due: Nov. 6)n Compare cache coherency schemes between snooping and

Directory-based coherencen Free format: Either MS Word or MS PPT (in Korean)

97

98

ARM AMBA Bus Architecture

Master versus Slave

n A bus transaction includes two parts:n Issuing the command (and address) – requestn Transferring the data – action

n Master is the one who starts the bus transaction by:n Issuing the command (and address)

n Slave is the one who responds to the address by:n Sending data to the master if the master ask for datan Receiving data from the master if the master wants to send data

BusMaster

BusSlave

Master issues command

Data can go either way

99

Bunch of Wires

Electrical Specification

Physical / Mechanical Characteristics– the connectors

Timing and Signaling Specification

Transaction Protocol

What defines a BUS?

100

On-Chip Bus Technology

101

Single Bus Multilayer Bus

Single transaction Split TransactionPipelined Transaction

Bus Matrix / Crossbar

Multiple Outstanding Addressing (Transaction ID)

ex) AMBA 2.0 ex) AMBA AXISonics Backbone

ex) AMBA 1.0

• Bus Structure Evolution

• Bus Protocol Evolution IP IP

IP IPBus

IP IP

IP IPBus

IP IP

IP IPBus

IP IP

IP IPBusR R

R R

Network-on-Chip

Bus to Network

On-chip Bus Architecture Classification

n Complicated protocoln Arbitration, multiplexing, decoding, etc.n Hierarchical structure

n Ex) AHB, IBM CoreConnect

n Core-independent protocoln Simple and straightforwardn Improves reusability of IP coresn Improves reliability

n Proprietary interconnection network

n Ex) ARM AXI, Sonics mNetwork, AHB Lite

High-performanceCPU

High-performanceMemory

Arbiter Bridge

Peripheral Peripheral Arbiter

Core Bus

Peripheral Bus

102

Hierarchical Bus Architecture

n All slaves are exist in a single layer

n All slaves suffer performance degradation due to

n the increased complexity of connection logic

n the slowest peripheral

n Classified layers according to required bandwidth

n Delay decreases due to the reduced complexity of connection logic à enhanced speed

103

Overall performance degradation

High speed layer

Low speed layer

AMBA (Advanced Microcontroller Bus Architecture)

n On-chip interconnect specification for SoC

n Promotes re-use by defining a common backbone for SoC modules using standard bus architectures

n ASB – Advanced System Busn AHB – Advanced High-performance Bus (system backbone)

n High-performance, high clock freq. modules n Processors to on-chip memory, off-chip memory interfaces

n APB – Advanced Peripheral Busn Low-power peripherals n Reduced interface complexity

n AXI – Advanced eXtensible Interface n ACE – AXI Coherency Extensionn ATB – Advanced Trace Bus

104

AMBA: Brief History

105

AMBA: Brief History

AMBA 1.0 1995ASB Advanced System BusAPB Advanced Peripheral Bus

AMBA 2.0 1999 AHB Advanced High-performance Bus

AMBA 3.0 2003AXI Advanced eXtensible BusAPB Addition of wait, error response

AMBA 4.02010

AXI4Some minor changes to AXI- Added support for long burst- Dropped support for write interleaving

AXI4-Lite Simplified AXI4 for on-chip device requiring a more powerful interface than APB

AXI4-Stream A point-to-point protocol without an address, just data

2011ACE AXI Coherency Extensions

ACE-Lite Simplified ACE

106

Major Component in AMBA System

n Mastern An independent component that can request to perform the read or write operation to

any slave in the system (e.g.) Processor, DSP etc.

n Slaven A dependent component on the master. If selected, the read or write operation on its

addressable space is performed under the control of the bus master (e.g.) DRAM, SRAM, Peripherals etc.

n Arbitern Arbitrate the bus requests among masters such that only one master is allowed to

access the bus (e.g.) fixed priority, round-robin etc.

n Decodern It decides which slave to access by decoding the address from the bus master

n Multiplexern It decides which master or slave access the target IP

107

Example AMBA BUS SYSTEM

High PerformanceARM processor

High-bandwidthon-chip RAM

HighBandwidth

ExternalMemoryInterface

DMABus Master

APBBridge

Timer

Keypad

UART

PIO

AHB

APB

High PerformancePipelined

Burst SupportMultiple Bus Masters

Low PowerNon-pipelined

Simple Interface

A typical AMBA-based system

108

109

AMBA 2.0 AHB Bus

AHB Interconnection

n AMBA AHB is designed to be used with a central multiplexor interconnection schemen Avoids tri-state bus

110

AMBA AHB Overview

n High performance

n Pipelined operation

n Multiple bus mastersn It is necessary to implement Arbiter

n Single cycle bus master handover

n Burst transfers

n Split transactions

n Single clock edge operationn Rising edge

n Non-tri-state implementationn Uni-direction vs Bi-direction

111

Address

Data

clock

AMBA AHB Overview

n An AHB transfer consists of two distinct sectionsn The address phase,

which lasts only a single cycle

n The data phase, which may require several cycles

n This is achieved using the HREADY signal

112

AMBA Signal Flow

113

Master0~7

Master0~7

HADDRx[31:0]HWDATAx[31:0]HSIZEx[2:0]HTRANSx[1:0]

ArbiterArbiter

HBUSREQx

HLOCKx

HGRANTx

HMASTER[3:0]HMASTLOCK

Slave 0~15

HSPLITx

HRDATA[31:0]

HREADYHADDR[31:0]

HWDATA[31:0]

HSIZE[2:0]

HTRANS[1:0]DecoderDecoder

HSELx

HADDR[31:0]

HBURSTx[2:0]HWRITE

AMBA AHB Signal List (1)

Name Source DescriptionHCLK Clock source Bus clock (rising edge)HRESETn Reset controller Reset (Active low)HADDR[31:0] Master Address bus HTRANS[1:0] Master Transfer type (non-sequential, sequential, idle, busy)HWRITE Master Transfer direction (high-write/low-read)

HSIZE[2:0] Master Transfer size[(8/16/32 bit), Max. 1024bit]

HBURST[2:0] Master Burst type (4/8/16 beat burst, incrementing or wrapping)

HPROT[3:0] Master Protection control (opcode fetch or data access) (privileged mode access, user mode access)

HWDATA[31:0] Master Write data busHSELx Decoder Slave selectHRDATA[31:0] Slave Read data busHREADY Slave Transfer done (high-done)HRESP[1:0] Slave Transfer response (okay, error, retry, split)

114

AMBA AHB Signal List (2)

Name Source Description

HBUSREQx Master Bus request (Max. 16 bus master)

HLOCKx Master Locked transfers (Exclusive use of bus when high)

HGRANTx Arbiter Bus grant

HMASTER[3:0] Arbiter Master number

HMASTLOCK Arbiter Locked sequence (The same timing with HMASTER signal)

HSPLITx[15:0] Slave Split completion request

115

AMBA Diagram

n AHB Mastern An AHB master is able to initiate read and write operation by providing an

address and control informationn Common AHB master

n Processor, DSP, DMA controller, etc

AHBMaster

HRDATA[31:0] HWDATA[31:0]

HADDR[31:0]

Arbitergrant

Transferresponse

HGRANTx

Clock

Data

HREADY

HRESP[1:0]

HRESETn

HCLKReset

HBUSREQx

HLOCKx

HTRANS[1:0]

HWRITE

HSIZE[2:0]

HBURST[2:0]

HPROT[3:0]

Arbiter

Transfer type

Address&

control

Data

116

AHB Master Block Diagram

n The AHB master is an interface unit that allows user logic to initiate a data transfer on the AHB

n The bus master is optimized to interface with DMA and PCI bus bridge functions to initiate data transfer on the AHB

BusRequest

MasterControl

TransferCounter

AddressBuffer

WriteBuffer

AHB Bus(Slave)

User Interface

Control Address R data W data

117

AMBA Diagram

n AHB Slaven An AHB slave processes read and write operations issued by an AHB mastern Common AHB slave

n Memory controller, Peripherals

AHBslave

HRDATA[31:0]HWDATA[31:0]

HADDR[31:0]Select

Transferresponse

HSELx

Clock

Data

HREADY

HRESP[1:0]

HRESETn

HCLKReset

HTRANS[1:0]HWRITE

HSIZE[2:0]

HBURST[2:0]

HSPLITx[15:0]

AddressAnd

control

Data

Split-capableslave

HMASTER[3:0]

HMASTLOCK

118

Control Signal (1/2)

n HWRITE – Transfer directionn HIGH : Write using HWDAT[31:0]n LOW : Read using HRDATA[31:0]

n HSIZE[2:0] – Transfer size

119

HSIZE[2:0] Size Description000 8 bits Byte001 16 bits half word010 32 bits Word011 64 bits100 128 bits 4-word line101 256 bits 8-word line110 512 bits111 1024 bits

Control Signal (2/2)

n HPROT[3:0] – Protection control (Indicating opcode fetch, data access, privileged mode access, and user mode)

120

HPROT[3]cacheable

HPROT[2]bufferable

HPROT[1]privileged

HPROT[0]Data/opcode Description

- - - 0 Opcode fetch- - - 1 Data access- - 0 - User access- - 1 - Privileged access- 0 - - Not bufferable- 1 - - Bufferable0 - - - Not cacheable1 - - - Cacheable

Basic Transfer (1/10)

n Address phase: Only single cycle

n Data phase can be extended to multi-cycle by using HREADY signal

121


n Example: data transfer cycle extension between objects with different operating clock frequencies

122


n Multiple data transfer with non-contiguous address

n A and C: zero wait state, B: one wait state

123

Transaction A StartsTransaction B Starts

Transaction A Completes

Notice backpressure


n Basic Signals for Read Txn

124

Name Source Description

HADDR[31:0] Master Address bus

HRDATA[31:0] Slave Read data bus

HRESP Slave Transfer response (okay, error, retry, split)


n Operation for a Read Txn

125


n Pipelined Operation

126


n HREADY for a Slow Slave

127

Name Source DescriptionHADDR[31:0] Master Address bus


HREADY Slave Transfer done (high-done)



n Wait State Insertion

128


n Operation for Write Txn

129

Name Source DescriptionHADDR[31:0] Master Address bus

HWRITE Master Transfer direction (high-write/low-read)

HWDATA[31:0] Master Write data bus


HREADY Slave Transfer done (high-done)



n Operation for a Write Txn

130

Transfer Type (1/2)

n All data transfer are processed with HTRANS[1:0]

131

HTRANS[1:0] Type Description

00 IDLEIndicates that no data transfer is required. The IDLE transfer type is used when a bus master is granted the bus, but does not wish to perform a data transfer.Slaves must always provide a zero wait state OKAY response to IDLE transfers and the transfer should be ignored by the slave.

01 BUSY

The BUSY transfer type allows bus masters to insert IDLE cycles in the middle of bursts of transfers. This transfer type indicates that the bus master is continuing with a burst of transfers, but the next transfer cannot take place immediately. When a master uses the BUSY transfer type the address and control signals must reflect the next transfer in the burst. The transfer should be ignored by the slave. Slaves must always provide a zero wait state OKAY response, in the same way that they respond to IDLE transfers.

10 NONSEQIndicates the first transfer of a burst or a single transfer. The address and control signals are unrelated to the previous transfer. Single transfers on the bus are treated as bursts of one and therefore the transfer type is NONSEQUENTIAL.

11 SEQ

The remaining transfers in a burst are SEQUENTIAL and the address is related to the previous transfer. The control information is identical to the previous transfer. The address is equal to the address of the previous transfer plus the size (in bytes). In the case of a wrapping burst the address of the transfer wraps at the address boundary equal to the size (in bytes) multiplied by the number of beats in the transfer (either 4, 8 or 16).

Transfer Type (2/2)

n Transfer type example

132

Burst Operation

HBURST[2:0] Type Description

000 SINGLE Single transfer

001 INCR Incrementing burst of unspecified length

010 WRAP4 4-beat wrapping burst

011 INCR4 4-beat incrementing burst





Burst signal encoding

133

Burst Mode Example (1/5)

n The address wraps if it reaches the address boundaryn In this case, the address boundary is 0x30 ~ 0x3C

134

Four-beat wrapping burst


n Sequentially increasing addressn In this case, the address increases by 4 because SIZE = Word

135

Four-beat incrementing burst


n The address wrap if it reaches the address boundaryn In this case, the address boundary is 0x20 ~ 0x3C

136

Eight-beat wrapping burst

Answer to the Questionn In the case of a wrapping burst the address of the transfer wraps at the address

boundary equal to the size (in bytes) multiplied by the number of beats in the transfer (either 4, 8 or 16).

So if ‘Eight-beat wrapping burst’ with size (specified in HSIZE[2:0]) of 4Byte (Word=32bit) has 32 Byte(=8x4) address boundary. ‘32 Byte address boundary’ means that only 5 LSB bits out of HADDR[31:0], that is, HADDR[4:0] can be changed.

Therefore if the start address is 0x34, then next address changes as follows:

0x34 à 0x38 à 0x3C à 0x20 à 0x24 à 0x28 à 0x2C à 0x30.

n The question in the middle of today’s lecture is: If the start address begins with 0x24, how the address bus will be changed? The answer is:

0x24 à 0x28 à 0x2C à 0x30 à 0x34 à 0x38 à 0x3C à 0x20.

Because only 5 LSB bits are allowed to be changed due to the fact that address boundary is 32 Byte.

137


n Sequentially increasing addressn In this case, the address increases by 2 because SIZE = halfword

138

Eight-beat incrementing burst


n Incrementing burst of unspecified length

139

Undefined-length bursts

Early Burst Termination

n There are certain circumstances when a burst will not be allowed to complete and therefore it is important that any slave design which makes use of the burst information can take the correct course of action if the burst is terminated early

n The slave can determine when a burst has terminated early by monitoring the HTRANS signals and ensuring that after the start of the burst every transfer is labeled as SEQUENTIAL or BUSY

n If a NONSEQUENTIAL or IDLE transfer occurs then this indicates that a new burst has started and therefore the previous one must have been terminated

n If a bus master cannot complete a burst because it loses ownership of the bus then it must rebuild the burst appropriately when it next gains access to the bus

n For example, if a master has only completed one beat of a four-beat burst then it must use an undefined-length burst to perform the remaining three transfers

140

Address Decoding

n Decoder – used to provide a select signal, HSELx, for each slave on the bus

n Select signal – Implemented by high-order address signals

n Address boundary – The minimum address space that can be allocated to a single slave is 1KB

141

Master# 1

Slave# 1

Slave# 2

Slave#3Decoder

Master# 2

HADDR_M1[31:0]

HADDR_M2[31:0]

HSEL_S1Address &

control mux

HADDR to all slaves

HSEL_S2

HSEL_S3

- Address decoding Process-

Slave Transfer Responses

n The role of slavesn Whenever a slave is accessed it must provide a response which indicates the status of the

transfern The HREADY signal is used to extend the transfer and this works in combination with the

response signals, HRESP[1:0], which provide the status of the transfer

n The slave can complete the transfer in a number of ways:n complete the transfer immediatelyn insert one or more wait states to allow time to complete the transfern signal an error to indicate that the transfer has failedn delay the completion of the transfer, but allow the master and slave to back off the bus, leaving it

available for other transfers

n Transfer responsen OKAY with HREADY indicates the transfer has completed successfullyn ERROR response: Occurs when accessing wrong addressesn RETRY: Indicates transfer has not yet completed, so the bus master should retry the transfern SPLIT: The transfer has not yet completed successfully, so the bus master must retry the transfer

when it is next granted access to the bus

142

Data Buses (1/2)

n AHB does not specify the required endianness, so, it is important that all masters and slaves on the bus are of the same endianness.

143

Transfersize

Address offset

DATA [31:24]

DATA [23:16]

DATA [15:8] DATA [7:0]

Word 0 o o o oHalfword 0 - - o oHalfword 2 o o - -

Byte 0 - - - oByte 1 - - o -Byte 2 - o - -Byte 3 o - - -

- Active byte lanes for a 32-bit little-endian data bus -

Data Buses (2/2)

144

Transfer size Address offset DATA [31:24] DATA [23:16] DATA [15:8] DATA [7:0]

Word 0 o o o o

Halfword 0 o o - -

Halfword 2 o o

Byte 0 o - - -

Byte 1 - o - -

Byte 2 - - o -

Byte 3 - - - o

- Active byte lanes for a 32-bit big-endian data bus -

Slave Transfer Response : Retry

145

l okay response : 1 cyclel error, split, retry : more than 2 cycles

Slave Transfer Response : Split

146

Arbitration : Granting Bus Access

147

Before a transaction a master makes a request to the central arbiter Eventually the request is granted

Then the transaction proceeds

Performance Impact

Arbitration : Granting Bus Access with Wait States

148

Arbitration : Data Bus Ownership

149

Arbitration : Handover after Burst

150

Multi-layer Bus

n Parallel paths between masters and slaves

n Key Advantagesn Increased bandwidthn Design flexibility

n Uses the same interface protocol

151

Master#1

Master#1

Slave#4

Slave#4

BusMatrix

BusMatrix

Master#2

Master#2

Master#3

Master#3

Slave#3

Slave#3

Slave#2

Slave#2

Slave#1

Slave#1

In Brief: Multi-layer AHB and AHB-Lite (1/2)

n ARM SoC World shifts towards crossbar switched interconnects, in the form of multi-layer busses

n Each layer of the bus is an independent single master AHB system

n Instead of a rather complex monolithic multiplexing scheme, n A multi-layer AHB bus architecture with M masters and S slaves is structured

as M ´1:S muxes plus S ´ M:1 slave muxes all connected to separate arbitration and decoding logic

152

In Brief: Multi-layer AHB and AHB-Lite (2/2)

n Multiple masters can talk to multiple slaves concurrently, as long as no two masters don't try to access the same slave at the same time

n All arbitration and protocol complexity moves into the fabricn The interface implementation becomes simpler as a number of unneeded

signals, most notably HGRANT and HBUSREQ, can be removed along with their associated protocol

n Although not a necessary consequence of the multi-layer architecture, getting rid of the unpopular SPLIT and RETRY handshaking mechanism was another advantage

n With the advent of AMBA3 this AHB subset has been standardized upon as AHB-Lite

n AHB-Lite is a subset of the full AHB specification for designs where only a single bus master is used. This can either be a simple single-master system, or a multi-layer AHB-Lite system where there is only one AHB master per layer

153

154

AMBA 3.0 AXI Bus

Routing Control Routing Control Routing Control

Control Control ControlRouting Routing Routing

AXI - APB

Synch bridgeAsync bridge

Register Slice

AXI - AHB

Downsizer

Downsizer

Async bridgeSynch bridge

AXI – 50MHz AXI – 166MHz AXI – 200MHz

APB – 40MHzAHB – 32-bit, 70MHzAXI – 200MHz

Fabric Matrix Core

AMBA Fabric Matrix

Parameterised – e.g. address

map

Structurally Configured

AMBA3.0: Typical AXI Interconnect

155

Why update the AMBA Protocol?

n Future SoC designs demand:n Increasing levels of system complexityn Increasing performance demandsn Increasing clock speed requirementsn Reduction of system power consumption

n Whilst retaining existing AMBA strengths:n Modular designn Component re-usen Simple to understand interface

156

Why AXI instead of Multi-layer AHB?

n AHB is transfer-orientedn With each transfer, an address will be submitted and a single data item will

be written to or read from the selected slaven All transfers will be initiated by the master, thus the master will be stalled if

the slave cannot respond immediately to a transfer requestn Each master can have only one outstanding transaction

n Sequential accesses (bursts) consist of consecutive transfers which indicate their relationship by asserting HTRANS/HBURST accordingly

n Although AHB systems are multiplexed and thus have independent read and write data busses, they cannot operate in full-duplex mode

157

AMBA3 AXI Protocol

n Advanced eXtensible Interface (AXI)n separate address/control and data phasesn support for unaligned data transfers, using byte strobesn uses burst-based transactions with only the start address issuedn separate read and write data channels, that can provide low-cost

Direct Memory Access (DMA)n support for issuing multiple outstanding addressesn support for out-of-order transaction completionn permits easy addition of register stages to provide timing closure

n Five channelsn Read: Address read (AR) and read (R) data channelsn Write: Address write (AW), write (W) data, and response (R)

channels

[Source: AXI Spec]

158

Channel Architecture

n Four groups of signalsn Address “A” signal name prefixn Read “R” signal name prefixn Write “W” signal name prefixn Write Response“B” signal name prefix

159

ADDRESS

READ DATA

WRITE DATA

RESPONSE

Interconnect, Interface & Channel

160

Interconnect

Master 1 Master 2 Master 3

Slave 1 Slave 2 Slave 3 Slave 4

Interface

Interface

AXIMaster

AXI masterinterface

AXISlave

AXI slaveinterface

AXIInterconnect

AXI masterinterface

AXI slaveinterface

Write address channel (AW)

Write data channel (W)

Read address channel (AR)

Write response channel (B)

Read data channel (R)

Write address channel (AW)

Write data channel (W)

Read address channel (AR)

Write response channel (B)

Read data channel (R)

To read ...

ADDRESS

READ DATA

WRITE DATA

RESPONSE

1. Master issues address

2. Slave returns data

161

To write ...

ADDRESS

READ DATA

WRITE DATA

RESPONSE

2. Master gives data

3. Slave acknowledges

1. Master issues address

162

Channels - One way flow

AVALIDADDRAWRITEALENASIZEABURSTALOCK

AIDAREADY

RVALIDRLASTRDATARRESPRIDRREADY

WVALIDWLASTWDATAWSTRBWIDWREADY

BVALIDBRESPBIDBREADY

APROTACACHE

n Each channel has information flowing in one direction only

n READY is the only return signal

163

Register Slices for Max Frequency

n Register slices canbe applied across any channel

n Allows maximum frequency of operation by matching channel latency to channel delay

n Allows system topology to be matched to performance requirements

WREADY

WID

WDATA

WSTRB

WLAST

WVALID

164

Example Register Slices

Master#1

Master#2

Slave#1

Slave#2

Slave#3

Slave#4

A

A

A

A

A

A

Master#1

Master#2

Slave#1

Slave#2

Slave#3

Slave#4

A

A

A

A

A

A

Master#1

Master#2

Slave#1

Slave#2

Slave#3

Slave#4

A

A

A

A

A

A

Isolate addresstimingR

R

R

R

R

R

R

R

R

R

R

RIsolate poor read

data setupBreak long

interconnect path

165

AMBA 2.0 AHB Burst

n AHB Burstn Address and Data are locked togethern Single pipeline stage n HREADY controls intervals of address and data

166

A21 A22 A23A11 A12 A13 A14

D21 D22 D23D11 D12 D13 D14

A31

D31

ADDRESS

DATA

AXI - One Address for Burst

n One Address for entire burst

n Separation of address and data channeln Master provides the start address of burstn Slave needs to generate the remaining addresses based on burst type (FIXED,

INCR, WRAP)

167

A21A11

D21 D22 D23D11 D12 D13 D14

A31

D31

ADDRESS

DATA

AXI - Outstanding Transactions

n Benefit of split transaction: multiple outstanding requests

n Parameters for multiple outstanding requestsn Master I/F: Issuing capability à # of outstanding request that the master can

generaten Slave I/F: Acceptance capability à # of outstanding request that the slave

can accommodate

168

A21A11

D21 D22 D23D11 D12 D13 D14

A31

D31

ADDRESS

DATA

Out of Order Interface

n Each transaction has an ID attachedn Channels have ID signals - AID, RID, etc.

n Transactions with the same ID must be ordered

n Requires bus-level monitoring to ensure correct ordering on each IDn Masters can issue multiple ordered addresses

169

AMBA 2.0 AHB Burst - Slow slave

n With AHBn If one slave is very slow, all data is held up.

170

A21 A22 A23A11 A12 A13 A14

D11 D12

A31ADDRESS

DATA

Separate Read / Write Channels

n AMBA AXI allows for independent read and write transactions.

AXIMaster

AXISlave

Write data

WREADY

Read data

RREADY

Response

BREADY

Write Address/Control

AWREADY

Read Address/Control

ARREADY

171

Split Transaction

n Address, data, and response are handled separately.

AXIMaster

AXISlave

Write data

WREADY

Read data

RREADY

Response

BREADY


AWREADY


ARREADY

172

Master issues address

AXIMaster

AXISlave

Write data

WREADY

Read data

RREADY

Response

BREADY


AWREADY


ARREADY

Split Transaction: Write (1/3)

173

AXIMaster

AXISlave

Write data

WREADY

Read data

RREADY

Response

BREADY


AWREADY


ARREADY

Master gives data


174

AXIMaster

AXISlave

Write data

WREADY

Read data

RREADY

Response

BREADY


AWREADY


ARREADY

Slave acknowledges


175

AXIMaster

AXISlave

Write data

WREADY

Read data

RREADY

Response

BREADY


AWREADY


ARREADY

Master issues address

Split Transaction: Read (1/2)

176

AXIMaster

AXISlave

Write data

WREADY

Read data

RREADY

Response

BREADY


AWREADY


ARREADY

Slave returns data

Split Transaction: Read (2/2)

177

Wire Counts

n Address 32b, data 32b bus case: 184~204n AW: 52~56, W: 39~43, B: 4~8, AR: 52~56, R: 37~41

AXIMaster

AXISlave

Write Address/ControlAWID[3:0] write addr ID (0~4 bits)AWADDR[31:0] write addr IDAWLEN[3:0] burst lengthAWSIZE[2:0] burst sizeAWBURST[1:0] burst typeAWLOCK[1:0] lock infoAWCACHE[3:0] cache typeAWPROT[2:0] protection typeAWVALID write address validAWREADY write address ready

Write dataWID[3:0] write ID tag, AWID = WID (0~4 bits)WDATA[31:0] write dataWSTRB[3:0] write strobesWLAST write lastWVALID write validWREADY write ready

52~56

39~43

Payload

Handshakesignals

178

Write Address Channel & Signals

Signal Source Description

AWID[3:0] Master Transaction id

AWADDR[31:0] Master Start address

AWLEN[3:0] Master Burst length (1~16)

AWSIZE[2:0] Master Data width (1~1024)

AWBURST[1:0] Master Burst type (FIXED, INCR, WRAP)

AWLOCK[1:0] / AWCACHE[1:0] Master / Master Lock / cache type

AWPROT[2:0] Master Protection type

AWVALID / AWREADY Master / Slave Handshake control

179

Independent

Independent

I’m sending data. Please store it.

Here is the data.

I received that data correctly.

channels synchronized with ID # or

“tags”

Write Data & Response Channels

Signal (Write Data Ch.) Source Description

WID[3:0] Master Transaction id

WADDR[31:0] Master Start address

WSTRB[3:0] Master Write strobe per byte

WLAST Master Write last

WVALID / WREADY Master / Slave Handshake control

Signal (Write Response Ch.) Source Description

BID[3:0] Master Transaction id

BRESP[1:0] Master Write response

BVALID / BREADY Master / Slave Handshake control

180

Read Channels & Signals

n AW signals ~ AR signals

n R signals ~ W signalsn RLAST is controlled by slave

181

Give me some data

Here you go

Independent

channels synchronized with ID # or “tags”

Read Burst Operation

Read requestis initiated

Read requestis accepted

Data read is ready

1st data is transferred

The last data is transferred

Note: data transfer only when valid = ready = 1

182

Overlapping Read Bursts

Read request Ais accepted Read request B

is accepted via AR channelwhile data A(0) is

transferred via R channel

183

Write Burst Operation

Write request Ais accepted

Response completeswrite operation

184

Cache Support: ARCACHE[3:0] / AWCACHE[3:0]

n Bufferable bit (B): AWCACHE[0]n Write delay can be an arbitrary one

n Cacheable bit (C): AR(W)CACHE[1]n Read: prefetch or read cache is possiblen Write: write merging is possible

n Read Allocate bit (RA): ARCACHE[2]n If read miss, fetch the data to cachen If C=low, RA=low

n Write Allocate bit (WA): AWCACHE[3]n If write miss, fetch the data to cache, and then write to the cache (and

through the memory)n If C=low, WA=low

185

Protection Support

n Normal or privileged: AR(W)PROT[0]n High à privileged

n Secure or non-secure: AR(W)PROT[1]n Low à secure

n Instruction or data, AR(W)PROT[2]n High / low à instruction / data

186

[Source: AXI Spec]

Atomic Access

n Normal access, AR(W)LOCK[1:0]=b00n Exclusive access, b01

n Exclusive read … Exclusive writen If no intervening write to the address region, EXOKAY response.

If not, OKAY response.n Usually used for read-modify-write

n Locked access, b11n Start with b11, and end with b00n During the period, only the lock initiating master can access the

address region (not the slave!!!)

time

E.RD 0x100 WR 0x100

Master 1 Master 2

E.WR 0x100

Master 1

OKAY

Slave 1

187

Out-of-Order Transaction

n Transaction ID is used to identify data transfer at all channelsn ARID, AWID, RID, WID, & BIDn Up to four bits

n Ordering by transaction IDn Master needs to finish data transfers with the same transaction ID in the order of

request issue. n Slave can handle data transfers with different transaction IDs out-of-order

A21A11

D21 D22 D23 D11 D12 D13 D14

A31

D31

ADDRESS

DATA

188

Ordering

n Transaction idn ARID, AWID, RID, WID, & BIDn Up to four bits

n In-order requirementn Transfers with the same id finishes in the same order that they

are initiated

n Real implementationn Transaction id = <master ID, channel ID>n Channel ID = original AXI transaction idn Master ID is needed to identify the initiating master among all

the masters

189

Transaction ID Implementation

n Real implementationn Transaction ID = <master ID, channel ID>n Channel ID = original AXI transaction idn Master ID is needed to identify the initiating master among all the masters

CPU VideoDecoder

3DGraphics

LCDControl

VideoProcessor Mixer DMA

Crossbar

MemoryController

ID: 4 + ceil(log27)=7 bits

190

Write Interleaving

n Interleaving rulen Data with different ID can be interleaved.n The order within a single burst is maintainedn The order of first data needs to be the same with that of requestn Write Interleave Cability à The maximum number of

transactions that master can interleave

A11

D11 D12 D13 D14

ADDRESS

DATA

A21

D21 D22 D23

A31

D31

191

Low Power Interface (C channel)

192

AMBA Class Library

n Defined method for modelling AXI componentsn Data structure used to fully describe a transactionn Standard method calls for generating and receiving transactions

n Modelling abstraction levels n Programmer view leveln Programmer’s view + timing n Cycle callable

193

AXI Support

RTLEnvironment Verification

Environment

ModellingEnvironment

AMBADesign

KitAMBA

VerificationComponents

RealViewModelLibrary

AMBA Class Library

AXIInterconnect

Generator

194

Summary

n AXI is the next generation AMBA busn Channel architecturen Registers Slices n Burst addressingn Multiple outstanding burstsn Out of order completion

n Interconnect Optionsn Shared bus, multi-layer and mixed

n More than just a bus protocoln RTL, Modelling and Verification environments

195

196

AMBA 4.0 AXI Bus

In Brief: AXI4 (1/3)

n Three flavors: AXI4, AXI4-Lite, AXI4-Streamn All three share same handshake rules and signal naming

197

APBAPB AHBAHB AXIAXIAMBA 3.0

AXI4AXI4 AXI4 StreamAXI4 Stream AXI4 LiteAXI4 LiteAMBA 4.0

AXI4 AXI4-Lite AXI4-Stream

Dedicated forhigh-performance

and memorymapped systems

register-style interfaces(area efficient

implementation)

non-addressbased IP

(PCIe, Filters, etc.)

Burst up to 256 1 Unlimited

Data width 32 to 1,024 bits 32 or 64 bits any number of bytes

Applications Embedded, memory Small footprint control logic

DSP, video, communication


n Functionality has been added and several known issues in AXI3n The maximum burst length has been increased from 16 to 256 transfers for

certain types of bursts (INCR, non-exclusive).n A Quality-of-Service signaling has been added, where the finer details of the

interpretation are implementation defined

n AXI4 defines address regions for slaves, which allows implementations of memory perspectives on the bus level

n Some ordering requirements and transfer dependencies have been refined, as have the meanings of the cache policy signals AxCACHE

n Abstract memory types as defined by ARMv6/v7 architectures and multicorearchitectures are much better represented by these changes

198


n Legacy (AHB) locked transfers are no longer supported, which does not fit within the idea of a switched interconnectn AXI4 does not support locked transactions but, an AXI3

implementation must support locked transactions

n A rather significant change seems to be the banning of write interleaving, which could help improve the system throughputn Write interleaving is hardly used by regular masters but can be

used by fabrics that gather streams from different sources. With the new AXI4-Stream protocol, write interleaving is still available for fabrics

199

In Brief: AXI4-Stream

n The new AXI4-Stream protocol was designed for streaming data to destinations that are not memory mapped internallyn Display controllers, transmitters, but also routing fabrics are

among the target applications for this new protocol

n Building upon the proven simple AXI channel handshake AXI4-Stream is essentially an AXI write data channel with additional control signals and a slightly modified protocoln The burst (packet) length is not restricted and the number of

bytes of the data signal TDATA can be an arbitrary integer including zero

200

In Brief: AXI4-Lite

n The focus of AXI has been on high-performance data transfer, but what about the low-end hardware registers, configuration, etc like APB?

n The issue with APB is the bridge

n AXI4-Lite addresses this last issue by defining certain restrictions that would allow a slave to be connected directly to an AXI fabric

n In AXI4-Lite, you might say that AXI gets "dumbed" down to a few basic transaction types

n The burst length is fixed to one data transfer, transfers are non-cacheable and non-bufferable, exclusive access is not allowed and access width must always be the same as data bus width

n This is supposed to make the interface design simple enough to be implemented quickly in custom IP

201

In Brief: ACE

n The ACE protocol extends the AXI4 protocol and provides support for hardware-coherent caches

n A five state cache model to define the state of any cache line in the coherent system

n The cache line state determines what actions are required during access to that cache line

n Additional signaling on the existing AXI4 channels that enables new transactions and information to be conveyed to locations that require hardware coherency support

n Additional channels that enable communication with a cached master when another master is accessing an address location that might be shared

n The ACE protocol also provides:n Barrier transactions that guarantee transaction ordering within a systemn Distributed Virtual Memory (DVM) functionality to manage virtual memory

202

What’s New in AMBA4

n Increased productivityn Supports memory mapped, control and streaming applications

203

Source: Xilinx (2013)

What’s New in AMBA4:

n New Stream Interface

n AHB absent

n New AXI-Lite spec

n New AXI Features

n AXI Feature removal

n AXI clarifications

204

AXI Changes

n Clarificationsn ordering, posting, memory types / cache impacts

n Removal of locked transactions

n Removal of write data interleave

n New featuresn longer burstsn QoS fieldn User fields

205

Longer Bursts

n Can issue up to 256 beat burstsn performance, atomicity, simplicity

n Not guaranteed atomic beyond 16 beatsn even in device space / non-modifiablesn system (integrator) performance & QoS

n Atomicity may be undesirable for core as well

206

Basic AXI4 Signaling: 5 Channels, Point-to-Point

207

Basic AXI4 Handshaking

n Master asserts and holds VALID when data is available

n Slave asserts READY if able to accept data

n DATA and other signals transferred when VALID and READY = 1

n Master sends next DATA/other signals or deasserts VALID

n Slave deasserts READY if no longer able to accept data

208

VALID before READY handshake READY before VALID handshake VALID with READY handshake

Appendix:PL301 Crossbar Bus

209

AXI Crossbar Bus: ARM PrimeCell PL301

n PL301: PrimeCell High Performance Matrix

Slave Interface (SI) Master Interface (MI)

210

PL301 Internal Structure

n Adjust interface differencesn Data width, frequency, and protocol

Routing Control Routing Control Routing Control

Control Control ControlRouting Routing Routing

AXI - APB

Synch bridgeAsync bridge

Register Slice

AXI - AHB

Downsizer

Downsizer

Async bridgeSynch bridge

AXI – 50MHz AXI – 166MHz AXI – 200MHz

APB – 40MHzAHB – 32-bit, 70MHzAXI – 200MHz

Fabric Matrix Core

AMBA Fabric Matrix

Parameterised –e.g. address

map

Structurally Configured

211

Adjusting Data Width

n ExpanderAxin Converts a narrow master for use on a wide

busn Data replication on write data and muxing

on read path.

n FunnelAxin Converts a narrow slave for use on a wide

busn Mux on write data path and replication on

read data pathn Assumes that transfers are suitably sized for

slave (i.e. no wider than the narrow bus)

n DownSizerAXIn Converts a wide bus to a narrow busn Transaction <= narrow bus pass throughn Transactions > narrow bus are broken down

in to smaller transactions.

PL300Exp32b 64b

32b 64b

PL300Fun

32b

64b

64b

PL30032b

DS64b

32b64b

32b

212

PL301 Features

n Configurable number of SIs and MIsn Sparse connection options to reduce gate count and improve securityn Configurable AXI address/data widthsn Decoded address register that you can configure for each SIn Flexible register stages to aid timing closuren An arbitration mechanism that you can configure for each MI,

implementing:n a fixed Round-Robin (RR) schemen a programmable RR schemen a programmable scheme that provides prioritized groups of Least Recently

Granted (LRG) arbitration

n A programmable Quality of Service (QoS) schemen Support for multiple clock domains: synchronous and asynchronous.n Configurable cyclic dependency schemes to enable a master to have

outstanding transactions to more than one slave [Source: PL301 TS]

213

Arbitration Scheme: Fixed Priority

214

Arbitration Scheme: Round Robin

[Source: PL301 TS]

215

Arbitration Scheme: Hybrid

n Combination of round robin and fixed priority

216

Arbitration Scheme

n LRG (least recently granted) scheme

[Source: PL301 TS]

217

Arbitration Scheme: Fixed Round Robin

n A weighted round robin in a fixed order

ARM VIDEO 3D LCD

m0 m1 m2 m3 m4 m5

s0

S0

m0

m1 m2 m3 m4

m2

m5

m0

m1

m2 m4 m3

m2

m0

m5

m2

218

An Example of Crossbar Bus Design

CPU VideoDecoder

3DGraphics

LCDControl

VideoProcess Mixer DMA

Crossbar

MemoryController

Some IPs need a privileged usage of memory BW

40% of total BW

219

Programmable QoS

Maximum # of requestsallowed for best effort traffic

Assume Tidemark = 4, and ID match = M0.If there is 4 outstandingrequests for M1, then only requests from M0 are accepted by S0 until one of M1’s requests is served

[Source: PL301 TS]

220

Bus Arbitration: A Generic Arbiter

n Assumptionsn Each request has its own performance requirement (e.g., bandwidth budget and/or latency)

n E.g., low latency access from CPUn E.g., bandwidth guarantee for LCD / Camera controllers

n In order to avoid starvation, a global time out is applied

n Priority order: time-out > bandwidth > best effortn Time-out (TO) request: TO = 0, and BW budget > 0n Bandwidth (BW) request: TO > 0, and BW budget > 0n Best effort (BE) request: BW budget = 0

n Priority promotion/demotionn Demotion: if BW budget is exhausted, demotion to BEn Promotion: if BW budget becomes positive, promotion to BW or TO request

n Time-out countersn One type of counter for QoS access w/ time outn The other for old request

n When a normal request (w/ unspecified TO) arrives at the bus, a timer is assigned and starts to be decremented each cycle.

221

Bus Arbitration: A Generic Arbiter

n If there is any TO request, it is servedn QoS request w/ time outn Old request

n If there is any BW requestn Give the bus to the BW requests based on BW budget

n Based on fixed time slot allocation or statistical slot allocation

n To the other BE requests, apply the same priority order that BW requests

222

HigherPriority

Note: Arbitration scheme is similar to the one in memory access scheduling

Current Slot

Case of fixed time slot allocationAt a time slot (instant or period)if the master allocated to the slot has a request,

then it is servedelse

the other pending requests are served(e.g., round robin)

A Deadlock Problem in Accessing Multiple Slaves

MemoryController

1

MemoryController

2Master

2

Master1

Memory1

Memory2

D

B

A

On-chip Bus

C

Master 2

C D

Memory 1

Memory 2

B is blocked at master 2

D is blocked at master 1

Master 1

A B Time

B

D A

C

Color (= Transaction id)Requests with the same transaction idneed to be finished in the order ofrequest issues

Color (= Transaction id)Requests with the same transaction idneed to be finished in the order ofrequest issues

Optimization in Memory ControllerMemory controllers can serve independent requests out-of-orderto increase memory utilization or to lower memory access latency

Optimization in Memory ControllerMemory controllers can serve independent requests out-of-orderto increase memory utilization or to lower memory access latency

223

Master 2

Memory 1

Memory 2

Master 1 C

DA

A A

B

B

C

C D

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

D

B

16 17

Cyclic Dependency Schemes

n Outstanding requests are permitted only for a single slave per transaction id

n The deadlock problem is resolved while limiting parallel (memory) accesses

224

Single Slave Scheme

n Allow multiple outstanding transactions only to the same slave

225

Unique ID Scheme

n Accept only out-of-order requests, i.e., requests with different transaction ID’s

226

Single Slave per ID

n Combination of both single slave and unique ID schemes

n Allow multiple outstanding requests to a single slave per transaction ID

227

Master 2

C D

Memory 1

Memory 2

AD

Master 1

A B

RP2 (receiving)

D

D

A

C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

RP1 (receiving) C

A

CB

B

B

16 17

Request Parallelizer (RP) for Transaction ID Renaming

MemoryController

1

MemoryController

2Master

2

Master1

Memory1

Memory2

D

C

B

A

NoCRP1

RP2

Master IDs Network IDs

Furtherreduction?

Rename Transaction IDs at NoC Boundary to Enable Out-of-Order Transfers in NoC

*Requirement: Reserve Buffer Space in RP for Transaction ID Renaming

228

On-Chip Network (OCN) Design Flow

IP1 IP2 IP3 IP4

Bus 1 Bus 2

Mem Mem

IP1 IP2 IP3 IP4

Bus 1

Mem Mem

Architecture exploration w/performance evaluation

Pre-layoutRegister sliceoptimization

IP1 IP2

IP3IP4

Bus 1

Early engagement w/ floorplanning

GeneratedBusRTL code

Bus RTL Verification (Specman-based)

IP1 IP2

IP3IP4

Bus 1

P&R

latching

buffering

pipelining

too longwire

too big I/O delay

Bus RTLgeneration

Bus RTLgeneration

229

AXI Bus Architecture Design Space

n Interconnection leveln PL300 decomposition: cascading level

n Bus component leveln Common parameters: bit width, etc.n PL300: WriteIssueCap, etc.n PL340: arbiter queue, data fifo size,

etc.n Register slice: full/forward/bypass per

channeln Closely related with implementation

n IP interface leveln # of AXI ports, internal buffer

usage/size, address queue size, etc.

PL3002x2

PL300

3x2

PL30_33x2

PL340

PL340

Reg

IP

IP

Processor

Processor

IP

IP

DSP

Large WriteIssuingCapability on high-latency paths

Register slices on timing-critical paths

2-level cascaded architecture to localize

traffic

SRAM

Advanced AXI interfaceBuffer usage, size

Address queue sizeBurst size

230

Performance Evaluation of Bus Architecture

n Five methods depending on how to model the traffic of bus mastersn ViP (virtual platform)-based methods

n Random bus traffic modeln ViP version of Sonics’ random pattern generator

n Sequence-based bus traffic modeln ViP version of Specman-based model in DTV team

n Timed modeln Bus masters are modeled in cycle accuracy as ViP models

n HW Integratorn Bus masters are emulated on FPGA

n RTL simulationn Golden RTL is built and simulated

231

Bus RTL Generation: AMBA Designer

n Generated IP’sn Interconnect, DMA, Static

memory controller etc.

n Parametersn data width (32b ~ 128b),

protocol (AXI, AHB, APB), etc.n Timing (register slice, address

decode), security (trustzone), etc.

n QoS (programmable QoS), arbitration schemes, etc.

232

Bus RTL Verification

n Bus component verificationn Verification of each components (PL30x, 340, bridges)n Method & criterion: Specman eVC & protocol coverage

n Bus architecture verificationn Verification of the entire bus architecturen Method & criterion: Specman eVC + system scenarios &

functional correctness + protocol coverage

n Called Platform Verifiern Under development by System LSI Design Technology team

233


IP1 IP2 IP3 IP4

Bus 1 Bus 2

Mem Mem

IP1 IP2 IP3 IP4

Bus 1

Mem Mem



IP1 IP2

IP3IP4

Bus 1


GeneratedBus

RTL code


IP1 IP2

IP3IP4

Bus 1

P&R

latching

buffering

pipelining

too longwire

too big I/O delay

Bus RTLgeneration

Bus RTLgeneration

234

Register Slice for Timing Isolation

n Register slice can be used at any channel independently

AXIMaster

AXISlave

Write data

WREADY

Read data

RREADY

Response

BREADY


AWREADY


ARREADY

235

Register Slice for Timing Isolation

n Register slice incurs one cycle latency per insertion

AXIMaster

AXISlave

Write data

WREADY

Read data

RREADY

Response

BREADY


AWREADY


ARREADY

236

Two Modes: Full and Forward

n Area benefit: forward mode gives smaller area

n The same latency penalty

AXI register slice (full mode)

AXI register slice (forward mode)

[Source: PL301 TS]

237


IP1 IP2 IP3 IP4

Bus 1 Bus 2

Mem Mem

IP1 IP2 IP3 IP4

Bus 1

Mem Mem



IP1 IP2

IP3IP4

Bus 1


GeneratedBus

RTL code


IP1 IP2

IP3IP4

Bus 1

P&R

latching

buffering

pipelining

too longwire

too big I/O delay

Bus RTLgeneration

Bus RTLgeneration

238

Announcement

n Homework #5 (Due: Nov. 13)

239

ece5917 soc architecture: arm/ambacontents.kocw.net/kocw/document/2014/sungkyunkwan/hanta... ·...

Documents