ece5917 soc architecture: arm/ambacontents.kocw.net/kocw/document/2014/sungkyunkwan/hanta... ·...
TRANSCRIPT
ECE5917SoC Architecture: ARM/AMBA
Tae Hee Han: [email protected]
Semiconductor Systems Engineering
Sungkyunkwan University
Outline
n ARM Processor Architecture
n AMBA Bus
2
What Is ARM?
n Advanced RISC Machine
n Founded in November 1990n Spun out of Acorn Computers – Joint venture of Acorn computer, Apple computer and
VLSI technology
n Designs the ARM range of RISC processor cores
n Licenses ARM core designs to semiconductor partners who fabricate and sell to their customers
n ARM does not fabricate silicon itself
n Also develop technologies to assist with the design-in of the ARM architecturen Software tools, boards, debug hardware, application software, bus architectures,
peripherals etc.
n Market-leader for low-power and cost-sensitive embedded applicationsn Architectural simplicity which allows very small implementations which result in very
low power consumption
3
Reduced Instruction Set Computer
The History of ARM
n Developed at Acorn Computers Limited, of Cambridge, England, between 1983 and 1985
n Problems with CISC:n Slower then memory parts
n Due to memory operand and instruction decodingn Variable clock cycles per instruction
n Solution – the Berkeley RISC I:n Competitiven Easy to develop (less than a year)n Cheapn Pointing the way to the future
4
5
ARM AArch32
Evolution of the AArch32 (1)
n Original ARM architecture:n 32-bit RISC architecture focused on core instruction setn 16 Registers - 1 being the Program counter – generally accessiblen Conditional execution on all instructionsn Load/Store Multiple operations - Good for Code Densityn Shifts available on data processing and address generationn Original architecture had 26-bit address space
n Augmented by a 32-bit address space early in the evolution
n Thumb instruction set was the next big stepn ARMv4T architecture (ARM7TDMI)n Introduced a 16-bit instruction set alongside the 32-bit instruction set
n Different execution states for different instruction setsn Switching ISA as part of a branch or exceptionn Not a full instruction set – ARM still essential
6
Evolution of the AArch32 (2)
n ARMv5TEJ (ARM926EJ-S) introduced:n Better interworking between ARM and Thumb
n Bottom bit of the address used to determine the ISAn DSP-focused additional instructionsn Jazelle-DBX for Java byte code interpretation in hardwaren Some architecting of the virtual memory system
n ARMv6K (ARM1136JF-S) introduced:n Media processing – SIMD within the integer datapathn Enhanced exception handlingn Overhaul of the memory system architecture to be fully architected
n Supported only 1 level of cache
n ARMv7 rolled in a number of substantive changes:n Thumb-2* - variable length instruction setn TrustZone*n Jazelle-RCTn Neon
7
AArch32 Architecture (1)
n Typical RISC architecture:n Large uniform register filen Load/store architecturen Simple addressing modesn Uniform and fixed-length instruction fields
8
AArch32 Architecture (2)
n Enhancements:n Each instruction controls the ALU and shiftern Auto-increment and auto-decrement addressing modesn Multiple Load/Storen Conditional execution
n Results:n High performancen Low code sizen Low power consumptionn Low silicon area
9
Processor Modes (1)
n The ARM has seven basic processor modes:n User : unprivileged mode under which most tasks runn Privileged:
n System : privileged mode using the same registers as user moden FIQ : entered when a high priority (fast) interrupt is raisedn IRQ : entered when a low priority (normal) interrupt is raisedn Supervisor : entered on reset and when a Software Interrupt instruction is
executedn Abort : used to handle memory access violationsn Undef : used to handle undefined instructions
n Mode changes can be made under software control, or can be caused by external interrupts or exception processing
10
Exceptionmodes
Processor Modes (2)
n User mode:n Normal program execution
moden System resources
unavailablen Mode changed by exception
only
n Exception modes:n Entered upon exceptionn Full access to system
resourcesn Mode changed freely
11
AArch32 Registers (1)
n ARM has 37 registers all of which are 32-bits long.n 31 general purpose registers including a program counter (PC)
n At any one time, 16 visible (R0 - R15), Others speed up the exception processn 6 status registers
n 1 dedicated current program status register (CPSR)n 5 dedicated saved program status registers (SPSR)
n The current processor mode governs which of several banks is accessible. Each mode can access
n a particular set of R0-R12 registersn a particular R13 (the stack pointer, sp) and R14 (the link register, lr)n the program counter, R15 (pc)n the current program status register, CPSR
n Privileged modes (except System) can also accessn a particular SPSR (saved program status register)
12
AArch32 Registers (2)
n General-purpose registersn The GP registers R0 – R15 can be split into three groups, which
are differentiated in the way they are banked and in their special-purpose uses:
n The unbanked registers, R0 – R7n Each of them refers to the same 32-bit physical register in all processor
modesn They are completely GP registers, with no special uses implied
n The banked registers, R8 – R14n The physical register referred to by each of them depends on the current
processor moden Where a particular physical register is intended, w/o depending on the current
processor mode, a more specific name is usedn Register 15, R15 (Program Counter)
13
AArch32 Registers (3-1)
R0R1R2R3R4R5R6R7R8R9
R10R11R12
R13 (sp)R14 (lr)R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
FIQ IRQ SVC Undef Abort
User Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R13 (sp)R14 (lr)R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
FIQ IRQ SVC Undef Abort
R0R1R2R3R4R5R6R7
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User IRQ SVC Undef Abort
R8R9
R10R11R12
R13 (sp)R14 (lr)
FIQ ModeIRQ Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ SVC Undef Abort
R13 (sp)R14 (lr)
Undef Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ IRQ SVC Abort
R13 (sp)R14 (lr)
SVC Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ IRQ Undef Abort
R13 (sp)R14 (lr)
User Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPRS
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
FIQ IRQ SVC Undef
14
R13 (sp)R14 (lr)
SPRS
Abort
AArch32 Registers (3-2)
R0R1R2R3R4R5R6R7R8R9
R10R11R12
R13 (sp)R14 (lr)R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
FIQ IRQ SVC Undef Abort
User Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R13 (sp)R14 (lr)R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
FIQ IRQ SVC Undef Abort
R0R1R2R3R4R5R6R7
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User IRQ SVC Undef Abort
R8R9
R10R11R12
R13 (sp)R14 (lr)
FIQ ModeIRQ Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ SVC Undef Abort
R13 (sp)R14 (lr)
Undef Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ IRQ SVC Abort
R13 (sp)R14 (lr)
SVC Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ IRQ Undef Abort
R13 (sp)R14 (lr)
FIQ Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPRS
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
Current Visible Registers
Banked out Registers
IRQ SVC Undef
15
R13 (sp)R14 (lr)
SPRS
Abort
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
User
AArch32 Registers (3-3)
R0R1R2R3R4R5R6R7R8R9
R10R11R12
R13 (sp)R14 (lr)R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
FIQ IRQ SVC Undef Abort
User Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R13 (sp)R14 (lr)R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
FIQ IRQ SVC Undef Abort
R0R1R2R3R4R5R6R7
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User IRQ SVC Undef Abort
R8R9
R10R11R12
R13 (sp)R14 (lr)
FIQ ModeIRQ Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ SVC Undef Abort
R13 (sp)R14 (lr)
Undef Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ IRQ SVC Abort
R13 (sp)R14 (lr)
SVC Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ IRQ Undef Abort
R13 (sp)R14 (lr)
IRQ Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPRS
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
SVC Undef
16
R13 (sp)R14 (lr)
SPRS
Abort
R13 (sp)R14 (lr)
User
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
FIQ
AArch32 Registers (3-4)
R0R1R2R3R4R5R6R7R8R9
R10R11R12
R13 (sp)R14 (lr)R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
FIQ IRQ SVC Undef Abort
User Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R13 (sp)R14 (lr)R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
FIQ IRQ SVC Undef Abort
R0R1R2R3R4R5R6R7
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User IRQ SVC Undef Abort
R8R9
R10R11R12
R13 (sp)R14 (lr)
FIQ ModeIRQ Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ SVC Undef Abort
R13 (sp)R14 (lr)
Undef Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ IRQ SVC Abort
R13 (sp)R14 (lr)
SVC Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ IRQ Undef Abort
R13 (sp)R14 (lr)
SVC Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPRS
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
Undef
17
R13 (sp)R14 (lr)
SPRS
Abort
R13 (sp)R14 (lr)
User
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
FIQ
R13 (sp)R14 (lr)
SPSR
IRQ
AArch32 Registers (3-5)
R0R1R2R3R4R5R6R7R8R9
R10R11R12
R13 (sp)R14 (lr)R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
FIQ IRQ SVC Undef Abort
User Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R13 (sp)R14 (lr)R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
FIQ IRQ SVC Undef Abort
R0R1R2R3R4R5R6R7
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User IRQ SVC Undef Abort
R8R9
R10R11R12
R13 (sp)R14 (lr)
FIQ ModeIRQ Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ SVC Undef Abort
R13 (sp)R14 (lr)
Undef Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ IRQ SVC Abort
R13 (sp)R14 (lr)
SVC Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ IRQ Undef Abort
R13 (sp)R14 (lr)
Undef Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPRS
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
SVC
18
R13 (sp)R14 (lr)
SPRS
Abort
R13 (sp)R14 (lr)
User
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
FIQ
R13 (sp)R14 (lr)
SPSR
IRQ
AArch32 Registers (3-6)
R0R1R2R3R4R5R6R7R8R9
R10R11R12
R13 (sp)R14 (lr)R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
FIQ IRQ SVC Undef Abort
User Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R13 (sp)R14 (lr)R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
FIQ IRQ SVC Undef Abort
R0R1R2R3R4R5R6R7
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User IRQ SVC Undef Abort
R8R9
R10R11R12
R13 (sp)R14 (lr)
FIQ ModeIRQ Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ SVC Undef Abort
R13 (sp)R14 (lr)
Undef Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ IRQ SVC Abort
R13 (sp)R14 (lr)
SVC Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R13 (sp)R14 (lr)
SPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
User FIQ IRQ Undef Abort
R13 (sp)R14 (lr)
Abort Mode R0R1R2R3R4R5R6R7R8R9
R10R11R12
R15 (pc)
CPSR
R13 (sp)R14 (lr)
SPRS
R13 (sp)R14 (lr)
SPSR
Current Visible Registers
Banked out Registers
SVC
19
R13 (sp)R14 (lr)
SPRS
Undef
R13 (sp)R14 (lr)
User
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
FIQ
R13 (sp)R14 (lr)
SPSR
IRQ
AArch32 Register Organization Summary
20
UsermodeR0-R7,R15,and
CPSR
R8R9
R10R11R12
R13 (sp)R14 (lr)
SPSR
FIQ
R8R9
R10R11R12
R13 (sp)R14 (lr)R15 (pc)
CPSR
R0R1R2R3R4R5R6R7
User
R13 (sp)R14 (lr)
SPSR
IRQ
Usermode
R0-R12,R15,and
CPSR
R13 (sp)R14 (lr)
SPSR
Undef
Usermode
R0-R12,R15,and
CPSR
R13 (sp)R14 (lr)
SPSR
SVC
Usermode
R0-R12,R15,and
CPSR
R13 (sp)R14 (lr)
SPSR
Abort
Usermode
R0-R12,R15,and
CPSR
Thumb stateLow registers
Thumb stateHigh registers
Note: System mode uses the User mode register set
Program Status Register
n Condition code flagsn N = Negative result from ALU n Z = Zero result from ALUn C = ALU operation Carried outn V = ALU operation oVerflow
n Sticky Overflow flag - Q flagn Architecture 5TE/J onlyn Indicates if saturation has occurred
n J bitn Architecture 5TEJ onlyn J = 1: Processor in Jazelle state
n Interrupt Disable bits.n I = 1: Disables the IRQ.n F = 1: Disables the FIQ.
n T Bitn Architecture xT onlyn T = 0: Processor in ARM staten T = 1: Processor in Thumb state
n Mode bitsn Specify the processor mode
2731
N Z C V Q
28 67
I F T mode
1623 815 5 4 024
f s x c
U n d e f i n e dJ
21
Program Counter (R15)
n When the processor is executing in ARM state:n All instructions are 32 bits widen All instructions must be word alignedn Therefore the pc value is stored in bits [31:2] with bits [1:0] undefined (as
instruction cannot be halfword or byte aligned)
n When the processor is executing in Thumb state:n All instructions are 16 bits widen All instructions must be halfword alignedn Therefore the pc value is stored in bits [31:1] with bit [0] undefined (as
instruction cannot be byte aligned)
n When the processor is executing in Jazelle state:n All instructions are 8 bits widen Processor performs a word access to read 4 instructions at once
22
Vector Table
Exception Handling
n When an exception occurs, the ARM:n Copies CPSR into SPSR_<mode>n Sets appropriate CPSR bits
n Change to ARM staten Change to exception mode n Disable interrupts (if appropriate)
n Stores the return address in LR_<mode>n Sets PC to vector address
n To return, exception handler needs to:n Restore CPSR from SPSR_<mode>n Restore PC from LR_<mode>
This can only be done in ARM state Vector table can be at 0xFFFF0000 on ARM720T
and on ARM9/10 family devices
FIQIRQ
(Reserved)Data Abort
Prefetch AbortSoftware Interrupt
Undefined Instruction
Reset
0x1C0x180x140x100x0C0x080x040x00
23
24
Data Sizes and Instruction Set
n The AArch32 is a 32-bit architecture
n When used in relation to the AArch32:n Byte means 8 bitsn Halfword means 16 bits (two bytes)n Word means 32 bits (four bytes)
n Most AArch32 implement two instruction setsn 32-bit ARM Instruction Set: Standard 32-bit instruction setn 16-bit Thumb Instruction Set
n 16-bit compressed form
n Code density better than most CISC
n Dynamic decompression in pipeline
AArch32 Instruction Set Basic Features
n Load/Store architecture
n 3-address data processing instructions
n Conditional execution
n Load/Store multiple registers
n Shift & ALU operation in single clock cycle
25
Conditional Execution and Flags
n AArch32 instructions can be made to execute conditionally by postfixing them with the appropriate condition code field
n This improves code density and performance by reducing the number of forward branch instructionsCMP R3,#0 CMP R3,#0BEQ skip ADDNE R0,R1,R2ADD R0,R1,R2skip
n By default, data processing instructions do not affect the condition code flags but the flags can be optionally set by using “S”. CMP does not need “S”.
loop…SUBS R1,R1,#1BNE loop
if Z flag clear then branch
decrement R1 and set flags
26
Condition Codes
n The possible condition codes are listed below:n Note AL is the default and does not need to be specified
Not equalUnsigned higher or sameUnsigned lowerMinus
Equal
OverflowNo overflowUnsigned higherUnsigned lower or same
Positive or Zero
Less thanGreater thanLess than or equalAlways
Greater or equal
EQNECS/HSCC/LO
PLVS
HILSGELTGTLEAL
MI
VC
Suffix Description
Z=0C=1C=0
Z=1Flags tested
N=1N=0V=1V=0C=1 & Z=0C=0 or Z=1N=VN!=VZ=0 & N=VZ=1 or N=!V
27
Examples of Conditional Execution
n Use a sequence of several conditional instructions if (a==0) func(1);
CMP R0,#0MOVEQ R0,#1BLEQ func
n Set the flags, then use various condition codesif (a==0) x=0;if (a>0) x=1;
CMP R0,#0MOVEQ R1,#0MOVGT R1,#1
n Use conditional compare instructionsif (a==4 || a==10) x=0;
CMP R0,#4CMPNE R0,#10MOVEQ R1,#0
28
Branch Instructions
n Branch : B{<cond>} label
n Branch with Link : BL{<cond>} subroutine_label
n The processor core shifts the offset field left by 2 positions, sign-extends it and adds it to the PC
n ± 32 Mbyte rangen How to perform longer branches?
2831 24 0
Cond 1 0 1 L Offset
Condition field
Link bit 0 = Branch1 = Branch with link
232527
29
Can set up LR manually if needed, then load into PCMOV lr, pcLDR pc, =dest
ADS linker will automatically generate long branch veneers for branches beyond 32MB range.
Data Processing Instructions
n Consist of :n Arithmetic: ADD ADC SUB SBC RSB RSCn Logical: AND ORR EOR BICn Comparisons: CMP CMN TST TEQn Data movement: MOV MVN
n These instructions only work on registers, NOT memoryn Syntax:
<Operation>{<cond>}{S} Rd, Rn, Operand2
n Comparisons set flags only - they do not specify Rdn Data movement does not specify Rn
n Second operand is sent to the ALU via barrel shifter
30
The Barrel Shifter
DestinationCF 0 Destination CF
LSL : Logical Left Shift ASR: Arithmetic Right Shift
Multiplication by a power of 2 Division by a power of 2, preserving the sign bit
Destination CF...0 Destination CF
LSR : Logical Shift Right ROR: Rotate Right
Division by a power of 2 Bit rotate with wrap aroundfrom LSB to MSB
Destination
RRX: Rotate Right Extended
Single bit rotate with wrap aroundfrom CF to MSB
CF
31
Register, optionally with shift operation
n Shift value can be either be:n 5 bit unsigned integern Specified in bottom byte of another
registern Used for multiplication by constant
Immediate valuen 8 bit number, with a range of 0-255.
n Rotated right through even number of positions
n Allows increased range of 32-bit constants to be loaded directly into registersResult
Operand 1
BarrelShifter
Operand 2
ALU
Using the Barrel Shifter: The Second Operand
32
Data Processing Instructions
33
Conditional codes
+Data processing instructions
+Barrel shifter
=Powerful tools for efficient coded programs
Data Processing Instructions
34
e.g.:
if (z==1) R1=R2+(R3*4)
compiles to
EQADDS R1,R2,R3, LSL #2
( SINGLE INSTRUCTION ! )
THUMB Instruction Set
n Thumb is a 16-bit instruction setn Optimized for code density from C code (~65% of ARM code size)n Improved performance from narrow memoryn Subset of the functionality of the ARM instruction set
n Core has additional execution state - Thumbn Switch between ARM and Thumb using BX instruction
n For most instructions generated by compiler:
n Conditional execution is not usedn Source and destination registers
identicaln Only Low registers usedn Constants are of limited sizen Inline barrel shifter not used
35
015
31 0ADDS R2,R2,#1
ADD R2,#1
32-bit ARM Instruction
16-bit Thumb Instruction
ARM7 Family
n 32/16-bit RISC Architecture
n Unified bus architecturen Both instructions and data use the same bus
n 3 stage pipeliningn Fetch / decode / execution
n Coprocessor interface
n Embedded ICE-RT support
n JTAG interface unit
n Optional support for MMU
36
ARM7 Block Diagram
37
Address Register
Register bank
Multiplier
BarrelShifter
Controller
Instruction Decoder
Data in Register
Data out Register
ALU
+1
A[31:0]
PC
Immediate constant
PC update
Control Signals
D[31:0]
ALU-bus
A-bus
B-bus
ARM7 Family
38
AHB I/F
ETM7 I/F
ARM7
8K Cache
MMU
ASB I/F
ARM7
8K Cache
MPU
• Platform OS core• 100MHz (0.13um)• 0.20mW/MHz• Hard Macro IP
ARM720T
• Embedded RTOS core• Hard Macro IP
ARM740T
• Base Integer core• 80-110MHz (0.18um)• 0.28mW/MHz• Synthesizable (soft IP)
ARM7TDMI-S
• Base Integer core(Hard Macro IP)• 100MIPS (0.18um)• 0.25mW/MHz• NEW : low-voltage layout
ARM7TDMI • ARM v4T ISA• Von Neumann Architecture• Thumb Support• Debug Support• 3-Stage Pipeline
ARM7-S
ARM7
ARM9 Family
n 32/16-bit RISC Architecture
n Harvard Architecturen Separate memory bus architecture
n 5 stage pipeliningn Fetch/Decode/Execute/Memory/Write
n Coprocessor interface
n EmbeddedICE-RT support
n JTAG interface unit
n Embedded Trace Macrocell
39
ARM920T Block Diagram
40
Trace Interface
Port
External coprocessor
interfaceInstruction
cache
ARM9TDMIProcessor core
(Integral Embedded ICE)
InstructionMMU
R13
CP15
R13
DataCache Data
MMUWrite backPATRG RAM
WriteBuffer
Trace Interface
Port
IPA[31:0]
IMVA[31:0]
ID[31:0]IVA[31:0]
DVA[31:0] DD[31:0]
DMVA[31:0] DPA[31:0]
WBPA[31:0] DINDEX[5:0]
JTAG
ASB
ARM9 Family
41
ASB, AHB I/F
ARM9T
4K Cache
MPU
• Platform OS processor core• Up to 200MIPS (typical)• 180MHz (typical)
ARM920T
• Embedded RTOS processor core
ARM940T
• Base Integer processor core
ARM9TDMI• ARM v4T ISA• Harvard architecture• Thumb Support• Debug Support• 5-Stage Pipeline• Dual caches• AMBA ASB, AHB bus
ARM9T
ASB, AHB I/F
ETM9 I/F
ARM9T
16K Cache
MMU
ASB, AHB I/F
ETM9 I/F
ARM9
8K Cache
MMU
• Platform OS processor core• Up to 200MIPS (0.18um)• 0.8mW/MHz
ARM922T
ARM9E Family
42
• Platform OS processor core• 200MIPS• 220MHz (0.18um)
ARM926EJ-S
• Embedded RTOS processor core
ARM966E-S
• Low Power Embedded core• 270MHz (0.13um)• 0.14mW/MHz
ARM968E-S• Synthesizable (Soft IP)• ARM v5TE ISA• Thumb Support• Debug Support• 5-Stage Pipeline• Dual Caches• AMBA AHB bus
ARM9E-S
• Embedded OS processor core• 166MHz (0.18um)• 0.86mW/MHz
ARM946E-S
AHB I/F
ETM9 I/F
ARM9E-S
Flexible Caches
TCM I/F
MMU
AHB I/F
ETM9 I/F
ARM9E-S
Flexible Caches
TCM I/F
MPU
AHB I/F
ETM9 I/F
ARM9E-S
Flexible SRAMs
JazelleEnhanced
ARM10 Family
n 6 stage pipeline
n Parallel load / store unitn Allow computation to continue while data transfers complete
n 64-bit data bus
n Optional support for vector floating-pointn 7th stage in pipelinen Increase FP performancen Out-of-order completion
43
ARM1026EJ-S block diagram
n ARM10EJ-S integer core in an ARMv5TEJ implementation
n 32-bit ARMn 16-bit Thumbn 8-bit Jazelle instruction set
n MMUn Single TLB for both instruction and
data
n Memory Protection Unit (MPU)n Partition external memory into 8-
protection regions
n Cache (I/D)n Configurable to 0KB or 4-128KB
n Tightly Coupled Memory (TCM)n Configurable to 0KB or 4KB-1MB
44
InstructionTCM
interface
DataTCM
interface
ETM 10C Interface
InstructionCache
DataCache
MMU/MPU MMU/MPU
ARM1026core
Control Logic and Bus Interface Unit
Write buffer
VFP10Interface
VICInterface
AMBA AHB InterfaceInterface Data
ARM11 Family
n 8 stage pipelinen load-store pipelinen Arithmetic pipeline
n Parallel load / store unitn Allow computation to continue while data transfers completen Non-blocking cache
n 64-bit data bus
n Out-of-order completion
45
ARM1136JF-S block diagram
n ARM11 Core in an ARMv6 implementation
n 32-bit ARMn 16-bit Thumbn 8-bit Jazelle instruction set
n Load Store Unit(LSU)n Manages all load and store
operationsn Load-Store pileline
n Decouples loads and stores from the MAC and ALU pipelines
n Prefetch Unit
n Memory systemn Harvard architecturen 64-bit datapathsn 2-channel DMA into TCM
46
Vector Floating Point(ARM1136JF-S only)
InstructionCache
TCRAM
Pre-fetch Unit Core Data Cache /
TCRAMLoad Store
Unit
Main Translation Look-aside
Buffer
System metrics
Level one instruction side cache controller
DMA Level one data side cache controller
Debug/JTAGETM
VICExternal coprocessor
Interface
Instruction fetch DMA Data readData Write
Peripheral
ARM11 Family
47
• Platform OS core
ARM11
• ARM v6 ISA• Thumb Code Compression• DSP extensions• 8-Stage Pipeline (35% enhanced)• SIMD media processing• Multi-port 64bit memory system• Optional VFP coprocessor• TCM(scratch-pad) memory
ARM11
• Platform OS core• 620MHz (90nm)• 0.45mW/MHz• Floating Point coprocessor
ARM1136JF-S
M-layer AHB
Multi Port
ARM11
I/D Cache
TCM
VFP
M-layer AHB
Multi Port
ARM11
I/D Cache
TCM• Platform OS core• 620MHz (90nm)• 0.45mW/MHz
ARM1136J-S
ARM’s First MPCore Architecture based on ARM11
CPU/VFPCPU/VFP
L1 MemoryL1 Memory
CPU/VFPCPU/VFP
L1 MemoryL1 Memory
CPU/VFPCPU/VFP
L1 MemoryL1 Memory
CPU/VFPCPU/VFP
L1 MemoryL1 Memory
Snoop Control Unit (SCU)I&D64-bit bus
CoherencyControl
bus
Timer
WdogCPU
Interface
IRQ
Timer
WdogCPU
Interface
IRQ
Timer
WdogCPU
Interface
IRQ
Timer
WdogCPU
Interface
IRQ
Interrupt Distributor
Private FIQ lines
Configurable HW interrupt lines
ConfigurableBetween1 and 4
Symmetric CPU
48
INTEL SA & XSCALE Family
49
• Up to 203MHz(Typical)•Only Core Support
SA110
• SA core (StrongARM)• ARM8• ARM v4 Architecture• Harvard Architecture, I/D Cache• 5-Stage Pipeline
•Xscale• ARM5T Architecture• Thumb mode Support• 7-stage integer, 8-stage• memory superpipeline
• Up to 400MHz
PXA210 (250)
Xcale Core
64-bit BIU
2K Mini/D Cache
32K I/D Cache
I/D MMU
• 199MHz
SA1100
SA Core
DRAM
16K-I Cache
8K-D Cache
MMU
SA Core
DRAM
16K-I Cache
MMU
• 233MHz (Typical)
SA1110
SRAM
Cortex-A8 Architecture
n Cortex-A Base Architecturen Thumb-2 technology for power efficient
executionn TrustZoneTM for secure applicationsn v6 SIMD for compatibility with ARM11n Media acceleration applications
n Cortex-A8 Extensionsn Jazelle-RCT for efficient acceleration of
execution environments such as Java and Microsoft .NET
n NEON technology accelerating multimedia gaming and signal processing applications
n VFPv3 supports full IEEE 754 specification and has been expanded to support 32 registers
50
Instruction Fetch Unit NEON
Media Processor
L2 Memory System
AXI L3 Memory Interface
L1 I-cache
Instruction Decode
Unit
Instruction Execution & Load/Store
Unit
L1 D-cache
Cortex-A8
Current ARM MP Core Architecture Overview
n Scalable 1-4 cores solutions (ARM11, Cortex-A5, Cortex-A7, Cortex-A9, Cortex-A15, Cortex-A53, Cortex-A57)
n Supporting AMP/SMP models and any combination of the two
n Memory system and cache subsystems optimized for MP
n Interrupt Controller designed for distribution across multiple cores
51
Debug and Trace Architecture
Distributed Interrupts Logic
CPU CPU CPU CPU
Coherency Logic
External Memory
FPU/Neon
MPCore CPU
I-cache D-cache
Source: ARM (2013)
Cortex-A5: High Volume and Value
n Power-efficient Performancen Simple in-order 8-stage, single-
issue pipelinen Improved branch prediction
n Full feature set of Cortex-A9n NEON, FPU, TrustZonen Symmetric Multi-processing
(SMP)
n Highly configurablen Uniprocessor only version
availablen Optional NEON / FPUn Optional external L2 cache
52
CortexTM-A5 MPCore
432
ARMv7 32b CPU
NEONTM
SIMD engine
Floating Point Unit
4-64k I-Cache 4-64k D-Cache Core 1
ACP (Accelerator Coherency Port) SCU (Snoop Control Unit)
1-2x 64-bit AMBA AXI Bus Interface
Source: ARM (2013)
Cortex-A7: Most Efficient ARMv7
n Power efficient m-architecture n In-order 8-stage, partial dual-
issue n Efficient memory system
n Integrated L2 cache
n Full feature set of Cortex-A15 n Hardware enhanced virtualization n 1 TB physical memory (40bit
addressing) n NEON, FPU, TrustZone n big.LITTLE companion to Cortex-
A15 using AMBA4 ACE
53
CortexTM-A7 MPCore
432
ARMv7 32b CPU
NEONTM
SIMD engine
Floating Point Unit
8-64k I-Cache 8-64k D-Cache Core 1
SCU L2 Cache (128KB – 1MB)
128-bit AMBA 4 – Coherent Bus Interface
Source: ARM (2013)
Cortex-A9: Widely Adopted
n Performance & power optimized multi-core processor
n Dual-issue, Out-of-order pipeline n Scalable SMP – up to 4 cores n Accelerator Coherency Port (ACP)
n Flexible system architecture n Available as a single CPU also n Configurable cache sizes n Optional second AXI interface n Optimized L2 cache controller n NEON, FPU, TrustZone
54
Source: ARM (2013)
CortexTM-A9 MP Core
432
ARMv7 32b CPU
NEONTM
SIMD engine
Floating Point Unit
16-64k I-Cache16-64k D-
CacheCore 1
ARM CoreSightTM Multicore Debug and Trace
ACP SCU
Dual 64-bit AMBA3 AXI
Cortex-A9 Architecture
n Out-of-Order execution
n Register renaming to enable execution speculation
n Non-blocking memory system with load-store forwarding
n Fast loop mode in instruction pre-fetch to lower power consumption
55
FPU/NEONFPU/NEONFPU/NEONFPU/NEONFPU/NEONFPU/NEON
Profiling Monitor Block
Profiling Monitor Block
Dual-instructionDecode stage
Dual-instructionDecode stage
RegisterRename stage
RegisterRename stage
CoreSightDebug Access Port
CoreSightDebug Access Port Virtual to
PhysicalRegister Pool
Virtual toPhysical
Register Pool
BranchMonitorBranch
Monitor
Out of orderMulti-issue
withSpeculation
3+1 Dispatchstage
Out of orderMulti-issue
withSpeculation
3+1 Dispatchstage
Inst
ruct
ion
queu
e an
d Di
spat
ch
Inst
ruct
ion
queu
e an
d Di
spat
ch
ALU/MULALU/MUL
ALUALU
FPU/NEONFPU/NEON
AddressAddress
OoOWriteBackstage
OoOWriteBackstage
Instruction prefetch stageInstruction prefetch stageFast-loop
modeFast-loop
mode
Instructioncache
Instructioncache
Branch PredictionBranch PredictionGlobal History BufferGlobal History Buffer
BR-Target Addr CacheBR-Target Addr Cache
Return StackReturn Stack
Inst
ruct
ion
queu
eIn
stru
ctio
n qu
eue
Pref
etch
queu
ePr
efet
chqu
eue
Memory SystemMemory System
Auto-prefetcherAuto-prefetcher
Load-Store Unit
Quad-slot with forwarding
Load-Store Unit
Quad-slot with forwarding
Store bufferStore buffer
mTLBmTLB
MMUMMU
ProgramTraceUnit
ProgramTraceUnit
Data cacheData cache
CoresightTrace
L2 Cache ControlL2 Cache Control
ECC RAMsECC RAMsBus Interface Unit (BIU)Bus Interface Unit (BIU)
AMBA 3 AXI 64bit
IRQ/FIQ
PL390InterruptController
…
Coresight/JTAG
Debug
Cortex-A9Single coreProcessor
PL310L2 Cache
ControllerMaster Interface Secondary master (with
filtering)
Advanced Bus Interface Unit
ARM Cortex-A9 MP Core Architecture
56
ARM CoreSightTM Multicore Debug and Trace Architecture
GeneralizedInterrupt Controland Distribution
Snoop Control Unit (SCU)AcceleratorCoherence
PortTimersSnoopFiltering
Cache-2-CacheTransfers
Primary AMBA 3 64bit Interface Optional 2nd I/F with Address Filtering
L2 Cache Controller (PL310)
FPU/NEON PTMI/F
Cortex-A9 CPU
I-Cache D-Cache
FPU/NEON PTMI/F
Cortex-A9 CPU
I-Cache D-Cache
FPU/NEON PTMI/F
Cortex-A9 CPU
I-Cache D-Cache
FPU/NEON PTMI/F
Cortex-A9 CPU
I-Cache D-Cache
ø PTM: Program Trace Macrocell
Cortex-A12 CPU: High Performance Embedded
n 40% performance uplift over Cortex-A9
n Same best-in-class energy efficiency
n The most area- and cost-efficient solution
n Premium mobile features n big.LITTLE™ processing enabled n Greater than 4GB addressable
memory n Security with Virtualization and
TrustZone®
57
CortexTM-A12 MP Core
432
ARMv7 32b CPU
NEONTM
Data Engine
Floating Point Unit
32-64k I-Cache 32k D-Cache Core 1
ARM CoreSightTM Multicore Debug and Trace
ACP SCU L2 Cache w/ECC (256kB -8MB)
128-bit AMBA 4 AXI Bus Interface
Peripheral Port
Source: ARM (2013)
Load-Store Unit
Branch Prediction
ARM Cortex-A15 Core Architecture
58
128 bits
L1 Instruction Cache
Instruction Fetch
3-way Instruction Decode
Register Rename
DispatchIssue (8-entry Queue per Issue port)
Complex Cluster (N
EON
/FPU)
Complex Cluster (N
EON
/FPU)
Branch
Simple
Cluster 1
Simple
Cluster 0
Load/Store
Multiply Divide
Cluster
32-entry Loop Buffer
L2 Cache ControlBus Interface U
nit (BIU)
Write Back(60 entry) Retirement Buffer
Global History BufferMicroBTB (64-entry)
Branch Target Buffer (256-entry)
Return Stack
L1 Data Cache
Store Buffer
Load/Store
3 Instructions
8 Instruction issue
1 Load & 1 Store
per cycleARM
multiply & Integer
divide
2~10 stages Integer ALU & Shifter
(includes v6-SIMD)
All NEON & FPU ops Quad-FMAC
12 Stage In-Order Pipeline
3~12 Stage Out-of-Order
Pipeline
5 Stages7 Stages
1Stage
1Stage
Source: ARM (2013)
ARM Cortex-A15 MP Core Architecture
n 2.5GHz in 28 HP processn 12 stage in-order, 3-12 stage OoO pipelinen 3.5 DMIPS/MHz ~ 8750 DMIPS @ 2.5GHz
n ARMv7A with 40-bit PA
n Dynamic repartitioning Virtualizationn Fast state save and restoren Move execution between cores/clusters
n 128-bit AMBA 4 ACE busn Supports system coherency
n ECC on L1 and L2 caches
59
ARM CoreSight Multicore Debug and Trace Architecture
NEON/FastFP
PTM I/F
Cortex-A15
Virtualization, ECC
I-$ D-$
NEON/FastFP
PTM I/F
Cortex-A15
Virtualization, ECC
I-$ D-$
NEON/FastFP
PTM I/F
Cortex-A15
Virtualization, ECC
I-$ D-$
NEON/FastFP
PTM I/F
Cortex-A15
Virtualization, ECC
I-$ D-$
L2 Cache Controller
Advanced Bus Interface Unit (> 32-bit Addressing)
64-bit/128-bit AMBA4 Interface
Generalized Interrupt
Control and Distribution
Accelerator Coherence
PortSystemMMU
Snoop Control Unit (SCU)
Cache-2-Cache Transfers
Snoop Filtering Timers
Simple Cluster 0
Simple Cluster 1
Multiply Accumulate
Complex
Complex
Load & Store 0
Load & Store 1
Fetc
h
Deco
de
Rena
me
Cortex-A15 Pipeline
12 Stage In-order pipeline(Fetch, 3 Instruction
Decode, Rename)Source: ARM (2012)ø PTM: Program Trace Macrocell
3~12 Stage out-of-order
pipeline(capable of 8
issue)
ARM Application Processors for Embedded
60
Summary ARMv7-A
61
Cortex-A5 MPCore
Thumb / Thumb-2 / DSP / SIMD / Optional VFPv4-D16 FPU / Optional NEON / Jazelle RCT and DBX, 1–4 cores / optional MPCore, snoop control unit (SCU), generic interrupt controller (GIC), accelerator coherence port (ACP)
4-64 KB / 4-64 KB L1, MMU + TrustZone
1.57 DMIPS / MHz per core
Cortex-A7 MPCore
Thumb / Thumb-2 / DSP / VFPv4-D16 FPU / NEON / Jazelle RCT and DBX / Hardware virtualization, in-order execution, superscalar, 1–4 SMP cores, Large Physical Address Extensions (LPAE), SCU, GIC, ACP, architecture & feature set are identical to A15, 8-10 stage pipeline
32 KB / 32 KB L1, 0-4 MB L2, MMU + TrustZone
1.9 DMIPS / MHz per core
Cortex-A8 Thumb / Thumb-2 / VFPv3 FPU / NEON / Jazelle RCT and DAC, 13-stage superscalar pipeline
16-32 KB / 16-32 KB L1, 0-1 MB L2 opt ECC, MMU + TrustZone
2.0 DMIPS/MHz
Cortex-A9 MPCore
Thumb / Thumb-2 / DSP / Optional VFPv3 FPU / Optional NEON / Jazelle RCT and DBX, out-of-order speculative issue superscalar, 1–4 SMP cores, SCU, GIC, ACP
16-64 KB / 16-64 KB L1, 0-8 MB L2 opt parity, MMU + TrustZone
2.5 DMIPS/MHz per core
Cortex-A12
MPCore
Thumb-2 / DSP / VFPv4 FPU / NEON / Hardware virtualization, out-of-order speculative issue superscalar, 1–4 SMP cores, LPAE, SCU, GIC, ACP
32-64 KB / 32 KB L1, 256 KB-8 MB L2
3.0 DMIPS / MHz per core
Cortex-A15
MPCore
Thumb / Thumb-2 / DSP / VFPv4 FPU / NEON / Integer divide / Fused MAC / Jazelle RCT / Hardware virtualization, out-of-order speculative issue superscalar, 1–4 SMP cores, LPAE, SCU, GIC, ACP, 15-24 stage pipeline
32 KB I$ w/parity / 32 KB D$ w/ECC L1, 0-4 MB L2, L2 has ECC, MMU + TrustZone
At least 3.5 DMIPS/MHz per core
62
AArch32 Summary
AArch32 Evolution (1/4)
63
1, 2, 3Early ARM architectures
4THalfword and signed halfword / byte support
System mode
Thumb instruction set
4Load-store instrs for signed and unsigned halfword / byte
System mode
SA-110 SA-1110
ARM7TDMI ARM9TDMI
5TE
Improved ARM / Thumb state change
CLZ(Count Leading Zero) instr.
Saturated maths
DSP multiply-accumulate instructions
ARM1020E
XScale
5TEJ
Jazelle(Java bytecode accelerator)
ARM9EJ-S
ARM1026EJ-S
ARM1136EJ-S6
7
SIMD InstructionsMulti-processingV6 Memory acrchitecture(VMSA)Unaligned and mixed endian data support
7-A (Applications)NEON
7-R (Real-time)Hardware divide
7-M (Microcontroller)Thumb-2 onlyHardware divide
Cortex-A /R /M
AArch32 Evolution (2/4)
n ARM v1n Developed at Acorn Computer Limited between 1983 and 1985n 26-bit addressingn No multiplication, CPSR, …(never used in a commercial product)
n ARM v2n Still 26-bit addressing (version 1 extension)n 32-bit result multiply (with accumulate) and coprocessor supportn Atomic load and store instruction (SWP) support (ARM v2a)
n ARM v3n Developed at ARM Limited (1990)n 32-bit addressingn CPSR, SPSR addedn Multi mode support : Undefined, Abort mode addedn ARM6, ARM610, ARM7, ARM710 core (Apple Newton PDA!)
64
AArch32 Evolution (3/4)
n ARM v4n Fully 32-bit addressingn ARMv4T : Thumb mode support ( T variant)n ARMv4 is deployed for StrongARM and ARM810 / ARMv4T: ARM7TDMI,
ARM720T, ARM9TDMI, ARM920T, ARM940T
n ARM v5TE{J}n Superset of ARMv4(1999)n Enhanced DSP Instructionn Jazelle : Java H/W Accelerationn ARM9E, ARM10TDMI, ARM 1020E, XScale
n ARM v6{Z|T2}n Media instruction(SIMD) support (2001)n ARMv6Z (TrustZone), ARMv6T2 (Thumb-2)n ARM1136J, ARM1176JZ
65
AArch32 Evolution (4/4)
n ARM Architecture Versions and Variants(ARM v7)n Application profile (ARM v7-A)
n Memory management support (MMU)n Highest performance at low power
n Influenced by multi-tasking OS system requirementsn TrustZone and Jazelle-RCT for a safe, extensible system
n Real-time profile (ARM v7-R)n Protected memory (MPU)n Low latency and predictability ‘real-time’ needsn Evolutionary path for traditional embedded business
n Microcontroller profile (ARM v7-M, ARM v6-M)n Lowest gate count entry pointn Deterministic and predictable behavior a key priorityn Deeply embedded use
66
Extensions to ARMv7
n MPE – Multiprocessing Extensionsn Added Cache and TLB Maintenance Broadcast for efficient MP
n VE - Virtualization Extensionsn Adds hardware support for virtualization:
n 2 stages of translation in the memory systemn New mode and privilege level for holding an Hypervisor
n With associated traps on many system relevant instructionsn Support for interrupt virtualization
n Combines with a System MMU
n LPAE – Large Physical Address Extensionsn Adds ability to address up to 40-bits of physical address space
67
Announcement
n Homework #3 (Due: Oct. 30)n Compare the ISA between Intel Atom and ARM Aarch-32
n Free format: Either MS Word or MS PPT
68
69
ARM NEON
NEON?
n NEON is a wide SIMD data processing architecturen Extension of the ARM instruction setn 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide)
n NEON Instructions perform “Packed SIMD” processingn Registers are considered as vectors of elements of the same data typen Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. floatn Instructions perform the same operation in all lanes
70
Dn (src register)
Dm (src register)
Dd (Dest Register)
Lane
Operation
NEON: Data Types
n NEON natively supports a set of common data typesn Integer and Fixed-Point; 8-bit, 16-bit, 32-bit and 64-bitn 32-bit Single-precision Floating-point
n Data types are represented using a bit-size and format letter
71
.8.I8
.S8
.U8
.P8
.16.I16
.S16
.U16
.P16
.32.I32
.S32
.U32
.F32
.64 .I64.S64
.U64
64-bit Signed, Unsigned Integers;
32-bit Signed, Unsigned Integers;
Floats
32-bit Signed, Unsigned Integers;
Floats
8/16-bit Signed, Unsigned Integers;
Polynomials
NEON: Registers
n NEON provides a 256-byte register filen Distinct from the core registersn Extension to the VFPv2 register file (VFPv3)
n Two explicitly aliased viewsn 32 x 64-bit registers (D0-D31)n 16 x 128-bit registers (Q0-Q15)
n Enables register trade-offn Vector lengthn Available registers
n Also uses the summary flags in the VFP FPSCRn Adds a QC integer saturation summary flagn No per-lane flags, so ‘carry’ handled using wider result (16bit+16bit à 32-bit)
72
D0
D1
D2
D3
...D30
D31
Q0
Q1
...
Q15
NEON: Vectors and Scalars
n Registers hold one or more elements of the same data typen Vn can be used to reference either a 64-bit Dn or 128-bit Qn registern A register, data type combination describes a vector of elements
n Some instructions can reference individual scalar elementsn Scalar elements are referenced using the array notation Vn[x]
n Array ordering is always from the least significant bit
73
Dn
I64
S32 S32
63 0
D0
D7
64-bit
Qn
F32
S8
127 0
Q0
Q7
128-bit
F32 F32 F32
S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8
F32 F32 F32 F32 Q0
Q0 [3] Q0 [2] Q0 [1] Q0 [0]
74
ARM VFP
VFP – Overview
n VFP – “Vector Floating-point”n A floating point hardware accelerator
n Described as a “coprocessor”n Originally a tightly-coupled coprocessorn Executed instructions from ARM instruction stream via dedicated interfacen Now more tightly integrated into the CPU
n Single and Double precision floating-pointn Fully IEEE compliant: ANSI/IEEE Std. 754-1985n Vector operation (Short vectors of up to 8 SP or 4 DP numbers) are handled
particularly efficiently by the VFP architecturen Most arithmetic instructions can be used on these vectors, allowing SIMD
parallelismn The FP Load & Store instructions have multiple register forms, allowing vectors to
be transferred to and from memory efficiently
75
VFP – Registers
n VFP has 32 GP registers, each capable of holding a SP FP number or a 32-bit integer
n In D variants, these registers can also be used in pairs to hold up to 16 DP FP numbers:n FPSID (Read Only): Can be read to determine which
implementations of the VFP architecture is being usedn FPSCR: Supplies all user-level status and control
n Status bits hold comparison results and cumulative flags for FP exceptionsn Control bits are provided to select rounding options and vector
length/stride, and to enable FP exception trapsn FPEXC: Contains a few bits for system-level status and control
76
VFP – Instructions
n Load /storen Some of these instructions allow multiple register values to be transferred, providing
floating-point equivalents to ARM LDM and STM instructions
n Transfer 32-bit values directly between VFP and ARM GP registers
n Transfer 32-bit values directly between VFP system registers and ARM GP registers
n Add, subtract, multiply, divide, and square rootn Can be used on short vectors as well as on individual floating-point values
n Copy floating-point values between registers
n Perform combined MAC operations on FP values and short vectors
n Perform conversions between SP, DP, unsigned 32-bit integers and two's complement signed 32-bit integers
n Compare floating-point values in registers with each other or with zero
77
VFP – Versions
n VFPv1 is obsolete
n VFPv2 is an optional extension to the ARMv5TE/5TEJ and ARMv6 architectures
n VFPv3-D32 is broadly compatible with VFPv2 but adds exception-less FPU usage, has 32 64-bit FPU registers as standard, adds VCVT instructions to convert between scalar, float and double, adds immediate mode to VMOV such that constants can be loaded into FPU registers
n VFPv3-D16: as above, but it has only 16 64-bit FPU registers
n VFPv3-F16 is uncommon; it supports IEEE754-2008 half-precision (16-bit) floating point
n VFPv4 in/for Cortex-A5, has a fused multiply-accumulate
78
79
ARM AArch64
ARM Architecture Evolution: AArch 32 to AArch 64
80
Jazelle®
VFPv2
SIMD
Thumb®-2
NEON™Adv SIMD
TrustZone®
ARMv5 ARMv6 ARMv7-A/R
Key feature ARMv7-A
compatibility
VFPv3/v4
ARMv8-A
A32+T32 ISAsincluding:l Scalar FP
(SP and DP)l Adv SIMD
(SP Float)
AArch32
A64 ISAsincluding:l Scalar FP
(SP and DP)l Adv SIMD
(SP+DP Float)
AArch64
CRYPTO CRYPTO
ARM v8 (AArch64) - Motivation
n Work on 64-bit architecture started in 2007
n Fundamental motivation is evolution into 64-bitn Ability to access a large virtual address spacen Foresee a future need in ARM’s traditional marketsn Enables expansion of ARM market presence
n Developing ecosystem takes timen Development started ahead of strong demand n ARM now seeing strong partner interest in 64-bit
n Though still some years from “must have” status
81
ARM v8 (AArch64) - Fundamentals
n New instruction set (A64)
n Revised exception handling for exceptions in AArch64 staten Fewer banked registers and modes
n Support for all the same architectural capabilities as in ARMv7n TrustZonen Virtualization
n Memory translation system based on the ARMv7LPAE table formatn LPAE format was designed to be easily extendable to AArch64-bit n Up to 48 bits of virtual address from a translation table base register
82
ARM v8 (AArch64) – New Instruction Set
n New fixed length Instruction set n Instructions are 32-bits in sizen Clean decode table based on a 5-bit register specifiers
n Instruction semantics broadly the same as in AArch32n Changes only where there is a compelling reason to do so
n 31 general purpose registers accessible at all timesn Improved performance and energyn General purpose registers are 64-bits widen No banking of general purpose registersn Stack pointer is not a general purpose register n PC is not a general purpose register n Additional dedicated zero register available for most instructions
83
AArch64 – Unbanked Registers
n 64-bit GP register file used for:n Scalar integer computation
n 32 and 64-bitn Address computation
n 64-bit
n Media register file used for:n Scalar Single and Double Precision
FPn 32 and 64-bit
n Advanced SIMD for Integer and FPn 64 and 128-bit wide vectors
n Cryptography
84
X0 X8 X16 X24
X1 X9 X17 X25
X2 X10 X18 X26
X3 X11 X19 X27
X4 X12 X20 X28
X5 X13 X21 X29
X6 X14 X22 X30*
X7 X15 X23
V0 V8 V16 V24
V1 V9 V17 V25
V2 V10 V18 V26
V3 V11 V19 V27
V4 V12 V20 V28
V5 V13 V21 V29
V6 V14 V22 V30
V7 V15 V23 V31
ARM v8 (AArch64) – Key differences from AArch32
n New instructions to support 64-bit operandsn Most instructions can have 32-bit or 64-bit argumentsn Addresses assumed to be 64-bits in size
n LP64 and LLP64 are the primary data models targeted
n Far fewer conditional instructions than in AArch32n Conditional {branches, compares, selects}
n No arbitrary length load/store multiple instructionsn LD/ST ‘P’ for handling pairs of registers added
85
AArch32 / AArch64 Relationship
n Changes between AArch32 and AArch64 occur on “exception/exception return” only
n Increasing exception level cannot decrease register width (or vice versa)n No Branch and Link between AArch32 and AArch64
n Allows AArch32 applications under AArch64 OS Kerneln Alongside AArch64 applications
n Allows AArch32 guest OS under AArch64 Hypervisorn Alongside AArch64 guest OS
n Allows AArch32 Secure side with AArch64 Non-secure siden Protects AArch32 Secure OS investments into ARMv8
n Requires architected relationship between AArch32 and AArch64 registers
86
ARM v8 (AArch64) – Advanced SIMD and FP Instruction Set
n A64 Advanced SIMD and FP semantically similar to A32n Advanced SIMD shares the floating-point register file as in AArch32
n A64 provides 3 major functional enhancements:n More 128 bit registers: 32 x 128 bit wide registers
n Can be viewed as 64-bit wide registersn Advanced SIMD supports DP floating-point executionn Advanced SIMD support full IEEE 754 execution
n Rounding-modes, Denorms, NaN handling
n Register packing model in A64 is different from A32n 64-bit register view fit in bottom of the 128-bit registers
n Some Additional floating-point instructions for IEEE754-2008n MaxNum/MinNum instructions, Float to Integer conversions with
RoundTiesAway
87
ARM v8 (AArch64) – Cryptography Support
n Instruction level support for Cryptographyn Not intended to replace hardware accelerators in an SoC
n AESn 2 encode and 2 decode instructions
n Work on the Advanced SIMD 128-bit registersn 2 instructions encode/decode a single round of AES
n SHA-1 and SHA-256 supportn Keep running hash in two128 bit wide registersn Hash in 4 new data words each instructionn Instructions also accelerate key generation
88
A-series Evolution Details
89
ARMv7-A ARMv7-A - Extensions ARMv8-A2004 2010 2012
• 32-bit Architecture • 32-bit VA & PA • Supports SMP 1-4 cores • TrustZone – Security • ARM & Thumb1/2 ISA • NEON & VFP • L1 D-Cache Coherency
• 32-bit Architecture• 32-bit VA & PA• Supports SMP 1-4 cores• TrustZone – Security• ARM & Thumb1/2 ISA• NEON & VFP• L1 D-Cache Coherency
• 32-bit Architecture• 32-bit VA & PA• Supports SMP 1-4 cores• TrustZone – Security• ARM & Thumb1/2 ISA• NEON & VFP• L1 D-Cache Coherency
• Full HW Virtualization• 40-bit PA• Hypervisor State• 2 levels PTT
• Full HW Virtualization• 40-bit PA• Hypervisor State• 2 levels PTT
• 64-bit Architecture• 31x64-bit registers (double)• 32-bit instructions• >40-bit PA• New exception model
Cortex-A9
Cortex-A15
Cortex-A53/A57
ARMv8: AArch64
90
CortexTM-A57 MPCore
432
ARMv8 32b/64b CPU Virtual 44b PA
NEONTM
SIMD engine with crypto ext.
Floating Point Unit
48k I-Cachewith DED parity
32k D-Cache w/ECC
Core 1
ARM CoreSightTM Multicore Debug and Trace
ACP SCU L2 Cache w/ECC (512kB ~ 2MB)
128-bit AMBA ACE Coherent Bus Interface
CortexTM-A53 MPCore
432
ARMv8 32b/64b CPU Virtual 40b PA
NEONTM
SIMD engine with crypto ext.
Floating Point Unit
8~64k I-Cachew/parity
8~64k D-Cache w/ECC
Core 1
ARM CoreSightTM Multicore Debug and Trace
ACP SCU L2 Cache w/ECC (128kB ~ 2MB)
128-bit AMBA ACE Coherent Bus Interface
Source: ARM (2013)
Cortex-A932nm
Cortex-A53,20nm
Cortex-A9
Cortex-A53
High Performance Energy Efficient
Cortex-A53 Features
n Cortex-A53 evolves our 8-Stage in-order LITTLE pipelinen ARM v8 AArch64 & AArch32 support at all exception levels
n Optional configurations for different marketsn Smartphone to networking to offload
91
Cortex-A53 Cluster
Core 3
NEON Pipelines
Integer Pipelines
Load-Store Pipelines
ETM
Deco
de
Inst
ruct
ion
Que
ueInstruction Fetch
L1 C
ache
Core 2
NEON Pipelines
Integer Pipelines
Load-Store Pipelines
ETM
Deco
de
Inst
ruct
ion
Que
ueInstruction Fetch
L1 C
ache
Core 1
NEON Pipelines
Integer Pipelines
Load-Store Pipelines
ETM
Deco
de
Inst
ruct
ion
Que
ueInstruction Fetch
L1 C
ache
Core 0
NEON Pipelines
Integer Pipelines
Load-Store Pipelines
ETM
Deco
de
Inst
ruct
ion
Que
ueInstruction Fetch
L1 C
ache
L2
Coherency
L2 Cache
ACP I/F
ACE I/F
Configurable 1-4 Core Capable Optional CPU / NEON
retention capability
Optional ECC for all Core and L2
RAMs
Optional v8 Cryptographic
extensions
Debug v8 & ETM™ v4 Support
AMBA® 4 ACE™System Coherent
Interface
Optional Accelerator
Coherence Port
None or 128K to 2MB
Cortex-A53 Core & L2 Memory System
n Tightly integrated Core (L1) to L2 Memory system
92
L1 Data Cache
Read Buffer
Fully associative Micro-TLB
4-way set-associative,
256-bit Fill Path,64-bit Fetch Path,
64-byte lines
512-Entry, 4-way set associative main TLB
Multi-Stream prefetcher
16-way set-associative,256-bit Fill Path,
512-bit Fetch Path
System coherent interfaces (MOESI)
with 128-bit read/write/snoop
paths
128-bit Accelerator
Coherence Port (IO coherent)
Write Buffer uTLB
Store Buffer
Prefetcher
Main TLB
L2 Cache
ACE I/F
ACP I/F
Core L2
Multi-entry merging store
buffer
3-outstanding load miss capability (per-core, excluding prefetches) 8-outstanding transactions (per-core)
CortexTM-A57 Cluster
FP/Adv SIMD Pipelines
Integer Pipelines
48k
I-Cac
hew
ith p
arity
32k D-Cache w/ECC
Core 4
Out
-of-O
rder
FP/Adv SIMD Pipelines
Integer Pipelines
48k
I-Cac
hew
ith p
arity
32k D-Cache w/ECC
Core 3
Out
-of-O
rder
FP/Adv SIMD Pipelines
Integer Pipelines
48k
I-Cac
hew
ith p
arity
32k D-Cache w/ECC
Core 2
Out
-of-O
rder
Cortex-A57 Features
n Fixed L1-cache sizen 48KB I-cachen 32KB D-cache
n Configurable L2 cache sizen 512KB/1MB/2MB
n Fully out-of-order execution
n Optional ARMv8 cryptography units
n Reliability features n Caches: SECDED ECC or parity n System-error capabilities
n 1-4 core MP configurability within a cluster
n System interface compatible with CCN-504 and CCI-400
93
FP/Adv SIMD Pipelines
Integer Pipelines
48k
I-Cac
hew
ith p
arity
32k D-Cache w/ECC
Core 1
Out
-of-O
rder
ACP L2 w/ECC (512kB ~ 2MB)
128-bit AMBA ACE Coherent Bus Interface
CoreLink CCN-504 Cache Coherent Network
94
DSPDSP
CoreLink CCN-504 Cache Coherent Network
Quad Cortex-
A57
L2 Cache
Quad Cortex-
A57
L2 Cache
Quad Cortex-
A57
L2 Cache
Quad Cortex-
A57
L2 Cache
NIC-400
GIC-500
Snoop Filter8-16MB L3 Cache
CoreLinkDMC-520
NIC-400 Network Interconnect
X72DDR4-3200
Flash GPIO
SATAUSBCryptoDPI
10-40 GbE PCIe DSP
CoreLinkDMC-520
X72DDR4-3200
MMU-500
18 AMBA ACE-Lite interfaces for I/O coherency
§ ECC on RAMs and parity on transport§ Integrated QoS regulation & Priority
management§ TrustZone aware
To minimize snoop traffic
Heterogeneous processors – CPU (supported CPU: Cortex-A15, Cortex-A57, Cortex-A53), GPU, DSP and acceleratorsUp to 4 cores
per cluster
4 fully coherent clusters
128-bit bus @ CPU
frequency
Integrated L3 cache
Dual channel DDR3/4 x72
Usable system bandwidth: ~1 Terabit/secPeripheral address space
Generic Interrupt Controller
Low Power Support
§ Extensive clock gating§ Leakage mitigation hooks§ Granular DVFS and CPU
shutdown support§ Partial or full L3 cache
shutdown§ Retention mode
Source: ARM (2013)
Performance Max bandwidth: 25.6 GB/s per channel Buffering to optimize read/write turnaround
Interfaces Optimal direct connection to CCN-504 Industry standard DFI-3.0 to connect to PHY
PHY Low-latency PHY from ARM
Memory Support for x72 DRAM DDR3, DDR3L and DDR4 up to DDR4-3200
Low Power Support Programmable DRAM power modes
Memory Channel
CoreLink DMC-520
n 5th Generation ARM DMC targeting >95% DRAM efficiencyn ECC and RAS features n Performance Profiling n TrustZone Address Space Control
n High efficiency through close integration with CCN-504n System wide QoS, designed and verified with ARM CPUs
95
System I/F
RAS Features
ECC Features
Data Bufferfor both Reads and Writes
Memory Interface
ConfigI/F
Perf
orm
ance
Pr
ofili
ng I/
F
QoS
Dynamic Memory Controller
DDR4, 3 & 3L PHY 3200 Mbpsø RAS = Reliability, Availability, Serviceability
Source: ARM (2013)
ARM Cache Architecture
96
CPU Architecture L1 Cache L2 CacheARM720T ARMv4T (32bit) 8KB Unified I&D cache, 4-way SA, 8 words/line, VIVT -
ARM926EJS ARMv5TEJ (32bit)4KB~128KB configurable size, Harvard architecture,
4-way SA, 8 words/line,VIVT using Modified VA (MVA)
-
ARM1136EJS ARMv6 (32bit) 4KB~64KB configurable size, Harvard architecture,4-way SA, 8 words/line, VIPT -
Cortex-A8 ARMv7-A (32bit)16~32 KB configurable size, Harvard architecture,
4-way SA, 16 words/line,VIPT for I-cache, PIPT for D-cache
0-1 MB configurable size, unified,8-way SA, 16 words / line, opt ECC or parity
Cortex-A9 ARMv7-A (32bit)16~32 KB configurable size, Harvard architecture,
4-way SA, 8 words/line,VIPT for I-cache, PIPT for D-cache
0-8 MB configurable size,optional parity
Cortex-A7 ARMv7-A (32bit)8~64 KB configurable size, Harvard architecture,
I-cache: 2-way SA, 8 words/line, VIPTD-cache: 4-way SA, 16 words/line, PIPT
128KB-1 MB configurable size, 8-way SA, 16 words / line
Cortex-A15 ARMv7-A (32bit) I-cache: 32 KB, 2-way SA, 16 words/line, Parity per 16 bits, PIPTD-cache: 32 KB, 2-way SA, 16 words/line, Parity per 32 bits, PIPT
512KB-4 MB configurable size,16-way SA, 16 words/line, Optional parity,
Strictly enforced inclusive with L1 data cache
Cortex-A53 ARMv8-A (64bit)a.k.a AArch64
8~64 KB configurable size, Harvard architecture,2-way SA, 16 words/line, PIPT
128KB-2MB configurable size,16-way SA,
Strictly enforced inclusive with L1 data cache
Cortex-A57 ARMv8-A (64bit)a.k.a AArch64
I-cache: 48 KB, 3-way SA, ECC, PIPTD-cache: 32 KB, 2-way SA, ECC, PIPT
512KB-2MB configurable size,16-way SA, ECC,
Strictly enforced inclusive with L1 data cache
40-bit Physical Addresses
44-bit Physical Addresses
Announcement
n Homework #4 (Due: Nov. 6)n Compare cache coherency schemes between snooping and
Directory-based coherencen Free format: Either MS Word or MS PPT (in Korean)
97
98
ARM AMBA Bus Architecture
Master versus Slave
n A bus transaction includes two parts:n Issuing the command (and address) – requestn Transferring the data – action
n Master is the one who starts the bus transaction by:n Issuing the command (and address)
n Slave is the one who responds to the address by:n Sending data to the master if the master ask for datan Receiving data from the master if the master wants to send data
BusMaster
BusSlave
Master issues command
Data can go either way
99
Bunch of Wires
Electrical Specification
Physical / Mechanical Characteristics– the connectors
Timing and Signaling Specification
Transaction Protocol
What defines a BUS?
100
On-Chip Bus Technology
101
Single Bus Multilayer Bus
Single transaction Split TransactionPipelined Transaction
Bus Matrix / Crossbar
Multiple Outstanding Addressing (Transaction ID)
ex) AMBA 2.0 ex) AMBA AXISonics Backbone
ex) AMBA 1.0
• Bus Structure Evolution
• Bus Protocol Evolution IP IP
IP IPBus
IP IP
IP IPBus
IP IP
IP IPBus
IP IP
IP IPBusR R
R R
Network-on-Chip
Bus to Network
On-chip Bus Architecture Classification
n Complicated protocoln Arbitration, multiplexing, decoding, etc.n Hierarchical structure
n Ex) AHB, IBM CoreConnect
n Core-independent protocoln Simple and straightforwardn Improves reusability of IP coresn Improves reliability
n Proprietary interconnection network
n Ex) ARM AXI, Sonics mNetwork, AHB Lite
High-performanceCPU
High-performanceMemory
Arbiter Bridge
Peripheral Peripheral Arbiter
Core Bus
Peripheral Bus
102
Hierarchical Bus Architecture
n All slaves are exist in a single layer
n All slaves suffer performance degradation due to
n the increased complexity of connection logic
n the slowest peripheral
n Classified layers according to required bandwidth
n Delay decreases due to the reduced complexity of connection logic à enhanced speed
103
Overall performance degradation
High speed layer
Low speed layer
AMBA (Advanced Microcontroller Bus Architecture)
n On-chip interconnect specification for SoC
n Promotes re-use by defining a common backbone for SoC modules using standard bus architectures
n ASB – Advanced System Busn AHB – Advanced High-performance Bus (system backbone)
n High-performance, high clock freq. modules n Processors to on-chip memory, off-chip memory interfaces
n APB – Advanced Peripheral Busn Low-power peripherals n Reduced interface complexity
n AXI – Advanced eXtensible Interface n ACE – AXI Coherency Extensionn ATB – Advanced Trace Bus
104
AMBA: Brief History
105
AMBA: Brief History
AMBA 1.0 1995ASB Advanced System BusAPB Advanced Peripheral Bus
AMBA 2.0 1999 AHB Advanced High-performance Bus
AMBA 3.0 2003AXI Advanced eXtensible BusAPB Addition of wait, error response
AMBA 4.02010
AXI4Some minor changes to AXI- Added support for long burst- Dropped support for write interleaving
AXI4-Lite Simplified AXI4 for on-chip device requiring a more powerful interface than APB
AXI4-Stream A point-to-point protocol without an address, just data
2011ACE AXI Coherency Extensions
ACE-Lite Simplified ACE
106
Major Component in AMBA System
n Mastern An independent component that can request to perform the read or write operation to
any slave in the system (e.g.) Processor, DSP etc.
n Slaven A dependent component on the master. If selected, the read or write operation on its
addressable space is performed under the control of the bus master (e.g.) DRAM, SRAM, Peripherals etc.
n Arbitern Arbitrate the bus requests among masters such that only one master is allowed to
access the bus (e.g.) fixed priority, round-robin etc.
n Decodern It decides which slave to access by decoding the address from the bus master
n Multiplexern It decides which master or slave access the target IP
107
Example AMBA BUS SYSTEM
High PerformanceARM processor
High-bandwidthon-chip RAM
HighBandwidth
ExternalMemoryInterface
DMABus Master
APBBridge
Timer
Keypad
UART
PIO
AHB
APB
High PerformancePipelined
Burst SupportMultiple Bus Masters
Low PowerNon-pipelined
Simple Interface
A typical AMBA-based system
108
109
AMBA 2.0 AHB Bus
AHB Interconnection
n AMBA AHB is designed to be used with a central multiplexor interconnection schemen Avoids tri-state bus
110
AMBA AHB Overview
n High performance
n Pipelined operation
n Multiple bus mastersn It is necessary to implement Arbiter
n Single cycle bus master handover
n Burst transfers
n Split transactions
n Single clock edge operationn Rising edge
n Non-tri-state implementationn Uni-direction vs Bi-direction
111
Address
Data
clock
AMBA AHB Overview
n An AHB transfer consists of two distinct sectionsn The address phase,
which lasts only a single cycle
n The data phase, which may require several cycles
n This is achieved using the HREADY signal
112
AMBA Signal Flow
113
Master0~7
Master0~7
HADDRx[31:0]HWDATAx[31:0]HSIZEx[2:0]HTRANSx[1:0]
ArbiterArbiter
HBUSREQx
HLOCKx
HGRANTx
HMASTER[3:0]HMASTLOCK
Slave 0~15
HSPLITx
HRDATA[31:0]
HREADYHADDR[31:0]
HWDATA[31:0]
HSIZE[2:0]
HTRANS[1:0]DecoderDecoder
HSELx
HADDR[31:0]
HBURSTx[2:0]HWRITE
AMBA AHB Signal List (1)
Name Source DescriptionHCLK Clock source Bus clock (rising edge)HRESETn Reset controller Reset (Active low)HADDR[31:0] Master Address bus HTRANS[1:0] Master Transfer type (non-sequential, sequential, idle, busy)HWRITE Master Transfer direction (high-write/low-read)
HSIZE[2:0] Master Transfer size[(8/16/32 bit), Max. 1024bit]
HBURST[2:0] Master Burst type (4/8/16 beat burst, incrementing or wrapping)
HPROT[3:0] Master Protection control (opcode fetch or data access) (privileged mode access, user mode access)
HWDATA[31:0] Master Write data busHSELx Decoder Slave selectHRDATA[31:0] Slave Read data busHREADY Slave Transfer done (high-done)HRESP[1:0] Slave Transfer response (okay, error, retry, split)
114
AMBA AHB Signal List (2)
Name Source Description
HBUSREQx Master Bus request (Max. 16 bus master)
HLOCKx Master Locked transfers (Exclusive use of bus when high)
HGRANTx Arbiter Bus grant
HMASTER[3:0] Arbiter Master number
HMASTLOCK Arbiter Locked sequence (The same timing with HMASTER signal)
HSPLITx[15:0] Slave Split completion request
115
AMBA Diagram
n AHB Mastern An AHB master is able to initiate read and write operation by providing an
address and control informationn Common AHB master
n Processor, DSP, DMA controller, etc
AHBMaster
HRDATA[31:0] HWDATA[31:0]
HADDR[31:0]
Arbitergrant
Transferresponse
HGRANTx
Clock
Data
HREADY
HRESP[1:0]
HRESETn
HCLKReset
HBUSREQx
HLOCKx
HTRANS[1:0]
HWRITE
HSIZE[2:0]
HBURST[2:0]
HPROT[3:0]
Arbiter
Transfer type
Address&
control
Data
116
AHB Master Block Diagram
n The AHB master is an interface unit that allows user logic to initiate a data transfer on the AHB
n The bus master is optimized to interface with DMA and PCI bus bridge functions to initiate data transfer on the AHB
BusRequest
MasterControl
TransferCounter
AddressBuffer
WriteBuffer
AHB Bus(Slave)
User Interface
Control Address R data W data
117
AMBA Diagram
n AHB Slaven An AHB slave processes read and write operations issued by an AHB mastern Common AHB slave
n Memory controller, Peripherals
AHBslave
HRDATA[31:0]HWDATA[31:0]
HADDR[31:0]Select
Transferresponse
HSELx
Clock
Data
HREADY
HRESP[1:0]
HRESETn
HCLKReset
HTRANS[1:0]HWRITE
HSIZE[2:0]
HBURST[2:0]
HSPLITx[15:0]
AddressAnd
control
Data
Split-capableslave
HMASTER[3:0]
HMASTLOCK
118
Control Signal (1/2)
n HWRITE – Transfer directionn HIGH : Write using HWDAT[31:0]n LOW : Read using HRDATA[31:0]
n HSIZE[2:0] – Transfer size
119
HSIZE[2:0] Size Description000 8 bits Byte001 16 bits half word010 32 bits Word011 64 bits100 128 bits 4-word line101 256 bits 8-word line110 512 bits111 1024 bits
Control Signal (2/2)
n HPROT[3:0] – Protection control (Indicating opcode fetch, data access, privileged mode access, and user mode)
120
HPROT[3]cacheable
HPROT[2]bufferable
HPROT[1]privileged
HPROT[0]Data/opcode Description
- - - 0 Opcode fetch- - - 1 Data access- - 0 - User access- - 1 - Privileged access- 0 - - Not bufferable- 1 - - Bufferable0 - - - Not cacheable1 - - - Cacheable
Basic Transfer (1/10)
n Address phase: Only single cycle
n Data phase can be extended to multi-cycle by using HREADY signal
121
Basic Transfer (2/10)
n Example: data transfer cycle extension between objects with different operating clock frequencies
122
Basic Transfer (3/10)
n Multiple data transfer with non-contiguous address
n A and C: zero wait state, B: one wait state
123
Transaction A StartsTransaction B Starts
Transaction A Completes
Notice backpressure
Basic Transfer (4/10)
n Basic Signals for Read Txn
124
Name Source Description
HADDR[31:0] Master Address bus
HRDATA[31:0] Slave Read data bus
HRESP Slave Transfer response (okay, error, retry, split)
Basic Transfer (5/10)
n Operation for a Read Txn
125
Basic Transfer (6/10)
n Pipelined Operation
126
Basic Transfer (7/10)
n HREADY for a Slow Slave
127
Name Source DescriptionHADDR[31:0] Master Address bus
HRDATA[31:0] Slave Read data bus
HREADY Slave Transfer done (high-done)
HRESP Slave Transfer response (okay, error, retry, split)
Basic Transfer (8/10)
n Wait State Insertion
128
Basic Transfer (9/10)
n Operation for Write Txn
129
Name Source DescriptionHADDR[31:0] Master Address bus
HWRITE Master Transfer direction (high-write/low-read)
HWDATA[31:0] Master Write data bus
HRDATA[31:0] Slave Read data bus
HREADY Slave Transfer done (high-done)
HRESP Slave Transfer response (okay, error, retry, split)
Basic Transfer (10/10)
n Operation for a Write Txn
130
Transfer Type (1/2)
n All data transfer are processed with HTRANS[1:0]
131
HTRANS[1:0] Type Description
00 IDLEIndicates that no data transfer is required. The IDLE transfer type is used when a bus master is granted the bus, but does not wish to perform a data transfer.Slaves must always provide a zero wait state OKAY response to IDLE transfers and the transfer should be ignored by the slave.
01 BUSY
The BUSY transfer type allows bus masters to insert IDLE cycles in the middle of bursts of transfers. This transfer type indicates that the bus master is continuing with a burst of transfers, but the next transfer cannot take place immediately. When a master uses the BUSY transfer type the address and control signals must reflect the next transfer in the burst. The transfer should be ignored by the slave. Slaves must always provide a zero wait state OKAY response, in the same way that they respond to IDLE transfers.
10 NONSEQIndicates the first transfer of a burst or a single transfer. The address and control signals are unrelated to the previous transfer. Single transfers on the bus are treated as bursts of one and therefore the transfer type is NONSEQUENTIAL.
11 SEQ
The remaining transfers in a burst are SEQUENTIAL and the address is related to the previous transfer. The control information is identical to the previous transfer. The address is equal to the address of the previous transfer plus the size (in bytes). In the case of a wrapping burst the address of the transfer wraps at the address boundary equal to the size (in bytes) multiplied by the number of beats in the transfer (either 4, 8 or 16).
Transfer Type (2/2)
n Transfer type example
132
Burst Operation
HBURST[2:0] Type Description
000 SINGLE Single transfer
001 INCR Incrementing burst of unspecified length
010 WRAP4 4-beat wrapping burst
011 INCR4 4-beat incrementing burst
100 WRAP8 8-beat wrapping burst
101 INCR8 8-beat incrementing burst
110 WRAP16 16-beat wrapping burst
111 INCR16 16-beat incrementing burst
Burst signal encoding
133
Burst Mode Example (1/5)
n The address wraps if it reaches the address boundaryn In this case, the address boundary is 0x30 ~ 0x3C
134
Four-beat wrapping burst
Burst Mode Example (2/5)
n Sequentially increasing addressn In this case, the address increases by 4 because SIZE = Word
135
Four-beat incrementing burst
Burst Mode Example (3/5)
n The address wrap if it reaches the address boundaryn In this case, the address boundary is 0x20 ~ 0x3C
136
Eight-beat wrapping burst
Answer to the Questionn In the case of a wrapping burst the address of the transfer wraps at the address
boundary equal to the size (in bytes) multiplied by the number of beats in the transfer (either 4, 8 or 16).
So if ‘Eight-beat wrapping burst’ with size (specified in HSIZE[2:0]) of 4Byte (Word=32bit) has 32 Byte(=8x4) address boundary. ‘32 Byte address boundary’ means that only 5 LSB bits out of HADDR[31:0], that is, HADDR[4:0] can be changed.
Therefore if the start address is 0x34, then next address changes as follows:
0x34 à 0x38 à 0x3C à 0x20 à 0x24 à 0x28 à 0x2C à 0x30.
n The question in the middle of today’s lecture is: If the start address begins with 0x24, how the address bus will be changed? The answer is:
0x24 à 0x28 à 0x2C à 0x30 à 0x34 à 0x38 à 0x3C à 0x20.
Because only 5 LSB bits are allowed to be changed due to the fact that address boundary is 32 Byte.
137
Burst Mode Example (4/5)
n Sequentially increasing addressn In this case, the address increases by 2 because SIZE = halfword
138
Eight-beat incrementing burst
Burst Mode Example (5/5)
n Incrementing burst of unspecified length
139
Undefined-length bursts
Early Burst Termination
n There are certain circumstances when a burst will not be allowed to complete and therefore it is important that any slave design which makes use of the burst information can take the correct course of action if the burst is terminated early
n The slave can determine when a burst has terminated early by monitoring the HTRANS signals and ensuring that after the start of the burst every transfer is labeled as SEQUENTIAL or BUSY
n If a NONSEQUENTIAL or IDLE transfer occurs then this indicates that a new burst has started and therefore the previous one must have been terminated
n If a bus master cannot complete a burst because it loses ownership of the bus then it must rebuild the burst appropriately when it next gains access to the bus
n For example, if a master has only completed one beat of a four-beat burst then it must use an undefined-length burst to perform the remaining three transfers
140
Address Decoding
n Decoder – used to provide a select signal, HSELx, for each slave on the bus
n Select signal – Implemented by high-order address signals
n Address boundary – The minimum address space that can be allocated to a single slave is 1KB
141
Master# 1
Slave# 1
Slave# 2
Slave#3Decoder
Master# 2
HADDR_M1[31:0]
HADDR_M2[31:0]
HSEL_S1Address &
control mux
HADDR to all slaves
HSEL_S2
HSEL_S3
- Address decoding Process-
Slave Transfer Responses
n The role of slavesn Whenever a slave is accessed it must provide a response which indicates the status of the
transfern The HREADY signal is used to extend the transfer and this works in combination with the
response signals, HRESP[1:0], which provide the status of the transfer
n The slave can complete the transfer in a number of ways:n complete the transfer immediatelyn insert one or more wait states to allow time to complete the transfern signal an error to indicate that the transfer has failedn delay the completion of the transfer, but allow the master and slave to back off the bus, leaving it
available for other transfers
n Transfer responsen OKAY with HREADY indicates the transfer has completed successfullyn ERROR response: Occurs when accessing wrong addressesn RETRY: Indicates transfer has not yet completed, so the bus master should retry the transfern SPLIT: The transfer has not yet completed successfully, so the bus master must retry the transfer
when it is next granted access to the bus
142
Data Buses (1/2)
n AHB does not specify the required endianness, so, it is important that all masters and slaves on the bus are of the same endianness.
143
Transfersize
Address offset
DATA [31:24]
DATA [23:16]
DATA [15:8] DATA [7:0]
Word 0 o o o oHalfword 0 - - o oHalfword 2 o o - -
Byte 0 - - - oByte 1 - - o -Byte 2 - o - -Byte 3 o - - -
- Active byte lanes for a 32-bit little-endian data bus -
Data Buses (2/2)
144
Transfer size Address offset DATA [31:24] DATA [23:16] DATA [15:8] DATA [7:0]
Word 0 o o o o
Halfword 0 o o - -
Halfword 2 o o
Byte 0 o - - -
Byte 1 - o - -
Byte 2 - - o -
Byte 3 - - - o
- Active byte lanes for a 32-bit big-endian data bus -
Slave Transfer Response : Retry
145
l okay response : 1 cyclel error, split, retry : more than 2 cycles
Slave Transfer Response : Split
146
Arbitration : Granting Bus Access
147
Before a transaction a master makes a request to the central arbiter Eventually the request is granted
Then the transaction proceeds
Performance Impact
Arbitration : Granting Bus Access with Wait States
148
Arbitration : Data Bus Ownership
149
Arbitration : Handover after Burst
150
Multi-layer Bus
n Parallel paths between masters and slaves
n Key Advantagesn Increased bandwidthn Design flexibility
n Uses the same interface protocol
151
Master#1
Master#1
Slave#4
Slave#4
BusMatrix
BusMatrix
Master#2
Master#2
Master#3
Master#3
Slave#3
Slave#3
Slave#2
Slave#2
Slave#1
Slave#1
In Brief: Multi-layer AHB and AHB-Lite (1/2)
n ARM SoC World shifts towards crossbar switched interconnects, in the form of multi-layer busses
n Each layer of the bus is an independent single master AHB system
n Instead of a rather complex monolithic multiplexing scheme, n A multi-layer AHB bus architecture with M masters and S slaves is structured
as M ´1:S muxes plus S ´ M:1 slave muxes all connected to separate arbitration and decoding logic
152
In Brief: Multi-layer AHB and AHB-Lite (2/2)
n Multiple masters can talk to multiple slaves concurrently, as long as no two masters don't try to access the same slave at the same time
n All arbitration and protocol complexity moves into the fabricn The interface implementation becomes simpler as a number of unneeded
signals, most notably HGRANT and HBUSREQ, can be removed along with their associated protocol
n Although not a necessary consequence of the multi-layer architecture, getting rid of the unpopular SPLIT and RETRY handshaking mechanism was another advantage
n With the advent of AMBA3 this AHB subset has been standardized upon as AHB-Lite
n AHB-Lite is a subset of the full AHB specification for designs where only a single bus master is used. This can either be a simple single-master system, or a multi-layer AHB-Lite system where there is only one AHB master per layer
153
154
AMBA 3.0 AXI Bus
Routing Control Routing Control Routing Control
Control Control ControlRouting Routing Routing
AXI - APB
Synch bridgeAsync bridge
Register Slice
AXI - AHB
Downsizer
Downsizer
Async bridgeSynch bridge
AXI – 50MHz AXI – 166MHz AXI – 200MHz
APB – 40MHzAHB – 32-bit, 70MHzAXI – 200MHz
Fabric Matrix Core
AMBA Fabric Matrix
Parameterised – e.g. address
map
Structurally Configured
AMBA3.0: Typical AXI Interconnect
155
Why update the AMBA Protocol?
n Future SoC designs demand:n Increasing levels of system complexityn Increasing performance demandsn Increasing clock speed requirementsn Reduction of system power consumption
n Whilst retaining existing AMBA strengths:n Modular designn Component re-usen Simple to understand interface
156
Why AXI instead of Multi-layer AHB?
n AHB is transfer-orientedn With each transfer, an address will be submitted and a single data item will
be written to or read from the selected slaven All transfers will be initiated by the master, thus the master will be stalled if
the slave cannot respond immediately to a transfer requestn Each master can have only one outstanding transaction
n Sequential accesses (bursts) consist of consecutive transfers which indicate their relationship by asserting HTRANS/HBURST accordingly
n Although AHB systems are multiplexed and thus have independent read and write data busses, they cannot operate in full-duplex mode
157
AMBA3 AXI Protocol
n Advanced eXtensible Interface (AXI)n separate address/control and data phasesn support for unaligned data transfers, using byte strobesn uses burst-based transactions with only the start address issuedn separate read and write data channels, that can provide low-cost
Direct Memory Access (DMA)n support for issuing multiple outstanding addressesn support for out-of-order transaction completionn permits easy addition of register stages to provide timing closure
n Five channelsn Read: Address read (AR) and read (R) data channelsn Write: Address write (AW), write (W) data, and response (R)
channels
[Source: AXI Spec]
158
Channel Architecture
n Four groups of signalsn Address “A” signal name prefixn Read “R” signal name prefixn Write “W” signal name prefixn Write Response“B” signal name prefix
159
ADDRESS
READ DATA
WRITE DATA
RESPONSE
Interconnect, Interface & Channel
160
Interconnect
Master 1 Master 2 Master 3
Slave 1 Slave 2 Slave 3 Slave 4
Interface
Interface
AXIMaster
AXI masterinterface
AXISlave
AXI slaveinterface
AXIInterconnect
AXI masterinterface
AXI slaveinterface
Write address channel (AW)
Write data channel (W)
Read address channel (AR)
Write response channel (B)
Read data channel (R)
Write address channel (AW)
Write data channel (W)
Read address channel (AR)
Write response channel (B)
Read data channel (R)
To read ...
ADDRESS
READ DATA
WRITE DATA
RESPONSE
1. Master issues address
2. Slave returns data
161
To write ...
ADDRESS
READ DATA
WRITE DATA
RESPONSE
2. Master gives data
3. Slave acknowledges
1. Master issues address
162
Channels - One way flow
AVALIDADDRAWRITEALENASIZEABURSTALOCK
AIDAREADY
RVALIDRLASTRDATARRESPRIDRREADY
WVALIDWLASTWDATAWSTRBWIDWREADY
BVALIDBRESPBIDBREADY
APROTACACHE
n Each channel has information flowing in one direction only
n READY is the only return signal
163
Register Slices for Max Frequency
n Register slices canbe applied across any channel
n Allows maximum frequency of operation by matching channel latency to channel delay
n Allows system topology to be matched to performance requirements
WREADY
WID
WDATA
WSTRB
WLAST
WVALID
164
Example Register Slices
Master#1
Master#2
Slave#1
Slave#2
Slave#3
Slave#4
A
A
A
A
A
A
Master#1
Master#2
Slave#1
Slave#2
Slave#3
Slave#4
A
A
A
A
A
A
Master#1
Master#2
Slave#1
Slave#2
Slave#3
Slave#4
A
A
A
A
A
A
Isolate addresstimingR
R
R
R
R
R
R
R
R
R
R
RIsolate poor read
data setupBreak long
interconnect path
165
AMBA 2.0 AHB Burst
n AHB Burstn Address and Data are locked togethern Single pipeline stage n HREADY controls intervals of address and data
166
A21 A22 A23A11 A12 A13 A14
D21 D22 D23D11 D12 D13 D14
A31
D31
ADDRESS
DATA
AXI - One Address for Burst
n One Address for entire burst
n Separation of address and data channeln Master provides the start address of burstn Slave needs to generate the remaining addresses based on burst type (FIXED,
INCR, WRAP)
167
A21A11
D21 D22 D23D11 D12 D13 D14
A31
D31
ADDRESS
DATA
AXI - Outstanding Transactions
n Benefit of split transaction: multiple outstanding requests
n Parameters for multiple outstanding requestsn Master I/F: Issuing capability à # of outstanding request that the master can
generaten Slave I/F: Acceptance capability à # of outstanding request that the slave
can accommodate
168
A21A11
D21 D22 D23D11 D12 D13 D14
A31
D31
ADDRESS
DATA
Out of Order Interface
n Each transaction has an ID attachedn Channels have ID signals - AID, RID, etc.
n Transactions with the same ID must be ordered
n Requires bus-level monitoring to ensure correct ordering on each IDn Masters can issue multiple ordered addresses
169
AMBA 2.0 AHB Burst - Slow slave
n With AHBn If one slave is very slow, all data is held up.
170
A21 A22 A23A11 A12 A13 A14
D11 D12
A31ADDRESS
DATA
Separate Read / Write Channels
n AMBA AXI allows for independent read and write transactions.
AXIMaster
AXISlave
Write data
WREADY
Read data
RREADY
Response
BREADY
Write Address/Control
AWREADY
Read Address/Control
ARREADY
171
Split Transaction
n Address, data, and response are handled separately.
AXIMaster
AXISlave
Write data
WREADY
Read data
RREADY
Response
BREADY
Write Address/Control
AWREADY
Read Address/Control
ARREADY
172
Master issues address
AXIMaster
AXISlave
Write data
WREADY
Read data
RREADY
Response
BREADY
Write Address/Control
AWREADY
Read Address/Control
ARREADY
Split Transaction: Write (1/3)
173
AXIMaster
AXISlave
Write data
WREADY
Read data
RREADY
Response
BREADY
Write Address/Control
AWREADY
Read Address/Control
ARREADY
Master gives data
Split Transaction: Write (2/3)
174
AXIMaster
AXISlave
Write data
WREADY
Read data
RREADY
Response
BREADY
Write Address/Control
AWREADY
Read Address/Control
ARREADY
Slave acknowledges
Split Transaction: Write (3/3)
175
AXIMaster
AXISlave
Write data
WREADY
Read data
RREADY
Response
BREADY
Write Address/Control
AWREADY
Read Address/Control
ARREADY
Master issues address
Split Transaction: Read (1/2)
176
AXIMaster
AXISlave
Write data
WREADY
Read data
RREADY
Response
BREADY
Write Address/Control
AWREADY
Read Address/Control
ARREADY
Slave returns data
Split Transaction: Read (2/2)
177
Wire Counts
n Address 32b, data 32b bus case: 184~204n AW: 52~56, W: 39~43, B: 4~8, AR: 52~56, R: 37~41
AXIMaster
AXISlave
Write Address/ControlAWID[3:0] write addr ID (0~4 bits)AWADDR[31:0] write addr IDAWLEN[3:0] burst lengthAWSIZE[2:0] burst sizeAWBURST[1:0] burst typeAWLOCK[1:0] lock infoAWCACHE[3:0] cache typeAWPROT[2:0] protection typeAWVALID write address validAWREADY write address ready
Write dataWID[3:0] write ID tag, AWID = WID (0~4 bits)WDATA[31:0] write dataWSTRB[3:0] write strobesWLAST write lastWVALID write validWREADY write ready
52~56
39~43
Payload
Handshakesignals
178
Write Address Channel & Signals
Signal Source Description
AWID[3:0] Master Transaction id
AWADDR[31:0] Master Start address
AWLEN[3:0] Master Burst length (1~16)
AWSIZE[2:0] Master Data width (1~1024)
AWBURST[1:0] Master Burst type (FIXED, INCR, WRAP)
AWLOCK[1:0] / AWCACHE[1:0] Master / Master Lock / cache type
AWPROT[2:0] Master Protection type
AWVALID / AWREADY Master / Slave Handshake control
179
Independent
Independent
I’m sending data. Please store it.
Here is the data.
I received that data correctly.
channels synchronized with ID # or
“tags”
Write Data & Response Channels
Signal (Write Data Ch.) Source Description
WID[3:0] Master Transaction id
WADDR[31:0] Master Start address
WSTRB[3:0] Master Write strobe per byte
WLAST Master Write last
WVALID / WREADY Master / Slave Handshake control
Signal (Write Response Ch.) Source Description
BID[3:0] Master Transaction id
BRESP[1:0] Master Write response
BVALID / BREADY Master / Slave Handshake control
180
Read Channels & Signals
n AW signals ~ AR signals
n R signals ~ W signalsn RLAST is controlled by slave
181
Give me some data
Here you go
Independent
channels synchronized with ID # or “tags”
Read Burst Operation
Read requestis initiated
Read requestis accepted
Data read is ready
1st data is transferred
The last data is transferred
Note: data transfer only when valid = ready = 1
182
Overlapping Read Bursts
Read request Ais accepted Read request B
is accepted via AR channelwhile data A(0) is
transferred via R channel
183
Write Burst Operation
Write request Ais accepted
Response completeswrite operation
184
Cache Support: ARCACHE[3:0] / AWCACHE[3:0]
n Bufferable bit (B): AWCACHE[0]n Write delay can be an arbitrary one
n Cacheable bit (C): AR(W)CACHE[1]n Read: prefetch or read cache is possiblen Write: write merging is possible
n Read Allocate bit (RA): ARCACHE[2]n If read miss, fetch the data to cachen If C=low, RA=low
n Write Allocate bit (WA): AWCACHE[3]n If write miss, fetch the data to cache, and then write to the cache (and
through the memory)n If C=low, WA=low
185
Protection Support
n Normal or privileged: AR(W)PROT[0]n High à privileged
n Secure or non-secure: AR(W)PROT[1]n Low à secure
n Instruction or data, AR(W)PROT[2]n High / low à instruction / data
186
[Source: AXI Spec]
Atomic Access
n Normal access, AR(W)LOCK[1:0]=b00n Exclusive access, b01
n Exclusive read … Exclusive writen If no intervening write to the address region, EXOKAY response.
If not, OKAY response.n Usually used for read-modify-write
n Locked access, b11n Start with b11, and end with b00n During the period, only the lock initiating master can access the
address region (not the slave!!!)
time
E.RD 0x100 WR 0x100
Master 1 Master 2
E.WR 0x100
Master 1
OKAY
Slave 1
187
Out-of-Order Transaction
n Transaction ID is used to identify data transfer at all channelsn ARID, AWID, RID, WID, & BIDn Up to four bits
n Ordering by transaction IDn Master needs to finish data transfers with the same transaction ID in the order of
request issue. n Slave can handle data transfers with different transaction IDs out-of-order
A21A11
D21 D22 D23 D11 D12 D13 D14
A31
D31
ADDRESS
DATA
188
Ordering
n Transaction idn ARID, AWID, RID, WID, & BIDn Up to four bits
n In-order requirementn Transfers with the same id finishes in the same order that they
are initiated
n Real implementationn Transaction id = <master ID, channel ID>n Channel ID = original AXI transaction idn Master ID is needed to identify the initiating master among all
the masters
189
Transaction ID Implementation
n Real implementationn Transaction ID = <master ID, channel ID>n Channel ID = original AXI transaction idn Master ID is needed to identify the initiating master among all the masters
CPU VideoDecoder
3DGraphics
LCDControl
VideoProcessor Mixer DMA
Crossbar
MemoryController
ID: 4 + ceil(log27)=7 bits
190
Write Interleaving
n Interleaving rulen Data with different ID can be interleaved.n The order within a single burst is maintainedn The order of first data needs to be the same with that of requestn Write Interleave Cability à The maximum number of
transactions that master can interleave
A11
D11 D12 D13 D14
ADDRESS
DATA
A21
D21 D22 D23
A31
D31
191
Low Power Interface (C channel)
192
AMBA Class Library
n Defined method for modelling AXI componentsn Data structure used to fully describe a transactionn Standard method calls for generating and receiving transactions
n Modelling abstraction levels n Programmer view leveln Programmer’s view + timing n Cycle callable
193
AXI Support
RTLEnvironment Verification
Environment
ModellingEnvironment
AMBADesign
KitAMBA
VerificationComponents
RealViewModelLibrary
AMBA Class Library
AXIInterconnect
Generator
194
Summary
n AXI is the next generation AMBA busn Channel architecturen Registers Slices n Burst addressingn Multiple outstanding burstsn Out of order completion
n Interconnect Optionsn Shared bus, multi-layer and mixed
n More than just a bus protocoln RTL, Modelling and Verification environments
195
196
AMBA 4.0 AXI Bus
In Brief: AXI4 (1/3)
n Three flavors: AXI4, AXI4-Lite, AXI4-Streamn All three share same handshake rules and signal naming
197
APBAPB AHBAHB AXIAXIAMBA 3.0
AXI4AXI4 AXI4 StreamAXI4 Stream AXI4 LiteAXI4 LiteAMBA 4.0
AXI4 AXI4-Lite AXI4-Stream
Dedicated forhigh-performance
and memorymapped systems
register-style interfaces(area efficient
implementation)
non-addressbased IP
(PCIe, Filters, etc.)
Burst up to 256 1 Unlimited
Data width 32 to 1,024 bits 32 or 64 bits any number of bytes
Applications Embedded, memory Small footprint control logic
DSP, video, communication
In Brief: AXI4 (2/3)
n Functionality has been added and several known issues in AXI3n The maximum burst length has been increased from 16 to 256 transfers for
certain types of bursts (INCR, non-exclusive).n A Quality-of-Service signaling has been added, where the finer details of the
interpretation are implementation defined
n AXI4 defines address regions for slaves, which allows implementations of memory perspectives on the bus level
n Some ordering requirements and transfer dependencies have been refined, as have the meanings of the cache policy signals AxCACHE
n Abstract memory types as defined by ARMv6/v7 architectures and multicorearchitectures are much better represented by these changes
198
In Brief: AXI4 (3/3)
n Legacy (AHB) locked transfers are no longer supported, which does not fit within the idea of a switched interconnectn AXI4 does not support locked transactions but, an AXI3
implementation must support locked transactions
n A rather significant change seems to be the banning of write interleaving, which could help improve the system throughputn Write interleaving is hardly used by regular masters but can be
used by fabrics that gather streams from different sources. With the new AXI4-Stream protocol, write interleaving is still available for fabrics
199
In Brief: AXI4-Stream
n The new AXI4-Stream protocol was designed for streaming data to destinations that are not memory mapped internallyn Display controllers, transmitters, but also routing fabrics are
among the target applications for this new protocol
n Building upon the proven simple AXI channel handshake AXI4-Stream is essentially an AXI write data channel with additional control signals and a slightly modified protocoln The burst (packet) length is not restricted and the number of
bytes of the data signal TDATA can be an arbitrary integer including zero
200
In Brief: AXI4-Lite
n The focus of AXI has been on high-performance data transfer, but what about the low-end hardware registers, configuration, etc like APB?
n The issue with APB is the bridge
n AXI4-Lite addresses this last issue by defining certain restrictions that would allow a slave to be connected directly to an AXI fabric
n In AXI4-Lite, you might say that AXI gets "dumbed" down to a few basic transaction types
n The burst length is fixed to one data transfer, transfers are non-cacheable and non-bufferable, exclusive access is not allowed and access width must always be the same as data bus width
n This is supposed to make the interface design simple enough to be implemented quickly in custom IP
201
In Brief: ACE
n The ACE protocol extends the AXI4 protocol and provides support for hardware-coherent caches
n A five state cache model to define the state of any cache line in the coherent system
n The cache line state determines what actions are required during access to that cache line
n Additional signaling on the existing AXI4 channels that enables new transactions and information to be conveyed to locations that require hardware coherency support
n Additional channels that enable communication with a cached master when another master is accessing an address location that might be shared
n The ACE protocol also provides:n Barrier transactions that guarantee transaction ordering within a systemn Distributed Virtual Memory (DVM) functionality to manage virtual memory
202
What’s New in AMBA4
n Increased productivityn Supports memory mapped, control and streaming applications
203
Source: Xilinx (2013)
What’s New in AMBA4:
n New Stream Interface
n AHB absent
n New AXI-Lite spec
n New AXI Features
n AXI Feature removal
n AXI clarifications
204
AXI Changes
n Clarificationsn ordering, posting, memory types / cache impacts
n Removal of locked transactions
n Removal of write data interleave
n New featuresn longer burstsn QoS fieldn User fields
205
Longer Bursts
n Can issue up to 256 beat burstsn performance, atomicity, simplicity
n Not guaranteed atomic beyond 16 beatsn even in device space / non-modifiablesn system (integrator) performance & QoS
n Atomicity may be undesirable for core as well
206
Basic AXI4 Signaling: 5 Channels, Point-to-Point
207
Basic AXI4 Handshaking
n Master asserts and holds VALID when data is available
n Slave asserts READY if able to accept data
n DATA and other signals transferred when VALID and READY = 1
n Master sends next DATA/other signals or deasserts VALID
n Slave deasserts READY if no longer able to accept data
208
VALID before READY handshake READY before VALID handshake VALID with READY handshake
Appendix:PL301 Crossbar Bus
209
AXI Crossbar Bus: ARM PrimeCell PL301
n PL301: PrimeCell High Performance Matrix
Slave Interface (SI) Master Interface (MI)
210
PL301 Internal Structure
n Adjust interface differencesn Data width, frequency, and protocol
Routing Control Routing Control Routing Control
Control Control ControlRouting Routing Routing
AXI - APB
Synch bridgeAsync bridge
Register Slice
AXI - AHB
Downsizer
Downsizer
Async bridgeSynch bridge
AXI – 50MHz AXI – 166MHz AXI – 200MHz
APB – 40MHzAHB – 32-bit, 70MHzAXI – 200MHz
Fabric Matrix Core
AMBA Fabric Matrix
Parameterised –e.g. address
map
Structurally Configured
211
Adjusting Data Width
n ExpanderAxin Converts a narrow master for use on a wide
busn Data replication on write data and muxing
on read path.
n FunnelAxin Converts a narrow slave for use on a wide
busn Mux on write data path and replication on
read data pathn Assumes that transfers are suitably sized for
slave (i.e. no wider than the narrow bus)
n DownSizerAXIn Converts a wide bus to a narrow busn Transaction <= narrow bus pass throughn Transactions > narrow bus are broken down
in to smaller transactions.
PL300Exp32b 64b
32b 64b
PL300Fun
32b
64b
64b
PL30032b
DS64b
32b64b
32b
212
PL301 Features
n Configurable number of SIs and MIsn Sparse connection options to reduce gate count and improve securityn Configurable AXI address/data widthsn Decoded address register that you can configure for each SIn Flexible register stages to aid timing closuren An arbitration mechanism that you can configure for each MI,
implementing:n a fixed Round-Robin (RR) schemen a programmable RR schemen a programmable scheme that provides prioritized groups of Least Recently
Granted (LRG) arbitration
n A programmable Quality of Service (QoS) schemen Support for multiple clock domains: synchronous and asynchronous.n Configurable cyclic dependency schemes to enable a master to have
outstanding transactions to more than one slave [Source: PL301 TS]
213
Arbitration Scheme: Fixed Priority
214
Arbitration Scheme: Round Robin
[Source: PL301 TS]
215
Arbitration Scheme: Hybrid
n Combination of round robin and fixed priority
216
Arbitration Scheme
n LRG (least recently granted) scheme
[Source: PL301 TS]
217
Arbitration Scheme: Fixed Round Robin
n A weighted round robin in a fixed order
ARM VIDEO 3D LCD
m0 m1 m2 m3 m4 m5
s0
S0
m0
m1 m2 m3 m4
m2
m5
m0
m1
m2 m4 m3
m2
m0
m5
m2
218
An Example of Crossbar Bus Design
CPU VideoDecoder
3DGraphics
LCDControl
VideoProcess Mixer DMA
Crossbar
MemoryController
Some IPs need a privileged usage of memory BW
40% of total BW
219
Programmable QoS
Maximum # of requestsallowed for best effort traffic
Assume Tidemark = 4, and ID match = M0.If there is 4 outstandingrequests for M1, then only requests from M0 are accepted by S0 until one of M1’s requests is served
[Source: PL301 TS]
220
Bus Arbitration: A Generic Arbiter
n Assumptionsn Each request has its own performance requirement (e.g., bandwidth budget and/or latency)
n E.g., low latency access from CPUn E.g., bandwidth guarantee for LCD / Camera controllers
n In order to avoid starvation, a global time out is applied
n Priority order: time-out > bandwidth > best effortn Time-out (TO) request: TO = 0, and BW budget > 0n Bandwidth (BW) request: TO > 0, and BW budget > 0n Best effort (BE) request: BW budget = 0
n Priority promotion/demotionn Demotion: if BW budget is exhausted, demotion to BEn Promotion: if BW budget becomes positive, promotion to BW or TO request
n Time-out countersn One type of counter for QoS access w/ time outn The other for old request
n When a normal request (w/ unspecified TO) arrives at the bus, a timer is assigned and starts to be decremented each cycle.
221
Bus Arbitration: A Generic Arbiter
n If there is any TO request, it is servedn QoS request w/ time outn Old request
n If there is any BW requestn Give the bus to the BW requests based on BW budget
n Based on fixed time slot allocation or statistical slot allocation
n To the other BE requests, apply the same priority order that BW requests
222
HigherPriority
Note: Arbitration scheme is similar to the one in memory access scheduling
Current Slot
Case of fixed time slot allocationAt a time slot (instant or period)if the master allocated to the slot has a request,
then it is servedelse
the other pending requests are served(e.g., round robin)
A Deadlock Problem in Accessing Multiple Slaves
MemoryController
1
MemoryController
2Master
2
Master1
Memory1
Memory2
D
B
A
On-chip Bus
C
Master 2
C D
Memory 1
Memory 2
B is blocked at master 2
D is blocked at master 1
Master 1
A B Time
B
D A
C
Color (= Transaction id)Requests with the same transaction idneed to be finished in the order ofrequest issues
Color (= Transaction id)Requests with the same transaction idneed to be finished in the order ofrequest issues
Optimization in Memory ControllerMemory controllers can serve independent requests out-of-orderto increase memory utilization or to lower memory access latency
Optimization in Memory ControllerMemory controllers can serve independent requests out-of-orderto increase memory utilization or to lower memory access latency
223
Master 2
Memory 1
Memory 2
Master 1 C
DA
A A
B
B
C
C D
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
D
B
16 17
Cyclic Dependency Schemes
n Outstanding requests are permitted only for a single slave per transaction id
n The deadlock problem is resolved while limiting parallel (memory) accesses
224
Single Slave Scheme
n Allow multiple outstanding transactions only to the same slave
225
Unique ID Scheme
n Accept only out-of-order requests, i.e., requests with different transaction ID’s
226
Single Slave per ID
n Combination of both single slave and unique ID schemes
n Allow multiple outstanding requests to a single slave per transaction ID
227
Master 2
C D
Memory 1
Memory 2
AD
Master 1
A B
RP2 (receiving)
D
D
A
C
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
RP1 (receiving) C
A
CB
B
B
16 17
Request Parallelizer (RP) for Transaction ID Renaming
MemoryController
1
MemoryController
2Master
2
Master1
Memory1
Memory2
D
C
B
A
NoCRP1
RP2
Master IDs Network IDs
Furtherreduction?
Rename Transaction IDs at NoC Boundary to Enable Out-of-Order Transfers in NoC
*Requirement: Reserve Buffer Space in RP for Transaction ID Renaming
228
On-Chip Network (OCN) Design Flow
IP1 IP2 IP3 IP4
Bus 1 Bus 2
Mem Mem
IP1 IP2 IP3 IP4
Bus 1
Mem Mem
Architecture exploration w/performance evaluation
Pre-layoutRegister sliceoptimization
IP1 IP2
IP3IP4
Bus 1
Early engagement w/ floorplanning
GeneratedBusRTL code
Bus RTL Verification (Specman-based)
IP1 IP2
IP3IP4
Bus 1
P&R
latching
buffering
pipelining
too longwire
too big I/O delay
Bus RTLgeneration
Bus RTLgeneration
229
AXI Bus Architecture Design Space
n Interconnection leveln PL300 decomposition: cascading level
n Bus component leveln Common parameters: bit width, etc.n PL300: WriteIssueCap, etc.n PL340: arbiter queue, data fifo size,
etc.n Register slice: full/forward/bypass per
channeln Closely related with implementation
n IP interface leveln # of AXI ports, internal buffer
usage/size, address queue size, etc.
PL3002x2
PL300
3x2
PL30_33x2
PL340
PL340
Reg
IP
IP
Processor
Processor
IP
IP
DSP
Large WriteIssuingCapability on high-latency paths
Register slices on timing-critical paths
2-level cascaded architecture to localize
traffic
SRAM
Advanced AXI interfaceBuffer usage, size
Address queue sizeBurst size
230
Performance Evaluation of Bus Architecture
n Five methods depending on how to model the traffic of bus mastersn ViP (virtual platform)-based methods
n Random bus traffic modeln ViP version of Sonics’ random pattern generator
n Sequence-based bus traffic modeln ViP version of Specman-based model in DTV team
n Timed modeln Bus masters are modeled in cycle accuracy as ViP models
n HW Integratorn Bus masters are emulated on FPGA
n RTL simulationn Golden RTL is built and simulated
231
Bus RTL Generation: AMBA Designer
n Generated IP’sn Interconnect, DMA, Static
memory controller etc.
n Parametersn data width (32b ~ 128b),
protocol (AXI, AHB, APB), etc.n Timing (register slice, address
decode), security (trustzone), etc.
n QoS (programmable QoS), arbitration schemes, etc.
232
Bus RTL Verification
n Bus component verificationn Verification of each components (PL30x, 340, bridges)n Method & criterion: Specman eVC & protocol coverage
n Bus architecture verificationn Verification of the entire bus architecturen Method & criterion: Specman eVC + system scenarios &
functional correctness + protocol coverage
n Called Platform Verifiern Under development by System LSI Design Technology team
233
On-Chip Network (OCN) Design Flow
IP1 IP2 IP3 IP4
Bus 1 Bus 2
Mem Mem
IP1 IP2 IP3 IP4
Bus 1
Mem Mem
Architecture exploration w/performance evaluation
Pre-layoutRegister sliceoptimization
IP1 IP2
IP3IP4
Bus 1
Early engagement w/ floorplanning
GeneratedBus
RTL code
Bus RTL Verification (Specman-based)
IP1 IP2
IP3IP4
Bus 1
P&R
latching
buffering
pipelining
too longwire
too big I/O delay
Bus RTLgeneration
Bus RTLgeneration
234
Register Slice for Timing Isolation
n Register slice can be used at any channel independently
AXIMaster
AXISlave
Write data
WREADY
Read data
RREADY
Response
BREADY
Write Address/Control
AWREADY
Read Address/Control
ARREADY
235
Register Slice for Timing Isolation
n Register slice incurs one cycle latency per insertion
AXIMaster
AXISlave
Write data
WREADY
Read data
RREADY
Response
BREADY
Write Address/Control
AWREADY
Read Address/Control
ARREADY
236
Two Modes: Full and Forward
n Area benefit: forward mode gives smaller area
n The same latency penalty
AXI register slice (full mode)
AXI register slice (forward mode)
[Source: PL301 TS]
237
On-Chip Network (OCN) Design Flow
IP1 IP2 IP3 IP4
Bus 1 Bus 2
Mem Mem
IP1 IP2 IP3 IP4
Bus 1
Mem Mem
Architecture exploration w/performance evaluation
Pre-layoutRegister sliceoptimization
IP1 IP2
IP3IP4
Bus 1
Early engagement w/ floorplanning
GeneratedBus
RTL code
Bus RTL Verification (Specman-based)
IP1 IP2
IP3IP4
Bus 1
P&R
latching
buffering
pipelining
too longwire
too big I/O delay
Bus RTLgeneration
Bus RTLgeneration
238
Announcement
n Homework #5 (Due: Nov. 13)
239