fast, scalable and energy efficient io solutions:...

Title 44pt sentence case

Affiliations 24pt sentence case

20pt sentence case

© ARM 2016

Fast, Scalable and Energy Efficient IO Solutions: Accelerating infrastructure SoC time-to-market

Sridhar Valluru

ARM Tech Symposia 2016

Product Manager

Intelligent Flexible Cloud

Scalability and Flexibility

A C C

A C S ion

Storage

ion

Storage

Packet flows Packet flows

Acceleration

Storage

Compute

Packet flows

S

A C S

C

A C

A

S

S A

C

Target design space

100W SoC

64-96

CPUs

~300mm2

2-5W SoC

4-8 CPUs

~30mm2

Access

point

Data

center

Execution environment supporting IFC

Container_2

P hysical M achine (e.g., processors, DRAM, caches, mmu, iommu, other resources and SoC devices ...)

V irtual M achines

G uest O S 1

JVM_1App_1

VNF_2

Virtual M achine1

G uest O S 2

VNF_1

V irtual M achine2

V irtual M achine M onitor (V M M )/H ypervisor

P hysical M achine

...

H yper-P riveleged

N onpriveleged

P riveleged

Container_1

G uest O S 3

App_4

Virtual M achine3

...

Container_1

VNF_3

Container_2

VNF_4

O ther E xternal D evices (e.g., disks, NICs, FPGAs, GPUs, crypto, other accelerators, other devices ...)

Firm w areFirm w are,

O ption R O M s, etc

(O ptional)S ystemD ependent

Non-privileged

Privileged

Hyper

Privileged

Optional

System Dependent

© ARM 2016 5


Bullets 24pt sentence case

Sub-bullets 20pt sentence case

Scalability

Limited number of

hardware IO due to

capacity

Large number of IO

stream traffic

management

Performance

Large number of

translations Large

number of Page Table

Walks

Insufficient TLB in

SMMU

Power Efficiency

Large Page Table Walks

Large number

memory access

Enormous TLB

Large dynamic power

IO challenges for next-gen SoC systems

© ARM 2016 6

Text 54pt sentence case System memory management unit (SMMU)

Next-generation example server subsystem

Peripherals

Non-coherent Interconnect

Coherent Mesh Network (CMN-600)

DDR4 Core

Sigh

t So

C

DMC-620

Process

or

Cortex-A E

LA

-500

Process

or

Cortex-A E

LA

-500

Process

or

Cortex-A E

LA

-500

Process

or

Cortex-A E

LA

-500

Process

or

Cortex-A E

LA

-500

Process

or

Cortex-A E

LA

-500

Interconnect

IO (PCIe and accelerators)

Security

(CryptoCell)

DMC-620

System MMU (SMMU)

1-8 memory controllers

. . . . .

Generic Interrupt Controller (GIC)

ARM IO MMU or system MMU (SMMU)

Memory

Physical address

space (PA)

IO #3 IO #2 IO #1 Virtual

address

Space (VA)

TBU

TCU

SMMU TBU TBU

Address

translation

AXI-Stream interconnect

* ATS – PCIe address translation services (PCIe 3.0 ECN)

Translation Buffer Unit

- Performs translation from VA PA

- Holds TLB

- Performs security/Access checks

- Request translation miss to TCU

IO Accelerator

- Virtual Address

Memory:

- Physical Address

IO Accelerator

- Virtual Address

Translation Cache Unit

- Performs table walks of translation tables

- Handles ATS* requests/responses for PCIe

- Request translation miss to TCU

- Performs security/Access checks

Local AXI Stream

- Free flowing transport

- Enables distributed TBU

SMMU architecture evolution

Implemented by

CoreLink™ MMU-500

SMMUv2

Adds

Up to 128 translation

contexts

Support for v8 page tables

64k page granule

SMMUv3

Adds

Scalability enhancements for millions of translation

contexts

Context store in memory

PCIe address translation services (ATS) for returning

translations to end points with address translation

caches (ATC)

PCIe process address space ID (PASID) for process-

specific translations

PCIe page requires interface (PRI) support for access

to unpinned pages in memory

Software communication via memory queues (non-

blocking / scalable)

Support for message-signalled interrupts

SMMUv1

Features

Support for v7

page table for IO

virtualization

4k page granule

Implemented by

CoreLink™ MMU-401

© ARM 2016 10




Addressing performance & scalability challenge

SMMU microarchitecture

VA PA translation overhead

Limited TLB scalability with # IO devices

VAPA Translation in ATC Cache

ATC removes dependency on TBU Size

Micro-

TLB Main

TLB

Small – fully

associative

Large – set

associative

TBU

SMMU

Config

cache

Multi-Level

walk cache

Caches

context info

Separate S1/S2

TCU TBU

TCU

SMMU TBU TBU

IO #3 IO #2 IO #1

Local ATC

(address

translation

caches)

AXI-Stream interconnect

ATC

#3

ATC

#2

ATC

#1

Populating ATC

requires SMMU to

support ATS (address

translation services)

© ARM 2016 11




Advantages of PCIe ATS for IO access performance

Scalability of ATC Size and number of ATCs grows with number of IO devices, whereas TLB size in SMMU is fixed (however large)

Independence of ATCs Local ATC accesses are independent of each other and do not result in cache trashing

Shared TLB size in a SMMU can suffer from trashing if multiple IO devices access too many scattered locations in

memory

Customizable pre-fetch IO devices can request translations ahead of time according to known access patterns

Shared TLB in an SMMU is not aware of IO access patterns and cannot implement a universal pre-fetch policy

Customizable replacement policies IO devices can prioritize caching of some entries over others based upon known access patterns

E.g., an Ethernet NIC might choose to exclusively cache ring descriptor translations and store only data buffer

translations temporarily

Support for unpinned memory without stalling faults with the use of PRI

© ARM 2016 12




ARM SMMU and Cadence PCIe RP integration

M: AXI master interface

All normal PCIe packets with or without

translated address are seen here

S: AXI slave interface

T: DTI-ATS (direct translation

interface for PCIe ATS) supports

ATS translation requests from EP

Invalidation requests from TCU

PRI (page request interface) requests from

EP

* ATS – PCIe address translation services (PCIe 3.0 ECN)

EP

RP

TBU

ATC S

M

S

M

TCU T

Mem

EndPoint

Root Port

SMMU

PCIe

Link

DTI-

ATS

© ARM 2016 13




Cadence PCIe RC’s DTI-ATS features

Separate interface provided with the PCIe RC IP

All PCIe ATS related requests, responses, invalidations are routed to this I/F

DTI-ATS implementation supports additional features

PCIe PRI (page request interface)

PCIe PASID support (process address space ID)

DTI-ATS is conveyed using AXI4-Stream

Separate master AXI4-Stream and slave AXI4-Stream interfaces

Transaction sideband signals to indicate the context information to the TBU

DTI-ATS packets can be presented/accepted in one clock cycle

Debug & status

Registers to capture status and error conditions encountered in the DTI-ATS protocol

© ARM 2016 14




EP

RP

TBU

S

M

S

M

TCU T

Mem

EndPoint

Root Port

SMMU

PCIe

Link

DTI-

ATS

1. Client logic generates a TLP with a un-

translated address

2. EP sends this as a PCIe TLP to the RP

3. On receipt by the RP, since the packet is a

data flow packet, this is sent on the “M”

interface

a) If the TBU does not have a suitable translation

for the address received, it will issue a request

to TCU

b) The TCU will respond with the response for

the TBU

4. The TBU then forwards the transaction to

the memory

SMMU with PCIe operation – no ATS

14

1

2

3

a

b

4

© ARM 2016 15




EP

RP

TBU ATC

S

M

S

M

TCU T

Mem

EndPoint

Root Port

SMMU

PCIe

Link

DTI-

ATS

1. Client logic generates a TLP with a

virtual address

2. The client logic uses the translated

Addr if available from the ATC

3. The EP sends this as a PCIe TLP that

has translated address

4. On receipt by the RP, since the packet

is a data-flow packet, this is sent on the

“M” interface

5. The TBU then forwards the transaction

to the memory via the main

interconnect

SMMU with PCIe operation – with ATS and ATC hit

15

3 4

Lookup 1

Hit 2

5

© ARM 2016 16




EP

RP

TBU ATC

S

M

S

M

TCU T

Mem

EndPoint

Root Port

SMMU

PCIe

Link

DTI-

ATS

1. EP client generates a PCIe translation

for a particular address that needs

translation

2. Translation request goes out on the

PCIe link to the RP

3. RP sends the translation request it

received on the “T” interface to the TCU

4. The TCU then generates the response

completion

5. The RP repacks the translation

completion TLP back to the EP

6. Once the EP received this completion

for the translation request it generated, it

populates the local ATC

SMMU with PCIe operation – with ATS and ATC miss

16

2

Lookup

Miss

1

1

3 4 5

6

© ARM 2016 17




Scalability

ATC allows PCIe RC to

support multiple IO

accelerators

AXI Stream Interface

allows distributed TBU’s to

be connected to a TCU

Performance

With ATC, no more

address translation needed

for every transaction

TCU Cache reduces page

table walks

Power Efficiency

ATC & TCU minimize

memory access for page

table walks

Custom ATC in IO

accelerator removes the

need for very large TLB in

SMMU

IO challenges for next-gen SoC systems

© ARM 2016 18




Summary

IFC is driving need for scalability, performance and efficiency for IO accesses in

infrastructure SoCs

ARM has been addressing IO virtualization solutions via its SMMU Fast, performant

IO such as PCIe Gen4 from Cadence has been efficiently integrated with ARM’s

SMMU with an architected interface DTI-ATS

Combined SMMU-PCIe solution delivers high performance access for IO devices

with PCIe ATS as well as PRI and PASID support

SMMU IP from ARM is designed to handle the performance, scalability, and power

efficiency demands from SoCs for IFC

fast, scalable and energy efficient io solutions:...

Documents