boost soc performance from edge to cloud - 2017 arm ... · boost soc performance from edge to...

Title 44pt Title Case

Affiliations 24pt sentence case

20pt sentence case

© ARM 2016

Boost SoC performance from edge to cloud ARM® CoreLink™ System IP

Neil Parris, Director interconnect marketing

China Tech Symposia

Systems and software group, ARM

November 2016

© ARM 2016 2


Bullets 24pt sentence case

Sub-bullets 20pt sentence case

3.7 exa-bytes

per month 22x bandwidth

increase

More nodes, new use cases

1ms end to end

~30x access

nodes

© ARM 2016 3




Intelligent flexible cloud to enable new use cases

Compute

Acceleration

© ARM 2016 4




Heterogeneous compute requires coherency

Flexible heterogeneous architecture

Blend compute and acceleration for target solution

Fast, reliable transport to shared memory

Maximize throughput, minimize latency

Coherency simplifies software

Accelerate SoC development and deployment

IP designed, optimized and validated for systems

Cortex-A

ARM IP Tooling

CoreLink Interconnect

CoreLink Controllers

CoreSight

Coherent backplane

TrustZone

Accelerator

…

© ARM 2016 5




3rd-generation ARM coherent backplane IP

CoreLink CMN-600 Coherent Mesh Network

CoreLink DMC-620 Dynamic Memory Controller

Optimized for next-generation intelligent connected systems

© ARM 2016 6




Build more powerful systems

Boost performance

Up to 5x more throughput

Fastest path to DDR4 memory

Up to 50% latency reduction

Performance at any design point

Up to 32 clusters (128 CPUs)

Frequencies exceeding

2.5GHz

Tailor designs from

edge to cloud

>1TB/s bandwidth

Performance comparison to ARM CoreLink CCN and CoreLink DMC-520

© ARM 2016 7




Delivering maximum compute density

6x compute

5x throughput

32x Cortex-A72

CoreLink CCN-508

4x DMC-520

64x Cortex-A72

CoreLink CMN-600

8x DMC-620

16x Cortex-A57

CoreLink CCN-504

2x DMC-520

Rela

tive

Perf

orm

ance

Compute = measured by specint2k6_rate

Throughput = achieved requested bandwidth

Same process node and test conditions.

2.5x

0

1

2

3

4

5

6

© ARM 2016 8




Fastest path to DDR4 memory

50%

Interconnect

+ DMC

DDR PHY

+ memory

Static latency – Cortex-A72 load-to-use

Same process node and test conditions

Estimated DDR PHY + memory cycles for 3rd party PHY & closed page DRAM

CoreLink CMN-600 configured as 4 cpu cluster to match CoreLink CCN-504

CPU

CoreLink

CCN-504

DMC-520

CoreLink

CMN-600

DMC-620

DDR PHY

+ memory CPU

Interconnect

+ DMC

Increase CPU performance

50% backplane latency reduction

High frequency mesh transport

One cycle per mesh cross point

Improved area efficiency

60% more bandwidth for same area

© ARM 2016 9

Text 54pt sentence case

9

Text 54pt sentence case Tailor solutions from edge to cloud

© ARM 2016 10




New scalable coherent mesh architecture

Agile System Cache Agile System Cache

Accelerator

DMC-620 DMC-620

Custom mesh size and

device placement

Agile System Cache

with snoop filter

Cortex-A

CoreLink CMN-600

1 to 32 clusters (128 CPUs)

mix ARMv8-A CPUs and accelerators

1 to 8 high performance

DDR4-3200 controllers

NIC-450

PCIe 100GbE

DDR4-3200

IO

DDR4-3200

Up to 32 IO coherent

subsystems

Coherent Multichip Link

CCIX support

© ARM 2016 11




Scalable solutions from edge to cloud

Access point

Data center compute

System Cache

Accelerator

Cortex-A

CoreLink CMN-600

DMC-620

IO

NIC-450

IO

Automated interconnect generation with ARM CoreLink Creator

Bandwidth >1 TB/s 20 GB/s

System cache 128MB 0MB

DDR channels 8 1

Cortex-A CPUs 128 1

Data center compute

DMC-620 100GbE PCIe

DMC-620 DMC-620

DMC-620

DM

C-6

20

DM

C-6

20

DM

C-6

20

DM

C-6

20

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CoreLink CMN-600

© ARM 2016 12




Why a mesh topology?

Interconnect capabilities scale with

system size

Naturally add links, wires, cross point

routers with resources

Mesh cross sectional bandwidth scales

by N vs 1 for a ring topology

Mesh latency scales by √N vs N for a

ring topology

Bandwidth scaling comparison

Achieved coherent bandwidth as observed by requestors

Same process node and test conditions

Ach

ieve

d B

andw

idth

(G

B/s

)

Number of CPU Clusters

CoreLink CCN CoreLink CMN 0

200

400

600

800

1000

1200

0 8 16 24 32

CoreLink CCN Family

CoreLink CMN-600

© ARM 2016 13




CoreLink CMN-600

DMC-620

Innovations to increase throughput

Intelligent cache allocation

Throughput uplift for RDMA, networking, storage

IO allocate on ingress, de-allocate on egress

Combine with integrated scratch pad

Lock critical counters, stats, and tables on-chip

Software configurable cache partitioning

0 0.5 1 1.5 2

DDR

Relative IO throughput

Agile System Cache

Cortex-A IO

Allocate on

ingress

Read and

invalidate on

egress

Scratch Pad

Cache

Lock critical

data

IO

© ARM 2016 14




Maximizing heterogeneous SoC performance

New working groups for affinity or isolation

Assign cache, bandwidth and memory resources

Flexible assignment, software programmable

Provides predictable multi-application performance

QoS regulation for compute, accelerators, IO

End-to-end regulation from master thru memory

Tune for bandwidth, latency or real-time traffic

Intelligent memory scheduling to meet guarantees

DMC-620 100GbE PCIe

DMC-620 DMC-620

DMC-620

DM

C-6

20

DM

C-6

20

DM

C-6

20

DM

C-6

20

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

System Cache

Accelerator

Cortex-A

DMC-620

IO

NIC-450

IO

System Cache

DMC-620

control

plane data

plane

virtual

machine

virtual

machine

virtual

machine

virtual

machine

© ARM 2016 15




Enterprise-class DDR3/4 memory controller

Lowest memory latency and bandwidth

utilization with efficient QoS Up to 95% utilization with random traffic

Up to 50% reduction in static pipeline latency

Up to DDR4-3200 memory DDR3/4 with UDIMM, RDIMM, LRDIMM

Up to 1 TB per channel with 3D stacked DRAM

Advanced Security and RAS Integrated ARM TrustZone

SECDED or symbol based error correction

End-to-end data path parity protection Secure, reliable, protected

data

System optimized

Latest DDR standards

Performance comparison to CoreLink DMC-520

© ARM 2016 16


16

Text 54pt sentence case Multichip interconnect standards

© ARM 2016 17




Interconnect standards for different needs

ARM AMBA

The standard for on-chip communication enabling IP

portability, creation and re-use

CCIX

Extends the benefits of cache coherency to the multi-chip

server node for evolving accelerator and IO use-cases

GenZ

Enables a new data centric computing approach with scalable

memory pools at both server node and rack level

© ARM 2016 18




CCIX: Extending coherency benefits to multichip

New work loads require more shared data, higher

bandwidth and lower latency

Coherency eliminates the software and DMA

overhead of transferring data between devices

Free flowing, high frequency AMBA 5 CHI data

transfers, transferred over multichip topologies

Accelerates time to deployment by leveraging

existing PCIe transport

IP, electricals, mechanicals and software exist today

Extends top end bandwidth to 25Gbps

Server node with shared address space

Compute Node Accelerator

DDR Memory

CCIX

DDR Memory

CC

IX

CC

IX

© ARM 2016 19




GenZ: A new approach to data access

Data centric computing approach to big data

Interconnect based on memory operations

Eliminates traditional complex, code intensive block

based storage software stacks

Storage Class Memory (SCM)

New, emerging non-volatile memory technologies

Latencies closer to traditional DDR than SSDs.

Disaggregated memory at rack scale

Large pool of low latency, volatile and non-volatile

memory at the rack scale

Dynamic utilization/allocation lowers TCO

Server node

DDR Memory DDR Memory

CC

IX

CC

IX

Pooled Memory

Server node

DDR Memory DDR Memory

CC

IX

CC

IX

Storage Class

Memory

GenZ

Storage Class

Memory

GenZ

Storage Class

Memory

GenZ

Storage Class

Memory

GenZ

Data center rack

© ARM 2016 21




Assemble systems in days with IP tooling

Enables guided intelligent IP configuration, creation and assembly

Ensures system viability with design rule checks (DRCs)

Reduces and converges iterations quickly

SYSTEMS IP

Configure Create Assemble

Cortex-A

ARM IP tooling

CoreLink CMN-600

CoreLink DMC-620

CoreSight

Coherent backplane

1-32

clusters

TrustZone

Accelerator

© ARM 2016 22




Accelerate software development

Device Drivers (UEFI/ACPI)

Linux Kernel Hypervisor

Application & ODP API

ARM Fixed Virtual Platform (FVP)

Cortex-A

ARM IP tooling

CoreLink CMN-600

CoreLink DMC-620

CoreSight

Coherent backplane

1-32

clusters

TrustZone

Accelerator

Reference software stack Open source device drivers for CoreLink IP

Linux kernel and OS boot ready

Compliant with UEFI, ACPI and Server Base

System Architecture (SBSA)

Prototype with fixed virtual platforms Prototyping model of reference system

Built with ARM Fast Models for IP components

Reference subsystem memory map and

registers

© ARM 2016 23




Jump start SoC designs

Device Drivers (UEFI/ACPI)

Linux Kernel Hypervisor

Application & ODP API

ARM Fixed Virtual Platform (FVP)

Cortex-A

ARM IP tooling

CoreLink CMN-600

CoreLink DMC-620

CoreSight

Coherent backplane

1-32

clusters

TrustZone

Accelerator

System reference design data Peta cycles of system validation

Measured RTL industry benchmark reports

Measured area, frequency and power in

targeted process nodes

© ARM 2016 24




Trusted and proven ARM CoreLink family

ARM CoreLink System IP – silicon

proven in billions of devices

CoreLink CMN-600 & DMC-620

applicable to multiple applications

>75 Coherent

interconnect

licenses

>75 Memory

controller

licenses

>500 Total

interconnect

licenses

© ARM 2016 25




Build more powerful SoCs – faster

CoreLink CMN-600 Coherent Mesh Network and

CoreLink DMC-620 Dynamic Memory Controller

5x more throughput

50% lower latency

Accelerate deployment Tailor solutions

1 to 32 clusters (128 CPUs)

Mix compute and acceleration

Automated interconnect creation

Software virtual prototyping

Boost performance

The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited

(or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be

trademarks of their respective owners.

Copyright © 2016 ARM Limited

© ARM 2016

The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited

(or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be

trademarks of their respective owners.

Copyright © 2016 ARM Limited

Confidential © ARM 2016

boost soc performance from edge to cloud - 2017 arm ... · boost soc performance from edge to...

Documents