boost soc performance from edge to cloud - 2017 arm ... · boost soc performance from edge to...
TRANSCRIPT
Title 44pt Title Case
Affiliations 24pt sentence case
20pt sentence case
© ARM 2016
Boost SoC performance from edge to cloud ARM® CoreLink™ System IP
Neil Parris, Director interconnect marketing
China Tech Symposia
Systems and software group, ARM
November 2016
© ARM 2016 2
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
3.7 exa-bytes
per month 22x bandwidth
increase
More nodes, new use cases
1ms end to end
~30x access
nodes
© ARM 2016 3
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Intelligent flexible cloud to enable new use cases
Compute
Acceleration
© ARM 2016 4
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Heterogeneous compute requires coherency
Flexible heterogeneous architecture
Blend compute and acceleration for target solution
Fast, reliable transport to shared memory
Maximize throughput, minimize latency
Coherency simplifies software
Accelerate SoC development and deployment
IP designed, optimized and validated for systems
Cortex-A
ARM IP Tooling
CoreLink Interconnect
CoreLink Controllers
CoreSight
Coherent backplane
TrustZone
Accelerator
…
© ARM 2016 5
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
3rd-generation ARM coherent backplane IP
CoreLink CMN-600 Coherent Mesh Network
CoreLink DMC-620 Dynamic Memory Controller
Optimized for next-generation intelligent connected systems
© ARM 2016 6
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Build more powerful systems
Boost performance
Up to 5x more throughput
Fastest path to DDR4 memory
Up to 50% latency reduction
Performance at any design point
Up to 32 clusters (128 CPUs)
Frequencies exceeding
2.5GHz
Tailor designs from
edge to cloud
>1TB/s bandwidth
Performance comparison to ARM CoreLink CCN and CoreLink DMC-520
© ARM 2016 7
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Delivering maximum compute density
6x compute
5x throughput
32x Cortex-A72
CoreLink CCN-508
4x DMC-520
64x Cortex-A72
CoreLink CMN-600
8x DMC-620
16x Cortex-A57
CoreLink CCN-504
2x DMC-520
Rela
tive
Perf
orm
ance
Compute = measured by specint2k6_rate
Throughput = achieved requested bandwidth
Same process node and test conditions.
2.5x
0
1
2
3
4
5
6
© ARM 2016 8
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Fastest path to DDR4 memory
50%
Interconnect
+ DMC
DDR PHY
+ memory
Static latency – Cortex-A72 load-to-use
Same process node and test conditions
Estimated DDR PHY + memory cycles for 3rd party PHY & closed page DRAM
CoreLink CMN-600 configured as 4 cpu cluster to match CoreLink CCN-504
CPU
CoreLink
CCN-504
DMC-520
CoreLink
CMN-600
DMC-620
DDR PHY
+ memory CPU
Interconnect
+ DMC
Increase CPU performance
50% backplane latency reduction
High frequency mesh transport
One cycle per mesh cross point
Improved area efficiency
60% more bandwidth for same area
© ARM 2016 9
Text 54pt sentence case
9
Text 54pt sentence case Tailor solutions from edge to cloud
© ARM 2016 10
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
New scalable coherent mesh architecture
Agile System Cache Agile System Cache
Accelerator
DMC-620 DMC-620
Custom mesh size and
device placement
Agile System Cache
with snoop filter
Cortex-A
CoreLink CMN-600
1 to 32 clusters (128 CPUs)
mix ARMv8-A CPUs and accelerators
1 to 8 high performance
DDR4-3200 controllers
NIC-450
PCIe 100GbE
DDR4-3200
IO
DDR4-3200
Up to 32 IO coherent
subsystems
Coherent Multichip Link
CCIX support
© ARM 2016 11
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Scalable solutions from edge to cloud
Access point
Data center compute
System Cache
Accelerator
Cortex-A
CoreLink CMN-600
DMC-620
IO
NIC-450
IO
Automated interconnect generation with ARM CoreLink Creator
Bandwidth >1 TB/s 20 GB/s
System cache 128MB 0MB
DDR channels 8 1
Cortex-A CPUs 128 1
Data center compute
DMC-620 100GbE PCIe
DMC-620 DMC-620
DMC-620
DM
C-6
20
DM
C-6
20
DM
C-6
20
DM
C-6
20
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CoreLink CMN-600
© ARM 2016 12
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Why a mesh topology?
Interconnect capabilities scale with
system size
Naturally add links, wires, cross point
routers with resources
Mesh cross sectional bandwidth scales
by N vs 1 for a ring topology
Mesh latency scales by √N vs N for a
ring topology
Bandwidth scaling comparison
Achieved coherent bandwidth as observed by requestors
Same process node and test conditions
Ach
ieve
d B
andw
idth
(G
B/s
)
Number of CPU Clusters
CoreLink CCN CoreLink CMN 0
200
400
600
800
1000
1200
0 8 16 24 32
CoreLink CCN Family
CoreLink CMN-600
© ARM 2016 13
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
CoreLink CMN-600
DMC-620
Innovations to increase throughput
Intelligent cache allocation
Throughput uplift for RDMA, networking, storage
IO allocate on ingress, de-allocate on egress
Combine with integrated scratch pad
Lock critical counters, stats, and tables on-chip
Software configurable cache partitioning
0 0.5 1 1.5 2
DDR
Relative IO throughput
Agile System Cache
Cortex-A IO
Allocate on
ingress
Read and
invalidate on
egress
Scratch Pad
Cache
Lock critical
data
IO
© ARM 2016 14
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Maximizing heterogeneous SoC performance
New working groups for affinity or isolation
Assign cache, bandwidth and memory resources
Flexible assignment, software programmable
Provides predictable multi-application performance
QoS regulation for compute, accelerators, IO
End-to-end regulation from master thru memory
Tune for bandwidth, latency or real-time traffic
Intelligent memory scheduling to meet guarantees
DMC-620 100GbE PCIe
DMC-620 DMC-620
DMC-620
DM
C-6
20
DM
C-6
20
DM
C-6
20
DM
C-6
20
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
System Cache
Accelerator
Cortex-A
DMC-620
IO
NIC-450
IO
System Cache
DMC-620
control
plane data
plane
virtual
machine
virtual
machine
virtual
machine
virtual
machine
© ARM 2016 15
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Enterprise-class DDR3/4 memory controller
Lowest memory latency and bandwidth
utilization with efficient QoS Up to 95% utilization with random traffic
Up to 50% reduction in static pipeline latency
Up to DDR4-3200 memory DDR3/4 with UDIMM, RDIMM, LRDIMM
Up to 1 TB per channel with 3D stacked DRAM
Advanced Security and RAS Integrated ARM TrustZone
SECDED or symbol based error correction
End-to-end data path parity protection Secure, reliable, protected
data
System optimized
Latest DDR standards
Performance comparison to CoreLink DMC-520
© ARM 2016 16
Text 54pt sentence case
16
Text 54pt sentence case Multichip interconnect standards
© ARM 2016 17
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Interconnect standards for different needs
ARM AMBA
The standard for on-chip communication enabling IP
portability, creation and re-use
CCIX
Extends the benefits of cache coherency to the multi-chip
server node for evolving accelerator and IO use-cases
GenZ
Enables a new data centric computing approach with scalable
memory pools at both server node and rack level
© ARM 2016 18
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
CCIX: Extending coherency benefits to multichip
New work loads require more shared data, higher
bandwidth and lower latency
Coherency eliminates the software and DMA
overhead of transferring data between devices
Free flowing, high frequency AMBA 5 CHI data
transfers, transferred over multichip topologies
Accelerates time to deployment by leveraging
existing PCIe transport
IP, electricals, mechanicals and software exist today
Extends top end bandwidth to 25Gbps
Server node with shared address space
Compute Node Accelerator
DDR Memory
CCIX
DDR Memory
CC
IX
CC
IX
© ARM 2016 19
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
GenZ: A new approach to data access
Data centric computing approach to big data
Interconnect based on memory operations
Eliminates traditional complex, code intensive block
based storage software stacks
Storage Class Memory (SCM)
New, emerging non-volatile memory technologies
Latencies closer to traditional DDR than SSDs.
Disaggregated memory at rack scale
Large pool of low latency, volatile and non-volatile
memory at the rack scale
Dynamic utilization/allocation lowers TCO
Server node
DDR Memory DDR Memory
CC
IX
CC
IX
Pooled Memory
Server node
DDR Memory DDR Memory
CC
IX
CC
IX
Storage Class
Memory
GenZ
Storage Class
Memory
GenZ
Storage Class
Memory
GenZ
Storage Class
Memory
GenZ
Data center rack
© ARM 2016 20
Text 54pt sentence case
20
Text 54pt sentence case Accelerating system deployment
© ARM 2016 21
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Assemble systems in days with IP tooling
Enables guided intelligent IP configuration, creation and assembly
Ensures system viability with design rule checks (DRCs)
Reduces and converges iterations quickly
SYSTEMS IP
Configure Create Assemble
Cortex-A
ARM IP tooling
CoreLink CMN-600
CoreLink DMC-620
CoreSight
Coherent backplane
1-32
clusters
TrustZone
Accelerator
© ARM 2016 22
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Accelerate software development
Device Drivers (UEFI/ACPI)
Linux Kernel Hypervisor
Application & ODP API
ARM Fixed Virtual Platform (FVP)
Cortex-A
ARM IP tooling
CoreLink CMN-600
CoreLink DMC-620
CoreSight
Coherent backplane
1-32
clusters
TrustZone
Accelerator
Reference software stack Open source device drivers for CoreLink IP
Linux kernel and OS boot ready
Compliant with UEFI, ACPI and Server Base
System Architecture (SBSA)
Prototype with fixed virtual platforms Prototyping model of reference system
Built with ARM Fast Models for IP components
Reference subsystem memory map and
registers
© ARM 2016 23
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Jump start SoC designs
Device Drivers (UEFI/ACPI)
Linux Kernel Hypervisor
Application & ODP API
ARM Fixed Virtual Platform (FVP)
Cortex-A
ARM IP tooling
CoreLink CMN-600
CoreLink DMC-620
CoreSight
Coherent backplane
1-32
clusters
TrustZone
Accelerator
System reference design data Peta cycles of system validation
Measured RTL industry benchmark reports
Measured area, frequency and power in
targeted process nodes
© ARM 2016 24
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Trusted and proven ARM CoreLink family
ARM CoreLink System IP – silicon
proven in billions of devices
CoreLink CMN-600 & DMC-620
applicable to multiple applications
>75 Coherent
interconnect
licenses
>75 Memory
controller
licenses
>500 Total
interconnect
licenses
© ARM 2016 25
Title 40pt Title Case
Bullets 24pt sentence case
Sub-bullets 20pt sentence case
Build more powerful SoCs – faster
CoreLink CMN-600 Coherent Mesh Network and
CoreLink DMC-620 Dynamic Memory Controller
5x more throughput
50% lower latency
Accelerate deployment Tailor solutions
1 to 32 clusters (128 CPUs)
Mix compute and acceleration
Automated interconnect creation
Software virtual prototyping
Boost performance
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited
(or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be
trademarks of their respective owners.
Copyright © 2016 ARM Limited
© ARM 2016
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited
(or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be
trademarks of their respective owners.
Copyright © 2016 ARM Limited
Confidential © ARM 2016