balance, flexibility, and partnership: an arm approach to future hpc node architectures

ARM Research – Software & Large Scale Systems

Node Architecture: From Present Technology to Future Exascale Nodes

Balance, Flexibility, and PartnershipAn ARM Approach to Future HPC Node Architectures

Eric Van HensbergenSenior Principal Research Engineer


ARM Background

ARM

royalty


ARM: Architecture

ARM

royalty


ARM: Microarchitecture

ARM

royalty

Low-power processing solutions for applications, real-time/control and microcontroller end markets

Scalable roadmap for application efficient computing

Software compatibility across a diverse application range


ARM: GPGPUs

ARM

royalty

Bringing visual computing to life

Combining the best of the CPU and GPU

Putting massive amounts of processing power into the hands of the application developer


ARM: Supporting IP

ARM

royalty

System performance with power efficiency

Enabling distributed processing with scalable architectures

Simplifying software elements through hardware coherency


ARM: Optimizations for Foundries

ARM

royalty

Advanced physical IP tuned for a specific foundry and process technology

Artisan Physical IP offered for more than 100 processes from 250nm to 20nm: broadest coverage in the industry

POP IP for ARM Cortex processors and Mali GPU’s deliver time to market, low risk and leadership performance


ARM: Software Tools and Energy Efficient Platforms

ARM

royalty

The broad ARM software ecosystem is continually advancing and evolving

Optimized software solutions enable increased system efficiency


ARM: Segments

ARM

royalty

Internet of Things Embedded Mobile Laptops Enterprise Networking Supercomputing


ARM: Business Model

ARM SemiPartner

OEM

Customer

licence

royalty

IP

chips

ARM invests in ecosystem

Ecosystem provides value chain with support & products based on ARM technology


Exascale Challenge: Power Efficiency


Top 500 MFLOPs/W over time

Nov-11 May-12 Nov-12 May-13 Nov-13 May-14 Nov-14 May-150

500

1000

1500

2000

2500

3000

3500

4000

4500

#1 #500 AVG MIN MAX


Maximizing Throughput Density: per mm2, per Watt

Xeon-E5 2650 V3

Cortex-A57 Cortex-A72 Xeon-E5 2660 V3

0

0.2

0.4

0.6

0.8

1

1.2 20 Thread Workload

2.7

GH

z

Rela

tive p

erf

orm

ance

(Spec2

K6

rate

)

Comparison for equivalent number of threads Platforms used:

Xeon-E5 2660 10C20T platform (measured) Xeon-E5 2650 10C20T platform (measured) Gcc compiler v4.9 with –o3 flag TDP rating source: ark.intel.com

Estimated result on example 20C ARM Cortex platforms with CCN-508, 28MB total L2+L3 cache

per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled to 20T based on modelled and empirical results Power estimated in 16nm based on ARM internal implementations for entire CPU+interconnect complex including 20xCPU, CCN-508, L2+L3 caches• Actual results on silicon platforms may vary

2.5

GH

z

105W*

105W* <30

W

<30W

ARM Solution Benefits:

Less than 1/3rd the power for equivalent performance*

Allows power headroom for specialized computing or greater thread density

(10 cores 20 threads)(20 cores 20 threads)(20 cores 20 threads)(10 cores 20 threads)* A portion of Intel TDP power will be consumed by IO, The Cortex-A72 and Cortex-A57 estimates exclude IO power

Cortex-A72: Ideal for dense compute environments

Cortex-A72 is <20 % size

Single Broadwell CPU + 256K1 L2 ~8mm2

Cortex-A72 MP4 + 2MB L23

~8mm2

Single Cortex-A72 core 2 ~1.15mm2

A quad core Cortex-A72 with 8x L2 cache RAM is

the same size

1Source: Estimated from die-shot image provided by Intel at IDF 2014. 2/3Source: ARM trial implementations on TSMC 16FF+, using ARM Artisan libraries

Core


Reminder: Embedded SoC in HPC is not a new concept


Top 500 RMAX/Core

Nov-11 May-12 Nov-12 May-13 Nov-13 May-14 Nov-14 May-150

20

40

60

80

100

120

#1 #500 AVG MIN MAX


Objectives

To develop a full energy-efficient HPC prototype using low-power commercially available embedded technology.

To develop a portfolio of exascale applications to be run on this new generation of HPC systems.

To design a next-generation HPC system together with a range of embedded technologies in order to overcome the limitations identified in the prototype system.

Mont-Blanc

MB Prototype installed in the Torre Girona chapel @ BSC

Status

Prototype operational:8 standard BullX chassis, 72 compute blades,1080 compute cards, 2160 ARM Cortex-A15 processors, 1080 ARM Mali-T604 GPUs.

11 Scientific applications ported and in use for evaluation of the prototype.

Research ongoing into areas such as memory, on-chip and off-chip interconnect, compute acceleration


The Energy Efficient Computing Research Programme has been established through a £19M capital grant from the Department of Business Innovation and Skills to establish a centre of best practice in the UK that will enable users of computer systems to achieve the same outcomes while minimising the consumption of energy.

The Hartree CentreScience & Technology Facilities Council, UK

“This is a fantastic opportunity to meet the challenge of developing a computationally powerful and energy-efficient platform based on the 64-bit ARM v8 microprocessor … The Hartree centre will be actively developing a robust software ecosystem encompassing compilers, linkers, numerical libraries and tools – all of which are fundamental to the adoption of these types of technologies.”

Lenovo are providing a NeXtScale system: 1,152 64-bit Cavium ThunderX ARM cores in 6U.

http://www.theplatform.net/2015/02/27/prototype-arm-clusters-muscle-hpc/


BalanceOne Size Core Doesn’t Fit All


Top 500 Efficiency over Time (RMAX/RPEAK)

Nov-11 May-12 Nov-12 May-13 Nov-13 May-14 Nov-14 May-150%

20%

40%

60%

80%

100%

120%

#1 #500 AVG MIN MAX

HPCG (1.8%-4.07%) (1.8%-10%)


Seeking Balance: FastForward II

ARM Focus Areas Evaluation of next-generation

architecture in the context of DoE applications

Evaluation of throughput and multithreaded core designs for HPC

Next generation memory technologies

Design study to find right balance of core types, memory, and interconnect

Development and integration of full system simulation technology with other partners

Workload characterization and optimization for ARM architecture

https://asc.llnl.gov/fastforward/

https://asc.llnl.gov/fastforward/


Flexibility


Current ARM Micro architectural Flavors

• Big-Cores• Performance optimized cores• Pro: High single thread performance• Challenge: Higher power and larger area

• Little Cores• Efficiency optimized cores• Pro: Lowest energy• Challenge: Requires massive concurrency to yield

performance

• GPU/Throughput Accelerator• Highly specialized processors adapted from

gaming/graphics market space • Pro: Extremely dense performance• Challenge: Productive Programmability

What class of ARM IP?

RESEARCH & DEVELOPMENT

Source: HotChips 2014


Source: Broadcom Presentation at IDC HPC USER FORUM APRIL 7, 2014


Up to 48 custom ARMv8-A cores @ 2.5GHz1S and 2S configurationUp to 4x72 bit DDR3/4 Memory ControllersFamily Specific I/O’sStandards based low latency Ethernet fabricvirtSOC™: Virtualization from Core to I/OFamily Specific Accelerators: Storage/Networking/Compute/Security The benefits of this Workload Specific approach

Efficiency (performance, latency, power, and scalability)

Best in Class Optimized solution for the specific workload

FullyVirtualized

NetworkingStorage Controller

Accelerators

OptimizedPower

LowerCost

Security

Virtualized Network

& Storage

Storage & Analytics Accelerato

r

High Speed

Network

ARM 64bitProcessor

SecurityAccelerato

r

NetworkAccelerator

ThunderX 2S Reference Platform

Cavium ThunderX


One size core doesn’t fit all, but one architecture can.


Partnership


Challenges


Memory Bandwidth

But…with low memory latency

And…with low cost

But what about…data movement costs

Making solutions to the above something which can become commodity so that the price is not the primary barrier to Exascale.

Challenges for Exascale


Challenge: Ecosystem


ARM Math Libraries

In November 2015, we plan to offer a commercially supported set of 64-bit ARMv8 numerical libraries for scientific computing, built on technology from NAG.

Enable ARM partners’ computational kernels tuned for their SOC implementation. Unified, validated framework A57, A72 and Cavium® ThunderX

optimizations available at launch date, others to follow.

All implementations hosted on arm.com

By the end of 2015, an HPC-specific ARM microsite will offer downloads, technical reference material, how-to-guides and third-party software recommendations for the scientific computing community.

2015 Focus: BLAS LAPACK FFT


Compilers

Commercial Open-Source

PathScale (Alpha) NAG (Alpha) GCC LLVM

C, C++, FortranOpenMP 4.0

FortranOpenMP 3.1

C, C++, Fortran,OpenMP 4.0

C, C++OpenMP 3.1

November 2014:PathScale provides the full EKOPath compiler suite including OpenACC and OpenMP 4.0 C/C++/Fortran support for ARMv8 to support HPC and Enterprise customers exploring the power efficiencies of these devices. As an enabling technology, EKOPath gives our customers the ability to compile for native ARMv8 CPU or accelerated architectures that return the fastest time to solution. Your application defines the benchmark, EKOPath lets you evaluate the new architecture with your code, across either Intel64/AMD64 and now directly compare it against the performance of enterprise ready ARMv8 processors.

November 2013:The Numerical Algorithms Group (NAG), the global numerical software and HPC services company, announces a new technical collaboration with ARM®, the world's leading semiconductor IP supplier. NAG's highly skilled team of HPC experts, numerical analysts and computer scientists will ensure the algorithms in the NAG Numerical Library and the facilities of the NAG FORTRAN Compiler are available for use on ARM's 64-bit ARMv8-A architecture-based platforms.

• Open-source focus on AArch64 correctness up to 2014.• Now improving core performance through mostly architectural

(not microarchitectural) optimisations.• Command-line enablement for new ARM cores (e.g. A72).• Most focus and improvement in floating-point code.

Current work:• Improvements for big-endian

ARM.• Floating-point rounding mode

optimization.• Making use of more

sophisticated ARM instructions.

• Scheduler / register allocation improvements.

• Improved memcpy, memset, glibc string routines.

• Improved performance on NEON intrinsics.

Current work:• Vectorizer improvement.• Loop unrolling/interleaving.• Improved register allocation.• ABI conformance.• Improve inliner heuristics.• Scheduling for Cortex-A57.• Software pipelining.• Jump threading.

http://www.pathscale.com/ARMv8

http://www.nag.co.uk/market/articles/nag-to-broaden-64-bit-armv8-a-ecosystem

http://www.nag.com/frontpage

http://www.nag.com/frontpage


R esearch

• Co-Design• Workload optimizations and characterization for HPC & big data• Architectural & system design sensitivity sweeps for performance & energy• Simulation and modeling infrastructure

A rchitecture

• ARM Architecture Partner Engagements• Evolve architecture envelope allowing partners to better

accommodate requirements of HPC and Data Intensive Computing• Improved support for massive concurrency

Ecosystem

• Software Ecosystem Enablement• Operating systems and runtimes targeted and optimized for ARM HPC• Optimized math library enablement of ARM architecture• Parallel and vector optimizing compilers and runtimes• Cross-stack optimizations for resiliency and energy efficiency

Microarchitecture

• Broader ARM Partner Engagement• Higher performance core designs with increased computational throughput• Decreased memory latency and increased bandwidth• Multi-thread optimized cores


ThanksWe are growing the HPC research team and have several entry level positions open for PhDs. Come talk to me if you are interested or apply directly: http://goo.gl/re11oi

http://goo.gl/re11oi

balance, flexibility, and partnership: an arm approach to future hpc node architectures

Technology

arm technology

arm approach

broad arm software ecosystem

arm cortex processors

arm internal implementations

software tools

software elements

c arm cortex platforms