parallelism and energy-efficient gpus from mobile to...

38
Parallelism and Energy-efficient GPUs from Mobile to Supercomputers Jem Davies ARM Fellow,VP of Technology Media Processing Division 24-Feb-13

Upload: letram

Post on 07-May-2018

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Parallelism and Energy-efficient GPUs from Mobile to Supercomputers

Jem Davies

ARM Fellow, VP of Technology Media Processing Division

24-Feb-13

Page 2: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

First, some background…

2

Page 3: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

The Eras of Computing

1960 1970 1980 1990 2000 2010 2020

Uni

ts

1M 10M Mainframe

Mini

1st Era

100M

1 Billion PC

Desktop Internet

2nd Era

100 Billion The Internet of Things

10 Billion Mobile Internet

3

Page 4: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

The Eras of Computing

1960 1970 1980 1990 2000 2010 2020

Uni

ts

1M 10M Mainframe

Mini

1st Era

100M

1 Billion PC

Desktop Internet

2nd Era

100 Billion The Internet of Things

10 Billion Mobile Internet

4

Page 5: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Merging Our Digital and Physical Worlds ARM® technology and the ARM ecosystem scale from sensors to servers, connecting our world, from the Internet of Things to supercomputers

5

ARM’s partners make SoCs containing heterogeneous compute engines

8.7 billion ARM-based chips in 2012, from <50 cents to >$200

We’ve always been in the multicore business

Page 6: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

From 1mm3 to 1km3 Battery Solar Cells

Processor, SRAM and PMU

8.75mm3 platform solar cell 0.18µm Cortex™-M3 12µAh Li-ion battery

University of Michigan

1mm3 platform

Work supported by the National Science Foundation and University of Wisconsin-Madison

4200 ARM powered Neutrino Detectors 70 bore holes 2.5km deep 60 detectors per bore hole

1km3 platform

6

Page 7: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Mobility Is Reshaping Behaviour

Symantec “2012 State of Mobility Survey,” BI intelligence 2012

Guohe, 2012

71% of enterprises

plan to establish own store

Tablets outsell

laptop PCs 60% Facebook users

access from mobile

IDG, Pew Research

79% Of iPad users use exclusively

for Internet

1 billion smartphones will ship in 2013

50% of mobile computing devices in 2013 will be ARM-based

36% Chinese

Android™ users spend 2+ hrs a day on mobile

apps

7

Page 8: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

8

Pushing the Boundaries of Mobile Computing

1080p HD video and

beyond, multiple streams, HEVC

Live image and scene recognition

for augmented reality

Intuitive gesture and natural

language input

Video & image editing, manipulation, and rendering

Console- quality gaming

experience

Facial recognition and unlock

Delivering tomorrow’s features and capabilities 2.5k/4k displays,

multiple simultaneous

displays

Multi-tasking and multi- window

productivity

Page 9: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

9

Interactive Mobile Devices Drive Graphics

Graphics capability is a key factor in consumer purchasing decisions

Rich graphics is a priority for anything with a screen Smartphones, DTVs, STBs, Tablets,

hand-held games consoles, netbooks In-car infotainment

In the post-PC era, data from the cloud and graphically-rich, interactive, connected devices means a growing market for GPUs 4 billion internet-connected screens in 2016, most with

embedded graphics

Page 10: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Solving the Graphics Market Challenges

10

Page 11: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Performance/Power is the Big Challenge

Requirements on the GPU continue to grow exponentially but still have to fit within constant power boundaries

Mali GPU power already in mobile power budget; 35% additional energy efficiency improvements required every year to fit new performance requirements within SoC thermal limits

ARM GPU and System savings of 35% annually R

elat

ive

Pow

er

11

Page 12: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

>230 OEM products shipping with Mali GPUs

#1 graphics in Android tablets (>50%)

Delivering Tomorrow’s Features and Capabilities

>20% Android smartphones

ARM Mali GPU Momentum

#1 in graphics- enabled DTV (>70%)

12

75 licences across 54 partners, making Mali-based SoCs

Over 150M Mali™-based GPUs reported in 2012

Page 13: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

ARM is not just about Mobile…

…but Mobile has been driving a lot of innovation and enabling diversity

13

Page 14: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Super-phone

Mali -T678

Connectivity

SOI

Mobile Revolution Built On Diversity

40LP

20nm

32 HKMG

14/16 FinFET

Cortex-A53

Cortex-A7

Cortex-A5

Cortex-A15

Cortex-A9

Mass- market

smartphone

Tablet

Clamshell

Mali -T628

Mali -T604

Mali -450 Mali

-400

Video

Memory Reference Design

Modem

Cortex-A57

Camera

Android Windows

Phone

Firefox Mobile

Chrome OS

Windows RT

Aliyun OS

14

Page 15: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Post-PC World Built on Diversity Rapid pace of development has embraced numerous changes,

built on diverse, nimble, fast-responding ecosystem of partners One size does NOT fit all (nor should it!)

Not just desktops and laptops any more

New form factors and new usage models: Phones, tablets, phablets, netbooks, hybrid tablets, TVs,

consoles, Chromebooks, watches!

Always-on, always-connected, long battery-life

Different form factors in different regions

Diverse Silicon foundry ecosystem: bulk CMOS, FDSOI, FinFET

ARM’s partners are driven by innovation and user requirements, not by ARM

ARM shows technical leadership but embraces diversity – we do not pick winners!

15

Page 16: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

big Performance, LITTLE Power

16

Perf

orm

ance

Highest Performance

Highest Efficiency

big

LITTLE

Smartphone needs to be very responsive

Constant connectivity and usage Voice, SMS, trickle feeds are low-intensity tasks

Mobile performance

needs are highly elastic

Page 17: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

GPU Compute Making the Difference

Computer Vision

Real Time Still and Moving Image Perfection

Upscaling

Multi-Perspective Vision 2D to 3D

Information Extraction

Multi-User Interaction

Benefits

More energy-efficient processing BOM reduction

Improved accuracy/quality Improved existing use cases

Unlock new use cases

Light-Field Photography

Computational Photography

Trends

Heterogeneous computing Portability Parallel computation Hardware acceleration GPU Computing

17

Page 18: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

18

Unifying CPU and GPU Compute

High efficiency approaches for:

Computational photography

Immersive visual computing

Augmented Reality

Improved user interfaces

GPU for highly-threaded tasks and CPU for low-threaded tasks

Efficient ARM IP synergy makes it worthwhile to move even small tasks to the GPU

Full Profile GPU Compute is key to enabling portability

Continued Performance

& Efficiency Gains

Right-sized computing

Page 19: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Energy Efficiency Underlies It All Efficiency is a requirement for a connected world

ARM has always understood this – others are now catching on

19

Page 20: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Energy-efficiency Has Always Mattered Segment differentiation matters less than you think

High-end Mobile driven by thermal constraints And “You can never be too thin”

Sensors need 10 years from a button cell battery

Everybody hates fans

Desktops struggle to: Dissipate more than ~150W out of a single chip package

Supply more than ~300W from a PSU onto a PCI card

Servers struggle: To get more than ~10kW into a rack

To get more than ~500kW of heat out of a shipping container

Spending more on power and cooling than on hardware

To buy more power from the grid 20

Page 21: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

GPUs from Mobile to Supercomputer

Mont-Blanc project selects Samsung Exynos 5 Processor - 13 Nov 2012 The project continues its research effort towards an energy efficient HPC prototype using low-power embedded technology

Salt Lake City, 13th November 2012.- The Mont-Blanc European project has selected the Samsung Exynos platform as the building block for powering its first integrated low power- High Performance Computing (HPC) prototype. The aim of Mont-Blanc project is to design a new type of computer architecture capable of setting future global HPC standards, built from today’s energy efficient solutions used in embedded and mobile devices

The Samsung Exynos 5 Dual is built on 32nm low-power HKMG (High-K Metal Gate), and features a dual-core 1.7GHz mobile CPU built on ARM® Cortex™-A15 architecture plus an integrated ARM Mali™-T604 GPU for increased performance density and energy efficiency. It has been featured and market proven in consumer and mobile devices such as Samsung Chromebook and Google’s Nexus 10

21

Page 22: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

The Search for Parallelism ARM (like the industry) utilizes parallelism across all segments, across multiple devices

to drive down energy use CPUs use instruction-level parallelism: modern ARM CPUs have IPC well over 3

CPUs and OSs exploit task-level parallelism: fast context swap, MMUs, virtual memory, hypervisors

Graphics is one of the few problem spaces that have very high levels of parallelism: Do <foo> on every vertex in the frame (10s - 100s of kvertices/frame)

Do <bar> on every pixel in the frame (millions of pixels/frame)

Data-level parallelism through thread-level parallelism

GPU compute (throughput architectures/implementations) has great advantages: Performance and Energy efficiency

CPUs remain “easier” to program

Do GPUs merge with CPUs? Maybe not yet, but connections between them become ever more important

22

Page 23: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Amdahl’s Law is Alive and Well 100% parallel

95% parallel

90% parallel

75% parallel 50% parallel

0

4

8

12

16

20

24

28

32

0 4 8 12 16 20 24 28 32

Max

imum

spe

edup

Number of cores

Speedup on parallel processors is limited by the sequential portion of the program

Sequential portion need not be large to constrain speedup significantly !

23

Page 24: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

ARM’s Heterogeneous Computing Advantages

ARM has always been involved in heterogeneous systems For 23 years, mobile SoCs have contained

CPUs, DSPs and other compute elements

ARM’s first multicore CPU was produced 10 years ago ARM has produced several generations of

coherent interconnect

ARM introduced the world’s first embedded multicore GPU Now have TFLOP GPUs

big.LITTLE™ and GPU Computing Right task, right processor

Today’s modern mobile SoC contains ISPs, DSPs, GPUs, baseband accelerators, crypto accelerators, GPUs and VPUs “Helper” ARM CPUs, security, power

controllers, baseband, WiFi

Heterogeneous, multicore applications CPUs

ARM is introducing multicore VPUs

All within today’s energy budget…

24

Page 25: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Improving the Granule of Offload Hardware: sharing data Traditional CPU -> GPU was unidirectional offload

GPU is I/O-coherent with CPU ◦ GPU reads snoop CPU data

Frame buffer is not cached

Too big, especially with multi-buffering

Typically large, pre-arranged buffers

Set up by a driver

GPU Compute is much more bidirectional GPU needs to be fully coherent with the CPU

Smaller-scale sharing of data

Exchanging pointers to data structures

Directly referred to by the application

Avoid driver interaction 25

CPU GPU Display processor

CPU GPU

Object Database Texture Database

Frame Buffer

Page 26: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

CPU + GPU Heterogeneous benefits ARM leadership in system coherency is key to enable heterogeneous computing and

the future of graphics

GPU participating in system coherency enables a broad set of benefits and values Avoids unnecessary cache flush/invalidate operations Maintain warm caches: lower power, higher performance

Avoids DRAM access to data present in CPU caches Lower power by elimination of off-chip memory accesses

Eliminates software-managed cache coherency Greater efficiency, reduced complexity

Faster TTM by removing source of complex synchronization bugs

Reduced cost of sharing data between CPU and GPU Widens the set of algorithms ripe for GPU acceleration

26

CoreLink ™ CCN-504 Cache Coherent Network with AMBA® 4 ACE™ Interfaces

CoreLink DMC 520 System and I/O NIC Network interconnect

L2 L2 L2

Up to Quad big CPUs

Up to Quad LITTLE CPUs

Up to 8 Mali-T678

Page 27: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

What is ARM doing to help?

27

Page 28: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Building Smarter Systems Visual computing is about designing the best compute

systems to match the competing demands of power and performance

ARM provides the four fundamental IP building blocks needed to enable high-performance, energy efficient, coherent System-on-Chips

ARM’s unique experience and portfolio of processors, memory systems, physical IP and interconnect designed together

ARM’s leading system IP extends coherency for optimized multicore solutions

Heterogeneous computing for optimum energy efficiency

28

Page 29: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

ARMv8: 64-bit Processing Now Available In Mobile Power

64-bit processing and full backward compatibility with 32-bit

Mobile devices will need more than 4GBytes

Scalable power and performance with big.LITTLE

Sub 1mm2 processors

Co-designed with Mali-T600 GPUs

32-bit

64-bit

CRYPTO

ARMv8

Scalar FP

Advanced SIMD

ARM v7A Software

Designed to deliver all your computing needs 29

Page 30: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Complexity Out-of-order superscalar CPUs introduces some complexity

Multicore CPUs/GPUs – more complexity Memory consistency model

Threading models

Heterogeneous computing - more complexity

Some GPU compute engines are: Complex

Badly described

Difficult to reason about

Most graphics developers want to create visually stunning content, not argue about the finer points of computer science

30

Page 31: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Complexity drives the need for standards If we do not abstract away from this complexity

and present a simpler world to the developer… Either they won’t use these new systems

Or, they’ll use it and get it wrong

Either way, it will be our fault, and they will hate us for it

If we all provide different abstractions… They will hate us for it

We need standard(s) And a very small number of them

31

Page 32: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

ARM and HSA

Heterogeneous System Architecture (HSA) Foundation Not-for-profit consortium founded by industry leaders to create

hardware and software standards for heterogeneous computing to: Simplify the programming environment Make compute at low power pervasive Introduce new capabilities in modern computing devices

ARM is a founding member and board member

Heterogeneous computing is the essence of efficient computation: the right processor for the right task

More research required – not a solved problem Better models, compilation technologies, debug, programming methodologies

We need to make it easier to use – still seen as a “black art”

32

Page 33: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

33

Diverse Partners Driving Future of Heterogeneous Computing

© Copyright 2012 HSA Foundation. All Rights Reserved.

Founders

Promoters

Supporters

Contributors

Academic

Page 34: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

A Call to the ARM Partnership for Action

34

Page 35: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Towards the Future Don’t miss opportunities. Push the boundaries!

Innovation in the post-PC era requires: Focus on energy-efficiency across diverse market

segments with diverse devices

Continue to utilize parallelism (of all forms)

Compute on the right device Highest efficiency CPU for required performance Use the right programmable device for the right task,

such as GPU computing Domain-specific processors for specific tasks

Not just hardware problem, need more software investment: Operating systems, hypervisors, managed environments, APIs, tools

Application developers across the ecosystem

We need new standards, but there will be different paths to follow One size does NOT fit all

35

Page 36: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

ARM committed to heterogeneous, parallel computing

36

Page 37: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Thank you 谢谢

37

Page 38: Parallelism and Energy-efficient GPUs from Mobile to Supercomputersprism.sejong.ac.kr/download/p01_Jem_Davies.pdf ·  · 2018-02-05Parallelism and Energy-efficient GPUs from Mobile

Questions?

38