parallelism and energy-efficient gpus from mobile to...
TRANSCRIPT
Parallelism and Energy-efficient GPUs from Mobile to Supercomputers
Jem Davies
ARM Fellow, VP of Technology Media Processing Division
24-Feb-13
First, some background…
2
The Eras of Computing
1960 1970 1980 1990 2000 2010 2020
Uni
ts
1M 10M Mainframe
Mini
1st Era
100M
1 Billion PC
Desktop Internet
2nd Era
100 Billion The Internet of Things
10 Billion Mobile Internet
3
The Eras of Computing
1960 1970 1980 1990 2000 2010 2020
Uni
ts
1M 10M Mainframe
Mini
1st Era
100M
1 Billion PC
Desktop Internet
2nd Era
100 Billion The Internet of Things
10 Billion Mobile Internet
4
Merging Our Digital and Physical Worlds ARM® technology and the ARM ecosystem scale from sensors to servers, connecting our world, from the Internet of Things to supercomputers
5
ARM’s partners make SoCs containing heterogeneous compute engines
8.7 billion ARM-based chips in 2012, from <50 cents to >$200
We’ve always been in the multicore business
From 1mm3 to 1km3 Battery Solar Cells
Processor, SRAM and PMU
8.75mm3 platform solar cell 0.18µm Cortex™-M3 12µAh Li-ion battery
University of Michigan
1mm3 platform
Work supported by the National Science Foundation and University of Wisconsin-Madison
4200 ARM powered Neutrino Detectors 70 bore holes 2.5km deep 60 detectors per bore hole
1km3 platform
6
Mobility Is Reshaping Behaviour
Symantec “2012 State of Mobility Survey,” BI intelligence 2012
Guohe, 2012
71% of enterprises
plan to establish own store
Tablets outsell
laptop PCs 60% Facebook users
access from mobile
IDG, Pew Research
79% Of iPad users use exclusively
for Internet
1 billion smartphones will ship in 2013
50% of mobile computing devices in 2013 will be ARM-based
36% Chinese
Android™ users spend 2+ hrs a day on mobile
apps
7
8
Pushing the Boundaries of Mobile Computing
1080p HD video and
beyond, multiple streams, HEVC
Live image and scene recognition
for augmented reality
Intuitive gesture and natural
language input
Video & image editing, manipulation, and rendering
Console- quality gaming
experience
Facial recognition and unlock
Delivering tomorrow’s features and capabilities 2.5k/4k displays,
multiple simultaneous
displays
Multi-tasking and multi- window
productivity
9
Interactive Mobile Devices Drive Graphics
Graphics capability is a key factor in consumer purchasing decisions
Rich graphics is a priority for anything with a screen Smartphones, DTVs, STBs, Tablets,
hand-held games consoles, netbooks In-car infotainment
In the post-PC era, data from the cloud and graphically-rich, interactive, connected devices means a growing market for GPUs 4 billion internet-connected screens in 2016, most with
embedded graphics
Solving the Graphics Market Challenges
10
Performance/Power is the Big Challenge
Requirements on the GPU continue to grow exponentially but still have to fit within constant power boundaries
Mali GPU power already in mobile power budget; 35% additional energy efficiency improvements required every year to fit new performance requirements within SoC thermal limits
ARM GPU and System savings of 35% annually R
elat
ive
Pow
er
11
>230 OEM products shipping with Mali GPUs
#1 graphics in Android tablets (>50%)
Delivering Tomorrow’s Features and Capabilities
>20% Android smartphones
ARM Mali GPU Momentum
#1 in graphics- enabled DTV (>70%)
12
75 licences across 54 partners, making Mali-based SoCs
Over 150M Mali™-based GPUs reported in 2012
ARM is not just about Mobile…
…but Mobile has been driving a lot of innovation and enabling diversity
13
Super-phone
Mali -T678
Connectivity
SOI
Mobile Revolution Built On Diversity
40LP
20nm
32 HKMG
14/16 FinFET
Cortex-A53
Cortex-A7
Cortex-A5
Cortex-A15
Cortex-A9
Mass- market
smartphone
Tablet
Clamshell
Mali -T628
Mali -T604
Mali -450 Mali
-400
Video
Memory Reference Design
Modem
Cortex-A57
Camera
Android Windows
Phone
Firefox Mobile
Chrome OS
Windows RT
Aliyun OS
14
Post-PC World Built on Diversity Rapid pace of development has embraced numerous changes,
built on diverse, nimble, fast-responding ecosystem of partners One size does NOT fit all (nor should it!)
Not just desktops and laptops any more
New form factors and new usage models: Phones, tablets, phablets, netbooks, hybrid tablets, TVs,
consoles, Chromebooks, watches!
Always-on, always-connected, long battery-life
Different form factors in different regions
Diverse Silicon foundry ecosystem: bulk CMOS, FDSOI, FinFET
ARM’s partners are driven by innovation and user requirements, not by ARM
ARM shows technical leadership but embraces diversity – we do not pick winners!
15
big Performance, LITTLE Power
16
Perf
orm
ance
Highest Performance
Highest Efficiency
big
LITTLE
Smartphone needs to be very responsive
Constant connectivity and usage Voice, SMS, trickle feeds are low-intensity tasks
Mobile performance
needs are highly elastic
GPU Compute Making the Difference
Computer Vision
Real Time Still and Moving Image Perfection
Upscaling
Multi-Perspective Vision 2D to 3D
Information Extraction
Multi-User Interaction
Benefits
More energy-efficient processing BOM reduction
Improved accuracy/quality Improved existing use cases
Unlock new use cases
Light-Field Photography
Computational Photography
Trends
Heterogeneous computing Portability Parallel computation Hardware acceleration GPU Computing
17
18
Unifying CPU and GPU Compute
High efficiency approaches for:
Computational photography
Immersive visual computing
Augmented Reality
Improved user interfaces
GPU for highly-threaded tasks and CPU for low-threaded tasks
Efficient ARM IP synergy makes it worthwhile to move even small tasks to the GPU
Full Profile GPU Compute is key to enabling portability
Continued Performance
& Efficiency Gains
Right-sized computing
Energy Efficiency Underlies It All Efficiency is a requirement for a connected world
ARM has always understood this – others are now catching on
19
Energy-efficiency Has Always Mattered Segment differentiation matters less than you think
High-end Mobile driven by thermal constraints And “You can never be too thin”
Sensors need 10 years from a button cell battery
Everybody hates fans
Desktops struggle to: Dissipate more than ~150W out of a single chip package
Supply more than ~300W from a PSU onto a PCI card
Servers struggle: To get more than ~10kW into a rack
To get more than ~500kW of heat out of a shipping container
Spending more on power and cooling than on hardware
To buy more power from the grid 20
GPUs from Mobile to Supercomputer
Mont-Blanc project selects Samsung Exynos 5 Processor - 13 Nov 2012 The project continues its research effort towards an energy efficient HPC prototype using low-power embedded technology
Salt Lake City, 13th November 2012.- The Mont-Blanc European project has selected the Samsung Exynos platform as the building block for powering its first integrated low power- High Performance Computing (HPC) prototype. The aim of Mont-Blanc project is to design a new type of computer architecture capable of setting future global HPC standards, built from today’s energy efficient solutions used in embedded and mobile devices
The Samsung Exynos 5 Dual is built on 32nm low-power HKMG (High-K Metal Gate), and features a dual-core 1.7GHz mobile CPU built on ARM® Cortex™-A15 architecture plus an integrated ARM Mali™-T604 GPU for increased performance density and energy efficiency. It has been featured and market proven in consumer and mobile devices such as Samsung Chromebook and Google’s Nexus 10
21
The Search for Parallelism ARM (like the industry) utilizes parallelism across all segments, across multiple devices
to drive down energy use CPUs use instruction-level parallelism: modern ARM CPUs have IPC well over 3
CPUs and OSs exploit task-level parallelism: fast context swap, MMUs, virtual memory, hypervisors
Graphics is one of the few problem spaces that have very high levels of parallelism: Do <foo> on every vertex in the frame (10s - 100s of kvertices/frame)
Do <bar> on every pixel in the frame (millions of pixels/frame)
Data-level parallelism through thread-level parallelism
GPU compute (throughput architectures/implementations) has great advantages: Performance and Energy efficiency
CPUs remain “easier” to program
Do GPUs merge with CPUs? Maybe not yet, but connections between them become ever more important
22
Amdahl’s Law is Alive and Well 100% parallel
95% parallel
90% parallel
75% parallel 50% parallel
0
4
8
12
16
20
24
28
32
0 4 8 12 16 20 24 28 32
Max
imum
spe
edup
Number of cores
Speedup on parallel processors is limited by the sequential portion of the program
Sequential portion need not be large to constrain speedup significantly !
23
ARM’s Heterogeneous Computing Advantages
ARM has always been involved in heterogeneous systems For 23 years, mobile SoCs have contained
CPUs, DSPs and other compute elements
ARM’s first multicore CPU was produced 10 years ago ARM has produced several generations of
coherent interconnect
ARM introduced the world’s first embedded multicore GPU Now have TFLOP GPUs
big.LITTLE™ and GPU Computing Right task, right processor
Today’s modern mobile SoC contains ISPs, DSPs, GPUs, baseband accelerators, crypto accelerators, GPUs and VPUs “Helper” ARM CPUs, security, power
controllers, baseband, WiFi
Heterogeneous, multicore applications CPUs
ARM is introducing multicore VPUs
All within today’s energy budget…
24
Improving the Granule of Offload Hardware: sharing data Traditional CPU -> GPU was unidirectional offload
GPU is I/O-coherent with CPU ◦ GPU reads snoop CPU data
Frame buffer is not cached
Too big, especially with multi-buffering
Typically large, pre-arranged buffers
Set up by a driver
GPU Compute is much more bidirectional GPU needs to be fully coherent with the CPU
Smaller-scale sharing of data
Exchanging pointers to data structures
Directly referred to by the application
Avoid driver interaction 25
CPU GPU Display processor
CPU GPU
Object Database Texture Database
Frame Buffer
CPU + GPU Heterogeneous benefits ARM leadership in system coherency is key to enable heterogeneous computing and
the future of graphics
GPU participating in system coherency enables a broad set of benefits and values Avoids unnecessary cache flush/invalidate operations Maintain warm caches: lower power, higher performance
Avoids DRAM access to data present in CPU caches Lower power by elimination of off-chip memory accesses
Eliminates software-managed cache coherency Greater efficiency, reduced complexity
Faster TTM by removing source of complex synchronization bugs
Reduced cost of sharing data between CPU and GPU Widens the set of algorithms ripe for GPU acceleration
26
CoreLink ™ CCN-504 Cache Coherent Network with AMBA® 4 ACE™ Interfaces
CoreLink DMC 520 System and I/O NIC Network interconnect
L2 L2 L2
Up to Quad big CPUs
Up to Quad LITTLE CPUs
Up to 8 Mali-T678
What is ARM doing to help?
27
Building Smarter Systems Visual computing is about designing the best compute
systems to match the competing demands of power and performance
ARM provides the four fundamental IP building blocks needed to enable high-performance, energy efficient, coherent System-on-Chips
ARM’s unique experience and portfolio of processors, memory systems, physical IP and interconnect designed together
ARM’s leading system IP extends coherency for optimized multicore solutions
Heterogeneous computing for optimum energy efficiency
28
ARMv8: 64-bit Processing Now Available In Mobile Power
64-bit processing and full backward compatibility with 32-bit
Mobile devices will need more than 4GBytes
Scalable power and performance with big.LITTLE
Sub 1mm2 processors
Co-designed with Mali-T600 GPUs
32-bit
64-bit
CRYPTO
ARMv8
Scalar FP
Advanced SIMD
ARM v7A Software
Designed to deliver all your computing needs 29
Complexity Out-of-order superscalar CPUs introduces some complexity
Multicore CPUs/GPUs – more complexity Memory consistency model
Threading models
Heterogeneous computing - more complexity
Some GPU compute engines are: Complex
Badly described
Difficult to reason about
Most graphics developers want to create visually stunning content, not argue about the finer points of computer science
30
Complexity drives the need for standards If we do not abstract away from this complexity
and present a simpler world to the developer… Either they won’t use these new systems
Or, they’ll use it and get it wrong
Either way, it will be our fault, and they will hate us for it
If we all provide different abstractions… They will hate us for it
We need standard(s) And a very small number of them
31
ARM and HSA
Heterogeneous System Architecture (HSA) Foundation Not-for-profit consortium founded by industry leaders to create
hardware and software standards for heterogeneous computing to: Simplify the programming environment Make compute at low power pervasive Introduce new capabilities in modern computing devices
ARM is a founding member and board member
Heterogeneous computing is the essence of efficient computation: the right processor for the right task
More research required – not a solved problem Better models, compilation technologies, debug, programming methodologies
We need to make it easier to use – still seen as a “black art”
32
33
Diverse Partners Driving Future of Heterogeneous Computing
© Copyright 2012 HSA Foundation. All Rights Reserved.
Founders
Promoters
Supporters
Contributors
Academic
A Call to the ARM Partnership for Action
34
Towards the Future Don’t miss opportunities. Push the boundaries!
Innovation in the post-PC era requires: Focus on energy-efficiency across diverse market
segments with diverse devices
Continue to utilize parallelism (of all forms)
Compute on the right device Highest efficiency CPU for required performance Use the right programmable device for the right task,
such as GPU computing Domain-specific processors for specific tasks
Not just hardware problem, need more software investment: Operating systems, hypervisors, managed environments, APIs, tools
Application developers across the ecosystem
We need new standards, but there will be different paths to follow One size does NOT fit all
35
ARM committed to heterogeneous, parallel computing
36
Thank you 谢谢
37
Questions?
38