"efficient convolutional neural network inference on mobile gpus," a presentation from...
TRANSCRIPT
Copyright © 2016 Imagination Technologies 1
Efficient Convolutional Neural Network
Inference on Mobile GPUs Paul Brasnett
May 3, 2016
Copyright © 2016 Imagination Technologies 2
• About Imagination Technologies
• PowerVR GPUs
• Case study: Implementing Convolutions
• Performance Analysis
• Conclusions
• Resources
Overview
Copyright © 2016 Imagination Technologies 3
• Imagination Technologies
is a leading IP supplier for
multimedia, processors and
communications
• More than 8bn units
containing Imagination IP
shipped
About Imagination Technologies
SoC
fabric
PowerVR Graphics & GPU Compute
Processors
Ensigma Communications
Processors
PowerVR Vision
Processors
MIPS Processors
PowerVR Video
Processors
Copyright © 2016 Imagination Technologies 4
What is a Mobile GPU?
Mobile GPU
Optimised for High
Performance at
Low Power
Copyright © 2016 Imagination Technologies 5
What is a Mobile GPU?
Mobile Devices
Automotive
Consumer Multimedia
Wearables
Internet of Things
Augmented Reality
Mobile GPU
Optimised for High
Performance at
Low Power
Copyright © 2016 Imagination Technologies 6
Why Mobile GPUs for Vision Processing?
CPUs can generate large amounts of heat • CPUs can deliver high peak/burst
performance
• But generate large amounts of heat
• PowerVR Mobile GPUs provide
• Lowest power FP16 & int pipelines
• Local memory for highly efficient data
access for compute operations
• Power-saving features such as gating
of non-compute parts of GPU for
efficient compute operation
Copyright © 2016 Imagination Technologies 7
Why Mobile GPUs for Vision Processing?
Provence(raytracing)
Particle Simulation –
32k
Particle Simulation –
4k Julia Set
AmbientOcclusion
Denoise Gaussian Blur
CPU 100.00% 100% 100% 100% 100% 100% 100%
PowerVR Series6 265% 407% 517% 963% 1126% 482% 383%
0%
100%
200%
300%
400%
500%
600%
Perf
orm
ance
rel
ativ
e t
o C
PU
Copyright © 2016 Imagination Technologies 8
Moving the CNN Workload to the GPU
PowerVR GPU — Graphics and compute CPU
Large Cache
Unified System Memory
CPU1
CPU0
THREADS
Few
Multiprocessor (Unified Shading Cluster)
Multiprocessor (Unified Shading Cluster)
Coarse Grain Scheduler
L2
System Level Cache Cache Unit
Residency
Slots
Common
Store Compute Store
Texture
Processing Unit
Residency
Slots
Common
Store Compute Store Scheduler
System Memory Interface
enqueue Compute
Kernel
Host Interface
Scheduler
System Memory Interface
Copyright © 2016 Imagination Technologies 9
Evolution of Mobile GPU
PowerVR
Series 6 GPU
PowerVR
Series 7 GPU
PowerVR
Series 8 GPU
…
Copyright © 2016 Imagination Technologies 10
Evolution of Mobile GPU
OpenCL 1.2
OpenCV
OpenVX
Vulkan
OpenCL 2.0
New APIs
Copyright © 2016 Imagination Technologies 11
• Mobile GPU increasingly dominating compute performance in SoCs
GPU Dominates Compute in Modern SoCs
CPU
GP
U
Illustrative diagram only, to show relative CPU/GPU size
Copyright © 2016 Imagination Technologies 12
• State-of-the-art performance
• Rapid development cycles
• Range of vision tasks
• Classification
• Localisation
• Other applications…
Why CNNs?
Camera Localisation
PoseNet: A Convolutional Network for Real-Time 6-DOF Camera
Relocalization, Kendall, A., Grimes, M., Cipolla, R., ICCV 2015
Copyright © 2016 Imagination Technologies 13
What is a CNN?
Convolution Activation Normalization Pooling Fully Connected
Convolution Image Activation Pooling
Fully Connected
CNN Architecture Building Blocks
CNN Example Network
Normalization
Soft Max
Convolution Activation Pooling Normalization
Convolution Activation Pooling Soft Max
Copyright © 2016 Imagination Technologies 14
• Training — Offline
CNN Object Classification
Architecture
Data CNN Library Compute + Time Model Coefficients
Copyright © 2016 Imagination Technologies 15
• Training — Offline
• Inference — Online
CNN Object Classification
Architecture
Data CNN Library Compute + Time Model Coefficients
Architecture
Model Coefficients
Copyright © 2016 Imagination Technologies 16
• Training — Offline
• Inference — Online
CNN Object Classification
Architecture
Data CNN Library Compute + Time Model Coefficients
Architecture
Model Coefficients
Image
CNN Library Compute Classification
Mobile GPU
Copyright © 2016 Imagination Technologies 17
Where is the Cost in CNN Inference?
Flops by layer-type (AlexNet)
Convolution
Normalisation
Pooling
Fully Connected
Copyright © 2016 Imagination Technologies 18
• Create as many work-items as is size of output matrix
• Each work-item will read it’s row and column and produce dot product
• Requires large number of accesses to memory
Matrix Multiply — Naïve
x =
A B C
Copyright © 2016 Imagination Technologies 19
• The OpenCL memory model
closely maps to GPU architecture
• Private Memory — Per work-item
• Local Memory
• Shared within a work-group
• Global Memory /Constant Memory
• Visible to all work-groups
• Host memory
• Typically share CPU/GPU on a
mobile SoC
OpenCL Memory Model
Copyright © 2016 Imagination Technologies 20
• Work-items load A data into private memory
Matrix Multiply — Tiling Approach
Tiling approach based on “2008. Volkov and Demmel. Using GPUs to accelerate linear algebra runtime”
x =
A B C
Copyright © 2016 Imagination Technologies 21
• Work-items load A data into private memory
• Work-groups load B data into local memory
• Each work item will read from local memory and produce a dot product
• Significantly reduces global memory accesses
Matrix Multiply — Tiling Approach
x =
A B C
Tiling approach based on “2008. Volkov and Demmel. Using GPUs to accelerate linear algebra runtime”
Copyright © 2016 Imagination Technologies 22
• Choose work-group size to fit the GPU, 32 work-items is typically a good
choice for PowerVR GPUs
• Read multiple items (e.g. 4 or 8) into private memory at a time to optimise
memory transfers
• Consider the use of half data type in place of float
• Most PowerVR platforms provide up to 2x the flops
• Define workgroup size at compile time
• __attribute__((reqd_work_group_size(SIZE, 1, 1)))
Matrix Multiply — OpenCL Tips
Copyright © 2016 Imagination Technologies 23
Matrix Multiply — Tiling Approach
0.1
1
10
100
1000
Tim
e (
s)
Matrix Size
Naïve
Tiled matrix multiply
Copyright © 2016 Imagination Technologies 24
CNN Classification: AlexNet & GoogLeNet
60
5.5
Model Coefficients (Millions)
AlexNet GoogLeNet
1.3
3.1
Operations (Billions)
AlexNet GoogLeNet18.2
10.07
Top-5 Error Rate (%)
AlexNet GoogLeNet
Bandwidth Compute
Copyright © 2016 Imagination Technologies 25
• Time consumed by layer type
Performance Analysis — CNN Inference
GoogLeNet
Convolutions
Pooling
Normalisation
Fully Connected
Reference Time*: 1.36 Reference Time*: 1.00
AlexNet
Convolutions
Pooling
Normalisation
Fully Connected
Copyright © 2016 Imagination Technologies 26
Performance Analysis — GPU v CPU*
* CPU results based on Caffe (with ATLAS)
0
2
4
6
8
10
12
14R
ela
tive F
PS
Perf
orm
an
ce
(Hig
her
is b
ett
er)
AlexNet
GPU - PowerVR 2 ClusterGPU (480MHz)
CPU - ARM A15 (1.6GHz)
Copyright © 2016 Imagination Technologies 27
Efficiency Analysis — GPU v CPU
0
0.5
1
1.5
2
2.5
3
3.5R
ela
tive E
ffic
ien
cy (
Hig
her
is
bett
er)
AlexNet
GPU - PowerVR 2Cluster GPU (480MHz)
CPU - ARM A15(1.6GHz)
Copyright © 2016 Imagination Technologies 28
• Mobile GPUs are widely available in a range of SoCs across numerous
markets today
• Compared to mobile CPUs, PowerVR Mobile GPUs offer
• upto 3x higher efficiency and
• upto 12x higher performance deployment for CNNs
• Newer CNN architectures with smaller fully connected layers help to
make more efficient use of compute resources
• PowerVR GPUs scale to allow for higher levels of performance & lower
power for current and future generations of vision enabled products
• COME & SEE THE DEMO DURING THE NEXT BREAK
Conclusions
Copyright © 2016 Imagination Technologies 29
• PowerVR GPU Compute
• https://imgtec.com/tools/powervr-gpu-compute/
• Guide to writing OpenCL
• http://blog.imgtec.com/powervr/a-quick-guide-to-writing-opencl-kernels-for-rogue
• PowerVR Imaging Framework
• http://blog.imgtec.com/powervr/powervr-imaging-framework-sdk
• PowerVR CNN Demo
• See our stand
• OpenCL Tutorial
• https://handsonopencl.github.io/
Resources