real world hpc systems for big data/ai research2019/06/16 · [3] haoyuzhang, mohamed wahib,...
TRANSCRIPT
Real World HPC Systems for Big Data/AI Research
Efficient 2D Convolution on CUDA-enabled GPUs2
[1] Shweta Salaria, Aleksandr Drozd, Artur Podobas, Satoshi Matsuoka, Learning Neural Representations for Predicting GPU Performance, ISC’19[2] Peng Chen, Mohamed Wahib, Shin'ichiro Takizawa, Ryousei Takano, Satoshi Matsuoka, Efficient Algorithms for the Summed Area Tables Primitive on GPUs. IEEE CLUSTER’18[3] Haoyu Zhang, Mohamed Wahib, Satoshi Matsuoka, Can Local Binary Convolutions Make Neural Networks Models Smaller?, Submitted to ICPP’19 (Poster)[4] Mateusz Bysiek, Aleksandr Drozd, and Satoshi Matsuoka. Migrating Legacy Fortran to Python While Retaining Fortran-level Performance Through Transpilation and Type Hints,
6th Workshop on Python for High-Performance and Scientific Computing, co-held with SC16
⋯⋯⋯
⋯⋯⋯
Threa
d 31
∗
⋯⋯⋯
𝑤$ 𝑤% 𝑤&𝑣$𝑣%
𝑣(𝑀×𝑁𝑠𝑖𝑧𝑒 𝑓𝑖𝑙𝑡𝑒𝑟32×𝐶 𝑅𝑒𝑔𝑖𝑠𝑡𝑒𝑟𝑀𝑎𝑡𝑟𝑖𝑥
𝑣;⋯⋯⋯⋯ ⋯
Sliding Window
A CUDA Warp
Threa
d 0Th
read 1
Threa
d 2Th
read 3
Threa
d 4Th
read 5
Threa
d 30
𝑠$𝑠%
𝑠;
𝑠<
𝑠$𝑠%
𝑠;
𝑠<
𝑠$𝑠%
𝑠;
𝑠<
e0 e1 e2 e4 e5 e6 e30
e0 e1 e3 e4 e5 e6 e30 e31
e0 e1 e3 e4 e5 e6 e30 e31
#1
+ + + + + + +
++++++
⋯⋯⋯
⋯⋯⋯#2
#3
Threa
d 0
A CUDA Warp
Convolution Results
e31+
+
Threa
d 1Th
read 2
Threa
d 3Th
read 4
Threa
d 5
Threa
d 30
Threa
d 31
𝑣$𝑣%
⋯
𝑠$𝑠%
𝑠;
A CUDA thread
⋯
𝑤
𝑣;
𝑣
𝑠𝑢𝑚? ← 𝑣?× 𝑠?+𝑠𝑢𝑚?B$
(1) Register Cache (32×𝐶)
(2) Compute partial sums
(3) Transfer partial sums
Problem• 2D convolution computation on CUDA-enabled GPUs
Proposal(1) Using registers as a cache
• Lower latency, high throughput• High data reuse
(2)Computing partial sums in parallel
(3)Transferring partial sums in parallel• No stalling within thread• Efficient communication between threads by shuffle
Can Local Binary Convolutions Make Neural Networks Models Smaller?3
Background• Local Binary Convolution
Neural Networks(LBCNN)• models use Local Binary
Convolution(LBC) layers• (+ )less learned
parameters• ( - )less accuracy
Problem• Not applicable to Large Model
• Too slow• Too much Acc. loss
Proposed MethodOnly replace half of the normal convolution layer by LBC
Accuracy and Speed of different modelsResults of models for different number of difference maps generated by LBC filters
Results
Learning Neural Representations for Predicting GPU Performance1
Problem l Modeling performance of applications across systems with different
GPU microarchitecturesProposall A collaborative filtering based matrix factorization (MF) model to
automatically learn latent features describing performance ofapplications on systems
l A multi-layer perceptron (MLP) to model complex non-linearinteractions between applications and systems
Multi-layer perceptron model using latent features
Application
System
Application Latent Vector
System Latent Vector
Application Embeddings
System Embeddings
L a y e r 1
L a y e r 2
Concatenation
L a y e r N
��� predicted score
Multi-Layer Perceptron
ReLU ReLU ReLU
Results• MLP with 2 layers (MLP-2) achieves 90.6% accuracy when
predicting instructions per second (IPS)
Actual vs. Predicted normalized IPS values
Accuracy of the models using IPS dataset
Framework for Transpliation Between Python and Fortran4
Fortran+ top performance
+ HPC legacy- hard to maintain
Python+ ease of programming+ general-purpose tools- big runtime overhead
Dillema• Performance or ease-of-programming?
Solution• Python for development, Fortran at runtime.
take legacy Fortran
• reuse legacy code• keep battle-tested
implementations• will be translated
by our framework
migrate to Python
• happens once• semi-automatic or
automatic• performance-critical data
is retained (via Type Hints)
• user can easily extend functionality
translate performance-critical kernels to Fortran
• JIT, at runtime• fully automatic• original
performance is retained
• user doesn’t interact with Fortran
Problem
ResultsFigure 1: DGEMM performance the same as Fortran. 5× better than Numba.
Figure 2: Migrated Miranda IO benchmark retains original performance.
230×
5×6×6×
Figure 2Figure 1
Evaluation on a Nvidia Titan Xp GPU
Results