machine learning research at arm - microsoft · 2019. 11. 1. · workshop, neural information...
TRANSCRIPT
Machine LearningResearch at Arm
Matthew MattinaSenior Director, Arm ML Research Lab
© Arm 2017 2
Arm ML Research Lab: Vision and Core Research ThreadsArm's ML Research Lab vision is to be a technology leader in efficient ML inference and distributed ML
Efficient Hardware for ML
• Hardware designs and accelerator microarchitecture
• ISA additions
• Exploiting emerging device technology
Model Design & Optimization
• Novel Model architectures
• Emerging Use cases
• AutoML and Network Architecture Search
Edge-to-Cloud & ML Systems
• Distributed and on-device training
• Model security and model distributionImage credit: Song Han
© Arm 2017 3
EfficientHardware
Arm ML Research Landscape
Executing Developing
Model Design & Optimization
Edge-to-Cloud
Watching
FixyNNISP-ML
Distributed ML & Federated Learning
AutoBOT
PCM, RRAM
SiPhotonics
Neural SLAM
GOOG -Sparsity
MIT - AutoMLPrinceton –
Bayesian NetsSystemX –
SNORKEL, Edge ML
SRC – C-BRIC, ADA
TinyML
Network Optimization
Security
BU – Training with Test Time Budget
LLNL – ML Workloads
BonsEyes - platform
Oxford –DNN opt
UnsupervisedModels
GANs
NVM
CPU-ML
PGM & ExplainableAI
Datacenter ML
Utexas - NLP
MNEMOSENE M0N0
Verification CE
IMP Perception
PredictiveAnalytics
H2020 SAFE AI
Collaboration
Tracking
RSH-ML
© Arm 2017 4
Arm ML Research Lab
Hokchhay Tann
Partha Maji
We’re Hiring! Openings available in our Boston, Austin, and Cambridge UK Locations!
© Arm 2017 5
Recent Publications from Arm ML Research Lab
I. Fedorov, R. Adams, M. Mattina, P. Whatmough “SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers” (NeurIPS ‘19)
Zhi-Gang Liu, M. Mattina, “Learning low-precision neural networks without Straight-Through Estimator(STE)” (IJCAI ‘19)
D. Gope, G. Dasika, M. Mattina, “Ternary Hybrid Neural-Tree Networks for Highly Constrained IoT Applications,” 2019 Conference on Systems and Machine Learning (SysML ‘19)
P. Whatmough, C. Zhou, P. Hansen, S. Venkataramanaiah, J. Seo, M. Mattina, “FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning,” 2019 Conference on Systems and Machine Learning (SysML ‘19)
U. Thakker, J. Beu, G. Dasika, M. Mattina, “Measuring scheduling efficiency of RNNs for NLP Applications,” International Workshop on Performance Analysis of Machine Learning Systems (FastPath ’19)
U. Thakker, J. Beu, D. Gope, G. Dasika, M. Mattina, “RNN Compression using Hybrid Matrix Decomposition,” (tinyML Summit ’19)
P. Maji, A. Mundy, G. Dasika, J. Beu, M. Mattina, R. Mullins, “Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs,” Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC^2 ‘19)
P. Whatmough, C. Zhou, P. Hansen, M. Mattina, “Energy Efficient Hardware for On-Device CNN Inference via Transfer Learning”, On-Device ML Workshop, Neural Information Processing Systems (NeurIPS ‘18)
Y. Zhu, A. Samajdar, M. Mattina, P. Whatmough, “Euphrates: Algorithm-SoC Co-Design for Low-Power Mobile Continuous Vision”, International Symposium on Computer Architecture (ISCA’18)
© Arm 2017 6
University EngagementsUniversity Topics Status/AgreementHarvard University Sasha Rush, David Brooks, GuYeon Wei
NLP NN/HW co-designFunded by Arm RSH-ML over three years, 2018-2020
MIT HAN Lab - Song HanDeep Compression AutoML
Funded by Arm RSH-ML over three years, 2018-2020
Boston University Venkatesh SaligramaLearning for a test-time BudgetLearning with Limited Supervision
Funded by Arm RSH-ML over three years, 2018-2020
Princeton University Ryan AdamsCo-optimization of ML / hardwareSimple, robust, decision-making machine
Funded by Arm RSH-ML over three years, 2018-2020
Trinity College Dublin CALCULUS - performance optimization techniques Funded by Arm over four years, 2017-2020Oxford University Nic Lane
Binary network optimizationStatistic foundation for network pruning
Funded by iCASE over three years 2017-2019PhD student, Javier Fernández-Marqués, to intern in 2019
SRC/GRC JUMP Center liaison: CBRIC-T1: Neuro-inspired Algorithms & TheoryJUMP Center tracking: ADAHadi Esmaeilzadeh, UCSD, cloud-to-edge stack for DNN acceleration
Funded by Arm RSH
SystemX, Stanford Computation for Data Analyics - Chris ReSNORKEL, ML for the Edge
Funded by Arm RSH for two Focus Area tokens
RISElabs, Berkeley Data services visionClipper - a general-purpose ML model serving systemRay - a distributed execution framework - cloud-edge ML
Funded primarily by ISG plus RSH & IPG contributions
BonsEyes ARM-based platforms as example deployments Funded by EU over three years, 2017-2019University of Texas, Austin Greg Durrett - NLP, scalable training/inference with large data sets
Ruben Rathnasingham - Collaboration with Dell MedicalFunded by Arm RSH for 2019On standby for consulting
University of Michigan Honglak Lee - Generative networks, GANs Arm RSH - UoM sponsorship, exploring topics for collaborationUniversity of Manchester Gavin Brown - Ensembles to Modular NNs Exploring topics for collaborationUniversity of Cambridge TBD CDT program, exploring topics for collaboration
Arm ML Research Lab: Selected
Projects
© Arm 2017 8
TinyML
What is it?
• ”Swimming in sensors, drowning in data”
• Model design and optimization for highly constrained hardware platforms
• Can we get 10X+ reduction in ops or memory with minimal accuracy loss?
Near term results
• Hybrid neural + non-neural techniques
• New training approaches for binary/ternary networks
• Compression techniques for recurrent neural networks (RNNs) that operate on time-series data
BBC Micro:Bit (Arm Cortex M0, 16KB RAM)
LPCXpresso 1125 (Arm Cortex M0, 8KB SRAM)
M0N0 (Arm Cortex M33, 16KB SRAM)
© Arm 2017 9
TinyML: HybridNet
“DS-CNN” is a highly optimized network for the key word spotting (KWS) task
• How do we optimize it further at iso-accuracy?
Ternarize weight values using Strassen's algorithm
• Overall memory footprint reduced by 30%
Selectively use decision trees to reduce compute
• Total number of operations reduced by 12%
Less than 0.3% loss in accuracy for these savings
DS-CNN
ST-HybridNet
9090.5
9191.5
9292.5
9393.5
9494.5
95
0 5 10 15 20 25 30 35 40
Accu
racy
(%)
Memory Footprint (KB)
Accuracy vs Overall Memory Footprint
DS-CNN
ST-HybridNet
9090.5
9191.5
9292.5
9393.5
9494.5
95
2 2.25 2.5 2.75 3
Accu
racy
(%)
Operations (M)
Accuracy vs #Operations
Published in SysML’19 - https://arxiv.org/abs/1903.01531
© Arm 2017 10
AutoBotWhat is it?
• Automate Neural Architecture Search (NAS) on Arm
• Incorporate information about Arm hardware into the optimization flow
• Reduce search runtime
Near term goal: Top-Down (Optimization)
1. Input a trained model
2. Optimize for Arm IP – reduce latency/energy at iso-accuracy
Long term goal: Bottom-Up (Design)
1. Input a dataset
2. Create a from-scratch model optimized for Arm IP
Optimization Runtime
Mod
elQ
oR
Top-Down(Model Opt)
Bottom-Up(Global NAS)
MicroBrew(Local-NAS)
Accepted at NeurIPS’19 - https://arxiv.org/abs/1905.12107
© Arm 2017 11
ML Convolution Kernels in ArmCL
New optimized FP32 depthwise kernel
• Depthwise convolutions consist of depth-wise and point-wise layers
• RSH contributed new techniques for performing depthwise convolution• NEON optimized, cache-friendly, direct convolution
outperforms GEMV based method• Activation fusion further reduces mem traffic
• 6x speedup compared to previous ArmCL FP32 depthwise kernels
• Overall 2x performance uplift for the whole MobileNet v1 model
Mobilenet v1 1.0/224 FP32; 1xA73
0%
20%
40%
60%
80%
100%
Original ACL Latest ACL
Runt
ime
(nor
mal
ized
to o
rigin
al)
Convolution Depthwise Other
© Arm 2017 12
ML Winograd Kernels in ArmCL
4x speedup to ArmCL FP32 convolution
• Introduced Winograd convolution to improveArmCL CPU performance
• Winograd lowers matrix multiplication toelement-wise multiplication
• Contributed code and analysis to ArmCL• Up-to 1.6x whole network speedup
• Depending on proportion of network time spent in convolution
ArmCL performance
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
VGG19 VGG16 Inceptionv3
Squeezenet Squeezenetv1.1
Who
le n
etw
ork
spee
dup
due
to W
inog
rad
Network
© Arm 2017 13
FixyNN
What is it?
• Accelerator concept to push TOPS/W/mm2
Goals
1. Aggressive HW specialization via transfer learning2. Implement TF -> Verilog tool (DeepFreeze) to
understand PPA benefit of fixed-weight datapaths
3. ML experiments to understand transfer learning4. System model for iso-area comparison with baseline
Results• Energy efficiency of up to 11.2 TOPS/W – nearly 2×
more efficient than NVDLA alone in same area– 1.42x TOPS/W by fixing 4/13 layers
– 1.92x TOPS/W by fixing 7/13 layers
• Accuracy loss of < 1% over six datasets
FixedFeature Extractor
(FFE)
Programmable CNN Accelerator
Fully-Parallel Fully-Pipelined Zero DRAM BW
Weights Stored in DRAM
POO
L
Shared Front-End
Task Specific CNN Back-End
DRAM Memory
SRAM MemorySRAM Memory
Task 1
Weights Hard-Coded in Fixed Datapath
Input
Shared
CONV
POO
L
FC
Task 2Task N
FixyNN Hardware
CONV
POO
L
CONV
POO
L
“CAT”
https://github.com/ARM-software/DeepFreezeFixyNN: https://arxiv.org/abs/1902.11128
Discussion
Analog HW and Non-neural models
© Arm 2017 15
The case for non-digital neural network accelerators
What are the most promising alternatives to digital CMOS?
What improvements in performance/W are possible?
What are the challenges? ADC/DACs? Noise?
x1
x2
x3
x4
y1
y2
y3
𝑦" =$%&'
(
𝑤%" * 𝑥%
y4
𝑖" =$-&'
(
𝑔-" * 𝑣-
≡𝑔=
1𝑅
© Arm 2017 16
The case for non-neural network model architectures
What are the most promising alternatives to non-neural network models?
What problems do they solve?
In what ways are they better than neural models?
What are the challenges with non-neural network models?
From “Neuromorphic Computing: A bio-plausible routeToward spike-based machine intelligence”, K. Ray et al
1717
Thank You!Danke!Merci!谢谢!ありがとう!Gracias!Kiitos!
Confidential © Arm 2017 Limited