hardware-efficient machine learningcbl.eng.cam.ac.uk/...havasi_hardware_efficient_ml.pdf · marton...
TRANSCRIPT
Hardware-efficient Machine Learning
Marton Havasi, Robert Peharz
Machine Learning Reading Group
Cambridge, 30th of November 2017
Motivation for Hardware-efficient Machine Learning
I current philosophy: stronger, higher, faster
I watching Formula-1 might be fun, but who is going to buildthe machine learning bicycle?
I applications often come with stringent constraints: embeddedsystems, autonomous navigation
I insight: what is the “practical complexity” of ML tasks?
Example: Deep Compression1
I AlexNet: 240 MB → 6.9 MB
I VGG-16: 552 MB → 11.3 MB
I 3–5× more energy efficient
1S. Han, H. Mao, W. J. Dally, Deep Compression: Compressing Deep NeuralNetworks with Pruning, Trained Quantization and Huffman Coding, ICLR 2017.
NIPS’16
Binarized Neural Networks
I feed-forward neural network
I use binary weights: {−1,+1}I use threshold-units for hidden units:
sign(x) =
{+1 if x ≥ 0−1 if x < 0
I thus, hidden units reduced to XNOR-operations, integeraccumulation, and thresholding
Training
I real-valued weights during training, restricted to [−1, 1]
I binarized (using sign(x)) in forward pass
I “straight-through estimator”: replace sign(x) with identityf (x) = x in backprop pass
I “shift-based batch-normalization” (test time?)
I ADAM or “shift-based AdaMax” for optimization
I first layer: 8-bit fixed point multiplication
Classification Results
Energy Considerations (45nm technology)
Runtime using SWAR2
2SIMD within a register; SIMD: Single instruction, multiple data.
submitted to ICLR’18
Overview and Core Idea
cf. Soudry et al. NIPS’14Hernandez-Lobato and Adams, ICML’15
Model and Approach
I inputs: x0
I ternary weights {−1, 0, 1}I L-layer neural net: al = Wl xl−1
I sign as non-linearity: x l = sign(al)
I softmax output: exp(al )∑l′ exp(a
l′ )(not needed in test phase)
I assume prior p(W), where W = {Wl}Ll=1
I interpret softmax as likelihood p(D |W)
I infer posterior p(W | D) ∝ p(D |W) p(W)
Variational Approach
min. KL(q(W) || p(W | D)) =
KL(q(W) || p(W))− Eq(W)[log p(D |W)] + log p(D)
Further Details
I 3-bit input weights,
I activation normalization (divide by√dl−1) during PFP
I re-weighting of variational objective
λKL(q(W) || p(W))− (1− λ)Eq(W)[log p(D |W)]
→ corresponds to λ/(1− λ) copies of training data
I drop out
I 50 iterations of spearmint (Bayesian optimization) forhyper-parameters
Results on MNIST Variants
Ensembles of Discrete NNs
NIPS’17
Overview
I Relation between variational inference and compression.
I Use Bayesian approach to prune weights and lower precision inneural networks.
I Experimental results.
Optimal encodingI What is the mean number of bits required to encode samples
from a distribution?
I Discretize using buckets of length t. Use Huffman coding.limt→0
∑i P(it ≤ z < i(t + 1)
)[−log2
(P(it ≤ z < i(t + 1)
)]
I Shannon’s source coding theorem:H(p) =
∫p(z)[−log2p(z)]dx
Optimal encodingI What is the mean number of bits required to encode samples
from a distribution?I Discretize using buckets of length t. Use Huffman coding.
limt→0∑
i P(it ≤ z < i(t + 1)
)[−log2
(P(it ≤ z < i(t + 1)
)]
I Shannon’s source coding theorem:H(p) =
∫p(z)[−log2p(z)]dx
Optimal encodingI What is the mean number of bits required to encode samples
from a distribution?I Discretize using buckets of length t. Use Huffman coding.
limt→0∑
i P(it ≤ z < i(t + 1)
)[−log2
(P(it ≤ z < i(t + 1)
)]
I Shannon’s source coding theorem:H(p) =
∫p(z)[−log2p(z)]dx
Bayesian Compression
I What if we only care about the distribution?
Bayesian Compression
I What if we only care about the distribution?
I Naive approach: Eq[−log(p(z))]
Bayesian Compression
I What if we only care about the distribution?
I The information cost is the additional information over theprior: Eq[−log(p(z))]−H(q) = KL(q||p)
Variational Inference
I Approximate the posterior distribution p(z |x) by maximizingthe ELBO.
ELBO(φ) = Eqφ(z)[log p(x |z)]− KL(qφ(z)||p(z))
Variational Inference as Occam’s Razor
I Occam’s Razor: What is the simplest explanation to the data?
I What is the information cost of describing the data?
Variational Inference as Occam’s Razor
I Complexity cost: Lc = KL(q||p)
I Error cost: Le = Eq[−log(x |z)]
I Overall information cost: Le + Lc = −ELBO(φ)
Hardware efficient neural networks
I Weight pruningI Pruning individual weightsI Pruning nodes
I QuantizationI Binary, tertiary weightsI k-means quantizationI Precision quantization
Hardware efficient neural networks
I Weight pruningI Pruning individual weightsI Pruning nodes
I QuantizationI Binary, tertiary weightsI k-means quantizationI Precision quantization
Advantages of the Bayesian approach
I Sparsity inducing priors for weight pruning.
I Noisy weights allow for reduced precision binary encoding.
Training
I Data: DI Parameters: w
I Variational distribution: qφ(w)
ELBO(φ) = Eqφ(w)[log(D|w)]− KL(qφ(w)||p(w)
)
I Reparameterize w = f (φ, ε)
ELBO(φ) = Ep(ε)[log(D|f (φ, ε))]− KL(qφ(w)||p(w)
)
Training
I Data: DI Parameters: w
I Variational distribution: qφ(w)
ELBO(φ) = Eqφ(w)[log(D|w)]− KL(qφ(w)||p(w)
)I Reparameterize w = f (φ, ε)
ELBO(φ) = Ep(ε)[log(D|f (φ, ε))]− KL(qφ(w)||p(w)
)
Choice of prior
z ∼ p(z)
w ∼ N (w ; 0, z2)
Improper log-uniform prior
p(z) ∝ |z |−1
p(w) ∝∫
1
|z |N (w ; 0, z2)dz =
1
|w |
Improper log-uniform prior
p(z) ∝ |z |−1
p(w) ∝∫
1
|z |N (w ; 0, z2)dz =
1
|w |
p(W , z) ∝A∏i
1
|zi |
A,B∏i ,j
N (wij ; 0, z2i )
qφ(W , z) ∝A∏i
N (zi ;µzi , µ2ziαi )
A,B∏i ,j
N (wij ;µijzi , z2i σ
2ij)
Improper log-uniform prior
I Test time! Fix the weights at their means.
I Pruning: logαi > t
I Precision:Var(wij) = Var(zi
wij
zi) =
= Var(zi )(E (
wij
zi)2 + Var(
wij
zi))
+ Var(wij)E (zi )2
= σ2zi (σij + µ2ij) + σ2ijµ2zi
Half Cauchy scale prior
I Half Cauchy distribution: C+(0, s) = 2(sπ(1 + z2
s2))−1
I s ∝ C+(0, τ)
I z̃i ∝ C+(0, 1)
I w̃ij ∝ N (0, 1)
I wij = w̃ij z̃i s
I The limit is the improper log-uniform distribution.
Experiments
Experiments
Summary
I Used a Bayesian approach to determine the dropout andprecision of the weights.
I Sparsity inducing priors allow for weight pruning.
I Noisy weights allow for reduced precision.
I The performance is on-par with existing, less principledapproaches.
Thank you!