hardware-efficient machine learningcbl.eng.cam.ac.uk/...havasi_hardware_efficient_ml.pdf · marton...

Hardware-efficient Machine Learning

Marton Havasi, Robert Peharz

Machine Learning Reading Group

Cambridge, 30th of November 2017

Motivation for Hardware-efficient Machine Learning

I current philosophy: stronger, higher, faster

I watching Formula-1 might be fun, but who is going to buildthe machine learning bicycle?

I applications often come with stringent constraints: embeddedsystems, autonomous navigation

I insight: what is the “practical complexity” of ML tasks?

Example: Deep Compression1

I AlexNet: 240 MB → 6.9 MB

I VGG-16: 552 MB → 11.3 MB

I 3–5× more energy efficient

1S. Han, H. Mao, W. J. Dally, Deep Compression: Compressing Deep NeuralNetworks with Pruning, Trained Quantization and Huffman Coding, ICLR 2017.

NIPS’16

Binarized Neural Networks

I feed-forward neural network

I use binary weights: {−1,+1}I use threshold-units for hidden units:

sign(x) =

{+1 if x ≥ 0−1 if x < 0

I thus, hidden units reduced to XNOR-operations, integeraccumulation, and thresholding

Training

I real-valued weights during training, restricted to [−1, 1]

I binarized (using sign(x)) in forward pass

I “straight-through estimator”: replace sign(x) with identityf (x) = x in backprop pass

I “shift-based batch-normalization” (test time?)

I ADAM or “shift-based AdaMax” for optimization

I first layer: 8-bit fixed point multiplication

Classification Results

Energy Considerations (45nm technology)

Runtime using SWAR2

2SIMD within a register; SIMD: Single instruction, multiple data.

submitted to ICLR’18

Overview and Core Idea

cf. Soudry et al. NIPS’14Hernandez-Lobato and Adams, ICML’15

Model and Approach

I inputs: x0

I ternary weights {−1, 0, 1}I L-layer neural net: al = Wl xl−1

I sign as non-linearity: x l = sign(al)

I softmax output: exp(al )∑l′ exp(a

l′ )(not needed in test phase)

I assume prior p(W), where W = {Wl}Ll=1

I interpret softmax as likelihood p(D |W)

I infer posterior p(W | D) ∝ p(D |W) p(W)

Variational Approach

min. KL(q(W) || p(W | D)) =

KL(q(W) || p(W))− Eq(W)[log p(D |W)] + log p(D)

Further Details

I 3-bit input weights,

I activation normalization (divide by√dl−1) during PFP

I re-weighting of variational objective

λKL(q(W) || p(W))− (1− λ)Eq(W)[log p(D |W)]

→ corresponds to λ/(1− λ) copies of training data

I drop out

I 50 iterations of spearmint (Bayesian optimization) forhyper-parameters

Results on MNIST Variants

Ensembles of Discrete NNs

NIPS’17

Overview

I Relation between variational inference and compression.

I Use Bayesian approach to prune weights and lower precision inneural networks.

I Experimental results.

Optimal encodingI What is the mean number of bits required to encode samples

from a distribution?

I Discretize using buckets of length t. Use Huffman coding.limt→0

∑i P(it ≤ z < i(t + 1)

)[−log2

(P(it ≤ z < i(t + 1)

)]

I Shannon’s source coding theorem:H(p) =

∫p(z)[−log2p(z)]dx

Optimal encodingI What is the mean number of bits required to encode samples

from a distribution?I Discretize using buckets of length t. Use Huffman coding.

limt→0∑

i P(it ≤ z < i(t + 1)

)[−log2

(P(it ≤ z < i(t + 1)

)]

I Shannon’s source coding theorem:H(p) =

∫p(z)[−log2p(z)]dx

Bayesian Compression

I What if we only care about the distribution?



I Naive approach: Eq[−log(p(z))]



I The information cost is the additional information over theprior: Eq[−log(p(z))]−H(q) = KL(q||p)

Variational Inference

I Approximate the posterior distribution p(z |x) by maximizingthe ELBO.

ELBO(φ) = Eqφ(z)[log p(x |z)]− KL(qφ(z)||p(z))

Variational Inference as Occam’s Razor

I Occam’s Razor: What is the simplest explanation to the data?

I What is the information cost of describing the data?

Variational Inference as Occam’s Razor

I Complexity cost: Lc = KL(q||p)

I Error cost: Le = Eq[−log(x |z)]

I Overall information cost: Le + Lc = −ELBO(φ)

Hardware efficient neural networks

I Weight pruningI Pruning individual weightsI Pruning nodes

I QuantizationI Binary, tertiary weightsI k-means quantizationI Precision quantization

Advantages of the Bayesian approach

I Sparsity inducing priors for weight pruning.

I Noisy weights allow for reduced precision binary encoding.

Training

I Data: DI Parameters: w

I Variational distribution: qφ(w)

ELBO(φ) = Eqφ(w)[log(D|w)]− KL(qφ(w)||p(w)

)

I Reparameterize w = f (φ, ε)

ELBO(φ) = Ep(ε)[log(D|f (φ, ε))]− KL(qφ(w)||p(w)

)

Training

I Data: DI Parameters: w

I Variational distribution: qφ(w)

ELBO(φ) = Eqφ(w)[log(D|w)]− KL(qφ(w)||p(w)

)I Reparameterize w = f (φ, ε)

ELBO(φ) = Ep(ε)[log(D|f (φ, ε))]− KL(qφ(w)||p(w)

)

Choice of prior

z ∼ p(z)

w ∼ N (w ; 0, z2)

Improper log-uniform prior

p(z) ∝ |z |−1

p(w) ∝∫

1

|z |N (w ; 0, z2)dz =

1

|w |


p(z) ∝ |z |−1

p(w) ∝∫

1

|z |N (w ; 0, z2)dz =

1

|w |

p(W , z) ∝A∏i

1

|zi |

A,B∏i ,j

N (wij ; 0, z2i )

qφ(W , z) ∝A∏i

N (zi ;µzi , µ2ziαi )

A,B∏i ,j

N (wij ;µijzi , z2i σ

2ij)


I Test time! Fix the weights at their means.

I Pruning: logαi > t

I Precision:Var(wij) = Var(zi

wij

zi) =

= Var(zi )(E (

wij

zi)2 + Var(

wij

zi))

+ Var(wij)E (zi )2

= σ2zi (σij + µ2ij) + σ2ijµ2zi

Half Cauchy scale prior

I Half Cauchy distribution: C+(0, s) = 2(sπ(1 + z2

s2))−1

I s ∝ C+(0, τ)

I z̃i ∝ C+(0, 1)

I w̃ij ∝ N (0, 1)

I wij = w̃ij z̃i s

I The limit is the improper log-uniform distribution.

Experiments

Summary

I Used a Bayesian approach to determine the dropout andprecision of the weights.

I Sparsity inducing priors allow for weight pruning.

I Noisy weights allow for reduced precision.

I The performance is on-par with existing, less principledapproaches.

Thank you!

hardware-efficient machine learningcbl.eng.cam.ac.uk/...havasi_hardware_efficient_ml.pdf · marton...

Documents