practical advice for building neural nets deep learning and neural nets spring 2015

Practical Advice ForBuilding Neural Nets

Deep Learning and Neural NetsSpring 2015

Day’s Agenda

1. Celebrity guests

2. Discuss issues and observations from homework

3. Catrin Mills on climate change in the Arctic

4. Practical issues in building nets

Homework Discussion

Lei’s problem with the perceptron not converging

demo

Manjunath’s questions about priors

Why does the balance of data make a difference in what is learned? (I believe it will for Assignment 2)

If I’d told you the priors should be 50/50, what would you have to do differently?

What’s the relationship between statistics of the training set and the test set?

Suppose we have a classifier that outputs a probability (not just 0 or 1). If you know the priors in the training set and you know the priors in the test set to be different, how do you repair the classifier to accommodate the mismatch between training and testing priors?

Weight Initialization

Break symmetries

use small random values

Weight magnitudes should depend on fan in

What Mike has always done

Draw all weights feeding into neuron j (including bias) via

Normalize weights such that i.e.,

Works well for logistic units; to be determined for ReLU units

Weight Initialization

A perhaps better idea (due to Kelvin Wagner)

Draw all weights feeding into neuron j (including bias) via

If input activities lie in [-1, +1], then variance of input to unit j grows with fan-in to j, fj

Normalize such that variance of input is equal to C2

i.e.,

If input activities lie in [-1, +1], most net inputs will be in [-2C, +2C]

Does Weight Initialization Scheme Matter?

Trained 3 layer net on 2500 digit patterns

10 output classes, tanh units

20-1600 hidden units

automatic adaptation of learning rates

10 minibatches per epoch

500 epochs

12 replications of each architecture

Does Weight Initialization Scheme Matter?

Weight initialization schemes

Gaussian random weights, N(0,.0012)

Gaussian random weights, N(0,.012)

L1 constraint on Gaussian random weights[conditioning for worst case]

Gaussian weights, N(0, 4/FanIn)[conditioning for average case]

Gaussian weights, N(0, 1/FanIn)[conditioning for average case]

Small Random Weights

Strangeness

training set can’t be learned  caveat: plotting accuracy not MSE

if there’s overfitting, doesn’t happen until 200 hidden

Mike’s L1 Normalization Scheme

About the same as small random weights

Normalization Based On Fan In

Perfect performance on training set

Test set performance dependent on scaling

Conditioning The Input Vectors

If mi is the mean activity of input unit i over the training set and si is the std dev over the training set

For each input (in both training and test sets), normalize by

where is the training set mean activity andis the std deviation of the training set activities

Conditioning The Hidden Units

If you’re using logistic units, then replace logistic output with function scaled from -1 to +1

With net=0, y=0

Will tend to cause biases to be closer to zero and more on the same scale as other weights in network

Will also satisfy assumption I make to condition initial weights and weight updates for the units in the next layer

tanh function

Setting Learning Rates I

Initial guess for learning rate

If error doesn’t drop consistently, lower initial learning rate and try again

If error falls reliably but slowly, increase learning rate.

Toward end of training

Error will often jitter, at which point you can lower the learning rate down to 0 gradually to clean up weights

Remember, plateaus in error often look like minima

be patient

have some idea a priori how well you expect your network to be doing, and print statistics during training that tell you how well it’s doing

plot epochwise error as a function of epoch, even if you’re doing minibatches

Setting Learning Rates II

Momentum

Adaptive and neuron-specific learning rates

Observe error on epoch t-1 and epoch t

If decreasing, then increase global learning rate, εglobal, by an additive constant

If increasing, decrease global learning rate by a multiplicative constant

If fan-in of neuron j is fj, then

Setting Learning Rates III

Mike’s hack

Initialization epsilon = .01inc = epsilon / 10if (batch_mode_training) scale = .5else scale = .9

Update if (current_epoch_error < previous_epoch_error) epsilon = epsilon + inc saved_weights = weightselse epsilon = epsilon * scale inc = epsilon / 10 if (batch_mode_training) weights = saved_weights

Setting Learning Rates IV

rmsprop

Hinton lecture

Exploit optimization methods using curvature

Requires computation of Hessian

When To Stop Training

1. Train n epochs; lower learning rate; train m epochs

bad idea: can’t assume one-size-fits-all approach

2. Error-change criterion

stop when error isn’t dropping

My recommendation: criterion based on % drop over a window of, say, 10 epochs 1 epoch is too noisy absolute error criterion is too problem dependent

Karl’s idea: train for a fixed number of epochs after criterion is reached (possibly with lower learning rate)

When To Stop Training

3. Weight-change criterion

Compare weights at epochs t-10 and t and test:

Don’t base on length of overall weight change vector

Possibly express as a percentage of the weight

Be cautious: small weight changes at critical points can result in rapid drop in error

practical advice for building neural nets deep learning and neural nets spring 2015

Documents