document5

38
1. Introduction 2. Probability Distributions 3. Linear Models for Regression 4. Linear Models for Classification 5. Neural Networks 6. Kernel Methods CONTENTS

Upload: jan-kristanto

Post on 14-Apr-2016

2 views

Category:

Documents


0 download

DESCRIPTION

machine learning

TRANSCRIPT

Page 1: Document5

1. Introduction

2. Probability Distributions

3. Linear Models for Regression

4. Linear Models for Classification

5. Neural Networks

6. Kernel Methods

CONTENTS

Page 2: Document5

• Multilayer Perceptron (MLP)• Mathematical representations of information processing in

biological systems• Linear models for regression and classification

• Neural network model

Feed-forward Network Functions

Pattern Recognition and Machine Learning 83

Page 3: Document5

where ,

Pattern Recognition and Machine Learning 84

Page 4: Document5

where , softmax activation function

• Neural network model is simply a nonlinear function from a set of inputs to a set of outputs controlled by

.deterministic variables for neural network

• We can give the additional input variables ,

Pattern Recognition and Machine Learning 85

Page 5: Document5

Neural networks are said to be “universal approximators”Pattern Recognition and Machine Learning 86

Page 6: Document5

Pattern Recognition and Machine Learning 87

Page 7: Document5

• Given input vectors and target vector , the minimization of a sum-of-squares error function

Relation to maximum likelihood criterion

Taking the negative logarithm, we have

Pattern Recognition and Machine Learning 88

Objective for Network Training

Page 8: Document5

ML Minimization of sum-of-squares error function

• For the case of multiple target variables

Pattern Recognition and Machine Learning 89

Page 9: Document5

• Consider a binary classification, let

• Cross entropy error function

• In case of K binary classifications,

Pattern Recognition and Machine Learning 90

Cross Entropy Error Function

Page 10: Document5

• Taking the negative logarithm of the likelihood function then gives the cross-entropy error function

where

• In case of standard multiclass classification, 1-of-K coding scheme ,

Pattern Recognition and Machine Learning 91

Page 11: Document5

Pattern Recognition and Machine Learning 92

Page 12: Document5

Pattern Recognition and Machine Learning 93

Parameter optimization

• Smallest value will occur at a point which

is some point around , the Taylor expansion

Local quadratic approximation

2

Page 13: Document5

• We obtain• Let be a local minimum and

• For geometric interpretation, we can perform eigen-analysis

where are orthonormal vectors, i.e.,

• We expand , then

Pattern Recognition and Machine Learning 94

Page 14: Document5

• H is positive definite if, and only if, for all

or if are all positive.

Pattern Recognition and Machine Learning 95

Page 15: Document5

• Batch gradient descent

where the parameter is the learning rate.

• Sequential gradient descent methodIf

then

Gradient Descent Optimization

Pattern Recognition and Machine Learning 96

Page 16: Document5

• We try to find an efficient way of evaluating for a feed-forward neural network.

Error Backpropagation Algorithm

Pattern Recognition and Machine Learning 97

• In a general feed-forward network, each unit computes a

Page 17: Document5

send a connection to ,

• Finally,

Pattern Recognition and Machine Learning 98

Evaluation of Error-Function Derivatives

Page 18: Document5

1.Apply to network by general computation

Find all activations of all hidden and output units

2.Evaluate for all output nodes

3.Backpropagate the 's using

4.Find the required derivatives

for all weights in first layer and second layer

Pattern Recognition and Machine Learning 99

Error Backpropagation Procedure

Page 19: Document5

Pattern Recognition and Machine Learning 100

Logistic Sigmoid Activation Function

Page 20: Document5

• Backpropagation algorithm for the first derivatives can be extended to that for the second derivatives of the error, given by

• Hessian matrix plays an important role in many aspects of neural computing. Various approximation schemes have been used to evaluate the Hessian matrix for a neural network.

• Hessian exact computation is also available with cost where is the number of weights.

Pattern Recognition and Machine Learning 101

The Hessian Matrix

Page 21: Document5

• For this case, the diagonal elements are computed by

• If we neglect off-diagonal elements in the second-derivative terms, we obtain

Pattern Recognition and Machine Learning 102

Diagonal Approximation

Page 22: Document5

• In case of a single output

• If the network is well trained, , we have the outer-product approximation

Pattern Recognition and Machine Learning 103

Outer Product Approximation

Page 23: Document5

• Using outer-product approximation,

where• Sequential procedure for building Hessian, for the first

data points,

• Using the Woodbury identity

we obtain

Pattern Recognition and Machine Learning 104

Inverse Hessian

Page 24: Document5

• Optimum value of M, number of hidden neurons, gives the best generalization performance or optimum balance between under-fitting and over-fitting.

• Figure 5.9

• The above regularizer is also known as weight decay and is the regularization coefficient.

Pattern Recognition and Machine Learning 105

5.5 Regularization in Neural Networks

Page 25: Document5

• This weight decay is inconsistent with certain scaling properties of network mapping due to it treats all weights and biases on an equal footing, does not satisfy the property of consistency of linear transformation of obtaining equivalent network.

• Regularizer is modified by

• This corresponds to merge a prior of the form

Pattern Recognition and Machine Learning 106

5.5.1 Consistent Gaussian priors

Page 26: Document5

• Early stopping is a way of controlling the effective complexity of a network.

• Our goal is to achieve the best generalization.• Figure 5.12

5.5.2 Early stopping

Pattern Recognition and Machine Learning 107

(Training set error) (Validation set error)

Page 27: Document5

• Figure 5.13 A schematic illustration of why early stopping can give similar results to weight decay in the case of a quadratic error function.

Pattern Recognition and Machine Learning 108

• Early stopping exhibits similar behavior to regularization using a weight-decay term.

Page 28: Document5

• The training set is augmented using replicas of the training patterns, transformed according to the desired invariances (e.g., translation or scale invariance).

• Figure 5.14

Pattern Recognition and Machine Learning 109

5.5.3 Invariances

Page 29: Document5

• We can use regularization to encourage models to be invariant to transformations of the input through the technique of “tangent propagation”.

• Figure 5.15

5.5.4 Tangent propagation

Pattern Recognition and Machine Learning 110

Page 30: Document5

• Let the vector that results from acting on by the transformation with and is a parameter, the tangent vector at the point is given by

• Under a transformation of the input vector, the network output vector will, in general, change.

5.5.4 Tangent propagation

Pattern Recognition and Machine Learning 111

Page 31: Document5

• Encourage local invariance in the neighborhood of the data points by the addition of a regularization function Ω

where is a regularization coefficient and

Pattern Recognition and Machine Learning 112

• Notes : 1. in case of the network mapping function is invariant under

the transformation in the neighborhood of each pattern vector.2. The value of determines the balance between fitting the

training data and learning the invariance property.

5.5.4 Tangent propagation

Page 32: Document5

• Figure 5.16

Pattern Recognition and Machine Learning 113

Page 33: Document5

Pattern Recognition and Machine Learning 114

• In case of perturbation by the transformation with density zero mean and small variance

• Expand the transformation by Taylor series

5.5.5 Training with transformed data

Page 34: Document5

• Expand the model function to give

• Taylor series in terms of

5.5.5 Training with transformed data

Pattern Recognition and Machine Learning 115

Page 35: Document5

• Omitting terms of , the average error function is

where the regularization term takes the form

in which we have performed the integration over .

5.5.5 Training with transformed data

Pattern Recognition and Machine Learning 116

Page 36: Document5

• To simplify the regularization term, we express

which minimizes the total error.

• Thus, to order , the first term in the regularizer vanishesand left with

• In case of , then and

which is known as Tikhonov regularization. We can build MLP by considering this regularization function.

5.5.5 Training with transformed data

Pattern Recognition and Machine Learning 117

Page 37: Document5

• One way to reduce the effective complexity is to constrain weights within certain groups to be equal.

• We can encourage the weight values to form several groups, rather than just one group, by considering a mixture of Gaussians

where

and are the mixing coefficients.

Pattern Recognition and Machine Learning 118

5.5.7 Soft weight sharing

Page 38: Document5

• Taking the negative logarithm then leads to a regularization term

and the total error is given by

• Parameters can be determined by using EM algorithm if the weights were constant.

• As the distribution of weights is itself evolving during the learning process, a joint optimization can be performed simultaneously over the weights and the mixture-model parameters.

5.5.7 Soft weight sharing

Pattern Recognition and Machine Learning 119