document5
DESCRIPTION
machine learningTRANSCRIPT
![Page 1: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/1.jpg)
1. Introduction
2. Probability Distributions
3. Linear Models for Regression
4. Linear Models for Classification
5. Neural Networks
6. Kernel Methods
CONTENTS
![Page 2: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/2.jpg)
• Multilayer Perceptron (MLP)• Mathematical representations of information processing in
biological systems• Linear models for regression and classification
• Neural network model
Feed-forward Network Functions
Pattern Recognition and Machine Learning 83
![Page 3: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/3.jpg)
where ,
Pattern Recognition and Machine Learning 84
![Page 4: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/4.jpg)
where , softmax activation function
• Neural network model is simply a nonlinear function from a set of inputs to a set of outputs controlled by
.deterministic variables for neural network
• We can give the additional input variables ,
Pattern Recognition and Machine Learning 85
![Page 5: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/5.jpg)
Neural networks are said to be “universal approximators”Pattern Recognition and Machine Learning 86
![Page 6: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/6.jpg)
Pattern Recognition and Machine Learning 87
![Page 7: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/7.jpg)
• Given input vectors and target vector , the minimization of a sum-of-squares error function
Relation to maximum likelihood criterion
Taking the negative logarithm, we have
Pattern Recognition and Machine Learning 88
Objective for Network Training
![Page 8: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/8.jpg)
ML Minimization of sum-of-squares error function
• For the case of multiple target variables
Pattern Recognition and Machine Learning 89
![Page 9: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/9.jpg)
• Consider a binary classification, let
• Cross entropy error function
• In case of K binary classifications,
Pattern Recognition and Machine Learning 90
Cross Entropy Error Function
![Page 10: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/10.jpg)
• Taking the negative logarithm of the likelihood function then gives the cross-entropy error function
where
• In case of standard multiclass classification, 1-of-K coding scheme ,
Pattern Recognition and Machine Learning 91
![Page 11: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/11.jpg)
Pattern Recognition and Machine Learning 92
![Page 12: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/12.jpg)
Pattern Recognition and Machine Learning 93
Parameter optimization
• Smallest value will occur at a point which
is some point around , the Taylor expansion
Local quadratic approximation
2
![Page 13: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/13.jpg)
• We obtain• Let be a local minimum and
• For geometric interpretation, we can perform eigen-analysis
where are orthonormal vectors, i.e.,
• We expand , then
Pattern Recognition and Machine Learning 94
![Page 14: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/14.jpg)
• H is positive definite if, and only if, for all
or if are all positive.
Pattern Recognition and Machine Learning 95
![Page 15: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/15.jpg)
• Batch gradient descent
where the parameter is the learning rate.
• Sequential gradient descent methodIf
then
Gradient Descent Optimization
Pattern Recognition and Machine Learning 96
![Page 16: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/16.jpg)
• We try to find an efficient way of evaluating for a feed-forward neural network.
Error Backpropagation Algorithm
Pattern Recognition and Machine Learning 97
• In a general feed-forward network, each unit computes a
![Page 17: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/17.jpg)
send a connection to ,
• Finally,
Pattern Recognition and Machine Learning 98
Evaluation of Error-Function Derivatives
![Page 18: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/18.jpg)
1.Apply to network by general computation
Find all activations of all hidden and output units
2.Evaluate for all output nodes
3.Backpropagate the 's using
4.Find the required derivatives
for all weights in first layer and second layer
Pattern Recognition and Machine Learning 99
Error Backpropagation Procedure
![Page 19: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/19.jpg)
Pattern Recognition and Machine Learning 100
Logistic Sigmoid Activation Function
![Page 20: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/20.jpg)
• Backpropagation algorithm for the first derivatives can be extended to that for the second derivatives of the error, given by
• Hessian matrix plays an important role in many aspects of neural computing. Various approximation schemes have been used to evaluate the Hessian matrix for a neural network.
• Hessian exact computation is also available with cost where is the number of weights.
Pattern Recognition and Machine Learning 101
The Hessian Matrix
![Page 21: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/21.jpg)
• For this case, the diagonal elements are computed by
• If we neglect off-diagonal elements in the second-derivative terms, we obtain
Pattern Recognition and Machine Learning 102
Diagonal Approximation
![Page 22: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/22.jpg)
• In case of a single output
• If the network is well trained, , we have the outer-product approximation
Pattern Recognition and Machine Learning 103
Outer Product Approximation
![Page 23: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/23.jpg)
• Using outer-product approximation,
where• Sequential procedure for building Hessian, for the first
data points,
• Using the Woodbury identity
we obtain
Pattern Recognition and Machine Learning 104
Inverse Hessian
![Page 24: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/24.jpg)
• Optimum value of M, number of hidden neurons, gives the best generalization performance or optimum balance between under-fitting and over-fitting.
• Figure 5.9
• The above regularizer is also known as weight decay and is the regularization coefficient.
Pattern Recognition and Machine Learning 105
5.5 Regularization in Neural Networks
![Page 25: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/25.jpg)
• This weight decay is inconsistent with certain scaling properties of network mapping due to it treats all weights and biases on an equal footing, does not satisfy the property of consistency of linear transformation of obtaining equivalent network.
• Regularizer is modified by
• This corresponds to merge a prior of the form
Pattern Recognition and Machine Learning 106
5.5.1 Consistent Gaussian priors
![Page 26: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/26.jpg)
• Early stopping is a way of controlling the effective complexity of a network.
• Our goal is to achieve the best generalization.• Figure 5.12
5.5.2 Early stopping
Pattern Recognition and Machine Learning 107
(Training set error) (Validation set error)
![Page 27: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/27.jpg)
• Figure 5.13 A schematic illustration of why early stopping can give similar results to weight decay in the case of a quadratic error function.
Pattern Recognition and Machine Learning 108
• Early stopping exhibits similar behavior to regularization using a weight-decay term.
![Page 28: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/28.jpg)
• The training set is augmented using replicas of the training patterns, transformed according to the desired invariances (e.g., translation or scale invariance).
• Figure 5.14
Pattern Recognition and Machine Learning 109
5.5.3 Invariances
![Page 29: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/29.jpg)
• We can use regularization to encourage models to be invariant to transformations of the input through the technique of “tangent propagation”.
• Figure 5.15
5.5.4 Tangent propagation
Pattern Recognition and Machine Learning 110
![Page 30: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/30.jpg)
• Let the vector that results from acting on by the transformation with and is a parameter, the tangent vector at the point is given by
• Under a transformation of the input vector, the network output vector will, in general, change.
5.5.4 Tangent propagation
Pattern Recognition and Machine Learning 111
![Page 31: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/31.jpg)
• Encourage local invariance in the neighborhood of the data points by the addition of a regularization function Ω
where is a regularization coefficient and
Pattern Recognition and Machine Learning 112
• Notes : 1. in case of the network mapping function is invariant under
the transformation in the neighborhood of each pattern vector.2. The value of determines the balance between fitting the
training data and learning the invariance property.
5.5.4 Tangent propagation
![Page 32: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/32.jpg)
• Figure 5.16
Pattern Recognition and Machine Learning 113
![Page 33: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/33.jpg)
Pattern Recognition and Machine Learning 114
• In case of perturbation by the transformation with density zero mean and small variance
• Expand the transformation by Taylor series
5.5.5 Training with transformed data
![Page 34: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/34.jpg)
• Expand the model function to give
• Taylor series in terms of
5.5.5 Training with transformed data
Pattern Recognition and Machine Learning 115
![Page 35: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/35.jpg)
• Omitting terms of , the average error function is
where the regularization term takes the form
in which we have performed the integration over .
5.5.5 Training with transformed data
Pattern Recognition and Machine Learning 116
![Page 36: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/36.jpg)
• To simplify the regularization term, we express
which minimizes the total error.
• Thus, to order , the first term in the regularizer vanishesand left with
• In case of , then and
which is known as Tikhonov regularization. We can build MLP by considering this regularization function.
5.5.5 Training with transformed data
Pattern Recognition and Machine Learning 117
![Page 37: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/37.jpg)
• One way to reduce the effective complexity is to constrain weights within certain groups to be equal.
• We can encourage the weight values to form several groups, rather than just one group, by considering a mixture of Gaussians
where
and are the mixing coefficients.
Pattern Recognition and Machine Learning 118
5.5.7 Soft weight sharing
![Page 38: Document5](https://reader030.vdocuments.mx/reader030/viewer/2022020506/56d6beb01a28ab3016933253/html5/thumbnails/38.jpg)
• Taking the negative logarithm then leads to a regularization term
and the total error is given by
• Parameters can be determined by using EM algorithm if the weights were constant.
• As the distribution of weights is itself evolving during the learning process, a joint optimization can be performed simultaneously over the weights and the mixture-model parameters.
5.5.7 Soft weight sharing
Pattern Recognition and Machine Learning 119