machine learning class notes andrew ng

8/10/2019 Machine Learning class notes Andrew Ng

http://slidepdf.com/reader/full/machine-learning-class-notes-andrew-ng 1/42

ML-Class, Fall 2011

Dhruvkaran Mehtahttp://www.about.me/dhruvkaran

@dhruvkaran

January 6, 2012

Abstract

These notes capture various aspects taught by Stanford’s Prof. Andrew Ng in the inaugural batchof http://www.ml-class.org held in the fall of 2011. The sole purpose of these notes, is to patch mypersonal intuition-holes. They capture what seemed non-trivial to me. Given, my generic background,the same might appeal to a larger population.



Contents

1 Linear Regression with one variable 1

1.1 Problem denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 More about the cost function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Linear Regression with multiple variables 2

2.1 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.3 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.4 Feature Scaling and Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.5 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.6 Normal Equation - the analytical solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Logistic Regression (Classication) 4

3.1 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2 Decision boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.3 Cost function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.5 Multi-class classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Vectorization 6

4.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.4 Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Regularization 7

5.1 Symptom: Overtting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5.2 Cure: Feature Curation and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5.3 Regularization for linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5.3.1 Regularized linear regression - an interpretation . . . . . . . . . . . . . . . . . . . . . . 8

5.4 Regularization - Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5.5 Regularized logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

6 Neural Networks - Theory and Forward Propagation 8

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

ii



6.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

6.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

6.4 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

6.4.1 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

6.5 Interpretation and notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106.6 Multi-class classication using neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . 11

6.7 Cost function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

7 Backpropagation algorithm 11

7.1 Activation forward propagation and cost function . . . . . . . . . . . . . . . . . . . . . . . . . 12

7.2 Error backpropagation - derivate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

7.3 Backpropagation - Formalization and Vectorization . . . . . . . . . . . . . . . . . . . . . . . . 13

7.4 Numerical gradient checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

7.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

8 Applied Machine Learning 14

8.1 Hypothesis evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

8.2 Cross-Validation set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

8.3 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

8.4 More about learning curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

8.5 Remedies from learning curve diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

8.6 Skewed datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

8.6.1 F 1score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

8.6.2 Synthesizing positive examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

8.7 Ceiling Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

9 Support Vector Machines 19

9.1 An alternative view of logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

9.2 Large Margin Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

9.2.1 Understanding dot-product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

9.3 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

9.3.1 The trick! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

9.4 Discussion about SVM parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

9.5 Don’t re-invent the wheel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

9.6 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

9.7 Logistic Regression v/s Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 24

iii



10 Clustering 24

10.1 k-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

10.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

10.3 Optimization objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

10.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

11 Principal component analysis 26

11.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

11.2 PCA is not linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

11.3 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

11.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

11.5 Choosing the number of principal components . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

11.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

12 Anomaly Detection 29

12.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

12.2 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

12.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

12.4 Algorithm Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

12.5 Anomaly detection and supervised classication . . . . . . . . . . . . . . . . . . . . . . . . . . 31

12.6 Choosing features to use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

12.7 Multivariate gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

12.8 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

12.9 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

13 Recommender systems 33

13.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

13.2 Content based recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

13.3 Collaborative ltering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

13.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.4.1 Formal Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.5 Vectorization: Low rank matrix factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

13.6 Mean normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

14 Large scale machine learning 36

14.1 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

14.2 Online learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

iv



14.3 Batch gradient descent and mapreduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

v



1 Linear Regression with one variable

1.1 Problem denition

Given an input variable x (also known as predictor) and output variable y (or, prediction), build a modelthat predicts y for a value of x that hasn’t been seen before.

More concretely,

Input:

{(x ( i ) , y ( i ) )}) i [1, m ], x ( i ) R , y ( i ) R (1)

Output:

{Θ0 , Θ1} (2)

Such that:y

Θ0 + Θ 1

×x (3)

Hypothesis:hΘ (x ) = Θ 0 + Θ 1 ×x (4)

Goal:minΘ 0 Θ 1

J (Θ0 , Θ1) (5)

Where:

J (Θ0 , Θ1) = 12m

m

i =1

(hΘ (x ( i ) ) −y( i ) )2 (6)

1.2 More about the cost functionThe cost function used in equation 6 (also known as OLS, or ordinary least squares) is convex and so thisminimization problem has a global minimum.

1.3 Gradient Descent

Gradient descent is an optimization algorithm used for convex optimization problems. [In real life, we useadvanced optimization techniques like fminunc and fmincg instead.]

repeat

Θj

← Θj

−α

× ∂ J (Θ0 , Θ1)

∂ Θj j

[0, 1] (7)

where:∂J (Θ0 , Θ1)

∂ Θj=

∂J (Θ )∂ Θj

= 1m

m

i =1(hΘ (x ( i ) ) −y( i ) ) ×x ( i )

j (8)

(For convenience, x 0 ← 1)until {Θ0 , Θ1} converge

1



Notes:

• Updates to all Θ’s happen simultaneously.

• α is the learning rate. Too small → slow convergence. Too high → oscillations and no convergence

• As the algorithm approaches a minimum, the derivative will be smaller and so the steps will be smaller.One needn’t change α .

2 Linear Regression with multiple variables

2.1 Problem Denition

Given n predictors x1 , x 2 , ...x n and outcome y build a model that predicts y for a value of x (x1 , x 2 ,...x n )that hasn’t been seen before.

More concretely,

Input:

{(x ( i ) , y ( i ) )}) i [1, m ], x ( i ) R n , y ( i ) R (9)

Output:Θ = {Θ0 , Θ1 , Θ2 ,..., Θn } R n +1 (10)

Such that:y Θ0 + Θ 1 ×x1 + Θ 2 ×x2 + ... + Θ n ×x n (11)

Hypothesis:h Θ (x ) = Θ 0 + Θ 1

×x1 + Θ 2

×x2 + ... + Θ n

×x n (12)

Goal:min

Θ 0 ,Θ 1 ,..., Θ nJ (Θ0 , Θ1 ,..., Θn ) (13)

Where:

J (Θ ) = 12m

m

i =1

(hΘ (x ( i ) ) −y( i ) )2 (14)

2.2 Notes

• For convenience, we dene x 0 ← 1. So now, each sample x R n +1 .

• So now, X R m × (n +1) , y R m × 1 , Θ R (n +1) × 1

• hΘ (x ( i ) ) = Θ T x ( i ) (x ( i ) R (n +1) × 1)

2



2.3 Cost Function

The cost function and the derivative remain the same as they were with linear regression for one variable.

J (Θ ) = 12m

m

i =1

(hΘ (x ( i ) ) −y( i ) )2 (15)

∂J (Θ )∂ Θ j

= 1m

m

i =1(h Θ (x ( i ) ) −y( i ) ) ×x ( i )

j (16)

2.4 Feature Scaling and Normalization

All features used in the predictor vectors must be scaled and normalized for the following reasons:

• There is only one learning rate, α for all features. If the features are not scaled, it is possible that therewill be oscillations for the model parameter, Θ corresponding to one feature while the Θ correspondingto another feature might converge too slowly. This could result in no convergence.

• If one feature is way larger than another feature in its typical values, not scaling will mean thatthe larger feature is dominant in predicting the value of the outcome since the hypothesis is a linearcombination of the features. The underlying assumption is that all features are equally important andall Θ’s are of comparable magnitudes.

Feature scaling and normalization can be done in the following ways:

x ( i )j =

x ( i )j −µ j

σ j(17)

or,

x ( i )j =

x ( i )j −µ j

max j −min j(18)

In both those equations, the denominator is a measure of feature spread which can be replaced by any othermeasure that serves the same purpose.

2.5 Debugging

An easy way to debug any linear optimization problem when working with a convex optimization functionis to plot J v/s iterations curve as seen in gure 1. If the cost function decreases with every iteration, thealgorithm is working. Oscillations mean the algorithm is skipping the minimum or hovering around it. Oneway to x that is to decrease learning rate, α .

2.6 Normal Equation - the analytical solution

This solution for Θ can also be found by trying to solve for Θ using XΘ = y:

Θ = ( X T X )− 1X T y (19)

Using this equation, there is no need to choose α since it is not iterative. But, it is much slower to use thismethod if the number of features n is very large. That slows down the computation of the n ×n matrixinverse tremendously. For values of n larger than 1000-10000, it is denitely much more advisable to use thegradient descent or some other iterative optimization algorithm.

3



Figure 1: Debugging linear regression

3 Logistic Regression (Classication)

3.1 Problem Denition

Given m training examples {(x (1) , y (1) ), (x (2) , y (2) ), ..., (x (m ) , y (m ) )}, where x( i ) R n and y {0, 1}, buildthe model:

y = g(Θ0 + Θ 1 ×x1 + Θ 2 ×x2 + ... + Θ n ×x n ) > ∆ (20)

Hypothesis:hΘ (x) = g(Θ T x) =

1

1 + e− ΘT

x (

−1, 1) (21)

where, g(z) = 11+ e − z , also known as the sigmoid , or logistic function.

We will try to get h Θ (x) to model the probability p(y = 1; x, Θ) and then use ∆ to threshold the probability.

3.2 Decision boundary

The decision boundary is the curve which represents the wall between membership and non-membership.It can be moved by changing the threshold of the decision, ∆. Typically, due to the nature of the sigmoidfunction, we choose ∆ ← 0.5, since hΘ (x) > 0.5 when ΘT x > 0. Although it is important to bear in mindthat when trying to play with false positives and negatives, etc. this threshold can help. There are situationswhere we want absolutely no false positives(cancer detection) and then there are some where we want nofalse negatives(quality assurance). ∆ can help achieve this balance.

In it’s current form, the hypothesis 21 can only support linear decision boundaries. To have non-linear boundaries, we can add extra (interactive) features like x2

1 , x 22 , x 1x2 ,... .

4



Figure 2: Logistic regression - Cost functions

3.3 Cost function

The cost function for a single classication for the logistic regression is:

J = −y ×log(hΘ (x)) −(1 −y) ×log(1 −h Θ (x)) (22)

Looking at this equation carefully, it penalizes y towards the wrong ends heavily. The rst term, −y ×log(h Θ (x)) is activated only when y = 1 and penalizes being away from the value of 1 heavily (as illustratedin gure 2) and vice versa.

Over all the examples, the cost function is:

J (Θ ) = 1m

m

i =1

[−y( i ) ×log(hΘ (x ( i ) )) −(1 −y( i ) ) ×log(1 −h Θ (x ( i ) ))] (23)

And the cost derivative remains the same as with linear regression:

We know that for one example, J =

−y

×log h

−(1

−y)

×log(1

−h )

∂J (Θ )∂ Θj

= −yh

(−h 2)exp − Θ T x (−x j ) − (1 −y)(1 −h )

h 2 exp − Θ T x (−x j ) (24)

∂J (Θ )∂ Θj

= −yh exp− Θ T x x j −(1 −y)h (x j ) (25)

∂J (Θ )∂ Θj

= −yx j + yh x j + h x j −yh x j = ( h −y)x j (26)

∂J (Θ )∂ Θ j

= 1m

m

i =1

(h Θ (x ( i ) ) −y( i ) ) ×x ( i )j (27)

3.4 Prediction

Once we know Θ = (Θ 0 , Θ1 , Θ2 , ..., Θn ), we predict using:

p(y = 1|x ( test ) , Θ ) = g(hΘ (x ( test ) )) (28)

and we predict y = 1 if p(y = 1|x ( test ) , Θ ) > ∆

5



3.5 Multi-class classication

To use the same technique with N classes instead of 2 classes, we would train N one v/s all classers andthen pick the class which maximizes hΘ (x ( test ) )

4 VectorizationWhen writing code for machine learning, it is benecial to make use of the years of research and hard workthat has gone into building highly optimized linear algebra libraries on whatever platform is in question.Vectorized code that leverages super-fast matrix multiplication, inversion, transpose, etc, usually runsorders of magnitude faster than non-vectorized, ”loopy” code. In this section, we will study vectorizedimplementations of linear and logistic regression

4.1 Framework

m training examples, with n features in each example.Input matrix: X

R m × n

For convenience, we add an extra (bias term) feature, x 0 ← 1So, now, X R m × n +1

Target Prediction for training, or, output matrix: y R m × 1

Model Parameters: Θ = (Θ 0 , Θ1 , ..., Θn ) R (n +1 × 1)

4.2 Linear Regression[J, grad ] = function( Θ , X , y)err ← XΘ −y

{Matrix multiplication and subtraction to compute errors on all training examples }J = 12m ×err T ×err

{Vector-vector multiplication to perform a sum of products for m

i =1 err ( i ) 2

}grad = 1m ×(err T ×X )T {But, ( AB )T = B T AT , so, grad = 1m ×X T ×err }

4.3 Logistic Regression[J, grad ] = function( Θ , X , y)h ← sigmoid (XΘ )

{Leveraging a vectorized implementation of sigmoid }err ← h −yJ = − 1

m [y T ×log(h ) + (1 −y T ) ×log(1 −h )]

{Using a vectorized log function }grad = 1m ×(err T ×X )T {But, ( AB )T = B T AT , so, grad = 1

m ×X T ×err }

4.4 Tricks

There are many more subtle reasons one should strive for completely vectorized code. For instance, if weuse vectorized implementations, and later in time, move to a multi-threaded architecture or even a computecluster, underlying libraries can often utilize modern machines better with parallelization built in with linearalgebra libraries. Here we discuss some vectorization tricks:

6



• To sum over products, always consider vector-vector multiplication

• Most platforms like octave and numpy will let you index a matrix with a vector which can repeat. Forinstance eye(k)(<insert_indexing_vector>,:)

5 Regularization5.1 Symptom: Overtting

Overtting is a common problem in machine learning problems. It’s best described as a model which is sotuned to a training set that while it performs really well on that data, it suffers from severe performancepenalties on new data, or what is more commonly referred to as a test set. In other words, the model failsto generalize it’s learnings for data it has not seen before. Overtting usually occurs due to:

• Number of features, n >> the number of training examples, m

• Complex interactive features exist in the training predictors. For example, x1 , x 2 , x 1x2 , x 21x2 ,... . These

not only increasing the number of features being trained on, but also reduce the SNR(Signal-to-noiseratio) and so the amount of information in the data is really lesser than it seems.

5.2 Cure: Feature Curation and Regularization

To improve the generalization capability of your model:

• Reduce the number of features and the interactivity of the features. Look out for features which proxy the information from each other and dampen the SNR. Irrelevant features also make this worse.

• Regularization is the technique used to keep the magnitudes of the model parameters in Θ i.e.(Θ0 , Θ1 , ..., Θn ) small. This effectively means, we want to minimize Θ while actually minimizingthe error function. Just minimizing

Θ by itself will not help.

5.3 Regularization for linear regression

With regularization, the linear regression equation looks like:

J (Θ ) = 12m

m

i =1

(h Θ (x ( i ) ) −y( i ) )2 + λ2m

n

j =1

Θ2j (29)

and if j > 0,∂J (Θ )

∂ Θj=

1m

m

i =1

(hΘ (x ( i ) ) −y( i ) ) ×x ( i )j +

λm ×Θj (30)

else if j = 0,∂J (Θ )

∂ Θj=

1m

m

i =1

(hΘ (x ( i ) ) −y( i ) ) ×x ( i )j (31)

Take notice, that the regularization term does not try to suppress the parametric bias, Θ 0 since it is left outof the regularization term.

7



Figure 3: Neuron - in nature and math

• To add non-linear interactive features like x 1x2 , x 21 (x2

1 + x22 will give us a circular hypothesis), etc.

• Use systems like neural networks which can model a non-linear decision boundary by internalizing non-linear and interactive features(explained later).

The problem with adding interactive terms is that the total number of input features explodes combinato-rially.

6.2 Theory

Neural networks in nature, are networks made up of nodes which have many incoming connections , or inputs and one outgoing connection , or output . These nodes, called neurons are activated based on a weighted sum of inputs if the sum is above a threshold .

6.3 Terminology

• A neural network has an input layers where the number of neurons is num_input_features + 1 andhidden layers which have a similar extra bias or intercept unit.

• The output layer has as many neurons as num_output_classes

• Inputs: {(x ( i ) , y ( i ) )} i [1, m ], x ( i ) (R )n

• We add the bias input unit and so X R m × (n +1)

• a( j )i →activation of the i th neuron layer j . This is the sigmoid of the weighted sum before thresholding.

i.e. hΘ (x) = 11 +exp − Θx . In other words,

a (2)1 = g(Θ (1)

10 x0 + Θ (1)11 x1 + Θ (1)

12 x2 + ... + Θ (1)1n x n ) (35)

where g(z) = sigmoid (z).

• L → number of layers in the neural network.

• s j → the number of neurons in layer j (not counting the bias unit)

– s0 ← num_input_features

– Θ ( j ) R s j +1 × ( s j +1) (The extra column is adding the bias unit) helps propagate the activationsform layer j to j + 1

9



Figure 4: Forward Propagation

6.4 Forward PropagationIn order to understand forward propagation better, let’s consider the following scenario: s1 = 3 , s 2 = 3 , s 3 = 1(See gure 4), i.e. 3 input features two output classes (one output node can classify between two classes).

{Input layer activations are the inputs. }a (1)

0 ← 1, a (1)1 ← x ( i )

1 , a (1)2 ← x ( i )

2 , a (1)3 ← x ( i )

3

{For the hidden layers: }a (2)

1 ← g(z (2)1 ) = g(Θ (1)

10 a (1)0 + Θ (1)

11 a (1)1 + Θ (1)

12 a (1)2 + Θ (1)

13 a (1)3 )

a (2)2 ← g(z (2)

2 )a (2)

3 ← g(z (2)3 )

Now, we set a (2)0 ← 1

output = h Θ (x)

← g(z (3)

1 ) = g(Θ (2)10 a (2)

0 + Θ (1)11 a (2)

1 + Θ (1)12 a (2)

2 + Θ (1)13 a (2)

3 )

6.4.1 Vectorization

a (1) ← x R 4× 1

Θ (1) R 3× 4

z(2) ← Θ (1) a (1)

a (2) ← g(z(2) )Now, we add the bias a (2)

0 ← 1 so a (2) R 4× 1

Θ (2) R 1× 4

z(3) ← Θ (2) a (2)

a (3) ← g(z(3) ) R

6.5 Interpretation and notes

The hidden layer activations are basically non-linear, derived, interactive features which the network learns.

• More than 2 hidden layers are usually unnecessary and the changes which backpropagate into the thirdhidden are minimal - so not much to be learnt by adding a third hidden layer in most cases.

10



• Usually, it makes sense to have the same number of units in all the hidden layers. The more the fanout from the input layer to the hidden layers, the more is the number of available inferred features.

Another intuitive way to look at neural nets is to see how simple logical functions like AND, OR, XNORcan be implemented using just an input and an output layer.

• For AND: Θ ←−15 10 10

• For OR: Θ ←20 −15 −15

6.6 Multi-class classication using neural networks

We will now try to use neural networks to perform a k-class one-v/s-all style classication. So, instead of

y( i ) {0, 1}, y ( i ) {0, 1}K . y( i ) looks like

00...1

0...0

Also, s L ← k

6.7 Cost function

The cost function for neural networks used in classication problems looks very similar to the cost functionfor the logistic regression.

J = − 1m

m

i =1

K

k =1

[y( i )k logh Θ (x ( i ) )k + (1 −y( i )

k ) log(1 −h Θ (x ( i ) )k )] + λ2m

L − 1

l=1

s l

i =1

s l +1

j =1

(Θ ( l )ji )2 (36)

Take note of how the regularization term has skipped the bias terms by not allowing for i = 0;

7 Backpropagation algorithm

There is an overarching theme in the algorithms we have discussed so far. The key point in the formulation of the algorithm is the when we nd a (convex) cost function function which varies as we change the parameters.We then proceed to understand how this cost would change if we changed each of these parameters up ordown (this is the gradient step). Using the gradient, we then alter the parameters to obtain a lower cost.

The cost function for a neural network discussed towards the end of the previous section is no different.The various Θ ( l ) matrices which propagate activations from layer l to the next one are the parameters. (Forimplementation, we will roll them all out into a long vector and then unroll them again when we want toplay with them. This way we can use them with optimization routines like fminunc and fmincg )

The focus of this section will be to understand how to compute ∂J ∂ Θ ( l )

ijand then use that to perform opti-

mization to train the neural network.

11



Figure 5: Backpropagation of errors

7.1 Activation forward propagation and cost function

Just to recapitulate, forward propagation takes the input x( i ) and computes the nal decisions hΘ (x ( i ) )using:

a (1) = x( i )

z(2) = Θ (1) a (1)

a (2) = g(z(2) )...

...z(L ) = Θ (L − 1) a (L − 1)

h Θ (x ( i ) ) = a (L ) = g(z(2) )and, the cost function is:

J (Θ ) = − 1m

m

i =1

K

k =1

[y( i )k logh Θ (x ( i ) )k + (1 −y( i )

k ) log(1 −h Θ (x ( i ) )k )] + λ2m

L − 1

l=1

s l

i =1

s l +1

j =1

(Θ ( l )ji )2 (37)

7.2 Error backpropagation - derivate

We need to add some more terminology in order to pursue our goal of ∂J ∂ Θ ( l )

ij. We denote:

δ (L )j → ”error” for unit j in layer l

To propagate the errors backwards(see g: 5) from the output layer, to the rst hidden layer(input layer hasno errors - it is the ground truth), we use the following:

δ (L )

j = a(L )

j −yjThe vectorized form of saying the same thing is:

δ (L ) = a (L ) −y (38)

Backpropagate the error using,

δ (L − 1) = ( Θ (L − 1) )T δ (L ) ×g (z(L − 1) ) (39)

12



Now, since,g(z(L − 1) ) =

11 + exp − Θ ( L − 2) a ( L − 2) =

11 + exp − z ( L − 1) (40)

g (z(L − 1) ) = 1

1 + exp− z ( L − 1) 2 ×exp− z ( L − 1)

= g(z(L − 1) ) ×(1 −g(z(L − 1) )) = a (L − 1) ×(1 −a (L − 1) ) (41)

From 39 and 41, we have:

δ (L − 1) = ( Θ (L − 1) )T δ (L ) ×a (L − 1) ×(1 −a (L − 1) ) (42)

Similarly, we go all the way to:

δ (2) = ( Θ (2) )T δ (3) ×a (2) ×(1 −a (2) ) (43)

Ignoring regularization, it turns out that (a more detailed derivation was out of the scope of the course):

∂J

∂ Θ( l)ij

= a ( l)j δ ( l+1)

i (44)

7.3 Backpropagation - Formalization and Vectorization

Given a supervised learning training set, {(x (1) , y (1) ), (x (2) , y (2) ), ..., (x (m ) , y (m ) )}, we want to compute, Jaccording to equation 37 and another set of matrices, D ( l ) where D ( l )

ij holds the value for ∂J ∂ Θ ( l )

ijwhich we saw

earlier can be computed as a ( l )j δ ( l+1)

i . Let’s look into a fairly vectorized implementation of back propagation.for l = 1 → L −1 do

∆ ( l ) ← zerosLike (Θ ( l ) )end forfor i = 1 → m do

{The outer loop can be vectorized too. Not done here for simplicity.

}a (1) ← x ( i ) {First layer activations are the inputs }{Forward propagate the inputs through the parametrized implementation }for l = 2 → L doz( l ) ← Θ ( l − 1) a ( l − 1)

a ( l ) ← g(z( l ) )end for

{Now, we compute the error and backpropagate it}δ (L ) ← a (L ) −yfor l = L −1 → 2 do

δ ( l ) ← (Θ ( l) )T δ ( l+1) ×a ( l ) ×(1 −a ( l ) )end forfor l = 1 → L −1 do

∆ ( l )

← ∆ ( l) + δ ( l+1) (a ( l) )

T

{Vectorization for: ∆ ( l )

ij ← ∆ ( l )

ij + a ( l )

j δ ( l+1)

i }end forend forfor l = 1 → L −1 do

regΘ ( l ) ←0 1 . . . 10 1 . . . 1...

......

...0 1 . . . 1

×Θ ( l ) {We don’t regularize the bias terms }

13



• Increase regularization by increasing λ

8.1 Hypothesis evaluation

In order to evaluate our hypothesis, we will perform the following train-evaluate cycle:

• Data split: Split the training data into 70% for training set and 30% for test set. So now, wehave two sets of data, the training set {(x (1) , y (1) ), (x (2) , y (2) ), ..., (x (0 .7m ) , y (0 .7m ) )} and the test set,

{(x (1)test , y (1)

test ), (x (2)test , y (2)

test ), ..., (x (0 .3m )test , y(0 .3m )

test )}.

• Learn the model parameters Θ on the training set and compute J test (Θ ). Use the correct cost functiondepending on the optimization problem at hand - for regression, use, OLS, for classication, use,misclassication error, etc.

Since the model parameters Θ are t to the training set, the training error , J train (Θ ) is likely to be lowerthan the test error , J test (Θ ).

8.2 Cross-Validation setConsider the scenario where we are trying to choose between many model hypothesis like: hΘ (x ) = Θ 0 +Θ 1x1h Θ (x ) = Θ 0 + Θ 1x1 + Θ 2x2

1...h Θ (x ) = Θ 0 + Θ 1x1 + Θ 2x2

1 + Θ 5x51

Now, if we use J test (Θ ) to decide which hypothesis to use, we have actually used the test set to t anothermodel parameter viz. the degree of the model hypothesis. Reporting this error as the model health will likelybe an optimistic estimate as explained in the previous section.

Let’s revisit the test split:

• Let’s split it as 60% for training data, 20% for test set and 20% for what we will now call the cross-

validation set. So now, we have three sets of data, the training set {(x(1)

, y(1)

), (x(2)

, y(2)

),..., (x(0 .6m )

, y(0 .6m )

)},the test set, {(x (1)test , y (1)

test ), (x (2)test , y (2)

test ),..., (x (0 .2m )test , y(0 .2m )

test )}and the cross-validation set: {(x (1)cv , y(1)

cv ), (x (2)cv , y(2)

cv

This time, we will use the cross-validation set to pick the model parameter i.e. the hypothesis degree and thenuse the test set to report the model health. This will avoid any over-optimistic estimates and embarrassmentthat comes with them.

8.3 Bias and Variance

A model with a high bias is one that is under-tting the dataset, or in other words is too simple, has too fewfeatures, etc and is not able to capture all the information in the data into the model. On the other hand, amodel with a high variance is one which over-ts and fails to generalize well to data it has not seen before.

Let’s analyze the scenario of tting the degree of the model hypothesis for a linear regression, such as theone described in the previous section. In this case, we will compute the following errors, for each value of the degree of the hypothesis:

J train (Θ ) = 12m

m

i =1

[h Θ (x ( i ) ) −y( i ) ]2 (45)

15



Figure 6: Learning curve - Model hypothesis degree

J cv (Θ ) = 12m cv

m cv

i =1

[h Θ (x ( i )cv ) −y( i )

cv ]2 (46)

From gure 6, we can tell that the region of high bias or undert(when the degree of the polynomial is

low, and the straight looking lines are not able to capture the complex data), is characterized by: high,J train , high, J cv and J train J cv whereas, the region of high variance or overt(when the degree of thepolynomial is so high, that it captures the noise in the data along with the signals and fails to generalize),is characterized by: low, J train , high, J cv and J train << J cv

While our job is to just pick the model parameter(s) that minimizes the cross-validation error and thenreport the test error, it is important to understand bias and variance in order to diagnose problems in themodel.

If we were doing the same study for the regularization parameter λ instead of the model degree, the learningcurves would look like gure 7

8.4 More about learning curves

It’s intuitive that when the size of the training set is small, it is easier to learn and highly likely that themodel will over-t the data. As the number of examples in the training set increases, the tendency to over-tdrops(as there are many more constraints for the model to satisfy), but still, the cross-validation error islikely to be higher than the training error. The learning curves for the training-set size m look like gure 8

As before, in under t( high bias ) scenario, as m increases, we observe, high, J train , high, J cv andJ train J cv (See gure: 9)

16



Figure 10: Learning curve - Training set size - High Variance

And in the over-t( high variance ) scenario, low, J train , high, J cv and J train << J cv (See gure: 10)

8.5 Remedies from learning curve diagnostics

Getting more data for under-tted models is useless since they haven’t even modeled all the information inthe existing training information. Whereas, over-tted models can benet from more data.

Next step Will solve...

Get more data High varianceReduce features High variance

Increase regularization( λ ) High varianceAdditional features High bias

Add polynomial, interactive, derived features High biasDecrease regularization( λ ) High bias

In general, it makes sense to start with a very simple model and algorithm, plot the learning curves , andthen iterate in a typical agile software engineering fashion. This denitely beats starting out with a large

assumption and then burning man-hours in the wrong direction.

8.6 Skewed datasets

Skewed datasets present a special challenge. Consider a cancer dataset with 1% positive examples and 99%negative examples. Now, if misclassication error was used as the benchmarking metric in a classicationenvironment, then a classier which always predicted negative for cancer, would have 99% accuracy, or, 1%misclassication! In this situation we need to change the metric which we use to t the model parametersusing the cross-validation error. Only the metric changes, nothing else. We still follow the algorithm-model-learning curve workow.

8.6.1 F 1score

Let’s dissect a cancer classier’s performance using the chart below:Actual

1 0

Predicted 1 True Positive(tp) False Positive(fp)0 False Negative(fn) True Negative(tn)

We notice that accuracy or misclassication error can be obtained as:

18



misclassification = fp + f ntp + fp + tn + fn . The problem here is that the total number of true samples is too low for

this to be a meaningful metric. Instead we dene:

precision = tp

tp + f p (47)

recall = tptp + f n

(48)

and,

F 1score = 2× precision ×recall

precision + recall (49)

This is a very well dened tradeoff . The reader might try to understand how, for instance, varying thethreshold, ∆ for hΘ (x) in a logistic regression setup, might change the precision and recall. A cancerclassier needs to be high precision but a low recall might be slightly more manageable since people shouldnot be put on chemotherapy unless absolutely sure. Also, multiple high precision, low recall classiers canbe combined for a good output. On the other hand, for quality assurance in a manufacturing environment,

we would like a low precision, high recall classier to make sure everything being delivered is perfect.

8.6.2 Synthesizing positive examples

Another way of dealing with skewed datasets is to articially synthesize positive samples . This is possiblewith problems like PhotoOCR, where we can obtain various fonts, overlay them onto backgrounds of areasonable contrast and then apply wavelet transforms, etc. Note that adding noise which is not seen in thereal world, like gaussian pixelations, will not really help the cause of building a dataset that is not skewed.The added noise has to be realistic.

8.7 Ceiling Analysis

Another way of deciding which component to work next is ceiling analysis. While diagnostic learning curveshelp us what should be worked on in a particular machine learning algorithm, ceiling analysis can help usdecide which component of a machine learning pipeline (composed of several machine learning algorithms),should be worked on next. Consider a pipeline likePicture →Smoothing → Face Detection → Face Recognition. To perform ceiling analysis, we will replacethe output of each layer with the perfect(often hand-crafted) dataset as input to the next layer. This willresult in a table like:

Perfect Pipeline accuracyNothing 63%

Smoothing 63%Smoothing, Face Detection 93%

Smoothing, Face Detection, Face Recognition 100%

So now, it becomes clear that the biggest gains are to be obtained with better face detection.

9 Support Vector Machines

Support vector machines are a more modern implementation of a classication technique. In this section weget an opportunity to exhibit several concepts like large margin intuition and kernel techniques.

19



9.1 An alternative view of logistic regression

The model hypothesis for a logistic regression was:

h Θ (x ( i ) ) = 1

1 + exp − Θ T x ( i ) (50)

The sigmoid function looks like gure 11 So, for logistic regression, we need, Θ T x >> 0 if y = 1, and,

Figure 11: Sigmoid function: 11+exp − x

Θ T x << 0 if y = 0.The cost function for logistic regression if J = −y × log(hΘ (x)) − (1 − y) × log(1 − h Θ (x)) for a givenexample. Let’s re-write this as: J = y ×cost 1(x) + ( 1 −y) ×cost 0(x) where cost 1(x) is the cost imposed onthe hypothesis prediction when the target y is 1 and vice versa.

Now, we get to the cost function of the Support Vector Machine:

J = y ×cost 1(Θ T x) + (1 −y) ×cost 0(Θ T x) (51)

We replace the previous cost functions with curves which zero out when Θ T x > 1 for y = 1 as shown ingure 12 and when Θ T x < −1 for y = 0 as shown in gure 13.

Now, to classify absolutely correctly, we just need to make sure that we learn model parameters Θ suchthat Θ T x > 1 for y = 1 and Θ T x < −1 for y = 0. This will make more sense as we begin to talk aboutlarge-margin intuition in the next section. The overall objective for a support vector machine looks like:

minΘ

C m

i =1

[y( i ) cost 1(Θ T x ( i ) ) + (1 −y( i ) )cost 0(Θ T x ( i ) )] + 12

n

j =1

Θ2j (52)

Figure 12: cost 1(Θ T x)

20



Figure 13: cost 0(Θ T x)

Notice how λ has been replaced by another constant C for mathematical convenience. It still plays the rolesimilar to λ , only it’s correlation with regularization in inverted.

9.2 Large Margin Intuition

Now, we will examine equation 52 to understand how a support vector machine chooses a decision boundarywhich maximizes the margins from all the classes. For the purposes of this discussion, let’s assume that theSVM constant C is set to a very large value. Assume C → ∞. A large value of C acts like a small value of λ - so this is going to be like studying SVM with no regularization; it will be prone to over tting the data.

As C goes up, the penalty associated with the term with cost 1 and cost 0 goes up, so the SVM is highlyincentivized to make those costs 0, i.e., make ||Θ T x|| > 1 - even if doing so has other side effects, likeover-tting.

Assuming a large value of C, we can reformulate equation 52 as:

minΘ

12

n

j =1

Θ2j s.t. Θ T x ( i ) > 1 if y ( i ) = 1 and Θ T x ( i ) < −1 if y ( i ) = 0 (53)

in other words:

minΘ

12

n

j =1

Θ2j s.t. ||Θ T x ( i ) || > 1 (54)

9.2.1 Understanding dot-product

Vector dot product is basically the process of projecting one vector onto another.

u ·v = ||u ||||v ||cos (φ) = || p||||v || where p is the projection of u onto v (55)

The next thing to remember is that Θ is the normal vector to the decision boundary at each point. Hence,

Θ T x which is the dot-product can be re-written as, Θ · x. What the un-regularized SVM is really trying todo if make sure that Θ · x, or the projection of the data vector ( x̃ ) onto the normal vector ( Θ̃ ) is maximized .(Since, Trying to make Θ T x ( i ) > 1 means, making p( i ) || Θ || > 1) With this incentive, along, trying tominΘ

12

nj =1 Θ2

j i.e. minΘ12 || Θ ||2 . This translates to making the projection of x ( i ) onto Θ large and gives

the SVM, a large margin property (See g: 14).

A ”perfect”(or, in other words, unregularized) SVM, however, will try to get a perfect decision boundarybecause the penalty on misclassication is too high. Lowering that penalty(C), will decrease the

21



Figure 14: SVM - large margin intuition

large margin property but increase regularization . Hence, C gives us a knob into modeling v/sgeneralization and must be studied with learning curves just as λ was.

9.3 KernelsThe kernel trick is what makes SVMs special. It’s possible to use this trick to make several mathematicaloptimizations possible. The kernels trick projects the data onto a higher-dimensional space where it ispossible to nd a linear decision boundary.

Say we have n features ( x1 , x 2 ,...,x n ). The kernel trick is to replace: ( x1 , x 2 ,...,x n ) with new derived features(f 1 , f 2 ,...,f k ) where f ( i )

j = similarityFunction (x ( i ) , lj ). To do this, we rst need to choose the landmark points, ( l1 , l2 ,...,l k ) where lk R n , the same space as the input examples. Usually, the similarityFunction is gaussian:

similarityFunction (x ( i ) , lj ) = exp( −||x ( i ) −lj ||22σ2 ) (56)

This function has the value of 1 at the landmark location and varies in a gaussian fashion all the way to 0 aswe move far away from the landmark. This particular similarity function(also known as a kernel ) introducesanother model parameter we need to gure out a value for: σ, the spread of the similarity.

The model now predicts hΘ (x) = 1 when Θ 0 + Θ 1 f 1 + Θ 2 f 2 + ... + Θ n f n > 1 and hΘ (x) = 0 whenΘ0 + Θ 1 f 1 + Θ 2 f 2 + ... + Θ n f n < −1 (in accordance with cost 1 and cost 0).

The outstanding question then is how do we go about choosing the landmark points, {l1 , l2 ,...,l k }. We haveto think about the landmark points as magnets(landmarks) which emit a magnetic eld(decision boundaryeld) and the resulting magnetic eld(decision boundary) is a combination of those emitted by the individualmagnets. We can also think of the spread, σ as the magnetic strength of the pole(g: 15).

As a side note, there are many kernels which people use. If someone says they are using a linear kernel, it

means they are not using landmark-expansion. i.e. f = x . Not any similarity function can be a valid kernel.A valid kernel has to satisfy a theorem called Mercer’s Theorem. Some valid kernels are:

• Polynomial kernel: similarityFunction (x , l) = ( x T l)d , (x T l + const )d ,...

• String kernel

• Chi-square kernel

22



Figure 15: SVM - Gaussian kernel - Do you see the magnetic eld?

9.3.1 The trick!

The SVM kernel trick is to set ( l1 , l2 ,..., lm ) = ( x (1) , x (2) ,..., x (m ) )here, k=m. So now:

f ( i ) = ( similarityF unction (x ( i ) , l1), similarityF unction (x ( i ) , l2), ..., similarityFunction (x ( i ) , lm )) (57)

which makes for,

f ( i ) = ( similarityF unction (x ( i ) , x (1) ), similarityF unction (x ( i ) , x (2) ), ..., similarityFunction (x ( i ) , x (m ) ))(58)

We have transformed x( i ) R n +1 → f ( i ) R m +1 (still have the bias term). Basically, the new featuresdenote how similar the input example is to each of the other input examples. It’s like dropping a north magnetic pole at each positive example and a south magnetic pole at each negative example.

And now we will predict using Θ T f , where Θ , will be trained using the cost function:

minΘ

C m

i =1

[y( i ) cost 1(Θ T f ( i ) ) + (1 −y( i ) )cost 0(Θ T f ( i ) )] + 12

Θ T Θ (59)

(Vectorization note: Θ T Θ = nj =1 Θ2

j and n = m )

In most numerical implementations, Θ T Θ is actually implemented as Θ T MΘ and this is what helps tremen-dously speed up this matrix multiplication. Since Θ R m +1 (m is the number of training examples, whichcan easily be very large), computing Θ T Θ is very slow without this trick. This numerical trick does not workwell with the cost function used for logistic regression and renders the kernel trick useless in that scenario.

9.4 Discussion about SVM parameters

• C ( 1λ ): High C →(Low bias, High variance) ; Low C → (High bias, Low variance) . Typical

value: 1

• σ: High σ means that the features vary more smoothly. Prone to under tting. Leads to high bias,low variance . Low σ means features vary less smoothly, low bias, high variance , more prone toover-tting.

23



9.5 Don’t re-invent the wheel

Due to the complex nature of the mathematical optimizations involved in support vector machines, it usuallymakes sense to use existing packages and libraries like: liblinear/libsvm to get the job done. They onlyneed a valid kernel and values for model parameters, C, λ to be plugged in.

9.6 Notes

When using a gaussian kernel, it is very important to run feature scaling and normalization beforerunning the SVM algorithm. It should be obvious to the reader(from equation 56) how non-normalizedfeatures could give more importance to some features than others.

To perform multi-class classication across k-classes using SVMs, we will have to implement k-one v/s allsvm classiers.

Neural networks can be used in almost all situations where an SVM will do well - but they are likely to bemuch slower to train.

9.7 Logistic Regression v/s Support Vector MachinesDifferent situations call for different techniques. When considering the choice between logistic regression andsupport vector machines, keep the following in mind:

n >> m n = 10 , 000, m =10..1000

Too many features. Prone to over-tting(highvariance). Use Logistic regression or anSVM(without a kernel )

n is small. m is in-termediate

n = 1 ..1000, m =10..10, 000

Use SVM(with a gaussian/RBF kernel )

n is small. m islarge

n = 1 ..1000, m =50, 000+

Try to add more features. Prone to under-tting(high bias). Use Logistic regression oran SVM( without a kernel )

10 Clustering

Clustering is a unsupervised technique to form groups of input examples. The training set, {x (1) , x (2) , x (3) , ..., x (m )}has no labels. The goal is dene clusters into one of which, each of the m input examples belong. Clusteringis used for market segmentation and analysis, social network analysis and datacenter health studies amongstmany other applications.

10.1 k-Means

Input: {x (1) , x (2) , x (3) ,..., x (m )}, and the number of clusters, K Output: {µ (1) , µ (2) , µ (3) ,..., µ (K )}, the K

centroids for the c-clusters.Note: This time, x R n . There is no bias term, x 0 .

10.2 AlgorithmRandomly initialize all {µ (1) , µ (2) , µ (3) , ..., µ (K )}while all µ ’s have not converged do

24



Figure 16: The elbow method

{Cluster assignment step }for i = 1 → m dofor all k = 1 → K do

dist (k )

← ||x ( i )

−µ (k )

||end forc( i ) ← index such that dist ( index ) is minimized.

end for {Move the centroids }for k = 1 → K doµ ← mean (x ( i ) ) i {c( i ) == k}end for

end while

The only awkward requirement for the k-means algorithm is knowing the number of clusters, K up front.For data that is not visually well separated, this number should come from downstream requirements like”how many t-shirt sizes do I want to make?”

10.3 Optimization objectiveThe cost function for each step of the k-means algorithm can be specied as:

J (c(1) , c(2) , ..., c(m ) , µ(1) , µ

(2) ,..., µ(K ) ) =

1m

m

i =1||x ( i ) −µ c ( i ) ||2 (60)

In the cluster-assignment step (when the points are all assigned to one of the clusters), this cost function isminimized w.r.t {c(1) , c(2) ,..., c(m )} whilst the µ ’s are held constant and in the centroid-move step (when werecompute the cluster centroids form the newly assigned points), we minimize it w.r.t. {µ (1) , µ (2) , ..., µ (K )}when the c’s remain unchanged.

Another way to decide the number of clusters is the elbow method (See g: 16). We pick the number of clusters at which the cost function drops suddenly. This usually nds a natural number of clusters. But,often, as the reader might expect, an elbow is not to be found in real world data.

10.4 Notes

• The random initialization has a large impact on the way the k-means algorithm performs and it isoften advised to set the initial centroids to K randomly chosen examples.

25



• The k-means cost function is not convex. The algorithm can get stuck in local minimums whenthe random initialization is a terrible one. This problem usually manifests itself with bad clusterswhich do not make sense even visually. The way to get around this problem is to run the algorithmseveral hundred times with a different random initialization each time, and record the cost function atconvergence. Finally, we pick the clusters from the run with the minimum nal cost.

11 Principal component analysis

PCA is a method to reduce the dimensionality of a dataset while maintaining the information in it. Some-times, we need to take a dataset with 100s of features and reduce the number of features. Usually theunderlying motivation is:

• To make a already well performing machine learning algorithm run faster by reducing the number of features and hence the size of the involved matrices.

• To bring the data down to 2 or 3 dimensions so it can be visualized. Take this with a grain of salt.Visualizing features you don’t know the meaning of, will only take you so far. Still useful for decidingthe number of clusters, etc.

• Data compression

11.1 Problem formulation

Given inputs {x (1) , x (2) , ..., x (m )} where x ( i ) R n , nd a transformation matrix, such that the projection of any input using the matrix is z( i ) R k where k < n .

i.e.

Find U red s.t. U red R k × n , z( i ) = U red ×x ( i ) such that the projection error is minimized. The projectionerror is the error in the data recovered in the n dimensional space using the transpose of the original

transformation. So, if x ( i )approx = U red T ×z( i ) , the objective for PCA is:

minU red

1m

m

i =1||x ( i ) −x ( i )

approx ||2 (61)

11.2 PCA is not linear regression

Often , there’s confusion between linear regression and PCA. The difference is illustrated in gure 17. Whilethe linear regression cost function is trying to minimize the y-distance between the actuals and the prediction,PCA is trying to minimize the projected distance.

11.3 Pre-processing

Before starting to run principal component analysis, it is important to perform feature scaling and nor-

malization using x ( i )j ←

x ( i )j

− µ j

s jwhere s is some measure of feature spread, like standard deviation, variance,

or range( max j −min j ).

26



Figure 17: PCA is not linear regression

11.4 Algorithm

As a rst step, we compute the covariance matrix for the input data.

Σ = 1m

m

i =1

(x ( i ) )(x ( i ) )T (62)

This is another reason the feature normalization is important. This step actually assumes that the featuresare zero-meaned. Remember covariance computation? Also, this can be vectorized as Σ = 1

m X T X whereX R m × n . The covariance matrix Σ R n × n has the following properties:

• All the elements are positive. The matrix is symmetric across its diagonal.

• The diagonal elements, Σ ii represent the variance of the i th feature.

• The non-diagonal elements, Σ ij = Σ ji , represent the co-variance of ith and the j th features. The

magnitude of the co-variance represents how often the two features vary together. The sign of thesame denotes if the features vary in the same direction(Positive v/s negative correlation).

The idea central to PCA is to come up with n new features which are sorted by how much information thereis in each of the new features. Intuitively, the spread of a feature is its importance. The morethe variance or spread of the feature, the more it tells us about the input. . Think about it thisway; if a feature does not vary at all(i.e. has a constant value for all inputs), its variance is 0. It’s useless toinclude this feature in the analysis. On the other hand a feature that has a wide-spread across inputs, addsa lot on information. A real, varying value of the feature, tells us a story about the input example.

We use the singular value decomposition technique to do this. While an in-depth discussion of SVD is outof scope, here’s an intuition about it. SVD takes the covariance matrix Σ as it’s input. Then, it beginssearching across various new axes, for a new direction such that when all the data is projected onto thisdirection, the spread is maximized. Having picked one, axes, SVD, now proceeds to do the same, only thistime, the possible axes are reduced to the set that is orthogonal to the rst one. Finally, we obtain a set of n eigenvectors , or what is also a basis for the new vector space.

More concretely, we run:

[U , S, V ] = svd (Σ ) (63)

27



The matrix U represents eigenvectors of the new space:

. . .

. . .

. . .u 1 u2 u3 . . . un

. . .. . .

. . .

(64)

The dimensions corresponding to {u 1 , u 2 , u 3 , ..., u n } have decreasing amounts of spread/information.

The matrix S is the covariance matrix for the new features. It has some interesting properties:

• Non-diagonals elements, S ij are all almost 0. Almost none of the new features vary with each other.

• The values of the diagonals elements, Sii are larger than in Σ ii . The variance of the new features islarger.

• As we go down the diagonal, the magnitude of S ii decreases as i increases.

Our next job is to pick a number k < n and then use {u 1 , u 2 , u 3 , ..., u k } as the new basis for projection intothe reduced space.

U red =

. . .

. . .

. . .u 1 u2 u3 . . . uk

. . .

. . .

. . .

T

(65)

With, U red

R k × n , X

R m × n , we compute the new training data, Z

R m × k as:

Z = X ×U redT (66)

Approximations of the original data can be re-composed using the same transformation as:

X approx = Z ×U red (67)

11.5 Choosing the number of principal components

The average squared projection error for PCA is: 1m

mi =1 ||x ( i ) − x ( i )

approx ||2 and the total variance in thedata set is given by: 1

mmi =1

||x ( i )

||2 . We need to choose ’k’ such that most variance in the data is preserved.

So, we need to choose k, such that:

1m

mi =1 ||x ( i ) −x ( i )

approx ||21m

mi =1 ||x ( i ) ||2

< (68)

To maintain 99% of the variance, set, = 0 .01. Since the diagonal elements of S are the variance numbersin the new space, we need to make sure that

28



Figure 18: Anomaly in the data

ki =1 S iimi =1 S ii

> (1 − ) (69)

We could also compute various values of U red for various values of k and then compute the optimal k usingequation 68, but since SVD provides the new co-variance matrix, this is easier.

11.6 Applications

• Speed-up an already well performing supervised learning algorithm. Use only the training set tocompute U red . Then use it for everything else including new predictions. It is now a part of the learntpipeline.

• Data compression and storage

• Visualization

• Do not use PCA to reduce the number of features in order to prevent over-tting . PCAwill retain most of the information in the features. This will not actually help. PCA does not use theinformation in the labels y( i ) . Regularization does a better job for this motive. Try everything else,get the pipeline working and then use PCA to add a speed-up.

12 Anomaly Detection

Anomaly detection is a form of unsupervised classication used to nd outlier input examples in a given set.It is used in various applications including fraud detection, manufacturing defect detection, quality assuranceand monitoring datacenter machines. (See g: 18)

12.1 Problem formulation

Given {x (1) , x (2) ,..., x (m )}, decide if x test is anomalous ( y = 1 if p(x test ) < ).

12.2 Modeling

The underlying assumption is that we know the distribution from which the various features are drawn. Forthe discussions in the text, we will use the gaussian distribution as the assumption. So we hypothesize that:

29



Figure 19: Gaussian Anomaly - top view

x j N (µ j , σ 2j ) j [1, n ] (70)

The probability density function for the gaussian distribution is:

p(x; µ, σ 2) = 1√ 2πσ

exp− ( x − µ ) 2

2 σ 2 (71)

12.3 AlgorithmChoose features x j which might be indicative of anomalous examples.for j ← 1 → n do

µ j ← 1m

mi =1 x ( i )

j

σ 2j ← 1

mmi =1 (x ( i )

j −µ j )2

end for

Now, predict the p(x test )using,

p(x test ) = p(x test, 1; µ1 , σ 21 ) × p(x test, 2; µ2 , σ 2

2 ) ×... × p(x test,n ; µn , σ 2n ) =

n

j =1

p(x test,j ; µ j , σ 2j ) (72)

And then predict an anomaly, i.e. y = 1 if p(x test ) < (See g: 19)

12.4 Algorithm Evaluation

It’s very important that we nd a way to evaluate the algorithm we just formulated. We will try to do thatin this section using a labeled dataset , {(x ( i ) , y ( i ) )}.

• Split the dataset into 60% training, 20% cross-validation and 20% test data. Assume that all thetraining data in normal, non-anomalous(y = 0).

• Now, t the models for x j N (µ j , σ 2j )

• Use the cross-validaton set to decide a value for which leads to a good F 1score , precision, recall,misclassication error, or whatever metric suits the situation.

• Report the evaluation metric on the test set with the chosen model parameter .

30



12.5 Anomaly detection and supervised classication

The next logical question is why would one not just use logistic regression or a support vector machine. Whyare we learning an altogether new technique which does the same thing?

Anomaly Detection(Unsupervised) Classication(Supervised)Use when ||{y == 1 }|| << ||{y == 0 }|| Better used when the dataset is more or less bal-

anced in volume for each of the classes. ||{y ==1}|| ||{y == 0 }||

Models what’s normal and then declares as anoma-lous anything that does not t the model.

Tries to model each class separately.

Future anomalies can look different from past onesso long as they don’t look normal

Past anomalies must reect any kind that may oc-cur in the future.(This is hard, for instance, in alow volume manufacturing setup, but works whenwe have tons of data, like in email spam classica-tion.)

12.6 Choosing features to use

We need to tackle the problem of picking features which take on outlier values in anomalous cases. Oneswhich have an ’elbow’ in p(x) at a certain boundary. Consider the case when the features are for computersin a datacenter with features like cpu usage, disk, memory, and network traffic. Now, if most of these runweb servers, cpu and network traffic are going to be highly correlated. So, while cpu and network traffic don’treally indicate machine performance, the feature Cpu

NetworkTraffic does. It will take on very low values if thesoftware on the server is not working as intended and very high values if the hardware is malfunctioning.The trick, is to pick out features which take on very small or very large values in anomalouscases.

Another thing to keep in mind is to perform feature transformation if they are not distributed in agaussian fashion. While this is not a large problem even if they aren’t, a number of simple transforms likex1 ← log(x1), x 2 ← log(x2 + const ), x 3 ← √ x3 can lead us to derived gaussian features.

12.7 Multivariate gaussian distribution

The gaussian modeling discussed thus far has been a particular manifestation which assumes that the chosenfeatures are all independent to each other. It has a feature independence assumption. As we saw in theprevious section, this is not always true. cpu and network traffic are often correlated, and such an assumptionmakes it harder to choose features.

Another way to look at this is that the feature independence assumption leads to a combined model whichcan only model ellipses which have axes parallel to the feature axes.(See g: 20 for a contour projection plot)

Sometimes, we need to model data which has correlations and looks more like g: 21.

The original model fails to capture this kind of feature co-variance .

12.8 Modeling

Nothing changes. We just model µ R n and Σ R n × n (Σ is the co-variance matrix ) for all features atthe same time. And the pdf is:

31



Figure 20: Non-correlated features

Figure 21: Correlated features

32



p(x; µ , Σ ) = 1

2πn2 |Σ |

12

exp− 1

2 (x − µ ) T Σ− 1 (x − µ ) (73)

where,

µ = 1m

m

i =1

x ( i ) (74)

and,

Σ = 1m

m

i =1

(x ( i ) −µ )(x ( i ) −µ )T (75)

A closer look reveals that the real difference in what gets modeled is that this time we capture the non-diagonal terms, Σ ij which is the covariance of the i th and j th features. If all the non-diagonal terms were 0as in:

σ 21 0 0 . . . 0

0 σ22 0 . . . 0

... 0 . . . . . . 0

... ... . . . .. . ...0 0 0 0 σ2

n

this model would be identical to the original one.

12.9 Comparison

• The feature independence assumption makes the original model computationally cheaper and it usuallyworks out just ne.

• In the case where m < n , the co-variance matrix Σ is non-invertible and only the original model isfeasible.

• On the ip side, the multi-variate model captures co-variance and makes for a easier time in pickingthe features.

13 Recommender systems

Consider the problem of recommending great products or movies to a user on the website. Let’s lay theterminology groundwork. The data looks like:

Movie User 1 User 2 . . . User nu

Movie 1 5 4 . . . 1

Movie 2 ? 2 . . . 5......

... . . ....

Movie n m 2 2 . . . ?

Number of users: nu , number of movies: nm . r (i, j ) = 1 iff movie i is rated by user j else it is 0. And lastly,y( i,j ) is the rating given to movie i by user j .

33



13.1 Applications

• Predict the rating for a movie a user hasn’t rated yet.

• Offer the user another product, given which product they already bought/viewed.

13.2 Content based recommendations

For a moment, let’s assume that someone gave us a set of features for each movie. Something along the linesof romance, comedy, action, thriller, foreign, etc. Couldn’t we learn a set of model parameters the each user, Θ ( j ) , using linear regression with each of the rated movies (where r (i, j ) = 1) as the input examples?Consider: Θ ( j ) → prediction parameters for user j and x ( i ) →features for movie i . Once we learn the modelfor user j , we can predict their rating for an unrated movie using:

prediction = ( Θ ( j ) )T x ( i ) (76)

This is linear regression, and so we have the bias term i.e. Θ ( j ) R n +1 where n is the number of availablefeatures. Let’s say that the number of movies rated by user j are m ( j ) . The per-user linear regression has

the objective:

minΘ ( j )

12m ( j )

i :r ( i,j )=1

[(Θ ( j ) )T x ( i ) −y( i,j ) ]2 + λ2m ( j )

n

k =1

(Θ ( j )k )2 (77)

Combining all the per-user regressions, we have,

minΘ (1) , Θ (2) ,..., Θ ( n u )

J (Θ (1) , Θ (2) , ..., Θ (n u ) ) (78)

where,

J (Θ (1) , Θ (2) ,..., Θ (n u ) ) = 12

n u

j =1 i :r ( i,j )=1

[(Θ ( j ) )T x ( i ) −y( i,j ) ]2 + λ2

n u

j =1

n

k =1

(Θ ( j )k )2 (79)

(Notice how the m ( j ) term is missing. Turns out that constant does not really affect the algorithm.)

13.3 Collaborative ltering

Now, we approach the case where the movies don’t have input features, but all users have told us how muchthey like comedy and hate romance. It is possible to perform a inverted linear regression and back out whatwe think the feature values for a movie should be.

So, given Θ (1) , Θ (2) ,..., Θ (n u ) , learn, x (1) , x (2) ,..., x (n m )

Similar to equation 77, this time, the objective function, for a single movie, is:

minx ( i )

12

j :r ( i,j )=1

[(Θ ( j ) )T x ( i ) −y( i,j ) ]2 + λ2m

n

k =1

(x ( i )k )2 (80)

and similar to 82 that for many movies, is:

minx (1) , x (2) ,..., x ( n m )

J (x (1) , x (2) , ..., x (n m ) ) (81)

34



where,

J (x (1) , x (2) ,..., x (n m ) ) = 12

n m

i =1 j :r ( i,j )=1

[(Θ ( j ) )T x ( i ) −y( i,j ) ]2 + λ2

n m

i =1

n

k =1

(x ( i )k )2 (82)

The process of collaborative ltering is just to repeat these two processes, with some ratings missing and nogiven features, over and over until convergence, as in: Θ → x → Θ → x → Θ → . . .

13.4 Algorithm

The alternating process described in the previous section can be combined into one optimization objective bytreating all of {Θ (1) , Θ (2) , ..., Θ (n u ) , x (1) , x (2) ,..., x (n m )} as model parameters . The minimization objectivefor collaborative ltering is:

J (Θ (1) , Θ (2) ,..., Θ (n u ) , x (1) , x (2) , ..., x (n m ) ) = 12 ( i,j ): r ( i,j )=1 [(Θ ( j ) )T x ( i ) −y( i,j ) ]2+λ2

n mi =1

nk =1 (x ( i )

k )2+λ2

n uj =1

nk =1 (Θ ( j )

k )2 (83)

The derivatives required for gradient descent or advanced optimization techniques are:

∂J

∂x ( i )k

=( i,j ): r ( i,j )=1

{[(Θ ( j ) )T x ( i ) −y( i,j ) ]Θ( j )k + λx ( i )

k } (84)

and,

∂J

∂ Θ( j )k

=( i,j ): r ( i,j )=1

{[(Θ ( j ) )T x ( i ) −y( i,j ) ]x ( i )k + λΘ( j )

k } (85)

13.4.1 Formal Algorithm

Initialize Θ (1) , Θ (2) , ..., Θ (n u ) , x (1) , x (2) ,..., x (n m ) to small random values.Use gradient descent or an advanced optimization technique along with equations: 83, and 84, 85, to trainmodel parameters Θ (1) , Θ (2) , ..., Θ (n u ) , x (1) , x (2) , ..., x (n m )

• Now, we can predict the rating for a user, movie using pred = ( Θ ( j ) )T x ( i )

• We can also tell the similarity between two movies, by using, ||x ( i ) − x( j ) || as the distance betweenthem.

13.5 Vectorization: Low rank matrix factorization

We are given the matrices:

Y =

y(1 ,1) y(1 ,2) . . . y (1 ,n u )

y(2 ,1 y(2 ,2) . . . y (2 ,n u )

...... . . .

...y(n m ,1 y(n m ,2) . . . y (n m ,n u )

35



PRED =

(Θ (1) )T x (1) (Θ (2) )T x (1) . . . (Θ (n u ) )T x (1)

(Θ (1) )T x (2) (Θ (2) )T x (2) . . . (Θ (n u ) )T x (2)

...... . . .

...(Θ (1) )T x (n m ) (Θ (2) )T x (n m ) . . . (Θ (n u ) )T x (n m )

X =

. . . . . . (x (1) )T . . . . . .

. . . . . . (x (2) )T . . . . . ....

... . . ....

. . . . . . (x (n m ) )T . . . . . .

and,

Θ =

. . . . . . (Θ (1) )T . . . . . .

. . . . . . (Θ (2) )T . . . . . ....

... . . ....

. . . . . . (Θ (n u ) )T . . . . . .

The vectorized format of predictions is: PRED = XΘT

. In the linear algebra world, the matrix PREDbears a property called low rank .

13.6 Mean normalization

Let’s consider a new user, who has not rated any movies yet. In the collaborative ltering algorithm accordingto equation 83, Θ ( j ) will turn out to be all zeros(also, think about a movie that has not been rated by anyone)and we will be unable to recommend movies to the new user(or, recommend a new movie to any user). Thiscan be over come by normalizing all the ratings across the mean for that movie, like so:

y(i, j ) ← y( i,j ) −µ( i ) (86)

and then perform predictions, using

prediction = ( Θ ( j ) )T x ( i ) + µ( i ) (87)Now, all new users will have predicted ratings equal to the mean rating for that movie. Which does makesense.

14 Large scale machine learning

Often the amount of data which a machine learning scientist deals with is large enough that it warrantsspecial techniques, both to save time and for us to be able to iterate faster.

14.1 Stochastic gradient descent

In our study of gradient descent we were updating all the model parameters after computing the cost and thegradient across al l the inputs. In organizations which have a lot of data, this is a problem since one cannotmonitor progress until the algorithm has run through all thirty million examples at least once. Stochasticgradient descent solves this problem. The only change is that we update all model parameters with theevaluation of cost for each input example.

36



while Model parameters have not converged(change < ) dox ← next exampleΘj ← Θj −α ∂J

∂ Θ j j

end while

where we set the cost(for linear regression) for a single example according to:

J (Θ ; x , y) = cost (Θ ; x , y ) = 12

(h Θ (x) −y)2 + λ2

n

j =1

Θ2j (88)

• This cost function is not convex and the descent can get stuck in a local optimum. However this isusually not a problem. This algorithm however, will not show a constantly decreasing cost for eachiteration. In order to monitor it, plot the cost, averaged for every 1000 iterations. If that curve isconstantly decreasing, you’re golden.

• Also, since it does one example at a time, stochastic gradient descent doesn’t quite converge at theminimum. It keeps hovering at a very acceptable solution. If we want more convergence, one thing todo is to reduce the learning rate α with the iterations with something like: α ← const

number + const

Another variant of the stochastic gradient descent is the mini-batch gradient descent where instead of one example at a time or all examples each time, we choose a mini-batch of b examples, such that 1 < = b < = m .

14.2 Online learning

A lot of business models like information retrieval (search), dynamic pricing, etc. require that a machinelearning system be able to learn from data as soon as it comes into the system. Batch learning algorithmsfail here. But, stochastic gradient descent work in this scenario.

14.3 Batch gradient descent and mapreduce

Another way to speed up batch-gradient descent is to use mapreduce. One can shard all the inputs, andcompute the gradient and cost (both of which are summations over the examples) distributed over the shards.Then these can be combined to move the algorithm one step forward(model parameters are changed) on amaster shard. This distributed algorithm can also benet from multi-core architecture on modern machines.

The thing to remember is - if the primary step in an algorithm is distributable, like a summation, it canlikely benet a speed-up from distributed computing.

machine learning class notes andrew ng

Documents