11 machine learning important issues in machine learning

98
Machine Learning for Data Mining Important Issues Andres Mendez-Vazquez July 3, 2015 1 / 34

Upload: andres-mendez-vazquez

Post on 21-Jan-2018

196 views

Category:

Engineering


4 download

TRANSCRIPT

Page 1: 11 Machine Learning Important Issues in Machine Learning

Machine Learning for Data MiningImportant Issues

Andres Mendez-Vazquez

July 3, 2015

1 / 34

Page 2: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Outline

1 Bias-Variance DilemmaIntroductionMeasuring the difference between optimal and learnedThe Bias-Variance“Extreme” Example

2 Confusion MatrixThe Confusion Matrix

3 K-Cross ValidationIntroductionHow to choose K

2 / 34

Page 3: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Outline

1 Bias-Variance DilemmaIntroductionMeasuring the difference between optimal and learnedThe Bias-Variance“Extreme” Example

2 Confusion MatrixThe Confusion Matrix

3 K-Cross ValidationIntroductionHow to choose K

3 / 34

Page 4: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (1)

Remark: Where the xi ∼ p (x|Θ)!!!

4 / 34

Page 5: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (1)

Remark: Where the xi ∼ p (x|Θ)!!!

4 / 34

Page 6: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (1)

Remark: Where the xi ∼ p (x|Θ)!!!

4 / 34

Page 7: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (1)

Remark: Where the xi ∼ p (x|Θ)!!!

4 / 34

Page 8: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (1)

Remark: Where the xi ∼ p (x|Θ)!!!

4 / 34

Page 9: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (1)

Remark: Where the xi ∼ p (x|Θ)!!!

4 / 34

Page 10: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Introduction

What did we see until now?The design of learning machines from two main points:

Statistical Point of ViewLinear Algebra and Optimization Point of View

Going back to the probability modelsWe might think in the machine to be learned as a function g (x|D)....

Something as curve fitting...

Under a data set

D = {(xi , yi) |i = 1, 2, ...,N} (1)

Remark: Where the xi ∼ p (x|Θ)!!!

4 / 34

Page 11: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

5 / 34

Page 12: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

5 / 34

Page 13: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

5 / 34

Page 14: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

5 / 34

Page 15: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus, we have that

Two main functionsA function g (x|D) obtained using some algorithm!!!E [y|x] the optimal regression...

ImportantThe key factor here is the dependence of the approximation on D.

Why?The approximation may be very good for a specific training data set butvery bad for another.

This is the reason of studying fusion of information at decision level...

5 / 34

Page 16: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Outline

1 Bias-Variance DilemmaIntroductionMeasuring the difference between optimal and learnedThe Bias-Variance“Extreme” Example

2 Confusion MatrixThe Confusion Matrix

3 K-Cross ValidationIntroductionHow to choose K

6 / 34

Page 17: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How do we measure the difference

We have that

Var(X) = E((X − µ)2)

We can do that for our data

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2

)

Now, if we add and subtract

ED [g (x|D)] (2)

Remark: The expected output of the machine g (x|D)

7 / 34

Page 18: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How do we measure the difference

We have that

Var(X) = E((X − µ)2)

We can do that for our data

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2

)

Now, if we add and subtract

ED [g (x|D)] (2)

Remark: The expected output of the machine g (x|D)

7 / 34

Page 19: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How do we measure the difference

We have that

Var(X) = E((X − µ)2)

We can do that for our data

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2

)

Now, if we add and subtract

ED [g (x|D)] (2)

Remark: The expected output of the machine g (x|D)

7 / 34

Page 20: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How do we measure the difference

We have that

Var(X) = E((X − µ)2)

We can do that for our data

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2

)

Now, if we add and subtract

ED [g (x|D)] (2)

Remark: The expected output of the machine g (x|D)

7 / 34

Page 21: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus, we have that

Or Original variance

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)] + ED [g (x|D)]− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)])2 + ...

...2 ((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x]) + ...

... (ED [g (x|D)]− E [y|x])2)Finally

ED (((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x])) =? (3)

8 / 34

Page 22: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus, we have that

Or Original variance

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)] + ED [g (x|D)]− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)])2 + ...

...2 ((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x]) + ...

... (ED [g (x|D)]− E [y|x])2)Finally

ED (((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x])) =? (3)

8 / 34

Page 23: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus, we have that

Or Original variance

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)] + ED [g (x|D)]− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)])2 + ...

...2 ((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x]) + ...

... (ED [g (x|D)]− E [y|x])2)Finally

ED (((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x])) =? (3)

8 / 34

Page 24: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus, we have that

Or Original variance

VarD (g (x|D)) = ED((g (x|D)− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)] + ED [g (x|D)]− E [y|x])2)

= ED((g (x|D)− ED [g (x|D)])2 + ...

...2 ((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x]) + ...

... (ED [g (x|D)]− E [y|x])2)Finally

ED (((g (x|D)− ED [g (x|D)])) (ED [g (x|D)]− E [y|x])) =? (3)

8 / 34

Page 25: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Outline

1 Bias-Variance DilemmaIntroductionMeasuring the difference between optimal and learnedThe Bias-Variance“Extreme” Example

2 Confusion MatrixThe Confusion Matrix

3 K-Cross ValidationIntroductionHow to choose K

9 / 34

Page 26: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

We have the Bias-Variance

Our Final Equation

ED(

(g (x|D)− E [y|x])2) = ED(

(g (x|D)− ED [g (x|D)])2)︸ ︷︷ ︸VARIANCE

+ (ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

Where the varianceIt represents the measure of the error between our machine g (x|D) andthe expected output of the machine under xi ∼ p (x|Θ).

Where the biasIt represents the quadratic error between the expected output of themachine under xi ∼ p (x|Θ) and the expected output of the optimalregression.

10 / 34

Page 27: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

We have the Bias-Variance

Our Final Equation

ED(

(g (x|D)− E [y|x])2) = ED(

(g (x|D)− ED [g (x|D)])2)︸ ︷︷ ︸VARIANCE

+ (ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

Where the varianceIt represents the measure of the error between our machine g (x|D) andthe expected output of the machine under xi ∼ p (x|Θ).

Where the biasIt represents the quadratic error between the expected output of themachine under xi ∼ p (x|Θ) and the expected output of the optimalregression.

10 / 34

Page 28: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

We have the Bias-Variance

Our Final Equation

ED(

(g (x|D)− E [y|x])2) = ED(

(g (x|D)− ED [g (x|D)])2)︸ ︷︷ ︸VARIANCE

+ (ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

Where the varianceIt represents the measure of the error between our machine g (x|D) andthe expected output of the machine under xi ∼ p (x|Θ).

Where the biasIt represents the quadratic error between the expected output of themachine under xi ∼ p (x|Θ) and the expected output of the optimalregression.

10 / 34

Page 29: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

We have the Bias-Variance

Our Final Equation

ED(

(g (x|D)− E [y|x])2) = ED(

(g (x|D)− ED [g (x|D)])2)︸ ︷︷ ︸VARIANCE

+ (ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

Where the varianceIt represents the measure of the error between our machine g (x|D) andthe expected output of the machine under xi ∼ p (x|Θ).

Where the biasIt represents the quadratic error between the expected output of themachine under xi ∼ p (x|Θ) and the expected output of the optimalregression.

10 / 34

Page 30: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Remarks

We have thenEven if the estimator is unbiased, it can still result in a large mean squareerror due to a large variance term.

The situation is more dire in a finite set of data DWe have then a trade-off:

1 Increasing the bias decreases the variance and vice versa.2 This is known as the bias–variance dilemma.

11 / 34

Page 31: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Remarks

We have thenEven if the estimator is unbiased, it can still result in a large mean squareerror due to a large variance term.

The situation is more dire in a finite set of data DWe have then a trade-off:

1 Increasing the bias decreases the variance and vice versa.2 This is known as the bias–variance dilemma.

11 / 34

Page 32: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Remarks

We have thenEven if the estimator is unbiased, it can still result in a large mean squareerror due to a large variance term.

The situation is more dire in a finite set of data DWe have then a trade-off:

1 Increasing the bias decreases the variance and vice versa.2 This is known as the bias–variance dilemma.

11 / 34

Page 33: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Remarks

We have thenEven if the estimator is unbiased, it can still result in a large mean squareerror due to a large variance term.

The situation is more dire in a finite set of data DWe have then a trade-off:

1 Increasing the bias decreases the variance and vice versa.2 This is known as the bias–variance dilemma.

11 / 34

Page 34: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Similar to...

Curve FittingIf, for example, the adopted model is complex (many parameters involved)with respect to the number N , the model will fit the idiosyncrasies of thespecific data set.

ThusThus, it will result in low bias but will yield high variance, as we changefrom one data set to another data set.

FurthermoreIf N grows we can have a more complex model to be fitted which reducesbias and ensures low variance.

However, N is always finite!!!

12 / 34

Page 35: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Similar to...

Curve FittingIf, for example, the adopted model is complex (many parameters involved)with respect to the number N , the model will fit the idiosyncrasies of thespecific data set.

ThusThus, it will result in low bias but will yield high variance, as we changefrom one data set to another data set.

FurthermoreIf N grows we can have a more complex model to be fitted which reducesbias and ensures low variance.

However, N is always finite!!!

12 / 34

Page 36: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Similar to...

Curve FittingIf, for example, the adopted model is complex (many parameters involved)with respect to the number N , the model will fit the idiosyncrasies of thespecific data set.

ThusThus, it will result in low bias but will yield high variance, as we changefrom one data set to another data set.

FurthermoreIf N grows we can have a more complex model to be fitted which reducesbias and ensures low variance.

However, N is always finite!!!

12 / 34

Page 37: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Similar to...

Curve FittingIf, for example, the adopted model is complex (many parameters involved)with respect to the number N , the model will fit the idiosyncrasies of thespecific data set.

ThusThus, it will result in low bias but will yield high variance, as we changefrom one data set to another data set.

FurthermoreIf N grows we can have a more complex model to be fitted which reducesbias and ensures low variance.

However, N is always finite!!!

12 / 34

Page 38: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus

You always need to compromiseHowever, you always have some a priori knowledge about the data

Allowing you to impose restrictionsLowering the bias and the variance

NeverthelessWe have the following example to grasp better the bothersomebias–variance dilemma.

13 / 34

Page 39: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus

You always need to compromiseHowever, you always have some a priori knowledge about the data

Allowing you to impose restrictionsLowering the bias and the variance

NeverthelessWe have the following example to grasp better the bothersomebias–variance dilemma.

13 / 34

Page 40: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus

You always need to compromiseHowever, you always have some a priori knowledge about the data

Allowing you to impose restrictionsLowering the bias and the variance

NeverthelessWe have the following example to grasp better the bothersomebias–variance dilemma.

13 / 34

Page 41: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

For this

AssumeThe data is generated by the following function

y =f (x) + ε,

ε ∼N(0, σ2

ε

)We know thatThe optimum regressor is E [y|x] = f (x)

FurthermoreAssume that the randomness in the different training sets, D, is due to theyi ’s (Affected by noise), while the respective points, xi , are fixed.

14 / 34

Page 42: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

For this

AssumeThe data is generated by the following function

y =f (x) + ε,

ε ∼N(0, σ2

ε

)We know thatThe optimum regressor is E [y|x] = f (x)

FurthermoreAssume that the randomness in the different training sets, D, is due to theyi ’s (Affected by noise), while the respective points, xi , are fixed.

14 / 34

Page 43: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

For this

AssumeThe data is generated by the following function

y =f (x) + ε,

ε ∼N(0, σ2

ε

)We know thatThe optimum regressor is E [y|x] = f (x)

FurthermoreAssume that the randomness in the different training sets, D, is due to theyi ’s (Affected by noise), while the respective points, xi , are fixed.

14 / 34

Page 44: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Outline

1 Bias-Variance DilemmaIntroductionMeasuring the difference between optimal and learnedThe Bias-Variance“Extreme” Example

2 Confusion MatrixThe Confusion Matrix

3 K-Cross ValidationIntroductionHow to choose K

15 / 34

Page 45: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Sampling the Space

Imagine that D ⊂ [x1, x2] in which x liesFor example, you can choose xi = x1 + x2−x1

N−1 (i − 1) with i = 1, 2, ...,N

16 / 34

Page 46: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Case 1

Choose the estimate of f (x), g (x|D), to be independent of DFor example, g (x) = w1x + w0

For example, the points are spread around (x , f (x))

17 / 34

Page 47: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Case 1Choose the estimate of f (x), g (x|D), to be independent of DFor example, g (x) = w1x + w0

For example, the points are spread around (x , f (x))

0

Data Points

17 / 34

Page 48: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Case 1

Since g (x) is fixed

ED [g (x|D)] = g (x|D) ≡ g (x) (4)

With

VarD [g (x|D)] = 0 (5)

On the other handBecause g (x) was chosen arbitrarily the expected bias must be large.

(ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

(6)

18 / 34

Page 49: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Case 1

Since g (x) is fixed

ED [g (x|D)] = g (x|D) ≡ g (x) (4)

With

VarD [g (x|D)] = 0 (5)

On the other handBecause g (x) was chosen arbitrarily the expected bias must be large.

(ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

(6)

18 / 34

Page 50: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Case 1

Since g (x) is fixed

ED [g (x|D)] = g (x|D) ≡ g (x) (4)

With

VarD [g (x|D)] = 0 (5)

On the other handBecause g (x) was chosen arbitrarily the expected bias must be large.

(ED [g (x|D)]− E [y|x])2︸ ︷︷ ︸BIAS

(6)

18 / 34

Page 51: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Case 2

In the other handNow, g1 (x) corresponds to a polynomial of high degree so it can passthrough each training point in D.

Example of g1 (x)

19 / 34

Page 52: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Case 2In the other handNow, g1 (x) corresponds to a polynomial of high degree so it can passthrough each training point in D.

Example of g1 (x)

0

Data Points

19 / 34

Page 53: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Case 2

Due to the zero mean of the noise source

ED [g1 (x|D)] = f (x) = E [y|x] for any x = xi (7)

Remark: At the training points the bias is zero.

However the variance increases

ED[(g1 (x|D)− ED [g1 (x|D)])2

]= ED

[(f (x) + ε− f (x))2

]= σ2

ε , for x = xi , i = 1, 2, ...,N

In other wordsThe bias becomes zero (or approximately zero) but the variance is nowequal to the variance of the noise source.

20 / 34

Page 54: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Case 2

Due to the zero mean of the noise source

ED [g1 (x|D)] = f (x) = E [y|x] for any x = xi (7)

Remark: At the training points the bias is zero.

However the variance increases

ED[(g1 (x|D)− ED [g1 (x|D)])2

]= ED

[(f (x) + ε− f (x))2

]= σ2

ε , for x = xi , i = 1, 2, ...,N

In other wordsThe bias becomes zero (or approximately zero) but the variance is nowequal to the variance of the noise source.

20 / 34

Page 55: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Case 2

Due to the zero mean of the noise source

ED [g1 (x|D)] = f (x) = E [y|x] for any x = xi (7)

Remark: At the training points the bias is zero.

However the variance increases

ED[(g1 (x|D)− ED [g1 (x|D)])2

]= ED

[(f (x) + ε− f (x))2

]= σ2

ε , for x = xi , i = 1, 2, ...,N

In other wordsThe bias becomes zero (or approximately zero) but the variance is nowequal to the variance of the noise source.

20 / 34

Page 56: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Observations

FirstEverything that has been said so far applies to both the regression and theclassification tasks.

HoweverMean squared error is not the best way to measure the power of a classifier.

Think aboutA classifier that send everything far away of the hyperplane!!! Away fromthe values +− 1!!!

21 / 34

Page 57: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Observations

FirstEverything that has been said so far applies to both the regression and theclassification tasks.

HoweverMean squared error is not the best way to measure the power of a classifier.

Think aboutA classifier that send everything far away of the hyperplane!!! Away fromthe values +− 1!!!

21 / 34

Page 58: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Observations

FirstEverything that has been said so far applies to both the regression and theclassification tasks.

HoweverMean squared error is not the best way to measure the power of a classifier.

Think aboutA classifier that send everything far away of the hyperplane!!! Away fromthe values +− 1!!!

21 / 34

Page 59: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Outline

1 Bias-Variance DilemmaIntroductionMeasuring the difference between optimal and learnedThe Bias-Variance“Extreme” Example

2 Confusion MatrixThe Confusion Matrix

3 K-Cross ValidationIntroductionHow to choose K

22 / 34

Page 60: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Introduction

Something NotableIn evaluating the performance of a classification system, the probability oferror is sometimes not the only quantity that assesses its performancesufficiently.

For this assume a M-class classification taskAn important issue is to know whether there are classes that exhibit ahigher tendency for confusion.

Where the confusion matrixConfusion Matrix A = [Aij ] is defined such that each element Aij is thenumber of data points whose true class was i but where classified in classj.

23 / 34

Page 61: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Introduction

Something NotableIn evaluating the performance of a classification system, the probability oferror is sometimes not the only quantity that assesses its performancesufficiently.

For this assume a M-class classification taskAn important issue is to know whether there are classes that exhibit ahigher tendency for confusion.

Where the confusion matrixConfusion Matrix A = [Aij ] is defined such that each element Aij is thenumber of data points whose true class was i but where classified in classj.

23 / 34

Page 62: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Introduction

Something NotableIn evaluating the performance of a classification system, the probability oferror is sometimes not the only quantity that assesses its performancesufficiently.

For this assume a M-class classification taskAn important issue is to know whether there are classes that exhibit ahigher tendency for confusion.

Where the confusion matrixConfusion Matrix A = [Aij ] is defined such that each element Aij is thenumber of data points whose true class was i but where classified in classj.

23 / 34

Page 63: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus

We have thatFrom A, one can directly extract the recall and precision values for eachclass, along with the overall accuracy.

Recall - Ri

It is the percentage of data points with true class label i, which werecorrectly classified in that class.

For example in a two-class problemThe recall of the first class is calculated as

R1 = A11A11 + A12

(8)

24 / 34

Page 64: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus

We have thatFrom A, one can directly extract the recall and precision values for eachclass, along with the overall accuracy.

Recall - Ri

It is the percentage of data points with true class label i, which werecorrectly classified in that class.

For example in a two-class problemThe recall of the first class is calculated as

R1 = A11A11 + A12

(8)

24 / 34

Page 65: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus

We have thatFrom A, one can directly extract the recall and precision values for eachclass, along with the overall accuracy.

Recall - Ri

It is the percentage of data points with true class label i, which werecorrectly classified in that class.

For example in a two-class problemThe recall of the first class is calculated as

R1 = A11A11 + A12

(8)

24 / 34

Page 66: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

More

Precision - Pi

It is the percentage of data points classified as class i, whose true class isindeed i.

Therefore again for a two class problem

P1 = A11A11 + A21

(9)

Overall Accuracy (Ac).The overall accuracy, Ac, is the percentage of data that has been correctlyclassified.

25 / 34

Page 67: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

More

Precision - Pi

It is the percentage of data points classified as class i, whose true class isindeed i.

Therefore again for a two class problem

P1 = A11A11 + A21

(9)

Overall Accuracy (Ac).The overall accuracy, Ac, is the percentage of data that has been correctlyclassified.

25 / 34

Page 68: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

More

Precision - Pi

It is the percentage of data points classified as class i, whose true class isindeed i.

Therefore again for a two class problem

P1 = A11A11 + A21

(9)

Overall Accuracy (Ac).The overall accuracy, Ac, is the percentage of data that has been correctlyclassified.

25 / 34

Page 69: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Thus, for a M -Class Problem

We have that

Ac = 1N

M∑i=1

Aii (10)

26 / 34

Page 70: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Outline

1 Bias-Variance DilemmaIntroductionMeasuring the difference between optimal and learnedThe Bias-Variance“Extreme” Example

2 Confusion MatrixThe Confusion Matrix

3 K-Cross ValidationIntroductionHow to choose K

27 / 34

Page 71: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

What we want

We want to measureA quality measure to measure different classifiers (for different parametervalues).

We call that as

R(f ) = ED [L (y, f (x))] . (11)

Example: L (y, f (x)) = ‖y − f (x)‖22

More preciselyFor different values γj of the parameter, we train a classifier f (x|γj) onthe training set.

28 / 34

Page 72: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

What we want

We want to measureA quality measure to measure different classifiers (for different parametervalues).

We call that as

R(f ) = ED [L (y, f (x))] . (11)

Example: L (y, f (x)) = ‖y − f (x)‖22

More preciselyFor different values γj of the parameter, we train a classifier f (x|γj) onthe training set.

28 / 34

Page 73: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

What we want

We want to measureA quality measure to measure different classifiers (for different parametervalues).

We call that as

R(f ) = ED [L (y, f (x))] . (11)

Example: L (y, f (x)) = ‖y − f (x)‖22

More preciselyFor different values γj of the parameter, we train a classifier f (x|γj) onthe training set.

28 / 34

Page 74: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Then, calculate the empirical RiskDo you have any ideas?Give me your best shot!!!

Empirical RiskWe use the validation set to estimate

R̂ (f (x|γi)) = 1Nv

Nv∑i=1

L (yi f (xi |γj)) (12)

Thus, you follow the following procedure1 Select the value γ∗ which achieves the smallest estimated error.2 Re-train the classifier with parameter γ∗ on all data except the test

set (i.e. train + validation data).3 Report error estimate R̂ (f (x|γi)) computed on the test set.

29 / 34

Page 75: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Then, calculate the empirical RiskDo you have any ideas?Give me your best shot!!!

Empirical RiskWe use the validation set to estimate

R̂ (f (x|γi)) = 1Nv

Nv∑i=1

L (yi f (xi |γj)) (12)

Thus, you follow the following procedure1 Select the value γ∗ which achieves the smallest estimated error.2 Re-train the classifier with parameter γ∗ on all data except the test

set (i.e. train + validation data).3 Report error estimate R̂ (f (x|γi)) computed on the test set.

29 / 34

Page 76: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Then, calculate the empirical RiskDo you have any ideas?Give me your best shot!!!

Empirical RiskWe use the validation set to estimate

R̂ (f (x|γi)) = 1Nv

Nv∑i=1

L (yi f (xi |γj)) (12)

Thus, you follow the following procedure1 Select the value γ∗ which achieves the smallest estimated error.2 Re-train the classifier with parameter γ∗ on all data except the test

set (i.e. train + validation data).3 Report error estimate R̂ (f (x|γi)) computed on the test set.

29 / 34

Page 77: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Then, calculate the empirical RiskDo you have any ideas?Give me your best shot!!!

Empirical RiskWe use the validation set to estimate

R̂ (f (x|γi)) = 1Nv

Nv∑i=1

L (yi f (xi |γj)) (12)

Thus, you follow the following procedure1 Select the value γ∗ which achieves the smallest estimated error.2 Re-train the classifier with parameter γ∗ on all data except the test

set (i.e. train + validation data).3 Report error estimate R̂ (f (x|γi)) computed on the test set.

29 / 34

Page 78: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Then, calculate the empirical RiskDo you have any ideas?Give me your best shot!!!

Empirical RiskWe use the validation set to estimate

R̂ (f (x|γi)) = 1Nv

Nv∑i=1

L (yi f (xi |γj)) (12)

Thus, you follow the following procedure1 Select the value γ∗ which achieves the smallest estimated error.2 Re-train the classifier with parameter γ∗ on all data except the test

set (i.e. train + validation data).3 Report error estimate R̂ (f (x|γi)) computed on the test set.

29 / 34

Page 79: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

IdeaSomething NotableEach of the error estimates computed on validation set is computed from asingle example of a trained classifier. Can we improve the estimate?

K -fold Cross ValidationTo estimate the risk of a classifier f :

1 Split data into K equally sized parts (called "folds").2 Train an instance fk of the classifier, using all folds except fold k as

training data.3 Compute the cross validation (CV) estimate:

R̂CV (f (x|γi)) = 1Nv

Nv∑i=1

L(yi f

(xk(i)|γj

))(13)

where k (i) is the fold containing xi .

30 / 34

Page 80: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

IdeaSomething NotableEach of the error estimates computed on validation set is computed from asingle example of a trained classifier. Can we improve the estimate?

K -fold Cross ValidationTo estimate the risk of a classifier f :

1 Split data into K equally sized parts (called "folds").2 Train an instance fk of the classifier, using all folds except fold k as

training data.3 Compute the cross validation (CV) estimate:

R̂CV (f (x|γi)) = 1Nv

Nv∑i=1

L(yi f

(xk(i)|γj

))(13)

where k (i) is the fold containing xi .

30 / 34

Page 81: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

IdeaSomething NotableEach of the error estimates computed on validation set is computed from asingle example of a trained classifier. Can we improve the estimate?

K -fold Cross ValidationTo estimate the risk of a classifier f :

1 Split data into K equally sized parts (called "folds").2 Train an instance fk of the classifier, using all folds except fold k as

training data.3 Compute the cross validation (CV) estimate:

R̂CV (f (x|γi)) = 1Nv

Nv∑i=1

L(yi f

(xk(i)|γj

))(13)

where k (i) is the fold containing xi .

30 / 34

Page 82: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

IdeaSomething NotableEach of the error estimates computed on validation set is computed from asingle example of a trained classifier. Can we improve the estimate?

K -fold Cross ValidationTo estimate the risk of a classifier f :

1 Split data into K equally sized parts (called "folds").2 Train an instance fk of the classifier, using all folds except fold k as

training data.3 Compute the cross validation (CV) estimate:

R̂CV (f (x|γi)) = 1Nv

Nv∑i=1

L(yi f

(xk(i)|γj

))(13)

where k (i) is the fold containing xi .

30 / 34

Page 83: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

IdeaSomething NotableEach of the error estimates computed on validation set is computed from asingle example of a trained classifier. Can we improve the estimate?

K -fold Cross ValidationTo estimate the risk of a classifier f :

1 Split data into K equally sized parts (called "folds").2 Train an instance fk of the classifier, using all folds except fold k as

training data.3 Compute the cross validation (CV) estimate:

R̂CV (f (x|γi)) = 1Nv

Nv∑i=1

L(yi f

(xk(i)|γj

))(13)

where k (i) is the fold containing xi .

30 / 34

Page 84: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Example

K = 5,k = 3Train Train Validation Train Train1 2 3 4 5

Actually, we haveCross validation procedure does not involve the test data.

SPLIT THIS PART︷ ︸︸ ︷Train Data + Validation Data Test

31 / 34

Page 85: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Example

K = 5,k = 3Train Train Validation Train Train1 2 3 4 5

Actually, we haveCross validation procedure does not involve the test data.

SPLIT THIS PART︷ ︸︸ ︷Train Data + Validation Data Test

31 / 34

Page 86: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

Outline

1 Bias-Variance DilemmaIntroductionMeasuring the difference between optimal and learnedThe Bias-Variance“Extreme” Example

2 Confusion MatrixThe Confusion Matrix

3 K-Cross ValidationIntroductionHow to choose K

32 / 34

Page 87: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How to choose K

Extremal casesK = N , called leave one out cross validation (loocv)K = 2

An often-cited problem with loocv is that we have to train many (= N )classifiers, but there is also a deeper problem.

Argument 1: K should be small, e.g. K = 21 Unless we have a lot of data, variance between two distinct training

sets may be considerable.2 Important concept: By removing substantial parts of the sample in

turn and at random, we can simulate this variance.3 By removing a single point (loocv), we cannot make this variance

visible.

33 / 34

Page 88: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How to choose K

Extremal casesK = N , called leave one out cross validation (loocv)K = 2

An often-cited problem with loocv is that we have to train many (= N )classifiers, but there is also a deeper problem.

Argument 1: K should be small, e.g. K = 21 Unless we have a lot of data, variance between two distinct training

sets may be considerable.2 Important concept: By removing substantial parts of the sample in

turn and at random, we can simulate this variance.3 By removing a single point (loocv), we cannot make this variance

visible.

33 / 34

Page 89: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How to choose K

Extremal casesK = N , called leave one out cross validation (loocv)K = 2

An often-cited problem with loocv is that we have to train many (= N )classifiers, but there is also a deeper problem.

Argument 1: K should be small, e.g. K = 21 Unless we have a lot of data, variance between two distinct training

sets may be considerable.2 Important concept: By removing substantial parts of the sample in

turn and at random, we can simulate this variance.3 By removing a single point (loocv), we cannot make this variance

visible.

33 / 34

Page 90: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How to choose K

Extremal casesK = N , called leave one out cross validation (loocv)K = 2

An often-cited problem with loocv is that we have to train many (= N )classifiers, but there is also a deeper problem.

Argument 1: K should be small, e.g. K = 21 Unless we have a lot of data, variance between two distinct training

sets may be considerable.2 Important concept: By removing substantial parts of the sample in

turn and at random, we can simulate this variance.3 By removing a single point (loocv), we cannot make this variance

visible.

33 / 34

Page 91: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How to choose K

Extremal casesK = N , called leave one out cross validation (loocv)K = 2

An often-cited problem with loocv is that we have to train many (= N )classifiers, but there is also a deeper problem.

Argument 1: K should be small, e.g. K = 21 Unless we have a lot of data, variance between two distinct training

sets may be considerable.2 Important concept: By removing substantial parts of the sample in

turn and at random, we can simulate this variance.3 By removing a single point (loocv), we cannot make this variance

visible.

33 / 34

Page 92: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How to choose K

Extremal casesK = N , called leave one out cross validation (loocv)K = 2

An often-cited problem with loocv is that we have to train many (= N )classifiers, but there is also a deeper problem.

Argument 1: K should be small, e.g. K = 21 Unless we have a lot of data, variance between two distinct training

sets may be considerable.2 Important concept: By removing substantial parts of the sample in

turn and at random, we can simulate this variance.3 By removing a single point (loocv), we cannot make this variance

visible.

33 / 34

Page 93: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How to choose K

Argument 2: K should be large, e.g. K = N1 Classifiers generally perform better when trained on larger data sets.2 A small K means we substantially reduce the amount of training data

used to train each fk , so we may end up with weaker classifiers.3 This way, we will systematically overestimate the risk.

Common recommendation: K = 5 to K = 10Intuition:

1 K = 10 means number of samples removed from training is one orderof magnitude below training sample size.

2 This should not weaken the classifier considerably, but should be largeenough to make measure variance effects.

34 / 34

Page 94: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How to choose K

Argument 2: K should be large, e.g. K = N1 Classifiers generally perform better when trained on larger data sets.2 A small K means we substantially reduce the amount of training data

used to train each fk , so we may end up with weaker classifiers.3 This way, we will systematically overestimate the risk.

Common recommendation: K = 5 to K = 10Intuition:

1 K = 10 means number of samples removed from training is one orderof magnitude below training sample size.

2 This should not weaken the classifier considerably, but should be largeenough to make measure variance effects.

34 / 34

Page 95: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How to choose K

Argument 2: K should be large, e.g. K = N1 Classifiers generally perform better when trained on larger data sets.2 A small K means we substantially reduce the amount of training data

used to train each fk , so we may end up with weaker classifiers.3 This way, we will systematically overestimate the risk.

Common recommendation: K = 5 to K = 10Intuition:

1 K = 10 means number of samples removed from training is one orderof magnitude below training sample size.

2 This should not weaken the classifier considerably, but should be largeenough to make measure variance effects.

34 / 34

Page 96: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How to choose K

Argument 2: K should be large, e.g. K = N1 Classifiers generally perform better when trained on larger data sets.2 A small K means we substantially reduce the amount of training data

used to train each fk , so we may end up with weaker classifiers.3 This way, we will systematically overestimate the risk.

Common recommendation: K = 5 to K = 10Intuition:

1 K = 10 means number of samples removed from training is one orderof magnitude below training sample size.

2 This should not weaken the classifier considerably, but should be largeenough to make measure variance effects.

34 / 34

Page 97: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How to choose K

Argument 2: K should be large, e.g. K = N1 Classifiers generally perform better when trained on larger data sets.2 A small K means we substantially reduce the amount of training data

used to train each fk , so we may end up with weaker classifiers.3 This way, we will systematically overestimate the risk.

Common recommendation: K = 5 to K = 10Intuition:

1 K = 10 means number of samples removed from training is one orderof magnitude below training sample size.

2 This should not weaken the classifier considerably, but should be largeenough to make measure variance effects.

34 / 34

Page 98: 11 Machine Learning Important Issues in Machine Learning

Images/cinvestav-1.jpg

How to choose K

Argument 2: K should be large, e.g. K = N1 Classifiers generally perform better when trained on larger data sets.2 A small K means we substantially reduce the amount of training data

used to train each fk , so we may end up with weaker classifiers.3 This way, we will systematically overestimate the risk.

Common recommendation: K = 5 to K = 10Intuition:

1 K = 10 means number of samples removed from training is one orderof magnitude below training sample size.

2 This should not weaken the classifier considerably, but should be largeenough to make measure variance effects.

34 / 34