csc411: midterm reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · csc411: midterm review...

31
CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Upload: others

Post on 05-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

CSC411: Midterm Review

Xiaohui Zeng

February 7, 2019

Built from slides by James Lucas, Joe Wu and others1/23

Page 2: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Agenda

1. A brief overview

2. Some sample questions

2/23

Page 3: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Basic ML Terminology

I Regression

I Overfitting

I Generalization

I Stochastic Gradient Descent(SGD)

I Classification

I Underfitting

I Regularization

I Bayes Optimal

3/23

Page 4: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Basic ML Terminology

I Training Data

I Validation Data

I Test Data

I Optimization

I 0-1 Loss

I Linear classifier

I Features

I Model

4/23

Page 5: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Some Questions

Question

1. Bagging improved performance by reducing

2. Bagging improved performance by reducing variance

5/23

Page 6: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Some Questions

Question

1. Bagging improved performance by reducing variance

5/23

Page 7: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Some Questions

Question

1. Bagging improved performance by reducing variance

2. Given discrete random variables X and Y. TheInformation Gain in Y due to X is

IG (Y |X ) = H( )− H( ),

where H is the entropy

5/23

Page 8: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Some Questions

Question

1. Bagging improved performance by reducing variance

2. Given discrete random variables X and Y. TheInformation Gain in Y due to X is

IG (Y |X ) = H(Y )− H(Y |X ),

where H is the entropy

5/23

Page 9: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Some Questions

Question 2

Take labelled data (X, y).

1. Why should you use a validation set?

2. How do you know if your model is overfitting?

3. How do you know if your model is underfitting?

6/23

Page 10: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Some Questions

Question 2

Take labelled data (X, y).

1. Why should you use a validation set?

2. How do you know if your model is overfitting?

3. How do you know if your model is underfitting?

6/23

Page 11: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Some Questions

Question 2

Take labelled data (X, y).

1. Why should you use a validation set?

2. How do you know if your model is overfitting?

3. How do you know if your model is underfitting?

6/23

Page 12: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

ML Models

1. Nearest Neighbours

2. Decision Trees

3. Ensembles

4. Linear Regression

5. Logistic Regression

6. SVMs

7/23

Page 13: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Nearest Neighbours

1. Decision Boundaries2. Choice of ‘k‘ vs. Generalization3. Curse of dimensionality

8/23

Page 14: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Decision Trees

1. Entropy: H(X ), H(Y |X )

2. Information Gain

3. Decision Boundaries

9/23

Page 15: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Bayes Optimality

Starting with square error: E[(y − t)2|x] = E[y2 − 2yt + t2|x];

1. choose a single value y∗ based on p(t|x):

(y − E[t|x])2 + Var(t|x)

2. treat y as a random variable:I bias termI variance termI Bayes error

10/23

Page 16: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Bayes Optimality

Starting with square error: E[(y − t)2|x] = E[y2 − 2yt + t2|x];

1. choose a single value y∗ based on p(t|x):

(y − E[t|x])2 + Var(t|x)

2. treat y as a random variable:I bias termI variance termI Bayes error

10/23

Page 17: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Bayes Optimality

Starting with square error: E[(y − t)2|x] = E[y2 − 2yt + t2|x];

1. choose a single value y∗ based on p(t|x):

(y − E[t|x])2 + Var(t|x)

2. treat y as a random variable:I bias termI variance termI Bayes error

10/23

Page 18: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Ensembles

1. Bagging/bootstrapaggregation

2. BoostingI decision stumps:

11/23

Page 19: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Linear Regression

1. Loss function

2. Direct solution

3. (Stochastic) GradientDescent

4. RegularizationI L1 vs L2 norm

12/23

Page 20: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Logistic Regression

1. Loss functionsI 0-1 loss?I l2 loss?I cross-entropy loss?

2. Binary vs. Multi-class

3. Decision BoundariesI p = 1

1+e−θx

13/23

Page 21: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

SVMs

1. Hinge loss

2. Margins

14/23

Page 22: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Sample Question 1

First, we use a linear regression method to model this data.To test our linear regressor, we choose at random some datarecords to be a training set, and choose at random some of theremaining records to be a test set.Now let us increase the training set size gradually. As the trainingset size increases, what do you expect will happen with the meantraining and mean testing errors? (No explanation required)

I Mean Training Error: A. Increase; B. Decrease

I Mean Testing Error: A. Increase; B. Decrease

15/23

Page 23: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Q1 Solution

I The training error tends to increase. As more examples haveto be fitted, it becomes harder to ’hit’, or even come close, toall of them.

I The test error tends to decrease. As we take into accountmore examples when training, we have more information, andcan come up with a model that better resembles the truebehavior. More training examples lead to bettergeneralization.

16/23

Page 24: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Q1 Solution

I The training error tends to increase. As more examples haveto be fitted, it becomes harder to ’hit’, or even come close, toall of them.

I The test error tends to decrease. As we take into accountmore examples when training, we have more information, andcan come up with a model that better resembles the truebehavior. More training examples lead to bettergeneralization.

16/23

Page 25: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Sample Question 2

If variables X and Y are independent, is I (X |Y ) = 0? If yes, proveit. If no, give a counter example.

17/23

Page 26: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Q2 Solution

Recall that

I two random variables X and Y are independent if for allx ∈ Values(X ) and all y ∈ Values(Y ),P(X = x ,Y = y) = P(X = x)P(Y = y).

I H(X ) = −∑

x P(x)log2P(x)

18/23

Page 27: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Q2 Solution

I (X |Y ) = H(X )− H(X |Y ) (1)

= −∑x

P(x)log2P(x)−(−∑y

∑x

P(x , y)log2P(x |y))(2)

= −∑x

P(x)log2P(x)−(−∑y

P(y)∑x

P(x)log2P(x))

(3)

= −∑x

P(x)log2P(x)−(−∑x

P(x)log2P(x))

(4)

= 0 (5)

19/23

Page 28: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Sample Question 3

Given input x ∈ RD and target t ∈ R, consider a linear model ofthe form: y(x ,w) =

∑Di=1 wixi .. Now suppose a noisy pertubation

εi is added independently to each of the input variables xi . i.e.,xi = xi + εi , assume

I E[εi ] = 0

I for i 6= j : E[εiεj ] = 0

I E[ε2i ] = λ

We define the following objective that tries to be robust to noise

w∗ = arg minEε[(wT x− t)2]. (6)

Show that it is equivalent to minimizing L2 regularized linearregression, i.e.,:

w∗ = arg min[(wTx− tn)2 + λ||w||2].

20/23

Page 29: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Q3 Solution

Let

y =D∑i=1

wi (xi + εi ) = y +D∑i=1

wiεi ,

where y = wTx =∑D

i=1 wixi .

21/23

Page 30: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Q3 Solution

Then we start with

w∗ = arg minEε[(wT x− t)2] = arg minEε[(y − t)2],

the inner term

(y − t)2 = y2 − 2y t + t2 (7)

= (y +D∑i=1

wiεi )2 − 2t(y +

D∑i=1

wiεi ) + t2 (8)

= y2 + 2yD∑i=1

wiεi + (D∑i=1

wiεi )2 − 2ty − 2t

D∑i=1

wiεi + t2

(9)

then we take the expectation under the distribution of ε

22/23

Page 31: CSC411: Midterm Reviewmren/teach/csc411_19s/tut/... · 2019-08-31 · CSC411: Midterm Review Xiaohui Zeng February 7, 2019 Built from slides by James Lucas, Joe Wu and others 1/23

Q3 Solution

That is,

[y2 + 2y

D∑i=1

wiεi + (D∑i=1

wiεi )2 − 2ty − 2t

D∑i=1

wiεi + t2]

we have the second and the fifth term equal to zero since E[εi ] = 0,while the third term become Eε[(

∑Di=1 wiεi )

2] = λ∑D

i=1 w2i .

Finally we see

[y2 + λ

D∑i=1

w2i − 2ty + t2

]= Eε[(y − t)2 + λ

D∑i=1

w2i ].

23/23