midterm review fall 2019 - github pages · 2019. 12. 16. · midterm review fall 2019. comments •...

Midterm Review Fall 2019

Comments

• Midterm location on schedule• CCIS 1 440

• Covers Chapters 1-6 (up to GLMs, but not including GLMs)

• My office hours are today, from 2-4 p.m.

2

High-level comments

• In some cases lots of space, but doesn’t mean you need to fill the entire page. If the answer can be concisely stated in 2 sentences, that is perfectly reasonable

• I am not looking for one specific “right” answer. I am looking for your thought process, to see if you understood the material

• In the past, the wrong answer has been given (e.g., true, false question), but I gave full marks because the reasoning demonstrated understanding

• Two common mistakes: (a) giving multiple answers, in case one is right. You’ll lose marks for this (b) answering a different question than the one asked (read the question)

3

Probability review

• Quantify uncertainty using probability theory

• Discussed sigma-algebras and probability measures

• Discussed random variables as functions of event-space

• Discussed relationships between random variables, including (in)dependence and conditional independence

• Discussed operations, like expected value, marginalization, Bayes rule, chain rule

4

Exercise: Probability

• Suppose that we have created a machine learning algorithm that predicts whether a link will be clicked. It has a TPR (true positive rate) of 0.99 and a FPR (false positive rate) of 0.01.• Let C be binary RV, with C = 1 indicating Predict Click

• Let Y be binary RV, with Y = 1 indicating Actual Click

• p(C = 1 | Y = 1) = TPR

• p(C = 1 | Y = 0) = FPR

• The rate the link is actually clicked is 1/1000 visits to a website. If we predict the link will be clicked on a specific visit, what is the probability it will actually be clicked?

5

Exercise: MAP and ML

• See a set of random variables X1, …, Xn

• We assumed that these RVs are independent and identically distributed

• Then we assumed a distribution for each p(Xi | theta) • e.g., p(Xi | theta) is a Gaussian

• How would our ML objectives change if we did not assume independence?• previously maximized p(D | theta) = prod_{i=1}^n p(Xi | theta)

6(we’ll do this on the whiteboard)

Exercise: Bias of Sample Average

• See a set of random variables X1, …, Xn (dataset of size n)

• We showed the sample average is unbiased if the Xi are iid

• Is it biased or unbiased if the data is identically distributed, but not independent?

• (Why do we talk about the dataset being random?)

7

Optimization

• We’ve discussed first and second-order gradient descent

• We’ve discussed batch gradient descent, mini-batch gradient descent and stochastic gradient descent• one epoch corresponds to going once over the entire dataset (size n)

• each batch GD update uses one epoch (across all n samples)

• SGD does n updates for each epoch, noisy gradient

• mini-batch SGD does n / b updates for each epoch, where b is the batch size and the gradient is slightly less noisy than 1-sample SGD

8

Optimization• We’ve discussed batch gradient descent, mini-batch gradient

descent and stochastic gradient descent• one epoch corresponds to going once over the entire dataset (size n)

• each batch GD update uses one epoch (across all n samples)

• SGD does n updates for each epoch, noisy gradient

• mini-batch SGD does n / b updates for each epoch, where b is the batch size and the gradient is slightly less noisy than 1-sample SGD

• We’ve only done first-order SGD. What about second order?• Can also consider stochastic Hessian approximations

• The real goal is to scale the update, to make flat region more curved and really sharp regions more flat

• Common to use vector of stepsizes (diagonal approximation to Hinv), scales the update to each dimension separately9

Practice midterm question

10

Name:

Bonus (Mandatory for students in 566). [20 marks]

A typical goal behind data normalization is to make all the features of the same scale. For example,

the features are rescaled to be zero-mean, with unit variance, by taking the data matrix X 2 Rn⇥d

and centering and normalizing each column: Xij =Xij�µj

�jwhere µj =

1n

Pni=1Xij and �2

j =

1n

Pni=1(Xij � µj)

2. Why might this be important for gradient descent? Hint: Consider a setting

where you have two features, where the first has range [0, 0.01] and the other [0, 1000]. Think about

the stochastic gradient descent update for linear regression, and what issues might arise.

# 1 # 2 # 3 # 4 # 5 # 6 Total

/20 /10 /5 /20 /20 /30 /105

Total Pages = 8 Page 8 End of Examination

Exercise: Understand behaviour of algorithms

• Q1: Run l1-LinearRegression twice on the same data, using SGD. Should you expect to get about the same error on the testing set?

• Q2: Use squared error for binary classification (nonconvex obj). Again, run twice on the same data, using SGD. Should you expect to get about the same error on the testing set?

• Q3: Use squared error for binary classification (nonconvex obj). Run, with SGD, on the same training data, in the same order, with the same starting point —> should it produce the same final set of weights?

• Q4: When performing batch gradient descent with line search, should you expect loss(wt) to consistently decrease?

• Q5: When using SGD with a small fixed stepsize, should you expect loss(wt) to consistently decrease?11

Terminology clarification

• I have used the words Cost, Objective, Loss, and Error

• These are all sometimes used interchangeably in ML

• I try to use these specifically to mean• Cost — the true cost for errors in our prediction (e.g., for

classification, we had a 0-1 cost in Chapter 3).

• Objective — the thing we are trying to optimize, usually labelled as little c in optimizations. It can be more general than an error

• Loss — typically also means the objective, where we minimize Loss. Sometimes we have used it to be the error term in the objective.

• Error — usually for the least-squares error (reflects error in prediction inside objective); I’ll try to avoid this extra term

12

Optimal models versus Estimated models

• Chapter 4: Talked about optimal models• For 0-1 Cost, found f*(x) = argmax_y p(y | x)

• For Squared-error Cost, found f*(x) = E[Y | x]

• For Absolute-error Cost, found f*(x) = median(Y | x)

• None of Chapter 3 is about estimation, only about what functions we wish we could get to use for prediction

• Afterwards (Linear regression, GLMs), discussed surrogates to find these f* — we did NOT directly minimize this cost

13

Exercise: Why don’t we directly minimize the Cost?

• In one case we do: linear regression. Otherwise, not often use MLE/MAP objectives as proxy to minimize Expected Cost

• Let’s imagine we did directly minimize Expected Cost, for binary classification

• We can obtain an Empirical Cost, to estimate this cost

• How might we minimize this?14

E[C] =

Z

X

X

y2Ycost(f(x), y)p(x, y)dx

<latexit sha1_base64="CUzlvds4DoRTksd0rQH7TQefJTw=">AAACYXicbVHLSsQwFE3re3xVXboJDsIIMrQq6EYQRXCp4OjItJQ0k2qYNC3JrUwJ/Ul3btz4I7adwfeBwMk59yY3J1EmuAbXfbXsmdm5+YXFpdbyyuraurOxeafTXFHWo6lIVT8imgkuWQ84CNbPFCNJJNh9NLqo/ftnpjRP5S0UGQsS8ih5zCmBSgqdsZ8QeIoic1kOLgJ8in0uITSNSokw/bLEvs6T0BS1hT+Nh9oANgZDUw1lJ+5MTorNuNzbx8Uezr4pjTD82rdaodN2u24D/Jd4U9JGU1yHzos/TGmeMAlUEK0HnptBYIgCTgUrW36uWUboiDyyQUUlSZgOTJNQiXcrZYjjVFVLAm7U7x2GJFoXSVRV1jPq314t/ucNcohPAsNllgOTdHJRnAsMKa7jxkOuGAVRVIRQxatZMX0iilCoPqUOwfv95L/k7qDrHXYPbo7aZ+fTOBbRNtpBHeShY3SGrtA16iGK3qxZa9Vas97tJduxNyeltjXt2UI/YG9/AOPbtrs=</latexit>

E[C] =

Z

X

X

y2Y1(f(x) 6= y)p(x, y)dx

<latexit sha1_base64="KQTdKabhFUsMZKITt/kmLYd4lG8=">AAACWnicbZFbS8MwGIbTepjO0zzceRMcwgQZ7RT0RhgOwUsFNydrKWmWalia1iQVS+if9EYE/4pgug2dhw8Cb543H/nyJkwZlcpx3ix7bn5hsbK0XF1ZXVvfqG1u9WSSCUy6OGGJ6IdIEkY56SqqGOmngqA4ZOQ2HHVK//aJCEkTfqPylPgxuuc0ohgpg4Laoxcj9RCG+qIYdHx4Bj3KVaDHFCOm+0UBPZnFgc5LC34Zd8ZwG1Fj0h/p5+IAepw8wvwApjP0sATD7301qNWdpjMu+Fe4U1EH07oKai/eMMFZTLjCDEk5cJ1U+RoJRTEjRdXLJEkRHqF7MjCSo5hIX4+jKeC+IUMYJcIsruCYznZoFEuZx6E5WY4of3sl/M8bZCo69TXlaaYIx5OLooxBlcAyZzikgmDFciMQFtTMCvEDEggr8xtlCO7vJ/8VvVbTPWq2ro/r7fNpHEtgF+yBBnDBCWiDS3AFugCDV/BhLVoV69227WV7ZXLUtqY92+BH2Tuf+c20XQ==</latexit>

nX

i=1

1(f(xi) 6= yi)<latexit sha1_base64="yGbxRtbQD404VCME7wbd0NCFy0Q=">AAACFHicbZDLSsNAFIYn9VbrLerSzWARKkJJqqAboejGZQV7gSaGyXTSDp1M4sxEDCEP4cZXceNCEbcu3Pk2Jm0W2vrDwMd/zmHO+d2QUakM41srLSwuLa+UVytr6xubW/r2TkcGkcCkjQMWiJ6LJGGUk7aiipFeKAjyXUa67vgyr3fviZA04DcqDontoyGnHsVIZZajH1ky8p2EnpvpLYfQrHk1y0dq5HrJQ+rQQ2hxcgfjnCqOXjXqxkRwHswCqqBQy9G/rEGAI59whRmSsm8aobITJBTFjKQVK5IkRHiMhqSfIUc+kXYyOSqFB5kzgF4gsscVnLi/JxLkSxn7btaZLyxna7n5X60fKe/MTigPI0U4nn7kRQyqAOYJwQEVBCsWZ4CwoNmuEI+QQFhlOeYhmLMnz0OnUTeP643rk2rzooijDPbAPqgBE5yCJrgCLdAGGDyCZ/AK3rQn7UV71z6mrSWtmNkFf6R9/gDp15zS</latexit>

Expected Cost = Reducible and Irreducible Error

• We cannot reduce irreducible error

• Our goal is to find f to minimize reducible error• Note that directly minimizing expected cost will minimize reducible

error, since we cannot influence irreducible error

• We cannot directly minimize this true reducible error (expectation over all pairs (x,y)); instead, we use sampled training data to minimize it

15

1. f(x) = E[Y |x]

2. f(x) ”= E[Y |x].When f(x) = E[Y |x], the expected cost can be simply expressed as

E[C] =ˆ

X

p(x)ˆ

Y

(E[Y |x] ≠ y)2p(y|x)dydx

=ˆ

X

p(x)V[Y |X = x]dx (4.4)

Recall that V[Y |X = x] is the variance of Y , for the given x. The expected cost, therefore,reflects the cost incurred from noise or variability in the targets. This is the best scenarioin regression for a squared error cost; we cannot achieve a lower expected cost.

The next situation is when f(x) ”= E[Y |x]. Here, we will proceed by decomposing thesquared error as

(f(x) ≠ y)2 = (f(x) ≠ E[Y |x] + E[Y |x] ≠ y)2

= (f(x) ≠ E[Y |x])2 + 2(f(x) ≠ E[Y |x])(E[Y |x] ≠ y)¸ ˚˙ ˝

g(x,y)

+ (E[Y |x] ≠ y)2

Notice that the expected value of g(x, Y ) for each x is zero because

E[g(x, Y )] = EË(f(x) ≠ E[Y |x])(E[Y |x] ≠ Y )|x

È

= (f(x) ≠ E[Y |x])EË(E[Y |x] ≠ Y )|x

È

= (f(x) ≠ E[Y |x]) (E[Y |x] ≠ E[Y |x])= 0.

Therefore, we can conclude that E[g(X, Y )] = 0, when taking expectations over X. Wecan now express the expected cost as

E[C] = E[(f(X) ≠ Y )2]= E[(f(X) ≠ E[Y |X])2]

¸ ˚˙ ˝reducible error

+E[(E[Y |X] ≠ Y )2]¸ ˚˙ ˝

irreducible error

.

The first term reflects how far the trained model f(x) is from the optimal model E[Y |x].The second term reflects the inherent variability in Y given x, as written in Equation (4.4).These terms are also often called the reducible and irreducible errors. If we extend theclass of functions f to predict E[Y |x], we can reduce the first expected error. However, thesecond error is inherent or irreducible in the sense that no matter how much we improve thefunction, we cannot reduce this term. This relates to the problem of partial observability,where there is always some stochasticity due to a lack of information. This irreducibledistance could potentially be further reduced by providing more feature information (i.e.,extending the information in x). However, for a given dataset, with the given features, thiserror is irreducible.

To sum up, we argued here that optimal classification and regression models criticallydepend on knowing or accurately learning the posterior distribution p(y|x). This task canbe solved in di�erent ways, but a straightforward approach is to assume a functional formfor p(y|x), say p(y|x, ◊), where ◊ is a set of weights or parameters that are to be learnedfrom the data.

53

Bias-variance (trade-off)

16

Reducible Error = Bias^2 + Variance

Bias = E[w(D)]� !<latexit sha1_base64="JO5j9PISD4OdfYHi2E4GP52UW+c=">AAACMHicbZDPShxBEMZ7zB91E+MmHr00WQRzcJlRwVwCy2pIjgayKuwMS01vzdrYMz1010SXZh7Ji49iLgkkhFzzFOnZ3UOiFjT946sqqupLSyUtheH3YOnR4ydPl1dWW8+er71Yb798dWJ1ZQQOhFbanKVgUckCByRJ4VlpEPJU4Wl6cdjkT7+gsVIXn2laYpLDpJCZFEBeGrU/xIRX5PoSbM3f8TgHOk9T974ezjFzl/X2DAUod1S/SfgOj1Otxnaa+8/FOscJ1KN2J+yGs+D3IVpAhy3ieNS+jcdaVDkWJBRYO4zCkhIHhqRQWLfiymIJ4gImOPRYQI42cbODa77llTHPtPGvID5T/+1wkNtmP1/ZrG7v5hrxodywouxt4mRRVoSFmA/KKsVJ88Y9PpYGBampBxBG+l25OAcDgrzHLW9CdPfk+3Cy2432uruf9ju9/sKOFbbJXrNtFrED1mMf2TEbMMGu2Vf2g/0MboJvwa/g97x0KVj0bLD/IvjzF8T5qqY=</latexit>

Variance = V[w(D)]<latexit sha1_base64="Crv54U+iMRJqMThgguwBkr7lWrk=">AAACH3icbZBNSyNBEIZ7/DZ+ZfXopTEIegkzKuplQVYPHiOYKCRDqOnUaGNPz9Bd424Y5p/sxb/ixYMiy97yb+x8HPx6oeHhrSq66o0yJS35/sCbmp6ZnZtfWKwsLa+srlV/rLdsmhuBTZGq1FxHYFFJjU2SpPA6MwhJpPAqujsd1q/u0ViZ6kvqZxgmcKNlLAWQs7rVww7hHypaYCRogSX/yTsJ0G0UFa2yPca4+F3ujFCAKs7K3ZB3qzW/7o/Ev0IwgRqbqNGt/u/0UpEnqEkosLYd+BmFBRiSQmFZ6eQWMxB3cINthxoStGExuq/k287p8Tg17mniI/f9RAGJtf0kcp3DLe3n2tD8rtbOKT4OC6mznFCL8UdxrjilfBgW70mDglTfAQgj3a5c3IIBQS7Sigsh+HzyV2jt1YP9+t7FQe3k1ySOBbbJttgOC9gRO2HnrMGaTLC/7JE9sxfvwXvyXr1/49YpbzKzwT7IG7wB5tejeg==</latexit>

Realizable Setting:

Bias-variance (generally)

17

Reducible Error = E[(f̂(X)� f(X))2]

= E[E[(f̂(x)� f(x))2]|X = x]]

Given x: = E[(f̂(x)� f(x))2]

= (E[(f̂(x)]� f(x))2 + V[(f̂(x)]<latexit sha1_base64="88AMf1j7qsJJqUir4kWrS34Ssik=">AAAC3XicdVLfi9NAEN7EX2f8VfXRl8Ge0kMsSRUU4eBQDn08xfaKTSyb7eS63GYTdjdHSgz44oMivvp/+eb/4R/gpilq7+rAsh8z3zczO7NxLrg2vv/Tcc+dv3Dx0tZl78rVa9dvdG7eGumsUAyHLBOZGsdUo+ASh4YbgeNcIU1jgYfx8YsmfniCSvNMvjWLHKOUHkmecEaNdU07v0KDpane4Kxg3IpgX6lM1XB/F8KUmnkcV/v1pBfOqamSujfegYeQ2Gvn/SAKQ2+dtklRtopyqYAPMIZdKKNG21Z+yU9Qwna5/az2/lN0LUVbtLeZF/0lwoM/uUbrnDCcdrp+318anAXBCnTJyg6mnR/hLGNFitIwQbWeBH5uoooqw5nA2gsLjTllx/QIJxZKmqKOquV2arhnPTNIMmWPNLD0/quoaKr1Io0ts2lYn441zk2xSWGSp1HFZV4YlKwtlBQCTAbNqmHGFTIjFhZQprjtFdicKsqM/RCeHUJw+slnwWjQDx71B68fd/eer8axRe6Qu6RHAvKE7JFX5IAMCXPeOR+dz84Xd+p+cr+631qq66w0t8maud9/A+W73ZI=</latexit>

Exercise: Reasoning about Training Error and Expected Error• Imagine you get 0 training error (say squared error)

• But the irreducible error is sigma^2, the variance of p(y | x)

• This suggests the best error you can get is sigma^2 (i.e., the minimal error, even with zero reducible error)

• What does this tell you about your learned model?

18

1. f(x) = E[Y |x]

2. f(x) ”= E[Y |x].When f(x) = E[Y |x], the expected cost can be simply expressed as

E[C] =ˆ

X

p(x)ˆ

Y

(E[Y |x] ≠ y)2p(y|x)dydx

=ˆ

X

p(x)V[Y |X = x]dx (4.4)

Recall that V[Y |X = x] is the variance of Y , for the given x. The expected cost, therefore,reflects the cost incurred from noise or variability in the targets. This is the best scenarioin regression for a squared error cost; we cannot achieve a lower expected cost.

The next situation is when f(x) ”= E[Y |x]. Here, we will proceed by decomposing thesquared error as

(f(x) ≠ y)2 = (f(x) ≠ E[Y |x] + E[Y |x] ≠ y)2

= (f(x) ≠ E[Y |x])2 + 2(f(x) ≠ E[Y |x])(E[Y |x] ≠ y)¸ ˚˙ ˝

g(x,y)

+ (E[Y |x] ≠ y)2

Notice that the expected value of g(x, Y ) for each x is zero because

E[g(x, Y )] = EË(f(x) ≠ E[Y |x])(E[Y |x] ≠ Y )|x

È

= (f(x) ≠ E[Y |x])EË(E[Y |x] ≠ Y )|x

È

= (f(x) ≠ E[Y |x]) (E[Y |x] ≠ E[Y |x])= 0.

Therefore, we can conclude that E[g(X, Y )] = 0, when taking expectations over X. Wecan now express the expected cost as

E[C] = E[(f(X) ≠ Y )2]= E[(f(X) ≠ E[Y |X])2]

¸ ˚˙ ˝reducible error

+E[(E[Y |X] ≠ Y )2]¸ ˚˙ ˝

irreducible error

.

The first term reflects how far the trained model f(x) is from the optimal model E[Y |x].The second term reflects the inherent variability in Y given x, as written in Equation (4.4).These terms are also often called the reducible and irreducible errors. If we extend theclass of functions f to predict E[Y |x], we can reduce the first expected error. However, thesecond error is inherent or irreducible in the sense that no matter how much we improve thefunction, we cannot reduce this term. This relates to the problem of partial observability,where there is always some stochasticity due to a lack of information. This irreducibledistance could potentially be further reduced by providing more feature information (i.e.,extending the information in x). However, for a given dataset, with the given features, thiserror is irreducible.

To sum up, we argued here that optimal classification and regression models criticallydepend on knowing or accurately learning the posterior distribution p(y|x). This task canbe solved in di�erent ways, but a straightforward approach is to assume a functional formfor p(y|x), say p(y|x, ◊), where ◊ is a set of weights or parameters that are to be learnedfrom the data.

53

Bias-variance (trade-off)

19

Q1: Why might variance increase with increasing model complexity? Q2: Is this really a trade-off? Does decreasing variance necessarily imply we have increased bias?

Practice Midterm Question

• Imagine you transform the input observations to higher-order polynomials, before using linear regression. You consider all polynomials of orders k = 1,2,...,100. You are given a training set, with a separate test set. How might you determine which models are overfitting or underfitting?

20

Practice Midterm Question

21

Name:

Question 4. [20 marks]

Let us assume a setting where the true model is linear, i.e., Y = w0 +Pd

j=1wiXi + ✏ for weights

wj 2 R, random variables Xj and ✏ ⇠ N (0, 1). Imagine you get a dataset with n = 100 samples,

and train one model with linear regression and one model with cubic regression—linear regression,

where you first expand the features into all the polynomial terms in a cubic polynomial. Which

model do you think will obtain lower training error—or will they perform the same—and why?

Total Pages = 8 Page 5 cont’d. . .

Generalized linear models• Can pick any natural exponential family distribution for p(y | x)

• If p(y | x) is Gaussian, then we get linear regression with <x, w> approximating E[y | x]

• If p(y | x) is Bernoulli, then we get logistic regression with sigmoid(<x, w>) approximating E[y | x]

• If p(y | x) is Poisson, then we get Poisson regression with exp(<x, w>) approximating E[y | x]

• If p(y | x) is a Multinomial (multiclass), then we get multinomial logistic regression with softmax(<x, w>) approximating E[y | x]

22

Note: GLMs are not on the midterm, but this summary is useful for thinking about distributions

Exercise: Selecting distributions

• What distributions might you use if

• we have binary features and targets?

• binary targets and continuous features?

• positive targets?

• categorical features with a large number of categories?

• multi-class targets, with continuous features?

23

Exercise: Selecting algorithms

• When might logistic regression do better than linear regression?

• When might Poisson regression do better than linear regression?

• When might Ridge Regression (l2-regularized linear regression) do better than linear regression?

24

Practice Midterm Question• Discuss any one form of regularization that is used to train linear

regression models. Why is such regularization used?

• Example answer that would get full marks:

•

25

One form of regularization that is used with linear regression is the l2 reg-

ularizer, where l2(w) =P

j w2j = kwk22. This function of w is added to the

least-squares loss kXw � yk22, to get the modified minimization

minw kXw � yk22 + �kwk22.This regularizer helps prevent overfitting, by encoding a preference for weights

that are closer to zero—the l2 corresponds to using a zero-mean Gaussian prior

on the weights. From an algebraic perspective, it add a non-negligble positive

constant � to the eigenvalues of the matrix X>X, making this matrix better

conditioned and so the solution that uses the inverse of this matrix more stable.<latexit sha1_base64="5vcuqQL2PY75tGbVC//0d3iSbTw=">AAAFFXiclVRNb9NAEHVJgFK+WjhyGdEgFbWJklzgglSBBNwoUr+kuo3W67G97XrX7K4TQpo/wYW/woUDCHFF4sa/YdZJSGl7YU/jnZk3M+/NOiqksK7d/r1wpVa/eu364o2lm7du37m7vHJv1+rScNzhWmqzHzGLUijcccJJ3C8MsjySuBedvPD+vT4aK7TadsMCD3OWKpEIzhxd9VZq628UQqJNDjoBg2kpmREfKi+4jDkQFkqLMQyEy8CXYcbHGbQe1LtdhiC782Q0GzDI0CA0ZHctzJnLomQ0GD+GZxDaMu8dw6B3fNT1n6dzd3ja6x51Gy3Yzgg0KRWvmqC2GvOghi/I4pgacnpSGZl1TfuuZNQSSG0txZ/CLGV/PE+G5t/r4Rim9TY8UIquAst1TOQQeC6UyGc8hOEStSBUbwD/gQzrEEpSImaXj0mo1aRneIMMZWGBFOyjcqBJuUQ4J1S6AdEQUHHqT6XAfEhCDCteiQcDFGnm7EQxIgI4EUF4NBrB6mazORWJa0M0FVrF1jtLO4GrgnJkCl6xkoQlozCCgKstwBl+C14anQN5mUwxMkxwKGi7CiSt+rgBwnlxCFBp1VSYSpHSJkKhrfABVF5Zx2i0xpSaxkxGKoCqz2RJIpLklRjMGfF+rv7++Ch0ujgjAImXsxM/gvNMThMidI5mp1qx8PqRnEzFYCeFrJblfLtptScLLJR/JjipPcfKNZFJLdMULegtr7Zb7erARaMzNVaD6dnqLf8KY83LnMTkkll70GkX7nDEjBNc4ngppPIF4ycsxQMyFcvRHo6qVz2GR9Wz8+ommhirbs9mjFhu7TCPKNITYs/7/OVlvoPSJU8PR0IVpaMFmhRKSlkpQb8IiIUhPeWQDMYNUciBZ8wwTqzaJSKhc37ki8Zut9Uh+213dfP5lI7F4EHwMFgLOsGTYDN4HWwFOwGvfax9rn2tfat/qn+pf6//mIReWZjm3A/+OfWffwAClbSO</latexit><latexit sha1_base64="5vcuqQL2PY75tGbVC//0d3iSbTw=">AAAFFXiclVRNb9NAEHVJgFK+WjhyGdEgFbWJklzgglSBBNwoUr+kuo3W67G97XrX7K4TQpo/wYW/woUDCHFF4sa/YdZJSGl7YU/jnZk3M+/NOiqksK7d/r1wpVa/eu364o2lm7du37m7vHJv1+rScNzhWmqzHzGLUijcccJJ3C8MsjySuBedvPD+vT4aK7TadsMCD3OWKpEIzhxd9VZq628UQqJNDjoBg2kpmREfKi+4jDkQFkqLMQyEy8CXYcbHGbQe1LtdhiC782Q0GzDI0CA0ZHctzJnLomQ0GD+GZxDaMu8dw6B3fNT1n6dzd3ja6x51Gy3Yzgg0KRWvmqC2GvOghi/I4pgacnpSGZl1TfuuZNQSSG0txZ/CLGV/PE+G5t/r4Rim9TY8UIquAst1TOQQeC6UyGc8hOEStSBUbwD/gQzrEEpSImaXj0mo1aRneIMMZWGBFOyjcqBJuUQ4J1S6AdEQUHHqT6XAfEhCDCteiQcDFGnm7EQxIgI4EUF4NBrB6mazORWJa0M0FVrF1jtLO4GrgnJkCl6xkoQlozCCgKstwBl+C14anQN5mUwxMkxwKGi7CiSt+rgBwnlxCFBp1VSYSpHSJkKhrfABVF5Zx2i0xpSaxkxGKoCqz2RJIpLklRjMGfF+rv7++Ch0ujgjAImXsxM/gvNMThMidI5mp1qx8PqRnEzFYCeFrJblfLtptScLLJR/JjipPcfKNZFJLdMULegtr7Zb7erARaMzNVaD6dnqLf8KY83LnMTkkll70GkX7nDEjBNc4ngppPIF4ycsxQMyFcvRHo6qVz2GR9Wz8+ommhirbs9mjFhu7TCPKNITYs/7/OVlvoPSJU8PR0IVpaMFmhRKSlkpQb8IiIUhPeWQDMYNUciBZ8wwTqzaJSKhc37ki8Zut9Uh+213dfP5lI7F4EHwMFgLOsGTYDN4HWwFOwGvfax9rn2tfat/qn+pf6//mIReWZjm3A/+OfWffwAClbSO</latexit><latexit sha1_base64="5vcuqQL2PY75tGbVC//0d3iSbTw=">AAAFFXiclVRNb9NAEHVJgFK+WjhyGdEgFbWJklzgglSBBNwoUr+kuo3W67G97XrX7K4TQpo/wYW/woUDCHFF4sa/YdZJSGl7YU/jnZk3M+/NOiqksK7d/r1wpVa/eu364o2lm7du37m7vHJv1+rScNzhWmqzHzGLUijcccJJ3C8MsjySuBedvPD+vT4aK7TadsMCD3OWKpEIzhxd9VZq628UQqJNDjoBg2kpmREfKi+4jDkQFkqLMQyEy8CXYcbHGbQe1LtdhiC782Q0GzDI0CA0ZHctzJnLomQ0GD+GZxDaMu8dw6B3fNT1n6dzd3ja6x51Gy3Yzgg0KRWvmqC2GvOghi/I4pgacnpSGZl1TfuuZNQSSG0txZ/CLGV/PE+G5t/r4Rim9TY8UIquAst1TOQQeC6UyGc8hOEStSBUbwD/gQzrEEpSImaXj0mo1aRneIMMZWGBFOyjcqBJuUQ4J1S6AdEQUHHqT6XAfEhCDCteiQcDFGnm7EQxIgI4EUF4NBrB6mazORWJa0M0FVrF1jtLO4GrgnJkCl6xkoQlozCCgKstwBl+C14anQN5mUwxMkxwKGi7CiSt+rgBwnlxCFBp1VSYSpHSJkKhrfABVF5Zx2i0xpSaxkxGKoCqz2RJIpLklRjMGfF+rv7++Ch0ujgjAImXsxM/gvNMThMidI5mp1qx8PqRnEzFYCeFrJblfLtptScLLJR/JjipPcfKNZFJLdMULegtr7Zb7erARaMzNVaD6dnqLf8KY83LnMTkkll70GkX7nDEjBNc4ngppPIF4ycsxQMyFcvRHo6qVz2GR9Wz8+ommhirbs9mjFhu7TCPKNITYs/7/OVlvoPSJU8PR0IVpaMFmhRKSlkpQb8IiIUhPeWQDMYNUciBZ8wwTqzaJSKhc37ki8Zut9Uh+213dfP5lI7F4EHwMFgLOsGTYDN4HWwFOwGvfax9rn2tfat/qn+pf6//mIReWZjm3A/+OfWffwAClbSO</latexit><latexit sha1_base64="5vcuqQL2PY75tGbVC//0d3iSbTw=">AAAFFXiclVRNb9NAEHVJgFK+WjhyGdEgFbWJklzgglSBBNwoUr+kuo3W67G97XrX7K4TQpo/wYW/woUDCHFF4sa/YdZJSGl7YU/jnZk3M+/NOiqksK7d/r1wpVa/eu364o2lm7du37m7vHJv1+rScNzhWmqzHzGLUijcccJJ3C8MsjySuBedvPD+vT4aK7TadsMCD3OWKpEIzhxd9VZq628UQqJNDjoBg2kpmREfKi+4jDkQFkqLMQyEy8CXYcbHGbQe1LtdhiC782Q0GzDI0CA0ZHctzJnLomQ0GD+GZxDaMu8dw6B3fNT1n6dzd3ja6x51Gy3Yzgg0KRWvmqC2GvOghi/I4pgacnpSGZl1TfuuZNQSSG0txZ/CLGV/PE+G5t/r4Rim9TY8UIquAst1TOQQeC6UyGc8hOEStSBUbwD/gQzrEEpSImaXj0mo1aRneIMMZWGBFOyjcqBJuUQ4J1S6AdEQUHHqT6XAfEhCDCteiQcDFGnm7EQxIgI4EUF4NBrB6mazORWJa0M0FVrF1jtLO4GrgnJkCl6xkoQlozCCgKstwBl+C14anQN5mUwxMkxwKGi7CiSt+rgBwnlxCFBp1VSYSpHSJkKhrfABVF5Zx2i0xpSaxkxGKoCqz2RJIpLklRjMGfF+rv7++Ch0ujgjAImXsxM/gvNMThMidI5mp1qx8PqRnEzFYCeFrJblfLtptScLLJR/JjipPcfKNZFJLdMULegtr7Zb7erARaMzNVaD6dnqLf8KY83LnMTkkll70GkX7nDEjBNc4ngppPIF4ycsxQMyFcvRHo6qVz2GR9Wz8+ommhirbs9mjFhu7TCPKNITYs/7/OVlvoPSJU8PR0IVpaMFmhRKSlkpQb8IiIUhPeWQDMYNUciBZ8wwTqzaJSKhc37ki8Zut9Uh+213dfP5lI7F4EHwMFgLOsGTYDN4HWwFOwGvfax9rn2tfat/qn+pf6//mIReWZjm3A/+OfWffwAClbSO</latexit>

Another answer that would get full marks

• Discuss any one form of regularization that is used to train linear regression models. Why is such regularization used?

26

One form of regularization that is used with linear regression is the l2 regular-

izer, where l2(w) =P

j w2j = kwk22. This regularizer helps prevent overfitting,

by encoding a preference for weights that are closer to zero—the l2 corresponds

to using a zero-mean Gaussian prior on the weights.<latexit sha1_base64="+DMuYju3l7yv0nkEUrppXjJu3CU=">AAADL3icbVJNbxMxEPUuXyV8NIUjlxEpUpGaKNkLXJAqEIIbRWraSkm68jqzWbfe9cr2Ngpp/hEX/kovCIEQV/4F401QS4tPz34zz/OenZRKWtftfgvCGzdv3b6zdrdx7/6Dh+vNjUf7VldGYF9opc1hwi0qWWDfSafwsDTI80ThQXLyxvMHp2is1MWem5U4yvmkkKkU3NFRvBG8/VAgpNrkoFMwOKkUN/JTzYLLuANpobI4hql0GfhruPF1Bq0X9bTLEFR00YxmG6YZGoRNFW0Nc+6yJJ1PF8/hFQxtlcfHMI2PjyK/Pbugh2dxdBRtdqCxl5HqJTnIUJUWyNgpFg40GUqlc7KYbEMyAyyEHtMGuC9J6eJC1J5ginKSObs0wmkgobQlPaeBZHW73V7NLrQhQ6UuxtaTlV3K1UU58gLe8Yr8EiiNJOE6HPyr34G42ep2uvWC66C3Ai22Wrtx83w41qLKyY9Q3NpBr1u60ZwbJ4XCRWNImZdcnPAJDggWPEc7mtfvvYBn9YN4g6mmPOrTyx1znls7yxOq9Onaq5w//B83qFz6cjSXRVk5ynB5UVopH4n/PDCWBoVTMwJcGEmzgsi44cLRF2tQCL2rlq+D/ajTI/wxau28XsWxxp6wp2yL9dgLtsPes13WZyL4HJwH34Mf4Zfwa/gz/LUsDYNVz2P2zwp//wHHVgTy</latexit><latexit sha1_base64="+DMuYju3l7yv0nkEUrppXjJu3CU=">AAADL3icbVJNbxMxEPUuXyV8NIUjlxEpUpGaKNkLXJAqEIIbRWraSkm68jqzWbfe9cr2Ngpp/hEX/kovCIEQV/4F401QS4tPz34zz/OenZRKWtftfgvCGzdv3b6zdrdx7/6Dh+vNjUf7VldGYF9opc1hwi0qWWDfSafwsDTI80ThQXLyxvMHp2is1MWem5U4yvmkkKkU3NFRvBG8/VAgpNrkoFMwOKkUN/JTzYLLuANpobI4hql0GfhruPF1Bq0X9bTLEFR00YxmG6YZGoRNFW0Nc+6yJJ1PF8/hFQxtlcfHMI2PjyK/Pbugh2dxdBRtdqCxl5HqJTnIUJUWyNgpFg40GUqlc7KYbEMyAyyEHtMGuC9J6eJC1J5ginKSObs0wmkgobQlPaeBZHW73V7NLrQhQ6UuxtaTlV3K1UU58gLe8Yr8EiiNJOE6HPyr34G42ep2uvWC66C3Ai22Wrtx83w41qLKyY9Q3NpBr1u60ZwbJ4XCRWNImZdcnPAJDggWPEc7mtfvvYBn9YN4g6mmPOrTyx1znls7yxOq9Onaq5w//B83qFz6cjSXRVk5ynB5UVopH4n/PDCWBoVTMwJcGEmzgsi44cLRF2tQCL2rlq+D/ajTI/wxau28XsWxxp6wp2yL9dgLtsPes13WZyL4HJwH34Mf4Zfwa/gz/LUsDYNVz2P2zwp//wHHVgTy</latexit><latexit sha1_base64="+DMuYju3l7yv0nkEUrppXjJu3CU=">AAADL3icbVJNbxMxEPUuXyV8NIUjlxEpUpGaKNkLXJAqEIIbRWraSkm68jqzWbfe9cr2Ngpp/hEX/kovCIEQV/4F401QS4tPz34zz/OenZRKWtftfgvCGzdv3b6zdrdx7/6Dh+vNjUf7VldGYF9opc1hwi0qWWDfSafwsDTI80ThQXLyxvMHp2is1MWem5U4yvmkkKkU3NFRvBG8/VAgpNrkoFMwOKkUN/JTzYLLuANpobI4hql0GfhruPF1Bq0X9bTLEFR00YxmG6YZGoRNFW0Nc+6yJJ1PF8/hFQxtlcfHMI2PjyK/Pbugh2dxdBRtdqCxl5HqJTnIUJUWyNgpFg40GUqlc7KYbEMyAyyEHtMGuC9J6eJC1J5ginKSObs0wmkgobQlPaeBZHW73V7NLrQhQ6UuxtaTlV3K1UU58gLe8Yr8EiiNJOE6HPyr34G42ep2uvWC66C3Ai22Wrtx83w41qLKyY9Q3NpBr1u60ZwbJ4XCRWNImZdcnPAJDggWPEc7mtfvvYBn9YN4g6mmPOrTyx1znls7yxOq9Onaq5w//B83qFz6cjSXRVk5ynB5UVopH4n/PDCWBoVTMwJcGEmzgsi44cLRF2tQCL2rlq+D/ajTI/wxau28XsWxxp6wp2yL9dgLtsPes13WZyL4HJwH34Mf4Zfwa/gz/LUsDYNVz2P2zwp//wHHVgTy</latexit><latexit sha1_base64="+DMuYju3l7yv0nkEUrppXjJu3CU=">AAADL3icbVJNbxMxEPUuXyV8NIUjlxEpUpGaKNkLXJAqEIIbRWraSkm68jqzWbfe9cr2Ngpp/hEX/kovCIEQV/4F401QS4tPz34zz/OenZRKWtftfgvCGzdv3b6zdrdx7/6Dh+vNjUf7VldGYF9opc1hwi0qWWDfSafwsDTI80ThQXLyxvMHp2is1MWem5U4yvmkkKkU3NFRvBG8/VAgpNrkoFMwOKkUN/JTzYLLuANpobI4hql0GfhruPF1Bq0X9bTLEFR00YxmG6YZGoRNFW0Nc+6yJJ1PF8/hFQxtlcfHMI2PjyK/Pbugh2dxdBRtdqCxl5HqJTnIUJUWyNgpFg40GUqlc7KYbEMyAyyEHtMGuC9J6eJC1J5ginKSObs0wmkgobQlPaeBZHW73V7NLrQhQ6UuxtaTlV3K1UU58gLe8Yr8EiiNJOE6HPyr34G42ep2uvWC66C3Ai22Wrtx83w41qLKyY9Q3NpBr1u60ZwbJ4XCRWNImZdcnPAJDggWPEc7mtfvvYBn9YN4g6mmPOrTyx1znls7yxOq9Onaq5w//B83qFz6cjSXRVk5ynB5UVopH4n/PDCWBoVTMwJcGEmzgsi44cLRF2tQCL2rlq+D/ajTI/wxau28XsWxxp6wp2yL9dgLtsPes13WZyL4HJwH34Mf4Zfwa/gz/LUsDYNVz2P2zwp//wHHVgTy</latexit>

Challenge Exercise

• Imagine you believed p(y|x) is a mixture of two Gaussians

• Recall we can take the convex combination of two Gaussians to represent a more complex bimodal distribution

• If you want to estimate p(y|x), how might you parameterize it?

• How might you estimate those parameters?

27

c1N (µ1,�21) + (1� c2)N (µ2,�

22)

<latexit sha1_base64="sMxhYLTWOFGmx80Y+zOrL46kbSM=">AAACNXicbVDLSgMxFM3UV62vqks3wSJU1DIZBV0W3bgQqWAf0KlDJk3b0GRmSDJCGfpTbvwPV7pwoYhbf8FMW0RbDwQO55xL7j1+xJnStv1iZebmFxaXssu5ldW19Y385lZNhbEktEpCHsqGjxXlLKBVzTSnjUhSLHxO637/IvXr91QqFga3ehDRlsDdgHUYwdpIXv6KeAi6AusewTy5HhZdEXvoELqKdQX20J2zDw9gER0Rz7DpoPMTdNKgly/YJXsEOEvQhBTABBUv/+S2QxILGmjCsVJNZEe6lWCpGeF0mHNjRSNM+rhLm4YGWFDVSkZXD+GeUdqwE0rzAg1H6u+JBAulBsI3yXRtNe2l4n9eM9ads1bCgijWNCDjjzoxhzqEaYWwzSQlmg8MwUQysyskPSwx0abonCkBTZ88S2pOCR2XnJuTQvl8UkcW7IBdUAQInIIyuAQVUAUEPIBn8AberUfr1fqwPsfRjDWZ2QZ/YH19Awzdp/I=</latexit>

midterm review fall 2019 - github pages · 2019. 12. 16. · midterm review fall 2019. comments •...

Documents