CMU-Q 15-381Lecture 24:
Supervised Learning 2
Teacher:
Gianni A. Di Caro
SUPERVISED LEARNING
2
minimize๐ฝ
1
๐
๐=1
๐
โ โ๐ ๐(๐) , ๐ฆ(๐)
Given a collection of input features and outputs โ๐ ๐(๐) , ๐ฆ(๐) , ๐ = 1,โฆ ,๐ and a hypothesis
function โ๐, find parameters values ๐ฝ that minimize the average empirical error:
We need to specify:
1. The hypothesis class ๐, ๐๐ฝ โ ๐
2. The loss function โ
3. The algorithm for solving the optimization problem (often approximately)
4. A complete ML design: from data processing to learning to validation and testing
Labeled
Given
ErrorsPerformance
criteria
Hypotheses space
Hypothesis function
CLASSIFICATION AND REGRESSION
3
Complex
boundaries, relations
Which hypothesis class โ?
Classification: (Width, Lightness)
โ{Salmon, Sea bass} (discrete)
Regression: (Width, Lightness)
โWeight (continuous)
โ๐ ๐ : ๐ โ โ2 โ ๐ = {0,1} โ๐ ๐ : ๐ โ โ2 โ ๐ โ โ
Features:
Width, Lightness
PROBABILISTIC MODELS: DISCRIMINATIVE VS. GENERATIVE
4
Discriminative models:
Directly learn ๐ ๐ฆ ๐)
Parametric hypothesis
Allow to discriminate between
classes / predicted outputs
Generative models / Probability distributions:
Learn ๐(๐, ๐ฆ), the probabilistic model that
describes the data, then use Bayesโ rule
Allow to generate data any relevant data
Regression and classification problems can be stated in probabilistic terms (later)
The mapping ๐ฆ = โ๐ ๐ that we are learning can be naturally interpreted as the
probability of the output being ๐ฆ given the input data ๐ (under the selected
hypothesis โ and the learned parameter vector ๐)
๐ ๐ฆ ๐) =๐ ๐ ๐ฆ)๐(๐ฆ)
๐(๐)=๐(๐, ๐ฆ)
๐(๐)๐ฅ2
๐ฅ1๐ฅ2
= salmon
= sea bass
๐ฅ1
GENERATIVE MODELS
5
A generative approach would proceed as follows:
1. By looking at the feature data about salmons, build
a model of a salmon
2. By looking at the feature data about sea basses,
build a model of a sea bass
A discriminative model, that learn learns ๐ ๐ฆ ๐; ๐ฝ), can be used to label the
data, to discriminate the data, but not to generate the data
o E.g., a discriminative approach tries to find out
which (linear, in this case) decision boundary
allows for the best classification based on the
training data, and takes decisions accordingly
o Direct learning of the mapping from ๐ to ๐
3. To classify a new fish based on its features ๐, we can match it against the
salmon and the sea bass models, to see whether it looks more like the
salmons or more like the sea basses we had seen in the training set
1,2,3 is equivalent to model ๐ ๐ ๐ฆ), where ๐ฆ = {๐1, ๐2}: the conditional
probability that the observed features ๐ are those of a salmon or a sea bass
GENERATIVE MODELS
6
๐ ๐ฆ ๐) =๐ ๐ ๐ฆ)๐(๐ฆ)
๐(๐)=๐(๐, ๐ฆ)
๐(๐)
๐ ๐ ๐ฆ = ๐1) models the distribution of salmonโs features
๐ ๐ ๐ฆ = ๐2) models the distribution of sea bassโ features
๐(๐ฆ) can be derived from the dataset or from other sources
o E.g., ๐(๐1) = ratio of salmons in the dataset, ๐(๐2) = ratio of sea basses
Bayes rule:
๐ ๐ = ๐ ๐ ๐ฆ = ๐1)๐ ๐ฆ = ๐1 + ๐ ๐ ๐ฆ = ๐2)๐(๐ฆ = ๐2)
To make a prediction:
argmax๐ฆ
๐ ๐ฆ ๐) = argmax๐ฆ
๐ ๐ ๐ฆ)๐(๐ฆ)
๐(๐)= argmax
๐ฆ๐ ๐ ๐ฆ)๐(๐ฆ)
๐๐๐ ๐ก๐๐๐๐๐ =๐๐๐๐๐๐โ๐๐๐ ร ๐๐๐๐๐
๐๐ฃ๐๐๐๐๐๐
Equivalent to: decide ๐1 if ๐ ๐1 ๐) > ๐ ๐2 ๐), otherwise decide ๐2
GENERATIVE MODELS AND BAYES DECISION RULE
7
Likelihood ratio
Two disconnected regions for class 2
Decide ๐1 if ๐ ๐ ๐1)๐ ๐1 > ๐ ๐ ๐2)๐(๐2) otherwise decide ๐2
Decide ๐1 if ๐ ๐ ๐1)
๐ ๐ ๐2)>
๐(๐2)
๐(๐1)otherwise decide ๐2
GENERATIVE MODELS
8
Given the joint distribution we can generate any conditional or marginal probability
Sample from ๐(๐, ๐ฆ) to obtain labeled data points
Given the priors ๐(๐ฆ), sample a class or a predictor value
Given the class ๐ฆ, sample instance data ๐ ๐ ๐ฆ) of that class, or, given a
predictor variable sample an expected output
Downside: higher complexity, more parameters to learn
Density estimation problem:
Parametric (e.g., Gaussian densities)
Non-parametric (full density estimation)
9
LETโS GO BACK TO LINEAR REGRESSIONโฆ
Linear model as hypothesis:
๐ฆ = โ ๐;๐ = ๐ค0 + ๐ค1๐ฅ1 + ๐ค2๐ฅ2 +โฏ+ ๐ค๐๐ฅ๐ = ๐๐ โ ๐
๐ = (1, ๐ฅ1, ๐ฅ2, โฏ , ๐ฅ๐ )
โ
Find ๐ that minimizes the deviation from the
desired answers: ๐ฆ(๐) โ โ ๐ ๐ , โ๐ in dataset
Loss function: Mean squared error (MSE)
โ =1
๐
๐=1
๐
๐ฆ ๐ โ โ ๐ ๐2
The model does not try to explain variation in observed ๐ฆs for the data
10
STATISTICAL MODEL FOR LINEAR REGRESSION
A statistical model of linear regression: ๐ฆ = ๐๐๐ + ๐
๐ธ ๐ฆ ๐ฅ = ๐๐๐
The model does explain variation
in observed ๐ฆs for the data in
terms of a white Gaussian noise
๐ ~ ๐(0, ๐2 )
๐ฆ ~ ๐(๐๐๐, ๐2 )
The conditional distribution of ๐ฆ given ๐:
๐ ๐ฆ ๐;๐, ๐) =1
๐ 2๐exp โ
1
2๐2๐ฆ โ ๐๐๐ 2
Probability of the output being ๐ฆ given the predictor ๐
11
STATISTICAL MODEL FOR LINEAR REGRESSION
Letโs consider the entire data set ๐, and letโs assume that all samples are
independent and identically distributed (i.i.d.) random variables
What is the joint probability of all training data? That is, the probability of observing
all the outputs ๐ฆ in ๐ given ๐ and ๐?
๐ ๐ฆ(1), ๐ฆ(2), โฏ , ๐ฆ(๐) ๐(1), ๐(2), โฏ , ๐(๐) ; ๐, ๐)
By iid:
๐ ๐ฆ(1), ๐ฆ(2), โฏ , ๐ฆ(๐) ๐(1), ๐(2), โฏ , ๐(๐) ; ๐, ๐) =เท
๐=1
๐
๐ ๐ฆ ๐ ๐(๐); ๐, ๐)
Maximum likelihood estimation of the parameters ๐: parameter values
maximizing the likelihood of the predictions, the value of the parameters such
that the probability of observing the data in ๐ is maximized
๐ฟ(๐, ๐, ๐) = ฯ๐=1๐ ๐ ๐ฆ ๐ ๐(๐); ๐, ๐) Likelihood function of predictions, the
probability of observing the outputs ๐ฆ in ๐given ๐ and ๐
๐โ = argmax๐
๐ฟ(๐, ๐, ๐)
12
STATISTICAL MODEL FOR LINEAR REGRESSION
Log-Likelihood:
๐ ๐, ๐, ๐ = log(๐ฟ(๐, ๐, ๐)) = log ฯ๐=1๐ ๐ ๐ฆ ๐ ๐(๐); ๐, ๐)
=
๐=1
๐
log ๐ ๐ฆ ๐ ๐(๐); ๐, ๐)
Using the conditional density: ๐ ๐ฆ ๐;๐, ๐) =1
๐ 2๐exp โ
1
2๐2๐ฆ โ ๐๐๐ 2
๐ ๐, ๐, ๐ =
๐=1
๐
log1
๐ 2๐exp โ
1
2๐2๐ฆ ๐ โ๐๐๐(๐)
2=
๐=1
๐
โ1
2๐2๐ฆ ๐ โ๐๐๐(๐)
2โ ๐(๐)
= โ1
2๐2
๐=1
๐
๐ฆ ๐ โ๐๐๐(๐)2+ ๐(๐)
Does it look familiar?
Maximizing the predictive log-likelihood
with regard to ๐, is equivalent to
minimizing the MSE loss function
max๐
๐ ๐, ๐, ๐ ~min๐
๐๐๐ธ
More in general, least squares linear fit under Gaussian noise corresponds to
the maximum likelihood estimator of the data
13
NON-LINEAR, ADDITIVE REGRESSION MODELS
NON-LINEAR PROBLEMS?
14
Design a non-linear regressor / classifier
Modify the input data to make the problem linear
MAP DATA IN HIGHER DIMENSIONALITY FEATURE SPACES
15
MAP DATA IN HIGHER DIMENSIONALITY FEATURE SPACES
16
The property of the solution of SVMs (that are in terms of dot products between
feature vectors) allows to easily define a kernel function that implicitly perform
the desired transformation, allowing keeping using linear classifiers โฆ.
The hyperplane is found in ๐-space, then
projected back in ๐-space, where is an ellipsis
17
NON-LINEAR, ADDITIVE REGRESSION MODELS
Main idea to model nonlinearities: Replace inputs to linear units with ๐ feature
(basis) functions ๐๐ ๐ , ๐ = 1,โฏ , ๐, where ๐๐ ๐ is an arbitrary function of ๐
๐ฆ = โ ๐;๐ = ๐ค0 + ๐ค1๐1 ๐ + ๐ค2๐2 ๐ +โฏ+ ๐ค๐๐๐ ๐ = ๐๐ โ ๐(๐)
โ
๐
๐
Original
feature
input
New input Linear model
18
EXAMPLES OF FEATURE FUNCTIONS
Higher order polynomial with one-dimensional input, ๐ = (๐ฅ)
๐1 ๐ = ๐ฅ, ๐2 ๐ = ๐ฅ2, ๐3 ๐ = ๐ฅ3, โฏ
Quadratic polynomial with two-dimensional inputs, ๐ = (๐ฅ1, ๐ฅ2)
๐1 ๐ = ๐ฅ1, ๐2 ๐ = ๐ฅ12, ๐3 ๐ = ๐ฅ2, ๐4 ๐ = ๐ฅ2
2, ๐3 ๐ = ๐ฅ1๐ฅ2
Transcendent functions:
๐1 ๐ = sin(๐ฅ), ๐2 ๐ = cos(๐ฅ)
โฆ
19
SOLUTION USING FEATURE FUNCTIONS
The same techniques (analytical gradient + system of equations, or gradient
descent) used for the plain linear case with MSE as loss function
๐ ๐ ๐ = (1, ๐1 ๐ ๐ , ๐2 ๐ ๐ , โฏ , ๐๐(๐๐ ))
โ =1
๐
๐=1
๐
๐ฆ ๐ โ โ ๐ ๐2
To find min๐
โ we have to look where ๐ป๐ โ = 0
๐ป๐ โ = โ2
๐
๐=1
๐
๐ฆ ๐ โ โ ๐ ๐ ๐ ๐ ๐ = ๐
โ ๐ ๐ ; ๐ = ๐ค0 + ๐ค1๐1 ๐ ๐ + ๐ค2๐2 ๐ ๐ +โฏ+๐ค๐๐๐ ๐ ๐ = ๐๐ โ ๐(๐ ๐ )
Results in a system of ๐ linear equations:
๐ค0
๐=1
๐
1๐๐ ๐ ๐ + ๐ค1
๐=1
๐
๐1 ๐ ๐ ๐๐ ๐ ๐ + โฏ+๐ค๐
๐=1
๐
๐๐ ๐ ๐ ๐๐ ๐ ๐ โฏ+๐ค๐
๐=1
๐
๐๐ ๐ ๐ ๐๐ ๐ ๐
=
๐=1
๐
๐ฆ๐๐๐ ๐ ๐ โ๐ = 1,โฏ , ๐
20
EXAMPLE OF SDG WITH FEATURE FUNCTIONS
One dimensional feature vectors and high-order polynomial: ๐ = ๐ฅ , ๐๐ ๐ = ๐ฅ๐
โ ๐;๐ = ๐ค0 +๐ค1๐1 ๐ + ๐ค2๐2 ๐ +โฏ+ ๐ค๐๐๐ ๐ = ๐ค0 +
๐=1
๐
๐ค๐ ๐ฅ๐
On-line, single sample, (๐ ๐ , ๐ฆ ๐ ), gradient update, โ๐ = 1,โฏ , ๐
๐ค๐ = ๐ค๐ + ๐ผ๐ป๐ โ โ ๐ ๐ ; ๐ , ๐ฆ ๐ = ๐ค๐ + ๐ผ ๐ฆ ๐ โ โ ๐ ๐ ๐๐ ๐ ๐
Same form as in the linear regression model, with ๐๐(๐)
โ ๐๐ ๐ ๐
ELECTRICITY EXAMPLE
21
New data: it doesnโt look
linear anymore
22
NEW HYPOTHESIS
The complexity of the model grows: one parameter for each feature transformed
according to a polynomial of order 2 (at least 3 parameters vs. 2 of original hypothesis)
23
NEW HYPOTHESIS
At least 5 parameters (if we had multiple predicting features, all their order d
products should be considered, resulting into a number of additional parameters)
24
NEW HYPOTHESIS
The number of parameters is now larger than the data points, such that the
polynomial can almost precisely fit the data Overfitting
25
SELECTING MODEL COMPLEXITY
Dataset with 10 points, 1D features: which hypothesis class should we use?
Linear regression: ๐ฆ = โ ๐ฅ;๐ = ๐ค0 + ๐ค1๐ฅ
Polynomial regression, cubic: ๐ฆ = โ ๐ฅ;๐ = ๐ค0 + ๐ค1๐ฅ + ๐ค2๐ฅ2 + ๐ค3๐ฅ
3
MSE for the loss functions
Which model would give the smaller error in terms of MSE / least squares fit?
26
SELECTING MODEL COMPLEXITY
Cubic regression provides a better fit to the data, and a smaller MSE
Should we stick with the hypothesis โ ๐ฅ;๐ = ๐ค0 + ๐ค1๐ฅ + ๐ค2๐ฅ2 + ๐ค3๐ฅ
3 ?
Since a higher order polynomial seems to provide a better fit, why donโt we
use a polynomial of order higher than 3?
What is the highest order that makes sense for the given problem?
27
SELECTING MODEL COMPLEXITY
For 10 data points, a degree 9 polynomial gives a perfect fit (Lagrange
interpolation). Error is zero.
Is it always good to minimize (even reduce to zero) the training error?
Related (and more important) question: How do we (will) perform on new,
unseen data?
28
OVERFITTING
The 9-polynomial model totally fails the prediction for the new point!
Overfitting: Situation when the training error is low and the generalization error
is high. Causes of the phenomenon:
Highly complex hypothesis model, with a large number of parameters
(degrees of freedom)
Small data size (as compared to the complexity of the model)
The learned function has enough degrees of freedom to (over)fit all data perfectly
29
OVERFITTING
Empirical loss vs. Generalization loss
30
TRAINING AND VALIDATION LOSS
31
SPLITTING DATASET IN TWO
32
PERFORMANCE ON VALIDATION SET
33
PERFORMANCE ON VALIDATION SET
34
INCREASING MODEL COMPLEXITY
In this case, the small size of the dataset favors an easy overfitting by
increasing the degree of the polynomial (i.e., hypothesis complexity). For
a large multi-dimensional dataset this effect is less strong / evident
35
TRAINING VS. VALIDATION LOSS
36
MODEL SELECTION AND EVALUATION PROCESS
1. Break all available data into training and testing sets (e.g., 70% / 30%)
2. Break training set into training and validation sets (e.g., 70% / 30%)
3. Loop:
i. Set a hyperparameter value (e.g., degree of polynomial โ model complexity)
ii. Train the model using training sets
iii. Validate the model using validation sets
iv. Exit loop if (validation errors keep growing && training errors go to zero)
4. Choose hyperparameters using validation set results: hyperparameter values
corresponding to lowest validation errors
5. (Optional) With the selected hyperparameters, retrain the model using all training
data sets
6. Evaluate (generalization) performance on the testing sets
(more on this next time)
37
MODEL SELECTION AND EVALUATION PROCESS
Dataset
Testing
setTraining
set
Internal
training setValidation
set
Model 1
โฎ
Learn 1
Learn 2
Learn ๐
Validate 1
Validate 2
Validate ๐
Select
best
model
Model โ
โฎ โฎ
Learn โ
Model 2
Model ๐