MACHİNE LEARNİNG3. Supervised Learning
Learning a Class from Examples
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
2
Class C of a “family car” Prediction: Is car x a family car? Knowledge extraction: What do people expect from
a family car? Output:
Positive (+) and negative (–) examples Input representation:
Expert suggestions x1: price, x2 : engine power Ignore other attributes
Training set XNt
tt,r 1}{ xX1 if is positive0 if is negative
r
xx
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3
2
1
xxx
Class C
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
4
•Assume class model (rectangle)
•(p1 ≤ price ≤ p2) & (e1 ≤ engine power ≤ e2)
S, G, and the Version Space
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
5
most specific hypothesis, Smost general hypothesis, G
h H, between S and G isconsistent
and make up the version space
(Mitchell, 1997)
Hypothesis class H
negative as classifies if 0positive as classifies if 1)(
xxx
hh
h
1
( | ) 1N
t t
tE h h r
xX
6
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Error of h on H
Generalization
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
7
Problem of generalization: how well our hypothesis will correctly classify future examples
In our example: hypothesis is characterized by 4 numbers (p1,p2,e1,e2)
Choose the best one Include all positive and none negative Infinitely many hypothesis for real-valued
parameters
Doubt
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
8
In some applications, a wrong decision is very costly
May reject an instance if fall between S (most specific) and G (most general)
VC Dimension
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
9
Assumed that H (hypothesis space) includes true class C
H should be flexible enough or have enough capacity to include C
Need some measure of hypothesis space “flexibility” complexity
Can try to increase complexity of hypothesis space
VC Dimension
Based on for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
10
N points can be labeled in 2N ways as +/–
H shatters N if there exists h H consistent for any of these: VC(H ) = N
An axis-aligned rectangle shatters 4 points only !
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
11
Fix a probability of target classification error (planned future)
Actual error depends on training sample(past)
Want the actual probability error(actual future) be less than a target with high probability
Probably Approximately Correct (PAC) Learning
Probably Approximately Correct (PAC) Learning
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
12
How many training examples N should we have, such that with probability at least 1 ‒ δ, h has error at most ε ?(Blumer et al., 1989)
Let’s calculate how many samples wee need for S Each strip is at most ε/4 Pr that we miss a strip 1‒ ε/4 Pr that N instances miss a strip (1 ‒ ε/4)N
Pr that N instances miss 4 strips 4(1 ‒ ε/4)N
1-4(1 ‒ ε/4)N >1- δ and (1 ‒ x)≤exp( ‒ x) 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)
Noise
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
13
Imprecision in recording the input attributes Error in labeling data points (teacher noise) Additional attributes not taken into account
(hidden or latent) Same price/engine with different label due to a
color Effect of this attributes modeled as a noise
Class boundary might be not simple Need more complicated hypothesis
space/model
Noise and Model Complexity
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
14
Use the simpler one because Simpler to use
(lower computational complexity)
Easier to train (lower space complexity)
Easier to explain (more interpretable)
Generalizes better (lower variance - Occam’s razor)
Occam Razor
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
15
If actual class is simple and there is mislabeling or noise, the simpler model will generalized better
Simpler model result in more errors on training set
Will generalized better , won’t try to explain noise in training sample
Simple explanations are more plausible
Multiple Classes
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
16
General case K classes Family, Sport , Luxury cars
Classes can overlap
Can use different/same hypothesis class
Fall into two classes? Sometimes worth to reject
Multiple Classes, Ci i=1,...,KNt
tt,r 1}{ xX
, if 0 if 1
ijr
jt
it
ti C
Cxx
, if 0 if 1
ijh
jt
it
ti C
Cxxx
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
17
Train hypotheses hi(x), i =1,...,K:
Regression
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
18
Output is not Boolean (yes/no) or label but numeric value
Training Set of examples Interpolation: fit function (polynomial) Extrapolation: predict output for any x Regression : added noise Assumption: hidden variables Approximate output by model: g(x)
Regression
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
19
Empirical error on training set
Hypothesis space is linear functions
Calculate best parameters to minimize error by taking partial derivatives
Higher-order polynomials
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
20
Model Selection & Generalization
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
21
Learning is an ill-posed problem; data is not sufficient to find a unique solution Each sample remove irrelevant hypothesis
The need for inductive bias, assumptions about H E.g. rectangles in our example
Generalization: How well a model performs on new data
Overfitting: H more complex than C or f Underfitting: H less complex than C or f
Triple Trade-Off
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
22
There is a trade-off between three factors (Dietterich, 2003):
1. Complexity of H, c (H),2. Training set size, N, 3. Generalization error, E, on new data
As NE As c (H)first Eand then E
Cross-Validation
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
23
To estimate generalization error, we need data unseen during training. We split the data as Training set (50%)
To train a model Validation set (25%)
To select a model (e.g. degree of polynomials) Test (publication) set (25%)
Estimate the error Resampling when there is few data
Dimensions of a Supervised Learner1. Model :
2. Loss function:
3. Optimization procedure:
|xg
t
tt g,rLE || xX
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
24
X|min arg
E*