chapter 2: supervised learningakfletcher/stat261/lec2slides_unsupbayes.pdfmodel selection &...
TRANSCRIPT
CHAPTER 2:
Supervised Learning
Learning a Class from Examples Class C of a “family car”
Prediction: Is car x a family car?
Knowledge extraction: What do people expect from a family car?
Output:
Positive (+) and negative (–) examples
Input representation:
x1: price, x2 : engine power
2
Training set XNt
tt ,r 1}{ xX
negative is if
positive is if
x
x
0
1r
3
2
1
x
xx
Class C 2121 power engine AND price eepp
4
Hypothesis class H
negative is says if
positive is says if )(
x
xx
h
hh
0
1
N
t
tt rhhE1
1 x)|( X
5
Error of h on H
S, G, and the Version Space
6
most specific hypothesis, S
most general hypothesis, G
h H, between S and G isconsistent and make up the version space(Mitchell, 1997)
Margin Choose h with largest margin
7
VC Dimension N points can be labeled in 2N ways as +/–
H shatters N if there
exists h H consistent
for any of these:
VC(H ) = N
8
An axis-aligned rectangle shatters 4 points only !
Probably Approximately Correct (PAC) Learning How many training examples N should we have, such that with probability
at least 1 ‒ δ, h has error at most ε ?
(Blumer et al., 1989)
Each strip is at most ε/4
Pr that we miss a strip 1‒ ε/4
Pr that N instances miss a strip (1 ‒ ε/4)N
Pr that N instances miss 4 strips 4(1 ‒ ε/4)N
4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x)
4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)
9
Noise and Model ComplexityUse the simpler one because Simpler to use
(lower computational
complexity)
Easier to train (lower
space complexity)
Easier to explain
(more interpretable)
Generalizes better (lower
variance - Occam’s razor)
10
Multiple Classes, Ci i=1,...,KNt
tt ,r 1}{ xX
, if
if
ijr
jt
it
ti C
C
x
x
0
1
, if
if
ijh
jt
it
ti C
C
x
xx
0
1
11
Train hypotheses hi(x), i =1,...,K:
Regression
01 wxwxg
01
2
2 wxwxwxg
N
t
tt xgrN
gE1
21X|
12
N
t
tt wxwrN
wwE1
2
0101
1X|,
tt
t
N
ttt
xfr
r
rx 1,X
Model Selection & Generalization Learning is an ill-posed problem; data is not sufficient to
find a unique solution
The need for inductive bias, assumptions about H Generalization: How well a model performs on new data
Overfitting: H more complex than C or f
Underfitting: H less complex than C or f
13
Triple Trade-Off There is a trade-off between three factors (Dietterich,
2003):
1. Complexity of H, c (H),
2. Training set size, N,
3. Generalization error, E, on new data
As N,E
As c (H),first Eand then E
14
Cross-Validation To estimate generalization error, we need data unseen
during training. We split the data as
Training set (50%)
Validation set (25%)
Test (publication) set (25%)
Resampling when there is few data
15
Dimensions of a Supervised Learner1. Model:
2. Loss function:
3. Optimization procedure:
|xg
t
tt grLE |,| xX
16
X|min arg*
E
CHAPTER 3:
Bayesian Decision Theory
Probability and Inference Result of tossing a coin is {Heads,Tails}
Random var X {1,0}
Bernoulli: P {X=1} = poX (1 ‒ po)(1 ‒ X)
Sample: X = {xt }Nt =1
Estimation: po = # {Heads}/#{Tosses} = ∑txt / N
Prediction of next toss:
Heads if po > ½, Tails otherwise
18
Classification Credit scoring: Inputs are income and savings.
Output is low-risk vs high-risk Input: x = [x1,x2]T ,Output: C Î {0,1} Prediction:
otherwise 0
)|()|( if 1 choose
or
otherwise 0
)|( if 1 choose
C
C
C
C
,xxCP ,xxCP
. ,xxCP
2121
21
01
501
19
Bayes’ Rule
x
xx
p
pPP
CCC
| |
110
0011
110
xx
xxx
||
||
CC
CCCC
CC
Pp
PpPpp
PP
20
posterior
likelihoodprior
evidence
Bayes’ Rule: K>2 Classes
K
kkk
ii
iii
CPCp
CPCp
p
CPCpCP
1
|
|
||
x
x
x
xx
xx | max | if choose
and 1
kkii
K
iii
CPCPC
CPCP
10
21
Losses and Risks Actions: αi
Loss of αi when the state is Ck : λik
Expected risk (Duda and Hart, 1973)
xx
xx
|min| if choose
||
kkii
k
K
kiki
RR
CPR
1
22
Losses and Risks: 0/1 Loss
ki
kiik
if
if
1
0
x
x
xx
|
|
||
i
ikk
K
kkiki
CP
CP
CPR
1
1
23
For minimum risk, choose the most probable class
Losses and Risks: Reject
10
1
1
0
otherwise
if
if
,Ki
ki
ik
xxx
xx
|||
||
iik
ki
K
kkK
CPCPR
CPR
1
1
1
otherwise reject
| and || if choose 1xxx ikii CPikCPCPC
24
Discriminant Functions Kigi ,, , 1x xx kkii ggC max if choose
xxx kkii gg max| R
ii
i
i
i
CPCp
CP
R
g
|
|
|
x
x
x
x
25
K decision regions R1,...,RK
K=2 Classes Dichotomizer (K=2) vs Polychotomizer (K>2)
g(x) = g1(x) – g2(x)
Log odds:
otherwise
if choose
2
1 0
C
gC x
x
x
|
|log
2
1
CP
CP
26
Utility Theory Prob of state k given exidence x: P (Sk|x)
Utility of αi when state is k: Uik
Expected utility:
xx
xx
| max| if Choose
||
jj
ii
kkiki
EUEUα
SPUEU
27
Association Rules Association rule: X Y
People who buy/click/visit/enjoy X are also likely to buy/click/visit/enjoy Y.
A rule implies association, not necessarily causation.
28
Association measures Support (X Y):
Confidence (X Y):
Lift (X Y):
29
customers
and bought whocustomers
#
#,
YXYXP
X
YX
XP
YXPXYP
bought whocustomers
and bought whocustomers
|
#
#
)(
,
)(
)|(
)()(
,
YP
XYP
YPXP
YXP
Apriori algorithm (Agrawal et al., 1996) For (X,Y,Z), a 3-item set, to be frequent (have enough
support), (X,Y), (X,Z), and (Y,Z) should be frequent.
If (X,Y) is not frequent, none of its supersets can be frequent.
Once we find the frequent k-item sets, we convert them to rules: X, Y Z, ...
and X Y, Z, ...
30