1 cs598: machine learning and natural language lecture 7: probabilistic classification oct. 19,21...
TRANSCRIPT
![Page 1: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/1.jpg)
1
CS598: Machine Learning and Natural Language
Lecture 7: Probabilistic Classification
Oct. 19,21 2006
Dan RothUniversity of Illinois, [email protected]://L2R.cs.uiuc.edu/~danr
![Page 2: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/2.jpg)
2
Model the problem of text correction as that of generating correct sentences.
Goal: learn a model of the language; use it to predict.
PARADIGM Learn a probability distribution over all sentences
Use it to estimate which sentence is more likely. Pr(I saw the girl it the park) <> Pr(I saw the girl in the
park)[In the same paradigm we sometimes learn a conditional
probability distribution]
In practice: make assumptions on the distribution’s type
In practice: a decision policy depends on the assumptions
2: Generative Model
![Page 3: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/3.jpg)
3
Consider a distribution D over space XY X - the instance space; Y - set of labels. (e.g. +/-1)
Given a sample {(x,y)}1m
,, and a loss function L(x,y) Find hH that minimizes
i=1,mL(h(xi),yi) L can be: L(h(x),y)=1, h(x)y, o/w L(h(x),y) = 0 (0-1
loss)
L(h(x),y)= (h(x)-y)2 , (L2 )
L(h(x),y)=exp{- y h(x)}
Find an algorithm that minimizes average loss; then, we know that things will be okay (as a function of H).
Before: Error Driven Learning
![Page 4: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/4.jpg)
4
Goal: find the best hypothesis from some space H of hypotheses, given the observed data D.
Define best to be: most probable hypothesis in H
In order to do that, we need to assume a probability distribution over the class H.
In addition, we need to know something about the relation between the data observed and the hypotheses (E.g., a coin problem.)
As we will see, we will be Bayesian about other things, e.g., the parameters of the model
Basics of Bayesian Learning
![Page 5: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/5.jpg)
5
P(h) - the prior probability of a hypothesis h Reflects background knowledge; before data is
observed. If no information - uniform distribution.
P(D) - The probability that this sample of the Data is observed. (No knowledge of the hypothesis)
P(D|h): The probability of observing the sample D, given that the hypothesis h holds
P(h|D): The posterior probability of h. The probability h holds, given that D has been observed.
Basics of Bayesian Learning
![Page 6: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/6.jpg)
6
P(h|D) increases with P(h) and with P(D|h)
P(h|D) decreases with P(D)
P(D)P(h)h)|P(DD)|P(h
Bayes Theorem
![Page 7: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/7.jpg)
7
The learner considers a set of candidate hypotheses H (models), and attempts to find the most probable one h H, given the observed data.
Such maximally probable hypothesis is called maximum a posteriori hypothesis (MAP); Bayes theorem is used to compute it:
P(D)P(h)h)|P(DD)|P(h
h)P(h)|P(Dargmax
P(D)P(h)h)|P(DargmaxD)|P(hargmaxh
Hh
HhHhMAP
Learning Scenario
![Page 8: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/8.jpg)
8
We may assume that a priori, hypotheses are equally probable
We get the Maximum Likelihood hypothesis:
Here we just look for the hypothesis that best explains the data
h)P(h)|P(DargmaxD)|P(hargmaxh HhHhMAP
Hh,hP(hP(h jiji ),)
h)|P(Dargmaxh HhML
Learning Scenario (2)
![Page 9: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/9.jpg)
9
How should we use the general formalism? What should H be?
H can be a collection of functions. Given the training data, choose an optimal function. Then, given new data, evaluate the selected function on it.
H can be a collection of possible predictions. Given the data, try to directly choose the optimal prediction.
H can be a collection of (conditional) probability distributions.
Could be different! Specific examples we will discuss:
Naive Bayes: a maximum likelihood based algorithm; Max Entropy: seemingly, a different selection criteria; Hidden Markov Models
Bayes Optimal Classifier
![Page 10: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/10.jpg)
10
f:XV, finite set of values Instances x X can be described as a collection of features Given an example, assign it the most probable value in V
{0,1}x )x,...,x,(xx in21
),...xx,x|P(vargmax x)|P(vargmax v n21jVvjVvMAP jj
Bayesian Classifier
![Page 11: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/11.jpg)
11
• f:XV, finite set of values•Instances x X can be described as a collection of features
• Given an example, assign it the most probable value in V
• Bayes Rule:
• Notational convention: P(y) means P(Y=y)
{0,1}x )x,...,x,(xx in21
),...xx,x|P(vargmax x)|P(vargmax v n21jVvjVvMAP jj
))P(vv|,...xx,P(xargmax
),...xx,P(x
))P(vv|,...xx,P(xargmax v
jjn21Vv
n21
jjn21VvMAP
j
j
Bayesian Classifier
![Page 12: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/12.jpg)
12
• Given training data we can estimate the two terms.• Estimating P(vj) is easy. For each value vj count how many times it appears in the training data.
• However, it is not feasible to estimate
))P(vv|,...xx,P(xargmax v jjn21VvMAP j
)v|,...xx,P(x jn21
Bayesian Classifier
![Page 13: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/13.jpg)
13
• Given training data we can estimate the two terms.• Estimating P(vj) is easy. For each value vj count how many times it appears in the training data.
• However, it is not feasible to estimate• In this case we have to estimate, for each target value, the probability of each instance (most of which will not occur)
))P(vv|,...xx,P(xargmax v jjn21VvMAP j
)v|,...xx,P(x jn21
Bayesian Classifier
![Page 14: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/14.jpg)
14
• Given training data we can estimate the two terms.• Estimating P(vj) is easy. For each value vj count how many times it appears in the training data.
• However, it is not feasible to estimate• In this case we have to estimate, for each target value, the probability of each instance (most of which will not occur)
• In order to use a Bayesian classifiers in practice, we need to make assumptions that will allow us to estimate these quantities.
))P(vv|,...xx,P(xargmax v jjn21VvMAP j
)v|,...xx,P(x jn21
Bayesian Classifier
![Page 15: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/15.jpg)
15
• Assumption: feature values are independent given the target value
))P(vv|,...xx,P(xargmax v jjn21VvMAP j
n
1i ji
jnjn3jn32jn21
jn3jn32jn21
jn2jn21
jn21
)v|P(x
v|P(xv|,...xP(xv,,...xx|)P(xv,,...xx|P(x
v|,...xP(xv,,...xx|)P(xv,,...xx|P(x
v|,...x)P(xv,,...xx|P(x
)v|,...xx,P(x
))...)
.......
))
)
Naive Bayes
![Page 16: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/16.jpg)
16
• Assumption: feature values are independent given the target value
• Generative model:• First choose a value vj V according to P(vj )• For each vj : choose x1 x2 …, xn according to P(xk |vj )
))P(vv|,...xx,P(xargmax v jjn21VvMAP j
n
1i jiijnn2211 )vv|bP(x)vv|b,...xbx,bP(x
Naive Bayes
![Page 17: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/17.jpg)
17
• Assumption: feature values are independent given the target value
• Learning method: Estimate n|V| parameters and use them to compute the new value. (how to estimate?)
))P(vv|,...xx,P(xargmax v jjn21VvMAP j
n
1i jiijnn2211 )vv|bP(x)vv|b,...xbx,bP(x
Naive Bayes
![Page 18: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/18.jpg)
18
• Assumption: feature values are independent given the target value
• Learning method: Estimate n|V| parameters and use them to compute the new value.• This is learning without search. Given a collection of training examples, you just compute the best hypothesis (given the assumptions)• •This is learning without trying to achieve consistency or even approximate consistency.
))P(vv|,...xx,P(xargmax v jjn21VvMAP j
n
1i jiijnn2211 )vv|bP(x)vv|b,...xbx,bP(x
Naive Bayes
![Page 19: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/19.jpg)
19
• Assumption: feature values are independent given the target value
• Learning method: Estimate n|V| parameters and use them to compute the new value.• This is learning without search. Given a collection of training examples, you just compute the best hypothesis (given the assumptions) •This is learning without trying to achieve consistency or even approximate consistency. Why does it work?
))P(vv|,...xx,P(xargmax v jjn21VvMAP j
n
1i jiijnn2211 )vv|bP(x)vv|b,...xbx,bP(x
Naive Bayes
![Page 20: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/20.jpg)
20
• Notice that the features values are conditionally independent, given the target value, and are not required to be independent.• Example: f(x,y)=xy over the product distribution defined by p(x=0)=p(x=1)=1/2 and p(y=0)=p(y=1)=1/2 The distribution is defined so that x and y are independent: p(x,y) = p(x)p(y) (Interpretation - for every value of x and y)• But, given that f(x,y)=0:
p(x=1|f=0) = p(y=1|f=0) = 1/3 p(x=1,y=1 | f=0) = 0
so x and y are not conditionally independent.
Conditional Independence
![Page 21: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/21.jpg)
21
• The other direction also does not hold. x and y can be conditionally independent but not independent. f=0: p(x=1|f=0) =1, p(y=1|f=0) = 0 f=1: p(x=1|f=1) =0, p(y=1|f=1) = 1 and assume, say, that p(f=0) = p(f=1)=1/2 Given the value of f, x and y are independent.• What about unconditional independence ?
Conditional Independence
![Page 22: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/22.jpg)
22
• The other direction also does not hold. x and y can be conditionally independent but not independent. f=0: p(x=1|f=0) =1, p(y=1|f=0) = 0 f=1: p(x=1|f=0) =0, p(y=1|f=1) = 1 and assume, say, that p(f=0) = p(f=1)=1/2 Given the value of f, x and y are independent.• What about unconditional independence ? p(x=1) = p(x=1|f=0)p(f=0)+p(x=1|f=1)p(f=1) = 0.5+0=0.5 p(y=1) = p(y=1|f=0)p(f=0)+p(y=1|f=1)p(f=1) = 0.5+0=0.5But, p(x=1, y=1)=p(x=1,y=1|f=0)p(f=0)+p(x=1,y=1|f=1)p(f=1) = 0
so x and y are not independent.
Conditional Independence
![Page 23: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/23.jpg)
23
Day Outlook Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Weak Yes14 Rain Mild High Strong No
i jijVvNB )v|P(x)P(vargmax v
j
Example
![Page 24: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/24.jpg)
24
• How do we estimate P(observation | v) ?
i ino}{yes,vNB v)|nobservatioP(xP(v)argmax v
Estimating Probabilities
![Page 25: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/25.jpg)
25
• Compute P(PlayTennis= yes); P(PlayTennis= no)=• Compute P(outlook= s/oc/r | PlayTennis= yes/no) (6 numbers)• Compute P(Temp= h/mild/cool | PlayTennis= yes/no) (6 numbers)• Compute P(humidity= hi/nor | PlayTennis= yes/no) (4 numbers)• Compute P(wind= w/st | PlayTennis= yes/no) (4 numbers)
i jijVvNB )v|P(x)P(vargmax v
j
Example
![Page 26: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/26.jpg)
26
• Compute P(PlayTennis= yes); P(PlayTennis= no)=• Compute P(outlook= s/oc/r | PlayTennis= yes/no) (6 numbers)• Compute P(Temp= h/mild/cool | PlayTennis= yes/no) (6 numbers)• Compute P(humidity= hi/nor | PlayTennis= yes/no) (4 numbers)• Compute P(wind= w/st | PlayTennis= yes/no) (4 numbers)
•Given a new instance: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
• Predict: PlayTennis= ?
i jijVvNB )v|P(x)P(vargmax v
j
Example
![Page 27: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/27.jpg)
27
•Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
P(PlayTennis= yes)=9/14=0.64 P(PlayTennis= no)=5/14=0.36
P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 P(temp = cool | yes) = 3/9 P(temp = cool | no) = 1/5P(humidity = hi |yes) = 3/9 P(humidity = hi |no) = 4/5P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5
P(yes|…..) ~ 0.0053 P(no|…..) ~ 0.0206
i jijVvNB )v|P(x)P(vargmax v
j
Example
![Page 28: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/28.jpg)
28
•Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
P(PlayTennis= yes)=9/14=0.64 P(PlayTennis= no)=5/14=0.36
P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 P(temp = cool | yes) = 3/9 P(temp = cool | yes) = 1/5P(humidity = hi |yes) = 3/9 P(humidity = hi |yes) = 4/5P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5
P(yes|…..) ~ 0.0053 P(no|…..) ~ 0.0206
What is we were asked about Outlook=OC ?
i jijVvNB )v|P(x)P(vargmax v
j
Example
![Page 29: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/29.jpg)
29
•Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong)
P(PlayTennis= yes)=9/14=0.64 P(PlayTennis= no)=5/14=0.36
P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 P(temp = cool | yes) = 3/9 P(temp = cool | no) = 1/5P(humidity = hi |yes) = 3/9 P(humidity = hi |no) = 4/5P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5
P(yes|…..) ~ 0.0053 P(no|…..) ~ 0.0206 P(no|instance) = 0/.0206/(0.0053+0.0206)=0.795
i jijVvNB )v|P(x)P(vargmax v
j
Example
![Page 30: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/30.jpg)
30
• Notice that the naïve Bayes method gives a method for predicting rather than an explicit classifier.• In the case of two classes, v{0,1} we predict that v=1 iff:
i jijVvNB )v|P(x)P(vargmax v
j
10)v|P(x0)P(v
1)v|P(x1)P(vn
1i jij
n
1i jij
Naive Bayes: Two Classes
![Page 31: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/31.jpg)
31
• Notice that the naïve Bayes method gives a method for predicting rather than an explicit classifier.• In the case of two classes, v{0,1} we predict that v=1 iff:
i jijVvNB )v|P(x)P(vargmax v
j
10)v|P(x0)P(v
1)v|P(x1)P(vn
1i jij
n
1i jij
1q-(1q0)P(v
p-(1p1)P(v
0)v|1p(xq 1),v|1p(xp
ii
ii
x-1i
xij
x-1i
xij
iiii
n
i
n
i
1
1
)
)
:Denote
Naive Bayes: Two Classes
![Page 32: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/32.jpg)
32
•In the case of two classes, v{0,1} we predict that v=1 iff:
1)
q-1
q)(q-(10)P(v
)p-1
p)(p-(11)P(v
)q-(1q0)P(v
)p-(1p1)P(v
n
1i
x
i
iij
n
1i
x
i
iij
n
1i
x-1i
xij
n
1i
x-1i
xij
i
i
ii
ii
Naive Bayes: Two Classes
![Page 33: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/33.jpg)
33
•In the case of two classes, v{0,1} we predict that v=1 iff:
1)
q-1
q)(q-(10)P(v
)p-1
p)(p-(11)P(v
)q-(1q0)P(v
)p-(1p1)P(v
n
1i
x
i
iij
n
1i
x
i
iij
n
1i
x-1i
xij
n
1i
x-1i
xij
i
i
ii
ii
0)xq-1
qlog
p-1
p(log
q-1
p-1log
0)P(v
1)P(vlog
:iff 1vpredict we logarithm; Take
ii
i
ii
i
ii
i
j
j
Naïve Bayes: Two Classes
![Page 34: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/34.jpg)
34
•In the case of two classes, v{0,1} we predict that v=1 iff:
• •We get that the optimal Bayes behavior is given by a linear separator with
1)
q-1
q)(q-(10)P(v
)p-1
p)(p-(11)P(v
)q-(1q0)P(v
)p-(1p1)P(v
n
1i
x
i
iij
n
1i
x
i
iij
n
1i
x-1i
xij
n
1i
x-1i
xij
i
i
ii
ii
0)xq-1
qlog
p-1
p(log
q-1
p-1log
0)P(v
1)P(vlog
:iff 1vpredict we logarithm; Take
ii
i
ii
i
ii
i
j
j
irrelevant is feature the and 0w then qp if
p-1
q-1
q
plog)
q-1
qlog
p-1
p(logw
iii
i
i
i
i
i
i
ii
ii
Naïve Bayes: Two Classes
![Page 35: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/35.jpg)
35
• We have not addressed the question of why does this Classifier perform well, given that the assumptions are unlikely to be satisfied.
• The linear form of the classifiers provides some hints. (More on that later; also, Roth’99 Garg&Roth ECML’02); one of the presented papers will also address this partly.
Why does it work?
![Page 36: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/36.jpg)
36
•In the case of two classes we have that:
bxw)x |0P(v
)x |1P(vlog ii i
j
j
Naïve Bayes: Two Classes
![Page 37: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/37.jpg)
37
•In the case of two classes we have that:
•but since
•We get (plug in (2) in (1); some algebra):
•Which is simply the logistic (sigmoid) function used in the neural network representation.
bxw)x |0P(v
)x |1P(vlog ii i
j
j
)x |0P(v-1)x |1P(v jj
b)xwexp(-1
1)x |1P(v
ii ij
Naïve Bayes: Two Classes
![Page 38: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/38.jpg)
38
Another look at Naive Bayes
n32 x x x x1 n-1n43322 xx xx xx xx1
l)|Pr( 1
1 2 3 n
l)|Pr( n l)|Pr( 2
Pr(l)l
|} lj|j)(x, {|
|} l j 1(x) |j)(x, {|logrPlogrPlogc D
l],[D
l],[l],[ 0
ˆ/ˆ
(x)cargmax(x)prediction i
n
0il],[x0.1}{l i
|S|/|} lj|j)(x, {|logrPlogc Dl],[l],[ 00
ˆ
Note this is a bit different than the previous
linearization. Rather than a single function, here we have argmax over
several different functions.
Graphical model. It encodes the NB independence
assumption in the edge structure (siblings are
independent given parents)
![Page 39: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/39.jpg)
39
Hidden Markov Model (HMM)
HMM is a probabilistic generative model It models how an observed sequence is generated
Let’s call each position in a sequence a time step At each time step, there are two variables
Current state (hidden) Observation
![Page 40: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/40.jpg)
40
HMM
Elements Initial state probability P(s1) Transition probability P(st|st-1) Observation probability P(ot|st)
As before, the graphical model is an encoding of the independence assumptions
Note that we have seen this in the context of POS tagging.
s1
o1
s2
o2
s3
o3
s4
o4
s5
o5
s6
o6
![Page 41: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/41.jpg)
41
HMM for Shallow Parsing
States: {B, I, O}
Observations: Actual words and/or part-of-speech tags
s1=B
o1
Mr.
s2=I
o2
Brown
s3=O
o3
blamed
s4=B
o4
Mr.
s5=I
o5
Bob
s6=O
o6
for
![Page 42: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/42.jpg)
42
HMM for Shallow Parsing
Given a sentences, we can ask what the most likely state sequence is
Initial state probability:P(s1=B),P(s1=I),P(s1=O)
Transition probabilty:P(st=B|st-1=B),P(st=I|st-1=B),P(st=O|st-1=B),P(st=B|st-1=I),P(st=I|st-1=I),P(st=O|st-1=I),…
Observation Probability:P(ot=‘Mr.’|st=B),P(ot=‘Brown’|st=B),…,P(ot=‘Mr.’|st=I),P(ot=‘Brown’|st=I),…,…
s1=B
o1
Mr.
s2=I
o2
Brown
s3=O
o3
blamed
s4=B
o4
Mr.
s5=I
o5
Bob
s6=O
o6
for
![Page 43: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/43.jpg)
43
Finding most likely state sequence in HMM (1)
![Page 44: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/44.jpg)
44
Finding most likely state sequence in HMM (2)
![Page 45: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/45.jpg)
45
Finding most likely state sequence in HMM (3)
A function of sk
![Page 46: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/46.jpg)
46
Finding most likely state sequence in HMM (4)
Viterbi’s Algorithm Dynamic Programming
![Page 47: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/47.jpg)
47
Learning the Model
Estimate Initial state probability P (s1) Transition probability P(st|st-1) Observation probability P(ot|st)
Unsupervised Learning (states are not observed) EM Algorithm
Supervised Learning (states are observed; more common)
ML Estimate of above terms directly from data
Notice that this is completely analogues to the case of naive Bayes, and essentially all other models.
![Page 48: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/48.jpg)
48
Another view of Markov Models
x)|Pr(w)t|Pr(w iii
)t,...,t,t|Pr(t)t|Pr(t 1-1ii1ii1i Assumptions:
Prediction: predict tT that maximizest)|Pr(tt)|Pr(w)t|Pr(t 1ii-1i
Input: ) ( )...t:(w?),:(w),t:),...(wt:(wx 1i1ii-1i-1i11
t t -1i t 1i
w -1i iw w 1i
States:
Observations:
T
W
![Page 49: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/49.jpg)
49
Another View of Markov Models
As for NB: features are pairs and singletons of t‘s, w’s
|} tt|)t'(t, {|
|} tt' tt |)t'(t, {|logrPlogrPlogc
1
21D]t[1,
D]t,[t]t,t[t 121121
ˆ/ˆ
(x)cargmax(x)prediction t],[xT}{t i
]t,t[t 221c t]t,[c w 0c Otherwise, t],[ ]t,t[t 121
c
Only 3 active featuresOnly 3 active features
Input: ) ( )...t:(w?),:(w),t:),...(wt:(wx 1i1ii-1i-1i11
t t -1i t 1i
w -1i iw w 1i
States:
Observations:
T
W
This can be extended to an argmax that maximizes the prediction of the whole state sequence and computed, as before, via Viterbi.
![Page 50: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/50.jpg)
50
Learning with Probabilistic Classifiers
Learning Theory
We showed that probabilistic predictions can be viewed as predictions via Linear Statistical Queries Models (Roth’99).
The low expressivity explains Generalization+Robustness Is that all?
It does not explain why is it possible to (approximately) fit the data with these models. Namely, is there a reason to believe that these hypotheses minimize the empirical error on the sample?
In General, No. (Unless it corresponds to some probabilistic assumptions that hold).
|S| /|} lh(x)|Sx {|(h)ErrS
![Page 51: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/51.jpg)
51
Learning Protocol
LSQ hypotheses are computed directly, w/o assumptions on the underlying distribution:
- Choose features - Compute coefficients
Is there a reason to believe that an LSQ hypothesis
minimizes the empirical error on the sample?
In general, no. (Unless it corresponds to some probabilistic
assumptions that hold).
![Page 52: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/52.jpg)
52
Learning Protocol: Practice
LSQ hypotheses are computed directly: - Choose features - Compute coefficients
If hypothesis does not fit the training data - - Augment set of features
(Forget your original assumption)
![Page 53: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/53.jpg)
53
Example: probabilistic classifiers
n32 x x x x1 n-1n43322 xx xx xx xx1
Features are pairs and singletons of t‘s, w’s
Additional features are included
l)|Pr( 1
1 2 3 n
l)|Pr( n l)|Pr( 2
Pr(l)l
t t -1i t 1i
w -1i iw w 1i
States:
Observations:
T
W
If hypothesis does not fit the training data - augment set of features (forget assumptions)
![Page 54: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/54.jpg)
54
Why is it relatively easy to fit the data? Consider all distributions with the same
marginals
(E.g, a naïve Bayes classifier will predict the same regardless of which distribution generated the data.)
(Garg&Roth ECML’01):In most cases (i.e., for most such distributions),
the resulting predictor’s error is close to optimal classifier (that if given the correct distribution)
)Pr( i l|
Robustness of Probabilistic Predictors
![Page 55: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/55.jpg)
55
Summary: Probabilistic Modeling
Classifiers derived from probability density estimation models were viewed as LSQ
hypotheses.
Probabilistic assumptions: + Guiding feature selection but also - - Not allowing the use of more general
features.
k21 iiiiii xxxc(x)c ...
![Page 56: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/56.jpg)
56
A Unified Approach
Most methods blow up original feature space.
And make predictions using a linear representation over the new feature space
kn ) (x)... (x), (x), (x) n321 (),...,,( 321 kxxxxX
(x)ci
ijimaxarg
j
Note: Methods do not have to actually do that; But: they produce same decision as a hypothesis that does that. (Roth 98; 99,00)
![Page 57: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/57.jpg)
57
A Unified Approach
Most methods blow up original feature space.
And make predictions using a linear representation over the new feature space
kn ) (x)... (x), (x), (x) n321 (),...,,( 321 kxxxxX
(x)ci
ijimaxarg
j Probabilistic Methods Rule based methods
(TBL; decision lists;
exponentially decreasing weights)
Linear representation (SNoW;Perceptron;
SVM;Boosting) Memory Based Methods
(subset features)
![Page 58: 1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign](https://reader035.vdocuments.mx/reader035/viewer/2022062720/56649ef95503460f94c0ad7a/html5/thumbnails/58.jpg)
58
A Unified Approach
Most methods blow up original feature space.
And make predictions using a linear representation over the new feature space
kn ) (x)... (x), (x), (x) n321 (),...,,( 321 kxxxxX
(x)ci
ijimaxarg
j
Q 1: How are weights determined?Q 2: How is the new feature-space determined? Implications? Restrictions?