statistical nlp course for master in computational linguistics 2nd year 2013-2014 diana trandabat
TRANSCRIPT
Intro to probabilities
• Probability deals with prediction:
– Which word will follow in this ....?
– How can parses for a sentence be ordered?
– Which meaning is more likely?
– Which grammar is more linguistically plausible?
– See phrase “more lies ahead”. How likely is it that “lies” is noun?
– See “Le chien est noir”. How likely is it that the correct translation is “The dog is black”?
• Any rational decision can be described probabilistically.
Notations
• Experiment (or trial)
– repeatable process by which observations are made– e.g. tossing 3 coins
• Observe basic outcome from sample space, Ω, (set of all possible basic outcomes)
• Examples of sample spaces:• one coin toss, sample space Ω = { H, T };
• three coin tosses, Ω = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
• part-of-speech of a word, Ω = {N, V, Adj, etc…}
• next word in Shakespeare play, |Ω| = size of vocabulary
• number of words in your Msc. Thesis Ω = { 0, 1, … ∞ }
Notation
• An event A, is a set of basic outcomes, i.e., a subset of the sample space, Ω.
Example:
– Ω = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
– e.g. basic outcome = THH
– e.g. event = “has exactly 2 H’s”
A={THH, HHT, HTH}
– A=Ω is the certain event P(A=Ω)=1
– A=∅ is the impossible event P(A=∅) = 0
– For “not A” , we write Ā
Intro to probablities
• The probability of an event is hard to compute. • It is easily to compute the estimation of
probability, marked ^p(x).
• When |X| , ^p(x) P(x)
Intro to probabilities
• “A coin is tossed 3 times.
• What is the likelihood of 2 heads?”
– Experiment: Toss a coin three times
– Sample space Ω = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
– Event: basic outcomes that have exactly 2 H’s
A = {THH, HTH, HHT}
– the likelihood of 2 heads is 3 out of 8 possible outcomes
P(A) = 3/8
Probability distribution
• A probability distribution is an assignment of probabilities from a set of outcomes.– A uniform distribution assigns the same
probability to all outcomes (eg a fair coin).– A gaussian distribution assigns a bell-curve
over outcomes.– Many others.– Uniform and gaussians popular in SNLP.
Independent events
• Two events are independent if:
p(a,b)=p(a)*p(b)
• Consider a fair dice. Intuitively, each side (1, 2, 3, 4, 5, 6) has an appearance chance of 1/6.
• Consider the eveniment X “the number on the dice will be devided by 2” and Y “the number s divided by 3”.
Independent events
• Two events are independent if:
p(a,b)=p(a)*p(b)
• Consider a fair dice. Intuitively, each side (1, 2, 3, 4, 5, 6) has an appearance chance of 1/6.
• Consider the eveniment X “the number on the dice will be devided by 2” and Y “the number s divided by 3”.
• X={2, 4, 6}, Y={3, 6}
Independent events
• Two events are independent if:
p(a,b)=p(a)*p(b)
• Consider a fair dice. Intuitively, each side (1, 2, 3, 4, 5, 6) has an appearance chance of 1/6.
• Consider the eveniment X “the number on the dice will be devided by 2” and Y “the number s divided by 3”.
• X={2, 4, 6}, Y={3, 6}
• p(X)=p(2)+p(4)+p(6)=1/6+1/6+1/6=3/6=1/2
• p(Y)=p(3)+p(6)=1/3
Independent events
• Two events are independent if:
p(a,b)=p(a)*p(b)
• Consider a fair dice. Intuitively, each side (1, 2, 3, 4, 5, 6) has an appearance chance of 1/6.
• Consider the eveniment X “the number on the dice will be devided by 2” and Y “the number s divided by 3”.
• X={2, 4, 6}, Y={3, 6}
• p(X)=p(2)+p(4)+p(6)=1/6+1/6+1/6=3/6=1/2
• p(Y)=p(3)+p(6)=1/3
• p(X,Y)=p(6)=1/2*1/3=p(X)*p(Y)=1/6
• ==> X and Y are independents
Conditioned events
• Non independent events are called conditioned events.
• p(X|Y) == “the probability of having X if an Y event occurred.
• p(X|Y)=p(X,Y) /p(Y)
• p(X) == apriori probability(prior)
• p(X|Y) = posterior probability
• Are X and Y independent? p(X)=1/2, p(Y)=1/3, p(X,Y)=1/6, p(X |Y)= 1/2 ==> independent.
• Consider Z the event “the number on the dice can be divided by 4”
Are X and Z independent?
p(Z)=p(4)=1 /6
p(X,Z)=1/6,
p(X|Z)=p(X,Z) / p(Z)=1/6 /1/6=11/2 ==> non-indep.
Bayes’ Theorem
• Bayes’ Theorem lets us swap the order of dependence between events
• We saw that
• Bayes’ Theorem:
P(B)A)P(A)|P(B
B)|P(A
P(B)B)P(A,
B)|P(A
Example
• S:stiff neck, M: meningitis
• P(S|M) =0.5, P(M) = 1/50,000 P(S)=1/20
• I have stiff neck, should I worry?
Example
• S:stiff neck, M: meningitis
• P(S|M) =0.5, P(M) = 1/50,000 P(S)=1/20
• I have stiff neck, should I worry?
0002.020/1
000,50/15.0
)(
)()|()|(
SP
MPMSPSMP
Other useful relations:
p(x)=p(x|y) *p(y) or p(x)=p(x,y) yY yY
Chain rule:p(x1,x2,…xn) = p(x1) * p(x2| x1 )*p(x3| x1,x2)*... p(xn| x1,x2 ,…xn-1)
The demonstration is easy, through successive reductions:Consider event y as coincident of events x1,x2 ,…xn-1
p(x1,x2,…xn)= p(y, xn)=p(y)*p(xn| y)= p(x1,x2 ,…xn-1)*p(xn | x1,x2 ,…xn-1)
similar for the event z p(x1,x2,…xn-1)= p(z, xn-1)=p(z)*p(xn -1| z)= p(x1,x2 ,…xn-2)*p(xn -1| x1,x2 ,…xn-2)
. . .p(x1,x2,…xn)= p(x1) * p(x2| x1 )*p(x3| x1,x2)*... p(xn| x1,x2 ,…xn-1) prior bigram, trigram, n-gram
Objections
• People don’t compute probabilities.
• Why would computers?
• Or do they?
• John went to …
the market
go
red
if
number
Objections
• Statistics only count words and co-occurrences
• Two different concepts:– Statistical model and statistical method
• The first doesn’t need the second one.
• A person which used the intuition to raison is using a statistical model without statistical methods.
• Objections refer mainly to the accuracy of statistical models.