statistical nlp course for master in computational linguistics 2nd year 2013-2014 diana trandabat

Statistical NLP

Course for Master in Computational Linguistics2nd Year

2013-2014

Diana Trandabat

Intro to probabilities

• Probability deals with prediction:

– Which word will follow in this ....?

– How can parses for a sentence be ordered?

– Which meaning is more likely?

– Which grammar is more linguistically plausible?

– See phrase “more lies ahead”. How likely is it that “lies” is noun?

– See “Le chien est noir”. How likely is it that the correct translation is “The dog is black”?

• Any rational decision can be described probabilistically.

Notations

• Experiment (or trial)

– repeatable process by which observations are made– e.g. tossing 3 coins

• Observe basic outcome from sample space, Ω, (set of all possible basic outcomes)

• Examples of sample spaces:• one coin toss, sample space Ω = { H, T };

• three coin tosses, Ω = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}

• part-of-speech of a word, Ω = {N, V, Adj, etc…}

• next word in Shakespeare play, |Ω| = size of vocabulary

• number of words in your Msc. Thesis Ω = { 0, 1, … ∞ }

Notation

• An event A, is a set of basic outcomes, i.e., a subset of the sample space, Ω.

Example:

– Ω = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}

– e.g. basic outcome = THH

– e.g. event = “has exactly 2 H’s”

A={THH, HHT, HTH}

– A=Ω is the certain event P(A=Ω)=1

– A=∅ is the impossible event P(A=∅) = 0

– For “not A” , we write Ā

Intro to probablities

Intro to probablities

• The probability of an event is hard to compute. • It is easily to compute the estimation of

probability, marked ^p(x).

• When |X| , ^p(x) P(x)

Intro to probabilities

• “A coin is tossed 3 times.

• What is the likelihood of 2 heads?”

– Experiment: Toss a coin three times

– Sample space Ω = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}

– Event: basic outcomes that have exactly 2 H’s

A = {THH, HTH, HHT}

– the likelihood of 2 heads is 3 out of 8 possible outcomes

P(A) = 3/8

Probability distribution

• A probability distribution is an assignment of probabilities from a set of outcomes.– A uniform distribution assigns the same

probability to all outcomes (eg a fair coin).– A gaussian distribution assigns a bell-curve

over outcomes.– Many others.– Uniform and gaussians popular in SNLP.

Joint probabilities

Independent events

• Two events are independent if:

p(a,b)=p(a)*p(b)

• Consider a fair dice. Intuitively, each side (1, 2, 3, 4, 5, 6) has an appearance chance of 1/6.

• Consider the eveniment X “the number on the dice will be devided by 2” and Y “the number s divided by 3”.

Independent events


p(a,b)=p(a)*p(b)



• X={2, 4, 6}, Y={3, 6}

Independent events


p(a,b)=p(a)*p(b)



• X={2, 4, 6}, Y={3, 6}

• p(X)=p(2)+p(4)+p(6)=1/6+1/6+1/6=3/6=1/2

• p(Y)=p(3)+p(6)=1/3

Independent events


p(a,b)=p(a)*p(b)



• X={2, 4, 6}, Y={3, 6}

• p(X)=p(2)+p(4)+p(6)=1/6+1/6+1/6=3/6=1/2

• p(Y)=p(3)+p(6)=1/3

• p(X,Y)=p(6)=1/2*1/3=p(X)*p(Y)=1/6

• ==> X and Y are independents

Conditioned events

• Non independent events are called conditioned events.

• p(X|Y) == “the probability of having X if an Y event occurred.

• p(X|Y)=p(X,Y) /p(Y)

• p(X) == apriori probability(prior)

• p(X|Y) = posterior probability

Conditioned events

• Are X and Y independent? p(X)=1/2, p(Y)=1/3, p(X,Y)=1/6, p(X |Y)= 1/2 ==> independent.

• Consider Z the event “the number on the dice can be divided by 4”

Are X and Z independent?

p(Z)=p(4)=1 /6

p(X,Z)=1/6,

p(X|Z)=p(X,Z) / p(Z)=1/6 /1/6=11/2 ==> non-indep.

Bayes’ Theorem

• Bayes’ Theorem lets us swap the order of dependence between events

• We saw that

• Bayes’ Theorem:

P(B)A)P(A)|P(B

B)|P(A

P(B)B)P(A,

B)|P(A

Example

• S:stiff neck, M: meningitis

• P(S|M) =0.5, P(M) = 1/50,000 P(S)=1/20

• I have stiff neck, should I worry?

Example

• S:stiff neck, M: meningitis

• P(S|M) =0.5, P(M) = 1/50,000 P(S)=1/20

• I have stiff neck, should I worry?

0002.020/1

000,50/15.0

)(

)()|()|(

SP

MPMSPSMP

Other useful relations:

p(x)=p(x|y) *p(y) or p(x)=p(x,y) yY yY

Chain rule:p(x1,x2,…xn) = p(x1) * p(x2| x1 )*p(x3| x1,x2)*... p(xn| x1,x2 ,…xn-1)

The demonstration is easy, through successive reductions:Consider event y as coincident of events x1,x2 ,…xn-1

p(x1,x2,…xn)= p(y, xn)=p(y)*p(xn| y)= p(x1,x2 ,…xn-1)*p(xn | x1,x2 ,…xn-1)

similar for the event z p(x1,x2,…xn-1)= p(z, xn-1)=p(z)*p(xn -1| z)= p(x1,x2 ,…xn-2)*p(xn -1| x1,x2 ,…xn-2)

. . .p(x1,x2,…xn)= p(x1) * p(x2| x1 )*p(x3| x1,x2)*... p(xn| x1,x2 ,…xn-1) prior bigram, trigram, n-gram

Objections

• People don’t compute probabilities.

• Why would computers?

• Or do they?

• John went to …

the market

go

red

if

number

Objections

• Statistics only count words and co-occurrences

• Two different concepts:– Statistical model and statistical method

• The first doesn’t need the second one.

• A person which used the intuition to raison is using a statistical model without statistical methods.

• Objections refer mainly to the accuracy of statistical models.

statistical nlp course for master in computational linguistics 2nd year 2013-2014 diana trandabat

Documents