# machine learning: a probabilistic murphyk/teaching/cs340-fall06/lectures/review-4pp.pdfcs340:...

Post on 16-Mar-2018

221 views

Embed Size (px)

TRANSCRIPT

CS340: Machine Learning

Review for midterm

Kevin Murphy

Machine learning: a probabilistic approach

We want to make models of data so we can find patterns andpredict the future.

We will use probability theory to represent our uncertainty aboutwhat the right model is; this will induce uncertainty in ourpredictions.

We will update our beliefs about the model as we acquire data (thisis called learning); as we get more data, the uncertainty in ourpredictions will decrease.

Outline

Probability basics Probability density functions Parameter estimation PredictionMultivariate probability models Conditional probability models Information theory

Probability basics

Chain (product) rulep(A,B) = p(A)p(B|A) (1)

Marginalization (sum) rulep(B) =

a

p(A = a,B) (2)

Hence we derive Bayes rule

p(A|B) = p(A,B)p(B)

(3)

PMF

A probability mass function (PMF) is defined on a discrete randomvariable X {1, . . . ,K} and satisfies

1 =

K

x=1

P (X = x) (4)

0 P (X = x) 1 (5) This is just a histogram!

PDF

A probability density function (PDF) is defined on a continuousrandom variable X IR or X IR+ or X [0, 1] etc and satisfies

1 =

Xp(X = x)dx (6)

0 p(X = x) (7) Note that it is possible that p(X = x) > 1, since multiplied by dx. P (x X x + dx) p(x)dx Statistics books often write fX(x) for pdfs and use P () for pmfs.

Moments of a distribution

Mean (first central moment) = E[X ] =

x

xp(x) (8)

Variance (second central moment)2 = Var [X ] =

x

(x )2p(x) = E[X2] 2 (9)

CDFs

The cumulative distribution function is defined as

F (x) = P (X x) = x

f (x)dx (10)

3 2 1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

x

p(x)

Standard Normal

3 2 1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

p(x)

Gaussian cdf

Quantiles of a distribution

P (a X b) = FX(b) FX(a) (11)eg if Z N (0, 1), FZ(z) = (z). Then

p(1.96 Z 1.96) = (1.96) (1.96) (12)= 0.975 0.025 = 0.95 (13)

Hence X N (, ),p(1.96 < X < 1.96) = 1 2 0.025 = 0.95 (14)

Hence the interval 1.96 contains 0.95 mass.

Outline

Probability basics

Probability density functions Parameter estimation PredictionMultivariate probability models Conditional probability models Information theory

Bernoulli distribution

X {0, 1}, p(X = 1) = , p(X = 0) = 1 X Be()E[X ] =

Multinomial distribution

X {1, . . . ,K}, p(X = k) = k, X Mu() 0 k 1Kk=1 k = 1E[Xi] = i

(Univariate) Gaussian distribution

X IR, X N (, ), IR, IR+

N (x|, ) def= 122

e 1

22(x)2

(15)

E[X ] =

3 2 1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

x

p(x)

Standard Normal

Beta distribution

X [0, 1], X Be(1, 0), 0 0, 1E[X ] = 11+0

Be(X|1, 0) [X11(1 X)01]

0 0.5 10

1

2

3

4a=0.10, b=0.10

0 0.5 10

0.5

1

1.5

2a=1.00, b=1.00

0 0.5 10

0.5

1

1.5

2a=2.00, b=3.00

0 0.5 10

1

2

3a=8.00, b=4.00

Dirichlet distribution

X [0, 1]K, X Dir(1, . . . , K), 0 kE[Xi] = i , where =

k k.

Dir(X|1, . . . , K)

k

Xk1k

Outline

Probability basics

Probability density functions

Parameter estimation PredictionMultivariate probability models Conditional probability models Information theory

Parameter estimation: MLE

Likelihood of iid data D = {xn}Nn=1L() = p(D|) =

n

p(xn|) (16)

Log likelihood`() = log p(D|) =

n

log p(xn|) (17)

MLE = Maximum likelihood estimate = arg max

L() = arg max

`() (18)

henceL

i= 0 (19)

MLE for Bernoullis

p(Xn|) = Xn(1 )1Xn (20)`() =

n

Xn log +

n

(1 Xn) log(1 ) (21)

= N1 log + N0 log(1 ) (22)N1 =

n

I(Xn = 1) (23)

dL

d=

N1

N0

1 = 0 (24)

=N1

N1 + N0(25)

MLE for Multinomials

p(Xn|) =

k

I(Xn=k)k

(26)

`() =

n

k

I(xn = k) log k (27)

=

k

Nk log k (28)

l =

k

Nk log k +

1

k

k

(29)

l

k=

Nkk

= 0 (30)

k =NkN

(31)

Parameter estimation: Bayesian

p(|D) p(D|)p() (32)= [

n

p(xn|)]p() (33)

We say p() is conjugate to p(D|) if p(|D) in same family as p().This allows recursive (sequential) updating.MLE is a point estimate. Bayes computes full posterior, natural modelof uncertainty (e.g., posterior variance, entropy, credible intervals, etc)

Parameter estimation: MAP

For many models, as |D|,p(|D)( MAP )( MLE) (34)

whereMAP = arg max

p(D|)p() (35)

MAP (maximum a posteriori) estimate is a posterior mode. Also calleda penalized likelihood estimate since

MAP = arg max

log p(D|) c() (36)

where c() = log p() is a penalty or regularizer, and is thestrenghth of the regularizer.

Bayes estimate for Bernoulli

Xn Be() (37) Beta(1, 0) (38)

p(|D) [

n

Xn(1 )1Xn]11(1 )01 (39)

= N1(1 )N011(1 )01 (40) Beta(1 + N1, 0 + N0) (41)

Bayes estimate for Multinomial

Xn Mu() (42) Dir(1, . . . , K) (43)

p(|D) [

n

k

I(Xn=k)k

]

k

k1k

(44)

=

k

Nkk

k1k

(45)

Dir(1 + N1, . . . , K + NK) (46)

Outline

Probability basics

Probability density functions

Parameter estimation

PredictionMultivariate probability models Conditional probability models Information theory

Predicting the future

Posterior predictive density gotten by Bayesian model averaging

p(X|D) =

p(X|)p(|D)d (47)

Plug-in principle: if p(|D) ( ), thenp(X|D) p(X|) (48)

Predictive for Bernoullis

p(|D) = Beta(1 + N1, 0 + N0) (49)def= Beta(1,

0) (50)

p(X = 1|D) =

p(X = 1|)p(|D)d (51)

=

p(|D)d (52)= E[|D] (53)

=1

1 + 0

(54)

Cross validation

We can use CV to find the model with best predictive performance.

Outline

Probability basics

Probability density functions

Parameter estimation

Prediction

Multivariate probability models Conditional probability models Information theory

Multivariate probability models

Each data case Xn is now a vector of variables Xni, i = 1 : p,n = 1 : N

e.g., Xni {1, . . . ,K} is the ith word in nth sentence e.g., Xni IR is the ith attribute in the nth feature vectorWe need to define p(Xn|) for feature vectors of potentially variable

size.

Independent features (bag of words model)

p(Xn|) =

i

p(Xni|i) (55)

e.g., Xni {0, 1}, product of Bernoullisp(Xn|) =

i

Xnii (1 i)Xni (56)

e.g., Xni {1, . . . ,K}, product of multinomialsp(Xn|) =

i

k

I(Xni=k)ik

(57)

MLE for bag of words model

p(D|) =

n

i

p(Xni|i) (58)

We can estimate each i separately, eg. product of multinomials

p(D|) =

n

i

k

I(Xni=k)ik

(59)

=

i

k

Nikik

(60)

Nikdef=

n

I(Xni = k) (61)

`(i) =

k

Nik log ik (62)

ik =NikNi

(63)

Bayes estimate for bag of words model

p(|D)

i

p(i)[

n

i

p(Xni|i)] (64)

We can update each i separately, eg. product of multinomials

p(|D) =

i

Dir(i|i1, . . . , iK)[

k

Nikik

] (65)

i

Dir(i|i1 + Ni1, . . . , iK + NiK) (66)

Factored prior factored likelihood = factored posterior

(Discrete time) Markov model

p(Xn|) = p(Xn1|)p

t=2

p(Xnt|Xn,t1, A) (67)

For a discrete state space, Xnt {1, . . . ,K},i = p(X1 = i) (68)

Aij = p(Xt = j|Xt1 = i) (69)p(Xn|) =

i

I(Xn1=i)i

t

i

j

AI(Xnt=j,Xn,t1=i)ij (70)

=

i

N1ii

i

j

ANijij (71)

N1idef= I(Xn1 = i) (72)

Nijdef=

t

I(Xn,t = j,Xn,t1 = i) (73)

MLE for Markov model

p(D|) =

n

i

I(Xn1=i)i

t

i

j

AI(Xnt=j,Xn,t1=i)ij (74)

=

i

N1ii

i

j

ANijij (75)

N1idef=

n

I(Xn1 = i) (76)

Nijdef=

n

t

I(Xn,t = j,Xn,t1 = i) (77)

Aij =Nij

j Nij(78)

i =N1iN

(79)

Stationary distribution for a Markov chain

i is fraction of time we spend in state i, given by principle eigenvector

A = (80)

Balance flow across cut set

1 = 2 (81)

Since 1 + 2 = 1, we have

1 =

+ , 2 =

+ , (82)

Outline

Probability basics

Probability density functions

Parameter estimation

Prediction

Multivariate probability models

Conditional probability models Information theory

Conditional density estimation

Goal: learn p(y|x) where x is input and y is output. Generative model

p(y|x) p(x|y)p(y) (83)Discriminative model e.g., logistic regression

p(y|x) = (wTx) (84)

Naive Bayes = conditional bag of words model

Generative modelp(y|x) p(x|y)p(y) (85)

in which features of x are conditionally independent given y:

p(y|x) p(y)[p

i=1

p(xi|y)] (86)

e.g., for multinomials

p(y|x)

c