machine learning: a probabilistic murphyk/teaching/cs340-fall06/lectures/review-4pp.pdfcs340:...

Click here to load reader

Post on 16-Mar-2018

221 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • CS340: Machine Learning

    Review for midterm

    Kevin Murphy

    Machine learning: a probabilistic approach

    We want to make models of data so we can find patterns andpredict the future.

    We will use probability theory to represent our uncertainty aboutwhat the right model is; this will induce uncertainty in ourpredictions.

    We will update our beliefs about the model as we acquire data (thisis called learning); as we get more data, the uncertainty in ourpredictions will decrease.

    Outline

    Probability basics Probability density functions Parameter estimation PredictionMultivariate probability models Conditional probability models Information theory

    Probability basics

    Chain (product) rulep(A,B) = p(A)p(B|A) (1)

    Marginalization (sum) rulep(B) =

    a

    p(A = a,B) (2)

    Hence we derive Bayes rule

    p(A|B) = p(A,B)p(B)

    (3)

  • PMF

    A probability mass function (PMF) is defined on a discrete randomvariable X {1, . . . ,K} and satisfies

    1 =

    K

    x=1

    P (X = x) (4)

    0 P (X = x) 1 (5) This is just a histogram!

    PDF

    A probability density function (PDF) is defined on a continuousrandom variable X IR or X IR+ or X [0, 1] etc and satisfies

    1 =

    Xp(X = x)dx (6)

    0 p(X = x) (7) Note that it is possible that p(X = x) > 1, since multiplied by dx. P (x X x + dx) p(x)dx Statistics books often write fX(x) for pdfs and use P () for pmfs.

    Moments of a distribution

    Mean (first central moment) = E[X ] =

    x

    xp(x) (8)

    Variance (second central moment)2 = Var [X ] =

    x

    (x )2p(x) = E[X2] 2 (9)

    CDFs

    The cumulative distribution function is defined as

    F (x) = P (X x) = x

    f (x)dx (10)

    3 2 1 0 1 2 30

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    x

    p(x)

    Standard Normal

    3 2 1 0 1 2 30

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    x

    p(x)

    Gaussian cdf

  • Quantiles of a distribution

    P (a X b) = FX(b) FX(a) (11)eg if Z N (0, 1), FZ(z) = (z). Then

    p(1.96 Z 1.96) = (1.96) (1.96) (12)= 0.975 0.025 = 0.95 (13)

    Hence X N (, ),p(1.96 < X < 1.96) = 1 2 0.025 = 0.95 (14)

    Hence the interval 1.96 contains 0.95 mass.

    Outline

    Probability basics

    Probability density functions Parameter estimation PredictionMultivariate probability models Conditional probability models Information theory

    Bernoulli distribution

    X {0, 1}, p(X = 1) = , p(X = 0) = 1 X Be()E[X ] =

    Multinomial distribution

    X {1, . . . ,K}, p(X = k) = k, X Mu() 0 k 1Kk=1 k = 1E[Xi] = i

  • (Univariate) Gaussian distribution

    X IR, X N (, ), IR, IR+

    N (x|, ) def= 122

    e 1

    22(x)2

    (15)

    E[X ] =

    3 2 1 0 1 2 30

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    x

    p(x)

    Standard Normal

    Beta distribution

    X [0, 1], X Be(1, 0), 0 0, 1E[X ] = 11+0

    Be(X|1, 0) [X11(1 X)01]

    0 0.5 10

    1

    2

    3

    4a=0.10, b=0.10

    0 0.5 10

    0.5

    1

    1.5

    2a=1.00, b=1.00

    0 0.5 10

    0.5

    1

    1.5

    2a=2.00, b=3.00

    0 0.5 10

    1

    2

    3a=8.00, b=4.00

    Dirichlet distribution

    X [0, 1]K, X Dir(1, . . . , K), 0 kE[Xi] = i , where =

    k k.

    Dir(X|1, . . . , K)

    k

    Xk1k

    Outline

    Probability basics

    Probability density functions

    Parameter estimation PredictionMultivariate probability models Conditional probability models Information theory

  • Parameter estimation: MLE

    Likelihood of iid data D = {xn}Nn=1L() = p(D|) =

    n

    p(xn|) (16)

    Log likelihood`() = log p(D|) =

    n

    log p(xn|) (17)

    MLE = Maximum likelihood estimate = arg max

    L() = arg max

    `() (18)

    henceL

    i= 0 (19)

    MLE for Bernoullis

    p(Xn|) = Xn(1 )1Xn (20)`() =

    n

    Xn log +

    n

    (1 Xn) log(1 ) (21)

    = N1 log + N0 log(1 ) (22)N1 =

    n

    I(Xn = 1) (23)

    dL

    d=

    N1

    N0

    1 = 0 (24)

    =N1

    N1 + N0(25)

    MLE for Multinomials

    p(Xn|) =

    k

    I(Xn=k)k

    (26)

    `() =

    n

    k

    I(xn = k) log k (27)

    =

    k

    Nk log k (28)

    l =

    k

    Nk log k +

    1

    k

    k

    (29)

    l

    k=

    Nkk

    = 0 (30)

    k =NkN

    (31)

    Parameter estimation: Bayesian

    p(|D) p(D|)p() (32)= [

    n

    p(xn|)]p() (33)

    We say p() is conjugate to p(D|) if p(|D) in same family as p().This allows recursive (sequential) updating.MLE is a point estimate. Bayes computes full posterior, natural modelof uncertainty (e.g., posterior variance, entropy, credible intervals, etc)

  • Parameter estimation: MAP

    For many models, as |D|,p(|D)( MAP )( MLE) (34)

    whereMAP = arg max

    p(D|)p() (35)

    MAP (maximum a posteriori) estimate is a posterior mode. Also calleda penalized likelihood estimate since

    MAP = arg max

    log p(D|) c() (36)

    where c() = log p() is a penalty or regularizer, and is thestrenghth of the regularizer.

    Bayes estimate for Bernoulli

    Xn Be() (37) Beta(1, 0) (38)

    p(|D) [

    n

    Xn(1 )1Xn]11(1 )01 (39)

    = N1(1 )N011(1 )01 (40) Beta(1 + N1, 0 + N0) (41)

    Bayes estimate for Multinomial

    Xn Mu() (42) Dir(1, . . . , K) (43)

    p(|D) [

    n

    k

    I(Xn=k)k

    ]

    k

    k1k

    (44)

    =

    k

    Nkk

    k1k

    (45)

    Dir(1 + N1, . . . , K + NK) (46)

    Outline

    Probability basics

    Probability density functions

    Parameter estimation

    PredictionMultivariate probability models Conditional probability models Information theory

  • Predicting the future

    Posterior predictive density gotten by Bayesian model averaging

    p(X|D) =

    p(X|)p(|D)d (47)

    Plug-in principle: if p(|D) ( ), thenp(X|D) p(X|) (48)

    Predictive for Bernoullis

    p(|D) = Beta(1 + N1, 0 + N0) (49)def= Beta(1,

    0) (50)

    p(X = 1|D) =

    p(X = 1|)p(|D)d (51)

    =

    p(|D)d (52)= E[|D] (53)

    =1

    1 + 0

    (54)

    Cross validation

    We can use CV to find the model with best predictive performance.

    Outline

    Probability basics

    Probability density functions

    Parameter estimation

    Prediction

    Multivariate probability models Conditional probability models Information theory

  • Multivariate probability models

    Each data case Xn is now a vector of variables Xni, i = 1 : p,n = 1 : N

    e.g., Xni {1, . . . ,K} is the ith word in nth sentence e.g., Xni IR is the ith attribute in the nth feature vectorWe need to define p(Xn|) for feature vectors of potentially variable

    size.

    Independent features (bag of words model)

    p(Xn|) =

    i

    p(Xni|i) (55)

    e.g., Xni {0, 1}, product of Bernoullisp(Xn|) =

    i

    Xnii (1 i)Xni (56)

    e.g., Xni {1, . . . ,K}, product of multinomialsp(Xn|) =

    i

    k

    I(Xni=k)ik

    (57)

    MLE for bag of words model

    p(D|) =

    n

    i

    p(Xni|i) (58)

    We can estimate each i separately, eg. product of multinomials

    p(D|) =

    n

    i

    k

    I(Xni=k)ik

    (59)

    =

    i

    k

    Nikik

    (60)

    Nikdef=

    n

    I(Xni = k) (61)

    `(i) =

    k

    Nik log ik (62)

    ik =NikNi

    (63)

    Bayes estimate for bag of words model

    p(|D)

    i

    p(i)[

    n

    i

    p(Xni|i)] (64)

    We can update each i separately, eg. product of multinomials

    p(|D) =

    i

    Dir(i|i1, . . . , iK)[

    k

    Nikik

    ] (65)

    i

    Dir(i|i1 + Ni1, . . . , iK + NiK) (66)

    Factored prior factored likelihood = factored posterior

  • (Discrete time) Markov model

    p(Xn|) = p(Xn1|)p

    t=2

    p(Xnt|Xn,t1, A) (67)

    For a discrete state space, Xnt {1, . . . ,K},i = p(X1 = i) (68)

    Aij = p(Xt = j|Xt1 = i) (69)p(Xn|) =

    i

    I(Xn1=i)i

    t

    i

    j

    AI(Xnt=j,Xn,t1=i)ij (70)

    =

    i

    N1ii

    i

    j

    ANijij (71)

    N1idef= I(Xn1 = i) (72)

    Nijdef=

    t

    I(Xn,t = j,Xn,t1 = i) (73)

    MLE for Markov model

    p(D|) =

    n

    i

    I(Xn1=i)i

    t

    i

    j

    AI(Xnt=j,Xn,t1=i)ij (74)

    =

    i

    N1ii

    i

    j

    ANijij (75)

    N1idef=

    n

    I(Xn1 = i) (76)

    Nijdef=

    n

    t

    I(Xn,t = j,Xn,t1 = i) (77)

    Aij =Nij

    j Nij(78)

    i =N1iN

    (79)

    Stationary distribution for a Markov chain

    i is fraction of time we spend in state i, given by principle eigenvector

    A = (80)

    Balance flow across cut set

    1 = 2 (81)

    Since 1 + 2 = 1, we have

    1 =

    + , 2 =

    + , (82)

    Outline

    Probability basics

    Probability density functions

    Parameter estimation

    Prediction

    Multivariate probability models

    Conditional probability models Information theory

  • Conditional density estimation

    Goal: learn p(y|x) where x is input and y is output. Generative model

    p(y|x) p(x|y)p(y) (83)Discriminative model e.g., logistic regression

    p(y|x) = (wTx) (84)

    Naive Bayes = conditional bag of words model

    Generative modelp(y|x) p(x|y)p(y) (85)

    in which features of x are conditionally independent given y:

    p(y|x) p(y)[p

    i=1

    p(xi|y)] (86)

    e.g., for multinomials

    p(y|x)

    c

View more