unsupervised learning (part 1) lecture 19 - people | mit...

40
Unsupervised learning (part 1) Lecture 19 David Sontag New York University Slides adapted from Carlos Guestrin, Dan Klein, Luke Ze@lemoyer, Dan Weld, Vibhav Gogate, and Andrew Moore

Upload: others

Post on 04-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Unsupervisedlearning(part1)Lecture19

    DavidSontagNewYorkUniversity

    SlidesadaptedfromCarlosGuestrin,DanKlein,LukeZe@lemoyer,DanWeld,VibhavGogate,andAndrewMoore

  • Bayesiannetworksenableuseofdomainknowledge

    Willmycarstartthismorning?

    Heckermanetal.,Decision-TheoreMcTroubleshooMng,1995

    Bayesian networksReference: Chapter 3

    A Bayesian network is specified by a directed acyclic graphG = (V , E ) with:

    1 One node i 2 V for each random variable Xi

    2 One conditional probability distribution (CPD) per node, p(xi

    | xPa(i)),specifying the variable’s probability conditioned on its parents’ values

    Corresponds 1-1 with a particular factorization of the jointdistribution:

    p(x1

    , . . . xn

    ) =Y

    i2Vp(x

    i

    | xPa(i))

    Powerful framework for designing algorithms to perform probabilitycomputations

    David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 30 / 44

  • Bayesian networksReference: Chapter 3

    A Bayesian network is specified by a directed acyclic graphG = (V , E ) with:

    1 One node i 2 V for each random variable Xi

    2 One conditional probability distribution (CPD) per node, p(xi

    | xPa(i)),specifying the variable’s probability conditioned on its parents’ values

    Corresponds 1-1 with a particular factorization of the jointdistribution:

    p(x1

    , . . . xn

    ) =Y

    i2Vp(x

    i

    | xPa(i))

    Powerful framework for designing algorithms to perform probabilitycomputations

    David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 30 / 44

    Bayesiannetworksenableuseofdomainknowledge

    WhatisthedifferenMaldiagnosis?

    Beinlichetal.,TheALARMMonitoringSystem,1989

  • Bayesiannetworksaregenera*vemodels

    •  CansamplefromthejointdistribuMon,top-down•  SupposeYcanbe“spam”or“notspam”,andXiisabinary

    indicatorofwhetherwordiispresentinthee-mail

    •  Let’strygeneraMngafewemails!

    •  OZenhelpstothinkaboutBayesiannetworksasageneraMvemodelwhenconstrucMngthestructureandthinkingaboutthemodelassumpMons

    Y

    X1 X2 X3 Xn. . .

    Features

    Label

  • InferenceinBayesiannetworks•  CompuMngmarginalprobabiliMesintreestructuredBayesian

    networksiseasy–  Thealgorithmcalled“beliefpropagaMon”generalizeswhatweshowedfor

    hiddenMarkovmodelstoarbitrarytrees

    •  Wait…thisisn’tatree!Whatcanwedo?

    X1 X2 X3 X4 X5 X6

    Y1 Y2 Y3 Y4 Y5 Y6

    Y

    X1 X2 X3 Xn. . .

    Features

    Label

  • InferenceinBayesiannetworks

    •  Insomecases(suchasthis)wecantransformthisintowhatiscalleda“juncMontree”,andthenrunbeliefpropagaMon

  • Approximateinference

    •  ThereisalsoawealthofapproximateinferencealgorithmsthatcanbeappliedtoBayesiannetworkssuchasthese

    •  MarkovchainMonteCarloalgorithmsrepeatedlysampleassignmentsforesMmaMngmarginals

    •  Varia4onalinferencealgorithms(determinisMc)findasimplerdistribuMonwhichis“close”totheoriginal,thencomputemarginalsusingthesimplerdistribuMon

  • MaximumlikelihoodesMmaMoninBayesiannetworksML estimation in Bayesian networks

    Suppose that we know the Bayesian network structure G

    Let ✓x

    i

    |xpa(i)

    be the parameter giving the value of the CPD p(xi

    | xpa(i)

    )

    Maximum likelihood estimation corresponds to solving:

    max✓

    1

    M

    MX

    m=1

    log p(xM ; ✓)

    subject to the non-negativity and normalization constraints

    This is equal to:

    max✓

    1

    M

    MX

    m=1

    log p(xM ; ✓) = max✓

    1

    M

    MX

    m=1

    NX

    i=1

    log p(xMi

    | xMpa(i)

    ; ✓)

    = max✓

    NX

    i=1

    1

    M

    MX

    m=1

    log p(xMi

    | xMpa(i)

    ; ✓)

    The optimization problem decomposes into an independent optimizationproblem for each CPD! Has a simple closed-form solution.

    David Sontag (NYU) Graphical Models Lecture 10, April 11, 2013 17 / 22

    ML estimation in Bayesian networks

    Suppose that we know the Bayesian network structure G

    Let ✓x

    i

    |xpa(i)

    be the parameter giving the value of the CPD p(xi

    | xpa(i)

    )

    Maximum likelihood estimation corresponds to solving:

    max✓

    1

    M

    MX

    m=1

    log p(xM ; ✓)

    subject to the non-negativity and normalization constraints

    This is equal to:

    max✓

    1

    M

    MX

    m=1

    log p(xM ; ✓) = max✓

    1

    M

    MX

    m=1

    NX

    i=1

    log p(xMi

    | xMpa(i)

    ; ✓)

    = max✓

    NX

    i=1

    1

    M

    MX

    m=1

    log p(xMi

    | xMpa(i)

    ; ✓)

    The optimization problem decomposes into an independent optimizationproblem for each CPD! Has a simple closed-form solution.

    David Sontag (NYU) Graphical Models Lecture 10, April 11, 2013 17 / 22

    ML estimation in Bayesian networks

    Suppose that we know the Bayesian network structure G

    Let ✓x

    i

    |xpa(i)

    be the parameter giving the value of the CPD p(xi

    | xpa(i)

    )

    Maximum likelihood estimation corresponds to solving:

    max✓

    1

    M

    MX

    m=1

    log p(xM ; ✓)

    subject to the non-negativity and normalization constraints

    This is equal to:

    max✓

    1

    M

    MX

    m=1

    log p(xM ; ✓) = max✓

    1

    M

    MX

    m=1

    NX

    i=1

    log p(xMi

    | xMpa(i)

    ; ✓)

    = max✓

    NX

    i=1

    1

    M

    MX

    m=1

    log p(xMi

    | xMpa(i)

    ; ✓)

    The optimization problem decomposes into an independent optimizationproblem for each CPD! Has a simple closed-form solution.

    David Sontag (NYU) Graphical Models Lecture 10, April 11, 2013 17 / 22

  • Returningtoclustering…

    •  Clustersmayoverlap•  Someclustersmaybe“wider”thanothers

    •  Canwemodelthisexplicitly?

    •  Withwhatprobabilityisapointfromacluster?

  • ProbabilisMcClustering

    •  TryaprobabilisMcmodel!•  allowsoverlaps,clustersofdifferent

    size,etc.

    •  Cantellagenera*vestoryfordata– P(Y)P(X|Y)

    •  Challenge:weneedtoesMmatemodelparameterswithoutlabeledYs

    Y X1 X2

    ?? 0.1 2.1

    ?? 0.5 -1.1

    ?? 0.0 3.0

    ?? -0.1 -2.0

    ?? 0.2 1.5

    … … …

  • GaussianMixtureModels

    µ1µ2

    µ3

    •  P(Y):Therearekcomponents•  P(X|Y):Eachcomponentgeneratesdatafromamul>variateGaussian

    withmeanμiandcovariancematrixΣi

    Eachdatapointassumedtohavebeensampledfromagenera4veprocess:

    1.  ChoosecomponentiwithprobabilityP(y=i)[Mul*nomial]

    2.  Generatedatapoint~N(mi,Σi)

    P(X = x j |Y = i) =1

    (2π)m / 2 ||Σi ||1/ 2 exp −

    12x j − µi( )

    TΣi−1 x j − µi( )

    ⎣ ⎢ ⎤

    ⎦ ⎥

    Byfi:ngthismodel(unsupervisedlearning),wecanlearnnewinsightsaboutthedata

  • MulMvariateGaussians

    Σ∝idenMtymatrix

    P(X = x j |Y = i) =1

    (2π )m/2 || Σi ||1/2 exp −

    12x j −µi( )

    TΣi−1 x j −µi( )

    #

    $%&

    '(P(X=xj)=

  • MulMvariateGaussians

    Σ=diagonalmatrixXiareindependentalaGaussianNB

    P(X = x j |Y = i) =1

    (2π )m/2 || Σi ||1/2 exp −

    12x j −µi( )

    TΣi−1 x j −µi( )

    #

    $%&

    '(P(X=xj)=

  • MulMvariateGaussians

    Σ=arbitrary(semidefinite)matrix:-specifiesrotaMon(changeofbasis)-eigenvaluesspecifyrelaMveelongaMon

    P(X = x j |Y = i) =1

    (2π )m/2 || Σi ||1/2 exp −

    12x j −µi( )

    TΣi−1 x j −µi( )

    #

    $%&

    '(P(X=xj)=

  • P(X = x j |Y = i) =1

    (2π )m/2 || Σi ||1/2 exp −

    12x j −µi( )

    TΣi−1 x j −µi( )

    #

    $%&

    '(P(X=xj)=

    Covariancematrix,Σ,=degreetowhichxivarytogether

    Eigenvalue,λ,ofΣ

    MulMvariateGaussians

  • ModellingerupMonofgeysers

    OldFaithfulDataSet

    TimetoErupM

    on

    DuraMonofLastErupMon

  • ModellingerupMonofgeysers

    OldFaithfulDataSet

    SingleGaussian MixtureoftwoGaussians

  • MarginaldistribuMonformixturesofGaussians

    Component

    Mixingcoefficient

    K=3

  • MarginaldistribuMonformixturesofGaussians

  • LearningmixturesofGaussians

    Originaldata(hypothesized) Observeddata(ymissing)

    Pr(Y = i | x)

    Inferredy’s(learnedmodel)

    ShownistheposteriorprobabilitythatapointwasgeneratedfromithGaussian:

  • MLesMmaMoninsupervisedsetng

    •  UnivariateGaussian

    •  MixtureofMul4variateGaussiansMLesMmateforeachoftheMulMvariateGaussiansisgivenby:

    Justsumsoverxgeneratedfromthek’thGaussian

    µML =1n

    xnj=1

    n

    ∑ ΣML =1n

    x j −µML( ) x j −µML( )T

    j=1

    n

    ∑k k k k

  • Whataboutwithunobserveddata?

    •  Maximizemarginallikelihood:– argmaxθ∏jP(xj)=argmax∏j∑k=1P(Yj=k,xj)

    •  Almostalwaysahardproblem!– UsuallynoclosedformsoluMon

    – EvenwhenlgP(X,Y)isconvex,lgP(X)generallyisn’t…

    – ManylocalopMma

    K

  • ExpectaMonMaximizaMon

    1977:Dempster,Laird,&Rubin

  • TheEMAlgorithm

    •  Aclevermethodformaximizingmarginallikelihood:–  argmaxθ∏jP(xj)=argmaxθ∏j∑k=1KP(Yj=k,xj)– Basedoncoordinatedescent.Easytoimplement(eg,nolinesearch,learningrates,etc.)

    •  Alternatebetweentwosteps:– ComputeanexpectaMon– ComputeamaximizaMon

    •  Notmagic:s4llop4mizinganon-convexfunc4onwithlotsoflocalop4ma–  ThecomputaMonsarejusteasier(oZen,significantlyso)

  • EM:TwoEasyStepsObjec>ve:argmaxθlg∏j∑k=1KP(Yj=k,xj;θ)=∑jlg∑k=1KP(Yj=k,xj;θ)

    Data:{xj|j=1..n}

    •  E-step:ComputeexpectaMonsto“fillin”missingyvaluesaccordingtocurrentparameters,θ

    –  ForallexamplesjandvalueskforYj,compute:P(Yj=k|xj;θ)

    •  M-step:Re-esMmatetheparameterswith“weighted”MLEesMmates

    –  Setθnew=argmaxθ∑j∑kP(Yj=k|xj;θold)logP(Yj=k,xj;θ)

    Par>cularlyusefulwhentheEandMstepshaveclosedformsolu>ons

  • GaussianMixtureExample:Start

  • AZerfirstiteraMon

  • AZer2nditeraMon

  • AZer3rditeraMon

  • AZer4thiteraMon

  • AZer5thiteraMon

  • AZer6thiteraMon

  • AZer20thiteraMon

  • EMforGMMs:onlylearningmeans(1D)Iterate:Onthet’thiteraMonletouresMmatesbe

    λt={μ1(t),μ2(t)…μK(t)}E-step

    Compute“expected”classesofalldatapoints

    M-step

    ComputemostlikelynewμsgivenclassexpectaMons€

    P Yj = k x j ,µ1...µK( )∝ exp − 12σ 2 (x j − µk )2⎛

    ⎝ ⎜

    ⎠ ⎟ P Yj = k( )

    µk = P Yj = k x j( )

    j=1

    m

    ∑ x j

    P Yj = k x j( )j=1

    m

  • Whatifwedohardassignments?Iterate:Onthet’thiteraMonletouresMmatesbe

    λt={μ1(t),μ2(t)…μK(t)}E-step

    Compute“expected”classesofalldatapoints

    M-step

    ComputemostlikelynewμsgivenclassexpectaMons

    µk = j=1

    m∑ δ Yj = k,x j( ) x j

    δ Yj = k,x j( )j=1

    m

    δrepresentshardassignmentto“mostlikely”ornearestcluster

    Equivalenttok-meansclusteringalgorithm!!!

    P Yj = k xj ,µ1...µK( ) ∝ exp − 12σ 2 (xj −µk )2⎛

    ⎝ ⎜

    ⎠ ⎟ P Yj = k( )

    µk = P Yj = k x j( )

    j=1

    m

    ∑ x j

    P Yj = k x j( )j=1

    m

  • E.M.forGeneralGMMsIterate:Onthet’thiteraMonletouresMmatesbe

    λt={μ1(t),μ2(t)…μK(t),Σ1(t),Σ2(t)…ΣK(t),p1(t),p2(t)…pK(t)}

    E-stepCompute“expected”classesofalldatapointsforeachclass

    P Yj = k x j;λt( )∝ pk( t )p x j ;µk( t ),Σk(t )( )

    pk(t)isshorthandforesMmateofP(y=k)ont’thiteraMon

    M-step

    ComputeweightedMLEforμgivenexpectedclassesabove

    µkt+1( ) =

    P Yj = k x j;λt( )j∑ x j

    P Yj = k x j;λt( )j∑

    Σkt+1( ) =

    P Yj = k x j;λt( )j∑ x j − µk t+1( )[ ] x j − µk t+1( )[ ]

    T

    P Yj = k x j ;λt( )j∑

    pk(t+1) =

    P Yj = k x j;λt( )j∑

    m m=#trainingexamples

    Evaluateprobabilityofamul*variateaGaussianatxj

  • Thegenerallearningproblemwithmissingdata•  Marginallikelihood:Xisobserved,

    Z(e.g.theclasslabelsY)ismissing:

    •  ObjecMve:Findargmaxθl(θ:Data)•  Assuminghiddenvariablesaremissingcompletelyatrandom

    (otherwise,weshouldexplicitlymodelwhythevaluesaremissing)

  • ProperMesofEM

    •  Onecanprovethat:– EMconvergestoalocalmaxima– EachiteraMonimprovesthelog-likelihood

    •  How?(Sameask-means)– LikelihoodobjecMveinsteadofk-meansobjecMve– M-stepcanneverdecreaselikelihood

  • EMpictoriallyDerivation of EM algorithm

    L(θ) l(θ|θn)

    θn θn+1

    L(θn) = l(θn|θn)l(θn+1|θn)L(θn+1)

    L(θ)l(θ|θn)

    θ

    Figure 2: Graphical interpretation of a single iteration of the EM algorithm:The function l(θ|θn) is bounded above by the likelihood function L(θ). Thefunctions are equal at θ = θn. The EM algorithm chooses θn+1 as the value of θfor which l(θ|θn) is a maximum. Since L(θ) ≥ l(θ|θn) increasing l(θ|θn) ensuresthat the value of the likelihood function L(θ) is increased at each step.

    We have now a function, l(θ|θn) which is bounded above by the likelihoodfunction L(θ). Additionally, observe that,

    l(θn|θn) = L(θn) + ∆(θn|θn)

    = L(θn) +∑

    z

    P(z|X, θn) lnP(X|z, θn)P(z|θn)P(z|X, θn)P(X|θn)

    = L(θn) +∑

    z

    P(z|X, θn) lnP(X, z|θn)P(X, z|θn)

    = L(θn) +∑

    z

    P(z|X, θn) ln 1

    = L(θn), (16)

    so for θ = θn the functions l(θ|θn) and L(θ) are equal.Our objective is to choose a values of θ so that L(θ) is maximized. We have

    shown that the function l(θ|θn) is bounded above by the likelihood function L(θ)and that the value of the functions l(θ|θn) and L(θ) are equal at the currentestimate for θ = θn. Therefore, any θ which increases l(θ|θn) will also increaseL(θ). In order to achieve the greatest possible increase in the value of L(θ), theEM algorithm calls for selecting θ such that l(θ|θn) is maximized. We denotethis updated value as θn+1. This process is illustrated in Figure (2).

    7

    (Figure from tutorial by Sean Borman)

    David Sontag (NYU) Graphical Models Lecture 11, April 12, 2012 11 / 13

    LikelihoodobjecMve

    Lowerboundatitern

  • Whatyoushouldknow•  MixtureofGaussians•  EMformixtureofGaussians:

    –  Howtolearnmaximumlikelihoodparametersinthecaseofunlabeleddata–  RelaMontoK-means

    •  Twostepalgorithm,justlikeK-means•  Hard/soZclustering•  ProbabilisMcmodel

    •  Remember,EMcangetstuckinlocalminima, –  AndempiricallyitDOES