unsupervised learning (part 1) lecture 19 - people | mit...

Unsupervisedlearning(part1)Lecture19

DavidSontagNewYorkUniversity

SlidesadaptedfromCarlosGuestrin,DanKlein,LukeZe@lemoyer,DanWeld,VibhavGogate,andAndrewMoore

Bayesiannetworksenableuseofdomainknowledge

Willmycarstartthismorning?

Heckermanetal.,Decision-TheoreMcTroubleshooMng,1995

Bayesian networksReference: Chapter 3

A Bayesian network is specified by a directed acyclic graphG = (V , E ) with:

1 One node i 2 V for each random variable Xi

2 One conditional probability distribution (CPD) per node, p(xi

| xPa(i)),specifying the variable’s probability conditioned on its parents’ values

Corresponds 1-1 with a particular factorization of the jointdistribution:

p(x1

, . . . xn

) =Y

i2Vp(x

i

| xPa(i))

Powerful framework for designing algorithms to perform probabilitycomputations

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 30 / 44

Bayesian networksReference: Chapter 3

A Bayesian network is specified by a directed acyclic graphG = (V , E ) with:

1 One node i 2 V for each random variable Xi

2 One conditional probability distribution (CPD) per node, p(xi

| xPa(i)),specifying the variable’s probability conditioned on its parents’ values

Corresponds 1-1 with a particular factorization of the jointdistribution:

p(x1

, . . . xn

) =Y

i2Vp(x

i

| xPa(i))

Powerful framework for designing algorithms to perform probabilitycomputations

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 30 / 44

Bayesiannetworksenableuseofdomainknowledge

WhatisthedifferenMaldiagnosis?

Beinlichetal.,TheALARMMonitoringSystem,1989

Bayesiannetworksaregenera*vemodels

•  CansamplefromthejointdistribuMon,top-down•  SupposeYcanbe“spam”or“notspam”,andXiisabinary

indicatorofwhetherwordiispresentinthee-mail

•  Let’strygeneraMngafewemails!

•  OZenhelpstothinkaboutBayesiannetworksasageneraMvemodelwhenconstrucMngthestructureandthinkingaboutthemodelassumpMons

Y

X1 X2 X3 Xn. . .

Features

Label

InferenceinBayesiannetworks•  CompuMngmarginalprobabiliMesintreestructuredBayesian

networksiseasy–  Thealgorithmcalled“beliefpropagaMon”generalizeswhatweshowedfor

hiddenMarkovmodelstoarbitrarytrees

•  Wait…thisisn’tatree!Whatcanwedo?

X1 X2 X3 X4 X5 X6

Y1 Y2 Y3 Y4 Y5 Y6

Y

X1 X2 X3 Xn. . .

Features

Label

InferenceinBayesiannetworks

•  Insomecases(suchasthis)wecantransformthisintowhatiscalleda“juncMontree”,andthenrunbeliefpropagaMon

Approximateinference

•  ThereisalsoawealthofapproximateinferencealgorithmsthatcanbeappliedtoBayesiannetworkssuchasthese

•  MarkovchainMonteCarloalgorithmsrepeatedlysampleassignmentsforesMmaMngmarginals

•  Varia4onalinferencealgorithms(determinisMc)findasimplerdistribuMonwhichis“close”totheoriginal,thencomputemarginalsusingthesimplerdistribuMon

MaximumlikelihoodesMmaMoninBayesiannetworksML estimation in Bayesian networks

Suppose that we know the Bayesian network structure G

Let ✓x

i

|xpa(i)

be the parameter giving the value of the CPD p(xi

| xpa(i)

)

Maximum likelihood estimation corresponds to solving:

max✓

1

M

MX

m=1

log p(xM ; ✓)

subject to the non-negativity and normalization constraints

This is equal to:

max✓

1

M

MX

m=1

log p(xM ; ✓) = max✓

1

M

MX

m=1

NX

i=1

log p(xMi

| xMpa(i)

; ✓)

= max✓

NX

i=1

1

M

MX

m=1

log p(xMi

| xMpa(i)

; ✓)

The optimization problem decomposes into an independent optimizationproblem for each CPD! Has a simple closed-form solution.

David Sontag (NYU) Graphical Models Lecture 10, April 11, 2013 17 / 22

ML estimation in Bayesian networks


Let ✓x

i

|xpa(i)


| xpa(i)

)


max✓

1

M

MX

m=1

log p(xM ; ✓)


This is equal to:

max✓

1

M

MX

m=1


1

M

MX

m=1

NX

i=1

log p(xMi

| xMpa(i)

; ✓)

= max✓

NX

i=1

1

M

MX

m=1

log p(xMi

| xMpa(i)

; ✓)



ML estimation in Bayesian networks


Let ✓x

i

|xpa(i)


| xpa(i)

)


max✓

1

M

MX

m=1

log p(xM ; ✓)


This is equal to:

max✓

1

M

MX

m=1


1

M

MX

m=1

NX

i=1

log p(xMi

| xMpa(i)

; ✓)

= max✓

NX

i=1

1

M

MX

m=1

log p(xMi

| xMpa(i)

; ✓)



Returningtoclustering…

•  Clustersmayoverlap•  Someclustersmaybe“wider”thanothers

•  Canwemodelthisexplicitly?

•  Withwhatprobabilityisapointfromacluster?

ProbabilisMcClustering

•  TryaprobabilisMcmodel!•  allowsoverlaps,clustersofdifferent

size,etc.

•  Cantellagenera*vestoryfordata– P(Y)P(X|Y)

•  Challenge:weneedtoesMmatemodelparameterswithoutlabeledYs

Y X1 X2

?? 0.1 2.1

?? 0.5 -1.1

?? 0.0 3.0

?? -0.1 -2.0

?? 0.2 1.5

… … …

GaussianMixtureModels

µ1µ2

µ3

•  P(Y):Therearekcomponents•  P(X|Y):Eachcomponentgeneratesdatafromamul>variateGaussian

withmeanμiandcovariancematrixΣi

Eachdatapointassumedtohavebeensampledfromagenera4veprocess:

1.  ChoosecomponentiwithprobabilityP(y=i)[Mul*nomial]

2.  Generatedatapoint~N(mi,Σi)

€

P(X = x j |Y = i) =1

(2π)m / 2 ||Σi ||1/ 2 exp −

12x j − µi( )

TΣi−1 x j − µi( )

⎡

⎣ ⎢ ⎤

⎦ ⎥

Byfi:ngthismodel(unsupervisedlearning),wecanlearnnewinsightsaboutthedata

MulMvariateGaussians

Σ∝idenMtymatrix

P(X = x j |Y = i) =1

(2π )m/2 || Σi ||1/2 exp −

12x j −µi( )

TΣi−1 x j −µi( )

#

$%&

'(P(X=xj)=


Σ=diagonalmatrixXiareindependentalaGaussianNB

P(X = x j |Y = i) =1

(2π )m/2 || Σi ||1/2 exp −

12x j −µi( )


#

$%&

'(P(X=xj)=


Σ=arbitrary(semidefinite)matrix:-specifiesrotaMon(changeofbasis)-eigenvaluesspecifyrelaMveelongaMon

P(X = x j |Y = i) =1

(2π )m/2 || Σi ||1/2 exp −

12x j −µi( )


#

$%&

'(P(X=xj)=

P(X = x j |Y = i) =1

(2π )m/2 || Σi ||1/2 exp −

12x j −µi( )


#

$%&

'(P(X=xj)=

Covariancematrix,Σ,=degreetowhichxivarytogether

Eigenvalue,λ,ofΣ


ModellingerupMonofgeysers

OldFaithfulDataSet

TimetoErupM

on

DuraMonofLastErupMon

ModellingerupMonofgeysers

OldFaithfulDataSet

SingleGaussian MixtureoftwoGaussians

MarginaldistribuMonformixturesofGaussians

Component

Mixingcoefficient

K=3

MarginaldistribuMonformixturesofGaussians

LearningmixturesofGaussians

Originaldata(hypothesized) Observeddata(ymissing)

Pr(Y = i | x)

Inferredy’s(learnedmodel)

ShownistheposteriorprobabilitythatapointwasgeneratedfromithGaussian:

MLesMmaMoninsupervisedsetng

•  UnivariateGaussian

•  MixtureofMul4variateGaussiansMLesMmateforeachoftheMulMvariateGaussiansisgivenby:

Justsumsoverxgeneratedfromthek’thGaussian

µML =1n

xnj=1

n

∑ ΣML =1n

x j −µML( ) x j −µML( )T

j=1

n

∑k k k k

Whataboutwithunobserveddata?

•  Maximizemarginallikelihood:– argmaxθ∏jP(xj)=argmax∏j∑k=1P(Yj=k,xj)

•  Almostalwaysahardproblem!– UsuallynoclosedformsoluMon

– EvenwhenlgP(X,Y)isconvex,lgP(X)generallyisn’t…

– ManylocalopMma

K

ExpectaMonMaximizaMon

1977:Dempster,Laird,&Rubin

TheEMAlgorithm

•  Aclevermethodformaximizingmarginallikelihood:–  argmaxθ∏jP(xj)=argmaxθ∏j∑k=1KP(Yj=k,xj)– Basedoncoordinatedescent.Easytoimplement(eg,nolinesearch,learningrates,etc.)

•  Alternatebetweentwosteps:– ComputeanexpectaMon– ComputeamaximizaMon

•  Notmagic:s4llop4mizinganon-convexfunc4onwithlotsoflocalop4ma–  ThecomputaMonsarejusteasier(oZen,significantlyso)

EM:TwoEasyStepsObjec>ve:argmaxθlg∏j∑k=1KP(Yj=k,xj;θ)=∑jlg∑k=1KP(Yj=k,xj;θ)

Data:{xj|j=1..n}

•  E-step:ComputeexpectaMonsto“fillin”missingyvaluesaccordingtocurrentparameters,θ

–  ForallexamplesjandvalueskforYj,compute:P(Yj=k|xj;θ)

•  M-step:Re-esMmatetheparameterswith“weighted”MLEesMmates

–  Setθnew=argmaxθ∑j∑kP(Yj=k|xj;θold)logP(Yj=k,xj;θ)

Par>cularlyusefulwhentheEandMstepshaveclosedformsolu>ons

GaussianMixtureExample:Start

AZerfirstiteraMon

AZer2nditeraMon

AZer3rditeraMon

AZer4thiteraMon

AZer5thiteraMon

AZer6thiteraMon

AZer20thiteraMon

EMforGMMs:onlylearningmeans(1D)Iterate:Onthet’thiteraMonletouresMmatesbe

λt={μ1(t),μ2(t)…μK(t)}E-step

Compute“expected”classesofalldatapoints

M-step

ComputemostlikelynewμsgivenclassexpectaMons€

P Yj = k x j ,µ1...µK( )∝ exp − 12σ 2 (x j − µk )2⎛

⎝ ⎜

⎞

⎠ ⎟ P Yj = k( )

€

µk = P Yj = k x j( )

j=1

m

∑ x j

P Yj = k x j( )j=1

m

∑

Whatifwedohardassignments?Iterate:Onthet’thiteraMonletouresMmatesbe

λt={μ1(t),μ2(t)…μK(t)}E-step

Compute“expected”classesofalldatapoints

M-step

ComputemostlikelynewμsgivenclassexpectaMons

€

µk = j=1

m∑ δ Yj = k,x j( ) x j

δ Yj = k,x j( )j=1

m

∑

δrepresentshardassignmentto“mostlikely”ornearestcluster

Equivalenttok-meansclusteringalgorithm!!!

€

P Yj = k xj ,µ1...µK( ) ∝ exp − 12σ 2 (xj −µk )2⎛

⎝ ⎜

⎞

⎠ ⎟ P Yj = k( )

€

µk = P Yj = k x j( )

j=1

m

∑ x j

P Yj = k x j( )j=1

m

∑

E.M.forGeneralGMMsIterate:Onthet’thiteraMonletouresMmatesbe

λt={μ1(t),μ2(t)…μK(t),Σ1(t),Σ2(t)…ΣK(t),p1(t),p2(t)…pK(t)}

E-stepCompute“expected”classesofalldatapointsforeachclass

€

P Yj = k x j;λt( )∝ pk( t )p x j ;µk( t ),Σk(t )( )

pk(t)isshorthandforesMmateofP(y=k)ont’thiteraMon

M-step

ComputeweightedMLEforμgivenexpectedclassesabove

€

µkt+1( ) =

P Yj = k x j;λt( )j∑ x j

P Yj = k x j;λt( )j∑

€

Σkt+1( ) =

P Yj = k x j;λt( )j∑ x j − µk t+1( )[ ] x j − µk t+1( )[ ]

T

P Yj = k x j ;λt( )j∑

€

pk(t+1) =

P Yj = k x j;λt( )j∑

m m=#trainingexamples

Evaluateprobabilityofamul*variateaGaussianatxj

Thegenerallearningproblemwithmissingdata•  Marginallikelihood:Xisobserved,

Z(e.g.theclasslabelsY)ismissing:

•  ObjecMve:Findargmaxθl(θ:Data)•  Assuminghiddenvariablesaremissingcompletelyatrandom

(otherwise,weshouldexplicitlymodelwhythevaluesaremissing)

ProperMesofEM

•  Onecanprovethat:– EMconvergestoalocalmaxima– EachiteraMonimprovesthelog-likelihood

•  How?(Sameask-means)– LikelihoodobjecMveinsteadofk-meansobjecMve– M-stepcanneverdecreaselikelihood

EMpictoriallyDerivation of EM algorithm

L(θ) l(θ|θn)

θn θn+1

L(θn) = l(θn|θn)l(θn+1|θn)L(θn+1)

L(θ)l(θ|θn)

θ

Figure 2: Graphical interpretation of a single iteration of the EM algorithm:The function l(θ|θn) is bounded above by the likelihood function L(θ). Thefunctions are equal at θ = θn. The EM algorithm chooses θn+1 as the value of θfor which l(θ|θn) is a maximum. Since L(θ) ≥ l(θ|θn) increasing l(θ|θn) ensuresthat the value of the likelihood function L(θ) is increased at each step.

We have now a function, l(θ|θn) which is bounded above by the likelihoodfunction L(θ). Additionally, observe that,

l(θn|θn) = L(θn) + ∆(θn|θn)

= L(θn) +∑

z

P(z|X, θn) lnP(X|z, θn)P(z|θn)P(z|X, θn)P(X|θn)

= L(θn) +∑

z

P(z|X, θn) lnP(X, z|θn)P(X, z|θn)

= L(θn) +∑

z

P(z|X, θn) ln 1

= L(θn), (16)

so for θ = θn the functions l(θ|θn) and L(θ) are equal.Our objective is to choose a values of θ so that L(θ) is maximized. We have

shown that the function l(θ|θn) is bounded above by the likelihood function L(θ)and that the value of the functions l(θ|θn) and L(θ) are equal at the currentestimate for θ = θn. Therefore, any θ which increases l(θ|θn) will also increaseL(θ). In order to achieve the greatest possible increase in the value of L(θ), theEM algorithm calls for selecting θ such that l(θ|θn) is maximized. We denotethis updated value as θn+1. This process is illustrated in Figure (2).

7

(Figure from tutorial by Sean Borman)


LikelihoodobjecMve

Lowerboundatitern

Whatyoushouldknow•  MixtureofGaussians•  EMformixtureofGaussians:

–  Howtolearnmaximumlikelihoodparametersinthecaseofunlabeleddata–  RelaMontoK-means

•  Twostepalgorithm,justlikeK-means•  Hard/soZclustering•  ProbabilisMcmodel

•  Remember,EMcangetstuckinlocalminima, –  AndempiricallyitDOES

unsupervised learning (part 1) lecture 19 - people | mit...

Documents