lecture 8: learning fully observed undirected graphical models · mle for undirected graphical...
TRANSCRIPT
CS839:ProbabilisticGraphicalModels
Lecture8:LearningFullyObservedUndirectedGraphicalModels
TheoRekatsinas
1
Recall:UndirectedGraphicalModels
2
• Pairwise(non-causal)relationships• Wecanwritedownthemodel,scorespecificconfigurationsoftheRVsbutnotgeneratesamples• Contingencyconstraintsonnodeconfigurations
Recall:MLEforBNs
3
• IfweassumetheparametersforeachCPDaregloballyindependent,andallnodesarefullyobserved,thenthelog-likelihoodfunctiondecomposesintoasumoflocalterms,onepernode
• MLE-basedparameterestimationofGMreducestolocalest.ofeachGLIM.
MLEforUndirectedGraphicalModels
4
• Fordirectedmodels,thelog-likelihooddecomposesintoasumofterms,oneperfamily(nodeplusparents).• Forundirectedmodels,thelog-likelihooddoesnotdecompose,becausethenormalizationconstantZisafunctionofallparameters.
• Ingeneral,weneedtodoinferencetolearnparametersforundirectedmodels,eveninthefullyobservedcase.
LoglikelihoodforUndirectedGraphicalModelswithtabularcliquepotentials
5
• Sufficientstatistics:foranMRF(V,E)thenumberoftimesthataconfigurationx isobservedinadatasetD canberepresentedasfollows.
• Thelog-likelihoodis:
LoglikelihoodforUndirectedGraphicalModelswithtabularcliquepotentials
6
• Sufficientstatistics:foranMRF(V,E)thenumberoftimesthataconfigurationx isobservedinadatasetD canberepresentedasfollows.
• Intermsofthecounts,theloglikelihoodis:
Takingthederivative
7
• Log-likelihood
• Fistterm:
• Secondterm:
Takingthederivative
8
• Derivativeoflog-likelihood
• Henceweneedthat:
• Thissaysthat:• Forthemaximumlikelihoodestimatesoftheparameters,foreachclique,themodelmarginals mustbeequaltotheobservedmarginals (empiricalcounts)
• Thisisonlyaconditionthattheparametersshouldsatisfy!• Itdoesnottellushowtogetthemaximumlikelihoodestimates.
MLEforUndirectedGraphicalModels
9
• Case1:Themodelisdecomposable (triangulatedgraph)andallthecliquepotentialsaredefinedonmaximalcliques.• TheMLEofcliquepotentialsareequaltotheempiricalmarginals (orconditionals)ofthecorrespondingclique.
• SolveMLEbyinspection
• Decomposablemodels• Gisdecomposable,Gistriangulated,Ghasajunctiontree
• Ex.:ChainX1– X2– X3 pMLE(X1, X2, X3) =p̃(X1, X2)p̃(X2, X3)
p̃(X2)
pMLE(X1, X2) =X
X3
p̃(X1, X2, X3) = p̃(X1|X2)X
X3
p̃(X2, X3) = p̃(X1, X2)
pMLE(X2, X3) = p̃(X2, X3)
MLEforUndirectedGraphicalModels
10
• Decomposablemodels• Gisdecomposable,Gistriangulated,Ghasajunctiontree
• Ex.:ChainX1– X2– X3
• Tocomputethecliquepotentialswejustusetheempiricalmarginals (orconditionals),i.e.,theseparatormustbedividedintooneofitsneighbors.ThenZ=1
pMLE(X1, X2, X3) =p̃(X1, X2)p̃(X2, X3)
p̃(X2)
pMLE(X1, X2) =X
X3
p̃(X1, X2, X3) = p̃(X1|X2)X
X3
p̃(X2, X3) = p̃(X1, X2)
pMLE(X2, X3) = p̃(X2, X3)
MLEforUndirectedGraphicalModels
11
• Case2:Themodelisnon-decomposable,thepotentialsaredefinedasnon-maximalcliques.WecannotequateMLEofcliquepotentialstoempiricalmarginals (orconditionals)• Iterativepotentialfitting• GeneralizedIterativeScaling
IterativeProportionalFitting(IPF)
12
• Fromthelog-likelihood:
• Let’srewriteinadifferentway:or
• Thecliquepotentialsimplicitlyappearinthemodelmarginal
• Let’sforgetaclosedformsolutionandfocusonafixed-pointiterationmethod
• Needtoruninferenceforp(t)(xc)
m(xc)
N c(xc)=
p(xc)
c(xc)
p̃(xc)
c(xc)=
p(xc)
c(xc)
p(xc) = f( c(xc))
p̃(xc)
(t+1)c (xc)
=p(xc)
(t)c (xc)
(t+1)c (xc) = (t)
c (xc)p̃(xc)
p(t)(xc))
PropertiesofIPFUpdates
13
• Setoffixed-pointequations:
• Wecanshowthatitisalsoacoordinateascentalgorithm(coordinates=parametersofcliquepotentials)
• Ateachstep,itwillincreasethelog-likelihood,anditwillconvergetoaglobalmaximum.
• MaximizingtheloglikelihoodisequivalenttominimizingtheKLdivergence(crossentropy)• Themax-entropyprincipletoparameterizationoffersadualperspectivetotheMLE.
(t+1)c (xc) = (t)
c (xc)p̃(xc)
p(t)(xc)
MLEforundirectedgraphicalmodels
14
• Whathaveweseensofar?
• Decomposablegraphs• Cliquepotentialscorrespondtomarginals orconditionals
• Cliquepotentialsthatcorrespondtofulltables• IterativeProportionalfitting
• Whataboutmodelsthatareparameterizedmorecompactly?
(t+1)c (xc) = (t)
c (xc)p̃(xc)
p(t)(xc)
Feature-parameterizedcliquepotentials
15
• Sofarwesawthemostgeneralformofanundirectedgraphicalmodel:cliquesareparameterizedbygeneraltabular potentialfunctions
• Forlargecliquesthesepotentialsareexponentiallycostlyforinference.Also,wehaveexponentiallymanyparameterstolearnfromlimiteddata.
• Solution:?
Feature-parameterizedcliquepotentials
16
• Sofarwesawthemostgeneralformofanundirectedgraphicalmodel:cliquesareparameterizedbygeneraltabular potentialfunctions
• Forlargecliquesthesepotentialsareexponentiallycostlyforinference.Also,wehaveexponentiallymanyparameterstolearnfromlimiteddata.
• Solution:Changethegraphicalmodeltomakecliquessmaller.
Feature-parameterizedcliquepotentials
17
• Sofarwesawthemostgeneralformofanundirectedgraphicalmodel:cliquesareparameterizedbygeneraltabular potentialfunctions
• Forlargecliquesthesepotentialsareexponentiallycostlyforinference.Also,wehaveexponentiallymanyparameterstolearnfromlimiteddata.
• Solution:Changethegraphicalmodeltomakecliquessmaller.
• Thischangesthedependenciesandmayforceustomakemoreindependenceassumptionsthanwhatwehad
Feature-parameterizedcliquepotentials
18
• Sofarwesawthemostgeneralformofanundirectedgraphicalmodel:cliquesareparameterizedbygeneraltabular potentialfunctions
• Forlargecliquesthesepotentialsareexponentiallycostlyforinference.Also,wehaveexponentiallymanyparameterstolearnfromlimiteddata.
• Solution:Keepthesamegraphicalmodelbutuselessparameterstodefinethecliquepotentials• RecallparametersharingforBNs
• Thisistheideabehindfeature-basedmodels.
Features
19
• Letacliquecorrespondtothreeconsecutivecharacters• Howwouldyoudefinep(c1,c2,c3)?
Features
20
• Letacliquecorrespondtothreeconsecutivecharacters• Howwouldyoudefinep(c1,c2,c3)?• Forallpossiblecharactercombinationsyouneed263 – 1parameters.• Buttherearesequencesthatareunlikely:kfd
• A“feature”isafunctionthatisnon-zeroforafewparticularinputs.ThinkofBooleanfeatures.• Is“ing”theinputsequence?Then1otherwise0.
• Wecandefinefeaturesforcontinuousfeaturesaswell.
Featuresaspotentials
21
• Eachfeaturefunctioncanbeconvertedtoapotentialbytakingtheexponentofit.Wecanmultiplythesepotentialstogethertogetacliquepotential.
• Example:
• ThereisstillanexponentialnumberofsettingbutweonlyuseKparameterscorrespondingtotheKfeatures.• Canwerecoverthetabularrepresentation?
CombiningFeatures
22
• Eachfeaturehasaweightθk whichrepresentsthenumericalstrengthofthefeatureandwhetheritincreasesordecreasestheprobabilityofaclique.• Themarginaloverthecliqueisageneralizedexponentialfamilydistribution(ageneralizedlinearmodel)
• Thefeaturesmaybeoverlappingacrosscliques
Feature-basedmodel
23
• Jointdistribution:
• Wecanusethesimplifiedform
• Thefeaturescorrespondtothesufficientstatisticsofourmodel.
• Weneedtolearnparametersθk
Feature-basedmodel
24
• Jointdistribution:
• Wecanusethesimplifiedform
• Thefeaturescorrespondtothesufficientstatisticsofourmodel.
• Weneedtolearnparametersθk• WhataboutIPF?• Notclearhowtousethisruletoupdatetheparametersandpotentials
(t+1)c (xc) = (t)
c (xc)p̃(xc)
p(t)(xc)
MLEofFeature-basedUndirectedGraphicalModels
25
• Objective:scaledlikelihoodfunction
• Maindifficulties:thepartitionfunctionisacomplexfunctionoftheparameters.IfwetakeaderivativeZappearsinthedenominator.Nothingchanges.WewanttoavoidcomputingZ.
• Approximationtime…
MLEofFeature-basedUndirectedGraphicalModels
26
• Objective:scaledlikelihoodfunction
• WereplacelogZ byitsupperboundlogZ(θ) <=μΖ(θ)– logμ– 1whereμ =Z-1(θ(t))
• Thuswehave
MLEofFeature-basedUndirectedGraphicalModels
27
• Wehave
• Wedefine
• Weassume.Alsobyconvexityofexpfi(x) � 0,X
i
fi = 1 exp(
X
i
⇡ixi) X
i
⇡i exp(xi)
MLEofFeature-basedUndirectedGraphicalModels
28
• Wehave
• Wetakethederivative• p(t)(x)istheunnormalized versionofp(x|θ(t))
• Ourupdatesare:
Summary
29
• IterativeProportionalFitting(IPF)isageneralalgorithmforMLEofUGMs• A fixed-pointequationforpotentialsoversinglecliques,usescoordinateascent• Requiresthepotentialtobefullyparameterized• Thecliquedescribedbythepotentialsdoesnothavetobemax-clique• Forfullydecomposablemodel,reducestoasinglestepiteration
• GeneralizedIterativeScaling(GIS)• IterativescalingongeneralUGMwithfeature-basedpotentials• IPFisaspecialcaseofGISwherethecliquepotentialisbuiltonfeaturesdefinedasindicatorfunctionsofthecliqueconfigurations.
Summary
30