lecture 3: directed graphical models and latent variables
TRANSCRIPT
CSC412/2506Winter2018ProbabilisticLearningandReasoning
Lecture3:DirectedGraphicalModelsandLatentVariables
BasedonslidesbyRichardZemel
Learningoutcomes
• Whataspectsofamodelcanweexpressusinggraphicalnotation?
• Whichaspectsarenotcapturedinthisway?• Howdoindependencieschangeasaresultofconditioning?
• Reasonsforusinglatentvariables• Commonmotifssuchasmixturesandchains• Howtointegrateoutunobservedvariables
JointProbabilities
• Chainruleimpliesthatanyjointdistributionequals
• Directedgraphicalmodelimpliesarestrictedfactorization
p(x1:D) = p(x1)p(x2|x1)p(x3|x1, x2)p(x4|x1, x2, x3)...p(xD|x1:D�1)
ConditionalIndependence• Notation:xA⊥ xB|xC
• Definition:two(setsof)variablesxA andxB areconditionallyindependentgivenathirdxC if:
whichisequivalenttosaying
• Onlyasubsetofalldistributionsrespectanygiven(nontrivial)conditionalindependencestatement.ThesubsetofdistributionsthatrespectalltheCIassumptionswemakeisthefamilyofdistributionsconsistentwithourassumptions.
• Probabilisticgraphicalmodelsareapowerful,elegantandsimplewaytospecifysuchafamily.
P (xA,xB |xC) = P (xA|xC)P (xB |xC) ; 8xC
P (xA|xB ,xC) = P (xA|xC) ; 8xC
DirectedGraphicalModels• Considerdirectedacyclicgraphsovern variables.• Eachnodehas(possiblyempty)setofparentsπi
• Wecanthenwrite
• Hencewefactorizethejointintermsoflocalconditionalprobabilities
• Exponentialin“fan-in”ofeachnodeinsteadofinN
P (x1, ...,xN ) =Y
i
P (xi|x⇡i)
ConditionalIndependenceinDAGs
• Ifweorderthenodesinadirectedgraphicalmodelsothatparentsalwayscomebeforetheirchildrenintheorderingthenthegraphicalmodelimpliesthefollowingaboutthedistribution:
wherearethenodescomingbeforexi thatarenotitsparents
• Inotherwords,theDAGistellingusthateachvariableisconditionallyindependentofitsnon-descendantsgivenitsparents.
• Suchanorderingiscalleda“topological”ordering
x⇡i
{xi ? x⇡i |x⇡i} ; 8i
MissingEdges• Keypointaboutdirectedgraphicalmodels:
Missingedgesimplyconditionalindependence• Rememberthatbythechainrulewecanalwayswritethefulljointasaproductofconditionals,givenanordering:
• IfthejointisrepresentedbyaDAGM,thensomeoftheconditionedvariablesontherighthandsidesaremissing.
• Thisisequivalenttoenforcingconditionalindependence.• Startwiththe“idiot’sgraph”:eachnodehasallpreviousnodesintheorderingasitsparents.
• NowremoveedgestogetyourDAG.• Removinganedgeintonodei eliminatesanargumentfromtheconditionalprobabilityfactor
P (x1,x2, ...) = P (x1)P (x2|x1)P (x3|x2,x1)P (x4|x3,x2,x1)
P (xi|x1,x2, ...,xi�1)
D-Separation• D-separation,ordirected-separationisanotionofconnectednessinDAGMsinwhichtwo(setsof)variablesmayormaynotbeconnectedconditionedonathird(setof)variable.
• D-connectionimpliesconditionaldependenceandd-separationimpliesconditionalindependence.
• Inparticular,wesaythatxA⊥ xB|xC ifeveryvariableinAisd-separatedfromeveryvariableinBconditionedonallthevariablesinC.
• Tocheckifanindependenceistrue,wecancyclethrougheachnodeinA,doadepth-firstsearchtoreacheverynodeinB,andexaminethepathbetweenthem.Ifallofthepathsared-separated,thenwecanassertxA⊥ xB|xC
• Thus,itwillbesufficienttoconsidertriplesofnodes.(Why?)• Pictorially,whenweconditiononanode,weshadeitin.
Chain
• Q:Whenweconditionony,are xandz independent?
whichimplies
andthereforex⊥ z|y
• Thinkofx asthepast,y asthepresentandz asthefuture.
P (x,y, z) = P (x)P (y|x)P (z|y)
P (z|x,y) = P (x,y, z)
P (x,y)
=P (x)P (y|x)P (z|y)
P (x)P (y|x)= P (z|y)
CommonCause
• Q:Whenweconditionony,are xandz independent?
whichimplies
andthereforex⊥ z|y
= P (x|y)P (z|y)
=P (y)P (x|y)P (z|y)
P (y)
P (x, z|y) = P (x,y, z)
P (y)
P (x,y, z) = P (y)P (x|y)P (z|y)
ExplainingAway
• Q:Whenweconditionony,are xandz independent?
• x andz aremarginallyindependent,butgiveny theyareconditionallydependent.
• Thisimportanteffectiscalledexplainingaway(Berkson’sparadox.)
• Forexample,fliptwocoinsindependently;letx=coin1,z=coin2.• Lety=1ifthecoinscomeupthesameandy=0ifdifferent.• x andz areindependent,butifItellyouy,theybecomecoupled!
P (x,y, z) = P (x)P (z)P (y|x, z)
Bayes-BallAlgorithm• TocheckifxA⊥ xB|xCweneedtocheckifeveryvariableinA isd-separatedfromeveryvariableinB conditionedonallvars inC.
• Inotherwords,giventhatallthenodesinxC areclamped,whenwewigglenodesxA canwechangeanyofthenodesinxB?
• TheBayes-BallAlgorithmisasuchad-separationtest.• WeshadeallnodesxC,placeballsateachnodein xA (orxB),letthembouncearoundaccordingtosomerules,andthenaskifanyoftheballsreachanyofthenodesinxB (orxA).
Bayes-BallBoundaryRules
• Wealsoneedtheboundaryconditions:
• Here’satrickfortheexplainingawaycase:Ify oranyofitsdescendants isshaded,theballpassesthrough.
• Noticeballscantraveloppositetoedgedirections.
Plates&Parameters
• SinceBayesianmethodstreatparametersasrandomvariables,wewouldliketoincludetheminthegraphicalmodel.
• Onewaytodothisistorepeatalltheiid observationsexplicitlyandshowtheparameteronlyonce.
• Abetterwayistouseplates,inwhichrepeatedquantitiesthatareiid areputinabox.
Plates:MacrosforRepeatedStructures• Platesarelike“macros”thatallowyoutodrawaverycomplicatedgraphicalmodelwithasimplernotation.
• Therulesofplatesaresimple:repeateverystructureinaboxanumberoftimesgivenbytheintegerinthecornerofthebox(e.g.N),updatingtheplateindexvariable(e.g.n)asyougo.
• Duplicateeveryarrowgoingintotheplateandeveryarrowleavingtheplatebyconnectingthearrowstoeachcopyofthestructure.
Nested/IntersectingPlates
• Platescanbenested,inwhichcasetheirarrowsgetduplicatedalso,accordingtotherule:drawanarrowfromeverycopyofthesourcenodetoeverycopyofthedestinationnode.
• Platescanalsocross(intersect),inwhichcasethenodesattheintersectionhavemultipleindicesandgetduplicatedanumberoftimesequaltotheproductoftheduplicationnumbersonalltheplatescontainingthem.
UnobservedVariables
• CertainvariablesQinourmodelsmaybeunobserved,eithersomeofthetimeoralways,eitherattrainingtimeorattesttime
• Graphically,wewilluseshadingtoindicateobservation
PartiallyUnobserved(Missing)Variables• Ifvariablesareoccasionallyunobservedtheyaremissingdata,e.g.,undefinedinputs,missingclasslabels,erroneoustargetvalues
• Inthiscase,wecanstillmodelthejointdistribution,butwedefineanewcostfunctioninwhichwesumoutormarginalizethemissingvaluesattrainingortesttime:
Recallthat
`(✓;D) =X
complete
log p(xc,yc|✓) +X
missing
log p(xm|✓)
=X
complete
log p(xc,yc|✓) +X
missing
logX
y
p(xm,y|✓)
p(x) =X
q
p(x, q)
LatentVariables
• Whattodowhenavariablez isalwaysunobserved?Dependsonwhereitappearsinourmodel.Ifweneverconditiononitwhencomputingtheprobabilityofthevariableswedoobserve,thenwecanjustforgetaboutitandintegrateitout.e.g.,giveny,x fitthemodelp(z,y|x)=p(z|y)p(y|x,w)p(w).(Inotherwordsifitisaleafnode.)
• Butifzisconditionedon,weneedtom�odel it:
e.g.giveny,x fitthemodelp(y|x)=Σzp(y|x,z)p(z)
WhereDoLatentVariablesComeFrom?• Latentvariablesmayappearnaturally,fromthestructureoftheproblem,becausesomethingwasn’tmeasured,becauseoffaultysensors,occlusion,privacy,etc.
• Butalso,wemaywanttointentionallyintroducelatentvariablestomodelcomplexdependenciesbetweenvariableswithoutlookingatthedependenciesbetweenthemdirectly.Thiscanactuallysimplifythemodel(e.g.,mixtures).
LatentVariablesModels&Regression
• Youcanthinkofclusteringastheproblemofclassificationwithmissingclasslabels.
• Youcanthinkoffactormodels(suchasfactoranalysis,PCA,ICA,etc.)aslinearornonlinearregressionwithmissinginputs.
WhyisLearningHarder?
• Infullyobservediid settings,theprobabilitymodelisaproduct,thustheloglikelihoodisasumwheretermsdecouple.(Atleastfordirectedmodels.)
• Withlatentvariables,theprobabilityalreadycontainsasu�m,sotheloglikelihoodhasallparamete�rs coupledtogetherviaz:
(Justaswiththepartitionfunctioninundirectedmodels)
`(✓;D) = log p(x, z|✓)
= log p(z|✓z) + log p(x|z, ✓x)
`(✓;D) = logX
z
p(x, z|✓)
= logX
z
p(z|✓z) + log p(x|z, ✓x)
WhyisLearningHarder?
• Likelihood couplesparameters:
• Wecantreatthisasablackboxprobabilityfunctionandjusttrytooptimizethelikelihoodasafunctionofθ (e.g.gradientdescent).However,sometimestakingadvantageofthelatentvariablestructurecanmakeparameterestimationeasier.
• Goodnews:soonwewillseehowtodealwithlatentvariables.• Basictrick:putatractabledistributiononthevaluesyoudon’tknow.Basicmath:useconvexitytolowerboundthelikelihood.
= logX
z
p(z|✓z) + log p(x|z, ✓x)
MixtureModels• Mostbasiclatentvariablemodelwithasinglediscretenodez.• Allowsdifferentsubmodels (experts)tocontributetothe(conditional)densitymodelindifferentpartsofthespace.
• Divide&conqueridea:usesimplepartstobuildcomplexmodels(e.g.,multimodaldensities,orpiecewise-linearregressions).
MixtureDensities
• Exactlylikeaclassificationmodelbuttheclassisunobservedandsowesumitout.Whatwegetisaperfectlyvaliddensity:
wherethe“mixingproportions”addtoone:Σk αk =1.
• WecanuseBayes’ruletocomputetheposteriorprobabilityofthemixturecomponentgivensomedata:
thesequantitiesarecalledresponsibilities.
p(z = k|x, ✓) = ↵kpk(x|✓k)Pj ↵jpj(x|✓j)
p(x|✓) =KX
k=1
p(z = k|✓z)p(x|z = k, ✓k)
=X
k
↵kpk(x|✓k)
Example:GaussianMixtureModels• ConsideramixtureofKGaussiancomponents:
• Densitymodel:p(x|θ)isafamiliaritysignal.Clustering:p(z|x,θ)istheassignmentrule,−l(θ)isthecost.
p(x|✓) =X
k
↵kN (x|µk,⌃k)
p(z = k|x, ✓) = ↵kN (x|µk,⌃k)Pj ↵jN (x|µj ,⌃j)
`(✓;D) =X
n
logX
k
↵kN (x(n)|µk,⌃k)
Example:MixturesofExperts• Alsocalledconditionalmixtures.Exactlylikeaclass-conditionalmodelbuttheclassisunobservedandsowesumitoutagain:
where• Harder:mustlearnα(x)(unlesschosezindependentofx).• WecanstilluseBayes’ruletocomputetheposteriorprobabilityofthemixturecomponentgivensomedata:
thisfunctionisoftencalledthegatingfunction.
p(y|x, ✓) =KX
k=1
p(z = k|x, ✓z)p(y|z = k,x, ✓k)
=X
k
↵k(x|✓z)pk(y|x, ✓k)X
k
↵k(x) = 1 8x
p(z = k|x,y, ✓) = ↵k(x)pk(y|x, ✓k)Pj ↵j(x)pj(y|x, ✓j)
Example:MixturesofLinearRegressionExperts• EachexpertgeneratesdataaccordingtoalinearfunctionoftheinputplusadditiveGaussian�noise:
• The“gate”functioncanbeasoftmax classificationmachine
• Remember:wearenotmodelingthedensityoftheinputsx
p(y|x, ✓) =X
k
↵kN (y|�Tk x,�
2k)
↵k(x) = p(z = k|x) = e⌘Tk x
Pj e
⌘Tj x
GradientLearningwithMixtures
• Wecanlearnmixturedensitiesusinggradientdescentonthelikelihoodasusual.Thegradientsarequiteinteresting:
• Inotherwords,thegradientistheresponsibilityweightedsumoftheindividualloglikelihoodgradients
`(✓) = log p(x|✓) = logX
k
↵kpk(x|✓k)
@`
@✓=
1
p(x|✓)X
k
↵k@pk(x|✓k)
@✓
=X
k
↵k1
p(x|✓)pk(x|✓k)@ log pk(x|✓k)
@✓
=X
k
↵kpk(x|✓k)p(x|✓)
@`k@✓k
=X
k
↵krk@`k@✓k
ParameterConstraints
• Ifwewanttousegeneraloptimizations(e.g.,conjugategradient)tolearnlatentvariablemodels,weof�tenhavetomakesureparametersrespectcertainconstraints(e.g.,Σk αk =1,Σk αkpositivedefinite)
• Agoodtrickistore-parameterizethesequantitiesintermsofunconstrainedvalues.Formixingproportions,usethesoftmax:
• Forcovariancematrices,usetheCholesky decomposition
whereA isupperdiagonalwithpositivediagonal
↵k =exp(qk)Pj exp(qj)
⌃�1 = ATA |⌃|�1/2 =Y
i
Aii
Aii = exp(ri) > 0 Aij = aij (j > i) Aij = 0 (j < i)
Logsumexp• Oftenyoucaneasilycomputebk =logp(x|z =k,θk),
butitwillbeverynegative,say-106orsmaller.• Now,tocomputel =logp(x|θ) youneedtocompute(e.g.,forcalculatingresponsibilitiesattesttimeorforlearning)
• Careful!Donotcomputethisbydoinglog(sum(exp(b))).Youwillgetunderflowandanincorrectanswer.
• Insteaddothis:–AddaconstantexponentBtoallthevaluesbk suchthatthe
largestvalueequalszero:B=max(b).– Computelog(sum(exp(b- B)))+B.
• Example:iflogp(x|z =1)=−120andlogp(x|z =2)=−120,whatislogp(x) =log[p(x|z =1)+p(x|z =2)]?Answer:log[2e−120]=−120+log2.
• Ruleofthumb:neveruselogorexp byitself
logP
k ebk
HiddenMarkovModels(HMMs)
• Averypopularformoflatentvariablemodel
• Ztà HiddenstatestakingoneofKdiscretevalues• Xtà Observationstakingvaluesinanyspace
Example:discrete,MobservationsymbolsB 2 <KxM
p(xt = j|zt = k) = Bkj
InferenceinGraphicalModels
xEà Observedevidence variables(subsetofnodes)xFà unobservedquery nodeswe’dliketoinferxRà remaining variables,extraneoustothisquerybutpartofthegivengraphicalrepresentation
Naïve Inference
• Supposeeachvariabletakesoneofk discretevalues
• CostsO(k)operationstoupdateeachofO(k5)tableentries• Usefactorizationanddistributedlawtoreducecomplexity
p(x1, x2, ..., x5) =X
x6
p(x1)p(x2|x1)p(x3|x1)p(x4|x2)p(x5|x3)p(x6|x2, x5)
= p(x1)p(x2|x1)p(x3|x1)p(x4|x2)p(x5|x3)X
x6
p(x6|x2, x5)
Learningoutcomes
• Whataspectsofamodelcanweexpressusinggraphicalnotation?
• Whichaspectsarenotcapturedinthisway?• Howdoindependencieschangeasaresultofconditioning?
• Reasonsforusinglatentvariables• Commonmotifssuchasmixturesandchains• Howtointegrateoutunobservedvariables