lecture 3: directed graphical models and latent variables

48
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel

Upload: others

Post on 30-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

CSC412/2506Winter2018ProbabilisticLearningandReasoning

Lecture3:DirectedGraphicalModelsandLatentVariables

BasedonslidesbyRichardZemel

Learningoutcomes

• Whataspectsofamodelcanweexpressusinggraphicalnotation?

• Whichaspectsarenotcapturedinthisway?• Howdoindependencieschangeasaresultofconditioning?

• Reasonsforusinglatentvariables• Commonmotifssuchasmixturesandchains• Howtointegrateoutunobservedvariables

JointProbabilities

• Chainruleimpliesthatanyjointdistributionequals

• Directedgraphicalmodelimpliesarestrictedfactorization

p(x1:D) = p(x1)p(x2|x1)p(x3|x1, x2)p(x4|x1, x2, x3)...p(xD|x1:D�1)

ConditionalIndependence• Notation:xA⊥ xB|xC

• Definition:two(setsof)variablesxA andxB areconditionallyindependentgivenathirdxC if:

whichisequivalenttosaying

• Onlyasubsetofalldistributionsrespectanygiven(nontrivial)conditionalindependencestatement.ThesubsetofdistributionsthatrespectalltheCIassumptionswemakeisthefamilyofdistributionsconsistentwithourassumptions.

• Probabilisticgraphicalmodelsareapowerful,elegantandsimplewaytospecifysuchafamily.

P (xA,xB |xC) = P (xA|xC)P (xB |xC) ; 8xC

P (xA|xB ,xC) = P (xA|xC) ; 8xC

DirectedGraphicalModels• Considerdirectedacyclicgraphsovern variables.• Eachnodehas(possiblyempty)setofparentsπi

• Wecanthenwrite

• Hencewefactorizethejointintermsoflocalconditionalprobabilities

• Exponentialin“fan-in”ofeachnodeinsteadofinN

P (x1, ...,xN ) =Y

i

P (xi|x⇡i)

ConditionalIndependenceinDAGs

• Ifweorderthenodesinadirectedgraphicalmodelsothatparentsalwayscomebeforetheirchildrenintheorderingthenthegraphicalmodelimpliesthefollowingaboutthedistribution:

wherearethenodescomingbeforexi thatarenotitsparents

• Inotherwords,theDAGistellingusthateachvariableisconditionallyindependentofitsnon-descendantsgivenitsparents.

• Suchanorderingiscalleda“topological”ordering

x⇡i

{xi ? x⇡i |x⇡i} ; 8i

ExampleDAGConsiderthissixnodenetwork: Thejointprobabilityisnow:

MissingEdges• Keypointaboutdirectedgraphicalmodels:

Missingedgesimplyconditionalindependence• Rememberthatbythechainrulewecanalwayswritethefulljointasaproductofconditionals,givenanordering:

• IfthejointisrepresentedbyaDAGM,thensomeoftheconditionedvariablesontherighthandsidesaremissing.

• Thisisequivalenttoenforcingconditionalindependence.• Startwiththe“idiot’sgraph”:eachnodehasallpreviousnodesintheorderingasitsparents.

• NowremoveedgestogetyourDAG.• Removinganedgeintonodei eliminatesanargumentfromtheconditionalprobabilityfactor

P (x1,x2, ...) = P (x1)P (x2|x1)P (x3|x2,x1)P (x4|x3,x2,x1)

P (xi|x1,x2, ...,xi�1)

D-Separation• D-separation,ordirected-separationisanotionofconnectednessinDAGMsinwhichtwo(setsof)variablesmayormaynotbeconnectedconditionedonathird(setof)variable.

• D-connectionimpliesconditionaldependenceandd-separationimpliesconditionalindependence.

• Inparticular,wesaythatxA⊥ xB|xC ifeveryvariableinAisd-separatedfromeveryvariableinBconditionedonallthevariablesinC.

• Tocheckifanindependenceistrue,wecancyclethrougheachnodeinA,doadepth-firstsearchtoreacheverynodeinB,andexaminethepathbetweenthem.Ifallofthepathsared-separated,thenwecanassertxA⊥ xB|xC

• Thus,itwillbesufficienttoconsidertriplesofnodes.(Why?)• Pictorially,whenweconditiononanode,weshadeitin.

Chain

• Q:Whenweconditionony,are xandz independent?

whichimplies

andthereforex⊥ z|y

• Thinkofx asthepast,y asthepresentandz asthefuture.

P (x,y, z) = P (x)P (y|x)P (z|y)

P (z|x,y) = P (x,y, z)

P (x,y)

=P (x)P (y|x)P (z|y)

P (x)P (y|x)= P (z|y)

CommonCause

• Q:Whenweconditionony,are xandz independent?

whichimplies

andthereforex⊥ z|y

= P (x|y)P (z|y)

=P (y)P (x|y)P (z|y)

P (y)

P (x, z|y) = P (x,y, z)

P (y)

P (x,y, z) = P (y)P (x|y)P (z|y)

ExplainingAway

• Q:Whenweconditionony,are xandz independent?

• x andz aremarginallyindependent,butgiveny theyareconditionallydependent.

• Thisimportanteffectiscalledexplainingaway(Berkson’sparadox.)

• Forexample,fliptwocoinsindependently;letx=coin1,z=coin2.• Lety=1ifthecoinscomeupthesameandy=0ifdifferent.• x andz areindependent,butifItellyouy,theybecomecoupled!

P (x,y, z) = P (x)P (z)P (y|x, z)

Bayes-BallAlgorithm• TocheckifxA⊥ xB|xCweneedtocheckifeveryvariableinA isd-separatedfromeveryvariableinB conditionedonallvars inC.

• Inotherwords,giventhatallthenodesinxC areclamped,whenwewigglenodesxA canwechangeanyofthenodesinxB?

• TheBayes-BallAlgorithmisasuchad-separationtest.• WeshadeallnodesxC,placeballsateachnodein xA (orxB),letthembouncearoundaccordingtosomerules,andthenaskifanyoftheballsreachanyofthenodesinxB (orxA).

Bayes-BallRules• Thethreecasesweconsideredtellusrules:

Bayes-BallBoundaryRules

• Wealsoneedtheboundaryconditions:

• Here’satrickfortheexplainingawaycase:Ify oranyofitsdescendants isshaded,theballpassesthrough.

• Noticeballscantraveloppositetoedgedirections.

CanonicalMicrographs

ExamplesofBayes-BallAlgorithm

ExamplesofBayes-BallAlgorithm

• Notice:ballscantraveloppositetoedgedirection

Plates

Plates&Parameters

• SinceBayesianmethodstreatparametersasrandomvariables,wewouldliketoincludetheminthegraphicalmodel.

• Onewaytodothisistorepeatalltheiid observationsexplicitlyandshowtheparameteronlyonce.

• Abetterwayistouseplates,inwhichrepeatedquantitiesthatareiid areputinabox.

Plates:MacrosforRepeatedStructures• Platesarelike“macros”thatallowyoutodrawaverycomplicatedgraphicalmodelwithasimplernotation.

• Therulesofplatesaresimple:repeateverystructureinaboxanumberoftimesgivenbytheintegerinthecornerofthebox(e.g.N),updatingtheplateindexvariable(e.g.n)asyougo.

• Duplicateeveryarrowgoingintotheplateandeveryarrowleavingtheplatebyconnectingthearrowstoeachcopyofthestructure.

Nested/IntersectingPlates

• Platescanbenested,inwhichcasetheirarrowsgetduplicatedalso,accordingtotherule:drawanarrowfromeverycopyofthesourcenodetoeverycopyofthedestinationnode.

• Platescanalsocross(intersect),inwhichcasethenodesattheintersectionhavemultipleindicesandgetduplicatedanumberoftimesequaltotheproductoftheduplicationnumbersonalltheplatescontainingthem.

Example:NestedPlates

ExampleDAGM:MarkovChain

• MarkovProperty:Conditionedonthepresent,thepastandfutureareindependent

UnobservedVariables

• CertainvariablesQinourmodelsmaybeunobserved,eithersomeofthetimeoralways,eitherattrainingtimeorattesttime

• Graphically,wewilluseshadingtoindicateobservation

PartiallyUnobserved(Missing)Variables• Ifvariablesareoccasionallyunobservedtheyaremissingdata,e.g.,undefinedinputs,missingclasslabels,erroneoustargetvalues

• Inthiscase,wecanstillmodelthejointdistribution,butwedefineanewcostfunctioninwhichwesumoutormarginalizethemissingvaluesattrainingortesttime:

Recallthat

`(✓;D) =X

complete

log p(xc,yc|✓) +X

missing

log p(xm|✓)

=X

complete

log p(xc,yc|✓) +X

missing

logX

y

p(xm,y|✓)

p(x) =X

q

p(x, q)

LatentVariables

• Whattodowhenavariablez isalwaysunobserved?Dependsonwhereitappearsinourmodel.Ifweneverconditiononitwhencomputingtheprobabilityofthevariableswedoobserve,thenwecanjustforgetaboutitandintegrateitout.e.g.,giveny,x fitthemodelp(z,y|x)=p(z|y)p(y|x,w)p(w).(Inotherwordsifitisaleafnode.)

• Butifzisconditionedon,weneedtom�odel it:

e.g.giveny,x fitthemodelp(y|x)=Σzp(y|x,z)p(z)

WhereDoLatentVariablesComeFrom?• Latentvariablesmayappearnaturally,fromthestructureoftheproblem,becausesomethingwasn’tmeasured,becauseoffaultysensors,occlusion,privacy,etc.

• Butalso,wemaywanttointentionallyintroducelatentvariablestomodelcomplexdependenciesbetweenvariableswithoutlookingatthedependenciesbetweenthemdirectly.Thiscanactuallysimplifythemodel(e.g.,mixtures).

LatentVariablesModels&Regression

• Youcanthinkofclusteringastheproblemofclassificationwithmissingclasslabels.

• Youcanthinkoffactormodels(suchasfactoranalysis,PCA,ICA,etc.)aslinearornonlinearregressionwithmissinginputs.

WhyisLearningHarder?

• Infullyobservediid settings,theprobabilitymodelisaproduct,thustheloglikelihoodisasumwheretermsdecouple.(Atleastfordirectedmodels.)

• Withlatentvariables,theprobabilityalreadycontainsasu�m,sotheloglikelihoodhasallparamete�rs coupledtogetherviaz:

(Justaswiththepartitionfunctioninundirectedmodels)

`(✓;D) = log p(x, z|✓)

= log p(z|✓z) + log p(x|z, ✓x)

`(✓;D) = logX

z

p(x, z|✓)

= logX

z

p(z|✓z) + log p(x|z, ✓x)

WhyisLearningHarder?

• Likelihood couplesparameters:

• Wecantreatthisasablackboxprobabilityfunctionandjusttrytooptimizethelikelihoodasafunctionofθ (e.g.gradientdescent).However,sometimestakingadvantageofthelatentvariablestructurecanmakeparameterestimationeasier.

• Goodnews:soonwewillseehowtodealwithlatentvariables.• Basictrick:putatractabledistributiononthevaluesyoudon’tknow.Basicmath:useconvexitytolowerboundthelikelihood.

= logX

z

p(z|✓z) + log p(x|z, ✓x)

MixtureModels• Mostbasiclatentvariablemodelwithasinglediscretenodez.• Allowsdifferentsubmodels (experts)tocontributetothe(conditional)densitymodelindifferentpartsofthespace.

• Divide&conqueridea:usesimplepartstobuildcomplexmodels(e.g.,multimodaldensities,orpiecewise-linearregressions).

MixtureDensities

• Exactlylikeaclassificationmodelbuttheclassisunobservedandsowesumitout.Whatwegetisaperfectlyvaliddensity:

wherethe“mixingproportions”addtoone:Σk αk =1.

• WecanuseBayes’ruletocomputetheposteriorprobabilityofthemixturecomponentgivensomedata:

thesequantitiesarecalledresponsibilities.

p(z = k|x, ✓) = ↵kpk(x|✓k)Pj ↵jpj(x|✓j)

p(x|✓) =KX

k=1

p(z = k|✓z)p(x|z = k, ✓k)

=X

k

↵kpk(x|✓k)

Example:GaussianMixtureModels• ConsideramixtureofKGaussiancomponents:

• Densitymodel:p(x|θ)isafamiliaritysignal.Clustering:p(z|x,θ)istheassignmentrule,−l(θ)isthecost.

p(x|✓) =X

k

↵kN (x|µk,⌃k)

p(z = k|x, ✓) = ↵kN (x|µk,⌃k)Pj ↵jN (x|µj ,⌃j)

`(✓;D) =X

n

logX

k

↵kN (x(n)|µk,⌃k)

Example:MixturesofExperts• Alsocalledconditionalmixtures.Exactlylikeaclass-conditionalmodelbuttheclassisunobservedandsowesumitoutagain:

where• Harder:mustlearnα(x)(unlesschosezindependentofx).• WecanstilluseBayes’ruletocomputetheposteriorprobabilityofthemixturecomponentgivensomedata:

thisfunctionisoftencalledthegatingfunction.

p(y|x, ✓) =KX

k=1

p(z = k|x, ✓z)p(y|z = k,x, ✓k)

=X

k

↵k(x|✓z)pk(y|x, ✓k)X

k

↵k(x) = 1 8x

p(z = k|x,y, ✓) = ↵k(x)pk(y|x, ✓k)Pj ↵j(x)pj(y|x, ✓j)

Example:MixturesofLinearRegressionExperts• EachexpertgeneratesdataaccordingtoalinearfunctionoftheinputplusadditiveGaussian�noise:

• The“gate”functioncanbeasoftmax classificationmachine

• Remember:wearenotmodelingthedensityoftheinputsx

p(y|x, ✓) =X

k

↵kN (y|�Tk x,�

2k)

↵k(x) = p(z = k|x) = e⌘Tk x

Pj e

⌘Tj x

GradientLearningwithMixtures

• Wecanlearnmixturedensitiesusinggradientdescentonthelikelihoodasusual.Thegradientsarequiteinteresting:

• Inotherwords,thegradientistheresponsibilityweightedsumoftheindividualloglikelihoodgradients

`(✓) = log p(x|✓) = logX

k

↵kpk(x|✓k)

@`

@✓=

1

p(x|✓)X

k

↵k@pk(x|✓k)

@✓

=X

k

↵k1

p(x|✓)pk(x|✓k)@ log pk(x|✓k)

@✓

=X

k

↵kpk(x|✓k)p(x|✓)

@`k@✓k

=X

k

↵krk@`k@✓k

ParameterConstraints

• Ifwewanttousegeneraloptimizations(e.g.,conjugategradient)tolearnlatentvariablemodels,weof�tenhavetomakesureparametersrespectcertainconstraints(e.g.,Σk αk =1,Σk αkpositivedefinite)

• Agoodtrickistore-parameterizethesequantitiesintermsofunconstrainedvalues.Formixingproportions,usethesoftmax:

• Forcovariancematrices,usetheCholesky decomposition

whereA isupperdiagonalwithpositivediagonal

↵k =exp(qk)Pj exp(qj)

⌃�1 = ATA |⌃|�1/2 =Y

i

Aii

Aii = exp(ri) > 0 Aij = aij (j > i) Aij = 0 (j < i)

Logsumexp• Oftenyoucaneasilycomputebk =logp(x|z =k,θk),

butitwillbeverynegative,say-106orsmaller.• Now,tocomputel =logp(x|θ) youneedtocompute(e.g.,forcalculatingresponsibilitiesattesttimeorforlearning)

• Careful!Donotcomputethisbydoinglog(sum(exp(b))).Youwillgetunderflowandanincorrectanswer.

• Insteaddothis:–AddaconstantexponentBtoallthevaluesbk suchthatthe

largestvalueequalszero:B=max(b).– Computelog(sum(exp(b- B)))+B.

• Example:iflogp(x|z =1)=−120andlogp(x|z =2)=−120,whatislogp(x) =log[p(x|z =1)+p(x|z =2)]?Answer:log[2e−120]=−120+log2.

• Ruleofthumb:neveruselogorexp byitself

logP

k ebk

HiddenMarkovModels(HMMs)

• Averypopularformoflatentvariablemodel

• Ztà HiddenstatestakingoneofKdiscretevalues• Xtà Observationstakingvaluesinanyspace

Example:discrete,MobservationsymbolsB 2 <KxM

p(xt = j|zt = k) = Bkj

InferenceinGraphicalModels

xEà Observedevidence variables(subsetofnodes)xFà unobservedquery nodeswe’dliketoinferxRà remaining variables,extraneoustothisquerybutpartofthegivengraphicalrepresentation

InferencewithTwoVariables

Tablelook-up

Bayes’Rulep(x|y = y) =

p(y|x)p(x)p(y)

p(y|x = x)

Naïve Inference

• Supposeeachvariabletakesoneofk discretevalues

• CostsO(k)operationstoupdateeachofO(k5)tableentries• Usefactorizationanddistributedlawtoreducecomplexity

p(x1, x2, ..., x5) =X

x6

p(x1)p(x2|x1)p(x3|x1)p(x4|x2)p(x5|x3)p(x6|x2, x5)

= p(x1)p(x2|x1)p(x3|x1)p(x4|x2)p(x5|x3)X

x6

p(x6|x2, x5)

InferenceinDirectedGraphs

InferenceinDirectedGraphs

InferenceinDirectedGraphs

Learningoutcomes

• Whataspectsofamodelcanweexpressusinggraphicalnotation?

• Whichaspectsarenotcapturedinthisway?• Howdoindependencieschangeasaresultofconditioning?

• Reasonsforusinglatentvariables• Commonmotifssuchasmixturesandchains• Howtointegrateoutunobservedvariables

Questions?

• Thursday:Tutorialonautomaticdifferentiation• Thisweek:Assignment1released