generative and discriminative approaches to graphical...

Generative and Discriminative Approaches toGraphical Models

CMSC 35900 Topics in AILecture 1

Yasemin Altun

January 3, 2007

ADMINISTRATIVE STUFF

I Lectures: Wednesday 3:30-6:00, TTI Room 201I Office hours: Wednesday 10am-NoonI No text book. Reference reading will be handed out.I No homework, no examI Presentation/Discussions: 40% of gradeI Final Project: 60% of grade

I Apply one of methods discussed to your research area( eg.NLP, vision, CompBio)

I Add new features to SVM-struct or other available packagesI Theoretical work

I URLhttp://ttic.uchicago.edu/ altun/Teaching/CS359/index.html

I Class mail list

Prerequisites

I Familiarity withI Probability: Random variables, densities, expectations,

joint, marginal, conditional probabilities, Bayes rule,independence

I Linear algebraI Optimization: Lagrangian methods

Traditional Prediction Problems

I Supervised Learning: Given input-output pairs, find afunction that predict outputs of new inputs

I Binary classification, label class {0, 1}I Multiclass classification, label class {0, . . . , m}I Regression, label class <

I Unsupervised learning: Given only inputs, discover somestructure, eg. clusters, outliers

I Semi-supervised Learning: Given a few input-output pairsand many inputs, find a function to predict outputs of newinputs

I Transduction: Given a few input-output pairs and manyinputs, find a function to predict well on unlabeled inputs

Key Components

I 4 aspects of learningI RepresentationI Parameterization and the hypothesis spaceI Learning objectiveI Optimization method

I Different settings lead to different learning methodsI For prediction tasks, state-of-the art methods Support

Vector Machines, Boosting, Gaussian Processes

Discriminative Learning

I All these methods are from the discriminative learningparadigm

I (Treat inputs, outputs as random variables. X for input withinstantiation x , Y for output with instantiation y . p(x) forprobability (X = x) )

I Given an input x , they discriminate the target label y . eg.p(y |x)

I Since conditioning on x , they can treat arbitrarily complexobjects as input

I Versus a generative approach,I where the goal is to estimate the joint distribution p(x , y).I p(x , y) = p(y)p(x |y)I p(x |y): Given the target label, generate the input.I eg. Naive Bayes classifier

Structured (Output) Prediction

I Traditionally, discriminative methods predict one simplevariable.

I In real-life, it is rarely the case.I Not taking dependencies into account is an important

shortcoming.I Domains: Natural Language Processing, Speech,

Information Retrieval, Computer Vision, Bioinformatics,Computational Economy

Examples

I Domain: Natural Language ProcessingI Application: Part-of-speech taggingI Input: A sequence of wordsI Output: Labels of each word as noun,verb,adjective,etc.

John hit the ball.Noun Vb Det Noun

Examples

I Domain: Computational BiologyI Application: Protein Secondary Structure PredictionI Input: Amino-acid sequence

AAY KSHGSGDYGDHDVGHPTPGDPWVEPDYGINVYHI Output: H/E/- regions

HHHH——-EEEEEEEE—- – - - - HHHHH- - - -

Examples

I Domain: Computer visionI Application: Identifying joint angles of human body

Examples

I Domain: Natural Language ProcessingI Application: ParsingI Input: Sentence (sequence of words)I Output: Parse tree (a configuration of grammar

terminals/non-terminals

Examples

I Domain: Information RetrievalI Application: Text classification with taxomoniesI Input: DocumentI Output: A (leaf) class from the taxomony

Possible approaches

I Ignore dependency, use standard learning method for eachcomponent. BAD!

I Consider all components as one and consider all possiblelabelings as one class

I Input: word1, . . . , wordnI Each word can take a label from {1, . . . , m}I A multiclass classification where label set is mn

I Hopeless!!!I Use graphical models!

Graphical Models

I A framework for multivariate statistical modelsI Wide-spread domains and ubiquous applicationsI Marriage of graph theory and probability theoryI Graph theory

I Provide a means to build complex system from simple partsI Dependency between variables encoded in the graph

structureI Efficient algorithms for learning and inference

I Bayesian Networks: Graphical models with directed acyclicgraphs

I Markov Random Fields (Markov networks): Graphicalmodels with undirected graphs

Bayes(ian) Net(work)s

I aka Belief Networks, Directed Graphical ModelsI Each node is a random variable. Shade for fix values.I Edges represent dependency (causation).I No directed cycles. Bayes nets are DAGs.I Local Markov property: A node is conditionally

independent of its non-descendants given its parent.Directed Graphical Models

• Consider directed acyclic graphs over n variables.

• Each node has (possibly empty) set of parents !i.

• Each node maintains a function fi(xi;x!i) such thatfi > 0 and

!

xifi(xi;x!i) = 1 !!i.

•Define the joint probability to be:

P(x1,x2, . . . ,xn) ="

i

fi(xi;x!i)

Even with no further restriction on the the fi, it is always true that

fi(xi;x!i) = P(xi|x!i)

so we will just write

P(x1,x2, . . . ,xn) ="

i

P(xi|x!i)

• Factorization of the joint in terms of local conditional probabilities.Exponential in “fan-in” of each node instead of in total variables n.

Conditional Independence in DAGs

• If we order the nodes in a directed graphical model so that parentsalways come before their children in the ordering then the graphicalmodel implies the following about the distribution:

{xi " x#!i|x!i}!i

where x#!i

are the nodes coming before xi that are not its parents.

• In other words, the DAG is telling us that each variable isconditionally independent of its non-descendants given its parents.

• Such an ordering is called a “topological” ordering.

Example DAG

• Consider this six node network: The joint probability is now:

1X

2X

3X

X 4

X 5

X6P(x1,x2,x3,x4,x5,x6) =

P(x1)P(x2|x1)P(x3|x1)

P(x4|x2)P(x5|x3)P(x6|x2,x5)

0

1

0 1

2x

4x

0

1

x 1

0

1

0 1

x 1

2x

0

1

0 1

3x

x 1

5x0

1

0 1

3x

0

1

0 1

0

1

6x

2x

5x

1X

2X

3X

X 4

X 5

X6

Missing Edges

• Key point about directed graphical models:Missing edges imply conditional independence

• Remember, that by the chain rule we can always write the full jointas a product of conditionals, given an ordering:

P(x1,x2,x3,x4, . . .) = P(x1)P(x2|x1)P(x3|x1,x2)P(x4|x1,x2,x3) . . .

• If the joint is represented by a DAGM, then some of theconditioned variables on the right hand sides are missing.This is equivalent to enforcing conditional independence.

• Start with the “idiot’s graph”: each node has all previous nodes inthe ordering as its parents.

• Now remove edges to get your DAG.

• Removing an edge into node i eliminates an argument from theconditional probability factor p(xi|x1,x2, . . . ,xi#1)

Chain Rule for Bayes NetsDirected Graphical Models




!



P(x1,x2, . . . ,xn) ="

i

fi(xi;x!i)




P(x1,x2, . . . ,xn) ="

i

P(xi|x!i)




{xi " x#!i|x!i}!i

where x#!i




Example DAG


1X

2X

3X

X 4

X 5

X6P(x1,x2,x3,x4,x5,x6) =

P(x1)P(x2|x1)P(x3|x1)

P(x4|x2)P(x5|x3)P(x6|x2,x5)

0

1

0 1

2x

4x

0

1

x 1

0

1

0 1

x 1

2x

0

1

0 1

3x

x 1

5x0

1

0 1

3x

0

1

0 1

0

1

6x

2x

5x

1X

2X

3X

X 4

X 5

X6

Missing Edges








P(x1:N) = P(x1)P(x2|x1)P(x3|x1, x2) . . .

=N∏

i=1

P(xi |x1:i−1)

=N∏

i=1

P(xi |xΠi )

I Factorization of the joint P(x1:N) into local conditionalprobabilities

Compact Representation

Directed Graphical Models




!



P(x1,x2, . . . ,xn) ="

i

fi(xi;x!i)




P(x1,x2, . . . ,xn) ="

i

P(xi|x!i)




{xi " x#!i|x!i}!i

where x#!i




Example DAG


1X

2X

3X

X 4

X 5

X6P(x1,x2,x3,x4,x5,x6) =

P(x1)P(x2|x1)P(x3|x1)

P(x4|x2)P(x5|x3)P(x6|x2,x5)

0

1

0 1

2x

4x

0

1

x 1

0

1

0 1

x 1

2x

0

1

0 1

3x

x 1

5x0

1

0 1

3x

0

1

0 1

0

1

6x

2x

5x

1X

2X

3X

X 4

X 5

X6

Missing Edges








Directed Graphical Models




!



P(x1,x2, . . . ,xn) ="

i

fi(xi;x!i)




P(x1,x2, . . . ,xn) ="

i

P(xi|x!i)




{xi " x#!i|x!i}!i

where x#!i




Example DAG


1X

2X

3X

X 4

X 5

X6P(x1,x2,x3,x4,x5,x6) =

P(x1)P(x2|x1)P(x3|x1)

P(x4|x2)P(x5|x3)P(x6|x2,x5)

0

1

0 1

2x

4x

0

1

x 1

0

1

0 1

x 1

2x

0

1

0 1

3x

x 1

5x0

1

0 1

3x

0

1

0 1

0

1

6x

2x

5x

1X

2X

3X

X 4

X 5

X6

Missing Edges








I Factorization P(x1:6) =P(x1)P(x2|x1)P(x3|x1)P(x4|x2)P(x5|x3)P(x6|x2, x5)

I If each xi takes one of m values and K = max number ofparents (fan-in), reducing terms of size mN to NmK

I From exponential in N to linear. K << N in general.

Bayes Net Example: Hidden Markov Models

!"#$%&!"#&!"#!%& !"#$%&!"#&!"#!%&

'"#$%&'"#&'"#!%& '"#$%&'"#&'"#!%&

(&)*++ ,&)-./)(01)2(34(#4506

I POS-tagging example. Each word xi takes one of m POStags and assume a sentence of length N.

I Factorization of the joint into local conditional probabilitiesP(x1:N , y1:N) = P(y1)P(x1|y1)

∏Ni=2 P(xi |yi)P(yi |yi−1)

I Fan-in = 2

Independence queries

I Using local independences, we can infer globalindependencies

I In general, it is a hard problem. Given a Bayes net, anefficient algorithm (Bayes Ball Algorithm) to list allconditional independence relations that must be trueaccording to graph.

I More independence relations are possibleI Graph represent a family of joint distributions that satisfy

these independence relations

Inference

I Estimate values of hidden variables from observed onesI From causes to effects: Given the parent node, how likely

to observe the child node? Read of the conditionalprobability.

I From effects to causes: Given child node, how to inferancestor?

I Use Bayes rule

P(c|e1:N) =P(e1:N |c)P(c)

P(e)

I Naive Bayes Classifier: Effects conditionally independentgiven the cause

P(c|e1:N) ∝ P(c)N∏

i=1

P(ei |c)

Quick Medical Reference-DT Bayes Net

Approach 2: build generative model and use Bayes’rule to invert

•We can build a causal model of how diseases cause symptoms, anduse Bayes’ rule to invert:

P (c|e1:N ) =P (e1:N |c)P (c)

P (e)=

P (e1:N |c)P (c)∑c′ P (e1:N |c′)P (c′)

• In words

posterior =class-conditional likelihood × prior

marginal likelihood

Naive Bayes classifier

• Simplest generative model: assume e!ects are conditionallyindependent given the cause: Ei ⊥ Ej|C

P (E1:N |C) =N∏

i=1

P (Ei|C)

• Hence P (c|e1:N ) ∝ P (e1:N |c)P (c) =∏N

i=1 P (ei|c)P (c)

C

E1 EN

Naive Bayes classifier

C

E1 EN

• This model is extremely widely used (e.g., for document classifica-tion, spam filtering, etc) even when observations are not independent.

P (c|e1:N ) ∝ P (e1:N |c)P (c) =N∏

i=1

P (ei|c)P (c)

P (C = cancer|E1 = spots, E2 = vomiting, E3 = fever) ∝P(spots |cancer) P(vomiting|cancer) P(fever|cancer) P(C=cancer)

QMR-DT Bayes net(Quick medical reference, decision theoretic)

Symptoms4075

Diseases570

heartdisease

flu botulism

WBC countsex=Fpain

abdomen

Slide from lecture nodes of K. Murphy

Learning

I Parameter Learning: Given the graph structure, how to getthe conditional probability distributions P(Xi |XΠi )?

I Parameters θ are unknown constantsI Given fully observed training sample SI Find their estimates maximizing the (penalized)

log-likelihood

θ̂ = argmaxθ

log P(D|θ)(−λR(θ))

I Frequentist approach (versus Bayesian approach)I Structure Learning: Given fully observed training sample

S, how to get the graph G and its parameters θ?I Find the estimator of the unknown constants G and θ by

maximizing the log-likelihood.I Iterate between G and θ steps.

I Consider when some variables are hiddenI Consider when no polynomial time exact algorithms.

Markov Random Fields (MRFs)

I aka Undirected graphical models, Markov networksI Nodes represent random variables, undirected edges

represent (possible) symmetric dependencies.I A node is conditionally independent from its non-neighbors

given its neighborsI Separation: XA ⊥ XC |XB if every path from a node in XA to

a node in XC includes at least one node in XB.

Even more structure

• Surprisingly, once you have specified the basic conditionalindependencies, there are other ones that follow from those.

• In general, it is a hard problem to say which extra CI statementsfollow from a basic set. However, in the case of DAGMs, we havean e!cient way of generating all CI statements that must be truegiven the connectivity of the graph.

• This involves the idea of d-separation in a graph.

• Notice that for specific (numerical) choices of factors at the nodesthere may be even more conditional independencies, but we areonly concerned with statements that are always true of everymember of the family of distributions, no matter what specificfactors live at the nodes.

• Remember: the graph alone represents a family of joint distributionsconsistent with its CI assumptions, not any specific distribution.

Explaining Away

X

Y

Z X Z

•Q: When we condition on y, are x and z independent?

P(x,y, z) = P(x)P(z)P(y|x, z)

• x and z are marginally independent, but given y they areconditionally dependent.

• This important e"ect is called explaining away (Berkson’s paradox.)

• For example, flip two coins independently; let x=coin1,z=coin2.Let y=1 if the coins come up the same and y=0 if di"erent.

• x and z are independent, but if I tell you y, they become coupled!

Undirected Models

• Also graphs with one node per random variable and edges thatconnect pairs of nodes, but now the edges are undirected.

• Semantics: every node is conditionally independent from itsnon-neighbours given its neighbours, i.e.xA ⊥ xC | xB if every path b/w xA and xC goes through xB

XA

XB

XC

• Can model symmetric interactions that directed models cannot.

• aka Markov Random Fields, Markov Networks, BoltzmannMachines, Spin Glasses, Ising Models

Simple Graph Separation

• In undirected models, simple graph separation (as opposed tod-separation) tells us about conditional independencies.

• xA ⊥ xC|xB if every path between xA and xC is blockedby some node in xB.

XA

XB

XC

• “Markov Ball” algorithm:remove xB and see if there is any path from xA to xC .

Parameterization of MRFs

I Goal: Represent the joint probability in terms of localfunctions

I Clique: Fully connected subset of nodes

P(X1:N) =1Z

∏c∈C

φc(xc), Z =∑

X

∏c

φc(xc)

I C the set of maximal cliquesI φ positive potential functionsI Z the partition functionI No probabilistic interpretation to potential functions (cannot

be conditional P(xi |xΠi ) or marginal P(xi , xΠi )

I No requirement for P(X1:N) be a probability distribution.Non-probabilistic learning methods?

MRF example

Conditional Parameterization?

• In directed models, we started with p(X) =!

i p(xi|x!i) and wederived the d-separation semantics from that.

• Undirected models: have the semantics, need parametrization.

•What about this “conditional parameterization”?

p(X) ="

i

p(xi|xneighbours(i))

• Good: product of local functions.Good: each one has a simple conditional interpretation.Bad: local functions cannot be arbitrary, but must agree properly inorder to define a valid distribution.

Marginal Parameterization?

•OK, what about this “marginal parameterization”?

p(X) ="

i

p(xi,xneighbours(i))

• Good: product of local functions.Good: each one has a simple marginal interpretation.Bad: only very few pathalogical marginals on overalpping nodes canbe multiplied to give a valid joint.

Clique Potentials

•Whatever factorization we pick, we know that only connectednodes can be arguments of a single local function.

• A clique is a fully connected subset of nodes.

• Thus, consider using a product of positive clique potentials:

P(X) =1

Z

"

cliques c

"c(xc) Z =#

X

"

cliques c

"c(xc)

• The product of functions that don’t need to agree with each other.

• Still factors in the way that the graph semantics demand.

•Without loss of generalitywe can restrict ourselves tomaximal cliques. (Why?)

0

1

0 1

2x

4x

0

1

0 1

x 1

2x

0

1

0 1

3x

x 15x

0

1

0 1

3x

0

1

0 1

016x

2x

5x

1X

2X

3X

X 4

X 5

X6

Examples of Clique Potentials

0

1

0 1

2x

4x

0

1

0 1

x 1

2x

0

1

0 1

3x

x 15x

0

1

0 1

3x

0

1

0 1

016x

2x

5x

1X

2X

3X

X 4

X 5

X6

Xi_1iX +1iX

(a)

(b)

_1ix

xi

xi

x +1i

1

1_

1_ 1

0.2

0.2

1.5

1.5

1

1_

1_ 1

0.2

0.2

1.5

1.5

I Maximal cliques:{X1, X2}, {X1, X3}, {X2, X4}, {X3, X5}, {X2, X5, X6}

Conversion of MRFs and Bayes Nets

I Cannot always convert MRFs to Bayes nets and vice versaa) No directed model represent x ⊥ y |{w , x}, w ⊥ z|{x , y}and only thoseb) No undirected model represent x ⊥ y and only that

Expressive Power

• Can we always convert directed ! undirected?

• No.W

X Y

Z

X Y

Z

(a) (b)

No directed modelcan represent theseand only theseindependencies.x " y | {w, z}w " z | {x,y}

No undirected modelcan represent theseand only theseindependencies.x " y

What’s Inside the Nodes/Cliques?

•We’ve focused a lot on the structure of the graphs in directed andundirected models. Now we’ll look at specific functions that canlive inside the nodes (directed) or on the cliques (undirected).

• For directed models we need prior functions p(xi) for root nodesand parent-conditionals p(xi|x!i) for interior nodes.

• For undirected models we need clique potentials "C(xC) on themaximal cliques (or log potentials/energies HC(xC)).

•We’ll consider various types of nodes: binary/discrete (categorical),continuous, interval, and integer counts.

•We’ll see some basic probability models (parametrized families ofdistributions); these models live inside nodes of directed models.

•We’ll also see a variety of potential/energy functions which takemultiple node values as arguments and return a scalarcompatibility; these live on the cliques of undirected models.

Probability Tables & CPTs

• For discrete (categorical) variables, the most basic parametrizationis the probability table which lists p(x = kth value).

• Since PTs must be nonnegative and sum to 1, for k-ary nodesthere are k # 1 free parameters.

• If a discrete node has discrete parent(s) we make one table for eachsetting of the parents: this is a conditional probability table or CPT.

0

1

0 1

2x

4x

0

1

x 1

0

1

0 1

x 1

2x

0

1

0 1

3x

x 1

5x0

1

0 1

3x

0

1

0 1

0

1

6x

2x

5x

1X

2X

3X

X 4

X 5

X6

Exponential Family

• For a numeric random variable x

p(x|#) = h(x) exp{#$T (x) # A(#)}

=1

Z(#)h(x) exp{#$T (x)}

is an exponential family distribution withnatural parameter #.

• Function T (x) is a su!cient statistic.

• Function A(#) = log Z(#) is the log normalizer.

• Key idea: all you need to know about the data in order to estimateparameters is captured in the summarizing function T (x).

• Examples: Bernoulli, binomial/geometric/negative-binomial,Poisson, gamma, multinomial, Gaussian, ...

Topics in Graphical Models

I Representation: What is the graphical model?I Directed/Undirected graphsI Independence properties, Markov properties

I Inference: How can we use these models to efficientlyanswer probabilistic queries?

I Exact inference (hidden, fully observed) eg. Viterbi,Forward-Backward algorithm

I Approximate inference, eg. Loopy belief propagation,sampling, variational methods

I Learning:I Parameter learning, eg. Expectation-Maximization

algorithmI Structure learning, eg. Structural EM

What is new then?I Graphical models have been studied in a probabilistic

frameworkI Generally using the generative paradigmI Recent research married discriminative learning methods

(SVMs, boosting, etc.) with the graphical models literatureI Result: Learning by using the structure of input and output

spaces jointly in a discriminative frameworkI Advantages of discriminative methods

I Implicit data representation via kernelsI Explicit feature inductionI Using one of above, ability to learn efficiently in high

dimensional feature spacesI Leading to improved accuracy

I Advantages of graphical modelsI Powerful representational framework to capture complex

structuresI Leading to reliable predictionsI Efficient learning and inference algorithms

What is new?

I Representation: High dimensional feature spacesI Notion of cost, eg Hamming lossI Incorporating cost function into optimizationI Various objective functions and optimization methods

Outline

I Graphical ModelsI Representation: Directed/Undirected, IndependenceI Inference: Exact/Approximate, Fully observed/Partially

observedI Parameter learningI Structure learning

I Discriminative Methods for GMI Conditional Random Fields (CRFs)I Perceptron learning on random fieldsI Boosting approachesI Support vector machine approachesI Kernel CRFsI Decompositional approaches

generative and discriminative approaches to graphical...

Documents