generative and discriminative approaches to graphical ...altun/teaching/cs359/lecture0.pdf ·...

of 30/30
Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 1 Yasemin Altun January 3, 2007

Post on 28-Sep-2020

2 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • Generative and Discriminative Approaches toGraphical Models

    CMSC 35900 Topics in AILecture 1

    Yasemin Altun

    January 3, 2007

  • ADMINISTRATIVE STUFF

    I Lectures: Wednesday 3:30-6:00, TTI Room 201I Office hours: Wednesday 10am-NoonI No text book. Reference reading will be handed out.I No homework, no examI Presentation/Discussions: 40% of gradeI Final Project: 60% of grade

    I Apply one of methods discussed to your research area( eg.NLP, vision, CompBio)

    I Add new features to SVM-struct or other available packagesI Theoretical work

    I URLhttp://ttic.uchicago.edu/ altun/Teaching/CS359/index.html

    I Class mail list

  • Prerequisites

    I Familiarity withI Probability: Random variables, densities, expectations,

    joint, marginal, conditional probabilities, Bayes rule,independence

    I Linear algebraI Optimization: Lagrangian methods

  • Traditional Prediction Problems

    I Supervised Learning: Given input-output pairs, find afunction that predict outputs of new inputs

    I Binary classification, label class {0, 1}I Multiclass classification, label class {0, . . . , m}I Regression, label class <

    I Unsupervised learning: Given only inputs, discover somestructure, eg. clusters, outliers

    I Semi-supervised Learning: Given a few input-output pairsand many inputs, find a function to predict outputs of newinputs

    I Transduction: Given a few input-output pairs and manyinputs, find a function to predict well on unlabeled inputs

  • Key Components

    I 4 aspects of learningI RepresentationI Parameterization and the hypothesis spaceI Learning objectiveI Optimization method

    I Different settings lead to different learning methodsI For prediction tasks, state-of-the art methods Support

    Vector Machines, Boosting, Gaussian Processes

  • Discriminative Learning

    I All these methods are from the discriminative learningparadigm

    I (Treat inputs, outputs as random variables. X for input withinstantiation x , Y for output with instantiation y . p(x) forprobability (X = x) )

    I Given an input x , they discriminate the target label y . eg.p(y |x)

    I Since conditioning on x , they can treat arbitrarily complexobjects as input

    I Versus a generative approach,I where the goal is to estimate the joint distribution p(x , y).I p(x , y) = p(y)p(x |y)I p(x |y): Given the target label, generate the input.I eg. Naive Bayes classifier

  • Structured (Output) Prediction

    I Traditionally, discriminative methods predict one simplevariable.

    I In real-life, it is rarely the case.I Not taking dependencies into account is an important

    shortcoming.I Domains: Natural Language Processing, Speech,

    Information Retrieval, Computer Vision, Bioinformatics,Computational Economy

  • Examples

    I Domain: Natural Language ProcessingI Application: Part-of-speech taggingI Input: A sequence of wordsI Output: Labels of each word as noun,verb,adjective,etc.

    John hit the ball.Noun Vb Det Noun

  • Examples

    I Domain: Computational BiologyI Application: Protein Secondary Structure PredictionI Input: Amino-acid sequence

    AAY KSHGSGDYGDHDVGHPTPGDPWVEPDYGINVYHI Output: H/E/- regions

    HHHH——-EEEEEEEE—- – - - - HHHHH- - - -

  • Examples

    I Domain: Computer visionI Application: Identifying joint angles of human body

  • Examples

    I Domain: Natural Language ProcessingI Application: ParsingI Input: Sentence (sequence of words)I Output: Parse tree (a configuration of grammar

    terminals/non-terminals

  • Examples

    I Domain: Information RetrievalI Application: Text classification with taxomoniesI Input: DocumentI Output: A (leaf) class from the taxomony

  • Possible approaches

    I Ignore dependency, use standard learning method for eachcomponent. BAD!

    I Consider all components as one and consider all possiblelabelings as one class

    I Input: word1, . . . , wordnI Each word can take a label from {1, . . . , m}I A multiclass classification where label set is mnI Hopeless!!!

    I Use graphical models!

  • Graphical Models

    I A framework for multivariate statistical modelsI Wide-spread domains and ubiquous applicationsI Marriage of graph theory and probability theoryI Graph theory

    I Provide a means to build complex system from simple partsI Dependency between variables encoded in the graph

    structureI Efficient algorithms for learning and inference

    I Bayesian Networks: Graphical models with directed acyclicgraphs

    I Markov Random Fields (Markov networks): Graphicalmodels with undirected graphs

  • Bayes(ian) Net(work)s

    I aka Belief Networks, Directed Graphical ModelsI Each node is a random variable. Shade for fix values.I Edges represent dependency (causation).I No directed cycles. Bayes nets are DAGs.I Local Markov property: A node is conditionally

    independent of its non-descendants given its parent.Directed Graphical Models

    • Consider directed acyclic graphs over n variables.• Each node has (possibly empty) set of parents πi.• Each node maintains a function fi(xi;xπi) such that

    fi > 0 and∑

    xifi(xi;xπi) = 1 ∀πi.

    •Define the joint probability to be:P(x1,x2, . . . ,xn) =

    i

    fi(xi;xπi)

    Even with no further restriction on the the fi, it is always true that

    fi(xi;xπi) = P(xi|xπi)so we will just write

    P(x1,x2, . . . ,xn) =∏

    i

    P(xi|xπi)

    • Factorization of the joint in terms of local conditional probabilities.Exponential in “fan-in” of each node instead of in total variables n.

    Conditional Independence in DAGs

    • If we order the nodes in a directed graphical model so that parentsalways come before their children in the ordering then the graphicalmodel implies the following about the distribution:

    {xi ⊥ xπ̃i|xπi}∀iwhere xπ̃i are the nodes coming before xi that are not its parents.

    • In other words, the DAG is telling us that each variable isconditionally independent of its non-descendants given its parents.

    • Such an ordering is called a “topological” ordering.

    Example DAG

    • Consider this six node network: The joint probability is now:

    1X

    2X

    3X

    X 4

    X 5

    X6P(x1,x2,x3,x4,x5,x6) =P(x1)P(x2|x1)P(x3|x1)P(x4|x2)P(x5|x3)P(x6|x2,x5)

    0

    1

    0 1

    2x

    4x

    0

    1

    x 1

    0

    1

    0 1

    x 1

    2x

    0

    1

    0 1

    3x

    x 1

    5x0

    1

    0 1

    3x

    0

    1

    0 1

    0

    1

    6x

    2x

    5x

    1X

    2X

    3X

    X 4

    X 5

    X6

    Missing Edges

    • Key point about directed graphical models:Missing edges imply conditional independence

    • Remember, that by the chain rule we can always write the full jointas a product of conditionals, given an ordering:

    P(x1,x2,x3,x4, . . .) = P(x1)P(x2|x1)P(x3|x1,x2)P(x4|x1,x2,x3) . . .

    • If the joint is represented by a DAGM, then some of theconditioned variables on the right hand sides are missing.This is equivalent to enforcing conditional independence.

    • Start with the “idiot’s graph”: each node has all previous nodes inthe ordering as its parents.

    • Now remove edges to get your DAG.• Removing an edge into node i eliminates an argument from the

    conditional probability factor p(xi|x1,x2, . . . ,xi−1)

  • Chain Rule for Bayes NetsDirected Graphical Models

    • Consider directed acyclic graphs over n variables.• Each node has (possibly empty) set of parents πi.• Each node maintains a function fi(xi;xπi) such that

    fi > 0 and∑

    xifi(xi;xπi) = 1 ∀πi.

    •Define the joint probability to be:P(x1,x2, . . . ,xn) =

    i

    fi(xi;xπi)

    Even with no further restriction on the the fi, it is always true that

    fi(xi;xπi) = P(xi|xπi)so we will just write

    P(x1,x2, . . . ,xn) =∏

    i

    P(xi|xπi)

    • Factorization of the joint in terms of local conditional probabilities.Exponential in “fan-in” of each node instead of in total variables n.

    Conditional Independence in DAGs

    • If we order the nodes in a directed graphical model so that parentsalways come before their children in the ordering then the graphicalmodel implies the following about the distribution:

    {xi ⊥ xπ̃i|xπi}∀iwhere xπ̃i are the nodes coming before xi that are not its parents.

    • In other words, the DAG is telling us that each variable isconditionally independent of its non-descendants given its parents.

    • Such an ordering is called a “topological” ordering.

    Example DAG

    • Consider this six node network: The joint probability is now:

    1X

    2X

    3X

    X 4

    X 5

    X6P(x1,x2,x3,x4,x5,x6) =P(x1)P(x2|x1)P(x3|x1)P(x4|x2)P(x5|x3)P(x6|x2,x5)

    0

    1

    0 1

    2x

    4x

    0

    1

    x 1

    0

    1

    0 1

    x 1

    2x

    0

    1

    0 1

    3x

    x 1

    5x0

    1

    0 1

    3x

    0

    1

    0 1

    0

    1

    6x

    2x

    5x

    1X

    2X

    3X

    X 4

    X 5

    X6

    Missing Edges

    • Key point about directed graphical models:Missing edges imply conditional independence

    • Remember, that by the chain rule we can always write the full jointas a product of conditionals, given an ordering:

    P(x1,x2,x3,x4, . . .) = P(x1)P(x2|x1)P(x3|x1,x2)P(x4|x1,x2,x3) . . .

    • If the joint is represented by a DAGM, then some of theconditioned variables on the right hand sides are missing.This is equivalent to enforcing conditional independence.

    • Start with the “idiot’s graph”: each node has all previous nodes inthe ordering as its parents.

    • Now remove edges to get your DAG.• Removing an edge into node i eliminates an argument from the

    conditional probability factor p(xi|x1,x2, . . . ,xi−1)

    P(x1:N) = P(x1)P(x2|x1)P(x3|x1, x2) . . .

    =N∏

    i=1

    P(xi |x1:i−1)

    =N∏

    i=1

    P(xi |xΠi )

    I Factorization of the joint P(x1:N) into local conditionalprobabilities

  • Compact Representation

    Directed Graphical Models

    • Consider directed acyclic graphs over n variables.• Each node has (possibly empty) set of parents πi.• Each node maintains a function fi(xi;xπi) such that

    fi > 0 and∑

    xifi(xi;xπi) = 1 ∀πi.

    •Define the joint probability to be:P(x1,x2, . . . ,xn) =

    i

    fi(xi;xπi)

    Even with no further restriction on the the fi, it is always true that

    fi(xi;xπi) = P(xi|xπi)so we will just write

    P(x1,x2, . . . ,xn) =∏

    i

    P(xi|xπi)

    • Factorization of the joint in terms of local conditional probabilities.Exponential in “fan-in” of each node instead of in total variables n.

    Conditional Independence in DAGs

    • If we order the nodes in a directed graphical model so that parentsalways come before their children in the ordering then the graphicalmodel implies the following about the distribution:

    {xi ⊥ xπ̃i|xπi}∀iwhere xπ̃i are the nodes coming before xi that are not its parents.

    • In other words, the DAG is telling us that each variable isconditionally independent of its non-descendants given its parents.

    • Such an ordering is called a “topological” ordering.

    Example DAG

    • Consider this six node network: The joint probability is now:

    1X

    2X

    3X

    X 4

    X 5

    X6P(x1,x2,x3,x4,x5,x6) =P(x1)P(x2|x1)P(x3|x1)P(x4|x2)P(x5|x3)P(x6|x2,x5)

    0

    1

    0 1

    2x

    4x

    0

    1

    x 1

    0

    1

    0 1

    x 1

    2x

    0

    1

    0 1

    3x

    x 1

    5x0

    1

    0 1

    3x

    0

    1

    0 1

    0

    1

    6x

    2x

    5x

    1X

    2X

    3X

    X 4

    X 5

    X6

    Missing Edges

    • Key point about directed graphical models:Missing edges imply conditional independence

    • Remember, that by the chain rule we can always write the full jointas a product of conditionals, given an ordering:

    P(x1,x2,x3,x4, . . .) = P(x1)P(x2|x1)P(x3|x1,x2)P(x4|x1,x2,x3) . . .

    • If the joint is represented by a DAGM, then some of theconditioned variables on the right hand sides are missing.This is equivalent to enforcing conditional independence.

    • Start with the “idiot’s graph”: each node has all previous nodes inthe ordering as its parents.

    • Now remove edges to get your DAG.• Removing an edge into node i eliminates an argument from the

    conditional probability factor p(xi|x1,x2, . . . ,xi−1)

    Directed Graphical Models

    • Consider directed acyclic graphs over n variables.• Each node has (possibly empty) set of parents πi.• Each node maintains a function fi(xi;xπi) such that

    fi > 0 and∑

    xifi(xi;xπi) = 1 ∀πi.

    •Define the joint probability to be:P(x1,x2, . . . ,xn) =

    i

    fi(xi;xπi)

    Even with no further restriction on the the fi, it is always true that

    fi(xi;xπi) = P(xi|xπi)so we will just write

    P(x1,x2, . . . ,xn) =∏

    i

    P(xi|xπi)

    • Factorization of the joint in terms of local conditional probabilities.Exponential in “fan-in” of each node instead of in total variables n.

    Conditional Independence in DAGs

    • If we order the nodes in a directed graphical model so that parentsalways come before their children in the ordering then the graphicalmodel implies the following about the distribution:

    {xi ⊥ xπ̃i|xπi}∀iwhere xπ̃i are the nodes coming before xi that are not its parents.

    • In other words, the DAG is telling us that each variable isconditionally independent of its non-descendants given its parents.

    • Such an ordering is called a “topological” ordering.

    Example DAG

    • Consider this six node network: The joint probability is now:

    1X

    2X

    3X

    X 4

    X 5

    X6P(x1,x2,x3,x4,x5,x6) =P(x1)P(x2|x1)P(x3|x1)P(x4|x2)P(x5|x3)P(x6|x2,x5)

    0

    1

    0 1

    2x

    4x

    0

    1

    x 1

    0

    1

    0 1

    x 1

    2x

    0

    1

    0 1

    3x

    x 1

    5x0

    1

    0 1

    3x

    0

    1

    0 1

    0

    1

    6x

    2x

    5x

    1X

    2X

    3X

    X 4

    X 5

    X6

    Missing Edges

    • Key point about directed graphical models:Missing edges imply conditional independence

    • Remember, that by the chain rule we can always write the full jointas a product of conditionals, given an ordering:

    P(x1,x2,x3,x4, . . .) = P(x1)P(x2|x1)P(x3|x1,x2)P(x4|x1,x2,x3) . . .

    • If the joint is represented by a DAGM, then some of theconditioned variables on the right hand sides are missing.This is equivalent to enforcing conditional independence.

    • Start with the “idiot’s graph”: each node has all previous nodes inthe ordering as its parents.

    • Now remove edges to get your DAG.• Removing an edge into node i eliminates an argument from the

    conditional probability factor p(xi|x1,x2, . . . ,xi−1)

    I Factorization P(x1:6) =P(x1)P(x2|x1)P(x3|x1)P(x4|x2)P(x5|x3)P(x6|x2, x5)

    I If each xi takes one of m values and K = max number ofparents (fan-in), reducing terms of size mN to NmK

    I From exponential in N to linear. K

  • Bayes Net Example: Hidden Markov Models

    !"#$%&!"#&!"#!%& !"#$%&!"#&!"#!%&

    '"#$%&'"#&'"#!%& '"#$%&'"#&'"#!%&

    (&)*++ ,&)-./)(01)2(34(#4506

    I POS-tagging example. Each word xi takes one of m POStags and assume a sentence of length N.

    I Factorization of the joint into local conditional probabilitiesP(x1:N , y1:N) = P(y1)P(x1|y1)

    ∏Ni=2 P(xi |yi)P(yi |yi−1)

    I Fan-in = 2

  • Independence queries

    I Using local independences, we can infer globalindependencies

    I In general, it is a hard problem. Given a Bayes net, anefficient algorithm (Bayes Ball Algorithm) to list allconditional independence relations that must be trueaccording to graph.

    I More independence relations are possibleI Graph represent a family of joint distributions that satisfy

    these independence relations

  • Inference

    I Estimate values of hidden variables from observed onesI From causes to effects: Given the parent node, how likely

    to observe the child node? Read of the conditionalprobability.

    I From effects to causes: Given child node, how to inferancestor?

    I Use Bayes rule

    P(c|e1:N) =P(e1:N |c)P(c)

    P(e)

    I Naive Bayes Classifier: Effects conditionally independentgiven the cause

    P(c|e1:N) ∝ P(c)N∏

    i=1

    P(ei |c)

  • Quick Medical Reference-DT Bayes Net

    Approach 2: build generative model and use Bayes’rule to invert

    •We can build a causal model of how diseases cause symptoms, anduse Bayes’ rule to invert:

    P (c|e1:N ) =P (e1:N |c)P (c)

    P (e)=

    P (e1:N |c)P (c)∑c′ P (e1:N |c

    ′)P (c′)

    • In words

    posterior =class-conditional likelihood × prior

    marginal likelihood

    Naive Bayes classifier

    • Simplest generative model: assume effects are conditionallyindependent given the cause: Ei ⊥ Ej|C

    P (E1:N |C) =N∏

    i=1

    P (Ei|C)

    • Hence P (c|e1:N ) ∝ P (e1:N |c)P (c) =∏N

    i=1 P (ei|c)P (c)

    C

    E1 EN

    Naive Bayes classifier

    C

    E1 EN

    • This model is extremely widely used (e.g., for document classifica-tion, spam filtering, etc) even when observations are not independent.

    P (c|e1:N ) ∝ P (e1:N |c)P (c) =N∏

    i=1

    P (ei|c)P (c)

    P (C = cancer|E1 = spots, E2 = vomiting, E3 = fever) ∝P(spots |cancer) P(vomiting|cancer) P(fever|cancer) P(C=cancer)

    QMR-DT Bayes net(Quick medical reference, decision theoretic)

    Symptoms4075

    Diseases570

    heartdisease

    flu botulism

    WBC countsex=Fpain

    abdomen

    Slide from lecture nodes of K. Murphy

  • Learning

    I Parameter Learning: Given the graph structure, how to getthe conditional probability distributions P(Xi |XΠi )?

    I Parameters θ are unknown constantsI Given fully observed training sample SI Find their estimates maximizing the (penalized)

    log-likelihood

    θ̂ = argmaxθ

    log P(D|θ)(−λR(θ))

    I Frequentist approach (versus Bayesian approach)I Structure Learning: Given fully observed training sample

    S, how to get the graph G and its parameters θ?I Find the estimator of the unknown constants G and θ by

    maximizing the log-likelihood.I Iterate between G and θ steps.

    I Consider when some variables are hiddenI Consider when no polynomial time exact algorithms.

  • Markov Random Fields (MRFs)

    I aka Undirected graphical models, Markov networksI Nodes represent random variables, undirected edges

    represent (possible) symmetric dependencies.I A node is conditionally independent from its non-neighbors

    given its neighborsI Separation: XA ⊥ XC |XB if every path from a node in XA to

    a node in XC includes at least one node in XB.

    Even more structure

    • Surprisingly, once you have specified the basic conditionalindependencies, there are other ones that follow from those.

    • In general, it is a hard problem to say which extra CI statementsfollow from a basic set. However, in the case of DAGMs, we havean efficient way of generating all CI statements that must be truegiven the connectivity of the graph.

    • This involves the idea of d-separation in a graph.• Notice that for specific (numerical) choices of factors at the nodes

    there may be even more conditional independencies, but we areonly concerned with statements that are always true of everymember of the family of distributions, no matter what specificfactors live at the nodes.

    • Remember: the graph alone represents a family of joint distributionsconsistent with its CI assumptions, not any specific distribution.

    Explaining Away

    X

    Y

    Z X Z

    •Q: When we condition on y, are x and z independent?P(x,y, z) = P(x)P(z)P(y|x, z)

    • x and z are marginally independent, but given y they areconditionally dependent.

    • This important effect is called explaining away (Berkson’s paradox.)• For example, flip two coins independently; let x=coin1,z=coin2.

    Let y=1 if the coins come up the same and y=0 if different.

    • x and z are independent, but if I tell you y, they become coupled!

    Undirected Models

    • Also graphs with one node per random variable and edges thatconnect pairs of nodes, but now the edges are undirected.

    • Semantics: every node is conditionally independent from itsnon-neighbours given its neighbours, i.e.xA ⊥ xC | xB if every path b/w xA and xC goes through xB

    XA

    XB

    XC

    • Can model symmetric interactions that directed models cannot.• aka Markov Random Fields, Markov Networks, Boltzmann

    Machines, Spin Glasses, Ising Models

    Simple Graph Separation

    • In undirected models, simple graph separation (as opposed tod-separation) tells us about conditional independencies.

    • xA ⊥ xC|xB if every path between xA and xC is blockedby some node in xB.

    XA

    XB

    XC

    • “Markov Ball” algorithm:remove xB and see if there is any path from xA to xC .

  • Parameterization of MRFs

    I Goal: Represent the joint probability in terms of localfunctions

    I Clique: Fully connected subset of nodes

    P(X1:N) =1Z

    ∏c∈C

    φc(xc), Z =∑

    X

    ∏c

    φc(xc)

    I C the set of maximal cliquesI φ positive potential functionsI Z the partition functionI No probabilistic interpretation to potential functions (cannot

    be conditional P(xi |xΠi ) or marginal P(xi , xΠi )I No requirement for P(X1:N) be a probability distribution.

    Non-probabilistic learning methods?

  • MRF example

    Conditional Parameterization?

    • In directed models, we started with p(X) =∏

    i p(xi|xπi) and wederived the d-separation semantics from that.

    • Undirected models: have the semantics, need parametrization.•What about this “conditional parameterization”?

    p(X) =∏

    i

    p(xi|xneighbours(i))

    • Good: product of local functions.Good: each one has a simple conditional interpretation.Bad: local functions cannot be arbitrary, but must agree properly inorder to define a valid distribution.

    Marginal Parameterization?

    •OK, what about this “marginal parameterization”?p(X) =

    i

    p(xi,xneighbours(i))

    • Good: product of local functions.Good: each one has a simple marginal interpretation.Bad: only very few pathalogical marginals on overalpping nodes canbe multiplied to give a valid joint.

    Clique Potentials

    •Whatever factorization we pick, we know that only connectednodes can be arguments of a single local function.

    • A clique is a fully connected subset of nodes.• Thus, consider using a product of positive clique potentials:

    P(X) =1

    Z

    cliques c

    ψc(xc) Z =∑

    X

    cliques c

    ψc(xc)

    • The product of functions that don’t need to agree with each other.• Still factors in the way that the graph semantics demand.•Without loss of generality

    we can restrict ourselves tomaximal cliques. (Why?)

    0

    1

    0 1

    2x

    4x

    0

    1

    0 1

    x 1

    2x

    0

    1

    0 1

    3x

    x 15x

    0

    1

    0 1

    3x

    0

    1

    0 1

    016

    x

    2x

    5x

    1X

    2X

    3X

    X 4

    X 5

    X6

    Examples of Clique Potentials

    0

    1

    0 1

    2x

    4x

    0

    1

    0 1

    x 1

    2x

    0

    1

    0 1

    3x

    x 15x

    0

    1

    0 1

    3x

    0

    1

    0 1

    016

    x

    2x

    5x

    1X

    2X

    3X

    X 4

    X 5

    X6

    Xi_1iX +1iX

    (a)

    (b)

    _1ix

    xi

    xi

    x +1i

    1

    1_

    1_ 1

    0.2

    0.2

    1.5

    1.5

    1

    1_

    1_ 1

    0.2

    0.2

    1.5

    1.5

    I Maximal cliques:{X1, X2}, {X1, X3}, {X2, X4}, {X3, X5}, {X2, X5, X6}

  • Conversion of MRFs and Bayes Nets

    I Cannot always convert MRFs to Bayes nets and vice versaa) No directed model represent x ⊥ y |{w , x}, w ⊥ z|{x , y}and only thoseb) No undirected model represent x ⊥ y and only that

    Expressive Power

    • Can we always convert directed ↔ undirected?• No.

    W

    X Y

    Z

    X Y

    Z

    (a) (b)

    No directed modelcan represent theseand only theseindependencies.x ⊥ y | {w, z}w ⊥ z | {x,y}

    No undirected modelcan represent theseand only theseindependencies.x ⊥ y

    What’s Inside the Nodes/Cliques?

    •We’ve focused a lot on the structure of the graphs in directed andundirected models. Now we’ll look at specific functions that canlive inside the nodes (directed) or on the cliques (undirected).

    • For directed models we need prior functions p(xi) for root nodesand parent-conditionals p(xi|xπi) for interior nodes.

    • For undirected models we need clique potentials ψC(xC) on themaximal cliques (or log potentials/energies HC(xC)).

    •We’ll consider various types of nodes: binary/discrete (categorical),continuous, interval, and integer counts.

    •We’ll see some basic probability models (parametrized families ofdistributions); these models live inside nodes of directed models.

    •We’ll also see a variety of potential/energy functions which takemultiple node values as arguments and return a scalarcompatibility; these live on the cliques of undirected models.

    Probability Tables & CPTs

    • For discrete (categorical) variables, the most basic parametrizationis the probability table which lists p(x = kth value).

    • Since PTs must be nonnegative and sum to 1, for k-ary nodesthere are k − 1 free parameters.

    • If a discrete node has discrete parent(s) we make one table for eachsetting of the parents: this is a conditional probability table or CPT.

    0

    1

    0 1

    2x

    4x

    0

    1

    x 1

    0

    1

    0 1

    x 1

    2x

    0

    1

    0 1

    3x

    x 1

    5x0

    1

    0 1

    3x

    0

    1

    0 1

    0

    1

    6x

    2x

    5x

    1X

    2X

    3X

    X 4

    X 5

    X6

    Exponential Family

    • For a numeric random variable xp(x|η) = h(x) exp{η$T (x) − A(η)}

    =1

    Z(η)h(x) exp{η$T (x)}

    is an exponential family distribution withnatural parameter η.

    • Function T (x) is a sufficient statistic.• Function A(η) = log Z(η) is the log normalizer.• Key idea: all you need to know about the data in order to estimate

    parameters is captured in the summarizing function T (x).

    • Examples: Bernoulli, binomial/geometric/negative-binomial,Poisson, gamma, multinomial, Gaussian, ...

  • Topics in Graphical Models

    I Representation: What is the graphical model?I Directed/Undirected graphsI Independence properties, Markov properties

    I Inference: How can we use these models to efficientlyanswer probabilistic queries?

    I Exact inference (hidden, fully observed) eg. Viterbi,Forward-Backward algorithm

    I Approximate inference, eg. Loopy belief propagation,sampling, variational methods

    I Learning:I Parameter learning, eg. Expectation-Maximization

    algorithmI Structure learning, eg. Structural EM

  • What is new then?I Graphical models have been studied in a probabilistic

    frameworkI Generally using the generative paradigmI Recent research married discriminative learning methods

    (SVMs, boosting, etc.) with the graphical models literatureI Result: Learning by using the structure of input and output

    spaces jointly in a discriminative frameworkI Advantages of discriminative methods

    I Implicit data representation via kernelsI Explicit feature inductionI Using one of above, ability to learn efficiently in high

    dimensional feature spacesI Leading to improved accuracy

    I Advantages of graphical modelsI Powerful representational framework to capture complex

    structuresI Leading to reliable predictionsI Efficient learning and inference algorithms

  • What is new?

    I Representation: High dimensional feature spacesI Notion of cost, eg Hamming lossI Incorporating cost function into optimizationI Various objective functions and optimization methods

  • Outline

    I Graphical ModelsI Representation: Directed/Undirected, IndependenceI Inference: Exact/Approximate, Fully observed/Partially

    observedI Parameter learningI Structure learning

    I Discriminative Methods for GMI Conditional Random Fields (CRFs)I Perceptron learning on random fieldsI Boosting approachesI Support vector machine approachesI Kernel CRFsI Decompositional approaches