learning graphical models - 2018-10-10¢  learning graphical modelslearning graphical...

Download Learning Graphical Models - 2018-10-10¢  Learning Graphical ModelsLearning Graphical Models Scenarios:

Post on 01-Aug-2020

2 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Machine LearningMachine Learninggg

    Learning Graphical ModelsLearning Graphical Models

    Eric XingEric Xing

    Lecture 12, August 14, 2010

    Eric Xing © Eric Xing @ CMU, 2006-2010 1 Reading:

  • Inference and LearningInference and Learning  A BN M describes a unique probability distribution P

     Typical tasks:

     Task 1: How do we answer queries about P? Task 1: How do we answer queries about P?

     We use inference as a name for the process of computing answers to such queries

     So far we have learned several algorithms for exact and approx. inferenceg pp

     Task 2: How do we estimate a plausible model M from data D?

    i. We use learning as a name for the process of obtaining point estimate of M.

    ii. But for Bayesian, they seek p(M |D), which is actually an inference problem.

    iii When not all variables are observable even computing point estimate of M

    Eric Xing © Eric Xing @ CMU, 2006-2010 2

    iii. When not all variables are observable, even computing point estimate of M need to do inference to impute the missing data.

  • Learning Graphical Models

    The goal:

    Learning Graphical Models

    g

    Given set of independent samples (assignments of random variables) find the best (the most likely?)random variables), find the best (the most likely?) graphical model (both the graph and the CPDs)

    E BE B

    R A

    C

    R A

    C

    (B,E,A,C,R)=(T,F,F,T,F) (B,E,A,C,R)=(T,F,T,T,F)

    C

    0.9 0.1

    e 0 2 0 8 be b

    BE P(A | E,B) C

    Eric Xing © Eric Xing @ CMU, 2006-2010 3

    …….. (B,E,A,C,R)=(F,T,T,T,F)

    e

    b e

    0.2 0.8

    0.01 0.99 0.9 0.1

    b b

    e

  • Learning Graphical ModelsLearning Graphical Models  Scenarios:

      completely observed GMs

     directed  undirected 

      partially observed GMs

     directed  undirected (an open research topic)

      Estimation principles:

     Maximal likelihood estimation (MLE)  Bayesian estimationy  Maximal conditional likelihood  Maximal "Margin"

    W l i f th f ti ti th

    Eric Xing © Eric Xing @ CMU, 2006-2010 4

     We use learning as a name for the process of estimating the parameters, and in some cases, the topology of the network, from data.

  • Z

    ML P t E t f

    Z

    X

    ML Parameter Est. for completely observed GMs of p y

    given structure

     The data: {( (1) (1)) ( (2) (2)) ( (3) (3)) ( (N) (N))}

    Eric Xing © Eric Xing @ CMU, 2006-2010 5

    {(z(1),x(1)), (z(2),x(2)), (z(3),x(3)), ... (z(N),x(N))}

  • The basic idea underlying MLE  Likelihood X1 X2

    The basic idea underlying MLE

    (for now let's assume that the structure is given):

    L Lik lih d

    X1 X2

    X3);,|()|()|()|()|( 33332211  XXXpXpXpXpXL  θθ

     Log-Likelihood:

     Data log-likelihood

    ),,|(log)|(log)|(log)|(log)|( 33332211  XXXpXpXpXpXl  θθ

     Data log-likelihood

     

    

    n nnnn nn n

    n n

    XXXpXpXp

    XpDATAl

    ),|(log)|(log)|(log

    )|(log)|(

    ,,,,, 32132211 

    θθ

     MLE

     nnn ,,,,,

    )|(maxarg},,{ DATAlMLE θ321 

    Eric Xing © Eric Xing @ CMU, 2006-2010 6

    ∑∑∑ ),|(logmaxarg ,)|(logmaxarg ,)|(logmaxarg ,,,*,*,* n

    nnn n

    n n

    n XXXpXpXp 32133222111  

  • Example 1: conditional Gaussian

     The completely observed model: Z

    Example 1: conditional Gaussian

    p y  Z is a class indicator vector

    Z

    X

    110 2

    1

      

      

    ∑and][where mm ZZ Z Z

    Z 110 

       

       

     ∑and ],,[where , m

    M

    ZZ

    Z

    Z 

     zzzi M

    zp  )|( 21

    1 All except one of these terms

    and a datum is in class i w.p. i

    X i diti l G i i bl ith l ifi

     

    m

    z m

    Mi m

    zp

    zp

    

    )(

    )|( 211 will be one

     X is a conditional Gaussian variable with a class-specific mean

     2 2

    1 212 22

    11 )-(-exp )(

    ),,|( / m m xzxp 

     

     

    Eric Xing © Eric Xing @ CMU, 2006-2010 7

    )( mz

    m m

    xNzxp ),|(),,|( ∏  

  • Example 1: conditional Gaussian

    Z

    Example 1: conditional Gaussian  Data log-likelihood

    zxpzp

    zxpzpxzpDl

    nnn

    nn n

    n n

    nn

    

     

    ∑∑

    ),,|(log)|(log

    ),,|()|(log),(log)|(

    

    θ Z

    X

    g

    C

    xN

    pp

    mm

    n

    z m

    m n

    n m

    z m

    n nn

    n n

    m n

    m n 

    ∑∑∑∑

    ∑ ∏∑ ∏

    )(l

    ),|(loglog

    )|(g)|(g

    21

    

    Cxzz n m

    mn m n

    n m m

    m n  ∑∑∑∑ )-(-log 22

    1 2  

    s t∀)|(⇒)|(maxarg ∑∂* mDlDl  10θθ  MLE

    s.t. ,∀ ,)|( ⇒ ),|(maxarg

    *

    m ∂ ∂

    N n

    N z

    mDlDl

    mn m n

    m

    mm m

    

    

      10θθ

    the fraction of samples of class m

    Eric Xing © Eric Xing @ CMU, 2006-2010 8

    p

    m

    n n m n

    n m n

    n n m n

    mm n

    xz

    z

    xz Dl

    ∑ ∑ ∑** ⇒ ),|(maxarg   θ the average of

    samples of class m

  • Example 2: HMM: two scenariosExample 2: HMM: two scenarios  Supervised learning: estimation when the “right answer” is g g

    known  Examples:

    GIVEN: a genomic region x = x1…x1,000,000 where we have good, , (experimental) annotations of the CpG islands

    GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls

     Unsupervised learning: estimation when the “right answer” is unknown  Examples:

    GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition

    GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

    Eric Xing © Eric Xing @ CMU, 2006-2010 9

     QUESTION: Update the parameters  of the model to maximize P(x|) --- Maximal likelihood (ML) estimation

  • Recall definition of HMMRecall definition of HMM  Transition probabilities between y2 y3y1 yT...

    any two states

    A AA Ax2 x3x1 xT

    y2 y3y1 yT...

    ... ,)|( , jiitjt ayyp   11 1 or

    St t b biliti

      .,,,,lMultinomia~)|( ,,, I iaaayyp Miiiitt 211 1

     Start probabilities

     .,,,lMultinomia~)( Myp  211

     Emission probabilities associated with each state

      .,,,,lMultinomia~)|( ,,, I ibbbyxp Kiiiitt 211

    Eric Xing © Eric Xing @ CMU, 2006-2010 10

    or in general:   .,|f~)|( I iyxp iitt 1

  • Supervised ML estimationSupervised ML estimation  Given x = x1…xN for which the true state path y = y1…yN is

    known,

     Define:   

      

     

     

    n

    T

    t tntn

    T

    t tntnn xxpyypypp

    1 ,,

    2 1,,1, )|()|()(log),(log),;( yxyxθl

    Aij = # times state transition ij occurs in y Bik = # times state i in y emits k in x

     We can show that the maximum likelihood parameters  are:

        

     

      

     

    ' ',

    ,,

    )(# )(#

    j ij

    ij

    n T t

    i tn

    j tnn

    T t

    i tnML

    ij A A

    y yy

    i jia

    2 1

    2 1

        

     

     

    ' ',

    ,,

    )(# )(#

    k ik

    ik

    n T t

    i tn

    k tnn

    T t

    i tnML

    ik B B

    y xy

    i kib

    1

    1

      

    Eric Xing © Eric Xing @ CMU, 2006-2010 11

     If y is continuous, we can treat as NT observations of, e.g., a Gaussian, and apply learning rules for Gaussian …

      NnTtyx tntn :,::, ,, 11 

  • Supervised ML estimation ctdSupervised ML estimation, ctd.  Intuition:

     When we know the underlying states, the best estimate of  is the average frequency of transitions & emissions that occur in the training data

     Drawback:  Given little data, there may be overfitting:

     P(x|) is maximized but  is unreasonable P(x|) is maximized, but  is unreasonable 0 probabilities – VERY BAD

View more >