dynamic bayesian networks beyond 10708./guestrin/class/10708-f06/slides/dbn... · 2006. 12. 26. ·...

17
1 1 Dynamic Bayesian Networks Beyond 10708 Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University December 1 st , 2006 Readings: K&F: 18.1, 18.2, 18.3, 18.4 2 Dynamic Bayesian network (DBN) HMM defined by Transition model P(X (t+1) |X (t) ) Observation model P(O (t) |X (t) ) Starting state distribution P(X (0) ) DBN – Use Bayes net to represent each of these compactly Starting state distribution P(X (0) ) is a BN (silly) e.g, performance in grad. school DBN Vars: Happiness, Productivity, HiraBlility, Fame Observations: PapeR, Schmooze

Upload: others

Post on 20-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    1

    Dynamic Bayesian Networks

    Beyond 10708

    Graphical Models – 10708

    Carlos Guestrin

    Carnegie Mellon University

    December 1st, 2006

    Readings:

    K&F: 18.1, 18.2, 18.3, 18.4

    2

    Dynamic Bayesian network (DBN)

    � HMM defined by

    � Transition model P(X(t+1)|X(t))

    � Observation model P(O(t)|X(t))

    � Starting state distribution P(X(0))

    � DBN – Use Bayes net to represent each of these compactly

    � Starting state distribution P(X(0)) is a BN

    � (silly) e.g, performance in grad. school DBN

    � Vars: Happiness, Productivity, HiraBlility, Fame

    � Observations: PapeR, Schmooze

  • 2

    3

    Unrolled DBN

    � Start with P(X(0))

    � For each time step, add vars as defined by 2-TBN

    4

    “Sparse” DBN and fast inference

    “Sparse” DBN ⇒ Fast inference����Time t

    C

    B

    A

    F

    E

    D

    t+1

    C’

    A’

    B’

    F’

    D’

    E’

    B’’

    A’’

    B’’’

    C’’’

    A’’’

    t+2 t+3

    C’’

    E’’

    D’’

    E’’’

    F’’’

    D’’’

    F’’

  • 3

    5

    Even after one time step!!

    What happens when we marginalize out time t?

    Time t t+1

    C’

    A’

    C

    B

    A

    B’

    F’

    D’

    F

    E

    D

    E’

    6

    “Sparse” DBN and fast inference 2

    “Sparse” DBN Fast inferenceAlmost!

    ☺☺☺☺

    Structured representation of belief often yields good approximate

    ?

    B’’

    A’’

    B’’’

    C’’’

    A’’’

    Time t t+1

    C’

    A’

    t+2 t+3

    C

    B

    A

    B’

    C’’

    E’’

    D’’

    E’’’

    F’’’

    D’’’

    F’

    D’

    F

    E

    D

    E’

    F’’

  • 4

    7

    BK Algorithm for approximate DBN inference[Boyen, Koller ’98]

    � Assumed density filtering:

    � Choose a factored representation P for the belief state

    � Every time step, belief not representable with P, project into representation

    ^

    ^

    B’’

    A’’

    B’’’

    C’’’

    A’’’

    Time t t+1

    C’

    A’

    t+2 t+3

    C

    B

    A

    B’

    C’’

    E’’

    D’’

    E’’’

    F’’’

    D’’’

    F’

    D’

    F

    E

    D

    E’

    F’’

    8

    A simple example of BK: Fully-Factorized Distribution

    � Assumed density:

    � Fully factorized

    Time t t+1

    C’

    A’

    C

    B

    A

    B’

    F’

    D’

    F

    E

    D

    E’

    True P(X(t+1)):Assumed Density

    for P(X(t+1)):^

  • 5

    9

    Computing Fully-Factorized Distribution at time t+1

    � Assumed density:

    � Fully factorized

    Time t t+1

    C’

    A’

    C

    B

    A

    B’

    F’

    D’

    F

    E

    D

    E’

    Assumed Density for P(X(t+1)):

    Computingfor P(Xi

    (t+1)):^ ^

    10

    General case for BK: Junction Tree Represents Distribution

    � Assumed density:

    � Fully factorized

    Time t t+1

    C’

    A’

    C

    B

    A

    B’

    F’

    D’

    F

    E

    D

    E’

    True P(X(t+1)):Assumed Density

    for P(X(t+1)):^

  • 6

    11

    Computing factored belief state in the next time step

    � Introduce observations in current

    time step

    � Use J-tree to calibrate time t

    beliefs

    � Compute t+1 belief, project into

    approximate belief state

    � marginalize into desired factors

    � corresponds to KL projection

    � Equivalent to computing

    marginals over factors directly

    � For each factor in t+1 step belief

    � Use variable elimination

    C’

    A’

    C

    B

    A

    B’

    F’

    D’

    F

    E

    D

    E’

    12

    Error accumulation

    � Each time step, projection introduces error

    � Will error add up?

    � causing unbounded approximation error as t→∞→∞→∞→∞

  • 7

    13

    Contraction in Markov process

    14

    BK Theorem

    � Error does not grow unboundedly!

    � Theorem: If Markov chain contracts at a rate of γγγγ (usually very small), and assumed density projection at each time step has error bounded by εεεε (usually large) then the expected error at every iteration is bounded by εεεε/γγγγ.

  • 8

    15

    Example – BAT network [Forbes et al.]

    16

    BK results [Boyen, Koller ’98]

  • 9

    17

    Thin Junction Tree Filters [Paskin ’03]

    � BK assumes fixed approximation clusters

    � TJTF adapts clusters over time

    � attempt to minimize

    projection error

    18

    Hybrid DBN (many continuous and discrete variables)

    � DBN with large number of discrete

    and continuous variables

    � # of mixture of Gaussian components

    blows up in one time step!

    � Need many smart tricks…

    � e.g., see Lerner Thesis

    Reverse Water Gas Shift System(RWGS) [Lerner et al. ’02]

  • 10

    19

    DBN summary

    � DBNs

    � factored representation of HMMs/Kalman filters

    � sparse representation does not lead to efficient inference

    � Assumed density filtering

    � BK – factored belief state representation is assumed density

    � Contraction guarantees that error does blow up (but could still be large)

    � Thin junction tree filter adapts assumed density over time

    � Extensions for hybrid DBNs

    20

    This semester…

    � Bayesian networks, Markov networks, factor graphs,

    decomposable models, junction trees, parameter learning, structure learning, semantics, exact inference, variable

    elimination, context-specific independence, approximate

    inference, sampling, importance sampling, MCMC, Gibbs,

    variational inference, loopy belief propagation, generalized

    belief propagation, Kikuchi, Bayesian learning, missing data, EM, Chow-Liu, IPF, GIS, Gaussian and hybrid

    models, discrete and continuous variables, temporal and

    template models, Kalman filter, linearization, switching

    Kalman filter, assumed density filtering, DBNs, BK,

    Causality,…

    � Just the beginning… ☺☺☺☺

  • 11

    21

    Quick overview of some hot topics...

    � Conditional Random Fields

    � Maximum Margin Markov Networks

    � Relational Probabilistic Models� e.g., the parameter sharing model that you learned for

    a recommender system in HW1

    � Hierarchical Bayesian Models� e.g., Khalid’s presentation on Dirichlet Processes

    � Influence Diagrams

    22

    Generative v. Discriminative models – Intuition

    � Want to Learn: h:X aaaa Y� X – features

    � Y – set of variables

    � Generative classifier, e.g., Naïve Bayes, Markov networks:� Assume some functional form for P(X|Y), P(Y)� Estimate parameters of P(X|Y), P(Y) directly from training data� Use Bayes rule to calculate P(Y|X= x)� This is a ‘generative’ model

    � Indirect computation of P(Y|X) through Bayes rule

    � But, can generate a sample of the data, P(X) = ∑∑∑∑y P(y) P(X|y)

    � Discriminative classifiers, e.g., Logistic Regression, Conditional Random Fields:� Assume some functional form for P(Y|X)� Estimate parameters of P(Y|X) directly from training data

    � This is the ‘discriminative’ model� Directly learn P(Y|X), can have lower sample complexity

    � But cannot obtain a sample of the data, because P(X) is not available

  • 12

    23

    Conditional Random Fields [Lafferty et al. ’01]

    � Define a Markov network using a log-linear model for P(Y|X):

    � Features, e.g., for pairwise CRF:

    � Learning: maximize conditional log-likelihood

    � sum of log-likelihoods you know and love…

    � learning algorithm based on gradient descent, very similar to learning MNs

    24

    Max (Conditional) Likelihood

    x1,t(x1)…

    xm,t(xm)

    D

    f(x,y)

    Estimation Classification

    Don’t need to learn entire distribution!

  • 13

    25

    OCR Example

    � We want:

    argmaxword wT f( ,word) = “brace”

    � Equivalently:

    wT f( ,“brace”) > wT f( ,“aaaaa”)

    wT f( ,“brace”) > wT f( ,“aaaab”)

    wT f( ,“brace”) > wT f( ,“zzzzz”)

    a lot!

    26

    � Goal: find w such that

    wTf(x,t(x)) > wTf(x,y) x∈∈∈∈D y≠t(x)

    wT[f(x,t(x)) – f(x,y)] > 0

    � Maximize margin γ

    � Gain over y grows with # of mistakes in y: ∆∆∆∆tx(y)

    ∆∆∆∆t (“craze”) = 2 ∆∆∆∆t (“zzzzz”) = 5

    w�∆∆∆∆fx(y) > 0

    Max Margin Estimation

    w�∆∆∆∆fx(y) ≥≥≥≥ γγγγ

    A A

    w�∆∆∆∆f (“craze”) ≥ 2γγγγ w�∆∆∆∆f (“zzzzz”) ≥ 5γγγγ

    ∆∆∆∆tx(y)

  • 14

    27

    M3Ns: Maximum Margin Markov Networks [Taskar et al. ’03]

    x1,t(x1)…

    xm,t(xm)

    D

    f(x,y)

    ClassificationEstimation

    28

    Propositional Models and Generalization

    � Suppose you learn a model for social networks for CMU from FaceBook data to predict movie preferences:

    � How would you apply when new people join CMU?

    � Can you apply it to make predictions a some “little technical college” in Cambridge, Mass?

  • 15

    29

    Generalization requires Relational Models (e.g., see tutorial by Getoor)

    � Bayes nets defined specifically for an instance, e.g., CMU FaceBook today

    � fixed number of people

    � fixed relationships between people

    � …

    � Relational and first-order probabilistic models

    � talk about objects and relations between objects

    � allow us to represent different (and unknown) numbers

    � generalize knowledge learned from one domain to other, related, but different domains

    30

    Reasoning about decisionsK&F Chapters 20 & 21

    � So far, graphical models only have random variables

    � What if we could make decisions that influence the probability of these variables?

    � e.g., steering angle for a car, buying stocks, choice of medical treatment

    � How do we choose the best decision?

    � the one that maximizes the expected long-term utility

    � How do we coordinate multiple decisions?

  • 16

    31

    Example of an Influence Diagram

    32

    Many, many, many more topics we didn’t even touch, e.g.,...

    � Active learning

    � Non-parametric models

    � Continuous time models

    � Learning theory for graphical models

    � Distributed algorithms for graphical models

    � Graphical models for reinforcement learning

    � Applications

    � …

  • 17

    33

    What next?

    � Seminars at CMU:� Machine Learning Lunch talks: http://www.cs.cmu.edu/~learning/

    � Intelligence Seminar: http://www.cs.cmu.edu/~iseminar/

    � Machine Learning Department Seminar: http://calendar.cs.cmu.edu/cald/seminar� Statistics Department seminars: http://www.stat.cmu.edu/seminar

    � …

    � Journal:� JMLR – Journal of Machine Learning Research (free, on the web)

    � JAIR – Journal of AI Research (free, on the web)

    � …

    � Conferences:� UAI: Uncertainty in AI

    � NIPS: Neural Information Processing Systems

    � Also ICML, AAAI, IJCAI and others

    � Some MLD courses:� 10-705 Intermediate Statistics (Fall)� 10-702 Statistical Foundations of Machine Learning (Spring)

    � 10-801 Advanced Topics in Graphical Models: statistical foundations, approximate inference, and Bayesian methods (Spring)

    � …