hmms - carnegie mellon school of computer...

19
©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University November 7 th , 2007 Example of a hidden Markov model (HMM)

Upload: duongthuy

Post on 15-Apr-2018

225 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

©2005-2007 Carlos Guestrin 1

HMMs

Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

November 7th, 2007

Example of a hidden Markov model(HMM)

Page 2: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

Understanding the HMM Semantics

X1 = {a,…z}

O1 =

X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}

O2 = O3 = O4 = O5 =

HMMs semantics: Details

X1 = {a,…z}

O1 =

X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}

O2 = O3 = O4 = O5 =

Just 3 distributions:

Page 3: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

HMMs semantics: Joint distribution

X1 = {a,…z}

O1 =

X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}

O2 = O3 = O4 = O5 =

Learning HMMs from fullyobservable data is easy

X1 = {a,…z}

O1 =

X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}

O2 = O3 = O4 = O5 =

Learn 3 distributions:

Page 4: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

Possible inference tasks in anHMM

X1 = {a,…z}

O1 =

X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}

O2 = O3 = O4 = O5 =

Marginal probability of a hidden variable:

Viterbi decoding – most likely trajectory for hidden vars:

Using variable elimination tocompute P(Xi|o1:n)

X1 = {a,…z}

O1 =

X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}

O2 = O3 = O4 = O5 =

Variable elimination order?

Compute:

Example:

Page 5: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

What if I want to compute P(Xi|o1:n)for each i?

X1 = {a,…z}

O1 =

X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}

O2 = O3 = O4 = O5 =

Variable elimination for each i?

Compute:

Variable elimination for each i, what’s the complexity?

Reusing computationX1 = {a,…z}

O1 =

X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}

O2 = O3 = O4 = O5 =

Compute:

Page 6: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

The forwards-backwards algorithmX1 = {a,…z}

O1 =

X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}

O2 = O3 = O4 = O5 =

Initialization: For i = 2 to n

Generate a forwards factor by eliminating Xi-1

Initialization: For i = n-1 to 1

Generate a backwards factor by eliminating Xi+1

8 i, probability is:

What you’ll implement 1:multiplication

Page 7: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

What you’ll implement 2:marginalization

Higher-order HMMs

X1 = {a,…z}

O1 =

X5 = {a,…z}X3 = {a,…z} X4 = {a,…z}X2 = {a,…z}

O2 = O3 = O4 = O5 =

Add dependencies further back in time !better representation, harder to learn

Page 8: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

What you need to know

Hidden Markov models (HMMs) Very useful, very powerful! Speech, OCR,… Parameter sharing, only learn 3 distributions Trick reduces inference from O(n2) to O(n) Special case of BN

©2005-2007 Carlos Guestrin 16

Bayesian Networks(Structure) Learning

Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

November 7th, 2007

Page 9: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

Review Bayesian Networks

Compact representation forprobability distributions

Exponential reduction in number ofparameters

Fast probabilistic inference usingvariable elimination Compute P(X|e) Time exponential in tree-width, not

number of variables Today

Learn BN structure

Flu Allergy

Sinus

Headache Nose

Learning Bayes nets

x(1)

… x(m)

Data

structure parameters

CPTs –P(Xi| PaXi)

Page 10: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

Learning the CPTs

x(1)

… x(m)

Data For each discrete variable Xi

Information-theoretic interpretationof maximum likelihood 1

Given structure, log likelihood of data:

Flu Allergy

Sinus

Headache Nose

Page 11: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

Information-theoretic interpretationof maximum likelihood 2

Given structure, log likelihood of data:

Flu Allergy

Sinus

Headache Nose

Information-theoretic interpretationof maximum likelihood 3

Given structure, log likelihood of data:

Flu Allergy

Sinus

Headache Nose

Page 12: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

Decomposable score

Log data likelihood

Decomposable score: Decomposes over families in BN (node and its parents) Will lead to significant computational efficiency!!! Score(G : D) = ∑i FamScore(Xi|PaXi : D)

How many trees are there?Nonetheless – Efficient optimal algorithm finds best tree

Page 13: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

Scoring a tree 1: equivalent trees

Scoring a tree 2: similar trees

Page 14: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

Chow-Liu tree learning algorithm 1

For each pair of variables Xi,Xj

Compute empirical distribution:

Compute mutual information:

Define a graph Nodes X1,…,Xn

Edge (i,j) gets weight

Chow-Liu tree learning algorithm 2

Optimal tree BN Compute maximum weight

spanning tree Directions in BN: pick any

node as root, breadth-first-search defines directions

Page 15: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

Can we extend Chow-Liu 1

Tree augmented naïve Bayes (TAN) [Friedmanet al. ’97] Naïve Bayes model overcounts, because

correlation between features not considered Same as Chow-Liu, but score edges with:

Can we extend Chow-Liu 2

(Approximately learning) modelswith tree-width up to k [Chechetka & Guestrin ’07] But, O(n2k+6)…

Page 16: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

What you need to know aboutlearning BN structures so far

Decomposable scores Maximum likelihood Information theoretic interpretation

Best tree (Chow-Liu) Best TAN Nearly best k-treewidth (in O(N2k+6))

Scoring general graphical models –Model selection problem

Data

<x1(1),…,xn

(1)>…

<x1(m),…,xn

(m)>

Flu Allergy

Sinus

Headache Nose

What’s the best structure?

The more edges, the fewer independence assumptions,the higher the likelihood of the data, but will overfit…

Page 17: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

Maximum likelihood overfits!

Information never hurts:

Adding a parent always increases score!!!

Bayesian score avoids overfitting

Given a structure, distribution over parameters

Difficult integral: use Bayes information criterion(BIC) approximation (equivalent as M! 1)

Note: regularize with MDL score Best BN under BIC still NP-hard

Page 18: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

Structure learning for general graphs

In a tree, a node only has one parent

Theorem: The problem of learning a BN structure with at most d

parents is NP-hard for any (fixed) d¸2

Most structure learning approaches use heuristics Exploit score decomposition (Quickly) Describe two heuristics that exploit decomposition

in different ways

Learn BN structure using localsearch

Starting from Chow-Liu tree

Local search,possible moves:• Add edge• Delete edge• Invert edge

Score using BIC

Page 19: HMMs - Carnegie Mellon School of Computer Scienceguestrin/Class/10701/slides/hmms-structurelearn.pdf · ©2005-2007 Carlos Guestrin 1 HMMs Machine Learning – 10701/15781 Carlos

What you need to know aboutlearning BNs

Learning BNs Maximum likelihood or MAP learns parameters Decomposable score Best tree (Chow-Liu) Best TAN Other BNs, usually local search with BIC score