learning, uncertainty, and information big ideas november 8, 2004

Learning, Uncertainty, and Information

Big Ideas

November 8, 2004

Roadmap

• Turing, Intelligence, and Learning• Noisy-channel model

– Uncertainty, Bayes’ Rule, and Applications

• Hidden Markov Models– The Model– Decoding the best sequence– Training the model (EM)

• N-gram models: Modeling sequences– Shannon, Information Theory, and Perplexity

• Conclusion

Turing & Intelligence

• Turing (1950): – Computing Machinery and Intelligence– “Imitation Game” (aka Turing test)

• Functional definition of intelligence as indistinguishable from human

– Key question raised: Learning• Can a system be intelligent if only knows program?• Learning necessary for intelligence

– 1) Programmed knowledge

– 2) Learning mechanism

– Knowledge, reasoning, learning, communication

Noisy-Channel Model

• Original message not directly observable– Passed through some channel b/t sender, receiver + noise– From telephone (Shannon), Word sequence vs acoustics

(Jelinek), genome sequence vs CATG, object vs image

• Derive most likely original input based on observed

Bayesian Inference

• P(W|O) difficult to compute – W – input, O – observations

– Generative and Sequence

)|(maxarg* OWPWW

)()|(maxarg

)()|(maxarg WPWOPW

Applications

• AI: Speech recognition!, POS tagging, sense tagging, dialogue, image understanding, information retrieval

• Non-AI: – Bioinformatics: gene sequencing– Security: intrusion detection– Cryptography

Hidden Markov Models

Probabilistic Reasoning over Time

• Issue: Discrete models – Many processes continuously changing– How do we make observations? States?

• Solution: Discretize– “Time slices”: Make time discrete– Observations, States associated with time:

Ot, Qt• Observations can be discrete or continuous

– Here focus on discrete for clarity

Modelling Processes over Time• Infer underlying state sequence from observed• Issue: New state depends on preceding states

– Analyzing sequences

• Problem 1: Possibly unbounded # prob tables– Observation+State+Time

• Solution 1: Assume stationary process– Rules governing process same at all time

• Problem 2: Possibly unbounded # parents– Markov assumption: Only consider finite history– Common: 1 or 2 Markov: depend on last couple

Hidden Markov Models (HMMs)

• An HMM is:– 1) A set of states:– 2) A set of transition probabilities:

• Where aij is the probability of transition qi -> qj

– 3)Observation probabilities:• The probability of observing ot in state i

– 4) An initial probability dist over states: • The probability of starting in state i

– 5) A set of accepting states

ko qqqQ ,...,, 1

mnaaA ,...,01

)( ti obB

Three Problems for HMMs

• Find the probability of an observation sequence given a model– Forward algorithm

• Find the most likely path through a model given an observed sequence– Viterbi algorithm (decoding)

• Find the most likely model (parameters) given an observed sequence– Baum-Welch (EM) algorithm

Bins and Balls Example

• Assume there are two bins filled with red and blue balls. Behind a curtain, someone selects a bin and then draws a ball from it (and replaces it). They then select either the same bin or the other one and then select another ball…

– (Example due to J. Martin)

Bins and Balls Example

Bin 1 Bin 2

Bins and Balls

Bin1 Bin2

Bin1 0.6 0.4

Bin2 0.3 0.7

• Π Bin 1: 0.9; Bin 2: 0.1• A

• BBin 1 Bin 2

Red 0.7 0.4

Blue 0.3 0.6

Bins and Balls

• Assume the observation sequence:– Blue Blue Red (BBR)

• Both bins have Red and Blue– Any state sequence could produce observations

• However, NOT equally likely– Big difference in start probabilities– Observation depends on state– State depends on prior state

Bins and Balls

Blue Blue Red1 1 1 (0.9*0.3)*(0.6*0.3)*(0.6*0.7)=0.0204

1 1 2 (0.9*0.3)*(0.6*0.3)*(0.4*0.4)=0.0077

1 2 1 (0.9*0.3)*(0.4*0.6)*(0.3*0.7)=0.0136

1 2 2 (0.9*0.3)*(0.4*0.6)*(0.7*0.4)=0.0181

2 1 1 (0.1*0.6)*(0.3*0.7)*(0.6*0.7)=0.0052

2 1 2 (0.1*0.6)*(0.3*0.7)*(0.4*0.4)=0.0020

2 2 1 (0.1*0.6)*(0.7*0.6)*(0.3*0.7)=0.0052

2 2 2 (0.1*0.6)*(0.7*0.6)*(0.7*0.4)=0.0070

Answers and Issues

• Here, to compute probability of observed– Just add up all the state sequence probabilities

• To find most likely state sequence– Just pick the sequence with the highest value

• Problem: Computing all paths expensive– 2T*N^T

• Solution: Dynamic Programming– Sweep across all states at each time step

• Summing (Problem 1) or Maximizing (Problem 2)

Forward Probability

)()()1(

1),()1(

)|,,..,,()(

jqoooPt

Where α is the forward probability, t is the time in utterance, i,j are states in the HMM, aij is the transition probability, bj(ot) is the probability of observing ot in state bj

N is the max state, T is the last time

Pronunciation Example

• Observations: 0/1

Sequence Pronunciation Model

Acoustic Model

• 3-state phone model for [m]– Use Hidden Markov Model (HMM)

– Probability of sequence: sum of prob of paths

Onset Mid End Final0.7

0.3 0.9

C1:0.5

C2:0.2

C3:0.3 C3:

0.2C4:0.7

C5:0.1 C4:

0.1C6:0.5

C6:0.4

Transition probabilities

Observation probabilities

Forward Algorithm• Idea: matrix where each cell forward[t,j] represents probability of

being in state j after seeing first t observations. • Each cell expresses the probability:

forward[t,j] = P(o1,o2,...,ot,qt=j|w)• qt = j means "the probability that the tth state in the sequence of

states is state j. • Compute probability by summing over extensions of all paths

leading to current cell. • An extension of a path from a state i at time t-1 to state j at t is

computed by multiplying together: i. previous path probability from the previous cell forward[t-1,i],

ii. transition probability aij from previous state i to current state j

iii. observation likelihood bjt that current state j matches observation symbol t.

Forward Algorithm

Function Forward(observations length T, state-graph) returns best-path

Num-states<-num-of-states(state-graph)Create path prob matrix forwardi[num-states+2,T+2]Forward[0,0]<- 1.0For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-Forward[s,t]*at[s,s’]*bs’(ot)

Forward[s’,t+1] <- Forward[s’,t+1]+new-score

Viterbi Code

Function Viterbi(observations length T, state-graph) returns best-pathNum-states<-num-of-states(state-graph)Create path prob matrix viterbi[num-states+2,T+2]Viterbi[0,0]<- 1.0For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]==0) || (viterbi[s’,t+1]<new-score))

then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s

Backtrace from highest prob state in final column of viterbi[] & return

Modeling Sequences, Redux

• Discrete observation values – Simple, but inadequate– Many observations highly variable

• Gaussian pdfs over continuous values– Assume normally distributed observations

• Typically sum over multiple shared Gaussians– “Gaussian mixture models”– Trained with HMM model

)]()[(

1)( j jtjt oo

Learning HMMs

• Issue: Where do the probabilities come from?• Solution: Learn from data

– Trains transition (aij) and emission (bj) probabilities• Typically assume structure

– Baum-Welch aka forward-backward algorithm• Iteratively estimate counts of transitions/emitted• Get estimated probabilities by forward comput’n

– Divide probability mass over contributing paths

learning, uncertainty, and information big ideas november 8, 2004

Documents

big ideas learning

couldwe divertan big ideas for the way to big ideas for

navigating uncertainty when launching new ideas

21 big ideas

big data + big ideas = big impact

big ideas. big results. searchfest 2013

big ideas math_ch7

‘the big ideas’

big boie ideas

big ideas for

equation equivalent equations - big ideas mathequation...

big software ideas

ten big ideas

from big ideas to big change

covid brings big changes …and some uncertainty

the big ideas

big ideas 2015

big ideas event

curriculum big ideas

the big-ideas