learning, uncertainty, and information big ideas november 8, 2004
Post on 20-Dec-2015
214 Views
Preview:
TRANSCRIPT
Roadmap
• Turing, Intelligence, and Learning• Noisy-channel model
– Uncertainty, Bayes’ Rule, and Applications
• Hidden Markov Models– The Model– Decoding the best sequence– Training the model (EM)
• N-gram models: Modeling sequences– Shannon, Information Theory, and Perplexity
• Conclusion
Turing & Intelligence
• Turing (1950): – Computing Machinery and Intelligence– “Imitation Game” (aka Turing test)
• Functional definition of intelligence as indistinguishable from human
– Key question raised: Learning• Can a system be intelligent if only knows program?• Learning necessary for intelligence
– 1) Programmed knowledge
– 2) Learning mechanism
– Knowledge, reasoning, learning, communication
Noisy-Channel Model
• Original message not directly observable– Passed through some channel b/t sender, receiver + noise– From telephone (Shannon), Word sequence vs acoustics
(Jelinek), genome sequence vs CATG, object vs image
• Derive most likely original input based on observed
Bayesian Inference
• P(W|O) difficult to compute – W – input, O – observations
– Generative and Sequence
)|(maxarg* OWPWW
)(
)()|(maxarg
OP
WPWOP
W
)()|(maxarg WPWOPW
Applications
• AI: Speech recognition!, POS tagging, sense tagging, dialogue, image understanding, information retrieval
• Non-AI: – Bioinformatics: gene sequencing– Security: intrusion detection– Cryptography
Probabilistic Reasoning over Time
• Issue: Discrete models – Many processes continuously changing– How do we make observations? States?
• Solution: Discretize– “Time slices”: Make time discrete– Observations, States associated with time:
Ot, Qt• Observations can be discrete or continuous
– Here focus on discrete for clarity
Modelling Processes over Time• Infer underlying state sequence from observed• Issue: New state depends on preceding states
– Analyzing sequences
• Problem 1: Possibly unbounded # prob tables– Observation+State+Time
• Solution 1: Assume stationary process– Rules governing process same at all time
• Problem 2: Possibly unbounded # parents– Markov assumption: Only consider finite history– Common: 1 or 2 Markov: depend on last couple
Hidden Markov Models (HMMs)
• An HMM is:– 1) A set of states:– 2) A set of transition probabilities:
• Where aij is the probability of transition qi -> qj
– 3)Observation probabilities:• The probability of observing ot in state i
– 4) An initial probability dist over states: • The probability of starting in state i
– 5) A set of accepting states
ko qqqQ ,...,, 1
mnaaA ,...,01
)( ti obB
i
Three Problems for HMMs
• Find the probability of an observation sequence given a model– Forward algorithm
• Find the most likely path through a model given an observed sequence– Viterbi algorithm (decoding)
• Find the most likely model (parameters) given an observed sequence– Baum-Welch (EM) algorithm
Bins and Balls Example
• Assume there are two bins filled with red and blue balls. Behind a curtain, someone selects a bin and then draws a ball from it (and replaces it). They then select either the same bin or the other one and then select another ball…
– (Example due to J. Martin)
Bins and Balls
Bin1 Bin2
Bin1 0.6 0.4
Bin2 0.3 0.7
• Π Bin 1: 0.9; Bin 2: 0.1• A
• BBin 1 Bin 2
Red 0.7 0.4
Blue 0.3 0.6
Bins and Balls
• Assume the observation sequence:– Blue Blue Red (BBR)
• Both bins have Red and Blue– Any state sequence could produce observations
• However, NOT equally likely– Big difference in start probabilities– Observation depends on state– State depends on prior state
Bins and Balls
Blue Blue Red1 1 1 (0.9*0.3)*(0.6*0.3)*(0.6*0.7)=0.0204
1 1 2 (0.9*0.3)*(0.6*0.3)*(0.4*0.4)=0.0077
1 2 1 (0.9*0.3)*(0.4*0.6)*(0.3*0.7)=0.0136
1 2 2 (0.9*0.3)*(0.4*0.6)*(0.7*0.4)=0.0181
2 1 1 (0.1*0.6)*(0.3*0.7)*(0.6*0.7)=0.0052
2 1 2 (0.1*0.6)*(0.3*0.7)*(0.4*0.4)=0.0020
2 2 1 (0.1*0.6)*(0.7*0.6)*(0.3*0.7)=0.0052
2 2 2 (0.1*0.6)*(0.7*0.6)*(0.7*0.4)=0.0070
Answers and Issues
• Here, to compute probability of observed– Just add up all the state sequence probabilities
• To find most likely state sequence– Just pick the sequence with the highest value
• Problem: Computing all paths expensive– 2T*N^T
• Solution: Dynamic Programming– Sweep across all states at each time step
• Summing (Problem 1) or Maximizing (Problem 2)
Forward Probability
)()|(
)()()1(
1),()1(
)|,,..,,()(
1
11
1
21
TOP
obatt
Njob
jqoooPt
N
ii
tj
N
iijij
jjj
ttj
Where α is the forward probability, t is the time in utterance, i,j are states in the HMM, aij is the transition probability, bj(ot) is the probability of observing ot in state bj
N is the max state, T is the last time
Acoustic Model
• 3-state phone model for [m]– Use Hidden Markov Model (HMM)
– Probability of sequence: sum of prob of paths
Onset Mid End Final0.7
0.3 0.9
0.1
0.4
0.6
C1:0.5
C2:0.2
C3:0.3 C3:
0.2C4:0.7
C5:0.1 C4:
0.1C6:0.5
C6:0.4
Transition probabilities
Observation probabilities
Forward Algorithm• Idea: matrix where each cell forward[t,j] represents probability of
being in state j after seeing first t observations. • Each cell expresses the probability:
forward[t,j] = P(o1,o2,...,ot,qt=j|w)• qt = j means "the probability that the tth state in the sequence of
states is state j. • Compute probability by summing over extensions of all paths
leading to current cell. • An extension of a path from a state i at time t-1 to state j at t is
computed by multiplying together: i. previous path probability from the previous cell forward[t-1,i],
ii. transition probability aij from previous state i to current state j
iii. observation likelihood bjt that current state j matches observation symbol t.
Forward Algorithm
Function Forward(observations length T, state-graph) returns best-path
Num-states<-num-of-states(state-graph)Create path prob matrix forwardi[num-states+2,T+2]Forward[0,0]<- 1.0For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-Forward[s,t]*at[s,s’]*bs’(ot)
Forward[s’,t+1] <- Forward[s’,t+1]+new-score
Viterbi Code
Function Viterbi(observations length T, state-graph) returns best-pathNum-states<-num-of-states(state-graph)Create path prob matrix viterbi[num-states+2,T+2]Viterbi[0,0]<- 1.0For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]==0) || (viterbi[s’,t+1]<new-score))
then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s
Backtrace from highest prob state in final column of viterbi[] & return
Modeling Sequences, Redux
• Discrete observation values – Simple, but inadequate– Many observations highly variable
• Gaussian pdfs over continuous values– Assume normally distributed observations
• Typically sum over multiple shared Gaussians– “Gaussian mixture models”– Trained with HMM model
1
)]()[(
||)2(
1)( j jtjt oo
tj ej
ob
Learning HMMs
• Issue: Where do the probabilities come from?• Solution: Learn from data
– Trains transition (aij) and emission (bj) probabilities• Typically assume structure
– Baum-Welch aka forward-backward algorithm• Iteratively estimate counts of transitions/emitted• Get estimated probabilities by forward comput’n
– Divide probability mass over contributing paths
top related