speech recognition (part 2)
DESCRIPTION
Speech Recognition (Part 2). T. J. Hazen MIT Computer Science and Artificial Intelligence Laboratory. Lecture Overview. Probabilistic framework Pronunciation modeling Language modeling Finite state transducers Search System demonstrations (time permitting). Probabilistic Framework. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/1.jpg)
Speech Recognition(Part 2)
T. J. Hazen
MIT Computer Science and Artificial Intelligence Laboratory
![Page 2: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/2.jpg)
Lecture Overview
• Probabilistic framework
• Pronunciation modeling
• Language modeling
• Finite state transducers
• Search
• System demonstrations (time permitting)
![Page 3: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/3.jpg)
Probabilistic Framework
• Speech recognition is typically performed a using probabilistic modeling approach
• Goal is to find the most likely string of words, W, given the acoustic observations, A:
Wmax ( | )P W A
• The expression is rewritten using Bayes’ Rule:
W
( | ) ( )max max ( | ) ( )
( )
W
P A W P WP A W P W
P A
![Page 4: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/4.jpg)
LexicalNetwork
Probabilistic Framework
• Words are represented as sequence of phonetic units.
• Using phonetic units, U, expression expands to:
• Pronunciation and language models provide constraint
• Pronunciation and language models encoded in network
• Search must efficiently find most likely U and W
,max ( | ) ( | ) ( )U W
P A U P U W P W
Acoustic Model
Pronunciation Model
Language Model
![Page 5: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/5.jpg)
Phonemes
• Phonemes are the basic linguistic units used to construct morphemes, words and sentences.– Phonemes represent unique canonical acoustic sounds
– When constructing words, changing a single phoneme changes the word.
• Example phonemic mappings:– pin /p ih n/
– thought /th ao t/
– saves /s ey v z/
• English spelling is not (exactly) phonemic – Pronunciation can not always be determined from spelling
– Homophones have same phonemes but different spellings
* Two vs. to vs. too, bear vs. bare, queue vs. cue, etc.
– Same spelling can have different pronunciations
* read, record, either, etc.
![Page 6: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/6.jpg)
Phonemic Units and Classes
Vowelsaa : pot er : bertae : bat ey : baitah : but ih : bitao : bought iy : beataw : bout ow : boatax : about oy : boyay : buy uh : bookeh : bet uw : boot
Semivowelsl : light w : wetr : right y : yet
Fricativess : sue f : feez : zoo v : veesh : shoe th : thesiszh : azure dh : thathh : hat
Nasalsm : mightn : nightng : sing
Affricatesch : chew jh : Joe
Stopsp : pay b : bayt : tea d : dayk : key g : go
![Page 7: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/7.jpg)
Phones
• Phones (or phonetic units) are used to represent the actual acoustic realization of phonemes.
• Examples:– Stops contain a closure and a release
* /t/ [tcl t]
* /k/ [kcl k]
– The /t/ and /d/ phonemes can be flapped
* utter /ah t er/ [ah dx er]
* udder /ah d er/ [ah dx er]
– Vowels can be fronted:
* Tuesday /t uw z d ey/ [tcl t ux z d ey]
![Page 8: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/8.jpg)
Enhanced Phoneme Labels
Stopsp : pay b : bayt : tea d : dayk : key g : go
Special sequencesnt : interviewtq en : Clinton
Stops w/ optional releasepd : tap bd : tabtd : pat dd : badkd : pack gd : dog
Unaspirated stopsp- : speedt- : steepk- : ski
Stops w/ optional flaptf : batterdf : badder
Retroflexed stopstr : treedr : drop
![Page 9: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/9.jpg)
Example Phonemic Baseform File
<hangup> : _h1 +
<noise> : _n1 +
<uh> : ah_fp
<um> : ah_fp m
adder : ae df er
atlanta : ( ae | ax ) td l ae nt ax
either : ( iy | ay ) th er
laptop : l ae pd t aa pd
northwest : n ao r th w eh s td
speech : s p- iy ch
temperature : t eh m p ( r ? ax | er ax ? ) ch er
trenton : tr r eh n tq en
special filledpause vowel
alternatepronunciations
repeat previous symbol
special noise model symbol
optionalphonemes
![Page 10: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/10.jpg)
Applying Phonological Rules
• Multiple phonetic realization of phonemes can be generated by applying phonological rules.
• Example:
• Phonological rewrite rules can be used to generate this:
butter : b ah tf er
This can be realized phonetically as:
bcl b ah tcl t er
or as:
bcl b ah dx er
Standard /t/
Flapped /t/
butter : bcl b ah ( tcl t | dx ) er
![Page 11: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/11.jpg)
Example Phonological Rules
• Example rule for /t/ deletion (“destination”):
{s} t {ax ix} => [tcl t];
Left Context
Phoneme RightContext
PhoneticRealization
• Example rule for palatalization of /s/ (“miss you”):
{} s {y} => s | sh;
![Page 12: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/12.jpg)
Contractions and Reductions
• Examples of contractions:– what’s what is
– isn’t is not
– won’t will not
– i’d i would | i had
– today’s today is | today’s
• Example of multi-word reductions – gimme give me
– gonna going to
– ave avenue
– ‘bout about
– d’y’ave do you have
• Contracted and reduced forms entered in lexical dictionary
![Page 13: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/13.jpg)
Language Modeling
• A language model constrains hypothesized word sequences
• A finite state grammar (FSG) example:
• Probabilities can be added to arcs for additional constraint
• FSGs work well when users stay within grammar…
• …but FSGs can’t cover everything that might be spoken.
tell me
what is
theforecast
weather in
for
baltimore
boston
![Page 14: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/14.jpg)
N-gram Language Modeling
• An n-gram model is a statistical language model
• Predicts current word based on previous n-1 words
• Trigram model expression:
• Examples
• An n-gram model allows any sequence of words…
• …but prefers sequences common in training data.
P( wn | wn-2 , wn-1 )
P( | arriving in )
P( | tuesday march )
boston
seventeenth
![Page 15: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/15.jpg)
N-gram Model Smoothing
• For a bigram model, what if…
• To avoid sparse training data problems, we can use an interpolated bigram:
• One method for determining interpolation weight:
)(p~)1()|(p)|(p~11 11 nwnnwnn wwwww
nn
Kwc
wc
n
nwn
)(
)(
1
11
0)|(p 1 nn ww
![Page 16: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/16.jpg)
Class N-gram Language Modeling
• Class n-gram models can also help sparse data problems
• Class trigram expression:
• Example:
P(class(wn) | class(wn-2) , class(wn-1)) P(wn | class(wn))
P(seventeenth | tuesday march )
P( NTH | WEEKDAY MONTH ) P( seventeenth | NTH )
![Page 17: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/17.jpg)
Multi-Word N-gram Units
• Common multi-word units can be treated as a single unit within an N-gram language model
• Common uses of compound units:– Common multi-word phrases:
* thank_you , good_bye , excuse_me
– Multi word sequences that act as a single semantic unit:
* new_york , labor_day , wind_speed
– Letter sequences or initials:
* j_f_k , t_w_a , washington_d_c
![Page 18: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/18.jpg)
Finite-State Transducer (FST) Motivation
• Most speech recognition constraints and results can be represented as finite-state automata:– Language models (e.g., n-grams and word networks)
– Lexicons
– Phonological rules
– N-best lists
– Word graphs
– Recognition paths
• Common representation and algorithms desirable– Consistency
– Powerful algorithms can be employed throughout system
– Flexibility to combine or factor in unforeseen ways
![Page 19: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/19.jpg)
What is an FST?
• One initial state
• One or more final states
• Transitions between states: input : output / weight– input requires an input symbol to match
– output indicates symbol to output when transition taken
– epsilon () consumes no input or produces no output
– weight is the cost (e.g., -log probability) of taking transition
• An FST defines a weighted relationship between regular languages
• A generalization of the classic finite-state acceptor (FSA)
![Page 20: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/20.jpg)
FST Example: Lexicon
• Lexicon maps /phonemes/ to ‘words’
• Words can share parts of pronunciations
• Sharing at beginning beneficial to recognition speed because pruning can prune many words at once
![Page 21: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/21.jpg)
FST Composition
• Composition (o) combines two FSTs to produce a single FST that performs both mappings in single step
words /phonemes/ /phonemes/ [phones]
o =
words [phones]
![Page 22: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/22.jpg)
FST Optimization Example
letter to word lexicon
![Page 23: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/23.jpg)
FST Optimization Example: Determinization
• Determinization turns lexicon into tree
• Words share common prefix
![Page 24: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/24.jpg)
FST Optimization Example: Minimization
• Minimization enables sharing at the ends
![Page 25: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/25.jpg)
A Cascaded FST Recognizer
Acoustic Model Labels
Phonetic Units
Phonemic Units
Spoken Words
Multi-Word Units
Canonical Words
C : CD Model Mapping
P: Phonological Model
L : Lexical Model
G : Language Model
R : Reductions Model
M : Multi-word MappingLanguage Model
Pronunciation Model
![Page 26: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/26.jpg)
A Cascaded FST Recognizer
Acoustic Model Labels
Phonetic Units
Phonemic Units
Spoken Words
Multi-Word Units
Canonical Words
C : CD Model Mapping
P: Phonological Model
L : Lexical Model
G : Language Model
R : Reductions Model
M : Multi-word Mapping
give me new_york_city
give me new york city
gimme new york city
g ih m iy n uw y ao r kd s ih tf iy
gcl g ih m iy n uw y ao r kcl s ih dx iy
![Page 27: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/27.jpg)
Search
• Once again, the probabilistic expression is:
• Pronunciation and language models encoded in FST
• Search must efficiently find most likely U and W
,max ( | ) ( | ) ( )U W
P A U P U W P W
Acoustic Model
LexicalFST
![Page 28: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/28.jpg)
Viterbi Search
• Viterbi search: a time synchronous breadth-first searchL
exic
al N
od
es
h#
m
z
r
a
Time
t0 t1 t2 t3 t4 t5 t6 t7 t8
h# m a r z h#
![Page 29: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/29.jpg)
Viterbi Search Pruning
• Search efficiency can be improved with pruning– Score-based: Don’t extend low scoring hypotheses
– Count-based: Extend only a fixed number of hypotheses
Lex
ical
No
des
h#
m
z
r
a
Time
t0 t1 t2 t3 t4 t5 t6 t7 t8
x
x
![Page 30: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/30.jpg)
Search Pruning Example
• Count-based pruning can effectively reduce search
• Example: Fix beam size (count) and vary beam width (score)
36
![Page 31: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/31.jpg)
Lex
ical
No
des
h#
m
z
r
a
Time
t0 t1 t2 t3 t4 t5 t6 t7 t8
N-best Computation with Backwards A* Search
• Backwards A* search can be used to find N-best paths
• Viterbi backtrace is used as future estimate for path scores
![Page 32: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/32.jpg)
Street Address Recognition
• Street address recognition is difficult– 6.2M unique street, city, state pairs in US (283K unique words)
– High confusion rate among similar street names
– Very large search space for recognition
• Commercial solution Directed dialogue– Breaks problem into set of smaller recognition tasks
– Simple for first time users, but tedious with repeated use
C: Main menu. Please say one of the following:C: “directions”, “restaurants”, “gas stations”, or “more options”.H: Directions.C: Okay. Directions. What state are you going to?H: Massachusetts.C: Okay. Massachusetts. What city are you going to?H: Cambridge.C: Okay. Cambridge. What is the street address?H: 32 Vassar Street.C: Okay. 32 Vassar Street in Cambridge, Massachusetts.C: From you current location, continue straight on…
![Page 33: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/33.jpg)
Street Address Recognition
• Research goal Mixed initiative dialogue– More difficult to predict what users will say
– Far more natural for repeat or expert users
C: How can I help you?H: I need directions to 32 Vassar Street in Cambridge, Mass.
• Recognition approach: dynamically adapt recognition vocabulary– 3 recognition passes over
one utterance.
– 1st pass: Detect state and activate relevant cities
– 2nd pass: Detect cities and activate relevant streets
– 3rd pass: Recognize full street address
![Page 34: Speech Recognition (Part 2)](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568159d1550346895dc720eb/html5/thumbnails/34.jpg)
Dynamic Vocabulary Recognition