automatic continuous speech recognition database speech text scoring
TRANSCRIPT
![Page 1: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/1.jpg)
Automatic Continuous Speech Recognition
Databasespeech
texttext
text
text
Scoring
![Page 2: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/2.jpg)
Automatic Continuous Speech Recognition
Problems with isolated word recognition:– Every new task contains novel words
without any available training data.– There are simply too many words, and this
words may have different acoustic realizations. Increases variability
coarticulation of “words” Speech velocity
– we don´t know the limits of the words.
![Page 3: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/3.jpg)
In CSR, should we use words? Or what is the basic unit to represent salient acoustic and phonetic information?
![Page 4: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/4.jpg)
Model Units Issues
Accurate.– Represent the acoustic realization that
appears in different contexts. Trainable Generalizable:
– New words can be derived
![Page 5: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/5.jpg)
Comparison of Different Units
Words: – Small task.
accurate, trainable, no-generalizable
– Large Vocabulary: accurate, non-trainable, no-generalizable.
Phonemes:– Large Vocabulary:
No-accurate, trainable, over-generalizable
![Page 6: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/6.jpg)
Syllables– English: 30,000
No-very-accurate, no-trainable, generalizable
– Chinese: 1200 tone-dependent syllables– Japanese: 50 syllables for
accurate, trainable, generalizable
Allophones: Realizations of phonemes in different context.– accurate, no-trainable, generalizable
– Triphones: Example of allophone.
![Page 7: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/7.jpg)
Traning in Sphinx
phonemes set is trained
senons are trained:1-gaussians to 8_or_16-gaussinas
triphones are created
senons are created
senons are prunned
triphones are trained
![Page 8: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/8.jpg)
Context Independent: Phonemes– SPHINX:
model_architecture/Telefonica.ci.mdef Context Dependent:Triphone:
– SPHINX: model_architecture/Telefonica.untied.mdef
![Page 9: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/9.jpg)
Clustering Acoustic-Phonetic Units
Many Phones have similar effects on the neighboring phones, hence, many triphones have very similar Markov states.
A senone is a cluster of similar Markov
states. Advantages:
– More training data.– Less memory used.
![Page 10: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/10.jpg)
Senonic Decision Tree (SDT)
SDT Classify Markov States of Triphones represented in the training corpus by asking Linguistic Questions composed of Conjuntions, Disjunctions and/or negations of a set of predetermined questions.
![Page 11: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/11.jpg)
Linguistic Questions
Question Phones in Each Question
Aspgen Hh
Sil Sil
Alvstp d,t
Dental dh, th
Labstp b, p
Liquid l, r
Lw l, w
S/Sh S, sh
…. …
![Page 12: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/12.jpg)
Decision Tree for Classifying the second state of k-triphone
Is left phone (LP) a sonorant or nasal?
yes
Is right phone (RP) a back-R? Is LP /s,z,sh,sh/?
Is RF voiced?
Is LP back L or ( LC neither a nasal or RF A LAX-vowel)?
Senone 1 Senone 5 Senone 6
Senone 4
Senone 3Senone 2
![Page 13: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/13.jpg)
When applied to the word welcome
Is left phone (LP) a sonorant or nasal?
yes
Is right phone (RP) a back-R? Is left phone /s,z,sh,sh/?
Is RF voiced?
Is LP back L or ( LC neither a nasal or RF A LAX-vowel)?
Senone 1 Senone 5 Senone 6
Senone 4
Senone 3Senone 2
![Page 14: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/14.jpg)
The tree can automatically constructed by searching, for each node, the question that the maximum entropy decrease – Sphinx:
Construction: $base_dir/ c_scripts/03.bulidtrees. Results: $base_dir/trees/Telefonica.unpruned/A-0.dtree
When the tree grows, it needs to be pruned – Sphinx:
$base_dir/ c_scripts/ 04.bulidtrees. Results:aA $base_dir/trees/Telefonica.500/A-0.dtree $base_dir/Telefonica_arquitecture/Telefonica.500.mdef
![Page 15: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/15.jpg)
Subword unit Models based on HMMs
![Page 16: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/16.jpg)
Words
Words can be modeled using composite HMMs
A null transition is used to go from one subword unit to the following
/sil/ /t/ /uw/ /sil/
![Page 17: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/17.jpg)
Continuous Speech TrainingDatabase
speech
texttext
text
text
Scoring
![Page 18: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/18.jpg)
For each utterance to train, the subword units are concatenated to form words model.– Sphinx: Dictionary– $base_dir/training_input/dict.txt– $base_dir/training_input/train.lbl
![Page 19: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/19.jpg)
Let’s assume we are going to train the phonemes in the sentence:– Two four six.
The phonems of this sentence are:– /t//w//o//f//o//r//s//i//x/
Therefore the HMM will be:
/sil/ /t/ /uw/ /sil//f/ /o/ /r/ /s/ /i/ /x/
![Page 20: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/20.jpg)
We can estimate the parameters for each HMM using the forward-backward reestimation formulas already definded.
![Page 21: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/21.jpg)
The ability to automatically align each individual HMM to the corresponding unsegmented speech observation sequence is one of the most powerful features in the forward-backward algorithm.
![Page 22: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/22.jpg)
Language Models for Large Vocabulary Speech Recognitin
Databasespeech
texttext
text
text
Scoring
![Page 23: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/23.jpg)
Instead of using:
The recongition can be imporved using the calculating the Maximum Posteriory Probability:
P M P M P M P M k ii i k k( / ) ( ) ( / ) ( )O O
M,,q=MPMPikMPMP kkki 21 )()( ; )/()/( OO
Languaje ModelLanguaje ModelViterbiViterbi
![Page 24: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/24.jpg)
Language Models for Large Vocabulary Speech Recognitin
Goal:– Provide an estimate of the probability of a
“word” sequence (w1 w2 w3 ...wQ)
for the given recognition task.
This can be solved as follows:
QwwwwPWP 321
121
213121321
|
||
Q
wwwwP
wwwPwwPwPwwwwPWP
![Page 25: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/25.jpg)
Since, it is impossible to reliable estimate the conditional probabilities,
hence in practice it is used an N-gram language model:
En practice, realiable estimators are obtained for N=1 (unigram) N=2 (bigram) or possible N=3 (trigram).
121121 || jNjNjQQQ wwwwPwwwwP
121| jj wwwwP
121| jj wwwwP j
![Page 26: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/26.jpg)
Examples:
Unigram:P(Maria loves Pedro)=P(Maria)P(loves)P(Pedro)
Bigram:P(Maria|<sil>)P(loves|Maria)P(Pedro|loves)P(</sil>|Pedro)
![Page 27: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/27.jpg)
CMU-Cambridge Language Modeling Tools
$base_dir/c_scripts/languageModelling
![Page 28: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/28.jpg)
Databasespeech
texttext
text
text
Scoring
![Page 29: Automatic Continuous Speech Recognition Database speech text Scoring](https://reader030.vdocuments.mx/reader030/viewer/2022033101/56649dda5503460f94ad0482/html5/thumbnails/29.jpg)
P(Wi| Wi-2,Wi-1)=
C(Wi-2 Wi-1 )=Total Number Sequence Wi-2 Wi-1 was observed
C(Wi-2 Wi-1 Wi ) =Total Number Sequence Wi-2 Wi-1 Wi was observed
C(Wi-2 Wi-1 Wi )
C(Wi-2 Wi-1)
where