Download - CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition
![Page 1: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/1.jpg)
CS 124/LINGUIST 180: From Languages to
Information
Dan JurafskyLecture 19: Speech Recognition
![Page 2: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/2.jpg)
The final exam
• Friday March 20 12:15-3:15 in 300-300• Open book and open note• You won’t need a calculator• Computers are ok to read, e.g., the slides
and the textbooks, but no use of the internet on your laptop or any internet-aware devices, on the honor code i.e., open book and notes, but not open-web
• The problems will be very much like homework 5, which I gave you specifically to be a finals prep
![Page 3: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/3.jpg)
Topics we covered
• http://www.stanford.edu/class/linguist180/
![Page 4: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/4.jpg)
Some classes in these areas
cs224N Natural Language Processing Manning, Spring 2009cs224S Speech Recognition, Understanding,Dialogue, Jurafsky, cs224U Natural Language Understanding possible winter 2010ling284 History of Comp. Linguistics/NLP Jurafsky and Kay, A or
Wcs121 Intro to AI Latombe Spring 2009cs221 Artificial Intelligence Ngcs228 Structured Probabilistic ModelsKoller cs262 Computational Genomics Batzoglou Often Wintercs229 Machine Learning Ngcs270 Intro to Biomedical Informatics: Musencs276 Information Retrieval and Web Search
Manning/Raghavan
![Page 5: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/5.jpg)
Outline for ASR
• ASR Tasks and Architecture• Five easy pieces of an ASR system
1) The Lexicon (An HMM with phones as hidden states)
2) The Language Model3) The Acoustic Model (phone detector)4) Feature extraction (“MFCC”)5) HMM Stuff:
1) Viterbi decoding2) EM (Baum-Welch) training
![Page 6: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/6.jpg)
Applications of Speech Recognition/Understanding (ASR/ASU)
• Dictation• Telephone-based Information
GOOG 411 Directions, air travel, banking, etc
• “Google Voice” Voice mail transcription http://www.nytimes.com/2009/03/12/techn
ology/personaltech/12pogue.html
• Hands-free (in car)• Second language ('L2') (accent
reduction)• Audio archive searching and aligning1/5/07
![Page 7: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/7.jpg)
Speaker Recognition tasks
![Page 8: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/8.jpg)
Applications of Speaker Recognition and Language Recognition
• Language recognition for call routing• Speaker Recognition:
Speaker verification (binary decision) Voice password, telephone assistant
Speaker identification (one of N) Criminal investigation
1/5/07
![Page 9: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/9.jpg)
Speech synthesis
• Telephone dialogue systems• Games• The new ipod shuffle
http://www.apple.com/ipodshuffle/voiceover.html
• The Kindle voice Controversy!!!
• Compare to state-of-the-art synthesis: http://www.research.att.com/~ttsweb/tts
/demo.php
![Page 10: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/10.jpg)
LVCSR
• Large Vocabulary Continuous Speech Recognition ~20,000-64,000 words Speaker independent (vs. speaker-
dependent) Continuous speech (vs isolated-word)
• Useful for: Dictation Voice-mail transcription
![Page 11: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/11.jpg)
Current error rates
Task Vocabulary Error Rate%
Digits 11 0.5
WSJ read speech 5K 3
WSJ read speech 20K 3
Broadcast news 64,000+ 10
Conversational Telephone
64,000+ 20
Ballpark numbers; exact numbers depend very much on the specific corpus
![Page 12: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/12.jpg)
HSR versus ASR
• Conclusions:Machines about 5 times worse than humansGap increases with noisy speechThese numbers are rough, take with grain of
salt
Task Vocab
ASR Hum SR
Continuous digits
11 .5 .009
WSJ 1995 clean 5K 3 0.9
WSJ 1995 w/noise
5K 9 1.1
SWBD 2004 65K 20 4
![Page 13: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/13.jpg)
Why is conversational speech harder?
• A piece of an utterance without context
• The same utterance with more context
![Page 14: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/14.jpg)
LVCSR Design Intuition
• Build a statistical model of the speech-to-words process
• Collect lots and lots of speech, and transcribe all the words.
• Train the model on the labeled speech
• Paradigm: Supervised Machine Learning + Search
![Page 15: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/15.jpg)
The Noisy Channel Model
• Search through space of all possible sentences.
• Pick the one that is most probable given the waveform.
![Page 16: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/16.jpg)
The Noisy Channel Model (II)
• What is the most likely sentence out of all sentences in the language L given some acoustic input O?
• Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot
• Define a sentence as a sequence of words: W = w1,w2,w3,…,wn
![Page 17: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/17.jpg)
Noisy Channel Model (III)
• Probabilistic implication: Pick the highest prob S:
• We can use Bayes rule to rewrite this:
• Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:
ˆ W argmaxW L
P(W | O)
ˆ W argmaxW L
P(O |W )P(W )
ˆ W argmaxW L
P(O |W )P(W )
P(O)
![Page 18: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/18.jpg)
Noisy channel model
ˆ W argmaxW L
P(O |W )P(W )
likelihood prior
![Page 19: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/19.jpg)
Starting with the HMM Lexicon
• A list of words• Each one with a pronunciation in
terms of phones• We get these from on-line
pronucniation dictionary• CMU dictionary: 127K words
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
• We’ll represent the lexicon as an HMM
![Page 20: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/20.jpg)
ARPAbet
• http://www.stanford.edu/class/cs224s/arpabet.html
1/5/07
![Page 21: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/21.jpg)
HMMs for speech: the word “six”
• Hidden states are phones:
• Loopbacks because a phone is ~100 milliseconds long An observation of speech every 10 ms So each phone repeats ~10 times
(simplifying greatly)
![Page 22: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/22.jpg)
HMM for the digit recognition task
![Page 23: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/23.jpg)
The rest
• So if it’s just HMMs• I just need to tell you how to build a
phone detector• The rest is the same as P-o-s or
Named Entity tagging• Phone detection algorithm:• Supervised machine learning
Classifier: Gaussian Mixture Model Features: “Mel-frequence Cepstral
Coefficients”, MFCC
![Page 24: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/24.jpg)
Speech Production Process
• Respiration: We (normally) speak while breathing out.
Respiration provides airflow. “Pulmonic egressive airstream”
• Phonation Airstream sets vocal folds in motion. Vibration
of vocal folds produces sounds. Sound is then modulated by:
• Articulation and Resonance Shape of vocal tract, characterized by: Oral tract
Teeth, soft palate (velum), hard palate Tongue, lips, uvula
Nasal tract1/5/07
Text adopted from Sharon Rose
![Page 25: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/25.jpg)
1/5/07
Nasal Cavity
Pharynx
Vocal Folds (within the Larynx)
Trachea
Lungs
Text copyright J. J. Ohala, Sept 2001, from Sharon Rose slide
Sagittal section of the vocal tract (Techmer 1880)
![Page 26: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/26.jpg)
1/5/07
From Mark Liberman’s website, from Ultimate Visual Dictionary
![Page 27: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/27.jpg)
1/5/07
From Mark Liberman’s Web Site, from Language Files (7th ed)
![Page 28: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/28.jpg)
Vocal tract
1/5/07 Figure thnx to John Coleman!!
![Page 29: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/29.jpg)
Vocal tract movie (high speed x-ray)
1/5/07 Figure of Ken Stevens, from Peter Ladefoged’s web site
![Page 30: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/30.jpg)
1/5/07 Figure of Ken Stevens, labels from Peter Ladefoged’s web site
![Page 31: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/31.jpg)
USC’s SAIL LabShri Narayanan
1/5/07
![Page 32: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/32.jpg)
Larynx and Vocal Folds
• The Larynx (voice box) A structure made of cartilage and muscle Located above the trachea (windpipe) and
below the pharynx (throat) Contains the vocal folds (adjective for larynx: laryngeal)
• Vocal Folds (older term: vocal cords) Two bands of muscle and tissue in the
larynx Can be set in motion to produce sound
(voicing)1/5/07 Text from slides by Sharon Rose UCSD LING 111 handout
![Page 33: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/33.jpg)
The larynx, external structure, from front
1/5/07 Figure thnx to John Coleman!!
![Page 34: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/34.jpg)
Vertical slice through larynx, as seen from back
1/5/07 Figure thnx to John Coleman!!
![Page 35: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/35.jpg)
Voicing:
1/5/07
•Air comes up from lungs•Forces its way through vocal cords, pushing open (2,3,4)•This causes air pressure in glottis to fall, since:
• when gas runs through constricted passage, its velocity increases (Venturi tube effect)• this increase in velocity results in a drop in pressure (Bernoulli principle)
•Because of drop in pressure, vocal cords snap together again (6-10)•Single cycle: ~1/100 of a second.
Figure & text from John Coleman’s web site
![Page 36: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/36.jpg)
Voicelessness
• When vocal cords are open, air passes through unobstructed
• Voiceless sounds: p/t/k/s/f/sh/th/ch• If the air moves very quickly, the
turbulence causes a different kind of phonation: whisper
1/5/07
![Page 37: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/37.jpg)
Vocal folds open during breathing
1/5/07 From Mark Liberman’s web site, from Ultimate Visual Dictionary
![Page 38: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/38.jpg)
Vocal Fold Vibration
1/5/07
UCLA Phonetics Lab Demo
![Page 39: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/39.jpg)
Consonants and Vowels
• Consonants: phonetically, sounds with audible noise produced by a constriction
• Vowels: phonetically, sounds with no audible noise produced by a constriction
• (it’s more complicated than this, since we have to consider syllabic function, but this will do for now)
1/5/07 Text adapted from John Coleman
![Page 40: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/40.jpg)
Acoustic Phonetics
• Sound Waves http://www.kettering.edu/~drussell/Dem
os/waves-intro/waves-intro.html
![Page 41: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/41.jpg)
Simple Periodic Waves (sine waves)
Time (s)0 0.02
–0.99
0.99
0
• Characterized by:• period: T• amplitude A• phase
• Fundamental frequencyin cycles per second, or Hz
• F0=1/T1 cycle
![Page 42: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/42.jpg)
Simple periodic waves
• Computing the frequency of a wave: 5 cycles in .5 seconds = 10 cycles/second = 10 Hz
• Amplitude: 1
• Equation: Y = A sin(2ft)
![Page 43: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/43.jpg)
Speech sound waves
• A little piece from the waveform of the vowel [iy]• Y axis:
Amplitude = amount of air pressure at that time point
Positive is compression Zero is normal air pressure, negative is rarefaction
• X axis: time.
![Page 44: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/44.jpg)
Digitizing Speech
![Page 45: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/45.jpg)
Digitizing Speech
• Analog-to-digital conversion• Or A-D conversion.• Two steps
Sampling Quantization
![Page 46: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/46.jpg)
Sampling
• Measuring amplitude of signal at time t• The sampling rate needs to have at least two samples for each cycle
• Roughly speaking, one for the positive and one for the negative half of each cycle.• More than two sample per cycle is ok• Less than two samples will cause frequencies to be missed• So the maximum frequency that can be measured is one that is half the sampling rate.• The maximum frequency for a given sampling rate called Nyquist frequency
![Page 47: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/47.jpg)
Sampling
If measure at green dots, will see a lower frequency wave and miss the correct higher frequency one!
Original signal in red:
![Page 48: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/48.jpg)
Sampling
In practice, then, we use the following sample rates.
• 16,000 Hz (samples/sec) Microphone (“Wideband”):• 8,000 Hz (samples/sec) Telephone
Why?• Need at least 2 samples per cycle• max measurable frequency is half sampling rate• Human speech < 10,000 Hz, so need max 20K• Telephone filtered at 4K, so 8K is enough
![Page 49: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/49.jpg)
Quantization
• Quantization • Representing real value of
each amplitude as integer• 8-bit (-128 to 127) or 16-bit
(-32768 to 32767)• Formats:
• 16 bit PCM• 8 bit mu-law; log
compression• Byte order
• LSB (Intel) vs. MSB (Sun, Apple)
• Headers:• Raw (no header)• Microsoft wav• Sun .au
40 byteheader
![Page 50: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/50.jpg)
WAV format
![Page 51: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/51.jpg)
Waves have different frequencies
Time (s)0 0.02
–0.99
0.99
0
Time (s)0 0.02
–0.99
0.99
0
100 Hz
1000 Hz
![Page 52: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/52.jpg)
Complex waves: Adding a 100 Hz and 1000 Hz wave
together
Time (s)0 0.05
–0.9654
0.99
0
![Page 53: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/53.jpg)
Spectrum
100 1000Frequency in Hz
Am
plit
ude
Frequency components (100 and 1000 Hz) on x-axis
![Page 54: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/54.jpg)
Spectra continued
• Fourier analysis: any wave can be represented as the (infinite) sum of sine waves of different frequencies (amplitude, phase)
![Page 55: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/55.jpg)
Spectrum of one instant in an actual soundwave: many components across frequency range
Frequency (Hz)0 5000
0
20
40
![Page 56: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/56.jpg)
Part of [ae] waveform from “had”
• Note complex wave repeating nine times in figure• Plus smaller waves which repeats 4 times for every large
pattern• Large wave has frequency of 250 Hz (9 times in .036 seconds)• Small wave roughly 4 times this, or roughly 1000 Hz• Two little tiny waves on top of peak of 1000 Hz waves
![Page 57: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/57.jpg)
Back to spectrum
• Spectrum represents these freq components• Computed by Fourier transform, algorithm
which separates out each frequency component of wave.
• x-axis shows frequency, y-axis shows magnitude (in decibels, a log measure of amplitude)
• Peaks at 930 Hz, 1860 Hz, and 3020 Hz.
![Page 58: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/58.jpg)
Spectrogram: spectrum + time dimension
![Page 59: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/59.jpg)
FromMarkLiberman’sWeb site
![Page 60: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/60.jpg)
Detecting Phones
• Two stages Feature extraction
Basically a slice of a spectrogram
Building a phone classifier (using GMM classifier)
![Page 61: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/61.jpg)
MFCC: Mel-Frequency Cepstral Coefficients
![Page 62: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/62.jpg)
Final Feature Vector
• 39 Features per 10 ms frame: 12 MFCC features 12 Delta MFCC features 12 Delta-Delta MFCC features 1 (log) frame energy 1 Delta (log) frame energy 1 Delta-Delta (log frame energy)
• So each frame represented by a 39D vector
![Page 63: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/63.jpg)
Acoustic Modeling (= Phone detection)
• Given a 39-dimensional vector corresponding to the observation of one frame oi
• And given a phone q we want to detect
• Compute p(oi|q)
• Most popular method: GMM (Gaussian mixture models)
• Other methods Neural nets, CRFs, SVM, etc
![Page 64: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/64.jpg)
Gaussian Mixture Models
• Also called “fully-continuous HMMs”• P(o|q) computed by a Gaussian:
p(o | q) 1
2exp(
(o )2
2 2)
![Page 65: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/65.jpg)
Gaussians for Acoustic Modeling
• P(o|q):
P(o|q)
o
P(o|q) is highest here at mean
P(o|q is low here, very far from mean)
A Gaussian is parameterized by a mean and a variance:
Different means
![Page 66: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/66.jpg)
Training Gaussians
• A (single) Gaussian is characterized by a mean and a variance
• Imagine that we had some training data in which each phone was labeled
• And imagine that we were just computing 1 single spectral value (real valued number) as our acoustic observation
• We could just compute the mean and variance from the data:
i 1
Tot
t1
T
s.t. ot is phone i
i2
1
T(ot
t1
T
i)2 s.t. ot is phone i
![Page 67: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/67.jpg)
But we need 39 gaussians, not 1!
• The observation o is really a vector of length 39
• So need a vector of Gaussians:
p(o | q)
1
2D
2 2[d]d1
D
exp(
1
2
(o[d] [d])2
2[d]d1
D
)
![Page 68: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/68.jpg)
Actually, mixture of gaussians
• Each phone is modeled by a sum of different gaussians
• Hence able to model complex facts about the data
Phone A
Phone B
![Page 69: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/69.jpg)
Gaussians acoustic modeling
• Summary: each phone is represented by a GMM parameterized by M mixture weights M mean vectors M covariance matrices
Usually assume covariance matrix is diagonal
I.e. just keep separate variance for each cepstral feature
![Page 70: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/70.jpg)
Where we are
• Given: A wave file• Goal: output a string of words• What we know: the acoustic model
How to turn the wavefile into a sequence of acoustic feature vectors, one every 10 ms
If we had a complete phonetic labeling of the training set, we know how to train a gaussian “phone detector” for each phone.
We also know how to represent each word as a sequence of phones
• What we knew from a few weeks ago: the language model• Next time:
Seeing all this back in the context of HMMs Search: how to combine the language model and the acoustic model to
produce a sequence of words
![Page 71: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/71.jpg)
HMM for digit recognition task
![Page 72: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/72.jpg)
Viterbi trellis for “five”
![Page 73: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/73.jpg)
Viterbi trellis for “five”
![Page 74: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/74.jpg)
Search space with bigrams
![Page 75: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/75.jpg)
Viterbi trellis
75
![Page 76: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/76.jpg)
Viterbi backtrace
76
![Page 77: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/77.jpg)
Summary
• ASR Architecture• Phonetics Background• Five easy pieces of an ASR system
1) Lexicon2) Feature Extraction3) Acoustic Model (Phone detector)4) Language Model5) Viterbi decoding
![Page 78: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/78.jpg)
A few advanced topics
![Page 79: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/79.jpg)
Why foreign accents are hard
• A word by itself
• The word in context
![Page 80: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/80.jpg)
Sentence Segmentation
• Binary classification task; judge the juncture between each two words:
• Features: Pause Duration of previous phone and rime Pitch change across boundary; pitch range of previous word
![Page 81: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 19: Speech Recognition](https://reader033.vdocuments.mx/reader033/viewer/2022051516/56649e3f5503460f94b2f9e7/html5/thumbnails/81.jpg)
Disfluencies
• Reparandum: thing repaired• Interruption point (IP): where speaker
breaks off• Editing phase (edit terms): uh, I
mean, you know• Repair: fluent continuation
Example: Fragments: Incomplete or cut-off words:
Uh yeah, yeah, well, it- it- that’s right. And it-