mfcc tutorial - ocw.nthu.edu.tw · 1980s can recognize 20000 words 4mb ram => 30sec speech per...
TRANSCRIPT
![Page 1: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/1.jpg)
MFCC tutorial
Brought to you by EE6641 TAs
![Page 2: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/2.jpg)
Outline
History
Mel frequency
Cepstrum
MFCC
Applications
Conclusions
![Page 3: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/3.jpg)
Outline
History
Mel frequency
Cepstrum
MFCC
Applications
Conclusions
![Page 4: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/4.jpg)
Signal Processing
Speech recognition
Pitch detection
Cover-song detector and so
on…
![Page 5: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/5.jpg)
1930s
![Page 6: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/6.jpg)
1950s
![Page 7: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/7.jpg)
1960s
![Page 8: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/8.jpg)
1960s
Rej Reddy
Soviet=>DTW capable of 200 words
![Page 9: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/9.jpg)
1960’s
IDA: Leonard Baum
![Page 10: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/10.jpg)
1980s
Can recognize 20000 words
4MB ram => 30sec speech per 100
minutes
![Page 11: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/11.jpg)
1990
Commercial opportunities
Number of words bigger than human’s
![Page 12: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/12.jpg)
2000s
Lern & Hauspie
Dragon System
Later Bankrupt
![Page 13: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/13.jpg)
2010s
Deep learning
Reduced 30% error
“the most dramatic change”
![Page 14: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/14.jpg)
Outline
History
Mel frequency
Cepstrum
MFCC
Applications
Conclusions
![Page 15: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/15.jpg)
Mel-frequency
perceptual scale of pitch
1000 to 1000
"聽閾"
Not all equations are the same
![Page 16: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/16.jpg)
Mel-frequency
Hz 40 161 200 404 693 867 1000 2022 3000 3393 4109 5526 6500 7743 12000
mel 43 257 300 514 771 928 1000 1542 2000 2142 2314 2600 2771 2914 3228
![Page 17: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/17.jpg)
Outline
History
Mel frequency
Cepstrum
MFCC
Applications
Conclusions
![Page 18: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/18.jpg)
Cepstrum
FFT => abs() => log() => IFFT(FFT)
“quefrency”
![Page 19: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/19.jpg)
Spectral Envelope
![Page 20: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/20.jpg)
Spectral Envelope
![Page 21: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/21.jpg)
Spectral Envelope
![Page 22: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/22.jpg)
Why dB?
![Page 23: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/23.jpg)
Disc-Cosine-Trans v.s. Disc-Fourier-
Trans
DCT
DFT
Difference?
![Page 24: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/24.jpg)
Disc-Cosine-Trans v.s. Disc-Fourier-
Trans
DCT
DFT
2x resolution
0.5x memory
O(N*N)
![Page 25: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/25.jpg)
Disc-Cosine-Trans v.s. Disc-Fourier-
Trans
![Page 26: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/26.jpg)
Disc-Cosine-Trans v.s. Disc-Fourier-
Trans
![Page 27: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/27.jpg)
Outline
History
Mel frequency
Cepstrum
MFCC
Applications
Conclusions
![Page 28: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/28.jpg)
MFCC
FFT => power spectrum =>
triangular filter banks (usually 26)
log => DCT(IDCT)
取係數 (usually 13)
![Page 29: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/29.jpg)
MFCC
Why MFCC?
Simplicity (Only several coefficients)
Smoothness
![Page 30: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/30.jpg)
Outline
History
Mel frequency
Cepstrum
MFCC
Applications
Conclusions
![Page 31: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/31.jpg)
Machine Learning
Unsupervised learning
Supervised learning
Semi-supervised learning
![Page 32: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/32.jpg)
Unsupervised learning
Expectation maximization
E-step vs M-step
![Page 33: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/33.jpg)
K-means
![Page 34: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/34.jpg)
GMM
![Page 35: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/35.jpg)
Supervised learning
Based on “labels”
Empirical vs General
Error minimization
![Page 36: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/36.jpg)
SVM
![Page 37: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/37.jpg)
kNN
![Page 38: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/38.jpg)
Neural nets
![Page 39: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/39.jpg)
Musical Instruments Identification
Use audio recorded by ourselves
![Page 40: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/40.jpg)
Outline
History
Mel frequency
Cepstrum
MFCC
Applications
Conclusions
![Page 41: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/41.jpg)
Quick Recap
Why Mel-Filterbank?
人耳聽覺Why DCT?
頻譜對稱性兩倍的解析度
Why dB?
系統分解
![Page 42: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/42.jpg)
Conclusions
What’s next?
工欲善其事,必先利其器。
![Page 43: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes](https://reader033.vdocuments.mx/reader033/viewer/2022052817/60b0b853403d317ca54e4ad2/html5/thumbnails/43.jpg)