spectrum? hynek hermansky with jordan cohen, sangita sharma, and pratibha jain,
TRANSCRIPT
![Page 1: SPECTRUM? Hynek Hermansky with Jordan Cohen, Sangita Sharma, and Pratibha Jain,](https://reader036.vdocuments.mx/reader036/viewer/2022082818/56649f1b5503460f94c304c0/html5/thumbnails/1.jpg)
SPECTRUM?
Hynek Hermansky
with
Jordan Cohen, Sangita Sharma, and Pratibha Jain,
![Page 2: SPECTRUM? Hynek Hermansky with Jordan Cohen, Sangita Sharma, and Pratibha Jain,](https://reader036.vdocuments.mx/reader036/viewer/2022082818/56649f1b5503460f94c304c0/html5/thumbnails/2.jpg)
/u/ /o/ /a/ // /iy/
Helmholtz
Radio Rex (1917)
“limited commercial success”-John Pierce 1969
beer
Newton
![Page 3: SPECTRUM? Hynek Hermansky with Jordan Cohen, Sangita Sharma, and Pratibha Jain,](https://reader036.vdocuments.mx/reader036/viewer/2022082818/56649f1b5503460f94c304c0/html5/thumbnails/3.jpg)
frequ
ency
time
classify
about 20 ms Short-termspectrum
SHORT TERM SPECTRUM
![Page 4: SPECTRUM? Hynek Hermansky with Jordan Cohen, Sangita Sharma, and Pratibha Jain,](https://reader036.vdocuments.mx/reader036/viewer/2022082818/56649f1b5503460f94c304c0/html5/thumbnails/4.jpg)
Cortical receptive fields
![Page 5: SPECTRUM? Hynek Hermansky with Jordan Cohen, Sangita Sharma, and Pratibha Jain,](https://reader036.vdocuments.mx/reader036/viewer/2022082818/56649f1b5503460f94c304c0/html5/thumbnails/5.jpg)
frequ
ency
time
ASR from TempoRAl Patterns (TRAP)
Phone “boundaries”
1 sec
temporal pattern of critical band energies
window
classify
classify
about 20 ms Short-termspectrum
![Page 6: SPECTRUM? Hynek Hermansky with Jordan Cohen, Sangita Sharma, and Pratibha Jain,](https://reader036.vdocuments.mx/reader036/viewer/2022082818/56649f1b5503460f94c304c0/html5/thumbnails/6.jpg)
WHY 200-1000 ms ?
• because that’s where the information is (coarticulation)– mutual info studies (Bilmes, Yang et al.)
• psychophysics of hearing– 200 ms “critical time window” (forward masking, perception
of loudness, perception of gaps,…• physiology of hearing
– time component of cortical receptive fields (Klein)• because “it works”
– ETSI Aurora work
time
freq
uen
cy 200 – 1000 ms
![Page 7: SPECTRUM? Hynek Hermansky with Jordan Cohen, Sangita Sharma, and Pratibha Jain,](https://reader036.vdocuments.mx/reader036/viewer/2022082818/56649f1b5503460f94c304c0/html5/thumbnails/7.jpg)
WHY narrow frequency bands?
• psychophysics of hearing– independence of processing within critical bands
• physiology of hearing– mechanical selectivity of cochlea– cortical receptive fields (e.g. Shamma)
• because “it works”– multi-band ASR (Bourlard and Dupont, Hermansky et al,…)– decrease in ASR accuracy for wider frequency spans (Jain and
Hermansky - Eurospeech 2003)
time
freq
uen
cy
1-3 Bark
![Page 8: SPECTRUM? Hynek Hermansky with Jordan Cohen, Sangita Sharma, and Pratibha Jain,](https://reader036.vdocuments.mx/reader036/viewer/2022082818/56649f1b5503460f94c304c0/html5/thumbnails/8.jpg)
Which features?
• no knowledge is better than wrong knowledge– data cannot lie– speech evolved to be heard
• data-derived processing is consistent with human-like processing (minus the irrelevant components of the human cognitive processing)
time
features
freq
uen
cy
data-guided processing
![Page 9: SPECTRUM? Hynek Hermansky with Jordan Cohen, Sangita Sharma, and Pratibha Jain,](https://reader036.vdocuments.mx/reader036/viewer/2022082818/56649f1b5503460f94c304c0/html5/thumbnails/9.jpg)
WHY data-guided
processing?
• some function of class posteriors– class posteriors form the most efficient
feature set [e.g. Fukunaga]
• posteriors of which classes?
time
features
freq
uen
cy
data-guided(trained on data)
processing
![Page 10: SPECTRUM? Hynek Hermansky with Jordan Cohen, Sangita Sharma, and Pratibha Jain,](https://reader036.vdocuments.mx/reader036/viewer/2022082818/56649f1b5503460f94c304c0/html5/thumbnails/10.jpg)
Speech Events
signal frequencyselective hearing
event detection
event detection
p(event,frequency)
class(phoneme?)
detection
![Page 11: SPECTRUM? Hynek Hermansky with Jordan Cohen, Sangita Sharma, and Pratibha Jain,](https://reader036.vdocuments.mx/reader036/viewer/2022082818/56649f1b5503460f94c304c0/html5/thumbnails/11.jpg)
time
frequ
ency
data processing( trained system )
processing( trained system )
some functionof phoneme posteriors
TRAP TANDEM
data processing( trained system )
class posteriors