email: {ikeno, john.hansen}@utdallas.edu slide 1 iafpa-2006 center for robust speech systems slides ...

34
Email: {ikeno, John.Hansen}@utdallas.edu Slide 1 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Ayako Ikeno and John H.L. Hansen IAFPA-2006 July 23-26, 2006 Center for Robust Speech Systems (CRSS) Erik Jonsson School of Engineering & Computer Science University of Texas at Dallas Richardson, Texas 75083-0688, U.S.A.

Post on 20-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Slide 2
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 1 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Ayako Ikeno and John H.L. Hansen IAFPA-2006 July 23-26, 2006 Center for Robust Speech Systems (CRSS) Erik Jonsson School of Engineering & Computer Science University of Texas at Dallas Richardson, Texas 75083-0688, U.S.A.
  • Slide 3
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 2 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 CRSS & Speech Processing Overview Previous Studies on Stress & Lombard Effect Perceptual Speaker ID with Lombard Speech Speech Corpus - UTScope Experimental Setup Results Summary & Impact
  • Slide 4
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 3 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 SPOKEN DOCUMENT RETRIEVAL Overview of CRSS-Hansen Research: http://SpeechFind.utdallas.edu Speech Under Stress Speech Enhancement UTDrive & CU-Move: In-Vehicle Voice Navigation Dialect & Accent In-Set / Out-of-Set Speaker Detection Normalization: Speaker, Environment, Language UAE, Egypt, Palestine, etc. Cuba, Peru, Puerto Rico Cambridge, Irish, Welsh, etc.
  • Slide 5
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 4 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 File:1998_WhyRecogBreak Disk:PwrBook(jhlh) E NVIRONMENTAL B ASED A COUSTIC N OISE R OOM R EVERBERATION P HYSICAL T ASK D EMANDS C OMMUNICATION B ASED M ICROPHONE V OICE C OMPRESSION C HANNEL /M OBILE C ELLULAR S PEAKER B ASED P ROBLEMS S TRESS & E MOTION L OMBARD E FFECT / N OISE P SYCHOLOGICAL T ASK D EMANDS A CCENT /L ANGUAGE S PEAKER D IFFERENCES ( AGE, SEX, VOCAL TRACT ) S PONTANEOUS S PEECH C ONTEXT B ASED E FFECTS H OMONYMS (E NGLISH +10,000; J APANESE 120) C ONFUSABLE : (T AKE, S TAKE, S TRAIGHT ; C AKE, K ATE ) A MBIGUOUS : J EET YET ? " IT ' S OURS " VS. " IT SOURS " " NICE GUYS " VS. " NICE SKIES " "Um, I just wanna, I just want to say, I don't know what I want to say." SPEECH STRESSENVIRONMENT NOISE ACCENT LANGUAGE SPEECH RECOGNITION HUMAN (AUDITORY) RECOGNITION VOICE COMMUNICATIONS CHANNEL NOISE AMERICAN ENGLISH SPEAKER LOMBARD EFFECT SPEAKER RECOGNITION Why Speech Systems Break?
  • Slide 6
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 5 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Speech Production: Phonetics & Acoustics Noise Stress Microphone Speaker Speech Physiology Acoustic Speech Waveform NeutralStress
  • Slide 7
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 6 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 DOES STRESS VARIABILITY IMPACT SPEAKER RECOGNITION? Limited Research on Speaker Recognition over Stress, Lombard Effect, etc. NATO RSG.10 Report showed probe experimental results with SUSAS corpus NATO, 2000
  • Slide 8
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 7 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Pitch Glottal Spectral Slope (earlier studies by Hansen (1988), 200 speech features, 10,000 stat. tests) Formant Location
  • Slide 9
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 8 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Phone Duration RMS Intensity
  • Slide 10
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 9 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Conditional Gaussian fit (Zhou, Hansen 1997) Classification error rate Neutral vs. Loud: 7.24% (Neutral), 8.28% (Loud) Neutral vs. Lombard: 20.69% (Neutral), 19.31% (Lombard) Probability distribution Detection (ROC) curves STRESS DETECTION USING PITCH
  • Slide 11
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 10 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 ROC CURVES STRESS DETECTION
  • Slide 12
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 11 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Individual Feature Pitch Glottal Spectral Slope Intensity Phone Duration Formant Location 1st formant 2nd formant Feature Fusion Duration + Intensity + mean Pitch Stress/Neutral Error Rates 621% 1836% 2846% 3846% 50 58% 017% PAST STRESS DETECTION STUDIES USING TRADITIONAL FEATURES
  • Slide 13
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 12 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Discrete time and Continuous time TEO : where, is Teager Energy Operator TEO-CB-Auto-Env: Critical Band based TEO AUTOcorrelation ENVelope Ref: Zhou, Hansen,Kaiser, IEEE Transactions on Speech & Audio Processing, vol. 9(2): 201-216, March 2001 Critical Frequency 17 Band Partition = based on Auditory Perception TEAGER ENERGY OPERATOR
  • Slide 14
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 13 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Neutral HMM Model vs. Stress trained HMM Model Assessment for NATO SUSC-0 Military Cockpit Recordings
  • Slide 15
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 14 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 GOAL: (1) Identify, Model, and Classify Speech Under Stress in Military-Related Task Conditions, and (2) Improve Automatic Speech Coding under Stress Effective Soldier of the Quarter Board Paradigm Monitor and Track Biometrics of Stress: Heart rate, blood pressure, stress hormones, psychometrics. Engineering: Focus on NONLINEAR Air Turbulent Model Teager Energy Operator; Identify Stress Dependent Performance across Speakers, phonemes APPROACH: Rahurkar, Hansen, Meyerhoff, Saviolakis, Koenig, "Frequency Distribution based Weighted Sub-Band Approach for Classification of Emotional/Stressful Content in Speech," Interspeech, pp.721-724, Geneva, Switzerland, Sept. 2003 (another paper at Interspeech-2005) Detection of Speech Under Stress: WRAIR
  • Slide 16
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 15 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 First observed by Etienne Lombard in 1911 Change in speech production in response to noise to increase communication performance Lombard Test - standard test for hearing loss in U.S. (ASHA) measure dB-SPL change in speech production Hansen (1988) evaluation of 200 features with +10,000 statistical tests on 11 different stressed speech conditions to quantify changes in speech production
  • Slide 17
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 16 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 IAFPA-06: focus on Lombard Effect Audio samples for the perceptual experiment were extracted from UTScope corpus. S peech under CO gnitive and P hysical stress & E motion Consists of 4 Domains Lombard Effect noise levels & types Physical Stress stair climbing/stepper Cognitive Stress driving (simulator & actual) Emotion (Angry, Fear, Anxiety, Frustration)
  • Slide 18
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 17 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Goal: obtain Lombard Speech at different noise levels Quantify ground truth with biometric analysis Lombard Effect Speech 9 conditions (3 noise, 3 levels) 1 sec. duration Pink Noise 65,75,85 dB-SPL Highway Noise (windows open) 70,80,90 dB-SPL Large Crowd Noise 70,80,90 dB-SPL
  • Slide 19
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 18 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 UTScope PINK NOISE 65, 75, 86 dB-SPL HIGHWAY DRIVING, WINDOWS HALF OPEN 70, 80,90 dB-SPL LARGE CROWD NOISE 70, 80, 90 dB-SPL PURETONE HEARING SCREENING OPEN-AIR HEADPHONES FOR SPEECH FEEDBACK NOISE LEVELS CALIBRATED WITH QUEST SLM
  • Slide 20
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 19 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 UTScope 20 TIMIT SENTENCES 5 DIGIT STRINGS 1 MINUTE SPONTANEOUS SPEECH 100 SPEAKERS 8-CHANNEL DAT RECORDER P-MIC CLOSE-TALKING MIC FAR-FIELD MIC
  • Slide 21
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 20 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 The ASHA-certified sound booth and recording equipments
  • Slide 22
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 21 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Male Lombard Male Neutral Lombard Effect impacts Temporal and Spectral Structure (as expected) Evaluation: Perceptual Experiments to assess Speaker Recognition
  • Slide 23
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 22 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Listener Test Speakers Corpus: UTScope Native US English speakers Female speakers only Speech Conditions ReferenceTest NL-LDNeutralLombard LD-LDLombard NL-NLNeutral Noise Type Highway driving Noise Level 90dB-SPL
  • Slide 24
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 23 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Speech Materials Read speech TIMIT sentences: phonetically balanced 3 sentences per audio sample (.wav, 16k Hz) Ref : Basketball can be an entertaining sport. My problem is, the cats meow always hurts my ears. The causeway ended abruptly at the shore. Test : Youngsters commonly love chocolate and candies as treats. December and January are nice months to spend in Miami. There were other farmhouses nearby.
  • Slide 25
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 24 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Listener Test Listeners (12: 2f/10m May 06, -- 41 as of July 06) India(4), China(1), Korea(1), Mexico(1), Pakistan(1), Thai(1), Turkey(1) US(1), Vietnam(1) Task: In-set vs. Out-of-set Speaker Identification Reference/Training 12 In-set Female speakers Test 8 In-Set speakers 4 Out-of-Set speakers
  • Slide 26
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 25 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Reference audio: Neutral Lombard Test audio: Neutral Lombard
  • Slide 27
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 26 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 The effect of speech condition : significant (p=.0024). Mismatched condition (NL-LD) accuracy: chance level (52%). Lombard speech (LD-LD, 79%): higher accuracy than neutral speech (NL-NL, 67%). Lombard effect may emphasize the speech characteristics, and improve accuracy on perceptual speaker ID.
  • Slide 28
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 27 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 Emotion/stressMismatched Training Matched Training Neutral96 Angry3475 Lombard4899 Fast9190 Slow9098 Soft7389 Loud2281 Automated System Performance (SUSAS Corpus) (See Hansen, et.al, The Impact of Speech Under `Stress' on Military Speech Technology, NATO Research & Tech. Org. RTO-TR-10, March 2000). Angry 62% Lombard 48% Loud 74% 5-74% LOSS The trend hold the same for the automated system.
  • Slide 29
  • Email: {ikeno, John.Hansen}@utdallas.edu Slide 28 IAFPA-2006 Center for Robust Speech Systems SLIDES by John H.L. Hansen, 2006 In-Set accuracy : affected by the speech condition significantly (p