automated detection of speech landmarks using gaussian mixture modeling a. r. jayan p. c. pandey

I IT B

om

bay

1/21Intro. GMM Results. Sum.

FRSM-08, 20-21 Feb. 2008, Kolkata, India

Automated Detection of Speech Landmarks Using

Gaussian Mixture Modeling

A. R. JayanP. C. Pandey

{arjayan, pcpandey}@ee.iitb.ac.in

EE Dept., IIT BombayFebruary, 2008

I IT B

om

bay



A. R. Jayan and P. C. Pandey, "Automated detection of speech landmarks using Gaussian Mixture Modeling", Frontiers of Research on Speech and Music (FRSM-08), Feb. 20-21, 2008, Jadavpur University, Kolkata, India.

Abstract-Landmarks in speech signal are regions with abrupt spectral variations. Automated detection of these regions is important for several applications in speech processing. Performance of landmark detection using parameters extracted from predefined spectral bands generally gets limited by speaker related spectral variability. This paper presents a landmark detection technique which adapts to the acoustic properties of speech. Parameters are extracted from Gaussian mixture modeling (GMM) of smoothed spectral envelope. A single rate of rise function, obtained from the set of GMM parameters, is used for locating landmark regions. The method was evaluated using manually labeled VCV syllables and sentences. It was possible to detect 85 % of stop release bursts in VCV syllables and 82 % in

sentences, with an accuracy of 5 ms, compared to the manually located landmarks.

Address: SPI Lab, EE Dept., IIT Bombay, Powai Mumbai 400 076, IndiaWeb: http://www.ee.iitb.ac.in/~spilabE-mail: {arjayan, pcpandey}@ee.iitb.ac.in

I IT B

om

bay



PRESENTATION OUTLINE

1. Introduction

2. Gaussian Mixture Modeling (GMM)

3. Experimental results

4. Summary and conclusion

I IT B

om

bay



1. INTRODUCTION

Landmark detection

Speech landmarks Regions containing important information for speech perception

Associated with spectral transitions

Landmarks types

1. Abrupt-consonantal (AC) - Tight constrictions of primary articulators

2. Abrupt (A) - Fast glottal or velum activity

3. Non-abrupt (N) - Semi-vowel landmarks, less vocal tract constriction

4. Vocalic (V) - Vowel landmarks

Abrupt (~68%) Vocalic (~29%) Non-abrupt (~3%)

I IT B

om

bay



Example of landmarks

Applications of landmark detection

Feature extraction for supporting speech recognition Intelligibility enhancement

I IT B

om

bay



Earlier studies on automated landmark detection Schutte and Glass, 2005

▪ Mel frequency cepstral coefficients, support vector machines (SVMs)▪ Application: Extraction of features for speech recognition

Sainath and Hazen, 2006▪ Sinusoidal model, short-time energy, signal harmonicity▪ Application: Extraction of features for speech recognition

Liu, 1996▪ 512-point DFT on 6 ms frames, frame shift 1 ms

▪ 20 point moving average along timeto get smooth parameter tracks

▪ First difference of maximum spectral component in 6 spectral bands

▪ Application: Extraction of features for speech recognition Det. time (ms)

Det

. rat

e (%

)

I IT B

om

bay



Factors limiting detection rate and temporal resolution▪ Effectiveness of parameters in capturing acoustic variations

▪ Short-time energy variation in spectral bands : weak burst may not get detected▪ Centroid frequency : not well defined during low energy segments▪ Fixed band boundaries : may not adapt to speech variability

▪ Temporal smoothening of parameter tracks▪ Time resolution affected

▪ Detection operation▪ First difference operation not optimized for all types of landmarks

▪ Time-step 10 ms may be too large for burst detection

▪ Effect of noise on parameters▪ Cepstral features - sensitive to noise▪ Band energy or spectral peaks - not much affected▪ Band centroids -sensitive to noise

I IT B

om

bay



Need for high temporal resolution and detection rate

Application dependent Speech recognition: Analysis performed around landmarks

for parameter extraction. Landmarks detected with▪ high accuracy▪ moderate temporal resolution (20-30 ms)

Intelligibility enhancement: Modification of landmark regions, detected with

▪ good temporal resolution (0-5 ms) ▪ some tolerance to detection errors, but low tolerance toinsertions as insertions may introduce distortions

Landmark type Short duration events (bursts) need high time resolution. Voicing onsets/offsets may not require high resolution as signal properties remain same for a long duration.

I IT B

om

bay



Improvement in intelligibility of conversational speech by incorporating properties of clear speech: Enhancing landmark regions

Consonant–vowel intensity ratio (CVR) enhancementIncreasing energy of consonant segment.

Consonant duration enhancementIncreasing CV and VC transitions (burst duration, VOT, formant transition).

Challenges Accurate detection of regions for modification. Analysis-modification-synthesis with low processing artifacts. Processing without increasing overall speaking rate, increase in

transition regions with a corresponding decrease in steady state segments.

I IT B

om

bay



Earlier studies on intelligibility enhancement Colotte & Laprie, 2000

Identifying regions based on mel-cepstral analysis Stops and unvoiced fricatives amplified by +4 dB Transition segments time-scaled by 1.8, 2.0 (TD-PSOLA)

Skowronski & Harris, 2006 Spectral transition measure based voiced/unvoiced classification Energy redistribution in voiced / unvoiced segments (ERVU) Amplifying low energy regions critical to intelligibility

Jayan & Pandey, 2007 Variation of maximum energy and centroid in 5 spectral bands

VC and CV transition segments expanded, steady-state segments compressed → less temporal masking by nearby vowel

Intensity scaling of transition segments Overall speech duration is kept unaltered

I IT B

om

bay



Fixed spectral band based landmark detection Spectrum divided into five non-overlapping bands

▪ 0–0.4, 0.4–1.2, 1.2–2.0, 2.0–3.5, 3.5–5.0 kHz ▪ Sampling frequency 10 k samples/s, ▪ 512-point FFT on 6 ms frames ▪ Frame shift: 1 ms.

▪ Peak spectral component and band centroid in each band, every 1 ms(related to formant peaks and formant frequencies)

2 22 2

1 1

( , ) /k k

f b n k X X f Nc sk kk k k k

2

1 210( , ) 10 log max ,E b n X k k kp k Peak energy

Centroid frequency

Rate-of-rise functions

Transition index

' , ( , ) ( , )E b n E b n K E b n Kp p p ' ( , ) ( , ) ( , )f b n f b n K f b n Kc c c

5 ' '( ) ( , ) ( , )1

T n E b n f b nr p cb

I IT B

om

bay



LimitationsOnly 60 % release bursts in VCV syllables detected within 5 ms of manual labels.

Possible reasons▪ Poor approximation of formant peaks and frequencies by maximum energy and centroid in spectral bands with fixed boundaries.▪ Temporal smoothening performed on parameter tracks.

Gaussian Mixture Modeling (GMM)▪ Provides parametric representation of smoothened spectra.▪ Can be used to extract formant like features.▪ Gaussian mean → formant frequency, amplitude → formant peak,

variance → formant bandwidth.▪ Abrupt spectral variations results in abrupt variations in Gaussian parameters.▪ Parameter extraction by smoothening in the spectral domain, no smoothening in temporal domain → improved temporal resolution.

I IT B

om

bay



Earlier studies on Gaussian modeling of speech spectra▪ Zolfaghari & Robinson, 1996

▪ Cepstral smoothened speech spectrum modeled by GMMs.▪ Formant analysis, formant vocoder.▪ Formant tracks followed LPC based tracker, higher formant bandwidths.

▪ Stuttle & Gales, 2002 ▪ Low pass filtered speech spectrum modeled by GMMs.▪ GMM features used with MFCC features in speech recognition.▪ GMM parameters found effective in noisy environments.

▪ Omar et al., 2001▪ Used Gaussian model of phonetic boundaries. ▪ Improvement in phoneme recognition accuracy.

▪ Lindblom & Samuelsson, 2003▪ Bounded support expectation maximization algorithm for modeling speech source spectra (EMBS).

I IT B

om

bay



ObjectiveAutomated detection of landmarks for stop consonants with high temporal resolution, using Gaussian Mixture Modeling of speech spectra

Landmark detection using GMM parameters.

log|FFT|

SpectralSmoothing

GMM parameter

Rate of risemeasure

Landmarks

Speech signal

I IT B

om

bay



2. GAUSSIAN MIXTURE MODELING▪ Speech signal sampled at 10 k samples / second▪ 512 point DFT on 6 ms frames▪ Frame shift = 1 ms▪ Spectral smoothening by low pass filtering spectral envelope,

filter impulse response → 20 point raised cosine window.▪ Parameter extraction by expectation maximization (EM) algorithm.

GMM approximation of smoothened spectral envelope

( , ) ( ) ( ), ( )1

MS n k A n G n nx m m m

m

22( ), ( ) 1/ 2 exp22

x mG n nm m mm

( , )S n kx

Initialization ▪ Means → equal spacing along k, ▪ Equal mixture weights = 1/M ▪ Equal standard deviations N/(2M)

( 0.5) / 2m N Mm

I IT B

om

bay



Gaussian parameters Gaussian amplitudes consistent during vowel, consonant, and silence segments.

Mean and variances, not well defined during low energy segments.

Parameter tracks derived using Gaussian amplitudes.

Detection of burst landmarks Rate of rise (ROR) function derived using Gaussian amplitudes except that of first Gaussian.

Normalized to 0-1 range, 10 point median filtering.

Square root operation to make ROR more sensitive to burst onsets.

0.5

2

2( ) ( )( )

M

m mm

n s n sr n

I IT B

om

bay



3. RESULTS AND DISCUSSION Number of Gaussians for modeling decided by computing norm. mean squared error between smoothed spectrum and Gaussian modeled spectrum.

e(n) computed for vowels /a/, /i/, /u/, and fricatives /v/, /z/, /f/, /s/. Voiced sounds modeled more accurately. Not much improvement in increasing number of Gaussians above 3. Selected M = 4, for modeling 4 significant vocal tract resonances.

2/ 2 2

1 1( ) ( , ) ( , ) ( , )

N N

x x xk k

e n S n k S n k S n k

Phone-me

Normalized mean squarederror for no. of Gaussian components

1 2 3 4 5

/a/ 0.22 0.08 0.06 0.05 0.04

/i/ 0.45 0.08 0.05 0.05 0.05

/u/ 0.35 0.12 0.08 0.07 0.05

/v/ 0.18 0.05 0.04 0.04 0.03

/z/ 0.49 0.10 0.01 0.01 0.01

/f/ 0.43 0.28 0.20 0.19 0.18

/s/ 0.77 0.16 0.13 0.11 0.13

I IT B

om

bay



Evaluation using VCV syllables

Test material : 3 vowels /a/, /i/, /u/ and 6 stops /b/, /d/, /g/, /p/, /t/, /k/ and 6 speakers (3 male, 3 female) → 108 manually labeled tokens.

Signal

Spectrogram

GMM spectrogram

Amplitude

Mean

Variance

ROR

I IT B

om

bay



Evaluation using sentences

Test material : 15 Marathi sentences with 98 manually labeled tokens.1 speaker

Signal

Amplitude

Mean

Variance

Burst landmarks

ROR

'kamal, ki thi kam kar the ?'

I IT B

om

bay



Comparison of results

M1 - GMM based, M2- maximum spectral component + centroid in spectral bands with fixed boundaries

Detection of burst landmarks(VCV Syllables)

0

20

40

60

80

100

0-5 ms 0-10 ms 0-15 ms 0-20 ms 0-30 ms

Localization error (ms)

% D

etec

tio

n

M1

M2

Detection of burst landmarks(sentences)

0

20

40

60

80

100

0-5 ms 0-10 ms 0-15 ms 0-20 ms 0-30 ms

Localization error (ms)

% D

etec

tio

n

M1

M2

Observations M1 outperforms M2 in terms of detection rates for temporal resolution

< 10 ms.

M2 detects more landmarks than M1, but with lesser temporal resolution.

VCV syllables Sentences

I IT B

om

bay



4. SUMMARY & CONCLUSIONLandmark detection using Gaussian parameters investigated:

▪ Good temporal resolution compared to parameters extracted from spectral bands with fixed boundaries.

▪ Most of the landmarks detected within 10 ms of manual labels.

Future work:

▪ Evaluation of landmark detection method in presence of noise.▪ Method is more computation intensive and needs further investigations for real-time detection of landmarks

automated detection of speech landmarks using gaussian mixture modeling a. r. jayan p. c. pandey

Documents