iit bombay [email protected], [email protected] 14 th national conference on...
TRANSCRIPT
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
1/27Intro. Landmark detection Exp. Res. Sum.
Detection of Acoustic Landmarks withHigh Resolution for Speech Processing
A. R. JayanP. C. PandeyV. K. Pandey
{arjayan, pcpandey,vinod}@ee.iitb.ac.in
EE Dept, IIT Bombay3rd February, 2008
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
2/27Intro. Landmark detection Exp. Res. Sum.
PRESENTATION OUTLINE1. Introduction
Acoustic properties of clear speech Landmark detection Need for high time resolution
2. Automated landmark detection with high resolution Pass 1 Pass 2
3. Experimental results
4. Summary and conclusion
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
3/27Intro. Landmark detection Exp. Res. Sum.
1. INTRODUCTIONAcoustic properties of clear speechClear speech: Speech produced with clear articulation when talking to a hearing impaired listener, or in noisy environments
Examples - http://www.acoustics.org/press/145th/clr-spch-tab.htm
‘the book tells a story’‘the boy forgot his book’
Conversational Clear
Intelligibility of clear speech
▪ Picheny et al. ,1985: ~17% more intelligible than conversational speech▪ More intelligible for different classes of listeners & listening conditions
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
4/27Intro. Landmark detection Exp. Res. Sum.
Acoustic differences between clear and conversational speech
Sentence level ▪ Reduced speaking rate (conv: 200 wpm, clr: 100 wpm)
▪ Larger variation in fundamental frequency
▪ Increased number of pauses, more pause durations
Word level ▪ Less sound deletions
▪ More sound insertions
Phonetic level ▪ Context dependent, non-linear increase in segment durations
▪ More targeted vowel formants
▪ Increase in consonant intensity
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
5/27Intro. Landmark detection Exp. Res. Sum.
Improvement in intelligibility of conversational speech by incorporating properties of clear speech
Consonant–vowel intensity ratio (CVR) enhancementIncreasing energy of consonant segment
Consonant duration enhancementIncreasing CV and VC transitions (burst duration, VOT, formant transition)
Challenges
Accurate detection of regions for modification Analysis-modification-synthesis with low processing artifacts Processing without increasing overall speaking rate, increase in transition
regions with a corresponding dicrease in srteady state segments
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
6/27Intro. Landmark detection Exp. Res. Sum.
Intelligibility enhancement using properties of clear speech
Hazan & simpson, 1998
manually labeled VCV and sentences intensity modification of stop burst + 12 dB, frication + 6dB, nasal + 6dB spectral modification by filtering
Colotte & Laprie, 2000
automated method for identifying regions based on mel-cepstral analysis stops and unvoiced fricatives amplified by +4 dB
transition segments time-scaled by 1.8, 2.0 (TD-PSOLA)
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
7/27Intro. Landmark detection Exp. Res. Sum.
Landmark detection
Speech landmarks Regions containing important information for speech perception
Associated with spectral transitions
Landmarks types
1. Abrupt-consonantal (AC) – Tight constrictions of primary articulators
2. Abrupt (A) -Fast glottal or velum activity
3. Non-abrupt (N) - Semi-vowel landmarks, less vocal tract constriction
4. Vocalic (V) - Vowel landmarks
Abrupt (~68%) Vocalic (~29%) Non-abrupt (~3%)
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
8/27Intro. Landmark detection Exp. Res. Sum.
Landmarks
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
9/27Intro. Landmark detection Exp. Res. Sum.
Liu, 1996
▪ Based on energy variation in 6 spectral bands0-0.4, 0.8-1.5, 1.2-2.0, 2.0-3.5, 3.5-5.0, 5.0-8 kHz
▪ Parameter: First difference of maximum energy (log) in each spectral band
time-step = 50 ms in coarse level, 26 ms in fine level
▪ Matching of peaks across bands for locating boundaries
▪ Detects glottal, sonorant closures, releases, stop closures, releases
Application: Extraction of features for supporting speech recognition
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
10/27Intro. Landmark detection Exp. Res. Sum.
Detection rate vs. temporal resolution
73 %
83 %88 %
44 %
Uses same processing for all types of landmarks
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
11/27Intro. Landmark detection Exp. Res. Sum.
Niyogi & Sondhi, 2002
for stop consonants total energy & energy above 3 k Hz in log scale measure of spectral flatness non-linear operator optimized for burst detection
Salomon et al., 2002
Hilbert transform based envelope to extract temporal parameters spectral information adaptive time-steps (5 ms for burst onset, 30 ms for frication, 2 х pitch period for periodic regions)
Alani & Deriche, 1999
wavelet transform based decomposition energy variations in 6 bands
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
12/27Intro. Landmark detection Exp. Res. Sum.
Need for high temporal resolution and detection rate
Application dependent Speech recognition: Analysis is performed around landmarks for parameter
extraction▪ high accuracy▪ moderate temporal resolution (20-30 ms)
Intelligibility enhancement: Modify landmark regions ▪ high temporal resolution (< 5 ms)
▪ some tolerance to detection errors, but low tolerance to insertions as insertions may introduce distortions
Landmark type ▪ Short duration events (bursts) need high time resolution
▪ voicing onsets/offsets may not require this much resolution as signal properties remain same for a long duration
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
13/27Intro. Landmark detection Exp. Res. Sum.
Factors limiting detection rate and temporal resolution
▪ Effectiveness of parameters in capturing acoustic variations▪ short-time energy variation in spectral bands
weak burst may not get detected▪ centroid frequency
not well defined during low energy segments▪ fixed band boundaries
may not adapt to speech variability
▪ Smoothening performed during parameter extraction
▪ temporal smoothening on spectrum affects time resolution
▪ Type of distance measure ▪ first difference operation not optimized for all types of landmarks
▪ time-step 10 ms is too high for burst detection
▪ Effect of noise on parameters
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
14/27Intro. Landmark detection Exp. Res. Sum.
Acoustic cues for the different phonetic events are distributed non-homogeneously in the time-frequency plane
Separate detectors are required for each phonetic class
Each detector must use a method most suited for the phonetic event Objective
Automated detection of landmarks for stop consonants with high temporal resolution, for applications in speech intelligibility enhancement
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
15/27Intro. Landmark detection Exp. Res. Sum.
speech
Short-time spectalanalysis
Computation of energypeaks and centroids in the
frequency bands
Computation of spectraltransition index
Waveletdecomposition
around landmarks
Computation ofshort-time energy
and ZCRs
Pass 1
Computation of energy andcentroid RORs
Landmark localization
Pass 2
Landmarks(pass 1)
Computation ofenergy and ZCR
RORs
Landmark localisation
Landmarks(pass 2)
2. AUTOMATED LANDMARK DETECTION
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
16/27Intro. Landmark detection Exp. Res. Sum.
Landmark detection using spectral peaks and centroids
Pass 1 Spectrum divided into five non-overlapping bands
▪ 0–0.4, 0.4–1.2, 1.2–2.0, 2.0–3.5, 3.5–5.0 kHz
▪ Sampling frequency 10 k samples/s,
▪ 512-point FFT on 6 ms frames
▪ frame rate 1 ms.
Parameters▪ maximum energy in each spectral band, every 1 ms
▪ band centroids estimated in each band, every 1 ms
▪ features similar to formant peaks and formant frequencies
▪ can be estimated easily
▪ not much affected by noise
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
17/27Intro. Landmark detection Exp. Res. Sum.
2 22 2
1 1
( , ) /k k
f b n k X X f Nc sk kk k k k
2
1 210( , ) 10 log max ,E b n X k k kp k Peak energy
Centroid frequency
Rate-of-rise functions
Transition index
' , ( , ) ( , )E b n E b n K E b n Kp p p
' ( , ) ( , ) ( , )f b n f b n K f b n Kc c c
5 ' '( ) ( , ) ( , )1
T n E b n f b nr p cb
tracks simultaneous variation of energy and centroid
centroids given less weighting in low energy areas
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
18/27Intro. Landmark detection Exp. Res. Sum.
Example: /uka/
Peak & centroid contours
0-0.4 kHz
0.4-1.2 kHz
1.2-2.0 kHz
2.0-3.5 kHz
3.5-5.0 kHz
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
19/27Intro. Landmark detection Exp. Res. Sum.
Example: /uka/
Peak & centroid ROR contours
Time step = 26 ms
0-0.4 kHz
0.4-1.2 kHz
1.2-2.0 kHz
2.0-3.5 kHz
3.5-5.0 kHz
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
20/27Intro. Landmark detection Exp. Res. Sum.
Example: /uka/
Transition index
derived from RORs with time step = 26 ms
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
21/27Intro. Landmark detection Exp. Res. Sum.
Example: /uka/
Transition index
derived from RORs with time step = 4 ms
Less sensitive to slow transitions
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
22/27Intro. Landmark detection Exp. Res. Sum.
Problems
Large time step ( > 20 ms)
▪ detects with less temporal accuracy
▪ detects slowly varying events also (more detection rate)
Small time step (< 5 ms)
▪ detects abrupt transitions with good resolution
▪ misses slow transitions.
Pass 2:
Analyze landmarks detected in Pass 1 with a small time-step
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
23/27Intro. Landmark detection Exp. Res. Sum.
Improving Temporal resolution : Pass 2
▪ 40 ms window centered around burst landmarks detected in pass 1
▪ decomposed to 6 levels by discrete Meyer Wavelet
▪ detail (high frequency) contents in the lower two levels used for localizing bursts
Parameters ▪ short time energy variation
▪ zero crossing rate
Compute normalized RORs with a time-step of 3 ms
Get a new transition index as
Relocate landmark to the location corresponding to the peak in Tez(n)
2
1( ) 0.5 '( , ) '( , )ez n n
lT n E l n Z l n
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
24/27Intro. Landmark detection Exp. Res. Sum.
Relocating stop landmarks
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
25/27Intro. Landmark detection Exp. Res. Sum.
Relocating stop landmarks
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
26/27Intro. Landmark detection Exp. Res. Sum.
Relocating stop landmarks
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
27/27Intro. Landmark detection Exp. Res. Sum.
Stop 30 ms 20 ms 10 ms 5 ms
Initialvowel
Initialvowel
Initialvowel
Initialvowel
a i u a i u a i u a i u
/p/ - - - - - - - - - 1 1 2
/t/ - - - - - - - - - 1 1 2
/k/ - - - 1 - - 1 - 1 3 3 3
Det.%
100 98.1 96.3 68.5
Stop 10 ms 7 ms 5 ms 3 ms
Initialvowel
Initialvowel
Initialvowel
Initialvowel
a i u a i u a i u a i u
/p/ - - - - - - - - - - - -
/t/ - - - - - - - - - - 1 -
/k/ - - - - - - - - - - - -
Det.%
100 100 100 98.1
3. EXPERIMENTAL RESULTSTest material: VCV syllables
▪ 2 speakers (1 male, 1 female)
▪ 3 stop consonants (/p/, /t/, /k/)
▪ 3 initial and 3 final vowel contexts (/a/, /i/, /u/)
▪ Total 54 tokens
Pass 1 Pass 2
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
28/27Intro. Landmark detection Exp. Res. Sum.
Test material: TIMIT sentences▪ 5 speakers (2 male, 3 female)▪ 10 sentences per speaker▪ closure and burst onsets of /b/, /d/, /g/, /p/, /t/, /k/▪ total 418 tokens
Phonemeclass
30ms 20 ms 10 ms
Det. (%) Det. (%) Det. (%)
Pass 1 2 1 2 1 2
Stop (548) 94 96 82 86 62 66
Fricative(266) 95 95 90 90 76 79
Nasal (154) 80 79 70 70 53 51
Vowel (614) 77 79 70 71 58 57
S. vowel (213) 69 70 68 67 60 61
Overall det. (%)
84.1 85.7 76.4 78.0 61.7 63.0
Detection rates Localization error
IIT B
om
bay
arja
yan
@e
e.i i
tb.a
c .in
, p
c pa
nd
ey@
ee
.i itb
.ac.
in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India
29/27Intro. Landmark detection Exp. Res. Sum.
4. SUMMARY & CONCLUSIONPass 2 improves temporal resolution of stop landmarks
▪ Significant improvement in stop burst localization in VCV syllables30% improvement for 5 ms resolution
▪ Marginal improvement in sentences4 % improvement for stop landmarks at 10 ms resolution
Possible reasons▪ reduced closure duration in sentences▪ unreleased bursts ▪ errors in Pass 1 may be above 30 ms
▪ use of 40 ms window in Pass 2, may need modification
▪ errors in the manual labels
▪ Future work: Evaluation of the method in presence of noise