what are the essential cues for understanding spoken language? steven greenberg

128
What are the Essential Cues for Understanding Spoken Language? Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng [email protected]

Upload: swann

Post on 20-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

What are the Essential Cues for Understanding Spoken Language? Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng [email protected]. No Scientist is an Island …. IMPORTANT COLLEAGUES - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

What are the Essential Cues for

Understanding Spoken Language?

Steven GreenbergInternational Computer Science Institute1947 Center Street, Berkeley, CA 94704

http://www.icsi.berkeley.edu/[email protected]

Page 2: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

No Scientist is an Island …IMPORTANT COLLEAGUES

ACOUSTIC BASIS OF SPEECH INTELLIGILIBILTYTakayuki Arai, Joy Hollenback, Rosaria Silipo

AUDITORY-VISUAL INTEGRATION FOR SPEECH PROCESSINGKen Grant

AUTOMATIC SPEECH RECOGNITION AND FEATURE CLASSIFICATIONShawn Chang, Lokendra Shastri, Mirjam Wester

STATISTICAL ANALYSIS OF PRONUNCIATION VARIATIONEric Fosler, Leah Hitchcock, Joy Hollenback

Page 3: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Germane PublicationsSTATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND PRONUNCIATION MODELING

Fosler-Lussier, E., Greenberg, S. and Morgan, N. (1999) Incorporating contextual phonetics into automatic speech recognition. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco.

Greenberg, S. (1997) On the origins of speech intelligibility in the real world. Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 23-32.

Greenberg, S. (1999) Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Speech Communication, 29, 159-176.

Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany .

Greenberg, S., Hollenback, J. and Ellis, D. (1996) Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus, in Proc. Intern. Conf. Spoken Lang. (ICSLP), Philadelphia, pp. S24-27.

AUTOMATIC PHONETIC TRANSCRIPTION AND ACOUSTIC FEATURE CLASSIFICATIONChang, S. Greenberg, S. and Wester, M. (2001) An elitist approach to articulatory-acoustic feature

classification. 7th European Conference on Speech Communication and Technology (Eurospeech-2001).

Chang, S., Shastri, L. and Greenberg, S. (2000) Automatic phonetic transcription of spontaneous speech (American English), Proceedings of the International. Conference on. Spoken. Language. Processing, Beijing.

Shastri, L., Chang, S. and Greenberg, S. (1999) Syllable segmentation using temporal flow model neural networks. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco.

Wester, M. Greenberg, S. and Chang,, S. (2001) A Dutch treatment of an elitist approach to articulatory-acoustic feature classification. 7th European Conference on Speech Communication and Technology (Eurospeech-2001).

http://www.icsi.berkeley.edu/~steveng

Page 4: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Germane PublicationsPERCEPTUAL BASES OF SPEECH INTELLIGIBILITY

Arai, T. and Greenberg, S. (1998) Speech intelligibility in the presence of cross-channel spectral asynchrony, IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 933-936.

Greenberg, S. and Arai, T. (1998) Speech intelligibility is highly tolerant of cross-channel spectral asynchrony. Proceedings of the Joint Meeting of the Acoustical Society of America and the International Congress on Acoustics, Seattle, pp. 2677-2678.

Greenberg, S. and Arai, T. (2001) The relation between speech intelligibility and the complex modulation spectrum. Submitted to the 7th European Conference on Speech Communication and Technology (Eurospeech-2001).

Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedingss of the International Conference on Spoken Language Processing, Sydney, pp. 74-77.

Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal Constraints on Speech Intelligibility as Deduced

from Exceedingly Sparse Spectral Representations, Proceedings of Eurospeech, Budapest. AUDITORY-VISUAL SPEECH PROCESSING

Grant, K. and Greenberg, S. (2001) Speech intelligibility derived from processing of asynchronous processing of auditory-visual information. Submitted to the ISCA Workshop on Audio-Visual Speech Processing (AVSP-2001).

PROSODIC STRESS ACCENT – AUTOMATIC CLASSIFICATION AND CHARACTERIZATIONHitchcock, L. and Greenberg, S. (2001) Vowel height is intimately associated with stress-accent in

spontaneous American English discourse. Submitted to the 7th European Conference on Speech Communication and Technology (Eurospeech-2001).

Silipo, R. and Greenberg, S. (1999) Automatic transcription of prosodic stress for spontaneous English discourse. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco.

Silipo, R. and Greenberg, S. (2000) Prosodic stress revisited: Reassessing the role of fundamental frequency. Proceedings of the NIST Speech Transcription Workshop, College Park, MD.

Silipo, R. and Greenberg, S. (2000) Automatic detection of prosodic stress in American English discourse. Technical Report 2000-1, International Computer Science Institute, Berkeley, CA.

http://www.icsi.berkeley.edu/~steveng

Page 5: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

PROLOGUE

The Central Challenge for Models of Speech Recognition

Page 6: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Language - The Traditional PerspectiveThe “classical” view of spoken language posits a quasi-arbitrary relation between

the lower and higher tiers of linguistic organization

Page 7: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

The Serial Frame Perspective on Speech• Traditional models of speech recognition assume that the identity of a phonetic segment depends on the

detailed spectral profile of the acoustic signal for a given (usually 25-ms) frame of speech

Page 8: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Language - A Syllable-Centric PerspectiveA more empirical perspective of spoken language focuses on the syllable as the

interface between “sound” and “meaning”

Within this framework the relationship between the syllable and the higher and lower tiers is non-arbitrary and systematic statistically

Page 9: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Lines of Evidence

Page 10: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• Segmentation is crucial for understanding spoken language – At the level of the phrase– the word– the syllable– the phonetic segment

• But …. this linguistic segmentation is inherently “fuzzy”

• As is the spectral information associated with each linguistic tier

• The low-frequency (3-25 Hz) modulation spectrum is a crucial acoustic (and possibly visual) parameter associated with

intelligibility– It provides segmentation information that unites the phonetic segment

with the syllable (and possibly the word and beyond)

• Many properties of spontaneous spoken language differ from those of laboratory and citation speech

– There are systematic patterns in “real” speech that potentially reveal underlying principles of linguistic organization

Take Home Messages

Page 11: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

The Central Importance of the Modulation Spectrum and the Syllable for

Understanding Spoken Language

Page 12: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Effects of Reverberation on the Speech SignalReflections from walls and other surfaces routinely modify the spectro-temporal

structure of the speech signal under everyday conditions

Page 13: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Effects of Reverberation on the Speech SignalReflections from walls and other surfaces routinely modify the temporal and modulation spectral properties of the speech

signalThe modulation spectrum’s peak is attenuated and shifted down to ca. 2 Hz

[based on an illustration by Hynek Hermansky]

Page 14: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Modulation Spectrum Computation

Page 15: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

The Modulation Spectrum Reflects SyllablesThe peak in the distribution of syllable duration is close to the mean - 200 ms The syllable duration distribution is very close to that of the modulation spectrum - suggesting that the modulation spectrum

reflects syllables

Page 16: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

The Ability to Understand Speech Under Reverberant Conditions

(Spectral Asynchrony)

Page 17: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Spectral Asynchrony - Method

Output of quarter-octave frequency bands quasi- randomly time-shifted relative to common reference. Maximum shift interval ranged between 40 and 240 ms (in 20-ms steps). Mean shift interval is half of the maximum interval. Adjacent channels separated by a minimum of one-quarter of the maximum shift range.

“She washed his dark suit in greasy dish water all year” Stimuli – 40 TIMIT Sentences

Page 18: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Spectral Asynchrony - Paradigm

The magnitude of energy in the 3-6 Hz region of the modulation spectrum is computed for each (4 or 7 channel sub-band) as a function of spectral asynchrony

The modulation spectrum magnitude is relatively unaffected by asynchronies of 80 ms or less (open symbols), but is appreciably diminished for asynchronies of 160 ms or more

Is intelligibility correlated with the reduction in the 3-6 Hz modulation spectrum?

Page 19: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Intelligibility and Spectral AsynchronySpeech intelligibility does appear to be roughly correlated with the energy in the modulation spectrum between 3 and 6 HzThe correlation varies depending on the sub-band and the degree of spectral asynchrony

Page 20: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• Speech is capable of withstanding a high degree of temporal asynchrony across frequency channels

• This form of cross-spectral asynchrony is similar to the effects of many common forms of acoustic reverberation

• Speech intelligibility remains high (>75%) until this asynchrony (maximum) exceeds 140 ms

• The magnitude of the low-frequency (3-6 Hz) modulation spectrum is highly correlated with speech intelligibility

Spectral Asynchrony - Summary

Page 21: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Understanding Spoken Language Under Very Sparse Spectral Conditions

Page 22: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

A Flaw in the Spectral Asynchrony Study Of the 448 possible combinations of four slits across the spectrum (where one slit is present in each of the 4 sub-bands) ca. 10% (i.e.

45) exhibit a coefficient of variation less than 10% - thus, the seeming temporal tolerance of the auditory system may be illusory (if listeners can decode the speech signal using information from only a small number of channels distributed across the spectrum)

Distribution of channel asynchronyIntelligibility of spectrally desynchronized speech

Page 23: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Spectral Slit ParadigmCan listeners decode spoken sentences using just four narrow (1/3 octave) channels (“slits”) distributed across the spectrum?The edge of each slit was separated from its nearest neighbor by an octaveThe modulation pattern for each slit differs from that of the othersThe four-slit compound waveform looks very similar to the full-band signal

+

+

Page 24: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Word Intelligibility - Single SlitsThe intelligibility associated with any single slit is only 2 to 9%The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits

Page 25: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Word Intelligibility - Roap Map1. Intelligibility as a function of the number of slits (from one to four)

Page 26: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Word Intelligibility - 1 Slit

Page 27: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Word Intelligibility - 2 Slits

Page 28: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Word Intelligibility - 3 Slits

Page 29: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Word Intelligibility - 4 Slits

Page 30: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Word Intelligibility - Roap Map2. Intelligibility for different combinations of two-slit compounds

The two center slits yield the highest intelligibility

Page 31: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Word Intelligibility - 2 Slits

Page 32: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Word Intelligibility - 2 Slits

Page 33: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Intelligibility - 2 Slits

Page 34: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Intelligibility - 2 Slits

Page 35: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Intelligibility - 2 Slits

Page 36: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Intelligibility - 2 Slits

Page 37: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Word Intelligibility - Roap Map3. Intelligibility for different combinations of three-slit compounds

Combinations with one or two center slits yield the highest intelligibility

Page 38: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Intelligibility - 3 Slits

Page 39: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Intelligibility - 3 Slits

Page 40: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Intelligibility - 3 Slits

Page 41: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Intelligibility - 3 Slits

Page 42: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Word Intelligibility - Roap Map4. Four slits yield nearly (but not quite) perfect intelligibility of ca. 90%

This maximum level of intelligibility makes it possible to deduce the specific contribution of each slit by itself and in combination with others

Page 43: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Intelligibility - 3 Slits

Page 44: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language

• An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility

Spectral Slits - Summary

Page 45: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Modulation Spectrum Across Frequency

The modulation spectrum varies in magnitude across frequency

The shape of the modulation spectrum is similar for the three lowest slits, but the highest frequency slit differs from the rest in exhibiting a far greater amount of energy in the mid modulation frequencies

Page 46: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Word Intelligibility - Single SlitsThe intelligibility associated with any single slit ranges between 2 and 9%, suggesting that the shape and

magnitude of the modulation spectrum per se is NOT the controlling variable for intelligibility

Page 47: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language

• An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility

• The magnitude component of the modulation spectrum does not appear to be the controlling variable for intelligibility

Spectral Slits - Summary

Page 48: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

The Effect of Desynchronizing Sparse Spectral Information on Speech

Intelligibility

Page 49: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Modulation Spectrum Across FrequencyDesynchronizing the slits by more than 25 ms results in a significant decline in

intelligibility

Page 50: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• Even small amounts of asynchrony (>25 ms) imposed on spectral slits can result in significant degradation of intelligibility

• Asynchrony greater than 50 ms has a profound impact of intelligibility

Spectral Slits - Summary

Page 51: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Intelligibility and Slit Asynchrony

Page 52: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language

• An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility

• The magnitude component of the modulation spectrum does not appear to be the controlling variable for intelligibility

Spectral Slits - Summary

Page 53: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• Small amounts of asynchrony (>25 ms) imposed on spectral slits can result in significant degradation of intelligibility

• Asynchrony greater than 50 ms has a profound impact of intelligibility

• Intelligibility progressively declines with greater amounts of asynchrony up to an asymptote of ca. 250 ms

• Beyond asynchronies of 250 ms intelligibility IMPROVES, but the amount of improvement depends on individual factors

• Such results are NOT inconsistent with the high intelligibility of desynchronized full-spectrum speech, but rather imply that the auditory system is capable of extracting phonetically important information from a relatively small proportion of spectral channels

• BOTH the amplitude and phase components of the modulation spectrum are extremely important for speech intelligibility

• The modulation phase is of particular importance for cross-spectral integration of phonetic information

Spectral Slits - Summary

Page 54: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Speech Intelligibility Derived from Asynchronous Presentation of

Auditory and Visual Information

Page 55: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Auditory-Visual Integration of Speech• Video of spoken (Harvard/IEEE) sentences, presented in tandem with

sparse spectral representation (low- and high-frequency slits)

Page 56: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Auditory-Visual Integration - Mean Intelligibility

9 Subjects

• When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to that observed for audio-alone signals

• When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms

Page 57: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Variation across subjects

Video lagging often better than synchronous

Auditory-Visual Integration - by Individual Ss

Page 58: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• Sparse audio and speech-reading information when presented alone provide minimal intelligibility

• But can, when combined provide good intelligibility

• When the audio signal leads the video, intelligibility falls off rapidly as a function of onset asynchrony

• When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

• The dynamics of the video appear to be combined with the dynamics associated with the audio to provide good intelligibility

• The dynamics associated with the video signal are probably most closely associated with place of articulation information

• The implication is that place information has a long time constant of ca. 200 ms and appears linked to the syllable

Audio-Video Integration – Summary

Page 59: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Perceptual Evidence for the Spectral Origin of

Articulatory-Acoustic Features

Page 60: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Spectral Slit Paradigm• Signals were CV and VC Nonsense Syllables (from CUNY)

Page 61: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - Single Slits

Page 62: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 1 Slit

Page 63: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 2 Slits

Page 64: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 3 Slits

Page 65: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 4 Slits

Page 66: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 5 Slits

Page 67: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 2 Slits

Page 68: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 2 Slits

Page 69: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 2 Slits

Page 70: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 2 Slits

Page 71: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 2 Slits

Page 72: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 2 Slits

Page 73: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 3 Slits

Page 74: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 3 Slits

Page 75: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 3 Slits

Page 76: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 4 Slits

Page 77: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Consonant Recognition - 5 Slits

Page 78: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• The consonant recognition results can be scored in terms of articulatory features correct

• The the accuracy of the features are scored relative to the accuracy of consonant recognition an interesting pattern emerges

• Certain features (place and manner) appear to be highly correlated with consonant recognition performance

• While the voicing and rounding features are less highly correlated

Articulatory - Feature Analysis

Page 79: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Correlation - AFs/Consonant Recognition

Consonant recognition is almost perfectly correlated with place of articulation performance

This correlation suggests that the place feature is based on cues distributed across the entire speech spectrum, in contrast to features such as voicing and rounding, which appear to be extracted from a narrower band of the spectrum

Manner is also highly correlated with consonant recognition, implying that this feature is extracted from a fairly broad portion of the spectrum

Page 80: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Phonetic Transcription of Spontaneous (American) English

Page 81: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Phonetic Transcription of Spontaneous English• Telephone dialogues of 5-10 minutes duration - SWITCHBOARD• Amount of material manually transcribed    

– 4 hours labeled at the phone level and segmented at the syllabic level (this material was later phonetically segmented by automatic methods)

– 1 hour labeled and segmented at the phonetic-segment level

• Diversity of material transcribed– Spans speech of both genders (ca. 50/50%) reflecting a wide range of American

dialectal variation (6 regions + “army brat”), speaking rate and voice quality

• Transcribed by whom? – 11 undergraduates and 1 graduate student, all enrolled at UC-Berkeley. Most of

the corpus was transcribed by four individuals out of the twelve– Supervised by Steven Greenberg and John Ohala

• Transcription system– A variant of Arpabet, with phonetic diacritics such as:_gl,_cr, _fr, _n, _vl, _vd

• How long does transcription take? (Don’t ask!)– 388 times real time for labeling and segmentation at the phonetic-segment level– 150 times real time for labeling phonetic segments and segmenting syllables

• How was labeling and segmentation performed?– Using a display of the signal waveform, spectrogram, word transcription and

“forced alignments” (estimates of phones and boundaries) + audio (listening at multiple time scales - phone, word, utterance) on Sun workstations

• Data available at - http://www.icsi/berkeley.edu/real/stp

Page 82: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Phonetic Transcription What a “typical” computer screen shot of the speech material looks like to a transcriber

Page 83: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

A Brief Tour of Pronunciation Variation

inSpontaneous American English

Page 84: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

How Many Pronunciations of “and”?

82 ae n63 eh n45 ix n35 ax n34 en30 n20 ae n dcl d17 ih n17 q ae n11 ae n d

7 q eh n7 ae nx6 ae ae n6 ah n5 eh nx4 uh n4 ix nx4 q ae n dcl d3 eh n d3 q ae nx

3 eh2 ae n dcl2 ae2 ax m2 ax n d2 ae eh n dcl d2 eh n dcl d2 ax nx2 q ae ae n2 q ix n2 ix n dcl d2 ih 2 eh eh n2 q eh nx2 ix d n1 eh m1 ax n dcl d1 aw n1 ae q1 eh dcl

N Pronunciation N Pronunciation

Page 85: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

How Many Pronunciations of “and”?

1 ah nx1 ae n t1 eh d1 ah n dcl d1 ey ih n dcl1 ae ix n1 ae nx ax1 ax ng1 ay n1 ih ah n d1 ae hh1 ih ng1 ix1 ae n d dcl1 ix dcl d1 ae eh n1 hh n1 ix n t1 ae ax n dcl d1 iy eh n

1 m1 ae ae n d1 nx1 q ae ae n1 q ae ae n dcl d1 q ae eh n dcl d1 q ae ih n1 aa n1 q ae n d1 ? nx1 q ae n q1 eh n m1 q eh en dcl1 eh ng1 q eh n q1 em1 q eh ow m1 q ih n1 q ix en1 er

N Pronunciation N Pronunciation

Page 86: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

1   I 6 4 9   5 3   5 3   a y

2   a n d 5 2 1   8 7   1 6   a e n

3   th e 4 7 5    7 6   2 7   d h a x

4   y o u 4 0 6   6 8   2 0   y ix

5   th a t 3 2 8   1 1 7   1 1   d h a e

6   a 3 1 9   2 8   6 4   a x

7   to 2 8 8   6 6   1 4   tc l t u w

8   k n o w 2 4 9   3 4   5 6   n o w

9   o f 2 4 2   4 4   2 1   a x v

1 0   it 2 4 0   4 9   2 2   ih

1 1   y e a h 2 0 3   4 8   4 3   y a e

1 2   in 1 7 8   2 2   4 5   ih n

1 3   th e y 1 5 2   2 8   6 0   d h e y

1 4   d o 1 3 1   3 0   5 4   d c l d u w

1 5   s o 1 3 0   1 4   7 4   s o w

1 6   b u t 1 2 3   4 5   1 2   b c l b a h tc l t

1 7   is 1 2 0   2 4   5 0   ih z

1 8   lik e 1 1 9   1 9   4 6   l a y k c l k

1 9   h a v e 1 1 6   2 2   5 4   h h a e v

2 0   w a s 1 1 1   2 4   2 3   w a h z

2 1   w e 1 0 8   1 3   8 3   w iy

2 2   it's 1 0 1   1 4   2 0   ih tc l s

2 3   ju s t 1 0 1   3 4   1 7   jh ix s

2 4   o n 9 8   1 8   4 9   a a n

2 5   o r 9 4   2 3   3 6   e r

2 6   n o t 9 2   2 4   2 4   m a a q

2 7   th in k 9 2   2 3   3 2   th ih n g k c l k

2 8   fo r 8 7   1 9   4 6   f e r

2 9   w e ll 8 4   4 9   2 3   w e h l

3 0   w h a t 8 2   4 0   1 4   w a h d x

3 1   a b o u t 7 7   4 6   1 2   a x b c l b a w

3 2   a ll 7 4   2 7   2 4   a o l

3 3   th a t's 7 4   1 9   1 6   d h e h s

3 4   o h 7 4   1 7   6 1   o w

3 5   re a lly 7 1   2 5   4 5   r ih l iy

3 6   o n e 6 9   8   7 8   w a h n

3 7   a re 6 8   1 9   4 2   e r

3 8   I'm 6 7 9   2 6   q a a m

3 9   rig h t 6 1   2 1   2 8   r a y

4 0   u h 6 0   1 6   4 1   a h

4 1   th e m 6 0   1 8   2 3   a x m

4 2   a t 5 9   3 6   8   a e d x

4 3   th e re 5 8   2 8   2 2   d h e h r

4 4   my 5 8   9   6 6   m a y

4 5   me a n 5 6   1 0   5 8   m iy n

4 6   d o n 't 5 6   2 1   1 4   d x o w

4 7   n o 5 5   8   7 7   n o w

4 8   w ith 5 5   2 0   3 5   w ih th

4 9   if 5 5   1 8   4 1   ih f

5 0   w h e n 5 4   1 8   3 1   w e h n

5 1   c a n 5 4   2 8   1 5   k c l k a e n

5 2   th e n 5 1   1 9   3 8   d h e h n

5 3   b e 5 0   1 1   7 6   b c l b iy

5 4   a s 4 9   1 6   1 8   a e z

5 5   o u t 4 7   1 9   2 2   a e d x

5 6   k in d 4 7   1 7   2 1   k c l k a x n x

5 7   b e c a u e 4 6   3 1   1 5   k c l k a x z

5 8   p e o p le 4 5   2 1   4 4  p c l p iy p c l l e l

5 9   g o 4 5   5   8 3   g c l g o w

6 0   g o t 4 5   3 2   1 5   g c l g a a

6 1   th is 4 4   1 1   4 7   d h ih s

6 2   s o me 4 3   4   4 8   s a h m

6 3   w o u ld 4 1   1 6   2 9   w ih d c l

6 4   th in g s 4 1   1 5   5 2   th ih n g z

6 5   n o w 3 9   1 1   6 9   n a w

6 6   lo t 3 9   9   4 7   l a a d x

6 7   h a d 3 9   1 9   2 4   h h a e d c l

6 8   h o w 3 9   1 1   5 3   h h a w

6 9   g o o d 3 8   1 3   2 7   g c l g u h d c l

7 0   g e t 3 8   2 0   1 3   g c l g e h d x

7 1   s e e 3 7   6   8 0   s iy

7 2   fro m 3 6   1 0   2 8   f r a h m

7 3   h e 3 6   7   3 9   iy

7 4   me 3 5   5   8 7   m iy

7 5   d o n 't 3 5   2 1   1 4   d x o w

7 6   th e ir 3 3   1 9   2 5   d h e h r

7 7   mo re 3 2   1 1   5 6   m a o r

7 8   it's 3 1   1 4   2 0   ih tc l s

7 9   th a t's 3 1   2 0   1 6   d h e h s

8 0   to o 3 1   6   6 0   tc l t u w

8 1   o k a y 3 1   1 7   4 5   o w k c l k e y

8 2   v e ry 3 0   1 1   3 6   v e h r iy

8 3   u p 3 0   1 1   3 4   a h p c l p

8 4   b e e n 3 0   1 1   5 1   b c l b ih n

8 5   g u e s s 2 9   8   4 2   g c l g e h s

8 6   time 2 9   8   6 2   tc l t a y m

8 7   g o in g 2 9   2 1   1 3   g c l g o w ih n g

8 8   in to 2 8   2 0   1 4   ih n tc l t u w

8 9   th o s e 2 7   1 2   4 2   d h o w z

9 0   h e re 2 7   1 1   2 5   h h iy e r

9 1   d id 2 7   1 3   2 3   d c l d ih d x

9 2   w o rk 2 5   8   6 6   w e r k c l k

9 3   o th e r 2 5   1 4   2 6   a h d h e r

9 4   a n 2 5   1 2   2 8   a x n

9 5   I'v e 2 5   7   4 6   a y v

9 6   th in g 2 4   9   5 2   th ih n g

9 7   e v e n 2 4   7   4 0   iy v ix n

9 8   o u r 2 3   9   3 3   a a r

9 9   a n y 2 3   1 1   2 3   ix n iy

1 0 0   w e 're 2 3   8   2 5   w e y r

How Many Different Pronunciations?

1  I 649  53  53  ay2  and 521  87  16  ae n3  the 475   76  27  dh ax4  you 406  68  20  y ix5  that 328  117  11  dh ae6  a 319  28  64  ax7  to 288  66  14  tcl t uw8  know 249  34  56  n ow9  of 242  44  21  ax v

10  it 240  49  22  ih11  yeah 203  48  43  y ae12  in 178  22  45  ih n13  they 152  28  60  dh ey14  do 131  30  54  dcl d uw15  so 130  14  74  s ow16  but 123  45  12  bcl b ah tcl t17  is 120  24  50  ih z18  like 119  19  46  l ay kcl k19  have 116  22  54  hh ae v20  was 111  24  23  w ah z

Rank Word N #PronMost CommonPronunciation

MCP%Total

The 20 most frequency words account for 35% of the tokens

Page 87: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

21  we 108  13  83  w iy22  it's 101  14  20  ih tcl s23  just 101  34  17  jh ix s24  on 98  18  49  aa n25  or 94  23  36  er26  not 92  24  24  m aa q27  think 92  23  32  th ih ng kcl k28  for 87  19   46  f er29  well 84  49  23  w eh l30  what 82  40  14  w ah dx31  about 77  46  12  ax bcl b aw32  all 74  27  24  ao l 33  that's 74  19  16  dh eh s34  oh 74  17  61  ow35  really 71  25  45  r ih l iy36  one 69  8  78  w ah n37  are 68  19  42  er38  I'm 67 9  26  q aa m39  right 61  21  28  r ay40  uh 60  16  41  ah

Rank Word N #PronMost CommonPronunciation

MCP%Total

How Many Different Pronunciations?

The 40 most frequency words account for 45% of the tokens

Page 88: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Rank Word N #PronMost CommonPronunciation

MCP%Total

41  them 60  18  23  ax m42  at 59  36  8  ae dx43  there 58  28  22  dh eh r44  my 58  9  66  m ay45  mean 56  10  58  m iy n46  don't 56  21  14  dx ow47  no 55  8  77  n ow48  with 55  20  35  w ih th49  if 55  18  41  ih f50  when 54  18  31  w eh n51  can 54  28  15  kcl k ae n52  then 51  19  38  dh eh n53  be 50  11  76  bcl b iy54  as 49  16  18  ae z55  out 47  19  22  ae dx56  kind 47  17  21  kcl k ax nx57  becaue 46  31  15  kcl k ax z58  people 45  21  44  pcl p iy pcl l el59  go 45  5  83  gcl g ow60  got 45  32  15  gcl g aa

How Many Different Pronunciations?

The 60 most frequency words account for 55% of the tokens

Page 89: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

61  this 44  11  47  dh ih s62  some 43  4  48  s ah m63  would 41  16  29  w ih dcl64  things 41  15  52  th ih ng z65  now 39  11  69  n aw66  lot 39  9  47  l aa dx67  had 39  19  24  hh ae dcl68  how 39  11  53  hh aw69  good 38  13  27  gcl g uh dcl70  get 38  20  13  gcl g eh dx71  see 37  6  80  s iy72  from 36  10  28  f r ah m73  he 36  7  39  iy74  me 35  5  87  m iy75  don't 35  21  14  dx ow76  their 33  19  25  dh eh r77  more 32  11  56  m ao r78  it's 31  14  20  ih tcl s79  that's 31  20  16  dh eh s80  too 31  6  60  tcl t uw

Rank Word N #PronMost CommonPronunciation

MCP%Total

How Many Different Pronunciations?

The 80 most frequency words account for 62% of the tokens

Page 90: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

81  okay 31  17  45  ow kcl k ey82  very 30  11  36  v eh r iy83  up 30  11  34  ah pcl p84  been 30  11  51  bcl b ih n85  guess 29  8  42  gcl g eh s86  time 29  8  62  tcl t ay m87  going 29  21  13  gcl g ow ih ng88  into 28  20  14  ih n tcl t uw89  those 27  12  42  dh ow z90  here 27  11  25  hh iy er91  did 27  13  23  dcl d ih dx92  work 25  8  66  w er kcl k93  other 25  14  26  ah dh er94  an 25  12  28  ax n95  I've 25  7  46  ay v96  thing 24  9  52  th ih ng97  even 24  7  40  iy v ix n98  our 23  9  33  aa r99  any 23  11  23  ix n iy

100  we're 23  8  25  w ey r

Rank Word N #PronMost CommonPronunciation

MCP%Total

How Many Different Pronunciations?

The 100 most frequency words account for 67% of the tokens

Page 91: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

English Syllable Structure is (sort of) Like Japanese

87% of the pronunciations are simple syllabic forms

84% of the canonical corpus is composed of simple syllabic forms

n= 103, 054

Most syllables are simple in form (no consonant clusters)

C = ConsonantV = Vowel

ExamplesCV – “go”CVC – “cat”VC – “of”V – “a”

Corpus = “Canonical” representationPronunciation = Actual pronunciation

Coda consonants tend to “drop”

Page 92: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

There are many “complex” syllable forms (consonant clusters), but all occur relatively infrequently

Complex Syllables ARE Important (Though)

Thus, despite English’s reputation for complex syllabic forms, only ca. 15% of the syllable tokens are actually complex

n= 17,760

Percent

C = ConsonantV = Vowel

ExamplesCVCC – “fifth”VCC – “ounce”CCV – “stow”CCVC – “stoop”CCVCC – “stops”CCCVCC – “strength”

Complex syllables tend to be part of noun phrases (nouns or adjectives)

Coda consonants tend to “drop”

Page 93: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Syllable-Centric Pronunciation Patterns

(Spontaneous speech)

(Read Sentences)

“Cat” [k ae t] [k] = onset [ae] = nucleus [t] = coda

Onsets are pronouncedcanonically far more often than nuclei or codas

Codas tend to be pronounced canonically more frequently in formal speech than in spontaneous dialogues

Percent Canonically Pronounced

Syllable Position

n= 120,814

Page 94: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

70

75

80

85

90

95

100

Simple (C) Complex (CC(C))

STP

TIMIT

Complex onsets are pronounced more canonically than simple onsets despite the greater potential for deviation from the standard pronunciation

(Spontaneous speech)

(Read sentences)

Percent Canonically Pronounced

Syllable Onset Type

Complex Onsets are Highly CanonicalCOMPLEX onsets contain TWO or MORE consonants

Page 95: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Speaking Style Affects Syllable Codas

Percent Canonically Pronounced

Codas are much more likely to be realized canonically in formal than in spontaneous speech

Syllable Coda Type

COMPLEX codas contain TWO or MORE consonants

STP – Spontaneous phone dialoguesTIMIT – Read sentences

Page 96: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

50

55

60

65

70

AllNuclei

WithOnset

WithoutOnset

WithCoda

WithoutCoda

STP

TIMIT

Onsets (but not Codas) Affect Nuclei

Percent Canonically Pronounced

The presence of a syllable onset has a substantial impact on the realization of the nucleus

STP – Spontaneous phone dialoguesTIMIT – Read sentences

Page 97: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Syllable-Centric Articulatory Feature Analysis• Place of articulation deviates most in nucleus position• Manner of articulation deviates most in onset and coda position• Voicing deviates most in coda position

Phonetic deviation along a SINGLE feature

Place deviates very little from canonical form in the onset and coda. It

is a STABLE AF in these positions

Place is VERY unstable in nucleus position

Page 98: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Articulatory PLACE Feature Analysis• Place of articulation is a “dominant” feature in nucleus position only• Drives the feature deviation in the nucleus for manner and rounding

Phonetic deviation across SEVERAL features

Place “carries” manner and rounding in the nucleus

Page 99: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• Manner of articulation is a “dominant” feature in onset and coda position• Drives the feature deviation in onsets and codas for place and voicing

Articulatory MANNER Feature Analysis

Manner is less stable in the coda than in the onset

Manner drives place and

voicing deviations in the onset and

coda

Phonetic deviation across SEVERAL features

Page 100: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• Voicing is a subordinate feature in all syllable positions• Its deviation pattern is controlled by manner in onset and coda positions

Articulatory VOICING Feature Analysis

Voicing is unstable in coda position and is dominated by manner

Phonetic deviation across SEVERAL features

Page 101: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

The Intimate Relation BetweenStress Accent

and Vocalic Identity

(especially height)

Page 102: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

What is (usually) Meant by Prosodic Stress?• Prosody is supposed to pertain to extra-phonetic cues in the

acoustic signal • The pattern of variation over a sequence of SYLLABLES

pertaining to: syllabic DURATION, AMPLITUDE and PITCH (fo) variation over time (but the plot thickens, as we shall see)

Page 103: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

OGI Stories - Pitch Doesn’t Cut the Mustard • Although pitch range is the most important of the fo-related cues, it is not as good a

predictor of stress as DURATION

Duration

Amplitude

Pitch Range

Av. Pitch

Page 104: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Total Energy is the Best Predictor of Stress • Duration x Amplitude is superior to all other combination pairs of

acoustic parameters. Pitch appears redundant with duration.

Duration x Amplitude

Dur x Pitch Range

Duration

Dur x Pitch AvPitch Range x Average

Pitch Av x Amp

Pitch Range x Amp

Page 105: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS – Switchboard contains informal telephone dialogues

– 54 minutes of material that had previously been phonetically transcribed (by highly trained phonetics students from UC-

Berkeley)

– 45.5 minutes of “pure” speech (filled pauses, junctures filtered out), consisting of:

9,991 words, 13,446 syllables, 33,370 phonetic segments

– All of this material had been hand-segmented at either the phonetic-segment or syllabic level by the transcribers

– The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained on 72-minutes of hand-segmented Switchboard material. This automatic segmentation was manually verified

The Nitty Gritty (a.k.a. the Corpus Material)

Page 106: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• 2 UC-Berkeley Linguistics students each transcribed the full 45 minutes of material (i.e., there is 100% overlap between the 2)

• Three levels of stress-accent were marked for each syllabic nucleus– Fully stressed (78% concordance between transcribers)– Completely unstressed (85% interlabeler agreement)– An intermediate level of accent (neither fully stressed, nor completely

unstressed (ca. 60% concordance)– Hence, 95% concordance in terms of some level of stress

• The labels of the two transcribers were averaged – In those instances where there was disagreement, the magnitude of

disparity was almost always (ca. 90%) one step. Usually, disagreement signaled a genuine ambiguity in stress accent

• The illustrations in this presentation are based solely on those data in which both transcribers concurred (i.e., fully stressed or completely unstressed)

Manual Transcription of Stress Accent

Page 107: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• Vowel quality is generally thought to be a function primarily of two articulatory properties - both related to the motion of the tongue

– The front-back plane is most closely associated with the second formant frequency (or more precisely F2 - F1) and the volume of the front-cavity resonance

– The height parameter is closely linked to the frequency of F1

• In the classic vowel “triangle” segments are positioned in terms of the tongue positions associated with their production, as follows:

A Brief Primer on Vocalic Acoustics

Page 108: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Durational Differences - Stressed/Unstressed• There is a large dynamic range in duration between stressed and unstressed nuclei• Diphthongs and tense, low monophthongs tend to have a larger range than the lax monophthongs

Page 109: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• Let’s return to the vowel triangle and see if it can shed light on certain patterns in the vocalic data

• The duration, amplitude (and their product, integrated energy, will be plotted on a 2-D grid , where the x-axis will always be in terms of

hypothetical front-back tongue position (and hence remain a constant throughout the plots to follow)

• The y-axis will serve as the dependent measure, sometimes expressed in terms of duration, or amplitude, or their product

Spatial Patterning of Duration and Amplitude

Page 110: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Duration - Monophthongs vs. Diphthongs

All nuclei

Diphthongs Monophthongs

Page 111: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Duration - Monophthongs vs. Diphthongs

Stressed

Unstressed

Diphthongs Monophthongs

Page 112: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Proportion of Stress Accent and Vowel Height

Page 113: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• The vowel system of English (and perhaps other languages as well) needs to be re-thought in light of the intimate relationship

between vocalic identity, nucleic duration and stress accent

• Stressed syllables tend to have significantly longer nuclei than their unstressed counterparts, consistent with the findings

reported by Silipo and Greenberg in previous years’ meetings regarding the OGI Stories corpus (telephone monologues)

• Certain vocalic classes exhibit a far greater dynamic range in duration than others

– Diphthongs tend to be longer than monophthongs, BUT ….– The low monophthongs ([ae], [aa], [ay], [aw], [ao]) exhibit patterns of

duration and dynamic range under stress (accent) similar to diphtongs

• The statistical patterns are consistent with the hypothesis that duration serves under many conditions as either a primary or

secondary cue for vowel height (normally associated with the frequency of the first formant)

Take Home Messages

Page 114: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• Moreover, the stress-accent system in spontaneous (American) English appears to be closely associated with vocalic identity

• Low vowels are far more likely to be fully stressed than high vowels (with the mid vowels exhibiting an intermediate probability of being

stressed)

• Thus, the identity of a vowel can not be considered independently of stress-accent

• The two parameters are likely to be flip sides of the same Koine

• Although English is not generally considered to be a vowel-quantity language (as is Finnish), given the close relationship between

stress-accent and duration, and between duration and vowel quality, there is some sense in which English (and

perhaps other stress-accent languages) manifest certain properties of a “quantity” system

Take Home Messages

Page 115: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Automatic Methods forArticulatory Feature Extraction

and Phonetic Transcription

Page 116: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Manner-Specific Place Classification

Page 117: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Manner Feature Classification/Segmentation • Automatic methods (neural networks) can accurately label MANNER of articulation features for spontaneous material (Switchboard

corpus)

• Implication – MANNER information may be relatively co-terminous with phonetic segments and evade “co-articulation” effects

Page 118: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Label Accuracy per Frame• Central frames are labeled more accurately than those close to the segmental boundaries• Implication – some frames are created more equal than others

OGI Numbers Corpus Frame step interval = 10 ms

Page 119: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

MANNER Classification – Elitist Approach • “Confident” (usually central) frames are classified more accurately

NTIMIT (telephone) Corpus

Page 120: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Manner-Specific Place Classification • Knowing the “manner” improves “place” classification for consonants

NTIMIT (telephone) Corpus

Page 121: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Manner-Specific Place Classification • Knowing the “manner” improves “place” classification for vowels as well

NTIMIT (telephone) Corpus

Page 122: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Manner-Specific Place Classification – Dutch • Knowing the “manner” improves “place” classification for consonants and vowels in

DUTCH as well as in English

VIOS (telephone) Corpus

Page 123: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• Knowing the “manner” improves “place” classification for the “approximant” segments in DUTCH• Approximants are classified as “vocalic” rather than as “consonantal”

VIOS (telephone) Corpus

Manner-Specific Place Classification – Dutch

Page 124: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• Automatic recognition systems can be used to test specific hypotheses about the acoustic properties of articulatory features, segments

and syllables

• Manner information appears to be well classified and segmented – Suggests that manner features may be the key articulatory feature

dimension for segmentation within the syllable

• Place information is not as well classified as manner information

– Improvement of place with manner-specific classification suggests that place recognition does depend to a certain degree

on manner classification

• Voicing information appears to be relatively robust under many conditions and therefore is likely to emanate from a variety of spectral regions

– The time constant for voicing information is also likely to be less than or coterminous with the segment

Take Home Messages

Page 125: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

Sample Transcription from the ALPS System• The ALPS (automatic labeling of phonetic segments) system performs very similarly to manual transcription in terms of

both labels and segmentation – 11 ms average concordance in segmentation– 83% concordance with respect to phonetic labels

OGI Numbers (telephone) corpus

Page 126: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

ALPS Output Can Be Superior to Alignments

ALPS - Seg

ALPSManner Information

ForcedAlignment Segments

SpeechWaveform

Spectrogram

WordTranscript

Switchboard (telephone) Corpus

Page 127: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

• The controlling parameters for understanding spoken language appear to be based on low-frequency modulation patterns in the acoustic signal associated with the syllable

• Both the magnitude and phase of the modulation patterns are important

• Encoding information in terms of low-frequency modulations provides a certain degree of robustness to the speech signal that enables it to

be decoded under a wide range of acoustic and speaking conditions

• Manner information appears to be the key to understanding segmentation internal to the syllable

• Place features appear to be dominant and most stable at syllable onset and coda

• Manner is the stable feature dimension for the syllabic nucleus

• Voicing and rounding appear to be auxiliary features linked to manner and place feature information

• “Real” speech can be useful in delineating underlying patterns of llinguistic organization

Grand Summary and Conclusions

Page 128: What are the Essential Cues for Understanding Spoken Language? Steven Greenberg

That’s All, Folks

Many Thanks for Your Time and Attention