speaking in noise: how does the lombard effect improve acoustic contrasts between speech and ambient...

18
Available online at www.sciencedirect.com Computer Speech and Language 28 (2014) 580–597 Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise? Maëva Garnier , Nathalie Henrich Department of Speech and Cognition, GIPSA-Lab (UMR 5216: CNRS, INPG, University Stendhal, UJF), Grenoble, France Received 14 October 2012; received in revised form 26 July 2013; accepted 29 July 2013 Available online 15 August 2013 Abstract What makes speech produced in the presence of noise (Lombard speech) more intelligible than conversational speech produced in quiet conditions? This study investigates the hypothesis that speakers modify their speech in the presence of noise in such a way that acoustic contrasts between their speech and the background noise are enhanced, which would improve speech audibility. Ten French speakers were recorded while playing an interactive game first in quiet condition, then in two types of noisy conditions with different spectral characteristics: a broadband noise (BB) and a cocktail-party noise (CKTL), both played over loudspeakers at 86 dB SPL. Similarly to (Lu and Cooke, 2009b), our results suggest no systematic “active” adaptation of the whole speech spectrum or vocal intensity to the spectral characteristics of the ambient noise. Regardless of the type of noise, the gender or the type of speech segment, the primary strategy was to speak louder in noise, with a greater adaptation in BB noise and an emphasis on vowels rather than any type of consonants. Active strategies were evidenced, but were subtle and of second order to the primary strategy of speaking louder: for each gender, fundamental frequency (f 0 ) and first formant frequency (F1) were modified in cocktail-party noise in a way that optimized the release in energetic masking induced by this type of noise. Furthermore, speakers showed two additional modifications as compared to shouted speech, which therefore cannot be interpreted in terms of vocal effort only: they enhanced the modulation of their speech in f 0 and vocal intensity and they boosted their speech spectrum specifically around 3 kHz, in the region of maximum ear sensitivity associated with the actor’s or singer’s formant. © 2013 Elsevier Ltd. All rights reserved. Keywords: Lombard speech; Noise; Production; Speech audibility; Auditory detection; Segregation; Energetic masking 1. Introduction Noise exposure triggers an adaptation in speech production, commonly referred to as the Lombard effect. When communicating in noisy environments, speakers commonly increase vocal intensity and fundamental frequency (f 0 ) as compared to communicating in quiet environments (Castellanos et al., 1996; Junqua, 1993; Van Summers et al., 1988). Speech produced in noise (also called Lombard speech) is also characterized by a higher first-formant frequency of vowels (F1), boosted energy above 2 kHz and increased vowel/consonant (V/C) ratio in both vocal intensity and This paper has been recommended for acceptance by ‘Dr. Martin Cooke’. Corresponding author. Tel.: +33 4 76 57 50 61. E-mail address: [email protected] (M. Garnier). 0885-2308/$ see front matter © 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.csl.2013.07.005

Upload: kamila-paz-ferrada-castillo

Post on 11-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Efecto Lombard

TRANSCRIPT

Page 1: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

Available online at www.sciencedirect.com

Computer Speech and Language 28 (2014) 580–597

Speaking in noise: How does the Lombard effect improve acousticcontrasts between speech and ambient noise?�

Maëva Garnier ∗, Nathalie HenrichDepartment of Speech and Cognition, GIPSA-Lab (UMR 5216: CNRS, INPG, University Stendhal, UJF), Grenoble, France

Received 14 October 2012; received in revised form 26 July 2013; accepted 29 July 2013Available online 15 August 2013

Abstract

What makes speech produced in the presence of noise (Lombard speech) more intelligible than conversational speech producedin quiet conditions? This study investigates the hypothesis that speakers modify their speech in the presence of noise in such a waythat acoustic contrasts between their speech and the background noise are enhanced, which would improve speech audibility.

Ten French speakers were recorded while playing an interactive game first in quiet condition, then in two types of noisy conditionswith different spectral characteristics: a broadband noise (BB) and a cocktail-party noise (CKTL), both played over loudspeakersat 86 dB SPL.

Similarly to (Lu and Cooke, 2009b), our results suggest no systematic “active” adaptation of the whole speech spectrum or vocalintensity to the spectral characteristics of the ambient noise. Regardless of the type of noise, the gender or the type of speech segment,the primary strategy was to speak louder in noise, with a greater adaptation in BB noise and an emphasis on vowels rather than anytype of consonants.

Active strategies were evidenced, but were subtle and of second order to the primary strategy of speaking louder: for each gender,fundamental frequency (f0) and first formant frequency (F1) were modified in cocktail-party noise in a way that optimized the releasein energetic masking induced by this type of noise. Furthermore, speakers showed two additional modifications as compared toshouted speech, which therefore cannot be interpreted in terms of vocal effort only: they enhanced the modulation of their speechin f0 and vocal intensity and they boosted their speech spectrum specifically around 3 kHz, in the region of maximum ear sensitivityassociated with the actor’s or singer’s formant.© 2013 Elsevier Ltd. All rights reserved.

Keywords: Lombard speech; Noise; Production; Speech audibility; Auditory detection; Segregation; Energetic masking

1. Introduction

Noise exposure triggers an adaptation in speech production, commonly referred to as the Lombard effect. Whencommunicating in noisy environments, speakers commonly increase vocal intensity and fundamental frequency (f0)

as compared to communicating in quiet environments (Castellanos et al., 1996; Junqua, 1993; Van Summers et al.,1988). Speech produced in noise (also called Lombard speech) is also characterized by a higher first-formant frequencyof vowels (F1), boosted energy above 2 kHz and increased vowel/consonant (V/C) ratio in both vocal intensity and

� This paper has been recommended for acceptance by ‘Dr. Martin Cooke’.∗ Corresponding author. Tel.: +33 4 76 57 50 61.

E-mail address: [email protected] (M. Garnier).

0885-2308/$ – see front matter © 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.csl.2013.07.005

Page 2: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

dS

(ui

psp

cettS

scT

oPd1M

(

(

(

saJtgdfidn(

mcsdp

M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 581

uration (Boril and Pollak, 2005; Castellanos et al., 1996; Egan, 1972; Junqua, 1993; Kadiri, 1998; Mokbel, 1992;tanton et al., 1988; Van Summers et al., 1988).

Lombard speech has been shown to be more intelligible than conversational speech produced in quiet conditionDreher and O’Neill, 1957; Lu and Cooke, 2008; Pittman and Wiley, 2001; Van Summers et al., 1988). It remainsnclear which speech modifications contribute to this gain in intelligibility and which aspects of intelligibility aremproved by this speech adaptation (phoneme recognition, speech audibility, utterance parsing, etc.).

In this article, we focus on whether and how speakers may try to improve their audibility in noise, i.e. the detection anderception of speech information by their interlocutor within the background noise. Before envisaging and suggestingome possible strategies, let us first review the current knowledge on the mechanisms and factors that influence speecherception in noise.

First, it is known that the audibility of a sound is considerably degraded when it is heard simultaneously with aompeting noise or sound stream that contains energy in the same critical frequency bands. The energetic-maskingffect increases with increased spectral overlap and decreased signal-to-noise (SNR) ratio between the target sound andhe masker (Hornsby and Ricketts, 2001; French and Steinberg, 1947). In the case of speech, multi-talker noise degradeshe perception of vowels more than consonants, whereas white Gaussian noise has the opposite effect (Junqua, 1993).imilarly, speech is more degraded by a competing speech produced by a speaker of the same gender (Brungart, 2001).

Auditory fusion is another perceptual phenomenon that occurs when two or more sound streams are heard at theame time. The concurrent streams are interpreted as coming from the same source when they present similar acousticharacteristics (such as the average intensity, pitch and timbre, but also in the temporal modulation of these parameters).hey are segregated from each other when acoustic contrasts exceed a given threshold (Darwin et al., 2003).

Both phenomena of energetic masking and auditory fusion underlie the “cocktail-party effect”, i.e. the difficultyf following a voice and understanding what is said within a multitude of other competing voices (Arons, 1992).sychoacoustic research showed how the segregation of a target speech from another competing speech is particularlyifficult when both voices are similar in spectral content and fundamental frequency (f0) (Assmann and Summerfield,990), first-formant frequency (F1) (Darwin et al., 2003), when the target voice is not modulated in f0 (Marin andcAdams, 1991), and when it is compressed in amplitude (Hornsby and Ricketts, 2001).What could speakers then do to improve their speech audibility and segregation in noise?

1) Speakers may try to decrease the amount of energetic masking and enhance acoustic contrasts by increasing theglobal vocal intensity of their speech, and more specifically the spectral energy in frequency regions where thebackground noise presents maximum energy (boosting strategies).

2) They may try to shift the spectral energy, or at least important phonetic cues coded in frequency (f0, formants), tospectral bands where the background noise presents minimum energy (bypass strategies).

3) They may try to increase the temporal modulation of their speech in f0 and vocal intensity (modulation strategies).

Evidence of boosting strategies was provided by Mokbel (1992) and Junqua et al. (1998). Mokbel (1992) demon-trated that a speaker enhances his speech energy more in the frequency band where noise was concentrated. In

single-talker experiment comparing speech adaptation to broadband noises filtered by different band-pass filters,unqua et al. (1998) showed that, at constant masker level, the increase of vocal intensity varies with noise spectralilt. Bypass strategies were observed in recent studies (Lu and Cooke, 2008, 2009a,b) that showed how the center ofravity (CoG) of the speech spectrum increases in frequency when speaking in low-pass noises (multi-babble noise,riving noise or low-pass filtered broadband noise). However, they did not observe such bypass strategies in high-passltered noises (Lu and Cooke, 2009a,b), in which the speech CoG and additional spectral cues (f0, F1) were not shiftedown but were still shifted to higher frequencies. No study has specifically explored the use of modulation strategies inoise. Nevertheless, enhanced intonation contours and wider f0 ranges in noise have been reported by several authorsBoril and Pollak, 2005; Garnier et al., 2006, 2010; Welby, 2006).

In line with these previous studies, this study aims at examining whether speakers adopt boosting, bypass orodulation strategies in noise to adapt to the spectral characteristics of the background noise and enhance acoustic

ontrasts between their speech and that noise. Two different types of noise frequently encountered in ecologicalituations were chosen here to test speech adaptation to varying energetic masking: a broadband noise (with equallyistributed energy below 10 kHz) and a cocktail-party noise (with concentrated energy below 1 kHz). Similarly torevious studies, evidence of boosting, bypass or modulation strategies was searched in the global adaptation of speech

Page 3: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

582 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

Table 1List and phonetic transcription of the 16 French target words used in the study.

Bijou [bi�u] Chausson [ʃosɔ] Cochon [koʃɔ] Dauphin [dofε]Fusil [fyzi] Gitans [�ita] Guenon [g∅nɔ] Lagon [lagɔ]Marie [m�Ri] Navet [nav�] Panda [pada] Requin [Rek�]

Sommet [sɔmm�] Toupie [tupi] Vallée [vale] zébu [zeby]

intensity and spectrum. In addition, more specific and local adaptations were searched (1) in the modification of spectralcues such as f0 and F1, (2) in the temporal modulation of f0 and vocal intensity (3) in the potentially different effectsthat noise type can have on speech modifications made by both genders and for different types of speech segments.

2. Materials and methods

2.1. Speech production experiment

The corpus is similar to the experiments presented in Garnier (2008) and Garnier et al. (2010).

2.1.1. ParticipantsTen native French speakers (five men and five women) aged 20–28 years old took part in the recording. Only one

of them had some basic knowledge about the Lombard effect. None reported any speech or hearing difficulties.

2.1.2. TaskTo account for the effect of communicative interaction on the Lombard effect, speakers were recorded while

playing a collaborative game in pairs. The game was inspired by the Map Task game (Brown et al., 1983). Its rules aresummarized here, and more details can be found in Garnier et al. (2010). The speakers had to exchange informationabout 16 items drawn on their map, so as to reconstruct a path that linked the items. The items corresponded to targetwords comprised of two CV (consonant–vowel) syllables (see Table 1). They were selected to represent most of theFrench phonemes (all of the vowels except [œ], all the consonants except [ŋ] and none of the semi-vowels [w], [j]and [Ч]). The experimental condition aimed at reproducing as much as possible a realistic situation of face-to-faceinteraction in noisy conditions. Speakers were seated two meters from and facing each other. They could use audioand visual information from the face only, since hands and head movements were constrained. No carrier sentencewas imposed, so as to preserve spontaneity. The speech content could not be predicted so that speakers had to adjusttheir intelligibility level. An example of utterances produced by the speech partners (translated from French) is givenbelow.1 As in realistic noisy conditions, speakers were sometimes unsuccessful in their communication and had torepeat or reformulate their utterance. These repetitions were included in the long-term acoustical analysis (long-termaverage spectrum, distribution of f0 and vocal intensity over the game duration). Only the first occurrence of the targetwords was considered for the analysis of syllables and segments.

2.1.3. Experimental conditionsSpeakers played the game in a sound-treated booth in three sound conditions: (1) quiet, (2) 86 dB SPL of broadband

noise (BB) and (3) 86 dB SPL of non-intelligible cocktail-party noise (CKTL). The two types of noise were selected

from the BD Bruit database (Zeiliger et al., 1994). Their spectral characteristics are given in Fig. 1. In the BB noise,spectral energy is attenuated above 10 kHz. The CKTL noise is made from the non-intelligible speech of four malesand four females. The spectral energy is concentrated below 800 Hz. It presents two maxima around 170 and 500 Hz,

1 Leader: “The first item is the shark (\Rək�/ in French).”Follower: “Well. . .the shark is associated with the summit (/sɔmm�/ in French)”Leader: “with the what ?”Follower: “with the summit !”Follower: “Yes”Leader: “Ah, ok. Hmm. . .Then I’m going to the pig (/koʃɔ/ in French)”

Page 4: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 583

at

n11b

2

5Abp

Btcanrsomhsoewuso

2

s

Fig. 1. Spectral characteristics of the broadband noise (BB) and cocktail-party noise (CKTL).

nd a local minimum around 340 Hz. Noise files were 3 min long. Noise started with a fade-in and was turned off oncehe two speakers had completed the game after approximately 2–3 min.

To avoid perturbing the speakers’ self-monitoring feedback and their perception of the background noise, the twooises were played over two loudspeakers (Tannoy System 600) instead of headphones. Loudspeakers were positioned.5 m from the speakers in each lateral direction and at the level of their ears. Noise levels were calibrated using a/2′′ pressure microphone (B&K 4165) and an artificial head placed where the speaker would be seated inside theooth.

.1.4. Audio recordingsThe audio speech signal was recorded with a cardioid headset microphone (Beyerdynamic Opus 54) placed about

cm in front of the mouth. The signal was pre-amplified (RME Octamic) and sampled at 44.1 kHz and 16 bits (RMEDI 8 Pro converter and RME DIGI 9652 HDSP sound card). Speech intensity was calibrated prior to the experimenty recording the audio signal of a sustained vowel produced by the speaker, and by measuring the corresponding soundressure level (SPL) at the microphone with a digital Sound Level Meter.

Noise was removed from the speech recordings using a dedicated noise-canceling method (Ternström et al., 2002).efore each noisy condition, 10 s of white noise was played into the loudspeakers and recorded at the microphone with

he speaker remaining quiet, which enabled the estimation of the impulse response of the loudspeaker-to-microphonehannel. For each noisy condition, it was then possible to estimate the entire noise signal that would have been recordedt the microphone if the speaker had remained quiet, and to subtract this estimation, in the time-domain, from the actualoisy recording of speech. The result of this subtraction gave the audio signal of speech produced in noise, with littleemaining signal from the surrounding noise. Such a method can be used in laboratory conditions where the noiseignal is perfectly known, and where loudspeakers and microphones remain at the same place. However, the positionf the speaker in the room also affects the channel estimation. As a consequence, it is necessary to restrain speakers’ovements to maximize the denoising performance and to guarantee the validity of acoustic measurements. When

ead movements are restrained, the possible bias introduced by the denoising is less than 0.2 dB for intensity of voicedegments, 0.7 dB for unvoiced ones, 0.13 tones for F0, 10 Hz for the first formant frequency and 3 Hz for the centroidf the speech spectrum (Garnier et al., 2010). In this experiment participants remained still during the 2–3 min of noisexposure. In between each speaking condition the participants were allowed to move and relax. The headset microphoneas firmly attached to the participant’s head to avoid any movement of the face away from the microphone. The mapsed for the interactive game was placed high enough on a stand so that the speakers could see both the map and theirpeech partner by moving their eyes but not their head. Writing on the map could be achieved with wrist movementsnly.

.2. Analysis

Sentences, syllables and segments of the target words were manually segmented from the audio signal using Praatoftware. All the acoustic analyses were made with MATLAB.

Page 5: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

584 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

2.2.1. Long-term spectral and frequency descriptorsThe long-term average spectrum (LTAS) was computed from 0 to 6 kHz over the concatenated utterances produced

in each experimental condition by each speaker (∼1 min 30 s). The LTAS center of gravity (CoG) was measured fromspeech normalized in intensity in order to compare the energy distribution between conversational and Lombard speechspectra. The mean energy (in dB SPL) in the 0–1 kHz frequency band (corresponding to f0 and F1) in the 2–4 kHzband (corresponding to the actor’s formant area) and in the 1–2 kHz and 4–6 kHz remaining bands were derived fromthis LTAS.

The fundamental frequency (f0) was estimated by autocorrelation over the voiced parts of these concatenatedutterances. The distribution of f0 values was computed using a quantification step of 5 Hz and was normalized so thatits integration summed to 100%. The mode of the f0 distribution was detected to estimate the f0 value that is producedthe most frequently by a speaker in a given condition. The amplitude of f0 modulation was expressed in tones by takingout f0 values that corresponded to the lowest 15% of the distribution (i.e. the least frequent occurrences of f0) thenextracting the width of that new distribution.

2.2.2. Syllable descriptorsMean f0 and mean vocal intensity were measured for each CV syllable of the target word. Female speakers have

higher f0 and greater intra-speaker f0 variations (in Hertz) than males. To account for the gender difference and allowintra-speaker comparisons, f0 was expressed in tones (from fref = 50 Hz). The magnitude of amplitude modulation wascalculated for each CV syllable as the difference between the maximal intensity of the vowel and the minimal intensityof the preceding consonant.

2.2.3. Segment descriptorsFor each talker and each condition, the mean frequency of the first formant was semi-automatically measured for

the vowels [a] (4 occurrences), [i] (5 occurrences) and [u] (2 occurrences) of the target word, using a conventionalautocorrelation-based LPC method.

Segments mean duration and mean vocal intensity were measured for vowels (32 measurements), sonorants witha formant structure ([n], [m] and [l]: 8 measurements), and unvoiced consonants ([f], [s], [ʃ], [p], [t] and [k]: 12measurements) of the target word.

2.2.4. Statistical analysisUsing SPSS software, a one-way analysis of variance (ANOVA) with repeated measures was conducted on each

parameter, considering one mean value per speaker and per condition.Main effects were tested first for the factor CONDITION (three levels: CONDITION 1 – quiet (i.e. no noise),

CONDITION 2 – BB noise at 86 dB SPL, CONDITION 3 – CKTL noise at 86 dB SPL) and the inter-subject factorGENDER (two levels: 1-female, 2-male). The statistical interaction between these two factors was also tested.

Secondly, specific contrasts were examined using Bonferroni adjustments. The effect of noise exposure on speechproduction was tested as the contrast between quiet (CONDITION 1) and the two types of noise (CONDITION 2 and3). The effect of noise type on speech modification was tested as the contrast between BB noise (CONDITION 2) andCKTL noise (CONDITION 3). The influence of gender on speech modification in noise (GENDER × (CONDITION1 vs. CONDITIONS 2–3)) and on the effect of noise type (GENDER × (CONDITION 2 vs. CONDITION 3)) werealso tested.

All the results are reported in Figs. 2–4 and 6–8. The conventional notation was adopted for indicating statisticalsignificance of the ANOVA tests: *** for p < .001, ** for p < . 01, * for p < .05 and ns (not significant) for p > .05.

3. Results

This section explores the three types of strategies mentioned in the introduction that speakers may adopt to improvetheir speech audibility and segregation from a background noise (boosting, bypass, modulation). Where necessary, thehypothesis is stated again, together with the expected results for a given strategy. The observations are then describedand compared to the expectations.

Page 6: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 585

Fig. 2. Effect of noise exposure on syllable mean intensity as a function of noise type (Broadband noise (BB), Cocktail-party noise (CKTL)) andspeaker’s gender. On the left graph, error bars indicate the standard deviation over the five speakers of each gender. The black horizontal line acrossthe three panels represents the level of BB and CKTL noises (86 dB SPL). It shows how the increase of vocal intensity in noisy conditions enablesthe speakers to have a positive signal to noise ratio, whereas it would be negative if they did not adapt from quiet to noise. The table on the right sideseo

3

3

sSoav1

Bo(

atom

3

cBm

-

ummarizes the main effects and interaction of the factors CONDITION and GENDER on the syllable mean intensity. Specific contrasts tested theffect of noise exposure ((BB and CKTL noises) vs. quiet) and noise type (BB vs. CKTL), as well as the influence of gender (Females vs. Males)n these effects.

.1. Boosting strategies

.1.1. Adaptation of speech intensity to the degree of energetic maskingFig. 2 shows the variation of syllable intensity with noise exposure, noise type and speaker’s gender. As expected,

yllable intensity increased significantly when speakers adapted from quiet (i.e. 40 dB SPL of room noise) to 86 dBPL of noise, regardless of the type of noise. The average increase was 16.6 ± 3.1 dB2 (see Fig. 2). Despite the increasef syllable intensity in noise, the SNR ratio decreased on average by 29.4 dB SPL between speech produced in quietnd that produced in noise. However, it would have decreased by 46 dB SPL if the speakers had not adapted theirocal intensity in noise. Fig. 2 shows how speech modification contributes to maintain a positive SNR in noise of2.5 ± 3.7 dB SPL on average, whereas, without adaptation, it would have been negative for all but one male speaker.

The CKTL noise is made of simultaneous voices and therefore induces greater energetic masking of speech than doesB noise. However, speakers did not increase their syllable intensity more in CKTL noise. On the contrary, the increasef syllable intensity was significantly greater in BB noise (17.7 ± 2.7 dB SPL) than in CKTL noise (15.4 ± 3.8 dB SPL)see Fig. 2).

Furthermore, the long-term average spectrum of the CKTL noise demonstrates a first energy peak around 170 Hznd a globally more similar envelope to the female voice spectrum. In comparison, BB noise has a flat spectrum, andhus does not induce greater energetic masking on one gender over another. However, no significant interaction wasbserved between the noise type and the speaker’s gender: females did not increase their syllable intensity more thanales in either BB noise or in CKTL noise (see Fig. 2).

.1.2. Adaptation of segment intensity and duration to the degree of energetic maskingVowels and sonorants have a greater spectral overlap with the CKTL noise, compared with the BB noise. On the

ontrary, the spectrum of unvoiced fricatives ([s], [∫

], [f]) and stop consonants ([p], [t] and [k]) is more similar to theB noise spectrum. Consequently, if speakers adopted boosting strategies to compensate for the degree of energeticasking, we could expect that:

[h1]: the variation of sonorants’ intensity and duration follow the same tendency as vowels, because they demonstratea comparable spectrum.

2 This standard deviation is calculated from the mean modification observed in each participant (thus over 10 values).

Page 7: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

586 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

Fig. 3. Effect of noise exposure on segments intensity (a) and duration (b), as a function of noise type (Broadband noise (BB), Cocktail-party noise(CKTL)) and segment type (vowels, sonorants [n], [m] and [l], and unvoiced plosive and fricative consonants).On the left graphs, each bar represents the average variation of intensity or duration from quiet to noise, as well as the inter-speaker variability inthis adaptation.

The table on the right side summarizes the main effect of the factor CONDITION segments intensity and duration. Specific contrasts tested theeffect of noise exposure ((BB and CKTL noises) vs. quiet) and noise type (BB vs. CKTL).

- [h2]: there is an interaction between the segment type and the noise type. Vowels and sonorants would be moreemphasized in intensity and duration in CKTL noise compared to BB noise [h2a], and more emphasized thanunvoiced consonants in CKTL noise [h2b]. On the contrary, unvoiced consonants may be more emphasized inintensity and duration in BB noise compared to CKTL noise [h2c], and more emphasized than vowels and sonorantsin BB noise [h2d].

Fig. 3 shows how speakers modified the vocal intensity and the duration of vowels, sonorants and unvoiced fricativesand plosives in BB and CKTL noises. Results of statistical analysis are given in the table on its right side.

The results did not support our hypotheses.Sonorants’ intensity and duration were not found to be modified in noise in a similar way to vowels (hypothesis

[h1]) but rather similarly to unvoiced consonants. The intensity of sonorants and unvoiced consonants increased onaverage by 12.3 dB and 12.8 dB respectively, whereas vowels’ intensity increased by 17.3 dB. Vowels were significantlylengthened (on average by 33 ± 12 ms), whereas consonants were significantly shortened for unvoiced fricatives andplosives (by −10 ± 7 ms). Sonorants were shortened by 6 ms, however not significantly.

In contradiction to our hypotheses [h2], segment intensity and duration were not found to vary with noise exposurewith a significant interaction between segment type and noise type. Instead, the following general tendencies wereobserved:

- Regardless of the type of noise and the type of consonants (unvoiced ones or sonorants), vocal intensity increasedmore for vowels than consonants, and segment duration increased for vowels whereas it tended to decrease for

Page 8: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 587

Fig. 4. Effect of noise exposure on speech spectrum, as a function of noise type (Broadband noise (BB), Cocktail-party noise (CKTL)) and speaker’sgender.On the left graphs, lines represent the spectrum envelope stylized from the energy measured in the four frequency bands 0–1 kHz, 1–2 kHz, 2–4 kHzand 4–6 kHz. Speech signals were normalized in intensity. Error bars indicate the standard deviation over the five speakers of each gender.The table on the right side summarizes the main effects and interaction of the factors CONDITION and GENDER on the energy in the 0–1 kHz,1t

-

3

fse

C

0d

-

–2 kHz, 2–4 kHz and 4–6 kHz frequency band. Specific contrasts tested the effect of noise exposure ((BB and CKTL noises) vs. quiet) and noiseype (BB vs. CKTL), as well as the influence of gender (Females vs. Males) on these effects.

consonants. As a consequence, in all cases, the vowels to consonants ratio was found to increase in noise in bothintensity (4.7 ± 1.8 dB) and duration (41 ± 10 ms).

Vowels intensity and duration increased more in BB noise (by 18.7 ± 3.1 dB SPL) than in CKTL noise (by15.8 ± 3.7 dB SPL) (hypothesis [h2a]), whereas no significant effect of noise type was found on consonants’ intensityand duration (hypothesis [h2c]).

.1.3. Specific boost of the speech spectrum in regions of maximum energetic maskingAs shown in Fig. 1, the CKTL noise presents maximum energy below 1 kHz, then decreased energy with increasing

requencies above 1 kHz. The BB noise presents equal energy below 10 kHz. If the speakers boosted the energy of theirpeech in the frequency bands where the background noise presents maximum energy, a greater increase of spectralnergy in the 0–1 kHz frequency band should be found in CKTL noise compared to BB noise.

Fig. 4 shows how male and female speakers modified the energy distribution of their voice spectrum in BB andKTL noises. Results of statistical analysis are given in the table on its right side.

At normalized intensities, the spectra of voices produced in CKTL noise did not present enhanced energy in the–1 kHz band, compared to voices produced in BB noise. Noise type did not show any significant influence on theistribution of voice energy in higher frequency bands either. Instead, the following general tendencies were observed:

For both types of noise, female voices showed a similar an amount of energy in the 0–1 kHz band compared tovoices produced in quiet condition, whereas Lombard male voices showed slightly less energy in the 0–1 kHz band,compared to voices produced in quiet condition.

Page 9: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

588 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

Fig. 5. Examples of typical spectrum envelopes observed in quiet and 86 dB SPL of noise for the vowels [a], [i] and [u]. The selected graphscorrespond to female productions from different speakers for each vowel, but from the same speaker in quiet and cocktail-party noise (CKTL). For

comparison purposes, spectra have been normalized to their maximum amplitude. The spectrum enhancement observed between 2 and 4 kHz innoise involves different formants for the vowels [a], [i] and [u]. Two clustering strategies were observed for the vowel [i].

- For both males and females, and for both types of noise, the voice spectrum was significantly boosted in the 1–2 kHzand 2–4 kHz bands but decreased in energy at high frequencies (above 4 kHz for males, and above 6 kHz for females),compared to conversational voice produced in quiet condition.

Fig. 5 gives an example of typical spectral envelopes observed in this study for vowels produced by female speakersin quiet and noisy conditions. These graphs show how the boosted energy of Lombard speech in the 1–4 kHz regioncomes not only from a flatter spectral slope but also from the specific enhancement of the amplitude of higher formants.Different formants were involved in this enhancement, depending of the vowel ([a], [i] or [u]). For the vowel [i], twodifferent strategies of formant clustering were observed across speakers and experimental conditions: (1) two clusteringsof F2–F3 and of F4–F5 (illustrated in Fig. 5, bottom left panel); (2) one clustering of F2–F3–F4 (illustrated in Fig. 5,bottom right panel).

3.2. Bypass strategies

3.2.1. Global shift of spectral energy away from regions of minimum energetic maskingAn alternative to boosting speech energy in the frequency bands of maximum energetic-masking regions is to shift

the energy of the speech spectrum away from these masking regions. As CKTL noise presents maximum energy below1 kHz, one may hypothesize the speech spectral energy to be shifted toward higher frequencies in CKTL noise than inBB noise.

Page 10: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 589

Fig. 6. Effect of noise exposure on the CoG of the global speech spectrum, as a function of noise type (Broadband noise (BB), Cocktail-party noise(CKTL)) and speaker’s gender. The top left plot represents the spectral envelope of the CKTL noise. The dashed vertical lines across the panelscorrespond to the frequency regions where the CKTL noise spectrum presents local maxima (∼170 Hz and 500 Hz). Error bars indicate the standarddeviation over the five speakers of each gender.The table on the right side summarizes the main effects and interaction of the factors CONDITION and GENDER on the CoG of the speech spectrum.Sg

eoifsssim

3

bovrnop

bgaaloTf

pecific contrasts tested the effect of noise exposure ((BB and CKTL noises) vs. quiet) and noise type (BB vs. CKTL), as well as the influence ofender (Females vs. Males) on these effects.

Fig. 6 shows the variation of the CoG of the speech spectrum (including voiced and unvoiced segments) with noisexposure, noise type and speaker’s gender. The table on its right side gives the results of statistical analysis. In both typesf noise, speakers shifted their speech spectrum toward higher frequencies. This adaptation translated into a significantncrease of speech CoG (by 533 ± 368 Hz on average), which was greater for female speakers (786 ± 360 Hz) thanor male speakers (280 ± 122 Hz). In CKTL noise, this spectral shift contributed for female speakers to raise theirpeech CoG above 800 Hz, where the energy of that background noise is considerably attenuated (see Fig. 1). For malepeakers, however, their speech CoG in CKTL noise remained in the 500–900 Hz region, above the frequency of theecond peak of the CKTL noise spectrum, but not at a sufficiently high frequency to benefit from a significant releasen energetic masking. Furthermore, contrary to our expectations, speakers were not found to raise their speech CoG

ore in CKTL noise than in BB noise, but they increased it by a similar extent in both types of noise.

.2.2. Shift of the f0 information to regions of minimum energetic maskingThe CKTL noise spectrum presents a local minimum of energy around 340 Hz, which releases the energetic masking

y 6 dB in comparison to the two adjacent peaks of energy around 170 Hz and 500 Hz (see Fig. 1). An f0 frequencyf 340 Hz seems an accessible target for female speakers. Consequently, a bypass strategy would involve raising theiroice in CKTL noise so that their f0 reaches the region of reduced energetic masking. On the contrary, male speakersarely raise their speaking voice to f0 higher than 250–300 Hz. They cannot reach the local minimum of the CKTLoise spectrum. In their case, a bypass strategy would consist in limiting the increase of their pitch in CKTL noise inrder to maintain their f0 range below the first peak of energy of the CKL noise (around 170 Hz, see Fig. 1 and topanel in Fig. 8).

Fig. 7 shows the distribution of f0 values for all the speakers in quiet condition, BB noise and CKTL noise. Itsottom table presents the variation of mean f0 for the target words syllables, as a function of noise type and speaker’sender. As expected, all the female speakers were found to increase their syllable f0 in CKTL (by 4.5 ± 0.5 tones onverage) so that their most frequent f0 (corresponding to the mode of the f0 distribution) in CKTL noise was found onverage at 334 ± 17 Hz, i.e. 5.8 tones away from the first peak of the CKTL spectrum, and as close as 0.2 tones to itsocal minimum of energy. However, contrary to our expectations, male speakers were not found to limit the increase

f their f0 in CKTL noise. Similarly to females, they raised their syllable f0 in CKTL noise by 4 ± 1 tones on average.his brought their most frequent f0 in CKTL noise to 180 ± 38 Hz on average, i.e. as close as 0.5 tones from the center

requency of the main peak of the CKTL spectrum.

Page 11: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

590 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

Fig. 7. Effect ofnoise exposure on the distribution of f0 values, as a function of noise type (Broadband noise (BB), Cocktail-party noise (CKTL))for the ten speakers of this study. These distributions are normalized in amplitude so their integration sums to 100%. The spectrum profile of theCKTL noise is represented at the top. The dashed vertical lines across the panels correspond to the frequency area where the CKTL noise spectrumpresents a local maximum (∼170 Hz) and minimum (∼340 Hz).The bottom table summarizes the main effects and interaction of the factors CONDITION and GENDER on the mean syllable f0. Specific contraststested the effect of noise exposure ((BB and CKTL noises) vs. quiet) and noise type (BB vs. CKTL), as well as the influence of gender (Females vs.

Males) on these effects.

When considering vowel spectra, the amplitude of the two first voice harmonics was found to be comparable inCKTL noise for male speakers (H1–H2 = 0.1 ± 2.1 dB SPL), whereas the amplitude of the first harmonic was alwaysmuch greater than that of the second harmonic for female speakers in the same condition (H1–H2 = 8.4 ± 2.9 dB SPL).Thus, even if male speakers do not raise f0 high enough to reach the local minimum of the CKTL noise spectrum(around 340 Hz), they may raise it high enough for the second harmonic (2f0) to be located in that frequency region.The most frequent value of 2f for males was indeed measured on average around 360 Hz in CKTL noise.

0

As BB noise spectrum presents no local minimum in energy below 10 kHz, one might expect speakers to still raisef0 in BB noise – because of the well-known relationship between the variation of f0 and vocal intensity (Titze, 1989)

Page 12: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 591

Fig. 8. Effect of noise exposure on the first formant frequency of [a], [i] and [u] vowels, as a function of noise type (Broadband noise (BB),Cocktail-party noise (CKTL)) and speaker’s gender.On the left plots, the dashed vertical line across the panels corresponds to the frequency area where the CKTL noise spectrum presents a localminimum (∼340 Hz). Error bars indicate the standard deviation over the five speakers of each gender.Ton

–3

ei2

3

eCeh

toteo

acc

ts

he table on the right side summarizes the main effects and interaction of the factors CONDITION and GENDER on the first formant frequencyf the three cardinal vowels of French: [a], [i] and [u]. Specific contrasts tested the effect of noise exposure ((BB and CKTL noises) vs. quiet) andoise type (BB vs. CKTL), as well as the influence of gender (Females vs. Males) on these effects.

but to a lesser extent compared to in CKTL noise, and without any specific adjustment of the f0 distribution around40 Hz.

Syllable f0 was indeed found to increase significantly in BB noise (by 5.5 ± 0.5 tones on average), by a similarxtent for both males and females. However, contrary to our expectations, this increase was greater in BB noise thann CKTL noise (by 1.1 ± 0.5 tones). As a consequence, the most frequent f0 of female speakers and the most frequentf0 value of male speakers in BB noise was not specifically adjusted around 340 Hz.

.2.3. Shift of the F1 information to regions of minimum energetic maskingSimilarly to the f0 range, the range of F1 is situated below 1 kHz, where the CKTL noise spectrum has maximum

nergy and exerts greater energetic masking. A bypass strategy would be to adjust F1 to frequency regions where theKTL noise has reduced energy. The region around 340 Hz, where the CKTL noise spectrum has a local minimum innergy, may be an accessible range to shift the F1 of closed vowels. On the other hand, the F1 of open vowels wouldave to be shifted above 800 Hz if speakers aimed at improving the audibility of that information.

Fig. 8 represents the variation of F1 with noise exposure for the two closed vowels of French, [i] and [u], and forhe most open vowel [a], as a function of noise type and speaker’s gender. The table on its right side gives the resultsf statistical analysis. F2 was not examined here, since its range is already situated above 800 Hz, where the energy ofhe CKTL noise is considerably reduced, and since our research question is not whether the vowel space is reduced orxpanded in noise, but to determine whether the variations of formant frequencies can contribute to release the degreef energetic masking.

For the open vowel [a], F1 was found to increase on average by 140 ± 30 Hz, by a similar extent for both gendersnd both types of noise. For females, F1 values were raised above 800 Hz, where the energy of the CKTL noise isonsiderably decreased. For males, however, F1 values in noise remained just below 700 Hz, in the frequency regionorresponding to the second peak of the CKTL noise.

For the closed vowels [i] and [u], females were found to increase F1 by an average of 145 ± 36 Hz, similarly tohe open vowel [a], whereas males increased F1 by only 72 ± 24 Hz on average. Thus, gender was found to have aignificant effect on the modification of F1 in noise for these two closed vowels. Furthermore, a significant interaction

Page 13: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

592 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

Fig. 9. Effect ofnoise exposure on the amount of speech modulation in intensity and frequency, as a function of noise type (Broadband noise (BB),Cocktail-party noise (CKTL)) and speaker’s gender.On the left plots, error bars indicate the standard deviation over the five speakers of each gender.The bottom table summarizes the main effects and interaction of the factors CONDITION and GENDER on speech modulation in amplitude and

frequency. Specific contrasts tested the effect of noise exposure ((BB and CKTL noises) vs. quiet) and noise type (BB vs. CKTL), as well as theinfluence of gender (Females vs. Males) on these effects.

was observed between noise type and speaker’s gender in the variation of F1 for these closed vowels; F1 tendedto increase more in BB noise for females (by 43 Hz on average) and more in CKTL noise for males (by 21 Hz onaverage). As a result, modifications of the closed vowels [i] and [u] in CKTL noise contributed, for both genders,toward shifting their first formant away from the first main peak of the CKTL noise spectrum (around 170 Hz), andclose to the frequency where the CKTL noise spectrum presents a local minimum (around 340 Hz) (see Fig. 9). Thesemodifications appear to be specifically adapted to the characteristics of the CKTL noise, as they were not observed inBB noise.

3.3. Modulation strategies

3.3.1. Modulation of vocal intensity within each syllableSpeakers significantly enhanced the intensity dynamics of their syllable when they adapted to noise by 3.3 ± 1.2 dB

SPL on average (see Fig. 9a). A greater modulation of syllable intensity was observed in BB noise (4.8 ± 2.0 dB SPL)

Page 14: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

ti

3

sa

4

4

nc

(

(

snpC

M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 593

han in CKTL noise (1.8 ± 1.1 dB SPL) and this significant effect of noise type was more accentuated in females thann males.

.3.2. Modulation of f0 over the long termIn Fig. 7, it can be seen that speakers not only raise f0 when they adapt from quiet to 86 dB SPL of noise but also

ignificantly widen their f0 range, by an average of 0.8 ± 0.4 tones. The amount of widening was not significantlyffected by gender or noise type (see Fig. 9b).

. Discussion and conclusions

.1. Do speakers adopt boosting, bypass or modulation strategies to improve their audibility in noise?

As in previous studies, we found a significant effect of noise type on the modification of speech intensity buto significant effect on voice spectrum. In this study, this effect of the noise type can be summarized by two mainonclusions:

1) The relative energy between 0 and 1 kHz was not found to be more enhanced in CKTL noise, neither was thespeech CoG more shifted toward high frequencies in CKTL noise.

So, contrary to Mokbel (1992) who showed that a speaker enhanced more his speech energy in the frequencyband where noise was concentrated, and Junqua et al. (1998) that the increase of vocal intensity varied with noisespectral tilt at constant masker level, our results do not support the existence of such boosting strategies in speakerscommunicating in low frequency noise. However, our observations are consistent with those of (Lu and Cooke,2009a,b) who also observed a greater shift of speech CoG toward high frequencies in broadband noises and inhigh frequency noises rather than in low frequency noises.

2) Noise type was not found to interact with the type of speech segment and with the speaker’s gender in themodification of segments’ intensity and duration, syllable intensity and speech CoG:- Regardless of the speaker’s gender and the type of segment, a greater increase of these parameters was observed

in BB noise compared to CKTL noise, although CKTL noise induces greater energetic masking on speech.This result is in agreement with Egan (1972) and Lu and Cooke (2009b), who observed a greater increase ofvoice intensity in broadband noise than in noise enhanced in low frequencies or high frequencies. It does notreflect the results of previous studies (Junqua et al., 1998; Ternström et al., 2002; Jung, 2012), in which whitenoise, speech-shaped noise, music noise or driving noise did not induce any different adaptation in vocal effort.However, these studies compared noise types at similar perceived loudness (in dB A), whereas our own studycompared CKTL and BB noises at similar physical levels in dB SPL.

- Regardless of the noise type, vocal intensity and duration were found to increase significantly more for vowelsthan for consonants. Sonorant consonants followed the same tendency as unvoiced consonants, despite theirspectral similarity to vowels. This observation is consistent with other studies (Castellanos et al., 1996; Junqua,1993) that also reported the vowel-to-consonant ratio to increase in both intensity and duration in noise for eachkind of consonant. However, this greater adaptation of vowels could also result from the small vocabulary sizeand the weak confusability between the target words of our experiment, which were only distinguishable bytheir vowels (except the pairs ‘cochon’ (pig) vs. ‘chausson’ (slipper), and ‘navet’ (turnip) vs. ‘vallée’ (valley)).

- Regardless of the noise type, females raised their CoG more with noise exposure than did males. This partly comesfrom a notable decrease of the relative energy above 4 kHz in males. Again, this observation is in agreement withprevious studies (Castellanos et al., 1996; Junqua, 1993) that reported an increase of speech energy between 4and 5 kHz in female Lombard speech, whereas the greatest enhancement of speech energy in males was observedbetween 2 and 4 kHz (Castellanos et al., 1996; Junqua, 1993).

As a consequence, our observations do not support the hypothesis that speakers modify their speech intensity and

pectrum in noise in a way that specifically compensate for the degree of energetic masking, neither using a boostingor a bypass strategy. Instead, our results rather support the idea that speakers adapt their level of vocal effort to theerceived loudness of the background noise, greater in broadband noise than in low frequency noises such as theKTL noise of this study (because of the greater ear-sensitivity in the 2–4 kHz frequency band). Thus, regardless of
Page 15: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

594 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

the spectral content of the background noise, the type of speech segment and the speaker’s gender, the main strategyto cope with noise appears to consist in increasing vocal intensity to a level that preserves a positive SNR ratio (in dBSPL).

This main strategy of increasing vocal effort can very well account for the raised f0, F1 and CoG observed inLombard speech, as these acoustical modifications commonly accompany the increase of vocal intensity (Titze andSundberg, 1992; Sundberg and Nordenberg, 2006) and are observed in shouted speech, even in absence of backgroundnoise (Rostolland, 1982; Schulman, 1989; Lienard and Di Benedetto, 1999; Traumüller and Erikkson, 2000).

Nevertheless, additional modifications that can improve speech audibility in noise have also been observed in thisstudy, while they are not directly related to the increase of vocal intensity.

In CKTL noise, speakers of both genders shifted their f0 distribution so that their most frequent f0 or 2f0 value wasspecifically adjusted in the frequency region where the CKTL noise presented a local minimum in energy. Likewise,we observed an interaction between noise type, vowel type (open vs. closed) and speaker’s gender. In CKTL noise,speakers of both genders specifically adjusted the F1 of closed vowels ([i] and [u]) in the frequency region where theCKTL noise had a local minimum in energy. Such an improved contrast in spectrum, f0 and F1 is predicted to releasethe energetic masking effect of the CKTL noise and to improve the segregation of Lombard speech from anothercompeting speech stream (Culling and Darwin, 1993; Darwin, 1981).

Furthermore, a more detailed examination of the spectra of Lombard vowels indicated that the shift of speech CoGtoward higher frequencies came not only from the flattening of the spectral slope, as commonly observed when vocaleffort is increased, but also from specific higher-formants clustering and amplitude enhancement. This recalls the“singing formant” or the “actor’s formant” observed in opera singers who sing over an orchestra, or in stage actors whohave to project their voice at distance (Bele, 2006). This boosting strategy enhances energy around 3–4 kHz, where thehuman ear is most sensitive to sound pressure level, and improves voice ‘projection’ and audibility. Such an energyboost of the 1–4 kHz region is not observed in the shouted speech of untrained speakers, while it is reported in clearspeech (Krause and Braida, 2004). Both artificial flattening of the spectral tilt and artificial enhancement of speechenergy above 1.5 kHz have proved to be efficient techniques of speech enhancement (Horwitz et al., 2008; Skowronskiand Harris, 2006).

Lastly, modulation strategies were observed, as greater modulations of f0 and vocal intensity were found in Lombardspeech than in speech produced in quiet, both for male and female speakers. These modulation strategies are notdirectly related to increased vocal effort, as shouted voice demonstrates reduced f0 modulations (Rostolland, 1982).On the contrary, a wider f0 range (Picheny et al., 1986) and an increased low-frequency modulation of the intensityenvelope (Krause and Braida, 2004) have also been observed in clear speech, and in speakers who are intrinsicallymore intelligible than others (Bradlow et al., 1996). Several psychoacoustic studies have shown that vowels degradedby an interfering sound are better detected and recognized when their f0 is modulated (Ishizuka and Aikawa, 2002).Sound stream segregation is also improved by intensity modulation and temporal fluctuations (Ishizuka and Aikawa,2002). Synthesis of natural f0-contour variations has been shown to improve intelligibility of speech compared toflat contours (Laures and Weismer, 1999). Segmental intelligibility was worsened when applying compression effectsthat decrease the modulation in intensity (Boike and Souza, 2000; Hornsby and Ricketts, 2001). Consequently, thisenhanced modulation of f0 and vocal intensity observed here in Lombard speech are likely to reflect the intention ofthe speaker to improve his intelligibility.

4.2. The Lombard effect, a two-level adaptation

In agreement to Lu and Cooke (2009b), our results suggest that there is no systematic “active” adaptation to thespectral characteristics of the ambient noise. The increase in vocal intensity, the spectral shift, the rise in f0 and F1appear to be primary and unavoidable features of Lombard speech, regardless of the type of noise and the speaker. Thesame tendencies have been observed in shouted speech, in the absence of ambient noise (Lienard and Di Benedetto,1999; Stanton et al., 1988; Titze, 1989), and in Lombard speech with or without interaction (Amazi and Garber, 1982;Garnier et al., 2010). All these modifications may be interpreted as the different facets of one main action, which is to

increase voice intensity in noisy conditions.

However, when compatible with this primary tendency, there also appear to be subtler and secondary variationsof these same parameters in ways that optimize the acoustic contrast with the background noise. For example, thedistribution of f0 values is not only shifted to higher frequencies in noise (primary strategy of increasing voice intensity),

Page 16: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

b(

oLereftTm

4

sntas

spwbe

pien

amti

A

ws

R

AAA

BB

M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 595

ut it is also widened and specifically adjusted in the frequency area where the CKTL noise has minimum energysecondary strategies for improving acoustic contrasts).

This leads us to support the idea that speech adaptation to noise may broken down into two hierarchical “levels”f strategies. These primary and secondary speech adaptations can be related to different mechanisms underlying theombard effect. Thus, primary and unvarying modifications of speech in noise, related to the global increase of vocalffort, may very well be related to the automatic and uncontrollable regulation of voice intensity that leads speakers toaise their voice when they get an attenuated feedback of their own voice (Pick et al., 1989). Furthermore, the Lombardffect also depends on communicative interaction, as vocal intensity and related parameters are more modified in noiseor interactive than non-interactive situations (Amazi and Garber, 1982; Garnier et al., 2010). Additional modificationshat do not relate directly to vocal intensity are observed in noise for interactive situations only (Garnier et al., 2010).his communicative mechanism that also contributes to the Lombard effect may very well account for the secondaryodifications of speech parameters observed in this current study, when they are compatible with the primary ones.

.3. Implications of this work

Speech modifications examined in this study have applications to the improvement of speech perception whenpeech is broadcast in noisy environments, such as train stations, car interiors, places with ventilation or multi-talkeroise. In background noise of low frequency energy, it may be beneficial to shift speech f0, formants and spectrumoward higher frequencies, as is observed in Lombard speech. Irrespective of the noise type, speech auditory detectionnd segregation may be improved by enhancing both frequency and amplitude modulation of speech, and boostingpeech energy in the 2–4 kHz frequency band.

It is more difficult to predict the perceptual consequence of conjoint speech modifications, as their effects are notimply additive. Thus, conjoint modifications may either lead to counterproductive outcomes or may have particularlyositive consequences for speech auditory detection. For example, auditory detection may be particularly improvedhen enhancing speech modulation together with shifting the speech spectrum toward the 2–4 kHz region, a frequencyand where the human ear is most sensitive not only to sound pressure level but also to acoustic contrasts (Jesteadtt al., 1977; Wier et al., 1977).

Likewise, speech modifications can have multiple effects on these different aspects of intelligibility (audibility,honeme recognition, utterance parsing). For instance, the enhanced modulation of speech in f0 and intensity may bothmprove speech segregation from a background noise and the segmentation of the utterance into lexical units (Garniert al., 2010; Welby, 2006). On the other hand, raised f0 and F1 may improve speech segregation from a multi-talkeroise, but may simultaneously degrade acoustic cues to vowel recognition.

These results may also have further applications for injury prevention and therapy in the case of vocal misuse andbuse in noisy working places (preschool teachers, bartenders, factory workers, etc.). Among the different observedodifications, enhancing speech modulation and spectral energy in the 2–4 kHz region are communicative techniques

hat can be taught to people in order to improve their speech audibility in a safer way than simply increasing vocalntensity.

cknowledgements

We are grateful to Danièle Dubois for fruitful discussions on the methodological aspects of this study. We alsoarmly thank the 10 speakers who kindly agreed to participate in this experiment, despite the discomfort of the noisy

ituations.

eferences

mazi, D.K., Garber, S.R., 1982. The Lombard sign as a function of age and task. J. Speech Lang. Hear. Res. 25, 581–585.rons, B., 1992. A review of the cocktail party effect. J. Amer. Voice I/O Soc. 12 (7), 35–50.ssmann, P.F., Summerfield, Q., 1990. Modeling the perception of concurrent vowels: vowels with different fundamental frequencies. J. Acoust.

Soc. Am. 88, 680–697.ele, I.V., 2006. The Speaker’s formant. J. Voice 20, 555–578.oike, K.T., Souza, P.E., 2000. Effect of compression ratio on speech recognition and speech-quality ratings with wide dynamic range compression

amplification. J. Speech Lang. Hear. Res. 43, 456–468.

Page 17: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

596 M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597

Boril, H., Pollak, P., 2005. Design and collection of Czech Lombard database. In: Proceedings of ICSLP, Lisbon, Portugal, pp. 1577–1580.Bradlow, A.R., Torretta, G.M., Pisoni, D.B., 1996. Intelligibility of normal speech I: global and fine-grained acoustic-phonetic talker characteristics.

Sp. Commun. 20, 255–272.Brown, G., Anderson, A., Yule, G., Shillcock, R., 1983. Teaching Talk. Cambridge University Press, Cambridge.Brungart, D., 2001. Informational and energetic masking effects in the perception of two simultaneous talkers. J. Acoust. Soc. Am. 109, 1101–1109.Castellanos, A., Benedi, J.M., Casacuberta, F., 1996. An analysis of general acoustic-phonetic features for Spanish speech produced with the

Lombard effect. Sp. Commun. 20, 23–35.Culling, J.F., Darwin, C.J., 1993. Perceptual separation of simultaneous vowels: within and across-formant grouping by F0. J. Acoust. Soc. Am. 93,

3454–3467.Darwin, C.J., 1981. Perceptual grouping of speech components differing in fundamental frequency and onset-time. Q. J. Exp. Psychol. 1981 (33A),

185–207.Darwin, C.J., Brungart, D.S., Simpson, B.D., 2003. Effects of fundamental frequency and vocal-tract length changes on attention to one of two

simultaneous talkers. J. Acoust. Soc. Am. 114, 2913–2922.Dreher, J.J., O’Neill, J., 1957. Effects of ambient noise on speaker intelligibility for words and phrases. J. Acoust. Soc. Am. 29, 1320–1323.Egan, J.J., 1972. Psychoacoustics of the Lombard voice response. J. Aud. Res. 12, 318–324.French, N., Steinberg, J., 1947. Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am. 19, 90–119.Garnier, M., 2008. May speech modifications in noise contribute to enhance audio-visible cues to segment perception? In: Proceedings of AVSP,

Moreton Island, Australia, pp. 95–100.Garnier, M., Dohen, M., Lœvenbruck, H., Welby, P., Bailly, L., 2006. The Lombard effect: a physiological reflex or a controlled intelligibility

enhancement? In: Proceedings of ISSP, Ubatuba, Brazil, p. 255-262.Garnier, M., Henrich, N., Dubois, D., 2010. Influence of sound immersion and communicative interaction on the Lombard effect. J. Speech Lang.

Hear. Res. 53, 588–608.Hornsby, B.W., Ricketts, T.A., 2001. The effects of compression ratio, signal-to-noise ratio, and level on speech recognition in normal-hearing

listeners. J. Acoust. Soc. Am. 109, 2964–2973.Horwitz, A.R., Ahlstrom, J.B., Dubno, J.R., 2008. Factors affecting the benefits of high-frequency amplification. J. Speech Lang. Hear. Res. 51,

798–813.Ishizuka, K., Aikawa, K., 2002. Effect of F0 fluctuation and amplitude modulation of natural vowels on vowel identification in noisy environments.

In: Proceedings of ICSLP, Denver, USA, p. 1633-1636.Jesteadt, W., Wier, C.C., Green, D.M., 1977. Intensity discrimination as a function of frequency and sensation level. J. Acoust. Soc. Am. 61, 169–177.Junqua, J., 1993. The Lombard reflex and it role on human listener and automatic speech recognizers. J. Acoust. Soc. Am. 93, 510–524.Jung, O., 2012. On the Lombard effect induced by vehicle interior driving noises, regarding sound pressure level and long-term average speech

spectrum. Acta Acust. United Acust. 98 (2), 334–341.Junqua, J.-C., Fincke, S., et al., 1998. Influence of the speaking style and the noise spectral tilt on the lombard reflex and automatic speech recognition.

In: Proceedings of ICSLP, Sydney.Kadiri, N., 1998. Conséquences d’un environnement bruité sur la production de la parole. [Effect of noise exposure on speech production]. Toulouse

University, France (Ph.D. Thesis).Krause, J.C., Braida, L.D., 2004. Acoustic properties of naturally produced clear speech at normal speaking rates. J. Acoust. Soc. Am. 115, 362–378.Laures, J.S., Weismer, G., 1999. The effects of a flattened fundamental frequency on intelligibility at the sentence level. J. Speech Lang. Hear. Res.

42, 1148–1156.Lienard, J.S., Di Benedetto, M.G., 1999. Effect of vocal effort on spectral properties of vowels. J. Acoust. Soc. Am. 106, 411–422.Lu, Y., Cooke, M., 2008. Speech production modifications produced by competing talkers, babble, and stationary noise. J. Acoust. Soc. Am. 124,

3261–3275.Lu, Y., Cooke, M., 2009a. The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise. Sp. Commun.

51, 1253–1262.Lu, Y., Cooke, M., 2009b. Speech production modifications produced in the presence of low-pass and high-pass filtered noise. J. Acoust. Soc. Am.

126, 1495–1499.Marin, C.M., McAdams, S., 1991. Segregation of concurrent sounds II: effects of spectral envelope tracing, frequency modulation coherence, and

frequency modulation width. J. Acoust. Soc. Am. 89, 341–351.Mokbel, C., 1992. Reconnaissance de la parole dans le bruit: bruitage/débruitage [Speech recognition in noise: Noise degradation vs. cancelation].

Ecole Nationale Supérieure des Télécommunications, Paris, France (Ph.D. Thesis).Picheny, M.A., Durlach, N.I., Braida, L.D., 1986. Speaking clearly for the hard of hearing II: acoustic characteristics of clear and conversational

speech. J. Speech Hear. Res. 29, 434–446.Pick, H.L., Siegel, G.M., Fox, P.W., Garber, S.R., Kearney, J.K., 1989. Inhibiting the Lombard effect. J. Acoust. Soc. Am. 85, 894–900.Pittman, A.L., Wiley, T.L., 2001. Recognition of speech produced in noise. J. Speech Lang. Hear. Res. 44, 487–496.Rostolland, 1982. Phonetic structure of shouted voice. Acta Acust. 51, 80–89.Schulman, 1989. Articulatory dynamics of loud and normal speech. J. Acoust. Soc. Am. 85, 295–312.Skowronski, M.D., Harris, J.G., 2006. Applied principles of clear and Lombard speech for automated intelligibility enhancement in noisy environ-

ments. Sp. Commun. 48, 549–558.

Stanton, B.J., Jamieson, L.H., Allen, G.D., 1988. Acoustics-phonetic analysis of loud and Lombard speech in simulated cockpit conditions. In:

Proceedings of ICASSP, New York, USA, pp. 331–334.Sundberg, J., Nordenberg, M., 2006. Effects of vocal loudness variation on spectrum balance as reflected by the alpha measure of long-term-average

spectra of speech. J. Acoust. Soc. Am. 120, 453–457.

Page 18: Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?

T

TTTV

W

W

Z

M. Garnier, N. Henrich / Computer Speech and Language 28 (2014) 580–597 597

ernström, S., Sodersten, M., Bohman, M., 2002. Cancellation of simulated environmental noise as a tool for measuring vocal performance duringnoise exposure. J. Voice 16, 195–206.

itze, I.R., 1989. On the relation between subglottal pressure and fundamental frequency in phonation. J. Acoust. Soc. Am. 85, 901–906.itze, I.R., Sundberg, J., 1992. Vocal intensity in speakers and singers. J. Acoust. Soc. Am. 91 (5), 2936–2946.raumüller, H., Erikkson, A., 2000. Acoustic effects of variation in vocal effort by men, women, and children. J. Acoust. Soc. Am. 107, 3438–3451.an Summers, W., Pisoni, D.B., Bernacki, R.H., Pedlow, R.I., Stokes, M.A., 1988. Effects of noise on speech production: acoustic and perceptual

analyses. J. Acoust. Soc. Am. 84, 917–928.elby, P., 2006. Intonational differences in Lombard speech: looking beyond F0 range. In: Proceedings of Speech Prosody, Dresden, Germany, pp.

763–766.

ier, C.C., Jesteadt, W., Green, D.M., 1977. Frequency discrimination as a function of frequency and sensation level. J. Acoust. Soc. Am. 61,

178–184.eiliger, J., Serignat, J.F., Autresserre, D., Meunier, C., 1994. BD Bruit, une base de données de parole de locuteurs soumis à du bruit [BD Bruit, a

database of speakers exposed to noise]. In: Proceedings of JEP, Grenoble, France, pp. 287–290.