the neural processing of masked speech

9
Review The neural processing of masked speech Sophie K. Scott a, * , Carolyn McGettigan b a Institute of Cognitive Neuroscience, UCL, 17 Queen Square, London WC1N 3AR, UK b Department of Psychology, Royal Holloway University of London, Egham, Surrey TW20 0EX, UK article info Article history: Received 12 February 2013 Received in revised form 29 April 2013 Accepted 3 May 2013 Available online 16 May 2013 abstract Spoken language is rarely heard in silence, and a great deal of interest in psychoacoustics has focused on the ways that the perception of speech is affected by properties of masking noise. In this review we rst briey outline the neuroanatomy of speech perception. We then summarise the neurobiological aspects of the perception of masked speech, and investigate this as a function of masker type, masker level and task. This article is part of a Special Issue entitled Annual Reviews 2013. Ó 2013 Published by Elsevier B.V. 1. Introduction 1.1. The neural basis of speech perception The perceptual processing of heard speech is associated with considerable subcortical and cortical processing, and in humans the auditory cortical elds lie in the left and right dorsolateral temporal lobes. Primary auditory cortex (PAC), which receives all its input from the auditory thalamus, lies on Heschls gyrus, which is on the supratemporal plane e that is, the cortical surfaces of the dorso- lateral temporal lobes that extend into the Sylvian ssure (Fig. 1). Auditory association cortex, to which PAC projects, and which also receives inputs from the auditory thalamus, surrounds PAC on the supra temporal plane and extends along the superior temporal gyrus (STG), to the superior temporal sulcus (STS). Historically, our understanding of the kinds of perceptual mechanisms found in auditory cortex has been driven by studies of neuropsychological patients, for example Wernickes pioneering work outlining the effect that lesions of left STG played in decits of the reception of speech e that is, sensory aphasia. This neuropsy- chological work with patients who have acquired brain damage has continued. For example, Johnsrude et al. (2000) demonstrated that patients with temporal lobe resections, both left and right, had preserved pitch perception, but patients with right temporal lobe resections had specic problems identifying the direction of pitch change. This work suggested strongly that right temporal lobe mechanisms were not associated with determining pitch per se, but were critical to aspects of the detection of structure in pitch variation. However recent developments in functional imaging have allowed us both to better characterize the extent and locations of lesions in neuropsychological investigations, and to determine the relationship(s) between structure and function in the brains of healthy adults. This has been a tremendous development in the eld of cognitive neuroscience generally, and for the elds of speech and hearing in particular, as it has permitted us to move beyond constructs such as Brocas areaand Wernickes areawhen discussing the neurobiology of speech and sound processing (Rauschecker and Scott, 2009; Scott and Johnsrude, 2003; Wise et al., 2001). Though these concepts have been critical to our un- derstanding of different proles of aphasia, they tend to be less helpful as anatomical frameworks, not least because the bound- aries of the cortical elds people would be prepared to call Wernickes area have widened considerably over the intervening 150 years (Wise et al., 2001). Such variability weakens the speci- city of Wernickes areaas an anatomical construct. 1.2. Functional neuroimaging Functional neuroimaging techniques, such as positron emission tomography (PET) and functional magnetic resonance imaging (fMRI) allow us to try and map between structure and function in the human brain. In this review we will address the neural systems recruited during both energetic and informational masking, and also discuss the ways that these have been used to modulate intelligibility in neuroimaging studies. Energetic masking has been largely associated with the masking effects of steadystate noise, although it has recently been demonstrated that there are important contributions from random amplitude uctuations in the noise (Stone et al., 2011 , 2012). As these contributions have not been investigated with func- tional imaging, this manuscript we will refer to speech masked with * Corresponding author. Tel.: þ44 7881853586. E-mail address: [email protected] (S.K. Scott). Contents lists available at SciVerse ScienceDirect Hearing Research journal homepage: www.elsevier.com/locate/heares 0378-5955/$ e see front matter Ó 2013 Published by Elsevier B.V. http://dx.doi.org/10.1016/j.heares.2013.05.001 Hearing Research 303 (2013) 58e66

Upload: carolyn

Post on 25-Dec-2016

219 views

Category:

Documents


5 download

TRANSCRIPT

at SciVerse ScienceDirect

Hearing Research 303 (2013) 58e66

Contents lists available

Hearing Research

journal homepage: www.elsevier .com/locate/heares

Review

The neural processing of masked speech

Sophie K. Scott a,*, Carolyn McGettigan b

a Institute of Cognitive Neuroscience, UCL, 17 Queen Square, London WC1N 3AR, UKbDepartment of Psychology, Royal Holloway University of London, Egham, Surrey TW20 0EX, UK

a r t i c l e i n f o

Article history:Received 12 February 2013Received in revised form29 April 2013Accepted 3 May 2013Available online 16 May 2013

* Corresponding author. Tel.: þ44 7881853586.E-mail address: [email protected] (S.K. Scott)

0378-5955/$ e see front matter � 2013 Published byhttp://dx.doi.org/10.1016/j.heares.2013.05.001

a b s t r a c t

Spoken language is rarely heard in silence, and a great deal of interest in psychoacoustics has focused onthe ways that the perception of speech is affected by properties of masking noise. In this review we firstbriefly outline the neuroanatomy of speech perception. We then summarise the neurobiological aspectsof the perception of masked speech, and investigate this as a function of masker type, masker level andtask.

This article is part of a Special Issue entitled “Annual Reviews 2013”.� 2013 Published by Elsevier B.V.

1. Introduction

1.1. The neural basis of speech perception

The perceptual processing of heard speech is associated withconsiderable subcortical and cortical processing, and in humans theauditory cortical fields lie in the left and right dorsolateral temporallobes. Primary auditory cortex (PAC), which receives all its inputfrom the auditory thalamus, lies on Heschl’s gyrus, which is on thesupratemporal plane e that is, the cortical surfaces of the dorso-lateral temporal lobes that extend into the Sylvian fissure (Fig. 1).Auditory association cortex, to which PAC projects, and which alsoreceives inputs from the auditory thalamus, surrounds PAC on thesupra temporal plane and extends along the superior temporalgyrus (STG), to the superior temporal sulcus (STS).

Historically, our understanding of the kinds of perceptualmechanisms found in auditory cortex has been driven by studies ofneuropsychological patients, for example Wernicke’s pioneeringwork outlining the effect that lesions of left STG played in deficits ofthe reception of speech e that is, sensory aphasia. This neuropsy-chological work with patients who have acquired brain damage hascontinued. For example, Johnsrude et al. (2000) demonstrated thatpatients with temporal lobe resections, both left and right, hadpreserved pitch perception, but patients with right temporal loberesections had specific problems identifying the direction of pitchchange. This work suggested strongly that right temporal lobemechanisms were not associated with determining pitch per se, butwere critical to aspects of the detection of structure in pitch

.

Elsevier B.V.

variation. However recent developments in functional imaginghave allowed us both to better characterize the extent and locationsof lesions in neuropsychological investigations, and to determinethe relationship(s) between structure and function in the brains ofhealthy adults. This has been a tremendous development in thefield of cognitive neuroscience generally, and for the fields ofspeech and hearing in particular, as it has permitted us to movebeyond constructs such as ‘Broca’s area’ and ‘Wernicke’s area’whendiscussing the neurobiology of speech and sound processing(Rauschecker and Scott, 2009; Scott and Johnsrude, 2003; Wiseet al., 2001). Though these concepts have been critical to our un-derstanding of different profiles of aphasia, they tend to be lesshelpful as anatomical frameworks, not least because the bound-aries of the cortical fields people would be prepared to callWernicke’s area have widened considerably over the intervening150 years (Wise et al., 2001). Such variability weakens the speci-ficity of “Wernicke’s area” as an anatomical construct.

1.2. Functional neuroimaging

Functional neuroimaging techniques, such as positron emissiontomography (PET) and functionalmagnetic resonance imaging (fMRI)allowus to try andmapbetween structure and function in thehumanbrain. In this review we will address the neural systems recruitedduring both energetic and informational masking, and also discussthe ways that these have been used to modulate intelligibility inneuroimaging studies. Energeticmasking has been largely associatedwith the masking effects of ‘steady’ state noise, although it hasrecently been demonstrated that there are important contributionsfrom random amplitude fluctuations in the noise (Stone et al., 2011,2012). As these contributions have not been investigated with func-tional imaging, this manuscript we will refer to speech masked with

Fig. 1. Lateral view of the left cortical hemisphere, showing cortical fields important in speech perception and comprehension.

S.K. Scott, C. McGettigan / Hearing Research 303 (2013) 58e66 59

noise as energetic/modulation masking. Informational masking hasbeen associated with elements of masking which extend beyond theinfluence of energetic/modulation masking, for example, associatedwith masking speech or other complex sounds (Brungardt, 2001).

We confine the present discussion to fMRI and PET. The spatialand temporal resolution of these techniques is relatively poor, butthey permit the resolution of peaks of neural activity withinmillimetres, which allows for localization of activity within gyralanatomy. Furthermore, PET and fMRI both permit the detection ofneural activity across the whole brain, including the cerebral cortexand subcortical structures, which means that they are ideal tech-niques for looking at networks of activity. We note that severalelectrophysiological studies have investigated neural responses tospeech in masking sounds (Dimitrijevic et al., 2013; Parbery-Clarket al., 2011; Riecke et al., 2012; Xiang et al., 2010). These tech-niques have the advantage of very fine temporal analyses of neuralresponses, and the increasing use of combinedmethodswill be ableto exploit the benefits of both approaches.

Both PET and fMRI utilize local changes in blood flow as an indexof neural activity, since where there are regional increases in neuralactivity, there are immediate increases in the blood supply to thatregion. PET measures changes in the blood flow by detecting in-creases in the number of gamma rays emitted following decay of aradiolabelled tracer in the blood. FMRI detects differences in theproportion of oxygenated to deoxygenated bloode the BOLD (bloodoxygenation level dependent) response (Ogawa et al., 1990). Thoughequally sensitive across the whole brain, and silent, PET scans arelimited by the half life of the tracers used, which typically take mi-nutes to decay, and regulations regarding the total amount ofradioactivity that canbedelivered to anyoneparticipant. This has theeffect of reducing the statistical power of PET, which is nowadaysused less often for non-clinical functional scans. FMRI, in contrast,allowsmanymore scansper condition, and thushasgreaterpotentialpower. However, there can be serious signal loss in fMRI due toartefacts caused by air-bone interfaces, such as those found at theanterior temporal lobes. Signal lost in this way cannot be recovered(but ongoing work in MRI physics aims to reduce these artefactuallimitations; Posse, 2012). Finally, the echo-planar imaging sequencesused to acquire fMRI data generate a great deal of acoustic noise.While there is continued interest in developing silent sequences (e.g.Peelle et al., 2010), the vastmajority of fMRI studies of audition eitheruse continuous scanning (where scanner noise is always present), oremploy a sparse scanning technique (Edmister et al.,1999; Hall et al.,1999). The latter takes advantage of the fact that the BOLD responsebuilds to a peak over a lag of several seconds (4e6 s) after the onsetof a perceptual/cognitive event. This means that it is possible to

deliver acoustic stimuli during a quiet period and to time the noisyecho-planar image acquisition to occur later and coincident with theestimated peak of the bloodflow response. Importantly, the maps ofactivation seen in standard neuroimaging studies are made by con-trasting the condition of interest with a baseline condition, meaningthat we are looking atmaps of activation relative to some other stateof whole-brain activation. This subtractive approach means that theselection of a baseline condition can have profound implications forthe pattern of observed results.

1.3. Functional neuroimaging studies of speech perception

Functional imaging has revealed considerable complexity in theways that the human brain engages with speech and sound (Scottand Johnsrude, 2003). The perception of spoken language wasaddressed in some very early functional imaging studies, and thereis a largely uniform finding of strong bilateral activation of thesuperior temporal gyri in response to human speech, relative tosilence (Wise et al., 1991). Within this, primary auditory cortex hasbeen shown to have no specific role in speech perception (e.g.Mummery et al., 1999), and the planum temporale is similarly notselective in its response to speech (Binder et al., 1996). Speech-selective responses (defined as neural activity which is signifi-cantly greater than that seen in response to a complex auditorybaseline) have been described in left anterior temporal fields, lyingin the STG/STS, lateral and anterior to PAC. These speech selectiveresponses have been seen for a number of different kinds oflinguistic information, from intelligible speech (Narain et al., 2003;Okada et al., 2010; Scott et al., 2000), including sentences and singlewords (Cohen et al., 2004; Scott and Wise, 2004), to phonotacticstructure (Jacquemot et al., 2003) and single phonemes (Agnewet al., 2011). This profile has been linked to the rostrally directed‘what’ pathways described in primate neurophysiology andneuroanatomy, where cells in the anterior STS have been linked tothe processing of conspecific vocalizations (Rauschecker and Scott,2009; Scott and Johnsrude, 2003). It seems that the early percep-tual processing of intelligible speech is associated with this samerostral ‘what’ pathway, and this may well be linked to amodalsemantic representations in the anterior and ventral temporallobe(s) (Patterson et al., 2007). In contrast, auditory fields lyingposterior and medial have been associated with sensori-motorlinks in speech perception and production tasks: the left medialplanum has been shown to respond during speech production,even if the speech is silently mouthed (i.e. there is no acousticconsequence of the articulation; see Wise et al., 2001). This ‘how’

pathway has been linked to auditoryesomatosensory responses in

S.K. Scott, C. McGettigan / Hearing Research 303 (2013) 58e6660

non-humanprimates, and has been suggested to form a ‘sounds do-able’ pathway whereby sounds are analysed in terms of the actionsand events that have caused them (Warren et al., 2005), perhapswith specific reference to the production of sound via the humanvocal tract (Hickok et al., 2003, 2009; Pa and Hickok, 2008).

In terms of hemispheric asymmetries, there are clear differencesbetween the left and the right temporal lobes in terms of theirresponses to speech and sound. The left auditory areas show arelatively specific response to linguistic information at severaldifferent levels: from phonotactic structure (Jacquemot et al., 2003)and categorical perception of syllables (Liebenthal et al., 2005), toword forms (Cohen et al., 2004) and semantics/syntax (Friedericiet al., 2003). This left hemisphere dominance for spoken languagemay reflect purely linguistic mechanisms (McGettigan and Scott,2012) or more domain general mechanisms. For example, musi-cians with perfect pitch show greater activation of the left anteriorSTS than non-perfect pitch musicians (Schulze et al., 2009), whichmay suggest that left hemisphere mechanisms could be associatedwith categorization and expertise in auditory processing. However,whatever is driving the left dominance for language, this appears toreflect a specialization in the left temporal lobe that is not purelyacoustic (McGettigan and Scott, 2012). In contrast, it is relativelyeasy to modulate right temporal lobe activity with acoustic fac-tors e in addition to an enhanced response to voice-like (Rosenet al., 2011) and speaker-related information (Belin and Zatorre,2003), right auditory areas respond more strongly than the left tolonger sounds (Boemio et al., 2005), and to sounds that have pitchvariation (Zatorre and Belin, 2001). Future developments will beable to outline how these hemispheric asymmetries interact withthe rostral/caudal auditory processing pathways discussed above.

2. Masking speech

Speech is often hard to understand in the presence of othersounds, and psychophysical approaches to understanding this havetypically distinguished between informational and energetic/modulation components to the masking signal (e.g. Brungardt,2001). Energetic/modulation masking effects are those associatedwith masking at the auditory periphery, where the masking soundcompetes with the target speech at the level of the cochlea. Signalsthat lead to the highest amounts of energetic/modulation maskingare those with a wide spectrum (e.g. noise, multi-talker babble)presented at a low signal to noise (SNR) level. There is typically anogive relationship between the intelligibility of the target speechand the SNR level. Signals leading to informational masking, inaddition to their energetic/modulation masking effects, are thosewhich cause competition for resources as a result of some higher-order property of the acoustic signal e for example, linguisticinformation (Dirks and Bower, 1969; Festen and Plomp, 1990). Anexample of this would be trying to comprehend one talker againstthe concurrent speech of another talker. This is generally held torepresent some central auditory processing of the masking signal,which is competing for resources with the target stimulus. Thiscompetition need not be lexical e reversed speech is an effectiveinformational masker (Brungart and Simpson, 2002). One feature ofinformational masking is the possibility of intrusions of items fromthe masking signal into the participants’ spoken repetitions of thetarget signal. These intrusions can be taken as indicating somecentral processing of the informational masker (Brungardt, 2001).

3. Functional imaging studies using speech in the presence ofmaskers

There are two distinct differences in the approaches taken to theneurobiological study ofmasked speech. First, there are studies that

address the cortical and subcortical systems associated with theperception of speech in a masker, and the way that pattern ofactivation can be affected by different masker types and charac-teristics. Several of these studies are motivated by the findings thatthe perception of speech in noise is something which worsens withage, due to presbycusis as well as to central changes in auditoryprocessing capacity (e.g. Helfer et al., 2010). Other studies aremotivated by more specific questions about the mechanisms un-derlying different aspects of masking phenomena. These motiva-tions tend to influence the design of the experiment. For example,studies overtly comparing energetic/modulation masking withinformational masking do not typically include a speech-in-quietcondition, as they have matched instead for intelligibility scoresacross the two masking conditions (Scott et al., 2004; 2009b). In alargely separate set of studies, target speech items are presentedwith a masking noise as a way of expressly manipulating theintelligibility of speech and investigating the neural systems sup-porting comprehension. In such studies, the relationship betweenspeech comprehension and the SNR of energetic/modulation orinformational maskers, which compete with the target speech forresources at the auditory periphery, is used as one of a variety ofways in which speech intelligibility can be affected.

3.1. Investigations of masking speech: energetic/modulationmasking studies

Salvi et al. (2002) conducted a study of the perception of speechin noise, imaging neural activity with PET, and using the followingconditions: quiet, noise, speech-in-quiet, speech-in-noise (wherethe noise was composed of 12-talker babble e SNR not reported).Energetic/modulation masking effects might therefore be expectedto dominate in this experiment. There was an overt task: duringeach PET scan, the subjects were required to repeat the last word ofeach sentence aloud (or say ‘nope’ if they could not hear the word,or when they thought a sentence ended in the Noise condition). Thecontrast of speech-in-noise (SPIN) over speech-in-quiet revealedwidespread midline cortical and cerebellar activation. The speech-in-noise level was selected such that the participants would beexpected to get approximately 50% of the sentences correct, and itis possible that the activations seen in the SPIN condition reflectcognitive resources (attentional and/or semantic) which supportthis task, as the energetic/modulation masking makes accuraterepetition more difficult (Obleser et al., 2007). The oppositecontrast, of speech-in-quiet over SPIN, was not reported.

Hwanget al. (2006) used fMRI tomeasureneural activationwhileparticipants listened to stories in Chinese. They used a block design,where the participants heard the stories either in quiet or with acontinuous white noise masker at þ5 dB SNR. The use of a contin-uous fMRI paradigm means that the scanner noise was a furthersource of acoustic stimulation (i.e. speech-in-quiet should bethought of as speech-in-scanner-noise). However as the scannernoise was present in every scan, it should have been effectivelysubtracted from the contrasts of interest (though it may still haveinfluenced overall levels of activation). Hwang et al. (2006) foundreduced activation in left STS (and elsewhere) for the SPIN conditionrelative to the speech-in-quiet condition. It is possible that this re-flects reduced intelligibility of the speech-in-noise relative to thespeech-in-quiet condition, since the left STShas a strong response tointelligibility in speech (Eisner et al., 2010; McGettigan et al., 2012;Narain et al., 2003; Rosen et al., 2011; Scott et al., 2000). The oppositecontrast, of SPIN over speech-in-quiet, was not reported.

Hwang et al. (2007) performed the same experiment with olderadults and found comparable effects, with some evidence for agreater signal decrease for the speech in noise condition in theleft temporal lobe for the older adults than for the younger group.

S.K. Scott, C. McGettigan / Hearing Research 303 (2013) 58e66 61

They specifically identified posterior left STG as a site for centraleffects of presbycusis.

Wong et al. (2008) used fMRI to study the perception of speech inmasking noise, and utilized a sparse imaging design to reduce thecontamination of the acoustic stimuli with scanner noise. Their taskinvolved the presentation of singlewords in a picture-matching task,with one word (out of a possible 20) presented per trial. The noisewas multi-talker babble, so one would again expect energetic/modulation masking effects to dominate, and the noise was pre-sented so that it coincided with the onsets and offsets of thepresented words (i.e. the noise had the same duration as the word).The SNRs chosen were �5 dB and þ20 dB: these were selected astheþ20 dB SNRmasking levels elicited ‘normal’ performance, whilethe �5 dB SNR led to a significant decrement in performance. Aspeech-in-quiet condition was also included. The results werecomplex: the contrast of (�5 dB SNR SPIN) > (þ20 dB SNR SPIN)showed activation inposterior STG and left anterior insula, while theopposite contrast revealed extensive MTG and superior occipitalgyrus activation, plus fusiform, inferior temporal gyrus andposteriorcingulate gyrus. This is somewhat surprising as the (þ20 dB SNRSPIN) > (�5 dB SNR SPIN) contrast might be expected to reveal themore extensive auditory cortical fields associated with the bettercomprehension of speech at higher SNRs. This pattern of activationsmay well reflect the task used, in which the activation to the singleword in thecontextof avisualmatching taskmaynotentail extensivecortical activations compared to that seen when listeners perceiveconnected speech, such as sentences or stories (Hwang et al., 2006,2007; Salvi et al., 2002). The repetition of the same 20words in eachcondition may also have been detrimental to responses in auditorycorticalfields, which can showhabituation to repeated spoken items(Zevin and McCandliss, 2004). However, using a region-of-interest(ROI) analysis, the authors identified sensitivity in the left STG tothe level of the masking noise, where there was greater activationassociated with increasing noise levels. This result is directly at oddswith thefindingofHwangandcolleagues of less activation in theSTGfor the SPIN listening condition than for the speech-in-quiet listeningcondition (Hwang et al., 2006, 2007; see also Scott et al., 2004).

In a further study (Wong et al., 2009), the design was repeatedwith older adult participants, with the difference that all three setsof stimuli, regardless of SNR were adjusted to have a level of 70 dBSPL. Therewas no behavioural difference between the older and theyounger groups for the speech-in-quiet andþ20 dB SNR conditions,but at �5 dB SNR the older adults had lower scores. The resultsindicated that the older adults, especially at �5 dB SNR, showedreduced activation in auditory areas, and recruited prefrontal andparietal areas associated with attentional and/or working memoryprocesses to support the difficult listening task. The authors spe-cifically linked this to compensatory listening strategies.

There may be some specific reasons for the differences betweenthe studies reported by Hwang and colleagues and Wong and col-leagues. Wong et al. found greater activation in STG regions forspeech innoise at a lower SNR,whileHwanget al. foundhigher levelsof activation for the speech-in-quiet than the speech in noise. Thesedifferences may be a consequence of the precise implementation ofthe study designs: Hwang et al. used stories as the target speech, anda continuous masking noise, while Wong et al. used single wordspresented in noise ofmatched duration. Hwanget al. askedpeople torespond when they had understood a sentence, and Wong et al.asked participants to match the heard word to pictures, which mayemphasize semantic and executive processes over and above normalspeech perception, while simultaneously reducing the total amountof speech heard. We discuss such factors in detail later in the paper,alongwith considerationof other aspects of the studydesign (suchassparse vs. continuous scanning, the SNRs chosen, and how thesemight affect thepatterns of results observedacross different studies).

These studies show that there are differences in the corticalprocessing of speech-in-quiet and speech in a masking noise:however the methodological dissimilarities (and variation in howthe results are analysed) mean that the precise nature of thesedifferences remains elusive.

3.2. Investigations of masking speech: informational masking

In an fMRI study of perception of speech-in-speech (SPIS) Nakaiet al. (2005) presented listeners with a story to follow, read aloudby a female talker. Periodically the story was masked with speechfrom a different (male) talker (DV) or speech from the same talker(SV) (these different maskers were presented in different blocks ei.e. the type of masking speech was not randomly varied). Inaddition to any energetic/modulation masking effects, one wouldexpect masking speech to lead to considerable information mask-ing, and that this would be more severe for the SV condition thanfor the different talker (DV) condition (Brungardt, 2001). Contin-uous fMRI scanning was carried out with a blocked presentation ofbaseline speech with no masker, alternated with the DV or SVconditions. Therewas bilateral STG activation for the contrast of theDV condition with the baseline speech condition, indicating thatthe masking speech was indeed being processed centrally to someextent, i.e. that central, cortical auditory areas were processing theunattended masking speech as well as the attended target speech.The SV condition led to significantly greater activation in thebilateral temporal lobes, and also in prefrontal and parietal fields. Adirect comparison of the SV and DV conditions revealed greateractivation for the SV masker condition in pre-supplementaryspeech area (pre-SMA), right parietal and bilateral prefrontalareas, suggesting that these are recruited to compensate for thestrong competition for central resources produced by the SVmasker. This study indicates that as informational masking effectsincrease, there is greater activity in non-auditory areas, potentiallyreflecting the recruitment of brain areas to cope with the increasedattentional demands of a difficult listening condition.

Hill and Miller (2010) ran an fMRI study involving auditorypresentation of 3 simultaneous talkers, all continuously recitingIEEE sentences (IEEE Subcommittee on Subjective Measurements,1969). All the stimuli were generated from sentences produced bythe same talker, with additional ‘talkers’ produced by shifting thepitch of the original speaker up or down. The ‘three talkers’ werepresented at three virtual spatial locations, using headphones and ahead-related transfer function. The sequences were constructedsuch that sentence onsets were not signalled by longer gaps, norwere there synchronous sentence onsets across the differentvoices. Prior to each trial, participants were cued to attend to eitherthe pitch or location of the target talker, or to rest. In the attendingtrials, the participants were required to identify when the targettalker began a new sentence. Continuous scanning was employed.As there were no conditions where the number of maskingspeakers or the type of maskers was varied, this study can beconsidered to be an investigation of attentional control and strategywithin an informational masking context, rather than a study ofmasking per se. During stimulus presentation, they reportedactivation in bilateral STG/STS and beyond (but not in PAC) and thesame systemwas recruited whether listening was based on pitch orlocation. This activation therefore probably reflects the objects ofperception, rather than the strategies used to attend to them.

3.3. Investigations of masking speech: contrasting informationaland energetic/modulation masking

Informational and energetic/modulation masking have beenidentified as involving somewhat different mechanisms, although a

S.K. Scott, C. McGettigan / Hearing Research 303 (2013) 58e6662

complex masking sound (like masking speech) will produce bothinformational and energetic/modulation masking factors. Oneinteresting distinction is the effects of level e while there is arelationship between SNR and intelligibility for energetic/modu-lation maskers, especially for SNRs below 0 dB, masking speechitems can intrude into responses during informational maskingeven at high SNRs (Brungardt, 2001). An early PET paper from thefirst author’s lab aimed to investigate this by requiring participantsto listen to short (BKB; Bench et al., 1979) sentences produced by afemale talker, and contrasting two masker types e continuousspeech spectrum noise as the energetic/modulation masker, and amale talker producing similar (though not identical) sentences asthe informational masker. Passive presentation was used, to avoidneural activation associated with an overt task: such a design isarguably preferable when the processes under consideration areheld to be both automatic and obligatory (Scott and Wise, 2004).Pre-testing for intelligibility was used to establish that the SNRsused for each masking type did not result in overall differences ofintelligibility across maskers. This resulted in use of the followingSNRs: �3, 0, þ3 and þ6 dB for the speech in noise, and �6, �3,0 and þ3 for the speech in speech.

The study of Scott et al. (2004) has been criticized for notincluding a speech-in-quiet condition (Wong et al., 2008). Howeverit would have been impossible to avoid intelligibility differencesbetween the speech-in-quiet and masked speech, which wouldhave led to differences in neural activation which reflected thismodulation of intelligibility, rather than being an index of maskingmechanisms. The logic of the chosen design was that contrastingspeech in speech with speech in noise will remove perceptual/linguistic activations associated with the comprehension of thetarget talker’s speech, and reveal activation more specificallyassociated with the processing of the masking noise/talker. Also, agoal was to reveal neural activation that varied with the SNR,separately for each masking condition.

The study revealed greater overall activity for energetic/modu-lation masking than informational masking (independent of SNR)in rostral and dorsal prefrontal cortex and posterior parietal cortex(Fig. 2), but this contrast revealed no activation in the temporallobes. Within the energetic/modulation masking conditions, therewas evidence for level-dependent activations in left ventral pre-frontal cortex and the supplementary motor area (SMA) (Fig. 2).This was interpreted in terms of the increased recruitment of se-mantic access and articulatory processes as the level of the maskingnoise increased. In contrast, therewas extensive, level-independentactivation in the dorsolateral temporal lobes associated with thecontrast of speech-in-speech over speech-in-noise (Fig. 2, seealso Fig. 4). As predicted from the behavioural data, there were no

Fig. 2. Lateral view of the left and right cortical hemispheres, showing cortical areasassociated with different aspects of the perception of speech-in-speech and speech-in-noise (from Scott et al., 2004).

level-dependent effects within the speech-in-speech condition.The lack of activity beyond the temporal lobes may have beenbecause the masking talker was male, and the target talker wasfemale, so the competition for central resources was potentiallyreduced relative to more extreme information masking conditions(e.g. masking a talker with their own voice). The results wereinterpreted as showing a distributed, non-linguistic-specificnetwork to support the perception of speech in noise, with bothsemantic and articulatory processes and representations recruitedmore strongly as the level of the noise masker increases. The strongbilateral activation of the dorsolateral temporal lobes in response tothe masking speech was attributed to the obligatory and automaticcentral processing of the unattended speech: as has been shown byearly investigations into selective attention, irrelevant or unat-tended speech is not discarded at the auditory periphery but isprocessed to some degree. However we cannot distinguish fromthis design which aspects of the unattended speech masker drivethis central processing e is it the linguistic nature of speech, orsome aspect of its acoustic structure? Considering the latter, itcould be that because masking speech is highly amplitude modu-lated, it allows for brief ‘glimpses’ of the target speech (Festen andPlomp, 1990).

A follow-up study (Scott et al., 2009b) addressed these issues bypresenting participants with speech in signal-correlated noise(Schroeder, 1968), spectrally rotated speech, and speech. All threemaskers allowed ‘glimpses’ of the target speech, and the threemaskers varied in the degree to which they are dominated by en-ergetic/modulation masking, and informational masking. Thetarget speech again comprised simple sentences read by a femaletalker. Signal correlated noise (Schroeder, 1968) was created bymodulating noise (in this experiment, noise with a speech-shapedspectrum) with the amplitude envelope of sentences spoken by themale talker from the previous study (Scott et al., 2004). The speechin signal-correlated noise was considered to be dominated by en-ergetic/modulation masking, but because of the amplitude modu-lated profile of the signal correlated noise, ‘glimpses’ of the targetspeech were possible. The spectrally rotated speech was againgenerated from the original male masking talker. Spectral rotationinvolves inverting a low-passed speech signal around a frequencypoint (e.g. 4 kHz low-passed filtered speech, inverted around2 kHz), and has been used extensively as a baseline condition inspeech perception studies as it preserves many aspects of theoriginal speech signal, such as pitch, spectral structure (thoughinverted) and amplitude modulation (Blesser, 1972). This results ina signal highly analogous to speech in terms of complexity, butwhich cannot be understood. This was included as a maskingcondition as it was expected to contribute to informationalmaskingphenomena, but without contributions from linguistic competition.Finally, there was a speech-in-speech condition, which was iden-tical to the speech-in-speech informational masking condition inthe previous study (Scott et al., 2004). Pretesting was used to selectSNRs leading to comparable levels of around 80% intelligibility.These were�3 dB for the signal correlated noise masker, and�6 dBfor the speech and rotated speech maskers.

The contrast of speech in speech with speech in signal-correlated noise revealed much more restricted activation of thelateral STG by the masking speech than was seen in the earlierstudy (Fig. 3, Fig. 4), suggesting that much of this activation wasindeed associated with unmasked ‘glimpses’ of the target speech inthe speech-in-speech condition (Festen and Plomp, 1990). Oncethis was controlled for (by having glimpses possible in both speechin signal-correlated noise and speech in speech) the activation nolonger extended into primary auditory fields. Notably, there wasalso activation in the STG seen in the contrast of speech in rotatedspeech over speech in signal-correlated noise, though it was

Fig. 3. Cortical areas associated with (A) the contrast of speech-in-speech > speech-in-noise and (B) the contrast of speech-in-rotated speech > speech-in-noise (from Scottet al., 2009a,b). Upper panels show activity on coronal brain sections, and lowerpanels show activity on transverse brain sections. Figures have been redrawn fromimages thresholded at p < 0.0001, cluster size >40 voxels.

S.K. Scott, C. McGettigan / Hearing Research 303 (2013) 58e66 63

restricted to the right STG (Fig. 3). The non-intelligibility of thespectrally rotated speech thus results in some informationalmasking effects through competition for central auditory process-ing resources, but not via linguistic mechanisms (which we wouldexpect to yield modulations of left-dominant auditory regions).

Fig. 4 shows the speech-in-speech > speech-in-noise contrastsfor the Scott et al. (2004) study, where continuous noise was usedfor the speech in noise (A), and the Scott et al. (2009a,b) study, inwhich signal-correlated noise was used to construct the speech-in-noise condition (B). The activation from the Scott et al. (2004) studyis more extensive than that seen from the Scott et al. (2009a,b)study. The results of Scott et al. (2009a,b) therefore indicated thatsome (but not all) of the greater STG activation seen for informa-tional masking relative to energetic/modulation masking is aconsequence of the glimpses of the target speech afforded by(naturally amplitude-modulated) speech maskers, but not bycontinuous noise maskers, which are commonly used. Further-more, the study suggested that a masker need not be formed ofintelligible speech to lead to central auditory processing: there wassignificant activation of the right STG by the masking spectrallyrotated speech, and this is consistent with theories of informationalmasking which suggest that these may involve linguistic effects,

Fig. 4. Cortical areas associated with the contrast of speech-in-speech > speech-in-noise from (A) Scott et al., 2004, where the noise in speech-in-noise is continuous and(B) Scott et al., 2009a,b, where the noise in speech-in-noise is speech amplitudemodulated, that is, glimpses of the target speech are possible in both speech-in-speechand speech-in-noise. Upper panels show activity on coronal brain sections, and lowerpanels show activity on transverse brain sections. Figures have been redrawn fromimages thresholded at p < 0.0001, cluster size >40 voxels.

but that informational masking can also arise due to other, para-linguistic properties of the stimuli.

4. Cognitive studies using speech in noise

Several studies have employed masking signals to explore theneural systems supporting the comprehension of speech, andcognitive processes within this. Here, the primary aim is often notprimarily to explore the specific perceptual effects of presentingcompeting sounds alongside speech, but rather to use the presen-tation of speech against masking noise as a way of varying thecomprehension of speech, and hence as a way of exploring lin-guistic processes relevant to comprehension. Binder et al. (2004)carried out an fMRI study of speech perception using a para-metric modulation of SNR (�6 dB, �1 dB, þ4 dB, þ14 dBand þ24 dB) with syllables (‘ba’ and ‘da’) masked by white noise.Increasing SNR led to increases in the intelligibility of the speechsignal, while reaction times (RTs) showed an inverted U-shapefunction with slowest responses occurring around þ4 dB SNR. Theauthors related the RT profile to decision making processes in thetask, suggesting that intermediate SNRs are those at which theseprocesses are most strongly engaged (i.e. where the speech perceptis likely to be most ambiguous). In a clever analysis approach, theauthors identified brain regions whose activation profiles acrossthe conditions matched the behavioural pattern of accuracy and RTchange with SNR. They found that activation in bilateral superiortemporal cortex showed a positive correlation with accuracy, whileventral areas of the pars opercularis in both hemispheres, as well asthe left inferior frontal sulcus, showed the opposite pattern. Incontrast, activation in bilateral anterior insula and portions of thefrontal operculum showed a profile similar to the pattern of RTsacross SNR. This study, as in some of the energetic/modulationmasking studies, showed a sensitivity to speech intelligibility intemporal lobe fields, and an involvement of left prefrontal areas asthe masked speech became harder to understand. Activity in otherprefrontal areas (bilateral anterior insula and frontal operculum)was associated with the speed of response, suggesting an involve-ment in task related aspects of performance. This study is anelegant example of how fMRI data can be used to dissociatedifferent aspects of the perception of speech in noise.

A more recent study (Davis et al., 2011) took a similar approach,in using behavioural profiles of speech comprehension in noise toexplore ‘topedown’ versus ‘bottomeup’ aspects of speech pro-cessing. Similar to Binder et al. (2004), they presented speech innoise (this time signal-correlated noise with English sentences)across 6 different SNRs equally spaced from �5 dB to 0 dB. How-ever, in an additional manipulation, half of the sentences weresemantically anomalous (e.g. ‘Her good slope was done in carrot’).The authors used a cluster analysis approach to fMRI responsesacross these conditions in order to identify neural regions showingsignificantly different profiles of activation. Similar to Binder et al.(2004), the left posterior STG showed a profile of increasingsignal as SNR increased. However, this profile did not distinguishbetween semantically anomalous and semantically coherent sen-tence types. The left inferior frontal gyrus (IFG), as well as anteriorSTG, in contrast, showed a profile in which the response tosemantically coherent items increased with increasing SNR, butdecreased at the highest SNRs, while activation to the anomalousitems continued to increase for these conditions. The authorsinterpreted this as a reflection of ‘effortful semantic integration’ (p.3928) taking place in the IFG. That is, this region is engaged stronglyat intermediate SNRs where there is sufficient low-level perceptionof the signal to allow higher-order linguistic computations to beengaged, but this is no longer necessary at high SNRs for seman-tically coherent (and to some extent, predictable) sentences. On the

S.K. Scott, C. McGettigan / Hearing Research 303 (2013) 58e6664

other hand, semantically anomalous sentences demand additionalprocessing to facilitate the extraction of a sensible sentencepercept, even at high SNRs. Although the posterior STG showed nodistinction between the two sentence types, and therefore may bemore concerned with the lower-level acoustic-phonetic aspects ofperception, the authors suggest that this region is likely to befunctionally connected with inferior frontal regions as part of anetwork of regions supporting comprehension.

Similar findings related to the IFG have recently been reported(Adank et al., 2012) in an fMRI experiment exploring the differencesin the neural responses between two challenging listening condi-tions e SPIN and accented speech e and undistorted speech. TheirSPIN stimuli were simple sentences in Dutch that were presented inspeech spectrum noise at an SNR of þ2 dB. In a sentence verifica-tion task, in which participants responded with a yes/no evaluationof the semantic plausibility of the sentence content, the SPINcondition elicited 18.1% errors (compared with 8% errors for un-distorted, speech-in-quiet sentences). When comparing the BOLDresponses to SPIN compared with speech-in-quiet, the authorsfound greater signal bilaterally in the opercular part of the IFG, aswell as cingulate cortex, parahippocampal gyrus, caudate andfrontal pole. They relate this to the engagement of higher-orderlinguistic representations in the attempt to understand speechthat has been masked by an external signal. As a fully factorialdesign was not employed (that is, there was no SPIN þ Accentedspeech condition), however, it remains unknown how the effects ofmasking noise and accented speech interact.

Scott et al. (2004) conducted a study described in a previoussection, in which the neural consequences of energetic/modulationand informational masking were explored. However, they alsoincluded one analysis that addressed the correlates of speechintelligibility across all conditions (SPIN, and speech-in-speech),which they identified using the behavioural speech repetitionscores from every participant (which were collected prior toscanning). This revealed a single cluster of activation in left anteriorSTG that showed a positive correlation with stimulus intelligibility.This is in line with their previous work on factorial comparisons ofintelligible and unintelligible sentence stimuli (Scott et al., 2000),while the results of Davis et al. (2011) also point towards higher-order processes in this region supporting the extraction of ameaningful perception from a noisy stimulus.

Zekveld et al. (2006) masked spoken sentences with speechspectrum noise at 144 SNRs from �35 dB to 0 dB (�30 dB to 5 dB insome participants). The chosen range yielded speech comprehen-sion accuracy scores of around 50% on average. The authorsexplored the interaction betweenmasking and perception, dividingthe stimuli into a factorial array describing SNR (low, medium andhigh), and intelligible versus unintelligible. Analyses exploring re-sponses of the speech perception network across these conditionsshowed that unintelligible stimuli with low SNR yielded significantactivation (compared with noise alone) in left Brodmann Area (BA)44 in the opercular part of the IFG. They link this back to thefindings of Binder et al. (2004) and suggest that BA 44 is recruitedto support articulatory strategies, which affect speech perception ina top-down fashion and help the comprehension of severely dis-torted speech stimuli. In a further contrast in which SNR wascontrolled (at ‘medium’) but intelligibility differed, the authorsidentified increased signal in bilateral superior temporal cortex anda more anterior portion of the left IFG, in BA 45. There was somedifferentiation between the neural responses in temporal andfrontal areas with increasing SNR, with temporal regions showing a(non-significant) ‘earlier’ onset response at around �13 dB, and asignificantly larger asymptotic response than frontal areas.

There are some clear themes emerging from this work. Mostconsistently, there appears to be a distinction between the role of

mainly left-lateralized regions of the opercular part of the IFG andsites in superior temporal cortex, where the frontal regions areassociated with strategic and topedown effects, while temporalcortex more faithfully tracks the changes in speech audibility withvarying SNR. At the same time, two studies have indicated thattemporal responses in the comprehension of noisy speech precedethose of the frontal cortex, both in the timing of the response, andin terms of the SNRs at which they ‘emerge’ (Davis et al., 2011;Zekveld et al., 2006). What are the roles of these higher-orderinfluences of frontal cortex? Is the role evaluative, for example inthe decision-making aspects of the Binder et al. (2004) task? Or isthe involvement more dynamic, with, for example, an articulatorystrategy guiding the extraction of phonemic content from theincoming stimulus? A parallel pattern in other studies is that directcontrasts of SPINwith speech-in-quiet or undistorted speech revealactivations in left inferior frontal cortex. Some authors have linkedmotoric simulation strategies to difficult listening situations(Adank, 2012; Hervais-Adelman et al., 2012). However, as we havepreviously argued (McGettigan et al., 2010; Scott et al., 2009a), thechoice of stimulus and task can also have effects on the extent towhich certain strategies come into play, for example, the engage-ment of articulatory representations to perform phonetic decom-position of a heard syllable (Sato et al., 2009). Across the studiesdescribed above (and for the moment not considering the type ofnoise/masker used), authors have presented participants withstimuli ranging from a closed set of syllables, to words, to senten-ces, with tasks including repetition, semantic evaluation and pas-sive listening (sometimes subsequent to an active behavioural taskon the same stimuli). These different tasks would be expected toresult in different patterns of neural activation. For example, thecognitive processes involved in evaluating whether a sentence‘makes sense’ or not maymore consistently recruit more prefrontalcortical fields thanwhen deciding whether a heard syllable is a ‘da’or a ‘ba’ (Scott et al., 2003). This may partly explain why somestudies in the literature on SPIN implicate frontal and auditory/auditory association cortices to varying degrees. In terms of taskdifficulty, the choice of SNR(s) used, in combination with task, islikely to have a major role in which brain regions are uncovered,and the profile exhibited by these regions. In factorial designs, theprecise SNR used, and the degree of its effects on intelligibility,presents the authors with a certain ‘snapshot’ view of the activationof brain networks, and the relative involvement of componentswithin them. A good example is Adank et al. (2012), who used twonoise conditions e one intrinsic (sentences spoken with a novelaccent) and one extrinsic (sentences in speech-shaped noise).When each of these was contrasted with rest, different profiles ofactivation were found. Additional activations were observed in thedirect comparison of the two noise conditions. However, it is alsothe case that the two ‘noise’ conditions, aside from their likelydiffering effects at the auditory periphery, were significantlydifferent in the proportion of errors they generated in the behav-ioural task. Thus, the direct comparison of the two conditionsreflects an unknown ratio of contributions of basic acoustic andhigher-order task related effects. Consideration of task difficultyeffects may help us to understand the some of the inconsistenciesacross studies. For example IFG has been shown to be moreengaged by SPIN than by speech-in-quiet (Adank et al., 2012), yetIFG activation also shows positive correlations with SNR (andtherefore increasing intelligibility) (Davis et al., 2011; Zekveld et al.,2006). In a study by Zekveld et al. (2012), where performance in thescanner was on average less than 29% correct, a comparison of SPINwith speech-in-quiet gave an array of fronto-parietal activationsmore commonly associated with the default mode network (net-works of activation which have been consistently associated withthe brain ‘at rest’ and which typically indicate that participants are

S.K. Scott, C. McGettigan / Hearing Research 303 (2013) 58e66 65

focussing on their own thoughts rather than on the experimentalstimuli (Raichle et al., 2001)). In contrast, the IFG appeared in theconverse comparison of speech-in-quiet > SPIN (Zekveld et al.,2012). We suggest that, where factorial designs have beenemployed to compare different types of masker, authors shouldendeavour to control for intelligibility across conditions (see Scottet al., 2004) and consider the overall level of difficulty at whichtheir task is pitched. Alternatively, the parametric designs used byDavis et al. (2011) and Zekveld et al. (2006) offer the chance forauthors to characterize an ‘intelligibility landscape’ across thebrain, showing the interplay of different key sites in the compre-hension network across different levels and types of masker. Thesedesigns could be developed to interrogate, for example, the effectsof masking on the neural systems supporting comprehension atdifferent levels of the linguistic hierarchy.

5. Conclusions

From a functional neuroimaging perspective, the understandingof speech presented against different maskers is still developing,not least because of the considerable technical problems of dealingwith the acoustic noise generated by the scanning sequences usedin most fMRI paradigms. However, a pattern is starting to emerge,where there is now considerable evidence for widespread pre-frontal and parietal activation associated with dealing with speechin masking noise (Adank et al., 2012; Hwang et al., 2006, 2007;Scott et al., 2004; Scott et al., 2009b; Wong et al., 2008, 2009) andsome evidence for level specific recruitment of prefrontal andpremotor fields (Davis et al., 2011; Zekveld et al., 2012). In contrast,the claim that informational masking arises due to competition forresources at central auditory cortical processing levels has beenlargely supported by the findings of considerable processing of theunattended masking speech in bilateral superior temporal lobes,which is seen in addition to the activation associated with thetarget speech. As might be expected, these central informationalmasking effects are not restricted to intelligible speech, and canalso be seen for spectrally rotated speech (Scott et al., 2009b).Greater activation (outside auditory cortex) is also seen when themasking speech is highly similar to the target speech (Nakai et al.,2005), as would be predicted from the behavioural literature(Brungardt, 2001).

There are outstanding questions about what this increased STGactivation reflects: is it showing parallel processing of the differentauditory streams associated with the target speech and with theinformational masker? Is it a result of increased competition forresources? Does this processing profile vary between the twohemispheres, and does this vary with functional differences?Further studies will be needed to refine our understanding of theseresponses. However, if the objects of auditory perception are pro-cessed along neural pathways that are both plastic and capable ofprocessingmultiple streams of information in parallel (Scott, 2005),all of these possibilities may be found to play a role.

We have outlined several factors which affect the patterns ofactivation seen above and beyond those specifically associated withmasking: the task used, the speech stimuli selected and SNRall affectthekinds of activation seen. Furtheroutstanding challengeswill be toidentify cortical signatures that are masker-specific, and that mightbe recruited for both energetic/modulation masking and informa-tionalmasking (not possible fromScott et al.’s 2004 or 2009 studies),and address the ways that ageing affects the perception of maskedspeech, while controlling for intelligibility (or performance differ-ences). Finally, another important dimension to address will be in-dividual variation inperformancewith differentmaskers. The neuralactivations associatedwith variation in adaptation to noise-vocodedspeech have been reported (Eisner et al., 2010). In this study better

learning was associated with greater recruitment of left IFG, ratherthan greater recruitment of auditory cortical fields. Is this similar ordifferent for the perception of speech in masking sounds? How arethe neural mechanisms altered by the kinds of experience-basedchanges (e.g. musical training) which have been argued to lead toenhanced perception of speech in noise (Parbery-Clark et al., 2012;Strait et al., 2012)?We anticipate the next decade of research to buildupon the strong foundations reported here, with the real possibilityof important changes in our understanding of how our brains copewith listening in a noisy world.

References

Adank, P., 2012. The neural bases of difficult speech comprehension and speechproduction: Two Activation Likelihood Estimation (ALE) meta-analyses. BrainLang. 122, 42e54.

Adank, P., Davis, M.H., Hagoort, P., 2012. Neural dissociation in processing noise andaccent in spoken language comprehension. Neuropsychologia 50, 77e84.

Agnew, Z.K., McGettigan, C., Scott, S.K., 2011. Discriminating between auditory andmotor cortical responses to speech and nonspeech mouth sounds. J. Cogn.Neurosci. 23, 4038e4047.

Belin, P., Zatorre, R.J., 2003. Adaptation to speaker’s voice in right anterior temporallobe. Neuroreport 14, 2105e2109.

Bench, J., Kowal, A., Bamford, J., 1979. The BKB (Bamford-Kowal-Bench) sentencelists for partially-hearing children. Br. J. Audiol. 13, 108e112.

Binder, J.R., Frost, J.A., Hammeke, T.A., Rao, S.M., Cox, R.W., 1996. Function of the leftplanum temporale in auditory and linguistic processing. Brain 119, 1239e1247.

Binder, J.R., Liebenthal, E., Possing, E.T., Medler, D.A., Ward, B.D., 2004. Neural cor-relates of sensory and decision processes in auditory object identification. Nat.Neurosci. 7, 295e301.

Blesser, B., 1972. Speech perception under conditions of spectral transformation. I.Phonetic characteristics. J. Speech Hear. Res. 15, 5e41.

Boemio, A., Fromm, S., Braun, A., Poeppel, D., 2005. Hierarchical and asymmetrictemporal sensitivity in human auditory cortices. Nat. Neurosci. 8, 389e395.

Brungardt, D.S., 2001. Informational and energetic masking effects in the perceptionof two simultaneous talkers. J. Acoust. Soc. Am. 109, 1101e1109.

Brungart, D.S., Simpson, B.D., 2002. Within-ear and across-ear interference in acocktail-party listening task. J. Acoust. Soc. Am. 112, 2985e2995.

Cohen, L., Jobert, A., Le Bihan, D., Dehaene, S., 2004. Distinct unimodal and multi-modal regions for word processing in the left temporal cortex. Neuroimage 23,1256e1270.

Davis, M.H., Ford, M.A., Kherif, F., Johnsrude, I.S., 2011. Does semantic contextbenefit speech understanding through “top-down” processes? Evidence fromtime-resolved sparse fMRI. J. Cogn. Neurosci. 23, 3914e3932.

Dimitrijevic, A., Pratt, H., Starr, A., 2013. Auditory cortical activity in normal hearingsubjects to consonants and vowels presented in quiet and in noise. Clin. Neu-rophysiol.. in press.

Dirks, D.D., Bower, D.R., 1969. Masking effects of speech competing messages.J. Speech Hear. Res. 12, 229.

Edmister, W.B., Talavage, T.M., Ledden, P.J., Weisskoff, R.M., 1999. Improvedauditory cortex imaging using clustered volume acquisitions. Hum. BrainMapp. 7, 89e97.

Eisner, F., McGettigan, C., Faulkner, A., Rosen, S., Scott, S.K., 2010. Inferior frontalgyrus activation predicts individual differences in perceptual learning ofcochlear-implant simulations. J. Neurosci. 30, 7179e7186.

Festen, J.M., Plomp, R., 1990. Effects of fluctuating noise and interfering speech onthe speech-reception threshold for impaired and normal hearing. J. Acoust. Soc.Am. 88, 1725e1736.

Friederici, A.D., Ruschemeyer, S.A., Hahne, A., Fiebach, C.J., 2003. The role of leftinferior frontal and superior temporal cortex in sentence comprehension:localizing syntactic and semantic processes. Cereb. Cortex 13, 170e177.

Hall, D.A., Haggard, M.P., Akeroyd, M.A., Palmer, A.R., Summerfield, A.Q., Elliott, M.R.,Gurney, E.M., Bowtell, R.W., 1999. “Sparse” temporal sampling in auditory fMRI.Hum. Brain Mapp. 7, 213e223.

Helfer, K.S., Chevalier, J., Freyman, R.L., 2010. Aging, spatial cues, and single-versusdual-task performance in competing speech perception. J. Acoust. Soc. Am. 128,3625e3633.

Hervais-Adelman, A.G., Carlyon, R.P., Johnsrude, I.S., Davis, M.H., 2012. Brain regionsrecruited for the effortful comprehension of noise-vocoded words. Lang. Cogn.Process. 27, 1145e1166.

Hickok, G., Okada, K., Serences, J.T., 2009. Area Spt in the human planum temporalesupports sensory-motor integration for speech processing. J. Neurophysiol. 101,2725e2732.

Hickok, G., Buchsbaum, B., Humphries, C., Muftuler, T., 2003. Auditory-motorinteraction revealed by fMRI: speech, music, and working memory in area Spt.J. Cogn. Neurosci. 15, 673e682.

Hill, K.T., Miller, L.M., 2010. Auditory attentional control and selection duringcocktail party listening. Cereb. Cortex 20, 583e590.

Hwang, J.H., Wu, C.W., Chen, J.H., Liu, T.C., 2006. The effects of masking on theactivation of auditory-associated cortex during speech listening in white noise.Acta Otolaryngol. 126, 916e920.

S.K. Scott, C. McGettigan / Hearing Research 303 (2013) 58e6666

Hwang, J.H., Li, C.W., Wu, C.W., Chen, J.H., Liu, T.C., 2007. Aging effects on theactivation of the auditory cortex during binaural speech listening in whitenoise: an fMRI study. Audiol. Neurootol. 12, 285e294.

Jacquemot, C., Pallier, C., LeBihan, D., Dehaene, S., Dupoux, E., 2003. Phonologicalgrammar shapes the auditory cortex: a functional magnetic resonance imagingstudy. J. Neurosci. 23, 9541e9546.

Johnsrude, I.S., Penhune, V.B., Zatorre, R.J., 2000. Functional specificity in the righthuman auditory cortex for perceiving pitch direction. Brain 123, 155e163.

Liebenthal, E., Binder, J.R., Spitzer, S.M., Possing, E.T., Medler, D.A., 2005. Neuralsubstrates of phonemic perception. Cereb. Cortex 15, 1621e1631.

McGettigan, C., Scott, S.K., 2012. Cortical asymmetries in speech perception: what’swrong, what’s right and what’s left? Trends Cogn. Sci. 16, 269e276.

McGettigan, C., Agnew, Z.K., Scott, S.K., 2010. Are articulatory commands auto-matically and involuntarily activated during speech perception? Proc. Natl.Acad. Sci. U. S. A. 107, E42.

McGettigan, C., Evans, S., Rosen, S., Agnew, Z.K., Shah, P., Scott, S.K., 2012. Anapplication of univariate and multivariate approaches in FMRI to quantifyingthe hemispheric lateralization of acoustic and linguistic processes. J. Cogn.Neurosci. 24, 636e652.

Mummery, C.J., Ashburner, J., Scott, S.K., Wise, R.J.S., 1999. Functional neuroimagingof speech perception in six normal and two aphasic subjects. J. Acoust. Soc. Am.106, 449e457.

Nakai, T., Kato, C., Matsuo, K., 2005. An fMRI study to investigate auditory attention:a model of the cocktail party phenomenon. Magn. Reson. Med. Sci. 4, 75e82.

Narain, C., Scott, S.K., Wise, R.J.S., Rosen, S., Leff, A., Iversen, S.D., Matthews, P.M.,2003. Defining a left-lateralized response specific to intelligible speech usingfMRI. Cereb. Cortex 13, 1362e1368.

Obleser, J., Wise, R.J.S., Dresner, M.A., Scott, S.K., 2007. Functional integration acrossbrain regions improves speech perception under adverse listening conditions.J. Neurosci. 27, 2283e2289.

Ogawa, S., Lee, T.M., Kay, A.R., Tank, D.W., 1990. Brain magnetic resonance imagingwith contrast dependent on blood oxygenation. Proc. Natl. Acad. Sci. U. S. A. 87,9868e9872.

Okada, K., Rong, F., Venezia, J., Matchin, W., Hsieh, I.H., Saberi, K., Serences, J.T.,Hickok, G., 2010. Hierarchical organization of human auditory cortex: evidencefrom acoustic invariance in the response to intelligible speech. Cereb. Cortex 20,2486e2495.

Pa, J., Hickok, G., 2008. A parietal-temporal sensory-motor integration area for thehuman vocal tract: evidence from an fMRI study of skilled musicians. Neuro-psychologia 46, 362e368.

Parbery-Clark, A., Marmel, F., Bair, J., Kraus, N., 2011. What subcortical-cortical re-lationships tell us about processing speech in noise. Eur. J. Neurosci. 33, 549e557.

Parbery-Clark, A., Anderson, S., Hittner, E., Kraus, N., 2012. Musical experiencestrengthens the neural representation of sounds important for communicationin middle-aged adults. Front. Aging Neurosci. 4, 30.

Patterson, K., Nestor, P.J., Rogers, T.T., 2007. Where do you know what you know?The representation of semantic knowledge in the human brain. Nat. Rev.Neurosci. 8, 976e987.

Peelle, J.E., Eason, R.J., Schmitter, S., Schwarzbauer, C., Davis, M.H., 2010. Evaluatingan acoustically quiet EPI sequence for use in fMRI studies of speech and audi-tory processing. Neuroimage 52, 1410e1419.

Posse, S., 2012. Multi-echo acquisition. Neuroimage 62, 665e671.Raichle, M.E., MacLeod, A.M., Snyder, A.Z., Powers, W.J., Gusnard, D.A., Shulman, G.L.,

2001. A default mode of brain function. Proc. Natl. Acad. Sci. U. S. A. 98, 676e682.Rauschecker, J.P., Scott, S.K., 2009.Maps andstreams in theauditorycortex: nonhuman

primates illuminate human speech processing. Nat. Neurosci. 12, 718e724.Riecke, L., Vanbussel, M., Hausfeld, L., Baskent, D., Formisano, E., Esposito, F., 2012.

Hearing an illusory vowel in noise: suppression of auditory cortical activity.J. Neurosci. 32, 8024e8034.

Rosen, S., Wise, R.J.S., Chadha, S., Conway, E.-J., Scott, S.K., 2011. Hemisphericasymmetries in speech perception: sense, nonsense and modulations. PloS One6. http://dx.doi.org/10.1371/journal.pone.0024672.

Salvi, R.J., Lockwood, A.H., Frisina, R.D., Coad, M.L., Wack, D.S., Frisina, D.R., 2002.PET imaging of the normal human auditory system: responses to speech inquiet and in background noise. Hear. Res. 170, 96e106.

Sato, M., Tremblay, P., Gracco, V.L., 2009. A mediating role of the premotor cortex inphoneme segmentation. Brain Lang. 111, 1e7.

Schroeder, M.R., 1968. Reference signal for signal quality studies. J. Acoust. Soc.Am. 44.

Schulze, K., Gaab, N., Schlaug, G., 2009. Perceiving pitch absolutely: comparingabsolute and relative pitch possessors in a pitch memory task. BMC Neu-rosci. 10.

Scott, S.K., 2005. Auditory processing e speech, space and auditory objects. Curr.Opin. Neurobiol. 15, 197e201.

Scott, S.K., Johnsrude, I.S., 2003. The neuroanatomical and functional organizationof speech perception. Trends Neurosci. 26, 100e107.

Scott, S.K., Wise, R.J.S., 2004. The functional neuroanatomy of prelexical processingin speech perception. Cognition 92, 13e45.

Scott, S.K., Leff, A.P., Wise, R.J.S., 2003. Going beyond the information given: a neuralsystem supporting semantic interpretation. Neuroimage 19, 870e876.

Scott, S.K., McGettigan, C., Eisner, F., 2009a. A little more conversation, a little lessaction e candidate roles for the motor cortex in speech perception. Nat. Rev.Neurosci. 10, 295e302.

Scott, S.K., Blank, C.C., Rosen, S., Wise, R.J., 2000. Identification of a pathway forintelligible speech in the left temporal lobe. Brain 123, 2400e2406.

Scott, S.K., Rosen, S., Wickham, L., Wise, R.J., 2004. A positron emission tomographystudy of the neural basis of informational and energetic masking effects inspeech perception. J. Acoust. Soc. Am. 115, 813e821.

Scott, S.K., Rosen, S., Beaman, C.P., Davis, J.P., Wise, R.J.S., 2009b. The neural pro-cessing of masked speech: evidence for different mechanisms in the left andright temporal lobes. J. Acoust. Soc. Am. 125, 1737e1743.

Stone, M.A., Füllgrabe, C., Moore, B.C.J., 2012. Notionally steady background noiseacts primarily as a modulation masker of speech. J. Acoust. Soc. Am. 132,317e326.

Stone, M.A., Füllgrabe, C., Mackinnon, R.C., Moore, B.C.J., 2011. The importance forspeech intelligibility of random fluctuations in “steady” background noise.J. Acoust. Soc. Am. 130, 2874e2881.

Strait, D.L., Parbery-Clark, A., Hittner, E., Kraus, N., 2012. Musical training duringearly childhood enhances the neural encoding of speech in noise. Brain Lang.123, 191e201.

Warren, J.E., Wise, R.J., Warren, J.D., 2005. Sounds do-able: auditory-motor trans-formations and the posterior temporal plane. Trends Neurosci. 28, 636e643.

Wise, R., Chollet, F., Hadar, U., Friston, K., Hoffner, E., Frackowiak, R., 1991. Distri-bution of cortical neural networks involved in word comprehension and wordretrieval. Brain 114, 1803e1817.

Wise, R.J., Scott, S.K., Blank, S.C., Mummery, C.J., Murphy, K., Warburton, E.A., 2001.Separate neural subsystems within ‘Wernicke’s area’. Brain 124, 83e95.

Wong, P.C., Uppunda, A.K., Parrish, T.B., Dhar, S., 2008. Cortical mechanisms ofspeech perception in noise. J. Speech Lang. Hear. Res. 51, 1026e1041.

Wong, P.C., Jin, J.X., Gunasekera, G.M., Abel, R., Lee, E.R., Dhar, S., 2009. Agingand cortical mechanisms of speech perception in noise. Neuropsychologia 47,693e703.

Xiang, J., Simon, J., Elhilali, M., 2010. Competing streams at the cocktail party:exploring the mechanisms of attention and temporal integration. J. Neurosci.30, 12084e12093.

Zatorre, R.J., Belin, P., 2001. Spectral and temporal processing in human auditorycortex. Cereb. Cortex 11, 946e953.

Zekveld, A.A., Heslenfeld, D.J., Festen, J.M., Schoonhoven, R., 2006. Topedown andbottomeup processes in speech comprehension. Neuroimage 32, 1826e1836.

Zekveld, A.A., Rudner, M., Johnsrude, I.S., Heslenfeld, D.J., Ronnberg, J., 2012.Behavioral and fMRI evidence that cognitive ability modulates the effect ofsemantic context on speech intelligibility. Brain Lang. 122, 103e113.

Zevin, J., McCandliss, B., 2004. Dishabituation to phonetic stimuli in a “silent” event-related fMRI design. Int. J. Psychol. 39, 102.