perceptual encoding of natural speech sounds revealed by...

19
Perceptual encoding of natural speech sounds revealed by the N1 event-related potential response Olivia Pereira 1,2 , Yang Agnes Gao 1 , Joseph C. Toscano 1 1 Dept. of Psychological and Brain Sciences, Villanova University 2 Nemours Biomedical Research, Nemours Alfred I. duPont Hospital for Children To appear in Auditory Perception & Cognition Abstract Recent work has demonstrated that the auditory N1 event-related potential (ERP) compo- nent tracks continuous changes in voice onset time (VOT; an acoustic cue distinguishing word-initial voicing categories), suggesting that this ERP response can index early percep- tual representations. The present study aims to determine whether the N1 can serve as a more general index of cue encoding, providing a tool for measuring listeners’ perception of fine-grained acoustic differences in speech. We examined ERP responses to a wide range of phonetic contrasts, focusing particularly on voicing and place of articulation distinctions for different classes of speech sounds. Listeners were presented with natural speech span- ning 18 consonants in English, and identified the consonant they heard while EEG was recorded. Results show differences in N1 amplitude as a function of voicing and place for both fricatives and stop consonants, replicating and extending previous findings. The pattern of results also suggests that some distinctions reflect perceptual encoding based on acoustic cue dimensions rather than articulatory dimensions. Overall, these results demon- strate that the N1 can serve as a general index of perceptual encoding at early stages of auditory perception across a range of phonetic contrasts in speech. Keywords: speech perception, perceptual encoding, phonetic categorization, event-related brain potential technique, auditory N1 What are the perceptual representations that allow human listeners to accurately perceive speech? Numerous models of speech perception have attempted to address this question, proposing that listeners might rely on categorical representations corresponding to phonological features (Liber- man et al., 1967), articulatory gestures either inferred to be produced by the talker (motor theory; Liberman & Mattingly, 1985) or directly perceived (direct realism; Fowler, 1984), auditory repre- Corresponding Author: Olivia Pereira 1701 Rockland Road, Wilmington, DE, 19803 Nemours Biomedical Research Nemours Alfred I. duPont Hospital for Children Email: [email protected] Phone: 302-298-7293

Upload: lydieu

Post on 11-Nov-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

Perceptual encoding of natural speech sounds revealed bythe N1 event-related potential response

Olivia Pereira1,2, Yang Agnes Gao1, Joseph C. Toscano11 Dept. of Psychological and Brain Sciences, Villanova University

2 Nemours Biomedical Research, Nemours Alfred I. duPont Hospital for Children

To appear in Auditory Perception & Cognition

AbstractRecent work has demonstrated that the auditory N1 event-related potential (ERP) compo-nent tracks continuous changes in voice onset time (VOT; an acoustic cue distinguishingword-initial voicing categories), suggesting that this ERP response can index early percep-tual representations. The present study aims to determine whether the N1 can serve as amore general index of cue encoding, providing a tool for measuring listeners’ perception offine-grained acoustic differences in speech. We examined ERP responses to a wide rangeof phonetic contrasts, focusing particularly on voicing and place of articulation distinctionsfor different classes of speech sounds. Listeners were presented with natural speech span-ning 18 consonants in English, and identified the consonant they heard while EEG wasrecorded. Results show differences in N1 amplitude as a function of voicing and placefor both fricatives and stop consonants, replicating and extending previous findings. Thepattern of results also suggests that some distinctions reflect perceptual encoding based onacoustic cue dimensions rather than articulatory dimensions. Overall, these results demon-strate that the N1 can serve as a general index of perceptual encoding at early stages ofauditory perception across a range of phonetic contrasts in speech.

Keywords: speech perception, perceptual encoding, phonetic categorization, event-relatedbrain potential technique, auditory N1

What are the perceptual representations that allow human listeners to accurately perceive speech?Numerous models of speech perception have attempted to address this question, proposing thatlisteners might rely on categorical representations corresponding to phonological features (Liber-man et al., 1967), articulatory gestures either inferred to be produced by the talker (motor theory;Liberman & Mattingly, 1985) or directly perceived (direct realism; Fowler, 1984), auditory repre-

Corresponding Author:Olivia Pereira1701 Rockland Road, Wilmington, DE, 19803Nemours Biomedical ResearchNemours Alfred I. duPont Hospital for ChildrenEmail: [email protected]: 302-298-7293

Page 2: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 2

sentations that reflect domain-general characteristics of perception (Lotto & Kluender, 1998; Diehlet al., 2004), or acoustic cue representations (Nearey, 1997; Toscano & McMurray, 2010).

Distinguishing these models and uncovering the perceptual representations that listeners usehas proven to be challenging. The categorical perception framework provides an illustrative exam-ple. Early studies hypothesized that listeners ignore fine-grained acoustic details during perceptualprocessing, instead, perceiving speech sounds in terms of invariant phoneme categories (Libermanet al., 1957). This approach was influential in the development of many models of speech percep-tion, particularly motor theory (Liberman & Mattingly, 1985). However, other work has refutedthe claim that speech is perceived categorically. Pisoni & Lazarus (1974) demonstrated that char-acteristics of the task used (e.g., ABX discrimination vs. 4IAX discrimination) yield differencesin whether or not listeners show categorical responses,1 and some listeners behave more categori-cally than others (Gerrits & Schouten, 2004). Similarly, other measurements, such as reaction times(Pisoni & Tash, 1974), category goodness ratings (Miller, 1994), and graded lexical activation inthe visual world eye-tracking paradigm (McMurray et al., 2002) all reveal that listeners are sensitiveto sub-phonemic acoustic differences in speech, arguing against the strongest forms of categoricalperception (see Schouten et al., 2003, for further discussion).

Given the challenges in uncovering the perceptual representations used to perceive speechusing behavioral tasks, some researchers have turned to cognitive neuroscience methods that mayprovide more direct measures of perceptual processing. For example, Myers et al. (2009) used fMRIto investigate how listeners map acoustic information onto phoneme categories in different corticalareas, specifically by looking at activation for within- vs. between-category phonemic differences.They found that inferior frontal sulcus showed activation for between-category differences but notwithin-category differences. In contrast, left superior temporal gyrus (STG) and superior temporalsulcus were activated by both within- and between-category differences. These results suggest thatperceptual representations (localized to STG) are sensitive to within-category phonetic differences.Other work has found sensitivity to sub-phonemic differences in inferior frontal gyrus (IFG) as well(Blumstein et al., 2005).

While these results provide information about the representations used to process speech,the limited temporal resolution of fMRI makes it difficult to study listeners’ initial perceptual rep-resentations. Intracranial EEG recording (electrocortocigraphy; ECOG) provides precise temporalresolution while also localizing responses to specific cortical areas. However, ECOG studies havefound conflicting results as to the types of representations used to perceive speech, with some datasuggesting representations in STG are based on phonological features (Chang et al., 2010; Mes-garani et al., 2014) and others showing that detailed information about the speech spectrogramcan be reconstructed from ECOG recordings (Pasley et al., 2012), suggesting a high sensitivity toacoustic differences in speech. Recently, Toscano et al. (2018) used the event-related optical sig-nal (EROS) technique (Gratton & Fabiani, 2001) to measure the time-course of speech processingin different cortical areas. This technique offers good temporal and spatial resolution, and unlikeECOG, can be used non-invasively. Toscano et al. found that early perceptual representations inposterior STG are sensitive to continuous acoustic details, consistent with fMRI work (Myers et al.,2009) but contradicting claims from some ECOG studies (Chang et al., 2010).

Overall, the nature of perceptual representations used to perceive speech remains unclear

1This is not to say that task-elicited top-down effects (e.g., attention and memory demands) are not important forperception. Indeed, previous work has suggested an important role of top-down information in speech perception (Mc-Clelland & Elman, 1986; Pitt & Samuel, 1995).

Page 3: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 3

from these studies, and often, they suggest conflicting explanations. In addition, many of them havefocused on studying just one or two acoustic-phonetic distinctions. While useful, a full descriptionof the processes underlying speech perception requires data on a wide range of naturally-producedspeech sounds measured against an index that corresponds to listeners’ initial perceptual represen-tation of those sounds. One candidate for such a measure is the auditory N1 component measuredusing the event-related brain potential (ERP) technique.

The N1 reflects early cortical stages of auditory perception (Picton & Hillyard, 1974). Gen-erators for the N1 are likely located in auditory cortex (Pratt, 2010) as well as other cortical areas,including frontal areas (Giard et al., 1994). Thus, although the N1 is described as an auditory ERPcomponent, its generators do not appear to be exclusively located in primary auditory cortex. TheN1 also varies with several acoustic properties of the stimulus (e.g., frequency, intensity; Pictonet al., 1978). Overall, it presents a good candidate for measuring early stages of perceptual process-ing and for distinguishing different candidate representations used for speech perception.

Frye et al. (2007) provide evidence suggesting that the magnetic counterpart to the N1 mea-sured using magnetoencephalography (MEG) may index listeners’ early perceptual representations.They presented listeners with stimuli varying along a voice onset time (VOT) continuum, an acous-tic cue distinguishing word-initial voicing in stop consonants (e.g., /b/ vs. /p/) and found that theN1m response (the magnetic equivalent to the electrical N1 response) varied linearly with changesin VOT, suggesting that this response tracks acoustic changes in the speech signal. Subsequently,Toscano et al. (2010) presented listeners with stimuli varying along VOT continua and measuredthe auditory N1 ERP response, finding that continuous changes in VOT are also tracked linearly bythe auditory N1 ERP component, with larger N1 amplitudes for shorter VOTs. These data suggestthat listeners encode speech in terms of continuous acoustic differences at early stages of perceptualprocessing.

These results also open the door to studying perceptual representations more directly usingthe ERP technique. By measuring N1 responses to speech sounds varying in different acoustic cuesand phonological contrasts, we may be able to gain further insight into the nature of representationsthat listeners use to perceive speech. For instance, a classic question concerns whether percep-tual representations are based on articulatory gestures (Fowler, 1984; Viswanathan et al., 2010) oracoustic cue dimensions (Stevens & Blumstein, 1978). Behavioral data alone have been unable toresolve this debate (though see Viswanathan et al., 2010, for an alternative approach to addressingthis). By measuring perceptual representations more directly, we may be able to shed light on thisquestion by examining whether perceptual encoding varies with changes along acoustic dimensionsor articulatory ones.

Several issues must be overcome to determine whether the N1 ERP component can providea more general measure of early speech sound encoding and a tool for addressing these questions.First, we must investigate N1 responses to a variety of speech sounds, varying along different acous-tic cue dimensions and phonological contrasts, in order to determine whether this measure indexesencoding of VOT specifically (as shown by Toscano et al., 2010), temporal cue distinctions gener-ally, or phonetically-informative cues more generally. There is some preliminary evidence suggest-ing that the N1 may be a more general index of perceptual encoding. Toscano (2011) demonstratedsimilar N1 effects for vowel sounds varying along a continuum from /i/ to /u/, suggesting thatthese responses are not limited to VOT distinctions or stop consonants. Moreover, Toscano et al.(2010) found an overall difference in N1 amplitude between /b,p/ and /d,t/ distinctions, indicatingthat information about place of articulation may be encoded by the N1 as well. However, these

Page 4: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 4

differences among stop consonants have not been systematically investigated.Second, previous work has measured these responses using either synthetic speech (Frye

et al., 2007; Toscano et al., 2010) or modified naturally-produced speech sounds (Toscano, 2011).While these approaches are useful for investigating the contributions of specific acoustic dimen-sions, they do not tell us whether the N1 provides a more general indicator of perceptual encodingin natural speech, which contains a number of acoustic cues for each phonological contrast (Lisker,1986; Toscano & McMurray, 2010). Ultimately, experiments with both carefully-controlled mod-ified speech and natural speech will be necessary for uncovering the perceptual representationslisteners use.

The current study aims to address these issues by presenting listeners with a wide array ofnatural speech sounds—76 words across 18 consonants, spanning most of the distinctions used inEnglish—and examining how N1 amplitude varies as a function of specific phonological features(voicing, place of articulation, and manner of articulation). This will provide the data needed todetermine how general the N1 is as a measure of perceptual encoding, and which specific phono-logical contrasts are distinguished by differences in N1 amplitude. Based on previous work, wepredict, at minimum, that differences in stop consonant voicing and place will be evident in the N1.However, since this has not been investigated with natural speech, we do not know whether theseresponses reflect acoustic or articulatory dimensions, and it is unclear which other phonologicalcontrasts produce changes in N1 amplitude. These issues are addressed in the current study.

Method

Participants

Twenty-seven participants from the Villanova University community completed the experi-ment. All participants had self-reported normal hearing, normal or corrected-to-normal vision, andwere fluent in English.2 One participant’s EEG data contained too many oculomotor artifacts, sothis participant was excluded from analysis, resulting in N=26 participants in the final sample (15female; 23 right-handed; mean age: 19 years). Participants provided informed consent in accor-dance with Villanova IRB protocols and received course credit or monetary compensation for theirtime.

Stimuli

Listeners were presented with natural speech sounds spanning 18 consonants (/b, tS, d, f,

g, dZ, k, l, m, n, p, ô, s, S, t, v, w, z/) embedded in word-initial minimal pairs, chosen to span asmany voicing, place, and manner contrasts as possible. This resulted in 2–8 words for each specificphoneme, with two words for /v/ (since it can only occur as a feature-level minimal pair with /f/

[a voicing contrast] and /z/ [a place contrast]) and eight words for /d/ (since it can participate

2Demographic information was unavailable for one participant. In addition, three participants reported being non-native English speakers. Because there is little work on how the N1 varies in response to specific speech sounds, it isunknown how responses differ as a function of language background. To see whether this had an effect, we ran follow-upanalyses examining data for the subset of listeners who were native English speakers (N=23). This revealed the samepattern of results as those reported below, except that some effects involving voicing were marginal: the voicing × mannerinteraction for the N1 amplitude analysis (χ2(1)=3.64, p=0.056), the main effects of voicing for stops (χ2(1)=2.99,p=0.084) and fricatives (χ2(1), p=0.074) for the amplitude analyses, and the main effect of voicing for the latency analysis(χ2(1)=3.19, p=0.074).

Page 5: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 5

Table 1Word used in the experiment.

Phoneme Word Manner Voicing Place/b/ bead, beat, beer, bees, bet, bill, bit Stop Voiced Bilabial/d/ dead, deal, dept, deed, deer, den, dill, dip Stop Voiced Alveolar/g/ gear, get, gill Stop Voiced Velar/p/ peas, pen, pet, pin Stop Voiceless Bilabial/t/ tease, ted, teen, ten, thin, tin, tip Stop Voiceless Alveolar/k/ Ken, keys, kin Stop Voiceless Velar/v/ veal, vend, vest Fricative Voiceless Labiodental/z/ zeal, zen, zest, zip Fricative Voiced Alveolar/f/ fed, feel, feet, fend Fricative Voiced Labiodental/s/ said, seal, seat, seen, sin, sip Fricative Voiceless Alveolar/S/ shed, sheet, shin Fricative Voiceless Postalveolar/dZ/ gin, Jeep, Jess Affricate Voiced Postalveolar/tS/ cheap, chess, chin Affricate Voiceless Postalveolar/m/ meet, met, mit Nasal Voiced Bilabial/n/ knit, neat, need, net, nip Nasal Voiced Alveolar/ô/ red, reed, rip Approximant Voiced Alveolar/l/ let, lead, lip, lit Lateral Approximant Voiced Alveolar/w/ weed, wet, wit Glide Voiced Bilabial

in minimal pair contrasts with a number of other phonemes spanning place, manner, and voicingdifferences; see Table 1 for a complete list of words).3

Stimuli were recorded by a female talker in a sound attenuated booth with a Rode NT1 con-denser microphone and digitized at a sampling rate of 44.1 kHz using a Focusrite Scarlett 18i8audio interface. Table 1 lists the minimal pair words used in the experiment. Several tokens of eachword were recorded and two of the authors (OP and JCT) selected the best token to be used in theexperiment based on overall audio quality (e.g., tokens that were free from audio artifacts). Selectedtokens were spliced into individual sound files and amplitude normalized using the scale intensityfunction in Praat, a software package used for phonetic analysis (Boersma & Weenink, 2016). Stim-uli are available at the Open Science Framework repository for this project: http://osf.io/e9wp2.

Procedure

During the experiment, participants were seated comfortably in front of a computer ina sound-attenuated and electrically-shielded booth. Stimuli were presented using OpenSesame(Mathôt et al., 2012) via E-A-RTONE 3A insert earphones at each participant’s most comfortablelevel.4 The experiment was completed in a single two-hour session. Stimuli were presented 10times each in random order in blocks of 17 trials, for a total of 760 trials in the experiment. Eachtrial consisted of a fixation cross that appeared at the center of the screen, followed by the auditory

3Although /Z/ does occur in some words in English, it does not typically occur word-initially except in loanwords.As such, it was not included in the stimuli for the current study.

4Due to an error in the experiment presentation program, stimuli were played back at a sampling rate of 48.0 kHzrather than 44.1 kHz, resulting in a slightly higher playback rate and perceived pitch than in the original recordings.

Page 6: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 6

stimulus, and an inter-trial interval of 900-980 ms. The interval between onset of the fixation crossand onset of the speech stimulus was jittered randomly between 300-450 ms. Participants indicatedwhich sound each word began with by clicking letters corresponding to each phoneme from a dis-play. Several consonants do not have a one-to-one mapping to a specific letter in English; alternativeletters were used for these response labels. Specifically, “ch” was used for /tS/, “j” was used fordZ, “r” was used for /ô/, and “sh” was used for /S/. Participants were told to ignore spelling of thewords and answer based on the sound at the beginning of each stimulus (e.g., for knit, they shouldrespond “n”).

Listeners were instructed to minimize blinking during trials, and encouraged to blink duringthe inter-trial interval. In addition, a longer break was given approximately every 60 seconds (at theend of each block) and midway through the experiment.

EEG Recording and Data Processing

EEG was recorded using a Brain Products actiCHamp system with 32 electrodes at standard10-20 sites (C3, C4, Cz, CP1, CP2, CP5, CP6, F3, F4, F7, F8, Fz, FC1, FC2, FC5, FC6, O1, O2,Oz, P3, P4, P7, P8, Pz, T7, T8; Klem et al., 1999). Electrode impedances were kept at <10 kΩ.Voltages were referenced to the left mastoid online, and re-referenced to the average mastoids of-fline. Electrooculograms (EOG) were recorded via an electrode located above the left eye (verticalEOG) and electrodes located adjacent to the external canthi of each eye (horizontal EOG). For allparticipants, trials containing oculomotor artifacts were rejected via peak-to-peak threshold detec-tion with a threshold at 75 µV, as well as by visual inspection. Independent component analysis(ICA) was used to remove vertical and horizontal eye movement components for three participantswith a high number of oculomotor artifacts. After running ICA, these data were subjected to thesame artifact rejection routine as above in order to remove trials containing any remaining artifacts.EEG was recorded at a sampling rate of 500 Hz and band-pass filtered offline at 0.1-30 Hz (But-terworth filter with 12 dB/octave roll-off). Stimulus onsets were marked using a StimTrak deviceand acoustical adapter (Brain Products GmbH). All data processing was performed in EEGLAB(Delorme & Makeig, 2004) and ERPLAB (Lopez-Calderon & Luck, 2014).

Statistical Analyses

Statistical analyses were performed using linear mixed-effects models fit with the lme4 pack-age (Bates et al., 2015) in R (R Development Core Team, 2011). Mean N1 amplitude was enteredas the dependent measure. N1 amplitudes were measured as the mean voltage from 75 to 125 mspost-stimulus across electrodes F3, Fz, and F4. These electrode locations are the same as thoseused in previous work examining the N1 response to speech sounds (Toscano et al., 2010) and arenear the expected location of the fronto-central N1 peak. The time window encompasses the typicalpeak of the N1 and allows us to measure differences in N1 amplitude while minimizing overlapfrom adjacent components, such as the P2. Each consonant was coded based on its phonologicalfeatures: voicing (voiced or voiceless), place of articulation (bilabial, labiodental, labialvelar, alveo-lar, post-alveolar, or velar), and manner of articulation (stop, fricative, nasal, affricate, approximant,lateral approximant).5 Each feature was coded as 0 or 1 (e.g., for bilabial, all bilabial consonantswere coded as 1; all others as 0), centered, and entered as a fixed effect in the model. The maximal

5Because the distribution of features across consonants is unbalanced (e.g., there are no voiceless approximants),higher order interactions that do not occur are automatically dropped from the model.

Page 7: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 7

random effect structure justified by the design (Barr et al., 2013) was used. Model comparison wasused to evaluate significance of each fixed effect and interaction via χ2 goodness-of-fit tests.

Results

Overall, participants performed well on the behavioral task with a mean accuracy of 95.4%,expected performance for a phoneme categorization task (Toscano & Allen, 2014). Additionally,stimuli elicited typical auditory ERP waveforms, with a prominent N1 peak around 100 ms afterstimulus onset for each consonant. Mean N1 amplitudes are shown in Figure 1, and grand averagewaveforms for subsets of consonants (discussed in detail below) are shown in Figures 3–7. Scalpdistribution was similar across all consonants, with the peak of the N1 observed at frontal electrodes,as expected (Fig. 2).

/m//n/

−5

−4

−3

−2

−1

Bilabial AlveolarPlace

Mea

n am

plitu

de (u

V)

(C)

/t..//d../

−5

−4

−3

−2

−1

Voiced VoicelessVoicing

(D)

/..//w/

/l/

−5

−4

−3

−2

−1

Alveolar LabialvelarPlace

(E)

/v/ /z/

/f/

/s//../

−5

−4

−3

−2

−1

Labiodental Post−alveolar AlveolarPlace

Voicingaaaa

VoicedVoiceless

(B)

ʃ

ʃʒ

ɹ

/b//d//g/

/k/

/p/

/t/

−5

−4

−3

−2

−1

Bilabial Velar AlveolarPlace

Mea

n am

plitu

de (u

V)

Voicingaaaa

VoicedVoiceless

(A)

µµ

Figure 1. Mean N1 amplitudes. (A) Mean amplitudes for stop consonants, showing effects of voicing (larger N1 forvoiced sounds) and place (largest N1 for bilabial, followed by velar, followed by alveolar). (B) Mean amplitudes forfricatives, showing an effect of voicing (larger N1 for voiced sounds) and an effect of place for voiceless sounds (smallerN1 for /f/ than for /s,S/). (C) Mean amplitudes for nasals (/m,n/). (D) Mean amplitudes for affricates (/dZ,tS/). (E)Mean amplitudes for approximants (/ô,w/) and lateral approximants (/l/). Error bars indicate standard error.

Page 8: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 8

(A) (B)

(C) (D) (E)

/b/

/g/

/d/

/p/

/k/

/t/

/v/

/z/

/f/

/ʃ/

/s/

/m/

/n/ /dʒ/

/tʃ/ /ɹ/

/w/

/l/

Figure 2. Scalp maps showing mean voltage from 75 to 125 ms for each of the 18 consonant distinctions in the experiment,grouped by (A) stop consonants, (B) fricatives, (C) nasals, (D), affricates, and (E) approximants and lateral approximants.

N1 Amplitude

We first examined differences in N1 amplitude across all consonants. This analysis revealeda main effect of voicing (χ2(1)=8.44, p=.004), with larger N1 amplitudes for voiced consonants.There was also a voicing × manner interaction (χ2(1)=6.65, p=.010), suggesting that voicing dif-ferences were observed for some manners of articulation but not others. Lastly, there was a place× manner interaction (χ2(1)=5.35, p=0.021), suggesting that place differences were observed forsome manners but not others. These effects are consistent with the pattern of results seen in Figure1: an overall effect of voicing (larger N1 amplitude for voiced sounds), and differences as a functionof voicing and place for the stops and fricatives, but not the other consonants. To further evaluatethese effects, we fit models to the data for each manner of articulation separately, as consonantswith different manners are acoustically quite different from each other. This allowed us to examineeffects of place and voicing for stops and fricatives, which vary along both dimensions, as well asplace differences for nasals and voicing differences for affricates. The remaining three consonants(/l,ô,w/) are all voiced and /w/ does not fit neatly into a place of articulation category; these wereexamined with respect to each individual consonant.

Stop Consonants. Figure 3 shows grand average waveforms for each voicing contrastamong the stop consonants, and Figure 4 shows waveforms for each place of articulation con-

Page 9: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 9

−4

−2

0

−100 0 100 200Time (ms)

Volta

ge (.

.V)

/b//d//g/

(A)

−4

−2

0

−100 0 100 200Time (ms)

/p//t//k/

(B)

µ

−4

−2

0

−100 0 100 200Time (ms)

Volta

ge (.

.V)

/b//p/

(A)

−4

−2

0

−100 0 100 200Time (ms)

/d//t/

(B)

−4

−2

0

−100 0 100 200Time (ms)

/g//k/

(C)µµµ

Figure 3. Grand average waveforms for stop consonant voicing differences: (A) bilabials (/b,p/), (B) alveolars (/d,t/),and (C) velars (/g,k/). For each contrast, larger N1s are observed for voiced than for voiceless phonemes. Note: In allERP waveform figures presented here, positive is plotted up and voltages correspond to average across F3, Fz, and F4.

−4

−2

0

−100 0 100 200Time (ms)

Volta

ge (.

.V)

/b//d//g/

(A)

−4

−2

0

−100 0 100 200Time (ms)

/p//t//k/

(B)

µ−4

−2

0

−100 0 100 200Time (ms)

Volta

ge (.

.V)

/b//p/

(A)

−4

−2

0

−100 0 100 200Time (ms)

/d//t/

(B)

−4

−2

0

−100 0 100 200Time (ms)

/g//k/

(C)

µµµ

Figure 4. Grand average waveforms for stop consonant place differences: (A) voiced sounds (/b,d,g/), and (B) voicelesssounds (/p,t,k/). For both sets of sounds, bilabials (/b,p/) produce the largest N1. For voiceless sounds, N1 amplitudedecreases from bilabials (/p/) to velars (/k/) to alveolars (/t/). For voiced sounds, the velars (/g/) produce the smallestN1, but the difference between alveolars (/d/) and velars (/g/) is smaller.

trast. As expected, we found a main effect of voicing (χ2(1)=5.36, p=0.021), such that voiced stopsproduce larger N1s than voiceless stops, consistent with the observations for synthetic speech fromToscano et al. (2010). We also found a main effect of place (χ2(2)=9.16, p=0.010). For the voicedstops (/b,d,g/), the largest N1 amplitude was elicited by /b/, with smaller amplitudes for /d/ and/g/ (which were similar to each other). For the voiceless stops (/p,t,k/), however, N1 differenceswere more apparent, with the largest N1 for /p/, a smaller N1 for /k/, and the smallest N1 for /t/.This order does not reflect the relative ordering of these phonemes based on articulatory gestures(/p,t,k/), suggesting instead that listeners may be encoding an acoustic cue that varies from /p/ to/k/ to /t/; spectral shape (Stevens & Blumstein, 1978) may be one such cue. The place × voicinginteraction was not significant (χ2(2)=3.37, p=0.185), suggesting that the place effect did not differbetween the two voicing categories. Thus, N1 amplitude varies with place of articulation for stopconsonants, and moreover, it does so in a way that is inconsistent with a gestural coding system,with the largest N1 for bilabial stops, followed by velars, followed by alveolars.

Page 10: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 10

−4

−2

0

−100 0 100 200Time (ms)

Volta

ge (.

.V)

/v//f/

(A)

−4

−2

0

−100 0 100 200Time (ms)

/z//s/

(B)

µ

−4

−2

0

−100 0 100 200Time (ms)

Volta

ge (u

V)

/v//z/

(A)

−4

−2

0

−100 0 100 200Time (ms)

/f//s//../

(B)

µ

ʃ

Figure 5. Grand average waveforms for fricative voicing contrasts: (A) labiodental fricatives (/v,f/) and (B) alveolarfricatives (/z,s/). For both sets of consonants, voiced sounds produce larger N1 amplitudes, similar to the pattern observedfor stop consonants in Figure 3.

−4

−2

0

−100 0 100 200Time (ms)

Volta

ge (.

.V)

/v//f/

(A)

−4

−2

0

−100 0 100 200Time (ms)

/z//s/

(B)

µ

−4

−2

0

−100 0 100 200Time (ms)

Volta

ge (u

V)

/v//z/

(A)

−4

−2

0

−100 0 100 200Time (ms)

/f//s//../

(B)

µ

ʃ

Figure 6. Grand average waveforms for fricative place of articulation contrasts: (A) voiced sounds (/v,z/) and (B)voiceless sounds (/f,s,S/). No differences in N1 amplitude are observed for the voiced sounds. For the voiceless sounds,labiodental fricatives (/f/) produce a smaller N1 than the alveolar (/s/) and post-alveolar (/S/) fricatives.

Fricative Consonants. Figure 5 shows waveforms for each voicing contrast for the frica-tives, and Figure 6 shows waveforms for the place distinctions among the fricatives. The pattern ofresults is similar to that of the stop consonants. The mixed-effects analyses revealed a main effectof voicing (χ2(1)=4.72, p=0.030) with larger N1s for voiced sounds. This suggests that the N1 maybe sensitive to voicing cues that are common to both stops and fricatives, such as the presence oflow-frequency energy, rather than to specific cues like VOT. No differences in N1 amplitude areobserved between the voiced fricatives. For the voiceless fricatives, N1 amplitude is smaller for /f/

relative to the other two phonemes (/s,S/). We also found a place × voicing interaction (χ2(1)=5.29,p=0.021), suggesting that N1 amplitude varies with fricative place of articulation, but only for thevoiceless consonants. This was confirmed in a follow-up analysis examining the effect of place forthe voiced and voiceless fricatives separately. For the voiceless sounds, there was a main effect ofplace (χ2(2)=11.25, p=0.004), but the effect was not significant for the voiced sounds (χ2(1)=0.007,p=0.932).

Other Consonants. Lastly, we examined differences for the remaining seven consonants,which only vary along one phonological feature dimension each. Overall, within each manner of

Page 11: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 11

−4

−2

0

−100 0 100 200Time (ms)

Volta

ge (.

.V)

/m//n/

(A)

−4

−2

0

−100 0 100 200Time (ms)

/d..//t../

(B)

−4

−2

0

−100 0 100 200Time (ms)

/l//..//w/

(C)µ

ʃʒ ɹ

Figure 7. Grand average waveforms for the remaining seven consonants: (A) nasals (/m,n/), (B) affricates (/dZ,tS/), and(C) approximants (/ô,w/) and lateral approximants (/l/). Overall, no differences in N1 amplitude are observed within agiven manner of articulation for these consonants.

articulation (grouping approximants [/ô,w/] and lateral approximants [/l/] together), no substantialdifferences in N1 amplitude were observed. For the nasals, which only differ in place of articulation,N1 amplitude was slightly larger for /n/ than for /m/ (Fig. 7A), but this difference was not signif-icant (χ2(1)=1.71, p=0.190). For the affricates (/tS,dZ/), N1 amplitude was slightly larger for thevoiced sounds (/dZ/) than the voiceless sounds (/tS/), similar to the pattern observed for the stopsand fricatives (Fig. 7B). However, this difference was not significant for the affricates (χ2(1)=2.78,p=0.095). N1 amplitudes were similar for the remaining three consonants (/l,ô,w/; Fig. 7C). Forthese sounds, we examined differences as a function of manner and place. Neither effect was sig-nificant (place: χ2(1)=1.22, p=0.269; manner: χ2(1)=0.24, p=0.624). Thus, overall, differences inN1 amplitude are observed across the stops and fricatives, but no differences are apparent for theother phonemes.

N1 Latency

To further characterize N1 differences across consonants, we examined the latency of theN1 component, following a similar approach to the analyses for N1 amplitude. We calculated the50% fractional area latency (Luck, 2014) in the same time window used to compute mean N1 am-plitude (75-125 ms) and computed the average latency across the three frontal electrodes (F3, Fz,F4). Figure 8 shows mean latencies for each consonant. Overall, latencies were similar acrossall conditions, with slightly longer latencies for voiced fricatives. A mixed-effects model compar-ing latencies across all consonants revealed a main effect of voicing (χ2(1)=4.54, p=0.033) and amain effect of manner (χ2(5)=12.31, p=0.031); no other main effects and none of interactions weresignificant.6 Follow-up analyses revealed a main effect of voicing for the fricatives (χ2(1)=8.35,p=0.004) and no other significant effects. Thus, N1 latencies are longer for voiced fricatives thanthe other consonants.

6For the overall latency analysis, the baseline model (i.e., the model with no fixed effects) failed to converge. Sincethis is the model against which the voicing effect is compared in the χ2 tests, we performed an alternative analysis inwhich the effect of voicing was compared against a model containing just the terms for place (i.e., comparing a modelwith terms for place against a model with terms for both place and voicing). This analysis showed a marginal effect forvoicing (χ2(1)=3.79, p=0.052).

Page 12: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 12

/v//z/

/f/ /s//../

98

100

102

104

106

Labiodental Post−alveolar AlveolarPlace

aaaa

VoicedVoiceless

(B)

/m/

/n/

98

100

102

104

106

Bilabial AlveolarPlace

Mea

n la

tenc

y (m

s)

(C)

/d..//t../

98

100

102

104

106

Voiced VoicelessVoicing

(D)

/../

/w/

/l/

98

100

102

104

106

Alveolar LabialvelarPlace

(E)

/b/ /d/

/g/

/k//p/

/t/98

100

102

104

106

Bilabial Velar AlveolarPlace

Mea

n la

tenc

y (m

s)

Voicingaaaa

VoicedVoiceless

(A)

ʃ

ʒʃ

ɹ

Figure 8. Mean N1 latencies for (A) stop consonants, (B) fricatives, (C) nasals, (D) affricates, and (E) approximants andlateral approximants. Overall, voiced fricatives at slightly longer latencies than the other sounds.

Discussion

The results demonstrate that the N1 varies across a range of acoustic distinctions and phono-logical contrasts in speech. We found clear differences in N1 amplitude as a function of both voicingand place among stops and fricatives, two of the largest classes of speech sounds in English. Thissuggests that the N1 can serve as a useful tool for studying early stages of perceptual encodingduring speech perception.

Nature of perceptual representations

These results compliment other work examining the neurobiology of speech processing usingboth EEG (Toscano et al., 2010) and other methods (e.g., Chang et al., 2010; Myers et al., 2009;Toscano et al., 2018). Moreover, the current study extends previous work that has typically onlylooked at a single phonetic contrast. While some previous studies have used a broad range ofspeech sounds (Mesgarani et al., 2014), they have found mixed results in terms of the types ofperceptual representations used by the brain to perceive those sounds (e.g., Di Liberto et al., 2015,found sensitivity corresponding to both acoustic cues and phonological categories).

Page 13: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 13

The data presented here help to clarify these issues, and the effects reveal patterns that areinformative as to the types of representations that listeners are using. This is demonstrated mostclearly by the stop place of articulation effect, where N1 amplitude is largest for bilabials, followedby velars, followed by alveolars. The relative ordering of effects along this dimension does not cor-respond to the articulatory gestures used to to produce the sounds, which would predict an orderingfrom bilabials to alveolars to velars (or vice versa; the direction of the effect is immaterial in thiscase; Ohala et al., 1986). This argues against articulatory gestures as the type of representation usedby listeners to perceive speech. Rather, an account based on acoustic cues seems more promising.However, the current data do not rule out the possibility that a gestural representation (or someother type of representation) is used to represent speech sounds at a later stage of processing. TheP3 component, which varies as a function of listeners’ phonological category boundaries, may in-dex such representations (Toscano et al., 2010). Nonetheless, the current results suggest that initialperceptual encoding is better described in terms of differences along acoustic dimensions, ratherthan strictly along articulatory ones.

Such representations argue against motor theory and gestural models of speech perception,which posit that perceptual representations are based on listeners’ inferences about the talker’s ges-tures, either perceived directly (direct realism; Fowler, 1984) or abstractly (motor theory; Liberman& Mattingly, 1985). In both cases, this implies a representational organization based on articulatoryfeatures. One alternative, however, is that gestures are only perceived as binary features (e.g., +/-alveolar, or +/- bilabial). Combinations of binary features could, in theory, result the pattern ofobservations observed here. However, if this is the case, the representations cannot be distinguishedbased on whether they are acoustic or articulatory in nature (e.g., a high/low formant frequency[acoustic] and a front/back tongue position [articulatory] would yield similar results). Represen-tations organized along an articulatory feature dimension provide the most direct test of gesturalmodels.

Is it possible to observe an articulatory-based response in the N1, given that it is an auditorysensory ERP component? First, note that the N1 is not exclusively generated in primary auditorycortex (cf. Picton & Hillyard, 1974; Giard et al., 1994). In addition, activation in IFG correspondingto differences along acoustic cue dimensions is observed at approximately the same time as the N1,suggesting possible frontal contributions to the response (Toscano et al., 2018). Thus, differencesbetween the speech sounds presented in the current study could have been organized along artic-ulatory dimensions localized to other cortical areas (though this is not the pattern we observed).Moreover, gestural theories predict that the relevant units for speech perception are articulatory-based even though the input to the system is auditory. Thus, the strongest form of these theorieswould predict that the earliest representations of outside of those immediately responsible for cod-ing the sound (e.g., subcortical auditory structures) would be based on gestures. The current dataargue against this, but it is possible that speech is initially encoded in terms of acoustic cues andthen recoded as articulatory gestures at a later stage. The current results do not reveal the basis ofthese more abstract phonological representations, but they do suggest that, at least at early stages ofperception, listeners process speech sounds in terms of specific acoustic cues.

Which cues are encoded?

If listeners encode speech sounds in terms of acoustic cues, as suggested by the current re-sults, which cues are they using? Because the current study used natural speech sounds that containmultiple cues, we cannot determine which specific cues are being encoded in these stimuli. How-

Page 14: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 14

ever, spectral shape, a cue that has been argued to be invariant across stop place of articulation,varies in precisely the same way as the stop consonant place of articulation effects observed in theN1. Stevens & Blumstein (1978) demonstrated that spectral shape shows a diffuse-falling pattern(low-frequency peak) for bilabials, a mid-frequency peak for velars, and a diffuse-rising pattern(high-frequency peak) for alveolars. Thus, listeners may be encoding this or a similar cue to distin-guish place of articulation among stops.

For fricatives, an effect of place was also observed for the voiceless sounds. However, thepattern of results does not suggest an effect corresponding to spectral mean, similar to the spectralcue mentioned above for the stops. Although spectral mean does provide a useful cue for distin-guishing place among fricatives (Jongman et al., 2000), it is largest for /s/ and smallest for /S/,with intermediate values for /f/. This does not match the relative order of effects for N1 amplitude,which is largest for /s/ and smallest for /f/. Thus, there is likely some other cue dimension thatlisteners are tracking for these distinctions.

The results do reveal some general patterns across the stops and fricatives that may indicatewhich specific acoustic cues are indexed by differences in N1 amplitude. Because we used naturalspeech, these data do not isolate specific cues. However, both the voiced fricatives and voiced stopconsonants showed larger N1s than their voiceless counterparts. Typically, the primary acousticcues distinguishing these sounds are described as distinct from one another (VOT for stops; low-frequency modulations for fricatives). However, these sounds can also be described in terms ofacoustically-similar cues. For instance, the amplitude of low-frequency periodic energy is greaterfor voiced than voiceless fricatives (McMurray & Jongman, 2011), which provides a perceptually-useful cue (Li et al., 2012). If we were to average across the first 20-40 milliseconds of a stopconsonant, we would also find greater low-frequency energy for the voiced stops than the voicelessstops, since voiced stops have shorter VOTs and, therefore, the average would contain more periodicenergy from the vowel. F0 onset also serves as a voicing cue for both stops (Abramson & Lisker,1985; Whalen et al., 1993) and fricatives (McMurray & Jongman, 2011), providing an additionalpossible cue operating in this frequency range that listeners may be encoding.

Note, however, that such patterns are not universal: the effects of place of articulation weredifferent for stops and fricatives (i.e., largest for bilabials for stops; smallest for labiodentals forfricatives, the closest corresponding place of articulation). Indeed, along with the results for stopssuggesting cue-based encoding rather than encoding based on articulatory gestures, these resultsindicate that features may ultimately not be the best way to characterize speech sounds for thepurposes of describing the representations used in speech perception (they are, after all, derivedfrom speech production). Additional data, particularly data manipulating specific acoustic cuesalong different acoustic dimensions, will be needed to address this issue, and will serve as a usefulcomplement to the broad analysis of natural speech sounds presented here.

The remaining speech sounds, /m, n, dZ, tS, l, ô, w/, showed no substantial differences inN1 amplitude. Clearly, listeners can encode the acoustic differences between these sounds, as theyare able to discriminate them behaviorally. Why, then, do these sounds not lead to differences inN1 amplitude? If the N1 indexes perceptual encoding along phonetically-relevant—but distinct—acoustic cue dimensions, it may simply be that the acoustic cues listeners use for these distinctionsare represented in a way that we cannot detect via differences in the fronto-central N1. That is,because ERP components require generators that are oriented in a particular direction in cortexrelative to the scalp, there are likely many acoustic cues for which we cannot observe differences inN1 amplitude, simply due to the way they are represented cortically. Indeed, a key advantage of this

Page 15: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 15

dataset is that it allows us to investigate which potential acoustic and phonological dimensions arereadily observable in scalp-recorded ERP components. Moreover, the differences observed here arenot necessarily (and are probably not) the only ones involved in speech perception. For instance,earlier, subcortical representations may also be informative as to which acoustic distinctions inspeech listeners are sensitive to.

Later-occurring responses

The current findings demonstrate that the ERP technique can be used to study early auditoryperception. Future work could also look at later ERP components, such as the P3, in conjunctionwith the N1. The P3 has been shown to reflect stimulus categorization and is not specific to speechor auditory perception (Azizian et al., 2006; Johnson & Donchin, 1980). Thus, the P3 is a good can-didate for measuring post-perceptual phonological categorization. Previously, Toscano et al. (2010)found that differences in the P3 reflect listeners’ phonological category structure, with the largest P3amplitudes at the continuum endpoints for task-defined target categories. Participants heard stim-uli varying along two VOT continua, beach-peach and dart-tart. One of the four word endpointsserved as a target in different blocks of the experiment. The results showed that the P3 was largestfor stimuli from the target word’s continuum at VOT values corresponding to the target word end-point (i.e., largest P3 for short VOTs along the beach-peach continuum when beach was the target).P3 amplitude was also graded within each phoneme category, suggesting that fine-grained acousticinformation is preserved at post-perceptual stages. Investigating these later-occurring responses inmore detail with a broader range of speech sounds would provide further information as to howlisteners encode speech at different stages of processing.

Such data would also help to address the question of how the information encoded in the N1response is ultimately used for categorization. Because we used natural speech presented in quiet,listeners were nearly at ceiling at recognizing these sounds, as is typically observed. In order toassess the extent to which differences in N1 encoding correspond to listeners’ eventual behavioralclassification of the sound and/or to later-occurring ERP responses, we would need to either makethe sounds ambiguous (using either synthetic or edited naturally-produced speech) or present themin noise—this is an important goal for future work in this area.

Conclusions

Overall, our results indicate that the N1 can serve as a useful tool for studying cue encoding atearly stages of speech perception across a range of phonological distinctions. We find evidence thatvoiced consonants produce larger N1 amplitudes than voiceless consonants, suggesting that listen-ers may be tracking the amplitude of low-frequency energy at onset as a cue to the voiced/voicelessdistinction. There is also evidence that listeners encode specific acoustic cues to place of articula-tion, such as spectral shape for stops, rather than encoding sounds in terms of articulatory features.Together, these findings open the door for future research using the N1 ERP response to shed lighton the question of how human listeners represent speech sounds during early stages of perception.

Acknowledgements

We would like to thank the research assistants in the Word Recognition and Auditory Per-ception Lab at Villanova for their help with data collection, Emma Folk and Ben Falandays for

Page 16: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 16

assistance with participant recruitment, and Emma Folk for recording the auditory stimuli used inthe experiment.

References

Abramson, A. S. & Lisker, L. (1985). Relative power of cues: F0 shift versus voice timing. In V.Fromkin (Ed.), Phonetic linguistics: Essays in honor of Peter Ladefoged (pp. 25–33). New York:Academic Press.

Azizian, A., Freitas, A., Watson, T., & Squires, N. (2006). Electrophysiological correlates of cate-gorization: P300 amplitude as index of target similarity. Biological Psychology, 71(3), 278–288.

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatoryhypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278.

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models usinglme4. Journal of Statistical Software, 67(1), 1–48.

Blumstein, S. E., Myers, E. B., & Rissman, J. (2005). The perception of voice onset time: an fmriinvestigation of phonetic category structure. Journal of Cognitive Neuroscience, 17(9), 1353–1366.

Boersma, P. & Weenink, D. (2016). Praat: Doing phonetics by computer. Available at:http://www.praat.org/.

Chang, E. F., Rieger, J. W., Johnson, K., Berger, M. S., Barbaro, N. M., & Knight, R. T. (2010). Cat-egorical speech representation in human superior temporal gyrus. Nature neuroscience, 13(11),1428.

Delorme, A. & Makeig, S. (2004). EEGLAB: An open source toolbox for analysis of single-trialEEG dynamics including independent component analysis. Journal of Neuroscience Methods,134, 9–21.

Di Liberto, G. M., OSullivan, J. A., & Lalor, E. C. (2015). Low-frequency cortical entrainment tospeech reflects phoneme-level processing. Current Biology, 25(19), 2457–2465.

Diehl, R. L., Lotto, A. J., & Holt, L. L. (2004). Speech perception. Annual Review of Psychology,55, 149–179.

Fowler, C. A. (1984). Segmentation of coarticulated speech in perception. Perception and Psy-chophysics, 36, 359–368.

Frye, R. E., Fisher, J. M., Coty, A., Zarella, M., Liederman, J., & Halgren, E. (2007). Linear codingof voice onset time. Journal of Cognitive Neuroscience, 19, 1476–1487.

Gerrits, E. & Schouten, M. E. H. (2004). Categorical perception depends on the discrimination task.Perception and Psychophysics, 66, 363–376.

Page 17: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 17

Giard, M., Perrin, F., Echallier, J., Thevenet, M., Froment, J., & Pernier, J. (1994). Dissociationof temporal and frontal components in the human auditory n1 wave: a scalp current density anddipole model analysis. Electroencephalography and Clinical Neurophysiology/Evoked PotentialsSection, 92(3), 238–252.

Gratton, G. & Fabiani, M. (2001). Shedding light on brain function: the event-related optical signal.Trends in cognitive sciences, 5(8), 357–363.

Johnson, R. & Donchin, E. (1980). P300 and stimulus categorization: Two plus one is not sodifferent from one plus one. Psychophysiology, 17(2), 167–178.

Jongman, A., Wayland, R., & Wong, S. (2000). Acoustic characteristics of English fricatives.Journal of the Acoustical Society of America, 108, 1252–63.

Klem, G. H., Lüders, H. O., Jasper, H. H., & Elger, C. (1999). The ten-twenty electrode system ofthe international federation. Electroencephalography and Clinical Neurophysiology, 52(3), 3–6.

Li, F., Trevino, A., Menon, A., & Allen, J. B. (2012). A psychoacoustic method for studyingthe necessary and sufficient perceptual cues of american english fricative consonants in noise.Journal of the Acoustical Society of America, 132(4), 2663–2675.

Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception ofthe speech code. Psychological Review, 74, 431–461.

Liberman, A. M., Harris, K. S., Hoffman, H. S., & Griffith, B. C. (1957). The discrimination ofspeech sounds within and across phoneme boundaries. Journal of Experimental Psychology,54(5), 358–368.

Liberman, A. M. & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cogni-tion, 21, 1–36.

Lisker, L. (1986). “Voicing” in English: A catalogue of acoustic features signaling /b/ versus /p/ introchees. Language and Speech, 29, 3–11.

Lopez-Calderon, J. & Luck, S. J. (2014). Erplab: An open-source toolbox for the analysis of event-related potentials. Frontiers in Human Neuroscience, 8, 213.

Lotto, A. J. & Kluender, K. R. (1998). General contrast effects in speech perception: Effect ofpreceding liquid on stop consonant identification. Perception and Psychophysics, 60, 602–619.

Luck, S. J. (2014). An Introduction to the Event-Related Potential Technique. MIT press.

Mathôt, S., Schreij, D., & Theeuwes, J. (2012). Open sesame: An open-source, graphical experi-ment builder for the social sciences. Behavior Research Methods, 44(2), 314–324.

McClelland, J. L. & Elman, J. L. (1986). The TRACE model of speech perception. CognitivePsychology, 18, 1–86.

McMurray, B. & Jongman, A. (2011). What information is necessary for speech categorization?Harnessing variability in the speech signal by integrating cues computed relative to expectations.Psychological Review, 118, 219–46.

Page 18: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 18

McMurray, B., Tanenhaus, M. K., & Aslin, R. N. (2002). Gradient effects of within-categoryphonetic variation on lexical access. Cognition, 86, B33–42.

Mesgarani, N., Cheung, C., Johnson, K., & Chang, E. F. (2014). Phonetic feature encoding inhuman superior temporal gyrus. Science, 343(6174), 1006–1010.

Miller, J. L. (1994). On the internal structure of phonetic categories: A progress report. Cognition,50(1), 271–285.

Myers, E. B., Blumstein, S. E., Walsh, E., & Eliassen, J. (2009). Inferior frontal regions underliethe perception of phonetic category invariance. Psychological Science, 20(7), 895–903.

Nearey, T. M. (1997). Speech perception as pattern recognition. Journal of the Acoustical Societyof America, 101, 3241–3254.

Ohala, J. J., Browman, C. P., & Goldstein, L. M. (1986). Towards an articulatory phonology.Phonology, 3, 219–252.

Pasley, B. N., David, S. V., Mesgarani, N., Flinker, A., Shamma, S. A., Crone, N. E., Knight, R. T.,& Chang, E. F. (2012). Reconstructing speech from human auditory cortex. PLoS biology, 10(1),e1001251.

Picton, T. W. & Hillyard, S. a. (1974). Human auditory evoked potentials. II. Effects of attention.Electroencephalography and Clinical Neurophysiology, 36, 191–9.

Picton, T. W., Woods, D. L., & Proulx, G. B. (1978). Human auditory sustained potentials. II.Stimulus relationships. Electroencephalography and Clinical Neurophysiology, (pp. 198–210).

Pisoni, D. B. & Lazarus, J. H. (1974). Categorical and noncategorical modes of speech perceptionalong the voicing continuum. Journal of the Acoustical Society of America, 55, 328–333.

Pisoni, D. B. & Tash, J. (1974). Reaction times to comparisons within and across phonetic cate-gories. Attention, Perception, & Psychophysics, 15(2), 285–290.

Pitt, M. A. & Samuel, A. G. (1995). Lexical and sublexical feedback in auditory word recognition.Cognitive Psychology, 29, 149–188.

Pratt, H. (2010). Sensory ERP components. Oxford University Press: New York.

R Development Core Team (2011). R: A Language and Environment for Statistical Computing. RFoundation for Statistical Computing, Vienna, Austria.

Schouten, B., Gerrits, E., & van Hessen, A. (2003). The end of categorical perception as we knowit. Speech Communication, 41(1), 71–80.

Stevens, K. N. & Blumstein, S. E. (1978). Invariant cues for place of articulation in stop consonants.Journal of the Acoustical Society of America, 64, 1358–1368.

Toscano, J. C. (2011). Perceiving speech in context: Compensation for contextual variability at thelevel of acoustic cue encoding and categorization.

Page 19: Perceptual encoding of natural speech sounds revealed by ...wraplab.co/mse2701/readings/PereiraEtAl-APC.pdf · Perceptual encoding of natural speech sounds revealed by the N1 event-related

PERCEPTUAL ENCODING OF NATURAL SPEECH SOUNDS 19

Toscano, J. C. & Allen, J. B. (2014). Across- and within-consonant errors for isolated syllables innoise. Journal of Speech, Language, and Hearing Research, 57, 2293–2307.

Toscano, J. C., Anderson, N. D., Fabiani, M., Gratton, G., & Garnsey, S. M. (2018). The time-course of cortical responses to speech revealed by fast optical imaging. Brain and language, 184,32–42.

Toscano, J. C. & McMurray, B. (2010). Cue integration with categories: Weighting acoustic cues inspeech using unsupervised learning and distributional statistics. Cognitive Science, 34, 434–464.

Toscano, J. C., McMurray, B., Dennhardt, J., & Luck, S. J. (2010). Continuous perception andgraded categorization: Electrophysiological evidence for a linear relationship between the acous-tic signal and perceptual encoding of speech. Psychological Science, 21, 1532–1540.

Viswanathan, N., Magnuson, J. S., & Fowler, C. A. (2010). Compensation for coarticulation: Disen-tangling auditory and gestural theories of perception of coarticulatory effects in speech. Journalof Experimental Psychology. Human Perception and Performance, 36, 1005–1015.

Whalen, D. H., Abramson, A. S., Lisker, L., & Mody, M. (1993). F0 gives voicing informationeven with unambiguous voice onset times. Journal of the Acoustical Society of America, 93(4),2152–2159.