24 multisensory interactions in speech perceptioninfantstudies-psych.sites.olt.ubc.ca › files ›...

18
Q Stein—The New Handbook of Multisensory Processes 24 Multisensory Interactions in Speech Perception Jordi Navarra, H. Henny Yeung, Janet F. Werker, and Salvador Soto-Faraco Speech as a Multisensory Phenomenon The perception of someone talking provides correlated input to more than one sensory modality (mainly vision and audition) simultaneously. There are many everyday situations (such as face-to-face conversations, watching television, or videoconferencing) in which linguistically reliable information can be obtained from the sight of the speaker. The fact that one often communicates effectively in the absence of any visual cue (e.g., talking over the telephone) perhaps leads to the simple infer- ence that these visual speech cues are completely redundant with respect to the concurrent acoustic input or even useless in their linguistic and informa- tional relevance. Nevertheless, empirical evidence accu- mulated over the last few decades provides solid grounds to dismiss this subjective impression and instead sup- ports the conclusion that there is important informa- tion from vision that, when accessible, complements and supplements the acoustic speech signal. Vision carries substantial, and linguistically relevant, cues about the spoken signal. For example, research and clinical/educational practice with deaf individuals have repeatedly demonstrated the benefits of lipread- ing (or speechreading) under conditions of hearing loss (see Auer, 2010, for a review). Normally hearing indi- viduals also display a remarkable sensitivity to these visual speech cues. For instance, the fact that adults, and even infants as young as 4 months, are capable of discriminating between silent faces articulating sen- tences in different languages (e.g., English and French) makes us think that the sensitivity to visual speech infor- mation arises as a part of normal development and not only as a compensatory strategy to cope with acoustic impairment (Weikum et al., 2007; see also Soto-Faraco et al., 2007; see figure 24.1). Many different linguistic cues (including both segmental and suprasegmental) can, in fact, be retrieved from visual speech articula- tions (Bernstein, Eberhardt, & Demorest, 1989; Jiang, Auer, Alwan, Keating, & Bernstein, 2007; Vatikiotis- Bateson, Munhall, Kasahara, Garcia, & Yehia,1996; Yehia, Kuratate, & Vatikiotis-Bateson, 2002; and see chapter 23, in this volume, by Vatikiotis-Bateson and Munhall) and from other visible correlates such as head motion (Hadar, Steiner, Grant, & Rose, 1983, 1984; Munhall, Jones, Callan. Kuratate, & Vatikiotis-Bateson, 2004). An important question, however, is whether and how this visual source of information about speech is combined with auditory speech when they are both present. The pioneering work by Sumby and Pollack (1954; see also Cotton, 1935) represents the first successful attempt to assess the role of dynamic facial information on the comprehension of a spoken message. Using a clever setup, these authors demonstrated that the per- ception of acoustically presented words masked with noise improved substantially when the speaker’s facial movements were available to the perceiver (see also Grant & Greenberg, 2001 and Ross, Saint-Amour, Leavitt, Javitt, & Foxe, 2007, for other demonstrations of visual enhancement of auditory speech perception). This type of result reveals that observers can exploit (and thus benefit from) the informational correspon- dence between visual and acoustic aspects of the speech signal when needed. Moreover, abundant research sug- gests that visual speech exerts a substantial impact on speech perception even under good acoustic condi- tions (e.g., Reisberg, McLean, & Goldfield, 1987), that is, not only when the acoustical signal is degraded. McGurk and MacDonald (1976) demonstrated this point in a highly influential study, showing that a sylla- ble that is heard as /ba/ when presented in isolation is often heard as /da/ when dubbed onto a video clip of a face silently articulating the syllable [ga] (see figure 24.2). 1 This added value of audiovisual (AV) integration has also been demonstrated in more subtle ways, without the need for artificially induced intersensory conflict. For example, the combination of visual and auditory speech can make us more sensitive to nonnative pho- nemic distinctions that are difficult to discern on the basis of just visual or auditory information alone (Navarra & Soto-Faraco, 2007; see also Teinonen, Aslin, Alku, & Csibra, 2008). Developmental research has shown that these AV speech perception abilities are 8466_024.indd 435 12/21/2011 6:01:40 PM

Upload: others

Post on 06-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

24 Multisensory Interactions in Speech

Perception

Jordi Navarra, H. Henny Yeung, Janet F. Werker, and Salvador Soto-Faraco

Speech as a Multisensory Phenomenon

The perception of someone talking provides correlated input to more than one sensory modality (mainly vision and audition) simultaneously. There are many everyday situations (such as face-to-face conversations, watching television, or videoconferencing) in which linguistically reliable information can be obtained from the sight of the speaker. The fact that one often communicates effectively in the absence of any visual cue (e.g., talking over the telephone) perhaps leads to the simple infer-ence that these visual speech cues are completely redundant with respect to the concurrent acoustic input or even useless in their linguistic and informa-tional relevance. Nevertheless, empirical evidence accu-mulated over the last few decades provides solid grounds to dismiss this subjective impression and instead sup-ports the conclusion that there is important informa-tion from vision that, when accessible, complements and supplements the acoustic speech signal.

Vision carries substantial, and linguistically relevant, cues about the spoken signal. For example, research and clinical/educational practice with deaf individuals have repeatedly demonstrated the benefits of lipread-ing (or speechreading) under conditions of hearing loss (see Auer, 2010, for a review). Normally hearing indi-viduals also display a remarkable sensitivity to these visual speech cues. For instance, the fact that adults, and even infants as young as 4 months, are capable of discriminating between silent faces articulating sen-tences in different languages (e.g., English and French) makes us think that the sensitivity to visual speech infor-mation arises as a part of normal development and not only as a compensatory strategy to cope with acoustic impairment (Weikum et al., 2007; see also Soto-Faraco et al., 2007; see figure 24.1). Many different linguistic cues (including both segmental and suprasegmental) can, in fact, be retrieved from visual speech articula-tions (Bernstein, Eberhardt, & Demorest, 1989; Jiang, Auer, Alwan, Keating, & Bernstein, 2007; Vatikiotis-Bateson, Munhall, Kasahara, Garcia, & Yehia,1996; Yehia, Kuratate, & Vatikiotis-Bateson, 2002; and see

chapter 23, in this volume, by Vatikiotis-Bateson and Munhall) and from other visible correlates such as head motion (Hadar, Steiner, Grant, & Rose, 1983, 1984; Munhall, Jones, Callan. Kuratate, & Vatikiotis-Bateson, 2004). An important question, however, is whether and how this visual source of information about speech is combined with auditory speech when they are both present.

The pioneering work by Sumby and Pollack (1954; see also Cotton, 1935) represents the first successful attempt to assess the role of dynamic facial information on the comprehension of a spoken message. Using a clever setup, these authors demonstrated that the per-ception of acoustically presented words masked with noise improved substantially when the speaker’s facial movements were available to the perceiver (see also Grant & Greenberg, 2001 and Ross, Saint-Amour, Leavitt, Javitt, & Foxe, 2007, for other demonstrations of visual enhancement of auditory speech perception). This type of result reveals that observers can exploit (and thus benefit from) the informational correspon-dence between visual and acoustic aspects of the speech signal when needed. Moreover, abundant research sug-gests that visual speech exerts a substantial impact on speech perception even under good acoustic condi-tions (e.g., Reisberg, McLean, & Goldfield, 1987), that is, not only when the acoustical signal is degraded. McGurk and MacDonald (1976) demonstrated this point in a highly influential study, showing that a sylla-ble that is heard as /ba/ when presented in isolation is often heard as /da/ when dubbed onto a video clip of a face silently articulating the syllable [ga] (see figure 24.2).1 This added value of audiovisual (AV) integration has also been demonstrated in more subtle ways, without the need for artificially induced intersensory conflict. For example, the combination of visual and auditory speech can make us more sensitive to nonnative pho-nemic distinctions that are difficult to discern on the basis of just visual or auditory information alone (Navarra & Soto-Faraco, 2007; see also Teinonen, Aslin, Alku, & Csibra, 2008). Developmental research has shown that these AV speech perception abilities are

8466_024.indd 435 12/21/2011 6:01:40 PM

Page 2: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

436 JoRDI NAVARRA, H. HENNY YEUNG, JANET F. WERKER, AND SALVADoR SoTo-FARACo

already present quite early in life, as young infants, and even neonates, are able to detect the match between speech sounds and corresponding visually perceived articulatory gestures (Aldridge, Braga, Walton, & Bower, 1999; Kuhl & Meltzoff, 1982; Patterson & Werker, 2003) and seem to display McGurk-like illusions (Burnham & Dodd, 2004).

The present chapter provides an overview of the recent advances in this area, discussing some of the research issues on multisensory speech perception that have led to general agreement as well as those that are currently at the focus of debate. We have organized the present review under three topics: the developmental aspects of AV speech integration, the perceptual and neural bases of AV speech integration, and the possible role that attention and top-down processes play on AV speech processing.2

Early Development of AV Integration of Speech (Infant Studies)

Unlike theories of adult speech perception (see the next section), one primary theoretical divide among researchers studying the development of speech per-ception is whether the senses work together from birth or whether coordination of input across modalities comes about through experience and learning. The discussion in the current section will, in consequence, focus on when and how the auditory and visual systems come to work together in development.

According to the more standard theoretical approach, the senses are separate at birth and only become inte-grated through learning and experience. Within this approach, AV speech perception is referred to as “inter-modal” or “cross-modal,” to capture the fact that two independently functioning modalities need to act together. There is disagreement within this approach as to whether integration occurs across development through general principles of associative learning (e.g., Birch & Lefford, 1963) or through more active hypoth-esis testing (e.g., Piaget, 1952). In contrast to this inte-gration framework, differentiation theorists refer to initial AV perception as “amodal,” to capture the notion that the senses are not separate at birth and respond together to the input, for example, encoding intensity, rhythm, or other amodal properties of the generating sources (e.g., Gibson, 1969). In differentiation approaches experience also plays a role; it is needed to allow the senses to act independently and to attune, or narrow, the perceptual system to make it respond selectively to only those multimodal relations present in the infant’s environment (see Lewkowicz, 1994, for a review of both classes of theories).

Figure 24.1 Experimental setup used by Weikum et al. (2007). Infants aged 4–6 months were presented with a series of videoclips showing faces silently articulating sentences in French or English. once the infants habituated to the clips in one language (e.g., English), test trials appeared. Test trials consisted of similar silent videoclips that could appear in the same language (following the example, English), for half of the infants, or in another language (French) for the other half of the infants. According to the results, infants tested with a new language spent more time looking at the faces than the other group, indicating that infants discriminated English from French speech just from viewing silent articulations. By 8 months, only bilingual (French/English) infants succeed, perhaps reflecting an experience-based selectivity in keeping only necessary perceptual sensitivities.

Figure 24.2 The McGurk effect. We often “hear” the sylla-ble /da/ when being presented with a face articulating the syllable [ga] dubbed with the sounds of the syllable /ba/ (McGurk & MacDonald, 1976).

8466_024.indd 436 12/21/2011 6:01:40 PM

Page 3: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

MULTISENSoRY INTERACTIoNS IN SPEECH PERCEPTIoN 437

Evidence for differentiation comes from studies that show, for example, infants’ ability to detect the equiva-lence between the phonetic content of heard and seen speech (i.e., AV matching), even for sounds not previ-ously experienced. There is also incontrovertible evi-dence that experience does play a role in the development of AV speech perception. For example, initial multimodal sensitivities that are not required in the infants’ environment become less evident (a trend called “perceptual narrowing”), and those that are required become more robust. The evidence for each of these theoretical approaches is reviewed below.

Infant AV Matching

In a typical AV matching paradigm, an infant is shown two side-by-side talking faces, each articulating a differ-ent utterance. A speech sound that matches the visible articulation of one of the faces but not the other is presented centrally. Evidence of AV matching is obtained if the infant looks longer to the matching face than to the mismatching face. Research in the late 1970s revealed that 2.5- to 4-month-old infants are able to use synchrony to match mouth movements with sounds (Dodd, 1979; see also Lewkowicz, 2010). By 5–7 months of age, infants are able to match the affective content of faces and voices (Soken & Pick, 1992; Walker-Andrews, 1997) and can perform the AV match using the gender of the faces and the voice by 7–9 months (Patterson & Werker, 2002; Poulin-Dubois, Serbin, Kenyon, & Derbyshire, 1994; Walker-Andrews, Bahrick, Raglioni, & Diaz, 1991). Indeed, a recent study provides electrophysiological evidence that gender matching might be possible during speech perception by as early as 10 weeks (Bristow et al., 2009).

The first evidence of phonetic matching of AV speech was provided in infants aged 4.5 months by Kuhl and Meltzoff (1982) and has been replicated several times for vowels and consonants, using both female (Kuhl & Meltzoff, 1984; Lewkowicz, 2000; MacKain, Studdert-Kennedy, Spieker, & Stern, 1983) and male videotaped faces (Patterson & Werker, 1999). Recent work has dem-onstrated that the match between AV vowel pairs is, in fact, evident as early as 2 months of age (Baier, Idsardi, & Lidz, 2007; Patterson & Werker, 2003). Indeed, infants at this age distinguish the dynamic lip move-ments of a rounded production, /wi/, from a single vowel, /i/, which ends in the same articulatory configu-ration, and can use this subtle difference to detect an AV match (Baier et al., 2007). These findings of match-ing as early as 2 months of age, nevertheless, are also consistent with an integration view since they could be the result of the substantial early experience of infants

watching speaking faces, particularly in the face-to-face exchanges that are quite characteristic of parent-infant interactions. This is an important issue, as a require-ment for experience would be consistent with a learn-ing perspective and would suggest that AV speech perception is akin to general learning principles, just as the association between the sound and the sight of a car.

Support for a differentiation view, on the other hand, comes from accumulating evidence suggesting that bimodal matching may be also possible without specific experience. It has been observed that even neonates are able to match the seen movements of the mouth with heard vowels (Aldridge, Braga, Walton, & Bower, 1999) and that, at only 4 months of age, infants can match AV vowels (Walton & Bower, 1993) and consonants from a foreign language (Pons, Lewkowicz, Soto-Faraco, & Sebastián-Gallés, 2009). In the work by Pons and col-leagues, infants were tested on their ability to bimodally match the English syllables /ba/ and /va/ presented acoustically with their respective lip movements. Note that the /ba/-/va/ phonetic distinction is not contras-tive in Spanish (i.e., adult Spanish speakers tested in that study could not match correctly between acoustic and visual exemplars of /va/ and /ba/). At 4 months of age both Spanish and American infants succeeded, whereas by 10 months of age only American, but not Spanish, infants showed signs of pairing the sight and sound of /va/ and /ba/. This pattern of results suggests that AV matching precedes specific experience with phonological categories. Yet, experience with a particu-lar language modulates this ability, accommodating each infant’s perceptual system to the demands of his/her particular linguistic environment (as the differen-tiation hypothesis suggests).

Additional support for the limited role of experience in early detection of AV speech congruency may also be the ability to match heard and seen vocalizations beyond one’s own species. Human infants at 5 months of age are able to match the species of heard vocalizations with static faces of humans and monkeys but not with static duck faces (Vouloumanos, Druhen, Hauser, & Huizink, 2009). one interpretation of these results is that early matching abilities are driven by experience-indepen-dent, evolutionarily based biases for conspecifics rather than by actual experience with specific animals. Yet, it is also possible that the greater similarity between human and monkey faces than between human and duck faces facilitates generalization of learning and hence better matching of monkey calls to monkey faces. More nuanced evidence for early amodal sensitivities to vocal sounds comes from reports that even at birth (Lewkowicz, Leo, & Simion, 2010), and continuing to

8466_024.indd 437 12/21/2011 6:01:40 PM

Page 4: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

438 JoRDI NAVARRA, H. HENNY YEUNG, JANET F. WERKER, AND SALVADoR SoTo-FARACo

6 months (Lewkowicz & Ghazanfar, 2006), human infants can match heard and seen vocalizations of rhesus monkeys, even though they have never been exposed to them before. By 10 months of age, however, this AV matching ability declines for other species (Lewkowicz, Sowinski, & Place, 2008), again suggesting that experience is required to maintain the initial sensitivity.

Infant AV Integration

Although the evidence for AV matching of speech at an early age is remarkable, sensitivity to AV match/mis-match does not directly imply that correlated informa-tion from vision and audition is integrated in a unitary percept. Stronger evidence of the senses working together can be found when a unitary percept arises from the combination of the features specified by heard and seen speech, as it happens in the McGurk effect. Burnham and Dodd (2004) tested integration abilities in 4-month-old infants by habituating them to either an AV matching /ba/ or a mismatching speech event con-sisting of an audio /ba/ dubbed onto a visual [ga], a combination that leads to a fused “da” (and sometimes “tha”) percept in adults (see McGurk & MacDonald, 1976). After habituation, infants were tested with the sounds “ba,” “da,” or “tha” alone (i.e., dubbed onto a still face that provided no dynamic/linguistic informa-tion). In the test following habituation, and according to the typical habituation results, it was expected that infants would look longer to novel stimuli than to the ones experienced during habituation. Consistent with the idea that infants integrated the AV mismatch sylla-ble into “da” or “tha” during the habituation phase, they showed greater recovery in looking times to the sound /ba/ than to /da/ or /tha/ (compared to infants habituated to the audiovisually matching /ba/). This suggests that, during habituation, the acoustic /ba/ + visual [ga] stimulus was fused and perceived as either “da” or “tha.”

Two other studies have tested visual capture rather than AV fusion. In the adult literature the presentation of an auditory /b/ accompanied by a visual /v/ results in an acoustic percept of the visually presented phoneme (i.e., perceived as /v/), a phenomenon commonly referred to as “visual capture.” Even if this phenome-non were not as indicative of integration as the McGurk effect, it surely represents another example of multisen-sory speech perception. In a series of experiments, Rosenblum, Schmuckler, and Johnson (1997) provided evidence that infants of 4 months show visual [va] capture of auditory /ba/. In a subsequent study using a slightly different method, Desjardins and Werker

(2004) provided further evidence of visual capture in 4-month-old infants (auditory /bi/ by visual [vi]), although in their case, the phenomenon was not evident in all conditions. on the basis of these results, Desjar-dins and Werker concluded that although a foundation for bimodal speech perception might be present in the very young infant, its strength increases with experience.

Kushnerenko, Teinonen, Volein, and Csibra (2008) recorded event-related potentials (ERPs) from infants aged 5 months in response to four types of AV stimuli: AV congruent /ba/ + [ba] and /ga/ + [ga] stimuli, AV incongruent /ba/ + [ga] (leading to “da” fusions in adults), and AV incongruent /ga/ + [ba] (leading to the phonotactically unusual combination percept “bga” in adults; see McGurk & MacDonald, 1976). They rea-soned that the first three types of stimuli should be perceived as a coherent, fused syllable, whereas the AV incongruency that leads to “bga” should pose a percep-tual problem. In line with this prediction, the authors found a difference in the visual ERP component to a visual [ga] versus a visual [ba], but more importantly, there was also a distinct ERP response to the /ga/ + [ba] (combination stimulus) compared to any other AV stimulus type (more positive over frontal areas and more negative over temporal areas). Although our current knowledge of the meaning of different infant ERP signals is still lacking, the fact that different responses were seen to AV stimulus configurations leading to unitary fused syllable (“da”) versus those leading to a phonotactically anomalous combina-tion (“bga”) led the authors to conclude that neural responses are sensitive to the results of AV integration.

Contrary to developmental research on AV matching abilities, which has involved a wide variety of ages (including newborns) and cross-linguistic approaches (i.e., native and nonnative speech categories), until recently no research on AV integration of speech had involved infants younger than 4 months or nonnative stimuli. In 2009, Bristow and colleagues conducted a study in which they measured the electrophysiological MMR (mismatch response) to a change to a new acous-tically presented vowel (either /a/ or /i/) following repeated presentation of either an auditory-only vowel or a visual-only (silently articulated) vowel (/i/ or /a/, respectively). This hybrid design allowed the authors to investigate the neural response to an acoustic stimulus that either matched or did not match the previously presented vowel. Interestingly, the MMR was as rapid to the change from one visual to a different acoustic vowel as it was from one acoustic to a different acoustic vowel. These results suggest the existence of shared represen-

8466_024.indd 438 12/21/2011 6:01:40 PM

Page 5: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

MULTISENSoRY INTERACTIoNS IN SPEECH PERCEPTIoN 439

tations for visual and auditory speech at just 4 months of age. Moreover, the dominant neural responses were seen first in the left hemisphere temporal areas, and in the left hemisphere frontal areas afterwards, indicating the possible involvement of similar “language” areas of the brain as in adult perceivers.

The above results suggest that the neural systems supporting AV integration emerge very early in life. Nonetheless, in order to more fully address the ques-tion of whether visual information influences auditory speech perception prior to specific experience, it will be essential to test AV integration of speech in young infants using unfamiliar auditory and/or visual stimuli, just as has been done in the bimodal matching studies reviewed above. Studies with newborn infants without prior experience with visual speech will also be informa-tive for determining whether there is an organizing role for visual experience. In summary, research to date indicates substantial readiness for AV speech percep-tion in young infants even without prior specific experi-ence. Yet, it has become clear that experience also plays an important role in maintaining and sharpening these initial sensitivities. The initial organization seen pro-vides a foundation for apprehending the redundant information available in visual and auditory speech and thus could facilitate learning about language in general and the native language in particular.

Characterizing AV Integration of Speech in Adults

A great deal of research has sought to characterize the processes and the representations that underlie AV speech perception in adults. By AV speech process we refer to a set of mental operations that contribute to the integration of information from two or more sensory modalities, and from which emerges a speech percept. By AV speech representation we refer to the format and informational content of the input entering into, or the output resulting from, these integrative processes.

As any review of the literature on oral communica-tion will show, there is a great deal of disagreement about both the processes and representations involved in the perception of speech. Researchers have advo-cated for a variety of alternatives, including abstractions of phonological features that are formatted from pro-totypes in memory (e.g., Massaro, 1987) or defined acoustically (e.g., Stevens, 2002), abstract motor plans to produce specific speech gestures (Liberman & Mattingly, 1985), or even amodal information about the actual movements involved in these gestures (e.g., Fowler, 1996). In the specific case of AV speech percep-tion, theoretical approaches have similarly disagreed

about what the code might be in which visual and audi-tory inputs are represented during speech perception. Similarly, how AV speech processes are conceptualized is enormously influenced by a commitment to any par-ticular representation of incoming AV speech signals. For example, a direct-realist view about the representa-tion of speech gestures leads to a reduction in the set of processes involved in AV speech perception (see Fowler, 1996).

Equally relevant questions for the understanding of AV integration of speech have been the when (during the processing of speech signals) and the where (in the brain) integration takes place. The advance of neuro-imaging, including functional magnetic resonance imaging and electrophysiological techniques, has allowed researchers to investigate these issues in great detail. Initial attempts tended to depict AV integration of speech as a late phenomenon (that is, occurring after early analyses of the unimodal signals have already taken place) in heteromodal areas rather than, for example, in primary visual or auditory cortex. More recent literature, however, tends to emphasize the exis-tence of an early interplay between vision and audition (or even a somewhat unidirectional visual-to-auditory modulation). Both of these controversies (i.e., what and how as well as when and where) are reviewed in the two sections below.

What and How: Representational Codes and Related Processes

Classic models of speech perception have frequently neglected the multimodal nature of communication. For some, the speech signal is exclusively based on the acoustic input (e.g., Klatt, 1980; Liberman, 1982; McClelland, & Elman, 1986), whereas others have included in later formulations visual input as a possible source of complementary information (e.g., Diehl & Kluender, 1987; Diehl, Lotto, & Holt, 2004; Liberman & Mattingly, 1985). Even within those theoretical views of speech perception that consider the multisensory nature of spoken language, there are very different approaches regarding both the processes involved in perceiving AV speech and the representations in which speech inputs are encoded. Three distinct approaches, including the most influential models on AV speech perception proposed to this date, are discussed in turn.

Convergence Views Such theories generally assume that both visual and auditory speech inputs are mapped onto unitary representations (e.g., gestural codes) that mediate perceptual processes (see Fowler, 2004). The convergence theory that gathered most of the

8466_024.indd 439 12/21/2011 6:01:40 PM

Page 6: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

440 JoRDI NAVARRA, H. HENNY YEUNG, JANET F. WERKER, AND SALVADoR SoTo-FARACo

attention from the research community in the last decade is, without doubt, the motor theory of speech perception (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967). According to this theory, speech production and speech perception are both based on a common, prearticulatory code (Liberman et al., 1967; see also Liberman & Mattingly, 1985). This theory, and also the direct-realist theory of speech per-ception (Fowler, 1996; see also Best, 1995), have sug-gested that speech events are represented as specific speech gestures. A strong claim of this type of theory is, therefore, the involvement of motor representations during speech perception.

The hypothesis of motor involvement during speech perception has received empirical support from a rich body of behavioral evidence showing that perceived auditory speech can alter the latency (Bell-Berti, Raphael, Pisoni, & Sawusch, 1979; Fowler, Brown, Saba-dini, & Weihing, 2003; Gordon & D. E. Meyer, 1984; Levelt et al., 1991) and even the kinematics (Houde & Jordan, 1998; Yuen, Davis, Brysbaert, & Rastle, 2010) with which the listener produces speech. Complemen-tary studies also show that articulating (Nasir & ostry, 2009; Sams, Möttönen, & Sihvonen, 2005), planning articulation (Roelofs, Özdemir, & Levelt, 2007), or even having one’s facial skin deformed in ways that mimic articulation (Ito, Tiede, & ostry, 2009) can all affect the perception of auditory speech. In line with this view, recent evidence has also suggested that perceiving visual and AV speech utterances can have a similar effect on ensuing production, influencing the latency with which speakers articulate these and other utter-ances (Galantucci, Fowler, & Goldstein, 2009; Genti-lucci & Bernardis, 2007; Kerzel & Bekkering, 2000).

Further evidence for the possible role of gestural information in speech perception comes from motor evoked potentials (MEPs) induced by transcranial mag-netic stimulation (TMS) of lip- and tongue-related areas within the motor cortex (see Fadiga, Craighero, & olivier, 2005, for a review). Several studies have seen selective modulation of MEPs at the articulators involved in the production of specific speech events during audi-tory (e.g., Fadiga, Craighero, Buccino, & Rizzolatti, 2002; Wilson, Pinar-Saygin, Sereno, & Iacobini, 2004) and visual (e.g., Sato, Buccino, Gentilucci, & Cattaneo, 2009; Watkins, Strafella, & Paus, 2003) speech percep-tion. TMS has also been used to disrupt the functioning in these cortical motor areas, and doing so has been shown to influence speech perception itself, particu-larly when the speech signal is rendered ambiguous either by adding noise or by using stimuli that fall between categories in a phonetic continuum (D’Ausilio et al., 2009; Möttönen & Watkins, 2009).

Together, these results suggest that motor- articulatory processes might be engaged in AV speech perception, but more evidence must be obtained to determine their causal role (Hickok, Holt, & Lotto, 2009; Mahon & Caramazza, 2008; Mitterer & Ernestus, 2008). In fact, critics of the motor theories also argue that speech perception abilities remain relatively intact in patients in whom motor brain areas are damaged (e.g., Crinion, Warburton, Lambon-Ralph, Howard, & Wise, 2006; Terao et al., 2007; but see Pulvermüller & Fadiga, 2010).

Associationist Views These views generally differ from convergence views in two important respects. First, they suggest that speech information from sepa-rate modalities retains modality-specific characteristics throughout processing (see Bernstein, Auer, & Moore, 2004). In other words, unlike convergence theories, which assume amodal (or heteromodal) and unitary representations from the initial stages in the processing of multisensory speech signal, associationist theories assume parallel processing of speech information in separate modalities until a relatively late stage of per-ceptual analysis. Second, according to this view, integra-tion of multisensory input does not involve representations of speech that are linked to speech gestures, per se, but rather to abstract features stored as prototypes in memory, which are associated across modalities in the course of processing (see Braida, 1991; Massaro, 1987, 1998, 2004). The fuzzy-logical model of perception (FLMP; see Massaro, 1987, 1998) illustrates both of these claims. This model postulates the existence of different processing stages, and cross-modal integration does not take place until a relatively advanced stage of processing (called feature integration). According to this view, speech features extracted from the visual and auditory inputs are compared against “prestored” phonological prototypes to select the best match.

Although the FLMP can account for a large amount of empirical data (e.g., Massaro, 1998), it has been subject to some substantial criticisms (e.g., Green & Gerdeman, 1995). A possible argument against one of the postulates of prototype-based models such as the FLMP is the excessive weight it confers to stored memory representations. The fact that prelinguistic (i.e., 4-month-old) infants are able to integrate and match speech events across sensory modalities (see the previ-ous section) may suggest that this cross-modal ability is not necessarily based on established auditory categories or prototypes in memory (i.e., a phonological system). of course, this does not mean that (adult) perceivers with a well-developed phonological system use the same

8466_024.indd 440 12/21/2011 6:01:40 PM

Page 7: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

MULTISENSoRY INTERACTIoNS IN SPEECH PERCEPTIoN 441

code during AV integration of speech signal as infants do, but one can at least conclude that AV matching of speech is possible without an elaborated phonological system or entrenched prototypes in memory. Another empirical argument against the FLMP is the finding of AV interactions at the level of articulatory features, prior to phonological categorization (Green & Gerdeman, 1995).

Analysis-by-Synthesis Views According to this relatively recent view, the role of visual input, when it provides salient and sufficiently distinguishable information, is to constrain the processing of the audi-tory signal, thus making it faster and more accurate. During speech production, for example, there is often visible articulatory information (e.g., closure of the lips) that precedes a speech sound (e.g., the sound /b/; see figure 24.3) by several tenths of a second (see Schro-eder, Lakatos, Kajikawa, Partan, & Puce, 2008). Accord-ing to van Wassenhove and colleagues (van Wassenhove, Grant, & Poeppel, 2005), this advanced information can be used during speech perception to generate phonological predictions against which incoming sounds are evaluated. The more constraining the pre-diction from visual speech is (e.g., [b] generates stron-ger prediction than [g]), the faster its correlated sound seems to be processed. Visual anticipation during AV speech perception may perhaps help the perceiver to quickly retrieve the relevant representations from the rather complex and fast-changing acoustic speech signal (see Navarra, Alsius, Soto-Faraco, & Spence, 2010).

Recent conceptions of this theoretical view suggest that the prediction process may be based on premotor-articulatory representations (Skipper, Nusbaum, & Small, 2005). According to this view, visual speech infor-mation generates an internally synthesized motor prediction

about an articulatory event, represented by the speak-er’s own gestural knowledge. These predictions are then integrated with the ongoing acoustic input to derive the final percept (Skipper, van Wassenhove, Nusbaum, & Small, 2007). According to Skipper and colleagues (2007), the generation of these motor pre-dictions may help to solve the lack of invariance problem, that is, the problem of mapping speech signals that are extremely variable (i.e., the way a specific phoneme is produced is different for each speaker and highly depends on its adjacent phonemes, in a phenom-enon called coarticulation) onto highly specific, discrete, and categorical representations. It is worth noting, however, that the relevance given to visual anticipatory information and motor predictions is restricted in these models by (1) the large ambiguity that exists within the visual speech signal itself (e.g., the highly salient articu-lation of /p/, /b/, and /m/ is virtually equivalent) and by (2) the fact that the articulation of many phonemes remains invisible to the perceiver (e.g., the articulation of /g/).

Where and When: The Neural Underpinnings of Perceiving AV Speech

Studies using positron emission tomography (PET) and functional magnetic resonance imaging (fMRI) have repeatedly shown activation in posterior areas of the superior temporal sulcus (STS) during auditory (Binder et al., 1997; Calvert, Bullmore, Brammer, & Campbell, 1997) and visual (Blasi et al., 1999; Calvert et al., 1997) speech perception. In a seminal study, Calvert and col-leagues (Calvert, Campbell, & Brammer, 2000) reported, for the first time using fMRI, multisensory responses in the left STS following a superadditive pattern (with a criterion adapted from single-neuron animal neuro-physiology)3 during the perception of matching AV

Figure 24.3 Speech articulatory gestures often precede their corresponding sounds. This early visual information has an impact on the way speech sounds are processed, as shown in, for example, an advancement of auditory ERP components such as the N1 (van Wassenhove et al., 2005; see also Besle et al., 2004).

8466_024.indd 441 12/21/2011 6:01:40 PM

Page 8: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

442 JoRDI NAVARRA, H. HENNY YEUNG, JANET F. WERKER, AND SALVADoR SoTo-FARACo

speech. In contrast, a subadditive response pattern in the same brain area was found during the presentation of mismatching AV speech stimuli (see also Callan et al., 2003; Wright, Pelphrey, Allison, McKeown, & McCar-thy, 2003). Further studies have highlighted the STS as a node at which correlated auditory and visual informa-tion (from speech and nonspeech stimuli) is thought to converge (see Beauchamp, Argall, Bodurka, Duyn, & Martin, 2004).

The conceptualization of the STS as the place where signals from vision and audition are integrated fits well with a hierarchical and associationist conception of AV speech processing (see previous subsection) whereby multisensory binding occurs at higher-order association areas after independent visual and auditory processing in unimodal brain areas have taken place. Processing within these multisensory sites would then engage in a recurrent loop via feedback projections to hierarchi-cally lower, sensory areas. Previous findings of visually evoked activation in auditory cortices (Calvert et al., 1997; Bernstein et al., 2002; see also Sams et al., 1991) could then be accounted for as a result of feedback projections from heteromodal areas. This hierarchical view of AV speech integration has been put in doubt, however, after accumulating evidence of AV interac-tions at very early latencies in electrophysiological studies (e.g., as early as 30 msec after the onset of the sound; see Besle, Bertrand, & Giard, 2009, for a review) during speech perception (Besle, Fort, Delpuech, & Giard, 2004; Hertrich, Mathiak, Lutzenberger, & Ackermann, 2008; Lebib, Papo, de Bode, & Baudon-nière, 2003; Möttönen, Schürmann, & Sams, 2004; van Wassenhove et al., 2005).4 Note, however, that the degree of stimulus and domain specificity of these early correlates with respect to speech processing is still unclear (e.g., Bernstein, Auer, Wagner, & Ponton, 2008; Ponton, Bernstein, & Auer, 2009).

Because of the biophysical characteristics of speech production, visible articulatory gestures are often avail-able before their consequent sounds (e.g., the lips usually close more than 150 msec before the sound /b/ is produced; see figure 24.3). This precedence of visual information seems to overcompensate the temporal offset in neural processing times whereby vision arrives later than audition to their corresponding primary cor-tical areas (see Schroeder et al., 2008). As several studies now suggest, this early visual information can, when it is salient enough (e.g., the viseme [p] as opposed to the viseme [g]), have a strong impact on the way incoming sounds are processed (see earlier discussion on the analysis-by-synthesis theoretical view of AV speech percep-tion). For example, van Wassenhove et al. (2005) reported a decrease in the amplitude and a latency

speed-up of the auditory evoked potentials N1 and P2 as a result of a visual modulation (see also Stekelenburg & Vroomen, 2007, for similar ERP effects using non-speech stimuli). This evidence, together with previous results showing multisensory interactions in nonpri-mary areas such as STS (e.g., Calvert et al., 2000), suggest that integration between auditory and visual speech cues possibly involves interactions at different levels of processing: early, when the speech sounds are predicted by prior visual information, as findings such as the one by van Wassenhove and her colleagues (2005) suggest, and also late (in heteromodal areas such as the multisensory STS), as several researchers have found before (Bernstein, Lu, & Jiang, 2008; Calvert et al., 2000; Callan et al., 2003; Miller & d’Esposito, 2005; Wright et al., 2003).

In a recent study using intracranial ERPs in epileptic patients, Besle and collaborators (2008) proposed that this early influence from vision to audition may be the result of a direct link between motion-responsive visual areas (MT/V5) and auditory cortex, rather than the consequence of feedback projections from heteromo-dal multisensory areas. In the study by Besle et al., when participants were presented with (AV-matching) conso-nant-vowel syllables such as /ba/, secondary auditory areas were activated just 10 msec after the activation of MT/V5. Most noteworthy, previous neuropsychological studies had already pointed out the relevance of visual motion processing in AV and visual speech perception (Campbell, Zihl, Massaro, Munhall, & Cohen, 1997; Munhall, Servos, Santi, & Goodale, 2002; Mohammed et al., 2005). As also reported by Besle et al. (2008), a decrease of activity in auditory areas was also observed as a result of highly predictive visual information, perhaps mirroring the N1 amplitude reduction found in earlier ERP studies of AV speech processing (e.g., Besle et al., 2004; van Wassenhove et al., 2005). Accord-ing to Besle and colleagues, this effect could reflect an alleviation of auditory processing induced by the early disambiguating role of linguistically informative visual cues. Another interesting (yet highly speculative) hypothesis regarding this visual-to-auditory influence implies that visual input modulates the neuronal oscil-latory activity in primary auditory cortex (A1; Schroeder et al., 2008). Neuronal oscillations are the result of the spontaneous electrophysiological activity in a population of neurons. The visual modulation of this oscillatory activity in A1 may, according to Schro-eder and colleagues, induce a temporary highly excit-atory state during which ensuing auditory input could be facilitated (i.e., resulting in amplified neuronal responses to sounds). The fact that an amplitude reduc-tion of the auditory-related evoked potentials (and not

8466_024.indd 442 12/21/2011 6:01:40 PM

Page 9: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

MULTISENSoRY INTERACTIoNS IN SPEECH PERCEPTIoN 443

amplification) is usually found as a result of the anticipa-tory effect from visual speech will have to be taken into consideration, in our opinion, in further developments of this hypothesis.

An issue that recently raised the interest of research-ers is whether these early visual modulations of auditory responses to upcoming speech sounds are domain- specific about linguistic input (and reflects, for example, phonological AV interactions) or not. Although similar electrophysiological effects (i.e., amplitude reduction and speedup of auditory evoked potentials) have been reported for nonlinguistic stimuli when vision is predic-tive of acoustic events (such as the video clip of a person handclapping; Stekelenburg & Vroomen, 2007; see also Navarra, Vatakis, et al., 2005 for parallel effects using both speech and nonspeech), some researchers have interpreted the early effects found with AV speech as phonetic in nature (e.g., Hertrich et al., 2008). Arnal, Morillon, Kell, and Giraud (2009) have recently ana-lyzed combined data from magnetoencephalography (MEG) and fMRI, showing that visual-to-auditory early effects seem to be unresponsive to the level of congru-ency between the visual gesture and the speech sound, perhaps giving support to a prelinguistic and phono-logically nonspecific version of these early visual-to-auditory interactions. Arnal et al. also presented evidence suggesting that the viseme-phoneme congru-ency is detected later on in processing, perhaps owing to the processing in multimodal regions such as STS, where AV integration would take place. According to Arnal and colleagues (2009), and perhaps in line with the proposal by Schroeder and colleagues (2008) described above, the nonspecific visual-to-auditory effects found at an early processing stage could reflect a reset of activity in auditory cortex into an excitatory/preparatory state that may increase perceptual effi-ciency at a subsequent stage.

In conclusion, finding out when and where visual and auditory speech inputs interact (and are integrated) in terms of brain processes still is an ongoing task. It is also worth noting that, at present, all this evidence about AV interactions needs to be integrated with the mounting evidence for links between speech perception and pro-duction (see D’Ausilio et al., 2009; Möttönen & Watkins, 2009; Sato et al., 2010; see also Convergence Views, previous subsection). Skipper and colleagues (2005), for example, found that activity in premotor cortex (as well as in the superior temporal lobe [STS and STG]) during AV speech perception can be modulated by the saliency of the visual speech signal. Activity in these cortical motor areas was observed to a lesser extent during unimodal speech perception, suggesting their stronger involvement in multisensory perception.

Is Attention Needed to Integrate AV Speech?

one currently debated question in multisensory litera-ture (including AV speech perception) is whether inte-gration across sensory modalities depends on the distribution of the observer’s attention (see chapter 20, this volume). Interestingly, the case of AV speech is in fact one of the most frequently used illustrations of the automatic and mandatory nature of multisensory inte-gration. Paramount among these is the McGurk illu-sion, whereby the illusory acoustic percept arising from AV conflict arises even under optimal hearing condi-tions and regardless of the observer’s knowledge about the manipulation (McGurk & MacDonald, 1976; see figure 24.1). Following up on this idea, AV speech inte-gration has been assumed to fulfill the criteria for auto-maticity that are typically used in cognitive psychology. Namely, automaticity implies that, once the appropriate sensory stimulation is present, the process will initiate quickly, in a preattentive manner, without much influ-ence of the perceiver’s voluntary control and largely unaffected by the availability of processing resources (i.e., regardless of whether other ongoing tasks are the current focus of the observer’s attention; Schneider & Shiffrin, 1977). In the particular case of multisensory integration, some authors have further assumed that it is a cognitively impenetrable process (Bertelson & de Gelder, 2004; Bertelson & Aschersleben, 1998; Colin et al., 2002; de Gelder & Bertelson, 2003), that is, a process that will proceed with little influence from other cognitive systems, including top-down attention (see Coltheart, 1999; Fodor, 1983).

The characterization of AV speech integration as an automatic, attention-free process has important conse-quences because it implies that the benefits arising from AV binding (i.e., enhancement in comprehen-sion, faster processing, and increased accuracy, as described earlier) would come about at almost no cost, so that little interference on other ongoing brain pro-cesses would be expected. Yet, despite the widespread assumption about the automaticity of AV integration, including that of speech (Bertelson & de Gelder, 2004; Bertelson & Aschersleben, 1998; Colin et al., 2002; de Gelder & Bertelson, 2003; Bertelson, Vroomen, de Gelder, & Driver, 2000; Vroomen, Bertelson, & de Gelder, 2001), recent evidence is starting to question whether some of the classical criteria for automatic pro-cessing apply to multisensory integration (Talsma & Woldorff, 2005) including the case of speech (Alsius, Navarra, Campbell, & Soto-Faraco, 2005; Alsius, Navarra, & Soto-Faraco, 2007; Fairhall & Macaluso, 2009). In addition, there is ample consensus nowadays that the

8466_024.indd 443 12/21/2011 6:01:41 PM

Page 10: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

444 JoRDI NAVARRA, H. HENNY YEUNG, JANET F. WERKER, AND SALVADoR SoTo-FARACo

interplay between perception and attention can express at various levels of information processing, including the earliest processing stages (see Talsma, Senkowski, Soto-Faraco, & Woldorff, 2010, for the specific case of the interplay between attention and multisensory inte-gration). We review empirical evidence supporting the assumption of automaticity (or lack thereof) in AV speech processing and attempt to integrate past and current findings under a common framework.

Explicit Manipulations of Attention to AV Speech

The fact that perceivers do not notice, most of the time, their use of visual information during speech percep-tion gives us an approximate (albeit subjective) idea of the extent to which AV integration appears to be auto-matic. As noted earlier, a strong empirical argument for the automaticity of AV speech integration is that the McGurk illusion remains impervious to several cogni-tive manipulations such as repeated exposure, con-scious knowledge of the cross-dubbed nature of the stimuli (McGurk & MacDonald, 1976), and gender mis-match between face and voice (Green, Kuhl, Meltzoff, & Stevens, 1991). This resilience of AV integration to such manipulations is typical of automatic, mandatory processes. Another argument favoring the automaticity of AV speech integration is based on the lack of effects of explicit instructions on AV tasks (e.g., Dekle, Fowler, & Funnnell, 1992; Easton & Basala, 1982; Massaro, 1987). In Massaro’s (1987) study, for instance, identifi-cation responses to incongruent stimuli (i.e., auditory /ba/ and visual [da]) were similar regardless of whether the observers received the instruction to focus attention on one or the other modality (or both), thus support-ing the automaticity hypothesis.

Nevertheless, more recent findings hint at potential for modulation of AV integration. For example, the observer’s semantic and lexical expectations exert some influence on the probability of occurrence of the McGurk illusion (Windmann, 2003). Along similar lines, other studies have reported that responses to McGurk-like stimuli can in fact change depending on the explicit instructions given to participants (Colin, Radeau, & Deltenre, 2005; Easton & Basala, 1982; Massaro, 1998; but see Dekle et al., 1992), or their prior knowledge about the dubbed nature of the stimuli (Summerfield & McGrath, 1984). Finally, in a recent study it was found that the time needed to find the talking face that matched an acoustically presented sen-tence strongly depended on the number of distractor talking faces (Alsius & Soto-Faraco, 2011). Thus, although there are a number of observations suggesting the lack of cognitive influence on AV speech binding,

current data from studies manipulating the observer’s expectations and prior knowledge about the stimuli reveal certain flexibility that does not fit perfectly with the idea of a strictly automatic and involuntary mechanism.

Implicit Manipulations of Attention to AV Speech

Explicit manipulations in an intersensory conflict situ-ation just like the ones discussed above can be subject to confounds involving cognitive biases (see de Gelder & Bertelson, 2003, p. 462, for a recent discussion). For example, it is impossible to know with certainty that simply instructing participants to focus on the auditory component of the AV event and ignore vision (i.e., Massaro, 1987; Dekle et al., 1992; Easton & Basala, 1982) will ensure lack of attention to the visual compo-nent. It is in fact well known that in low-perceptual-load situations attention can spill over to task-irrelevant stimuli (e.g., Lavie, 1995). Consequently, the interpreta-tion of the findings discussed above in terms of auto-maticity, or lack thereof, must remain inconclusive. Alternative paradigms using implicit manipulations of the participants’ attention on the AV stimulus are less susceptible to these types of confounds. For example, Driver (1996) used a selective listening task where par-ticipants had to recall words from one (target) out of two possible and overlapping speech streams (both streams composed of triplets of two-syllable words pre-sented at a rate of two per second). Performance was better when a visual talking face that matched the audi-tory target utterances was presented away from the source of the sounds than at the source of the sounds. According to Driver, the visible talking face selectively attracted the (correlated) relevant auditory stream toward its spatial location, thus illusorily separating the target stream from the distractor stream (i.e., the ven-triloquist effect). In other words, AV congruency acted to create an illusory spatial separation that facilitated selective attention to one speech stream over the other despite the fact that the two auditory speech streams were physically presented from exactly the same loca-tion. This suggests that AV congruency was processed before spatial selective attention was deployed.

In a study by Andersen, Tiippana, Laarni, Kojo, and Sams (2009), participants had to covertly attend to one of two speaking faces that were presented simultane-ously (one at each side of fixation) while reporting centrally presented auditory speech (syllables). The syl-lable spoken by the attended visual face dominated participants’ reports of the auditory target syllable, sug-gesting a modulation of visual influence of AV process-ing by the direction of spatial attention. As with tasks

8466_024.indd 444 12/21/2011 6:01:41 PM

Page 11: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

MULTISENSoRY INTERACTIoNS IN SPEECH PERCEPTIoN 445

explicitly directing attention to a specific modality (dis-cussed above), one quibble with the interpretation of the findings of Andersen et al. is that the lip movements at the attended location might have biased responses at the decision stage rather than the perception stage. In order to be able to test the automaticity of AV speech integration without the potential confounds associated with requiring explicit judgments from the perceiver, Soto-Faraco, Navarra, & Alsius (2004) measured the McGurk effect indirectly. Participants were asked to classify the first syllable of disyllabic pseudowords (such as /tobi/ or /tadi/) irrespective of the second syllable, which could be kept constant throughout a block of trials (always /di/ or always /bi/) or else vary randomly from trial to trial; this latter condition is known to lead to a slowdown in RTs (see Navarra, Sebastián-Gallés, & Soto-Faraco, 2005; Navarra & Soto-Faraco, 2007; Pallier, 1994). In the variable condition, the syllable /di/ was achieved by dubbing AV-congruent items (auditory /todi/ + visual [todi]), or else AV-incongruent items con-ducive to illusory fusion “di” (auditory /tobi/ + visual [togi]). Syllabic interference was equally effective with real as well as with illusory variation, suggesting that AV binding occurred in a completely task-irrelevant syllable and prior to the perceptual classification of the task-relevant syllable. Furthermore, in a complementary experiment, syllabic interference could be prevented by alternating real and illusory /di/, despite the syllabic variability present in each modality individually.

Gentilucci and Cattaneo (2005) provide another demonstration of an indirect (and possibly involuntary) influence of AV integration, this time on the perceiver’s articulatory production of target AV stimuli. In their study, participants’ articulation of spoken responses to the identity of AV-incongruent syllables was influenced by the visually specified speech gesture at kinematic and acoustic levels. So, in AV-incongruent trials for which participants had answered correctly (i.e., when they did not experience the McGurk illusion), the kinematic profile of their verbal response was nevertheless differ-ent from the production of responses to AV-congruent trials. Altogether, these results show evidence of AV speech integration at stages of processing that do not operate under voluntary control and perhaps before the allocation of attention (e.g., selecting a spatial loca-tion, as per Driver’s 1996 findings).

Neural Evidence of Attention-Free versus Attention-Influenced AV Integration of Speech

Colin et al. (2002) used McGurk fusions in an ERP mismatch negativity (MMN) paradigm where deviant oddballs were presented within an otherwise

homogeneous sequence of auditory stimuli (i.e., /ba/). In this study by Colin et al., deviant stimuli consisted of incongruent visual lip movements ([ga]) dubbed onto one of the standard (/ba/) auditory events. Despite the absence of a real acoustic change, these illusory deviants evoked a mismatch ERP response, leading Colin et al. (2002) to conclude that AV integration takes place prior to the preattentive comparison process giving rise to the MMN. other studies have shown that McGurk stimuli can eliminate the MMN difference wave when resulting in an illusory standard event (i.e., illusory “da” induced by /ba/ + [ga] among real /da/ standard stimuli; Kislyuk, Möttönen, & Sams, 2008). This last finding is important in order to rule out the possibility that the MMN found by Colin et al. (2002) was, in fact, a brain response to AV incongruence (see Ponton et al., 2009, for potential confounds associated with measur-ing mismatch negativity in AV speech).

A recent fMRI study by Fairhall & Macaluso (2009), however, points to a rather different conclusion about the attention requirements for AV speech integration. Fairhall and colleagues manipulated the direction of covert spatial attention to lateralized speaking faces in the presence of a central auditory speech stream (similar to Andersen et al., 2009). When visual attention was directed to a speaking face whose lips matched the central sound, visual target detection improved, and, importantly, the STS area was selectively activated (bilat-erally) as compared to when the mismatching talking face was attended. Further analyses revealed yet other areas selectively activated by attended AV congruency, including the superior colliculus as well as visual sensory areas. Contrary to the ERP findings discussed above (Colin et al. 2002, Kislyuk et al., 2008), this result sug-gests a modulation of (spatial) attention on AV speech processing, whose consequences carry over to multisen-sory association areas.

The Interplay between Attention and AV Integration of Speech

As the data reviewed above reveal, some manipulations of attention can alter the outcome of AV speech integra-tion, whereas others seem to be ineffective. one possi-ble key in order to evaluate this mixed pattern of results is whether or not the potential modulations of AV speech integration caused by attention reflect modula-tions at the level of unisensory processing (that carry over to multisensory integration stages), or whether, alternatively, attention has a direct impact on the mul-tisensory integration mechanism itself (Navarra et al., 2010). Alsius et al. (2005, 2007; see also Tiippana, Andersen, & Sams, 2004) addressed this question using

8466_024.indd 445 12/21/2011 6:01:41 PM

Page 12: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

446 JoRDI NAVARRA, H. HENNY YEUNG, JANET F. WERKER, AND SALVADoR SoTo-FARACo

a dual-task paradigm in which participants reported their perception of McGurk stimuli presented occasion-ally in the context of a difficult visual, auditory, or somatosensory monitoring task. As expected, visual load reduced the impact of visual influence on auditory speech (less McGurk effect, more auditory responses) when compared to a no-load baseline. The interesting result was that auditory load also led to a reduction of visual influence on AV integration and, perhaps coun-terintuitively given that the perceptual load was imposed on the auditory system, to an increase of auditory-based responses. In all cases (see Alsius et al., 2007, for effects of somatosensory load on McGurk effect), the reduc-tion in the McGurk illusion in dual-task conditions was stronger than any decrement in unisensory perception (auditory or visual) and thus consistent with the idea that depleting resources in any sensory modality by imposing a demanding task reduced the ability to inte-grate multisensory information in exactly the same fashion (i.e., leading to more auditory-based responses). one possible interpretation of these data is that the shortage of resources depletes the online processing of the less-informative (visual) sensory input when both the visual and the auditory signals are available concur-rently (see figure 24.4). Following up on this idea, the reduction in McGurk illusions could have resulted from the attenuation of the weight given to the visual input during AV speech integration when resources were compromised (see Navarra et al., 2010).

Finally, some studies have addressed whether, as expected from an encapsulated and cognitively impenetra-ble process, AV speech integration mechanisms prevent access to information about the individual (unisensory) components of the stimulus. This unitary notion of the AV percept has been, in fact, challenged by recent reports showing sensitivity to modality-specific charac-teristics during AV speech processing. For example, robust AV integration remains even under face/voice gender mismatch disparities that otherwise lead to an increased sensitivity to AV temporal asynchrony (Vatakis & Spence, 2007). In a more explicit demonstration of this point, Soto-Faraco and Alsius (2007, 2009) used incongruent AV speech tokens leading to the illusory combination /bda/ (visual [ba] + acoustic /da/, e.g., McGurk & MacDonald, 1976) presented at varying AV asynchronies. Participants were asked to identify the syllable they heard and to perform a temporal order judgment (or simultaneity) task about the AV asyn-chrony in each trial. The data revealed that the tempo-ral window at which the AV illusion occurred was substantially wider than the temporal interval needed to resolve the correct order between the AV signals. That is, at some asynchronies, participants would

Figure 24.4 Three hypothetical models regarding the pos-sible influence of attention on the perception of AV speech are presented schematically. According to model 1, attention does not have any influence on AV speech perception (e.g., Dekle et al., 1992; Easton & Basala, 1982; Massaro, 1987). Model 2 assumes that attention modulates the integration of visual and auditory speech signals itself. When the available attentional resources are depleted, the final percept will rely, according to model 2, on unimodal information (e.g., audi-tory, in Alsius et al., 2005; 2007). Model 3 predicts that atten-tion can influence the processing of visual speech input (arguably the less informative channel in normal acoustic conditions). According to this model, the processing of visual speech information is (at least partially) blocked when atten-tional resources are nearly exhausted (see Navarra et al., 2010). Recent studies have found evidence of attentional modulations of AV speech perception at multiple stages of processing, perhaps backing up a combination of models 2 and 3 (see Fairhall & Macaluso, 2009).

correctly report that audition (containing /da/) tem-porally led vision (containing [ba]) but classify the per-ceived syllable as /bda/ (note the inversion in the sequence of phonemes). This finding strongly suggests that access to some features of the individual unimodal stimuli (temporal order) is not prevented when AV events are bound (see Miller & D’Esposito, 2005, for dissociation between neural correlates of AV fusion and coincidence detection).

In summary, extant data are far from conclusive with regard to the assumption of automaticity of AV speech binding held by several authors (Bertelson & Ascher-leben, 1998; Bertelson & de Gelder, 2004; Bertelson et al., 2000; Colin et al., 2002; de Gelder and Bertelson, 2003; de Gelder, Böcker, Tuomainen, Hennsen, & Vroomen, 1999; Vroomen et al., 2001). on the one hand, it is clear that the consequences of AV integration of speech are quite compelling and that these phenom-ena are indeed robust to several cognitive and atten-tional manipulations. on the other hand, it is not less true that behavioral (dual task) and neuroimaging (spatial attention) evidence for attentional modulation

8466_024.indd 446 12/21/2011 6:01:41 PM

Page 13: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

MULTISENSoRY INTERACTIoNS IN SPEECH PERCEPTIoN 447

and cognitive penetrability of AV speech integration is starting to accumulate (see Navarra et al., 2010, for a review on this issue). This pattern of findings is in general inconsistent with an extreme conceptualization about the strictly automatic nature of AV speech inte-gration. Instead, evidence seems to be in line with recent proposals about the flexible interplay between multisensory integration and attention (Talsma et al., 2010).

Concluding Remarks

Perceiving speech is, most of the time, a multisensory experience involving the interplay between at least two different sensory modalities: audition and vision. Although we cannot ignore the fact that, under good listening conditions, auditory input is sufficient to understand speech, there is undeniable evidence that the visual and auditory aspects of speech, when avail-able, contribute to an integrated perception of spoken language. Indeed, anticipatory visual information seems to modulate the way sounds will be processed quite early in the signal processing.

Speech is an extremely complex stimulus that varies continuously over time at very fast rates. Thus, retriev-ing as much information as early as possible becomes crucial for communication to be efficient. The present volume is full of examples where the combination of signals from different sensory modalities leads to more accurate and faster representations of perceptual objects, and speech is not an exception. We discussed evidence suggesting that AV integration of speech results in a more efficient encoding of a message. Although experience is crucial for developing an effec-tive AV speech-processing system, there is evidence that infants are surprisingly well prepared to integrate visual and acoustic speech at very early ages.

The binding of AV speech streams seems to be, in fact, so strong that we are less sensitive to AV asynchrony when perceiving speech than when perceiving other stimuli (see Vatakis, Ghazanfar & Spence, 2008 Vatakis & Spence, 2007). Despite results such as these (see also Tuomainen, Andersen, Tiippana, & Sams, 2005) that suggest that AV speech represents a special case of per-ceptual input for the brain, other experimental evi-dence does in fact indicate that this may be, at least for the processing of certain perceptual properties, not the case. For example, temporal adaptation effects observed during the exposure to temporally misaligned AV speech streams influence not only the perception of speech but also the perception of other (simpler) stimuli (e.g., Navarra, Vatakis, et al., 2005). Further-more, the perception of speech and nonspeech stimuli

seems to share some common resources that can be depleted by a nonspeech demanding task (Alsius et al., 2005, 2007). Electrophysiological evidence of a general (rather than speech-specific) predictive nature of visual processing with respect to the processing of sounds is another example where the apparently special status of speech perception could be ruled out in the benefit of a general stimulus-nonspecific formulation of human perception (see Stekelenburg & Vroomen, 2007).

Regardless of whether it is special or not in terms of brain and cognitive processes, speech is essentially mul-tisensory. Its perception seems to imply the constant interaction between vision and audition and, possibly, the combination of visual and auditory information in heteromodal areas of the brain as well as the activation of other areas related to phonological representations and articulatory behavior. Although visual speech cues are integrated with sounds in a quite robust and auto-matic fashion, recent studies suggest (in line with other studies described in section V, Attention, in this volume) that processes underlying this integration are more amenable to cognitive and attentional penetrability than originally assumed.

Acknowledgments

This work was supported by grant PSI2009–12859 and Ramón y Cajal (RyC-2008–00231) Programme from Ministerio de Ciencia e Innovación (Spain) to J. Navarra, by grant 81103 from Natural Sciences and Engineering Research Council of Canada to J. F. Werker, and by grants PSI2010–15426 and CDS00012 from Ministerio de Ciencia e Innovación (Spain), grant 2009SGR-292 from Comissionat per a Universitats i Recerca del DIUE-Generalitat de Catalunya, and grant StG-2010 263145 from the Euro-pean Research Council to S. Soto-Faraco.

Notes

1. one can experience for oneself how the auditory percept simply changes (i.e., from /da/ to /ba/) as a function of whether one opens or closes one’s eyes. There are demon-strations available online; a simple search for “McGurk illusion” in the browser will return some examples (e.g., http://www.youtube.com/watch?v=jtsfidRq2tw). Note that phonemes appear between slashes (/ /) and visemes (the visual equivalent to phonemes) between brackets ([ ]).

2. We note that properties of the speech signal can also be extracted from sensory modalities other than vision and audition (e.g., Auer, Bernstein, & Coulter, 1998; Bernstein et al., 1989; Gick & Derrick, 2009; Grant, Ardell, Kuhl, & Sparks, 1985; Yuan, Reed, & Durlach, 2005; see Kirman, 1973 and Summers, 1992, for reviews). For example, somatosensory inputs can influence speech perception both in individuals who have experience using tactile methods of communication (Bernstein, Demorest,

8466_024.indd 447 12/21/2011 6:01:41 PM

Page 14: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

448 JoRDI NAVARRA, H. HENNY YEUNG, JANET F. WERKER, AND SALVADoR SoTo-FARACo

Coulter, & o’Connell, 1991; Reed, Durlach, Braida, & Schultz, 1989) as well as in individuals without any previous explicit training using, for example, tactile aids (Fowler & Dekle, 1991; Gick & Derrick, 2009; Gick, Jóhannsdóttir, Gibraiel, & Mühlbauer, 2008; Ito et al., 2009). For reasons of brevity, however, we will concentrate only on the AV case because the bulk of the research and the related controver-sies can be best exemplified in this literature.

3. Following seminal studies recording single-cell activity in the cat’s superior colliculus (SC), the idea of superadditiv-ity in humans was characterized by a neural response (in this case in terms of BoLD signal, in a population of neurons) that exceeds the sum of the responses elicited by each of the unimodal inputs separately in that particu-lar brain area. Even though the validity of the superaddi-tive criterion to reveal multisensory integration has been put in doubt (both in human fMRI studies, e.g., Lauri-enti, Perrault, Stanford, Wallace, & Stein, 2005; Steven-son, Kim, & James, 2009; as well as in term of single-cell electrophysiology, e.g., Holmes, 2009; Holmes & Spence, 2005), the STS still stands out as a multisensory site for AV speech integration when using other criteria (see Ste-venson et al., 2009; Calvert et al., 2000).

4. Note that the early interactions between audition and vision seen when subtracting the sum of the auditory and visual responses (A + V) from the audiovisual response (AV) have to be taken with caution because of possible differences in the level of attention among the auditory, the visual, and the AV conditions (see Besle et al., 2009). When these possible effects are controlled, a modulation of the auditory ERP components at 100–125 msec after the stimulus onset is observed (see Besle et al., 2009; Ponton et al., 2009).

References

Aldridge, M. A., Braga, E. S., Walton, G. E., & Bower, T. G. R. (1999). The intermodal representation of speech in new-borns. Developmental Science, 2, 42–46.

Alsius, A., Navarra, J., & Soto-Faraco, S. (2007). Attention to touch reduces audiovisual speech integration. Experimental Brain Research, 183, 399–404.

Alsius, A., & Soto-Faraco, S. (2011). Searching for audiovisual correspondence in multiple speaker scenarios. Experimental Brain Research; Epub ahead of print.

Alsius, A., Navarra, J., Campbell, R., & Soto-Faraco, S. (2005). Audiovisual integration of speech falters under high atten-tion demands. Current Biology, 15, 839–843.

Andersen, T. S., Tiippana, K., Laarni, J., Kojo, I., & Sams, M. (2009). The role of visual spatial attention in audiovisual speech perception. Speech Communication, 51, 184–193.

Arnal, L. H., Morillon, B., Kell, C. A., & Giraud, A. L. (2009). Dual neural routing of visual facilitation in speech process-ing. Journal of Neuroscience, 29, 13445–13453.

Auer, E. T., Jr. (2010). Investigating speechreading and deaf-ness. Journal of the American Academy of Audiology, 21, 163–168.

Auer, E. T., Jr., Bernstein, L. E., & Coulter, D. C. (1998). Tem-poral and spatio-temporal vibrotactile displays for voice fundamental frequency: an initial evaluation of a new vibro-tactile speech perception aid with normal-hearing and

hearing-impaired individuals. Journal of the Acoustical Society of America, 104, 2477–2489.

Baier, R., Idsardi, W., & Lidz, J. (2007). Two-month-olds are sensitive to lip rounding in dynamic and static speech events. Audiovisual Speech Processing Conference, Kasteel Groenendaal, Hilvarenbeek, The Netherlands.

Beauchamp, M. S., Argall, B. D., Bodurka, J., Duyn, J. H., & Martin, A. (2004). Unraveling multisensory integration: patchy organization within human STS multisensory cortex. Nature Neuroscience, 7, 1190–1192.

Bell-Berti, F., Raphael, L. J., Pisoni, D. B., & Sawusch, J. R. (1979). Some relationships between speech production and perception. Phonetica, 36, 373–383.

Bernstein, L. E., Auer, E. T., Jr., Moore, J. K., Ponton, C. W., Don, M., & Singh, M. (2002). Visual speech perception without primary auditory cortex activation. Neuroreport, 13, 311–315.

Bernstein, L. E., Auer, E. T., Jr., Wagner, M., & Ponton, C. W. (2008). Spatio-temporal dynamics of audiovisual speech processing. NeuroImage, 39, 423–435.

Bernstein, L. E., Auer, E. T., Jr., & Moore, J. K. (2004). Audiovi-sual speech binding: Convergence or association? In G. Calvert, C. Spence, & B. E. Stein (Eds.), Handbook of Multi-sensory Processing (pp. 203–223). Cambridge, MA: MIT Press.

Bernstein, L. E., Demorest, M. E., Coulter, D. C., & o’Connell, M. P. (1991). Lipreading sentences with vibrotactile vocod-ers: performance of normal-hearing and hearing-impaired subjects. Journal of the Acoustical Society of America, 90, 2971–2984.

Bernstein, L. E., Eberhardt, S. P., & Demorest, M. E. (1989). Single-channel vibrotactile supplements to visual percep-tion of intonation and stress. Journal of the Acoustical Society of America, 85, 397–405.

Bernstein, L. E., Lu, Z. L., & Jiang, J. (2008). Quantified acoustic-optical speech signal incongruity identifies cortical sites of audiovisual speech processing. Brain Research, 1242, 172–184.

Bertelson, P., & Aschersleben, G. (1998). Automatic visual bias of perceived auditory location. Psychonomic Bulletin & Review, 5, 482–489.

Bertelson, P., & de Gelder, B. (2004). The psychology of mul-timodal perception. In C. Spence & J. Driver (Eds.), Cross-modal space and crossmodal attention (pp. 141–179). oxford: oxford University Press.

Bertelson, P., Vroomen, J., de Gelder, B., & Driver, J. (2000). The ventriloquist effect does not depend on the direction of deliberate visual attention. Perception & Psychophysics, 62, 321–332.

Besle, J., Fischer, C., Bidet-Caulet, A., Lecaignard, F., Bertrand, o., & Giard, M. H. (2008). Visual activation and audiovisual interactions in the auditory cortex during speech perception: intracranial recordings in humans. Journal of Neuroscience, 28, 14301–14310.

Besle, J., Fort, A., Delpuech, C., & Giard, M. H. (2004). Bimodal speech: early suppressive visual effects in human auditory cortex. European Journal of Neuroscience, 20, 2225–2234.

Best, C. (1995). A irect ealist iew of ross-anguage peech ercep-tion. In W. Strange (Ed.), Speech erception and inguistic xperi-ence. Issues in ross-anguage esearch (pp. 171–204). Baltimore, MD: York Press.

Binder, J. R., Frost, J. A., Hammeke, T. A., Cox, R. W., Rao, S. M., & Prieto, T. (1997). Human brain language areas identi-

8466_024.indd 448 12/21/2011 6:01:41 PM

Page 15: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

MULTISENSoRY INTERACTIoNS IN SPEECH PERCEPTIoN 449

fied by functional MRI. Journal of Neuroscience, 17, 353–362.

Birch, H., & Lefford, A. (1963). Intersensory development in children. Monographs of the Society for Research in Child Devel-opment, 28.

Blasi, V., Paulesu, E., Mantovani, F., Menoncello, L., Giovanni, U. D., Sensolo, S., et al. (1999). Ventral prefrontal areas specialised for lip-reading: a PET activation study. NeuroIm-age, 9, 1003.

Braida, L. (1991). Crossmodal integration in the identifica-tion of consonant segments. Quarterly Journal of Experimental Psychology. A, Human Experimental Psychology, 43, 647–678.

Bristow, D., Dehaene-Lambertz, G., Mattout, J., Soares, C., Gliga, T., Baillet, S., (2009). Hearing faces: how the infant brain matches the face it sees with the speech it hears. Journal of Cognitive Neuroscience, 21, 905–921.

Burnham, D., & Dodd, B. (2004). Auditory-visual speech inte-gration by pre-linguistic infants: perception of an emergent consonant in the McGurk effect. Developmental Psychobiology, 44, 204–220.

Callan, D., Jones, J. A., Munhall, K. G., Kroos, C., Callan, A., & Vatikiotis-Bateson, E. (2003). Neural processes underly-ing perceptual enhancement by visual speech gestures. Neu-roreport, 14, 2213–2218.

Calvert, G. A., Bullmore, E. T., Brammer, M. J., & Campbell, R. (1997). Activation of auditory cortex during silent lip-reading. Science, 276, 593–596.

Calvert, G. A., Campbell, R., & Brammer, M. J. (2000). Evi-dence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex. Current Biology, 10, 649–657.

Campbell, R., Zihl, J., Massaro, D., , K., & Cohen, M. (1997). Speechreading in the akinetopsic patient L.M. Brain, 120, 179–1803.

Colin, C., Radeau, M., & Deltenre, P. (2005). Top-down and bottom-up modulation of audiovisual integration in speech. European Journal of Cognitive Psychology, 17, 541–560.

Colin, C., Radeau, M., Soquet, A., Demolin, D., Colin, F., & Deltenre, P. (2002). Mismatch negativity evoked by the McGurk-MacDonald effect: a phonetic representation within short-term memory. Clinical Neurophysiology, 113, 495–506.

Coltheart, M. (1999). Modularity and cognition. Trends in Cognitive Sciences, 3, 115–120.

Cotton, J. C. (1935). Normal visual hearing. Science, 82, 592–593.

Crinion, J. T., Warburton, E. A., Lambon-Ralph, M. A., Howard, D., & Wise, R. J. (2006). Listening to narrative speech after aphasic stroke: the role of the left anterior temporal lobe. Cerebral Cortex, 16, 1116.

D’Ausilio, A., Pulvermüller, F., Salmas, P., Bufalari, I., Beglio-mini, C., & Fadiga, L. (2009). The motor somatotopy of speech perception. Current Biology, 19, 381–385.

de Gelder, B., Böcker, K. B., Tuomainen, J., Hensen, M., & Vroomen, J. (1999). The combined perception of emotion from voice and face: early interaction revealed by human electric brain responses. Neuroscience Letters, 260, 133–136.

de Gelder, B., & Bertelson, P. (2003). Multisensory integra-tion, perception and ecological validity. Trends in Cognitive Sciences, 7, 460–467.

Dekle, D., Fowler, C., & Funnell, M. (1992). Auditory-visual integration in perception of real words. Perception & Psycho-physics, 51, 355–362.

Desjardins, R. N., & Werker, J. W. (2004). Is the integration of heard and seen speech mandatory for infants? Develop-mental Psychobiology, 45, 187–203.

Diehl, R. L., & Kluender, K. R. (1987) on the categorization of speech sounds. In S. Harnad (Ed.) Categorical perception: the groundwork of cognition (pp. 226–253). New York: Cambridge Univerity Press.

Diehl, R. L., Lotto, A. J., & Holt, L. L. (2004). Speech percep-tion. Annual Review of Psychology, 55, 149–179.

Dodd, B. (1979). Lip-reading in infants: attention to speech presented in- and out-of-synchrony. Cognitive Psychology, 11, 478–484.

Driver, J. (1996). Enhancement of selective listening by illu-sory mislocation of speech sounds due to lip-reading. Nature, 381, 66–68.

Easton, R. D., & Basala, M. (1982). Perceptual dominance during lipreading. Perception & Psychophysics, 32, 562–570.

Fadiga, L., Craighero, L., Buccino, G., & Rizzolatti, G. (2002). Speech listening specifically modulates the excitability of tongue muscles: a TMS study. European Journal of Neurosci-ence, 15, 399–402.

Fadiga, L., Craighero, L., & olivier, E. (2005). Human motor cortex excitability during the perception of others’ action. Current Opinion in Neurobiology, 15, 213–218.

Fairhall, S. L., & Macaluso, E. (2009). Spatial attention can modulate audiovisual integration at multiple cortical and subcortical sites. European Journal of Neuroscience, 29, 1247–1257.

Fodor, J. (1983). The modularity of mind. Cambridge, MA: The MIT Press.

Fowler, C. A. (2004). Speech as a supramodal or amodal phe-nomenon. In G. Calvert, C. Spence, & B. E. Stein (Eds.), Handbook of Multisensory Processing (pp. 189–202). Cambridge, MA: MIT Press.

Fowler, C. A., Brown, J. M., Sabadini, L., & Weihing, J. (2003). Rapid access to speech gestures in perception: evidence from choice and simple response time tasks. Journal of Memory and Language, 49, 396–413.

Fowler, C. A., & Dekle, D. J. (1991). Listening with eye and hand: cross-modal contributions to speech perception. Journal of Experimental Psychology. Human Perception and Performance, 17, 816–823.

Galantucci, B., Fowler, C. A., & Goldstein, L. (2009). Percep-tuomotor compatibility effects in speech. Attention, Percep-tion & Psychophysics, 71, 1138–1149.

Gentilucci, M., & Bernardis, P. (2007). Imitation during phoneme production. Neuropsychologia, 45, 608–615.

Gentilucci, M., & Cattaneo, L. (2005). Automatic audiovisual integration in speech perception. Experimental Brain Research, 167, 66–75.

Gibson, E. J. (1969). Principles of perceptual learning and develop-ment. Englewood Cliffs, NJ: Prentice Hall.

Gick, B., & Derrick, D. (2009). Aero-tactile integration in speech perception. Nature, 462, 502–504.

Gick, B., Jóhannsdóttir, K. M., Gibraiel, D., & Mühlbauer, J. (2008). Tactile enhancement of auditory and visual speech perception in untrained perceivers. Journal of the Acoustical Society of America, 123, EL72–EL76.

Gordon, P. C., & Meyer, D. E. (1984). Perceptual-motor pro-cessing of phonetic features in speech. Journal of Experimen-tal Psychology. Human Perception and Performance, 10, 153–178.

8466_024.indd 449 12/21/2011 6:01:41 PM

Page 16: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

450 JoRDI NAVARRA, H. HENNY YEUNG, JANET F. WERKER, AND SALVADoR SoTo-FARACo

Grant, K. W., Ardell, L. H., Kuhl, P. K., & Sparks, D. W. (1985). The contribution of fundamental frequency, amplitude envelope, and voicing duration cues to speechreading in normal-hearing subjects. Journal of the Acoustical Society of America, 77, 671–677.

Grant, K. W., & Greenberg, S. (2001), Speech intelligibility derived from asynchronous processing of auditory-visual information, Proceedings of the AVSP 2001 International Confer-ence of Auditory-Visual Speech Processing, Scheelsminde, Denmark, pp. 132–137.

Green, K. P., & Gerdeman, A. (1995). Cross-modal discrepan-cies in coarticulation and the integration of speech infor-mation: the McGurk effect with mismatched vowels. Journal of Experimental Psychology. Human Perception and Performance, 21, 1409–1426.

Green, K. P., Kuhl, P. K., Meltzoff, A. N., & Stevens, E. B. (1991). Integrating speech information across talkers, gender, and sensory modality: female faces and male voces in the McGurk effects. Perception & Psychophysics, 50, 524–536.

Hadar, U., Steiner, T. J., Grant, E. C., & Rose, F. C. (1983). Head movement correlates of juncture and stress at sen-tence level. Language and Speech, 26, 117–129.

Hadar, U., Steiner, T. J., Grant, E. C., & Rose, F. C. (1984). The timing of shifts in head posture during conversation. Human Movement Science, 3, 237–245.

Hertrich, I., Mathiak, K., Lutzenberger, W., & Ackermann, H. (2008). Time course of early audiovisual interactions during speech and nonspeech central auditory processing: a mag-netoencephalography study. Journal of Cognitive Neuroscience, 21, 259–274.

Hickok, G., Holt, L. L., & Lotto, A. J. (2009). Response to Wilson: what does motor cortex contribute to speech per-ception? Trends in Cognitive Sciences, 13, 330–331.

Holmes, N. P. (2009). The principle of inverse effectiveness in multisensory integration: some statistical considerations. Brain Topography, 21, 168–176.

Holmes, N. P., & Spence, C. (2005). Multisensory integration: space, time, and superadditivity. Current Biology, 15, R762–R764.

Houde, J. F., & Jordan, M. I. (1998). Sensorimotor adaptation in speech production. Science, 279, 1213–1216.

Ito, T., Tiede, M., & ostry, D. J. (2009). Somatosensory func-tion in speech perception. Proceedings of the National Academy of Sciences of the United States of America, 106, 1245–1248.

Jiang, J., Auer, E. T., Jr., Alwan, A., Keating, P. A., & Bernstein, L. E. (2007). Similarity structure in visual speech percep-tion and optical phonetics. Perception & Psychophysics, 69, 1070–1083.

Kerzel, D., & Bekkering, H. (2000). Motor activation from visible speech: evidence from stimulus response compatibil-ity. Journal of Experimental Psychology. Human Perception and Performance, 26, 634–647.

Kirman, J. H. (1973). Tactile communication of speech: a review and an analysis. Psychological Bulletin, 80, 54–74.

Kislyuk, D. S., Möttönen, R., & Sams, M. (2008). Visual pro-cessing affects the neural basis of auditory discrimination. Journal of Cognitive Neuroscience, 20, 2175–2184.

Klatt, D. H. (1980). Speech perception. A model of acoustic-phonemic analysis and lexical access. Journal of Phonetics, 8, 279–312.

Kuhl, P. K., & Meltzoff, A. N. (1982). The bimodal perception of speech in infancy. Science, 218, 1138–1141.

Kuhl, P. K., & Meltzoff, A. N. (1984). The intermodal repre-sentation of speech in infants. Infant Behavior and Develop-ment, 7, 361–381.

Kushnerenko, E., Teinonen, T., Volein, A., & Csibra, G. (2008). Electrophysiological evidence of illusory audiovi-sual speech percept in human infants. Proceedings of the National Academy of Sciences of the United States of America, 105, 11442–11445.

Laurienti, P. J., Perrault, T. J., Stanford, T. R., Wallace, M. T., & Stein, B. E. (2005). on the use of superadditivity as a metric for characterizing multisensory integration in func-tional neuroimaging studies. Experimental Brain Research, 166, 289–297.

Lavie, N. (1995). Perceptual load as a necessary condition for selective attention. Journal of Experimental Psychology. Human Perception and Performance, 21, 451–468.

Lebib, R., Papo, D., de Bode, S., & Baudonnière, P. M. (2003). Evidence of a visual-to-auditory cross-modal sensory gating phenomenon as reflected by the human P50 event-related brain potential modulation. Neuroscience Letters, 341, 185–188.

Levelt, W. J. M., Schriefers, H., Vorberg, D., Meyer, A. S., Pechmann, T., & Havinga, J. (1991). The time course of lexical access in speech production: a study of picture naming. Psychological Review, 98, 122–142.

Lewkowicz, D. J. (1994). Development of intersensory perception in human infants. In D. J. Lewkowicz & R. Lickliter (Eds.), The development of intersensory perception: com-parative perspectives (pp. 165–203). Hillsdale, NJ: Lawrence Erlbaum.

Lewkowicz, D. J. (2000). Infants’ perception of the audible, visible and bimodal attributes of multimodal syllables. Child Development, 71, 1241–1257.

Lewkowicz, D. J. (2010). Infant perception of audiovisual speech synchrony. Developmental Psychology, 46, 66–77.

Lewkowicz, D. J., & Ghazanfar, A. A. (2006). The decline of cross-species intersensory perception in human infants. Pro-ceedings of the National Academy of Sciences of the United States of America, 103, 6771–6774.

Lewkowicz, D. J., Leo, I., & Simion, F. (2010). Intersensory perception at birth: newborns match non-human primate faces & voices. Infancy, 15, 46–60.

Lewkowicz, D. J., Sowinski, R., & Place, S. (2008). The decline of cross-species intersensory perception in human infants: underlying mechanisms & its developmental persistence. Brain Research, 1242, 291–302.

Liberman, A. M. (1982). on finding that speech is special. American Psychologist, 37, 148–167.

Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psycho-logical Review, 74, 431–461.

Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21, 1–36.

MacKain, K., Studdert-Kennedy, M., Spieker, S., & Stern, D. (1983). Infant intermodal speech perception is a left hemi-sphere function. Science, 219, 1347–1349.

Mahon, B. Z., & Caramazza, A. (2008). A critical look at the embodied cognition hypothesis and a new proposal for grounding conceptual content. Journal of Physiology, Paris, 102, 59–70.

Massaro, D. W. (1987). Speech perception by ear and eye: a para-digm for psychological enquiry. Hillsdale, NJ: Lawrence Erlbaum Associat.

8466_024.indd 450 12/21/2011 6:01:41 PM

Page 17: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

MULTISENSoRY INTERACTIoNS IN SPEECH PERCEPTIoN 451

Massaro, D. W. (1998). Perceiving talking faces: from speech percep-tion to a behavioral principle. Cambridge, MA: MIT Press.

Massaro, D. W. (2004). From multisensory integration to talking heads and language learning. In G. Calvert, C. Spence, & B. E. Stein (Eds.), Handbook of multisensory process-ing (pp. 153–176). Cambridge, MA: MIT Press.

McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1–86.

McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748.

Miller, L. M., & d’Esposito, M. (2005). Perceptual fusion and stimulus coincidence in the cross-modal integration of speech. Journal of Neuroscience, 25, 5884–5893.

Mitterer, H., & Ernestus, M. (2008). The link between speech perception and production is phonological and abstract: evidence from the shadowing task. Cognition, 109, 168–173.

Mohammed, T., Campbell, R., MacSweeney, M., Milne, E., Hansen, P., & Coleman, M. (2005). Speechreading skill and visual movement sensitivity are related in deaf speechread-ers. Perception, 34, 205–216.

Möttönen, R., Schürmann, M., & Sams, M. (2004). Time course of multisensory interactions during audiovisual speech perception in humans: a magnetoencephalographic study. Neuroscience Letters, 363, 112–115.

Möttönen, R., & Watkins, K. E. (2009). Motor representations of articulators contribute to categorical perception of speech sounds. Journal of Neuroscience, 29, 9819–9825.

Munhall, K., Jones, J. A., Callan, D. E., Kuratate, T., & Vatikiotis-Bateson, E. (2004). Head movement improves auditory speech perception. Psychological Science, 15, 133–137.

Munhall, K. G., Servos, P., Santi, A., & Goodale, M. A. (2002). Dynamic visual speech perception in a patient with visual form agnosia. Neuroreport, 13, 1793–1796.

Nasir, S. M., & ostry, D. J. (2009). Auditory plasticity and speech motor learning. Proceedings of the National Academy of Sciences of the United States of America, 106, 20470–20475.

Navarra, J., Alsius, A., Soto-Faraco, S., & Spence, C. (2010). Assessing the role of attention in the audiovisual integra-tion of speech. Information Fusion, 11, 4–11.

Navarra, J., Sebastián-Gallés, N., & Soto-Faraco, S. (2005). The perception of second language sounds in early bilinguals: new evidence from an implicit measure. Journal of Experi-mental Psychology. Human Perception and Performance, 31, 912–918.

Navarra, J., & Soto-Faraco, S. (2007). Hearing lips in a second language: visual articulatory information enables the per-ception of L2 sounds. Psychological Research, 71, 4–12.

Navarra, J., Vatakis, A., Zampini, M., Soto-Faraco, S., Humphreys, W., & Spence, C. (2005). Exposure to asyn-chronous audiovisual speech extends the temporal window for audiovisual integration. Brain Research. Cognitive Brain Research, 25, 499–507.

Pallier, C. (1994). Rôle de la syllable dans la perception de la parole: études attentionnelles. Unpublished PhD dissertation presented at the École des Hautes Études en Sciences Sociales, Paris.

Patterson, M. L., & Werker, J. F. (1999). Matching phonetic information in lips and voice is robust in 4.5-month-old infants. Infant Behavior and Development, 22, 237–247.

Patterson, M. L., & Werker, J. F. (2002). Infants’ ability to match dynamic phonetic and gender information in the face and voice. Journal of Experimental Child Psychology, 81, 93–115.

Patterson, M. L., & Werker, J. F. (2003). Two-month-old infants match phonetic information in lips and voice. Developmental Science, 6, 191–196.

Piaget, J. (1952). The origins of intelligence in children. New York: International University Press.

Pons, F., Lewkowicz, D. J., Soto-Faraco, S., & Sebastián-Gallés, N. (2009). Narrowing of intersensory speech perception in infancy. Proceedings of the National Academy of Sciences of the United States of America, 106, 10598–10602.

Ponton, C. W., Bernstein, L. E., & Auer, E. T., Jr. (2009). Mis-match negativity with visual-only and audiovisual speech. Brain Topography, 21, 207–215.

Poulin-Dubois, D., Serbin, L., Kenyon, B., & Derbyshire, A. (1994). Infants’ intermodal knowledge about gender. Devel-opmental Psychology, 30, 436–442.

Pulvermüller, F., & Fadiga, L. (2010). Active perception: Sen-sorimotor circuits as a cortical basis for language. Nature Reviews. Neuroscience, 11, 351–360.

Reed, C. M., Durlach, N. I., Braida, L. D., & Schultz, M. C. (1989). Analytic study of the Tadoma method: effects of hand position on segmental speech perception. Journal of Speech and Hearing Research, 32, 921–929.

Reisberg, D., McLean, J., & Goldfield, A. (1987). Easy to hear but hard to understand: a lip-reading advantage with intact auditory stimuli. In B. Dodd & R. Campbell (Eds.), Hearing by eye: the psychology of lip-reading (pp. 97–114). Hillsdale, NJ: Lawrence Erlbaum Associates.

Roelofs, A., Özdemir, R., & Levelt, W. J. M. (2007). Influences of spoken word planning on speech recognition. Journal of Experimental Psychology. Learning, Memory, and Cognition, 33, 900–913.

Rosenblum, L. D., Schmuckler, M. A., & Johnson, J. A. (1997). The McGurk effect in infants. Perception & Psychophysics, 59, 347–357.

Ross, L. A., Saint-Amour, D., Leavitt, V. M., Javitt, D. C., & Foxe, J. J. (2007). Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. Cerebral Cortex, 17, 1147–1153.

Sams, M., Aulanko, R., Hämäläinen, M., Hari, R., Lounasmaa, o. V., Lu, S. T., et al. (1991). Seeing speech: visual information from lip movements modifies activity in the human auditory cortex. Neuroscience Letters, 127, 141–145.

Sams, M., Möttönen, R., & Sihvonen, T. (2005). Seeing and hearing others and oneself talk. Brain Research. Cognitive Brain Research, 23, 429–435.

Sato, M., Buccino, G., Gentilucci, M., & Cattaneo, L. (2009). on the tip of the tongue: modulation of the primary motor cortex during audiovisual speech perception. Speech Com-munication, 52, 533–541.

Schneider, W., & Shiffrin, R. M. (1977). Controlled and auto-matic human information processing: detection, search, and attention. Psychological Review, 84, 1–66.

Schroeder, C. E., Lakatos, P., Kajikawa, Y., Partan, S., & Puce, A. (2008). Neuronal oscillations and visual amplification of speech. Trends in Cognitive Sciences, 12, 106–113.

Skipper, J. I., Nusbaum, H. C., & Small, S. L. (2005). Listening to talking faces: motor cortical activation during speech perception. NeuroImage, 25, 76–89.

Skipper, J. I., van Wassenhove, V., Nusbaum, H. C., & Small, S. L. (2007). Hearing lips and seeing voices: how cortical areas supporting speech production mediate audiovisual speech perception. Cerebral Cortex, 17, 2387–2399.

8466_024.indd 451 12/21/2011 6:01:41 PM

Page 18: 24 Multisensory Interactions in Speech Perceptioninfantstudies-psych.sites.olt.ubc.ca › files › 2015 › 07 › ... · Stein—The New Handbook of Multisensory Processes 24 Multisensory

Q

Stein—The New Handbook of Multisensory Processes

452 JoRDI NAVARRA, H. HENNY YEUNG, JANET F. WERKER, AND SALVADoR SoTo-FARACo

Soken, N. H., & Pick, A. D. (1992). Intermodal perception of happy and angry expressive behaviors by seven-month-old infants. Child Development, 63, 787–795.

Soto-Faraco, S., & Alsius, A. (2007). Conscious access to the uni-sensory components in a cross-modal illusion. Neurore-port, 18, 347–350.

Soto-Faraco, S., & Alsius, A. (2009). Deconstructing the McGurk-MacDonald illusion. Journal of Experimental Psychol-ogy. Human Perception and Performance, 35, 580–587.

Soto-Faraco, S., Navarra, J., & Alsius, A. (2004). Assessing automaticity in audiovisual speech integration: evidence from the speeded classification task. Cognition, 92, B13–B23.

Soto-Faraco, S., Navarra, J., Weikum, W., Vouloumanos, A., Sebastián-Gallés, N., & Werker, J. (2007). Discriminating languages by speechreading. Perception & Psychophysics, 69, 218–231.

Stekelenburg, J. J., & Vroomen, J. (2007). Neural correlates of multisensory integration of ecologically valid audiovisual events. Journal of Cognitive Neuroscience, 19, 1964–1973.

Stevens, K. N. (2002). Toward a model for lexical access based on acoustic landmarks and distinctive features. Journal of the Acoustical Society of America, 111, 1872–1891.

Stevenson, R. A., Kim, S., & James, T. W. (2009). An additive-factors design to disambiguate neuronal and areal conver-gence: measuring multisensory interactions between audio, visual, and haptic sensory streams using fMRI. Experimental Brain Research, 198, 183–194.

Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America, 26, 212–215.

Summerfield, Q., & McGrath, M. (1984). Detection and reso-lution of audiovisual incompatibility in the perception of vowels. Quarterly Journal of Experimental Psychology: Human Experimental Psychology, 36, 51–74.

Summers, I. R. (Ed.). (1992). Tactile aids for the hearing impaired (practical aspects of audiology). London: John Wiley & Sons.

Talsma, D., Senkowski, D., Soto-Faraco, S., & Woldorff, M. G. (2010). The multifaceted interplay between attention and multisensory integration. Trends in Cognitive Sciences, 14, 400–410.

Talsma, D., & Woldorff, M. G. (2005). Selective attention and multisensory integration: multiple phases of effects on the evoked brain activity. Journal of Cognitive Neuroscience, 17, 1098–1114.

Teinonen, T., Aslin, R. N., Alku, P., & Csibra, G. (2008). Visual speech contributes to phonetic learning in 6-month-old infants. Cognition, 105, 850–855.

Terao, Y., Ugawa, Y., Yamamoto, T., Sakurai, Y., Masumoto, T., & Abe, o. (2007). Primary face motor area as the motor representation of articulation. Journal of Neurology, 254, 442–447.

Tiippana, K., Andersen, T. S., & Sams, M. (2004). Visual atten-tion modulates audiovisual speech perception. European Journal of Cognitive Psychology, 16, 457–472.

Tuomainen, J., Andersen, T. S., Tiippana, K., & Sams, M. (2005). Audio-visual speech is special. Cognition, 96, B13–B22.

van Wassenhove, V., Grant, K. W., & Poeppel, D. (2005). Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences of the United States of America, 102, 1181–1186.

Vatakis, A., Ghazanfar, A. A., & Spence, C. (2008). Facilitation of multisensory integration by the “unity effect” reveals that speech is special. Journal of Vision (Charlottesville, Va.), 8, 1–11.

Vatakis, A., & Spence, C. (2007). Crossmodal binding: evaluat-ing the” unity assumption” using audiovisual speech stimuli. Perception & Psychophysics, 69, 744.

Vatikiotis-Bateson, E., Munhall, K. G., Kasahara, Y., Garcia, F., & Yehia, H. (1996). Characterizing audiovisual information during speech. Proceedings of the 4th International Conference on Language Processing (ICSLP 96, october 3-6,), Philadel-phia, PA (pp. 1485–1488).

Vouloumanos, A., Druhen, M. J., Hauser, M. D., & Huizink, A. T. (2009). Five-month-old infants’ identification of the sources of vocalizations. Proceedings of the National Academy of Sciences of the United States of America, 106, 18867–18872.

Vroomen, J., Bertelson, P., & de Gelder, B. (2001). The ven-triloquist effect does not depend on the direction of auto-matic visual attention. Perception & Psychophysics, 63, 651–659.

Walker-Andrews, A. S. (1997). Infants’ perception of expres-sive behaviors: differentiation of multimodal information. Psychological Bulletin, 121, 437–456.

Walker-Andrews, A. S., Bahrick, L. E., Raglioni, S. S., & Diaz, I. (1991). Infants’ bimodal perception of gender. Ecological Psychology, 3, 55–75.

Walton, G. E., & Bower, T. G. (1993). Amodal representations of speech in infants. Infant Behavior and Development, 16, 233–243.

Watkins, K. E., Strafella, A. P., & Paus, T. (2003). Seeing and hearing speech excites the motor system involved in speech production. Neuropsychologia, 41, 989–994.

Weikum, W. M., Vouloumanos, A., Navarra, J., Soto-Faraco, S., Sebastián-Gallés, N., & Werker, J. F. (2007). Visual language discrimination in infancy. Science, 316, 1159.

Wilson, S. M., Pinar-Saygin, A., Sereno, M. I., & Iacobini, M. (2004). Listening to speech activates motor areas involved in speech production. Nature, 7, 701–702.

Windmann, S. (2003). Effects of sentence context and expec-tation on the McGurk illusion. Journal of Memory and Language, 50, 212–230.

Wright, T. M., Pelphrey, K. A., Allison, T., McKeown, M. J., & McCarthy, G. (2003). Polysensory interactions along lateral temporal regions evoked by audiovisual speech. Cerebral Cortex, 13, 1034–1043.

Yehia, H. C., Kuratate, T., & Vatikiotis-Bateson, E. (2002). Linking facial animation, head motion and speech acous-tics. Journal of Phonetics, 30, 555–568.

Yuan, H., Reed, C. M., & Durlach, N. I. (2005). Tactual display of consonant voicing as a supplement to lipreading. Journal of the Acoustical Society of America, 118, 1003–1015.

Yuen, I., Davis, M. H., Brysbaert, M., & Rastle, K. (2010). Activation of articulatory information in speech percep-tion. Proceedings of the National Academy of Sciences of the United States of America, 107, 592–597.

8466_024.indd 452 12/21/2011 6:01:41 PM