gaze strategies and audiovisual speech enhancement · extensively in order to understand di erent...

88
Gaze Strategies and Audiovisual Speech Enhancement by Astrid Yi A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering and Institute of Biomaterials and Biomedical Engineering University of Toronto Copyright c 2010 by Astrid Yi

Upload: others

Post on 06-Apr-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Gaze Strategies and Audiovisual Speech Enhancement

by

Astrid Yi

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Electrical and Computer Engineering andInstitute of Biomaterials and Biomedical Engineering

University of Toronto

Copyright c© 2010 by Astrid Yi

Page 2: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Abstract

Gaze Strategies and Audiovisual Speech Enhancement

Astrid Yi

Master of Applied Science

Graduate Department of Electrical and Computer Engineering and Institute of

Biomaterials and Biomedical Engineering

University of Toronto

2010

Quantitative relationships were established between speech intelligibility and gaze pat-

terns when subjects listened to sentences spoken by a single talker at different auditory

SNRs while viewing one or more talkers. When the auditory SNR was reduced and sub-

jects moved their eyes freely, the main gaze strategy involved looking closer to the mouth.

The natural tendency to move closer to the mouth was found to be consistent with a

gaze strategy that helps subjects improve their speech intelligibility in environments that

include multiple talkers.

With a single talker and a fixed point of gaze, subjects’ speech intelligibility was

found to be optimal for fixations that were distributed within 10◦ of the center of the

mouth. Lower performance was observed at larger eccentricities, and this decrease in

performance was investigated by mapping the reduced acuity in the peripheral region to

various levels of spatial degradation.

ii

Page 3: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Acknowledgements

I would like to thank my supervisor, Dr. Willy Wong for his guidance and contributions to

this thesis. I would also like to thank Dr. Moshe Eizenman for his insights and knowledge

about various aspects of vision and for the use of his lab’s point of gaze estimation system

without which this work would not have been possible.

I would like to express my gratitude to my subjects who came back multiple times in

order to run the various experiments.

I want to thank my parents for giving me the opportunity to pursue this academic

tenure and I am grateful to Eric Dacquay, who agreed to record all the stimuli used for

the experiments.

iii

Page 4: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Contents

1 Introduction 1

1.1 Current State of Knowledge Regarding Audiovisual Speech . . . . . . . . 2

1.1.1 Theories Behind Audiovisual Speech . . . . . . . . . . . . . . . . 2

1.1.2 Neural Mechanisms Underlying Speech Perception . . . . . . . . . 3

1.1.3 Conditions for Audiovisual Speech . . . . . . . . . . . . . . . . . . 7

1.1.4 Signal Characteristics of an Audiovisual Speech Stream . . . . . . 8

1.2 Purpose of the Current Work . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 General Methodology 10

2.1 Experiment Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Video Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.2 Audio Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.2 Hardware Apparatus . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3 Room Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.4 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.5 Experiment Software . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

iv

Page 5: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

2.3.1 Speech Intelligibility Score . . . . . . . . . . . . . . . . . . . . . . 22

2.3.2 Eye Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Gaze Patterns and Audiovisual Speech Enhancement 24

3.1 Speech Intelligibility with Natural Viewing (Single Talker) . . . . . . . . 25

3.1.1 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.2 Experiment Procedure . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.3 Audiovisual Threshold and Speech Intelligibility . . . . . . . . . . 26

3.1.4 Percentage of Time Spent Fixating on Different Facial Regions . . 27

3.1.5 Gaze Patterns with Respect to Time . . . . . . . . . . . . . . . . 30

3.1.6 Natural Gaze Strategies . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Speech Intelligibility with a Fixed Point of Gaze (Single Talker) . . . . . 37

3.2.1 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.2 Experiment Procedure . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3 Speech Intelligibility with Multiple Talkers . . . . . . . . . . . . . . . . . 43

3.3.1 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.2 Experiment Procedure . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.3 Speech Intelligibility Results . . . . . . . . . . . . . . . . . . . . . 45

3.3.4 Natural Gaze Strategies . . . . . . . . . . . . . . . . . . . . . . . 46

3.4 Comparison of Audiovisual Speech Perception with Single and Multiple

Talkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 Visual Processing and Audiovisual Speech Enhancement 52

4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1.1 Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1.2 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1.3 Experiment Design and Procedure . . . . . . . . . . . . . . . . . . 56

v

Page 6: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

4.2 Speech Intelligibility with Low-Pass Filtered Videos . . . . . . . . . . . . 60

4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 Spatial and Temporal Frequency Channels . . . . . . . . . . . . . . . . . 64

5 Conclusions 67

5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Bibliography 71

References 71

vi

Page 7: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

List of Tables

3.1 Audiovisual Threshold of Each Subject . . . . . . . . . . . . . . . . . . . 27

vii

Page 8: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

List of Figures

1.1 Sensory Pathways in the Brain - Adapted from (Calvert, Spence, & Stein,

2004) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Generating an Audiovisual Stimulus . . . . . . . . . . . . . . . . . . . . . 11

2.2 Video Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Audio Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Sound Calibration Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 Flow Chart of the Experiment Software . . . . . . . . . . . . . . . . . . . 17

2.7 Audiovisual Threshold Algorithm . . . . . . . . . . . . . . . . . . . . . . 19

2.8 Experiment Progress Message Boxes . . . . . . . . . . . . . . . . . . . . 19

2.9 DirectShow Filter Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 Average Audiovisual Speech Intelligibility Scores as a Function of Auditory

SNR when Subjects Gazed Naturally at Video Recordings of a Talker . . 28

3.2 Percentage of Time Spent Fixating on Different Facial Regions during a

Speech Intelligibility Test Under 4 Auditory Conditions . . . . . . . . . . 29

3.3 Comparison of the Percentage of Time Spent Fixating on Different Facial

Regions between Correct and Incorrect Responses . . . . . . . . . . . . . 30

viii

Page 9: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

3.4 Average Euclidean Gaze Distance between Subjects’ Fixation Points and

the Center of the Mouth of the Talker during a Speech Intelligibility Test

under 4 Different Auditory Conditions . . . . . . . . . . . . . . . . . . . 31

3.5 Average Horizontal Gaze Distance between Subjects’ Fixations Points and

the Center of the Mouthof the Talker during a Speech Intelligibility Test

under 4 Different Auditory Conditions . . . . . . . . . . . . . . . . . . . 32

3.6 Scanning Pattern for Subject 4 at the ‘Audiovisual Threshold’ Auditory

Condition. The subject used a 4.5◦ saccade to shift his gaze from approx-

imately 5◦ from the mouth center at the beginning of the trial to less than

1◦ at the end of the trial. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.7 Scanning Pattern for Subject 6 at the ‘No Noise’ Auditory Condition. The

subject used a sequence of fixations to look at the talkers left eye, nose, an

area close to the talkers right eye and then once again at the nose. Note

that during the whole sequence the subject did not fixate within an area

of 2.5◦ from the mouth center. . . . . . . . . . . . . . . . . . . . . . . . . 34

3.8 Percentage of Trials with the ‘saccades towards mouth’ and ‘other’ Gaze

Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.9 Breakdown of the Gaze Strategy ’saccades towards mouth’ Under 4 Dif-

ferent Auditory Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.10 Average Speech Intelligibility Scores as a Function of Proximity to the

Center of the Mouth at Subjects’ Audiovisual Threshold for a Single Talker 40

3.11 Composition of Two-Talker Stimuli . . . . . . . . . . . . . . . . . . . . . 44

3.12 Average Speech Intelligibility Scores as a Function of Proximity to the

Center of the Mouth at Subjects’ Audiovisual Threshold for Two Talkers 46

3.13 Average Euclidean Gaze Distance between Subjects’ Fixation Points and

the Center of the Mouth of the correct ‘talking face’ during a Speech

Intelligibility Test at the ‘Audiovisual Threshold’ Auditory Condition . . 47

ix

Page 10: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

3.14 Scanning Pattern for Subject 4 at the ‘Audiovisual Threshold’ Auditory

Condition. The subject first fixated at the eyes of the ‘correct talker’,

shifted his gaze towards the mouth of the ‘incorrect talker’ and made

saccadic eye movements towards the mouth of the ‘correct talker’. . . . . 48

3.15 Average Euclidean Distance between Subjects’ Fixation Points and the

Correct Talker’s Mouth Center for Different Types of Gaze Behavior . . . 49

3.16 Average Speech Intelligibility Scores with a Fixed Point of Gaze when

Viewing One or Two Talkers . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1 Transfer Function of the Gaussian Blur Filter . . . . . . . . . . . . . . . 55

4.2 Sinusoidal Gratings of 4 Cycles per Degree for Different Image Contrasts 57

4.3 Grating Visual Acuity - Adapted from (Rovamo, Virsu, Laurinen, & Hy-

varinen, 1982) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4 Speech Intelligibility Scores as a Function of Low-Pass Filter Cutoff Fre-

quencies - Adapted from (Munhall, Kroos, Jozan, & Vatikiotis-Bateson,

2004) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.5 Examples of Video Frames Low-Pass Filtered at Different Cutoff Frequencies 59

4.6 Average Speech Intelligibility Scores as a Function of Low-Pass Filter Cut-

off Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.7 Average Speech Intelligibility Scores as a Function of Physiological Cutoff

Frequency at the ‘Audiovisual Threshold’ Auditory Condition when the

Original Signal is Attenuated by 99% at the Cutoff Frequency . . . . . . 62

4.8 Average Speech Intelligibility Scores as a Function of Physiological Cut-

off Frequency at the ‘Audiovisual Threshold’ when the Original Signal is

Attenuated by 99.9% at the Cutoff Frequency . . . . . . . . . . . . . . . 63

4.9 Temporal Filters in the Human Visual System - Adapted from (Hess &

Snowden, 1992) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

x

Page 11: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

5.1 Average Speech Intelligibility Scores for Younger and Older Adults under

a Blurred Condition - Adapted from (Gordon & Allen, 2009) . . . . . . . 70

5.2 Visual Acuity as a Function of Eccentricity - Adapted from (Westheimer,

1979) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

xi

Page 12: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 1

Introduction

Individuals who suffer from hearing impairments have greater difficulty in understanding

a talker in noisy environments compared to people with normal hearing. The attenuated

signal in the auditory modality can result in a masking of speech information by noise.

Hearing aids can be used to help offset the impairment, but they are not able to com-

pletely filter out noise, resulting in decreased usefulness in this context. The possibility

exists of using a different technique to help mitigate the effects of hearing loss. Re-

search has shown that when a sensory channel is impaired, a secondary one can be used

to recover the information presented in an environment. With the use of sinusoidally

amplitude-modulated signals, Qian (2009) found that the combination of visual and au-

ditory information can help improve signal detection by approximately 2 dB. Studies

on audiovisual speech perception showed similar enhancement over unimodal conditions.

Under noisy conditions, discerning the talker’s utterances can be difficult due to the

degradation of the acoustic signal. However, seeing a talker’s face can improve speech

intelligibility, with an equivalent gain of raising the acoustic SNR by as much as 15 dB

(Sumby & Pollack, 1954), (MacLeod & Summerfield, 1987). By studying gaze strate-

gies and speech intelligibility with normal hearing subjects, techniques can be devised

to enable hearing impaired individuals to improve their audiovisual speech perception.

1

Page 13: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 1. Introduction 2

Ebrahimi and Kunov (1991) showed the possibility of developing such a technique by

demonstrating an audiovisual speech enhancement of 35% when subjects wore a visual

lipreading aid which encoded different speech signal features (voice pitch, energy of the

speech signal).

1.1 Current State of Knowledge Regarding Audiovi-

sual Speech

Combining auditory and visual information can have two different effects. First, vision

can aid in audiovisual speech perception. The contribution of vision to speech perception

was demonstrated with several forms of signal degradation. For instance, Grant and

Walden (1996) filtered out some frequency bands of a speech signal and showed that

speech intelligibility was not affected when subjects were able to view the talker’s face.

In a similar experiment, Boothroyd et al. (1988) encoded an acoustic signal with only

the talker’s voice fundamental frequency, making it impossible to decipher the utterances

heard. It is only when the auditory information was combined with the presentation of

the talker’s face that an improvement in speech intelligibility was observed. In addition

to aiding speech perception, the combination of auditory and visual information can

produce illusory perceptions, such as the McGurk effect. This illusion consists of detecting

an auditory event differing from the presentation of auditory and visual components

of a stimulus. For instance, an auditory /ba/ dubbed with a visual /ga/ results in

perceiving /da/ (McGurk & MacDonald, 1976). This illusory perception has been studied

extensively in order to understand different components of audiovisual speech.

1.1.1 Theories Behind Audiovisual Speech

The literature provides a few theories to explain how audiovisual effects occur, such as

audiovisual enhancement and illusory perceptions. These theories can be placed under

Page 14: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 1. Introduction 3

the category of the ‘common format’ theory, which suggests that the auditory and visual

information transform to a common metric with the mechanism of neural convergence,

where audiovisual speech is processed by cortical neurons which respond to both auditory

and visual speech stimulations (Calvert et al., 2004).

There exist a few versions of the common format theory, but each one claims that

modality-specific processing terminates early. Specifically, the auditory and visual infor-

mation related to speech are converted to a common format prior to the stage where

speech segments (consonants and vowels) are used to form words. An example can be

given with the version of the common format theory proposed by Summerfield (1987)

based on speech articulation dynamics. It was suggested that the acoustic speech signal

is related to various aspects of lip movements, such as lip openings and closures, rate

of lip movements, and direction of lip movements. Audiovisual speech perception would

then involve integrating auditory and visual information related to articulatory dynamics

(Calvert et al., 2004).

Evidence supporting the common format theory was derived with the McGurk effect.

Green et al. (1991) mismatched the talker’s gender between auditory and visual streams of

various stimuli and found that the magnitude of the McGurk effect was similar to the one

where the talker’s gender was matched. Based on this result, the researchers indicated

that it is possible that the auditory information was recoded prior to the occurrence

of audiovisual integration. A similar inference was made by Rosenblum et al. (1996b)

who found that the McGurk effect could be perceived even when the visual stimuli was

reduced to a few moving dots corresponding to facial motion. It appears that subjects

were able to combine speech motion information with acoustics information.

1.1.2 Neural Mechanisms Underlying Speech Perception

The results found from psychophysical experiments provided some insight as to when

auditory and visual information formed a single percept. Various neuroimaging studies

Page 15: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 1. Introduction 4

complemented these studies by providing more details about which parts of the brain

activate during speech perception. The unimodal pathways are first reviewed before

discussing the multisensory interactions.

Auditory Speech Pathway

A few synaptic levels are involved in processing auditory stimuli. The lowest one occurs

in the core auditory cortex and is located on the upper surface of the temporal lobe (area

41). It receives input from the medial geniculate which is denoted by ‘MG’ in Figure 1.1.

The second synaptic level lies in the belt region (area 42), where the input originates

from the core, medial geniculate, and medial pulvinar. The next synaptic level consists

of a parabelt region (area 22) which surrounds the core and extends over the lateral

surface of the superior temporal gyrus. The input comes from the geniculate and medial

pulvinar. Finally, the fourth synaptic level is in the superior temporal sulcus, which is

considered to be the region where modality-specific information might interact (Calvert

et al., 2004).

Several studies showed that the areas involved during speech perception occur mainly

at the third and fourth synaptic levels. When Binder et al. (2000) presented simpler

stimuli such as tone stimuli, they observed that the early areas (core and belt areas)

of the auditory pathway were activated. However, the activated areas changed to the

parabelt on the lateral surface of the superior temporal gyrus when subjects listened

to speech stimuli. Scott et al. (2000) also found that cortical activation occured later

with speech. Specifically, the left superior temporal sulcus was involved when subjects

attempted to decipher the words of a sentence.

Visual Speech Pathway

Visual speech information enters the cortex via the primary visual area V1, which is

considered to be the first synaptic level of the visual pathway. V1 receives information

Page 16: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 1. Introduction 5

Figure 1.1: Sensory Pathways in the Brain - Adapted from (Calvert et al., 2004)

from the magnocellular and parvocellular layers of the lateral geniculate nucleus. It has a

retinotopic-mapping, coding mostly fine-grained information. The second synaptic level

consists of areas that are connected to V1, such as V2, V4, and V5. While these areas

are generally not thought to process visual speech, the study mentioned in Bavelier et

al. (2001) suggested that audiovisual speech enhancement might occur at this level. The

third and fourth synaptic levels consist of the fusiform, inferior and middle temporal gyri,

and they process information about object shape (Grill-Spector, Kourtzi, & Kanwisher,

2001).

The visual speech pathway is less understood than the auditory speech pathway.

The cortical activations involved in visual speech were obtained by comparing them

with the ones where still faces or nonspeech mouth movements were presented. The

Page 17: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 1. Introduction 6

results suggested that many different areas were activated including the ones which are

responsible for sensory, motion, and face processing (Calvert et al., 2004). In the study

(Pekkola et al., 2005), speechreading was found to activate the auditory cortex.

Audiovisual Speech Pathway

The cortical activations during audiovisual speech perception involve the superior tem-

poral gyrus and the posterior superior temporal sulcus (pSTS). The latter region was

found to be the main site for audiovisual speech integration (R. Campbell, 2008). For

instance, it showed supra-additive activation when the auditory and visual information

were matched while exhibiting a sub-additive response with mismatched inputs (Calvert,

Campbell, & Brammer, 2000).

Some studies investigated the time course of cortical activation with scalp-recorded

event-related potentials (ERP) and magnetoencephalography (MEG). The auditory and

visual information were found to be first processed in the primary sensory cortices, and

the superior temporal regions binded the different data streams. The pSTS was then

activated and extended to the parietal lobe. Alternatively, it projected back to the

primary visual or auditory regions.

In the studies involving audiovisual speech perception, the pSTS was found to be

mostly sensitive to motion. An example can be given with a study conducted by Callan

et al. (2004), who compared the cortical activations when subjects were presented with

different spatially degraded audiovisual stimuli. When the visual component of the stim-

uli were low-pass filtered, facial details were limited. In this case, only the pSTS was

activated. However, both the pSTS and middle temporal gyrus (MTG) were activated

when the visual stream was barely filtered.

Page 18: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 1. Introduction 7

1.1.3 Conditions for Audiovisual Speech

Visual Considerations

Past studies suggest that audiovisual speech integration is quite robust to spatial ma-

nipulations. For instance, MacDonald et al. (2000) used spatial quantization to degrade

the visual component of their stimuli to show that the McGurk effect could be perceived

even at the coarsest level. Similarly, Munhall et al. (2004) degraded images by applying

different band-pass and low-pass filters and revealed that the filtered visual information

was sufficient in attaining a higher speech intelligibility score than that of auditory-only

signal presentation. In fact, it was demonstrated that an audiovisual speech enhancement

could be observed even when a visual stimulus was reduced to a few points corresponding

to facial movements (Rosenblum, Johson, & Saldana, 1996a).

Temporal Considerations

Although audiovisual speech integration is not much affected by spatial manipulations,

its occurrence is dependent on the time difference between the onset of auditory and

visual signals. Experiments involving audiovisual synchrony showed that a speech signal

can be delayed by as much as 250 ms or advanced by 130 ms before the auditory and

visual signals are perceived as separate components (Dixon & Spitz, 1980). Pandey et al.

(1986) found that subjects could tolerate an auditory delay of 120 ms before observing a

degradation in speech intelligibility performance, and an auditory delay of 300 ms caused

the audiovisual condition to be the same as the visual-only one. These results suggest

that under normal conversational distances, audiovisual speech integration is not affected

by the auditory delay introduced by the difference in the speed of transmission of light

and sound (Calvert et al., 2004).

Page 19: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 1. Introduction 8

1.1.4 Signal Characteristics of an Audiovisual Speech Stream

Various characteristics of an audiovisual speech stream have been analyzed to gain a

better understanding of audiovisual speech perception. Grant and Seitz (2000) examined

the correlation between the envelope of a speech signal and lip movements on a small

set of sentences. They found that the correlation ranged between 35% and 52%. By

testing a larger set of sentences, Craig et al. (2008) showed that the correlation was

approximately 70%. When the set of sentences came from different language databases,

speech signals were found to be correlated between 31% and 65% with lip movements,

and variations occurred across talkers and sentences. These results were also found to be

applicable for longer spoken segments (Chandrasekaran, Trubanova, Stillittano, Caplier,

& Ghazanfar, 2009). The correlation between the envelope of a speech signal and mouth

opening suggests that the auditory and visual information are redundant. Therefore, if

one of the input streams is impaired, the speech signal can be recovered with the other

input stream.

The strongest correlations occurred in two frequency bands: below 1 kHz and in the

range of 2-3 kHz (Grant & Seitz, 2000), (Chandrasekaran et al., 2009). These bands

mapped to different formant regions, which contain frequency components that allow

people to distinguish between various speech sounds such as vowels. In addition, the

temporal modulations corresponding to these frequency bands were found to be promi-

nent between 2 Hz and 7 Hz.

Another characteristic of audiovisual speech includes the timing of lip movements

relative to the onset of the voice. Past studies showed that the onset of a visual signal

precedes the onset of an auditory signal. Specifically, a temporal window of 100 ms to

300 ms was found in (Chandrasekaran et al., 2009) with bilabial consonants (consonants

articulated using both lips) by comparing the time at which a transition occurred from an

open to closed mouth state with respect to the onset of the speech sound. Wassenhove et

al. (2005) reported a smaller delay (85 ms to 155 ms) by analyzing electroencephalography

Page 20: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 1. Introduction 9

recordings with an audiovisual stimulus consisting of a single syllable.

1.2 Purpose of the Current Work

The goal of this thesis is to gain higher-level knowledge regarding audiovisual speech

perception. Specifically, it attempts to determine the gaze strategies used in improving

speech intelligibility. The objective is achieved by analyzing subjects’ eye movements,

which were obtained through the use of a remote point of gaze estimation system. Un-

like previous studies where the speech stimuli were simple and consisted of one or two

syllables, subjects’ audiovisual speech perception was investigated with spoken sentences.

1.3 Thesis Outline

The thesis organized as follows. Chapter 2 describes the general methodology used in the

experiments, including the procedures involved in generating the stimuli, the experimen-

tal setup, and data analysis techniques. Chapter 3 examines the gaze strategies involved

in audiovisual speech enhancement. Chapter 4 then discusses the visual processing un-

derlying audiovisual speech enhancement. Finally, Chapter 5 highlights the results from

this thesis and provides future work.

Page 21: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 2

General Methodology

This chapter discusses the general methodology used in conducting an experiment. First,

the processes involved in generating audiovisual stimuli are explained, where an audio-

visual stimulus refers to the presence of both audio and video components. The experi-

mental setup is then described followed by the procedures involved in data analysis.

2.1 Experiment Stimuli

Low context SPIN (Speech Perception in Noise) sentences were selected and spoken by a

male talker fluent in English. Low context sentences were chosen to prevent participants

from determining what the next word was based on previous utterances. They were also

selected because they are typically used in speech intelligibility tests for the following

reasons. Their phonetic content is properly balanced, so no words are more susceptible

to masking by noise. Their length are more or less similar; they consist of five to eight

words which translate to six to eight syllables. Furthermore, the words which are used

in determining the speech intelligibility score are considered to have an average word

familiarity. This particular factor is important since it can affect speech intelligibility. A

low word familiarity can decrease performance while a high word familiarity improves it

(Kalikow & Stevens, 1977).

10

Page 22: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 2. General Methodology 11

The steps involved in generating an experiment stimulus are illustrated in Figure 2.1.

Each sentence was recorded using a HD camera (Panasonic HDC-TM20) in the standard

AVC format at 1920x1080. It is noted that another resolution was available for recording;

however, it was discarded for two reasons. First, the angular resolution was lower, so fine

details would not be present due to the reduction of high-frequency components. Second,

the native size of the imaging sensor was 1920x1080, so the lower resolution would have

resulted in internal processing of the video by the camera. Therefore, undesired video

manipulations may have occurred.

video recording

of a SPIN sentence

video

audio

video

processing

audio

processing

audiovisual

stimulus

Figure 2.1: Generating an Audiovisual Stimulus

For each recording, the audio and video streams were extracted using FFmpeg. This

command line tool was obtained from an open source project to provide an interface to the

libavcodec library which contains the code necessary for video encoding and decoding of

various compression schemes. Once the audio and video streams were extracted, further

processing was applied on them, as described in subsections 2.1.1 and 2.1.2. Finally,

the streams were recombined with FFmpeg to form an audiovisual stimulus, which was

stored in the AVI container format.

2.1.1 Video Processing

Figure 2.2 shows the details related to video processing. First, the extracted video

stream was uncompressed to prepare it for resizing and processing. The resolution of the

display used in the experimental setup was set to its native resolution of 1280x1024 to

Page 23: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 2. General Methodology 12

prevent interpolation and blurring. The resolution of the resized video was selected to be

1280x720 to ensure a one to one mapping between each pixel of a video frame and display.

Automatic resizing during video playback was thus avoided while preserving the aspect

ratio of the original recordings. Furthermore, the selected resolution was close to what a

human eye can resolve. Normal visual acuity is defined as the ability to resolve a spatial

pattern separated by one minute of arc, which corresponds to 1/60 = 0.0167 degrees

(Davson, 1990). The display subtended 32 degrees horizontally and 25 degrees vertically

of subjects’ field of view. The horizontal resolution thus mapped to 32/1280 = 0.025

degrees while the vertical one mapped to 25/1024 = 0.024 degrees. Ideally, the resolution

should have been chosen to be higher in order to be closer to the normal visual acuity.

However, it was limited to the native resolution of the display.

Figure 2.2: Video Processing

Once a video stream was resized, it was ready for further processing, which involved

the application of a blurring filter. This step was optional since not all stimuli contained

blurred video. The video was then set to be combined with the audio stream.

2.1.2 Audio Processing

The audio stream was modified by following the steps shown in Figure 2.3. Each speech

signal was normalized using another audio file as a reference point. This audio file

contained white noise which lasted as long as the longest speech signal. The normalized

signal was then combined with white noise to produce a new audio stream. The desired

Page 24: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 2. General Methodology 13

SNR level was obtained by varying the speech signal while maintaining the noise level.

Figure 2.3: Audio Processing

2.2 Experimental Setup

Figure 2.4 provides an overview of the experimental setup. Subjects were seated at a

distance of 66 cm from a display and were interfaced with an experiment program. The

audio stream was fed through headphones while the video stream was presented on a

display which subtended 32 degrees horizontally and 25 degrees vertically of subjects’

field of view. Gaze positions were monitored using a remote, non-contact point of gaze

estimation system.

Figure 2.4: Experiment Setup

Page 25: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 2. General Methodology 14

2.2.1 Subjects

One non-naıve and eleven naıve adult subjects from Toronto were included in the study.

Their age ranged from 20 to 25 years except for the non-naıve subject. All participants

were fluent in the English language. Furthermore, they had self-reported normal hear-

ing and normal/corrected-to-normal visual acuity. Prior to the commencement of the

experiment, they read and signed a consent form.

2.2.2 Hardware Apparatus

The experiment interface was provided via a PC sufficiently fast to decode HD videos.

The technical specifications of the computer were AMD Athlon(tm) 64 x2 dual core

processor 4400+ 2.22 GHz with 1.50 GiB RAM. The video output was produced by an

Asus EAX300 SE and presented on a 19-inch ViewSonic LCD. The audio output was fed

through wide frequency-response headphones (AKG K301xtra).

A remote, non-contact point of gaze estimation system consisting of 2 cameras and

4 infrared light sources (Guestrin & Eizenman, 2006), (Guestrin & Eizenman, 2008)

was used to track subjects’ eye movements at a rate of 30 Hz. It allowed for free head

movements in a volume of 15x15x15 cm. It extracted subjects’ eye features, such as the

pupil centre and corneal reflections, from video images and used them to obtain gaze

position within 1 degree. Corneal reflections were created as a result of infrared light

sources illuminating the eyes.

2.2.3 Room Conditions

All the experiments were conducted in the room where a point-of-gaze estimation system

was available. The ambient noise was measured to be around 50 dB SPL which came

from computer fans.

Page 26: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 2. General Methodology 15

2.2.4 Calibration

Sound Level

Prior to running an experiment, the sound level was calibrated to 75 dB SPL using the

following procedure. A video file consisting of only black frames for the video stream

and white noise as the audio signal was played in the same manner as that of each

experiment. As illustrated in Figure 2.5, the sound level was measured with a sound

level meter (Larson and Davis LxT) through a signal conditioner with an artificial ear

(G.R.A.S. Type 43AG). The sound level was set to the desired level by adjusting the

volume.

Figure 2.5: Sound Calibration Setup

Eye Tracker

A one-point calibration routine was executed to estimate the visual axis of the left and

right eyes. A bright flashing stimulus was presented on a dark uniform background at the

center of the screen. Once the calibration procedure was complete, subjects were asked

to look at various points. Successful calibration was ensured by verifying that these

points corresponded to subjects’ gaze positions. If the difference was significant, the

Page 27: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 2. General Methodology 16

calibration process was repeated until gaze positions were obtained within the accuracy

of the point-of-gaze estimation system (1 degree).

2.2.5 Experiment Software

The experiment software was implemented in C++ using DirectShow, which provided

the best framework and interface for video playback. The programming language choice

was selected based on what was best supported by DirectShow. Figure 2.6 shows the

flow chart used to implement an experiment. Upon program start-up, a training session

was completed, followed by the assessment of a subject’s audiovisual threshold. The

presentation order of test conditions was then determined, and a dialog box appeared

on the display to mark the beginning of an experiment. A stimulus file was randomly

selected and played during which subjects viewed and listened to the talker. After video

playback, subjects were prompted with a dialog box in which they reported the utterances

heard. This process was repeated until the number of trials required to test a condition

was completed. If there were additional conditions to be tested, subjects were suggested

with the instruction of taking a small break before moving on to the next stage of the

experiment. They then viewed their progress and completed the next set of trials. Further

details on the algorithm steps are provided in subsequent subsections.

Training Session

A training session was made available at the beginning of an experiment to introduce

subjects to the experimental stimuli and tasks. It comprised of 5 trials, where each trial

consisted of viewing and listening to a video sequence under no noise. This acoustic

condition was chosen to ensure that subjects got familiar with the low-context nature of

the sentences. The first frame of each video sequence was paused for about 2 seconds

to ensure that participants were prepared to complete each trial. After the presentation

of a stimulus, subjects were prompted with a dialog box in which they reported the

Page 28: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 2. General Methodology 17

is end of set?

is end of

experiment?

yes

yes

no

start

complete training

session

get audiovisual

threshold

determine presentation

order of test conditions

show experiment

progress

choose stimulus file

play stimulus file

and collect gaze

positions

get subject’s response

end

show experiment

instructions

no

Figure 2.6: Flow Chart of the Experiment Software

utterances heard.

Page 29: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 2. General Methodology 18

Audiovisual Threshold

Upon completing a training session, subjects’ audiovisual threshold was obtained. The

audiovisual threshold corresponds to the point at which participants attain a speech

intelligibility score of 50% when viewing and listening to audiovisual stimuli. The algo-

rithm used is shown in Figure 2.7 and is comparable to the one presented by Macleod

and Summerfield (1990). Participants were presented with 29 video sequences, where

the desired auditory SNR level was achieved by varying the speech signal level while

maintaining the noise level. The first stimulus was presented at a SNR of -20 dB, where

it is not possible to detect the utterances correctly. The SNR was then increased in 2

dB steps, and the sentence was repeated until its last word was correctly identified. The

SNR was then reduced by 2 dB. The next set of 2 sentences was presented once. If the

last word of one of these sentences was correctly identifed, the SNR was reduced by 2

dB. Otherwise, it was increased by 2 dB. This procedure was repeated for the remaining

sets of sentences. The audiovisual threshold was determined by taking the average of

the SNRs used in presenting the last 10 sets of sentences. It is noted that the SNRs of

the first 5 sets of sentences were not included to allow the procedure to converge to the

audiovisual threshold.

Experiment Progress

While performing an experiment, subjects were made aware of their progress. For in-

stance, a message box appeared prior to the commencement of each new set of trials

to indicate which set would be executed. An additional message box was used to show

which trial was being completed within a particular set. An example of the aforemen-

tioned message boxes is provided in Figure 2.8.

Page 30: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 2. General Methodology 19

Present first

stimulus

Present next

set of 2

sentences

Is correct

response?

No

Yes

If first stimulus?

Yes

No

Increase SNR

by 2 dB

Decrease SNR

by 2 dB

Figure 2.7: Audiovisual Threshold Algorithm

(a) Set Progress

(b) Trial Progress

Figure 2.8: Experiment Progress Message Boxes

Experiment Instructions

Once a set of trials was completed, a message box was displayed with experiment related

instructions. For instance, it suggested subjects to take a small break before moving

Page 31: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 2. General Methodology 20

on to the next set of trials. Furthermore, it could instruct them to look at a particular

location on the display during video playback.

Stimulus File Selection

A total of 200 low context SPIN sentences were available from the lists provided in

(Kalikow & Stevens, 1977). A stimulus file was randomly selected from this pool and

was marked as used once it was viewed. A check was performed when the next file was

chosen to ensure that it was not being repeated. If it had been already viewed, another

file was selected.

Video Playback

A few steps were required to play a video sequence. First, a window on which the video

was to be played was created. The width and height of the window were chosen to match

the display resolution of the experimental setup since the video sequence spanned across

the whole display.

File Source

(Stimulus File)AVI Splitter

DivX Decoder

(Video Decoder)

VMR7 Mixing Mode

(Video Renderer)

AC3 Filter

(Audio Decoder)

Default DirectSound Device

(Audio Renderer)

Figure 2.9: DirectShow Filter Graph

After creating a video window, a DirectShow filter graph was programmatically con-

structed, as shown in Figure 2.9, in order to render the video file. The filter ‘File Source’

loaded a stimulus file, which was passed through the ‘AVI Splitter’ filter to separate the

audio and video streams. The video stream was decoded by the DivX decoder, which

could decode both Xvid and DivX videos since they were both implementations of the

MPEG 4 Advanced Simple Profile standard. The video renderer then took the decoded

Page 32: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 2. General Methodology 21

video and put it in a format which was ready to be displayed. A similar procedure oc-

cured for the audio stream. It is noted that the filter ‘VMR7 Mixing Mode’ was selected

instead of the default ‘VMR7 Non-Mixing Mode’ in order to maintain the original aspect

ratio of the video. It also simplified the display of overlayed graphics. For some test

conditions, it was necessary to place an image at a specific location on top of each video

frame. The image overlay was performed after rendering the DirectShow filter graph.

First, a bitmap image was created which consisted of a black fixation cross with width

and height of 48x48 pixels. Alternatively, it consisted of a black filled rectangle with

dimensions of 1280x300 pixels. Upon its creation, the image was placed at the desired

location. For the first image choice, the color key was set such that the portion of the

bitmap which did not contain the cross was transparent in order to prevent obstructing

the area around the fixation cross. Once the DirectShow filter graph was generated, it

was paused for 2 seconds before being played to allow sufficient time for participants to

be prepared to complete a trial.

Gaze Positions

During video playback, subjects’ gaze positions were collected by following the subsequent

steps. First, the text file in which the eye tracking data were to be saved was opened.

Then, the time at which the video started playing was queried with the precision given

to one millisecond. While the video was playing, the current time was obtained, and

the eye tracker system was polled to get subjects’ gaze locations. The current time was

subtracted from the video playback start time, and this time difference was saved into

the text file along with the corresponding gaze locations. This process was repeated until

the end of video playback.

Page 33: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 2. General Methodology 22

Subject’s Response

Once a stimulus was viewed, participants were asked to report what they heard. They

recorded their responses in a dialog box, like the one presented in Figure 2.8b. Each data

entry was then saved into a text file, with a new entry being appended to the previous

one.

2.3 Data Analysis

2.3.1 Speech Intelligibility Score

A standard procedure was used to measure speech intelligiblity. In the case of SPIN

sentences, the target word was at the end (Kalikow & Stevens, 1977). Therefore, only

the last word of each sentence counted towards calculating subjects’ speech intelligibility

score. If the last word reported was not spelled correctly but was phonetically identical

to the word being compared, it was considered as a correct response. For instance, ‘tied’

was phonetically equivalent to ‘tide’. In addition, each letter of the last word reported

was checked to verify the possibility that it was written with its perceptual equivalent.

If it satisfied this possibility, it was counted as correct. A list of groups of phonemes

which are perceptually equivalent was found in (Iverson, Bernstein, & Edward, 1998).

These groups were referred to as ‘phonemic equivalence classes’ which included: {m, b,

p}, {f, v}, {d, t}, {s, z}, {l, n}, and {k, g}. When considering the first class ({m, b, p}),

the words ‘rome’, ‘robe’, and ‘rope’ were predicted to be visually perceived in the same

manner. Thus, reporting the word ‘rope’ as any of the 3 aforementioned alternatives was

considered to be a valid response.

Page 34: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 2. General Methodology 23

2.3.2 Eye Movements

The experiment software saved the following information in a text file when gathering

details about subjects’ eye movements: time in milliseconds and gaze positions in the

x and y axes from both cameras 1 and 2. Gaze positions were provided in terms of

display coordinates, so the first step involved mapping these data to video coordinates to

determine which parts of the talker’s face subjects focused on to maximize their speech

intelligibility. In the x-axis, no mapping was performed since the display spanned across

the same number of pixels as that of a video. However, mapping was required along the

other axis. When a stimulus was presented, it was centered on the display, and there

were black borders around the top and bottom edges since the video height was less than

the display height. Therefore, mapping between display and video coordinates involved

taking into account a factor of 2 as follows: y-position −(1024− 720)/2.

Once gaze positions were mapped to video coordinates, inconsistent data were dis-

carded. The eye tracker system had the capability of providing 2 estimates for a gaze

position at any time instance with an accuracy of 1 degree. Therefore, if these estimates

resulted in a difference of greater than the accuracy, they were discarded. Alternatively,

if only 1 estimate was obtained, it was not considered in the data analysis. At most, 15%

of all the experimental data were discarded.

A separate file was created for data analysis. This file contained information about

time and corresponding gaze distance from the center of the talker’s mouth for all the

trials within a particular test condition. The distance was calculated for each time interval

of 33 milliseconds since the point of gaze estimation system provided data at this rate.

For each time interval, the average gaze distance from the center of the talker’s mouth

was calculated along with its standard deviation in order to determine subjects’ general

trend. The x-coordinate of the talker’s mouth center was obtained by taking the average

of the x-coordinates of the left and right lip corners while the y-coordinate was derived

from the average of the y-coordinates of the upper and lower lips.

Page 35: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3

Gaze Patterns and Audiovisual

Speech Enhancement

Lansing and McConkie (1994) were the first to demonstrate the feasibility of record-

ing eye movements to study speechreading. Shifts in gaze patterns were observed, but

further analysis was not made. Subsequent studies attempted to establish a relation-

ship between gaze behavior and audiovisual speech perception. For instance, Everdell

et al. (2007) examined the gaze fixation asymmetry of their subjects while performing

audiovisual speech perception but found no correlation with speech intelligibility. Other

studies (Buchan, Pare, & Munhall, 2007), (Lansing & McConkie, 2003) found that sub-

jects prefer fixating on the mouth and on the nose during spoken sentence perception in

addition to spending considerable amount of time fixating on other regions of the face.

Similar observations were made when the study was extended to listening and viewing

video recordings of monologues under different SNRs (signal-to-noise ratios). Vatikiotis-

Bateson et al. (1998) observed a tendency to gaze primarily at the eyes and at the mouth.

With increasing noise levels, the number of mouth fixations increased, but it represented

only about half of all fixations even at the highest noise level. The results were replicated

when the talker identity was varied (Buchan, Pare, & Munhall, 2008).

24

Page 36: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 25

The above studies were qualitative in nature; they attempted to establish a rela-

tionship between gaze behavior and audiovisual speech perception by comparing the

percentage of time spent in different parts of the face under different auditory condi-

tions. However, further details were not provided. This chapter presents an experiment

which examines the natural scanning behavior during a speech intelligibility task and

complements the descriptions provided in previous studies. Furthermore, it investigates

the changes in speech intelligibility as a function of the proximity of gaze fixations to the

visual cues that contribute to audiovisual speech enhancement. This study is carried out

for the cases where subjects view either single or multiple talkers.

3.1 Speech Intelligibility with Natural Viewing (Sin-

gle Talker)

3.1.1 Stimuli

The audiovisual stimuli were created using the procedure described in section 2.1. Two

different types of stimuli were generated. The first type of stimuli included 1 set of video

sequences of the low-context SPIN (Speech Perception In Noise) sentences without noise.

The second type of stimuli consisted of 27 sets of video sequences of the low-context SPIN

sentences with white Gaussian noise introduced to the speech signal, where the auditory

SNRs ranged from -20 dB to +6 dB.

3.1.2 Experiment Procedure

Subjects were seated at a distance of 66 cm from a 19-inch LCD display. Once the point-

of-gaze estimation system was calibrated, subjects were asked to look at various points.

A successful calibration was confirmed by verifying that these points corresponded to

subjects’ gaze positions within the accuracy of the point-of-gaze estimation system (1◦).

Page 37: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 26

Subjects began the experiment by completing a training phase in order to become

familiar with the experimental stimuli and tasks. The training phase comprised of 5

trials, where each trial involved listening and viewing naturally a video recording under

the ‘no noise’ auditory condition. After the presentation of a stimulus, subjects were

prompted with a dialog box in which they were asked to report all the words heard.

After the training session, the subject’s audiovisual threshold was estimated by fol-

lowing the procedure mentioned in 2.2.5. The audiovisual threshold was calculated in

order to determine the auditory SNRs for presenting the experimental stimuli. Once the

audiovisual threshold was calculated, subjects took a small break. They were then pre-

sented with 4 sets of 30 trials, where each set tested subjects’ speech intelligibility under

one of the following auditory conditions: no noise, audiovisual threshold + 5 dB, audio-

visual threshold, and audiovisual threshold - 5 dB. The first 5 trials from each set were

used as practice trials to allow subjects to get used to each test condition. The presenta-

tion order of conditions and stimuli were randomized, and no stimulus was reused during

the entire experimental session. Subjects were free to gaze anywhere on the display, and

their gaze positions were recorded during the presentation of each stimulus.

3.1.3 Audiovisual Threshold and Speech Intelligibility

As shown in Table 3.1, the audiovisual threshold of each subject varied between -9 dB and

-15 dB. The audiovisual threshold was defined to correspond to a speech intelligibility

score of about 50%. From Figure 3.1, it can be observed that subjects almost achieved

this performance on average by attaining a speech intelligibility score of 43%. Under the

‘no noise’ condition, the mean percentage of correct responses was 98%, and it decreased

monotonically with decreased auditory SNR. The lowest value was 22% and was obtained

when the auditory SNR was 5 dB below the audiovisual threshold of each subject. A

one-way analysis of variance (ANOVA) showed a statistically significant difference in

performance (F (3, 40) = 152.76, p = 0) as a function of auditory SNR. A one-tail t-test

Page 38: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 27

showed statistically significant differences in speech intelligibility scores for each set of

two auditory conditions (p < 10−5).

Table 3.1: Audiovisual Threshold of Each Subject

Subject Audiovisual Threshold (SNR)

1 -12 dB

2 -15 dB

3 -12 dB

4 -15 dB

5 -10 dB

6 -9 dB

7 -13 dB

8 -13 dB

9 -10 dB

10 -11 dB

11 -14 dB

3.1.4 Percentage of Time Spent Fixating on Different Facial

Regions

The analyses performed in past studies on gaze behavior and audiovisual speech percep-

tion separated the eye movement data into different parts of the face. They showed a

tendency to look more at the mouth or at the nose with increasing noise levels. However,

this observation could not be made with the data obtained from the experiment described

earlier. Figure 3.2 shows the average gaze distribution on different facial regions when

subjects performed a speech intelligibility test under 4 auditory conditions. The facial

regions considered were the mouth, nose, and eyes. If the eye movement data fell into

Page 39: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 28

0

20

40

60

80

100

no noise audiovisual threshold + 5 dB audiovisual threshold audiovisual threshold - 5 dB

% c

orre

ct w

ords

conditions tested

Figure 3.1: Average Audiovisual Speech Intelligibility Scores as a Function of Auditory

SNR when Subjects Gazed Naturally at Video Recordings of a Talker

other parts of the talker’s face, they were placed into the ‘other’ category.

A one-way analysis of variance (ANOVA) showed that subjects preferred to look at

one or more regions of the talker’s face except when the auditory SNR was set to 5

dB below subjects’ audiovisual threshold (F (3, 36) = 13.14, p < 10−5) for ‘no noise’,

F (3, 40) = 3.57, p < 0.05 for ‘audiovisual threshold + 5 dB’, and F (3, 40) = 3.68,

p < 0.05 for ‘audiovisual threshold’). Overall, subjects preferred to look at the eyes.

Under the ‘no noise’ condition, the percentage of time spent fixating at the eyes was 50%,

and it decreased to a minimum of 30% when noise was introduced to the speech signal.

The decrease in the percentage of fixations on the eyes was replaced with more fixations at

the nose and at the mouth. Specifically, the percentage of mouth fixations rose from 13%

to a value ranging between 24% and 33%. Similarly, the percentage of nose fixations rose

from 17% to about 23%. Although there were more mouth fixations with the presence of

noise, it could not be concluded that the number of mouth fixations increased with lower

auditory SNRs, unlike the observations made in (Vatikiotis-Bateson et al., 1998). In fact,

Page 40: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 29

the percentage of time spent gazing at the mouth decreased from 33% to 25% when the

auditory SNR was changed from ‘audiovisual threshold’ to ‘audiovisual threshold - 5 dB’.

0

10

20

30

40

50

60

70

80

90

100

mouth nose eyes other

% o

f tim

e sp

ent

wit

hin

a fa

cial

reg

ion

(a) No Noise

0

10

20

30

40

50

60

70

80

90

100

mouth nose eyes other

% o

f tim

e sp

ent

wit

hin

a fa

cial

reg

ion

(b) Audiovisual Threshold + 5 dB

0

10

20

30

40

50

60

70

80

90

100

mouth nose eyes other

% o

f tim

e sp

ent

wit

hin

a fa

cial

reg

ion

(c) Audiovisual Threshold

0

10

20

30

40

50

60

70

80

90

100

mouth nose eyes other

% o

f tim

e sp

ent

wit

hin

a fa

cial

reg

ion

(d) Audiovisual Threshold - 5 dB

Figure 3.2: Percentage of Time Spent Fixating on Different Facial Regions during a

Speech Intelligibility Test Under 4 Auditory Conditions

Figure 3.3 provides a comparison of the average gaze distribution between reporting

correct and incorrect responses. A two-tail t-test showed no statistical differences in gaze

distribution for all the auditory conditions tested (p > 0.6 for ‘mouth’, p > 0.7 for ‘nose’,

p > 0.5 for ‘eyes’, and p > 0.4 for ‘other’). Therefore, reporting an incorrect answer was

not caused by looking at a wrong part of the talker’s face.

The standard deviations of the average gaze distributions shown in Figures 3.2 and 3.3

Page 41: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 30

0

10

20

30

40

50

60

70

80

90

100

mouth nose eyes other

% o

f tim

e sp

ent

wit

hin

a fa

cial

reg

ion

correct responses

incorrect responses

(a) No Noise

0

10

20

30

40

50

60

70

80

90

100

mouth nose eyes other

% o

f tim

e sp

ent

wit

hin

a fa

cial

reg

ion

correct responses

incorrect responses

(b) Audiovisual Threshold + 5 dB

0

10

20

30

40

50

60

70

80

90

100

mouth nose eyes other

% o

f tim

e sp

ent

wit

hin

a fa

cial

reg

ion

correct responses

incorrect responses

(c) Audiovisual Threshold

0

10

20

30

40

50

60

70

80

90

100

mouth nose eyes other

% o

f tim

e sp

ent

wit

hin

a fa

cial

reg

ion

correct responses

incorrect responses

(d) Audiovisual Threshold - 5 dB

Figure 3.3: Comparison of the Percentage of Time Spent Fixating on Different Facial

Regions between Correct and Incorrect Responses

ranged between 6% and 28%. The large values were caused by subjects having their own

preferred fixation points. For instance, some subjects spent most of their time looking

at the eyes and at the nose while others focused on the nose and mouth.

3.1.5 Gaze Patterns with Respect to Time

The following observations were made when the eye movement data were analyzed by sep-

arating them into different facial regions. First, the number of nose and mouth fixations

increased when noise was introduced to the speech signal. Second, the gaze distributions

Page 42: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 31

0

1

2

3

4

5

6

7

8

9

10

0 0.5 1 1.5 2 2.5 3 3.5

aver

age

gaze

dis

tanc

e fr

om m

outh

cen

ter

(deg

rees

)

time (s)

(a) No Noise

0

1

2

3

4

5

6

7

8

9

10

0 0.5 1 1.5 2 2.5 3 3.5

aver

age

gaze

dis

tanc

e fr

om m

outh

cen

ter

(d

egre

es)

time (s)

(b) Audiovisual Threshold + 5 dB

0

1

2

3

4

5

6

7

8

9

10

0 0.5 1 1.5 2 2.5 3 3.5

aver

age

gaze

dis

tanc

e fr

om m

outh

cen

ter

(deg

rees

)

time (s)

(c) Audiovisual Threshold

0

1

2

3

4

5

6

7

8

9

10

0 0.5 1 1.5 2 2.5 3 3.5

aver

age

gaze

dis

tanc

e fr

om m

outh

cen

ter

(deg

rees

)

time (s)

(d) Audiovisual Threshold - 5 dB

Figure 3.4: Average Euclidean Gaze Distance between Subjects’ Fixation Points and the

Center of the Mouth of the Talker during a Speech Intelligibility Test under 4 Different

Auditory Conditions

were similar when comparing them between correct and incorrect responses. However, it

is not clear as to how subjects’ gaze varied with respect to time. In order to determine

this relationship, the eye movement data were analyzed in a different manner. For each

auditory condition, the mean distance between subjects’ fixation points and the center

of the mouth of the talker was plotted against time, as illustrated in Figure 3.4. The

graphs show a tendency to move closer to the center of the mouth towards the end of

each sentence. Under the ‘no noise’ condition, subjects started an experimental trial by

Page 43: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 32

0

1

2

3

4

5

6

7

8

9

10

0 0.5 1 1.5 2 2.5 3 3.5

aver

age

hori

zont

al g

aze

dist

ance

from

m

outh

cen

ter

(deg

rees

)

time (s)

(a) No Noise

0

1

2

3

4

5

6

7

8

9

10

0 0.5 1 1.5 2 2.5 3 3.5

aver

age

hori

zont

al g

aze

dist

ance

from

m

outh

cen

ter

(deg

rees

)

time (s)

(b) Audiovisual Threshold + 5 dB

0

1

2

3

4

5

6

7

8

9

10

0 0.5 1 1.5 2 2.5 3 3.5

aver

age

hori

zont

al g

aze

dist

ance

from

m

outh

cen

ter

(deg

rees

)

time (s)

(c) Audiovisual Threshold

0

1

2

3

4

5

6

7

8

9

10

0 0.5 1 1.5 2 2.5 3 3.5

aver

age

hori

zont

al g

aze

dist

ance

from

m

outh

cen

ter

(deg

rees

)

time (s)

(d) Audiovisual Threshold - 5 dB

Figure 3.5: Average Horizontal Gaze Distance between Subjects’ Fixations Points and

the Center of the Mouthof the Talker during a Speech Intelligibility Test under 4 Different

Auditory Conditions

placing their gaze 4.3◦ away from the center of the mouth and finished it by fixating

approximately 3◦ away from the center of the mouth. A similar scanning pattern was

observed for noisy conditions, except that the end fixation point was closer to the mouth.

Subjects approached their gaze as close as 2◦ away from the center of the mouth when

noise was introduced to speech signals. It seems that subject felt the urge to move their

gaze closer to the mouth in order to decipher better the utterances. It is noted that

subjects moved their gaze mostly along the vertical axis; the eye movements along the

Page 44: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 33

horizontal axis varied at most by 1◦, as shown in Figure 3.5.

3.1.6 Natural Gaze Strategies

START

END0

1

2

3

4

5

6

0 0.5 1 1.5 2 2.5 3 3.5

Eucl

idea

n di

stan

ce fr

om m

outh

cen

ter

(deg

rees

)

Time (s)

Figure 3.6: Scanning Pattern for Subject 4 at the ‘Audiovisual Threshold’ Auditory

Condition. The subject used a 4.5◦ saccade to shift his gaze from approximately 5◦ from

the mouth center at the beginning of the trial to less than 1◦ at the end of the trial.

The graphs that were shown in the previous section were derived from a combination

of two types of gaze strategies. In the first strategy, subjects used saccadic eye movements

to shift their gaze from an initial starting point directly to a region that was within 2.5◦

of the center of the mouth. This gaze strategy is denoted as ‘saccades towards mouth’

strategy and includes trials where subjects fixated within 2.5◦ of the center of the mouth

during the time that the target word (the last word of each sentence) was heard. A

distance of 2.5◦ from the center of the mouth was selected so that 98% of all trials that

used this gaze strategy would be captured. An example of the ‘saccades towards mouth’

strategy is provided in Figure 3.6. As an alternative to this gaze strategy, subjects used

either prolonged fixations on a particular feature of the talker’s face (eyes, nose, etc.)

or a sequence of fixations and saccades to look at different features of the talkers face

without fixating within 2.5◦ of the center of the mouth. This gaze strategy is denoted as

Page 45: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 34

the ‘other’ gaze strategy and is illustrated in Figure 3.7. When subjects used the gaze

strategy ‘saccades towards mouth’, the mean distance from the center of the mouth was

1.2◦ with a standard deviation of 0.7◦. When subjects used the ‘other’ gaze strategy, the

mean distance from the center of the mouth was 4.3◦ with a standard deviation of 1.1◦.

START

END

0

1

2

3

4

5

6

0 0.5 1 1.5 2 2.5 3 3.5

Eucl

idea

n di

stan

ce fr

om m

outh

cen

ter

(deg

rees

)

Time (s)

Figure 3.7: Scanning Pattern for Subject 6 at the ‘No Noise’ Auditory Condition. The

subject used a sequence of fixations to look at the talkers left eye, nose, an area close

to the talkers right eye and then once again at the nose. Note that during the whole

sequence the subject did not fixate within an area of 2.5◦ from the mouth center.

Figure 3.8 shows the percentage of trials for which one of the above two gaze strategies

was used. A one-way analysis of variance (ANOVA) showed statistically significant dif-

ferences in the percentage of trials that each gaze strategy was used as a function of audi-

tory SNR (F (3, 40) = 4.73, p = 0.0065 for ‘saccades towards mouth’ and F (3, 40) = 4.59,

p = 0.0075 for ‘other’). A two-tail t-test showed that the gaze strategies during the ‘no

noise’ condition were significantly different from that of any other auditory condition

(p < 0.005). In the ‘no-noise’ condition, the two strategies were used approximately

equally, but when noise was added to the speech signal, the percentage of trials involving

the gaze strategy ‘saccades towards mouth’ increased to about 80%. The large standard

deviations in the percentage of trials in Figure 3.8 resulted from large inter subject vari-

Page 46: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 35

ations as subjects have individual preferences for one of the two aforementioned gaze

strategies.

0

10

20

30

40

50

60

70

80

90

100

no noise audiovisual threshold + 5 dB

audiovisual threshold

audiovisual threshold - 5 dB

% o

f tri

als

auditory condition

other saccades towards mouth

Figure 3.8: Percentage of Trials with the ‘saccades towards mouth’ and ‘other’ Gaze

Strategies

A two-tail t-test showed no significant differences between the mean percentages of

correct responses for the two different gaze strategies. When subjects used the ‘towards

the mouth’ gaze strategy, the scores were 98%, 69%, 40%, and 19% for the auditory

conditions of ‘no noise’, ‘audiovisual threshold + 5 dB’, ‘audiovisual threshold’, and

‘audiovisual threshold 5 dB’ respectively. When subjects used the other gaze strategy,

the scores were 100%, 73%, 50%, and 24% for the same auditory conditions.

Subjects used different approaches for the ‘saccades towards mouth’ gaze strategy.

Figure 3.9 shows the breakdown for each of these approaches. For some of the trials,

subjects used the ‘mouth fixations’ approach where they maintained their fixation points

within 2.5◦ of the center of the mouth from the onset of the speech signal. They probably

made saccadic eye movements towards the mouth when the stimuli were paused, but this

possibility could not be confirmed since subjects’ gaze was only recorded after the pause.

Page 47: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 36

For some of the other trials, the initial gaze position was close to the region of interest,

and subjects moved their gaze gradually towards the center of the mouth. This approach

is denoted as ‘gradual movements’. In the remaining trials, subjects made one or more

saccades to bring their gaze closer to the center of the mouth. The first saccade occurred

at different time instances, and Figure 3.9 shows the the percentage of trials for which

the first saccade was made in one of the following time intervals: [0.0, 1.0], [1.0, 2.0], and

[2.0, 3.0] seconds.

0

10

20

30

40

50

60

70

80

90

100

mouth fixations

gradual movements

[0.0-1.0] seconds

[1.0-2.0] seconds

[2.0-3.0] seconds

% o

f tri

als

(a) No Noise

0

10

20

30

40

50

60

70

80

90

100

mouth fixations

gradual movements

[0.0-1.0] seconds

[1.0-2.0] seconds

[2.0-3.0] seconds

% o

f tri

als

(b) Audiovisual Threshold + 5 dB

0

10

20

30

40

50

60

70

80

90

100

mouth fixations

gradual movements

[0.0-1.0] seconds

[1.0-2.0] seconds

[2.0-3.0] seconds

% o

f tri

als

(c) Audiovisual Threshold

0

10

20

30

40

50

60

70

80

90

100

mouth fixations

gradual movements

[0.0-1.0] seconds

[1.0-2.0] seconds

[2.0-3.0] seconds

% o

f tri

als

(d) Audiovisual Threshold - 5 dB

Figure 3.9: Breakdown of the Gaze Strategy ’saccades towards mouth’ Under 4 Different

Auditory Conditions

Compared to the results shown for the ‘no noise’ auditory condition, it can be observed

Page 48: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 37

that subjects made a greater attempt to look closer at the center of the mouth when a

speech signal was presented with noise. The percentage of trials involving the different

approaches for the ‘saccades towards mouth’ gaze strategy were similar for all auditory

conditions involving noise except for the case where the first saccade occurred in the time

interval of 1.0 to 2.0 seconds. At the lowest auditory SNR, the percentage of trials with

the ‘[1.0-2.0] seconds’ approach dropped to 13% from about 25%. It seems that some

subjects gave up on their speech intelligibility tasks due to the difficulty in discerning

the utterances. The average speech intelligibility score at the lowest auditory SNR was

22%, and some studies such as (Vatikiotis-Bateson, Eigsti, & Yano, 1994) indicated that

subjects tended to give up on an experimental task when speech intelligibility fell below

30%.

Based on the number of correct responses obtained with the different gaze patterns

or strategies, it appears that looking close to the mouth did not aid in improving speech

intelligibility. Subjects performed similarly whether they gazed close to the center of the

mouth or looked at other facial features, such as the nose and the eyes. How far can a

person place his/her gaze from the center of the mouth without observing a degradation

in speech intelligibility score?

3.2 Speech Intelligibility with a Fixed Point of Gaze

(Single Talker)

A limited number of studies explored audiovisual speech perception with respect to the

proximity of gaze fixations to visual cues. Smeele et al. (1998) investigated the laterality

effects when subjects’ speechreading performance was tested with 4 different visemes

(/ba/, /va/, /tha/, and /da/) as a function of eccentricity. A synthetic face was used

as visual stimuli, and subjects maintained their gaze at a fixation point, which was

within 9.1◦ either to the right or to the left of the edge of the synthetic face. Subjects’

Page 49: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 38

speechreading performance decreased monotonically with increasing eccentricities, where

the mean percentage of correct responses was lower by 15% for 2 of the visemes (/va/,

/tha/) when subjects fixated at a point to the left of the visual stimuli. Based on this

result, it was suggested that the left-hemisphere of the brain is more adept at processing

temporal information. However, this statement could not be made conclusively since the

number of visemes tested were limited.

Another study investigated the effects of proximity of gaze fixations to visual cues in

audiovisual speech perception. Pare et al. (2003) determined how well the McGurk effect

could be perceived when subjects viewed video recordings of a talker while gazing either

at the mouth, eyes (5◦ from the mouth), or hairline (10◦ from the mouth) of the talker.

Subjects were able to perceive the McGurk effect in all of the above 3 conditions, but

the effect was less pronounced for the hairline condition by about 5% to 20% depending

on the utterances. In a separate experiment, the McGurk effect was tested with greater

eccentricities, where the fixation points were either at the mouth or at 20◦, 40◦, and 60◦

horizontally relative to the mouth. The ability to perceive the McGurk effect decreased

as a function of eccentricity, and subjects were able to perceive the McGurk effect 10%

of the time when they fixated at a point displaced 60◦ horizontally from the mouth.

The aforementioned studies only dealt with speechreading performance or the ability

to perceive the McGurk effect. Therefore, it is not clear how their results would translate

to spoken sentence perception. A similar experiment was conducted to determine how

speech intelligibility was affected with respect to the proximity of gaze fixations to visual

cues.

3.2.1 Stimuli

The audiovisual stimuli were created using the procedure described in section 2.1. Two

different types of stimuli were generated. The first type of stimuli included 1 set of video

sequences of the low-context SPIN (Speech Perception In Noise) sentences without noise.

Page 50: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 39

The second type of stimuli consisted of 27 sets of video sequences of the low-context SPIN

sentences with white Gaussian noise introduced to the speech signal, where the auditory

SNRs ranged from -20 dB to +6 dB.

For each video sequence, an image was overlayed on top of each video frame. The

image consisted of either a black fixation cross of size 48x48 pixels or a black filled

rectangle of size 1280x300. For the first image option, the fixation cross was placed

either at the center of the mouth (0◦) or at 2.5◦, 5◦, 10◦, and 15◦ relative to the center

of the mouth. For the second image option, the black filled rectangle covered the mouth

region.

3.2.2 Experiment Procedure

Subjects sat a distance of 66 cm from a 19-inch LCD display. The eye tracker was then

initialized and calibrated. After a successful calibration, subjects completed 5 practice

trials, where each trial involved listening to a video sequence under the ‘no noise’ auditory

condition while maintaining the gaze on a fixation cross at the center of the mouth of

the talker. Once each stimulus was presented, subjects reported the last word heard.

Upon completion of the training phase, subjects maintained their gaze on a fixation

cross at the center of the mouth of the talker while their audiovisual threshold was

determined according to the procedure outlined in section 2.2.5. Subjects then took a

small break before commencing the speech intelligibility tests.

Subjects were presented with 6 sets of 30 trials, where the first 5 trials of each set were

used as practice trials to allow subjects to get used to each test condition. For 5 of these

sets, subjects fixated on a cross that was placed vertically at a specific distance from the

center of the mouth of the talker (0◦, 2.5◦, 5◦, 10◦, or 15◦). For the remaining set, subjects

viewed each stimulus naturally, where the talker’s mouth region was covered by a black

filled rectangle. The presentation order of conditions and stimuli were randomized, and

no sentence was reused.

Page 51: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 40

Subjects were only tested under one auditory condition as their speech intelligibility

had to be determined for various fixation distances relative to the center of the mouth

of the talker. Studies, such as (Sumby & Pollack, 1954), showed that the speech in-

telligibility improvement associated with the availability of visual information was more

pronounced for low auditory SNRs. However, selecting a very low auditory SNR (speech

intelligibility score below 30%) could result in subjects giving up in their speech intelligi-

bility tasks (Vatikiotis-Bateson et al., 1994). As a result, the auditory SNR was selected

to correspond to the subject’s audiovisual threshold, where a speech intelligibility score

of 50% could be attained.

3.2.3 Results and Discussion

30

35

40

45

50

55

60

65

70

75

0 5 10 15

% c

orre

ct w

ords

proximity to the center of the mouth (degrees)

Figure 3.10: Average Speech Intelligibility Scores as a Function of Proximity to the

Center of the Mouth at Subjects’ Audiovisual Threshold for a Single Talker

Figure 3.10 shows the speech intelligibility score when a single talker was presented on

a computer monitor and subjects fixated on a single point during each trial. A fixation

point was set either at the center of the mouth (0◦) or at 2.5◦, 5◦, 10◦, and 15◦ from the

Page 52: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 41

center of the mouth. A one-way analysis of variance (ANOVA) showed a main effect for

the proximity of the fixation point to the center of the mouth (F (5, 60) = 13.93, p = 0).

A two-tail t-test showed no significant statistical differences in speech intelligibility scores

between each set of two fixation points when the points were within 10◦ of the center

of the mouth. A one-tail t-test showed a statistically significant lower score between

fixation points that were within 10◦ of the center of the mouth and a fixation point that

was 15◦ from the center of the mouth (p < 0.05).

When subjects fixated within 10◦ of the center of the mouth of the talker, their

speech intelligibility score was approximately 60%, which was 17% higher than the one

obtained with natural viewing at the ‘audiovisual threshold’ auditory condition. This

performance discrepancy can be explained by the fact that a higher auditory SNR was

obtained in some cases when estimating the audiovisual threshold of each subject. For

half of the subjects, the audiovisual threshold value was identical to the one obtained for

the experiment which tested speech intelligibility with natural viewing. However, it was

either 1 dB to 2 dB higher for the remaining subjects.

The worst speech speech intelligibility score (29%) was obtained when the visual

information lying within the talker’s mouth region was unavailable. A one-tail t-test

showed that this performance was significantly lower compared to the one obtained when

subjects fixated on a cross at different distances relative to the center of the mouth of the

talker (p < 0.005). The difference between the best and worst speech intelligibility scores

was approximately 30%, and the worst performance corresponded to the case where

subjects would have had only access to auditory information as the addition of visual

information to auditory information was found to contribute to an improvement of 20% to

30% in speech intelligibility (Grant & Seitz, 2000). The performance degradation can be

attributed to the absence of visual information within the mouth region, and this result

is in agreement with previous findings. Studies of speechreading performance of deaf

children showed that the performance is similar when either the entire face of the talker,

Page 53: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 42

part of the talker’s face or only the lips were viewed (Ijsseldijk, 1992). Similar results with

normal hearing subjects (Marassa & Lansing, 1995) showed that the visual information

that contributes to audiovisual speech enhancement is found within the mouth region.

Our experimental results are not consistent with the ones reported in (Smeele et al.,

1998). The discrepancy in results can be explained by the differences in the cognitive

load of the experimental tasks. In (Smeele et al., 1998), subjects were asked to perform

another task in addition to speechreading while maintaining their gaze at a fixation point.

The additional task consisted of counting the number of times the fixation point varied in

size. Since subjects had to divide their attention between two tasks, they probably did not

perform as well as if they had been given only the speechreading task. Their performance

peaked when the fixation point was placed on the synthetic face and deteriorated beyond

that point.

Pare et al. (2003) observed no differences in the perception of the McGurk effect

when subjects gazed within 10◦ from the talker’s mouth. These results agree with our

findings, where the study of proximity of gaze fixations was extended to spoken sentence

perception. Subjects were able to maintain their gaze at a fixation point placed 10◦ from

the center of the mouth of the talker without compromising their performance. If speech

intelligibility is not affected by the proximity of the fixation points relative to the center

of the mouth, why did subjects change their gaze strategy to bring their fixation point

closer to the mouth under natural viewing and low auditory SNRs?

One possible explanation is that gaze strategies are developed to optimize performance

under different conditions and are often used subconsciously (i.e. subjects are not aware of

their gaze strategies). For example, to detect targets under very low levels of illumination,

subjects often look away from the target so that the image of the target will fall on their

peripheral retinas, while to detect the color and/or the fine spatial features of a target,

subjects look directly at the target. When the auditory SNR was high, subjects could

use any gaze strategy to achieve perfect performance. When the auditory SNR was low,

Page 54: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 43

subjects could not achieve perfect performance with any gaze strategy. Nevertheless,

they resorted to a gaze strategy that was found to be effective for the enhancement of

audiovisual speech perception in other environments. Such an environment could include

multiple visual sources.

3.3 Speech Intelligibility with Multiple Talkers

There are currently no studies which explored the effects of the proximity of gaze fixations

in audiovisual speech perception with multiple visual sources. The closest studies involved

investigating the role of attention in perceiving the McGurk effect. An example can be

given with an experiment conducted by Tiippana et al. (2004). In this experiment, the

ability to detect the McGurk effect was tested under 2 different viewing conditions. In

the first condition, subjects were presented with video recordings of a talker and were

asked to look directly at the talker’s face. In the second condition, they were presented

with both the talker’s face and a visual distractor consisting of a leaf moving across the

talker’s face, and they were asked to attend to the moving leaf. Once they viewed and

listened to each stimulus, they reported what they heard. From the experimental results,

it was found that the McGurk effect was lessened when subjects directed their gaze at

the moving leaf.

The role of visual spatial attention in the perception of the McGurk effect was also

explored in (Andersen, Tiippana, Laarni, Kojo, & Sams, 2009). Subjects were presented

with two instances of a talker’s face, where the faces were displayed symmetrically about

a vertical axis going through a central fixation point. The faces were independent in the

sense that each instance of the talker’s face uttered a different word. One instance of the

talker’s face said /aka/ while the other one said /ata/. Subjects were asked to maintain

their gaze at the central fixation point while attending to the talker’s face indicated by

a cueing arrow. The McGurk effect was found to be less well perceived compared to the

Page 55: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 44

case where a single instance of the talker’s face would have been presented.

The aforementioned studies showed that the introduction of a second visual source

interfered with subjects’ ability to detect the McGurk effect. However, it is not clear how

their results would translate to spoken sentence perception. A similar experiment was

conducted to determine the effects of the proximity of gaze fixations in speech intelligi-

bility with multiple visual sources.

3.3.1 Stimuli

The audiovisual stimuli were created using the procedure described in section 2.1. Two

different types of stimuli were generated. The first type of stimuli included 1 set of video

sequences of the low-context SPIN (Speech Perception In Noise) sentences without noise.

The second type of stimuli consisted of 27 sets of video sequences of the low-context SPIN

sentences with white Gaussian noise introduced to the speech signal, where the auditory

SNRs ranged from -20 dB to +6 dB.

Each stimulus was created by combining two different video streams (see Figure 3.11)

with the audio stream corresponding to one of the video streams. The stimuli were

presented either without modifications or with the addition of an overlay. In the latter

case, an image consisting of a black fixation cross of size 48x48 pixels was overlayed on top

of each video frame. The fixation cross was placed in one of the following eccentricities

relative to the center of the mouth of the ‘talking face’ whose audio output was fed

through headphones (correct ‘talking face’): 0◦, 2.5◦, 5◦, 10◦, or 15◦.

Figure 3.11: Composition of Two-Talker Stimuli

Page 56: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 45

3.3.2 Experiment Procedure

Subjects sat a distance of 66 cm from a 19-inch LCD display, and the eye tracker was

initialized and calibrated. After a successful calibration, subjects completed 5 practice

trials, where each trial involved listening to a video sequence under the ‘no noise’ auditory

condition while maintaining the gaze on a fixation cross at the center of the mouth of

the correct ‘talking face’. Once each stimulus was presented, subjects reported the last

word heard.

After the training phase, subjects maintained their gaze on a fixation cross at the

center of the mouth of the correct ‘talking face’ while their audiovisual threshold was

determined by following the procedure outlined in section 2.2.5. Subjects then took a

small break before commencing the speech intelligibility tests.

Subjects were presented with 6 sets of 30 trials, where the first 5 trials of each set

were used as practice trials to allow subjects to get used to each test condition. For 5

of these sets, subjects fixated on a cross that was placed vertically at a specific distance

from the center of the mouth of the correct ‘talking face’ (0◦, 2.5◦, 5◦, 10◦, or 15◦). For

the remaining set, subjects viewed each stimulus naturally. The presentation order of

conditions and stimuli were randomized, and no sentence was reused.

Similarly to the experiment which tested subjects’ speech intelligibility with a fixed

point of gaze (single talker), subjects were only tested under one auditory condition

as there were many test conditions. The auditory SNR was chosen to correspond to

the subject’s audiovisual threshold, where a speech intelligibility score of 50% could be

attained.

3.3.3 Speech Intelligibility Results

Figure 3.12 shows the speech intelligibility score when two talkers were presented on a

computer monitor and subjects fixated on a single point during each trial. A fixation

Page 57: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 46

point was set either at the center of the mouth of the correct ‘talking face’ (0◦) or at 2.5◦,

5◦, 10◦, and 15◦ from the center of the mouth of the correct ‘talking face’. A one-way

analysis of variance (ANOVA) showed a main effect for the proximity of the fixation

point to the center of the mouth (F (4, 50) = 6.41, p < 0.0005). A two-tail t-test showed

no significant statistical differences in speech intelligibility scores between fixation points

that were within 2.5◦ of the center of the mouth of the correct ‘talking face’ (0◦ and

2.5◦). A one-tail t-test showed significant statistical differences (p < 0.05) in speech

intelligibility scores between fixating either at 0◦ or 2.5◦ and any other fixation points

(i.e. at 5◦, 10◦ and 15◦).

30

35

40

45

50

55

60

65

70

75

0 5 10 15

% c

orre

ct w

ords

proximity to the center of the mouth (degrees)

Figure 3.12: Average Speech Intelligibility Scores as a Function of Proximity to the

Center of the Mouth at Subjects’ Audiovisual Threshold for Two Talkers

3.3.4 Natural Gaze Strategies

Figure 3.13 shows how subjects’ gaze varied with respect to time when they viewed

naturally two talkers on a computer display. Subjects exhibited the same gaze behavior

as in the case where their speech intelligibility was tested with a single talker. They

Page 58: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 47

0

2

4

6

8

10

12

0 0.5 1 1.5 2 2.5 3

aver

age

gaze

dis

tanc

e fr

om m

outh

cen

ter

of th

e ac

tual

'tal

king

face

' (de

gree

s)

time (s)

Figure 3.13: Average Euclidean Gaze Distance between Subjects’ Fixation Points and

the Center of the Mouth of the correct ‘talking face’ during a Speech Intelligibility Test

at the ‘Audiovisual Threshold’ Auditory Condition

moved their gaze closer to the center of the mouth of the ‘correct talker’ towards the end

of each trial. The movements were made with the use of saccades, and these saccades

occurred at different time instances for each trial. An example of a scanning pattern is

provided in Figure 3.14.

As shown in Figure 3.15, subjects exhibited 4 different gaze patterns when viewing

two instances of a talker’s face. For the first gaze pattern, subjects began each trial by

looking at the correct ‘talking face’ at an approximate distance of 6◦ from the center

of the mouth. They then moved their gaze towards the other talker and shifted their

gaze back to the correct ‘talking face’ to bring their fixation point to 2◦ from the center

of the mouth of the correct ‘talking face’. This gaze pattern was observed for about

half of the trials. The second gaze pattern involved looking first at the correct ‘talking

face’ and then looking at the other talker, and it was used in 8% of the trials. In about

30% of all trials, subjects began gazing at the incorrect ‘talking face’ and moved their

Page 59: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 48

0

2

4

6

8

10

12

14

16

18

0 0.5 1 1.5 2 2.5 3

Eucl

idea

n di

stan

ce fr

om m

outh

cen

ter

(deg

rees

)

time (s)

Figure 3.14: Scanning Pattern for Subject 4 at the ‘Audiovisual Threshold’ Auditory

Condition. The subject first fixated at the eyes of the ‘correct talker’, shifted his gaze

towards the mouth of the ‘incorrect talker’ and made saccadic eye movements towards

the mouth of the ‘correct talker’.

gaze closer to the center of the mouth of the correct ‘talking face’. For the remaining

trials, subjects maintained their gaze at the incorrect ‘talking face’ with fixation distances

varying between 12◦ and 14◦ from the center of the mouth of the correct ‘talking face’.

Page 60: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 49

-2

0

2

4

6

8

10

12

14

16

0 0.5 1 1.5 2 2.5 3

aver

age

gaze

dis

tanc

e fr

om m

outh

cen

ter

of th

e ac

tual

'tal

king

face

' (de

gree

s)

time (s)

(a) Correct to Correct ‘Talking Face’

0

2

4

6

8

10

12

14

16

18

20

0 0.5 1 1.5 2 2.5 3

aver

age

gaze

dis

tanc

e fr

om m

outh

cen

ter

of th

e ac

tual

'tal

king

face

' (de

gree

s)

time (s)

(b) Correct to Incorrect ‘Talking Face’

-2

0

2

4

6

8

10

12

14

16

18

0 0.5 1 1.5 2 2.5 3

aver

age

gaze

dis

tanc

e fr

om m

outh

cen

ter

of th

e ac

tual

'tal

king

face

' (de

gree

s)

time (s)

(c) Incorrect to Correct ‘Talking Face’

0

2

4

6

8

10

12

14

16

18

20

0 0.5 1 1.5 2 2.5 3

aver

age

gaze

dis

tanc

e fr

om m

outh

cen

ter

of th

e ac

tual

'tal

king

face

' (de

gree

s)

time (s)

(d) Incorrect to Incorrect ‘Talking Face’

Figure 3.15: Average Euclidean Distance between Subjects’ Fixation Points and the

Correct Talker’s Mouth Center for Different Types of Gaze Behavior

3.4 Comparison of Audiovisual Speech Perception

with Single and Multiple Talkers

Figure 3.16 compares subjects’ speech intelligibility between viewing single and multiple

talkers. A one-tail t-test between the speech intelligibility scores with single and multiple

talkers showed a statistically significant reduction in performance when multiple talkers

were presented and subjects used a fixed point of gaze strategy at an eccentricity of

5◦(p < 0.005) or 10◦ (p < 0.01).

Page 61: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 50

30

35

40

45

50

55

60

65

70

75

0 5 10 15

% c

orr

ect

wo

rds

proximity to the center of the mouth (degrees)

two talkers

single talker

Figure 3.16: Average Speech Intelligibility Scores with a Fixed Point of Gaze when

Viewing One or Two Talkers

When subjects fixated on a single point while listening to a single talker, the mean

percentage of correct responses was similar for all fixation points that were distributed

within 10◦ of the center of the mouth. Subjects’ performance decreased significantly

when the fixation point was placed at 15◦ from the center of the mouth. The afore-

mentioned results are consistent with the results obtained with natural viewing (single

talker). Subjects were able to fixate anywhere on the talker’s face without compromising

their performance.

For multiple talkers, the mean percentage of correct responses was similar when fix-

ations were within 2.5◦ of the center of the mouth and dropped significantly with larger

fixation distances (5◦, 10◦, and 15◦). The introduction of an additional visual source

interfered with subjects’ ability to use visual cues to enhance speech intelligibility, and

the performance deteriorated. This result is consistent with the observation that the

McGurk effect was lessened when two talkers were presented in the subjects’ visual field

(Andersen et al., 2009).

Page 62: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 3. Gaze Patterns and Audiovisual Speech Enhancement 51

When subjects moved their eyes freely while listening and viewing a single talker,

their gaze strategy changed as a function of the auditory SNR. For auditory SNRs that

were within 5dB of the subject’s audiovisual threshold, a gaze strategy consisting of

one or more saccades towards the mouth was used in approximately 80% of trials, with a

mean fixation distance of 1.2◦ and a standard deviation of 0.7◦ relative to the center of the

mouth. However, this strategy was used in only about half of the trials at a high auditory

SNR. In the remaining trials, subjects used a gaze strategy that consisted of either long

fixations on a specific face feature (eyes, nose, etc.) or a random sequence of saccades

and fixations between the different face features, with a mean fixation distance of 4.3◦

and a standard deviation of 1.1◦ relative to the center of the mouth. All fixation points

during natural viewing were within 8◦ of the center of the mouth, and the percentage

of correct words reported did not depend on the gaze strategy or on the proximity of

the fixation points relative to the center of the mouth. These observations are consistent

with the results obtained in the experiment involving a fixed point of gaze strategy and

a single talker, where it was shown that the speech intelligibility score was not affected

as long as fixation points were distributed within 10◦ of the center of the mouth.

The natural scanning patterns obtained with multiple talkers were similar to the ones

observed with a single talker. In about 80% of all trials, subjects moved their gaze closer

to the center of the mouth of the correct ‘talking face’ by bringing their fixation point to

2◦.

Page 63: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4

Visual Processing and Audiovisual

Speech Enhancement

From the experiments involving a single talker, it was found that subjects could fixate

anywhere on the talker’s face without compromising their performance. For a viewing

distance of 66 cm, subjects achieved the best speech intelligibility score when their fix-

ation points were distributed within 10◦ of the center of the mouth of the talker. The

above results suggest that the fine spatial details of the mouth and lips are not essential

for optimal enhancement of speech intelligibility. This observation is in agreement with

studies that investigated the effects of manipulating the level of details of visual informa-

tion in visual and audiovisual speech perception. These studies can be categorized into

different groups.

The first group of studies that can be considered involve determining how audiovisual

speech perception is affected when varying viewing distances or sizes of the visual infor-

mation. Although these studies tested only with the perception of the McGurk effect,

their findings suggest that high-spatial resolution is not required to perceive audiovisual

speech. When Jordan and Sergeant (1998) manipulated the image size to be between

2.5% and 100% of the original image size, they found that the McGurk effect was per-

52

Page 64: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4. Visual Processing and Audiovisual Speech Enhancement 53

ceived for all image sizes except for when the image size was reduced to 2.5% of the

original image size. A similar experiment (Jordan, McCotter, & Thomas, 2000) showed

that the perception of the McGurk effect was not affected until subjects viewed the stim-

uli at a distance greater or equal to 20 meters. With decreasing image size or increasing

viewing distance, the ability to resolve fine spatial details reduced, and it resulted in

reducing the perception of the McGurk effect. However, the exact spatial degradation

levels that could be tolerated were not provided. Instead, they were explored in another

group of studies.

C. S. Campbell and Massaro (1997) examined the influence of spatial quantization

in visual speech perception when the image resolution was manipulated by averaging

neighbouring pixels into larger blocks. They found that speechreading performance was

unaffected until the spatial quantization level was set to frequencies below 16 cycles/face

(this form of measure was obtained by dividing the number of quantized blocks that run

horizontally at the talker’s eyes level by 2). The effects of spatial quantization was also

explored in (MacDonald et al., 2000), where it was shown that the McGurk effect could

be perceived even at the coarsest level (11.2 pixels/face) tested. However, this result

might not entirely be correct due to the introduction of high-frequency components from

the boundaries of quantization blocks. More care was taken in (Munhall et al., 2004)

to confirm that fine spatial details are not required for audiovisual speech perception.

In one of the experiments, the visual stream of the stimuli was low-pass filtered using

a rotationally symmetric Gaussian filter with cutoff frequencies ranging from 0.25 to

7.9 cycles/degree. Compared to the case where only the audio stream was presented,

subjects were able to improve their speech intelligibility for all spatial filtering conditions

except for the lowest cutoff frequency. Their performance reached an asymptote when the

cutoff frequency was approximately 1 degree/cycle, and it matched that of the unfiltered

condition.

The finding that fine details are not required for audiovisual speech perception could

Page 65: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4. Visual Processing and Audiovisual Speech Enhancement 54

be equivalent to stating that peripheral vision is sufficient for audiovisual speech enhance-

ment. However, no previous attempt was made to verify this equivalency. An experiment

similar to the one described in (Munhall et al., 2004) was conducted to determine the

relationship between spatial degradation and the inability to resolve fine details in the

periphery.

4.1 Methods

4.1.1 Subjects

5 of the subjects who participated in the experiments presented in the previous chapter

were included in this study. The time interval between this experiment and the previous

one was approximately one month.

4.1.2 Stimuli

27 sets of video sequences of the low-context SPIN sentences were generated, where each

speech signal was combined with white Gaussian noise, and auditory SNRs ranged from

-20 dB to +6 dB. The audiovisual stimuli were created using the procedure described in

2.1. The original video recordings were resized to 1280x720 pixels and low-pass filtered

using a rotationally symmetric Gaussian filter. Other low-pass filters were considered,

but they were discarded due to the introduction of ringing effects. Prior to applying the

blurring filter, the data from each video frame were extracted and placed at the center

of a square surface, which was initialized with zeros. The surface dimension was chosen

to be a power of 2 to speed up computation time, and the smallest power of 2 which

was greater than or equal to the largest dimension of a video frame was 2048 pixels. A

fast Fourier transform was applied on the surface spatial data to convert them to the

frequency domain, and each element of the surface was multiplied with its corresponding

Gaussian blur kernel element. An inverse fast Fourier transform was used as a final step

Page 66: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4. Visual Processing and Audiovisual Speech Enhancement 55

to convert the filtered frequency domain data to its spatial representation. It is noted

that the blurring process could have been completed in the spatial domain, but this

option was discarded due to the extensive computation introduced by the convolution

operation for large kernels.

For each video sequence, an image consisting of a black fixation cross of size 48x48

pixels was overlayed on top of each video frame. The fixation cross was placed at 2.5◦

relative to the center of the mouth of the talker since this point was found to be the

preferred gaze position when subjects’ speech intelligibility was tested under natural

viewing.

Figure 4.1: Transfer Function of the Gaussian Blur Filter

A Gaussian filter was used to perform blurring. The filter is defined by f(d) = e−d2

2σ2 ,

where d represents the distance between the centre and a point on a two dimensional

surface. The variable σ denoted the cutoff frequency of the filter. A few values of the

filter are provided as a reference (see Figure 4.1): at d = σ, the amplitude of the original

signal is decreased by 40%; at d = σ3, 99%, and at d = σ

3.7, 99.9%. It was observed that

Page 67: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4. Visual Processing and Audiovisual Speech Enhancement 56

some spatial frequency components beyond 15 cycles per degree were still perceptible,

and this was not desirable for the following reason. If subjects’ speech intelligibility was

found to be affected by a cutoff frequency of x, it could not be ascertained that spatial

frequencies below x were the sole contributors to audiovisual speech enhancement. Some

spatial frequencies beyond x could also have been involved. Therefore, a more strict

definition for cutoff frequency was employed.

The definition of the cutoff frequency was chosen based on the contrast detection

thresholds of sinusoidal gratings. A sinusoidal grating with a spatial frequency of 4 cycles

per degree was generated and was low-pass filtered such that the contrast of the resulting

image was 0.1, 0.01, and 0.001. From Figure 4.2, it can be observed that the sinusoidal

gratings could be barely resolved with an image contrast of 0.001. As a result, the cutoff

frequency of the low-pass filter was selected to correspond to an image contrast that

was between 0.01 and 0.001. Specifically, it was chosen such that the amplitude of the

given signal was reduced by 99% at the cutoff frequency. It is noted that a compromise

was made by minimizing the contributions of the spatial frequencies beyond the cutoff

frequency. The amplitude of the signal associated with lower spatial frequencies were

more attenuated.

4.1.3 Experiment Design and Procedure

Subjects were seated at a distance of 66 cm from a 19-inch LCD display. The point-of-

gaze estimation system was then initialized and calibrated. After successful calibration,

subjects completed 5 practice trials, where each trial involved listening to a video sequence

under the ‘no noise’ auditory condition while maintaining the gaze on a fixation cross

at 2.5◦ from the center of the mouth of the talker. Once each stimulus was presented,

subjects reported the last word heard.

Upon completing the practice trials, subjects maintained their gaze on a fixation cross

while their audiovisual threshold was determined according to the procedure outlined in

Page 68: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4. Visual Processing and Audiovisual Speech Enhancement 57

(a) Image Contrast of 1 (b) Image Contrast of 0.1

(c) Image Contrast of 0.01 (d) Image Contrast of 0.001

Figure 4.2: Sinusoidal Gratings of 4 Cycles per Degree for Different Image Contrasts

2.2.5. Subjects then took a small break before commencing the speech intelligibility tests.

Subjects were presented with 8 sets of 25 trials, where each set tested subjects’ speech

intelligibility at different spatial degradation levels at the ‘audiovisual threshold’ audi-

tory condition. In each trial, subjects maintained their gaze on a fixation cross placed

2.5◦ relative to the center of the mouth, and they reported the last word heard after

the stimulus presentation. The presentation order of conditions and sentences were ran-

domized. The cutoff frequencies for the blurring filter were selected based on the ones

used by (Munhall et al., 2004) in their experiment of audiovisual speech perception with

low-pass filtered videos. In addition, some of them were chosen to coincide with the loss

of visual acuity in different regions of the periphery.

Page 69: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4. Visual Processing and Audiovisual Speech Enhancement 58

Volume 23Number 5 Grating resolution along and across meridians 667

until the observer reported that he saw a grating.The spatial frequency of the first visible gratingwas recorded as a resolution threshold. Each ex-periment, which comprised a series of orientationsand visual-field locations, was carried out twice.The two threshold estimates recorded for eachcondition were the same or very similar, and theywere averaged. A two-alternative forced-choicedetection method6 was used in some experimentsand produced results similar to those reported inthis article.

Three emmetropic observers (J. R., P. L., andV. V.) participated in the experiments and theirresults were qualitatively similar in all features.The central visual acuity of the observers, mea-sured with the Snellen E chart, was 1.6 or better.The peripheral refractive errors of the emmetropicsubjects were determined by means of sciascopy.At an eccentricity of 25 deg the spherical error,averaged over three subjects and the eight half-meridians studied in the experiments, was —0.6diopters (S.D., 0.6 D) and the average astigmaticerror was +1.5 D (S.D., 1.2 D), with axes per-pendicular to the corresponding visual-field me-ridians, in agreement with the typical results ofFerree et al.7 and Rempt et al.8 Corrective lensessuggested by sciascopy had no effect on subjectivevisual acuity tested with single, projected SnellenE-letters, and trial lenses were used only in con-trol experiments.

Results

Fig. 1 shows monocular grating resolutionas a function of eccentricity and orientationfor one observer along the nasal half-merid-ian of the visual field. At zero eccentricity,the classic oblique effect occurred; verticaland horizontal gratings could be resolved athigher spatial frequencies than oblique grat-ings. When eccentricity increased, resolutionbecame first similar for all orientations, inagreement with the previous results.4 At ec-centricities of 25 to 30 deg, however, theresolution limit became about two timeshigher for horizontal than for vertical andoblique gratings.

Fig. 2 shows monocular grating resolutionas a function of orientation and meridionalangle for one observer at the eccentricity of25 deg. The best resolution values were ob-tained with meridionally oriented gratingbars. The poorest resolutions were recordedwith grating bars perpendicular to the visual-

25

E 20

0 5 10 15 20 25 30

E c c e n t r i c i t y (deg)

Fig. 1. Grating resolution limits as a function ofeccentricity on the nasal half-meridian of the leftvisual field of subject V. V. The different gratingorientations tested were as indicated by the sym-bols.

field meridians. The resolutions for gratingswhose orientation deviated from the meri-dians by +45 deg were intermediate. Sincethe dependence of resolution on gratingorientation in visual periphery was related tomeridians, we called it the meridional reso-lution effect.

When we replicated the experiment ofFig. 2 in binocular vision, the results weresimilar to those of monocular vision, exceptfor a smaller difference in resolution betweenmeridional and perpendicular grating orien-tations. In monocular viewing, resolution formeridional gratings averaged over differentmeridians (n = 8) was 1.48 times better thanthe corresponding resolution limit for per-pendicular gratings, whereas in binocular vi-sion the corresponding ratio was 1.25. Whenmonocular and binocular resolutions for thesame grating orientations were compared onthe vertical meridian, resolution limits wereon the average 1.23 times higher in binocularthan monocular vision; other locations werediscarded because of the nasotemporal reso-

Figure 4.3: Grating Visual Acuity - Adapted from (Rovamo et al., 1982)

One of the experiments presented in the previous chapter involved evaluating the

performance of subjects in speech intelligibility tasks as a function of the proximity of

gaze fixations to the center of the mouth of the talker. The eccentricities tested were 0◦,

2.5◦, 5◦, 10◦, and 15◦ relative to the center of the mouth of the talker. Each eccentricity

condition was mapped to a spatial degradation level by using the grating visual acuity

transfer function shown in Figure 4.3.

Rovamo et al. (1982) measured resolution thresholds at different visual-field locations

using stationary sinusoidal gratings. They started by presenting unresolvable gratings

and decreased the spatial frequency in small steps until participants reported being able

to resolve the gratings. The spatial frequency of the first visible gratings was referred to as

a resolution threshold. A similar experiment was performed by Thibos et al. (1987) who

obtained the resolution thresholds by asking participants to reduce the spatial frequency

until gratings could just be barely resolved. However, the resolution thresholds obtained

in (Thibos et al., 1987) were higher than those presented in (Rovamo et al., 1982). As a

result, the lowest reported values were used. The spatial degradation levels corresponding

Page 70: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4. Visual Processing and Audiovisual Speech Enhancement 59

to the eccentrities of 0◦, 2.5◦, 5◦, 10◦, and 15◦ were translated to 24, 16, 12, 7, and 4

cycles per degree, respectively.

Figure 4.4: Speech Intelligibility Scores as a Function of Low-Pass Filter Cutoff Frequen-

cies - Adapted from (Munhall et al., 2004)

(a) 0.25 Cycle (b) 0.5 Cycle (c) 1 Cycle

(d) 2 Cycles (e) 4 Cycles (f) 6 Cycles

(g) 9 Cycles (h) 15 Cycles

Figure 4.5: Examples of Video Frames Low-Pass Filtered at Different Cutoff Frequencies

The number of low-context SPIN sentences available was limited. Therefore, the test

conditions had to be carefully chosen. Figure 4.4 shows the speech intelligibility results

Page 71: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4. Visual Processing and Audiovisual Speech Enhancement 60

found by (Munhall et al., 2004) as a function of low-pass filter cutoff frequencies. These

values were provided in cycles per face, so they were converted to cycles per degree in

order to deal with an absolute measure. The corresponding cutoff frequencies in cycles

per degree were calculated to be 0.24, 0.49, 0.97, 2, 3.86, and 7.9. It appears that subjects’

performance reached an asymptote at a cutoff frequency of 7.3 cycles per face (0.97 cycles

per degree), suggesting that the presence of additional spatial frequency components did

not contribute to a further enhancement in speech intelligibility. It is noted that all the

spatial degradation levels obtained for the eccentrities of 0◦, 2.5◦, 5◦, 10◦, and 15◦ were

beyond 0.97 cycles per degree. Therefore, it did not seem necessary to test subjects’

speech intelligibility at each of these points since the results were expected to be similar.

Instead, only the spatial degradation levels corresponding to the last 3 eccentrities (5◦,

10◦, and 15◦) were considered in addition to some of the ones employed by Munhall et

al. (2004). The low-pass filter cutoff frequencies were thus selected to be 0.25, 0.5, 1, 2,

4, 6, 9, and 15 cycles per degree. Figure 4.5 provides examples of video frames with the

corresponding blurring levels.

4.2 Speech Intelligibility with Low-Pass Filtered Videos

Figure 4.6 shows the mean percentage of correct words reported when subjects viewed

low-pass filtered video recordings of a talker while fixating at a cross 2.5◦ from the center

of the mouth. Subjects reached an asymptotic performance level (approximately 60%)

when the cutoff frequency of the low-pass filter was 6 cycles/degree, and this perfor-

mance was equivalent to that of the unfiltered condition. For cutoff frequencies below

6 cycles/degree, subjects’ speech intelligibility decreased monotonically with decreasing

cutoff frequency.

A one-way analysis of variance (ANOVA) showed a main effect for the low-pass filter

cutoff frequency (F (7, 32) = 3.97, p < 0.005). A one-tail t-test showed no significant

Page 72: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4. Visual Processing and Audiovisual Speech Enhancement 61

20

30

40

50

60

70

80

0 2 4 6 8 10 12 14 16

% c

orre

ct w

ords

low-pass filter cutoff frequencies (cycles/degree)

Figure 4.6: Average Speech Intelligibility Scores as a Function of Low-Pass Filter Cutoff

Frequencies

statistical differences in speech intelligibility scores between cutoff frequencies that were

in the range of 6 and 15 cycles/degree or in the range of 0.25 and 0.5 cycles/degree. For

the remaining conditions, the statistical differences were either significant (p < 0.05) or

trended towards significance (p < 0.10).

4.3 Discussion

The statement that fine spatial details are not required for audiovisual speech perception

can be made based on the results obtained from 2 different experiments. When subjects

fixated on a single point while viewing and listening to a single talker, the average speech

intelligibility score was similar for fixation points that were distributed within 10◦ of the

center of the mouth of the talker. Subjects were able to achieve the best audiovisual

speech enhancement with their peripheral vision even though the ability to resolve fine

spatial details was reduced. When subjects were presented with low-pass filtered stimuli

Page 73: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4. Visual Processing and Audiovisual Speech Enhancement 62

20

30

40

50

60

70

80

0 2 4 6 8 10 12 14 16

% c

orre

ct w

ords

physiological cutoff frequencies (cycles/degree)

speech intelligibility as a function of physiological cutoff frequency

speech intelligibility as a function of proximity to the center of the mouth

15°

10° 5°

Figure 4.7: Average Speech Intelligibility Scores as a Function of Physiological Cutoff

Frequency at the ‘Audiovisual Threshold’ Auditory Condition when the Original Signal

is Attenuated by 99% at the Cutoff Frequency

with cutoff frequencies higher or equal to 6 cycles/degree, their performance was the

same as that of the unfiltered condition. Therefore, low spatial frequency components

are sufficient to perceive audiovisual speech.

Figure 4.7 compares the mean percentage of correct responses between the conditions

of viewing low-pass filtered video recordings while maintaining the gaze at 2.5◦ from

the center of the mouth and viewing unfiltered stimuli with fixation points placed either

at 5◦, 10◦, or 15◦ relative to the center of the mouth. A two-tail t-test showed no

significant statistical difference (p = 0.50) when comparing the speech intelligibility scores

between viewing filtered (with cutoff frequency of 4 cycles/degree) and unfiltered (with

an eccentricity of 15◦) stimuli.

The minor difference in performance between the conditions of viewing filtered (with

cutoff frequency of 4 cycles/degree) and unfiltered (with an eccentricity of 15◦) stim-

uli can be modulated by changing the mapping between spatial degradation and the

Page 74: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4. Visual Processing and Audiovisual Speech Enhancement 63

resolution limit in the periphery, as shown in Figures 4.8. When subjects viewed video

recordings that were low-pass filtered at a cutoff frequency for which the amplitude of the

original signal was attenuated by 99.9%, their performance matched that of the unfiltered

condition.

20

30

40

50

60

70

80

0 5 10 15 20

% c

orre

ct w

ords

physiological cutoff frequencies (cycles/degree)

speech intelligibility as a function of physiological cutoff frequency

speech intelligibility as a function of proximity to the center of the mouth

15°

10°5°

Figure 4.8: Average Speech Intelligibility Scores as a Function of Physiological Cutoff

Frequency at the ‘Audiovisual Threshold’ when the Original Signal is Attenuated by

99.9% at the Cutoff Frequency

The speech intelligibility results shown in Figure 4.6 differed from those presented in

(Munhall et al., 2004). Munhall et al. (2004) found that subjects required only spatial

frequencies below 1 cycle per degree instead of 6 cycles per degree to maximize speech

intelligibility. The discrepancies in speech intelligibility results may have been caused by

several factors. First, the sentences used in the stimuli were different. In (Munhall et

al., 2004), CID (Central Institute for the Deaf) “everyday sentences” were used instead

of the low-context SPIN sentences. CID “everyday sentences” are considered to be high-

context sentences, and this type of sentences simplify the speech intelligibility task as

the next uttered word can be guessed based on previous utterances. A second factor

contributing to discrepancies in speech intelligibility results involves the conversion of

Page 75: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4. Visual Processing and Audiovisual Speech Enhancement 64

spatial frequencies from cycles/face to cycles/degree. A table showing the conversion

from cycles/face to cycles/degree was provided in (Munhall et al., 2004), but it is not

clear if the conversion was applicable for each of experiments presented. Finally, the

discrepancy in speech intelligibility results may have been caused by the use of a different

definition for the low-pass filter cutoff frequency.

4.4 Spatial and Temporal Frequency Channels

The human visual system is composed of narrowly tuned channels, where a channel

refers to a filtering mechanism which passes some of the input information (DeValois &

DeValois, 1990). These channels can be divided into 2 groups: low spatial frequency (less

than 1 cycle/degree) channels and high spatial frequency (greater than 1 cycle/degree)

channels (Blakemore & Campbell, 1969). The first group was found to be responsible for

movement detection while the latter deals with the analysis of spatial patterns in addition

to being most sensitive to stationary images (Tolhurst, 1973), (Anderson & Burr, 1985).

Based on the speech intelligibility scores shown in Figures 4.8, subjects required spa-

tial frequencies below 6 cycles/degree in order to achieve optimal performance. Therefore,

both low and high spatial frequency channels are needed to maximize speech intelligibil-

ity. This observation suggests that both low and high spatial frequency channels (in the

visual system) are involved in audiovisual speech perception.

When subjects viewed video recordings that were low-pass filtered either at a cutoff

frequency of 0.25 cycle/degree or 0.5 cycle/degree, their performance was similar to the

case where they would have had only access to auditory information. The inability to

observe an audiovisual speech enhancement was caused by the blurring filter masking

most of the temporal information. With a low-pass filter cutoff frequency of 0.25 cy-

cle/degree or 0.5 cycle/degree, the lip movements of the talker could not be discerned.

However, temporal information alone is not sufficient to achieve optimal performance.

Page 76: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4. Visual Processing and Audiovisual Speech Enhancement 65

When subjects were presented with a collection of points which mapped to facial fea-

tures (lips, teeth, tongue, jaw, cheeks, forehead, and nose), their speech intelligibility was

lower by approximately 5 dB compared to when they would have had access to the full

face (Rosenblum et al., 1996b). Subjects required finer spatial details to maximize their

speech intelligibility.

R. F. HESS and R. J. SNOWDEN

TEMPORAL

FIGURE 12. Averaged masking functions (symbols) derived from the results of Figs 4-9 have been fitted by a sum of identical Gaussian functions whose peaks are symmetrically distributed about the origin and whose positive lobes are displayed (see

Appendix). The normalized shape of all masking mechanisms in log space is illustrated in (D).

channel exhibits a low pass spatial dependence. Their sensitivities are equal in the mid spatial frequency region, it is higher for the band pass mechanism at low spatial frequencies and the reverse is true at high spatial frequencies. Another estimate of the relationship be- tween individual channel sensitivities and spatial fre- quency comes form the optimized fitting values which resulted in the best fits to the threshold surface (data of Fig. 2) as seen in Fig. 13. Since here three mechanisms were used to obtain these fits, the relative sensitivities of each of these is plotted in Fig. 14(B). The main features which have already been described for the relative masking functions of Fig. 14(A) are also seen in this plot which represents the best fitting parameters (the sensi- tivity of each channel) for the overall sensitivity simu- lation of Fig. 13. The sensitivity of the second band pass channel parallel that of the first band pass channel.

DISCUsSfON

An important objective of this study was to under- stand the reason behind the web known spatio-temporal covariation in human detection such that the shape of the temporal surface varies with the spatial frequency of the stimulus. A number of different explanations have been advanced; one involving a spatio-temporal covaria- tion of individual cells (Kelly, 1966; Burbeck & Kelly, 1980; Robson, 1966) or across a population of cells (Lehky, 1985) and a second involving just a sensitivity scaling of i~i~dual cehs (Wilson, 1980; Anderson 8t Burr, 1985). The present results provide more direct support for the claim made by Wilson (1980) and Anderson and Burr (1985) that this covariation in the threshold surface concerns the relative sensitivity of individual filters. Firstly, the number and shape of the

temporal mechanisms subserving temporal vision were not found to alter radically except for sensitivity across the spatial frequency range. Secondly the sensitivity scaling is not only suEicient for modeling the threshold surface (see Fig. 13 and compare Fig. 14 with Robsorr’s, 1966 results) but the scahng functions are of similar form to that measured dire&y for indi~dual m~han~sms using a threshold masking paradigm. This is in good agreement with Wilson (1980) who showed by a way of a different procedure that (1) the temporal band pass mechanism can be decomposed into the product of separate spatial and temporal functions and (2) the relative spatial sensitivities of the so called “sustained” and “transient” mechanism. There is undeniably also a spatio-temporal covariation of the band pass mechanism but this is of only minor consequence for modeling the spatio-temporal threshold surface because the sensitivity of this m~hanism is poor under conditions where such covariation occurs and hence its con~bu~on to the shape of threshold surface is slight in this-region. In the light of this result the more supratheshold masking data of Lehky (1985), Anderson and Burr (1985) and Burr, Ross and Morrone (1986) are also consistent with a small degree of covariation. However, because of the very suprath~hold nature of these techniques it is-not possible to unambiguously disentagle relative sensitivity from the space-time covariation.

Thus in terms of temporal sensitivity the present results add weight to the previous reports of there b&g two or possibly three mechamsms and we show that individual sensitivities depend upon spatial frequency, This general idea of there being only two or possib& three different types of temporal filter is well supported by previous studies using a variety of different tech- niques (Hess & Plant, 1985; Holliday & Ruddock, 1983;

Figure 4.9: Temporal Filters in the Human Visual System - Adapted from (Hess &

Snowden, 1992)

When subjects were tested with a single talker at different eccentricities relative to the

center of the mouth, a lower speech intelligibility score was observed for fixation points

placed at 15◦ from the center of the mouth. The decrease in performance was caused by

the inability to resolve some of the spatial and/or temporal frequencies in the periphery.

At an eccentricity of 15◦, only spatial frequencies below 6 cycles per degree can be seen

(Rovamo et al., 1982). Furthermore, the ability to view temporal frequencies is restricted.

As shown in Figure 4.9, there are 3 types of temporal filters in the human visual system

although the third one is not agreed upon in all studies (Hammett & Smith, 1992). The

first one is a low-pass filter with a cutoff frequency of 10 Hz while the other ones are

band-pass filters centered around 10 and 20 Hz respectively. At an eccentricity of 15◦,

the temporal filter consists of a band-pass filter centered around 10 Hz (Snowden & Hess,

1992), so all the frequencies below 8 Hz or above 12 Hz are attenuated. The facial motion

and acoustic speech data are within the temporal range of 0 Hz to 7 Hz (Chandrasekaran

et al., 2009), which means that subjects were unable to achieve the optimal performance

Page 77: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 4. Visual Processing and Audiovisual Speech Enhancement 66

when fixating at 15◦ as the temporal information associated to speech was attenuated.

Page 78: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 5

Conclusions

5.1 Summary

Some discrepancies were found between the findings reported in previous studies on gaze

behavior and audiovisual speech perception. One study (Vatikiotis-Bateson et al., 1998)

showed that the number of fixations on the mouth increased as a function of noise level.

Other studies (Buchan et al., 2007), (Lansing & McConkie, 2003) showed that more

fixations were made on the nose and on the mouth when noise was introduced to speech

signals. Nevertheless, studies on gaze behavior and audiovisual speech perception agreed

that the primary fixation regions were the eyes, nose, and mouth.

From past findings, it was not clear which gaze strategies were optimal for improving

speech intelligibility. The studies presented in chapter 3 were the first ones to quantify and

establish relationships between speech intelligibility and gaze patterns. When subjects

moved their eyes freely while listening and viewing a single talker, their gaze strategy

changed as a function of auditory SNR. For low auditory SNRs, a gaze strategy consisting

of one or more saccades towards the mouth was used in approximately 80% of all trials

while it was used in only about half of the trials for high auditory SNR. In the remaining

trials, subjects used a gaze strategy that consisted of either long fixations on a specific

67

Page 79: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 5. Conclusions 68

face feature or a random sequence of saccades and fixations between the different face

features, but they did not bring their fixation points close to the mouth. All fixation

points were within 8◦ of the center of the mouth of the talker, and the speech intelligibility

performance did not depend on the gaze strategy or on the proximity of the fixation

points relative to the center of the mouth. This observation is consistent with the results

obtained in the experiment for which subjects’ speech intelligibility was tested with a

single talker and a fixed point of gaze. Subjects were able to achieve the best performance

as long as their fixation points were distributed within 10◦ of the center of the mouth.

Although subjects did not require to bring their fixation points closer to the mouth for

optimal audiovisual speech enhancement, they adopted it because this gaze strategy was

found to be effective in other environments, such as an environment that includes multiple

visual sources. When subjects were tested with a fixed point of gaze and multiple talkers

in their visual field, they optimized their speech intelligibility by using a gaze strategy

that brought their fixation points to within 2.5◦ of the center of the mouth.

By investigating eye movement behavior and the effects of manipulating the level

of details of visual information, past studies suggested that peripheral vision might be

sufficient for audiovisual speech perception. However, no study explicitely investigated

the role of peripheral vision in audiovisual speech perception, nor was there a study

which compared audiovisual speech perception between the conditions of viewing spa-

tially filtered images (using foveal/parafoveal vision) and viewing unfiltered information

(using peripheral vision). In Chapter 4, this comparison was made. By using the grating

visual acuity curve, eccentricities were mapped to levels of spatial degradation. Sub-

jects’ speech intelligibility was found to be the same for a given eccentricity and its

corresponding level of spatial degradation (determined by the aforementioned mapping).

When subjects viewed low-pass filtered video recordings, their speech intelligibility was

optimized as long as they were able to see spatial frequencies below 6 cycles/degree

(where the signal was attenuated by 99% at the cutoff frequency). Their performance

Page 80: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 5. Conclusions 69

decreased when the the stimuli were low-pass filtered at a lower cutoff frequency, and

this performance decrease was caused by the inability to see some spatial frequencies in

addition to the blurring filter masking some of the temporal information. When subjects

viewed unfiltered video recordings while fixating at a single point, they obtained the best

audiovisual speech enhancement for fixation points that were distributed within 10◦ of

the center of the mouth. Their performance decreased significantly when fixating 15◦

from the center of the mouth, and this performance decrease was caused by the inabil-

ity to view some of the spatial and temporal information. In the periphery, the ability

to resolve fine spatial details is limited. Furthermore, the temporal information passes

through a bandpass filter, which is centered around 10 Hz (Snowden & Hess, 1992), and

the information associated with temporal frequencies below 8 Hz is attenuated. This

suggests that subjects cannot achieve optimal performance as the temporal information

pertaining to speech is limited (Chandrasekaran et al., 2009).

5.2 Future Work

The studies on young adult subjects demonstrated that for a single talker, speech intelli-

gibility was not affected by gaze strategies or by the proximity of fixation points to visual

cues that contribute to audiovisual speech enhancement. However, it is not clear how the

results from these studies would translate to the aging population. Gordon and Allen

(2009) showed that the aging population was not able to achieve the same audiovisual

speech enhancement as the younger population with blurred stimuli (see Figure 5.1).

When the level of blurring corresponded to a visual acuity of 20/50 (Snellen fraction

of 0.4), older adults exhibited an audiovisual speech enhancement for high-context sen-

tences only, but their performance was approximately 20% lower than the one obtained

for unfiltered stimuli. According to Figure 5.2, a blurring level of 20/50 (Snellen fraction

of 0.4) is equivalent to the reduced visual acuity observed at an eccentricity of 4◦. Under

Page 81: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

Chapter 5. Conclusions 70

this condition, older subjects were not able to achieve the best speech intelligibility score,

so it can be inferred that they need to bring their fixation points closer to the mouth in

order to achieve optimal performance.

manipulation in the two age groups. A critical question for thisresearch is the effect of the Blur manipulation on younger and olderadults. To address this issue, it is important to establish the baselinelevels of performance in the audio-only conditions. Separate analyseswere conducted in each age group to determine that younger adults inthe No Blur and Blurry test groups had analogous performance in theauditory-only conditions, F(1, 11)¼ 0.1, p> .5; older adults inthe No Blur and Blurry test groups had analogous performance inthe auditory-only conditions, F(1, 11)¼ 0.6, p> .4. These analysessuggest that within each age group, performance was relativelyhomogeneous in the baseline audio-only conditions.

To examine the level of visual enhancement generated in the NoBlur and Blurry test groups, we subtracted performance in theaudio-only conditions from performance in the audiovisual condi-tions for each subject (i.e., AV improvement¼ audiovisual – audio).Using this method, the following data illustrate the improvementparticipants experienced when a congruent visual signal was pairedwith the auditory speech.

Figure 2a and b show the levels of visual enhancement for youngerand older adults in the No Blur and Blurry conditions across contextconditions. As can be seen in Figure 2a and b, older adults onlyshowed improvement in the No Blur condition, and did not benefitfrom the Blurry visual signal, whereas younger adults had an analo-gous and consistent visual enhancement in the two Blurry conditions.Statistical analyses confirmed these findings by separately analyzingeach age group across the audiovisual conditions. Older adults were

Figure 2. (a) Audiovisual improvement for older and younger adults with

high- and low-context sentences in No Blur conditions. (b) Audiovisual improve-

ment for older and younger adults with high- and low-context sentences in

Blurry conditions.

Audiovisual Speech Across Life Span 213

Downloaded By: [University of Toronto] At: 06:12 13 September 2010

Figure 5.1: Average Speech Intelligibility Scores for Younger and Older Adults under a

Blurred Condition - Adapted from (Gordon & Allen, 2009)

ou 0.1- >

0.50 1.00 1.50 2.00Refractive Error, Diopters

2.50 0.50 1.00 1.50 2.00Refractive Error, Diopters

2.50

1009080706050-403020100 0.50 1.00 1.50 2.00 2.50

Refractive Error, Diopters

I.Or0.90.80.70.60.50.40.30.20.1

0 0.50 1.00 1.50 2.00 2.50Refractive Error, Diopters

Fig 2.—Effect of uncorrected refractive error on visual acuity. Data from Laurance3 areplotted with four different scales. Top left, Minimum angle of resolution (MAR) plotted onlinear scale. Top right, MAR plotted on logarithmic scale. Bottom left, Snell-Sterlingvisual efficiency. Bottom right, Snellen fraction.

23-4-5\67-8-9-

10-11-12-13-14Í

IK ¬

15 20 25 30Eccentricity, Degrees

s« °· ..

*a noio u

_ ) < M-

c o< c

_E 5

\

10 15 20 25 30

Eccentricity, Degrees

10090807060 h50403020100

\V

5 10 15 20 25 30Eccentricity, Degrees

1.00.90.80.70.60.50.40.30.20.1

0

\

10 15 20 25 30Eccentricity, Degrees

Fig 3.—Visual acuity as function of retinal eccentricity in horizontal meridian. Data fromWertheim8 and Weymouth et al9 are plotted with four different scales. Top left, Minimumangle of resolution (MAR) plotted on linear scale. Top right, MAR plotted on logarithmicscale. Bottom left, Snell-Sterling visual efficiency. Bottom right, Snellen fraction.

tive intervals. There is no epistemo¬logica! stricture against the search fora metric that satisfies subjectivecriteria, provided there can be someconsensus for their operational defini¬tion.

For visual acuity, there is no

compelling criterion that demandsthat the scale must have any particu¬lar property. Equal steps of subjectiveblur could, one supposes, be estab-

lished by appropriate experiments ofmagnitude estimation. Reduction inresolution capacity can, however, bethe result of a variety of factors, suchas an increase in size of the opticalpoint-spread function of the eye,coarsening of the grain of the receptormosaic, an increase in retinal summa¬tion pools, and so on. There is noimmediate assurance that an equidis-tance scale under one set of conditions

will apply to another set.An example will illustrate this

particular problem. When the visualacuity target is a grating, the qualityof the retinal image can be effectivelydescribed by the height of the modula¬tion transfer function of the eye'soptics at the target spatial frequency.Within the resolution limit, this modu¬lation transfer coefficient is not a

monotonically decreasing functionwith defocusing, but oscillates aroundzero. Thus, it can happen that thetarget becomes more visible withincreasing defocus, even though theoverall performance has surely be¬come worse. On the other hand,increasing blur with a Gaussianspread function would engender amonotonie reduction in the spatialfrequency with more predictable re¬sults.

Such considerations of opticalimage characteristics need not, how¬ever, leave us helpless in the face ofthe differences in the visual acuityscales, the tendencies of four of whichhave been demonstrated above withthree examples.

The falloff of visual acuity withretinal eccentricity can serve as auseful point of departure. Accordingto Wertheim's data," if the MAR inthe center of the fovea is one minuteof arc, it is 2.13 minutes of arc at 2xk°retinal eccentricity and five minutesof arc at 10°. This means that, reck¬oned in minutesof arc, the acuity stepfrom 2V20 to 10° eccentricity is (5—2.13)/(2.13-1) = 2.54 times that from0° to 2y2° eccentricity. For log MAR,the equivalent value is (0.699-0.329)/(0.329-0) = 1.125. For visual efficien¬cy it is (48.9-81.7)/(81.7-100) = 1.79,and for the Snellen fraction we obtain(0.2-0.47)/(0.47-1.0) = 0.51.

Is the acuity reduction, going from2xk° to 10° in the periphery, really onlyhalf as much as that going from thefoveal center to 2xk°, as the Snellenfraction scale would have it, or is it 2lâtimes, as the MAR scale suggests?One approach to this question is thatof Fechner, which has been one of themost widely accepted in these mat¬ters, viz, the use of the just noticeabledifference (JND) as the basic unit ofmeasurement. The scale would beconstructed, ie, equal intervals be¬tween values of the physical variablewould be suitably expanded or com¬

pressed as needed, so as to make theJNDs cover equal distances every¬where along the axis.

In order to pursue this aim in therealm of visual acuity, it is necessaryto know the JNDs for a variety ofvalues of the MAR. This is done here.By varying the retinal eccentricity itis possible to obtain a range of MARs

at University of Toronto, on June 9, 2010 www.archophthalmol.comDownloaded from

Figure 5.2: Visual Acuity as a Function of Eccentricity - Adapted from (Westheimer,

1979)

Page 82: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

References

Andersen, T. S., Tiippana, K., Laarni, J., Kojo, I., & Sams, M. (2009, Feb.). The role of

visual spatial attention in audiovisual speech perception. Speech Communication,

51 (2), 184-193.

Anderson, S. J., & Burr, D. C. (1985, Aug.). Spatial and temporal selectivity of the

human motion detection system. Vision Research, 25 (8), 1147-1154.

Bavelier, D., Brozinsky, C., Tomann, A., Mitchell, T., Neville, H., & Liu, G. (2001,

Dec.). Impact of early deafness and early exposure to sign language on the cerebral

organization for motion processing. Journal of Neuroscience, 21 , 8931-8942.

Binder, J. R., Frost, J. A., Hammeke, T. A., Bellgowan, P. S., Springer, J. A., & Kauf-

man, J. N. (2000, Jun.). Human temporal lobe activation by speech and nonspeech

sounds. Cerebral Cortex , 10 , 512-528.

Blakemore, C., & Campbell, F. W. (1969, Jul.). On the existence of neurones in the

human visual system selectively sensitive to orientation and size of retinal images.

Journal of Physiology , 203 (1), 237-260.

Boothroyd, A., Hnath-Chisolm, T., Hanin, L., & Kishon-Rabin, L. (1988, Dec.). Voice

fundamental frequency as an auditory supplement to the speechreading of sen-

tences. Ear and Hearing , 9 (6), 316-312.

Buchan, J. N., Pare, M., & Munhall, K. G. (2007, Mar.). Spatial statistics of gaze

fixations during dynamic face processing. Social Neuroscience, 2 (1), 1-13.

Buchan, J. N., Pare, M., & Munhall, K. G. (2008, Jun.). The effect of varying talker iden-

71

Page 83: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

References 72

tity and listening conditions on gaze behavior during audiovisual speech perception.

Brain Research, 162-171.

Callan, D. E., Jones, J. A., Munhall, K., Kroos, C., Callan, A. M., & Vatikiotis-Bateson,

E. (2004, May). Multisensory integration sites identified by perception of spatial

wavelet filtered visual speech gesture inforamtion. Journal of Cognitive Neuro-

science, 16 , 805-816.

Calvert, G. A., Campbell, R., & Brammer, M. J. (2000, Sept.). Evidence from functional

magnetic resonance imaging of crossmodal binding in the human heteromodal cor-

tex. Current Biology , 10 , 649-657.

Calvert, G. A., Spence, C., & Stein, B. E. (2004). The handbook of multisensory processes

(1st ed.). Cambridge, Massachussets: MIT Press.

Campbell, C. S., & Massaro, D. W. (1997, May). Perception of visible speech: influence

of spatial quantization. Perception, 26 (5), 627-644.

Campbell, R. (2008, Jun.). The processing of audio-visual speech: empirical and neural

bases. Philosophical Transactions of the Royal Society Biological Sciences , 363 ,

1001-1010.

Chandrasekaran, C., Trubanova, A., Stillittano, S., Caplier, A., & Ghazanfar, A. A.

(2009, Jul.). The natural statistics of audiovisual speech. PLoS Computational

Biology , 5 (7), 1-18.

Craig, M. S., Van Lieshout, P., & Wong, W. (2008, Nov.). A linear model of acoustic-to-

facial mapping: model parameters, data set size, and generalization across speakers.

Journal of Acoustical Society of America, 124 (5), 3183-3190.

Davson, H. (1990). Physiology of the eye (5th ed.). Houndmills, Basingstoke, Hamphsire:

The Macmillan Press Ltd.

DeValois, R. L., & DeValois, K. K. (1990). Spatial vision. New York: Oxford University

Press.

Dixon, N., & Spitz, L. (1980, Sept.). The detection of audiovisual desynchrony. Percep-

Page 84: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

References 73

tion, 9 (6), 719-721.

Ebrahimi, D., & Kunov, H. (1991, Oct.). Peripheral vision lipreading aid. IEEE Trans-

actions on Biomedical Engineering , 38 (10), 944-952.

Everdell, I. T., Marsh, H., Yurick, M. D., Munhall, K. G., & Pare, M. (2007, Oct.).

Gaze behaviour in audiovisual speech perception: Asymmetrical distribution of

face-directed fixations. Perception, 36 (10), 1535-1545.

Gordon, M. S., & Allen, S. (2009). Audiovisual speech in older and younger adults:

Integrating a distorted visual signal with speech in noise. Experimental Aging

Research, 35 (2), 202-219.

Grant, K. W., & Seitz, P.-F. (2000, Sept.). The use of visible speech cues for improving

auditory detection of spoken sentences. Journal of Acoustical Society of America,

108 (3), 1197-1208.

Grant, K. W., & Walden, B. E. (1996, Oct.). Evaluating the articulation index for

auditory-visual consonant recognition. Journal of the Acoustical Society of Amer-

ica, 100 (4), 2415-2424.

Green, K. P., Khul, P. K., Meltzoff, A. N., & Stevens, E. B. (1991, Jun.). Integrating

speech information across talkers, gender, and sensory modality: female faces and

male voices in the mcgurk effect. Perception & Psychophysics , 50 (6), 524-536.

Grill-Spector, K., Kourtzi, Z., & Kanwisher, N. (2001, Dec.). The lateral occipital

complex and its role in object recognition. Vision Research, 41 , 1409-1422.

Guestrin, E. D., & Eizenman, M. (2006, Jun.). General theory of remotegaze estimation

using the pupil center and corneal reflections. IEEE Transactions on Biomedical

Engineering , 53 (6), 1124-1133.

Guestrin, E. D., & Eizenman, M. (2008, Mar.). Remote point-of-gaze estimation requiring

a single-point calibration for applications with infants. Proceedings of ETRA 2008 ,

267-274.

Hammett, S. T., & Smith, A. T. (1992, Feb.). Two temporal channels or three? a

Page 85: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

References 74

re-evaluation. Vision Research, 32 (2), 285-291.

Hess, R. F., & Snowden, R. J. (1992, Jan.). Temporal properties of human visual filters:

number, shapes and spatial covariation. Vision Research, 32 (1), 47-59.

Ijsseldijk, F. J. (1992, Apr.). Speechreading performance under different conditions of

video image, repetition, and speech rate. Journal of Speech and Hearing Research,

35 , 466-471.

Iverson, P., Bernstein, L. E., & Edward, T. A. J. (1998, Oct.). Modeling the interaction

of phonemic intelligibility and lexical structure in audiovisual word recognition.

Speech Communication, 26 (1), 45-63.

Jordan, T. R., McCotter, M. V., & Thomas, S. M. (2000, Oct.). Visual and audio-

visual speech perception with color and gray-scale facial images. Perception &

Psychophysics , 62 (7), 1394-1404.

Jordan, T. R., & Sergeant, P. (1998). Effects of facial image size on visual and audiovisual

speech recognition. In R. Campbell, B. Dodd, & D. Burnham (Eds.), Hearing

by eye: Ii. advances in the psychology of speechreading and auditory-visual speech

(p. 155-176). Hove, East Sussex: Psychology Press.

Kalikow, D. N., & Stevens, K. N. (1977, May). Development of a test of speech intelligi-

bility in noise using sentence materials with controlled word predictability. Journal

of Acoustical Society of America, 61 (5), 1337-1351.

Lansing, C. R., & McConkie, G. W. (1994). A new method for speechreading research:

tracking observers’ eye movements. Journal of the Academy of Rehabilitation Au-

diology , 27 , 25-43.

Lansing, C. R., & McConkie, G. W. (2003, May). Word identification and eye fixa-

tion locations in visual and visual-plus-auditory presentations of spoken sentences.

Perception & Psychophysics , 65 (4), 536-552.

MacDonald, J., Andersen, S., & Bachmann, T. (2000, Oct.). Hearing by eye: how much

spatial degradation can be tolerated? Perception, 29 (10), 1155-1168.

Page 86: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

References 75

MacLeod, A., & Summerfield, Q. (1987, May). Quantifying the contribution of vision to

speech perception in noise. Journal of Audiology , 21 (2), 131-141.

Macleod, A., & Summerfield, Q. (1990, Jan.). A procedure for measuring auditory and

audiovisual speech-reception thresholds for sentences in noise: Rationale, evalua-

tion, and recommendations for use. British Journal of Audiology , 24 (1), 29-43.

Marassa, L. K., & Lansing, C. R. (1995, Dec.). Visual word recognition in 2 facial motion

conditions: full face versus lips-plus-mandible. Journal of Speech and Hearing

Research, 38 , 1387-1394.

McGurk, H., & MacDonald, J. (1976, Dec). Hearing lips and seeing voices. Nature, 264 ,

746-748.

Munhall, K. G., Kroos, C., Jozan, G., & Vatikiotis-Bateson, E. (2004, May). Spa-

tial frequency requirements for audiovisual speech perception. Perception & Psy-

chophysics , 66 (4), 574-583.

Pandey, P. C., Kunov, H., & Abel, S. M. (1986, Jan.). Disruptive effects of auditory

signal delay on speech perception with lipreading. Journal of auditory research,

26 (1), 27-41.

Pare, M., Richler, R. C., & Hove, M. ten. (2003, May). Gaze behavior in audiovisual

speech perception: The influence of ocular fixations on the mcgurk effect. Percep-

tion & Psychophysics , 65 (4), 553-567.

Pekkola, J., Ojanen, V., Autti, T., Jaaskelainen, I. P., Mottonen, R., Tarkiainen, A.,

et al. (2005, Jan.). Primary auditory cortex activation by visual speech: an fmri

study at 3 t. Neuroreport , 16 (2), 125-128.

Qian, C. L. (2009). Crossmodal modulation as a basis for visual enhancement of auditory

performance. Unpublished master’s thesis, University of Toronto.

Rosenblum, L. D., Johson, J. A., & Saldana, H. M. (1996a, Apr.). An audiovisual

test of kinematic primitives for visual speech perception. Journal of Experimental

Psychology: Human Perception and Performance, 22 (2), 318-331.

Page 87: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

References 76

Rosenblum, L. D., Johson, J. A., & Saldana, H. M. (1996b, Dec.). Point-light facial

displays enhance comprehension of speech in noise. Journal of Speech and Hearing

Research, 39 (6), 1159-1170.

Rovamo, J., Virsu, V., Laurinen, P., & Hyvarinen, L. (1982, Nov.). Resolution of

gratings oriented along and across meridians in peripheral vision. Investigative

Ophthalmology & Visual Science, 23 (5), 666-670.

Scott, S. K., Blank, C. C., Rosen, S., & Wise, R. J. S. (2000, Oct.). Identification of a

pathway for intelligible speech in the left temporal lobe. Brain, 123 (12), 2400-2406.

Smeele, P. M. T., Massaro, D. W., Cohen, M. M., & Sittig, A. C. (1998, Aug.). Lat-

erality in visual speech perception. Journal of Experimental Psychology: Human

Perception and Performance, 24 (4), 1232-1242.

Snowden, R. J., & Hess, R. F. (1992, Jan.). Temporal frequency filters in the human

peripheral visual field. Vision Research, 32 (1), 61-72.

Sumby, W. H., & Pollack, I. (1954, Mar.). Visual contribution to speech intelligibility in

noise. Journal of Acoustical Society of America, 26 (2), 212-215.

Summerfield, Q. (1987). Some preliminaries to a comprehensive account of audio-visual

speech perception. In R. Campbell & B. Dodd (Eds.), Hearing by eye: The psy-

chology of lipreading (p. 3-51). Hove, East Sussex: Lawrence Erlbaum Associates

Ltd.

Thibos, L. N., Cheney, F. E., & Walsh, D. J. (1987, Aug.). Retinal limits to the detection

and resolution of gratings. Journal of the Optical Society of America, 4 (8), 1524-

1529.

Tiippana, K., Sams, M., & Andersen, T. S. (2004, May). Visual attention influences

audiovisual speech perception. European Journal of Cognitive Psychology , 16 (3),

457-472.

Tolhurst, D. J. (1973, Jun.). Separate channels for the analysis of the shape and the

movement of a moving visual stimulus. Journal of Physiology , 231 (3), 385-402.

Page 88: Gaze Strategies and Audiovisual Speech Enhancement · extensively in order to understand di erent components of audiovisual speech. 1.1.1 Theories Behind Audiovisual Speech The literature

References 77

Vatikiotis-Bateson, E., Eigsti, I. M., & Yano, S. (1994). Listener eye movement behavior

during audiovisual perception. Journal of the Acoustical Society of Japan, 94 (3),

679-680.

Vatikiotis-Bateson, E., Eigsti, I.-M., Yano, S., & Munhall, K. G. (1998, Aug.). Eye

movement of perceivers during audiovisual speech perception. Perception & Psy-

chophysics , 60 (6), 926-940.

Wassenhove, V. van, Grant, K. W., & Poeppel, D. (2005, Jan.). Visual speech speeds

up the neural processing of auditory speech. Proceedings of the National Academy

of Sciences of the United States of America, 102 (4), 1181-1186.

Westheimer, G. (1979, Feb.). Scaling of visual acuity measurements. Archives of Oph-

thalmology , 97 , 327-330.