seeing, hearing, and touching: multimodality putting it ...fisher/bf-intersense2.pdf · putting it...

1

Seeing, Hearing, and Touching:

Putting It All Together

Sensory Integration Module

• Seeing and Hearing Events Fisher

• Touching, Seeing, and Hearing MacLean

• Integrating Applications: Tight Coupling & Physical Metaphors MacLean

• Integrating Applications: Designing for Intimacy Fels

The morning talks gave a perspective on how vision science can be used to inform the

design of visually complex interfaces such as those used in information visualization

systems. The second half of the course looks at intersensory interactions and how they

can inform the move to multimodal environments. These environments combine

visual, auditory, and haptic displays with a richer set of inputs from users, including

speech, gesture,and Biopotentials.

2

Seeing and Hearing Events (Fisher) 2

Moving to multimodality

Vision

Virtual

interaction model

Force & tactile

feedback

Psychophysics of vision, sound, and touchwill change when environment is multimodal

Hearing

Force display technology works by using mechanical actuators to apply forces to the

user. By simulating the physics of the user’s virtual world, we can compute these

forces in real-time, and then send them to the actuators so that the user feels them/

3


Intersensory Interactions

• Intro and metacognitive gap

• Integrating Cognitive Science in design

• Cognitive Architecture

– Modularity and multimodal interaction

• Information hiding-- conflict resolution

• Cognitive impenetrability

• Performance differences between modules

• Recalibration

– Spatial indexes in complex environments

• Multimodal cue matching within modules

We begin with a justification for an increased role for theory in the design of these

more complex interfaces. I will argue that the combination of a large design space and

the structural inability of humans to introspect at the level of sensory and attentional

processes makes conventional design techniques inadequate in these situations. This

is followed by a brief discussion of the challenges of taking information from

Psychology, Kinesiology and other disciplines that may fall under the broad banner of

Cognitive Science into account in designing interactive applications.

4


Vision systems to multimodality

• Ron: Vision systems and subsystems

– Pre-attentive vision (gist, layout, events)

– Attention (grab ~5 objects for processing)

– Combine for “virtual representation”

• Extend system concept to modalities

– Some are similar across modalities

– Some are multimodal

– Some are task-dependent

Extending the visual perception studies described by Ron and applied by Tamara to

multimodal interaction is conceptually simple, since vision is composed of separate

channels that can be thought of as modalities. The move to multimodality is similar to

the move to multiple channels, or systems in vision.

5


Extending to complex worlds

• Lab studies: few events, visual or

auditory

• In contrast to multimodal interfaces

– Virtual worlds

– Augmented reality

– Ubiquitous computing

• How are multiple multimodal events

dealt with in the brain?

The previous studies looked a relatively simple environments by our standards (but

complex from the standpoint of psychophysics!). How can we extend these methods

to more complex environments?

6


Some systems are multimodal

Example: Cross-modal speech system

• Reduces cognitive load

– Fast, effortless information processing

• Near-optimal information integration

between cues and sensory modalities

– Fuzzy logic cue integration

– Bayesian categorization

Modularity of processing has some advantages-- fast, effortless processing of multiple

sensory channels. It comes at a cost of lack of cognitive control and lack of access to

early-stage representations.

7


Illusory conjunctions occur inartificial multimodal environments

Example: Movie theatre

• The McGurk effect (face influences

sound)

– Dubbed movie

• The ventriloquist effect (vision

captures sound location)

– Sound seems to come from actor

Another area where we have conducted research deals with a basic question in

multimodal perception-- how do the different sensory channels decide what stimuli

get “matched” between vision, hearing, and touch? And, once matched, how are they

integrated and resolved into a multimodal or transmodel percept?

If, as we have maintained, much of our perceptual processing takes place in systems,

stimulus matching and fusion have to take place in multiple systems in parallel-- so

how do they come up with the same answer?

8

Seeing and Hearing Events (Fisher) 8Rensink

Feedback from higher-level areas allows a small number of proto-objects to be

stabilized.

[Note: These links may be related to (or even the same as?) the FINSTs in Brian

Fisher’s talk]

9


Study: Pointing to sounds

• Cognitive location good

• Pointing shows visual

capture

– Aware of visual and

auditory locations, but

point to visual

• No effect of

phoneme/viseme fit on

ventriloquism

• Slow recalibration of

auditory space if offset

constant

“Ba”

What was sound?

Where was source?

(point or describe)

Looking at the impact of this on perception in complex display environments with

multiple visual and auditory events that contain errors in location and category fit

generates some counter-intuitive findings.

Motor performance is typically found to be less sensitive to illusions than

cognitive processes, however it seems that that is not the case for auditory

localization. It seems that the dorsal system has a greater drive to combine visual

and auditory events, leading to a greater tolerance for location errors for

reaching than for voice measures.

In pointing to visual targets in the presence of a visual distractor, we found few

illusory errors when users were not allowed to see their hands (such as when

using a head-mounted display), but closed loop pointing had errors similar to

those observed with vocal interactions. Giving them a visible cursor on the screen

hurt performance, and adding a lag to the response of the cursor actually aided

performance.

10


Our interpretation: 2 systems at work

• Different multimodal systems solve

feature assignment problem differently

– Motor system: high visual dominance

– Cognitive system: low visual dominance

– Phoneme/viseme mismatch doesn’t help

• Vision can recalibrate spatial sound “map”

11

Seeing and Hearing Events (Fisher) 11Rensink

Feedback from higher-level areas allows a small number of proto-objects to be

stabilized.

[Note: These links may be related to (or even the same as?) the FINSTs in Brian

Fisher’s talk]

12


Attentional pointers in systems?

Cognitive processing

Action (motor space)

Auditory

localization

Visual

localization

If we look at the impact of these attentional tokens, or FINSTs in multimodal

perception in rich sensory environments, we can see two views of how they might

work. The naïve view is that the information from two events is “tagged” by a FINST

and is reassembled in cognition after the sensory processes have done their work.

13


Displays and multimodal perception

• Immersive environments must display acomplex multimodal world– Virtual Reality must provide entire world

– Augmented Reality must blend with real world

• Multimodal displays have errors– Location of events is not precise (esp. in depth)

– Timing is not precise

– Graphics can be low-fidelity

• What will be the impact of these errors onusers?

Logically, we would assume that correspondence in time and location are critical, and

goodness-of-fit (how plausible the match is conceptually) might play a role as well.

However these are precisely the areas where errors in scene rendering, stereo conflicts,

and poor synchrony might cause errors.

14


Disadvantage: Each module mustsolve feature assignment problem.

• Modules can’t accept information fromother modules: Information encapsulation

• Different modules should have access to adifferent set of matching cues.

• Illusory conjunctions can occur in

multimodal environments:

– Phoneme perception: The McGurk effect

– Auditory localization: The ventriloquist effect

Another area where we have conducted research deals with a basic question in

multimodal perception-- how do the different sensory channels decide what stimuli

get “matched” between vision, hearing, and touch? And, once matched, how are they

integrated and resolved into a multimodal or transmodel percept?

If, as we have maintained, much of our perceptual processing takes place in modules,

stimulus matching and fusion have to take place in multiple modules in parallel-- so

how do they come up with the same answer?

15


Study: Impact of display errors on

multimodal perception

• Immersive environments typicallyhave display errors

– Location of events is not precise

– Timing is not precise

– Graphics can be low-fidelity

• As immersive environments addsound and touch, what will be theimpact of these errors?

Logically, we would assume that correspondence in time and location are critical, and

goodness-of-fit (how plausible the match is conceptually) might play a role as well.

However these are precisely the areas where errors in scene rendering, stereo conflicts,

and poor synchrony might cause errors.

16


Recalibration by pairing (Epstein, 1975)

• Individual senses adapt to display

• Sensory modalities calibrate each other: haptics,

vision, sound

– Observed actions calibrate visual space (space

constancy)

– Vision calibrates hearing for the location of a multimodal

event

– Sound calibrates vision for the time of a multimodal event

• Result is an after-effect: remapping of auditory

(visual, haptic) space

One promising theory examines how the basic characteristics of space and time are

compared between modalities in order to calibrate them against themselves. This

fundamentally depends on the consistencies of events in the real world. How will

virtual worlds affect this process?

17


Impact of information encapsulation

• Multimodal environment with errors in

timing and location

• The same event might give rise to a single

multimodal construct in one task, and two

unimodal events for another.

– Vary location of visual and auditory phonemes

in a simple teleconferencing-style video display

– Vary information carried by using synthetic

speech stimuli (5 levels).

We can intentionally vary these errors to see how observers respond.

18


Ventriloquism meets the McGurk

effect.

• Vary location of visual and auditory phonemes in a

simple teleconferencing-style video display

• Vary information carried by using synthetic speech

stimuli (5 levels).

• Subjects report sound location and syllable heard,

• Analyses included testing a variety of

mathematical models of information integration by

fitting free parameters with STEPIT.

We intentionally introduce errors in category fit by using ambiguous synthetic speech,

and multiple loudspeakers hidden behind a curved screen. So we can vary space and

fit at will. This is a much more complex set of stimuli than are typically used in

perception research, but our interests are in rich sensory environments not

psychophysics and sensory primitives.

The data that this kind of test produces is difficult to analyze with Psychology

statistics such as ANOVA. Instead we use a model fitting analysis

19


Use of mathematical modeling tools

allow us to address

• Sensory input from a number of channels

simultaneously

• How stimuli from multiple channels are

matched and partitioned into mental

representations

• How information from multiple senses is

integrated to give rise to trans-modal

mental events

Modeling lets us extend our experiments to more realistic situations.

20


Results:

• Visual capture of auditory source location, resulting in a

shifting of unimodal auditory location estimation

(ventriloquism after-effect).

• No effect of location difference on phoneme perception as

measured by statistical or modeling tests.

• No correlation between errors in the two tasks (i.e. subjects

could not selectively attend to the auditory phoneme on trials

when visual capture failed).

• Overall, modularity of phoneme perception is supported.

The results supported our hypotheses, but also showed promise in another way-- each

user who ran in the test generated an individual set of values that describe their unique

set of weightings of the perceptual stimuli within the context of the more complex

environment and task.

21


Changes in task interact with modules

• 2 visual systems—“ventral stream” forrecognition and “dorsal stream” for action.

• Where vs how

• Different impact of illusions

• Lesion data

One set of modules that is difficult to understand conceptually comes from the

neuroanatomy of vision. Research has shown that there are separate brain areas that

support motor performance and scene understanding.

In mammals, a phylogenetically older dorsal visual system deals primarily with

motor activities. It is a relatively inaccurate system, but it is robust to changes in head,

eye, and body positions,and has access to proprioceptive information to aid in

coordinating perception with action.

A second ventral visual system deals with small field operations that require superior

precision. While it allows for fine discrimination between stimuli, it sacrifices the

proprioceptive input and eye, head and body position information, and the ability to

co-ordinate unseen body parts with visual information.

22


Functional Neuroanatomy of

perception for action.

2 visual systems—“ventral stream” for cognition and

“dorsal stream” for motor performance.

23


2 visual systems lesion evidence

lesion performance deficits spared abilities

V1 (blindsight) detection and

identification

pointing

Ventrolateral

occipital (DF)

identification, shape

recognition, object

orientation

object manipulation

(orientation matching,

grip scaling)

Posterior

parietal (RV)

object manipulation

(orientation matching,

grip scaling)

identification, shape

recognition, object

orientation

Evidence from brain-damaged patients support this disassociation in humans.

24


2 visual system illusions

stimuli

deficits spared abilities

Tichner circles size report grip scaling

displacement

during saccade

detection of

displacement,

location report

pointing

Moving or off-

centre frame

induced motion,

location report

pointing

25


Study: Pointing in large displays (Po)

• Tell me where the

target is

• Point with no

feedback

• Point with visual

feedback (cursor)

• Point with delayed

visual feedback














performance.

26


Findings

1. Can you tell if a target is on the left or right?

• 3 out of 7 males, 7 out of 7 females made errors

2. Can you point to it with no visual feedback?

• 6 out of 10 who failed #1 were correct

3. Are you better with a (simulated) laser pointer?

• Out of 6 who point accurately in 2, all fail

4. Will pointing accuracy be affected if visible pointer lags pointing?

• 3 of the 6 who failed #3 succeed

All results predicted by 2 visual systems hypothesis

27


Research with videoconferencing

and abstract displays

• Targeting sound: cognitive better than

motor

– Subs aware of visual and auditory

locations, but point to visual

• Targeting vision with context: Less

feedback is better

– Pointing with no visual feedback better

– Lagged cursor better than unlagged














performance.

28


Interpreting pointing studies

• Pointing studies counterintuitive, but

predicted by response characteristics

of neurons in dorsal/ventral to visual

and auditory stimuli

See our Smart Graphics 03 paper for more on this study.

29


Extending to complex worlds

• Previous studies in simple worlds, with a

few visual and auditory events

• Multimodal environments are complex

– Virtual worlds

– Augmented reality

– Ubiquitous computing

• How are multiple multimodal events dealt

with in the cognitive architecture?

The previous studies looked a relatively simple environments by our standards (but

complex from the standpoint of psychophysics!). How can we extend these methods

to more complex environments?

30


Indexical cognition (Pylyshyn)

31


Mental representations of complex worlds

• Cognitive architecture perspective requiresthat links be established between lowerlevel perceptual qualities and cognitivesymbols—i.e. a pointer, called a FINST.

• FINSTing allows us to interact withperceptual objects and events without theneed for mental images per se.

• Symbolic representation + pointers makesdifferent predictions than intuitive picture-in-the-head

As Ron demonstrated for you earlier, our mental models of complex scenes are not as

complete as we think. He suggested that much of what we think of as our mental

representation is actually in the world, and we sample from it in real time as needed.

The FINST hypothesis is one theory about how we might do that.

32


Indexical cognition (Pylyshyn)

According to this theory, we have a limited number of places in the scene that receive

a high level of processing. The rest of the scene is processed to a much more limited

extent, and if a change is masked as in Ron’s demos, it will go unnoticed.

33


Naïve view of FINSTs in Cognitive Arch

Phoneme

perception

Voice

recognition

Auditory

localization



FINSTs

If we look at the impact of these attentional tokens, or FINSTs in multimodal

perception in rich sensory environments, we can see two views of how they might

work. The naïve view is that the information from two events is “tagged” by a FINST

and is reassembled in cognition after the sensory processes have done their work.

34


Another view of FINSTs

Phoneme

perception

Voice

recognition

Auditory

localization



FINSTs

An alternative view would avoid the assembly process, and simply use the labels.

35


Multimodal representations are virtual

All modalities store little info in memory:

instead they take up information as needed

– Vision-- attention, eye, head and body

movements change view

– Haptics-- active exploration of space with

hands

– Hearing-- uses body and head movements to

localize sound and improve quality

36


Mental representations of complexenvironments

• Cognitive architecture perspective requires that

links be established between lower level

perceptual qualities and cognitive symbols—i.e. a

pointer, called a FINST.

• FINSTing allows us to interact with perceptual

objects and events without the need for mental

images per se.

• Symbolic representation + pointers makes

different predictions than intuitive picture-in-the-

head

• Coping with spatial transformations in complex

data spaces

I will conclude with a review of key aspects of the talk, and then ask for questions.

37


More about FINSTs

• FINSTs Link mind & perceptual world

– Visual routines: (collinear, inside, subitizing)

– History of an object

– Object-centred, “sticky”

– Drawn to salient changes-- onsets, luminance

increments, oddballs

– Finite number ~ 4-7

– FINSTs + ANCHORs for motor behaviour

Whichever model is true, there are some repercussions to FINSTing an object.

38


More about ANCHORs

• ANCHORs link mind & action

– Remembered locations for eye

movements

– Direct interaction with items off the retina

– Fast, robust motor performance by

action routines

– Affordances for action

A second mechanism that we will not be able to spend much time on is a pointer in

motor space called an ANCHOR. This mediates skilled motor performance by

downloading much of the tasks to low-level perceptuo-motor routines.

39


Multimodal events support adaptation

• Individual senses adapt to display

• Modalities use multimodal events for

cross-calibration

– Observed actions calibrate visual space

– Vision calibrates sound location

– Sound calibrates vision for time

• Result includes after-effect: a

remapping of perceptual space

(Epstein, 1975)

One promising theory examines how the basic characteristics of space and time are

compared between modalities in order to calibrate them against themselves. This

fundamentally depends on the consistencies of events in the real world. How will

virtual worlds affect this process?

40


Research question: Role of focal attention?

Are attentional resources shared

between senses?

• Will adding sound and haptics impact

visual attention?

– Or, will it offload processing from vision?

• Does a shift in one modality cause

complementary attention shifts in

others?

• Does recalibration require attention?

As the need for interfaces making extremely efficient use of limited perceptual

resources, sharing of attention becomes something we need to understand better.

There’s been quite a bit of study about attentional distribution within vision; less with

audition, and virtually none with touch. Even less studied is attention as shared among

senses. If, for example, we plan to offload the visual sense by delivering information

haptically, we better know whether this transfer of work will actually unload total

attention required – or make the situation even worse. A group at UBC is working on

this problem right now.

41


Research Topic: Pointers for action?

• Attentional pointers link mind and world

• Do “action pointers” link mind &

muscles?

– Remembered locations for eye movements

– Direct interaction with items off the retina

– Fast, robust motor performance by action

routines

A second mechanism that we will not be able to spend much time on is a pointer in

motor space called an ANCHOR. This mediates skilled motor performance by

downloading much of the tasks to low-level perceptuo-motor routines.

42


Research Topic: Individual

differences

• Perceptual rules are the same

• Impact differ over time and for individuals

– e.g. sensitivity to stereo depth & spatial sound cues

– Ability to adapt to new cue combinations

• Perceptual customization may help

– For individuals: “personal equation” for interaction

– In real time, through attentive computing

The“personal equation” was an invention of Freidrich Bessels who died in 1846 Modern astronomy of precision is essentially Bessels

creation. In astronomy the personal equation is the amount by which a measurement made by a particular individual differs from a

standard (usually the mean of other observer’s measurements). It is essentially a fudge factor that compensates for the characteristic

deviations of a particular individual. This controls for the constant part of measurement error (between subject error) , leaving trial-by-

measurement errors (within subject error). The concept of a personal equation was important component of Wundt’s Psychophysics.

A personal equation of interaction can be thought of as solving the personal equation for the individual: instead of modifying the

measurement to better match objective reality, we modify reality (or its simulation) to better fit the individual’s perceptual, attentive

and cognitive characteristics

43


Module disadvantages

• Coordination

– Distortions in location, timing, and category-relevantinformation may lead to the formation of conflictingrepresentations in different modules.

• Processing inflexibility

– Errors and conflicts within a module can create errorsand increase cognitive load. (CRT flicker example)

• Information hiding

Cognitive impenetrability of modules makes it difficult foroperators to determine the reasons for their poorperformance.

44


Future challenges

• Perception, cognition, & action in multimodal environment

with many event, and actors

• Applications in entertainment, cognition, communication

• Blend of virtual and real spaces… with seams

– Are the rules consistent?

– Can users shift between them?

– Can frames support rule shifts?

Thus, large screen, multimodal and virtual environments, augmented reality,

ubiquitous computing etc. pose difficult problems for designers.

45


Opportunities for creative design

• Environments: Affordances for exploration

– Spatial cognition, human space constancy theory

• Support for creative & logical thinking

– Problem solving, embodied cognition models

• Media-based communication & collaboration

– Metacognition, distributed cognition

• Experience (Kansei) engineering: Moving beyond

usability

46


What to expect in the next talk

• More on haptics

• Other senses

– Neuromuscular,GSR, heart rate, brain, other biopotentials

• Applications

– Displays, input, & sensing technologies

– Design examples

– Virtual environments

• Communicating human experience: information, emotion,

environment

– Intimacy and embodiment

– Sources of aesthetics

seeing, hearing, and touching: multimodality putting it ...fisher/bf-intersense2.pdf · putting it...

Documents