seeing, hearing, and touching: multimodality putting it ...fisher/bf-intersense2.pdf · putting it...
TRANSCRIPT
1
Seeing, Hearing, and Touching:
Putting It All Together
Sensory Integration Module
• Seeing and Hearing Events Fisher
• Touching, Seeing, and Hearing MacLean
• Integrating Applications: Tight Coupling & Physical Metaphors MacLean
• Integrating Applications: Designing for Intimacy Fels
The morning talks gave a perspective on how vision science can be used to inform the
design of visually complex interfaces such as those used in information visualization
systems. The second half of the course looks at intersensory interactions and how they
can inform the move to multimodal environments. These environments combine
visual, auditory, and haptic displays with a richer set of inputs from users, including
speech, gesture,and Biopotentials.
2
Seeing and Hearing Events (Fisher) 2
Moving to multimodality
Vision
Virtual
interaction model
Force & tactile
feedback
Psychophysics of vision, sound, and touchwill change when environment is multimodal
Hearing
Force display technology works by using mechanical actuators to apply forces to the
user. By simulating the physics of the user’s virtual world, we can compute these
forces in real-time, and then send them to the actuators so that the user feels them/
3
Seeing and Hearing Events (Fisher) 3
Intersensory Interactions
• Intro and metacognitive gap
• Integrating Cognitive Science in design
• Cognitive Architecture
– Modularity and multimodal interaction
• Information hiding-- conflict resolution
• Cognitive impenetrability
• Performance differences between modules
• Recalibration
– Spatial indexes in complex environments
• Multimodal cue matching within modules
We begin with a justification for an increased role for theory in the design of these
more complex interfaces. I will argue that the combination of a large design space and
the structural inability of humans to introspect at the level of sensory and attentional
processes makes conventional design techniques inadequate in these situations. This
is followed by a brief discussion of the challenges of taking information from
Psychology, Kinesiology and other disciplines that may fall under the broad banner of
Cognitive Science into account in designing interactive applications.
4
Seeing and Hearing Events (Fisher) 4
Vision systems to multimodality
• Ron: Vision systems and subsystems
– Pre-attentive vision (gist, layout, events)
– Attention (grab ~5 objects for processing)
– Combine for “virtual representation”
• Extend system concept to modalities
– Some are similar across modalities
– Some are multimodal
– Some are task-dependent
Extending the visual perception studies described by Ron and applied by Tamara to
multimodal interaction is conceptually simple, since vision is composed of separate
channels that can be thought of as modalities. The move to multimodality is similar to
the move to multiple channels, or systems in vision.
5
Seeing and Hearing Events (Fisher) 5
Extending to complex worlds
• Lab studies: few events, visual or
auditory
• In contrast to multimodal interfaces
– Virtual worlds
– Augmented reality
– Ubiquitous computing
• How are multiple multimodal events
dealt with in the brain?
The previous studies looked a relatively simple environments by our standards (but
complex from the standpoint of psychophysics!). How can we extend these methods
to more complex environments?
6
Seeing and Hearing Events (Fisher) 6
Some systems are multimodal
Example: Cross-modal speech system
• Reduces cognitive load
– Fast, effortless information processing
• Near-optimal information integration
between cues and sensory modalities
– Fuzzy logic cue integration
– Bayesian categorization
Modularity of processing has some advantages-- fast, effortless processing of multiple
sensory channels. It comes at a cost of lack of cognitive control and lack of access to
early-stage representations.
7
Seeing and Hearing Events (Fisher) 7
Illusory conjunctions occur inartificial multimodal environments
Example: Movie theatre
• The McGurk effect (face influences
sound)
– Dubbed movie
• The ventriloquist effect (vision
captures sound location)
– Sound seems to come from actor
Another area where we have conducted research deals with a basic question in
multimodal perception-- how do the different sensory channels decide what stimuli
get “matched” between vision, hearing, and touch? And, once matched, how are they
integrated and resolved into a multimodal or transmodel percept?
If, as we have maintained, much of our perceptual processing takes place in systems,
stimulus matching and fusion have to take place in multiple systems in parallel-- so
how do they come up with the same answer?
8
Seeing and Hearing Events (Fisher) 8Rensink
Feedback from higher-level areas allows a small number of proto-objects to be
stabilized.
[Note: These links may be related to (or even the same as?) the FINSTs in Brian
Fisher’s talk]
9
Seeing and Hearing Events (Fisher) 9
Study: Pointing to sounds
• Cognitive location good
• Pointing shows visual
capture
– Aware of visual and
auditory locations, but
point to visual
• No effect of
phoneme/viseme fit on
ventriloquism
• Slow recalibration of
auditory space if offset
constant
“Ba”
What was sound?
Where was source?
(point or describe)
Looking at the impact of this on perception in complex display environments with
multiple visual and auditory events that contain errors in location and category fit
generates some counter-intuitive findings.
Motor performance is typically found to be less sensitive to illusions than
cognitive processes, however it seems that that is not the case for auditory
localization. It seems that the dorsal system has a greater drive to combine visual
and auditory events, leading to a greater tolerance for location errors for
reaching than for voice measures.
In pointing to visual targets in the presence of a visual distractor, we found few
illusory errors when users were not allowed to see their hands (such as when
using a head-mounted display), but closed loop pointing had errors similar to
those observed with vocal interactions. Giving them a visible cursor on the screen
hurt performance, and adding a lag to the response of the cursor actually aided
performance.
10
Seeing and Hearing Events (Fisher) 10
Our interpretation: 2 systems at work
• Different multimodal systems solve
feature assignment problem differently
– Motor system: high visual dominance
– Cognitive system: low visual dominance
– Phoneme/viseme mismatch doesn’t help
• Vision can recalibrate spatial sound “map”
11
Seeing and Hearing Events (Fisher) 11Rensink
Feedback from higher-level areas allows a small number of proto-objects to be
stabilized.
[Note: These links may be related to (or even the same as?) the FINSTs in Brian
Fisher’s talk]
12
Seeing and Hearing Events (Fisher) 12
Attentional pointers in systems?
Cognitive processing
Action (motor space)
Auditory
localization
Visual
localization
If we look at the impact of these attentional tokens, or FINSTs in multimodal
perception in rich sensory environments, we can see two views of how they might
work. The naïve view is that the information from two events is “tagged” by a FINST
and is reassembled in cognition after the sensory processes have done their work.
13
Seeing and Hearing Events (Fisher) 13
Displays and multimodal perception
• Immersive environments must display acomplex multimodal world– Virtual Reality must provide entire world
– Augmented Reality must blend with real world
• Multimodal displays have errors– Location of events is not precise (esp. in depth)
– Timing is not precise
– Graphics can be low-fidelity
• What will be the impact of these errors onusers?
Logically, we would assume that correspondence in time and location are critical, and
goodness-of-fit (how plausible the match is conceptually) might play a role as well.
However these are precisely the areas where errors in scene rendering, stereo conflicts,
and poor synchrony might cause errors.
14
Seeing and Hearing Events (Fisher) 14
Disadvantage: Each module mustsolve feature assignment problem.
• Modules can’t accept information fromother modules: Information encapsulation
• Different modules should have access to adifferent set of matching cues.
• Illusory conjunctions can occur in
multimodal environments:
– Phoneme perception: The McGurk effect
– Auditory localization: The ventriloquist effect
Another area where we have conducted research deals with a basic question in
multimodal perception-- how do the different sensory channels decide what stimuli
get “matched” between vision, hearing, and touch? And, once matched, how are they
integrated and resolved into a multimodal or transmodel percept?
If, as we have maintained, much of our perceptual processing takes place in modules,
stimulus matching and fusion have to take place in multiple modules in parallel-- so
how do they come up with the same answer?
15
Seeing and Hearing Events (Fisher) 15
Study: Impact of display errors on
multimodal perception
• Immersive environments typicallyhave display errors
– Location of events is not precise
– Timing is not precise
– Graphics can be low-fidelity
• As immersive environments addsound and touch, what will be theimpact of these errors?
Logically, we would assume that correspondence in time and location are critical, and
goodness-of-fit (how plausible the match is conceptually) might play a role as well.
However these are precisely the areas where errors in scene rendering, stereo conflicts,
and poor synchrony might cause errors.
16
Seeing and Hearing Events (Fisher) 16
Recalibration by pairing (Epstein, 1975)
• Individual senses adapt to display
• Sensory modalities calibrate each other: haptics,
vision, sound
– Observed actions calibrate visual space (space
constancy)
– Vision calibrates hearing for the location of a multimodal
event
– Sound calibrates vision for the time of a multimodal event
• Result is an after-effect: remapping of auditory
(visual, haptic) space
One promising theory examines how the basic characteristics of space and time are
compared between modalities in order to calibrate them against themselves. This
fundamentally depends on the consistencies of events in the real world. How will
virtual worlds affect this process?
17
Seeing and Hearing Events (Fisher) 17
Impact of information encapsulation
• Multimodal environment with errors in
timing and location
• The same event might give rise to a single
multimodal construct in one task, and two
unimodal events for another.
– Vary location of visual and auditory phonemes
in a simple teleconferencing-style video display
– Vary information carried by using synthetic
speech stimuli (5 levels).
We can intentionally vary these errors to see how observers respond.
18
Seeing and Hearing Events (Fisher) 18
Ventriloquism meets the McGurk
effect.
• Vary location of visual and auditory phonemes in a
simple teleconferencing-style video display
• Vary information carried by using synthetic speech
stimuli (5 levels).
• Subjects report sound location and syllable heard,
• Analyses included testing a variety of
mathematical models of information integration by
fitting free parameters with STEPIT.
We intentionally introduce errors in category fit by using ambiguous synthetic speech,
and multiple loudspeakers hidden behind a curved screen. So we can vary space and
fit at will. This is a much more complex set of stimuli than are typically used in
perception research, but our interests are in rich sensory environments not
psychophysics and sensory primitives.
The data that this kind of test produces is difficult to analyze with Psychology
statistics such as ANOVA. Instead we use a model fitting analysis
19
Seeing and Hearing Events (Fisher) 19
Use of mathematical modeling tools
allow us to address
• Sensory input from a number of channels
simultaneously
• How stimuli from multiple channels are
matched and partitioned into mental
representations
• How information from multiple senses is
integrated to give rise to trans-modal
mental events
Modeling lets us extend our experiments to more realistic situations.
20
Seeing and Hearing Events (Fisher) 20
Results:
• Visual capture of auditory source location, resulting in a
shifting of unimodal auditory location estimation
(ventriloquism after-effect).
• No effect of location difference on phoneme perception as
measured by statistical or modeling tests.
• No correlation between errors in the two tasks (i.e. subjects
could not selectively attend to the auditory phoneme on trials
when visual capture failed).
• Overall, modularity of phoneme perception is supported.
The results supported our hypotheses, but also showed promise in another way-- each
user who ran in the test generated an individual set of values that describe their unique
set of weightings of the perceptual stimuli within the context of the more complex
environment and task.
21
Seeing and Hearing Events (Fisher) 21
Changes in task interact with modules
• 2 visual systems—“ventral stream” forrecognition and “dorsal stream” for action.
• Where vs how
• Different impact of illusions
• Lesion data
One set of modules that is difficult to understand conceptually comes from the
neuroanatomy of vision. Research has shown that there are separate brain areas that
support motor performance and scene understanding.
In mammals, a phylogenetically older dorsal visual system deals primarily with
motor activities. It is a relatively inaccurate system, but it is robust to changes in head,
eye, and body positions,and has access to proprioceptive information to aid in
coordinating perception with action.
A second ventral visual system deals with small field operations that require superior
precision. While it allows for fine discrimination between stimuli, it sacrifices the
proprioceptive input and eye, head and body position information, and the ability to
co-ordinate unseen body parts with visual information.
22
Seeing and Hearing Events (Fisher) 22
Functional Neuroanatomy of
perception for action.
2 visual systems—“ventral stream” for cognition and
“dorsal stream” for motor performance.
23
Seeing and Hearing Events (Fisher) 23
2 visual systems lesion evidence
lesion performance deficits spared abilities
V1 (blindsight) detection and
identification
pointing
Ventrolateral
occipital (DF)
identification, shape
recognition, object
orientation
object manipulation
(orientation matching,
grip scaling)
Posterior
parietal (RV)
object manipulation
(orientation matching,
grip scaling)
identification, shape
recognition, object
orientation
Evidence from brain-damaged patients support this disassociation in humans.
24
Seeing and Hearing Events (Fisher) 24
2 visual system illusions
stimuli
deficits spared abilities
Tichner circles size report grip scaling
displacement
during saccade
detection of
displacement,
location report
pointing
Moving or off-
centre frame
induced motion,
location report
pointing
25
Seeing and Hearing Events (Fisher) 25
Study: Pointing in large displays (Po)
• Tell me where the
target is
• Point with no
feedback
• Point with visual
feedback (cursor)
• Point with delayed
visual feedback
Looking at the impact of this on perception in complex display environments with
multiple visual and auditory events that contain errors in location and category fit
generates some counter-intuitive findings.
Motor performance is typically found to be less sensitive to illusions than
cognitive processes, however it seems that that is not the case for auditory
localization. It seems that the dorsal system has a greater drive to combine visual
and auditory events, leading to a greater tolerance for location errors for
reaching than for voice measures.
In pointing to visual targets in the presence of a visual distractor, we found few
illusory errors when users were not allowed to see their hands (such as when
using a head-mounted display), but closed loop pointing had errors similar to
those observed with vocal interactions. Giving them a visible cursor on the screen
hurt performance, and adding a lag to the response of the cursor actually aided
performance.
26
Seeing and Hearing Events (Fisher) 26
Findings
1. Can you tell if a target is on the left or right?
• 3 out of 7 males, 7 out of 7 females made errors
2. Can you point to it with no visual feedback?
• 6 out of 10 who failed #1 were correct
3. Are you better with a (simulated) laser pointer?
• Out of 6 who point accurately in 2, all fail
4. Will pointing accuracy be affected if visible pointer lags pointing?
• 3 of the 6 who failed #3 succeed
All results predicted by 2 visual systems hypothesis
27
Seeing and Hearing Events (Fisher) 27
Research with videoconferencing
and abstract displays
• Targeting sound: cognitive better than
motor
– Subs aware of visual and auditory
locations, but point to visual
• Targeting vision with context: Less
feedback is better
– Pointing with no visual feedback better
– Lagged cursor better than unlagged
Looking at the impact of this on perception in complex display environments with
multiple visual and auditory events that contain errors in location and category fit
generates some counter-intuitive findings.
Motor performance is typically found to be less sensitive to illusions than
cognitive processes, however it seems that that is not the case for auditory
localization. It seems that the dorsal system has a greater drive to combine visual
and auditory events, leading to a greater tolerance for location errors for
reaching than for voice measures.
In pointing to visual targets in the presence of a visual distractor, we found few
illusory errors when users were not allowed to see their hands (such as when
using a head-mounted display), but closed loop pointing had errors similar to
those observed with vocal interactions. Giving them a visible cursor on the screen
hurt performance, and adding a lag to the response of the cursor actually aided
performance.
28
Seeing and Hearing Events (Fisher) 28
Interpreting pointing studies
• Pointing studies counterintuitive, but
predicted by response characteristics
of neurons in dorsal/ventral to visual
and auditory stimuli
See our Smart Graphics 03 paper for more on this study.
29
Seeing and Hearing Events (Fisher) 29
Extending to complex worlds
• Previous studies in simple worlds, with a
few visual and auditory events
• Multimodal environments are complex
– Virtual worlds
– Augmented reality
– Ubiquitous computing
• How are multiple multimodal events dealt
with in the cognitive architecture?
The previous studies looked a relatively simple environments by our standards (but
complex from the standpoint of psychophysics!). How can we extend these methods
to more complex environments?
30
Seeing and Hearing Events (Fisher) 30
Indexical cognition (Pylyshyn)
31
Seeing and Hearing Events (Fisher) 31
Mental representations of complex worlds
• Cognitive architecture perspective requiresthat links be established between lowerlevel perceptual qualities and cognitivesymbols—i.e. a pointer, called a FINST.
• FINSTing allows us to interact withperceptual objects and events without theneed for mental images per se.
• Symbolic representation + pointers makesdifferent predictions than intuitive picture-in-the-head
As Ron demonstrated for you earlier, our mental models of complex scenes are not as
complete as we think. He suggested that much of what we think of as our mental
representation is actually in the world, and we sample from it in real time as needed.
The FINST hypothesis is one theory about how we might do that.
32
Seeing and Hearing Events (Fisher) 32
Indexical cognition (Pylyshyn)
According to this theory, we have a limited number of places in the scene that receive
a high level of processing. The rest of the scene is processed to a much more limited
extent, and if a change is masked as in Ron’s demos, it will go unnoticed.
33
Seeing and Hearing Events (Fisher) 33
Naïve view of FINSTs in Cognitive Arch
Phoneme
perception
Voice
recognition
Auditory
localization
Cognitive processing
Action (motor space)
FINSTs
If we look at the impact of these attentional tokens, or FINSTs in multimodal
perception in rich sensory environments, we can see two views of how they might
work. The naïve view is that the information from two events is “tagged” by a FINST
and is reassembled in cognition after the sensory processes have done their work.
34
Seeing and Hearing Events (Fisher) 34
Another view of FINSTs
Phoneme
perception
Voice
recognition
Auditory
localization
Cognitive processing
Action (motor space)
FINSTs
An alternative view would avoid the assembly process, and simply use the labels.
35
Seeing and Hearing Events (Fisher) 35
Multimodal representations are virtual
All modalities store little info in memory:
instead they take up information as needed
– Vision-- attention, eye, head and body
movements change view
– Haptics-- active exploration of space with
hands
– Hearing-- uses body and head movements to
localize sound and improve quality
36
Seeing and Hearing Events (Fisher) 36
Mental representations of complexenvironments
• Cognitive architecture perspective requires that
links be established between lower level
perceptual qualities and cognitive symbols—i.e. a
pointer, called a FINST.
• FINSTing allows us to interact with perceptual
objects and events without the need for mental
images per se.
• Symbolic representation + pointers makes
different predictions than intuitive picture-in-the-
head
• Coping with spatial transformations in complex
data spaces
I will conclude with a review of key aspects of the talk, and then ask for questions.
37
Seeing and Hearing Events (Fisher) 37
More about FINSTs
• FINSTs Link mind & perceptual world
– Visual routines: (collinear, inside, subitizing)
– History of an object
– Object-centred, “sticky”
– Drawn to salient changes-- onsets, luminance
increments, oddballs
– Finite number ~ 4-7
– FINSTs + ANCHORs for motor behaviour
Whichever model is true, there are some repercussions to FINSTing an object.
38
Seeing and Hearing Events (Fisher) 38
More about ANCHORs
• ANCHORs link mind & action
– Remembered locations for eye
movements
– Direct interaction with items off the retina
– Fast, robust motor performance by
action routines
– Affordances for action
A second mechanism that we will not be able to spend much time on is a pointer in
motor space called an ANCHOR. This mediates skilled motor performance by
downloading much of the tasks to low-level perceptuo-motor routines.
39
Seeing and Hearing Events (Fisher) 39
Multimodal events support adaptation
• Individual senses adapt to display
• Modalities use multimodal events for
cross-calibration
– Observed actions calibrate visual space
– Vision calibrates sound location
– Sound calibrates vision for time
• Result includes after-effect: a
remapping of perceptual space
(Epstein, 1975)
One promising theory examines how the basic characteristics of space and time are
compared between modalities in order to calibrate them against themselves. This
fundamentally depends on the consistencies of events in the real world. How will
virtual worlds affect this process?
40
Seeing and Hearing Events (Fisher) 40
Research question: Role of focal attention?
Are attentional resources shared
between senses?
• Will adding sound and haptics impact
visual attention?
– Or, will it offload processing from vision?
• Does a shift in one modality cause
complementary attention shifts in
others?
• Does recalibration require attention?
As the need for interfaces making extremely efficient use of limited perceptual
resources, sharing of attention becomes something we need to understand better.
There’s been quite a bit of study about attentional distribution within vision; less with
audition, and virtually none with touch. Even less studied is attention as shared among
senses. If, for example, we plan to offload the visual sense by delivering information
haptically, we better know whether this transfer of work will actually unload total
attention required – or make the situation even worse. A group at UBC is working on
this problem right now.
41
Seeing and Hearing Events (Fisher) 41
Research Topic: Pointers for action?
• Attentional pointers link mind and world
• Do “action pointers” link mind &
muscles?
– Remembered locations for eye movements
– Direct interaction with items off the retina
– Fast, robust motor performance by action
routines
A second mechanism that we will not be able to spend much time on is a pointer in
motor space called an ANCHOR. This mediates skilled motor performance by
downloading much of the tasks to low-level perceptuo-motor routines.
42
Seeing and Hearing Events (Fisher) 42
Research Topic: Individual
differences
• Perceptual rules are the same
• Impact differ over time and for individuals
– e.g. sensitivity to stereo depth & spatial sound cues
– Ability to adapt to new cue combinations
• Perceptual customization may help
– For individuals: “personal equation” for interaction
– In real time, through attentive computing
The“personal equation” was an invention of Freidrich Bessels who died in 1846 Modern astronomy of precision is essentially Bessels
creation. In astronomy the personal equation is the amount by which a measurement made by a particular individual differs from a
standard (usually the mean of other observer’s measurements). It is essentially a fudge factor that compensates for the characteristic
deviations of a particular individual. This controls for the constant part of measurement error (between subject error) , leaving trial-by-
measurement errors (within subject error). The concept of a personal equation was important component of Wundt’s Psychophysics.
A personal equation of interaction can be thought of as solving the personal equation for the individual: instead of modifying the
measurement to better match objective reality, we modify reality (or its simulation) to better fit the individual’s perceptual, attentive
and cognitive characteristics
43
Seeing and Hearing Events (Fisher) 43
Module disadvantages
• Coordination
– Distortions in location, timing, and category-relevantinformation may lead to the formation of conflictingrepresentations in different modules.
• Processing inflexibility
– Errors and conflicts within a module can create errorsand increase cognitive load. (CRT flicker example)
• Information hiding
Cognitive impenetrability of modules makes it difficult foroperators to determine the reasons for their poorperformance.
44
Seeing and Hearing Events (Fisher) 44
Future challenges
• Perception, cognition, & action in multimodal environment
with many event, and actors
• Applications in entertainment, cognition, communication
• Blend of virtual and real spaces… with seams
– Are the rules consistent?
– Can users shift between them?
– Can frames support rule shifts?
Thus, large screen, multimodal and virtual environments, augmented reality,
ubiquitous computing etc. pose difficult problems for designers.
45
Seeing and Hearing Events (Fisher) 45
Opportunities for creative design
• Environments: Affordances for exploration
– Spatial cognition, human space constancy theory
• Support for creative & logical thinking
– Problem solving, embodied cognition models
• Media-based communication & collaboration
– Metacognition, distributed cognition
• Experience (Kansei) engineering: Moving beyond
usability
46
Seeing and Hearing Events (Fisher) 46
What to expect in the next talk
• More on haptics
• Other senses
– Neuromuscular,GSR, heart rate, brain, other biopotentials
• Applications
– Displays, input, & sensing technologies
– Design examples
– Virtual environments
• Communicating human experience: information, emotion,
environment
– Intimacy and embodiment
– Sources of aesthetics