3d-vita - uva · chapter 1. introduction 2 those that have disabilities. however, the issues with...
TRANSCRIPT
3D-VITA
Robrecht Jurriaans
3D Visual Information to Audio
3D–VITASensory Substitution Device Employing Virtual Synaesthesia Based on
Common Weak Associations between Visual Appearance and Auditory Stimuli
A thesis submitted in conformity with the requirements for the degree of
MSc. in Artificial Intelligence
Robrecht Constantijn [email protected]
BSc Artificial Intelligence, Universiteit van Amsterdam, 2011
Supervisor: Jan van Gemert
Informatics Institute, Faculty of Science,Universiteit van Amsterdam
Science Park 904, 1098 XH Amsterdam2014
i
Abstract
In this thesis a novel sensory substitution device is proposed. It takes visual information froma RGB-D camera (Microsoft Kinect) and maps the information to an audio signal. This isachieved by segmenting the images taken with the RGB-D camera into superpixels using theSLIC algorithm. A visual descriptor is made for each superpixel according to the standardfeatures used in state-of-the-art material recognition systems. These descriptors are mappedto a set of parameters for audio synthesis based on literature from weak synaesthesia andetymological connections. Finally, the sounds are played through a set of headphones usingHRTF-enabled binaural sounds. The camera is mounted on top of the headphones enablingthe user to scan the environment. The system is evaluated in an object detection task anda virtual localisation task. In the virtual task, the user is given a sequence of stimuli fromthree fixed locations. The user is asked to move his or her gaze towards where the soundis coming from. This task is to determine if the user is capable of hearing binaural sound.In the object detection task, a large object is moved around the user and the user is askedto determine where the object is located in the scene by moving his or her gaze towardswhere they think the object is located. The correlation between stimulus and reaction isdetermined for each user and normalised with the stimulus-reaction correlation from thevirtual task. The results show that the system has a slight bias to the right. This means thatthe virtual audio space is rotated around the user to the right. Furthermore, the localisationtask shows that the system provides a sense of distal attribution without the necessity oftraining. In conclusion, the system is capable of supplying the user with visual informationthrough audio in such a way that the user experiences the visual information.
“Just because a man lacks the use of his eyesdoesn’t mean he lacks vision. ”— Stevie Wonder (1950–)
Chapter 1
Introduction
Imagine yourself, standing on New York’s famous Times Square.
Now imagine yourself standing there without the ability to see, hear or even feel.
Try to get to the other side, through the chaos that this place is known for.
As humans, we are strongly dependent on our ability to perceive the world. To do this we
have evolved incredibly complex systems that are capable of retrieving information from the
environment around us. With our eyes we can detect light strength and even the frequency of the
light-wave within the range of approximately 430 THz–790 THz. Our ears can detect pressure
differences, down to 20 µPa, which allows us to hear the faintest of movement. With our nose
we are capable of sensing the composition of the air we are breathing. Our tongue can analyse
what we are eating. Our entire body is covered with pressure-sensitive nerves which allows us to
feel pressure, temperature and all kinds of other properties of the air around us. These are the
five senses, but humans possess even more senses than these basic five. We also have a sense for
acceleration, balance, orientation, time, pain, hunger, thirst and a plethora of information about
our internal state. All of this information is measured by sensory cells and converted into an
electrical signal that is send to the brain. Our brain then integrates all of this information which
allows us to experience the world around us.
With a system as complex as the brain in charge of perception, it is nothing short of a miracle
that it usually works quite well. However, due to the complexity and the sensitivity of our sensory
system, it is not surprising that parts of the system may fail. Sensory impairment can happen
when a sense organ does not function, or in the connection between the organ and the brain,
or even in the processing within the brain itself. Although some forms of sensory impairment
result in minor inconveniences, such as anosmia, others can have a big impact on the autonomy
of an individual, such as blindness, vertigo and disequilibrium. This is mainly due to how we,
as humans, have arranged the world around us. As humans we have shaped our day to day
infrastructure to cater to able bodied individuals. More often than not, this leads to problems for
1
CHAPTER 1. INTRODUCTION 2
those that have disabilities. However, the issues with sensory impairment do not necessarily arise
from the lack of sensory input, e.g. the inability to see or hear, but rather from the diminished
amount of information that the individual can use. If the information that would normally be
acquired through sight, would be available to a blind individual, being blind would be less of a
problem. To facilitate this, sensory substitution devices can be used.
A sensory substitution device is a device, or system, that take information from one modality
and converts it into a signal in another modality. This is done so that the information can be
perceived by a person with an impairment in the first modality. For instance, a blind person is
not capable of sensing visual information. By mapping the visual information to an audio signal,
he or she gets access to this visual information. The idea is that there is a difference between
sensing and perceiving.
Definition:
Sensing is what your body does. It is a reaction of one of your organs to external or internal
stimuli which are converted into an electrical signal which is sent to the brain
Perceiving is what your brain does. It takes the sensory input and converts these to a mental
representation.
It follows that perception can happen without sensing and vice versa. For instance, hallucinations
occur when we perceive something we can not have sensed. Sensing without perceiving occurs
more frequently and is often caused by our attention blocking out these signals. In the case of
audible-vision sensory substitution we may perceive visual information without actually sensing
the visual information through our visual system, but rather sensing the auditory signal which
conveys the same information or an approximation of what our vision would have sensed. The
reason for perception to still be possible is that perception is done by the brain. More specifically,
the brain integrates all sensory input into a single representation of the world. So in a sense,
being blind is not an issue due to the lack of seeing, or even a lack of detecting light, but rather
due to the information that we take from light not being available in the integration process.
Using sensory substitution devices subverts this problem by making that information accessible
to the brain and thus augmenting perception with the missing information.
Although sensory substitution devices are usually designed for people with a sensory impairment,
there are also many other possible applications of sensory substitution. Using more elaborate
sensors we can augment the “standard” set of sense organs and thus augment our perception of
reality. A large group of bio-hackers use magnetic implants which allow them to perceive magnetic
fields, not to be confused with magnetoception, which is a somewhat weak sense allowing humans
to feel their orientation relative to the earth’s orientation, in the form of pressure. This magnetic
sense works due to the magnet moving slightly dependent on magnetic fields, allowing the user
to perceive these fields through touch.
CHAPTER 1. INTRODUCTION 3
Figure 1.1: A screenshot from the gameMark of the Ninja in which sound perceptionis improved by visual representations suchas the transparent circles representing howfar sound has travelled
Figure 1.2: Ben Underwood, who is fullyblind, can skate using his extraordinaryhearing, earning him the nickname “the real-life Batman”
Another possible application is in user interface design. Take for instance the game Mark of the
Ninja by Klei Entertainment (http://www.markoftheninja.com). In Mark of the Ninja the
player plays as a highly skilled ninja with excellent stealth capabilities. To facilitate this feel, the
designers have implemented a system which allows users to see sound. Everything that makes a
sound also creates a circle on the screen which indicates to how far this sound is audible, as can
be seen in fig. 1.1. Also note the sleeping dog on the right of the image which is shown through
a visualisation of the typical representation of snoring in popular culture in the form of “zzzz.”
Another use of sensory substitution in this game is the visualisation of what enemies can see,
which allows the player to “sense” how visible the avatar is at any given time. The same type of
user interface design can be found in the real world, such as using colours to signify concepts
such as red fonts on letters signifying importance.
There are types of sensory substitution which do not rely on external systems or devices. For
instance, there is a group of people that can use echolocation to sense distance and surface
shape. Ben Underwood, seen in fig. 1.2, was such an individual. He generated “clicks” with
his tongue and listened to the echoes of this sound. His brain interpreted the returning sound
giving him an indication of distances to objects around him. Furthermore, he achieved this
without external devices allowing him to quickly assess his surroundings and build a mental
representation. However, although this technique can be seen as sensory substitution, with the
echoes functioning as stimulator substituting a sense of distance, it is also perhaps just a normal
sense taken to its extreme form.
Perhaps the “ninja”-like super-power of Ben Underwood was not a case of sensory substitution,
but more a case of well developed integration of sensory input. Many of the sensory substitution
devices are designed to use this sensory integration to give the user the experience of perceiving.
CHAPTER 1. INTRODUCTION 4
Perhaps sensory substitution devices are not just sensory substitution, but rather designed to
utilise this integration, creating a form of synaesthesia. Synaesthesia is the involuntary coupling
of sensory experience. For instance, people with synaesthesia [Linkovski et al., 2012] may read
the character “7” and perceive it as being orange or they may experience notes played on a piano
to have a blueish tint. It is important to note that these associations are involuntary and that it
must not be confused with explicit associations that people may have. An example of an explicit
association is that music played on a steel-drum may remind somebody of a beach. This type
of explicit association is caused by memory and exposure, rather than the implicit involuntary
associations that come with synaesthesia.
The sensory substitution device proposed in this thesis combines the traditional set-up of sensory
substitution devices with synaesthesia-like mapping between vision and audio. The system is
called 3D Visual Information To Audio, or 3D-VITA in short. It uses a camera capable of
taking both colour and depth information and a 3D auditory illusion based on Head-Related-
Transfer-Functions, so that the 3D spatial information from the scene can be directly encoded in
the sounds. This means that the sounds that are generated only need to encode visual information
and not spatial information. The translation of visual information to audio is guided by common
weak synaesthesia as well as etymological similarities within the vocabulary of visual and auditive
phenomena. This mapping is based on literature from material recognition and by synaesthetic
experiments. The idea is that the most important aspect of the sounds must be that they are
differentiable [Parise and Spence, 2012]. So bright colours must sound more bright and dampened
colours should sound dampened.
3D-VITA, as introduced in this thesis, is meant for people with a visual impairment and its
main purpose is to facilitate navigation. However, if basic visual recognition is also supported,
navigation might become easier. For instance, being able to move from point A to point B while
avoiding obstacles is a possible navigational task. But being able to actually identify point B
gives the user far more independence. Most sensory substitution devices focussed on navigational
tasks do not take detailed visual information into account, but instead take spatial information
from the visual domain. Some systems do incorporate simple visual cues such as light intensity.
On the other hand, systems that focus on visual appearance generate stimuli that are complex,
such as systems that do object recognition and return the class of object as a speech signal. In
chapter 2 these systems will be discussed in more detail.
Chapter 2
Background
2.1 Sensory Substitution Devices
Sensory Substitution Devices are devices or systems that take stimuli, or sensory data, from one
modality and convert that signal into a signal in another modality. The resulting signal is then
returned to the user. One of the first sensory substitution devices was a system [Bach-Y-Rita
et al., 1969] that took images from a camera and converted it to a tactile signal on the back
of the user’s head. It is important to note that sensory substitution devices , such as a system
that converts video into a tactile signal, does not result in the user sensing [Declerck et al., 2009,
Poirier et al., 2007b] what the camera is providing. However, the user could potentially perceive
the visual information as the brain eventually accustoms to the new data it is receiving. This
distinction may seem arbitrary, but it lies at the basis of designing such systems. Perception
occurs due to the integration of stimuli from different modalities and therefore sensory substitution
devices result in perception rather than sensing. In the case of Bach-Y-Rita’s [Bach-Y-Rita et al.,
1969] tactile-vision-sensory-substitution, the user may perceive visual information, but senses
only tactile input. This form of perception is called distal attribution [Auvray et al., 2005] which
is the attribution of sensory experience to an external and distinct object.
All sensory substitution devices utilise [Auvray and Myin, 2009, Lenay et al., 1997, Bach-y Rita
and W Kercel, 2003] the same principle, namely a sensor is used which can retrieve data in
the same modality of the sense that needs to be substituted. This data is then processed and
converted into data which serves as input to a stimulator. The stimulator acts as an interface
between the device and the user. For instance, in a Text-To-Speech system the sensor takes
text and converts this into an audio signal, which is then outputted via a speaker that acts as
the stimulator. Conversely, in a Speech-To-Text system, the sensor is a microphone and the
stimulator is a screen to display the output. There are also types of sensory substitution that do
not need a sensor or a processing unit, but instead are designed to convey information that is
usually found in one modality. Perhaps the most well known variant of this type is Braille, which
takes textual information, that is usually represented visually and displays it through tactile
sensation. Text itself is also a form of sensory substitution which takes an auditory signal and
5
CHAPTER 2. BACKGROUND 6
1
2
3
4
5
6
87
9
10
Figure 2.1: The various possible sensory substitution devices: 1. Audible Vision, 2. VisibleTouch, 3. Tactile Vision, 4. Audible Touch, 5. Tactile Hearing, 6. Audible Spatial Awareness, 7.Tactile Balance, 8. Tactile Spatial Awareness, 9. Tactile Sensory Relocation, 10. Visible Audio,excluded are Visible Balance and Visible Spatial Awareness
displays it visually.
In this section an overview is given of sensory substitution devices that substitute vision. Since
vision is far more complex [Calder, 2010, Crary, 2006] and actually comprises of several individual
senses, e.g. sense of light intensity, sense of colour, sense of depth [Siegle and Warren, 2010],
sense of motion, the systems are categorised by their purpose. That is to say, the type of function
usually provided by vision that they are intended to replace. Further categorisation is done on
the modality of the stimulator that is used. This results in a categorisation as seen in fig. 2.1.
Omitted from this categorisation are the gustatory and olfactory senses due to their isolated and
specific roles in how we perceive the world. In the figure arrows point from the stimulator to the
sensor that is substituted. Note that Visible Balance and Visible Orientation are not represented
in the categorisation. These types of systems can be found in most vehicles, especially airborne,
as indicators for the pilot or operator. First a brief overview is given of the various stimulators
that can be utilised for sensory substitution.
2.2 Stimulators
Tactile Stimulators
There are different methods of returning tactile stimuli to the user. This has to do with the
different physiological premises of how humans perceive touch. There are different types of
CHAPTER 2. BACKGROUND 7
nerve-endings that act as sensors and these different sensors react differently to various stimuli.
Some of the sensors respond rapidly to the signal and only respond to changes in the signal,
while others return a more continuous output. The different nerve clusters can also differ at
which frequencies they have the strongest response, such as the Lamellar Corpuscle having the
strongest response at 250 Hz while Merkel Nerve Endings have the strongest response around
5 Hz–15 Hz.
To stimulate these nerves, there are two types of stimulators: electrotactile and vibrotactile.
Electrotactile stimulators use electric signals to directly stimulate the nerves. This can be done
either from the skin, or directly to the nerve. The latter uses less energy, but is also more invasive.
Vibrotactile stimulators use pressure and vibration to stimulate the nerves. The problem with
vibrotactile stimulation is that the nerves that can get activated through this have various spatial
and temporal resolution constraints, depending on where on the body the stimulator is attached.
In recent years, both types of stimulation have been successfully used on the tongue [Williams
et al., 2011] which nerves have a high spatial resolution and require minimal stimulation to be
activated.
Visual Stimulators
There is a broad range of visual stimulators which can be divided into two main categories: screens
and implants. Screens are comprised of one or multiple lights which can be monochrome or have
a set of possible colours. For example, traffic lights consist of three monochrome lights ordered
spatially. Both the colours and the spatial position of the lights convey meaning. Implants are
connected directly to the optical nerve and work by applying patterns of electricity on the optical
nerve which are then processed by the visual cortex.
Auditory Stimulators
Audio is a very strong stimulator due to the human auditory system being able to deal with very
complex and rapidly changing sound patterns [Auvray and Myin, 2009], even in the presence of
noise. Furthermore, the perceptual resolution of the auditory system is very fine for both pitch
and amplitude. In chapter 3 the perception of audio will be discussed in finer detail. A final
advantage of audio as a stimulator as opposed to visual and haptic interfaces is that audio is
relatively low-cost to produce in terms of computation and the energy required by the stimulator.
2.3 Visual Substitution
Reading Substitutional Aids
Language has known many forms of sensory substitution. It is difficult to assess where the exact
boundary lies between normal sensory integration and the field of sensory substitution. Written
and spoken language are in essence substitutions of one another or perhaps both are sensory
substitutions of a mental language. Due to both blindness and deafness, as well as a plethora of
mental conditions including agraphia, dyslexia and alexia, there have been numerous endeavours
CHAPTER 2. BACKGROUND 8
[Steele et al., 1989] in substituting modalities of language. For instance, sign language is a
visible-audio substitution.
Braille is a haptic written language consisting of several dots developed in the 19th century by
Louis Braille (1809-1852). Each character consists of six dots in a two by three array. Each
dot can have one of two values. Either the dot is raised or it is not. This leads to a total of 63
possible characters.
The Optacon [Goldish and Taylor, 1974] is a reading aid using a haptic stimulator consisting of
6x24 vibrating pins. It gives a direct mapping of a small hand-held camera. It is operated by
placing a finger on the vibrating pins and moving the camera over printed text with the other
hand.
Akin to the Optacon, the Stereotoner [Smith, 1972] is intended as a reading aid and uses a
hand-operated camera with a very small field of view. However, the haptic stimulator of the
Optacon is instead replaced with an auditory stimulus. The image from the camera is put
through a line detector and the detected lines are mapped to sounds varying in pitch.
The Kurzweil Reading Machine is a print-to-speech machine [Kurzweil et al., 1990, 2000] consisting
of an omni-font optical character recognition system that uses a flat-bed scanner to convert
printed text to a digital signal which is then given as the input for a text-to-speech device. It is
mainly used for reading printed media. Unlike the Optacon and the Stereotoner, the Kurzweil
Reading Machine preprocesses the signal before activating the stimulator, which alleviates the
user from processing the signals and reduces the training and adaptation period to a minimum.
Later reading aids [Steele et al., 1989] mostly follow the principles behind the Kurzweil Reading
Machine, but have improved on the optical character recognition and text-to-speech synthesis
components. As in the Kurzweil Reading Machine, 3D-VITA employs the same concept of
preprocessing the input signal to create a more compact auditory representation. However,
unlike the Kurzweil Reading Machine, 3D-VITA also uses a more complex mapping instead of a
direct auditory representation which allows for more flexibility in the type of entities that can be
recognised by the user.
Other types of reading aids focus on the addition of meta-data [Xydas et al., 2005] to visual
documents to improve speech synthesis. The idea is that the original information within the
documents, i.e. the raw text data, is not complete enough to perform meaningful substitution and
that the addition of information such as mood, the visual lay-out of the document and the role
of each piece of text, e.g. the header, subtitles and main text, greatly improves understanding.
Adding this meta-data improves audible-vision perception tasks.
Obstacle Avoidance
Navigation is an essential skill for autonomy [Meers and Ward, 2007, Roentgen et al., 2008a,
Giudice and Legge, 2008] and relies heavily on our visual system. Like reading, early rudimentary
systems relied on substituting to a tactile stimulus.
The Long Cane [Blasch et al., 1996] is not necessarily a tactile-vision sensory substitution device
CHAPTER 2. BACKGROUND 9
as its main goal is not to replace vision, but rather to allow the user to successfully navigate a
complex environment. In a sense, it replaces the sense of distance rather than the full visual
system. Similar to Braille, the simple yet effective design of the Long Cane resulted in that they
are still in common use although more advanced substitutes are available. As an improvement
on the Long Cane, Laser Canes remove the necessity to physically touch the objects in an
environment. Most Laser Canes utilise some type of distance sensor.
The Mowat sensor [Pressey, 1977] uses a high frequency sound to detect distances which are used
as the input for a vibrating stimulator. The Nottingham Obstacle Detector [Dodds et al., 1981]
is an early prototype for the Sonic Pathfinder [Heyes, 1984] and utilises a Mowat Sensor. The
Sonic Pathfinder instead employs a form of audible-vision sensory substitution and converts the
detected objects into tones that indicate the structure of the environment.
Combining principles from both the Nottingham Obstacle Detector and the Sonic Pathfinder,
the MiniGuide [Penrod and Simmons, 2005, Kolarik et al., 2013] uses ultrasonic echo-location to
detect objects, mapping the signal to both vibration and sound feedback. Unlike most other
laser canes, the MiniGuide is meant as a secondary aid, to be used together with a primary aid
such as a seeing eye dog or a white cane. As such, the MiniGuide has an easily accessible button
to switch on and off, resulting in that the user only uses it when more detail of the environment
is needed. The Ultracane builds upon the concept of the Nottingham Obstacle Detector, but
uses two Mowat sensors [Penrod and Simmons, 2005] connected to two vibrating pads in the
handle. By using two sensors a more detailed signal can be delivered to the user.
The Bat ’K’ Sonar-Cane also utilises [Roentgen et al., 2008a,b] ultra-sonic echo-location which
results in distance measurements represented as beeps of varying pitches. The ultra-sonic device
is mounted on the end of a traditional long cane. By sweeping the Sonar-Cane the user is given
a continuous soundscape representing the distance to various objects in the direct vicinity of the
user.
The TSight [Cancar et al., 2013] uses a infra-red range sensor coupled to vibrating actuators as
stimulators steering the user away from objects that are moving towards the user. The device
extracts time-to-contact distance measurements and maps this to an array of vibrating pads
carried around the waist. This type of sensory substitution is already closer to true vision
substitution than the laser canes, as measurements in 3D environment are converted to a 2D
representation. However, as light intensity and colour information are ignored, this device is
not an example of true vision substitution. Akin to the Kurzweil Reading Machine, this devices
utilises preprocessing of the raw data from the sensor as opposed to using a direct mapping. The
time-to-contact is measured using optical flow and thus relies on motion within the visual field.
The TSight allows users to be able to hit moving targets with the temporal and spatial precision
of a sighted person in the same task.
Finally, sensory augmentation has been successfully applied to the task of teleoperated navigation
[Liu and Wang, 2012] by providing auditory feedback. This auditory feedback resulted in a
significantly lower amount of hitting other objects and increased precision in navigation.
CHAPTER 2. BACKGROUND 10
True Vision Substitution
The type of sensory substitution that enables users to retrieve full visual information from the
environment is called true vision substitution. Unlike sensory substitution devices that only give
one type of visual information, such as reading and obstacle avoidance aids, they try to give a full
overview of the visual information of the scene. These true vision substitution devices are often
implanted devices that take visual information from a camera or photoelectric sensor and convert
this visual signal into a electronic signal which is then injected into either the optical nerve or
via a retinal electrode. Non-implantable devices map the visual 2D information [Auvray and
Myin, 2009, Lenay et al., 1997] to a stimulus of a different modality. These sensory substitution
devices rely on more extensive mapping [Renier et al., 2005] between the visual information and
the modality of the stimulator. Such mapping is a form of synaesthesia [Proulx, 2010] which
utilises the sensory integration performed by the brain to active the visual cortex. Although
implantable devices are often considered [Bach-y Rita and W Kercel, 2003, Rauschecker, 1995,
Normann et al., 1999] to be better at activating the visual cortex, the procedure is more costly
than most non-implantable devices [Zrenner, 2002] and the invasiveness [Rizzo III et al., 2001,
Lenay et al., 2003] of the procedure is generally less desirable than the non-implantable devices.
Cortical or retinal electrode matrix display
Early work on retinal implants show great promise [Zrenner, 2002] to the possibilities of using
these implants for restoring vision. The reaction of the brain is for most implants immediate
[Schanze et al., 2007] and the users generally report that they feel like they are seeing when
using such devices. These implants work by taking an array of electrodes and implanting these
within either the optical nerve or in nerves that are close to the skin [Caspi et al., 2009]. Various
tasks have been performed by users of such systems successfully, such as reading letters [Zrenner
et al., 2011], navigation [Normann et al., 1999], object recognition [Humayun et al., 2003] and
orientation [Yanai et al., 2007].
Electrocutaneous and Vibrotactile Displays
The first non-invasive sensory substitution device that targeted true vision substitution was the
TVSS (Tactile-Vision Sensory Substitution) device [Bach-Y-Rita et al., 1969] which utilized
an array of hundreds of vibrating pins on the back of the subject which represented the image
as captured in real-time by a camera. The original set-up can be seen in fig. 2.2. From an
initial set of 50 objects, participants were able to achieve a 100 % recognition rate within 100
trials. A later version of this device [Bach-y Rita, 1980] used an electrocutaneous stimulator,
the Tongue Display Unit, which could be worn on the tongue. The main advantages of using
electro-stimulators on the tongue [Ptito et al., 2005, Kaczmarek, 2011, Zelek et al., 2003] is that
the tongue has a higher spatial resolution and that stimulating the tongue requires approximately
3 % of the voltage required [Ptito et al., 2008] to stimulate the finger. Another advantage of
tongue stimulation is the effectiveness on treating cortical blindness [Matteau et al., 2010, Kupers
et al., 2010], especially in diminishing the effects [Chebat et al., 2007] of early onset of cortical
blindness.
CHAPTER 2. BACKGROUND 11
Figure 2.2: The original TVSS-device [Bach-Y-Rita et al., 1969]
Definition:
Cortical Blindness refers to patients who have functioning eyes, but are unable to see due to
inactivity of the occipital cortex, mainly in the areas Brodmann 17 or V1 and Brodmann
18 & 19
Early experiments [White et al., 1970, Guarniero, 1974, Bach-y Rita, 1983] showed that if the
user is given control over the sensor, the user is able to recognise 3D spatial configurations of
objects without extensive training. These results were later reproduced with various sensory
substitution devices [Amedi et al., 2007, Williams et al., 2011]. In the design of 3D-VITA, this
result plays an integral role as the RGB-D camera is mounted on the head of the user.
A major disadvantage of using electrocutaneous or vibrotactile displays [Bach-y Rita, 2004] is
the lack of colour information that can be represented in the signal as provided by the stimulator.
Using tactile stimulators, either light intensity from a camera or depth information from a depth
sensor, such as a infra-red or an ultrasonic sensor, can be represented. Allowing the user to move
the camera gives a faint sense of depth, but the signal remains in essence 1-dimensional. Another
disadvantage of tactile sensors is that they are, although not invasive, more intrusive than for
instance auditory stimulators.
CHAPTER 2. BACKGROUND 12
Auditory displays
There are various types of auditory displays, i.e. audio speakers. These auditory displays can be
either in-ear or as an external speaker. The main advantage of such auditory displays is the types
of signals it can produce. Systems such as the Kurzweil Reading Machine use speech synthesis to
convey the information from the substituted sense. Other systems use either analogue samples or
synthesised sounds. One of the most well-known sensory substitution devices that utilises sounds
is the vOICe [Amedi et al., 2007, Auvray et al., 2007] which uses grey-scale images captured by
a camera and converts these into sound-scapes. The sound-scape is created by sweeping over the
image from left to right. Each y − value or row in the image corresponds to a fixed pitch with
the bottom row being the lowest pitch and the top row being the highest. The intensity value of
each pixel determines the amplitude of that pitch for that column and all columns are played
from left to right within a one second interval. This results in a continuous sound-scape that is
very rich [Proulx et al., 2008] in information as it conveys all of the spatial information from the
image. Studies on the vOICe have shown that the system is very effective. For instance, during
a reading task the visual word form area of the brain is significantly activated [Striem-Amit
et al., 2012] and during shape recognition tasks the lateral occipital complex [Amedi et al., 2007,
Merabet et al., 2009, Ward and Meijer, 2010] is activated. Furthermore, users of the vOICe have
reported [Proulx, 2010] different types of visual phenomena, even after the experiments.
Extended usage of sensory substitution devices like the vOICe have been shown to lead to
structural reorganisation of the brain [Arno et al., 1999, Collignon et al., 2007] during recognition
tasks and depth perception tasks [Renier et al., 2005] and geometry categorisation tasks [Pollok
et al., 2005]. These changes have been found in the occipital cortex [Arno et al., 2001a] as well
as in the parietal cortex [Kim and Zatorre, 2008]. Furthermore, refinements of the auditory-
responsive areas in the parietal cortex as well as refinements [Rauschecker, 2001, Kim and
Zatorre, 2008] in the selectivity of neurons in the auditory cortex have been found in early-blinds
after usage of sensory substitution devices. An interesting observation [Poirier et al., 2007a] is
the increased activation in the dorsal and ventral extra-striate areas of cortical blind subjects
that has been found, which suggests that perception through audible-vision sensory substitution
devices could be visual-like in nature. These studies are based on the sensory substitution
devices following an inverse model of the cochlea [Capelle et al., 1998] which suggests that
following biologically plausible constructions is beneficial in the neural activation during usage of
a sensory substitution device . Another potential hint at the visual-like nature of perception of
audible-vision sensory substitution devices is the presence of certain visual illusions, such as the
Ponzo illusion [Renier et al., 2004] and occlusion illusions [Jacomuzzi and Bruno, 2006], which
hint at that the perception through these substituted channels occurs within the same areas as
the visual perception originally would.
An important improvement over systems such as the vOICe is the addition of preprocessing of
the image by finding salient spots in the image. This can be done by either using a retina-like
smoothing of the image [Arno et al., 2001b], where the centre of the image retains its original
sharpness but pixels towards the edges of the image are blurred, or by using algorithms such as
a neural network [Lescal et al., 2013] to detect salient areas within the image which are then
CHAPTER 2. BACKGROUND 13
converted into audio stimuli.
2.4 Design Choices for 3D-VITA
For 3D-VITA, the design has been heavily influenced by the existing sensory substitution devices.
The system uses a 3D camera to obtain images as well as depth measurements combining the
obstacle avoidance aids such as the Sonic Pathfinder [Heyes, 1984] with true vision systems
such as the vOICe [Amedi et al., 2007]. The images are preprocessed, akin to systems such as
the KRM [Steele et al., 1989] and the later additions to the vOICe [Arno et al., 2001b, Lescal
et al., 2013], by clustering pixels into superpixels, which represent areas with similar colours and
structure. For each superpixel various features are calculated including colour, light intensity,
saturation, texture and depth. Each feature vector is then given as input to a synthesiser which
converts the features into an audio sample. The samples are then played back each second to the
user using 3D sound via headphones. The 3D sound is achieved using a binaural illusion based
on head-related transfer functions. This pipeline is discussed in more detail in chapter 4.
Chapter 3
Underlying Theory
3.1 Sound Perception
Sound consists of waves propagating through space. These waves are caused by displacement of
(air) molecules which in turn vibrate. The molecules vibrate causing nearby molecules to also
vibrate with the same frequency resulting in waves propagating through air (or other materials)
that can enter the ear. This process can be seen in fig. 3.3. The ear then converts these
mechanical vibrations into internal vibrations which are in turn converted into nerve impulses.
However, sound is not just waves and the particularities of these waves, it is mainly a psychological
perception [Sacks, 2010] of the wave-patterns. Other types of waves also have frequencies and
amplitude, but are not perceived as sound, e.g., light. Frequency and amplitude of waves have
corresponding principles when sound is perceived, frequency corresponds to pitch and amplitude
to loudness.
Sound waves
Sound waves are mechanical waves, which means that they propagate through a medium, such as
air molecules, and differ in this from electromagnetic waves, such as light. Sounds are generated
from all kinds of objects, but the sounds can be very different depending on what generated the
sound. Sound can range from pure tones, which can be described with a sine wave, e.g. a tuning
fork, to noise in which no pattern can be detected, e.g. radio static noise.
Tones are generated by vibrating, or oscillating, objects which cause compression and rarefaction
depending. Compression occurs when the object moves towards the direction the sound is
travelling. The object pushes molecules together which causes an increase in pressure of the
medium. Rarefaction occurs when the object moves away from the direction of the sound. There
are then less molecules within the same space resulting in a decrease of pressure, as seen in fig.
3.3. Pure tones, which are mathematically represented by a single sine wave, can be generated by
tuning-forks. These waves can be described using only a frequency and an amplitude. Usually,
14
CHAPTER 3. UNDERLYING THEORY 15
+
-80
-70
-60
-50
-40
-30
-20
100 1000 10000-80
-70
-60
-50
-40
-30
-20
100 1000 10000Blue Noise Red Noise
White Noise Pink Noise
-80
-70
-60
-50
-40
-30
-20
100 1000 10000-80
-70
-60
-50
-40
-30
-20
100 1000 10000
Figure 3.1: Additive synthesis of two sinewaves resulting in a more complex waveform
Figure 3.2: Noise is usually named afterthe colour that shares similarities in theirrespective spectra. X-axis is frequency, Y-Axis is Amplitude in dB
sounds are not pure tones, but rather consist of a number of interfering sound waves. By adding
sine waves a more complex sound [Erickson, 1975] can be created, as seen in fig. 3.1.
Noise does not contain such patterns, but instead has energy at all frequencies. White noise
has equal energy in equally sized frequency bands in the spectrogram. A white noise signal has
the same energy, or equal power, in the bandwidth of 40 Hz–60 Hz as it has in the bandwidth of
400 Hz–420 Hz. White noise got its name, because it was believed that it had the same spectral
flatness as white light. Other types of noise colour, such as violet, pink, red, grey and blue noise
as seen in fig. 3.2, are also related to the believed spectral qualities of light. For instance, pink
noise has a frequency spectrum which is linear in log-space.
The human ear
The human ear consists out of three main parts: the outer ear, the middle ear and the inner
ear. The outer ear serves as a protection of the inner parts and also to funnel the sound waves
into the middle ear. The middle ear translates the sound waves into internal vibrations. The
inner ear then translates these internal vibrations to nerve impulses which are then sent to the
auditory cortex. A schematic overview of the internals of the ear can be found in fig. 3.4.
The outer ear consists of the ear flap and the ear canal. The ear flap channels the sound waves
into the ear canal and the canal itself amplifies frequencies up to 3000 Hz. At the end of the ear
canal the sound waves are absorbed by the ear drum which vibrates along with the waves. These
waves are then converted to internal vibrations by three small bones, known as the ossicles, (the
CHAPTER 3. UNDERLYING THEORY 16
Ear Drum
Ear CanalCochlea
Ossicles
Inner EarMiddle EarOuter Ear
Figure 3.3: Sound waves propagate througha medium such as air molecules
Figure 3.4: Internals of the ear
hammer, anvil, and stirrup) which transmit the vibrations of the ear drums into the fluids within
the cochlea. The ossicles function as a sort of levers which causes the vibrations to be amplified,
resulting in the human ear being capable of detecting sounds with low amplitudes. The cochlea
is a cavity located in the inner ear and is about 3 cm in length if stretched out. The cavity
contains a liquid which vibrates along with the ear drum via the ossicles. The inner surface of
the cochlea is lined with about 20.000 hair-like nerve cells of various lengths and resiliency. Due
to the differences in length, the hairs resonate with certain frequencies causing them to have a
larger amplitude if the liquid is vibrating with the same frequency. If the nerve cell is agitated
enough, it will send an electrical signal to the brain.
Certain frequencies are amplified by the outer– and the middle ear. The ear canal amplifies
frequencies around 3000 Hz which is also the frequency around which human speech sounds are
located. The middle ear boosts frequencies around 3000 Hz. The means that the human ear is
most sensitive to frequencies in the 1000 Hz–3000 Hz band.
Psychoacoustic properties of sound
“If a tree falls in the forest and there is no one to hear it,
did it still make a sound? ”This famous philosophical thought experiment is essentially about the knowledge of reality and
observation. Sound waves travel through the air, but are only perceived when they reach the ear.
What we hear shares a strong connection with the physical attributes of the wave, as it travels
through the air, but these connections change in meaning and in our understanding. The wave
is converted into a mental representation [Levitin, 2013] and it is this representation that we
experience as sound. The mental representation changes the physical properties of the wave into
CHAPTER 3. UNDERLYING THEORY 17
+
Time (sec)
0.25
8
kHz
0.25
8
kHz
0 3
Violin
Piano
Figure 3.5: Sound waves with smallfrequency differences create a noticeablepattern in amplitude differences over time ascan be seen when combining a wave with afrequency of 10 Hz and a wave with 11 hertz
Figure 3.6: Spectra of a piano and a violinplaying the same note. Apart from thedifferences in energy patterns, note also thedifference between attack and decay of eachnote
psycho-acoustical properties. For instance, the frequency of the wave is converted into pitch and
the amplitude is converted into loudness.
The difference between the properties of the real wave and the psychoacoustic properties of the
mental representation is not as clear [Patil et al., 2012] as the difference between the frequency
of a wave of light and the mental representation of this frequency as a colour. This is mainly due
to the fact that pitch is encoded in neurons firing at the same frequency as the incoming wave.
This means that pitch is also represented in Hz. However, perceptually the brain is incapable
of noticing small differences in Hz. For the brain, there is no discernible difference between
1000 Hz and 1001 Hz. The difference can only be heard when two sounds at those frequencies
are produced simultaneously. Due to the slight difference in the wave, the amplitude difference
over time changes from a maximum of the combined amplitudes of the waves (when they are
almost in sync) and a minimum when they cancel each other out (when the waves are not in
sync) as can be seen in fig. 3.5. This phenomenon is used by guitar tuners to attune strings by
playing the same note simultaneously on two different strings.
Humans are exceptionally skilled in the recognising the difference between sound producers. That
is to say, even when two instruments are playing the same note, the human brain is capable of
determining which instrument is playing when. The reason for this is timbre. Timbre is usually
referred to as every property of sound which is not the frequency or the amplitude. However,
this is not a scientific or rigid definition, it can be decomposed into the following:
CHAPTER 3. UNDERLYING THEORY 18
Definition:
Timbre is the property of a sound wave that determines the characteristics of that sound. The
timbre is composed of the energy pattern of the harmonics of a tone.
What this means is that as sound can be represented as the addition of several sine waves, timbre
can be represented as the energy of each individual sound wave, as can be seen in the spectra in
fig. 3.6. This definition is not complete, but already sufficient in representing the differences
between sound producers [Fujinaga, 1998, Fujinaga and MacMillan, 2000]. What is missing from
this definition are the characteristics of the sound over time. In theory, these properties can be
represented as an energy pattern over the low frequency sine waves, but this representation is not
intuitive. Instead, it is easier [Berger, 2005] to represent these properties as the convolution with
a function that controls the total energy of the sound. This function represents how the energy of
the sample changes over time. It starts with the attack of the tone. The attack refers to the time
between the initial production of the sound and the moment the sound reaches its maximum
energy. In fig. 3.6 this can be seen for a tone produces on a piano and a violin. The piano
reaches its maximum energy before the violin, which is due to the difference between hitting a
string within the piano and the slow swelling of the sound for a violin. The other properties of
the energy function are the time it takes for the sound to die out after production, the average
energy of the sound and the length of the sound.
Chapter 4
Methodology
The sensory substitution device, 3D-VITA, that was created for this thesis creates audible
vision. However, traditional audible vision sensory substitution devices take a visual percept
and convert this phenomenon into an audible signal. This results in sensory substitution devices
taking over the role of sensory integration instead of employing the brain of the user. By directly
converting visual information, rather than visual percepts, the user can perceive much more
complex information. This is why the sensory substitution device described in this thesis is
called 3D-VITA which stands for 3D Visual Information To Audio. The system utilises a 3D
camera, mounted on the head of the user, to get the visual information from the scene. The
visual information is based on a segmentation of the image into local patches with similar visual
qualities, known as super-pixels. A set of visual features is created for each super-pixel and
these features are used to generate sounds. The sound generation is steered by weak associations
stemming from the visual features of materials. The sounds are played back to the user through
headphones with head related transfer functions, creating the illusion that the sounds are in a 3D
space. The location of each of the sounds is taken from the depth information from the camera.
The following sections give greater detail on this pipeline. The sections follow the pipeline of the
system which consists of a camera to take images with 3D information, a computer to segment
the images, extract the visual features and generate the sounds and a pair of headphones which
enable the user to hear the sounds in a simulated 3D environment. A schematic overview is given
in fig. 4.1.
4.1 RGB-D Camera
RGB-D cameras can provide both colour and depth information. The RGB-D camera that was
used for the prototype is the Microsoft Kinect (http://www.xbox.com/en-us/kinect/). The Kinect
uses a normal RGB camera, an infra-red laser emitter and an infra-red camera. The infra-red
laser is diffracted to create a pattern of infra-red dots. The resulting pattern is compared to a
reference infra-red image created with a plane at a fixed distance. By comparing the distance
from the dots in the pattern to the original reference image, a disparity image can be obtained.
19
CHAPTER 4. METHODOLOGY 20
RGB-D Cameras are capable oftaking full-colour images (RGB)as well as retrieving depth inform-ation (D)
RGB-D Camera
With headphones, an auditoryeffect can be created resulting inthe user hearing sounds in a 3Denvironment using head relatedtransfer functions
HRTF Headphones
The image is segmented usingthe SLIC algorithm. Thisresults in a set of super-pixelswith approximately the same sizeand visual coherence within eachsuper-pixel, see fig. 4.2
Segmentation
The visual features are translatedto a set of audio parameters,based on the acoustic propertiesof a set of typical materials
Sound Generation
A set of visual features is extrac-ted for each super-pixel. Thefeatures are based on typicalmaterial reocgnition systems
Feature Extraction
Figure 4.1: A schematic overview of the system
The disparity image can then be converted into depth measurements using the camera parameters
as obtained through calibration. The 3D location vector [x′, y′, z′] of the object can be found
from pixel locations x and y using depth d and constant c:
x′ = xdc, y′ = ydc, z′ = d (4.1)
One potential pitfall of using this type of depth camera [Khoshelham, 2011] is that transparent
and reflective objects can not be accurately measured. However, there are no optical methods
that are capable of fully circumventing this issue. Another typical issue is that measurements fail
if a surface is oriented almost parallel to the camera. This is due to the laser specks becoming
either blurred by being smeared out over a larger surface area, or due to being reflected away
from the camera. This issue is less problematic as the segmentation is also based on edges,
causing most of the faulty measurements to lie on the edges of the super-pixels.
4.2 Super Pixel Segmentation
There is a fine balance between giving too much and too little information. If we were to create
a descriptor for each pixel and generate a sound from that, as done in the vOICe [Amedi et al.,
2007, Auvray et al., 2007], the user would get very little information for each pixel, but much
stimulation. The user would get the information in a raw format where the brain has to do
the filtering and focus attention, which can be straining for the user. Instead, 3D-VITA should
function much as a sensory organ does and filter relevant information for the user. On the
other hand, if we take global descriptors of the image, the system will filter out too much data
CHAPTER 4. METHODOLOGY 21
and thus does not contribute a lot to the perception by the user. This is why 3D-VITA uses
image patches for which it generates sound. The system uses SLIC segmentation (Simple Linear
Iterative Clustering) [Achanta et al., 2012] which results in a set of super pixels. With SLIC, a
predefined number of super pixels can be found that incorporates both spatial relevance and
colour relevance for each super pixel, as seen in fig. 4.2.
Definition:
Super Pixels are groups of pixels that share some relation, be it a spatial relation or relation
in colour or texture
The algorithm is given a number K of super pixels to find in the image and with N pixels in the
image, each super pixel will have around N/K pixels. With super pixels of approximately the
same size, there will be a super pixel centred on every grid interval S =√N/K.
The goal of the algorithm is to find a set of super pixels Ck = [lk, ak, bk, xk, yk]T with their
related pixels that have a smaller distance to that super pixel than to any other centre. The
distance measure, as seen in eq. 4.2, between a cluster centre and a pixel is determined using the
Euclidean distance in colour space and the Euclidean distance between pixel locations. The SLIC
algorithm works on the CIELAB colour space in which Euclidean distances are perceptually
meaningful for small distances. A distance threshold m is chosen which weights how much colour
difference is allowed. The higher m is chosen, the more spatial proximity is used for determining
pixel relations to the cluster. The algorithm can be seen in alg. 1.
dlab =√
(lk − li)2 + (ak − ai)2 + (bk − bi)2
dxy =√
(xk − xi)2 + (yk − yi)2
Ds = dlab +m
Sdxy (4.2)
The algorithm begins with selecting an initial set of super pixel centresCk = [lk, ak, bk, xk, yk]T
for k = [1,K] at the centres of an equally spaced grid. For each centre the surrounding pixels are
analysed and the centre is moved to the lowest gradient position in a 3× 3 patch. This is since
we do not want centres to be on edges that often represent borders between distinctly different
patches. The gradient G(x, y) is calculated using the lab vector I(x, y) using eq. 4.3.
G(x, y) = ‖I(x+ 1, y)− I(x− 1, y)‖2 + ‖I(x, y + 1)− I(x, y − 1)‖2 (4.3)
For all experiments, K was chosen to be 64. With the resolution of the Kinect at 640× 480, this
results in a cluster size of approximately 4800 for 64 clusters.
CHAPTER 4. METHODOLOGY 22
Algorithm 1 SLIC Segmentation
Require: K = number of super pixels1: Initialise K cluster centers Ck = [lk, ak, bk, xk, yk]T by sampling pixels at regular grid2: Perturb cluster centres in an n× n neighbourhood, to the lowest gradient position3: repeat4: for all cluster centre Ck do5: Assign the best matching pixels from a 2S×2S square neighbourhood around the cluster
centre according to the distance measure in eq. 4.26: end for7: Compute new cluster centres and residual error E(L1 distance between previous centres
and recomputed centres )8: until E ≤ threshold9: Enforce connectivity
Figure 4.2: The SLIC algorithm applied to a painting by Malevich. The parameters were set toK = 64, m = 20
With the initial centres in place, the algorithm iteratively uses K-Means to move the cluster
centres by checking a 2S × 2S square around the centre. This is done by relating each pixel
in the image to the nearest cluster centre according to the distance measure in eq. 4.2. When
each pixel is associated, the new cluster centres are calculated as the average labxy vector. The
algorithm terminates when the cluster centres converge. A visual representation of these steps
can be seen in fig. 4.2.
4.3 Extracting Visual Features
The selection of visual features is guided by the current state-of-the-art material recognition
systems. The reason for this is that these features give a good representation of the materials so
CHAPTER 4. METHODOLOGY 23
Table 4.1: Hue ranges for each of the five colours used. Hue values between these bins wereweighted linearly between the two neighbouring bins
Colour name Low Hue High Hue
Red 355 10Yellow 51 60Green 81 140Blue 221 240Pink 331 345
that they can be distinguished from one another. By finding natural complements within the
audio property domain, it should follow that humans can therefore distinguish between visually
different objects.
Colour is represented as a histogram of the primary colours ”ROYGBIV” which are Red,
Orange, Yellow, Green, Blue, Indigo and Violet. To reduce complexity within the sounds, Orange
is considered a natural blend between Red and Yellow, while Indigo and Violet are grouped
together as Pink. This reduces the main colours to be represented as Red, Yellow, Green, Blue,
Pink. The reasoning behind this is that these five colours can be linked to the five main notes
from a standard chord, which will let the final result sound more pleasant. For each super pixel
a histogram of the colours is made using hue. Each of the colours is represented by a certain
range in which that colour is the dominant colour. In between these ranges a linear blend is
made between two consecutive bins. So an orange pixel with a hue of 30 will add 0.5 to both
the red and the yellow bin. In table 4.1 the bin ranges for each colour are shown. Note that
hue is represented as an angle and thus wraps around, meaning that red is represented as the
ranges 355− 360 and 0− 10. Due to the possibility that individual super pixels have different
amounts of pixels within them, the histogram is then normalised. The saturation values are
simply averaged for each super pixel. Brightness is represented as three distinct values being
dark, medium or light. The reasoning behind this has to do with the way they are represented in
sound.
Texturedness is also represented. Note that this refers to texture as in noise rather than structured
texture such as repeating patterns. Texturedness is measured by smoothing the grey-scale image
I and subtracting that from the original gray-scale image resulting in a smoothed image I ′. This
results in pixels that differ from their surrounding pixels to have a higher value than pixels
surrounded by similar pixels. The smoothing is done with a convolution with a 9x9 averaging
filter. The result T shows the pixels that are noisy compared to their neighbouring pixels.
T = abs(I− I′) (4.4)
For each super pixel the mean of T is taken to represent the texturedness of the super pixel.
Values closer to the centre are given a higher value using a simple Gaussian filter, since the edges
of the super pixel are aligned to edges within the image.
CHAPTER 4. METHODOLOGY 24
Table 4.2: Notes and their corresponding colours frequencies as used by 3D-VITA
Colour Name Frequency (THz) Frequency ∗ 240 (Hz) Note
Red 431 392 G4
Yellow 513 466 B[4Green 575 523 C5
Blue 684 622 D]5Pink 768 698 F5
These are the only four features that are used for representing each super pixel. The reason
behind these four is that they are considered within the material recognition literature to hold
the most information with the omission of shape which has no real meaning due to the usage of
super pixels. Other typical material features such as texture patterns are also omitted due to
computational constraints.
4.4 Generating Sound
Sounds are generated using a additive synthesiser with a low-pass filter, an ADRS filter and
by mixing with white noise. As noted in the previous section, five colours are represented in
a normalised histogram. Each colour is linked to a certain note within a chordal progression.
These are found by taking the centre point of each bin and scaling the frequency down 40 octaves
thus multiplying the frequency by 2 40. The notes that are then found can be seen in table 4.2.
The notes in the table form a G Minor 7 chord with an augmented fifth. However, although red
can be seen as having the note G, it does not matter at which note we start the chord. It is more
important that different colours have audible differences.
Using the frequencies from table 4.2 a sine wave is generated for each histogram value Hi with
corresponding frequency f using equation 4.5 to get audio sample y. Note that the sine wave is
generated at a sample-rate of 44100 and with amplitude a = 7500.
y(t) =∑i
Hi ∗ a ∗sin(2 ∗ π ∗ fi ∗ t)
44100(4.5)
Frequency fi is multiplied by 0.5, 1, 2 corresponding to the brightness value of the super pixel.
This results in a darker super pixel to be be moved one octave down while a lighter super pixel
is moved one octave up.
The mean saturation µs of the super pixel is used as a parameter for a low-pass filter resulting in
the filtered audio sample y′. A low-pass filter results in a dampened sound since higher frequency
ranges are averaged out. The higher the saturation the lower the cut-off frequency is set. To
achieve this µs is inverted and normalised to the desired effect value.
CHAPTER 4. METHODOLOGY 25
Attack Decay ReleaseSustain
1
0
a
y(t)
Figure 4.3: A typical ADRS filter with amplitude on the Y-axis and the sample filter on theX-axis
RC =(1− (µs/255)) ∗ 150
44100
δt =1
44100
α =δt
RC + δt
y′(t) = αy(t) + (1− α)y(t− 1) (4.6)
To add texture to the sound the original wave is altered with a layer of white noise. White noise
has equal energy on all frequencies and thus will not induce a bias towards certain tones to
stand out due to blocking. For instance, red or Brownian noise has higher energy on the lower
frequencies resulting in lower frequencies to be less audible than higher frequencies. The layer of
noise is given the same maximum energy as the original sine waves.
Finally, the sound is passed through an ADRS filter. ADRS stands for attack, decay, release,
sustain and is usually represented as a piece-wise linear function as can be seen in fig. 4.3. The
attack is the beginning of the sample and signifies the time between the start of the sample and
the moment it reaches maximum amplitude. The decay is the moment right after where the
amplitude drops to the true amplitude of the sound which is called the sustain. The release
is the final part of the sound and is the drop from the sustain level back to zero. The ADRS
function is multiplied by the original audio sample y(t).
CHAPTER 4. METHODOLOGY 26
Table 4.3: Specification of 3D-VITA
Pipeline Method Specification
Image Retrieval RGB-D camera (Kinect) Colour and depthSegmentation SLIC K = 64, m = 20Feature extraction Hue see table 4.1Feature extraction SaturationFeature extraction Lightness Three classes (Dark, Medium, Light)Feature extraction Texture Summation over edge pixelsAudio Synthesis Additive Synthesis see table 4.2, a = 7500, samplerate = 44100Audio Synthesis Low-Pass Filter see eq. 4.6Audio Synthesis ADRS a = 0.05, d = 0.4, r = 0.45, s = 0.9Playback HRTF OpenAL implementationPlayback Stimulus interval 1 sPlayback Delay 1 s
4.5 Head-Related Transfer Functions
To create the binaural illusion, i.e. illusion of 3D localised sound, head-related transfer functions
(HRTF) [Begault et al., 1994, Larcher et al., 2000, Furse, 2009] are used for both ears. HRTF
is the Fourier transform of the head-related impulse response (HRIR). The HRIR encodes the
impulse response within the ear for a sound given a source location. The convolution of the
impulse response is called the HRIR and this can be used to alter a sound as if it came from that
source location. Different frequencies within a complex sound have different responses in the ear.
This is mainly due to the shape of the outer ear, head and body of the listener. For distances
larger than 1 m the differences in head shape become negligible so that it is not necessary to
measure the HRTF for any specific user of the system. HRTF also solve theCone of Confusion
where interaural level difference (ILD) and interaural time difference (ITD) are equal for both
ears.
4.6 Summary
The system uses a Kinect, mounted on a pair of headphones, to retrieve both colour and depth
images. The colour images are converted to the CIELAB colour space for segmentation with the
SLIC algorithm. The depth images are smoothed and are used to determine the 3D location of
each super pixel in the scene. The colour images are converted to the HSV colour space for the
extraction of colour features. Hue, saturation and lightness are converted to audio parameters
representing the note, dampness and octave. The texturedness of each super pixel is measured
using a sharpen filter and summing over the edge pixels from this filter. A sound is produced
using additive synthesis. The sound is then modified using an ADRS filter. The sounds are then
played back to the user over headphones in a virtual 3D space. This 3D virtual effect is achieved
using head related transfer functions. A full overview of the used parameters can be seen in table
4.3
Chapter 5
Experiments
To validate the system, both a virtual and a location experiment were performed. An overview
of the ten subjects can be found in table 5.1. Subjects were given approximately two minutes of
adaptation (training) time before starting the experiments. This brief time was included in the
experiments to let users get accustomed to the sounds during the experiment. The time was kept
short to keep training to a minimal as blind persons would not be able to have this visual-audio
training before using the system. The experiment for each subject consists of three parts: the
initial adaptation time, the virtual experiment and the location task. Subjects were asked
questions relating to their musical ability, both in production and in listening. The results of this
questionnaire can be found in table 5.2. The age of the subjects ranges from 19 years–25 years.
Seven of the subjects reported they were musicians, although four of them also reported playing
only occasionally. Most of the subjects reported listening to a broad range of genres spanning jazz,
classical music, pop and rock. Some of the subjects reported that they were unsure whether they
had any form of auditive damage. Apart from the ten subjects, another subject also performed
the experiments. This 11th subject had a subsidiary infection of the right ear canal. She was
Table 5.1: Subjects gender, age and right-handedness
Id Gender Age Right-handed
Subject 1 Male 19 yesSubject 2 Male 24 noSubject 3 Male 21 yesSubject 4 Male 25 yesSubject 5 Male 19 yesSubject 6 Male 21 yesSubject 7 Male 25 yesSubject 8 Male 25 yesSubject 9 Female 20 yesSubject 10 Male 20 noTotal 8 to 2 µ = 21.9 8 to 2
27
CHAPTER 5. EXPERIMENTS 28
Table 5.2: Subjects musical background
Id Instrument(s) Musical ability Musical taste
Subject 1 Piano Casual player Very broadSubject 2 Piano 30 minutes per day Very broadSubject 3 Guitar and Bass-guitar Daily. Also in a band Punk, StonerSubject 4 X X BroadSubject 5 X X Very BroadSubject 6 Piano 4-8 hours per week BroadSubject 7 Drums Few hours per week BroadSubject 8 X X Very BroadSubject 9 Saxophone Infrequent BroadSubject 10 Piano, Guitar, Fiddle 2 hours per week Broad
Figure 5.1: Three of the nine synthetic images used for the synthetic experiment. Each colouredsquare had three possible locations within the image
used as a sort of baseline as auditive difficulties in one ear hamper the specialisation of sound.
5.1 Experimental Set-Up
Virtual Experiment
The virtual experiment consists of nine synthesised images with for each primary colour (R,G,B)
a square at one of three positions (left, middle, right). The depth map for each image consists of
all NaNs except for the square, which is set at a fixed depth of 3.0 m. In front of the subject, a
large visual marker is placed. The user is asked to, when receiving the stimulus, turn their head
towards where the square is. Using the visual marker, the head position of the subject is then
tracked to see how close the user his or her gaze is to the virtual object. Using these nine values
(distance of gaze centre to the centre of the virtual object) the initial calibration error can be
found. This error can arise from inaccuracies in the software or hardware as well as auditorial
damage the subject may have.
Location Task
The goal of the location task is to measure how accurately subjects are in finding an object in
the sound-scape. A large visual marker is placed on the perimeter of a semi-circle centred around
CHAPTER 5. EXPERIMENTS 29
Laptop
Figure 5.2: Experimental set-up of room with the participant in the middle
the user with a radius of 2.5 m, as can be seen in figure 5.2. If the subject believes the visual
marker is in the centre of their field of view, the subject presses a button upon which records this
moment. The audio is temporarily disabled to move the visual marker to a new location, at which
point the next task starts. The soundscapes are generated at approximately one soundscape per
second and each image is stored so that the user his or her gaze can be tracked over time. A
total of ten location tasks are given to each subject which are averaged on their time axis from
starting time of the task to the moment the user presses the button. For each image a distance
is calculated to the centre of the visual marker. The distance is calculated as the angle at which
the marker is located, with the angle being zero if the marker is at the centre of the subject his or
her gaze. If the visual marker is out of the field of view, the maximum possible distance is given
to that location. The distance can be both positive and negative and will not be normalised
which allows for analysing the search patterns the subjects utilise to locate the marker.
5.2 Results
General Response of the Subjects
After the session, the subjects were asked about their personal findings and opinions on the
system. All subjects reported following the same strategy. At the first couple of experiments, the
subjects tried to find the “red” sound within the environment. Around experiment four/five the
subjects switched strategy to instead focus on contrast and changes in sound. This effect can be
seen in fig. 5.4 where the time taken for the task jumps from the normal pattern. Most subjects
felt insecure about their ability to correctly identify the marker within the room, despite the low
error rate. Some of the subjects reported that the sounds were pleasant and two of the subjects
reported they were unpleasant.
Most subjects reported that they had difficulty in locating where sounds were coming from. Some
subjects thought sounds came from behind them, others reported hearing more sounds coming
from the left than from the right and vice versa. The differences between the subjects can be
CHAPTER 5. EXPERIMENTS 30
explained by the HRTF having different effects on the spatial perception of the sounds. Another
explanation might be that some subjects have slight damage to their hearing which results in a
bias in locations.
Virtual Experiment Results
During the virtual experiment, all subjects experienced a slight bias towards sounds from the
right. When the right stimuli were presented subjects turned their head 90° to the right, while
the left stimuli made them turn their head 65° to the left. The middle stimuli was reacted to by
turning 30° to the right. This indicates that the entire audioscape is slightly rotated around the
user towards the right. Most subjects reported that some of the sounds seemed to come from
within their own head and were therefore not able to locate these sounds in the virtual room.
This is a natural response to sounds that are right in front of humans as the sound is the then
the same for both ears. One of the subjects turned completely around for both stimuli from the
left and right. This is a common effect in people with damage to the auditive system. Two of
the subjects also looked slightly up and down for some of the stimuli. This was especially true
for the green and blue stimuli. The green stimuli have high brightness and the blue stimuli have
low brightness. It is a common perceptual illusion to feel that sounds with a higher pitch come
from higher and vice versa.
The eleventh subject, who suffers from subsidiary infection of the ear canal, performed significantly
worse than the other subjects. She had a strong bias to the left, which indicates that the right ear
was indeed still not functioning normal. Furthermore, she reported that she could not successfully
attribute a location to each stimulus.
Location Task Results
During the location experiments, a total of 4720 stimuli were presented. These stimuli were
divided over 10 different participants who each completed 10 experiments of which the results
are in table 5.3. During these 4720 stimuli, the marker was present in 2451 of the frames. Each
experiment was concluded with the subject indicating the marker was within the field of view.
Of these 100 “screenshots” the marker was present in 78 of the frames. These screenshots can
be found in fig. 5.3. Compared to the 51.9 % of the frames in which the marker was present,
the 68 % of the screenshots is above random. Furthermore, 7 of the 10 subjects performed
significantly above random, while 2 of the subjects performed on random and 1 of the subjects
performed significantly below random. Of the 2 subjects, one performed at 80 % indicating that
this subject may just have spent more time looking at the marker before deciding. Subject
10, who performed below random, indicated that she had a specific strategy to search for high
pitched sounds. Looking at her screenshots, the bottom row in fig. 5.3, it becomes apparent that
her hypothesis was aimed at the floor.
Given the 100 trials that were performed of which 68 were successful, we can reject the hypothesis
that the subjects were performing the task at random. To do this, the experiment was modelled
as a binomial chance experiment with a 0.519 chance of success. The probability that this
CHAPTER 5. EXPERIMENTS 31
Subject 1
Subject 2
Subject 3
Subject 4
Subject 5
Subject 6
Subject 7
Subject 8
Subject 9
Subject 10
Figure 5.3: All screenshots as taken by the subjects during the experiments. Each row correspondsto a single subject and from left to right are experiments 1 to 10
hypothesis is correct is 0.0007951, as calculated in equation 5.1. This is under the assumption
that each trial, or task, is performed independently. If we consider the best and worst performing
participants as outliers, there were 80 trials of which 57 were successful. The probability then
becomes 0.0003300.
P (68 correct out of 100 trials) =
(100
68
)× 0.51968 × 0.48132 = 7.951× 10−4 (5.1)
If we compare the first 5 tasks of each subject to their last 5 tasks, we get the probabilities in
equation 5.2 and 5.3. This means that during the first 5 tasks, 28 out of 50 screenshots were
successful, while the last 5 tasks resulted in 39 out of 50 successful screenshots.
P (28 correct out of 50 trials) =
(50
28
)× 0.51928 × 0.48122 ≈ 0.33 (5.2)
P (39 correct out of 50 trials) =
(50
39
)× 0.51939 × 0.48111 ≈ 0.0001 (5.3)
The difference between the first 5 and last 5 tasks shows that during the experiment, despite the
subjects not receiving any intermediate feedback on their performance, the subjects were able to
improve their performance significantly.
CHAPTER 5. EXPERIMENTS 32
Table 5.3: Location Experiments per subject. Total frames is the total number of stimuli thesubject received. Frames with marker is the amount of those stimuli in which the marker waspresent. The screenshots were taken at the time the subject confirmed the marker was in frontof him or her
Id Total Frames Frames with marker Screenshots with marker
Subject 1 493 404 (81 %) 8 (80 %)Subject 2 521 239 (45 %) 6 (60 %)Subject 3 151 77 (51 %) 7 (70 %)Subject 4 606 236 (40 %) 7 (70 %)Subject 5 868 488 (56 %) 5 (50 %)Subject 6 172 111 (61 %) 8 (80 %)Subject 7 184 103 (56 %) 10 (100 %)Subject 8 677 386 (57 %) 9 (90 %)Subject 9 267 110 (41 %) 7 (70 %)Subject 10 269 108 (40 %) 1 (10 %)Total 4720 2451 (51.9 %) 68 (68 %)
It is worthwhile to note that when considering the individual performance of each subject it
becomes difficult to use this metric, as there were only 10 trials per subject. Especially when
considering subject 1 in table 5.3 who did not score above random. However, he did perform
considerably well at the task, but spent a larger portion of the time looking at the object. Looking
at table 5.3 it is also worthwhile to mention that subject 10 performed significantly below random.
The eleventh subject also performed below random, but also reported not being able to form
any hypothesis on what the sounds represented. While subject 10 followed a set strategy and
seemed to follow this strategy successfully, the eleventh subject did not follow any strategy as
she could not differentiate the sounds produces by the system. Although the sample size is small,
no noticeable difference was found regarding the information gathered about the subjects seen in
tables 5.1 and 5.2.
Another important metric to consider is the time taken for each consecutive task. The results
can be viewed in fig. 5.4 for each of the ten subjects. Noteworthy is the fact that some of
the subjects need approximately the same amount of time for each task, while other subjects
fluctuated with the time needed for the task. The subjects that did this reported that they
sometimes changed their hypothesis and took longer to try and verify whether their hypothesis
was correct. Furthermore, this effect is noticeable in fig. 5.4 around the fourth experiment where
for many of the subjects time is either above or below the average time for that subject. Another
effect that can be seen in fig. 5.4 is that the overall time taken per experiment becomes more
stable towards the final five experiments. That is to say, the deviation from the mean becomes
smaller.
To see what search strategy the subjects used, the marker was tracked over time during each
experiment. In fig. 5.5 six representative graphs can be seen. For each frame, that was converted
into an audio stimulus, the marker was detected. The x position of the marker is seen on the
y-axis. If the marker was not detected, the last x position was used to determine whether the
subject was looking to the right or the left of the marker. For these frames, either x = 0 or
CHAPTER 5. EXPERIMENTS 33
Figure 5.4: Time taken for each experiment. Blue circles indicate correct screenshots. The bluedashed line is the mean time taken per experiment
Subject 5, exp 2
Subject 4, exp 5
Subject 8, exp 10
Subject 10, exp 1Subject 9, exp 6
Subject 7, exp 6
Figure 5.5: Six representative graphs showing the search strategy as used by the subjects. Onthe y-axis is the time (ms) taken for the experiment and on the x-axis is the x-value of the centrepoint of the marker
x = 640 was used for looking to the right or left of the marker respectively. What becomes
apparent from these graphs is the sweeping strategy used by most subjects. The subjects move
their head from left to right when looking for the marker. Another effect that was found was that
some of the subjects in the final frames had their gaze trained on the marker but moved away
for a brief moment before taking a screenshot. After the experiments, these subjects reported
that they did this to verify if the marker was present in the field of view.
Chapter 6
Conclusion
In this thesis, a novel sensory substitution device was proposed named 3D-VITA. The system
was tested on ten different subjects in both a virtual and a localisation task. The system uses
rich visual information using a RGB-D camera and translates this information into an audio
signal. Results showed that the system indeed has an effect on the subjects allowing them to
detect an object in the room. It is difficult to assess what this effect is precisely, but that the
subjects were capable of performing above random is clear. Furthermore, when talking about
their experience with the system, most subjects reported that they indeed “felt” the presence
of an object in the room. This distal attribution is a difficult to measure effect, but it is clear
that together with the performance of the subjects, that it is very likely that they were capable
of finding the object during the experiment. What is remarkable about this is that they did so
without training. The two minute adaption period was not enough for users to extract which
parts of the audio signal were relevant. This is also clear from the performance during the first
five experiments. Most subjects made mistakes during these first five, but managed to improve
their performance, despite not getting any feedback during the experiment apart from the audio
signal of the system.
The most important thing that was shown is that the system does have potential and that it is
very much possible to create sensory substitution devices that operate on rich information. This
is important, since before 3D-VITA there were no sensory substitution devices that attempted
to utilise the sensory integration capabilities of the human brain.
Although the experiments show that the system indeed provides enough information for users to
perform a localisation task without the need for training, more rigorous testing is required to
fully understand what the system can provide. The experiments now made use of a single red
marker. In the future, this red marker needs to be replaced by real world objects, to fully show
the power of the system. However, the red marker is difficult to find in this room, since the red
hue is converted to the primary note of the chord used by the synthesiser. This means that even
when no red objects are in the field of view, the brain completes the chord making it hard to just
focus on a single note. Furthermore, the red-yellowish tint of the lights in the room made the
34
CHAPTER 6. CONCLUSION 35
task more difficult. Another potential issue is that the task was performed from a single vantage
point. In future experiments, it would be interesting to see how this system performs in a task
that allows movement. Due to the hardware, this was not possible at this current stage.
The localisation experiment was done by the subjects without prior training. Despite this lack
of training, the subjects exhibited improvement, even over a small number of trials. Future
experiments should include a training stage in which feedback is given during the experiment.
Perhaps the most obvious issue with the system in its current stage is the one second interval
and delay. This leads to a latency between the users head position and the stimulus provided by
the system. This latency greatly increases the complexity of the task. To improve the system,
this delay needs to be addressed. A possible method to achieve this is to remove the SLIC
segmentation algorithm and replace it with a simpler method. The added benefit of removing
the delay as well as the one second interval at which stimuli are provided is that the user gets
continuous feedback on the environment.
3D-VITA needs more experimentation to understand the limits and full capabilities of the
system. However, as the experiments have shown, it has great potential as a means to replace
vision without invasive surgery. The integration of sensory information is one of the great feats of
the brain and thus far, no sensory substitution device has attempted to explore the possibilities
of this powerful ability. With 3D-VITA, these integration capabilities are utilised to provide a
stronger sense of perceiving.
Bibliography
R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. Slic superpixels compared
to state-of-the-art superpixel methods. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 34(11):2274–2282, 2012.
A. Amedi, W. M. Stern, J. A. Camprodon, F. Bermpohl, L. Merabet, S. Rotman, C. Hemond,
P. Meijer, and A. Pascual-Leone. Shape conveyed by visual-to-auditory sensory substitution
activates the lateral occipital complex. Nature neuroscience, 10(6):687–689, 2007.
P. Arno, C. Capelle, M.-C. Wanet-Defalque, M. Catalan-Ahumada, and C. Veraart. Auditory
coding of visual patterns for the blind. PERCEPTION-LONDON-, 28:1013–1030, 1999.
P. Arno, A. G. De Volder, A. Vanlierde, M.-C. Wanet-Defalque, E. Streel, A. Robert, S. Sanabria-
Bohorquez, and C. Veraart. Occipital activation by pattern recognition in the early blind
using auditory substitution for vision. Neuroimage, 13(4):632–645, 2001a.
P. Arno, A. Vanlierde, E. Streel, M.-C. Wanet-Defalque, S. Sanabria-Bohorquez, and C. Veraart.
Auditory substitution of vision: pattern recognition by the blind. Applied Cognitive Psychology,
15(5):509–519, 2001b.
M. Auvray and E. Myin. Perception with compensatory devices: from sensory substitution to
sensorimotor extension. Cognitive Science, 33(6):1036–1058, 2009.
M. Auvray, S. Hanneton, C. Lenay, and K. O’Regan. There is something out there: distal
attribution in sensory substitution, twenty years later. Journal of Integrative Neuroscience, 4
(4):505–21, 2005.
M. Auvray, S. Hanneton, and J. K. O Regan. Learning to perceive with a visuo-auditory
substitution system: Localisation and object recognition withthe voice’. PERCEPTION-
LONDON-, 36(3):416, 2007.
P. Bach-y Rita. Brain plasticity as a basis for therapeutic procedures. Recovery of function:
Theoretical considerations for brain injury rehabilitation, pages 225–263, 1980.
P. Bach-y Rita. Tactile vision substitution: past and future. International Journal of
Neuroscience, 19(1-4):29–36, 1983.
P. Bach-y Rita. Tactile sensory substitution studies. ANNALS-NEW YORK ACADEMY OF
36
BIBLIOGRAPHY 37
SCIENCES, 1013:83–91, 2004.
P. Bach-y Rita and S. W Kercel. Sensory substitution and the human–machine interface. Trends
in cognitive sciences, 7(12):541–546, 2003.
P. Bach-Y-Rita, C. C. Collins, F. A. Saunders, B. White, and L. Scadden. Vision Substitution
by Tactile Image Projection. Nature, 221(5184):963–964, Mar. 1969.
D. R. Begault et al. 3-D sound for virtual reality and multimedia, volume 955. AP professional
Boston etc, 1994.
K. W. Berger. Some factors in the recognition of timbre. The Journal of the Acoustical Society
of America, 36(10):1888–1891, 2005.
B. Blasch, S. LaGrow, and W. De l’Aune. Three aspects of coverage provided by the long cane:
Object, surface, and foot-placement preview. Journal of Visual Impairment and Blindness, 90:
295–301, 1996.
D. J. Calder. Assistive technologies and the visually impaired: a digital ecosystem perspective.
In Proceedings of the 3rd International Conference on PErvasive Technologies Related to
Assistive Environments, page 1. ACM, 2010.
L. Cancar, A. Dıaz, A. Barrientos, D. Travieso, and D. M. Jacobs. Tactile-sight: A sensory
substitution device based on distance-related vibrotactile flow. International Journal of
Advanced Robotic Systems, 10, 2013.
C. Capelle, C. Trullemans, P. Arno, and C. Veraart. A real-time experimental prototype for
enhancement of vision rehabilitation using auditory substitution. Biomedical Engineering,
IEEE Transactions on, 45(10):1279–1293, 1998.
A. Caspi, J. D. Dorn, K. H. McClure, M. S. Humayun, R. J. Greenberg, and M. J. McMahon.
Feasibility study of a retinal prosthesis: spatial vision with a 16-electrode implant. Archives
of Ophthalmology, 127(4):398–401, 2009.
D.-R. Chebat, C. Rainville, R. Kupers, and M. Ptito. Tactile-’visual’acuity of the tongue in
early blind individuals. Neuroreport, 18(18):1901–1904, 2007.
O. Collignon, M. Lassonde, F. Lepore, D. Bastien, and C. Veraart. Functional cerebral
reorganization for auditory spatial processing and auditory substitution of vision in early blind
subjects. Cerebral Cortex, 17(2):457–465, 2007.
J. Crary. Modernizing vision.”. Images: A Reader, page 270, 2006.
G. Declerck, C. Lenay, and A. Khatchatourov. Rendre tangible le visible. Irbm, 30(5):252–257,
2009.
A. Dodds et al. The nottingham obstacle detector: Development and evaluation. Journal of
Visual Impairment and Blindness, 75(5):203–09, 1981.
R. Erickson. Sound structure in music. Univ of California Press, 1975.
I. Fujinaga. Machine recognition of timbre using steady-state tone of acoustic musical instruments.
In Proceedings of the International Computer Music Conference, pages 207–10. Citeseer, 1998.
I. Fujinaga and K. MacMillan. Realtime recognition of orchestral instruments. In Proceedings of
BIBLIOGRAPHY 38
the international computer music conference, volume 141, page 143, 2000.
R. W. Furse. Building an openal implementation using ambisonics. In Audio Engineering
Society Conference: 35th International Conference: Audio for Games. Audio Engineering
Society, 2009.
N. A. Giudice and G. E. Legge. Blind navigation and the role of technology. Engineering
handbook of smart technology for aging, disability, and independence, pages 479–500, 2008.
L. H. Goldish and H. E. Taylor. The optacon: A valuable device for blind persons. New Outlook
for the Blind, 68(2):49–56, 1974.
G. Guarniero. Experience of tactile vision. Perception, 3(1):101–104, 1974.
A. D. Heyes. The sonic pathfinder: A new electronic travel aid. Journal of Visual Impairment
and Blindness, 78(5):200–02, 1984.
M. S. Humayun, J. D. Weiland, G. Y. Fujii, R. Greenberg, R. Williamson, J. Little, B. Mech,
V. Cimmarusti, G. Van Boemel, G. Dagnelie, et al. Visual perception in a blind subject with
a chronic microelectronic retinal prosthesis. Vision research, 43(24):2573–2581, 2003.
A. Jacomuzzi and N. Bruno. Perceiving occlusion through auditory–visual substitution. Cognitive
Processing, 7:128–130, 2006.
K. A. Kaczmarek. The tongue display unit (tdu) for electrotactile spatiotemporal pattern
presentation. Scientia Iranica, 18(6):1476–1485, 2011.
K. Khoshelham. Accuracy analysis of kinect depth data. In ISPRS workshop laser scanning,
volume 38(5), page W12, 2011.
J.-K. Kim and R. J. Zatorre. Generalized learning of visual-to-auditory substitution in sighted
individuals. Brain research, 1242:263–275, 2008.
A. J. Kolarik, M. A. Timmis, S. Cirstea, and S. Pardhan. Sensory substitution information
informs locomotor adjustments when walking through apertures. Experimental brain research,
pages 1–10, 2013.
R. Kupers, D. R. Chebat, K. H. Madsen, O. B. Paulson, and M. Ptito. Neural correlates of
virtual route recognition in congenital blindness. Proceedings of the National Academy of
Sciences, 107(28):12716–12721, 2010.
R. Kurzweil, M. L. Schneider, and M. L. Schneider. The age of intelligent machines, volume
579. MIT press Cambridge, 1990.
R. Kurzweil, F. Bhathena, and S. Baum. Reading machine system for the blind having a
dictionary, Mar. 2000. US Patent 6,033,224.
V. Larcher, O. Warusfel, J.-M. Jot, and J. Guyard. Study and comparison of efficient methods
for 3-d audio spatialization based on linear decomposition of hrtf data. In Audio Engineering
Society Convention 108. Audio Engineering Society, 2000.
C. Lenay, S. Canu, and P. Villon. Technology and perception: the contribution of sensory
substitution systems. In Cognitive Technology, 1997. Humanizing the Information Age.
Proceedings., Second International Conference on, pages 44–53. IEEE, 1997.
BIBLIOGRAPHY 39
C. Lenay, O. Gapenne, S. Hanneton, C. Marque, and C. Genouelle. Sensory substitution: Limits
and perspectives. Touching for knowing, pages 275–292, 2003.
D. Lescal, J. Rouat, and J. Voix. Sensorial substitution system from vision to audition using
transparent digital earplugs. In Proceedings of Meetings on Acoustics, volume 19, page 040014.
Acoustical Society of America, 2013.
D. J. Levitin. This is your brain on music: Understanding a human obsession. Atlantic Books
Ltd, 2013.
O. Linkovski, L. Akiva-Kabiri, L. Gertner, and A. Henik. Is it for real? evaluating authenticity
of musical pitch-space synesthesia. Cognitive processing, 13(1):247–251, 2012.
R. Liu and Y.-X. Wang. Auditory feedback and sensory substitution during teleoperated
navigation. Mechatronics, IEEE/ASME Transactions on, 17(4):680–686, 2012.
I. Matteau, R. Kupers, E. Ricciardi, P. Pietrini, and M. Ptito. Beyond visual, aural and haptic
movement perception: hmt+ is activated by electrotactile motion stimulation of the tongue in
sighted and in congenitally blind individuals. Brain research bulletin, 82(5):264–270, 2010.
S. Meers and K. Ward. Substitute three-dimensional perception using depth and colour sensors.
Faculty of Informatics-Papers, page 578, 2007.
L. B. Merabet, L. Battelli, S. Obretenova, S. Maguire, P. Meijer, and A. Pascual-Leone.
Functional recruitment of visual cortex for sound encoded object identification in the blind.
Neuroreport, 20(2):132–138, 2009.
R. A. Normann, E. M. Maynard, P. J. Rousche, and D. J. Warren. A neural interface for a
cortical vision prosthesis. Vision research, 39(15):2577–2587, 1999.
C. V. Parise and C. Spence. Audiovisual crossmodal correspondences and sound symbolism: A
study using the implicit association test. Experimental brain research, 220(3-4):319–333, 2012.
K. Patil, D. Pressnitzer, S. Shamma, and M. Elhilali. Music in our ears: the biological bases of
musical timbre perception. PLoS computational biology, 8(11):e1002759, 2012.
W. Penrod and T. Simmons. An evaluation and comparison of the hand guide by guideline and
the miniguide developed by gdp research electronic travel devices. Closing the Gap, 23(6):
22–24, 2005.
C. Poirier, A. De Volder, D. Tranduy, and C. Scheiber. Pattern recognition using a device
substituting audition for vision in blindfolded sighted subjects. Neuropsychologia, 45(5):
1108–1121, 2007a.
C. Poirier, A. G. D. Volder, and C. Scheiber. What neuroimaging tells us about sensory
substitution. Neurosci Biobehav Rev, 31(7):1064–1070, 2007b.
B. Pollok, I. Schnitzler, P. Stoerig, T. Mierdorf, and A. Schnitzler. Image-to-sound conversion:
experience-induced plasticity in auditory cortex of blindfolded adults. Experimental brain
research, 167(2):287–291, 2005.
N. Pressey. Mowat sensor. Focus, 11(3):35–39, 1977.
M. J. Proulx. Synthetic synaesthesia and sensory substitution. Consciousness and cognition, 19
BIBLIOGRAPHY 40
(1):501–503, 2010.
M. J. Proulx, P. Stoerig, E. Ludowig, and I. Knoll. Seeing wherethrough the ears: effects of
learning-by-doing and long-term sensory deprivation on localization based on image-to-sound
substitution. PLoS One, 3(3):e1840, 2008.
M. Ptito, S. M. Moesgaard, A. Gjedde, and R. Kupers. Cross-modal plasticity revealed by
electrotactile stimulation of the tongue in the congenitally blind. Brain, 128(3):606–614, 2005.
M. Ptito, A. Fumal, A. M. de Noordhout, J. Schoenen, A. Gjedde, and R. Kupers. Tms of the
occipital cortex induces tactile sensations in the fingers of blind braille readers. Experimental
Brain Research, 184(2):193–200, 2008.
J. P. Rauschecker. Compensatory plasticity and sensory substitution in the cerebral cortex.
Trends in neurosciences, 18(1):36–43, 1995.
J. P. Rauschecker. Cortical plasticity and music. Annals of the New York Academy of Sciences,
930(1):330–336, 2001.
L. Renier, C. Laloyaux, O. Collignon, D. Tranduy, A. Vanlierde, R. Bruyer, and A. G. De Volder.
The ponzo illusion with auditory substitution of vision in sighted and early-blind subjects.
Perception, 34(7):857–867, 2004.
L. Renier, O. Collignon, C. Poirier, D. Tranduy, A. Vanlierde, A. Bol, C. Veraart, and A. G.
De Volder. Cross-modal activation of visual cortex during depth perception using auditory
substitution of vision. Neuroimage, 26(2):573–580, 2005.
J. F. Rizzo III, J. Wyatt, M. Humayun, E. de Juan, W. Liu, A. Chow, R. Eckmiller, E. Zrenner,
T. Yagi, and G. Abrams. Retinal prosthesis: an encouraging first decade with major challenges
ahead. Ophthalmology, 108(1):13–14, 2001.
U. R. Roentgen, G. J. Gelderblom, M. Soede, and L. P. de Witte. Inventory of electronic
mobility aids for persons with visual impairments: A literature review. Journal of Visual
Impairment & Blindness, 102(11), 2008a.
U. R. Roentgen, G. J. Gelderblom, M. Soede, and L. P. de Witte. Jvib, copyright (c) 2008
american foundation for the blind. all rights reserved. journal of visual impairment & blindness,
print edition page (s) 702-724, 2008b.
O. Sacks. Musicophilia: Tales of music and the brain. Random House LLC, 2010.
T. Schanze, L. Hesse, C. Lau, N. Greve, W. Haberer, S. Kammer, T. Doerge, A. Rentzos,
and T. Stieglitz. An optically powered single-channel stimulation implant as test system for
chronic biocompatibility and biostability of miniaturized retinal vision prostheses. Biomedical
Engineering, IEEE Transactions on, 54(6):983–992, 2007.
J. H. Siegle and W. H. Warren. Distal attribution and distance perception in sensory substitution.
Perception, 39(2):208, 2010.
G. Smith. The stereotoner: A new reading aid for the blind. In Proceedings of the 25th Annual
Conference on Engineering and Medical Biology, 1972.
R. D. Steele, G. L. Goodrich, D. Hennies, and J. A. McKinley. Reading aid technology for blind
BIBLIOGRAPHY 41
persons: Responses to a questionnaire of experienced users. Assistive Technology, 1(2):23–30,
1989.
E. Striem-Amit, L. Cohen, S. Dehaene, and A. Amedi. Reading with sounds: sensory substitution
selectively activates the visual word form area in the blind. Neuron, 76(3):640–652, 2012.
J. Ward and P. Meijer. Visual experiences in the blind induced by an auditory sensory substitution
device. Consciousness and cognition, 19(1):492–500, 2010.
B. W. White, F. A. Saunders, L. Scadden, P. Bach-Y-Rita, and C. C. Collins. Seeing with the
skin. Perception & Psychophysics, 7(1):23–27, 1970.
M. D. Williams, C. T. Ray, J. Griffith, and W. De l’Aune. The use of a tactile-vision sensory
substitution system as an augmentative tool for individuals with visual impairments. Journal
of Visual Impairment and Blindness, 105(1), Jan. 2011.
G. Xydas, V. Argyropoulos, T. Karakosta, and G. Kouroupetroglou. An experimental approach
in recognizing synthesized auditory components in a non-visual interaction with documents.
Proc. Human-Computer Interaction-HCII, 2005.
D. Yanai, J. D. Weiland, M. Mahadevappa, R. J. Greenberg, I. Fine, and M. S. Humayun. Visual
performance using a retinal prosthesis in three subjects with retinitis pigmentosa. American
journal of ophthalmology, 143(5):820–827, 2007.
J. S. Zelek, S. Bromley, D. Asmar, and D. Thompson. A haptic glove as a tactile-vision sensory
substitution for wayfinding. Journal of Visual Impairment & Blindness, 97(10), 2003.
E. Zrenner. Will retinal implants restore vision? Science, 295(5557):1022–1025, 2002.
E. Zrenner, K. U. Bartz-Schmidt, H. Benav, D. Besch, A. Bruckmann, V.-P. Gabel, F. Gekeler,
U. Greppmaier, A. Harscher, S. Kibbel, et al. Subretinal electronic chips allow blind patients
to read letters and combine them to words. Proceedings of the Royal Society B: Biological
Sciences, 278(1711):1489–1497, 2011.
“Good sense is at the bottom of everything:
virtue, genius, wit, talent and taste. ”— J.J. de Chenier (1764–1811)