3d-vita - uva · chapter 1. introduction 2 those that have disabilities. however, the issues with...

44
3D-VITA Robrecht Jurriaans 3D Visual Information to Audio

Upload: others

Post on 11-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

3D-VITA

Robrecht Jurriaans

3D Visual Information to Audio

Page 2: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

3D–VITASensory Substitution Device Employing Virtual Synaesthesia Based on

Common Weak Associations between Visual Appearance and Auditory Stimuli

A thesis submitted in conformity with the requirements for the degree of

MSc. in Artificial Intelligence

Robrecht Constantijn [email protected]

BSc Artificial Intelligence, Universiteit van Amsterdam, 2011

Supervisor: Jan van Gemert

Informatics Institute, Faculty of Science,Universiteit van Amsterdam

Science Park 904, 1098 XH Amsterdam2014

Page 3: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

i

Abstract

In this thesis a novel sensory substitution device is proposed. It takes visual information froma RGB-D camera (Microsoft Kinect) and maps the information to an audio signal. This isachieved by segmenting the images taken with the RGB-D camera into superpixels using theSLIC algorithm. A visual descriptor is made for each superpixel according to the standardfeatures used in state-of-the-art material recognition systems. These descriptors are mappedto a set of parameters for audio synthesis based on literature from weak synaesthesia andetymological connections. Finally, the sounds are played through a set of headphones usingHRTF-enabled binaural sounds. The camera is mounted on top of the headphones enablingthe user to scan the environment. The system is evaluated in an object detection task anda virtual localisation task. In the virtual task, the user is given a sequence of stimuli fromthree fixed locations. The user is asked to move his or her gaze towards where the soundis coming from. This task is to determine if the user is capable of hearing binaural sound.In the object detection task, a large object is moved around the user and the user is askedto determine where the object is located in the scene by moving his or her gaze towardswhere they think the object is located. The correlation between stimulus and reaction isdetermined for each user and normalised with the stimulus-reaction correlation from thevirtual task. The results show that the system has a slight bias to the right. This means thatthe virtual audio space is rotated around the user to the right. Furthermore, the localisationtask shows that the system provides a sense of distal attribution without the necessity oftraining. In conclusion, the system is capable of supplying the user with visual informationthrough audio in such a way that the user experiences the visual information.

“Just because a man lacks the use of his eyesdoesn’t mean he lacks vision. ”— Stevie Wonder (1950–)

Page 4: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

Chapter 1

Introduction

Imagine yourself, standing on New York’s famous Times Square.

Now imagine yourself standing there without the ability to see, hear or even feel.

Try to get to the other side, through the chaos that this place is known for.

As humans, we are strongly dependent on our ability to perceive the world. To do this we

have evolved incredibly complex systems that are capable of retrieving information from the

environment around us. With our eyes we can detect light strength and even the frequency of the

light-wave within the range of approximately 430 THz–790 THz. Our ears can detect pressure

differences, down to 20 µPa, which allows us to hear the faintest of movement. With our nose

we are capable of sensing the composition of the air we are breathing. Our tongue can analyse

what we are eating. Our entire body is covered with pressure-sensitive nerves which allows us to

feel pressure, temperature and all kinds of other properties of the air around us. These are the

five senses, but humans possess even more senses than these basic five. We also have a sense for

acceleration, balance, orientation, time, pain, hunger, thirst and a plethora of information about

our internal state. All of this information is measured by sensory cells and converted into an

electrical signal that is send to the brain. Our brain then integrates all of this information which

allows us to experience the world around us.

With a system as complex as the brain in charge of perception, it is nothing short of a miracle

that it usually works quite well. However, due to the complexity and the sensitivity of our sensory

system, it is not surprising that parts of the system may fail. Sensory impairment can happen

when a sense organ does not function, or in the connection between the organ and the brain,

or even in the processing within the brain itself. Although some forms of sensory impairment

result in minor inconveniences, such as anosmia, others can have a big impact on the autonomy

of an individual, such as blindness, vertigo and disequilibrium. This is mainly due to how we,

as humans, have arranged the world around us. As humans we have shaped our day to day

infrastructure to cater to able bodied individuals. More often than not, this leads to problems for

1

Page 5: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 1. INTRODUCTION 2

those that have disabilities. However, the issues with sensory impairment do not necessarily arise

from the lack of sensory input, e.g. the inability to see or hear, but rather from the diminished

amount of information that the individual can use. If the information that would normally be

acquired through sight, would be available to a blind individual, being blind would be less of a

problem. To facilitate this, sensory substitution devices can be used.

A sensory substitution device is a device, or system, that take information from one modality

and converts it into a signal in another modality. This is done so that the information can be

perceived by a person with an impairment in the first modality. For instance, a blind person is

not capable of sensing visual information. By mapping the visual information to an audio signal,

he or she gets access to this visual information. The idea is that there is a difference between

sensing and perceiving.

Definition:

Sensing is what your body does. It is a reaction of one of your organs to external or internal

stimuli which are converted into an electrical signal which is sent to the brain

Perceiving is what your brain does. It takes the sensory input and converts these to a mental

representation.

It follows that perception can happen without sensing and vice versa. For instance, hallucinations

occur when we perceive something we can not have sensed. Sensing without perceiving occurs

more frequently and is often caused by our attention blocking out these signals. In the case of

audible-vision sensory substitution we may perceive visual information without actually sensing

the visual information through our visual system, but rather sensing the auditory signal which

conveys the same information or an approximation of what our vision would have sensed. The

reason for perception to still be possible is that perception is done by the brain. More specifically,

the brain integrates all sensory input into a single representation of the world. So in a sense,

being blind is not an issue due to the lack of seeing, or even a lack of detecting light, but rather

due to the information that we take from light not being available in the integration process.

Using sensory substitution devices subverts this problem by making that information accessible

to the brain and thus augmenting perception with the missing information.

Although sensory substitution devices are usually designed for people with a sensory impairment,

there are also many other possible applications of sensory substitution. Using more elaborate

sensors we can augment the “standard” set of sense organs and thus augment our perception of

reality. A large group of bio-hackers use magnetic implants which allow them to perceive magnetic

fields, not to be confused with magnetoception, which is a somewhat weak sense allowing humans

to feel their orientation relative to the earth’s orientation, in the form of pressure. This magnetic

sense works due to the magnet moving slightly dependent on magnetic fields, allowing the user

to perceive these fields through touch.

Page 6: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 1. INTRODUCTION 3

Figure 1.1: A screenshot from the gameMark of the Ninja in which sound perceptionis improved by visual representations suchas the transparent circles representing howfar sound has travelled

Figure 1.2: Ben Underwood, who is fullyblind, can skate using his extraordinaryhearing, earning him the nickname “the real-life Batman”

Another possible application is in user interface design. Take for instance the game Mark of the

Ninja by Klei Entertainment (http://www.markoftheninja.com). In Mark of the Ninja the

player plays as a highly skilled ninja with excellent stealth capabilities. To facilitate this feel, the

designers have implemented a system which allows users to see sound. Everything that makes a

sound also creates a circle on the screen which indicates to how far this sound is audible, as can

be seen in fig. 1.1. Also note the sleeping dog on the right of the image which is shown through

a visualisation of the typical representation of snoring in popular culture in the form of “zzzz.”

Another use of sensory substitution in this game is the visualisation of what enemies can see,

which allows the player to “sense” how visible the avatar is at any given time. The same type of

user interface design can be found in the real world, such as using colours to signify concepts

such as red fonts on letters signifying importance.

There are types of sensory substitution which do not rely on external systems or devices. For

instance, there is a group of people that can use echolocation to sense distance and surface

shape. Ben Underwood, seen in fig. 1.2, was such an individual. He generated “clicks” with

his tongue and listened to the echoes of this sound. His brain interpreted the returning sound

giving him an indication of distances to objects around him. Furthermore, he achieved this

without external devices allowing him to quickly assess his surroundings and build a mental

representation. However, although this technique can be seen as sensory substitution, with the

echoes functioning as stimulator substituting a sense of distance, it is also perhaps just a normal

sense taken to its extreme form.

Perhaps the “ninja”-like super-power of Ben Underwood was not a case of sensory substitution,

but more a case of well developed integration of sensory input. Many of the sensory substitution

devices are designed to use this sensory integration to give the user the experience of perceiving.

Page 7: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 1. INTRODUCTION 4

Perhaps sensory substitution devices are not just sensory substitution, but rather designed to

utilise this integration, creating a form of synaesthesia. Synaesthesia is the involuntary coupling

of sensory experience. For instance, people with synaesthesia [Linkovski et al., 2012] may read

the character “7” and perceive it as being orange or they may experience notes played on a piano

to have a blueish tint. It is important to note that these associations are involuntary and that it

must not be confused with explicit associations that people may have. An example of an explicit

association is that music played on a steel-drum may remind somebody of a beach. This type

of explicit association is caused by memory and exposure, rather than the implicit involuntary

associations that come with synaesthesia.

The sensory substitution device proposed in this thesis combines the traditional set-up of sensory

substitution devices with synaesthesia-like mapping between vision and audio. The system is

called 3D Visual Information To Audio, or 3D-VITA in short. It uses a camera capable of

taking both colour and depth information and a 3D auditory illusion based on Head-Related-

Transfer-Functions, so that the 3D spatial information from the scene can be directly encoded in

the sounds. This means that the sounds that are generated only need to encode visual information

and not spatial information. The translation of visual information to audio is guided by common

weak synaesthesia as well as etymological similarities within the vocabulary of visual and auditive

phenomena. This mapping is based on literature from material recognition and by synaesthetic

experiments. The idea is that the most important aspect of the sounds must be that they are

differentiable [Parise and Spence, 2012]. So bright colours must sound more bright and dampened

colours should sound dampened.

3D-VITA, as introduced in this thesis, is meant for people with a visual impairment and its

main purpose is to facilitate navigation. However, if basic visual recognition is also supported,

navigation might become easier. For instance, being able to move from point A to point B while

avoiding obstacles is a possible navigational task. But being able to actually identify point B

gives the user far more independence. Most sensory substitution devices focussed on navigational

tasks do not take detailed visual information into account, but instead take spatial information

from the visual domain. Some systems do incorporate simple visual cues such as light intensity.

On the other hand, systems that focus on visual appearance generate stimuli that are complex,

such as systems that do object recognition and return the class of object as a speech signal. In

chapter 2 these systems will be discussed in more detail.

Page 8: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

Chapter 2

Background

2.1 Sensory Substitution Devices

Sensory Substitution Devices are devices or systems that take stimuli, or sensory data, from one

modality and convert that signal into a signal in another modality. The resulting signal is then

returned to the user. One of the first sensory substitution devices was a system [Bach-Y-Rita

et al., 1969] that took images from a camera and converted it to a tactile signal on the back

of the user’s head. It is important to note that sensory substitution devices , such as a system

that converts video into a tactile signal, does not result in the user sensing [Declerck et al., 2009,

Poirier et al., 2007b] what the camera is providing. However, the user could potentially perceive

the visual information as the brain eventually accustoms to the new data it is receiving. This

distinction may seem arbitrary, but it lies at the basis of designing such systems. Perception

occurs due to the integration of stimuli from different modalities and therefore sensory substitution

devices result in perception rather than sensing. In the case of Bach-Y-Rita’s [Bach-Y-Rita et al.,

1969] tactile-vision-sensory-substitution, the user may perceive visual information, but senses

only tactile input. This form of perception is called distal attribution [Auvray et al., 2005] which

is the attribution of sensory experience to an external and distinct object.

All sensory substitution devices utilise [Auvray and Myin, 2009, Lenay et al., 1997, Bach-y Rita

and W Kercel, 2003] the same principle, namely a sensor is used which can retrieve data in

the same modality of the sense that needs to be substituted. This data is then processed and

converted into data which serves as input to a stimulator. The stimulator acts as an interface

between the device and the user. For instance, in a Text-To-Speech system the sensor takes

text and converts this into an audio signal, which is then outputted via a speaker that acts as

the stimulator. Conversely, in a Speech-To-Text system, the sensor is a microphone and the

stimulator is a screen to display the output. There are also types of sensory substitution that do

not need a sensor or a processing unit, but instead are designed to convey information that is

usually found in one modality. Perhaps the most well known variant of this type is Braille, which

takes textual information, that is usually represented visually and displays it through tactile

sensation. Text itself is also a form of sensory substitution which takes an auditory signal and

5

Page 9: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 2. BACKGROUND 6

1

2

3

4

5

6

87

9

10

Figure 2.1: The various possible sensory substitution devices: 1. Audible Vision, 2. VisibleTouch, 3. Tactile Vision, 4. Audible Touch, 5. Tactile Hearing, 6. Audible Spatial Awareness, 7.Tactile Balance, 8. Tactile Spatial Awareness, 9. Tactile Sensory Relocation, 10. Visible Audio,excluded are Visible Balance and Visible Spatial Awareness

displays it visually.

In this section an overview is given of sensory substitution devices that substitute vision. Since

vision is far more complex [Calder, 2010, Crary, 2006] and actually comprises of several individual

senses, e.g. sense of light intensity, sense of colour, sense of depth [Siegle and Warren, 2010],

sense of motion, the systems are categorised by their purpose. That is to say, the type of function

usually provided by vision that they are intended to replace. Further categorisation is done on

the modality of the stimulator that is used. This results in a categorisation as seen in fig. 2.1.

Omitted from this categorisation are the gustatory and olfactory senses due to their isolated and

specific roles in how we perceive the world. In the figure arrows point from the stimulator to the

sensor that is substituted. Note that Visible Balance and Visible Orientation are not represented

in the categorisation. These types of systems can be found in most vehicles, especially airborne,

as indicators for the pilot or operator. First a brief overview is given of the various stimulators

that can be utilised for sensory substitution.

2.2 Stimulators

Tactile Stimulators

There are different methods of returning tactile stimuli to the user. This has to do with the

different physiological premises of how humans perceive touch. There are different types of

Page 10: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 2. BACKGROUND 7

nerve-endings that act as sensors and these different sensors react differently to various stimuli.

Some of the sensors respond rapidly to the signal and only respond to changes in the signal,

while others return a more continuous output. The different nerve clusters can also differ at

which frequencies they have the strongest response, such as the Lamellar Corpuscle having the

strongest response at 250 Hz while Merkel Nerve Endings have the strongest response around

5 Hz–15 Hz.

To stimulate these nerves, there are two types of stimulators: electrotactile and vibrotactile.

Electrotactile stimulators use electric signals to directly stimulate the nerves. This can be done

either from the skin, or directly to the nerve. The latter uses less energy, but is also more invasive.

Vibrotactile stimulators use pressure and vibration to stimulate the nerves. The problem with

vibrotactile stimulation is that the nerves that can get activated through this have various spatial

and temporal resolution constraints, depending on where on the body the stimulator is attached.

In recent years, both types of stimulation have been successfully used on the tongue [Williams

et al., 2011] which nerves have a high spatial resolution and require minimal stimulation to be

activated.

Visual Stimulators

There is a broad range of visual stimulators which can be divided into two main categories: screens

and implants. Screens are comprised of one or multiple lights which can be monochrome or have

a set of possible colours. For example, traffic lights consist of three monochrome lights ordered

spatially. Both the colours and the spatial position of the lights convey meaning. Implants are

connected directly to the optical nerve and work by applying patterns of electricity on the optical

nerve which are then processed by the visual cortex.

Auditory Stimulators

Audio is a very strong stimulator due to the human auditory system being able to deal with very

complex and rapidly changing sound patterns [Auvray and Myin, 2009], even in the presence of

noise. Furthermore, the perceptual resolution of the auditory system is very fine for both pitch

and amplitude. In chapter 3 the perception of audio will be discussed in finer detail. A final

advantage of audio as a stimulator as opposed to visual and haptic interfaces is that audio is

relatively low-cost to produce in terms of computation and the energy required by the stimulator.

2.3 Visual Substitution

Reading Substitutional Aids

Language has known many forms of sensory substitution. It is difficult to assess where the exact

boundary lies between normal sensory integration and the field of sensory substitution. Written

and spoken language are in essence substitutions of one another or perhaps both are sensory

substitutions of a mental language. Due to both blindness and deafness, as well as a plethora of

mental conditions including agraphia, dyslexia and alexia, there have been numerous endeavours

Page 11: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 2. BACKGROUND 8

[Steele et al., 1989] in substituting modalities of language. For instance, sign language is a

visible-audio substitution.

Braille is a haptic written language consisting of several dots developed in the 19th century by

Louis Braille (1809-1852). Each character consists of six dots in a two by three array. Each

dot can have one of two values. Either the dot is raised or it is not. This leads to a total of 63

possible characters.

The Optacon [Goldish and Taylor, 1974] is a reading aid using a haptic stimulator consisting of

6x24 vibrating pins. It gives a direct mapping of a small hand-held camera. It is operated by

placing a finger on the vibrating pins and moving the camera over printed text with the other

hand.

Akin to the Optacon, the Stereotoner [Smith, 1972] is intended as a reading aid and uses a

hand-operated camera with a very small field of view. However, the haptic stimulator of the

Optacon is instead replaced with an auditory stimulus. The image from the camera is put

through a line detector and the detected lines are mapped to sounds varying in pitch.

The Kurzweil Reading Machine is a print-to-speech machine [Kurzweil et al., 1990, 2000] consisting

of an omni-font optical character recognition system that uses a flat-bed scanner to convert

printed text to a digital signal which is then given as the input for a text-to-speech device. It is

mainly used for reading printed media. Unlike the Optacon and the Stereotoner, the Kurzweil

Reading Machine preprocesses the signal before activating the stimulator, which alleviates the

user from processing the signals and reduces the training and adaptation period to a minimum.

Later reading aids [Steele et al., 1989] mostly follow the principles behind the Kurzweil Reading

Machine, but have improved on the optical character recognition and text-to-speech synthesis

components. As in the Kurzweil Reading Machine, 3D-VITA employs the same concept of

preprocessing the input signal to create a more compact auditory representation. However,

unlike the Kurzweil Reading Machine, 3D-VITA also uses a more complex mapping instead of a

direct auditory representation which allows for more flexibility in the type of entities that can be

recognised by the user.

Other types of reading aids focus on the addition of meta-data [Xydas et al., 2005] to visual

documents to improve speech synthesis. The idea is that the original information within the

documents, i.e. the raw text data, is not complete enough to perform meaningful substitution and

that the addition of information such as mood, the visual lay-out of the document and the role

of each piece of text, e.g. the header, subtitles and main text, greatly improves understanding.

Adding this meta-data improves audible-vision perception tasks.

Obstacle Avoidance

Navigation is an essential skill for autonomy [Meers and Ward, 2007, Roentgen et al., 2008a,

Giudice and Legge, 2008] and relies heavily on our visual system. Like reading, early rudimentary

systems relied on substituting to a tactile stimulus.

The Long Cane [Blasch et al., 1996] is not necessarily a tactile-vision sensory substitution device

Page 12: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 2. BACKGROUND 9

as its main goal is not to replace vision, but rather to allow the user to successfully navigate a

complex environment. In a sense, it replaces the sense of distance rather than the full visual

system. Similar to Braille, the simple yet effective design of the Long Cane resulted in that they

are still in common use although more advanced substitutes are available. As an improvement

on the Long Cane, Laser Canes remove the necessity to physically touch the objects in an

environment. Most Laser Canes utilise some type of distance sensor.

The Mowat sensor [Pressey, 1977] uses a high frequency sound to detect distances which are used

as the input for a vibrating stimulator. The Nottingham Obstacle Detector [Dodds et al., 1981]

is an early prototype for the Sonic Pathfinder [Heyes, 1984] and utilises a Mowat Sensor. The

Sonic Pathfinder instead employs a form of audible-vision sensory substitution and converts the

detected objects into tones that indicate the structure of the environment.

Combining principles from both the Nottingham Obstacle Detector and the Sonic Pathfinder,

the MiniGuide [Penrod and Simmons, 2005, Kolarik et al., 2013] uses ultrasonic echo-location to

detect objects, mapping the signal to both vibration and sound feedback. Unlike most other

laser canes, the MiniGuide is meant as a secondary aid, to be used together with a primary aid

such as a seeing eye dog or a white cane. As such, the MiniGuide has an easily accessible button

to switch on and off, resulting in that the user only uses it when more detail of the environment

is needed. The Ultracane builds upon the concept of the Nottingham Obstacle Detector, but

uses two Mowat sensors [Penrod and Simmons, 2005] connected to two vibrating pads in the

handle. By using two sensors a more detailed signal can be delivered to the user.

The Bat ’K’ Sonar-Cane also utilises [Roentgen et al., 2008a,b] ultra-sonic echo-location which

results in distance measurements represented as beeps of varying pitches. The ultra-sonic device

is mounted on the end of a traditional long cane. By sweeping the Sonar-Cane the user is given

a continuous soundscape representing the distance to various objects in the direct vicinity of the

user.

The TSight [Cancar et al., 2013] uses a infra-red range sensor coupled to vibrating actuators as

stimulators steering the user away from objects that are moving towards the user. The device

extracts time-to-contact distance measurements and maps this to an array of vibrating pads

carried around the waist. This type of sensory substitution is already closer to true vision

substitution than the laser canes, as measurements in 3D environment are converted to a 2D

representation. However, as light intensity and colour information are ignored, this device is

not an example of true vision substitution. Akin to the Kurzweil Reading Machine, this devices

utilises preprocessing of the raw data from the sensor as opposed to using a direct mapping. The

time-to-contact is measured using optical flow and thus relies on motion within the visual field.

The TSight allows users to be able to hit moving targets with the temporal and spatial precision

of a sighted person in the same task.

Finally, sensory augmentation has been successfully applied to the task of teleoperated navigation

[Liu and Wang, 2012] by providing auditory feedback. This auditory feedback resulted in a

significantly lower amount of hitting other objects and increased precision in navigation.

Page 13: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 2. BACKGROUND 10

True Vision Substitution

The type of sensory substitution that enables users to retrieve full visual information from the

environment is called true vision substitution. Unlike sensory substitution devices that only give

one type of visual information, such as reading and obstacle avoidance aids, they try to give a full

overview of the visual information of the scene. These true vision substitution devices are often

implanted devices that take visual information from a camera or photoelectric sensor and convert

this visual signal into a electronic signal which is then injected into either the optical nerve or

via a retinal electrode. Non-implantable devices map the visual 2D information [Auvray and

Myin, 2009, Lenay et al., 1997] to a stimulus of a different modality. These sensory substitution

devices rely on more extensive mapping [Renier et al., 2005] between the visual information and

the modality of the stimulator. Such mapping is a form of synaesthesia [Proulx, 2010] which

utilises the sensory integration performed by the brain to active the visual cortex. Although

implantable devices are often considered [Bach-y Rita and W Kercel, 2003, Rauschecker, 1995,

Normann et al., 1999] to be better at activating the visual cortex, the procedure is more costly

than most non-implantable devices [Zrenner, 2002] and the invasiveness [Rizzo III et al., 2001,

Lenay et al., 2003] of the procedure is generally less desirable than the non-implantable devices.

Cortical or retinal electrode matrix display

Early work on retinal implants show great promise [Zrenner, 2002] to the possibilities of using

these implants for restoring vision. The reaction of the brain is for most implants immediate

[Schanze et al., 2007] and the users generally report that they feel like they are seeing when

using such devices. These implants work by taking an array of electrodes and implanting these

within either the optical nerve or in nerves that are close to the skin [Caspi et al., 2009]. Various

tasks have been performed by users of such systems successfully, such as reading letters [Zrenner

et al., 2011], navigation [Normann et al., 1999], object recognition [Humayun et al., 2003] and

orientation [Yanai et al., 2007].

Electrocutaneous and Vibrotactile Displays

The first non-invasive sensory substitution device that targeted true vision substitution was the

TVSS (Tactile-Vision Sensory Substitution) device [Bach-Y-Rita et al., 1969] which utilized

an array of hundreds of vibrating pins on the back of the subject which represented the image

as captured in real-time by a camera. The original set-up can be seen in fig. 2.2. From an

initial set of 50 objects, participants were able to achieve a 100 % recognition rate within 100

trials. A later version of this device [Bach-y Rita, 1980] used an electrocutaneous stimulator,

the Tongue Display Unit, which could be worn on the tongue. The main advantages of using

electro-stimulators on the tongue [Ptito et al., 2005, Kaczmarek, 2011, Zelek et al., 2003] is that

the tongue has a higher spatial resolution and that stimulating the tongue requires approximately

3 % of the voltage required [Ptito et al., 2008] to stimulate the finger. Another advantage of

tongue stimulation is the effectiveness on treating cortical blindness [Matteau et al., 2010, Kupers

et al., 2010], especially in diminishing the effects [Chebat et al., 2007] of early onset of cortical

blindness.

Page 14: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 2. BACKGROUND 11

Figure 2.2: The original TVSS-device [Bach-Y-Rita et al., 1969]

Definition:

Cortical Blindness refers to patients who have functioning eyes, but are unable to see due to

inactivity of the occipital cortex, mainly in the areas Brodmann 17 or V1 and Brodmann

18 & 19

Early experiments [White et al., 1970, Guarniero, 1974, Bach-y Rita, 1983] showed that if the

user is given control over the sensor, the user is able to recognise 3D spatial configurations of

objects without extensive training. These results were later reproduced with various sensory

substitution devices [Amedi et al., 2007, Williams et al., 2011]. In the design of 3D-VITA, this

result plays an integral role as the RGB-D camera is mounted on the head of the user.

A major disadvantage of using electrocutaneous or vibrotactile displays [Bach-y Rita, 2004] is

the lack of colour information that can be represented in the signal as provided by the stimulator.

Using tactile stimulators, either light intensity from a camera or depth information from a depth

sensor, such as a infra-red or an ultrasonic sensor, can be represented. Allowing the user to move

the camera gives a faint sense of depth, but the signal remains in essence 1-dimensional. Another

disadvantage of tactile sensors is that they are, although not invasive, more intrusive than for

instance auditory stimulators.

Page 15: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 2. BACKGROUND 12

Auditory displays

There are various types of auditory displays, i.e. audio speakers. These auditory displays can be

either in-ear or as an external speaker. The main advantage of such auditory displays is the types

of signals it can produce. Systems such as the Kurzweil Reading Machine use speech synthesis to

convey the information from the substituted sense. Other systems use either analogue samples or

synthesised sounds. One of the most well-known sensory substitution devices that utilises sounds

is the vOICe [Amedi et al., 2007, Auvray et al., 2007] which uses grey-scale images captured by

a camera and converts these into sound-scapes. The sound-scape is created by sweeping over the

image from left to right. Each y − value or row in the image corresponds to a fixed pitch with

the bottom row being the lowest pitch and the top row being the highest. The intensity value of

each pixel determines the amplitude of that pitch for that column and all columns are played

from left to right within a one second interval. This results in a continuous sound-scape that is

very rich [Proulx et al., 2008] in information as it conveys all of the spatial information from the

image. Studies on the vOICe have shown that the system is very effective. For instance, during

a reading task the visual word form area of the brain is significantly activated [Striem-Amit

et al., 2012] and during shape recognition tasks the lateral occipital complex [Amedi et al., 2007,

Merabet et al., 2009, Ward and Meijer, 2010] is activated. Furthermore, users of the vOICe have

reported [Proulx, 2010] different types of visual phenomena, even after the experiments.

Extended usage of sensory substitution devices like the vOICe have been shown to lead to

structural reorganisation of the brain [Arno et al., 1999, Collignon et al., 2007] during recognition

tasks and depth perception tasks [Renier et al., 2005] and geometry categorisation tasks [Pollok

et al., 2005]. These changes have been found in the occipital cortex [Arno et al., 2001a] as well

as in the parietal cortex [Kim and Zatorre, 2008]. Furthermore, refinements of the auditory-

responsive areas in the parietal cortex as well as refinements [Rauschecker, 2001, Kim and

Zatorre, 2008] in the selectivity of neurons in the auditory cortex have been found in early-blinds

after usage of sensory substitution devices. An interesting observation [Poirier et al., 2007a] is

the increased activation in the dorsal and ventral extra-striate areas of cortical blind subjects

that has been found, which suggests that perception through audible-vision sensory substitution

devices could be visual-like in nature. These studies are based on the sensory substitution

devices following an inverse model of the cochlea [Capelle et al., 1998] which suggests that

following biologically plausible constructions is beneficial in the neural activation during usage of

a sensory substitution device . Another potential hint at the visual-like nature of perception of

audible-vision sensory substitution devices is the presence of certain visual illusions, such as the

Ponzo illusion [Renier et al., 2004] and occlusion illusions [Jacomuzzi and Bruno, 2006], which

hint at that the perception through these substituted channels occurs within the same areas as

the visual perception originally would.

An important improvement over systems such as the vOICe is the addition of preprocessing of

the image by finding salient spots in the image. This can be done by either using a retina-like

smoothing of the image [Arno et al., 2001b], where the centre of the image retains its original

sharpness but pixels towards the edges of the image are blurred, or by using algorithms such as

a neural network [Lescal et al., 2013] to detect salient areas within the image which are then

Page 16: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 2. BACKGROUND 13

converted into audio stimuli.

2.4 Design Choices for 3D-VITA

For 3D-VITA, the design has been heavily influenced by the existing sensory substitution devices.

The system uses a 3D camera to obtain images as well as depth measurements combining the

obstacle avoidance aids such as the Sonic Pathfinder [Heyes, 1984] with true vision systems

such as the vOICe [Amedi et al., 2007]. The images are preprocessed, akin to systems such as

the KRM [Steele et al., 1989] and the later additions to the vOICe [Arno et al., 2001b, Lescal

et al., 2013], by clustering pixels into superpixels, which represent areas with similar colours and

structure. For each superpixel various features are calculated including colour, light intensity,

saturation, texture and depth. Each feature vector is then given as input to a synthesiser which

converts the features into an audio sample. The samples are then played back each second to the

user using 3D sound via headphones. The 3D sound is achieved using a binaural illusion based

on head-related transfer functions. This pipeline is discussed in more detail in chapter 4.

Page 17: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

Chapter 3

Underlying Theory

3.1 Sound Perception

Sound consists of waves propagating through space. These waves are caused by displacement of

(air) molecules which in turn vibrate. The molecules vibrate causing nearby molecules to also

vibrate with the same frequency resulting in waves propagating through air (or other materials)

that can enter the ear. This process can be seen in fig. 3.3. The ear then converts these

mechanical vibrations into internal vibrations which are in turn converted into nerve impulses.

However, sound is not just waves and the particularities of these waves, it is mainly a psychological

perception [Sacks, 2010] of the wave-patterns. Other types of waves also have frequencies and

amplitude, but are not perceived as sound, e.g., light. Frequency and amplitude of waves have

corresponding principles when sound is perceived, frequency corresponds to pitch and amplitude

to loudness.

Sound waves

Sound waves are mechanical waves, which means that they propagate through a medium, such as

air molecules, and differ in this from electromagnetic waves, such as light. Sounds are generated

from all kinds of objects, but the sounds can be very different depending on what generated the

sound. Sound can range from pure tones, which can be described with a sine wave, e.g. a tuning

fork, to noise in which no pattern can be detected, e.g. radio static noise.

Tones are generated by vibrating, or oscillating, objects which cause compression and rarefaction

depending. Compression occurs when the object moves towards the direction the sound is

travelling. The object pushes molecules together which causes an increase in pressure of the

medium. Rarefaction occurs when the object moves away from the direction of the sound. There

are then less molecules within the same space resulting in a decrease of pressure, as seen in fig.

3.3. Pure tones, which are mathematically represented by a single sine wave, can be generated by

tuning-forks. These waves can be described using only a frequency and an amplitude. Usually,

14

Page 18: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 3. UNDERLYING THEORY 15

+

-80

-70

-60

-50

-40

-30

-20

100 1000 10000-80

-70

-60

-50

-40

-30

-20

100 1000 10000Blue Noise Red Noise

White Noise Pink Noise

-80

-70

-60

-50

-40

-30

-20

100 1000 10000-80

-70

-60

-50

-40

-30

-20

100 1000 10000

Figure 3.1: Additive synthesis of two sinewaves resulting in a more complex waveform

Figure 3.2: Noise is usually named afterthe colour that shares similarities in theirrespective spectra. X-axis is frequency, Y-Axis is Amplitude in dB

sounds are not pure tones, but rather consist of a number of interfering sound waves. By adding

sine waves a more complex sound [Erickson, 1975] can be created, as seen in fig. 3.1.

Noise does not contain such patterns, but instead has energy at all frequencies. White noise

has equal energy in equally sized frequency bands in the spectrogram. A white noise signal has

the same energy, or equal power, in the bandwidth of 40 Hz–60 Hz as it has in the bandwidth of

400 Hz–420 Hz. White noise got its name, because it was believed that it had the same spectral

flatness as white light. Other types of noise colour, such as violet, pink, red, grey and blue noise

as seen in fig. 3.2, are also related to the believed spectral qualities of light. For instance, pink

noise has a frequency spectrum which is linear in log-space.

The human ear

The human ear consists out of three main parts: the outer ear, the middle ear and the inner

ear. The outer ear serves as a protection of the inner parts and also to funnel the sound waves

into the middle ear. The middle ear translates the sound waves into internal vibrations. The

inner ear then translates these internal vibrations to nerve impulses which are then sent to the

auditory cortex. A schematic overview of the internals of the ear can be found in fig. 3.4.

The outer ear consists of the ear flap and the ear canal. The ear flap channels the sound waves

into the ear canal and the canal itself amplifies frequencies up to 3000 Hz. At the end of the ear

canal the sound waves are absorbed by the ear drum which vibrates along with the waves. These

waves are then converted to internal vibrations by three small bones, known as the ossicles, (the

Page 19: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 3. UNDERLYING THEORY 16

Ear Drum

Ear CanalCochlea

Ossicles

Inner EarMiddle EarOuter Ear

Figure 3.3: Sound waves propagate througha medium such as air molecules

Figure 3.4: Internals of the ear

hammer, anvil, and stirrup) which transmit the vibrations of the ear drums into the fluids within

the cochlea. The ossicles function as a sort of levers which causes the vibrations to be amplified,

resulting in the human ear being capable of detecting sounds with low amplitudes. The cochlea

is a cavity located in the inner ear and is about 3 cm in length if stretched out. The cavity

contains a liquid which vibrates along with the ear drum via the ossicles. The inner surface of

the cochlea is lined with about 20.000 hair-like nerve cells of various lengths and resiliency. Due

to the differences in length, the hairs resonate with certain frequencies causing them to have a

larger amplitude if the liquid is vibrating with the same frequency. If the nerve cell is agitated

enough, it will send an electrical signal to the brain.

Certain frequencies are amplified by the outer– and the middle ear. The ear canal amplifies

frequencies around 3000 Hz which is also the frequency around which human speech sounds are

located. The middle ear boosts frequencies around 3000 Hz. The means that the human ear is

most sensitive to frequencies in the 1000 Hz–3000 Hz band.

Psychoacoustic properties of sound

“If a tree falls in the forest and there is no one to hear it,

did it still make a sound? ”This famous philosophical thought experiment is essentially about the knowledge of reality and

observation. Sound waves travel through the air, but are only perceived when they reach the ear.

What we hear shares a strong connection with the physical attributes of the wave, as it travels

through the air, but these connections change in meaning and in our understanding. The wave

is converted into a mental representation [Levitin, 2013] and it is this representation that we

experience as sound. The mental representation changes the physical properties of the wave into

Page 20: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 3. UNDERLYING THEORY 17

+

Time (sec)

0.25

8

kHz

0.25

8

kHz

0 3

Violin

Piano

Figure 3.5: Sound waves with smallfrequency differences create a noticeablepattern in amplitude differences over time ascan be seen when combining a wave with afrequency of 10 Hz and a wave with 11 hertz

Figure 3.6: Spectra of a piano and a violinplaying the same note. Apart from thedifferences in energy patterns, note also thedifference between attack and decay of eachnote

psycho-acoustical properties. For instance, the frequency of the wave is converted into pitch and

the amplitude is converted into loudness.

The difference between the properties of the real wave and the psychoacoustic properties of the

mental representation is not as clear [Patil et al., 2012] as the difference between the frequency

of a wave of light and the mental representation of this frequency as a colour. This is mainly due

to the fact that pitch is encoded in neurons firing at the same frequency as the incoming wave.

This means that pitch is also represented in Hz. However, perceptually the brain is incapable

of noticing small differences in Hz. For the brain, there is no discernible difference between

1000 Hz and 1001 Hz. The difference can only be heard when two sounds at those frequencies

are produced simultaneously. Due to the slight difference in the wave, the amplitude difference

over time changes from a maximum of the combined amplitudes of the waves (when they are

almost in sync) and a minimum when they cancel each other out (when the waves are not in

sync) as can be seen in fig. 3.5. This phenomenon is used by guitar tuners to attune strings by

playing the same note simultaneously on two different strings.

Humans are exceptionally skilled in the recognising the difference between sound producers. That

is to say, even when two instruments are playing the same note, the human brain is capable of

determining which instrument is playing when. The reason for this is timbre. Timbre is usually

referred to as every property of sound which is not the frequency or the amplitude. However,

this is not a scientific or rigid definition, it can be decomposed into the following:

Page 21: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 3. UNDERLYING THEORY 18

Definition:

Timbre is the property of a sound wave that determines the characteristics of that sound. The

timbre is composed of the energy pattern of the harmonics of a tone.

What this means is that as sound can be represented as the addition of several sine waves, timbre

can be represented as the energy of each individual sound wave, as can be seen in the spectra in

fig. 3.6. This definition is not complete, but already sufficient in representing the differences

between sound producers [Fujinaga, 1998, Fujinaga and MacMillan, 2000]. What is missing from

this definition are the characteristics of the sound over time. In theory, these properties can be

represented as an energy pattern over the low frequency sine waves, but this representation is not

intuitive. Instead, it is easier [Berger, 2005] to represent these properties as the convolution with

a function that controls the total energy of the sound. This function represents how the energy of

the sample changes over time. It starts with the attack of the tone. The attack refers to the time

between the initial production of the sound and the moment the sound reaches its maximum

energy. In fig. 3.6 this can be seen for a tone produces on a piano and a violin. The piano

reaches its maximum energy before the violin, which is due to the difference between hitting a

string within the piano and the slow swelling of the sound for a violin. The other properties of

the energy function are the time it takes for the sound to die out after production, the average

energy of the sound and the length of the sound.

Page 22: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

Chapter 4

Methodology

The sensory substitution device, 3D-VITA, that was created for this thesis creates audible

vision. However, traditional audible vision sensory substitution devices take a visual percept

and convert this phenomenon into an audible signal. This results in sensory substitution devices

taking over the role of sensory integration instead of employing the brain of the user. By directly

converting visual information, rather than visual percepts, the user can perceive much more

complex information. This is why the sensory substitution device described in this thesis is

called 3D-VITA which stands for 3D Visual Information To Audio. The system utilises a 3D

camera, mounted on the head of the user, to get the visual information from the scene. The

visual information is based on a segmentation of the image into local patches with similar visual

qualities, known as super-pixels. A set of visual features is created for each super-pixel and

these features are used to generate sounds. The sound generation is steered by weak associations

stemming from the visual features of materials. The sounds are played back to the user through

headphones with head related transfer functions, creating the illusion that the sounds are in a 3D

space. The location of each of the sounds is taken from the depth information from the camera.

The following sections give greater detail on this pipeline. The sections follow the pipeline of the

system which consists of a camera to take images with 3D information, a computer to segment

the images, extract the visual features and generate the sounds and a pair of headphones which

enable the user to hear the sounds in a simulated 3D environment. A schematic overview is given

in fig. 4.1.

4.1 RGB-D Camera

RGB-D cameras can provide both colour and depth information. The RGB-D camera that was

used for the prototype is the Microsoft Kinect (http://www.xbox.com/en-us/kinect/). The Kinect

uses a normal RGB camera, an infra-red laser emitter and an infra-red camera. The infra-red

laser is diffracted to create a pattern of infra-red dots. The resulting pattern is compared to a

reference infra-red image created with a plane at a fixed distance. By comparing the distance

from the dots in the pattern to the original reference image, a disparity image can be obtained.

19

Page 23: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 4. METHODOLOGY 20

RGB-D Cameras are capable oftaking full-colour images (RGB)as well as retrieving depth inform-ation (D)

RGB-D Camera

With headphones, an auditoryeffect can be created resulting inthe user hearing sounds in a 3Denvironment using head relatedtransfer functions

HRTF Headphones

The image is segmented usingthe SLIC algorithm. Thisresults in a set of super-pixelswith approximately the same sizeand visual coherence within eachsuper-pixel, see fig. 4.2

Segmentation

The visual features are translatedto a set of audio parameters,based on the acoustic propertiesof a set of typical materials

Sound Generation

A set of visual features is extrac-ted for each super-pixel. Thefeatures are based on typicalmaterial reocgnition systems

Feature Extraction

Figure 4.1: A schematic overview of the system

The disparity image can then be converted into depth measurements using the camera parameters

as obtained through calibration. The 3D location vector [x′, y′, z′] of the object can be found

from pixel locations x and y using depth d and constant c:

x′ = xdc, y′ = ydc, z′ = d (4.1)

One potential pitfall of using this type of depth camera [Khoshelham, 2011] is that transparent

and reflective objects can not be accurately measured. However, there are no optical methods

that are capable of fully circumventing this issue. Another typical issue is that measurements fail

if a surface is oriented almost parallel to the camera. This is due to the laser specks becoming

either blurred by being smeared out over a larger surface area, or due to being reflected away

from the camera. This issue is less problematic as the segmentation is also based on edges,

causing most of the faulty measurements to lie on the edges of the super-pixels.

4.2 Super Pixel Segmentation

There is a fine balance between giving too much and too little information. If we were to create

a descriptor for each pixel and generate a sound from that, as done in the vOICe [Amedi et al.,

2007, Auvray et al., 2007], the user would get very little information for each pixel, but much

stimulation. The user would get the information in a raw format where the brain has to do

the filtering and focus attention, which can be straining for the user. Instead, 3D-VITA should

function much as a sensory organ does and filter relevant information for the user. On the

other hand, if we take global descriptors of the image, the system will filter out too much data

Page 24: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 4. METHODOLOGY 21

and thus does not contribute a lot to the perception by the user. This is why 3D-VITA uses

image patches for which it generates sound. The system uses SLIC segmentation (Simple Linear

Iterative Clustering) [Achanta et al., 2012] which results in a set of super pixels. With SLIC, a

predefined number of super pixels can be found that incorporates both spatial relevance and

colour relevance for each super pixel, as seen in fig. 4.2.

Definition:

Super Pixels are groups of pixels that share some relation, be it a spatial relation or relation

in colour or texture

The algorithm is given a number K of super pixels to find in the image and with N pixels in the

image, each super pixel will have around N/K pixels. With super pixels of approximately the

same size, there will be a super pixel centred on every grid interval S =√N/K.

The goal of the algorithm is to find a set of super pixels Ck = [lk, ak, bk, xk, yk]T with their

related pixels that have a smaller distance to that super pixel than to any other centre. The

distance measure, as seen in eq. 4.2, between a cluster centre and a pixel is determined using the

Euclidean distance in colour space and the Euclidean distance between pixel locations. The SLIC

algorithm works on the CIELAB colour space in which Euclidean distances are perceptually

meaningful for small distances. A distance threshold m is chosen which weights how much colour

difference is allowed. The higher m is chosen, the more spatial proximity is used for determining

pixel relations to the cluster. The algorithm can be seen in alg. 1.

dlab =√

(lk − li)2 + (ak − ai)2 + (bk − bi)2

dxy =√

(xk − xi)2 + (yk − yi)2

Ds = dlab +m

Sdxy (4.2)

The algorithm begins with selecting an initial set of super pixel centresCk = [lk, ak, bk, xk, yk]T

for k = [1,K] at the centres of an equally spaced grid. For each centre the surrounding pixels are

analysed and the centre is moved to the lowest gradient position in a 3× 3 patch. This is since

we do not want centres to be on edges that often represent borders between distinctly different

patches. The gradient G(x, y) is calculated using the lab vector I(x, y) using eq. 4.3.

G(x, y) = ‖I(x+ 1, y)− I(x− 1, y)‖2 + ‖I(x, y + 1)− I(x, y − 1)‖2 (4.3)

For all experiments, K was chosen to be 64. With the resolution of the Kinect at 640× 480, this

results in a cluster size of approximately 4800 for 64 clusters.

Page 25: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 4. METHODOLOGY 22

Algorithm 1 SLIC Segmentation

Require: K = number of super pixels1: Initialise K cluster centers Ck = [lk, ak, bk, xk, yk]T by sampling pixels at regular grid2: Perturb cluster centres in an n× n neighbourhood, to the lowest gradient position3: repeat4: for all cluster centre Ck do5: Assign the best matching pixels from a 2S×2S square neighbourhood around the cluster

centre according to the distance measure in eq. 4.26: end for7: Compute new cluster centres and residual error E(L1 distance between previous centres

and recomputed centres )8: until E ≤ threshold9: Enforce connectivity

Figure 4.2: The SLIC algorithm applied to a painting by Malevich. The parameters were set toK = 64, m = 20

With the initial centres in place, the algorithm iteratively uses K-Means to move the cluster

centres by checking a 2S × 2S square around the centre. This is done by relating each pixel

in the image to the nearest cluster centre according to the distance measure in eq. 4.2. When

each pixel is associated, the new cluster centres are calculated as the average labxy vector. The

algorithm terminates when the cluster centres converge. A visual representation of these steps

can be seen in fig. 4.2.

4.3 Extracting Visual Features

The selection of visual features is guided by the current state-of-the-art material recognition

systems. The reason for this is that these features give a good representation of the materials so

Page 26: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 4. METHODOLOGY 23

Table 4.1: Hue ranges for each of the five colours used. Hue values between these bins wereweighted linearly between the two neighbouring bins

Colour name Low Hue High Hue

Red 355 10Yellow 51 60Green 81 140Blue 221 240Pink 331 345

that they can be distinguished from one another. By finding natural complements within the

audio property domain, it should follow that humans can therefore distinguish between visually

different objects.

Colour is represented as a histogram of the primary colours ”ROYGBIV” which are Red,

Orange, Yellow, Green, Blue, Indigo and Violet. To reduce complexity within the sounds, Orange

is considered a natural blend between Red and Yellow, while Indigo and Violet are grouped

together as Pink. This reduces the main colours to be represented as Red, Yellow, Green, Blue,

Pink. The reasoning behind this is that these five colours can be linked to the five main notes

from a standard chord, which will let the final result sound more pleasant. For each super pixel

a histogram of the colours is made using hue. Each of the colours is represented by a certain

range in which that colour is the dominant colour. In between these ranges a linear blend is

made between two consecutive bins. So an orange pixel with a hue of 30 will add 0.5 to both

the red and the yellow bin. In table 4.1 the bin ranges for each colour are shown. Note that

hue is represented as an angle and thus wraps around, meaning that red is represented as the

ranges 355− 360 and 0− 10. Due to the possibility that individual super pixels have different

amounts of pixels within them, the histogram is then normalised. The saturation values are

simply averaged for each super pixel. Brightness is represented as three distinct values being

dark, medium or light. The reasoning behind this has to do with the way they are represented in

sound.

Texturedness is also represented. Note that this refers to texture as in noise rather than structured

texture such as repeating patterns. Texturedness is measured by smoothing the grey-scale image

I and subtracting that from the original gray-scale image resulting in a smoothed image I ′. This

results in pixels that differ from their surrounding pixels to have a higher value than pixels

surrounded by similar pixels. The smoothing is done with a convolution with a 9x9 averaging

filter. The result T shows the pixels that are noisy compared to their neighbouring pixels.

T = abs(I− I′) (4.4)

For each super pixel the mean of T is taken to represent the texturedness of the super pixel.

Values closer to the centre are given a higher value using a simple Gaussian filter, since the edges

of the super pixel are aligned to edges within the image.

Page 27: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 4. METHODOLOGY 24

Table 4.2: Notes and their corresponding colours frequencies as used by 3D-VITA

Colour Name Frequency (THz) Frequency ∗ 240 (Hz) Note

Red 431 392 G4

Yellow 513 466 B[4Green 575 523 C5

Blue 684 622 D]5Pink 768 698 F5

These are the only four features that are used for representing each super pixel. The reason

behind these four is that they are considered within the material recognition literature to hold

the most information with the omission of shape which has no real meaning due to the usage of

super pixels. Other typical material features such as texture patterns are also omitted due to

computational constraints.

4.4 Generating Sound

Sounds are generated using a additive synthesiser with a low-pass filter, an ADRS filter and

by mixing with white noise. As noted in the previous section, five colours are represented in

a normalised histogram. Each colour is linked to a certain note within a chordal progression.

These are found by taking the centre point of each bin and scaling the frequency down 40 octaves

thus multiplying the frequency by 2 40. The notes that are then found can be seen in table 4.2.

The notes in the table form a G Minor 7 chord with an augmented fifth. However, although red

can be seen as having the note G, it does not matter at which note we start the chord. It is more

important that different colours have audible differences.

Using the frequencies from table 4.2 a sine wave is generated for each histogram value Hi with

corresponding frequency f using equation 4.5 to get audio sample y. Note that the sine wave is

generated at a sample-rate of 44100 and with amplitude a = 7500.

y(t) =∑i

Hi ∗ a ∗sin(2 ∗ π ∗ fi ∗ t)

44100(4.5)

Frequency fi is multiplied by 0.5, 1, 2 corresponding to the brightness value of the super pixel.

This results in a darker super pixel to be be moved one octave down while a lighter super pixel

is moved one octave up.

The mean saturation µs of the super pixel is used as a parameter for a low-pass filter resulting in

the filtered audio sample y′. A low-pass filter results in a dampened sound since higher frequency

ranges are averaged out. The higher the saturation the lower the cut-off frequency is set. To

achieve this µs is inverted and normalised to the desired effect value.

Page 28: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 4. METHODOLOGY 25

Attack Decay ReleaseSustain

1

0

a

y(t)

Figure 4.3: A typical ADRS filter with amplitude on the Y-axis and the sample filter on theX-axis

RC =(1− (µs/255)) ∗ 150

44100

δt =1

44100

α =δt

RC + δt

y′(t) = αy(t) + (1− α)y(t− 1) (4.6)

To add texture to the sound the original wave is altered with a layer of white noise. White noise

has equal energy on all frequencies and thus will not induce a bias towards certain tones to

stand out due to blocking. For instance, red or Brownian noise has higher energy on the lower

frequencies resulting in lower frequencies to be less audible than higher frequencies. The layer of

noise is given the same maximum energy as the original sine waves.

Finally, the sound is passed through an ADRS filter. ADRS stands for attack, decay, release,

sustain and is usually represented as a piece-wise linear function as can be seen in fig. 4.3. The

attack is the beginning of the sample and signifies the time between the start of the sample and

the moment it reaches maximum amplitude. The decay is the moment right after where the

amplitude drops to the true amplitude of the sound which is called the sustain. The release

is the final part of the sound and is the drop from the sustain level back to zero. The ADRS

function is multiplied by the original audio sample y(t).

Page 29: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 4. METHODOLOGY 26

Table 4.3: Specification of 3D-VITA

Pipeline Method Specification

Image Retrieval RGB-D camera (Kinect) Colour and depthSegmentation SLIC K = 64, m = 20Feature extraction Hue see table 4.1Feature extraction SaturationFeature extraction Lightness Three classes (Dark, Medium, Light)Feature extraction Texture Summation over edge pixelsAudio Synthesis Additive Synthesis see table 4.2, a = 7500, samplerate = 44100Audio Synthesis Low-Pass Filter see eq. 4.6Audio Synthesis ADRS a = 0.05, d = 0.4, r = 0.45, s = 0.9Playback HRTF OpenAL implementationPlayback Stimulus interval 1 sPlayback Delay 1 s

4.5 Head-Related Transfer Functions

To create the binaural illusion, i.e. illusion of 3D localised sound, head-related transfer functions

(HRTF) [Begault et al., 1994, Larcher et al., 2000, Furse, 2009] are used for both ears. HRTF

is the Fourier transform of the head-related impulse response (HRIR). The HRIR encodes the

impulse response within the ear for a sound given a source location. The convolution of the

impulse response is called the HRIR and this can be used to alter a sound as if it came from that

source location. Different frequencies within a complex sound have different responses in the ear.

This is mainly due to the shape of the outer ear, head and body of the listener. For distances

larger than 1 m the differences in head shape become negligible so that it is not necessary to

measure the HRTF for any specific user of the system. HRTF also solve theCone of Confusion

where interaural level difference (ILD) and interaural time difference (ITD) are equal for both

ears.

4.6 Summary

The system uses a Kinect, mounted on a pair of headphones, to retrieve both colour and depth

images. The colour images are converted to the CIELAB colour space for segmentation with the

SLIC algorithm. The depth images are smoothed and are used to determine the 3D location of

each super pixel in the scene. The colour images are converted to the HSV colour space for the

extraction of colour features. Hue, saturation and lightness are converted to audio parameters

representing the note, dampness and octave. The texturedness of each super pixel is measured

using a sharpen filter and summing over the edge pixels from this filter. A sound is produced

using additive synthesis. The sound is then modified using an ADRS filter. The sounds are then

played back to the user over headphones in a virtual 3D space. This 3D virtual effect is achieved

using head related transfer functions. A full overview of the used parameters can be seen in table

4.3

Page 30: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

Chapter 5

Experiments

To validate the system, both a virtual and a location experiment were performed. An overview

of the ten subjects can be found in table 5.1. Subjects were given approximately two minutes of

adaptation (training) time before starting the experiments. This brief time was included in the

experiments to let users get accustomed to the sounds during the experiment. The time was kept

short to keep training to a minimal as blind persons would not be able to have this visual-audio

training before using the system. The experiment for each subject consists of three parts: the

initial adaptation time, the virtual experiment and the location task. Subjects were asked

questions relating to their musical ability, both in production and in listening. The results of this

questionnaire can be found in table 5.2. The age of the subjects ranges from 19 years–25 years.

Seven of the subjects reported they were musicians, although four of them also reported playing

only occasionally. Most of the subjects reported listening to a broad range of genres spanning jazz,

classical music, pop and rock. Some of the subjects reported that they were unsure whether they

had any form of auditive damage. Apart from the ten subjects, another subject also performed

the experiments. This 11th subject had a subsidiary infection of the right ear canal. She was

Table 5.1: Subjects gender, age and right-handedness

Id Gender Age Right-handed

Subject 1 Male 19 yesSubject 2 Male 24 noSubject 3 Male 21 yesSubject 4 Male 25 yesSubject 5 Male 19 yesSubject 6 Male 21 yesSubject 7 Male 25 yesSubject 8 Male 25 yesSubject 9 Female 20 yesSubject 10 Male 20 noTotal 8 to 2 µ = 21.9 8 to 2

27

Page 31: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 5. EXPERIMENTS 28

Table 5.2: Subjects musical background

Id Instrument(s) Musical ability Musical taste

Subject 1 Piano Casual player Very broadSubject 2 Piano 30 minutes per day Very broadSubject 3 Guitar and Bass-guitar Daily. Also in a band Punk, StonerSubject 4 X X BroadSubject 5 X X Very BroadSubject 6 Piano 4-8 hours per week BroadSubject 7 Drums Few hours per week BroadSubject 8 X X Very BroadSubject 9 Saxophone Infrequent BroadSubject 10 Piano, Guitar, Fiddle 2 hours per week Broad

Figure 5.1: Three of the nine synthetic images used for the synthetic experiment. Each colouredsquare had three possible locations within the image

used as a sort of baseline as auditive difficulties in one ear hamper the specialisation of sound.

5.1 Experimental Set-Up

Virtual Experiment

The virtual experiment consists of nine synthesised images with for each primary colour (R,G,B)

a square at one of three positions (left, middle, right). The depth map for each image consists of

all NaNs except for the square, which is set at a fixed depth of 3.0 m. In front of the subject, a

large visual marker is placed. The user is asked to, when receiving the stimulus, turn their head

towards where the square is. Using the visual marker, the head position of the subject is then

tracked to see how close the user his or her gaze is to the virtual object. Using these nine values

(distance of gaze centre to the centre of the virtual object) the initial calibration error can be

found. This error can arise from inaccuracies in the software or hardware as well as auditorial

damage the subject may have.

Location Task

The goal of the location task is to measure how accurately subjects are in finding an object in

the sound-scape. A large visual marker is placed on the perimeter of a semi-circle centred around

Page 32: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 5. EXPERIMENTS 29

Laptop

Figure 5.2: Experimental set-up of room with the participant in the middle

the user with a radius of 2.5 m, as can be seen in figure 5.2. If the subject believes the visual

marker is in the centre of their field of view, the subject presses a button upon which records this

moment. The audio is temporarily disabled to move the visual marker to a new location, at which

point the next task starts. The soundscapes are generated at approximately one soundscape per

second and each image is stored so that the user his or her gaze can be tracked over time. A

total of ten location tasks are given to each subject which are averaged on their time axis from

starting time of the task to the moment the user presses the button. For each image a distance

is calculated to the centre of the visual marker. The distance is calculated as the angle at which

the marker is located, with the angle being zero if the marker is at the centre of the subject his or

her gaze. If the visual marker is out of the field of view, the maximum possible distance is given

to that location. The distance can be both positive and negative and will not be normalised

which allows for analysing the search patterns the subjects utilise to locate the marker.

5.2 Results

General Response of the Subjects

After the session, the subjects were asked about their personal findings and opinions on the

system. All subjects reported following the same strategy. At the first couple of experiments, the

subjects tried to find the “red” sound within the environment. Around experiment four/five the

subjects switched strategy to instead focus on contrast and changes in sound. This effect can be

seen in fig. 5.4 where the time taken for the task jumps from the normal pattern. Most subjects

felt insecure about their ability to correctly identify the marker within the room, despite the low

error rate. Some of the subjects reported that the sounds were pleasant and two of the subjects

reported they were unpleasant.

Most subjects reported that they had difficulty in locating where sounds were coming from. Some

subjects thought sounds came from behind them, others reported hearing more sounds coming

from the left than from the right and vice versa. The differences between the subjects can be

Page 33: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 5. EXPERIMENTS 30

explained by the HRTF having different effects on the spatial perception of the sounds. Another

explanation might be that some subjects have slight damage to their hearing which results in a

bias in locations.

Virtual Experiment Results

During the virtual experiment, all subjects experienced a slight bias towards sounds from the

right. When the right stimuli were presented subjects turned their head 90° to the right, while

the left stimuli made them turn their head 65° to the left. The middle stimuli was reacted to by

turning 30° to the right. This indicates that the entire audioscape is slightly rotated around the

user towards the right. Most subjects reported that some of the sounds seemed to come from

within their own head and were therefore not able to locate these sounds in the virtual room.

This is a natural response to sounds that are right in front of humans as the sound is the then

the same for both ears. One of the subjects turned completely around for both stimuli from the

left and right. This is a common effect in people with damage to the auditive system. Two of

the subjects also looked slightly up and down for some of the stimuli. This was especially true

for the green and blue stimuli. The green stimuli have high brightness and the blue stimuli have

low brightness. It is a common perceptual illusion to feel that sounds with a higher pitch come

from higher and vice versa.

The eleventh subject, who suffers from subsidiary infection of the ear canal, performed significantly

worse than the other subjects. She had a strong bias to the left, which indicates that the right ear

was indeed still not functioning normal. Furthermore, she reported that she could not successfully

attribute a location to each stimulus.

Location Task Results

During the location experiments, a total of 4720 stimuli were presented. These stimuli were

divided over 10 different participants who each completed 10 experiments of which the results

are in table 5.3. During these 4720 stimuli, the marker was present in 2451 of the frames. Each

experiment was concluded with the subject indicating the marker was within the field of view.

Of these 100 “screenshots” the marker was present in 78 of the frames. These screenshots can

be found in fig. 5.3. Compared to the 51.9 % of the frames in which the marker was present,

the 68 % of the screenshots is above random. Furthermore, 7 of the 10 subjects performed

significantly above random, while 2 of the subjects performed on random and 1 of the subjects

performed significantly below random. Of the 2 subjects, one performed at 80 % indicating that

this subject may just have spent more time looking at the marker before deciding. Subject

10, who performed below random, indicated that she had a specific strategy to search for high

pitched sounds. Looking at her screenshots, the bottom row in fig. 5.3, it becomes apparent that

her hypothesis was aimed at the floor.

Given the 100 trials that were performed of which 68 were successful, we can reject the hypothesis

that the subjects were performing the task at random. To do this, the experiment was modelled

as a binomial chance experiment with a 0.519 chance of success. The probability that this

Page 34: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 5. EXPERIMENTS 31

Subject 1

Subject 2

Subject 3

Subject 4

Subject 5

Subject 6

Subject 7

Subject 8

Subject 9

Subject 10

Figure 5.3: All screenshots as taken by the subjects during the experiments. Each row correspondsto a single subject and from left to right are experiments 1 to 10

hypothesis is correct is 0.0007951, as calculated in equation 5.1. This is under the assumption

that each trial, or task, is performed independently. If we consider the best and worst performing

participants as outliers, there were 80 trials of which 57 were successful. The probability then

becomes 0.0003300.

P (68 correct out of 100 trials) =

(100

68

)× 0.51968 × 0.48132 = 7.951× 10−4 (5.1)

If we compare the first 5 tasks of each subject to their last 5 tasks, we get the probabilities in

equation 5.2 and 5.3. This means that during the first 5 tasks, 28 out of 50 screenshots were

successful, while the last 5 tasks resulted in 39 out of 50 successful screenshots.

P (28 correct out of 50 trials) =

(50

28

)× 0.51928 × 0.48122 ≈ 0.33 (5.2)

P (39 correct out of 50 trials) =

(50

39

)× 0.51939 × 0.48111 ≈ 0.0001 (5.3)

The difference between the first 5 and last 5 tasks shows that during the experiment, despite the

subjects not receiving any intermediate feedback on their performance, the subjects were able to

improve their performance significantly.

Page 35: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 5. EXPERIMENTS 32

Table 5.3: Location Experiments per subject. Total frames is the total number of stimuli thesubject received. Frames with marker is the amount of those stimuli in which the marker waspresent. The screenshots were taken at the time the subject confirmed the marker was in frontof him or her

Id Total Frames Frames with marker Screenshots with marker

Subject 1 493 404 (81 %) 8 (80 %)Subject 2 521 239 (45 %) 6 (60 %)Subject 3 151 77 (51 %) 7 (70 %)Subject 4 606 236 (40 %) 7 (70 %)Subject 5 868 488 (56 %) 5 (50 %)Subject 6 172 111 (61 %) 8 (80 %)Subject 7 184 103 (56 %) 10 (100 %)Subject 8 677 386 (57 %) 9 (90 %)Subject 9 267 110 (41 %) 7 (70 %)Subject 10 269 108 (40 %) 1 (10 %)Total 4720 2451 (51.9 %) 68 (68 %)

It is worthwhile to note that when considering the individual performance of each subject it

becomes difficult to use this metric, as there were only 10 trials per subject. Especially when

considering subject 1 in table 5.3 who did not score above random. However, he did perform

considerably well at the task, but spent a larger portion of the time looking at the object. Looking

at table 5.3 it is also worthwhile to mention that subject 10 performed significantly below random.

The eleventh subject also performed below random, but also reported not being able to form

any hypothesis on what the sounds represented. While subject 10 followed a set strategy and

seemed to follow this strategy successfully, the eleventh subject did not follow any strategy as

she could not differentiate the sounds produces by the system. Although the sample size is small,

no noticeable difference was found regarding the information gathered about the subjects seen in

tables 5.1 and 5.2.

Another important metric to consider is the time taken for each consecutive task. The results

can be viewed in fig. 5.4 for each of the ten subjects. Noteworthy is the fact that some of

the subjects need approximately the same amount of time for each task, while other subjects

fluctuated with the time needed for the task. The subjects that did this reported that they

sometimes changed their hypothesis and took longer to try and verify whether their hypothesis

was correct. Furthermore, this effect is noticeable in fig. 5.4 around the fourth experiment where

for many of the subjects time is either above or below the average time for that subject. Another

effect that can be seen in fig. 5.4 is that the overall time taken per experiment becomes more

stable towards the final five experiments. That is to say, the deviation from the mean becomes

smaller.

To see what search strategy the subjects used, the marker was tracked over time during each

experiment. In fig. 5.5 six representative graphs can be seen. For each frame, that was converted

into an audio stimulus, the marker was detected. The x position of the marker is seen on the

y-axis. If the marker was not detected, the last x position was used to determine whether the

subject was looking to the right or the left of the marker. For these frames, either x = 0 or

Page 36: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 5. EXPERIMENTS 33

Figure 5.4: Time taken for each experiment. Blue circles indicate correct screenshots. The bluedashed line is the mean time taken per experiment

Subject 5, exp 2

Subject 4, exp 5

Subject 8, exp 10

Subject 10, exp 1Subject 9, exp 6

Subject 7, exp 6

Figure 5.5: Six representative graphs showing the search strategy as used by the subjects. Onthe y-axis is the time (ms) taken for the experiment and on the x-axis is the x-value of the centrepoint of the marker

x = 640 was used for looking to the right or left of the marker respectively. What becomes

apparent from these graphs is the sweeping strategy used by most subjects. The subjects move

their head from left to right when looking for the marker. Another effect that was found was that

some of the subjects in the final frames had their gaze trained on the marker but moved away

for a brief moment before taking a screenshot. After the experiments, these subjects reported

that they did this to verify if the marker was present in the field of view.

Page 37: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

Chapter 6

Conclusion

In this thesis, a novel sensory substitution device was proposed named 3D-VITA. The system

was tested on ten different subjects in both a virtual and a localisation task. The system uses

rich visual information using a RGB-D camera and translates this information into an audio

signal. Results showed that the system indeed has an effect on the subjects allowing them to

detect an object in the room. It is difficult to assess what this effect is precisely, but that the

subjects were capable of performing above random is clear. Furthermore, when talking about

their experience with the system, most subjects reported that they indeed “felt” the presence

of an object in the room. This distal attribution is a difficult to measure effect, but it is clear

that together with the performance of the subjects, that it is very likely that they were capable

of finding the object during the experiment. What is remarkable about this is that they did so

without training. The two minute adaption period was not enough for users to extract which

parts of the audio signal were relevant. This is also clear from the performance during the first

five experiments. Most subjects made mistakes during these first five, but managed to improve

their performance, despite not getting any feedback during the experiment apart from the audio

signal of the system.

The most important thing that was shown is that the system does have potential and that it is

very much possible to create sensory substitution devices that operate on rich information. This

is important, since before 3D-VITA there were no sensory substitution devices that attempted

to utilise the sensory integration capabilities of the human brain.

Although the experiments show that the system indeed provides enough information for users to

perform a localisation task without the need for training, more rigorous testing is required to

fully understand what the system can provide. The experiments now made use of a single red

marker. In the future, this red marker needs to be replaced by real world objects, to fully show

the power of the system. However, the red marker is difficult to find in this room, since the red

hue is converted to the primary note of the chord used by the synthesiser. This means that even

when no red objects are in the field of view, the brain completes the chord making it hard to just

focus on a single note. Furthermore, the red-yellowish tint of the lights in the room made the

34

Page 38: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

CHAPTER 6. CONCLUSION 35

task more difficult. Another potential issue is that the task was performed from a single vantage

point. In future experiments, it would be interesting to see how this system performs in a task

that allows movement. Due to the hardware, this was not possible at this current stage.

The localisation experiment was done by the subjects without prior training. Despite this lack

of training, the subjects exhibited improvement, even over a small number of trials. Future

experiments should include a training stage in which feedback is given during the experiment.

Perhaps the most obvious issue with the system in its current stage is the one second interval

and delay. This leads to a latency between the users head position and the stimulus provided by

the system. This latency greatly increases the complexity of the task. To improve the system,

this delay needs to be addressed. A possible method to achieve this is to remove the SLIC

segmentation algorithm and replace it with a simpler method. The added benefit of removing

the delay as well as the one second interval at which stimuli are provided is that the user gets

continuous feedback on the environment.

3D-VITA needs more experimentation to understand the limits and full capabilities of the

system. However, as the experiments have shown, it has great potential as a means to replace

vision without invasive surgery. The integration of sensory information is one of the great feats of

the brain and thus far, no sensory substitution device has attempted to explore the possibilities

of this powerful ability. With 3D-VITA, these integration capabilities are utilised to provide a

stronger sense of perceiving.

Page 39: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

Bibliography

R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. Slic superpixels compared

to state-of-the-art superpixel methods. Pattern Analysis and Machine Intelligence, IEEE

Transactions on, 34(11):2274–2282, 2012.

A. Amedi, W. M. Stern, J. A. Camprodon, F. Bermpohl, L. Merabet, S. Rotman, C. Hemond,

P. Meijer, and A. Pascual-Leone. Shape conveyed by visual-to-auditory sensory substitution

activates the lateral occipital complex. Nature neuroscience, 10(6):687–689, 2007.

P. Arno, C. Capelle, M.-C. Wanet-Defalque, M. Catalan-Ahumada, and C. Veraart. Auditory

coding of visual patterns for the blind. PERCEPTION-LONDON-, 28:1013–1030, 1999.

P. Arno, A. G. De Volder, A. Vanlierde, M.-C. Wanet-Defalque, E. Streel, A. Robert, S. Sanabria-

Bohorquez, and C. Veraart. Occipital activation by pattern recognition in the early blind

using auditory substitution for vision. Neuroimage, 13(4):632–645, 2001a.

P. Arno, A. Vanlierde, E. Streel, M.-C. Wanet-Defalque, S. Sanabria-Bohorquez, and C. Veraart.

Auditory substitution of vision: pattern recognition by the blind. Applied Cognitive Psychology,

15(5):509–519, 2001b.

M. Auvray and E. Myin. Perception with compensatory devices: from sensory substitution to

sensorimotor extension. Cognitive Science, 33(6):1036–1058, 2009.

M. Auvray, S. Hanneton, C. Lenay, and K. O’Regan. There is something out there: distal

attribution in sensory substitution, twenty years later. Journal of Integrative Neuroscience, 4

(4):505–21, 2005.

M. Auvray, S. Hanneton, and J. K. O Regan. Learning to perceive with a visuo-auditory

substitution system: Localisation and object recognition withthe voice’. PERCEPTION-

LONDON-, 36(3):416, 2007.

P. Bach-y Rita. Brain plasticity as a basis for therapeutic procedures. Recovery of function:

Theoretical considerations for brain injury rehabilitation, pages 225–263, 1980.

P. Bach-y Rita. Tactile vision substitution: past and future. International Journal of

Neuroscience, 19(1-4):29–36, 1983.

P. Bach-y Rita. Tactile sensory substitution studies. ANNALS-NEW YORK ACADEMY OF

36

Page 40: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

BIBLIOGRAPHY 37

SCIENCES, 1013:83–91, 2004.

P. Bach-y Rita and S. W Kercel. Sensory substitution and the human–machine interface. Trends

in cognitive sciences, 7(12):541–546, 2003.

P. Bach-Y-Rita, C. C. Collins, F. A. Saunders, B. White, and L. Scadden. Vision Substitution

by Tactile Image Projection. Nature, 221(5184):963–964, Mar. 1969.

D. R. Begault et al. 3-D sound for virtual reality and multimedia, volume 955. AP professional

Boston etc, 1994.

K. W. Berger. Some factors in the recognition of timbre. The Journal of the Acoustical Society

of America, 36(10):1888–1891, 2005.

B. Blasch, S. LaGrow, and W. De l’Aune. Three aspects of coverage provided by the long cane:

Object, surface, and foot-placement preview. Journal of Visual Impairment and Blindness, 90:

295–301, 1996.

D. J. Calder. Assistive technologies and the visually impaired: a digital ecosystem perspective.

In Proceedings of the 3rd International Conference on PErvasive Technologies Related to

Assistive Environments, page 1. ACM, 2010.

L. Cancar, A. Dıaz, A. Barrientos, D. Travieso, and D. M. Jacobs. Tactile-sight: A sensory

substitution device based on distance-related vibrotactile flow. International Journal of

Advanced Robotic Systems, 10, 2013.

C. Capelle, C. Trullemans, P. Arno, and C. Veraart. A real-time experimental prototype for

enhancement of vision rehabilitation using auditory substitution. Biomedical Engineering,

IEEE Transactions on, 45(10):1279–1293, 1998.

A. Caspi, J. D. Dorn, K. H. McClure, M. S. Humayun, R. J. Greenberg, and M. J. McMahon.

Feasibility study of a retinal prosthesis: spatial vision with a 16-electrode implant. Archives

of Ophthalmology, 127(4):398–401, 2009.

D.-R. Chebat, C. Rainville, R. Kupers, and M. Ptito. Tactile-’visual’acuity of the tongue in

early blind individuals. Neuroreport, 18(18):1901–1904, 2007.

O. Collignon, M. Lassonde, F. Lepore, D. Bastien, and C. Veraart. Functional cerebral

reorganization for auditory spatial processing and auditory substitution of vision in early blind

subjects. Cerebral Cortex, 17(2):457–465, 2007.

J. Crary. Modernizing vision.”. Images: A Reader, page 270, 2006.

G. Declerck, C. Lenay, and A. Khatchatourov. Rendre tangible le visible. Irbm, 30(5):252–257,

2009.

A. Dodds et al. The nottingham obstacle detector: Development and evaluation. Journal of

Visual Impairment and Blindness, 75(5):203–09, 1981.

R. Erickson. Sound structure in music. Univ of California Press, 1975.

I. Fujinaga. Machine recognition of timbre using steady-state tone of acoustic musical instruments.

In Proceedings of the International Computer Music Conference, pages 207–10. Citeseer, 1998.

I. Fujinaga and K. MacMillan. Realtime recognition of orchestral instruments. In Proceedings of

Page 41: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

BIBLIOGRAPHY 38

the international computer music conference, volume 141, page 143, 2000.

R. W. Furse. Building an openal implementation using ambisonics. In Audio Engineering

Society Conference: 35th International Conference: Audio for Games. Audio Engineering

Society, 2009.

N. A. Giudice and G. E. Legge. Blind navigation and the role of technology. Engineering

handbook of smart technology for aging, disability, and independence, pages 479–500, 2008.

L. H. Goldish and H. E. Taylor. The optacon: A valuable device for blind persons. New Outlook

for the Blind, 68(2):49–56, 1974.

G. Guarniero. Experience of tactile vision. Perception, 3(1):101–104, 1974.

A. D. Heyes. The sonic pathfinder: A new electronic travel aid. Journal of Visual Impairment

and Blindness, 78(5):200–02, 1984.

M. S. Humayun, J. D. Weiland, G. Y. Fujii, R. Greenberg, R. Williamson, J. Little, B. Mech,

V. Cimmarusti, G. Van Boemel, G. Dagnelie, et al. Visual perception in a blind subject with

a chronic microelectronic retinal prosthesis. Vision research, 43(24):2573–2581, 2003.

A. Jacomuzzi and N. Bruno. Perceiving occlusion through auditory–visual substitution. Cognitive

Processing, 7:128–130, 2006.

K. A. Kaczmarek. The tongue display unit (tdu) for electrotactile spatiotemporal pattern

presentation. Scientia Iranica, 18(6):1476–1485, 2011.

K. Khoshelham. Accuracy analysis of kinect depth data. In ISPRS workshop laser scanning,

volume 38(5), page W12, 2011.

J.-K. Kim and R. J. Zatorre. Generalized learning of visual-to-auditory substitution in sighted

individuals. Brain research, 1242:263–275, 2008.

A. J. Kolarik, M. A. Timmis, S. Cirstea, and S. Pardhan. Sensory substitution information

informs locomotor adjustments when walking through apertures. Experimental brain research,

pages 1–10, 2013.

R. Kupers, D. R. Chebat, K. H. Madsen, O. B. Paulson, and M. Ptito. Neural correlates of

virtual route recognition in congenital blindness. Proceedings of the National Academy of

Sciences, 107(28):12716–12721, 2010.

R. Kurzweil, M. L. Schneider, and M. L. Schneider. The age of intelligent machines, volume

579. MIT press Cambridge, 1990.

R. Kurzweil, F. Bhathena, and S. Baum. Reading machine system for the blind having a

dictionary, Mar. 2000. US Patent 6,033,224.

V. Larcher, O. Warusfel, J.-M. Jot, and J. Guyard. Study and comparison of efficient methods

for 3-d audio spatialization based on linear decomposition of hrtf data. In Audio Engineering

Society Convention 108. Audio Engineering Society, 2000.

C. Lenay, S. Canu, and P. Villon. Technology and perception: the contribution of sensory

substitution systems. In Cognitive Technology, 1997. Humanizing the Information Age.

Proceedings., Second International Conference on, pages 44–53. IEEE, 1997.

Page 42: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

BIBLIOGRAPHY 39

C. Lenay, O. Gapenne, S. Hanneton, C. Marque, and C. Genouelle. Sensory substitution: Limits

and perspectives. Touching for knowing, pages 275–292, 2003.

D. Lescal, J. Rouat, and J. Voix. Sensorial substitution system from vision to audition using

transparent digital earplugs. In Proceedings of Meetings on Acoustics, volume 19, page 040014.

Acoustical Society of America, 2013.

D. J. Levitin. This is your brain on music: Understanding a human obsession. Atlantic Books

Ltd, 2013.

O. Linkovski, L. Akiva-Kabiri, L. Gertner, and A. Henik. Is it for real? evaluating authenticity

of musical pitch-space synesthesia. Cognitive processing, 13(1):247–251, 2012.

R. Liu and Y.-X. Wang. Auditory feedback and sensory substitution during teleoperated

navigation. Mechatronics, IEEE/ASME Transactions on, 17(4):680–686, 2012.

I. Matteau, R. Kupers, E. Ricciardi, P. Pietrini, and M. Ptito. Beyond visual, aural and haptic

movement perception: hmt+ is activated by electrotactile motion stimulation of the tongue in

sighted and in congenitally blind individuals. Brain research bulletin, 82(5):264–270, 2010.

S. Meers and K. Ward. Substitute three-dimensional perception using depth and colour sensors.

Faculty of Informatics-Papers, page 578, 2007.

L. B. Merabet, L. Battelli, S. Obretenova, S. Maguire, P. Meijer, and A. Pascual-Leone.

Functional recruitment of visual cortex for sound encoded object identification in the blind.

Neuroreport, 20(2):132–138, 2009.

R. A. Normann, E. M. Maynard, P. J. Rousche, and D. J. Warren. A neural interface for a

cortical vision prosthesis. Vision research, 39(15):2577–2587, 1999.

C. V. Parise and C. Spence. Audiovisual crossmodal correspondences and sound symbolism: A

study using the implicit association test. Experimental brain research, 220(3-4):319–333, 2012.

K. Patil, D. Pressnitzer, S. Shamma, and M. Elhilali. Music in our ears: the biological bases of

musical timbre perception. PLoS computational biology, 8(11):e1002759, 2012.

W. Penrod and T. Simmons. An evaluation and comparison of the hand guide by guideline and

the miniguide developed by gdp research electronic travel devices. Closing the Gap, 23(6):

22–24, 2005.

C. Poirier, A. De Volder, D. Tranduy, and C. Scheiber. Pattern recognition using a device

substituting audition for vision in blindfolded sighted subjects. Neuropsychologia, 45(5):

1108–1121, 2007a.

C. Poirier, A. G. D. Volder, and C. Scheiber. What neuroimaging tells us about sensory

substitution. Neurosci Biobehav Rev, 31(7):1064–1070, 2007b.

B. Pollok, I. Schnitzler, P. Stoerig, T. Mierdorf, and A. Schnitzler. Image-to-sound conversion:

experience-induced plasticity in auditory cortex of blindfolded adults. Experimental brain

research, 167(2):287–291, 2005.

N. Pressey. Mowat sensor. Focus, 11(3):35–39, 1977.

M. J. Proulx. Synthetic synaesthesia and sensory substitution. Consciousness and cognition, 19

Page 43: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

BIBLIOGRAPHY 40

(1):501–503, 2010.

M. J. Proulx, P. Stoerig, E. Ludowig, and I. Knoll. Seeing wherethrough the ears: effects of

learning-by-doing and long-term sensory deprivation on localization based on image-to-sound

substitution. PLoS One, 3(3):e1840, 2008.

M. Ptito, S. M. Moesgaard, A. Gjedde, and R. Kupers. Cross-modal plasticity revealed by

electrotactile stimulation of the tongue in the congenitally blind. Brain, 128(3):606–614, 2005.

M. Ptito, A. Fumal, A. M. de Noordhout, J. Schoenen, A. Gjedde, and R. Kupers. Tms of the

occipital cortex induces tactile sensations in the fingers of blind braille readers. Experimental

Brain Research, 184(2):193–200, 2008.

J. P. Rauschecker. Compensatory plasticity and sensory substitution in the cerebral cortex.

Trends in neurosciences, 18(1):36–43, 1995.

J. P. Rauschecker. Cortical plasticity and music. Annals of the New York Academy of Sciences,

930(1):330–336, 2001.

L. Renier, C. Laloyaux, O. Collignon, D. Tranduy, A. Vanlierde, R. Bruyer, and A. G. De Volder.

The ponzo illusion with auditory substitution of vision in sighted and early-blind subjects.

Perception, 34(7):857–867, 2004.

L. Renier, O. Collignon, C. Poirier, D. Tranduy, A. Vanlierde, A. Bol, C. Veraart, and A. G.

De Volder. Cross-modal activation of visual cortex during depth perception using auditory

substitution of vision. Neuroimage, 26(2):573–580, 2005.

J. F. Rizzo III, J. Wyatt, M. Humayun, E. de Juan, W. Liu, A. Chow, R. Eckmiller, E. Zrenner,

T. Yagi, and G. Abrams. Retinal prosthesis: an encouraging first decade with major challenges

ahead. Ophthalmology, 108(1):13–14, 2001.

U. R. Roentgen, G. J. Gelderblom, M. Soede, and L. P. de Witte. Inventory of electronic

mobility aids for persons with visual impairments: A literature review. Journal of Visual

Impairment & Blindness, 102(11), 2008a.

U. R. Roentgen, G. J. Gelderblom, M. Soede, and L. P. de Witte. Jvib, copyright (c) 2008

american foundation for the blind. all rights reserved. journal of visual impairment & blindness,

print edition page (s) 702-724, 2008b.

O. Sacks. Musicophilia: Tales of music and the brain. Random House LLC, 2010.

T. Schanze, L. Hesse, C. Lau, N. Greve, W. Haberer, S. Kammer, T. Doerge, A. Rentzos,

and T. Stieglitz. An optically powered single-channel stimulation implant as test system for

chronic biocompatibility and biostability of miniaturized retinal vision prostheses. Biomedical

Engineering, IEEE Transactions on, 54(6):983–992, 2007.

J. H. Siegle and W. H. Warren. Distal attribution and distance perception in sensory substitution.

Perception, 39(2):208, 2010.

G. Smith. The stereotoner: A new reading aid for the blind. In Proceedings of the 25th Annual

Conference on Engineering and Medical Biology, 1972.

R. D. Steele, G. L. Goodrich, D. Hennies, and J. A. McKinley. Reading aid technology for blind

Page 44: 3D-VITA - UvA · CHAPTER 1. INTRODUCTION 2 those that have disabilities. However, the issues with sensory impairment do not necessarily arise from the lack of sensory input, …

BIBLIOGRAPHY 41

persons: Responses to a questionnaire of experienced users. Assistive Technology, 1(2):23–30,

1989.

E. Striem-Amit, L. Cohen, S. Dehaene, and A. Amedi. Reading with sounds: sensory substitution

selectively activates the visual word form area in the blind. Neuron, 76(3):640–652, 2012.

J. Ward and P. Meijer. Visual experiences in the blind induced by an auditory sensory substitution

device. Consciousness and cognition, 19(1):492–500, 2010.

B. W. White, F. A. Saunders, L. Scadden, P. Bach-Y-Rita, and C. C. Collins. Seeing with the

skin. Perception & Psychophysics, 7(1):23–27, 1970.

M. D. Williams, C. T. Ray, J. Griffith, and W. De l’Aune. The use of a tactile-vision sensory

substitution system as an augmentative tool for individuals with visual impairments. Journal

of Visual Impairment and Blindness, 105(1), Jan. 2011.

G. Xydas, V. Argyropoulos, T. Karakosta, and G. Kouroupetroglou. An experimental approach

in recognizing synthesized auditory components in a non-visual interaction with documents.

Proc. Human-Computer Interaction-HCII, 2005.

D. Yanai, J. D. Weiland, M. Mahadevappa, R. J. Greenberg, I. Fine, and M. S. Humayun. Visual

performance using a retinal prosthesis in three subjects with retinitis pigmentosa. American

journal of ophthalmology, 143(5):820–827, 2007.

J. S. Zelek, S. Bromley, D. Asmar, and D. Thompson. A haptic glove as a tactile-vision sensory

substitution for wayfinding. Journal of Visual Impairment & Blindness, 97(10), 2003.

E. Zrenner. Will retinal implants restore vision? Science, 295(5557):1022–1025, 2002.

E. Zrenner, K. U. Bartz-Schmidt, H. Benav, D. Besch, A. Bruckmann, V.-P. Gabel, F. Gekeler,

U. Greppmaier, A. Harscher, S. Kibbel, et al. Subretinal electronic chips allow blind patients

to read letters and combine them to words. Proceedings of the Royal Society B: Biological

Sciences, 278(1711):1489–1497, 2011.

“Good sense is at the bottom of everything:

virtue, genius, wit, talent and taste. ”— J.J. de Chenier (1764–1811)