optimizing personalized 3d soundscape for a wearable mobility aid for the blind

Optimizing personalized 3Dsoundscape for a wearablemobility aid for the blind

eingereichteMaster Thesis

von

std. Alexis Guibourge

geb. am 28.07.1990wohnhaft in:Lierstr. 11A

80639 MunchenTel.: 0179 1762787

Lehrstuhl furSTEUERUNGS- und REGELUNGSTECHNIK

Technische Universitat Munchen

UNIV.-PROF. DR.-ING./UNIV. TOKIO MARTIN BUSS

Betreuer: Prof. Dr. Jorg Conradt und Prof. Dr.-Ing. Bernhard SeeberBeginn: 01.04.2015Zwischenbericht: 28.07.2015Abgabe: 01.10.2015

Abstract

Visually impaired people face difficulties in their daily life as detecting and avoidingobstacles is of high complexity. To assist them in this daily task, AuvioLab’s cre-ated a hearing-based device; each obstacle is represented through a virtual sound,placed at the same position. The superposition of virtual sounds creates a sound-scape and users can avoid the obstacles by localizing the different sounds in thethree-dimensional space. Unfortunately, to obtain the best possible precision andaccuracy in sound localization, individual and complex measures are required tobuild the soundscape, and this is either time consuming or expensive. The goalof this master thesis was to implement a low-cost and fast soundscape individu-alization process. Consequently, individual measurement were avoided and otherstrategies were designed. One strategy was implemented and the user performanceswere measured. Based on the results, the soundscape design was optimized in or-der to improve the individualization and thus the user achievable sound-localizationprecision and accuracy. An improvement of 70% with respect to the accuracy wasachieved and a resolution of nine degrees horizontally and twelve degrees verticallywas attained.

CONTENTS 3

Contents

1 Introduction 7

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Thesis goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Existing Mobility Aids for the Visually Impaired 9

2.1 Commercialized Devices . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Commonly Used Devices . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Electronic Devices . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 AuvioLab’s Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Environment Perception . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 Soundscape Limitations . . . . . . . . . . . . . . . . . . . . . 12

2.3 Conlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Preliminaries 13

3.1 Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Coordinate System . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.2 Anatomical Planes . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.3 Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Sensory Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Physics of Three-Timensional Hearing . . . . . . . . . . . . . . . . . . 21

3.3.1 Sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.2 Psychoacoustics . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.3 Human Performance in Sound Localization . . . . . . . . . . . 26

3.4 Virtual Auditory Display . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.1 Head-Related Transfer Function . . . . . . . . . . . . . . . . . 29

3.4.2 Soundscape Creation . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.3 Headphone Equalization . . . . . . . . . . . . . . . . . . . . . 33

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 CONTENTS

4 Soundscape Individualization 354.1 Individualization Review . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.1 Generic HRTF . . . . . . . . . . . . . . . . . . . . . . . . . . 354.1.2 Modelization . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.1.3 Subjective Selection . . . . . . . . . . . . . . . . . . . . . . . . 374.1.4 Anthropometric Matching Method . . . . . . . . . . . . . . . 37

4.2 Proposed Individualization . . . . . . . . . . . . . . . . . . . . . . . . 384.2.1 Hybrid Selection . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Soundscape Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3.1 stimulus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3.2 The CIPIC Database . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Sound Localization Experiment 475.1 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1.1 Pointing Paradigm Review . . . . . . . . . . . . . . . . . . . . 475.1.2 Proposed Paradigm . . . . . . . . . . . . . . . . . . . . . . . . 515.1.3 Data Acquisition Procedure . . . . . . . . . . . . . . . . . . . 535.1.4 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.2.1 Pinna Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 555.2.2 Azimuth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.2.3 Elevation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.2.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.3.1 Soundscape Modifications . . . . . . . . . . . . . . . . . . . . 665.3.2 Head Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.3.3 Auralization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.3.5 Elevation Coding . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6 Soundscape Optimization 736.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.2 HRTF Selection Modification . . . . . . . . . . . . . . . . . . . . . . 736.3 Elevation Coding Review . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.3.1 Directional Bands . . . . . . . . . . . . . . . . . . . . . . . . . 746.3.2 Covert peaks . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.3.3 Natural Frequency Elevation Mapping . . . . . . . . . . . . . 756.3.4 Artificial Coding . . . . . . . . . . . . . . . . . . . . . . . . . 776.3.5 Coding design . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

CONTENTS 5

6.5.1 Azimuth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.5.2 Elevation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.5.3 Comparison and Discussion . . . . . . . . . . . . . . . . . . . 82

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7 Conclusion 877.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

List of Figures 89

Bibliography 91

6 CONTENTS

7

Chapter 1

Introduction

In 2014, the World Health Organization estimated that 285 million persons werevisually impaired worldwide [Org14]. Among them, 39 million are blind. Due to thegrowing population, especially in emerging countries, these numbers are increasing:in 2003, the World Health Organization estimated the number of visually impairedpeople to be up to only 135 million. Moreover, the World Health Organizationreports that 85% of visually impaired people are located in poor countries.Cataract is the largest cause of blindness as the global data on visual impairment2010 states. Indeed, this disease is said to be responsible for approximately halfof blindness and visual impairment cases. Surgery can partially treat this, butthis treatment remains rare and even inexistent in some emerging countries. More-over, many other diseases or congenital defects are still untreatable, for example theStardgardt disease.The medical field tries to cure blindness through prevention and surgery. However,as medical treatments are expensive, not always available and restricted to certaindiseases, other professions are trying to use technologies to help blind people in theirdaily lives. Blind people need help since blindness makes it very difficult or evenimpossible to accomplish vital tasks, such as finding objects, reading and travelingsafely. Devices are available or being developed to help them to accomplish suchvital tasks. This work focused on the task of mobility and safe navigation.

1.1 Motivation

Some solutions have been found to help blind people to travel alone, like the whitecane or guide dog. Now, new kinds of devices are emerging, using new technologies.Among these devices, some are based on sensory substitution, that is the substitutionof vision for other senses, like touch or hearing. Unfortunately, none of the developeddevices are widely used and several reasons might explain this. Some tools requirecomplex training and the existing devices are either too expensive, too complicatedto use or not efficient enough. Therefore, the white cane remains the most commonlyused device.

8 CHAPTER 1. INTRODUCTION

1.2 Mission

The need of an intuitive and low-cost device motivated the creation of the startupAuvioLab’s. This startup develops a device based on the above-mentioned sensorysubstitution [Per15]. Indeed, it is known that not only the eyes, but also the ears areable to capture distances and spatial cues. Moreover the auditory domain has madehuge steps during the last decades and it is now possible to produce 3D soundsthrough headphones [WK89b, EMWW93]. Therefore, the created device simplycaptures the visual environment with two Dynamic Vision Sensor (DVS) cameras,a processor reconstructs the three-dimensional space image, converts it into three-dimensional sounds, called soundscapes, and these are then retransmitted throughheadphones to the user. The user’s brain is then expected to adapt himself to thenew stimuli containing all the spatial cues their eyes couldn’t capture, and to processthem using the visual cortex, like the brain would do with visual stimuli.

1.3 Thesis goal

Unfortunately, each person perceives sound in an unique way. Ear shape, head andtorso size are parameters that change the sound reaching the eardrum. Therefore,it is impossible to produce a unique soundscape for every listeners without degrad-ing the localization performances. To address this issue, the most efficient way is toperform individual measurements on each person but this procedure is either expen-sive or time consuming. AuvioLab’s strategy is to construct a low-cost and easilyadaptable device. Consequently, performing these complex measurements is not anoption. The goal of this thesis was to find a soundscape-individualization processoptimizing the cost, time consumption and achievable performances. Localizationexperiments were led to measure the performances and improve the process.

9

Chapter 2

Existing Mobility Aids for theVisually Impaired

2.1 Commercialized Devices

The second chapter introduces the mobility aids market. AuvioLab’s main com-petitors are presented for a better understanding of the device expectations. Theseexpectations led the strategy deployed to create and optimize the soundscapes.

2.1.1 Commonly Used Devices

Different mobility assistance devices exist to help blind people in their daily lives,but the only widely spread tools are the white cane and guide dog. The white cane ismade of a long stick of approximately one meter and a half with a spherical tip at itsend. By rubbing this tip against the ground, the user is able to collect various spatialcues. Firstly, he can detect objects laying on the ground and walls when the caneis blocked. The change in the strength required to move the white cane also givesinformation on the ground type. The angle of the cane indicates the elevation of thesurrounding environment and enables the blind person to find holes or sidewalk’sborders. Finally, the sound produced by the white cane makes it possible for theuser to guess the type of surface in front of him, for example to distinguish grassfrom gravel. The white cane is a low cost tool and can be bought for thirty eurosupwards. However, it has limitations. The short range of the cues given by thecane keeps the user from feeling comfortable in unknown environments. The factthat the user can only detect objects near the ground is also a critical limitation.Another mobility aid for blind people is the guide dog. This is the second most usedaid. Similar to the white cane, the guide dog enables the user to avoid obstacles.Furthermore, the dogs are trained to cross streets and find specific objects like doors.Finally, dogs can memorize certain paths and the user can then follow the dog todo shopping or go to work for example. Unfortunately, allergies, price and facilitiesare factors restricting the use of guide dogs. The cost of a guide dog from birth to

10 CHAPTER 2. EXISTING MOBILITY AIDS FOR THE VISUALLY IMPAIRED

retirement is of approximately 67 000 euros [Gui14]. This includes its training.

2.1.2 Electronic Devices

In order to overcome the above mentioned limitations of the white cane and guidedog, other tools have been developed over the last decades, using new technologies.The Ultracane is one of those new tools. Developed in an UK university [Ult], theUltraCane is an improvement of the white cane. On its tip, ultrasonic emitters havebeen implemented. Objects at a distance less than 4 meters from the user can bedetected, and the user can capture this information by feeling the vibration of thecane. However, the field of view remains small and the price of this tool is high(about 800 euros). iGlasses is another device helping blind people in their mobilitytasks [RNI15]. It again uses ultrasonic sensors to detect objects. Moreover, thesensors are located on the head of the user, so that the user can choose the field ofview by moving its head. Unfortunately, this tool is not able to indicate the distanceof the detected object and it is thus not feasible to fully rely on it. Therefore, thistool is only meant to be used in addition with another assistive device. The abovementioned devices collect the important spatial cues so that a blind person canmove again without colliding with any object. The problem is that it is hard tocollect the right cues. Some tools try to address this problem by collecting all thecues that a normal seeing person could catch and by transmitting all of them to theuser. This can be done through sensory substitution. A famous tool using this isthe BrainPort device [BT15]. It consists of a head-mounted camera connected to400 electrodes placed on the user tong. The electrodes display the camera-capturedimages by producing a stimulation on the user tong for each pixel. The brighter is thepixel, the stronger the stimulation will be. After ten hours training, the user don’tfocus anymore on the translation of tactile to visual images but sensory substitutionoccurs: this translation becomes unconscious. Therefore, the interpretation of thestimulus is quick and the user feels again the sensation of seeing. However, theperceived grayscale images do not provide enough resolution to fully rely on them.The BrainPort is thus meant to be used as an additional device rather than analternative device. Moreover, this device is only available in the United-States ofAmerica and the price is not communicated. However, a review estimated it isaround 10 000 dollars [Blo15]. Finally, a system based on auditory visual sensorysubstitution also exist: the vOICE [Mei92]. It is a device that captures a twodimensional images of the environment through a head-mounted camera and thatcodes it with sounds. The coding algorithm converts the vertical positions intofrequency and the horizontal axis into loudness. The main drawback of this methodis that it requires a time consuming training stage before attaining reliable results.Figure 2.1 summarizes these information.

2.2. AUVIOLAB’S DEVICE 11

Mobility Aid Price CommentsWhite Cane 30-150e small field of viewGuide Dog 60 000e expensive and bindingUltraCane 800e reduced visibility rangeiGlasses 100e not self-sufficient deviceBrainPort 10 000$ expensive and intensive training neededThe vOICe 500e intensive training needed

Figure 2.1: Different mobility-aid devices and their price. The prices are only ap-proximations

s

2.2 AuvioLab’s Device

Brainport or the vOICE, based on sensory substitution, have the main drawbacknot to be intuitive. Indeed, the coding used by the vOICE is complicated and theuser needs to go through a time consuming training stage to be able to understandthe sound stimulus provided. Regarding Brainport, beside the fact that the brainis not used to capture spatial cues with touch, the stimulus given by the devicedo not contain enough information to replace the white cane or the guide dog. Inorder to address these issues, AuvioLab’s aims to create a sufficient, intuitive andreliable device, based on auditory visual sensory substitution. The actual deviceis made up of three parts. On the front end, the three-dimensional environmentis captured through two sensors. Then, this environment is reconstructed and theimportant information are selected [Per15]. Finally, a sound stimulus containing allthese information, called soundscape, is created.

2.2.1 Environment Perception

Two Dynamic Vision Sensor (DVS) Cameras are used to reconstruct the three-dimensional environment of the user through stereoscopy [Per15]. These sensorsare retina-inspired; they can process visual information with little power. Indeed,DVS cameras are only sensitive to changes in light intensity and redundant datais automatically discarded. The DVS cameras generate up to 16 000 events every50ms [Per15]. These generated events are assumed to capture enough informationon the environment to give a good perception of it. Even if the DVS only selectsnon redundant events, not all of them are useful for the user with respect to themobility task. Indeed, only the events linked to an obstacle on the path shouldbe considered. Consequently, different strategies to find these obstacles are beinginvestigated Filtering and clustering are the main algorithms used at the momentto select the relevant events. Once the 3D environment is reconstructed throughthe DVS cameras and the relevant events are selected, the auditory stimulus isconstructed. To do so, a three-dimensional soundscape is created: each event, linked

12 CHAPTER 2. EXISTING MOBILITY AIDS FOR THE VISUALLY IMPAIRED

to a cartesian position, is associated to a sound placed at the same position throughheadphones. The user will capture the position of the event by locating this virtualsound. He then hears his spatial environment.

2.2.2 Soundscape Limitations

Unfortunately, hearing is not as precise as vision for the localization task. Indeed,localizing an object through hearing can only be done with a certain error, dividedinto precision and accuracy errors. Another difference between hearing and visionis the quantity of information these senses can process. Each normal person canperceive a very complex environment in a split second with vision but localizing asingle sound source is often very difficult. Finally, one of the most important issue theAuvioLab’s team encounters is the fact that each human auditory system is unique;each person hears sound differently. Therefore, the system has to be adapted to everyuser. A perfect adaptation of the system can only be obtained through individualmeasurements. Unfortunately, those measurements are expensive, time consumingand are therefore not an option for a device meant to be distributed to a largenumber of persons.

2.3 Conlusion

This review shows that it is difficult to break through the market of mobility as-sistive devices. Electronic devices require additional training for the user and havedifficulties to win their trust. AuvioLab’s believes it is possible to make an intuitiveand reliable device, using auditory-vision substitution. To achieve a breakthrough inthe market, AuvioLab’s has to create soundscapes of high quality with cost and user-adaptation constraints. To fulfill these objectives, the soundscape-individualizationprocess has to be designed and optimized.

13

Chapter 3

Preliminaries

This chapter explains all the mathematical tools and physical concepts used in thisthesis for a better understanding and to avoid any confusion.

3.1 Mathematical Background

In this work, several mathematical tools were used to represent a spatial environ-ment, to create sound signals and to analyse the measures. Those tools are describedin the following section.

3.1.1 Coordinate System

As the localization in space of sound sources is in the middle of this work, it isnecessary to choose an appropriate geometrical representation of this space. Severalcoordinate system are frequently used for this purpose.

Spherical Coordinate Sytem

The spherical coordinate system defines the position of a point P with three differentnumbers: the radial distance, the polar angle and the azimuthal angle. The radialdistance is the length of the vector going from the origin to P. The polar angleis the angle between this vector and the zenith direction. The azimuth angle isthe angle between the orthogonal projection of this vector on the plane defined bythe axis lines X and Y, and the axis line X. Figure 3.1 illustrates this. When thiscoordinate system is used for sound localization purposes, the origin corresponds tothe center of the head of the listener, and the radial distance, the polar angle andthe azimuthal angle to the position of the sound source. This coordinate systemis widely used in the literature. However, in most of the cases, the polar angle isreplaced by the elevation angle, which corresponds to the angle between the soundposition and its projection on the plane defined by the X, Y axis line. In this work,

14 CHAPTER 3. PRELIMINARIES

Figure 3.1: Representation of the Spherical Coordinate System. [Sph]

spherical coordinate refers to these set of parameters. For more convenience, it isalso possible to use another coordinate system, described in the following.

Interaural Polar Coordinate System

In this coordinate system, the position of a point P is described with the radialdistance, the azimuthal angle and the elevation angle. The radial distance is stillthe length of the vector going from the origin to P. However, the azimuthal angleis now the angle between this vector and its projection on the plane defined by theaxis line Z and Y. The elevation is here defined as the angle between the vectorgoing from the origin to the latter projection and the axis line Y. The coordinateare represented in Figure 3.2.

Coordinate System Transformation

As both coordinate systems were used, it is necessary to establish the transforma-tions connecting them. The needed transformations were derived in the following.Let us define ρ, θ, φ as respectively the radial distance, the azimuth and the ele-vation of a point P in the spherical coordinate system and ρ ,θ φ as respectivelythe radial distance, the azimuth and the elevation in the interaural polar coordinatesystem of this same point.

Converting the spherical coordinates to the cartesian coordinates is straightforwardand lead to Equations (3.1)., (3.2) and (3.3). Equation 3.4 and 3.5 give the rela-tionship between the cartesian and interaural coordinates.

x = ρ. sin(θ). cos(φ) (3.1)

3.1. MATHEMATICAL BACKGROUND 15

Figure 3.2: Representation of the Interaural Coordinate System. x1, x2 and x3correspond to the previously-introduced X, Y and Z respectively. [Int]

y = ρ. cos(θ). cos(φ) (3.2)

z = ρ. sin(θ) (3.3)

tan(φ) =z

y(3.4)

tan(θ) =x

ρ(3.5)

Deriving Equations 3.1 to 3.5 leads to Equations 3.6 and 3.7 and gives the wantedrelationship, given that the angles are bounded to the [−π; π] interval and that thesinus and tangent functions are bijective on this interval. These formulae can befound in Xie’s book [Xie13].

φ = tan−1(tan(φ)

cos(θ)) (3.6)

θ = sin−1(sin(θ). cos(φ)) (3.7)

To avoid any ambiguity, the adopted coordinate system is specified before each use.

3.1.2 Anatomical Planes

The anatomical planes were also used in this work to describe several spatial loca-tions with respect to the human body. They are represented in Figure 3.3. The


transverse planes, also called horizontal planes, are all the planes orthogonal to thehuman body axis. In the following, this definition will be restricted: the horizontalplane will be considered as the plane orthogonal to the body axis and cutting thehead at the ear level. The Sagittal planes are all the planes dividing the body intoleft and right portions. The midsagittal plane, also called the median plane, refersto the sagittal plane containing the body axis. Finally, the coronal planes are theplanes dividing the body into head and tail portions.

Figure 3.3: Representation of the sagittal, coronal and transversal planes. [Bod]

3.1.3 Signal Processing

Signal processing is necessary to understand, describe and reproduce sounds. Itgives powerful mathematical tools to do so and the following section introduces themain used ones.

The Fourier Transform

The sounds heard in the daily life are rarely formed of pure tones but they can beexpressed as a superposition of pure tones. The brain is sensitive to this superpo-sition and the Fourier Transform represents this superposition by transforming thetime domain representation of a signal into the frequency domain. The Discrete


Fourier Transform of a signal is given by Equation 3.8.

Xk =N−1∑n=0

xne−j2πkn/N k ∈ J0;N − 1K (3.8)

The sequence of the Xk constitute the spectrum of the signal.

Filtering

To change the properties of a sound signal, it is possible to modify the shape ofits spectrum. This can be done with an equalization process. In order to removeundesired frequencies, low-pass, band-pass, band-stop and high-pass filters can beused. It is also possible to use shelving filters, in order to boost or attenuate acertain frequency range without changing the other frequencies.

The cutoff frequency of a filter corresponds to the frequency that has a gain of -3dBfrom the nominal gain. It is the arbitrary boundary between the passband and thetransition band. Figure 3.4 shows the gain of band-pass filter. fL and fH are thecutoff frequencies, fc the center frequency and B the bandwidth.

Figure 3.4: Gain diagram of a band-pass filter. fL and fH are the low-pass andhigh-pass cutoff frequencies respectively. f0 is the center frequency. [Ban]

Impulse Response

As the sound goes from a point to another, it propagates through a median andundergoes transformations having an impact on its spectral and temporal proper-ties. This can be modeled as a dynamic system excited by an input signal andoutputting another signal. It can be useful to reproduce these transformations for


any given input signal, in order to simulate a particular dynamic system and there-fore a particular spatial environment. To reproduce the transformations a dynamicsystem would cause to an input signal, the impulse response of this dynamic systemis needed. The impulse response is the output signal of the system when it is excitedwith a Dirac impulse. Because a Dirac impulse contains all possible frequencies, theimpulse response make it possible to fully simulate the related dynamic system.

Convolution Theorem Duality

A way to filter a signal is to convert it in the spectral domain and to multiply itsspectrum with the filter spectrum. However, the convolution theorem duality statesthat a multiplication in the spectral domain is equivalent to a convolution in thetime domain. Therefore, it is also possible to filter a signal by convolving it in thetime domain with the time domain representation of the filter. Convolving the inputsignal with a system impulse response will therefore give the system output signal.

3.1.4 Statistics

A part of this work was to analyse localization performances of different listenersfor a given soundscape. The listener was meant to localize the sound sources with acertain degree of closeness, referring to accuracy. The listener had also to be able torepeat these localizations with stability. This refers to precision. Finally, differencesbetween individual had to be reduced; this is the adaptivity or individualizationcriterion. These performances were measured by having a listener estimating soundpositions and by collecting these estimations. In order to extract the performancesout of these estimations statistical tools were required. These tools are adaptedfrom Probability and Statistics for Engineers and Scientists [Ros09]. The relationbetween individual and group’s performances are derived for better understanding.

Precision and Accuracy

Let X be the estimation of a sound-source position located at x . This estimationbeing repeated several times, it can be considered to be a random variable. Equation3.9 and 3.10 defines the mean value and the individual signed error respectively, E[X]being the expectation of the random variable X. The signed error indicates how closethe estimations are to the real position on average. It therefore refers to the accuracyof the user. The unsigned error, given by Equation 3.11, gives a value containingthe accuracy as well as the precision. Indeed, this value is equal to zero if both theaccuracy and the precision are perfect, but grows if one of these two performances isdegraded. To purely measure the precision, standard deviation is needed. Standarddeviation can be computed with Equation 3.13.

X = E[X] (3.9)


esign[X] = E[X]− x (3.10)

eunsign[X] = E[|X − x|] (3.11)

V ar[X] = E[(X − E[X])2] (3.12)

Sdv[X] =√V ar[X] (3.13)

These values set up a sufficient set to describe the accuracy and the precision alistener can achieve, and will allow performance comparisons with other studies.

Population Performance

As previously explained, tests were led on different persons of a group or popula-tion. The individual performances but also the group performances were importantto consider. It was thus necessary to derive the relationship between individualand group performances. The group’s estimations are the concatenation of all theindividual’s estimations. The following paragraphs derives these relationships.A population formed of two persons is considers. The generalization to a largeramount of persons is straightforward. Let us introduce four random variables: X, Y, W and Z. X and Y corresponds to the estimations of two single persons. W is adiscrete random variable, taking the values 0 and 1 with the same probability, andZ is the random variable associated to the population, mathematically described byEquation 3.14.

Z = W.X + (1−W ).Y (3.14)

As W, X and Y are independent random variables, the signed error of Z obtainedthrough Equation 3.10 directly leads to Equation 3.15. The unsigned error of Z isobtained with Equation 3.11 and gives Equation 3.16. The Konig-Huygens theoremgives another expression of the variance through Equation 3.17. Using this expres-sion and replacing Z using Equation 3.11, the relationship between Z variance andX and Y variance can be obtained. This relationship is given by Equation 3.18

esign[Z] =1

2(esign[X] + esign[Y ]) (3.15)

eunsign[Z] =1

2(eunsign[X] + eunsign[Y ]) (3.16)

V ar[Z] = E[Z2]− E[Z]2 (3.17)

V ar[Z] =1

2(V ar[X] + V ar[Y ]) + (

E[X]− E[Y ]

2)2. (3.18)


The latter equation is important as it emphasis the fact the variance of a populationis always larger than the average value of the individual variances. This differenceincreases as the difference between individual means increases and Figure 3.5 illus-trates this.

Figure 3.5: Left: Distribution representation of the random variables X and Y,assumed to be gaussian. Right: Distribution of the random variable Z. It canintuitively be observed that the variance of Z is larger that the average variance ofX and Y.

Therefore, individual precision performance here refers to the mean of the individualstandard deviation. To express the perception difference between the population’sindividuals, the between mean variance is used, given by Equation 3.19, i being adiscrete random variable, equally distributed, corresponding to the ith individualand N the number of individuals.

V arbm[Z] = V ar[Xi]] i ∈ J1;NK (3.19)

Estimator

The estimations being known for a limited set, the above mentioned magnitudes areapproximated through estimators. Let Y be a random variable and yi be the ith

observation of Y, i going from one to N. The estimators are given by Equation 3.20

3.2. SENSORY SUBSTITUTION 21

and 3.21 [Bol01].

E[Y ] =1

N

N∑i=1

yi (3.20)

V ar[Y ] =1

N

N∑i=1

y2i − E[Y ]2 (3.21)

However, as the observations are limited, the mean value E[Y] is also a randomvariable. The above variance is thus biased and must be corrected to obtain thetrue variance. Equation 3.22 gives the relationship between the true variance andthe biased variance:

V artrue[Y ] =N

N − 1V arbias[Y ] (3.22)

3.2 Sensory Substitution

AuvioLab’s constructs a device that enable the user to substitute the vision senses forhearing. To understand the choice made while designing the product, it is importantto know what sensory substitution involves.As previously said, the world health organisation states that blindness is predomi-nantly caused by diseases affecting the eyes and the optic nerve. The consequenceof this is that the visual cortex stops receiving stimuli and becomes therefore un-used. However, as the Neuroscience Biobehavioral Reviews, 2014, states [Poi07],the brain has plasticity properties and can modify its structure to face changes.Indeed, it is possible to reactivate the visual cortex by replacing the visual stimuluswith a correlated stimulus. This stimulus should be captured with the appropriatesense, transmitted to the brain part linked to this sense and then transmitted tothe visual cortex. Bach-y-Rita [yR20] observed this process through fMRI analysis.This suggests that, under the condition that an appropriate substituted signal isconstructed, the blind person can regain the perception of seeing.To make a visual sensory substitution possible, a stimulus containing some visualinformation has to be construct. A balance has to be made between stimulus com-plexity and ease of interpretation.

3.3 Physics of Three-Timensional Hearing

Although vision is the most used sense to capture spatial cues, hearing also enablesthat. Indeed, the sound reaching the eardrum is full of spacial cues the brain caninterpret. These cues are made up of two main types: monaural and binaural cues.The basis of the sound wave physics and of psychoacoustic is introduced in thefollowing.


3.3.1 Sound

Sound is a mechanical vibration of a fluid that propagates as a longitudinal wave.This sound propagation mainly results in a local perturbations of the air pressureand velocity. Those local perturbations can be perceived at the eardrum by makingit vibrate itself. This vibration is then transmitted to the brain, enabling humans tohear. Different properties of the sound signal are important while hearing. Amongthem is the acoustic pressure of the signal, which characterized its loudness andwhich is expressed in decibel SPL (Sound Pressure Level). This quantity is givenby Equation 3.23 with P the root mean square pressure given by Equation 3.24 andP0 the pressure reference.

20 log10(P

P0

) (3.23)

P = limT→∞

√1

2T

∫ T

−TP 2(t)dt (3.24)

P0 = 2.10−5N.m−2 (3.25)

The order of magnitudes of the sound decibel are listed below:

• Threshold of pain: 130 dB SPL

• Normal conversation: 40-60 dB SPL

• Very calm room: 20-30 dB SPL

As sound is a mechanical vibration, it propagates with a certain speed and it un-dergoes reflexions and refractions while striking objects. These transformations areimportant to localize sound sources. The speed of the sound depends on the fluidand is equal to 343.2 meters per second for dry air at 20◦C at atmospheric pressure.Reflexion and refraction are the phenomena occurring as the sound strikes an inter-face between two media. The incident ray will split into a refracted and reflectedray. The reflected ray stays in the same media whereas the refracted ray propagatein the other media. A change of direction occurs between the incident ray and thetwo other rays. If the reflection surface is very smooth, this change of directionfollows the law of reflection which says that the non oriented reflected angle θr isequal to the incident angle θi. These angles are defined in Figure 3.6.Diffraction is an interference phenomenon occurring as the wave encounters an objecthaving a size comparable to the wavelength. This phenomenon is described by theHuygens-Fresnel principle and Figure 3.7 represent the transformation undergoneby a plane wave striking an obstacle.Regarding the sound energy, it decreases as the sound travels. Indeed, in normalconditions, sound can be considered to have an omnidirectional propagation, causing

3.3. PHYSICS OF THREE-TIMENSIONAL HEARING 23

Figure 3.6: Wave going from P to Q and undergoing a reflexion in O. θi is theincident angle, θr is the reflection angle. [HRT]

Figure 3.7: Diffraction illustration. Left: sound going through a hole. Right: soundstriking an object. [HRT]


an energy decay proportional to 1r2

, r being the radial distance of the sound in thesound source centered spherical coordinate system. Moreover, as the sound splitsinto a reflecting and a refracting wave while striking objects, the reflecting wavewill contain less energy than the incident wave. This can be modeled by an object-specific absorption coefficient.

3.3.2 Psychoacoustics

Psychoacoustic studies how sounds reaching the ears are interpreted by the brain.three-dimensional hearing is at the heart of this science. This human ability is basedon different cues, categorized into binaural and monaural cues.

Binaural localization

Some cues helping the brain to localize sounds are based on right-left differencesof the signal. The phase shift between the sound at the right and left ear is calledInteraural Time Difference (ITD). Figure 3.8 illustrates the fact that, according tothe position of a sound source, the signal going to the right ear does not take thesame path as the signal going to the left ear. This path can be longer, shorter, or ofthe same length. The consequence of this is that one of the signal will fall behindthe other, explaining the phase shift. Therefore, the ITD will enable to locate theazimuth of a sound source, that is its lateral position. The brain is very receptive tothe ITD when the spectrum of the sound is restricted to the [50Hz-1300Hz] range.Beyond this range, the ITD is not detectable anymore. To give an idea of the orderof magnitude of the ITD, this time difference is equal to zero second when the sourceis in front and goes up to 800 µs when the source is at a lateral positions.If the sound source is made up of higher frequencies, the brain relies on anothercue to estimate this same azimuth; the Interaural Level Difference (ILD). As Figure3.8 already showed, the path taken by the left-ear and right-ear signals are not thesame. Therefore, the attenuation undergone by these signals is not the same; onesignal is more attenuated than the other and this difference is called ILD. The orderof magnitude of the ILD is equal to approximately twenty dB SPL at extreme lateralpositions as Figure 3.9 shows.

Monaural localization

The brain is also able to perform a spectral decomposition of the sound in order toextract additional spatial cues. Indeed, it can capture the spectral distortion of aknown signal and localize it according to this distortion. This is called monaurallocalization as only one ear is needed to achieve this. The monaural localizationis mainly used in two cases: to distinguish sound located in front or behind andto estimate the elevation of a sound. Indeed, a sound located at a certain positionand the coronal symmetric sound produce exactly the same ITD and ILD. It istherefore impossible for a listener to say if a sound is in front or behind him based


Figure 3.8: Binaural hearing [Xie13]

Figure 3.9: ILD with respect to the sound location and the frequency. The 0◦

location corresponds to the intersection of the median and transversal planes. [UoW]


on these two cues. Moreover, as the elevation of a sound changes, (elevation beingdefined in the interaural coordinate system), the ITD and ILD almost don’t change.Figure 3.10 describes the cone of confusion: any slice circumference of this conecorresponds to all the possible elevation for one single azimuth. To distinguish twosounds on the same slice of a cone of confusion, monaural localization is thereforeneeded. As the monaural localization uses spectral analysis, the displayed signalhas to be broadband, meaning that it has to contain all the hearable frequencies, soas to enable the best localization performances.

Figure 3.11 summarizes the important spatial cues contained in sound.

Figure 3.10: Cone of confusion. [UoW]

Cue Frequency range Related toITD [50-1300Hz] AzimuthILD [1kHz-20kHz] AzimuthSpectrum all Azimuth & Elevation

Figure 3.11: Spatial Cues in Sound

3.3.3 Human Performance in Sound Localization

With vision, the brain is able to construct a very accurate spatial image of the en-vironment. Indeed, the images captured by the retina enable the brain to locateimportant components with huge precision. Thanks to this process, normal people


are able to grab objects, avoid obstacles and even achieve very complex tasks, re-quiring high precision, like playing tennis or football. Unfortunately, many issuesoccur while using hearing to perform those tasks.

Error types

Using the hearing sense for localization tasks causes different kinds of errors. Theycan be categorized in three main types. The first type is called localization error,referring to the accuracy. It corresponds to the occurrence of a bias between thereal and the estimated position of a sound source. In the free field, this bias isat the lowest in the median plane at the eyes’ level and at the highest for lateralsounds according to Blauert[Bla96]. The second type of errors is the localizationblur, also called response variation or precision. It corresponds to perceptual noiseand is strongly related to the minimum audible angle introduced in the next sec-tion. Finally, the last error category is the front-back and up-down confusion. Thissituation is illustrated in Figure 3.12 which shows that in some cases, the listenerbelieves a sound source in front of him is behind and vice versa. Wenzel et al., 1993,reported that the front-back error was up to 19% for the free field [EMWW93]. Theyalso reported that this error was not symmetric; in most cases, sounds in front werebelieved to be behind. They explained this phenomenon with the ventriloquismeffect. Indeed, vision and hearing are highly linked with one another and the brainalways tries to give a coherent interpretation of the environment. Therefore, if asound source is heard but nothing can be seen, the brain manipulates the reality andmakes the sound come from another direction. Consequently, as no sound source isto be seen during localization tasks, the brain interprets that as if the sound sourcewas located behind. Fortunately, this is not a major issue for the created device asthe field of view is restricted to the frontal area.

Figure 3.12: Front / Back ambiguity


Minimum Audible Angle

In order to estimate the best device-achievable performances, it is important to knowwhich precision can be obtained by only using the hearing sense to localize objects.Mills, 1972, measured the Minimum Audible Angle (MAA), which corresponds tothe minimum angle from which it is possible to discriminate two sound sources[Mil72]. Figure 3.13 shows this MAA depending on the frequency and azimuth.

Figure 3.13: Minimum Audible Angle for different azimuth positions. The verticalaxis represents the MAA, the horizontal axis the frequency. Each curves correspondsto a given azimuth. From Mills, 1972 [Mil72]

.

Figure 3.13 shows that the minimum audible angle is equal to approximately onedegree for low frequencies, i.e. the brain is able to distinguish two different soundsources being one degree apart for an azimuth around 0◦. However, these results alsostates that the MAA drops off with an increasing azimuth. At lateral positions, theMAA is approximately equal to 7◦. Furthermore, in case of pure tone, the MAA alsodrops for frequencies between 1050 Hz and 2000 Hz. Therefore, it is possible to statethat a precision under 1◦ in front and 7◦ at lateral positions for sound localizationis not likely to be achievable. Moreover, the MAA is measured by having a persondetecting the changes in a sound while the position of this latter varies. It doesnot require the user to localize the sound. Consequently, additional error is to beexpected for localization tasks needing the subject to identify the actual location ofthe sound source [Mid90]

Localization Performances

Localizing an unknown speaker with the hearing sense was done with a localizationblur equal to 17◦ in Blauert experiment [Bla96] . However, still according to Blauert,

3.4. VIRTUAL AUDITORY DISPLAY 29

when the listener knew the speaker, this blur went down to 9◦. Finally, Blauertstated that if white noise was to be located, this blur decreased to 4◦.Focusing on the azimuth precision, Middlebrooks reported it was equal to approxi-mately 2.5◦ for white noise for an azimuth and elevation equal to zero [Mid90]. Healso reported that this value was increasing while the sound was moving from thecenter to the edges. A value of approximately 9.5◦ was found for an azimuth of 100◦.

Figure 3.14: Precision and MAA comparison. The circles represent the standarddeviation of the estimations in the horizontal plane. The value is the mean acrosssix subjects and two azimuths (right and left). The triangles represent the MinimumAudible Angle. The vertical line are the error bars. [Mid90]

Experiments revealed precision was poorer for elevation estimations.[WK89a, EMWW93,Mid90]. Middlebrooks found that this value was equal to 8◦ on average. Further-more, the elevation localization highly depends on spectral analysis and the men-tioned performance can be highly degraded if the signal is not well chosen.

3.4 Virtual Auditory Display

It is possible to localize real sound sources with a certain accuracy and precision[Bla96] . The goal of AuvioLab’s is to create virtual sound sources through head-phones and to make the listener estimate their position. The following sectionexplains how it is possible to create virtual three-dimensional sound through head-phones.

3.4.1 Head-Related Transfer Function

As a sound goes from the source to the ear, it undergoes diffractions and reflectionsbefore reaching the eardrum. Those diffractions and reflections are due to the spatial


environment of the listener, but also to its own body. Indeed, its head, its torso andits ears modify the original signal. Being able to reconstruct these modifications isimportant as sound localization is mainly based on them. Therefore, these bodyspecific transformations are measured through the Head-Related Impulse Response(HRIR) for a set of sound positions and are illustrated in Figure 3.15. The HRIRthen enables the reproduction of the transformations undergone by a sound goingfrom a given source to the eardrum of a particular listener.

Figure 3.15: Head-Related Transfer Function hL and hR for a given sound sourcelocation. X(t) is the sound signal at the sound source. XL(t) and XR(t) are thesound at the left and right listener’s eardrums respectively. They are given by thefollowing relation: XL/R(t) = hL/R(t).X(t) [HRT]

To measure the HRIR, it is necessary to place a microphone at each ear-canal en-trance of the listener. Then, a sound must be displayed at different positions aroundhim and the responses are recorded. The achievable spatial resolution directly de-pends on the number of positions displayed. Signal processing technics then enableto reconstruct the HRIR. Figure 3.16 illustrates a specific HRIR measurement.The Head-Related Transfer Function is simply defined as the spectrum of the HRIR.Filtering a sound with an HRTF linked to a given position, modifies this sound asif it traveled from the mentioned position to the ear-entrance canal of the HRTFowner.Unfortunately, measuring the HRIR is either time consuming or expensive. In mostof the cases, this procedure takes between 40 minutes and several hours. The costis explained by the facilities and material needed to perform these measurements;an anechoic chamber and a setup of high fidelity loudspeaker and microphones arerequired.


Figure 3.16: Left: Microphone placed at the ear-canal entrance of a subject. Right:HRIR measurement of a dummy head. The dummy head is place in an anechoicchamber and sounds are displayed through the surrounding loudspeakers to measurethe HRIR. [oS09]

3.4.2 Soundscape Creation

Once the individual HRTF is measured, it is possible to place virtual sound-sourcesin the three-dimensional space. Firstly, the sound-source position has to be chosen.Once this is done, the corresponding right and left-ear HRIR are selected and theinput signal is convolved with them, corresponding to a HRTF filtering of the signal.Then the two sound channels are transmitted to the two different ears. Figure 3.17illustrates this process.

As almost all the HRTFs are measured at a fixed radial distance of the listener, thedistance of the object is not encoded in the signal. Therefore, to add this feature,a level scaling of the signal has to be performed to simulate the sound attenuation.In the near field, near-field transfer function can also be used to give a distanceimpression.

Another limitation this process encounters is that the HRTF are only available for adiscretized set of directions. If precise sound positioning is needed, a way to overcomethis problem is to interpolate the HRTFs. Several studies have already been madeto find an efficient way to interpolate HRTFs [Ajd05, FPFD02]. Representation ofmultiple sound sources is needed for the device. It is very convenient to do so withthis process. The convolver shown in Figure 3.17 has to be applied on two differentsound sources. The environment being linear and time invariant, summing the rightand left output signals superposes the sounds. In the case of a large number of soundsources, the principal component of the HRTF can be extracted through principalcomponent analysis and the complexity of the signal processing can thus be reduced.

Finally, to enhance the sound experience, other features can also be added to this


Figure 3.17: Process for three-dimensional sound creation (convolver). The graphmust be read from the bottom to the top. Firstly, a sound signal, called stimulus,must be selected. Then the left and right HRIR must be selected with respect tothe position where the sound must be placed. Finally, the stimulus is duplicatedand each copy is convolved with one of the selected HRIR. Then, each of the twosignals are displayed separately to the right and left ears. [Int]


process. Firstly, Begault and Wenzel [BW01] showed that adding a head trackerimproves the sound localization. By doing so, the convolver changes with the headposition. This brings additional cues to the listener. Moreover, adding echoes to thesignal by including the room impulse response to the process was shown to be usefulfor externalization of the sound (placing the sound outside of the head) [BW01].These possibilities are discussed in chapter 5.

3.4.3 Headphone Equalization

A complete control of the sound produced at the ear-canal entrance of the listener isof high importance while producing virtual three-dimensional sounds. Headphonesallow this under some conditions.

Heaphones Reproduction Issues

Unlike natural sounds, sounds produced through headphones are located very nearthe ears. If this particular situation is not taken into account, the sound producedis perceived as being inside the head and therefore non-natural. To avoid this,headphones companies have created two main standard correction: free-field anddiffuse-field equalizations. These headphone-embedded sound processing techniqueshad to be considered and the input signal adapted to it. The following sectionexplains how to do so and describes the method used by Bosum Xie in his bookdedicated to Head-related transfer function [Xie13]

Free Field and Diffuse Field Equalization

The goal of free-field and diffuse-field equalization is to avoid the spectral distortiondue to the ear and headphones coupling and to reproduce the coloration that thesound would have if being played through loudspeaker placed far from the user. Thisis of capital importance in order to externalize the sound. The first standard thatappeared was the free field equalization, which corresponds to the filter describein Equation 3.26, where H(θ0, φ0, f) is an averaged HRTF corresponding to theposition (θ0, φ0) and Hp(f) is the headphone to ear canal entrance transfer function.This equalization gives the impression the sound is located a given position, andin most cases the front direction is chosen: θ0 = 0 and φ0 = 0. This equalizationmethod is adapted to give the impression the sound comes from a specific location.

Ffree =H(θ0, φ0, f)

Hp(f)(3.26)

Another correction standard exist, the diffuse-field equalization. This equalizationgives the impression the sounds comes from every directions. The diffuse-field equal-


ization is done by applying the filter given by Equation 3.27.

Fdiffuse =

√1M

∑i |H(θi, φi, f)|2

Hp(f)(3.27)

In this work, the signal is already colored with selected HRTFs before being trans-mitted to the headphones. Therefore, the headphones-embedded signal processingcorresponding to the sound-externalization has to be corrected. However, the ear-canal correction is still needed. H650 Sennheiser headphone are diffuse-field equal-ized and the filter described in Equation 3.28 has thus to be applied to the signal.

F =1√

1M

∑i |H(θi, φi, f)|2

(3.28)

3.5 Conclusion

This chapter introduces all the necessary background to understand the creationand the testing of three-dimensional soundscapes. In this work, spherical and inter-aural coordinate systems were used to normalize the listener spatial environment.Accuracy, precision and adaptivity of the device were measured with the mentionedstatistics, and the introduced signal processing tools were used to create the three-dimensional soundscapes. The physical theory behind the 3D hearing was necessaryto create and improve these soundscapes.

35

Chapter 4

Soundscape Individualization

The mobility assistive device under development has specific requirements for thethree-dimensional soundscape. Indeed, the latter is expected to:

• Maximize the elevation and azimuth accuracy and precision

• Be adaptable to a large number of users

• Minimize the setup time

This chapter explains how the device-specific soundscape was created with respectto these criteria.

4.1 Individualization Review

Producing three-dimensional sounds requires a signal filtering process through Head-Related Transfer Function (HRTF). HRTFs are user dependent as each person per-ceives sound differently. Unfortunately, individual HRTF are difficult to obtain asthey require expensive and time consuming measurements. To overcome this is-sue, an individualization of the HRTF is needed. Individualization is the choice orcreation of a particular HRTF according to the listener.However, the relationship between HRTFs and human morphology is still unclear.Nevertheless, Different strategies exist to choose the best HRTF for a give person,and the following sections explain them.

4.1.1 Generic HRTF

A naive solution is to use non-individualized HRTF, that is a generic HRTF. Wight-man et al. used a particular HRTF to create all their soundscapes [EMWW93].They showed that the use of non-individualized HRTF degraded the localizationability of sound sources for the user. Their results showed that the localization blurincreased of approximately 30% if the perfect individual HRTF was not used. It is

36 CHAPTER 4. SOUNDSCAPE INDIVIDUALIZATION

obvious that this loss of performance cannot be totally avoided if the listener HRTFis not used. However, other individualization processes allowed to minimize thisloss. It is also possible to use a manikin HRTF to create virtual soundscapes, likethe KEMAR manikin or the BF Manikin [PP14, JGC04]. However, these methodsseemed to degrade even more the performances. Indeed, Gupta et. al, 2004, foundthat the use of a BF manikin increased the unsigned error of 34% compared to theunsigned error obtained with the individual HRTFs. [JGC04]

4.1.2 Modelization

In order to take into account the head and ear geometry of the user, it is possible tomodel the HRTF. Meshram et al., implemented an HRTF creation algorithm usingear and head pictures. To do so, they used a mesh acquisition technique to obtainan accurate three-dimensional image of the ear. This required a high resolutioncamera and a powerful geometry software. Once the mesh surface was acquired, anadaptive rectangular decomposition was performed to compute the scattering of thesound waves due to the surface. This was done by solving the wave equation withgiven boundary conditions [Mes00]. Unfortunately, no quantitative comparison withother individualization techniques was led.

The HAT model is another method to individualize an HRTF [Zot00]. Instead ofcomputing the entire HRTF, this algorithm only personalizes existing HRTFs. Forthis method, only the torso and the head parameters are considered, but not the earparameters. To obtain a personalized HRTF, an algorithm is used to reconstructall the propagation paths of the sound going from the source to one eardrum. Thisalgorithm considers the torso and head shadowing effect, that is the effect occurringon the sound path when the torso or the head is between the source and the eardrum.This algorithm also considers the reverberation of the sound on the torso. Thedecrease of the unsigned error through this method was less than 1% compared tothe unsigned error obtained with a generic HRTF.

Gupta et. al, 2004, created a modeling HRTF method based on the shape andthe size of the outer ear [JGC04]. They looked for empirical sound equations inorder to describe the reflections and resonances undergone by the sound reachingthe eardrum. The global mean unsigned error obtained with the modeled HRIR was5% lower than the 31.4◦ global unsigned error they obtained using a B&K manikinHRTF. However, as said previously, the results obtained with B&K manikin arealready very poor.

Unfortunately, these solutions do not match the above-mentioned needs. Indeed, thefirst computation method requires a complex and long setup which contradicts theneed of having a quick setting of the soundscape. Moreover, the performances onecould obtained with these methods are still being investigated and it is not sure theyare better than the performances obtained with generic HRTFs. Regarding the HATmodel, it requires to already have an HRTF. Using a generic HRTF is a possiblechoice, but as Zotkin et al. [Zot00] showed, the performance can be increased by

4.1. INDIVIDUALIZATION REVIEW 37

choosing more carefully the initial HRTF. Therefore, the HAT model does not solvethe problem of the initial HRTF choice.

4.1.3 Subjective Selection

It is also possible to use a database containing different HRTFs and to select onewith respect to some criteria. Several selection processes exist. Seeber, 2003, cre-ated the subjective selection [SF03]. In this process, the user simply listens to soundproduced by different HRTFs and chooses the one giving the best spatial impressionwith respect to externalization and accuracy. Seeber’s results showed that whileselecting the HRTF, an optimization step occured; the users mainly chosed HRTFsincreasing their precision performance. This individualization procedure has the ad-vantage to be fast as the selection lasted about 10 minutes for a database containing12 HRTFs. A degradation of 24% of the azimuth unsigned error compared to az-imuth unsigned error obtained with individualized HRTF was observed. The Deter-mination method of OptimuM Impulse-response by Sound Orientation (DOMISO)is similar to the subjective selection process [Iwa06]. The main difference is the waythe user tested the different HRTFs. In the subjective selection, a user chose whichHRTF to test whereas in the DOMISO method this choice was made by an algo-rithm. Compared to non-individualized HRTF, Iwaya obtained better results withDOMISO-individualized HRTFs with respect to the front-back confusion. How-ever, no improvement could be found regarding the accuracy. Paukner, Rothbucherand Diepold [AK14] led an experiment comparing the DOMISO results with resultsobtained by using a KEMAR HRTFs. They found an improvement of 5.6% and7.5% of the performances with respect to the azimuth and elevation unsigned errorrespectively. However, for evident reasons, these methods cannot be used for bigdatabases

4.1.4 Anthropometric Matching Method

For big databases, another process exist: the anthropometric matching method. Itconsists of choosing an HRTF by comparing the anthropometric parameters of thesubject with those of the real owner of the HRTF. The process relies on the factthat a high correlation between ears and HRTF exists as Middlebrooks found, 1999[Mid99]. Unfortunately, no specification exists for a general and sufficient set ofwell-defined and relevant anthropometric measurements in order to select the bestsubject dependent HRTF, as the CIPIC interface laboratory states. Indeed, theexact influence of the pinna, head and torso geometry on the HRTF is still unknown[Int]. Therefore, the choice of the anthropometric parameters to consider while doingsuch a selection is difficult to make. However, some studies like the one led by Zhanget al. [MZA11] tried to address this problem by performing statistical analysis ondatabases giving access to sufficient information. Indeed, some databases like theCIPIC database give an access to a large amount of data and metadata, so that


one could try to choose the relevant anthropometric parameters by analyzing thecorrelation between the HRTFs and these parameters. The CIPIC database contains37 anthropometric measurements, shown in Figure 4.1, for 27 different subjects for1250 directions. Therefore, it gives the necessary support to attempt to extract themost important anthropometric parameters.

Figure 4.1: Pinna parameters available in the CIPIC database. 9 different parame-ters were available, both for the right and left ears. [Int]

A selection based on the pinna parameters was led by Zotkin et al. [Zot00]. In thisstudy, all the pinna parameters were compared to those of the HRTF owner, andthe HRTF that minimized the difference was selected. This method showed a 5%results improvement compared to the results obtained with an B&F-Manikin HRTFwith respect to the unsigned error. An Hypothesis to explain this poor improvementis that pinna parameters are very sensitive, and small changes in them could bringbig changes in the HRTF. This poor improvement can also be explained by the factthat all the available pinna parameters present in the CIPIC database were usedand considered to be of the same importance.

4.2 Proposed Individualization

Table 4.3 summaries the existing individualization methods and the achievable per-formances they enable. In order to optimize the achievable performances withoutmeasuring individual HRTF, an HRTF-individualization process was needed. Av-eraging and modeling methods were discarded as they either do not minimize thesoundscape setting-time or do not maximize the localization performances. There-fore, the individualization was based on HRTF selection. A HRTF database was

4.2. PROPOSED INDIVIDUALIZATION 39

Figure 4.2: Head and torso parameters available in the CIPIC database. 17 param-eters were registered in the CIPIC database for the head and torso. [Int]

Individualization Unsigned error Compared to...Manikin HRTF 34% degradation individual HRTFGeneric HRTF 30% degradation individual HRTFModel 1% ≤ improvement manikin HRTFSubjective Selection 24% degradation individual HRTFDOMISO 5.6% & 7.5% improvement manikin HRTFAnthropometric Selection 5% improvement manikin HRTF

Figure 4.3: Different individualization methods. The percentages mentioned in thetable were computed with respect to the unsigned error obtain in the individualiza-tion process and and the process specified to the left.


needed to perform this selection. The choice of this database was made accord-ing the following criteria: it should contain anthropometric information about thedifferent HRTF owners, contain a sufficient number of HRTFs and be usable forcommercial purposes. The CIPIC database is at the moment the biggest databasethat fulfills these three criteria. It was therefore chosen.

4.2.1 Hybrid Selection

The choice of the CIPIC database required the use of the DOMISO, the subjec-tive or the anthropometric selection. The anthropometric selection has the mainadvantage to be very fast and enables a selection for large HRTF databases, whichis the case of the CIPIC database. On the other hand, the subjective selectionshows significant results concerning the performance optimization. The DOMISOselection was however discarded as it mainly improved the front-back confusions buthad a lower effect on precision and accuracy. Therefore, In order to combine theadvantages provided by the anthropometric selection with the advantages providedby the subjective selection, a hybrid selection was designed. The first step of thehybrid selection was to select a reduced set of HRTFs. This preselection was basedon anthropometric comparison and enabled to accelerate the selection process. Thesecond step was the subjective selection applied on this reduced set of HRTFs.

Preselection

As Zotkin’s results state [Zot00], selecting an HRTF based on anthropometric se-lection is not likely to significantly improve the localization performances if theanthropometric parameters are not selected carefully. Therefore, the choice of theanthropometric parameters used to perform this selection relied on a statistical anal-ysis performed by Zhang et al. [MZA11]. Through principal component analysis,Zhang et al. could reduce the 200 HRIR samples to ten components containing 95%of the variance. Moreover, by considering the correlation between different anthro-pometric parameters, they were able to find the parameters having the strongestcorrelation with the HRTFs. According to this study, pinna height (d5), pinnawidth (d6), cavum concha width (d3) and fossa height (d4) were the most HRTF-correlated parameters (Figure 4.1). Thanks this preselection, HRTFs belonging topersons having very different ears compared to the user of the device could be dis-carded.The four most important pinna parameters were measured and five HRTFswere extracted out of the CIPIC database. These five HRTFs were selected throughthe optimization problem 4.1.

minj

4∑i=1

|di − di,j|di,j

(4.1)

where di correspond to the ith pinna parameter of the user and di,j correspondsto the ith pinna parameter related to the jth transfer function. The goal of this

4.3. SOUNDSCAPE PARAMETERS 41

minimization problem was to find the closest ear to the subject one with respect tothe relative error cost function 4.1.

Final Selection

After the preselection, a reduced set of HRTFs was available. To select one HRTFout of this set, the subjective selection was used [SF03]. The user selected thefinal HRTF by listening to the sound they produced and by choosing the one thatproduced the best spatial impression. More precisely, a soundscape was createdfor each of those final HRTF. This soundscape represented a sound source movingfrom the right to the left. The user could listen to the five soundscapes as manytimes as wanted. He was allowed to take notes and to choose the order of display,so that back-to-back comparison could be performed. The user was asked to focuson three criteria while comparing the different soundscapes. The first criteria wasthe range of display: the sound source was displayed in the frontal area, between-45◦ and +45◦, and it should therefore be heard in this range. If a soundscape gavethe impression that the sound source was more to the sides, it should be discarded.The second criterion was the accuracy criterion. The sound source was moving stepby step from the right to the left, with equal step size. Therefore, the subject wasasked to focus on this step size. If he had an impression of speed up or speed downof the sound source, he was asked to discard this soundscape. Finally, the subjectwas also asked to focus on whether he thought the sound source was located in frontor behind him. If the sound source was felt to be behind, it was discarded. To allowthe user to compare these different soundscapes easily, a Matlab GUI was developedand is shown in Figure 4.4. By clicking on five different buttons, it was possible todisplay the different soundscapes. Each soundscape lasted about 10 seconds.In the following, the other needed parameters chosen to construct a relevant sound-scape for the device are shown.

4.3 Soundscape Parameters

In order to create the soundscape, other parameters had to be set. Indeed, thechoice of the sound source type had to be made carefully and the adaptation of thesoundscape to the headphones had to be considered.

4.3.1 stimulus

Band limited Signals

Adelbert W. Bronkhorst, 1995, led a study to find out the influence of the spectralshape of a sound on the localization performances [Bro95]. To do so, he displayeda broadband signal, white noise, and filtered it with different low-pass filters. Theresults he obtained are summarized in Figure 4.5 .


Figure 4.4: Designed GUI interface. The user can select a subject. The subjectnumber is linked to a preselected HRTF. This linked HRTF will be used to createa soundscape. The soundscape will then be displayed. The text to the left and theprocedure invention are from Seeber, 2003 [SF03]

Figure 4.5: Unsigned error depending on the cutoff frequencies of the lowpass filterapplied to a broadband noise in different conditions. Circle: global unsigned error.Diamond: vertical unsigned error. Closed symbol: real sound. Open symbol: virtualsound. [Bro95]

4.3. SOUNDSCAPE PARAMETERS 43

The vertical axis is related to the localization blur and the horizontal axis repre-sents the cutoff frequency of the low-pass filter. Each curve corresponds to differentexperimental conditions, but an average trend can be found; the higher the cut-off frequency of the filter was, the better the precision of the localization became.Therefore, in order to achieve the best possible precision, the stimulus was chosento be broadband.

Pink Noise

Among broadband signals, different choices were possible. The most simple broad-band stimulus is the white noise, which consist of a row of uniformly distributedrandom samples in the discrete domain. Unfortunately, the main drawback of whitenoise is that it produces a sound unnatural and not pleasant to hear. This is a majorissue as the device is meant to be used for long time periods by blind people. There-fore, it was important to produce a soundscape that was pleasant to hear, or at leastnot annoying. In contrast to white noise, pink noise is a much more natural noise asit occurs in a lot of physical and biological systems [Han93]. Indeed, it is possible tofind pink noise in the fluctuation of tides and river heights or in heartbeat. Voss andClarke showed that the pitch and the loudness fluctuation in speech and music arealso pink noises [VC75]. Therefore, pink noise seemed more appropriate to the needsof the device as this signal is more natural and more pleasant to hear than whitenoise. Pink noise was thus chosen to construct the three-dimensional soundscapes.

Modulation

As seen in the second chapter, the soundscape was made to represent obstaclesthrough events. Therefore, the pink noise was modulated with a Gaussian function,plotted in Figure 4.6.The pulse duration was a setting difficult to establish as two conflicting constraintshad to be fulfilled. The first constraint was the presence criterion, that is theconstraint of having a consistent spatial immersion for the user. This criterion wasimportant in order to glue the soundscape created with the reality. According toMichael Abrash, the chief scientist of the Oculus Rift company, one of the leadersin the virtual reality field, the main constraint to be fulfilled in order to satisfy thispresence criterion is the photon to motion latency, that is the time between a positionchange of the user and the corresponding change in the display [Abr15]. MichaelAbrash states that this time has to be below 25 milliseconds to remain undetected bythe user. An increase of this time will cause the user to have a distorted perception ofthe reality, with appearance of latencies while moving the head or traveling. In orderto fulfill this photon to motion latency criterion, that could be translated into audioto motion latency criterion in this case, the refresh rate of the device should be under25 milliseconds and therefore the pulses should be shorter than that. Unfortunately,another constraint requires the pulses to last longer: the spectral analysis constraint.Indeed, to estimate the position of a sound, the brain performs a spectral analysis of


the signal and this process is time consuming. Therefore, it is not possible to reducethe length of the pulses without decreasing the localization performances, especiallyfor the elevation. In the literature, different pulse length can be found. For elevationlocalization purposes, the pulse duration is chosen between 100 milliseconds and250 milliseconds [WK89a, Mid90, EMWW93]. Therefore, a pulse duration of 100milliseconds was chosen. This was the best possible trade-off between the audio tomotion latency constraint and the spectral analysis constraint. A duration of 150milliseconds of silence between two bursts was arbitrary chosen, as shown in Figure4.6.

Figure 4.6: Plot of the modulation. The horizontal axis represents the time, thevertical axis the amplitude of the signal. The pulse duration is equal to 100 mil-liseconds. The time between two pulses is equal to 150 milliseconds.

4.3.2 The CIPIC Database

The CIPIC database was used in the experiment as it offers a large set of HRTFsand an expanded set of additional data such as anthropometric parameters. Theresolution of the contained HRTFs is equal to 5◦ and 5.625◦ for azimuth and elevationrespectively in the needed range: [-30◦; +30◦] and [-33.75◦ and 33.75◦] for azimuthand elevation respectively. Moreover, it important to say the the CIPIC data canbe used for commercial purposes. Different informations are to be considered beforeusing these HRTFs as the measurement protocol is not standardized.The first important information is that the sound was registered at the entrance ofa closed ear canal, that is the ear canal was blocked during the measures and themicrophone was placed at the outer end of the canal. During the HRTF acquisition,the HRIR was measured for a duration of 4,5 ms at a sampling rate of 44100 Hz.This corresponds to 200 samples for each HRIR. After the acquisition, the free

4.4. CONCLUSION 45

field response was measured by removing the subject and by placing a microphonein the center of the loudspeaker hoop. After measuring the free field response,a measurement equalization was applied to all the HRIR, in order to remove theloudspeaker and microphone distortion.

4.4 Conclusion

The soundscape was designed under optimization constraints. The quantity to op-timize were the time duration of the soundscape individualization and the perfor-mances achievable with it. This optimization problem led to the construction ofan HRTF hybrid-selection to individualize the sound and to the use of a pink noisestimulus modulated with pulses. To measure the efficiency of this individualizationprocess, a localization experiment was performed. The following chapter describesthis experiment and presents the obtained results.

47

Chapter 5

Sound Localization Experiment

5.1 Data acquisition

5.1.1 Pointing Paradigm Review

In order to measure the performances achievable with the designed soundscape, alocalization experiment was necessary. This localization experiment had to describehow the subject performing the localization task would communicate spatial posi-tions and how these communicated spatial positions would be captured. No matterthe communication paradigm, noise appears. However, the paradigm choice if ofcapital importance to minimize this noise. In the following, the already existingparadigms are shown. The proposed paradigm is then presented.

Straightforward Methods

The first method created, which also appears to be the most straightforward, is theabsolute judgment paradigm. Used by Wightman et al., it simply consists of sayingout loud the absolute position of the guesses [WK89a]. According to Wightmanexperiment, this methods gives consistent and stable results. Its main drawback ishowever the procedural training stage that had to be performed before the testing.Indeed, the subjects had to go through this process in order to get used to thisunintuitive way of mentioning a spatial position. Wightman et al. used a 10 hourstraining before collecting the final data.To make the pointing paradigm more intuitive, more sophisticated methods ap-peared. Haber et al., 1993, used a rotating dial and a drawing method to have thesubject indicate the position of a sound source [Hab93].

Body Pointing Methods

To further enhance the ease of use of the pointing methods, body parts can beused. Bronkhorst, 1995, used a pointing with the nose method [Bro95] and Haberet al., 1993, used a finger pointing methods [Hab93]. In these two last methods, the

48 CHAPTER 5. SOUND LOCALIZATION EXPERIMENT

subject listened to a sound and had either to indicate the position of the target bypointing it with the nose or the finger. Middlebrooks, 1990, used the nose pointingparadigm and reported that the precision of this method was degraded for highangles, as the subject had to do bigger movements and went out of the comfortablerange of head motion for these angles. In addition to that, another drawback wasthat the subject had to wait the signal to finish before moving; he had therefore toremember where the sound was in order to find it, which included additional error.Regarding the hand pointing methods, due to the fact that the user used his rightor left hand, Pinek and Brouchon, 1992, found out that a bias appeared becauseof this asymmetry [PB92]. To counter that and improve the precision of the fingerpointing method, extension of the body parts can be used such as hand-held cane,short stick or toy gun. Another issue occurring while using these paradigms is thedata acquisition. Indeed, recording precisely the localization pointed by a finger ora nose is difficult. Middlebrooks, 1990, used an electromagnetic device to measurethe head position with an uncertainty less than 0.5◦ [Mid90]. The sensor was heldfirmly on top of the subject’s head. Pedersen and Jorgensen, used a toy gun pointingmethod in their experiment and recorded the position with help of a tracking device[PJ02].

Laser Pointing Methods

The previous methods have this in common that they do not use any visual feedbackto specify the guessed spatial position. To explore the influence of visual feedbacks,another pointing paradigm was created: the laser pointing paradigm. With thismethod, the user carries a laser pointer in his left or right hand and points thewanted direction with it. This methods was used by Pedersen et al. and Paukner etal. [PP14, PJ02]. The biggest difference with the previous paradigms is that the usersees exactly where he is pointing. Moreover, in order to get rid of the asymmetrycaused by holding the laser pointer in the right or the left hand, it is possible tohave the laser pointer position controlled by a motor. The Proprioception DecoupledPointer (ProDePo) created and used by Seeber, 2003 [See03, SF03], implementedthat. This method enabled the user to localize a sound by controlling a laser througha trackball. A representation is showed in Figure 5.1. An additional buttons wereadded to the trackball in order to measure more information about the spatialimpression given by the sound, like the quality of the externalization or the front-back impression.

Virtual Pointing Methods

A last category of paradigms used to localize sound sources is the category using vir-tual environments. Instead of pointing a real position, the subjects specify a positionin a virtual environment. For example, in Mendonca et al. sound localization exper-iment, 2012, the subjects localized sounds by pressing a touch screen. The acousticresearch institute of Vienna, 2010, also explored such possibilities [PML10]. They

5.1. DATA ACQUISITION 49

Figure 5.1: Schematic of the ProDePo Method used by Seeber. [See03] Laut-sprecher stands for loudspeaker. Vorhang means a curtained covered the loudspeak-ers. Vesuchsperson is the experiment subject. The trackball was used to control themirror represented behind the subject head. The inclination of the mirror controlledthe laser ray direction.


assumed that having a virtual interface giving cues on the pointed direction couldimprove a lot the precision of the methods. To prove this, they used a head mounteddevice displaying images to the user. The visual environment was constructed ac-cording the head position or the finger position of the user. They concluded that,assuming the users are sufficiently trained, both pointing with the head or with thefinger gives reliable results with help of the visual environment. However, they alsoshowed the having a visual feedback is of capital importance as the results werehighly corrupted in the no visual feedback mode.

Paradigm Comparison

In order to compare several of the above mentioned paradigms, an experiment wasconducted by Haber et al., 1993 [Hab93]. It consisted of measuring the localizationability of people using 9 different methods. Among these methods, straightforwardand body part pointing methods were used. The body part pointing methods gavethe best performances regarding the accuracy and the variance as Figure 5.2 showsand comparable results were obtained with methods using an extension of the body,like hand-held cane, and short stick. However, the absolute judgment paradigm andthe dialing methods were not adapted. This suggest that the best accuracy andprecision can be obtained when the subject directly specify the position in his realspatial environment rather than on a projection of it.

Figure 5.2: Accuracy of the listener depending on the paradigm used for the azimuth.The mean unsigned error is represented on the vertical axis, the different pointingparadigms on the horizontal axis. Results obtained by Haber et al. [Hab93]

Finally, by comparing different paradigms, Madjdack and Laback found that visual


feedback was important in order to obtain accurate and stable localization perfor-mances [PML10]. This enhance the use of methods such as localization throughvirtual environment or laser pointing. Figure 5.3 tries to summarize this compari-son.

Paradigm Intuitive realizable RequirementAbsolute Judgment + +++ TrainingRotation dial and drawing + +++ TrainingNoise Pointing +++ + Nose trackerFinger Pointing +++ + Finger trackerLaser Pointing +++ +++ProDePo +++ + Photoreceivers and trackballVirtual environment ++ ++ Spatial projection

Figure 5.3: Paradigm comparison. A paradigm is said to be intuitive if almost noprocedural training is needed to achieve stable position estimations. The realizablecriterion was judged with respect to the available facilities and the deadlines

5.1.2 Proposed Paradigm

The paradigm had to be designed with respect the following criteria:

• Easy to use, intuitive

• No training requirement

• Precise record and automatized procedure

• Realizable in the lab with existing equipment

• Quick answering procedure

• High resolution grid

Paradigms such as absolute judgment or drawing were discarded as they require aprocedural training to enable subject to estimate precisely positions. Implementa-tion of a virtual environment such as a head mounted display is time consuming andrequires the subject to perform real to virtual spatial projections. Therefore, suchmethods were not chosen. Finally, laser pointing methods seemed to give the bestbalance between the intuitive and the realizable criteria. However, this procedurehad to be adapted to the available facilities. Moreover, as the experiment time waslimited, the implementation had to be fast.The designed paradigm consisted to control a laser pointer with help of a computer.The different positions of the laser pointer were mapped to the laser spot positionson the front wall. A precise calibration enabled to find the coordinate of the laser


spot positions in the subject spherical reference system, centered at the head of thesubject. Therefore, each position of the pointer was directly related to the estimateddirection of the sound source by the user. Because of that, photoreceptors were notneeded. However, due to system latencies, a step size of 2.5◦ was chosen. Thisenabled to reach any position in less than 2 seconds.

The subject sat right under the laser pointer, and the elevation of the chair wasadjusted to place his eyes exactly a 1.3 meters elevation. A picture of the setup isshown in Figure 5.4. A black dot in front of the subject served to position his head.Additional black dots and lines were hanged on the front wall as markers. The laserpointer was statically connected to two servos.

Figure 5.4: Picture of the designed setup. The laser pointer is to be seen at the top.The computer controlling the pointer is at the bottom of the picture. The seen wallis the projection area of the red dot.

The first used servo was a Hitech HS85MG+. Its stall torque at 6V is equal to3.5kg.cm which was highly sufficient to carry a laser pointer of approximately 10gdirectly connected to the axis and to support the small tension exerted by theelectrical wire. It’s precision is equal to 1◦. The other servo was a Hitec HSR8498HB. Its torque at 6V is equal to 7.4kg.cm which was sufficient to carry a 30gpointer and a 10g laser pointer. This servo is also 1◦ precise. This mounting wascontrolled and supplied with an Arduino Uno microcontroller. The control signalwas a Pulse With Modulation signal (PWM). The neutral position was at a PWM-width of 1.5ms, the -90◦ angle corresponded to a PWM-width of 0.6ms and +90◦

to a width of 2.4ms. The power supply given by the microcontroller was of 5V,which means a lower torque as those mentioned previously was obtained. However,the system was overdimensioned to address this. The Arduino micro-controller wasprogrammed with help of the Matlab’s Arduino-toolbox.


Finally, this paradigm was estimated to be 2.5◦ accurate and 1◦ precise. However,this does not take into account the adaptation of the user to the device. Indeed,the precision can be degrade if the paradigm is not intuitive for the user, like forabsolute judgment paradigm without training [WK89b].

5.1.3 Data Acquisition Procedure

The Procedure was made up of two steps: the HRTF selection and the localizationtesting. Figure 5.5 illustrates this process. In the first step, the HRTF of the userwas selected through the proposed selection; the hybrid selection. First, the pinna’skey measures of the subject were entered in the software. Once this was done, theselection stage began. The software selected 5 HRTFs out of the 27 HRTFs presentin the CIPIC database, by comparing the entered anthropometric values with thecorresponding HRTF metadata. Then, the subject participation was required. Thesubject had to test the different HRTFs and select the best one with respect to thesubjective selection procedure. At the end of this task, the individualized HRTFwas obtained. As mentioned in the previous chapter, a GUI was designed to have aquick and pleasant process. The user could test the HRTF as many time as needed.Back to back comparison of the HRTF was therefore possible. Once the HRTF wasselected, the testing procedure, described with the flow chart 5.6, could begin.

Figure 5.5: HRTF-selection flow-chart. The squares represents the different processsteps. The curved shapes represent the databases. The circles are the starting andending points.


Figure 5.6: Data-recoding flow-chart. The squares represents the different processsteps. The curved shapes represent the databases. The circles are the starting andending points. The arrows represent user actions.

First the previously selected HRTF was selected. Then, a sound target was displayedat a random position in front of the user, in the [-30◦, +30◦] [-33.75◦, +33.75◦] rangefor azimuth and elevation respectively. The range was chosen with respect to thedevice field of view. The subject had to localize the target by moving the laserpointer using the arrows of the keyboard. As soon as the subject thought he foundthe target, he simply pressed the space key to save his guess. Then, depending onthe state of the process, another sound target was displayed or not.

5.1.4 Measurements

At each step, the servos’ instructions, linked to a particular position of the laserpointer and thus of the red dot, were saved. A mapping was manually recordedbetween the red dot locations and the servos’ instructions to correct servos’ non-linearity effects and imperfections in the mounting.

Regarding the coordinate system, the CIPIC HRTFs were designed to work withthe interaural coordinate system. For convinence, the servos’ instructions and theposition records were given and measured in the spherical coordinate system. Thetransformation introduced in the preliminaries was build in the software to handlethis. The results were finally converted in the interaural coordinate system.

12 subjects did the experiment. 45 trials were performed for each experiment. The

5.2. RESULTS 55

duration of the experiment was equal to 20 minutes on average. The soundscapeindividualization lasted about 15 minutes, including the pinna measurements time.

5.2 Results

5.2.1 Pinna Matching

In order to quantify the quality of the pinna matching, the relative distance definedin Equation 4.1 was used. This distance went from 1% to 10% for single dimensionsand from 5% to 30% for the overall distance. The average value was equal to 16%relative distance between a subject’s ear and the HRTF corresponding ear.

5.2.2 Azimuth

Individual Accuracy

Figure 5.7 shows the distribution of the judged azimuth for the 12 subjects. The av-erage value is represented by the black diamond. The low and high border of the boxrepresent respectively the 25th and 75th percentile. The lowest and highest boundrepresent respectively the maximum and the minimum value of the distribution.

For every mean azimuths a bias was observed. This bias was equal to 11.4◦ onaverage. The mean unsigned error was equal to 16.5◦ and varied strongly with theazimuth. Indeed, this value was equal to 8.6◦ for a target azimuth equal to 0◦ andto 18.2◦ for a target azimuth equal to 30◦. Figure 5.8 is a linear regression of themean estimated values and further showed this overestimation, as the slope of thelinear approximation was equal to 1.6. This indicated that the estimations were onaverage 60% too high.

This overestimation seemed to be related to the HRTF selection as the latter didnot take into account the size of the subjects’ head. Indeed, It was observed thatthe HRTFs chosen by the different subjects were on average related to bigger heads.The average head diameter of the user was equal to 13.6 centimeters, whereas theaverage head diameter corresponding to the set of chosen HRTFs was equal to 14.43centimeters. ITD is a capital cue to estimate the azimuth and it is highly related tothe head size. Therefore, it was necessary to establish the relationship between thehead size and this cue to further understand this phenomenon.

Figure 3.8 shows the left and right path taken by the sound to respectively reach theleft and right ear. The length difference between the two paths divided by the speedof the sound directly gives the ITD. A first order approximation of the this value isexpressed by Equation 5.1. Assuming the ITD is fixed and the angle θ stays in areduced range, the sinus function can be approximated to be equal to its first ordertaylor series. By taking the natural logarithm of the previous equation, Equation 5.2is obtained. Finally, the differential of Equation 5.2 leads to Equation 5.3, describing


Figure 5.7: Distribution of the judged azimuths depending on the target position.The redline is the median value, the black diamond is the mean value, the boxboundaries are the 25 and the 75 percentiles and the vertical black lines representthe 95% confident interval.

5.2. RESULTS 57

Figure 5.8: Linear regression of the averaged azimuth estimations. The Horizontalaxis represents the target azimuth. The vertical axis represents the populationmean-estimation. The correlation coefficient and the slope can be read to the right.The black line is the linear approximation.


the relationship between the head radius variation ∆r and the estimated azimuthvariation ∆Θ.

ITD(θ) =r

c.(sin(θ) + θ) 0 ≤ θ ≤ π

2(5.1)

ln(A)− ln(r) = ln(θ) A ∈ R (5.2)

∆θ = −∆r

r.θ (5.3)

Now, considering that r, ∆r and θ are respectively equal to 13cm, 1cm, and 30◦,the resulting overestimation is equal to 2.5◦. This does not fully explain the 11.4◦

overestimation obtained here. However, this equation does not take into account theinfluence of the ILD. Indeed, the ILD is a very important cue at high frequenciesand could influence this result.

The other assumption that was made is that this was due to the fact that the sounddisplayed range didn’t cover the entire frontal area. Intuitively, the human brainwould try to cover the entire space and shift the perceived sound to do so. Mendoncaet al., 2010, reported a similar phenomenon [Men12].

Individual Precision

In order to estimate the precision of the users in the localization task, the standarddeviations of the individual judgments was computed and averaged, using Equation3.13. The averaged individual standard deviation was of 10.4 degrees. A differenceappeared between the standard deviation for the different azimuth angles. Indeed,the standard deviation was the lowest for the center and for the extreme positions,that is respectively 0◦, 30◦ and -30◦.

Adaptability of the Soundscape

Differences existed between the sound perception of each subjects. For the extremepositions, a difference of 15◦ appeared between the maximum and the minimummean perception of two different subjects. The between mean standard deviation,computed with Equation 3.19, was of 13.5◦ when averaged over all the target’sazimuth. However, this between mean standard deviation varied strongly dependingon the azimuth. In the median plane, the standard deviation was equal to 5.5◦. Itwent up to 17.6◦ for azimuths equal to 30◦ and -30◦. Overestimation partiallyexplains this. Indeed, as this effect was more important for lateral positions thanfor centered positions and as not every listeners were impacted by it, the betweenmean standard deviation increased for lateral positions.

5.2. RESULTS 59

5.2.3 Elevation

Individual Precision

Figure 5.9 shows the elevation judgment distribution of the 12 subjects. Unlike theazimuth, the elevation was underestimated. A linear regression of the the meanelevation estimations was performed. It showed that the underestimation was equalto 80% for the elevation.

Figure 5.9: Distribution of the elevation estimations depending on the target posi-tion. The redline is the median value, the black diamond is the mean value, the boxboundaries are the 25 and the 75 percentiles and the vertical black lines representsthe 95% confident interval

Listeners were poor at estimating the elevation as revealed the obtained 17.8◦ un-signed error and the linear regression slope equal to 0.2. The experiment showedthat only three persons out of twelve could perform a discrimination task betweenup and down with a success rate higher than sixty six percents and five were unableto discriminated up and down at all. The consequences of this was the strong under-estimation equal two 80% as it can be observed in Figure 5.9. This underestimationcan be explain by the fact that some responses were random. Moreover, it wasobserved that the 95% confident interval was reduced to [-27.6◦, +22.3◦] whereasthe displayed sound target were located in the [-33.75◦, +33.75◦] range. This mightindicates that because of the uncertainty, subjects tended to localize sounds near


the horizontal plane to minimize their error. However, no pressure was put on thesubjects regarding their achieved performances.

Precision and Adaptability of the Soundscape

The main standard deviation of the judgments was of 12.9◦. The between meanstandard deviation was equal to 7.9◦. This is low compared to the azimuth betweenmean standard deviation and this is essentially due to the underestimation, that isto the fact that the judgments were mainly focus near the transversal plane at theears level.Figure 5.10 brings the results together.

Results Azimuth ElevationSigned error 11.4◦ -12.6◦

Unsigned error 16.5 17.8◦

Standard deviation 10.4◦ 9.8◦

Between mean standard deviation 13.5◦ 7.9◦

Slope 1.6 0.2Correlation coefficient 0.977 0.76

Figure 5.10: Shows the different results obtained. The correlation coefficient iscomputed with respect to the linear regression of the mean estimations curve.

5.2.4 Comparison

A lot of studies already looked at the precision one could achieve in sound localizationwith virtual field. Unfortunately, it is difficult to compare the obtained results as alot of parameters change from one experiment to another. However, the followingsection tries to bring some comparison elements in order to estimate the achievedperformance of the designed soundscape. This comparison is expected to lead tonew ideas to improve the soundscape.

Comparison with real sound conditions

The following paragraph compares the obtained results with results obtained forreal sound sources. John Middlebrooks measured the ability of subjects to local-ize broadband sound sources displayed by loudspeakers and his results are shownin Figure 5.11 [Mid90]. The averaged azimuth unsigned error he obtained for realsound sources was of 5.4◦. Seeber conducted similar experiments and found a me-dian azimuth unsigned error of 3.4◦ [See03]. These values are much lower than the16.5◦ signed error obtained in the presented experiment. This could be due to thedifference between real and virtual display. However, Seeber observed in his exper-iment that the unsigned error didn’t increased for virtual sound localization in the

5.2. RESULTS 61

horizontal plane compared to real sound localization. Training could also be theexplanation of this difference. Indeed, Middlebrooks had all his subjects go throughan intensive training stage. The training stopped when the subjects could reach anaccurate and precise localization of each target position. It consisted of localizingrandomly placed sounds and having visual feedback after each trial. Five to tenhours training were required for each subjects. However, Seeber did not used anytraining in his localization experiment. Nevertheless, two main differences are to beseen between the presented and Seeber’s experiment. The first difference is the local-ization paradigm. Indeed, Seeber used the ProDePo paradigm which was shown tobe very accurate and precise compared to other paradigms. For example, thanks tothe ProDePo method, Seeber was abe to obtain an azimuth localization error threetimes smaller than the azimuth unsigned error obtained by Recanzone et al., 1998,with the head pointing paradigm [SF03]. Unfortunately, the achievable precisionof the proposed paradigm was not measured for real sound sources and it is thusdifficult to quantify its influence on the results. The other difference is that Seeberonly displayed sound in the horizontal plane. Middlebrooks results, shown in Figure5.11, showed that the azimuth unsigned error was higher for sound sources outsidethe horizontal plane. Indeed, he found that this error increased by 59% when soundsources were located at a + 45◦ elevation instead of +5◦ for azimuths between -30◦

and +30◦. This could also explain the relative difference of 40% between Seeber andMiddlebrooks results. Finally, Seeber and Middlebrooks both reported an overesti-mation phenomenon. They found a signed error of 1.6◦ and 2.1◦ respectively. Thisis much lower than the 11.4◦ unsigned error obtained. The hybrid selection couldexplain the higher overestimation obtained in this experiment.

Regarding the elevation, Middlebrooks reported an unsigned error equal to 5.6◦ anda signed error equal to -3.5◦. Consistent with the presented experiment are the factsthat an underestimation occurred and that the performance were poorer than theazimuth performance. However, the signed and unsigned errors obtained here werehigher and equal to -12.5◦ and 17.8◦ respectively. This suggests that the hybridselection or virtual display highly degraded the elevation localization performance.Wightman et al. [WK89b, EMWW93] also led experiments to evaluate the localiza-tion ability of real sound sources. One out of eight subjects was unable to estimatethe elevation of real sound sources in their experiment. Although this is lower thanthe 38% ratio obtained in the presented experiment, this is consistent with the poorelevation performance observed. The ratio difference could be explained by the ex-perimental differences, which were the training stage and the real-virtual conditions.Finally, Wightman et al. obtained an unsigned error of 20.4◦. This value is muchhigher than the values obtained by Seeber or Middlebrooks. The main differencesbetween Wightman and Seeber experiments were the paradigm used and the positionof the sound targets. Moreover, it is hard to compare their results as Wightman onlyreported the global unsigned error without mentioning the azimuth unsigned error.Finally, although procedural training was used by Wightman to reduce the paradigminfluence on the results, the difficulty of using the absolute judgment paradigm could


Figure 5.11: This figure shows the mean estimation and the standard deviationfor each target position, represented with a cross. The step between each cross isequal to 10◦. Only one over two measures are shown for ease of read. -90◦ is thesubject right, +90◦ is the subject left. The vertical axis represent the elevation, thehorizontal axis the azimuth. From Middlebrooks. [Mid90]

partly explain the two times higher unsigned error obtained compared to Middle-brooks unsigned error. Training might also explain this difference as Middlebrooksused a training based on sound memorization whereas Wightman used a trainingbased on cue memorization. Indeed, Wightman prevented subjects from learning amapping between sounds and positions by scrambling the sound spectrum betweeneach trial. Therefore, the subjects had to focus on the sound cues and not on thesound itself during the training. This was not the case for Middlebrooks.Finally, this paragraph shows that an important difference is to be seen betweenresults obtain for real and virtual sound sources. However, further investigationsshowed that this might not be due to real and virtual display differences, but toexperimental differences. Indeed, Seeber found that a coefficient of 3 could appearbetween the mean unsigned errors from one experiment to another [See03]. Thiswould also explain the large difference observed between Middlebrooks and Wight-man et al. results. According to that, it is possible that the large errors obtainedin the presented experiment were partly caused by the designed paradigm.

Comparison With Individualized Soundscapes

Seeber, 2003, found a median unsigned error of the azimuth estimations equal to3.3◦ in the horizontal plane when using individual HRTF [See03]. The differencewith the 16.5◦ azimuth unsigned error obtained in the current experiment can beexplained by the paradigm difference, the degradation due to elevation changes orthe hybrid selection.

5.2. RESULTS 63

Wightman et al., 1988, led an experiment to evaluate the achievable localizationperformances one could obtain with a virtual soundscape and individualized HRTF[WK89a, WK89b]. The created virtual auditory display was therefore constructedby convolving a gaussian noise with individualized HRIR. A similar experiment wasalso led for a soundscape created by a non-individualized HRTF [EMWW93]. How-ever, the non-individual HRTF used by Wightman et al. was carefully chosen. ThisHRTF belonged to a good listener, that is a person that showed good performancesfor the elevation estimation task of real sound sources. The motivation for thischoice was the assumption that, individual localization performances being differ-ent, HRTF belonging to people showing good localization performances must beof better quality. This assumption was verified through the following observation:good listeners had their performances degraded when listening through HRTFs ofbad listeners. However, the contrary was not true: no improvement was observed forbad listeners listening through HRTF of good listeners [EMWW93]. Consequently,the localization performances might have been degraded in the current experimentby the presence of bad-quality HRTFs in the CIPIC database. Assuming that 12.5%of the CIPIC population was unable to correctly localize the elevation, 10% of thebad localization cases would be explained by that. In the current experiment, 38%of the subjects were unable to estimate the elevation. This value was equal to 25%for Wightman et al. experiment using non individualized HRTF. The 13% differencemight be explained by the above mentioned phenomenon. Wightman et al. exper-iment reported that the average global unsigned error in the [-45◦, +45◦] azimuthrange and the [-30◦, +60◦] elevation range was equal to 24◦ and 27.5◦ for virtualsoundscape using individual and non-individualized HRTF respectively. Though ofthe same order of magnitude, these global error are lower than the 34.3◦ global errorfound in the presented experiment. As explained, the degradation of 25% of theunsigned error compared to the 27.5◦ unsigned error obtained by Wightman et al.could be due to presence of bad listeners in the CIPIC database. Moreover, the factWightman et al. had all the subjects go through a training stage before the testingcould also explain this difference. Paukner et al. [PP14] found an azimuth globalunsigned error of 32.1◦ using individual HRTF. This value is 33% higher than thevalue obtained by Wightman for the same soundscape. Paradigm differences andthe absence of training could explain this.

W. G. Gardner, 1997, measured the elevation estimation performance for virtualthree-dimensional sounds produced through a KEMAR HRTF [Gar97]. An averageperformance of 14,3◦ for the azimuth error and of 34,2◦ for the global error wasreported. Compared to the results obtained in the current experiment, this is adecrease of 13% and an increase of 12% for the azimuth and elevation unsignederror respectively. Regarding the azimuth, the difference suggest that the hybridselection degraded the performances. The overestimation occurring due to poorhead-size matching could be responsible of this. Regarding the elevation, the rel-ative error obtained in the presented experiment is lower than the error obtainedin Gardner’s experiment: 17.8◦ against 19.9◦. It is difficult to say if this improve-


Figure 5.12: Elevation estimation results for a bad listener in the virtual field. Thevertical axis represent the judged elevation, the horizontal axis the target’s azimuth.From Wightman et al, 1993. [EMWW93]

ment is due to the hybrid selection or if it is only due to differences between theexperimental conditions. Finally, a non negligible difference between the standarddeviation is to be seen between the two experiments. Gardner’s standard deviationwas of approximately 20◦, which is much higher than 9.8◦ obtained in this work.This suggests that using the KEMAR dummy head HRTF to produce the soundespecially degrades the elevation performances compared to the hybrid selection.Though different experimental conditions, the results obtained by Wightman et al.,Gardner and Paukner et al. [WK89a, WK89b, EMWW93, Gar97, AK14] are of thesame order of magnitude than the results obtained here. No matter the paradigmsand the experimental conditions, the results obtained with individualized HRTFwhere always better than the results obtained with the hybrid selection. Soundsources created by choosing carefully a generic HRTF were located with less errorthan sound sources created with the hybrid selection. This observation can yet bedue to training. Finally, the hybrid selection seemed to improve localization per-formances compared to the KEMAR HRTF. This suggest that the hybrid selectionoptimized the elevation unsigned error through its anthropometric and subjectiveselection.

Comparison to Other Individualization Methods

Seeber, 2003, found an unsigned error of the azimuth estimations equal to 4.1◦ inthe horizontal plane when using the subjective method to select the HRTF. Thisis still much lower than the 16.5◦ azimuth unsigned error obtained in the currentexperiment and is certainly due to the paradigm difference, the degradation due toelevation changes, and the individualization difference. However, Seeber observed

5.2. RESULTS 65

Figure 5.13: Elevation estimation results using non-individualized HRTF obtainedby Gardner. The vertical axis represents the judged elevation, the horizontal axisthe elevation of the sound target. The error bars represent the standard deviation.From Gardner, 1997 [Gar97]

with the subjective selection method an additional overestimation equal to 4% whichis consistent with the results obtained here. This suggests that the obtained overes-timation in the presented experiment was partly caused by the subjective selectionprocess.Paukner et al., 2014, compared different HRTF individualization methods: the re-gression and the the DOMISO methods [MIM06, PP14]. Using short pink noisebursts as stimulus, Paukner et al. obtained a mean unsigned error of 15.88◦ for theDOMISO selection, a mean unsigned error of 21.42◦ for the regression. Individualiza-tion through regression showed poor results compared to the other individualizationmethods. However, the difference between the DOMISO and the hybrid selection isnot significant and could be explained by differences in the experimental setup.Very close to the presented methods, Zotkin et al., 2002, developed an anthropo-metric selection process based on seven different pinna parameters [Zot00]. Throughthis method, a global mean relative error of 27.1◦ was obtained. This value is closeto the one reported here. The difference can be due to experimental differences.Indeed, they used head tracking which adds cues to localize sounds and the nosepointing paradigm was used.

Conclusion

Figure 5.14 compares the global unsigned error obtained for different experiments.Thiscomparison shows that the HRTF selection currently used seemed to improve theperformances one could obtain with a manikin HRTF. However no significant im-provement could be found compared to the DOMISO method and the non-individualized


Experiment Global unsigned error Experimental conditions[Mid90] 10.6◦ Real sound sources, naive training[See03] 3.4◦ Real sound sources, azimuth only[WK89b] 20.4◦ Real sound sources, training[See03] 3.3◦ Individual HRTF, azimuth only[See03] 4.1◦ Subjective selection, azimuth only[EMWW93] 24.0◦ Individidual HRTF, training[EMWW93] 27.5◦ generic HRTF, training[AK14] 32.1◦ Individidual HRTF[Gar97, AK14] 34.5◦, 35.8◦ Kemar Manikin HRTF[AK14] 33.8◦ DOMISO individualization[AK14] 42.4◦ Regression individualizationCurrent experiment 34.3◦ Hybrid selection individualization

Figure 5.14: This table reports the results obtained in different studies for the soundlocalization unsigned error. All the experiments used white or pink noise pulses asstimulus. All the experiments except [Mid90] were led for virtual sound sources. Allthe experiments except [See03] measured azimuth and elevation errors.

soundscape created by Wenzel et al. Finally, the difference in the order of magnitudeof the error obtained in the proposed experiment and in Seeber or Middlebrooks’experiment suggested that the paradigms and experimental setups highly influencedthe results. Unfortunately, it was not possible to measure quantitatively this influ-ence. This comparison remains qualitative and this has to be considered if referringto it.

5.3 Discussion

The main issues of the designed soundscape were the azimuth overestimation andelevation blur. To address these issues, two possibilities were considered. Thefirst one was to improve the soundscape quality. The other considered possibilitywas to add features to the soundscape such as training, head tracking, auralizationor elevation coding to facilitate the sound localization. This section discuss thesedifferent possibilities.

5.3.1 Soundscape Modifications

In order to improve the virtual auditory display produced, the soundscape individ-ualization could be changed. Regarding the azimuth, the biggest issue the systemencountered was the overestimation. The comparison with Seeber’s experiment[See03] revealed that this overestimation was partly due to the subjective selection.However, in Seeber’s experiment, the overestimation due to the subjective selection

5.3. DISCUSSION 67

was equal to 5% whereas it was equal to 60% with the hybrid selection. Therefore,it seemed that the choice of the database and the preselection process increased thisoverestimation. As the head size corresponding to the HRTF was not taken into ac-count during the selection, it was chosen randomly. Unfortunately, the average valueof the head size contained in the CIPIC database is equal to 14.5cm whereas theaveraged head size of the subjects was equal to 13.6cm. It was estimated that thispartly caused estimation bias. Therefore, in order to avoid this overestimation, itwas considered to change the preselection. By considering the head size while doingthe preselection, it was possible to force subjects to choose HRTFs corresponding tohead sizes smaller than their own. Unfortunately, the CIPIC database only contains37 HRTFs, and only 7 HRTF correspond to head sizes smaller than 13.6 centimeters.Therefore, adding such a constraint would degrade the pinna matching.

It is also possible to change the HRTF individualization to improve the elevationestimation performances. To do so, the pinna parameters used for the preselec-tion had to be changed. However, according to Wightman et al. experiments,[WK89a, EMWW93], only a small improvement is to expect from this. Indeed,a non-negligible number of persons were unable to estimate the elevation of noisesounds, even for real sources, in their experiment. Changing the HRTF selection isthus not expected to solve this problem.

However, assuming the system is good enough, a lot can be done a posteriori toimprove the performances. The next pages expose the different way to do that anddiscuss the pros and cons.

5.3.2 Head Tracking

Head tracking is one way to improve the localization performances. This methodconsists of measuring the head position of the subject, through a magnetic sensorfor example, and then to adapt the sound with respect to the measured position.This gives the listener additional cues to help him to localize sounds. Several exper-iments reported that this method enabled to reduce the localization blur and thefront-back confusions [AK14, BW01]. Indeed, most of the studies reported a higherlocalization blur at the sides of the listener [EMWW93, See03, Mid90]. With a headtracker, the listener can turn his head until he faces the sound source. Once thesound source is in front of him, the localization blur is minimized. As the unsignederror was equal to 17◦ at the edges and to 8.6◦ in the middle for the proposed sound-scape, an improvement of 50% was therefore hypothetically achievable through thismethod. The solving of front-back confusion enabled by head tracking was explainedby the fact that the listeners could use the ITL and ILD cues to estimate the el-evation. Indeed, to tilt the head made the elevation estimation become similar toan azimuth estimation. The confusion was thus immediately solved. 2014, Paukneret al. investigated on the impact of head tracking on the localization performance[AK14]. They reported an improvement of the localization performances of about4◦ for the azimuth unsigned error; this error decreased from 15.88◦ to 10.87◦, which


represented an improvement of 32%. Regarding the elevation, the improvement wasequal to 19%. Begault and Wenzel, 2001, also investigated on the impact of headtracking on the spatial perception of virtual speech sounds. They results are shownin Figure 5.13. They reported that head tracking reduced the unsigned error of 27%when using a KEMAR HRTF to produce the sounds.

Figure 5.15: Impact of head tracking on the unsigned azimuth error. White andblack stand for with and without head tracking respectively. From Begault andWenzel [BW01].

5.3.3 Auralization

Head tracking was not the only feature that could be added to the device in orderto improve the performances. Auralization was another possibility and would havemade the sound more realistic, by adding echoes. This process is mainly used toexternalize sounds. However, Begault and Wenzel, 2001, also investigated on theinfluence of echoes on the sound-localization performances. They compared theresult obtained for auralization with results obtained without echoes for a soundgenerated through a KEMAR HRTF. [BW01].They reported that auralization provided a 31% improvement of the performances,that is a decrease of 8◦ of the unsigned error in their case. However, they alsoreported that full auralization degraded the performances compared to auraliza-tion restricted to early reflection. This suggests that the diffuse reflection addedconfusion. Unfortunately, the drawback of this was that auralization increased theelevation error. Indeed, Begault and Wenzel reported that because of the aural-ization, the unsigned error went from 17.6◦ to 28.7◦, that is a 39% decrease of theperformances occurred with respect to the unsigned error, and the standard devia-tion undergone a degradation of 16%.

5.3. DISCUSSION 69

Figure 5.16: Impact of echoes on the unsigned azimuth error. From Begault andWenzel [BW01]

5.3.4 Training

Adding an initial training stage was an other possibility to improve the localizationperformances. However, training is an ambiguous term and can refer to differentconcepts. Indeed, one could define the learning phase as the memorization of thedifferent sound and position pairs. This is what Middlebrooks implemented in hisexperiment [Mid90]. However, another more global definition is the memorization ofthe sound-cues and position pairs. This corresponds to a remapping of the spatial-hearing perception. The latter definition is less restrictive as such a training is notsignal dependent. Wightman and Kistler [WK89a] trained their subjects this way[WK89a]. However, the achieved performances through this process was not as highas for Middlebrooks training process; the unsigned error was approximately twotimes higher for the same amount of training time (10 hours). Majdak et al., 2010,also studied the effect of training and proposed a standard way to do it [PML10].Their protocol was the following; a sound was displayed and the user had to localizeit. If the response was correct, a new target was generated but if the response wasnot correct, a visual feedback was given and the subject had to localize the samesound one more time. In order to avoid the previously mentioned memorization ofthe sounds and not of the cues, they randomly changed the sound level so that theusers didn’t listened to the exact same sound for two same positions. The traininglasted approximately three hours. They reported that the first 400 trials were themost important in the learning process as 90% of the improvement was made here.After, the improvement continued but was much slower. Using individualized HRTF,they reported an improvement of 22% of the precision compared to results obtainedwithout training. One hour training every day for 10 days was required to achievethis. The major improvement happened during the five first days. Mendoca et al.


also tried to improve the localization accuracy with help of training sessions [Men12].They especially focused on fast training session which duration were equal to about30 minutes. The training was divided in two stages: active learning and passivefeedback. In the first stage, the user selected different positions and listened to thesound displayed at these positions. Five minutes were allocated to this task. Afterthese five minutes, target sounds were displayed in a reduced set of positions: fivepositions instead of 15. The user had to perform the localization task with feedbackuntil he could answer correctly to 80% of the trials. Approximately 25 minutes wereneeded for the participant to reach such a score. Mendonca et al. observed a decreaseof 46% of the unsigned error for non-individualized HRTF. For the elevation, theyonly trained three positions. They reported that the training had small influence forlow-elevation localization, but that high elevation could be better localized. Theyalso reported that the variance could be reduced. Unfortunately, nothing was doneto avoid the sound-memorization effect; this training did not perform a remappingof the spatial-hearing perception.

5.3.5 Elevation Coding

Finally, it was also possible to transform the stimulus spectral properties in orderto give some additional cues to the listeners. This process was likely to improveelevation performances as spectral cues are mainly used for this. To do so, differentstrategies exists. One strategy is to use a peak and notch filter on the sound toemphasize some key HRTF characteristics [Bla83, RB67]. The other method isto perform an artificial coding of the sound. In this case, the localization of thesound isn’t straightforward anymore but goes through a decoding process. Indeed,the listener has to recognize the added cues and then to use a learned mapping toestimate the position.

5.4 Conclusion

Method Azimuth Improvement Elevation ImprovementHRTF Selection 50% UnknownAuralization [BW01] 31% -28.7%Head tracking [AK14] 32% 19%Training [Men12] 46% 46%Elevation coding No improvement Unknown

Figure 5.17: This table compares different methods to improve the sound localiza-tion. The percentages are computed with respect to the unsigned errors obtained inthe mentioned experiments.

Figure 5.17 summarizes the different possibility that were available to optimize thesoundscape individualization. The main issues the created virtual auditory display

5.4. CONCLUSION 71

encountered were the azimuth overestimation and the elevation blur. Though show-ing better performances than manikin HRTF, the hybrid selection did not showsignificant improvements compared to other individualization methods. Accordingto the needed localization performances for the device, the designed soundscapehad to be improved. Investigation revealed that different possibilities were availableto do so. Regarding the azimuth, considering the user head-size in order to forcehim not to use HRTF corresponding to head bigger than his own seemed to be themost convincing possibility. It was also possible to add features to the soundscape.Among the possible features were the use of head tracking, training, auralizationand elevation coding. Head-tracking is already implemented in the device and betterperformances regarding the sound localization are to be expected in real conditions.Training was able to improve the performances up to 46% in Mendonca experimentbut this process is time consuming. Auralization was able to improve the azimuthperformances of 31% in Begault and Wenzel experiment [BW01], but the price topay was a degradation of the elevation localization performances.

73

Chapter 6

Soundscape Optimization

6.1 Introduction

The previous chapter exposes different possibilities that were likely to improve thedesigned soundscape. Regarding the azimuth, as the overestimation was found tobe related to the individualization process, it was modified as explained in thenext section. Regarding the elevation blur, training, auralization and elevation cod-ing were proposed to reduce it. Unfortunatly, auralization degraded the elevationperformance in Begault and Wenzel experiment [BW01]. It was therefore not cho-sen. Training staged showed interesting improvements but were time consuming[PML10]. Short training stages as proposed by Mendonca et al. [Men12] were alsoconsidered, but since they were based on signal memorization, this solution was notconsidered to be optimal. Therefore, by elimination, elevation coding was chosen toreduce the elevation blur.

6.2 HRTF Selection Modification

In order to handle the overestimation effect, the HRTF selection process was mod-ified. The first option considered was to add the head size of the user to the pre-selection parameters. Unfortunately, due the number of HRTFs contained in theCIPIC database, this proceeding was irrelevant. Indeed, as mentioned before, theaveraged head width of the CIPIC population is equal to 14.5 centimeters, and only7 HRTFs belong to persons having a head width smaller than the averaged headwidth of the subjects of the current experiment. Consequently, it was not possibleto combine selection through pinna matching and selection through head matching.A preselection only based on the head size was therefore implemented.

Zhang et al. [MZA11] reported that the head parameters available in the CIPICdatabase, given in Figure 4.2, are highly correlated. Moreover, they stated that themost correlated parameter with the different HRTFs is the head width. Therefore,this parameter was chosen to perform the preselection. The algorithm selecting a

74 CHAPTER 6. SOUNDSCAPE OPTIMIZATION

reduced set of HRTF compared the head width of the subject, defined by X1 inFigure 4.1 to the head widths contained in the CIPIC database. Then the fiveclosest head width values were selected with the constraint that none of these valuescould be higher than the user head width. If the head width of the user was smallerthan 13.3◦, less than five HRTFs were preselected. If the head width of the user wassmaller than 12.6◦, the smallest head width was selected and no subjective selectionoccurred.

6.3 Elevation Coding Review

The following chapter shows how the elevation can be coded to achieve better lo-calization performance. The brain mainly estimates the elevation of a sound sourceby performing a spectral analysis. Therefore, the following strategies are based onspectral modifications of the sound signal and do not affect the ITD and ILD cuesto preserve the azimuth performances.

6.3.1 Directional Bands

Blauert showed that for one third octave wide band noise, the estimation of thefrequency is independent of the real location of the sound. In fact, this estimation isshown to be only based on the center frequency of the signal, phenomena also knownas the Pratt’s effect [Bla96]. Therefore, it was possible to map frequency ranges toparticular directions. These frequency ranges are called directional bands. Threedirections are associated to directional bands: front, back and up. The frequencyrange corresponding to these directional bands are presented in Figure 6.1. As itis shown, frequencies belonging to the [2kHz-7kHz] range are perceived to be infront and therefore corresponds to the front directional band according to Blauert[Bla96]. On the contrary, signal containing a strong component around 1 kHz wereperceived to be more diffuse as if the sound was coming from behind. Finally,another directional band exist around 8 kHz, linked to the upper area. In order touse these directional bands and to enhance the elevation estimation, it is thereforepossible to boost certain frequencies of the sound according to the spatial impressionto give. However, three main issues appear while doing this. The first one is the poorresolution of these additional cues. Indeed, the frequency bands are only mappedwith three different directions: front, up and back. Unfortunately, this did not suitthe device requirements as the elevation precision had to be improved within thefrontal area. The second issue is that these frequency bands are user dependent,as it was shown by Itoh et al. [MIM06] and Blauert results, presented in Figure6.2 [Bla96]. The latter figure shows the elevation estimation for different subjectsaccording to the center frequency of the signal displayed [Bla83]. It is clear thatthe back direction isn’t mapped to the same frequency bands for every subjects.For example, subject D associated the back area to frequencies under 3kHz whereas

6.3. ELEVATION CODING REVIEW 75

subject E associated this same area to frequencies beyond 5kHz. The last issueis that no frequency band were found to be linked to low elevations according toBlauert [Bla96]. Indeed, all the sound perceived through band limited noise werelocated above the transverse plane.

Figure 6.1: Left: Experiment setup. Right: perception depending of the centerfrequency. From Blauert [Bla83]

6.3.2 Covert peaks

Other cues able to emphasis the elevation perception are the covert peaks. Theyrefer to the spatial location at which a particular frequency is maximally transmittedby the ear. Butler [RB67] showed that it was possible to influence the position of asound source by notch filtering the spectral cues corresponding to the other covertpeak areas. Butler even managed to predict the listener judgment with help of hisHRTF. The response patterns in the median plane could be predicted using a modelbased on spectral comparison of the HRTF of the listener and the signal spectrum.However, similar to the boosting method, the performance given by notching outsome frequencies did not seem to be sufficient. Indeed, the founded Covert PeakArea (CPA) are high, low, front and back which would also provide a poor resolutionfor an elevation coding. Moreover, the covert peaks are user dependent, which wasan issue as the individual HRTFs were not available here.

6.3.3 Natural Frequency Elevation Mapping

Parise et al., 2013, investigated on the origin of the Frequency Elevation Mapping(FEM) [CVPE14]. They came to the conclusion that the the FEM is due to theshape of the ear, as it was already stated, but also to the natural sounds. Indeed,


Figure 6.2: Directional Band Perception. The color represent the subject-perceiveddirection, the horizontal axis the center frequency and the vertical axis the differentsubjects. From Blauert, Spatial Hearing [Bla96]

6.3. ELEVATION CODING REVIEW 77

according to their results, statistically, high-frequency sound had a tendency tooriginate from elevated sources in natural auditory scenes. The hypothesis made toexplain this is that sound sources having a high elevation naturally produce signalswith more energy in the high frequencies, like leaves on a trees produces more highfrequencies than footsteps on the floor. Another hypothesis is that the ground wouldabsorb the high frequencies, explaining why sound sources coming from below havemore low frequencies. They stated that this particularly true for frequencies betweenone and seven kHz.

6.3.4 Artificial Coding

The previous paragraphs show that the spectral shape of the stimulus influences theperceived elevation. However, it also explains that it seems impossible to constructa single signal that would naturally give the same elevation impression to all the lis-teners. To achieve this, a remapping between pitch and elevation must be performedfor every listeners. This seems to be possible through artificial coding and Susniket.al investigated this possibility [RST05] . To do so, they created a simple codingby filtering the signal through band-pass or low-pass filters with different center andcut off frequencies respectively. In their code, the high frequencies correspondedto high elevation and vice versa. The stimulus were randomly displayed and thelistener had to say if the elevation was lower or higher than the elevation of theprevious displayed sound. The results stated that, by filtering pink noise througha low-pass filter, it was possible to code sixty-one different elevations without anytraining. This way of coding was from 10% to 28% better than for a coding throughband-pass filtering. This procedure is called artificial coding as does not rely onspatial perception. Therefore, the main drawback of this method is that no abso-lute position can be coded. However, as this procedure seemed to enable the bestpossible elevation accuracy, it was chosen to complete the soundscape.

6.3.5 Coding design

The coding strategy was inspired of Susnik et .al [RST05] coding strategy. Susniket. al stated that the coding offering the best resolution is the low-pass filteringcoding. In order to obtain this full resolution, it was however necessary to coverthe range of 500Hz to 18kHz according to them. Unfortunately, as ILD are mainlydetectable at frequencies higher than 1kHz, using this strategy could degrade theazimuth estimation performances for some coded signals. To measure this, twodifferent coding strategies were implemented. The first coding strategy was the low-pass filtering of pink noise with different cutoff frequencies. Each cutoff frequencywas mapped to a particular elevation. The choice of the cutoff frequencies wasdone based on the critical bands defined by the bark model. The following cutoff frequencies were therefore chosen: 0.7kHz, 1.17kHz, 1.850kHz, 2.9kHz, 4.8kHz8.5kHz to encode -33.75◦, -22.5◦, -11.25◦, 0◦, 11.25◦, 22.5◦. These frequencies define


equally large bands on the bark scale. No filtering was used for the +33.75 elevation.In the second strategy, instead of filtering the signal, some frequencies were boostedthrough a second order shelving filter. The boosted frequencies were again chosenaccording to the bark model and were the same as the previous strategy’s cutoff-frequencies. By doing so, the signal remained broadband and no alteration of theITD and ILD cue could occur.

6.4 Experiment

To test the new soundscape created and to validate the hypothesis, the same ex-perimental setup was used. 17 subjects went through the test, representing a totalamount of time of six hours. Each of the 17 subjects had 70 trials to perform, lead-ing to a total amount of 1190 measurements. The first strategy was used on twelvesubjects, the second strategy on five subjects. The soundscape individualizationprocess lasted approximately 3 minutes for both strategies. The test experimentlasted 16 minutes on average.

Unlike the previous experiment, a calibration had to be done to give the listenerthe correct elevation mapping. To do so, sound sources associated to the extremeelevations -33.75 and +33.75 and to the 0◦ elevation were displayed three timesduring 5 seconds. A visual feedback was given simultaneously. It was assumed, byconstruction, that the listener would be able to complete the mapping intuitively.

6.5 Results

6.5.1 Azimuth

Figure 6.3 shows the distribution of the azimuth judgments of the different subjects.The distribution corresponding to a target position at +10◦ seemed not to be con-sistent with the other measures, since the 95% confidence interval is twice as bigas for the distribution at -10◦. This might be due to an asymmetry in the experi-mental setup. Indeed, the subjects sat next to a vertical board to their right duringthe experiment. Several subjects reported they had difficulties to discriminate the+10◦ and -10◦ location as they had the impression the vertical board was producingechoes. Therefore, sound sources located to the right were partially interpreted asechoes produced by a sound source located at the median symmetric position. Thiscould explain irregularity. This problem did not appear for larger azimuths.

The linear regression of the global mean azimuth estimation was also computed. Itshowed that an underestimation of 17.8% occurred.

The azimuth judgment distribution for the second experiment is presented in Figure6.4. No significant asymmetry could be found anymore, suggesting it was due to thelow pass filtering of the signal.

6.5. RESULTS 79

Figure 6.3: Azimuth-estimation distribution for the first coding. The redline is themedian value, the black diamond is the mean value, the box boundaries are the25 and the 75 percentiles and the vertical black lines represents the 95% confidentinterval.


Figure 6.4: Azimuth-estimation distribution for the second coding. The redline isthe median value, the black diamond is the mean value, the box boundaries are the25 and the 75 percentiles and the vertical black lines represents the 95% confidentinterval.

6.5. RESULTS 81

Regarding the linear regression, the slope coefficient for the first strategy was equal to0.82 whereas it was equal to 1.12 for second strategy. The mean standard deviationwere equal to 8.2◦ and to 7.9◦ and the between mean standard deviation to 6.69◦

and 7.21◦. These differences were not significant as the estimated precision of thepointing system was equal to 1◦.

6.5.2 Elevation

Figure 6.5 shows the judged elevation distribution. The linear regression gave a cor-relation coefficient of 0.95. This was lower than the correlation coefficient obtainedfor the azimuth regression but still indicated a strong linear behavior. The +33.75◦

location was perceived as lower than the +22.5◦ location, that is the broadband sig-nal was perceived as lower than the low-pass filtered signal with the 8.9kHz cutofffrequency. The standard deviation was equal to 9◦ and the between mean varianceto 7.84◦.

Figure 6.5: Elevation-estimation distribution for the first coding. The redline is themedian value, the black diamond is the mean value, the box boundaries are the25 and the 75 percentiles and the vertical black lines represents the 95% confidentinterval.

Regarding the second experiment, Figure 6.6 shows the elevation estimation distri-bution. The correlation coefficient provided by the linear regression was equal to


0.83 which is 13% lower than for the first strategy. This suggest a non linear be-havior of the elevation estimations for the second strategy. The standard deviationwas equal to 12◦ which was 25% higher than for the first strategy. Regarding thebetween mean standard deviation, the value was equal to 6.0◦ which 23% smallerthan the between mean standard deviation obtained for the first strategy.

Figure 6.6: Elevation-estimation distribution for the second coding. The redline isthe median value, the black diamond is the mean value, the box boundaries are the25 and the 75 percentiles and the vertical black lines represents the 95% confidentinterval.

6.5.3 Comparison and Discussion

Figure 6.7 and Figure 6.8 show the main results obtained for the three different ex-periments made during this work. Experiment one refers to the subjective selectionbased on pinna parameters. Experiment two corresponds to the elevation codingthrough low-pass filtering and experiment three corresponds to the elevation codingthrough shelve filtering. It is indisputable that the second and third created sound-scape provided better localization results than the first one. Indeed, improvementsfrom 20% up to 70% were made for azimuth and elevation localization regarding theaccuracy, and the precision. Improvement for the between mean standard deviationwas also to be seen between these same experiments.

6.5. RESULTS 83

Elevation Results Soundscape 1 Soundscape 2 Soundscape 3Signed error 11.4◦ -3.0◦ 1.7◦

Unsigned error 16.5 9.2◦ 8.5◦

Standard deviation 10.4◦ 8.2◦ 7.2Between mean standard deviation 13.5◦ 6.7 8.0Slope 1.6 0.82 1.12Correlation coefficient 0.977 0.99 0.99

Figure 6.7: Azimuth results obtained for the different experiments. Soundscape 1refers to the soundscape created with the hybrid selection. Soundscape 2 and 3 referto the soundscapes created with the low-pass and boosted-band coding respectively.

Elevation Results Soundscape 1 Soundscape 2 Soundscape 3Signed error -12.6◦ -2.8◦ -3.1◦

Unsigned error 17.8◦ 12.3◦ 14.2◦

Standard deviation 9.8◦ 9◦ 12.0◦

Between mean standard deviation 7.9◦ 7.8◦ 6.0◦

Slope 0.2 0.7 0.5Correlation coefficient 0.76 0.95 0.82

Figure 6.8: Elevation results obtained for the different experiments. Soundscape 1refers to the soundscape created with the hybrid selection. Soundscape 2 and 3 referto the soundscapes created with the low-pass and boosted-band coding respectively.


The main effect of the second HRTF selection was to be seen in the relative meanerror of the azimuth. The averaged overestimation of 11.4◦ obtained with the firstHRTF selection was reduced to -3.0◦ and +1.7◦ for the second and third experimentrespectively, which represent a 70% and 83% improvement. The direct consequenceof that was the reduction of the mean unsigned error. To another extent, thestandard deviation of the azimuth estimations also decreased significantly. Indeed,a decrease from 21% to 24% was to be seen. Finally, the between mean variancealso decreased significantly for the second and third experiments. An improvementof 45% was found. These results indicated that the HRTF selection with respect tothe head size improved both the accuracy, the precision and the individualizationof the soundscape compared to the hybrid selection. This was consistent with thefact that the ILD and ITD essentially depend on the head-size and not on the earshape of the listener. However, the complete removal of the overestimation wasn’texpected with the results obtained through Equation 5.3. This indicated that theoverestimation was not only due to an ITD mismatch, but might also have beencaused by an ILD mismatch.

Regarding the elevation, the second and third experiments showed that the achiev-able localization performance for coded sound was very high compared to the resultsobtained with the hybrid selection. Indeed, regarding the accuracy, an improvementof 85% was reached with a mean error of -2.8◦ against -17.6◦ for the first exper-iment. The results also showed that the standard deviation was not significantlyreduced in the second experiment; this might be explained by the fact that theuncertainty on the elevation was high in the first experiment causing an estimationcentering. Therefore, the standard deviation of the first experiment was abnormallylow. This same reason explain the fact that the between mean variance did notchange significantly from one experiment to another. Finally, the second and thirdelevation coding both showed the same default: the +22.5◦ and +33.75◦ elevationwere switched while estimated. This may be explained with the directional bands.Indeed, Blauert stated that frequencies beyond 8kHz are perceived to be behind andnot above [Bla83]. Therefore, it seems that the initial calibration was not sufficientto remap these high frequencies to high position. Consequently, two options areavailable to counter this effect: a longer training stage can be implemented to forcethe remapping or the elevation code has to be constraint to the [500Hz - 8kHz] fre-quency range. The drawback of the first solution is that it is time consuming, thedrawback of the second solution is that less accuracy will be available.

Both the second and third experiments provided an accuracy near the minimumobservable accuracy. Indeed, the absolute signed error was equal to 3.0◦ and 1.7◦,and the pointing-system accuracy was estimated to be equal to 2.5◦. Regardingthe unsigned error, the difference was less than 1.0◦ between the two strategies andthis value is below the pointing-system precision. It was thus not possible to find asignificant difference between the two strategies for these performances. The slope ofthe mean estimations was smaller for the low-pass filtering coding, suggesting moreuncertainty. However, no significant changes occurred for the azimuth standard

6.6. CONCLUSION 85

deviation. Regarding the elevation, the low-pass filtering method provided betterresults than the boosting method, both for accuracy and precision. Indeed, themean unsigned error was 14% lower in the second experiment than in the third andthe standard deviation was 25% lower. Finally, the third elevation coding showed alower correlation coefficient for the linear regression, going hand in hand with thepoorest accuracy results.The total unsigned error obtained for the second and third experiments were equalto 21.5◦ and 22.7◦. This represented an improvement of 38% and 35% respectivelycompared to the previous 34.3◦ global unsigned error. This is lower than the im-provement achieved by Mendonca et al. (2010) [Men12] for training. However, thisis better than the results obtained for auralization and head tracking by Begault etal [BW01]. These newly obtained global unsigned errors are smaller than the valueobtained by Wightman et al. with individual HRTF. However, they are larger thanthe unsigned error obtained by Wightman et al. for real sound sources, and twicehigher than the error obtained by Middlebrooks.

6.6 Conclusion

Based on the first experiment’s results, the hybrid selection was modified and twonew soundscape-creation strategies were developed to further optimize the accuracy,precision and adaptivity. The hybrid selection was modified in order to correct theoverestimation error obtained in the first soundscape which was a major problem.Regarding the elevation, coding was assumed to be the most promising strategywith respect to the device constraints and was therefore implemented. Finally,the second HRTF selection improved both azimuth accuracy and precision of thedevice. Indeed, the second experiment reported an averaged error equal to -3.0◦

and a standard deviation equal to 8.3◦, which is an improvement of 70% and 21%respectively compared to the first HRTF selection. Regarding the elevation, thelow-pass filtering strategy provided the best results with a mean error of -2.8◦ and astandard deviation of 9.2◦. The unsigned error obtained for the second strategy wasclose to the unsigned error obtained by Wightman et al. with individual HRTF. Thissecond strategy showed a 38% improvement of the soundscape with respect to theglobal unsigned error and provided a 9.2◦ and 12.3◦ horizontal and vertical resolutionrespectively. Finally, the two lastly designed soundscapes needed the same settingtime; 3 minutes, including the anthropometric measurement. The setting time of thefirst soundscape was equal to approximately 15 minutes. Any created soundscapesdid require additional cost. Figure 6.9 brings all these information together.


Individualization Strategy Global Unsigned Error Setup time CostHybrid Selection 34.3◦ 15 minutes noneLow-pass coding 21.5◦ 3 minutes noneBoosted-band coding 22.7◦ 3 minutes none

Figure 6.9: Final Comparison

87

Chapter 7

Conclusion

Localizing objects with sounds is possible as they contain cues related to the spatialenvironment and the brain is able to interpret these through monaural and binauralanalysis. Therefore, auditory-vision substitution could be used to help blind peoplelocalizing objects such as obstacles. To enable this sensory substitution, specialsound stimuli, also called soundscapes, have to be created. These stimuli have tocontain all the needed spatial information for the listener to localize objects withenough precision and accuracy.

Unfortunately, each person perceives sounds differently due to morphological differ-ences. Therefore, it is impossible to create a single soundscape for every user, butthis soundscape has to be individualized in order to achieve better sound-localizationprecision and accuracy. In this work, a soundscape individualization process was de-signed with respect to the following criteria: cost, time consumption and achievableperformances. These criteria reflected the will of creating a device intuitive andaffordable. The soundscape was created through a hybrid HRTF selection. Thecreation process, also called individualization, had the main advantages of beinglow-cost and quick. However, no significant improvement was to be found comparedto other existing soundscapes. Indeed, two main drawbacks occurred with the cre-ated soundscape: overestimation of lateral positions and elevation blur. Based onthe obtained performances, the soundscape individualization process was optimizedand the accuracy could be improved of 71% for the azimuth. Moreover, to counterthe elevation blur, additional features were added to the soundscape. The choicewas made to perform an artificial coding of the elevation, meaning that the elevationestimations did no more rely on natural but on learned cues. This coding was yetdesigned to resemble the natural coding, in order to enable a fast adaptation. Thisstrategy showed a 78% improvement of the accuracy without any training stage. Aglobal unsigned error of 21.5◦ was reached with the optimized soundscape and thisvalue is only 5% higher than the 20.4◦ unsigned error obtained by Wightman et al.for soundscape created with individual HRTF. Finally, the individualization timecould also be optimized and it was reduced to 3 minutes compared to 15 minutesfor the hybrid selection.

88 CHAPTER 7. CONCLUSION

7.1 Future Work

Although optimized, several possibilities remain to improve the achievable perfor-mances with the individualized soundscape. Training is likely to bring improve-ment up to 46% with respect to the unsigned error according to Mendonca et al.[BW01, Men12]. However, this additional step goes against the time constraint andthe right balance has therefore to be found between precision and time adaptation.Moreover, it is necessary to measure how the head tracking will improve the perfor-mances before constructing the training stage. Then, the final achievable precisionof the device will be obtained.Finally, the next important topic to investigate on is the amount of informationa soundscape can transmit to the listener. Indeed, to fully replace the vision, alarge amount of information has to be transmitted to the ears. The DVS camerascan produce 16386 events every 15 µs. Consequently, sound interpretation is indis-putably the limiting factor for the achievable information rate. It is thus importantto know the number of sounds it is possible to localize simultaneously without de-grading the localization performances. The literature on this topic seems very rareand experiment are therefore needed to answer these questions.

LIST OF FIGURES 89

List of Figures

2.1 Mobility Aid Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Spherical Coordinate System . . . . . . . . . . . . . . . . . . . . . . . 143.2 Interaural Coordinate System . . . . . . . . . . . . . . . . . . . . . . 153.3 Body Planes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 Band Pass Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.5 Population and Individual distribution . . . . . . . . . . . . . . . . . 203.6 Sound Reflexion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.7 Sound Diffraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.8 ITD representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.9 ILD representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.10 Cone of Confusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.11 Spatial Cues Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.12 Front-Back Confusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.13 MAA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.14 Precision and MAA comparision . . . . . . . . . . . . . . . . . . . . . 293.15 HRTF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.16 HRIR Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.17 Convolver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1 Pinna Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 Body Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3 Individualization Comparision . . . . . . . . . . . . . . . . . . . . . . 394.4 GUI for the Subjective Selection . . . . . . . . . . . . . . . . . . . . . 424.5 Frequency Influence on Localization . . . . . . . . . . . . . . . . . . . 424.6 Stimulus Modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1 ProDePo Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 Pardigms’ Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 505.3 Paradigm Comparison Table . . . . . . . . . . . . . . . . . . . . . . . 515.4 Designed Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.5 HRTF Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.6 Data Aquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.7 Exp 1: Azimuth Distribution . . . . . . . . . . . . . . . . . . . . . . . 56

90 LIST OF FIGURES

5.8 Exp1: Azimuth Regression . . . . . . . . . . . . . . . . . . . . . . . . 575.9 Exp1: Elevation Distribution . . . . . . . . . . . . . . . . . . . . . . . 595.10 First Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.11 Middlebrooks’ Results . . . . . . . . . . . . . . . . . . . . . . . . . . 625.12 Wenzel’s Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.13 Gardner’s Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.14 Individualization Comparision . . . . . . . . . . . . . . . . . . . . . . 665.15 Head Tracking Results . . . . . . . . . . . . . . . . . . . . . . . . . . 685.16 Auralization Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.17 Improvement Methods Comparison . . . . . . . . . . . . . . . . . . . 70

6.1 Directional Bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.2 Directional Bands Resolution . . . . . . . . . . . . . . . . . . . . . . 766.3 Exp2: Azimuth Distribution . . . . . . . . . . . . . . . . . . . . . . . 796.4 Exp3: Azimuth Distribution . . . . . . . . . . . . . . . . . . . . . . . 806.5 Exp2: Elevation Distribution . . . . . . . . . . . . . . . . . . . . . . . 816.6 Exp3: Elevation Distribution . . . . . . . . . . . . . . . . . . . . . . . 826.7 Final Results: Azimuth . . . . . . . . . . . . . . . . . . . . . . . . . . 836.8 Final Results: Elevation . . . . . . . . . . . . . . . . . . . . . . . . . 836.9 Final Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

BIBLIOGRAPHY 91

Bibliography

[Abr15] Michael Abrash. What virtual reality could, should, and almost cer-tainly will be within two years, 2015.

[Ajd05] Thibaut Ajdler. Interpolation of head related transfer functions con-sidering acoustics. AES 118th Convention, May 2005.

[AK14] Klaus Diepold Alexander Kuhn, Martin Rothbucher. HRTF Cus-tomization by Regression. Munich University of Technologies, 2014.

[Ban] Introduction to digital filter. https://ccrma.stanford.edu/~jos/

filters/. Accessed: 2015-09-14.

[Bla83] Jens Blauert. 1. introduction hearing of music in three spatial dimen-sions, 1983.

[Bla96] Jens Blauert. Spatial hearing, revisited edition, the psychophysics ofhuman sound localization, October 1996.

[Blo15] Bloomberg. Now blind americans can ’see’ with devices atop theirtongues. http://www.bloomberg.com/news/articles/2015-06-19/

now-blind-americans-can-see-with-device-atop-their-tongues,2015. Accessed: 2015-09-14.

[Bod] Sagittal plane. https://en.wikipedia.org/wiki/Sagittal_plane.Accessed: 2015-09-13.

[Bol01] Login Nikolaevich Bol’shev. Encyclopedia of mathematics. Springer,2001.

[Bro95] Bronkhorst. Localization of real and virtual sound sources. AcousticalSociety of America, 95:2542–2543, 1995.

[BT15] Inc. BrainPort Tecnologies, Wicab. Brainport v100. http://www.

wicab.com/en_us/, 2015. Accessed: 2015-09-14.

https://ccrma.stanford.edu/~jos/filters/

https://ccrma.stanford.edu/~jos/filters/

http://www.bloomberg.com/news/articles/2015-06-19/now-blind-americans-can-see-with-device-atop-their-tongues

http://www.bloomberg.com/news/articles/2015-06-19/now-blind-americans-can-see-with-device-atop-their-tongues

https://en.wikipedia.org/wiki/Sagittal_plane

http://www.wicab.com/en_us/

http://www.wicab.com/en_us/

92 BIBLIOGRAPHY

[BW01] Durand R. Begault and Elizabeth M Wenzel. Direct comparaison ofthe impact of head tracking, reverberation, and individualized head-related transfer functions on the spatial perception of a virtual speechsource. Audio Eng. Soc., 49(10):904–917, October 2001.

[CVPE14] Katharina Knorre Cesare V Parise and Marc O Ernst. Signal band-width necessary for horizontal sound localization. PNSA, 111(16):6104–6108, April 2014.

[EMWW93] Doris J. Kistler Elizabeth M Wenzel and Frederic L. Wightman. Local-ization using nonindividualized head-related transfer functions. Acous-tical Society of America, 94(1):11–123, July 1993.

[FPFD02] Luiz Wagner P. Biscainho Fabio P. Freeland and Paulo Sergio R. Di-niz. Efficient hrtf interpolation in 3d moving sound. AES 22nd Inter-national Conference on Virtual, Synthetic and Entertainment Audio,2002.

[Gar97] William G Gardner. 3-D Audio Using Loudspeakers. PhD thesis, Mas-saschusetts Institute of Technology, 1997.

[Gui14] Cost of a guide dog. https://www.guidedogs.org.uk/media/

3701632/Cost-of-a-guide-dog-2013.pdf, 2014.

[Hab93] Lyn Haber. Comparaison of nine methods of indicating the directionto objects: data from blind adults. Perception, 22:35–47, 1993.

[Han93] P. H. Chung Handel. Noise in physical systems, 1993.

[HRT] Head-related transfer function. https://en.wikipedia.org/wiki/

Head-related_transfer_function. Accessed: 2015-09-12.

[Int] The cipic database. http://interface.cipic.ucdavis.edu/sound/

tutorial/psych.html. Accessed: 2015-09-14.

[Iwa06] Yuko Iwaya. Individualization of head-related transfer function withtournament-style listening: Listening with other’s ears. Acoust. Sci.Tech., 27(6):340–344, 2006.

[JGC04] Armando Barreto JNavarun Gupta and Maroof Choudhury. Modelinghead-related transfer functions based on pinna anthropometry, 2004.

[Mei92] Peter B L Meijer. An experimental system for auditory image represen-tation. IEEE Transactions on Biomedical Engineering, 39(2):113–123,February 1992.

https://www.guidedogs.org.uk/media/3701632/Cost-of-a-guide-dog-2013.pdf

https://www.guidedogs.org.uk/media/3701632/Cost-of-a-guide-dog-2013.pdf

https://en.wikipedia.org/wiki/Head-related_transfer_function

https://en.wikipedia.org/wiki/Head-related_transfer_function

http://interface.cipic.ucdavis.edu/sound/tutorial/psych.html

http://interface.cipic.ucdavis.edu/sound/tutorial/psych.html

BIBLIOGRAPHY 93

[Men12] Catarina Mendonca. On the improvement of localization accuracy withnonindividualized hrtf-based sounds. Journal of the Audio EngineeringSociety, 2012.

[Mes00] Alok Meshram. P-hrtf: Efficient personalized hrtf computation forhigh-fidelity spatial sound, 2000.

[Mid90] John C Middlebrooks. Two-dimensional sound localization by humanlisteners. Acoustical Society of America, 87(5):2188–2200, June 1990.

[Mid99] John C Middlebrooks. Virtual localization improved by scaling non-individualized external-ear transfer functions in frequency. AcousticalSociety of America, 106(3):1493–1511, September 1999.

[Mil72] Mills. Auditory localization, 1972.

[MIM06] Kazuhiro Iida Motokuni Itoh and Masayuki Morimoto. Individual dif-ferences in directional bands. The 9th Western Pacific Acoustic Con-ference, June 2006.

[MZA11] R A Kennedy M Zhang and T D Abhayapala. Statistical method toidentify key anthropometric parameters in hrtf individualization, 2011.

[Org14] World Health Organisation. Blindness, 2014.

[oS09] University of Southampton. Head-related impulse response measure-ment, 2009.

[PB92] Pinek and Brouchon. Head turning versus manual pointing to auditorytargets in normal hearing subjects and in subjects with right parietaldamage, 1992.

[Per15] Vinicius Pereira. Processing of Event-Based Stereoscopic Visual In-formation For Visual-To-Auditory Sensory Substitution. PhD thesis,Munich University of Technology, 2015.

[PJ02] Jan Abildgaard Pedersen and Torben Jorgensen. Localization perfor-mance of real and virtual sound sources, 2002.

[PML10] Matthew J Goupell Piotr Majdak and Bernhard Laback. 3-d lo-calization of virtual sound sources: Effects of visual environment,pointing method, and training. Attention, Perception Psychophysics,72(2):454–469, 2010.

[Poi07] Poirier. What neuroimaging tells us about sensory substitution. Neu-roscience and Behavioral Reviews, 31:1064–1070, 2007.

94 BIBLIOGRAPHY

[PP14] Klaus Diepold Philipp Paukner, Martin Rothbucher. Sound Local-ization Performance Comparison of Different HRTF-IndividualizationMethods. Munich University of Technology, 2014.

[RB67] Suzanne K. Roffler and Robert A. Butler. Factors that influence thelocalization of sound. Acoustical Society of America, pages 1255–1259,December 1967.

[RNI15] AMBUTECH RNIB. iglasses. http://ambutech.com/new/?dest=

/iglasses, 2015. Accessed: 2015-09-14.

[Ros09] Sheldon Ross. Introduction to probability and statistics for engineersand scientists, 2009.

[RST05] Jaka Sodnik Rudolf Susnik and Saso Tomazic. Coding of elevation inacoustic image of space, 2005.

[See03] Bernhard Seeber. Untersuchung der auditiven Lokalisation mit einerLichtzeigermethode. PhD thesis, Munich University of Technology,2003.

[SF03] Bernhard U. Seeber and Hugo Fastl. Subjective selection of non-individual head-related transfer functions. International Conferenceon Auditory Display, July 2003.

[Sph] Spherical coordinate system. https://en.wikipedia.org/wiki/

Spherical_coordinate_system. Accessed: 2015-09-14.

[Ult] Ultracane. https://www.ultracane.com. Accessed: 2015-09-14.

[UoW] Hearing Research Center University of Washington. Sound localizationand the auditory scene. http://courses.washington.edu/psy333/

lecture_pdfs/Week9_Day2.pdf. Accessed: 2015-09-14.

[VC75] Voss and Clarke. 1/f noise in music and speech, 1975.

[WK89a] Frederic L. Wightman and Doris J. Kistler. Headphone simulation offree-field listening.i: Stimulus synthesis. Acoustical Society of America,85:858–867, February 1989.

[WK89b] Frederic L. Wightman and Doris J. Kistler. Headphone simulation offree-field listening.ii: Psychophysical validation. Acoustical Society ofAmerica, 85(2):858–867, February 1989.

[Xie13] Bosun Xie. Head-related transfer function and virtual auditory display,2013.

http://ambutech.com/new/?dest=/iglasses

http://ambutech.com/new/?dest=/iglasses

https://en.wikipedia.org/wiki/Spherical_coordinate_system

https://en.wikipedia.org/wiki/Spherical_coordinate_system

https://www.ultracane.com

http://courses.washington.edu/psy333/lecture_pdfs/Week9_Day2.pdf

http://courses.washington.edu/psy333/lecture_pdfs/Week9_Day2.pdf

BIBLIOGRAPHY 95

[yR20] Paul Bach y Rita. Sensory substitution and the human-machine inter-face. Trends in Cognitives Sciences, 7(12):541–546, 20.

[Zot00] Dmitry N. Zotkin. Hrtf personalization using anthropometric measure-ments, 2000.

96 BIBLIOGRAPHY

LICENSE 97

License

This work is licensed under the Creative Commons Attribution 3.0 Germany License.To view a copy of this license, visit http://creativecommons.org or send a letter toCreative Commons, 171 Second Street, Suite 300, San Francisco, California 94105,USA.

http://creativecommons.org/licenses/by/3.0/de/

optimizing personalized 3d soundscape for a wearable mobility aid for the blind

Documents