signal processing, acoustics, and psychoacoustics for high quality desktop audio

11
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION Vol. 9, No. 1, March, pp. 51–61, 1998 ARTICLE NO. VC980379 Signal Processing, Acoustics, and Psychoacoustics for High Quality Desktop Audio Chris Kyriakakis,* Tomlinson Holman,² Jong-Soong Lim, Hai Hong, and Hartmut Neven Integrated Media Systems Center, University of Southern California, Los Angeles, California 90089-2561 Received August 22, 1997; accepted March 4, 1998 namic range), multichannel and emerging 3D audio pro- Integrated media workstations are increasingly being used gram material requires accurate spatial perception of for creating, editing, and monitoring sound that is associated sound as well in order to create a seamless aural environ- with video or computer-generated images. While the require- ment and achieve sound localization relative to visual im- ments for high quality reproduction in large-scale systems are ages. For such material, a mismatch between the aurally well understood, these have not yet been adequately translated perceived and visually observed positions of a particular to the workstation environment. In this paper we discuss several sound causes a cognitive dissonance that can seriously limit factors that pertain to high quality sound reproduction at the the desired suspension of disbelief [1]. desktop, including acoustical and psychoacoustical considera- Applications for high-quality desktop audio include pro- tions, signal processing requirements, and the importance of fessional sound editing for film and television, immersive dynamically adapting the reproduced sound as the listener’s telepresence, augmented and virtual reality, distance learn- head moves. We present a desktop audio system that incorpo- rates several novel design requirements and integrates vision- ing, and home entertainment. Such a wide variety of appli- based listener-tracking for accurate spatial sound reproduction. cations has led to an equally wide variety of interrelated, We conclude with a discussion of the role the pinnae play in and at times conflicting, system requirements that arise immersive (3D) audio reproduction and present a method of from fundamental physical limitations as well as current pinna classification that allows users to select a set of parame- technological drawbacks [2]. For example, while there have ters that closely match their individual listening characteristics. been advances in sound recording and reproduction tech- 1998 Academic Press nologies, as well as in the understanding of human sound perception mechanisms, these have not yet been combined in such a way as to achieve accurate synthesis of fully 3D 1. INTRODUCTION auditory scenes. Furthermore, many acoustical and psycho- acoustical issues that pertain to sound reproduction in large Numerous applications are currently envisioned for inte- rooms have not yet been correctly translated to the desk- grated media workstations. The principal function of such top environment. systems is to manipulate, edit, and display still images, In this paper we examine several key issues in the imple- video, and computer animation and graphics. The neces- mentation of high quality desktop-based audio systems. sity, however, to accurately monitor the sound associated Such issues include the optimization of the frequency re- with visual images created and edited in the desktop envi- sponse over a given frequency range, the dynamic range, ronment has only recently been recognized. This is largely and stereo imaging subject to constraints imposed by room due to the increased use of digital audio workstations that acoustics and human listening characteristics. Several have benefited from rapid advances both in main CPU computational power, as well as in special-purpose DSPs. problems that are particular to the desktop environment Many sound editing operations that could previously only will be discussed including the frequency response anoma- be performed in calibrated (and very costly) dubbing stages lies that arise due to the local acoustical environment, the are now routinely performed on digital audio workstations. proximity of the listener to the loudspeakers, the acoustics In addition to accurate reproduction of the measurable associated with small rooms, and the location and orienta- characteristics of sound (e.g., frequency response and dy- tion of the listener’s head relative to the loudspeakers. We will address these issues from three complementary perspectives: identification of limitations that affect the * E-mail: [email protected]. performance of desktop audio systems; evaluation of the ² Also with TMH Corporation, 3375 S. Hoover Str., Suite J, Los Angeles, CA 90007. current status of desktop audio system development with 51 1047-3203/98 $25.00 Copyright 1998 by Academic Press All rights of reproduction in any form reserved.

Upload: chris-kyriakakis

Post on 15-Jun-2016

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Signal Processing, Acoustics, and Psychoacoustics for High Quality Desktop Audio

JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION

Vol. 9, No. 1, March, pp. 51–61, 1998ARTICLE NO. VC980379

Signal Processing, Acoustics, and Psychoacoustics forHigh Quality Desktop Audio

Chris Kyriakakis,* Tomlinson Holman,† Jong-Soong Lim, Hai Hong, and Hartmut Neven

Integrated Media Systems Center, University of Southern California, Los Angeles, California 90089-2561

Received August 22, 1997; accepted March 4, 1998

namic range), multichannel and emerging 3D audio pro-Integrated media workstations are increasingly being used gram material requires accurate spatial perception of

for creating, editing, and monitoring sound that is associated sound as well in order to create a seamless aural environ-with video or computer-generated images. While the require- ment and achieve sound localization relative to visual im-ments for high quality reproduction in large-scale systems are ages. For such material, a mismatch between the aurallywell understood, these have not yet been adequately translated perceived and visually observed positions of a particularto the workstation environment. In this paper we discuss several sound causes a cognitive dissonance that can seriously limitfactors that pertain to high quality sound reproduction at the

the desired suspension of disbelief [1].desktop, including acoustical and psychoacoustical considera-Applications for high-quality desktop audio include pro-tions, signal processing requirements, and the importance of

fessional sound editing for film and television, immersivedynamically adapting the reproduced sound as the listener’stelepresence, augmented and virtual reality, distance learn-head moves. We present a desktop audio system that incorpo-

rates several novel design requirements and integrates vision- ing, and home entertainment. Such a wide variety of appli-based listener-tracking for accurate spatial sound reproduction. cations has led to an equally wide variety of interrelated,We conclude with a discussion of the role the pinnae play in and at times conflicting, system requirements that ariseimmersive (3D) audio reproduction and present a method of from fundamental physical limitations as well as currentpinna classification that allows users to select a set of parame- technological drawbacks [2]. For example, while there haveters that closely match their individual listening characteristics. been advances in sound recording and reproduction tech- 1998 Academic Press

nologies, as well as in the understanding of human soundperception mechanisms, these have not yet been combinedin such a way as to achieve accurate synthesis of fully 3D1. INTRODUCTIONauditory scenes. Furthermore, many acoustical and psycho-acoustical issues that pertain to sound reproduction in largeNumerous applications are currently envisioned for inte-rooms have not yet been correctly translated to the desk-grated media workstations. The principal function of suchtop environment.systems is to manipulate, edit, and display still images,

In this paper we examine several key issues in the imple-video, and computer animation and graphics. The neces-mentation of high quality desktop-based audio systems.sity, however, to accurately monitor the sound associatedSuch issues include the optimization of the frequency re-with visual images created and edited in the desktop envi-sponse over a given frequency range, the dynamic range,ronment has only recently been recognized. This is largelyand stereo imaging subject to constraints imposed by roomdue to the increased use of digital audio workstations thatacoustics and human listening characteristics. Severalhave benefited from rapid advances both in main CPU

computational power, as well as in special-purpose DSPs. problems that are particular to the desktop environmentMany sound editing operations that could previously only will be discussed including the frequency response anoma-be performed in calibrated (and very costly) dubbing stages lies that arise due to the local acoustical environment, theare now routinely performed on digital audio workstations. proximity of the listener to the loudspeakers, the acousticsIn addition to accurate reproduction of the measurable associated with small rooms, and the location and orienta-characteristics of sound (e.g., frequency response and dy- tion of the listener’s head relative to the loudspeakers.

We will address these issues from three complementaryperspectives: identification of limitations that affect the* E-mail: [email protected] of desktop audio systems; evaluation of the† Also with TMH Corporation, 3375 S. Hoover Str., Suite J, Los

Angeles, CA 90007. current status of desktop audio system development with

511047-3203/98 $25.00

Copyright 1998 by Academic PressAll rights of reproduction in any form reserved.

Page 2: Signal Processing, Acoustics, and Psychoacoustics for High Quality Desktop Audio

52 KYRIAKAKIS ET AL.

respect to such limits; and delineation of technological originated from the film industry. A well-defined set ofstandards has been developed for sound monitoringconsiderations that impact present and future system de-

sign and development. conditions in dubbing stages to ensure the transparent re-production of program material in theaters. Such standardsinclude loudspeaker positioning for multichannel monitor-2. LIMITATIONS OF DESKTOP AUDIO SYSTEMSing, loudspeaker frequency response and directivity re-quirements, precise sound pressure level calibration, con-These are two classes of limitations that impede the

implementation of seamless audio reproduction. The first trol of room acoustics parameters (such as reverberationtime and discrete reflections), and background noise levels.class encompasses limitations imposed by physical laws,

and its understanding is essential for determining the feasi- Meeting these standards ensures that material producedin one professional dubbing stage can be monitored underbility of a particular technology with respect to the absolute

physical limits. Many such fundamental limitations are not identical conditions in another dubbing stage or in a movietheater. The design challenge in desktop audio systemsdirectly dependent on the choice of systems, but instead

they pertain to the actual process of sound propagation is to successfully map these standards onto the desktopenvironment through appropriate acoustical and psycho-and attenuation in irregularly shaped rooms. For example,

in order to recreate an environment for a listener it is acoustical scaling and system design.necessary to encode the acoustical characteristics of theremote location during the recording and then decode

3.1. Acoustical Considerationsthose characteristics locally in the user’s environment. Fur-thermore, the influence of the local acoustic environment In a typical desktop sound monitoring environment de-

livery of stereophonic sound is achieved through two loud-on the perception of spatial attributes such as directionand distance, as well as in colorations that arise from anom- speakers that are typically placed on either side of a video

or computer monitor. This environment, combined withalies in the frequency response, must be taken into account.The situation is further complicated by the fact that the the acoustical problems of small rooms, causes severe prob-

lems that contribute to audible distortion of the reproduceddecoding process includes the physiological signal pro-cessing performed by the human hearing mechanisms. This sound [5]. While an experienced professional can identify

and correct for such problems during the monitoring stage,processing translates level and time differences and direc-tion-dependent frequency response effects caused by the any changes made are permanently recorded and appear as

errors during playback in a different environment. Amongpinna, head, and torso into sound localization cues througha set of amplitude and phase transformations known as these problems the one most often neglected is the effect

of discrete early reflections. The effects of such reflectionsthe head-related transfer functions (HRTFs).The second class of limitations contains a number of on sound quality has been studied extensively [5–8] and

it has been shown that they are the dominant source ofconstraints that arise purely from technological considera-tions. These technological constraints are equally useful in monitoring nonuniformities when all the other standards

discussed above have been met. These nonuniformitiesunderstanding the potential applications of a given systemand are imposed by the particular technology chosen for appear in the form of colorations (frequency response

anomalies) in rooms with an early reflection level thatsystem implementation. For example, there are two choicesfor delivering sound in a desktop environment. The first exceeds 215 dB spectrum level relative to the direct sound

for the first 15 ms [9, 10] (Figs. 1, 2). Such a high level ofis based on headphones that are capable of reproducingsignals to each ear individually. While in certain applica- reflected sound gives rise to comb filtering in the frequency

domain that in turn causes noticeable changes in timbre.tions this method can be very effective because it eliminatescrosstalk, it suffers from three main drawbacks: (1) there The perceived effects of such distortions were not quanti-

fied until psychoacoustic experiments [6, 11] demonstratedare large errors in sound position perception associatedwith headphones, especially for the most important visual their importance.

A potential solution that alleviates the problems of earlydirection, out in front; (2) it is very difficult to externalizesounds and avoid the ‘‘inside-the-head’’ sensation; and (3) reflections in small rooms is near-field monitoring. In the-

ory, the direct sound is dominant when the listener is veryheadphones are uncomfortable for extended periods oftime [3, 4]. In this paper we will focus our discussion on close to the loudspeakers thus reducing the room effects

to below audibility. In practice, however, there are severalloudspeaker-based reproduction.issues that must be addressed in order to provide highquality sound. One such issue relates to the large reflecting3. REQUIREMENTS FOR HIGH QUALITY SOUNDsurfaces that are typically present near the loudspeakers.Strong reflections from a console or a video/computerA significant amount of work in the area of high quality

sound production and reproduction in large rooms has monitor act as baffle extensions for the loudspeaker re-

Page 3: Signal Processing, Acoustics, and Psychoacoustics for High Quality Desktop Audio

SIGNAL PROCESSING AND ACOUSTICS 53

FIG. 1. Desktop sound system time response that meets the psychoacoustic requirements for low-level early reflections. The spectrumlevel of the reflected sound is more than 15 dB below that of the direct sound.

sulting in a boost of mid-bass frequencies. Furthermore, room modes do not depend on surfaces in the local acousti-cal environment, but rather on the physical size of theeven if it were possible to place the loudspeakers far away

from large reflecting surfaces, this would only solve the room. These modes produce standing waves that give riseto large variations in frequency response (Fig. 3). Finally,problem for middle and high frequencies. Low frequency

FIG. 2. Desktop sound system time response that violates the requirements for low-level early reflections. The early reflection peaksat 1.2 ms and 6 ms give rise to a spectrum level that is above the 215 dB criterion.

Page 4: Signal Processing, Acoustics, and Psychoacoustics for High Quality Desktop Audio

54 KYRIAKAKIS ET AL.

FIG. 3. Frequency response of a desktop loudspeaker system that clearly shows the effects of the local acoustical environment. Thereare large peaks and dips that give rise to significant audible distortion in the reproduced sound. Also note that there is no bass reproductionbelow 150 Hz.

another factor that has a negative effect on the quality reproduction that delivers sound quality equivalent a cali-of reproduced sound relates to the physical size of the brated dubbing stage [5]. These solutions include:loudspeakers. Typical two-way designs in which the Direct-path dominant design. By combining elements oftweeter is physically separated from the woofer exhibit psychoacoustics in the system design, it is possible to placestrong radiation pattern changes in the crossover frequency the listener in a direct sound field that is dominant overrange. Amplitude and phase matching in this frequency the reflected and reverberant sound. The colorations thatrange becomes critical and as a result such speakers are arise due to such effects are eliminated and this results inextremely sensitive to placement and typically produce a a listening experience that is dramatically different thanflat frequency response for direct sound in one exact posi- what is achievable through traditional near-field monitor-tion. This limitation makes typical two-way speakers un- ing methods. The design considerations for this direct-pathsuitable for near-field monitoring. dominant design include the effect of the video/computer

The current state-of-the-art in desktop reproduction sys- monitor that extends the loudspeaker baffle, as well as thetems is rather poor both for low-cost as well as for high- large reflecting surface on which the computer keyboardcost near-field monitors (Fig. 3). As can be clearly seen typically rests.from the measured frequency response there are large Correct low-frequency response. There are severe prob-deviations from flat response that arise from a combination lems in the uniformity of low-frequency response that ariseof loudspeaker design and acoustical environment draw- from the standing waves associated with the acoustics ofbacks. The sound reproduced by such systems does not small rooms. Such anomalies can give rise to variations asmeet the standards required for professional applications large as 615 dB for different listening locations in a typicaland does a very poor job at translating the experience of room. The advantage of desktop audio systems lies in thea large theater or dubbing stage to the desktop. Further- fact that the position of the loudspeakers and, to a largemore, such distortions in the reproduced sound can obscure extent, the listener are known a priori. It is, therefore,problems present in the original recording that only be- possible to use equalization to produce very smooth low-come apparent in the finished product. frequency response. One fundamental limitation imposed

by small room acoustics is that this can only be achieved3.2. Design Requirements for a relatively small volume of space centered around the

listener. One possible solution to this problem can be foundIn order to address the problems described above, a setof solutions has been developed for single listener desktop by tracking the listener’s position and adjusting the equal-

Page 5: Signal Processing, Acoustics, and Psychoacoustics for High Quality Desktop Audio

SIGNAL PROCESSING AND ACOUSTICS 55

ization dynamically. An early version of such a system is tween the left and right loudspeakers and anchors thesound to the center of the stage.described in a later section of this paper.

For desktop applications, in which a single user is located3.3. Equalization Requirements in front of a CRT display, we no longer have the luxury

of a center loudspeaker because that position is occupiedThe desktop environment presents an unusual set ofby the display. In such cases sound is reproduced mainlyrequirements for equalization as compared to sound sys-through the use of two loudspeakers placed symmetricallytems designed for larger venues. Conventional processingon either side of the CRT, two surround loudspeakersmethods based on 1/3-octave-band, constant-Q equaliza-placed to the side and above the listening position. Sizetion that are derived from the critical band theory of hear-limitations prevent the front loudspeakers from being ca-ing are not applicable in this case. The basic assertion ofpable of reproducing the entire spectrum; thus a separatecritical band theory states that frequency components thatsubwoofer loudspeaker is used to reproduce the low fre-lie very close to each other are perceived differently fromquencies. The two front loudspeakers can create a virtualthose components that are further apart. In a conventional(phantom) image that appears to originate from the exactlistening environment in which the listener typically sitscenter of the display, provided that the listener is seatedfarther away from the loudspeakers (and thus perceivessymmetrically with respect to the loudspeakers. Withmore of the reverberant field), it is possible to equalizeproper head and loudspeaker placement, it is possible tousing conventional methods. In a desktop listening envi-recreate a spatially accurate soundfield with the correctronment, however, this theory breaks down because thefrequency response in one exact position, the sweet spot.direct field is dominant and the effects of the room areHowever, even in this static case, the sound originatingonly present at low frequencies.from each loudspeaker arrives at each ear at differentThe number of standing waves per 1/3-octave necessarytimes (about 200 es apart), thereby giving rise to acousticfor a diffuse sound field is high enough only above a certaincrosstalk. These time differences combined with reflectionfrequency (called the Schroeder frequency). Below thisand diffraction effects caused by the head lead to frequencyfrequency standing waves can give rise to level variationsresponse anomalies that are perceived as a lack of clar-that can cause frequency components that lie very closeity [12].to each other to be reproduced at very different levels. This

This problem can be solved by adding a crosstalk cancel-violates the conditions necessary for equalization based onlation filter to the signal of each loudspeaker. The idea iscritical band theory and instead necessitates the use ofto design a filter that generates a signal out of phase andparametric equalizers with filters that can be preciselydelayed by the amount of time it takes the sound to reachtuned in center frequency and bandwidth.the opposite ear. This signal combines with the in-phaseOne advantage provided by desktop sound systems oversignal from the opposite loudspeaker to create a cancella-their large room counterparts arises from the fact thattion of the undesired crosstalk. This method was initiallystanding waves are formed very rapidly as compared tointroduced by Schroeder and Atal [13] and later refinedthe buildup time observed in large rooms. In a large roomby Cooper and Bauck [14] who coined the term ‘‘transauralthe equalized steady-state sound combined with nonequal-audio.’’ While this solution may be satisfactory for theized reflected (transient) sound can give rise to a worsestatic case, as soon as the listener moves even slightly,overall sound quality. In the desktop environment the timethe conditions for cancellation are no longer met and thedifference of arrival between the direct and transient soundphantom image moves toward the closest loudspeaker be-is so small that equalization of the steady-state sound iscause of the precedence effect. In order, therefore, toperceived as optimal for the combined sound as well.achieve the highest possible quality of sound for a nonsta-tionary listener and preserve the spatial information in the4. LISTENER LOCATION CONSIDERATIONSoriginal material it is necessary to know the precise locationof the listener relative to the loudspeakers. In the sectionIn large rooms multichannel sound systems are used to

convey sound images that are primarily confined to the below we describe an experimental system that incorpo-rates a novel listener-tracking method in order to overcomehorizontal plane and are uniformly distributed over the

audience area. Typical systems used for cinema reproduc- the difficulties associated with two-ear listening, as wellas the technological limitations imposed by loudspeaker-tion use three front channels (left, center, right), two sur-

round channels (left and right surround), and a separate based desktop audio systems.low-frequency channel. Such 5.1 channel systems are de-signed to provide accurate sound localization relative to

4.1. Vision-Based Listener Trackingvisual images in front of the listener and diffuse (ambient)sound to the sides and behind the listener. The use of a Computer vision has historically been considered prob-

lematic particularly for tasks that require object recogni-center loudspeaker helps create a solid sound image be-

Page 6: Signal Processing, Acoustics, and Psychoacoustics for High Quality Desktop Audio

56 KYRIAKAKIS ET AL.

tion. Up to now the complexity of vision-based approaches it to the current estimate. If no estimate can be found thatis sufficiently close, then it is assumed that a new headhas prevented them from being incorporated into desktop-

based integrated media systems. Recently, however, the appeared in the image.While there are several alternative methods for trackingLaboratory of Computational and Biological Vision at

USC, has developed a vision architecture that is capable humans (e.g., magnetic, ultrasound, infrared, laser), theyare typically based on tethered operations or require arti-of recognizing the identity, spatial position (pose), facial

expression, gesture identification, and movement of a hu- ficial fiducials to be worn by the user. Furthermore, thesemethods do not offer any additional functionality to matchman subject, in real time. This highly versatile architecture

integrates a broad variety of visual cues in order to identify what can be achieved with vision-based methods (e.g., faceand expression recognition, ear classification). In the fol-the location and orientation of a listener’s head within

the image. lowing section we describe a novel desktop audio systemthat we have developed that meets all of the design require-The first step in determining head position involves find-

ing the listener’s silhouette. This is accomplished by first ments and acoustical considerations described above andincorporates the vision-based tracking algorithm thatperforming motion detection based on difference images

under the assumption that the cameras are fixed in space. allows us to modify the reproduced signal in response tolistener movements.A conventional stereo algorithm is then used to detect

pixel disparity within the regions of the image that aremoving. The tracking accuracy of this algorithm is typically 5. DESKTOP AUDIO SYSTEM WITHhigher than traditional stereo algorithms because we con- HEAD TRACKINGfine our search only to those regions that are moving. Oncethe disparities for changing pixels have been determined, Our prototype desktop audio system is based on the

MicroTheatere system developed by TMH Corporationa disparity histogram is used to detect disparity intervalsthat are characterized by strong image motion. This histo- [15]. This is a desktop multichannel system that was de-

signed to provide professional sound editors with a moni-gram represents the number of changing pixels as a func-tion of their disparity (Fig. 4a). A moving person typically toring platform that translates the experience of a dubbing

stage to the desktop using a combination of acoustic, psy-gives rise to a local maximum in this representation. Wethen construct a binary silhouette image by activating the choacoustic, and signal processing methods. The frequency

response anomalies that were present due to the localpixels that correspond to a local maximum in the disparityhistogram. A silhouette image is thus generated for every acoustical environment have been eliminated, as can be

seen in the frequency response plot that is very flatsection of the image that is moving.Following the silhouette detection process we use two (62 dB) from 30 Hz to 20 Khz (Fig. 5).

For the head-tracking experiment we used only the twoadditional detection processes to find the location of thehead within the silhouette. The first uses a lookup table front loudspeakers that are positioned on the sides of a

video monitor at a distance of 45 cm from each other andto check for colors corresponding to skin tones and thesecond identifies regions of the silhouette that are convex 50 cm from the listener’s ears (Fig. 5). The seating position

height is adjusted so that the listener’s ears are at the(Fig. 4b). The binary outputs from both detectors are clus-tered and bounding boxes are computed for each cluster tweeter level of the loudspeakers (117 cm from the floor),

which, combined with the high-horizontal-directivity de-whose size is likely to correspond to the size of the headat the distance where the associated silhouette image was sign of the loudspeakers, minimizes colorations in the

sound due to off-axis lobing.detected. The center of the head position is computed fromthe center of the bounding box and the disparity associated The vision-based tracking algorithm described above has

been incorporated using a standard video camera con-with the silhouette image based on a simple pinhole cameramodel. The estimates of discrete head positions are then nected to an SGI Indy workstation. This tracking system

provides us with the coordinates of the center of the listen-converted to trajectories.In order to make this algorithm practical for desktop er’s head relative to the loudspeakers and is currently capa-

ble of operating at 10 frames/s with an error of 61.5 cm.audio applications, it is necessary to account for periodsof time during which the listener’s head does not move. The goal of the experiment was to render a virtual (phan-

tom) sound source in the center of the screen while theThe algorithm first performs a thinning that assigns a singlerepresentative position estimate to closely spaced esti- head of the listener is moving left or right in the plane

parallel to the loudspeaker baffles.mates. This representative estimate is then checked to seeif it belongs to an existing trajectory. Under the assumption When the listener is located at the exact center position

(the sweet spot), sound from each loudspeaker arrives atof spatio-temporal continuity, for every position estimatein frame M the algorithm finds the closest head position the corresponding ear at the exact same time (i.e., with

zero ipsilateral time delay). At any other position of thedetermined for the previous frame (M 2 1) and connects

Page 7: Signal Processing, Acoustics, and Psychoacoustics for High Quality Desktop Audio

SIGNAL PROCESSING AND ACOUSTICS 57

FIG. 4. (a) The first step in the vision-based tracking algorithm involves motion detection from a disparity image. A disparity histogramis generated and the local maxima are used to generate a silhouette of the moving images. (b) In the second stage of the vision-basedalgorithm skin-colored and convex regions are identified. The results of this search are combined with the motion results to estimate theposition of the head.

listener in this plane, there is a relative time difference of rithm were used to determine the necessary time delayadjustment. This information is processed by a 32-bit DSParrival between the sound signals from each loudspeaker

(Fig. 6). This time difference causes the perceived location processor board (ADSP-2106x SHARC) resident in a pen-tium-based PC. The required relative time delay betweenof the sound image to shift toward the loudspeaker that

is closer to the listener. In order to maintain proper stereo- the two channels varies from 0 es in the center spot to340 es in the extreme left or right positions. The DSPphonic perspective, the ipsilateral time delay must be ad-

justed as the listener moves relative to the loudspeakers. board is used to delay the sound from the loudspeakerthat is closest to the listener so that sound arrives with theThe head coordinates provided from the tracking algo-

Page 8: Signal Processing, Acoustics, and Psychoacoustics for High Quality Desktop Audio

58 KYRIAKAKIS ET AL.

FIG. 5. Frequency response of desktop loudspeaker system designed using the direct-path dominant and correct low-frequency responseguidelines described in the text. The response has been corrected with minimal parametric equalization and is relatively flat (62 dB) from30 Hz to 20 KHz. The solid line represents the on-axis response and the dotted line represents the 108 off-axis (vertical) response.

same time difference as if the listener were positioned in ation for continuous listener movement, a linear interpola-tion scheme was used to address the problem of audiblethe exact center between the loudspeakers. In other words,

we have demonstrated stereophonic reproduction with an clicks that result from instantaneous changes in the digitaldelay between the two channels.adaptively optimized sweet spot. To achieve seamless oper-

FIG. 6. The relative delay in the time of arrival of the direct sound from each loudspeaker to each (same-side) ear as a function ofhead position in the horizontal plane parallel to the loudspeakers. The geometry of our experimental desktop sound system is shown inthe inset.

Page 9: Signal Processing, Acoustics, and Psychoacoustics for High Quality Desktop Audio

SIGNAL PROCESSING AND ACOUSTICS 59

While the enhanced functionality provided by the head the HRTF from our database that corresponds to the earwhose pinna shape is closest to the new ear.tracking system is promising, there are still several issues

that must be addressed. We are currently in the process The basic principles used in this vision-based ear classi-fication scheme rely on the elastic graph matching methodof identifying the computational bottlenecks of both the

tracking and the audio signal processing algorithms and that places graph nodes at appropriate fiducial points ofthe pattern [20]. Selected features from a new pinna shapeintegrating both into a low-cost PC-based platform for

real-time operation (30 frames/s). Furthermore, we are can then be compared with those in the database to deter-mine the best match. In elastic graph matching, visual fea-expanding the capability of the current single-camera sys-

tem to include a second camera in a stereoscopic configu- tures from an image are represented in the form of a vector(jet). Each component of these multidimensional vectorsration that will provide depth information and pose estima-

tion for head rotations. is the result of the convolution of a local grey level valuewith a Gabor wavelet of a particular frequency and orienta-tion. Jets are calculated at several different points from a6. DESKTOP IMMERSIVE AUDIOmodelgraph that is chosen to represent the object to beclassified (Fig. 7). The modelgraph we designed for theDesktop audio systems, such as those associated withpinna consists of 19 jets, each representing one key geomet-multimedia PC’s, are increasingly being used for reproduc-rical feature on the pinna. We used five frequencies andtion of program material that makes use of 3D audio pro-eight orientations for the Gabor wavelets for a total of 40cessing. Such processing typically relies on head-relatedcomponents in each jet.transfer functions that have been averaged over a number

Comparison among jets and graphs is performed throughof test subjects or measured using a dummy-head ear.a similarity function that is defined as the normalized dotSystems based on such nonindividualized HRTF’s haveproduct of two jets over the entire modelgraph. This givesbeen shown to suffer from serious drawbacks that ariseus a method for comparison that is robust to changes infrom the fact that each listener usually has characteristicsillumination and contrast. The first step in the procedurethat are significantly different from the average ear [16].we used was to manually place nodes on key features of 13Furthermore, in order to map the entire three-dimensionalpinnae to create appropriate modelgraphs (Fig. 7). Theseauditory space requires a large number of tedious andmodelgraphs were then used to automatically find the mod-time-consuming measurements. This process must be re-elgraph of any new pinna.peated for every intended listener in order to produce

Initial results have shown successful matching of earsaccurate results.from unknown listeners to those already in our databaseincluding two artificial ears from the KEMAR dummy-

6.1. Vision-Based Pinna Classificationhead system. We are currently in the process of performinglistening tests to determine the improvement in localizationThe human pinna is a rather sophisticated instrument

that has been shown to play a key role in sound localization that results from the pinna matching. We are also workingon an improved version of the matching method that will[17–19]. The pinna folds act as miniature reflectors that

create small time delays which in turn give rise to comb select transfer function characteristics from several storedpinnae to best match the corresponding characteristics offiltering effects in the frequency domain. These ridges are

arranged in such a way as to optimally translate a change the new pinna. An appropriate set of weighting factors willthen be determined to form a synthetic HRTF that closelyin angle of the incident sound into a change in the pattern of

reflections. It has been demonstrated [17] that the human resembles that of the new listener.ear–brain interface can detect delay differences as shortas 7 es. Furthermore, as the sound source is moved towards 7. CONCLUSIONS1808 in azimuth (directly behind the listener) the pinnaalso acts as a low-pass filter, thus providing additional We have examined the acoustical, psychoacoustical, and

signal processing design requirements for implementinglocalization cues.In order to circumvent the inaccuracies in sound localiza- desktop audio systems for high fidelity sound reproduction.

We proposed a set of solutions that pertain to the loud-tion that arise from variations in the pinna characteristicsof different listeners we are developing a method for pinna speaker design in order to place the listener in the direct

dominant field, adjustments for reflecting and diffractingclassification. The novelty of our approach is that it is basedon visual recognition of pinna physiology and selection of surfaces in the local acoustical environment, and paramet-

ric equalization that is not based on 1/3 octave bands.the appropriate set of HRTF filters. We are currently inthe process of establishing a database of pinna images and We also presented a desktop audio system design that

incorporates a novel listener-tracking algorithm based onassociated measured directional characteristics. A pictureof the pinna from every new listener will allow us to select principles of computer vision. This novel system adjusts

Page 10: Signal Processing, Acoustics, and Psychoacoustics for High Quality Desktop Audio

60 KYRIAKAKIS ET AL.

FIG. 7. Two modelgraphs from our pinna database are shown here. The nodes correspond to the location of the jets that carry theGabor wavelet convolution information.

6. F. E. Toole, Loudspeaker measurements and their relationship tothe output from each loudspeaker in real time based onlistener preferences, J. Audio Eng. Soc. 34, 1986, 227–235.the location of the listener’s head. Finally, we presented

7. S. Bech, Perception of timbre of reproduced sound in small rooms:a method for pinna classification based on elastic graphinfluence of room and loudspeaker position, J. Audio Eng. Soc. 42,matching that can be used to select the appropriate set of1994, 999–1007.

head-related transfer function filters by performing a visual8. S. E. Olive and F. E. Toole, The detection of reflections in typicalmatch with one of the measured pinnae in our database.

rooms, J. Audio Eng. Soc. 37, 1989, 539–553.We are currently working on implementing a system that

9. R. Walker, Early reflections in studio control rooms: The results fromincorporates all of these features to render multichannel the first controlled image design installations, in 96th Meeting of theprogram material from just two loudspeakers for a mov- Audio Engineering Society, Amsterdam, 1994.ing listener. 10. T. Holman, Report on mixing studios sound quality, J. Jpn Audio

Soc. 1994.ACKNOWLEDGMENTS 11. F. E. Toole, Subjective measurements of loudspeaker sound quality

and listener performance. J. Audio Eng. Soc. 33, 1985, 2–32.The authors thank Professor Christoph von der Malsburg from the USC

12. T. Holman, New factors in sound for cinema and television, J. AudioLaboratory for Computational and Biological Vision for his guidance andEng. Soc. 39, 1991, 529–539.support. This research has been funded in part by the Integrated Media

13. M. R. Schroeder and B. S. Atal, Computer simulation of soundSystems Center, a National Science Foundation Engineering Researchtransmission in rooms, IEEE Int. Conv. Record 7, 1963.Center with additional support from the Annenberg Center for Communi-

cation at USC and the California Trade and Commerce Agency. 14. D. H. Cooper and J. L. Bauck, Prospects for transaural recording,J. Audio Eng. Soc. 37, 1989, 3–19.

REFERENCES 15. TMH Corporation, http://www.tmhlabs.com.

16. E. M. Wenzel, M. Arruda, and D. J. Kistler, Localization using nonin-1. B. Shinn-Cunningham, Adapting to discrepant information in multi- dividualized head-related transfer functions, J. Acoust. Soc. Am. 94,

media displays, in 134th Meeting of the Acoustical Society of America, 1993, 111–123.San Diego, California, 1997.

17. J. Hebrank and D. Wright, Spectral cues used in the localization of2. C. Kyriakakis, Fundamental and technological limitations of immer-sound sources in the median plane, J. Acoust. Soc. Am. 56, 1974, 1829–sive audio systems, IEEE Proceedings: Special Issue on Multimedia1834.Signal Processing, June 1998, to appear.

18. J. Blauert, Spatial Hearing: The Psychophysics of Human Sound3. F. L. Wightman, D. J. Kistler, and M. Arruda, Perceptual conse-Localization, Revised Edition, MIT Press, Cambridge, MA, 1997.quences of engineering compromises in synthesis of virtual auditory

objects, J. Acoust. Soc. Am. 101, 1992, 1050–1063. 19. D. W. Batteau, The role of the pinna in human localization, Proc.R. Soc. London B 168, 1967, 158–180.4. D. R. Begault, Challenges to the successful implementation of 3-D

sound, J. Audio Eng. Soc. 39, 1991, 864–870. 20. L. Wiskott, J. M. Fellous, N. Krueger, and C. von der Malsburg, FaceRecognition by Elastic Bunch Graph Matching, Tech. Report No. 8,5. T. Holman, Monitoring sound in the one-person environment,

SMPTE J. 106, 1997, 673–678. Institute for Neuroinformatics, Bochum, 1996.

Page 11: Signal Processing, Acoustics, and Psychoacoustics for High Quality Desktop Audio

SIGNAL PROCESSING AND ACOUSTICS 61

CHRIS KYRIAKAKIS is assistant professor and an investigator in the JONG-SOONG LIM was born in Taegu, Korea, in 1964. He earned B.S.and M.S. degrees in electronics engineering from Kyung-Pook NationalIntegrated Media Systems Center (IMSC), a National Science Foundation

Engineering Research Center at the University of Southern California. University, Taegu, Korea, in 1986 and 1988 respectively. He is currentlyan electrical engineering Ph.D. student at the Integrated Media SystemsHe received his B.S. degree in Electrical Engineering from the California

Institute of Technology in 1985 and his M.S. and Ph.D. degrees in Electri- Center at the University of Southern California. Prior to his graduatework, he was employed for eight years with LG Electronics Centralcal Engineering from USC in 1987 and 1993, respectively. He is a member

of the IEEE, the Audio Engineering Society, and the Acoustical Society Research Center in Seoul, Korea. His current interests include crosstalkcancellation filter design for real-time 3-D audio implementations.of America. Prof. Kyriakakis’ research interests include the design and

implementation of novel audio- and video-based human–computer inter-faces, signal processing for immersive 3D sound reproduction, hybridoptical and electronic devices for smart camera and 3D display applica-

HAI HONG earned a B.S. degree in Computer Science in 1992 andtions, immersivision, and head-mounted displays for virtual and aug-

a M.S. degree in Electrical Engineering in 1995 both from Tsinghuamented reality applications.

University, P. R. China. Since August 1995 he has been working towardthe Ph.D. degree in Computer Science at the University of SouthernCalifornia. His research interests and previous publications are in theareas of facial gesture and expression recognition, and multimedia pro-

TOMLINSON HOLMAN is associate professor in the School of Cin-cessing.

ema–Television and an investigator in the Integrated Media SystemsCenter of the School of Engineering at the University of Southern Califor-nia. He is president of TMH Corporation where he develops entertain-ment technology. He developed the Lucasfilm THX Division’s offerings: HARTMUT NEVEN is a research assistant professor at the University

of Southern California where he leads the research of the Computationalthe THX Sound System for motion-picture theaters, Home THX, andthe THX Laser Disc Program, and was technical director for the design and Biological Vision Laboratory. Hartmut Neven is co-founder and vice-

president R&D of Eyematic Interfaces, a start-up company specializingphase of Skywalker Ranch. Holman holds six U.S. and correspondingpatents, licensed by over 50 companies. He holds Fellowships of the Audio in human–computer interfaces. In 1996 he received his Ph.D. with honors

from the Institute for Neuroinformatics in Bochum, Germany, with aEngineering Society, the British Kinematograph Sound and TelevisionSociety, and the Society of Motion Picture and Television Engineers. He thesis on ‘‘Dynamics of vision-guided autonomous mobile robots.’’ He

studied Physics and Economics in Koln, Paris, Tubingen, Aachen andis a member of the Acoustical Society of America and IEEE. He wonthe Samuel L. Warner Medal, the Eastman Kodak Gold Medal from Jerusalem. He received a fellowship from the ‘‘Studienstifung des

deutschen Volkes,’’ an organization supporting the top 0.5% of Ger-SMPTE, and the 1996 Career Achievement Award from the CinemaAudio Society. man students.