optimal crosstalk cancellation for binaural audio with … · optimal crosstalk cancellation for...

24
Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University [email protected] Crosstalk cancellation (XTC) yields high-spatial-fidelity reproduction of binaural audio through loudspeakers allowing a listener to perceive an accurate 3-D image of a recorded soundfield. Such accurate 3-D sound reproduction is useful in a wide range of applications in the medical, military and commercial audio sectors. However, XTC is known to add a severe spectral coloration to the sound and that has been an impediment to the wide adoption of loudspeaker-based binaural audio. The nature of this coloration in two-loudspeaker XTC systems, and the fundamental aspects of the regularization methods that can be used to optimally control it, were studied analytically using a free-field two-point-source model. It was shown that constant-parameter regularization, while effective at decreasing coloration peaks, does not yield optimal XTC filters, and can lead to the formation of roll-offs and doublet peaks in the filter’s frequency response. Frequency-dependent regularization was shown to be significantly better for XTC optimization, and was used to derive a prescription for designing optimal two-loudspeaker XTC filters, whereby the audio spectrum is divided into adjacent bands, each of is which associated with one of three XTC impulse responses, which were derived analytically. Aside from the sought fundamental insight, the analysis led to the formulation of band-assembled XTC filters, whose optimal properties favor their practical use for enhancing the spatial realism of two-loudspeaker playback of standard stereo recordings containing binaural cues. I. INTRODUCTION A. Background and Motivation The ultimate goal of binaural audio with loudspeakers (BAL), also known as transauralization[1], is to repro- duce, at the entrance of each of the listener’s ear canals, the sound pressure signals recorded on only the ipsilat- eral channel of a stereo signal. If the stereo signal [2] was encoded with the head-related transfer function (HRTF) of the listener, and includes the proper ITD (interau- ral time difference) and ILD (interaural level difference) cues, then delivering the signal on each of the channels of the stereo signal to the ipsilateral ear, and only to that ear, would ideally guarantee that the ear-brain system receives the cues it needs to hear an accurate 3-D repro- duction of the recorded soundfield. Since, with playback from two loudspeakers, each of them is also heard by the contralateral ear (crosstalk), approaching the goal of BAL requires an effective cancellation of this unintended crosstalk. In addition to crosstalk cancellation (XTC), effective BAL requires an abatement of sound reflections in the listening room, which cause degradation to the integrity of the binaural cues at the listener’s ears[3–5]. While this problem can be somewhat alleviated through pre- scriptions that increase the ratio of direct to reflected sound, full disambiguation of front-back sound local- ization through BAL has been shown to require XTC levels[6] above 20 dB, which are difficult to achieve prac- tically even under anechoic conditions[3]. Therefore, it would seem that the goal stated in the first paragraph could be more naturally reached with binaural audio through headphones (or earphones)[7] as both crosstalk and room reflections would be non- existent. However, with earphones or headphones, the location of the playback transducers in or very near the ears means that non-idealities, (e.g., mismatches between the HRTF of the listener and that used to encode the recording, movement of the perceived sound image with movement of the listener’s head, lack of bone-conducted sound, transducer-induced resonances in the ear canal, discomfort, etc.), when above a certain threshold, can lead to difficulties in perceiving a realistic 3-D image and to the perception that the sound (or some of its spectral components) is inside, or too close to, the listener’s head. Binaural playback through loudspeakers is largely im- mune to this head internalization of sound, for even when non-idealities in binaural reproduction are present, the sound originates far enough from the listener to be per- ceived to come from outside the head. Furthermore, cues such as bone-conducted sound and the involvement of the listener’s own head, torso and pinnae in sound diffraction and reflection during playback (even if it departs from, or interferes with, the diffraction-induced coloration repre- sented in the HRTF used to encode the binaural record- ing) could be expected to enhance the perceived realism of sound reproduction relative to that achieved with ear- phones. These potential advantages have, implicitly or explicitly, motivated the development of XTC-enabled BAL since the earliest work on the subject[8–10]. In scientific applications of BAL, such as its application to study spatial hearing disabilities of elderly adults[3], the high levels of XTC (above 20 dB) needed for highly- accurate transmission of binaural cues to a listener re- quire anechoic, or semi-anechoic, environments, precise matching of the listener’s HRTF with that used in the recording, and constrained positioning of the listener’s

Upload: vuongphuc

Post on 21-Jul-2018

260 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers

Edgar Y. ChoueiriPrinceton University

[email protected]

Crosstalk cancellation (XTC) yields high-spatial-fidelity reproduction of binaural audio throughloudspeakers allowing a listener to perceive an accurate 3-D image of a recorded soundfield. Suchaccurate 3-D sound reproduction is useful in a wide range of applications in the medical, militaryand commercial audio sectors. However, XTC is known to add a severe spectral coloration to thesound and that has been an impediment to the wide adoption of loudspeaker-based binaural audio.The nature of this coloration in two-loudspeaker XTC systems, and the fundamental aspects of theregularization methods that can be used to optimally control it, were studied analytically usinga free-field two-point-source model. It was shown that constant-parameter regularization, whileeffective at decreasing coloration peaks, does not yield optimal XTC filters, and can lead to theformation of roll-offs and doublet peaks in the filter’s frequency response. Frequency-dependentregularization was shown to be significantly better for XTC optimization, and was used to derivea prescription for designing optimal two-loudspeaker XTC filters, whereby the audio spectrum isdivided into adjacent bands, each of is which associated with one of three XTC impulse responses,which were derived analytically. Aside from the sought fundamental insight, the analysis led to theformulation of band-assembled XTC filters, whose optimal properties favor their practical use forenhancing the spatial realism of two-loudspeaker playback of standard stereo recordings containingbinaural cues.

I. INTRODUCTION

A. Background and Motivation

The ultimate goal of binaural audio with loudspeakers(BAL), also known as transauralization[1], is to repro-duce, at the entrance of each of the listener’s ear canals,the sound pressure signals recorded on only the ipsilat-eral channel of a stereo signal. If the stereo signal [2] wasencoded with the head-related transfer function (HRTF)of the listener, and includes the proper ITD (interau-ral time difference) and ILD (interaural level difference)cues, then delivering the signal on each of the channels ofthe stereo signal to the ipsilateral ear, and only to thatear, would ideally guarantee that the ear-brain systemreceives the cues it needs to hear an accurate 3-D repro-duction of the recorded soundfield. Since, with playbackfrom two loudspeakers, each of them is also heard bythe contralateral ear (crosstalk), approaching the goal ofBAL requires an effective cancellation of this unintendedcrosstalk.

In addition to crosstalk cancellation (XTC), effectiveBAL requires an abatement of sound reflections in thelistening room, which cause degradation to the integrityof the binaural cues at the listener’s ears[3–5]. Whilethis problem can be somewhat alleviated through pre-scriptions that increase the ratio of direct to reflectedsound, full disambiguation of front-back sound local-ization through BAL has been shown to require XTClevels[6] above 20 dB, which are difficult to achieve prac-tically even under anechoic conditions[3].

Therefore, it would seem that the goal stated in thefirst paragraph could be more naturally reached withbinaural audio through headphones (or earphones)[7]

as both crosstalk and room reflections would be non-existent. However, with earphones or headphones, thelocation of the playback transducers in or very near theears means that non-idealities, (e.g., mismatches betweenthe HRTF of the listener and that used to encode therecording, movement of the perceived sound image withmovement of the listener’s head, lack of bone-conductedsound, transducer-induced resonances in the ear canal,discomfort, etc.), when above a certain threshold, canlead to difficulties in perceiving a realistic 3-D image andto the perception that the sound (or some of its spectralcomponents) is inside, or too close to, the listener’s head.

Binaural playback through loudspeakers is largely im-mune to this head internalization of sound, for even whennon-idealities in binaural reproduction are present, thesound originates far enough from the listener to be per-ceived to come from outside the head. Furthermore, cuessuch as bone-conducted sound and the involvement of thelistener’s own head, torso and pinnae in sound diffractionand reflection during playback (even if it departs from, orinterferes with, the diffraction-induced coloration repre-sented in the HRTF used to encode the binaural record-ing) could be expected to enhance the perceived realismof sound reproduction relative to that achieved with ear-phones. These potential advantages have, implicitly orexplicitly, motivated the development of XTC-enabledBAL since the earliest work on the subject[8–10].

In scientific applications of BAL, such as its applicationto study spatial hearing disabilities of elderly adults[3],the high levels of XTC (above 20 dB) needed for highly-accurate transmission of binaural cues to a listener re-quire anechoic, or semi-anechoic, environments, precisematching of the listener’s HRTF with that used in therecording, and constrained positioning of the listener’s

Page 2: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 2

head in the area of equalization (“sweet spot”). In manyless stringent applications[10], modest levels of XTC,even of a few dB over a limited range of frequencies,have the potential of significantly enhancing the 3-D real-ism of the reproduction of recordings containing binauralcues. This is because, by definition, localization cues ina binaural recording represent differential interaural in-formation that is intended to be transmitted to the earswith no crosstalk. In other words, crosstalk cancellation,at any level, is a reduction of unintended artifice in theloudspeaker playback of recordings containing significantbinaural cues.

This reduction of unintended artifice through XTCshould also apply to the loudspeaker playback of moststereo recordings[11], especially those made in real acous-tic spaces, and even to recordings made using standardstereo microphone techniques without a dummy head,since these techniques[12] all rely on preserving in therecording a good measure of the natural ILD and ITDcues needed for spatial localization during playback. Weshould therefore expect that effecting even a relativelylow level of XTC to the playback of such standard stereorecordings, even those lacking HRTF encoding, shouldenhance image localization compared to playback withfull crosstalk, as well as the perception of width anddepth of the sound-field, since these binaural featuresare always, to some degree, corrupted by crosstalk[13].

With such promises of high-spatial-fidelity reproduc-tion of binaural recordings, and enhanced realism to theplayback of a large portion of existing acoustic stereorecordings, the question arises as to why crosstalk can-cellation has not yet penetrated widely in the professionaland consumer audio sectors.

A part of the answer to this question is related tothe physical constraints required for the practical im-plementation of an effective XTC-enabled BAL play-back system. These constraints include sensitivity tohead movements and a limited sweet spot[14–18], sen-sitivity to room reflections[4, 5], and the often-requireddeparture from the well-established stereo loudspeakerconfiguration[19, 20] (where the loudspeakers span anangle of 60 degrees with respect to the listener). Muchresearch effort has been expended recently on relievingsome of these constraints and has resulted in potentialsolutions, of varying degrees of practicality, which in-clude: widening the sweet spot through the use of multi-ple loudspeakers[21–24], providing XTC at multiple lis-tening locations[25], enhancing robustness to head move-ment through the use of sum and difference filters[26],and dynamically moving the sweet spot to follow the lo-cation of the listener’s head by tracking it with opticalsensors[27].

Another major impediment to the wide adoption ofXTC-enabled BAL has been the spectral coloration thatXTC filters inherently impose on the sound emitted bythe loudspeakers. The fundamental nature of this spec-tral coloration, its basic features, its dependencies, andoptimal methods to abate it with minimal adverse ef-

fects on XTC performance, are the main subjects of thispaper.

B. The Problem of XTC-induced SpectralColoration

1. Nature of the Problem

One main difficulty in implementing XTC is to reducethe artifice of crosstalk without adding an artifice of an-other kind: spectral coloration. Sound waves travelingfrom two distinct sources to the ears set up an interfer-ence pattern in the intervening air space. Depending onthe frequency, the distance of the ears from the loud-speakers, the distance between the loudspeakers, and thephase relationship between the left and right componentsof the recorded stereo signal, the wave interference at anear of the listener might be destructive, complementary,or constructive. At some of the frequencies for which theinterference is destructive at the ear, XTC control (i.e.,signal processing that would cause the waves from theloudspeakers to the contralateral ears to be nulled) wouldrequire boosting the amplitude of the emitted waves. Aswe shall see in Section II C, for typical listening configu-rations these level boosts[28] in the case of a perfect XTCfilter (defined as one that theoretically yields an infiniteXTC level over the entire audio band, in a free-field oranechoic environment) can easily be in excess of 30 dB,and therefore amount to severe spectral coloration.

Of course, such a “perfect” XTC filter would imposethese necessary level boosts only at the loudspeakers insuch a way that, at the listener’s ears, not only thecrosstalk is cancelled, but also the frequency spectrum isreconstructed perfectly, i.e., with no spectral coloration.

As recognized in Ref. [29, 30], and as will be furtherdiscussed in Section II C, the frequencies at which thelevel boosts are required correspond to the frequenciesat which system inversion (the mathematical inversionof the system’s transfer matrix, which leads to the XTCfilter) is ill-conditioned. At these frequencies, XTC con-trol becomes highly sensitive to errors[30], so that evena small error in the alignment of the listener’s head inthe real world would lead to an effective loss of XTCcontrol at, and near, these frequencies. Therefore, notonly would there be undesired crosstalk at the listener’sears at these frequencies, but also, and consequently, thelevels boosts which must necessarily be imposed at thesefrequencies, would be fully hearable, even in the sweetspot, as a coloration.

Even in an ideal world where the loudspeakers-listeneralignment is perfect, this spectral coloration imposed atthe loudspeakers would present three probems: 1) itwould be heard by a listener outside the sweet spot, 2) itwould cause a relative increase (compared to unprocessedsound playback) in the physical strain on the playbacktransducers, and 3) it would correspond to a loss in thedynamic range[29]. Since even professional audio equip-

Page 3: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 3

ment is seldom designed to have more more than a few dBheadroom above the levels required to reproduce realisticSPL peaks[31], the dynamic range of the program mustbe decreased by more than 30 dB (minus the headroom),in the case of the “perfect” XTC filter defined above, toavoid clipping. This is particularly problematic, for in-stance, in the case of wide-dynamic-range audio recordedin 16 bits.

2. Previous Work and Present Goals

Takeuchi and Nelson[29] have developed a method thatnot only yields excellent measured XTC performance[3,32], but also effectively solves the problem of spectral col-oration. However, their method, called “Optimal SourceDistribution” (OSD), which is discussed in Section II C.4,requires the use of a minimum of six transducers posi-tioned at various angles around the listener.

The problem of XTC-induced spectral coloration forplayback with only two loudspeakers remains compellingdue to the simplicity of the two-loudspeaker set-up andits compatibility with existing audio equipment. In thispaper, we study this problem in the context of XTC op-timization, which we define as maximizing XTC perfor-mance for a desired tolerable level of spectral colorationor, equivalently, minimizing the spectral coloration for adesired XTC performance.

In particular, we use a free-field two-point-sourcemodel and address, analytically, the fundamental aspectsof spectral coloration control through both constant-parameter and frequency-dependent regularization meth-ods. While regularization methods have been used inthe audio literature for sound source reconstruction[33]and to control ill-conditioning in HRTF inversion[3, 23,24, 34] their fundamental properties in the context ofXTC optimization and XTC-induced spectral colorationhave not been studied in detail, especially in the case offrequency-dependent regularization.

Aside from the fundamental insight we seek throughthis analysis, we aim to derive analytical expressions forthe time-domain impulse responses (IRs) of such optimalXTC filters. These IRs are not only useful for sheddinglight on the time-domain aspects of XTC optimization,but can also lend their benefits, through digital convo-lution engines, to (typically non-scientific) applicationswhere the needed XTC levels are below those requiringinversion of the listener’s HRTF, and are sufficient to en-hance the spatial realism of the two-loudspeaker playback(in regular listening rooms) of audio signals containingsignificant binaural cues.

II. THE FUNDAMENTAL XTC PROBLEM

In this section we start with the mathematical formu-lation of the model and the governing transformation ma-

trices. We then define a set of metrics that will be usefulfor evaluating and comparing the spectral coloration andperformance of XTC filters, and conclude with the defini-tion and discussion of a benchmark for such comparisons:the perfect XTC filter.

A. Formulation and Transformation Matrices

In order to render the analysis tractable enough so thatfundamental insight is more easily obtained, we makethe idealizing assumptions that sound propagation oc-curs in a free field (with no diffraction or refelection fromthe head and pinnae of the listener or any other physi-cal objects), and that the loudspeakers radiate like pointsources.

H

DL

DR

VL

VR

PL

PR

l1

l2

ll2

l1

l

∆r

C

FIG. 1: Geometry of the two-source free-field model. (Allsymbols are defined in the text.)

In the frequency domain, the air pressure at a free-fieldpoint located a distance r from a point source (monopole)radiating a sound wave of frequency ω is given by[35]

P (r, iω) =iωρoq

4πe−ikr

r,

where ρ0 is the air density, k = 2π/λ = ω/cs thewavenumber, λ the wavelength, cs the speed of sound(340.3 m/s), and q the source strength (in units of vol-ume per unit time). It is convenient to define

V =iωρ0q

4π,

Page 4: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 4

which is the time derivative of ρoq/(4π), the mass flowrate of air from the center of the source.

Therefore, at the left ear of a listener in the symmetrictwo-source geometry shown in Fig. 1, the air pressure dueto the two sources, under the above-stated assumptions,add up as

PL(iω) =e−ikl1

l1VL(iω) +

e−ikl2

l2VR(iω). (1)

Similarly, at the right ear, we have

PR(iω) =e−ikl2

l2VL(iω) +

e−ikl1

l1VR(iω). (2)

Here, l1 and l2 are the path lengths between any of thetwo sources and the ipsilateral and contralateral ear, re-spectively, as shown in that figure.

In order to maintain a connection with the relevantliterature, we adopt the same nomenclature used inRefs. [19, 20, 29, 30]. Namely, unless otherwise stated, weuse uppercase letters for frequency variables, lowercasefor time-domain variables, uppercase bold for matricesand lowercase bold for vectors, and define

∆l ≡ l2 − l1 and g ≡ l1/l2 (3)

as the path length difference and path length ratio, re-spectively. An inspection of the geometry illustratedin Fig. 1 shows that 0 < g < 1, and that the path lengthscan be expressed as

l1 =

√l2 +

(∆r2

)2

−∆r l sin(θ), (4)

l2 =

√l2 +

(∆r2

)2

+ ∆r l sin(θ), (5)

where ∆r is the effective distance between the entrancesof the ear canals, and l is the distance between eithersource and the interaural mid-point. As defined in Fig. 1,Θ = 2θ is the loudspeaker span. Note that for l >>∆r sin(θ), as in most loudspeaker-based listening set-ups,we have g ' 1. Another important parameter is the timedelay,

τc =∆lcs, (6)

defined as the time it takes a sound wave to traverse thepath length difference ∆l.

Using the above definitions, Eqs. (1) and (2) can bere-written in matrix form as[

PL(iω)PR(iω)

]= α

[1 ge−iωτc

ge−iωτc 1

] [VL(iω)VR(iω)

], (7)

where

α =e−iωl1/cs

l1. (8)

In the time domain, α is simply a transmission delay(divided by the constant l1) that does not affect the shapeof the signal. Its role in insuring causality is discussed inSection III B. The source vector v = [VL(iω), VR(iω)]Tis obtained from the vector of “recorded” signals d =[DL(iω), DR(iω)]T , through the transformation

v = Hd, (9)

where

H =

HLL(iω) HLR(iω)

HRL(iω) HRR(iω)

(10)

is the sought 2×2 filter matrix. Therefore, from Eq. (7),we have

p = αCHd, (11)

where p = [PL(iω), PR(iω)]T is the vector of pressures atthe ears, and C is the system’s transfer matrix

C ≡[

1 ge−iωτc

ge−iωτc 1

], (12)

which, like all matrices we will be dealing with, is sym-metric due to the symmetry of the geometry.

In summary, the transformation from the signal d,through the filter H, to the source variables v, thenthrough wave propagation from the sources to pressurep at the ears of the listener, can be written simply as

p = αRd. (13)

where we have introduced the performance matrix, R,defined as

R =

RLL(iω) RLR(iω)

RRL(iω) RRR(iω)

≡ CH. (14)

B. Metrics

We now wish to define a set of metrics by which tojudge the spectral coloration and performance of XTCfilters. In this context we note that the diagonal elementsof R represent the ipsilateral transmission of the signalto the ears, and the off-diagonal elements represent theundesired contralateral transmission, i.e., the crosstalk.

Therefore, the amplitude spectrum (to a factor α) ofa signal fed to only one (either left or right) of the twoinputs of the system, as heard at the ipsilateral ear is

Esi‖(ω) ≡ |RLL(iω)| = |RRR(iω)|,

where the subscripts “si” and ‖ stand for “side image”and “ipsilateral ear (with respect to the input signal)”respectively, since Esiip

, as defined, is the frequency re-sponse (at the ipsilateral ear) for the side image that

Page 5: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 5

would result from the input being panned to one side.Similarly, at the contralateral ear to the input signal(subscript X), we have the following side-image fre-quency response:

EsiX(ω) ≡ |RLR(iω)| = |RRL(iω)|.

The system’s frequency response at either ear when thesame signal is split equally between left and right inputsis another spectral coloration metric. It can be obtainedfrom the product R · [1/2, 1/2]T , which leads to

Eci(ω) ≡ |RLL(iω) +RLR(iω)2

| = |RRL(iω) +RRR(iω)2

|.

Here the subscript “ci” stands for “center image” sinceEci, as defined, is the frequency response (at either ear)for the center image that would result from the inputbeing panned to the center.

Also of importance to our discussions are the frequencyresponses that would be measured at the sources (loud-speakers). These are denoted by S, and can be obtainedfrom the elements of the filter matrix H. They are givenusing the same subscript convention used above (with “‖”and “X” referring to the loudspeakers that are ipsilateraland contralateral to the input signal, respectively) by

Ssi‖(ω) ≡ |HLL(iω)| = |HRR(iω)|,

SsiX(ω) ≡ |HLR(iω)| = |HRL(iω)|,

Sci(ω) ≡ |HLL(iω) +HLR(iω)2

| = |HRL(iω) +HRR(iω)2

|.

An intuitive interpretation of the significance of the abovemetrics is that a signal panned from a single input to bothinputs to the system will result in frequency responsesgoing from Esi to Eci at the ears, and Ssi to Sci at theloudspeakers.

Two other spectral coloration metrics are the fre-quency responses of the system to in-phase and out-of-phase inputs to the system. These two responses areobtained simply from the product of the filter matrix H

with the vectors [1, 1]T and [1,−1]T (or [−1, 1]T ), respec-tively, and are given by:

Si(ω) ≡ |HLL(iω) +HLR(iω)| = |HRL(iω) +HRR(iω)|,So(ω) ≡ |HLL(iω)−HLR(iω)| = |HRL(iω)−HRR(iω)|,

where the subscripts i and o denote the in-phase and out-of-phase responses, respectively. Note that, as defined, Siis double (i.e., 6 dB above) Sci, as the latter describes asignal of amplitude 1 panned to center (i.e., split equallybetween L and R inputs), while the former describes twosignals of amplitude 1 fed in phase to the two inputs ofthe system.

Since a real signal can consist of various componentshaving different phase relationships, it is more useful

to combine Si(ω) and So(ω) into a single metric, S(ω),which is the envelope spectrum that describes the maxi-mum amplitude that could be expected at the loudspeak-ers, and is given by

S(ω) = max [Si(ω), So(ω)] .

It is relevant to note that S(ω) is equivalent to ||H||,the 2-norm of H, and that Si and So are the two singu-lar values, which can be obtained through singular valuedecomposition of the matrix as was done in Ref. [29].

Finally, an important metric that will allow us to eval-uate and compare the XTC performance of various filtersis χ(ω), the crosstalk cancellation spectrum:

χ(ω) ≡ |RLL(iω)||RRL(iω)|

=|RRR(iω)||RLR(iω)|

=Esi‖(ω)

EsiX(ω)

.

The above definitions give us a total of eight metrics,(Esi‖ , EsiX

, Eci, Ssi‖ , SsiX, Sci, S, χ), all real functions

of frequency, by which to evaluate and compare the spec-tral coloration and XTC performance of XTC filters.

C. Benchmark: Perfect Crosstalk Cancellation

A perfect crosstalk cancellation (P-XTC) filter is de-fined as one that, theoretically, yields infinite crosstalkcancellation at the ears of the listener, for all frequen-cies.

Crosstalk cancellation, as defined in Section I A, re-quires that the pressure at each of the two ears be thatwhich would have resulted from the ipsilateral signalalone, namely, in the frequency domain, PL = αDL andPR = αDR, where all quantities are complex functions offrequency. Therefore, in order to achieve perfect cancella-tion of the crosstalk, Eq. (13) requires that R = I, whereI is the unity matrix, and thus, as per the definition of Rin Eq. (14), the P-XTC filter is simply the inverse of thesystem transfer matrix expressed in Eq. (12), and can beexpressed exactly:

H [P ] = C−1 =1

1− g2e−2iωτc

[1 −ge−iωτc

−ge−iωτc 1

],

(15)where the superscript [P ] denotes perfect XTC. For thisfilter, the eight metrics we defined above become:

E[P ]

si‖= 1; E

[P ]

siX= 0; E

[P ]

ci =12

;

S[P ]

si‖(ω) =

∣∣∣∣ 11− g2e−2iωτc

∣∣∣∣ =1√

g4 − 2g2 cos(2ωτc) + 1;

S[P ]

siX(ω) =

∣∣∣∣ −ge−iωτc

1− g2e−2iωτc

∣∣∣∣ =g√

g4 − 2g2 cos(2ωτc) + 1;

S[P ]

ci (ω) =12

∣∣∣∣1− g

g + eiωτc

∣∣∣∣ =1

2√g2 + 2g cos(ωτc) + 1

;

S[P ](ω) = max(∣∣∣∣1− g

g + eiωτc

∣∣∣∣ , ∣∣∣∣1 +g

eiωτc − g

∣∣∣∣) ,

Page 6: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 6

= max

(1√

g2 + 2g cos(ωτc) + 1,

1√g2 − 2g cos(ωτc) + 1

); (16)

χ[P ](ω) = ∞. (17)

Therefore the perfect (χ = ∞) XTC filter gives flat fre-quency responses at the ears (E[P ](ω) = constant), butnot at the sources. To appreciate the extent of spec-tral coloration at the loudspeakers, we plot the S[P ](ω)frequency responses expressed above in Fig. 2 for a typ-ical value g = .985. Throughout this paper, for the sakeof illustration, we complement the non-dimensional plotswith dimensional calculations, which are represented bythe same curves read in terms of the frequency f = ω/2πon the top axis, for a typical listening geometry charac-terized by g = .985 and τc = 68 µs (i.e., 3 samples at theRed Book CD sampling rate of 44.1kHz), which wouldbe the case, for instance, of a set-up with ∆r = 15 cm,l = 1.6 m, and Θ = 18.)

S` ApE

SsiApE

SciApE

Π100 Π10 Π2 Π 2Π 3Π-20

-10

0

10

20

30

40100 1000 104 2x104

Ω Τc

dB

f HHzL

FIG. 2: Perfect XTC filter frequency responses at the loud-speakers: amplitude envelope (heavy curve), side image (lightsolid curve), and central image (light dashed curve). The dot-ted horizontal line marks the envelope ceiling, which for thiscase (g = .985) is 36.5 dB. The non-dimensional frequency ωτcis given on the bottom axis, and the corresponding frequencyin Hz, shown on the top axis, is to illustrate a particular (typ-ical) case of τc =3 samples at a sampling rate of 44.1 kHz.

(Since S[P ]

si‖' S

[P ]

siXwhen g ' 1, these two spectra are shown

as the single curve S[P ]

si .)

The peaks in these spectra occur at frequencies forwhich the system must boost the amplitude of the sig-nal at the loudspeakers in order to effect XTC at theears while compensating for the destructive interferenceat that location. Similarly, minima in the spectra occurwhen the amplitude must be attenuated.

Using the first and second derivatives (with respectto ωτc) of the above expressions for the various S[P ](ω)

spectra, we find the following amplitudes and frequenciesfor the associated peaks and minima, denoted by ↑ and↓ superscripts, respectively:

S[P ]↑

si‖=

11− g2

at ωτc = nπ, with n = 0, 1, 2, 3, 4, . . .

S[P ]↓

si‖=

11 + g2

at ωτc = nπ

2, with n = 1, 3, 5, 7, . . .

S[P ]↑

siX=

g

1− g2at ωτc = nπ, with n = 0, 1, 2, 3, 4, . . .

S[P ]↓

siX=

g

1 + g2at ωτc = n

π

2, with n = 1, 3, 5, 7, . . .

S[P ]↑

ci =1

2− 2gat ωτc = nπ, with n = 1, 3, 5, 7, . . .

S[P ]↓

ci =1

2 + 2gat ωτc = nπ, with n = 0, 2, 4, 6, . . .

S[P ]↑ =1

1− gat ωτc = nπ, with n = 0, 1, 2, 3, 4, . . .

(18)

S[P ]↓ =1√

1 + g2at ωτc = n

π

2, with n = 1, 3, 5, 7, . . .

(19)

For a typical listening set-up, g ' 1, say, our referenceg = .985 case shown in Fig. 2, the envelope peaks (i.e.,S[P ]↑) correspond to a boost of

20 log10

(1

1− .985

)= 36.5 dB

(and the peaks in the other spectra, S[P ]↑

si‖' S

[P ]↑

siX'

S[P ]↑

ci , correspond to boosts of about 30.5 dB.) Whilethese boosts have equal frequency widths across the spec-trum, when the spectrum is plotted logarithmically (asis appropriate for human sound perception), the low-frequency boost is most prominent in its perceived fre-quency extent. This bass boost has long been recog-nized as an intrinsic problem in XTC. While the high-frequency peaks could, in principle, be pushed out of theaudio range by decreasing τc (which, as can be seen fromEqs. (4) to (6), is achieved by increasing l and/or decreas-ing the loudspeaker span Θ, as is done in the so-called“Stereo Dipole” configuration described in Ref. [19, 20],where Θ = 10), the “low frequency boost” of the P-XTCfilter would remain problematic.

As mentioned in Section I B 1, the severe spectralcoloration associated with these high-amplitude peakspresents three practical problems: 1) it would be heardby a listener outside the sweet spot, 2) it would cause arelative increase (compared to unprocessed sound play-back) in the physical strain on the playback transducers,and 3) it would correspond to a loss in the dynamic range.

These penalties might be a justifiable price to payif we are guaranteed the infinitely good XTC perfor-mance (χ = ∞) and the perfectly flat frequency re-

Page 7: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 7

sponse (E[P ](ω) = constant) that the perfect XTC fil-ter promises at the ears of a listener in the sweet spot.However, in practice, these theoretically promised bene-fits are unachievable due to the solution’s sensitivity tounavoidable errors. This problem can best be appreci-ated by evaluating the condition number of the transfermatrix C.

It is well known that in matrix inversion problems thesensitivity of the solution to errors in the system is givenby the condition number of the matrix. (For a discussionof the condition number in the context of XTC systemerrors, see Ref. [30]). The condition number κ(C) of thematrix C is given by

κ(C) = ||C|| ||C−1|| = ||C|| ||H [P ]||.

(It is also, equivalently, the ratio of largest to smallestsingular values of the matrix.) Therefore, we have

κ(C) = max

(√2(g2 + 1)

g2 + 2g cos(ωτc) + 1− 1,√

2(g2 + 1)g2 − 2g cos(ωτc) + 1

− 1

).

Using the first and second derivatives of this function,as we did for the previous spectra, we find the followingmaxima and minima:

κ↑(C) =1 + g

1− gat ωτc = nπ, with n = 0, 1, 2, 3, 4, . . .

κ↓(C) = 1 at ωτc = nπ

2, with n = 1, 3, 5, 7, . . . (20)

as was also reported in Ref. [30] in terms of wavelengths.First, we note that the peaks and minima in the con-dition number occur at the same frequencies as thoseof the amplitude envelope spectrum at the loudspeak-ers, S[P ]. Second, we note that the minima have acondition number of unity (the lowest possible value),which implies that the filter resulting from the inver-sion of C is most robust (i.e., least sensitive to errors inthe transfer matrix) at the non-dimensional frequenciesωτc = π/2, 3π/2, 5π/2, . . . . Conversely, the conditionnumber can reach very high values (e.g., κ↑(C) = 132.3for our typical case of g = .985) at the non-dimensionalfrequencies ωτc = 0, π, 2π, 3π . . . . As g → 1 the ma-trix inversion resulting in the P-XTC filter becomes ill-conditioned, or in other words, infinitely sensitive to er-rors. The slightest misalignment, for instance, of thelistener’s head, would thus result in a severe loss in XTCcontrol at the ears (at and near these frequencies) which,in turn, causes the severe spectral coloration in S[P ](ω)to be transmitted to the ears.

We are now in a position to appreciate the pre-scription proposed and implemented by Takeuchi andNelson[29, 32], which effectively solves both the robust-ness and spectral coloration problem of the P-XTC filterby insuring that the system operates always under condi-tions where κ(C) is small. This can be done by allowing

the loudspeaker span to be a function of the frequency.More specifically, after noting that typically l >> ∆r, sothat the approximation ∆l ' ∆r sin(θ) holds, and there-fore ωτc = ω∆l/cs = 2πf∆l/cs can be approximated by

ωτc '2πf∆r sin(θ)

csfor l ∆r, (21)

we can re-write the robustness condition (stated inEq. (20)) as

Θ(f) ' 2 sin−1

(ncs

4f∆r

), with n = 1, 3, 5, 7, . . .

Since both cs and ∆r are constant, the required loud-speaker span is solely a function of the frequency f . Inpractice this prescription, called Optimal Source Distri-bution (OSD), can be implemented by using a crossovernetwork to distribute adjacent bands of the audio spec-trum to pairs of transducers, whose spans are calculatedfrom the above equation so that in each band the condi-tion number does not exceed unity by much, thus insur-ing robustness and low coloration over the entire audiospectrum. It is clear, however, that this solution is notapplicable to the case of a single pair of loudspeakers,which is the focus of our analysis.

We refer the reader interested in the OSD method andXTC errors to Ref. [29, 30, 32], and sum up the discus-sion in this section by stating that, for the case of onlytwo loudspeakers, the perfect XTC filter carries in prac-tice the penalties of over-amplification (and the associ-ated loss of dynamic range) at frequencies where systeminversion is ill-conditioned, transducer fatigue, and a se-vere spectral coloration that is heard by listeners insideand outside the sweet spot.

III. CONSTANT-PARAMETERREGULARIZATION

Regularization methods allow controlling the norm ofthe approximate solution of an ill-conditioned linear sys-tem at the price of some loss in the accuracy of the solu-tion. The control of the norm through regularization canbe done subject to an optimization prescription, such asthe minimization of a cost function. Ref. [36] provides adetailed discussion of regularization methods in a generalmathematical context, and Refs. [3, 18, 23, 33, 34] are ex-amples of the use of regularization to control numericalHRTF inversion. We discuss regularization analyticallyin the context of XTC filter optimization, which we defineas the maximization of XTC performance for a desiredtolerable level of spectral coloration or, equivalently, theminimization of spectral coloration for a desired mini-mum XTC performance.

In essence, a nearby solution to the matrix inversionproblem is sought:

H [β] =[CHC + βI

]−1

CH , (22)

Page 8: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 8

where the superscript H denotes the Hermitian operator,and β is the regularization parameter which essentiallycauses a departure from H [P ], the exact inverse of C. Inthis section we take β to be a constant, 0 < β 1. Thepseudoinverse matrix H [β] is the regularized filter, andthe superscript [β] is used to denote constant-parameterregularization. The regularization stated in Eq. (22) canbe shown[23, 34, 37] to correspond to a minimization ofa cost function, J(iω),

J(iω) = eH(iω)e(iω) + βvH(iω)v(iω), (23)

where the vector e represents a performance metric thatis a measure of the departure from the signal reproducedby the perfect filter. Physically, then, the first term in thesum constituting the cost function represents a measureof the performance error, and the second term representsan “effort penalty,” which is a measure of the power ex-erted by the loudspeakers. For β > 0, Eq. (22) leadsto an optimum, which corresponds to the least-squareminimization of the cost function J(iω).

Therefore, an increase of the regularization parameterβ leads to a minimization of the effort penalty at the ex-pense of a larger performance error and thus to an abate-ment of the peaks in the norm of H, i.e., the colorationpeaks in the S(ω) spectra, at the price of a decrease inXTC performance at and near the frequencies where thesystem is ill-conditioned.

A. Frequency Response

Using the explicit form for C given by Eq. (12), in thelast equation above, we find:

H [β] =

H [β]LL(iω) H

[β]LR(iω)

H[β]RL(iω) H

[β]RR(iω)

, (24)

where

H[β]LL(iω) = H

[β]RR(iω)

=g2ei4ωτc − (β + 1)ei2ωτc

g2ei4ωτc + g2 − [(g2 + β)2 + 2β + 1],

(25)

H[β]LR(iω) = H

[β]RL(iω)

=geiωτc − g(g2 + β)ei3ωτc

g2ei4ωτc + g2 − [(g2 + β)2 + 2β + 1].

(26)

The eight metric spectra we defined in Section II B be-come:

E[β]

si‖(ω) =

g4 + βg2 − 2g2 cos(2ωτc) + β + 1−2g2 cos(2ωτc) + (g2 + β)2 + 2β + 1

;

E[β]

siX(ω) =

2gβ| cos(ωτc)|−2g2 cos(2ωτc) + (g2 + β)2 + 2β + 1

;

E[β]

ci (ω) =12− β

2 [g2 + 2 cos(ωτc) + β + 1];

S[β]

si‖(ω) =

√g4 − 2(β + 1)g2 cos(2ωτc) + (β + 1)2

−2g2 cos(2ωτc) + (g2 + β)2 + 2β + 1;

S[β]

siX(ω) =

g√

(g2 + β)2 − 2(g2 + β) cos(2ωτc) + 1−2g2 cos(2ωτc) + (g2 + β)2 + 2β + 1

;

S[β]

ci (ω) =

√g2 + 2g cos(ωτc) + 1

2[g2 + 2g cos(ωτc) + β + 1];

S[β](ω) = max

( √g2 + 2g cos(ωτc) + 1

g2 + 2g cos(ωτc) + β + 1,√

g2 − 2g cos(ωτc) + 1g2 − 2g cos(ωτc) + β + 1

); (27)

χ[β](ω) =g4 + βg2 − 2g2 cos(2ωτc) + β + 1

2gβ| cos(ωτc)|. (28)

Of course, as β → 0, H [β] →H [P ], and it can be verifiedthat the spectra of the perfect XTC filter are recoveredfrom the expressions above.

The envelope spectrum, S[β](ω), is plotted in Fig. 3for three values of β. Two features can be noted in thatplot: 1) increasing the regularization parameter atten-uates the peaks in the spectrum without affecting theminima, and 2) with increasing β the spectral maximasplit into doublet peaks (two closely-spaced peaks).

Β = 0

.005

.05

29.5 dB19.5 dB

Π100 Π10 Π2 Π 2Π 3Π-20

-10

0

10

20

30

40100 1000 104 2x104

Ω Τc

S`AΒEHd

BL

f HHzL

FIG. 3: Effects of regularization on the envelope spectrum atthe loudspeakers, S[β](ω), showing peak attenuation and for-mation of doublet peaks as β is increased. (Other parametersare the same as for Fig. 2.)

To get a measure of peak attenuation and the condi-tions for the formation of doublet peaks, we take the firstand second derivatives of S[β](ω) with respect to ωτc andfind the conditions for which the first derivative is nil andthe second is negative. These conditions are summarizedas follows: If β is below a threshold β∗ defined as

β < β∗ ≡ (g − 1)2, (29)

the peaks are singlets and occur at the same non-dimensional frequencies as for the envelope spectrum

Page 9: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 9

peaks of the P-XTC filter (S[P ]↑), and have the followingamplitude:

S[β]↑ =1− g

(g − 1)2 + β

at ωτc = nπ, with n = 0, 1, 2, 3, 4, . . .

If the condition

β∗ ≤ β 1 (30)

is satisfied, the maxima are doublet peaks located at thefollowing non-dimensional frequencies:

ωτc = nπ ± cos−1

(g2 − β + 1

2g

)with n = 0, 1, 2, 3, 4, . . .

(31)and have an amplitude

S[β]↑↑ =1

2√β, (32)

which does not depend on g. (The superscripts ↑ and↑↑ denote singlet and doublet peaks, respectively.) Theattenuation of peaks in the S[β] spectrum due to regu-larization can be obtained by dividing the amplitude ofthe peaks in the P-XTC (i.e., β = 0) spectrum by that ofpeaks in the regularized spectrum. For the case of singletpeaks, the attenuation is

20 log10

(S[P ]↑

S[β]↑

)= 20 log10

(g − 1)2 + 1

]dB,

and for doublet peaks, it is given by

20 log10

(S[P ]↑

S[β]↑↑

)= 20 log10

[2√β

1− g

]dB.

For the typical case of g = .985 illustrated in Fig. 3, wehave β∗ = 2.225 × 10−4, and for β = .005 and 0.05 weget doublet peaks that are attenuated (with respect tothe peaks in the P-XTC spectrum) by 19.5 and 29.5 dB,respectively, as marked on that plot.

Therefore, increasing the regularization parameterabove this (typically low) threshold causes the maxima inthe envelope spectrum to split into doublet peaks shiftedby a frequency ∆(ωτc) = cos−1[(g2 − β + 1)/2g] to ei-ther side of the peaks in the response of the perfect XTCfilter. (For our illustrative case of g = .985, we haveβ∗ = 2.225×10−4 and ∆(ωτc) ' 0.225 for β = .05). Dueto the logarithmic nature of frequency perception for hu-mans, these doublet peaks are perceived as narrow-bandartifacts at high frequencies (i.e., for n = 1, 2, 3, . . . ), butthe first doublet peak centered at n = 0 is perceived asa wide-band low-frequency rolloff of typically many dB,as can be clearly seen in Fig. 3. Therefore, constant-βregularization transforms the bass boost of the perfectXTC filter into a bass roll-off.

Since regularization is essentially a deliberate intro-duction of error into system inversion, we should expectboth the XTC spectrum and the frequency responses atthe ears to suffer (i.e., depart from their ideal P-XTCfilter levels of ∞ and 0 dB, respectively) with increasingβ. The effects of constant-parameter regularization onresponses at the ears are illustrated in Fig. 4.

Β = .05

.005

.005

.05

ΧAΒE

EsiÈÈ

AΒE

Π100 Π10 Π2 Π 2Π 3Π-20

-10

0

10

20

30

40100 1000 104 2x104

Ω Τc

dB

f HHzL

FIG. 4: Effects of regularization on the the crosstalk cancel-lation spectrum, χ[β](ω) (top two curves), and the ipsilateral

frequency response at the ear for a side image, E[β]

si‖(ω) (bot-

tom two curves). The black horizontal bars on the top axismark the frequency ranges for which an XTC level of 20 dBor higher is reached with β = .05, and the grey bars representthe same for the case of β = .005. (Other parameters are thesame as for Fig. 2.)

The black curves in that plot represent the crosstalkcancellation spectra and show that XTC control is lostwithin frequency bands centered around the frequen-cies where the system is ill-conditioned (ωτc = nπ withn = 0, 1, 2, 3, 4, . . . ) and whose frequency extent widenswith increasing regularization. For example, increasingβ to .05 limits XTC of 20 dB or higher to the frequencyranges marked by black horizontal bars on the top axisof that figure, with the first range extending only from1.1 to 6.3 kHz and the second and third ranges locatedabove 8.4 kHz. In many practical applications, suchhigh (20 dB) XTC levels may not be needed or achiev-able (e.g., because of room reflections and/or HRTF mis-match) and the higher values of β needed to tame thespectral coloration peaks below a required level at theloudspeakers may be tolerated.

The E[β]

si‖(ω) responses at the ears, shown as the bot-

tom curves in Fig. 4, depart only by a few dB from thecorresponding P-XTC (i.e., β = 0) filter response (whichis a flat curve at 0 dB). More precisely and generally, themaxima and minima of the E[β]

si‖(ω) spectrum are given

Page 10: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 10

by:

E[β]↑

si‖=

g2 + 1g2 + β + 1

at ωτc = nπ

2, with n = 1, 3, 5, . . .

E[β]↓

si‖=

g4 + (β − 2)g2 + β + 1g4 + 2(β − 1)g2 + (β + 1)2

at ωτc = nπ, with n = 0, 1, 2, 3, 4, . . .

For the typical (g = .985) example shown in the figure, wehave, for β = .05, E[β]↑

si‖= −.2 dB and E

[β]↓

si‖= −6.1 dB,

showing that even relatively aggressive regularization re-sults in a spectral coloration at the ears that is quitemodest compared to the spectral coloration the perfectXTC filter imposes at the loudspeakers.

In sum, we conclude that, while constant-parameterregularization is effective at reducing the amplitude ofpeaks (including the “low-frequency boost”) in the enve-lope spectrum at the loudspeakers, it typically results inundesirable narrow-band artifacts at higher frequenciesand a rolloff of the lower frequencies at the loudspeakers.This non-optimal behavior can be avoided if the regu-larization parameter is allowed to be a function of thefrequency, as we shall see in Section IV.

Before we do so, it is insightful to consider the effectsof constant-parameter regularization on the time-domainresponse of XTC filters.

B. Impulse Response

We start by making the substitution z = ei2ωτc inEqs. (25) and (26) to get

H[β]LL(z) = H

[β]RR(z)

=z2g2 − z(β + 1)

z2g2 + g2 − z [(g2 + β)2 + 2β + 1], (33)

H[β]LR(z) = H

[β]RL(z)

=z[gz−1/2 − g(g2 + β)z1/2

]z2g2 + g2 − z [(g2 + β)2 + 2β + 1]

. (34)

The two expressions above have the same quadratic de-nominator, which can be factored as

z2g2 + g2 − z[(g2 + β)2 + 2β + 1

]= g2(z − a1)(z − a2),

where

a1 =a−

√a2 − 4g4

2g2, a2 =

a+√a2 − 4g4

2g2, (35)

and

a = (g2 + β)2 + 2β + 1. (36)

We can then re-write Eqs. (33) and (34) as

H[β]LL(z) = H

[β]RR(z)

=[z − (β + 1)

g2

]×(

11− a1z−1

)(1

z − a2

),(37)

H[β]LR(z) = H

[β]RL(z)

=[z−1/2 − (g2 + β)z1/2

g

]×(

11− a1z−1

)(1

z − a2

).(38)

Since 0 < g < 1, and β ≥ 0, we see from Eqs. (35)and (36) that 0 ≤ a1 < 1 and a2 > 1, and therefore|a1z

−1| < 1 and a2 > |z|. This allows us to express theterms 1/(1− a1z

−1) and 1/(z− a2) in the last two equa-tions as two convergent power series (whose convergenceinsures that we have a stable filter), and thus write thelast two equations as

H[β]LL(z) = H

[β]RR(z)

=[z − (β + 1)

g2

]×( ∞∑

m=0

am1 z−m

)( ∞∑m=0

−a−m−12 zm

)(39)

H[β]LR(z) = H

[β]RL(z)

=[z−1/2 − (g2 + β)z1/2

g

]×( ∞∑

m=0

am1 z−m

)( ∞∑n=0

−a−m−12 zm

).(40)

The filter is now in a form that can be readily trans-formed into a time-domain filter, h[β], represented by

h[β] =

h[β]LL(t) h

[β]LR(t)

h[β]RL(t) h

[β]RR(t)

. (41)

We do so by substituting back ei2ωτc for z in Eqs. (39)and (40), and taking the inverse Fourier transform (IFT)to get

h[β]LL(t) =

12π

∫ ∞−∞

H[β]LL(iω)eiωtdω

= h[β]RR(t) =

12π

∫ ∞−∞

H[β]RR(iω)eiωtdω

=[δ(t+ 2τc)−

β + 1g2

δ(t)]∗ ψ(t), (42)

Page 11: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 11

h[β]LR(t) =

12π

∫ ∞−∞

H[β]LR(iω)eiωtdω

= h[β]RL(t) =

12π

∫ ∞−∞

H[β]RL(iω)eiωtdω

=[δ(τc − t)

g− (g2 + β)δ(t+ τc)

g

]∗ ψ(t),

(43)

where the asterisk denotes the convolution operation, andψ(t) is the IFT of the product of the two series appear-ing in Eqs. (39) and (40), and is given by the followingconvolution of two trains of Dirac delta functions:

ψ(t) =∞∑m=0

am1 δ(t− 2mτc) ∗∞∑m=0

−a−m−12 δ(t+ 2mτc),

(44)We see that the first train evolves forward in time andthe second evolves in reverse time.

The impulse response (IR) represented by Eqs. (42)and (43) is plotted in Fig. 5 for three values of β.

The IR of the perfect XTC filter is shown in the toppanel of that figure and consists of two trains of decay-ing and inter-delayed delta functions of opposite sign.Mathematically, it is the special case of β = 0, for whichEqs. (37) and (38) simplify to

H[P ]LL(z) = H

[P ]RR(z) =

11− a1z−1

, (45)

H[P ]LR(z) = H

[P ]RL(z) = − gz−1/2

1− a1z−1, (46)

from which, through the inverse Fourier transform, we re-cover the IR of the perfect XTC filter derived in Ref. [20]:

h[P ]LL(t) = h

[P ]RR(t) =

∞∑n=0

an1 δ(t− 2nτc) (47)

h[P ]LR(t) = h

[P ]RL(t)

= −gδ(t− τc) ∗∞∑n=0

an1 δ(t− 2nτc), (48)

where a1 = g2 (obtained by setting β = 0 in Eqs. (35)and (36)) is the pole of the filter. We see that the perfectXTC IR starts at t = 0 with an amplitude of unity anddecays to an amplitude an1 = (l1/l2)2n after a time 2nτc.

Its physical significance has been discussed by Kirkebyet. al.[20] who, along with Atal et. al.[9] before, recog-nized the recursive nature of XTC filters. Briefly, a phys-ical appreciation of the perfect crosstalk cancellation IRcan be obtained by considering the hypothetical case ofa positive pulse whose duration is much smaller than τc,fed into only one of the two inputs of the system, say theleft input. From Eq. (9), we see that this pulse, dL(t), isemitted from the left loudspeaker as a series of positivepulses dL(t)∗hLL(t) (corresponding to the filled circles inthe top panel of Fig. 5) and from the right loudspeaker asa series of negative pulses dL(t) ∗ hRL(t) (corresponding

ææææææææææææææææææææææææææææææææææ

çççççççççççççççççççççççççççççççççç

Β = 0

-200 -100 0 100 200-1.0

-0.5

0.0

0.5

1.0

æææææææææææææææææææææææææææææææææ

æ

æ

æ

æ

æ

æ

æææææææææææææææææææææææææææææççççççççççççç

çççççççç

ççççççççççç

ç

ç

ç

ç

ç

ç

ççççççççççççççç

ççççççççççççççç

Β = .005

-200 -100 0 100 200-1.0

-0.5

0.0

0.5

1.0

æææææææææææææææ

æ

æ

æ

æ

æ

æ

æ

æææææææææææææææççççççççççç

ççççç

ç

ç

ç

ç

ç

ç

çççççççççççççççç

Β = .05

-200 -100 0 100 200-1.0

-0.5

0.0

0.5

1.0

t HsamplesL

FIG. 5: Impulse responses h[β]LL(t) = h

[β]RR(t) (filled circles)

and h[β]LR(t) = h

[β]RL(t) (empty circles) for three values of β.

(g = .985, τc = 3 samples.)

to the empty circles in the same plot). These two series ofpulses are delayed by τc with respect to each other so thatafter the first positive pressure pulse arrives at the leftear, then reaches the right ear with a slightly smaller am-plitude, it is cancelled there by a negative pressure pulseof equal amplitude (that was emitted a time τc earlier bythe right loudspeaker), which in turn is cancelled at theleft ear by a positive pressure pulse, and so on. The netresult is that only the first pulse is heard and only at theleft ear, i.e., with no crosstalk.

The effects of regularization on the XTC IR can begleaned from a comparison of the three panels of Fig. 5.When β is finite, the IR has a “pre-echo” part, i.e., it

Page 12: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 12

extends in reverse time (t < 0) as shown in Fig. 5. Asalso can be seen in that figure, and inferred from Eq. (44),the delta functions in the t < 0 and t > 0 parts haveopposite signs. With increasing regularization, the t < 0part increases in prominence and the IR becomes shorterin temporal extent, which correspond in the frequencydomain to a spectrum with abated peaks.

To insure causality, a time delay must be used to in-clude the t < 0 part of the IR. In practice (e.g., whendealing with numerical HRTF inversion), this can bedone through a “modelling delay” that accommodatesboth the non-causal part of the IR and the transmissiondelay

δ

(l1cs− t)

associated with the factor α in Eq. (8).The length of a filter having a pole close to the unit

circle, |z| = 1, is inversely proportional to the distancebetween the pole and the unit circle[38]. As β is in-creased the poles pull away from the unit circle as perEqs. (35) and (36), and therefore the length of a finite-βIR is reduced by a factor of

1− a1

1− g2

with respect to the length of the perfect XTC IR. Thisfactor (which is based on a1 since 1 − a1 < |1 − a2| ) isaccurate as long as 1 − g2 << 1 and 1 − a1 << 1. Forinstance, for the IR shown in the middle panel of Fig. 5we have β = .005 and g = .985, which give a1 = .86 andthe IR is about 4.5 times shorter than the perfect XTCIR.

IV. FREQUENCY-DEPENDENTREGULARIZATION

In order to avoid the frequency-domain artifacts dis-cussed in Section III A and illustrated in Fig. 3, we seekan optimization prescription that would cause the enve-lope spectrum S(ω) to be flat at a desired level Γ (dB)over the frequency bands where the perfect filter’s enve-lope spectrum exceeds Γ (dB). Outside these bands (i.e.,below that level), we apply no regularization. This canbe stated symbolically as:

S(ω) = γ if S[P ](ω) ≥ γ, (49)

S(ω) = S[P ](ω) if S[P ](ω) < γ, (50)

where the P-XTC envelope spectrum, S[P ](ω), is givenby Eq. (16), and

γ = 10Γ/20, (51)

with Γ given in dB. We will take Γ ≥ 0 dB and, since Γcannot exceed the magnitude of the peaks in the S[P ](ω)

spectrum, γ is bounded by the inequalities:

1 ≤ γ ≤ 11− g

, (52)

where the last term is S[P ]↑ , given by Eq. (18).The frequency-dependent regularization parameter

needed to effect the spectral flattening requiredby Eq. (49) is obtained by setting S[β](ω), given byEq. (27), equal to γ and solving for β(ω), which is now afunction of frequency. Since the regularized spectral en-velope, S[β](ω), (which is also ||H [β]||, the 2-norm of theregularized XTC filter) is the maximum of two functions,we get two solutions for β(ω):

βI(ω) = −g2 + 2g cos(ωτc) +

√g2 − 2g cos(ωτc) + 1

γ− 1,

(53)

βII(ω) = −g2− 2g cos(ωτc) +

√g2 + 2g cos(ωτc) + 1

γ− 1.

(54)The first solution, βI(ω), applies for frequency bandswhere the out-of-phase response of the perfect filter (i.e.,the second singular value, which is the second argumentof the max function in Eq. (16)) dominates over the in-phase response (i.e., the first argument of that function):

S[P ]o =

1√g2 − 2g cos(ωτc) + 1

≥ S[P ]i =

1√g2 + 2g cos(ωτc) + 1

. (55)

Similarly, regularization with βII(ω) applies for frequencybands where S

[P ]i ≥ S

[P ]o . Therefore, we must distin-

guish between three branches of the optimized solution:two regularized branches corresponding to β = βI(ω)and β = βII(ω), and one non-regularized (perfect-filter)branch corresponding to β = 0. We call these Branch I,II and P, respectively, and sum up the conditions associ-ated with each as follows:

Branch I: applies where S[P ](ω) ≥ γ and S[P ]o ≥ S[P ]

i ,

and requires setting S(ω) = γ, β = βI(ω);

Branch II: applies where S[P ](ω) ≥ γ and S[P ]i ≥ S[P ]

o ,

and requires setting S(ω) = γ, β = βII(ω);

Branch P: applies where S[P ](ω) < γ,

and requires setting S(ω) = S[P ](ω), β = 0.

Following this three-branch division, the envelopespectrum at the loudspeakers, S(ω), for the case offrequency-dependent regularization is plotted as thethick black curve in Fig. 6 for Γ = 7 dB. This valuewas chosen because it corresponds to the magnitude ofthe (doublet) peaks in the β = .05 spectrum (i.e., Γ =20 log10(1/2

√β)), which is also plotted (light solid curve)

Page 13: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 13

as a reference for the corresponding case of constant-parameter regularization. (We call a spectrum ob-tained with frequency-dependent regularization and oneobtained with constant-β regularization “correspondingspectra,” if the peaks in S[β](ω), whether singlets or dou-blets, are equal to γ.) It is clear from that figure that

Branch I P II P I P II

Band 1 2 3 4 5 6 7

Β = 0

Β = .05

Β = Β HΩ, G = 7 dBL

Π100 Π10 Π2 Π 2Π 3Π-20

-10

0

10

20

30

40100 1000 104 2x104

Ω Τc

S`Hd

BL

f HHzL

FIG. 6: Envelope spectrum at the loudspeakers, S(ω), for thecase of frequency-dependent regularization with Γ = 7 dB(thick black curve) and for the corresponding reference caseof β = .05 (grey curve). The benchmark case of the perfectXTC filter is also shown (dashed grey curve). The verticaldotted lines show the frequency bounds of the resulting sevenbands, which are numbered consecutively at the top of theplot, and labeled with the corresponding branch name at thebottom. (Other parameters are the same as for Fig. 2.)

the low-frequency boost and the high-frequency peaks ofthe perfect XTC spectrum, which would be transformedinto a low-frequency roll-off and narrow-band artifacts,respectively, by constant-β regularization, are now flatat the desired maximum coloration level, Γ. The rest ofthe spectrum, i.e., the frequency bands with amplitudebelow Γ, is allowed to benefit from the infinite XTC levelof the perfect XTC filter and the robustness associatedwith relatively low condition numbers.

A. Band Hierarchy

The three-branch prescription therefore splits the au-dio spectrum into a series of adjacent frequency bands,which we number consecutively starting with Band 1 forthe lowest-frequency band. The frequency bounds foreach band can be found by setting S[P ](ω), given byEq. (16), to γ and solving for ωτc. This results in the fol-lowing hierarchy of bands and their associated frequencybounds:

• Bands 1, 5, 9, 13, 17, . . . , 4n+ 1 belong to Branch I,and are bounded by

2nπ − φ ≤ ωτc ≤ 2nπ + φ; (56)

• Bands 2, 6, 10, 14, 18, . . . , 4n+2 belong to Branch P,and are bounded by

2nπ + φ ≤ ωτc ≤ (2n+ 1)π − φ; (57)

• Bands 3, 7, 11, 15, 19, . . . , 4n + 3 belong to BranchII, and are bounded by

(2n+ 1)π − φ ≤ ωτc ≤ (2n+ 1)π + φ; (58)

• Bands 4, 8, 12, 16, 20, . . . , 4n+4 belong to Branch P,and are bounded by

(2n+ 1)π + φ ≤ ωτc ≤ 2(2n+ 1)π − φ; (59)

where n = 0, 1, 2, 3, 4, . . . and

φ = cos−1

(g2γ2 + γ2 − 1

2gγ2

). (60)

For instance, applying this hierarchy to the case of g =.985, and Γ = 7 dB (i.e., γ = 107/20 = 2.24), shownin Fig. 6, we have the following set of eight consecutivefrequency bounds for the seven consecutive bands be-tween ωτc = 0 and 3π: 0, 0.45, 2.69, 3.59, 5.83, 6.74,8.97, 9.42, which correspond to dimensional frequen-cies, f (Hz) (with τs = 3 samples at 44.1 kHz) givenby the set: 0, 1061.5, 6288.5, 8411.5, 13638.5, 15761.5,20988.5, 22000, as marked by the vertical lines in Fig. 6.Bands 1 and 5 belong to Branch I and are regularizedwith β = βI(ω); Bands 3 and 7 belong to Branch II andare regularized with β = βII(ω); and Bands 2, 4, and 6belong to Branch P and are not regularized. In general,successive bands, starting from the lowest-frequency one,are mapped to the following succession of branches: I, P,II, P, I, P, II, P, . . .

B. Frequency Response

The amplitude envelope of the frequency response atthe loudspeakers, given by Eqs. (49) and (50), was al-ready shown in Fig. 6. The other optimized metric spec-tra can be derived as follows:

Y[O]I (ω) = Y [βI(ω)](ω), for Branch-I bands; (61)

Y[O]II (ω) = Y [βII(ω)](ω), for Branch-II bands; (62)

Y[O]P (ω) = Y [P ](ω), for Branch-P bands; (63)

where Y represents any of the eight metric spectra wedefined in Section II B, the superscript [O] denotes thesought optimized version of that metric spectrum, thesubscript I, II, or P denotes one of the three branches, andthe superscripts [βI(ω)] and [βII(ω)] denote regularizationfollowing the formulas for the regularized metric spec-tra in Section III A, but with β taken to be frequency-dependent according to Eqs. (53) and (54).

Page 14: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 14

For example, following the above hierarchical prescrip-tion, and using Eqs. (28), (53), (54), and (17), the opti-mized crosstalk cancellation spectrum becomes

χ[O]I,II(ω) = ∓

∓γx2 + (g2 + 1)(xγ ±

√g2 ∓ x+ 1

)|x|(γg2 ∓ γx+ γ −

√g2 ∓ x+ 1

) ,

(64)

χ[O]P (ω) = χ[p](ω) =∞, (65)

where, for compactness, we have used the definitionx ≡ 2g cos(ωτc) and combined both branches into one ex-pression using the double subscripts “I,II” and the doublesign (± or ∓) with the top and bottom signs associatedwith Branches I and II, respectively. Similarly, the opti-mized version of the ipsilateral frequency response at theear for a side image, Esi‖(ω), becomes

E[O]

si‖ I,II(ω) =

±xγ2(g2 ∓ x+ 1) + (γ2g + γ)√g2 ∓ x+ 1

g2 ∓ x± 2γx√g2 ∓ x+ 1 + 1

(66)

E[O]

si‖P(ω) = E

[P ]

si‖(ω) = 1. (67)

These spectra are plotted in Fig. 7 where it is im-

EsiÈÈ

Χ

Β = .05

Β = Β HΩ, G = 7 dBL

Π100 Π10 Π2 Π 2Π 3Π-20

-10

0

10

20

30

40100 1000 104 2x104

Ω Τc

dB

f HHzL

FIG. 7: Crosstalk cancellation spectrum, χ(ω) (black curves),and ipsilateral frequency response at the ear for a side image,Esi‖

(ω) (light curves), for the cases of frequency-dependent

regularization (solid curves) and β = .05 (dashed curves).The frequency ranges for which an XTC level of 20 dB orhigher is reached are marked on the top axis by black hori-zontal bars for the case of β = β(ω) with Γ = 7 dB. (Otherparameters are the same as for Fig. 2.)

mediately clear from the χ(ω) curves that frequency-dependent regularization yields a significant enhance-ment of XTC level over that obtained with constant-βregularization. We can also deduce from this plot thatthe higher the desired minimum level of XTC, the larger

is the XTC enhancement over that attained with the cor-responding constant-β regularization.

Furthermore, this XTC enhancement occurs with norelative penalty to the frequency response at the ears,as can be seen by comparing the Esi‖(ω) spectrum withfrequency-dependedent regularization (solid grey curve)to that with β = .05 (dashed grey curve) in the samefigure.

It can be verified through Eqs. (28) and (64) thatconstant-β regularization yields an XTC level that isequal to that obtained with the corresponding frequency-dependent regularization only at the discrete frequenciesat which to the peaks in the corresponding S[β](ω) spec-trum are located, i.e., at

ωτc = nπ, if1

4γ2< (g − 1)2;

= nπ ± cos−1

(g2 − β + 1

2g

),

if (g − 1)2 ≤ 14γ2 1,

with n = 0, 1, 2, 3, 4, . . . (68)

(where the inequalities are those conditioning singlet ordoublet peaks in the corresponding S[β](ω) spectrum,and are derived from Eqs. (29), (30) and (32)). Atall other frequencies, frequency-dependent regularizationyields superior XTC performance to that obtained withconstant-β regularization. This behavior, which can alsobe seen graphically in the χ(ω) curves of Fig. 7, is due tothe fact that forcing the envelope spectrum to be flat (inbands belonging to Branches I and II) through frequency-dependent regularization clamps the effort penalty termin the cost function (second term in the sum in Eq. (23))leading to a minimization of the performance error. Thisin turn leads to a maximization of XTC level, which ex-ceeds the corresponding constant-β XTC level at all fre-quencies (except at those given by Eq. (68), where bothcorresponding S spectra reach the same value, γ), sincethe corresponding constant-β envelope, S[β](ω), is lowerthan (or equal to) γ (as seen in Fig. 6).

Therefore, we conclude that if we define XTC filteroptimization as “the maximization of XTC performancefor a desired tolerable level of spectral coloration” aswe did earlier, only frequency-dependent regularizationleads to an optimal XTC filter over all frequencies, whileconstant-β regularization leads to an XTC filter thatis optimized only at the discrete frequencies given byEq. (68).

C. Impulse Response: The Band-AssembledCrosstalk Cancellation Hierarchy Filter

In the frequency domain, the optimized XTC filter isgiven by the following matrix:

Page 15: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 15

H [O] =

H [O]LL(iω) H

[O]LR(iω)

H[O]RL(iω) H

[O]RR(iω)

, (69)

whose elements are derived following the same hierarchi-cal prescription (i.e., Eqs. (61)-(63)) we used to get theoptimized metric spectra, namely by substituting β1(ω)and β1(ω) from Eqs. (53) and (54), and β = 0 into eachof Eqs. (25) and (26), to get the Branch I, Branch II andBranch P versions of the filter’s matrix elements. Thisleads to

H[O]LLI,II

(iω) = H[O]RRI,II

(iω)

=γ2[±x− g2

(1 + e2iωτc

)]+ γ√g2 ∓ x+ 1

g2 ± x(

2γ√g2 ∓ x+ 1− 1

)+ 1

,

(70)

H[O]LRI,II

(iω) = H[O]RLI,II

(iω)

=∓γ2

[±x− g2

(1 + e2iωτc

)]+ gγeiωτc

√g2 ∓ x+ 1

g2 ± x(

2γ√g2 ∓ x+ 1− 1

)+ 1

,

(71)

H[O]LLP

(iω) = H[O]LLP

(iω) = H[P ]LL(iω) = H

[P ]RR(iω), (72)

H[O]LRP

(iω) = H[O]RLP

(iω) = H[P ]LR(iω) = H

[P ]RL(iω), (73)

where, again,

x ≡ 2g cos(ωτc),

and we have followed the same subscript and sign con-ventions used to compact the XTC spectrum in Eq. (64).Eqs. (72) and (73) give the Branch-P elements of the ma-trix of the optimized filter, which are also the elementsof the perfect filter’s matrix given by Eqs. (45) and (46),whose inverse Fourier transforms had given us the IRsexpressed in Eqs. (47) and (48). Therefore we need toderive only the IRs associated with Branches I and II ofthe optimized filter.

To do so, we follow, albeit through more cumber-some algebra, the same approach we used to obtain theconstant-β IRs in Section III B; namely, we seek to factorthe frequency-domain response of the filter into a prod-uct of terms, whose IFT can be readily found, or whichcan be expressed as convergent series of functions whoseIFT can be readily found. The complete IR is then theconvolution of the IFTs of all the terms in the factoredfrequency-domain response of the filter. The challengeis to carry out the factorization in such a way that allthe invoked power series expansions converge over theparameter space of interest.

The derivation is carried out in Appendix A, where wealso discuss the convergence of the adopted series expan-sions. The resulting filter in the time domain is given by

the following two IRs:

h[O]LLI,II

(t) = h[O]RRI,II

(t) = (ψ0 + γψ1) ∗ ψa, (74)

h[O]LRI,II

(t) = h[O]RLI,II

(t) = [∓ψ0 + γgδ(t+ τc) ∗ ψ1] ∗ ψa,

(75)

where

ψa = ±ψ2 ∗ ψ3 ± (ψ1 ∓ ψ4) ∗ ψ5ψ6(c1) ∗ ψ6(c2),(76)

ψ0 = −g2γ2δ(t)± gγ2δ(τc − t)± gγ2δ(t+ τc)−g2γ2δ(t+ 2τc), (77)

ψ1 =∞∑m=0

(12m

)(∓g)m

(g2 + 1

) 12−m ×

m∑k=0

(mk

)δ(2kτc − t−mτc), (78)

ψ2 = ± 14gγ

∞∑m=0

(− 1

2m

)(−1)m ×

2m∑k=0

(2mk

)(−1)k+m4−mδ(t+ 2kτc − 2mτc),

(79)

ψ3 =∞∑m=0

(− 1

2m

)(∓g)m

(g2 + 1

)− 12−m ×

m∑k=0

(mk

)δ(2kτc − t−mτc), (80)

ψ4 = 2gγδ(τc − t) + 2gγδ(t+ τc), (81)

ψ5 = ± 1(4gγ)3

∞∑m=0

(− 3

2m

)(−1)m ×

2m∑k=0

(2mk

)(−1)k+m4−mδ(t+ 2kτc − 2mτc),

(82)

ψ6(c) =∞∑p=0

(±c2g

)p ∞∑n=0

(−p2n

)(−1)n ×

2n∑k=0

(2mk

)(−1)k+m4−mδ(t+ 2kτc − 2mτc).

(83)

with the constants c1 and c2 given by

c1 =

√16γ2(g2 + 1) + 1∓ 1

8γ2, (84)

c2 =−√

16γ2(g2 + 1) + 1∓ 18γ2

. (85)

The impulse responses are valid for values of γ and g that

Page 16: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 16

satisfy the condition:

max

(√5 +√

5

2√g2 + 1

, 1

)≤ γ ≤ 1

1− g, (86)

which is shown graphically as a region plot in Fig. 9, inAppendix A.

ææææææææææææææææææ ææ

ææ

ææ

ææ

ææ

ææ

æ

æ

æ

æ

æ

æ

æ

ææææææææææææææææææææææææææææææçççççççççççççççççççç

çççç

çç

ç

ç

çç

ç

ç

ç

ç

ç

ç

ç

ç

çç

ççççççç

ççççççççççççççççççççç

Branch I

-100 -50 0 50 100-1.0

-0.5

0.0

0.5

1.0

æææææææææææææææææææææææææææææææ

ææ

æææææ

æ

æ

æ

ææ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æææ

æ

æ

æ

æ

æææææ

æ

æ

ææææ

æææ

æææææææææææææææææææææççççççççççççç

ççççççççççççççççç

ççççç

ç

ç

ç

ç

ç

çç

ç

ç

ç

ç

ç

ç

ç

ç

ç

ç

ç

ç

çç

çç

ç

ç

ç

ç

çççç

ç

ç

ç

ççççççççççç

ççççççççççççççççççç

Branch II

-100 -50 0 50 100-1.0

-0.5

0.0

0.5

1.0

t HsamplesL

FIG. 8: Impulse response of the optimal XTC filter: h[O]LL(t) =

h[O]RR(t) (filled circles) and h

[O]LR(t) = h

[O]RL(t) (empty circles),

for Branch I (top panel) and Branch II (bottom panel). (Γ =7 dB, g = .985, and τc = 3 samples, as in Fig. 2.)

The impulse responses for Branch-I and Branch-II ofthis optimal filter are shown in Fig. 8 for our typicalcase of g = .985 and τc = 3 samples, and, along withthe perfect filter IR shown in the top panel of Fig. 5,completely specify the optimal XTC filter.

Compared to the corresponding (β = .05) finite-betaIR in the bottom panel of Fig. 5, the optimal XTC IRsshown in Fig. 8 are more complex in their structure. Fur-thermore, each IR consists of a train of deltas that arespaced by τc as opposed to the 2τc intervals we had forthe perfect and finite-beta filters.

These IRs are difficult to interpret physically because,as they stand, they also include the time response asso-ciated, in the frequency domain, with frequency bandswhere the IR is not valid. This is illustrated in Ap-pendix B, in the bottom panel of Fig. 10, where theenvelope spectrum obtained from the Fourier transformof the Branch-I optimal IR is compared to the expected

flat envelope spectrum, S[O]I (ω) = γ. The agreement is

excellent only in the bands belonging to the branch forwhich the IR is intended (which, in the case illustrated inthat plot, are the first and fifth bands). In other bands,not only is the IR not valid, but, as discussed in theappendices, its application may lead to singularities as-sociated with the divergence of some of the series thatconstitute it (see for instance the singularities appearingin the Branch-P bands in Fig. 10).

Therefore, in principle, the application of the op-timal filter requires that, prior to XTC filtering, therecorded signal, [dLi

(t), dRi(t)]T , be passed through a

crossover filter whose crossover frequencies are set tothe band bounds given by the hierarchical prescriptionin Eqs. (56)-(59). The resulting bands are then as-sembled into three groups (I, II and P) according totheir branch identity. The combined recorded stereo sig-nals in each group can thus be represented by a vector[dLi(t), dRi(t)]

T , where the index i stands for Branch I,II or P. The loudspeakers source vector, in the time do-main, needed for optimal crosstalk cancellation is thengiven by the time-domain version of Eq. (9):vL(t)

vR(t)

=∑i

h[O]

LLi(t) h

[O]LRi

(t)

h[O]RLi

(t) h[O]RRi

(t)

∗dLi(t)

dRi(t)

,

(87)where the summation is over the three branches, andthe convolution operates in the same fashion as matrixmultiplication.

Causality is insured by calculating the IRs with a “pre-delay,” starting back at a time t < 0, whose exact tem-poral extent is not important as long as it allows the in-clusion of the salient part of the IR. For the IRs in Fig. 8,this pre-delay should start at about t = −100 samples.

V. APPLICATION AND PRACTICALCONSIDERATIONS

While the first goal of the preceding analyses wasto provide insight into the theory and fundamentals ofXTC optimization, the resulting optimized IR can offerpractical benefits in audio applications where the spa-tial fidelity of a recording is degraded by the unintendedcrosstalk inherent in playback through loudspeakers. Aswe argued in Section I A, this is the case of not onlypure binaural recordings made with dummy head micro-phones, but also of the vast majority of standard stereorecordings, since they generally contain ILD and ITDcues, which would be degraded by unintended crosstalk.

A. The Value of Analytical XTC Filters

Analytical XTC filters cannot rival the performance ofnumerical HRTF-based XTC filters in ideal situations,i.e., when 1) sound reflections in the listening room are

Page 17: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 17

negligible or non-existent (anechoic or semi-anechoic en-vironments), 2) the recording was made with the indi-vidualized inverted HRTF of the listener, 3) the XTCfilter includes the individualized inverted HRTF of thelistener, and 4) the listener’s head is constrained in a re-stricted sweet spot. Any departure from these idealitieswould cause the effective XTC level at the listener’s earsto drop, and the spectral coloration that is necessarilyimposed at the loudspeakers to become more audible atthe listener’s ears.

Since in many, if not most, practical listening situa-tions in non-anechoic environments all of the four ide-alities listed above are compromised to a certain degree,the practically achievable XTC level of numerical HRTF-based XTC filters seldom exceeds 13 dB over a wide fre-quency range[3]. An optimal analytical XTC filter, evenone based on a free-field model, such as the one derived inthe previous section, can become competitive especiallyin situations where it is calculated for, and used with,a loudspeaker span that is small enough to diminish therelative importance of head-shadowing effects. In suchapplications, an optimal analytical XTC filter can offerthe following advantages over a numerical HRTF-basedXTC filter:

1. The simplicity of using a single filter for all indi-viduals.

2. Shorter filters which incur lower CPU loads on thedigital processor.

3. Low spectral coloration for listeners inside and out-side the sweet spot (and the associated decrease inthe physical loading of the transducers).

(The third advantage could, in principle, be neutralizedby applying the optimal regularization method, describedin the previous section, to the design of a numericalHRTF-based XTC – at the price of eroding some of theadvantages associated with the use of an individualizedHRTF.)

With this justification for the usefulness ofanalytically-derived optimal XTC filters, we turnour attention to some practical issues related to theirspecific design and their application to real listeningsituations.

B. Filter Design Strategy

Of course, filter design strategies depend on perfor-mance requirements (desired maximum tolerable col-oration level or minimum XTC level) and the specificsand constraints of the listening configuration (constraintson the listening distance, l, and the loudspeaker span, Θ,and to some extent, the sound reflection characteristicsof the listening room).

One approach to filter design is to start with the spec-ification of the maximum tolerable coloration level, thatis, Γ in dB. For instance, for critical (e.g., audiophile)

listening and audio mastering applications, it may notbe desirable to have Γ exceed 3-5 dB, while for home-theatre applications, audio (spectral) fidelity may be in-tentionally compromised with higher values of Γ for theadvantage of having more XTC headroom for reproduc-ing surround effects with the two loudspeakers.

The choice of loudspeaker span is particularly impor-tant. In cases where it is constrained to a set value, asfor compatibility with the so-called “standard stereo tri-angle”, i.e., Θ = 60, the value of Θ becomes a fixedinput to the design process and is used, along with l, tocalculate g and τc from Eqs. (3)-(6). (The inequality inEq. (86), which is typically easy to satisfy, must hold forthat particular combination of γ = 10Γ/20 and g. If not,one of the input parameters, usually Γ, must be adjustedaccordingly before proceeding further with the design).In cases where Θ is not constrained to a preset value,it becomes a useful variable in the filter design processand can be used to simplify the filter, as discussed inSection V C below.

With γ, g, and τc specified, one has all the parame-ters needed to calculate the spectra associated with theXTC optimal filter, as described in Sections IV A andIV B, and thus evaluate the various aspects of the fil-ter. (These evaluations are more conveniently done interms of the dimensional frequency, f , in Hz, by select-ing the intended sampling rate.) In particular, a plotof the XTC spectrum according to Eqs. (64) and (65)allows the evaluation of the XTC performance of the fil-ter (defined as the frequency extent over which a desiredminimum XTC level is reached or exceeded) which, byvirtue of the implicit optimization (i.e., minimization ofthe cost function in Eq. (23)), is the maximum achievableXTC performance for that particular set of input param-eters. If the calculated XTC performance is judged bysome empirical standards to be above that achievable inthe intended listening environment (for instance, soundreflections in a reverberant room may limit the achiev-able XTC to only a few dB over a good part of the audiospectrum), the calculation can be repeated with a lowervalue of Γ, thus leading to even higher spectral fidelity.Conversely, a lower than desired XTC performance canbe amended by raising Γ.

Once the target XTC performance and coloration levelare reached, one proceeds to the time domain by calcu-lating the Branch-P IRs from Eqs. (42)-(44), and theBranch-I, and II IRs from Eqs. (74)-(85). The loud-speakers source vector can then be calculated accordingto Eq. (87), following the prescription given in the textpreceding that equation, i.e., by appropriately convolv-ing the 3-part IR with the recorded stereo signal afterhaving passed the latter through a crossover filter whosecrossover frequencies are set to the band bounds givenin Eqs. (56)-(59). The convolution operations can becarried out digitally, and in real time if desired, usinga digital convolution plugin. (Such software plugins relyon FFT-based algorithms[39, 40] for fast convolution andhave become readily available in the commercial and pub-

Page 18: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 18

lic domains for use as IR-based reverb processors.)

C. Simplified Implementation

An XTC system consisting of the properly configuredcrossover filter, the three XTC IR matrices, and the mul-tiple instances of convolution plugins, can be consideredas a single filter, having stereo inputs and outputs, whichacts as a linear operator. Therefore, once assembled, itcan be “rung” once by a single delta impulse, appliedto one of its two inputs, and the recorded stereo outputwould then represent one of the two columns of the 2x2IR matrix of the entire filter. Because of symmetry, theother column of the IR matrix is obtained by simply flip-ping the two recorded outputs. This results in a singleIR, representing the entire three-branch multi-band fil-ter, and simplifies any future application of Eq. (87) toa simpler one (with no crossover filtering) in which thesummation and indices are foregone.

D. The Role of Loudspeaker Span

Another important simplification arises in applicationswhere the loudspeaker span, Θ = 2θ, is not constrainedto a preset value, such at the 60 of the standard stereotriangle, and therefore can be a variable in the filter de-sign process. Since τc depends on the loudspeaker span,the bounds of the bands can be moved by varying θ. Bysetting θ equal to a particular value, θ∗, the higher boundof the second band (which belongs to Branch P) can bemade to coincide with a cutoff frequency, fc, above whichXTC is psychoacoustically not needed. Such a band-limited optimal XTC filter has the advantage that it re-quires only a 2-band crossover filter, and its IR consistsof only the Branch-I and Branch-P parts, thus leading tosignificant simplifications in the design and implementa-tion of the filter.

To find an expression for θ∗ as a function of fc, underthe typically valid approximations g ' 1 and l ∆r,we set ωτc equal to the upper bound of the second band(which, from Eq. (57), is π − φ), use Eq. (21), and solvefor θ, to get

θ∗ ' sin−1

cs(π − cos−1

[2γ2 − 1

2γ2

])2πfc∆r

. (88)

A number of studies[27, 41] have suggested that XTCabove a frequency of about 6 kHz is not critical or per-haps even necessary. Therefore, we set fc to that valuein the above equation, solve for θ∗, design the filter fora loudspeaker span of 2θ∗, use a 2-band crossover filterto separate the first two bands, apply the Branch-I andBranch-P parts of the filter to the first and second bands,respectively, and allow the part of the audio spectrumabove fc to bypass the filter.

It is relevant to mention in the context of loudspeakerspan that keeping Θ small offers advantages that havebeen recognized since Kirkeby and co-workers presentedtheir analysis[20] of the “stereo dipole” configuration,which has a span of only 10. Objective and subjectiveevaluations of the effects of loudspeaker span in XTCsystems have indicated that such a low-Θ configurationgives a larger sweet spot than that obtained with largerloudspeaker spans[18]. This can be attributed to the rel-ative insensitivity of the path length difference, ∆l, tohead movements when the span is small. On the otherhand, the same study favored larger spans partly becauseincreasing the span, with the distance l fixed, lowers thevalue of g and consequently decreases the magnitude ofthe coloration peaks and condition numbers. We shouldhowever expect, in light of our study of regularization,that an optimal XTC filter in which regularization is usedto flatten these peaks and lower the condition numbers,while maintaining good XTC performance, should tip thebalance in favor of lower values of Θ. This remains to beverified experimentally.

Another argument in favor of small loudspeaker spansis particular to the use of analytical filters based on afree-field model, such as those discussed in this paper.Since the free-field model ignores the presence of the lis-tener’s head, it should be expected that filters based onit perform better when the effects of head shadowing areminimized. This can be achieved by decreasing the spanangle as can be seen, for instance, in Fig. 3.13 of Ref. [27],where the inter-aural transfer function (the ratio of thefrequency responses at the two ears) of a typical humanhead, measured as a function of the azimuthal positionof a sound source, is small (about -2dB) and flat (within2 dB) for a small horizontal source azimuth (θ = 5), butworsens with increasing azimuths.

E. An Example

To illustrate the above design guidelines and discus-sions, we give the example of a listening situation whoseonly two design requirements are a distance l = 1.6 mand a maximum coloration level of Γ = 7 dB. FromEq. (88), with fc ' 6 kHz, and ∆r = 15 cm[42], weget θ = 9, which we take as half the loudspeaker span.From Eqs. (3)-(6), we can then calculate g = .985 andτc = 3 samples at a sampling rate of 44.1 kHz. These areprecisely the dimensional and non-dimensional parame-ters chosen for the calculations that are illustrated in theplots throughout this paper. The Branch-P and Branch-IIRs are therefore given by those shown in the top panelsof Figs. 5 and 8, respectively. The Branch-II IR is notneeded as the XTC filter is limited to 6 kHz, which, bydesign, was made to be the upper bound of the secondband (Branch-P). The spectra associated with this filterare given by the solid curves in Figs. 6 and 7, with thedimensional frequency read off the top axes of the plots,up to the cuttoff frequency of 6 kHz. In particular, we

Page 19: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 19

note that the XTC performance (top curve in Fig. 7), ex-ceeds 20 dB for a wide range of frequencies that extendsfrom the 6 kHz cuttoff down to 850 Hz, then drops offwith decreasing frequency, reaching 5 dB at 290 Hz.

VI. SUMMARY

We distill the main points and findings of this work inthe following bullets.

• 3-D reproduction of binaural audio with two loud-speakers requires cancellation of the crosstalk be-tween the loudspeakers and the contralateral earsof the listener. A perfect XTC filter (i.e., one withinfinite crosstalk cancellation) can be easily de-signed but causes severe spectral coloration to thesound emitted by the loudspeakers due to the ill-conditioned inversion of the system’s transfer func-tion.

• The coloration produced by the perfect XTC filterconsists of peaks in the frequency spectrum thatcan typically exceed 30 dB, and thus strain theplayback transducers and significantly reduce thedynamic range of the playback system. Further-more, the coloration is heard throughout the listen-ing space and, due to extreme sensitivity to errorsin the system, it is also heard by the listener in thesweet spot.

• Using a two-source free-field model, we have shownthat constant-parameter regularization, which hasbeen used previously to design HRTF-based XTCsystems, can tame these peaks but can produce abass roll-off and high-frequency artifacts in the fil-ter’s frequency response. Furthermore, we demon-strated that constant-beta regularization does notlead to the optimization of XTC filters across allfrequencies, but rather only at discrete, widely-spaced frequencies.

• Full optimization can be achieved throughfrequency-dependent regularization, and requiresthat the audio spectrum be divided into a hier-archical set of adjacent frequency bands, each ofwhich belonging to one of three solution branchesthat make up the complete optimal filter. We de-rived analytical expressions for the three branchesof the filter in terms of series expansions, which weshowed are convergent for typical listening situa-tions. The corresponding impulse responses werethen obtained analytically, and expressed as con-volutions of trains of Dirac deltas.

• Aside from seeking fundamental insight into the na-ture and characteristics of optimal XTC filters, weaddressed a number of issues related to their appli-cation. In particular, we argued that analytical fil-ters derived under the simplifying assumptions of a

free-field model can be useful in practical situationswhere individualized HRTF-based XTC filters areeither too cumbersome to implement or not neededto attain the XTC levels required for enhancing thespatial fidelity of playback in non-anechoic environ-ments. We described a strategy for designing suchoptimal filters that meets practical design require-ments, and we gave an illustrative example for atypical listening configuration.

Acknowledgments

The author wishes to thank Ralph Glasgal for intro-ducing him to the XTC problem, Angelo Farina for hisX-Volver convolution plugin, Robin Miller for his inde-pendent analytical auditioning of the filters, and J. S.Bach for his Mass in B Minor.

APPENDIX A: DERIVATION OF THE IMPULSERESPONSE OF THE OPTIMAL XTC FILTER

Here we carry out the derivation of Eqs. (74) to (83)following the approach outlined in Section IV C.

We start by factoring the expressions appearingin Eqs. (70) and (71), which, we note, have the samedenominator, into the following product of terms:

H[O]LLI,II

(iω) = H[O]RRI,II

(iω) = (Ψ0 + γΨ1) Ψa (A1)

H[O]LRI,II

(iω) = H[O]RLI,II

(iω) =(∓Ψ0 + γgeiωτcΨ1

)Ψa,

(A2)

where

Ψ0 = γ2[±x−

(1 + e2iωτc

)g2], (A3)

Ψ1 =√g2 ∓ x+ 1, (A4)

Ψa =1

g2 ± x(

2γ√g2 ∓ x+ 1− 1

)+ 1

. (A5)

The term Ψa can be factored as

Ψa = ±Ψ2Ψ3 ± (Ψ1 ∓Ψ4)Ψ5Ψ6(c1)Ψ6(c2),

where

Ψ2 =1

2γx, (A6)

Ψ3 =1√

g2 ∓ x+ 1, (A7)

Ψ4 = 2γx, (A8)

Ψ5 =1

8γ3x3, (A9)

Ψ6(c) =1

1− cx−1, (A10)

Page 20: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 20

and

c1 =

√16γ2(g2 + 1) + 1∓ 1

8γ2, (A11)

c2 =−√

16γ2(g2 + 1) + 1∓ 18γ2

. (A12)

In the time domain, the filter expressed by Eqs. (A1)and (A2) becomes:

h[O]LLI,II

(t) = h[O]RRI,II

(t) = (ψ0 + γψ1) ∗ ψa, (A13)

h[O]LRI,II

(t) = h[O]RLI,II

(t) = [∓ψ0 + gγδ(t+ τc) ∗ ψ1] ∗ ψa,

(A14)

where

ψa = ±ψ2 ∗ ψ3 ± (ψ1 ∓ ψ4) ∗ ψ5ψ6(c1) ∗ ψ6(c2). (A15)

The ψi terms are functions of time, and are the IFTs ofthe Ψi terms, which are functions of frequency.

We now seek the IFT of each of the Ψi terms givenabove.• Ψ0: The IFT of the expression in Eq. (A3) can be

readily found by substituting back 2g cos(ωτc) for x andcarrying out the IFT integration:

ψ0(t) =1

∫ ∞−∞

γ2[±2g cos(ωτc)− g2 − e2iωτc

]dω

= −g2γ2δ(t)± gγ2δ(τc − t)± gγ2δ(t+ τc)−g2γ2δ(t+ 2τc). (A16)

• Ψ1: Making the substitution b = g2 + 1 in Eq. (A4),we get

Ψ1 =√b∓ x, (A17)

which can be expressed as the series expansion

Ψ1 =∞∑m=0

(12m

)(∓x)mb

12−m (A18)

where we have used the binomial coefficient(km

)=

k!m!(k −m)!

if 0 ≤ m ≤ k

= 0 if m < 0 or k < m.

Since 0 < g < 1, we have x = 2g cos(ωτc)| < g2 + 1 = b,and the series in Eq. (A18) always converges. How-ever, as g → 1, b → 2, and when ωτc → n2π withn = 0, 1, 2, 3, 4, . . . , x→ b and the series converges slowly.Replacing x and b by their explicit values, we get

Ψ1 =∞∑m=0

(12m

)(∓2)mgm(g2 + 1)

12−m cosm(ωτc).

(A19)

Since cosm(ωτc) can be written as the finite sum

cosm(ωτc) =m∑k=0

(mk

)2−me−i(2k−m)ωτc , (A20)

and since the IFT of e−i(2k−m)ωτc is

12π

∫ ∞−∞

e−i(2k−m)ωτc dω = δ(2kτc − t−mτc),

the IFT of Ψ1 can be expressed as

ψ1(t) =∞∑m=0

(12m

)(∓g)m

(g2 + 1

) 12−m ×

m∑k=0

(mk

)δ(2kτc − t−mτc). (A21)

• Ψ2: Explicitly, Eq. (A6) is

Ψ2 =sec(ωτc)

4gγ.

The problem is that the IFT of sec(ωτc) cannot be ex-pressed in terms of real delta functions. However, thefunction sec(ωτc) can be expressed as

sec(ωτc) =1√

1− sin2(ωτc)(A22)

if n2π − π

2< ωτc < n2π +

π

2with n = 0, 1, 2, 3, 4, . . .

Furthermore, we note that since

1 ≤ γ ≤ 11− g

and 0 < g < 1, (A23)

the arguments of the inverse cosine function in Eq. (60)obeys the condition:

g2γ2 + γ2 − 12gγ2

≥ 0 (A24)

which leads us to write

0 ≤ φ ≤ π

2. (A25)

In light of this expression and Eq. (56), we conclude thatthe conditions for the validity of Eq. (A22) are alwayssatisfied in Branch-I bands.

Similarly, we find that sec(ωτc) can be expressed as

−1/√

1− sin2(ωτc) for conditions that are always satis-fied for Branch-II bands. Therefore, we can write

sec(ωτc) = ± 1√1− sin2(ωτc)

, (A26)

Page 21: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 21

for which we wish to use the expansion

1√1− u

=∞∑m=0

(− 1

2m

)(−1)mum. (A27)

However, this series converges only for |u| < 1. For ourparticular case, u = sin2(ωτc) and the series divergesat ωτc = nπ/2, with n = 1, 3, 5, 7, . . . From the banddivision conditions in Eqs. (56) and (56) we see that thesevalues of ωτc are always outside Branch-I and Branch-IIbands; therefore the convergence of the series is assuredand this allows us to express Eq. (A26) as

sec(ωτc) = ±∞∑m=0

(− 1

2m

)(−1)m sin2m(ωτc). (A28)

Since sin2m(ωτc) can be written as the finite sum

sin2m(ωτc) =2m∑k=0

(2mk

)(−1)k+m4−me2i(k−m)ωτc ,

(A29)and since the IFT of e2i(k−m)ωτc is

12π

∫ ∞−∞

e2i(k−m)ωτc dω = δ(t+ 2kτc − 2mτc), (A30)

the IFT of Ψ2 can be expressed as

ψ2 = ± 14gγ

∞∑m=0

(− 1

2m

)(−1)m ×

2m∑k=0

(2mk

)(−1)k+m4−mδ(t+ 2kτc − 2mτc).

(A31)

• Ψ3: The function 1/√b∓ x, with b = g2 + 1, has a

series expansion in the form of Eq. (A18), but with thefraction 1/2 inside the binomial coefficient replaced by−1/2. Therefore, by analogy with the result expressedin Eq. (A21), we have

ψ3(t) =∞∑m=0

(− 1

2m

)(∓g)m

(g2 + 1

) 12−m ×

m∑k=0

(mk

)δ(2kτc − t−mτc), (A32)

which has the same convergence behavior as that of ψ1.• Ψ4: The IFT of Ψ4 = 2γx = 4γg cos(ωτc) is straight-

forward:

ψ4 = 2γgδ(τc − t) + 2γgδ(t+ τc). (A33)

• Ψ5: Explicitly, Eq. (A9) is

Ψ5 =sec3(ωτc)

(4gγ)3, (A34)

where, following the same arguments as in the case of Ψ2,the function sec3(ωτc) can be expanded in a convergentseries of the form of that in Eq. (A28), but with thefraction −1/2 inside the binomial coefficient replaced by−3/2. Therefore, by analogy with the result expressedin Eq. (A31), we have

ψ5 = ± 1(4gγ)3

∞∑m=0

(− 3

2m

)(−1)m ×

2m∑k=0

(2mk

)(−1)k+m4−mδ(t+ 2kτc − 2mτc).

(A35)

• Ψ6: Eq. (A10) can be written as

Ψ6 =1

1− y(c)(A36)

where

y ≡ c

x=

c

2g cos (ωτc), (A37)

and c represents either c1 or c2, given by Eqs. (A11) and(A12), respectively. We wish to expand the function inEq. (A36) into the power series

σ(c) ≡∞∑p=0

yp(c) (A38)

but this series converges only for

|y(c)| < 1. (A39)

We now show that this convergence condition leads to arestriction on the allowable range of γ and g, but thatthis restriction does not limit the applicability of the IRto real listening configurations.

The inequalities in Eq. (A25), and the band divi-sion conditions in Eq. (56) and (58), imply that x =2g cos(ωτc) is always positive in Branch-I bands and neg-ative in Branch-II bands. Furthermore, we see fromEqs. (A11) and (A12) that, under the conditions inEq. (A23), c1 ≥ 0 and c2 ≤ 0. Therefore, we have

y(c1) = c1/x ≥ 0 in Branch-I bands, (A40)y(c1) = c1/x ≤ 0 in Branch-II bands, (A41)

and

y(c2) = c2/x ≤ 0 in Branch-I bands, (A42)y(c2) = c2/x ≥ 0 in Branch-II bands. (A43)

If we define η+(c) and η−(c) to be the non-dimensionalfrequencies, ωτc, at which y(c) = +1 and y(c) = −1,respectively, we can, in light of the expressions above,restate the convergence condition in Eq. (A39) as:

σ(c1) converges in Branch-I bands ifφ ≤ η+(c1) (A44)

σ(c1) converges in Branch-II bands ifη−(c1) ≤ π − φ (A45)

Page 22: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 22

and

σ(c2) converges in Branch-I bands ifφ1 ≤ η−(c2)) (A46)

σ(c2) converges in Branch-II bands ifη+(c2) ≤ π − φ. (A47)

Therefore, for σ(c) to converge in Branch-I and Branch-IIbands, all four inequalities must be satisfied. To expressthese convergence conditions explicitly (i.e., in terms ofconditions on γ and g), we first set y(c) = +1 and y(c) =−1, and solve for η+(c) and η−(c), respectively, to find

η+(c1) = cos−1

(f(g, γ)− 1

16gγ2

)(A48)

η−(c1) = cos−1

(−f(g, γ) + 1

16gγ2

)(A49)

and

η+(c2) = cos−1

(−f(g, γ)− 1

16gγ2

)(A50)

η−(c1) = cos−1

(f(g, γ) + 1

16gγ2

)(A51)

where, for compactness, we have used the function f(g, γ)defined as

f(g, γ) ≡√

16 (g2 + 1) γ2 + 1.

Using these four explicit expressions, along with the defi-nition of φ given by Eq. (60), we find that the inequalitiesin Eqs. (A44) and (A47) lead to the same explicit con-vergence condition:

f(g, γ) + 78(g2 + 1)γ2

≤ 1; (A52)

and the inequalities in Eqs. (A45) and (A46) lead to

f(g, γ) + 98(g2 + 1)γ2

≤ 1. (A53)

Since both of these inequalities need to be satisfied, andsince the latter condition is more stringent than the for-mer, we must satisfy the latter. We can finally statethe condition for σ(c) to converge in both Branch-I andBranch-II bands explicitly in terms of g and γ:√

16 (g2 + 1) γ2 + 1 + 98(g2 + 1)γ2

≤ 1. (A54)

This convergence condition is illustrated in the regionplot of Fig. 9, where the black region denotes the values ofg and γ for which the convergence condition is violated.It is clear that this restriction only slightly limits the

1 2 3 4 5 6 7 8 9 100.0

0.2

0.4

0.6

0.8

1.0

Γ

g

FIG. 9: Region plot showing the allowed values for g and γ(white). The black-shaded region is where the series conver-gence condition in Eq. (A54) is not satisfied, and the grey-shaded region is where the general condition in Eq. (52) isviolated.

range of allowable γ and g, and is not relevant to reallistening geometries, where g ' 1.

Aside from the series convergence condition above, γmust satisfy the general condition given by Eq. (52)(whose region of violation is shaded in grey in Fig. 9).Therefore, we combine both conditions in the followingexpression:

max

(√5 +√

5

2√g2 + 1

, 1

)≤ γ ≤ 1

1− g, (A55)

where the first argument of the max function comes fromsetting the left-hand side of the convergence condition inEq. (A54) to 1, and solving for γ.

Now that we have found the convergence condition forthe series in Eq. (A38), we can express Ψ6 as that seriesand proceed to find its IFT. Replacing y and x in thatseries by their explicit values, we write

Ψ6 =∞∑p=0

(c

2g

)psecp(ωτc). (A56)

The secp(ωτc) term can be expanded in a convergent se-ries of the same form as the series in Eq. (A28), but withthe fraction −1/2 inside the binomial coefficient replacedby −p/2, and this leads to:

Ψ6 =∞∑p=0

(±c2g

)p ∞∑m=0

(−p2m

)(−1)m sin2m(ωτc).

(A57)Finally, recalling the finite sum in Eq. (A29), and theassociated IFT in Eq. (A30), we arrive at the sought

Page 23: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 23

expression for the IFT of Ψ6(c):

ψ6(c) =∞∑p=0

(±c2g

)p ∞∑m=0

(−p2m

)(−1)m ×

2m∑k=0

(2mk

)(−1)k+m4−mδ(t+ 2kτc − 2mτc).

(A58)

The complete impulse response of the optimal XTCfilter is assembled according to Eqs. (A13) to (A15), andis valid under the condition stated in Eq. (A55).

APPENDIX B: NUMERICAL VERIFICATION

The optimal XTC IR derived in the previous appendixwas evaluated for the typical case of g = .985 andΓ = 7 dB, and plotted in Fig. 8. To verify the IR’s valid-ity and assess the effect of the number of terms in the se-ries expansions, we calculated its Fourier transform andcompared the resulting spectra to those obtained fromthe frequency-domain expressions of Section IV B. Anexample is shown in Fig. 10 for the Branch-I part of theXTC spectra (top panel) and envelope spectra (bottompanel).

We found that excellent agreement (within a few tenthsof a dB) over all frequencies does not require taking

more than the first few (5-10) terms of the infinite se-ries in the expressions for all the ψ functions constitut-ing the IR, except for ψ1 and ψ3, which, due to theirslow convergence at and near the frequencies ωτc = n2π( n = 0, 1, 2, 3, 4, . . . ), require taking a larger number ofterms. Approximating the infinite series in the expres-sions for ψ1 and ψ3 by a sum having a finite number ofterms causes departures from the correct amplitude spec-tra at and near these frequencies. Due to the logarithmicfrequency scale, the n = 0 departure appears as a slightbass roll-off in the first band (seen as the first dot in thefirst Branch-I band in the bottom panel of Fig. 10), andthe n ≥ 1 departures appear as narrow-band spikes (suchas the one appearing as three vertical dots in the fifthband in the same plot). Increasing the number of termsin the series above 1000 reduces the amplitude of the bassroll-off and pushes it into the subwoofer frequency range,where XTC is not needed, and causes the n ≥ 1 spikesto diminish in amplitude and frequency extent so as tobecome inaudible. (The XTC spectrum is more immunefrom the aforementioned departures, as seen in the toppanel, because it is a ratio of left to right spectra.)

A similar analysis of the Branch-II part of the IR is notshown as the resulting spectra exhibit the same behavioras that described above.

[1] D. H. Cooper and J. L. Bauk, J. Audio Eng. Soc. 37, 3(1989).

[2] Throughout this paper, the words “recording” and “sig-nal” are used interchangeably and are meant to also rep-resent a live feed, or the HRTF-encoded signal for theartificial placement of sounds in a virtual acoustic space.

[3] M. A. Akeroyd et al., J. Acoust. Soc. Am. 121, 1056(2007).

[4] D. B. Ward, J. Acoust. Soc. Am. 110, 1195 (2001).[5] A. Sæbø, Ph.D. thesis, Norwegian University of Science

and Technology, Trondheim, Norway, 2001.[6] Throughout this paper, the word “level” is meant to rep-

resent, generally, a frequency-dependent amplitude.[7] J. Blauert, Spatial Hearing (The MIT Press, Cambridge,

MA, 2001).[8] B. B. Bauer, J. Audio Eng. Soc. 9, 148 (1961).[9] B. S. Atal, M. Hill, and M. R. Schroeder, Apparent Sound

Source Translator, US patent No. 3,236,949. Applicationfiled Nov. 1962, granted Feb 1966.

[10] P. Damaske, J. Acoust. Soc. Am. 50, 1109 (1970).[11] An exception could be made for recordings in which the

specific placement of sound images was made with full ac-counting for crosstalk during playback, e.g., the case ofstereo soundfields constructed with pan-potted mono im-ages and monitored over loudspeakers, common in pop-ular music recording.

[12] C. Hugonnet and P. Walder, Stereophonic Sound Record-ing (John Wiley & Sons, Chichester, UK, 1998).

[13] While, as mentioned above, XTC levels above 20 dB areoften required for robust front-back image disambigua-tion, the larger portion of the direct sound content inacoustic recordings, e.g., performed music, is of frontalorigin and, with playback through frontal loudspeakersat modest XTC, is largely immune to such localizationconfusion.

[14] D. B. Ward and G. Elko, IEEE Signal Process. Lett. 6,106 (1999).

[15] J. J. Lopez and A. Gonzalez, IEEE Signal Process. Lett.6, 106 (1999).

[16] T. Takeuchi, P. A. Nelson, and H. Hamada, J. Acoust.Soc. Am. 109, 958 (2001).

[17] J. Rose, P. Nelson, B. Rafaely, and T. Takeuchi, J.Acoust. Soc. Am. 112, 1992 (2002).

[18] M. R. Bai and C. C. Lee, J. Acoust. Soc. Am. 120, 1976(2006).

[19] O. Kirekby, P. A. Nelson, and H. Hamada, J. Acoust.Soc. Am. 104, 1973 (1998).

[20] O. Kirekby, P. A. Nelson, and H. Hamada, J. Audio Eng.Soc. 46, 387 (1998).

[21] H. S. Kim, P. M. Kim, and H. B. Kim, ETRI J. 22, 11(2000).

[22] J. Yang, W. S. Gan, and S. E. Tan, Acous. Res. Lett.Online 4, 47 (2003).

[23] M. R. Bai, C. C. Tung, and C. C. Lee, J. Acoust. Soc.Am. 117, 2802 (2005).

[24] M. R. Bai and C. C. Lee, J. Sound & Vibration 117,

Page 24: Optimal Crosstalk Cancellation for Binaural Audio with … · Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers Edgar Y. Choueiri Princeton University choueiri@princeton.edu

E.Y. CHOUEIRI: OPTIMAL CROSSTALK CANCELLATION 24

ææ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æææææææææææææææææææææææææ

ææææææææ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææ

æææææ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æææææææææææææææææææææææææææææææææ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

ææææ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

ææææææææææææææææææææææææææææææææææ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æ

æææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææææ

ΧAOE

I

Branch I P II P I

Π100 Π10 Π2 Π 2Π 3Π-20

-10

0

10

20

30

40100 1000 104 2x104

Ω Τc

dBf HHzL

FIG. 10: XTC spectrum of optimal filter for Branch-I bands,

χ[O]I (ω), shown in top panel, and the associated envelope spec-

trum, S[O]I (ω), shown in bottom panel. The small dots repre-

sent the spectra calculated by taking the Fourier transform ofthe Branch-I part of the IR derived in the previous appendix.(The IR is shown graphically in the top panel of Fig. 8.) Onlythe first 20 terms of the infinite series representing the ψ func-tions were taken, with the exception of the series for ψ1 andψ3, for which the first 2500 terms were used. The hard curvein the top panel is the Branch-I XTC spectrum calculateddirectly from Eq. (64), and the horizontal line in the bottom

panel is the Branch-I envelope spectrum S[O]I (ω) = γ, with

Γ = 7 dB. (Other parameters are the same as for Fig. 2.)Since these spectra are valid only in Branch-I bands, all otherbands are shaded in grey. (The vertical dashed lines representthe frequency bounds of the successive bands, and the branchnumbers of the first five bands are given in the bottom halfof each panel.)

2802 (2005).[25] Y. Kim, O. Deille, and P. A. Nelson, J. Sound & Vibra-

tion 297, 251 (2006).[26] L. H. Kim, J. S. Lim, and K. M. Sung, IEICE Trans.

Fundamentals E85 A, 2159 (2002).[27] W. G. Gardner, 3-D Audio using Loudpseakers (Kluwer

Academic Publishers, Boston, 1998).[28] Conversely, XTC control at frequencies where the inter-

ference at the ear is constructive requires attenuation in-stead of boosting (and implies a dynamic range gain, in-stead of loss). As was shown by [29, 30], and as will bereviewed in Section II C, these attenuations are not prob-lematic as they correspond to frequencies where XTCcontrol is most robust.

[29] T. Takeuchi and P. A. Nelson, J. Acoust. Soc. Am. 112,2786 (2002).

[30] P. A. Nelson and J. F. W. Rose, J. Acoust. Soc. Am.118, 193 (2005).

[31] B. Katz, Mastering Audio (Focal Press, Oxford, UK,2002).

[32] T. Takeuchi and P. A. Nelson, J. Audio Eng. Soc. 5, 981(2007).

[33] A. Schuhmacher, J. Hald, K. B. Rasmussen, and P.C.Hansem, J. Acoust. Soc. Am. 113, 114 (2003).

[34] H. Tokuno, O. Kirkeby, P. A. Nelson, and H. Hamada,IEICE Trans. Fundamentals EE80-A, 809 (1997).

[35] P. M. Morse and K. U. Ingard, Theoretical Acoustics(Princeton University Press, Princeton, NJ, 1968).

[36] P. C. Hansen, Rank-Deficient and Discrete Ill-PosedProblems (SIAM, Philadelphia, PA, 1998).

[37] P. A. Nelson and S. J. Elliott, Active Control of Sound(Academic Press, London, UK, 1993).

[38] M. Bellanger, Digital Processing of Signals (John Wiley& Sons, Chichester, UK, 2000).

[39] W. G. Gardner, J. Audio Eng. Soc. 43, 127 (1995).[40] A. Torger and A. Farina, IEEE Workshop on Applica-

tions of Signal Processing to Audio and Acoustics (PUB-LISHER, New Paltz, New York, 2001), paper No. 198.

[41] M. R. Bai and C. C. Lee, EURASIP J. on Adv. in Sign.Process. 2007, 1 (2006).

[42] This value for the effective inter-ear separation, ∆r =15 cm, is justified by the relatively small loudspeakerspan, following the guidelines in Ref. [29], where it is re-ported that good correlation between the peak frequen-cies in the data calculated using a free-field model, andthose measured with the KEMAR dummy head, can beobtained by taking an effective ∆r ' 13 cm for low val-ues of θ, and ∆r ' 25 cm for large source azimuths. Thelarger value, which is much larger than the minimumdistance between the entrances of the ear canals of thedummy head, reflects the effects of diffraction around thehead.