microphone array speech processingdownloads.hindawi.com/journals/specialissues/547087.pdf ·...

Microphone Array Speech Processing

Guest Editors: Sven Nordholm, Thushara Abhayapala, Simon Doclo, Sharon Gannot, Patrick Naylor, and Ivan Tashev

EURASIP Journal on Advances in Signal Processing

EURASIP Journal onAdvances in Signal Processing


Guest Editors: Sven Nordholm, Thushara Abhayapala,Simon Doclo, Sharon Gannot, Patrick Naylor,and Ivan Tashev

Copyright © 2010 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2010 of “EURASIP Journal on Advances in Signal Processing.” All articles are open accessarticles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.

Editor-in-ChiefPhillip Regalia, Institut National des Telecommunications, France

Associate Editors

Adel M. Alimi, TunisiaKenneth Barner, USAYasar Becerikli, TurkeyKostas Berberidis, GreeceEnrico Capobianco, ItalyA. Enis Cetin, TurkeyJonathon Chambers, UKMei-Juan Chen, TaiwanLiang-Gee Chen, TaiwanSatya Dharanipragada, USAKutluyil Dogancay, AustraliaFlorent Dupont, FranceFrank Ehlers, ItalySharon Gannot, IsraelSamanwoy Ghosh-Dastidar, USANorbert Goertz, AustriaM. Greco, ItalyIrene Y. H. Gu, SwedenFredrik Gustafsson, SwedenUlrich Heute, GermanySangjin Hong, USAJiri Jan, Czech RepublicMagnus Jansson, Sweden

Sudharman K. Jayaweera, USASoren Holdt Jensen, DenmarkMark Kahrs, USAMoon Gi Kang, South KoreaWalter Kellermann, GermanyLisimachos P. Kondi, GreeceAlex Chichung Kot, SingaporeErcan E. Kuruoglu, ItalyTan Lee, ChinaGeert Leus, The NetherlandsT.-H. Li, USAHusheng Li, USAMark Liao, TaiwanY.-P. Lin, TaiwanShoji Makino, JapanStephen Marshall, UKC. Mecklenbrauker, AustriaGloria Menegaz, ItalyRicardo Merched, BrazilMarc Moonen, BelgiumChristophoros Nikou, GreeceSven Nordholm, AustraliaPatrick Oonincx, The Netherlands

Douglas O’Shaughnessy, CanadaBjorn Ottersten, SwedenJacques Palicot, FranceAna Perez-Neira, SpainWilfried R. Philips, BelgiumAggelos Pikrakis, GreeceIoannis Psaromiligkos, CanadaAthanasios Rontogiannis, GreeceGregor Rozinaj, SlovakiaMarkus Rupp, AustriaWilliam Sandham, UKB. Sankur, TurkeyErchin Serpedin, USALing Shao, UKDirk Slock, FranceYap-Peng Tan, SingaporeJoao Manuel R. S. Tavares, PortugalGeorge S. Tombras, GreeceDimitrios Tzovaras, GreeceBernhard Wess, AustriaJar-Ferr Yang, TaiwanAzzedine Zerguine, Saudi ArabiaAbdelhak M. Zoubir, Germany

Contents

Microphone Array Speech Processing, Sven Nordholm, Thushara Abhayapala, Simon Doclo,Sharon Gannot (EURASIPMember), Patrick Naylor, and Ivan TashevVolume 2010, Article ID 694216, 3 pages

Selective Frequency Invariant Uniform Circular Broadband Beamformer, Xin Zhang, Wee Ser,Zhang Zhang, and Anoop Kumar KrishnaVolume 2010, Article ID 678306, 11 pages

First-Order Adaptive Azimuthal Null-Steering for the Suppression of Two Directional Interferers,Rene M. M. DerkxVolume 2010, Article ID 230864, 16 pages

Musical-Noise Analysis in Methods of Integrating Microphone Array and Spectral Subtraction Based onHigher-Order Statistics, Yu Takahashi, Hiroshi Saruwatari, Kiyohiro Shikano, and Kazunobu KondoVolume 2010, Article ID 431347, 25 pages

Microphone Diversity Combining for In-Car Applications, Jurgen Freudenberger, Sebastian Stenzel,and Benjamin VendittiVolume 2010, Article ID 509541, 13 pages

DOA Estimation with Local-Peak-Weighted CSP, Osamu Ichikawa, Takashi Fukuda,and Masafumi NishimuraVolume 2010, Article ID 358729, 9 pages

Shooter Localization in Wireless Microphone Networks, David Lindgren, Olof Wilsson,Fredrik Gustafsson, and Hans HabberstadVolume 2010, Article ID 690732, 11 pages

Hindawi Publishing CorporationEURASIP Journal on Advances in Signal ProcessingVolume 2010, Article ID 694216, 3 pagesdoi:10.1155/2010/694216

Editorial


Sven Nordholm (EURASIP Member),1 Thushara Abhayapala (EURASIP Member),2

Simon Doclo (EURASIP Member),3 Sharon Gannot (EURASIP Member),4

Patrick Naylor (EURASIP Member),5 and Ivan Tashev6

1 Department of Electrical and Computer Engineering, Curtin University of Technology, Perth, WA 6845, Australia2 College of Engineering & Computer Science, The Australian National University, Canberra, ACT 0200, Australia3 Institute of Physics, Signal Processing Group, University of Oldenburg, 26111 Oldenburg, Germany4 School of Engineering, Bar-Ilan University, 52900 Tel Aviv, Israel5 Department of Electrical and Electronic Engineering, Imperial College, London SW7 2AZ, UK6 Microsoft Research, USA

Correspondence should be addressed to Sven Nordholm, [email protected]

Received 21 July 2010; Accepted 21 July 2010

Copyright © 2010 Sven Nordholm et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

Significant knowledge about microphone arrays has beengained from years of intense research and product develop-ment. There have been numerous applications suggested, forexample, from large arrays (in the order of >100 elements)for use in auditoriums to small arrays with only 2 or 3elements for hearing aids and mobile telephones. Apart fromthat, microphone array technology has been widely appliedin speech recognition, surveillance, and warfare. Traditionaltechniques that have been used for microphone arraysinclude fixed spatial filters, such as, frequency invariantbeamformers, optimal and adaptive beamformers. Thesearray techniques assume either model knowledge or cali-bration signal knowledge as well as localization informationfor their design. Thus they usually combine some formof localisation and tracking with the beamforming. Todaycontemporary techniques using blind signal separation (BSS)and time frequency masking technique have attracted sig-nificant attention. Those techniques are less reliant on arraymodel and localization, but more on the statistical propertiesof speech signals such as sparseness, non-Gaussianity, andnon-stationarity. The main advantage that multiple micro-phones add from a theoretical perspective is the spatialdiversity, which is an effective tool to combat interference,reverberation, and noise. The underpinning physical featureused is a difference in coherence in the target field (speechsignal) versus the noise field. Viewing the processing in thisway one can understand also the difficulty in enhancing

highly reverberant speech given that we only can observe thereceived microphone signals.

This special issue contains contributions to traditionalareas of research such as frequency invariant beamforming[1], hand-free operation of microphone arrays in cars [2],and source localisation [3]. The contributions show newways to study these traditional problems and give newinsights into those problems. Small size arrays have alwaysa lot of applications and interest for mobile terminals,hearing aids, and close up microphones [4]. The novelway to represent small size arrays leads to a capability tosuppress multiple interferers. Abnormalities in noise andspeech stemming from processing are largely unavoidable,and using nonlinear processing results often in significantcharacter change particularly in noise character. It is thusimportant to provide new insights into those phenomenaparticularly the so called musical noise [5]. Finally, newand unusual use of microphone arrays is always interestingto see. Distributed microphone arrays in a sensor network[6] provide a novel approach to find snipers. This type ofprocessing has good opportunities to grow in interest for newand improved applications.

The contributions found in this special issue can becategorized to three main aspects of microphone arrayprocessing: (i) microphone array design based on eigenmodedecomposition [1, 4]; (ii) multichannel processing methods[2, 5]; and (iii) source localisation [3, 6].

2 EURASIP Journal on Advances in Signal Processing

The paper by Zhang et al., “Selective frequency invariantuniform circular broadband beamformer” [1], describes adesign method for Frequency-Invariant (FI) beamforming.This problem is a well-known array signal processing tech-nique used in many applications such as, speech acquisition,acoustic imaging and communications purposes. However,many existing FI beamformers are designed to have afrequency invariant gain over all angles. This might not benecessary and if a gain constraint is confined to a specificangle, then the FI performance over that selected region (infrequency and angle) can be expected to improve. Inspiredby this idea, the proposed algorithm attempts to optimizethe frequency invariant beampattern solely for the mainlobeand relax the FI requirement on the sidelobes. This sacrificeon performance in the undesired region is traded off forbetter performance in the desired region as well as reducednumber of microphones employed. The objective functionis designed to minimize the overall spatial response of thebeamformer with a constraint on the gain being smallerthan a predefined threshold value across a specific frequencyrange and at a specific angle. This problem is formulated as aconvex optimization problem and the solution is obtainedby using the Second-Order Cone Programming (SOCP)technique. An analysis of the computational complexityof the proposed algorithm is presented as well as itsperformance. The performance is evaluated via computersimulation for different number of sensors and differentthreshold values. Simulation results show that the proposedalgorithm is able to achieve a smaller mean square error ofthe spatial response gain for the specific FI region comparedto existing algorithms.

The paper by Derkx, “First-order azimuthal null-steeringfor the suppression of two directional interferers” [4] showsthat an azimuth steerable first-order super directional micro-phone response can be constructed by a linear combinationof three eigenbeams: a monopole and two orthogonaldipoles. Although the response of a (rotation symmetric)first-order response can only exhibit a single null, thepaper studies a slice through this beampattern lying in theazimuthal plane. In this way, a maximum of two nullsin the azimuthal plane can be defined. These nulls aresymmetric with respect to the main-lobe axis. By placingthese two nulls on maximally two-directional sources tobe rejected and compensating for the drop in level for thedesired direction, these directional sources can be effectivelyrejected without attenuating the desired source. An adaptivenull-steering scheme for adjusting the beampattern, whichenables automatic source suppression, is presented. Closed-form expressions for this optimal null-steering are derived,enabling the computation of the azimuthal angles of theinterferers. It is shown that the proposed technique has agood directivity index when the angular difference betweenthe desired source and each directional interferer is at least90 degrees.

In the paper by Takahashi et al. “Musical noise analysisin methods of integrating microphone array and spectralsubtraction based on higher-order statistics” [5], an objectiveanalysis on musical noise is conducted. The musical noiseis generated by two methods of integrating microphone

array signal processing and spectral subtraction. To obtainbetter noise reduction, methods of integrating microphonearray signal processing and nonlinear signal processing havebeen researched. However, nonlinear signal processing oftengenerates musical noise. Since such musical noise causesdiscomfort to users, it is desirable that musical noise ismitigated. Moreover, it has been recently reported thathigher-order statistics are strongly related to the amountof musical noise generated. This implies that it is possibleto optimize the integration method from the viewpoint ofnot only noise reduction performance but also the amountof musical noise generated. Thus, the simplest methodsof integration, that is, the delay-and-sum beamformer andspectral subtraction, are analysed and the features of musicalnoise generated by each method are clarified. As a result, it isclarified that a specific structure of integration is preferablefrom the viewpoint of the amount of generated musicalnoise. The validity of the analysis is shown via a computersimulation and a subjective evaluation.

The paper by Freudenberger et al., “Microphone diversitycombining for in-car applications” [2], proposes a frequencydomain diversity approach for two or more microphonesignals, for example, for in-car applications. The micro-phones should be positioned separately to ensure diversesignal conditions and incoherent recording of noise. Thisenables a better compromise for the microphone positionwith respect to different speaker sizes and noise sources. Thiswork proposes a two-stage approach: In the first stage, themicrophone signals are weighted with respect to their signal-to-noise ratio and then summed similar to maximum-ratio-combining. The combined signal is then used as a referencefor a frequency domain least-mean-squares (LMS) filter foreach input signal. The output SNR is significantly improvedcompared to coherence-based noise reduction systems, evenif one microphone is heavily corrupted by noise.

The paper by Ichikawa et al., “DOA estimation withlocal-peak-weighted CSP” [3], proposes a novel weightingalgorithm for Cross-power Spectrum Phase (CSP) analysisto improve the accuracy of direction of arrival (DOA)estimation for beamforming in a noisy environment. Asa sound source, a human speaker is used, and as a noisesource broadband automobile noise is used. The harmonicstructures in the human speech spectrum can be used forweighting the CSP analysis, because harmonic bins mustcontain more speech power than the others and thus giveus more reliable information. However, most conventionalmethods leveraging harmonic structures require pitch esti-mation with voiced-unvoiced classification, which is notsufficiently accurate in noisy environments. The suggestedapproach employs the observed power spectrum, which isdirectly converted into weights for the CSP analysis byretaining only the local peaks considered to be comingfrom a harmonic structure. The presented results show thatthe proposed approach significantly reduces the errors inlocalization, and it also shows further improvement whenused with other weighting algorithms.

The paper by Lindgren et al., “Shooter localization inwireless microphone networks” [6], is an interesting com-bination of microphone array technology with distributed

EURASIP Journal on Advances in Signal Processing 3

communications. By detecting the muzzle blast as well asthe ballistic shock wave, the microphone array algorithmis able to locate the shooter in the case when the sensorsare synchronized. However, in the distributed sensor case,synchronization is either not achievable or very expensive toachieve and therefore the accuracy of localization comes intoquestion. Field trials are described to support the algorithmicdevelopment.

Sven NordholmThushara Abhayapala

Simon DocloSharon GannotPatrick Naylor

Ivan Tashev

References

[1] X. Zhang, W. Ser, Z. Zhang, and A. K. Krishna, “Selectivefrequency invariant uniform circular broadband beamformer,”EURASIP Journal on Advances in Signal Processing, vol. 2010,Article ID 678306, 11 pages, 2010.

[2] J. Freudenberger, S. Stenzel, and B. Venditti, “Microphonediversity combining for In-car applications,” EURASIP Journalon Advances in Signal Processing, vol. 2010, Article ID 509541,13 pages, 2010.

[3] O. Ichikawa, T. Fukuda, and M. Nishimura, “DOA estimationwith local-peak-weighted CSP,” EURASIP Journal on Advancesin Signal Processing, vol. 2010, Article ID 358729, 9 pages, 2010.

[4] R. M. M. Derkx, “First-order adaptive azimuthal null-steeringfor the suppression of two directional interferers,” EURASIPJournal on Advances in Signal Processing, vol. 2010, Article ID230864, 16 pages, 2010.

[5] Yu. Takahashi, H. Saruwatari, K. Shikano, and K. Kondo,“Musical-noise analysis in methods of integrating microphonearray and spectral subtraction based on higher-order statistics,”EURASIP Journal on Advances in Signal Processing, vol. 2010,Article ID 431347, 25 pages, 2010.

[6] D. Lindgren, O. Wilsson, F. Gustafsson, and H. Habberstad,“Shooter localization in wireless sensor networks,” in Proceed-ings of the 12th International Conference on Information Fusion(FUSION ’09), pp. 404–411, July 2009.


Research Article

Selective Frequency Invariant UniformCircular Broadband Beamformer

Xin Zhang,1 Wee Ser,1 Zhang Zhang,1 and Anoop Kumar Krishna2

1 Center for Signal Processing, Nanyang Technological University, 50 Nanyang Avenue, Singapore 6397982 EADS Innovation Works, EADS Singapore Pte Ltd., No. 41, Science Park Road, 01-30, Singapore 117610

Correspondence should be addressed to Xin Zhang, zhang [email protected]

Received 16 April 2009; Revised 24 August 2009; Accepted 3 December 2009

Academic Editor: Thushara Abhayapala

Copyright © 2010 Xin Zhang et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Frequency-Invariant (FI) beamforming is a well known array signal processing technique used in many applications. In this paper,an algorithm that attempts to optimize the frequency invariant beampattern solely for the mainlobe, and relax the FI requirementon the sidelobe is proposed. This sacrifice on performance in the undesired region is traded off for better performance in thedesired region as well as reduced number of microphones employed. The objective function is designed to minimize the overallspatial response of the beamformer with a constraint on the gain being smaller than a pre-defined threshold value across a specificfrequency range and at a specific angle. This problem is formulated as a convex optimization problem and the solution is obtainedby using the Second Order Cone Programming (SOCP) technique. An analysis of the computational complexity of the proposedalgorithm is presented as well as its performance. The performance is evaluated via computer simulation for different numberof sensors and different threshold values. Simulation results show that, the proposed algorithm is able to achieve a smaller meansquare error of the spatial response gain for the specific FI region compared to existing algorithms.

1. Introduction

Broadband beamforming techniques using an array ofmicrophones have been applied widely in hearing aids, tele-conferencing, and voice-activated human-computer inter-face applications. Several broadband beamformer designshave been reported in the literature [1–3]. One designapproach is to decompose the broadband signal into severalnarrowband signals and apply narrowband beamformingtechniques for each narrowband signal [4]. This approachrequires several narrowband processing to be conductedsimultaneously and is computationally expensive. Anotherdesign approach is to use adaptive broadband beamformers.Such techniques use a bank of linear transversal filters togenerate the desired beampattern. The filter coefficients canbe derived adaptively from the received signals. One classicdesign example is the Frost Beamformer [5]. However, inorder to have a similar beampattern over the entire frequencyrange, a large number of sensors and filter taps will beneeded. This again leads to high computational complexity.The third approach of designing broadband beamformers is

to use the Frequency-Invariant (FI) beampattern synthesistechnique. As the name implies, such beamformers aredesigned to have constant spatial gain response over thedesired frequency bands.

Over recent years, FI beamforming techniques aredeveloped in a fast pace. It is difficult to make a distinctclassification. However, in order to grasp the literature on FIbeamforming in a glimpse, we classify them loosely into thefollowing three types.

One type of FI beamformers includes those that focuson the design based on array geometry. These include, forexample, the 3D sensor array design reported in [6], therectangular sensor array design reported in [7], and thedesign of using subarrays in [8]. In [9], the FI beampattern isachieved by exploiting the relationship among the frequencyresponses of the various filters implemented at the output ofeach sensor.

The second type of FI beamformers is designed onthe base of a least-square approach. For this type of FIbeamformers, the weights of the beamformer are optimizedsuch that the error between the actual beampattern and


the desired beampattern is minimized over a range offrequencies. Some of such beamformers are designed in thetime-frequency domain [10–12], while others are designedin the eigen-space domain [13].

The third type of FI beamformers is designed based on“Signal Transformation.” For this type of beamformers, thesignal received at the sensor array is transformed into adomain such that the frequency response and the spatialresponse of the signal can be decoupled and hence adjustedindependently. This is the principle adopted in [14], wherea uniform concentric circular array (UCCA) is designedto achieve the FI beampattern. Excellent results have beenproduced by this algorithm. One limitation of the UCCAbeamformer is that a relatively large number of sensors haveto be used to form the concentric circular array.

Inspired by the UCCA beamformer design, a newalgorithm has been proposed by the authors of this paperand presented in [15]. The proposed algorithm attemptsto optimize the FI beampattern solely for the main lobewhere the signal of interest is from and relaxes the FIrequirement on the side lobe. As a result, the sacrificeon performance in the undesired region is traded off forbetter performance in the desired region and fewer numberof microphones are employed. To achieve this goal, anobjective function with a quadratic constraint is designed.This constraint function allows the FI characteristic to beaccurately controlled over the specified bandwidth at theexpense of other parts of the spectrum which are not ofconcern to the designer. This objective function is formulatedinto a convex optimization problem and solved by SOCPreadily. Our algorithm has a frequency band of interest from0.3π to 0.95π. If the sampling frequency is 16000 Hz, thefrequency band of interest ranges from 2400 Hz to 7600 Hz.This algorithm can be applied in speech processing as thelabial and fricative sounds of speech mostly lie in the 8thto 9th octave. If the sampling frequency is 8000 Hz, thefrequency band of interest is from 1200 Hz to 3800 Hz.This frequency range is useful for respiratory sounds[16].

The aim of this paper is to provide the full details ofthe design proposed in [15]. In addition, a computationalcomplexity analysis of the proposed algorithm and thesensitivity performance evaluations at different numbers ofsensors and different constraint parameter values are alsoincluded.

The remaining paper is organized in the following way:in Section 2, problem formulation is discussed; in Section 3,the proposed beamforming design is described; in Section 4,the design of the beamforming weight using SOCP isshown; numerical results are given in Section 5, and finally,conclusions are drawn in Section 6.

2. Problem Formulation

A uniformly distributed circular sensor array with K numberof microphones is arranged as shown in Figure 1. Eachomnidirectional sensor is located at (r cosφk, r sinφk), wherer is the radius of the circle, φk = 2kπ/K and k = 0, . . . ,K − 1.

In this configuration, the intersensor spacing is fixed atλ/2, where λ is the wavelength of the signals of interestand its minimum value is denoted by λmin. The radiuscorresponding to λmin is given by [14]

r = λmin

4 sin(π/K). (1)

Assuming that the circular array is on a horizontal plane,the steering vector is

a(f ,φ

) =[e j2π f r cos(φ−φ0)/c, . . . , e j2π f r cos(φ−φK−1)/c

]T, (2)

where T denotes transpose. For convenience, let ω be thenormalized angular frequency, that is, ω = 2π f / fs, let εbe the ratio of the sampling frequency and the maximumfrequency, that is, ε = fs/ fmax, and let r be the normalizedradius, that is, r = r/λmin, the steering vector can be rewrittenas

a(ω,φ

) =[e jωrε cos(φ−φ0), . . . , e jωrε cos(φ−φK−1)

]T. (3)

Figure 2 shows the system structure of the proposeduniform circular array beamformer. The sampled signalsafter the sensor are represented by the vector X[n] =[x0(n), x1(n), . . . , xK−1(n)]T where n is the sampling instance.These sampled signals are transformed into a set of coef-ficients via the Inverse Discrete Fourier Transform (IDFT),where each of the coefficients is called a phase mode [17].The mth phase mode at time instance n can be expressed as

pm[n] =K−1∑

k=0

xk[n]e j2πkm/K . (4)

These phase modes are passed through an FIR (FiniteImpulse Response) filter where the filter coefficients aredenoted as bm[n]. The purpose of this filter is to removethe frequency dependency of the received signal X[n]. Thebeamformer output y[n] is then determined as the weightedsum of the filtered signals:

y[n] =L∑

m=−L

(pm[n]∗ bm[n]

) · hm, (5)

where hm is the phase spatial weighting coefficients or thebeamforming weights, and ∗ is the discrete-time convolu-tion operator.

Let M be the total number of phase modes and it isassumed to be an odd number. It can be seen from Figure 2that the K received signals are transformed into M phasemodes, where L = (M − 1)/2.

The corresponding spectrum of the phase modes canbe obtained by taking the Discrete Time Fourier Transform


(DTFT) of the phase modes defined in (4):

Pm(ω) =K−1∑

k=0

Xk(ω)e j2πkm/K

= S(ω) ·K−1∑

k=0

e jωr cos(φ−φk)e j2πkm/K ,

(6)

where S(ω) is the spectrum of the source signal.Taking DTFT on both side of (5) and using (6), we have

Y(ω) =L∑

m=−LhmPm(ω)Bm(ω)

= S(ω) ·L∑

m=−Lhm

⎛

⎝K−1∑

k=0

e jωr cos(φ−φk)e j2πkm/K

⎞

⎠Bm(ω).

(7)

Consequently, the response of the beamformer can beexpressed as

G(ω,φ

) =L∑

m=−Lhm

⎛

⎝K−1∑

k=0

e jωr cos(φ−φk)e j2πkm/K

⎞

⎠Bm(ω). (8)

In order to obtain an FI response, terms which arefunctions of ω are grouped together using the Jacobi-Angerexpansion given as follows [18]:

e jβ cos γ =+∞∑

n=−∞jnJn

(β)e jnγ, (9)

where Jn(β) is the Bessel function of the first kind of order n.Substituting (9) into (8), and applying property of the

Bessel function, the spatial response of the beamformer cannow be approximated by

G(ω,φ

) =L∑

m=−Lhm · e jmφ · K · jm · Jm(ωr) · Bm(ω). (10)

This process has been described in [13] and its detailedderivation can be found in [14].

3. Proposed Novel Beamformer

With the above formulation, we propose the following beampattern synthesis method. The basic idea is to enhance thebroadband signals for a specific frequency region and at acertain direction. In order to achieve this goal, the following

Radius r

φk

kth element

Figure 1: Uniform Circular Array Configuration.

objective function is proposed:

min∫

ω

∫

φ

∥∥G(ω,φ

)∥∥2dω dφ,

s.t.∥∥G(ω,φ0

)− 1∥∥ ≤ δ, ω ∈ [ωl,ωu],

(11)

where G(ω,φ) is the spatial response of the beamformergiven in (10), and ωl and ωu are the lower and upper limit ofthe specified frequency region respectively. φ0 is the specifieddirection and δ is a predefined threshold value that controlsthe magnitude of the ripples of the main beam.

In principle, the objective function defined above aimsto minimize the square of the spatial gain response acrossall frequencies and all angles, while constraining the gainto the value of one at the specified angle. This is to relaxthe gain constraint to one angle instead of all angles,so that the FI beampattern in the specified region canbe improved. With this constraint setting, the resultingbeamformer can enhance broadband desired signals arrivingfrom one direction while attenuate broadband noise receivedfrom other directions. The concept for formulating theobjective function is similar to Capon beamformer [19]. Onedifference is that the Capon beamformer aims to minimizethe data dependent array output power at a single frequency,while the proposed algorithm aims to minimize the dataindependent array output power across a wide range offrequencies. Another difference is that the constraint used inCapon beamformer is a hard constraint, whereas the arraygain used in the proposed algorithm is a soft constraint,which can result in a higher degree of flexibility.

The proposed algorithm is expected to have lower com-putational complexity compared to the UCCA beamformer.The later is designed to achieve FI beampattern for allangles whereas the proposed algorithm focuses only on aspecified angle. For the same reason, the proposed algorithmis expected to have a larger degree of freedom too. Thisexplains the result in having a better FI beampattern for agiven number of sensors. These performance improvementshave been supported by computer simulations and will bediscussed in the later part of this paper.


b−L[n]

bL−1[n]

bL[n]

p−L[n]

pL−1[n]

pL[n]

xK−1[n]

x1[n]

x0[n]

......

...SUMIDFT

h−L

hL−1

hL

y[n]

Figure 2: The system structure of a uniform circular array beamformer.

The optimization problems defined by (10) and (11)require the optimum values of both the compensation filterand the spatial weightings to be determined simultaneously.As such, Cholesky factorization is used to transform theobjective function further into the Second-Order Cone Pro-gramming (SOCP) problem. The details of implementationwill be discussed in the following section. It should be notedthat when the threshold value δ equals zero, the optimizationprocess becomes a linearly constrained problem.

4. Convex Optimization-Based Implementation

Second-Order Cone Programming (SOCP) is a popular toolfor solving convex optimization problem, and it has beenused for array pattern synthesis problem [20–22] since theearly papers by Lobo et al. [23]. One advantage of SOCPis that the global optimal solution is guaranteed if it exists,whereas constrained least square optimization procedurelooks for local minimum. Another important advantage isthat it is very convenient to include additional linear orconvex quadratic constraints, such as the norm constraint ofthe variable vector, in the problem formulation. The standardform of SOCP can be written as follows:

min bTx,

s.t. dTi x + qi ≥ ‖Aix + ci‖2, i = 1, . . . ,N ,(12)

where x ∈ Rm is the variable vector; the parameters are b ∈Rm, Ai ∈ R(ni−1)×m, ci ∈ Rni−1, di ∈ Rm, and qi ∈ R. Thenorm appearing in the constraints is the standard Euclideannorm, that is, ‖u‖2 = (uTu)1/2.

4.1. Convex Optimization of the Beampattern SynthesisProblem. The following transformations are carried out toconvert (11) into the standard form defined by (12).

First, Bm(ω) = ∑Nmn=0 bm[n]e− jnω is substituted into (10),

where Nm is the filter order for each phase. The spatialresponse of the beamformer can now be expressed as

G(ω,φ

)=L∑

m=−Lhm·e jmφ · K · jm · Jm(ωr) ·

⎡

⎣Nm∑

n=0

bm[n]e− jnω⎤

⎦.

(13)

Using the identity e− jnω = cos(nω) − j sin(nω), (13)becomes

G(ω,φ

) =L∑

m=−Lhm · e jmφ · K · jm · Jm(ωr)

·⎡

⎣Nm∑

n=0

bm[n](cos(nω)− j sin(nω)

⎤

⎦

= KL∑

m=−Lhm · e jmφ · jm · Jm(ωr)

·⎡

⎣Nm∑

n=0

(bm[n] cos(nω)− jbm[n] sin(nω)

)⎤

⎦

= KL∑

m=−Lhm · e jmφ · jm · Jm(ωr)

·⎡

⎣Nm∑

n=0

⎛

⎝bm[n] cos(nω)− jNm∑

n=0

bm[n] sin(nω))

⎞

⎠

⎤

⎦

= KL∑

m=−Lhm · e jmφ · jm · Jm(ωr) ·

[cmbm − jsmbm

],

(14)

where bm = [bm[0], bm[1], . . . , bm[Nm]]T ; cm = [cos(0),cos(ω), . . . , cos(Nm · ω)]; sm = [sin(0), sin(ω), . . . , sin(Nm ·ω)].

hm is the spatial weighting in the system structure, andbm is the FIR filter coefficient vector for each phase.

Let um = hm · jm · bm, we have

G(ω,φ

) = KL∑

m=−Le jmφ · Jm(ωr) · cmum

− j · KL∑

m=−Le jmφ · Jm(ωr) · smum

= c(ω,φ

)u− js

(ω,φ

)u,

(15)

where c(ω,φ) = [Kej(−L)φJ−L(ωr)c−L, . . . ,Kej(L)φJL(ωr)cL];u = [uT−L, uT−L+1, . . . , uTL ]T ; s(ω,φ) = [Kej(−L)φJ−L(ωr)s−L,. . . ,Kej(L)φJL(ωr)sL].


Representing the complex spatial response G(ω,φ) bya 2-dimensional vector g(ω,φ) which display the real andimaginary parts into rows of a vector separately, (15) isrewritten in the following form:

g(ω,φ

) =⎛

⎝c(ω,φ

)

−s(ω,φ

)

⎞

⎠u = A(ω,φ)Hu. (16)

Hence, ‖G(ω,φ)‖2 = gHg = (A(ω,φ)Hu)H

(A(ω,φ)Hu)= uHA(ω,φ)A(ω,φ)Hu.

The objective function and the constraint inequalitydefined in (11) can now be written as

minu

uHRu,

s.t.∥∥G(ω,φ0

)− 1∥∥ ≤ δ, for ω ∈ [ωl,ωu],

(17)

where R = ∫ ω∫φA(ω,φ)A(ω,φ)Hdω dφ.

In order to transform (17) into the SOCP form definedby (12), the cost function must be a linear equation.Since matrix R is hermitian and positive definite, it canbe decomposed into an upper triangular matrix and itstranspose using Cholesky factorization, that is, R = DHD,where D is the Cholesky factorization of R. Substituting thisinto (17), we have

uHRu = uH(

DHD)

u = (Du)H(Du). (18)

This further simplifies (17) into the following form:

minu

d2,

s.t.

⎧⎨

⎩

d2 = ‖D · u‖2,∥∥G(ω,φ0

)− 1∥∥ ≤ δ for ω ∈ [ωl,ωu].

(19)

Denoting t as the maximum norm of vector ‖Du‖subject to various choices of u, (19) reduces to

minu

t,

s.t.

⎧⎨

⎩

‖D · u‖ ≤ t,∥∥G(ω,φ0

)− 1∥∥ ≤ δ for ω ∈ [ωl,ωu].

(20)

It should be noted that (20) contains I different con-straints where I uniformly divides the frequency rangespanned by ω.

Lastly, in order to solve (20) by SOCP toolbox, we stackt and the coefficients of u together and define y = [t; u]. Leta = [1, 0]T , so that t = aTy. As a result, the objective functionand the constraint defined in (11) can be expressed as

miny

aTy,

s.t.

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

∥∥∥[

0 D]

y∥∥∥ ≤ aTy,

∥∥∥∥∥∥

[0 A(ω,φ0)H

]y −

⎛

⎝1

0

⎞

⎠

∥∥∥∥∥∥≤ δ for ω ∈

[ωl ωu

],

(21)

−70

−60

−50

−40

−30

−20

−10

0

Gai

n(d

B)

−200 −150 −100 −50 0 50 100 150 200

Angle (deg)

Figure 3: The normalized spatial response of the proposedbeamformer for ω = [0.3π, 0.95π].

where 0 is the zero matrix with its dimension determinedfrom the context.

Equation (21) can now be solved using convex optimiza-tion toolbox such as SeDuMi [24] with great efficiency.

4.2. Computational Complexity. When the Interior-PointMethod (IPM) is used to solve the SOCP problem definedin (21), the number of iterations needed is bounded byO(√N) where N is the number of constraints. The amount

of computation per iteration is O(n2∑

i ni) [23].The bulk of the computational requirement of the broad-

band array pattern synthesis comes from the optimizationprocess. The computational complexity of the optimizationprocess of the proposed algorithm and that of the UCCAalgorithm have been calculated and are listed in Table 1.

It can be seen from Table 1 that the proposed algorithmrequires a similar amount of computation per iterationsbut a much smaller number of iterations compared tothe UCCA algorithm. The overall computational load ofthe proposed method is therefore much smaller that thatis needed by the UCCA algorithm. It should be notedthat, as the coefficients are optimized in the phase modes,the comparative computational load presented above iscalculated based on the number of phase modes and not thenumber of sensors. Nevertheless, the larger the number ofsensors, the larger the number of phase modes too.

5. Numerical Results

In this numerical study, the performance of the proposedbeamformer is compared with that of UCCA beamformer[14] and Yan’s beamformer [25], for the specified frequencyregion. The evaluation metric used to quantify the frequencyinvariance (FI) characteristics is the mean squared errorof the array gain variation at the specified direction. Thesensitivity performance of the proposed algorithm will alsobe evaluated for different number of sensors and different


Table 1: Computational complexity of different broadband beampattern synthesis method.

Method Number of iteration Amount of computation per iteration

UCCA O{√I ×M} O{(1 + P(1 +Nm))2[2M(I + 1)]}Equation (11) O{√1 + I} O{[M(Nm + 1)2][2I +M(Nm + 1) + 1]}

Table 2: Comparison of array gain at each frequency along thedesired direction for the three methods.

NormalizedFrequency(radians/sample)

ProposedBeamformer(dB)

Yan’sbeamformer(dB)

UCCABeamformer(dB)

0.3 −0.0007 0 0.6761

0.4 −0.0248 −0.8230 0.1760

0.5 0.0044 −1.3292 −0.022

0.6 −0.0097 −1.6253 −0.2186

0.7 −0.0046 −1.8789 −0.6301

0.8 0.0085 −2.9498 −0.1291

0.9 −0.0033 −6.2886 0.1477

threshold values set for magnitude control of the ripples ofthe main beam.

A uniform circular array consisting of 20 sensors isconsidered. All the sensors are assumed perfectly calibrated.The number of phase modes M is set to be 17 and thusthere are 17 spatial weighting coefficients. The order of thecompensation filter is set to be 16 for all the phase modes.The frequency region of interest is specified to be from0.3π to 0.95π. The threshold value, δ, which controls themagnitude of the ripples of the main beam is set to 0.1.The specified direction is set to be 0◦ where the referencemicrophone is located.

There are several optimization criteria presented in[25]. The one that is chosen to compare is peak sidelobeconstrained minimax mainlobe spatial response variation(MSRV) design. Its objective is to minimize the maximumMSRV with peak sidelobe constraint. The mathematicexpression is shown as follows:

minh

σ ,

s.t.

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

uT(f0,φ0

)h = 1,

∣∣∣∣[

u(fk, θq

)− u

(f0, θq

)]Th∣∣∣∣ ≤ σ ,

∣∣∣u(fk, θs

)Th∣∣∣ ≤ ε,

fk ∈[fl, fu

], θq ∈ ΘML, θs ∈ ΘSL,

(22)

where f0 is the reference frequency and choose to havethe value of fl, and h is the beamformed weightings to beoptimized. ε is the peak sidelobe constraint and set to be0.036. ΘML and ΘSL represent the mainlobe and sideloberegion, respectively.

The beampattern obtained for the proposed beamformerfor the frequency region of interest is shown in Figure 3. Thespatial response of the proposed beamformer at 10 uniformly

−45

−40

−35

−30

−25

−20

−15

−10

−5

0

Gai

n(d

B)

−200 −150 −100 −50 0 50 100 150 200

Angle (deg)

Figure 4: The normalized spatial response of the UCCA beam-former for ω = [0.3π, 0.95π].

−50

−45

−40

−35

−30

−25

−20

−15

−10

−5

0

Gai

n(d

B)

−200 −150 −100 −50 0 50 100 150 200

Angle (deg)

Figure 5: The normalized spatial response of Yan’s beamformer forω = [0.3π, 0.95π].

spaced discrete frequencies is superimposed. It can be seenthat, the proposed beamformer has approximately a constantgain within the frequency region of interest in the specifieddirection (0◦). As the direction deviates from 0◦, the FIproperty becomes poorer. The peak sidelobe level has a valueof −8 dB.

The beampattern of the UCCA beamformer is shown inFigure 4. As the proposed algorithm is based on a circulararray, only one layer of the UCCA concentric array is usedfor the numerical study. All other parameter settings remainthe same as that used for the proposed algorithm. As shown


0

2

4

6

8

10

12

14

16

18

20

Mea

nsq

uar

eer

ror

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Normalised frequency (radians/sample)

Yan’s beamformerProposed beamformerUCCA beamformer

Figure 6: Comparison on FI characteristic between the proposedbeamformer, UCCA beamformer and Yan’s beamformer at 0 degreefor ω = [0.3π, 0.95π].

10

10.5

11

11.5

12

12.5

13

13.5

14

14.5

15

Dir

ecti

vity

(dB

)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


Figure 7: Directivity versus frequency for the broadband beampattern shown in Figure 3.

in the figure, the beampattern of the UCCA beamformeris not as constant as that of the proposed beamformer inthe specified direction (0◦). The peak sidelobe level whichhas a value of −6 dB is higher as compared to the proposedbeamformer too.

The beampattern of Yan’s beamformer is shown inFigure 5. The frequency invariant characteristics is poorerat the desired direction. However it has the lowest sidelobelevel among all. From this comparison, we find that havingprocessed the signal in phase mode, the frequency rangefor the beamformer to achieve Frequency Invariant (FI)characteristics is wider.

The mean squared errors of the spatial response gain inthe specified direction and across different frequencies for

5

6

7

8

9

10

11

12

Wh

ite

noi

sega

in(d

B)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


Figure 8: White noise gain versus frequency for the broadbandbeam pattern shown in Figure 3.

different methods are shown in Figure 6. It is seen that theproposed beamformer outperforms both the UCCA beam-former and Yan’s beamformer on achieving FI characteristicat the desired direction. Table 2 tabulates the array gain ateach frequency along the desired direction for these threemethods.

Furthermore, the performance of the frequency invariantbeam pattern obtained by the proposed method is assessedby evaluating the directivity and the white-noise gain overthe entire frequency band considered, as shown in Figures7 and 8, respectively. Directivity describes the ability of thearray to suppress a diffuse noise field, while white noisegain shows the ability of the array to suppress spatiallyuncorrelated noise, which can be caused by self-noise of thesensors. Because our array is a circular array, the directivityD(ω) is calculated using the following equation:

D(ω) =∣∣∣∑L

m=−L Bm(ω)∣∣∣

2

∑Lm=−L

∑Ln=−L Bm(ω)Bn(ω)′sinc[(m− n)2πωr/c]

,

(23)

where Bm(ω) is the frequency response of the FIR filter atmthphase mode, and r is the radius of the circle.

As shown in the figure, the directivity has a constantprofile, with an average value equal to 13.1755 dB. Thewhite noise gain ranges from 5.5 dB to 11.3 dB. Thesepositive values represent an attenuation of self-noise of themicrophones. As expected, the lower the frequency, thesmaller the white noise gain, and the higher the sensitivityto array imperfections. Hence, the proposed beamformeris more sensitive to array imperfection at low frequencyand is the most robust to array imperfection at normalizedfrequency 0.75π.

5.1. Sensitivity Study—Number of Sensors. Most FI beam-formers reported in the literature employ a large numberof sensors. In this study, the number of sensors used


−45

−40

−35

−30

−25

−20

−15

−10

−5

0

Gai

n(d

B)

−200 −150 −100 −50 0 50 100 150 200

Angle (deg)

Figure 9: The normalized spatial response of the proposed FIbeamformer for 10 microphones.

−45

−40

−35

−30

−25

−20

−15

−10

−5

0

Gai

n(d

B)

−200 −150 −100 −50 0 50 100 150 200

Angle (deg)

Figure 10: The normalized spatial response of the UCCA beam-former for 10 microphones.

are reduced from 20 to 10 and 8 and the performancesof the proposed FI beamformer, UCCA beamformer, andYan’s beamformer are compared. The results are shown inFigures 9, 10, 11, 12, 13, and 14. As seen from the simula-tions, when 10 microphones are employed, the proposedalgorithm achieves the best FI performance in the mainloberegion, with a sidelobe level of −8 dB. For UCCA methodand Yan’s method, frequency invariant characteristics arenot promising at the desired direction, and higher sidelobesare obtained. When the number of microphone is furtherreduced to 8, our proposed method is still able to producereasonable FI beampattern whereas the FI property of thebeampattern of the UCCA algorithm becomes much poorerin the specified direction.

5.2. Sensitivity Study—Different Threshold Value δ. In thisproposed algorithm, δ is a parameter created to define

−70

−60

−50

−40

−30

−20

−10

0

Gai

n(d

B)

−200 −150 −100 −50 0 50 100 150 200

Angle (deg)

Figure 11: The normalized spatial response of the Yan’s beam-former for 10 microphones.

−70

−60

−50

−40

−30

−20

−10

0G

ain

(dB

)

−200 −150 −100 −50 0 50 100 150 200

Angle (deg)

Figure 12: The normalized spatial response of the proposed FIbeamformer for 8 microphones.

the allowed ripples in the magnitude of the main beamspatial gain response. In this section, different values of δare used to study the sensitivity of the performance of theproposed algorithm to this parameter value. Three values,namely, δ = [0.001, 0.01, 0.1] are selected and the resultsobtained are shown in Figures 15, 16, and 17, respectively.The specified frequency region of interest remains the same.Figure 18 shows the mean squared error of the array gain atthe specified direction (0◦) for the three different δ valuesstudied.

As shown in the figures, as the value of δ decreases, the FIperformance at the specified direction improves. The resultsalso show that the improvement in the FI performance inthe specified direction is achieved with an increase in thepeak sidelobe level and a poorer FI beampattern in the otherdirections in the main beam. For example, when the valueof δ is 0.001, the peak sidelobe of the spatial response is


−18

−16

−14

−12

−10

−8

−6

−4

−2

0

Gai

n(d

B)

−200 −150 −100 −50 0 50 100 150 200

Angle (deg)

Figure 13: The normalized spatial response of the UCCA beam-former for 8 microphones.

−50

−45

−40

−35

−30

−25

−20

−15

−10

−5

0

Gai

n(d

B)

−200 −150 −100 −50 0 50 100 150 200

Angle (deg)

Figure 14: The normalized spatial response of the Yan’s beam-former for 8 microphones.

−50

−45

−40

−35

−30

−25

−20

−15

−10

−5

0

Gai

n(d

B)

−200 −150 −100 −50 0 50 100 150 200

Angle (deg)

Figure 15: The normalized spatial response of the proposedbeamformer for δ = 0.001.

−60

−50

−40

−30

−20

−10

0

Gai

n(d

B)

−200 −150 −100 −50 0 50 100 150 200

Angle (deg)


−60

−50

−40

−30

−20

−10

0

Gai

n(d

B)

−200 −150 −100 −50 0 50 100 150 200

Angle (deg)


as high as −5 dB and the beampatterns do not overlap wellin the main beam. As δ increases to 0.1, the peak sidelobeof the spatial response is approximately −10 dB (lower) andthe beampatterns in the main beam are observed to have arelatively good FI characteristics.

6. Conclusion

A selective frequency invariant uniform circular broadbandbeamformer is presented in this paper. Other than pro-viding the details of a recent conference paper presentedby the authors of this paper, a complexity analysis andtwo sensitivity studies on the proposed algorithm are alsopresented in this paper. The proposed algorithm is designedto minimize an objective function of the spatial response gainwith a constraint on the gain being smaller than a predefinedthreshold value across a specified frequency range and in aspecified direction. The problem is formulated as a convex


0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Mea

nsq

uar

eer

ror

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


δ = 0.1δ = 0.01δ = 0.001

Figure 18: Comparison on FI characteristic of the proposedbeamformer for δ = 0.001, 0.01 and 0.1 at 0 degree for ω =[0.3π, 0.95π].

optimization problem and the solution is obtained by usingthe Second-Order Cone Programming (SOCP) technique.The complexity analysis shows that the proposed algorithmhas a lower computational requirement compared to that ofthe UCCA algorithm for the problem defined. Numericalresults show that the proposed algorithm is able to achievea more FI beampattern and a smaller mean square error ofthe spatial response gain in the specified direction across thespecified FI region compared to the UCCA algorithm.

Acknowledgments

The authors would like to acknowledge the helpful discus-sions given by H. H. Chen of the University of Hong Kongon the UCCA algorithm. The authors would also like tothank STMicroelectronics (Singapore) for the sponsorship ofthis project. Last but not the least, the authors would liketo thank the reviewers too for their constructive commentsand suggestions which greatly improve the quality of thismanuscript.

References

[1] H. Krim and M. Viberg, “Two decades of array signalprocessing research: the parametric approach,” IEEE SignalProcessing Magazine, vol. 13, no. 4, pp. 67–94, 1996.

[2] D. H. Johnson and D. E. Dudgeon, Array Signal Processing:Concepts and Techniques, Prentice-Hall, Upper Saddle River,NJ, USA, 1993.

[3] R. A. Monzingo and T. W. Miller, Introduction to AdaptiveArrays, John Wiley & Sons, SciTech, New York, NY, USA, 2004.

[4] B. D. Van Veen and K. M. Buckley, “Beamforming: aversatile approach to spatial filtering,” in Proceedings of theIEEE Transactions on Acoustics, Speech, and Signal Processing(ICASSP ’88), April 1988.

[5] O. L. Frost III, “An algorithm for linearly constrained adaptivearray processing,” Proceedings of the IEEE, vol. 60, no. 8, pp.926–935, 1972.

[6] W. Liu, D. McLernon, and M. Ghogho, “Frequency invariantbeamforming without tapped delay-lines,” in Proceedingsof the IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP ’07), vol. 2, pp. 997–1000,Honolulu, Hawaii, USA, April 2007.

[7] M. Ghavami, “Wideband smart antenna theory using rectan-gular array structures,” IEEE Transactions on Signal Processing,vol. 50, no. 9, pp. 2143–2151, 2002.

[8] T. Chou, “Frequency-independent beamformer with lowresponse error,” in Proceedings of the 20th IEEE Transactionson Acoustics, Speech, and Signal Processing (ICASSP ’95), vol.5, pp. 2995–2998, Detroit, Mich, USA, May 1995.

[9] D. B. Ward, R. A. Kennedy, and R. C. Williamson, “FIR filterdesign for frequency invariant beamformers,” IEEE SignalProcessing Letters, vol. 3, no. 3, pp. 69–71, 1996.

[10] A. Trucco and S. Repetto, “Frequency invariant beamformingin very short arrays,” in Proceedings of the MTS/IEEE Techno-Ocean (Oceans ’04), vol. 2, pp. 635–640, November 2004.

[11] A. Trucco, M. Crocco, and S. Repetto, “A stochastic approachto the synthesis of a robust frequency-invariant filter-and-sum beamformer,” IEEE Transactions on Instrumentation andMeasurement, vol. 55, no. 4, pp. 1407–1415, 2006.

[12] S. Doclo and M. Moonen, “Design of far-field and near-fieldbroadband beamformers using eigenfilters,” Signal Processing,vol. 83, no. 12, pp. 2641–2673, 2003.

[13] L. C. Parra, “Least squares frequency-invariant beamforming,”in Proceedings of the IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics, pp. 102–105, New Paltz, NY,USA, October 2005.

[14] S. C. Chan and H. H. Chen, “Uniform concentric circu-lar arrays with frequency-invariant characteristics—theory,design, adaptive beamforming and DOA estimation,” IEEETransactions on Signal Processing, vol. 55, no. 1, pp. 165–177,2007.

[15] X. Zhang, W. Ser, Z. Zhang, and A. K. Krishna, “Uniformcircular broadband beamformer with selective frequency andspatial invariant region,” in Proceedings of the 1st InternationalConference on Signal Processing and Communication System(ICSPCS ’07), Gold Coast, Australia, December 2007.

[16] W. Ser, T. T. Zhang, J. Yu, and J. Zhang, “Detection ofwheezes using a wearable distributed array of microphones,”in Proceedings of the 6th International Workshop on Wearableand Implantable Body Sensor Networks (BSN ’09), pp. 296–300,Berkeley, Calif, USA, June 2009.

[17] D. E. N. Davies, “Circular arrays,” in Handbook of AntennaDesign, Peregrinus, London, UK, 1983.

[18] M. Abramowitz and I. A. Stegum, Handbook of MathematicalFunctions, Dover, New York, NY, USA, 1965.

[19] J. Capon, “High-resolution frequency-wavenumber spectrumanalysis,” Proceedings of the IEEE, vol. 57, no. 8, pp. 1408–1418,1969.

[20] F. Wang, V. Balakrishnan, P. Y. Zhou, J. J. Chen, R. Yang, andC. Frank, “Optimal array pattern synthesis using semidefiniteprogramming,” IEEE Transactions on Signal Processing, vol. 51,no. 5, pp. 1172–1183, 2003.

[21] J. Liu, A. B. Gershman, Z.-Q. Luo, and K. M. Wong, “Adaptivebeamforming with sidelobe control: a second-order coneprogramming approach,” IEEE Signal Processing Letters, vol.10, no. 11, pp. 331–334, 2003.

[22] S. Autrey, “Design of arrays to achieve specified spatialcharacteristics over broadbands,” in Signal Processing, J. W. R.


Griffiths, Ed., pp. 507–524, Academic Press, New York, NY,USA, 1973.

[23] M. S. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret,“Applications of second-order cone programming,” LinearAlgebra and Its Applications, vol. 284, no. 1–3, pp. 193–228,1998.

[24] J. F. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox foroptimization over symmetric cones,” Optimization Methodsand Software, vol. 11, no. 1–4, pp. 625–653, 1999.

[25] S. Yan, Y. Ma, and C. Hou, “Optimal array pattern synthesis forbroadband arrays,” Journal of the Acoustical Society of America,vol. 122, no. 5, pp. 2686–2696, 2007.


Research Article

First-Order Adaptive Azimuthal Null-Steering forthe Suppression of Two Directional Interferers

Rene M. M. Derkx

Digitial Signal Processing Group, High Tech Campus 36, 5656 AE Eindhoven, The Netherlands

Correspondence should be addressed to Rene M. M. Derkx, [email protected]

Received 21 July 2009; Revised 10 November 2009; Accepted 15 December 2009

Academic Editor: Simon Doclo

Copyright © 2010 Rene M. M. Derkx. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

An azimuth steerable first-order superdirectional microphone response can be constructed by a linear combination of threeeigenbeams: a monopole and two orthogonal dipoles. Although the response of a (rotation symmetric) first-order response canonly exhibit a single null, we will look at a slice through this beampattern lying in the azimuthal plane. In this way, we can definemaximally two nulls in the azimuthal plane which are symmetric with respect to the main-lobe axis. By placing these two nulls onmaximally two directional sources to be rejected and compensating for the drop in level for the desired direction, we can effectivelyreject these directional sources without attenuating the desired source. We present an adaptive null-steering scheme for adjustingthe beampattern so as to obtain this suppression of the two directional interferers automatically. Closed-form expressions for thisoptimal null-steering are derived, enabling the computation of the azimuthal angles of the interferers. It is shown that the proposedtechnique has a good directivity index when the angular difference between the desired source and each directional interferer is atleast 90 degrees.

1. Introduction

In applications such as hands-free communication and voicecontrol systems, the microphone signal does not only containthe desired sound-source (e.g., a speech signal) but can alsocontain undesired directional interferers and backgroundnoise (e.g., diffuse-noise). To reduce the amount of noiseand minimize the influence of interferers, we can use amicrophone array and apply beamforming techniques tosteer the main-lobe of a beam towards the desired source-signal, for example, a speech signal. In this paper, we focuson arrays where the wavelength of the sound is much largethan the size of the array. These arrays are therefore called“Small Microphone Arrays.” When using omnidirectional(monopole) microphones in a small microphone arrayconfiguration, additive beamformers like delay-and-sum arenot able to obtain a sufficient directivity as the beamwidthdeteriorates for larger wavelengths [1, 2]. A common methodto obtain improved directivity is to apply superdirectivebeamforming techniques. In this paper, we will focus onfirst-order superdirective beamforming. (The term “first-order” is used to indicate that the directivity-pattern of the

superdirectional response is constructed by means of a linearcombination of a pressure and velocity (first-order spatialderivative of the pressure field) response.)

A first method to obtain this first-order superdirectivityis by using microphone-arrays with omnidirectionalmicrophone elements and to apply beamforming-techniqueswith asymmetrical filter-coefficients [3]. Basically, thisasymmetrical filtering corresponds to subtraction of signals,like in delay-and-subtract techniques [4, 5] or by takingspatial derivatives of the sound pressure field [6, 7]. Assubtraction leads to smaller signals for low frequencies, afirst-order integrator needs to be applied to equalize thefrequency-response, resulting in an increased sensitivity(20 dB/decade) for sensor-noise and increased sensitivityfor mismatches in microphones characteristics [8, 9] for thelower-frequency-range.

A second method to obtain first-order superdirectivity isby using microphone-arrays with first-order unidirectionalmicrophone elements. As the separate uni-directional micro-phone elements already have a first-order superdirectiveresponse, consisting out of a sum of a pressure and avelocity response, the beamformer can simply be constructed


M2

θ

M1

M0 x

y

z

M2

φ

M1

M0

x

y

(0, 0)

Figure 1: Circular array geometry with three cardioid microphones.

by a linear combination of the uni-directional microphonesignals. In such an approach, there is no need to apply a first-order integrator (as was the case for omni-directional micro-phone elements), and we avoid a 20 dB/decade increasedsensitivity for sensor-noise [7]. Nevertheless, uni-directionalmicrophones may have a low-frequency roll-off, whichcan be compensated for by means of proper equalizationtechniques. Throughout this paper, we will assume that theuni-directional microphones have a flat frequency response.

We focus on the construction of first-order superdirec-tional beampatterns where the nulls of the beampattern aresteered to the directional interferers, while having a unityresponse in the direction of the desired sound-source. InSection 2, we construct a monopole and two orthogonaldipole responses (known as “eigen-beams” [10, 11]) outof a circular array of three first-order cardioid microphoneelements M0, M1, and M2 (with a heart-shaped directionalpattern), as shown in Figure 1. Here θ and φ are the standardspherical coordinate angles: elevation and azimuth.

Based on these eigenbeams, we are able to constructarbitrary first-order responses that can be steered withthe main-lobe in any azimuthal direction (see Section 2).Although the (rotation symmetric) first-order response canonly exhibit a single null, we will look at a slice throughthe beampattern lying in the azimuthal plane. In thisway, we can define maximally two nulls in the azimuthalplane which are symmetric with respect to the main-lobeaxis. By placing these two nulls on the two directionalsources to be rejected and compensating for the drop inlevel for the desired direction, we can effectively reject thedirectional sources without attenuating the desired source.In Section 3 expressions are derived for this beampatternsynthesis.

To develop an adaptive null-steering algorithm, we firstshow in Section 4 how the superdirective beampattern canbe synthesized via the Generalized Sidelobe Canceller (GSC)[12]. This GSC enables us to optimize a cost-function inan unconstrained manner with a gradient descent search-method that is described in Section 5. Furthermore, the GSCenables tracking of the angles of the separate directionalinterferers, which is validated by means of simulations and

experiments in Section 6. Finally, in Section 7, conclusionsare given.

2. Construction of Eigenbeams

We know from [7, 9] that by using a circular array of at leastthree (omni- or uni-directional microphone) sensors in aplanar geometry and applying signal processing techniques,it is possible to construct a first-order superdirectionalresponse. This superdirectional response can be steeredwith its main-lobe to any desired azimuthal angle andcan be adjusted to have any first-order directivity pattern.As mentioned in the introduction, we will use three uni-directional cardioid microphones (with a heart-shapeddirectional pattern) in a circular configuration, where themain-lobes of the three cardioid responses are pointedoutwards, as shown in Figure 1.

The responses of the three cardioid microphonesM0,M1,and M2 are given by, respectively, E0

c (r, θ,φ), E1c (r, θ,φ), and

E2c (r, θ,φ), having their main-lobes at, respectively, φ = 0,

2π/3, and 4π/3 radians. Assuming that we have no sensor-noise, the nth cardioid microphone response, with n =0, 1, 2, for a harmonic plane-wave with frequency f is ideallygiven by [11]

Enc(r, θ,φ

) = Anejψn . (1)

The magnitude-response An and phase-response ψn of thenth cardioid microphone are given by, respectively:

An = 12

+12

cos(φ − 2nπ

3

)sin θ, (2)

ψn =2π fc

sin θ(xn cosφ + yn sinφ

). (3)

Here c is the speed of sound and xn and yn are the x and ycoordinates of the nth microphone (as shown in Figure 1),given by

xn = r cos(φ − 2nπ

3

),

yn = r sin(φ − 2nπ

3

),

(4)


1

0

−1

1

0.5

0

−0.5

−1

−1−0.50

0.51

(a) Em(θ,φ)

1

0

−1

1

0.5

0

−0.5

−1

−1−0.50

0.51

(b) E0d(θ,φ)

1

0

−1

1

0.5

0

−0.5

−1

−1−0.50

0.51

(c) Eπ/2d (θ,φ)

Figure 2: Eigenbeams (monopole and two orthogonal dipoles).

with r being the radius of the circle on which the micro-phones are located.

We can simplify (3) as

ψn =2π fcr sin θ cos

(2nπ

3

). (5)

From the three cardioid microphone responses, wecan construct the circular harmonics [7], also known as“eigenbeams” [10, 11]), by using the 3-point Discrete FourierTransform (DFT) with the three microphones as inputs. ThisDFT produces three phase-modes Pi(r, θ,φ) [7] with i =1, 2, 3:

P0(r, θ,φ

) = 13

2∑

n=0

Enc(r, θ,φ

),

P1(r, θ,φ

) = [P2(r, θ,φ

)]∗

= 13

2∑

n=0

Enc(r, θ,φ

)e− j 2πn/3,

(6)

with j = √−1 and ∗ being the complex-conjugate operator.Via the phase-modes, we can construct the monopole as

Em(r, θ,φ

) = 2P0(r, θ,φ

), (7)

and the orthogonal dipoles as

E0d

(r, θ,φ

) = 2[P1(r, θ,φ

)+ P2

(r, θ,φ

)],

Eπ/2d

(r, θ,φ

) = 2 j[P1(r, θ,φ

)− P2(r, θ,φ

)].

(8)

In matrix notation⎡

⎢⎢⎢⎣

Em

E0d

Eπ/2d

⎤

⎥⎥⎥⎦= 2

3

⎡

⎢⎢⎢⎣

1 1 1

2 −1 −1

0√

3 −√3

⎤

⎥⎥⎥⎦

⎡

⎢⎢⎢⎣

E0c

E1c

E2c

⎤

⎥⎥⎥⎦. (9)

For frequencies with wavelengths larger than the size ofthe array (for wavelengths smaller than the size of the array,

spatial aliasing effects will occur) , that is, r � c/ f , thephase-component ψn, given by (5) can be neglected and theresponses of the eigenbeams for these frequencies are equalto

Em =, 1

E0d

(θ,φ

) = cosφ sin θ,

Eπ/2d

(θ,φ

) = cos(φ − π

2

)sin θ.

(10)

The directivity patterns of these eigenbeams are shown inFigure 2.

The zeroth-order eigenbeam Em represents the monopoleresponse, while the first-order eigenbeams E0

d(θ,φ) andEπ/2d (θ,φ) represent the orthogonal dipole responses.

The dipole can be steered to any angle ϕs by means of aweighted combination of the orthogonal dipole pair:

Eϕsd

(θ,φ

) = cosϕsE0d

(θ,φ

)+ sinϕsE

π/2d

(θ,φ

), (11)

with 0 ≤ ϕs ≤ 2π being the steering angle.Finally, the steered and scaled superdirectional micro-

phone response can be constructed via

E(θ,φ

) = S[αEm + (1− α)E

ϕsd

(θ,φ

)]

= S[α + (1− α) cos

(φ − ϕs

)sin θ

],

(12)

with α ≤ 1 being the parameter for controlling thedirectional pattern of the first-order response and S being anarbitrary scaling factor. Both parameters α and S may alsohave negative values.

Alternatively, we can write the construction of theresponse in matrix-vector notation:

E(θ,φ

) = SFTαRϕsX, (13)

with the pattern-synthesis vector:

Fα =

⎡

⎢⎢⎢⎣

α

(1− α)

0

⎤

⎥⎥⎥⎦

, (14)


the rotation-matrix Rϕs :

Rϕs =

⎡

⎢⎢⎢⎣

1 0 0

0 cosϕs sinϕs

0 − sinϕs cosϕs

⎤

⎥⎥⎥⎦

, (15)

and the input-vector:

X =

⎡

⎢⎢⎢⎣

Em

E0d

(θ,φ

)

Eπ/2d

(θ,φ

)

⎤

⎥⎥⎥⎦=

⎡

⎢⎢⎢⎣

1

cosφ sin θ

sinφ sin θ

⎤

⎥⎥⎥⎦. (16)

In the remainder of this paper, we will assume that wehave unity response of the superdirectional microphone fora desired source coming from an arbitrary azimuthal angleφ = ϕs and for an elevation angle θ = π/2 and we want tosuppress two interferers by steering two nulls towards twoazimuthal angles φ = ϕn1 and φ = ϕn2 , also for an elevationangle θ = π/2. Hence, we assume θ = π/2 in the remainderof this paper.

3. Optimal Null-Steering for Two DirectionalInterferers via Direct Pattern Synthesis

3.1. Pattern Synthesis. The first-order response of (12), withthe main-lobe of the response steered to ϕs, has two nulls forα ≤ 1/2, given by (see [13])

ϕn1 ,ϕn2 = ϕs ± arccos( −α

1− α). (17)

If we want to steer two nulls to arbitrary angles ϕn1 andϕn2 , not lying symmetrical with respect to ϕs, it can be seenthat we cannot steer the main-lobe of the first-order responseto ϕs. Therefore, we steer the main-lobe to ϕs and use a scale-factor S under the constraint that a unity response is obtainedat angle ϕs. In matrix notation,

E(θ,φ

) = SFTαRϕsX, (18)

with the rotation-matrix and the pattern-synthesis matrixbeing as in (15) and (14), respectively, with α, ϕs instead ofα,ϕs.

From (12), we see that a unity desired response at angleϕs is obtained when we choose the scale-factor S as

S = 1α + (1− α) cos

(ϕs − ϕs

) , (19)

with α being the parameter for controlling the directionalpattern of the first-order response (similar to the parameterα), ϕs the angle for the desired sound, and ϕs the angle forthe steering (which, in general, is different from ϕs).

Next, we want to place the nulls at ϕn1 and ϕn2 . Hence, wesolve the following system of two equations:

S[α + (1− α) cos

(ϕn1 − ϕs

)] = 0,

S[α + (1− α) cos

(ϕn2 − ϕs

)] = 0.(20)

Solving the two unknowns α and ϕs gives

ϕs = 2 arctanX , (21)

α = sin(Δϕn

)X

cosϕn1 − cosϕn2 + X[sinϕn1 − sinϕn2 + sin

(Δϕn

)] ,

(22)

with

X =sinϕn1 − sinϕn2 ±

√2− 2 cos

(Δϕn

)

cosϕn1 − cosϕn2

, (23)

Δϕn = ϕn1 − ϕn2 . (24)

It is noted that (23) can have two solutions, leading todifferent solutions for ϕs, α, and S. However, the resultingbeampatterns are identical.

As can be seen we get a vanishing denominator in (22)for ϕn1 = ϕs and/or ϕn2 = ϕs. Similarly, this is the case whenΔϕn = ϕn1 − ϕn2 goes to zero. For this latter case, we cancompute the limit of ϕs and α:

limΔϕn→ 0

ϕs = 2 arctan

[sinϕni

cosϕni − 1

]

= ϕni + π, (25)

with i = 1, 2 and

limΔϕn→ 0

α = 12

, (26)

where Δϕn = ϕn1 − ϕn2 .For the case Δϕn = 0, we actually steer a single null

towards the two directional interferers ϕn1 and ϕn2 . Equations(25) and (26) describe the limit-case solution for which thereare an infinite number of solutions that satisfy the system ofequations, given by (21).

3.2. Analysis of Directivity Index. Although the optimizationin this paper is focused on the suppression of two directionalinterferers, it is also important to analyze the noise-reductionperformance for isotropic noise circumstances. We will onlyanalyze the spherical isotropic noise case, for which wecompute the spherical directivity factor QS given by [4, 5]

QS =4πE2

(π/2,ϕs

)

∫ 2πφ=0

∫ πθ=0E2

(θ,φ

)sin θdθ dφ

. (27)

If we combine (27) with (18), we get

QS(ϕ1,ϕ2

) = 6(1− cosϕ1

)(1− cosϕ2

)

5 + 3 cos(ϕ1 − ϕ2

) , (28)

with

ϕ1 = ϕn1 − ϕs, (29)

ϕ2 = ϕn2 − ϕs. (30)

In Figure 3, the contour-plot of the directivity factor QS

is shown with ϕ1 and ϕ2 on the x- and y-axes, respectively.


2.5

3.5

3.5

2.5

1.5

1

1

2

1.51

2

3

3

3.5

2.5

1.5

2.5

0.5

2.5

1.52

1

1

1

2

2

2

2

2

2.5

2.5

2.53

3

3

3

3

33.5

3.5

1.5

1.50.5

3.5

3

3

12π π

32π

ϕ1 (rad)

12π

π

32π

ϕ2

(rad

)

Figure 3: Contour-plot of the directivity factor QS(ϕ1,ϕ2).

As can be seen in (28), the directivity factor goes tozero if one of the angles ϕn1 or ϕn2 gets close to ϕs. Clearly,a directivity factor which is smaller than unity is not veryuseful in practice. Hence, the pattern synthesis technique isonly useful when the angles ϕn1 and ϕn2 are located in onehalf-plane and the desired source is located around the centerof the opposite half-plane.

It can be found in the appendix that for

ϕ1 = arccos(−1

3

),

ϕ2 = 2π − arccos(−1

3

),

(31)

a maximum directivity factor QS = 4 is obtained. This cor-responds with 6 dB directivity index, defined as 10 log10QS,where the directivity pattern resembles a hypercardioid.Furthermore for (ϕ1,ϕ2) = (π,π) rad. a directivity factorQS = 3 is obtained, corresponding with 4.8 dB directivityindex, where the directivity pattern yields a cardioid. As canbe seen from Figure 3, we can define a usable region, wherethe directivity-factor is QS > 3/4 for π/2 ≤ ϕ1, ϕ2 ≤ 3π/2.

4. Optimal Null-Steering for Two DirectionalInterferers via GSC

4.1. Generalized Sidelobe Canceller (GSC) Structure. Todevelop an adaptive algorithm for steering two nulls towardsthe two directional interferers based on the pattern-synthesistechnique in Section 3, it would be required to use aconstrained optimization technique where we want to main-tain a unity response towards the angle ϕs. For adaptivealgorithms, it is generally easier to adapt in an unconstrainedmanner. Therefore, we first present an alternative schemefor the null-steering, similar to the direct pattern-synthesistechnique as discussed in Section 3, but based on the well-known Generalized Sidelobe Canceller (GSC) [12]. In the

Em

E0d

Eπ/2d

Rϕs Fα

B

Ep

Er1

Er2

w1

w2

− −+E

Figure 4: Generalized Sidelobe Canceller scheme.

GSC scheme, first a prefiltering with a fixed value of ϕs andα is performed, to construct a primary signal with a unityresponse to angle ϕs and two noise references. As the twonoise references do not include the source coming from angleϕs, two noise-canceller weights w1 and w2 can be optimizedin an unconstrained manner. The GSC scheme is shown inFigure 4.

We start by constructing the primary-response as

Ep(θ,φ

) = FTαRϕsX, (32)

with FTα , Rϕs , and X being as defined in the introduction andusing a scale-factor S = 1.

Furthermore, we can create two noise-references via⎡

⎣Er1

(θ,φ

)

Er2

(θ,φ

)

⎤

⎦ = BTRϕsX (33)

with a blocking-matrix B [14] given by

B =

⎡

⎢⎢⎢⎢⎢⎢⎣

12

0

−12

0

0 1

⎤

⎥⎥⎥⎥⎥⎥⎦

. (34)

It is noted that the noise-references Er1 and Er2 are,respectively, a cardioid and a dipole response, with a nullsteered towards the angle of the desired source at azimuthφ = ϕs and elevation θ = π/2.

The primary- and the noise-responses can be used in thegeneralized sidelobe canceller structure, to obtain an outputas

E(θ,φ

) = Ep(θ,φ

)−w1Er1

(θ,φ

)−w2Er2

(θ,φ

). (35)

It is important to note that for any value of ϕs, α, w1, andw2, a unity-response at the output of the GSC is maintainedfor angle φ = ϕs and θ = π/2.

In the next sections we give some details in computingw1

and w2 for the suppression of two directional interferers, asdiscussed in the previous section.

4.2. Optimal GSC Null-Steering for Two Directional Inter-ferers. Using the GSC structure of Figure 4 having a unityresponse at angle φ = ϕs, we can compute the weights w1


and w2 to steer two nulls towards azimuthal angles ϕn1 andϕn2 , by solving

Ep

(π

2,ϕi

)−w1Er1

(π

2,ϕi

)−w2Er2

(π

2,ϕi

)= 0 (36)

for i = 1, 2.This results in the following relations:

w1 = 2α +2 sin

(ϕ1 − ϕ2

)

sinϕ1 − sin(ϕ1 − ϕ2

)− sinϕ2, (37)

w2 =cosϕ1 − cosϕ2

sinϕ1 − sin(ϕ1 − ϕ2

)− sinϕ2, (38)

where ϕ1 and ϕ2 are defined as given by (29) and (30),respectively.

To eliminate the dependency of α in (37), we will use

w1 = w1 − 2α. (39)

The denominators in (37) and (38) vanish when ϕn1 = ϕsand/or ϕn2 = ϕs. Also when Δϕn = ϕn1 −ϕn2 goes to zero, thedenominator vanishes. In this case, we can compute the limitof w1 and w2:

limΔϕn→ 0

w1 = −2, (40)

limΔϕn→ 0

w2 = sinϕi (41)

with i = 1, 2.For the case Δϕn = 0, we actually steer a single null

towards the two directional interferers ϕn1 and ϕn2 . Equations(40) and (41) describe the limit-case solution for which thereare an infinite number of solutions (w1,w2) that satisfy (36).

From the values of w1 and w2, we can derive thetwo angles of the directional interferers ϑ1 and ϑ2, where(ϑ1, ϑ2) = (ϕ1,ϕ2) or (ϑ1, ϑ2) = (ϕ2,ϕ1). The two anglesare obtained via a computation involving the arctan-functionwith additional sign checking to resolve all four quadrants inthe azimuthal plane and can be computed as

ϑ1, ϑ2 =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

arctan(N

D

)for : D ≥ 0,

arctan(N

D

)+ π for : D < 0, N ≥ 0,

arctan(N

D

)− π for : D < 0, N < 0,

(42)

with

N = −2(w1w2 ∓ X1)X2

,

D = w31 + 4w2

1 + 4w1 ± 4w2X1

X2(w1 + 2),

(43)

with

X1 =√

(w1 + 2)2(1 + w1 +w22

),

X2 = 4 + 4w1 + w21 + 4w2

2 .(44)

2

2

11

1

2.52.51.

5

2

1

1

1

1

3.5

3.5

2.51.5

0.5 0.5

0.5

0.50.5

1.5

1.5

3

3

3

1.5

0.5

−2 −1 0 1 2

w1

−2

−1

0

1

2

w2

Figure 5: Contour-plot of the directivity factor QS(w1,w2).

Note that with this computation, it is not necessarily truethat ϑ1 = ϕ1 and ϑ2 = ϕ2, that is, we can have a permutationambiguity. Furthermore, we compute the resolved angles ofthe directional interferers as

ϑni = ϑi − ϕs, (45)

where (ϑn1 , ϑn2 ) = (ϕn1 ,ϕn2 ) or (ϑn1 , ϑn2 ) = (ϕn2 ,ϕn1 ).

4.3. Analysis of Directivity Index. Just as for the directpattern synthesis in the previous section, we can analyze thedirectivity factor for spherical isotropic noise. We can insertthe values of w1 and w2 into (27) and (35) and get

QS(w1,w2) = 3w1 + w2

1 +w22 + 1

. (46)

In Figure 5, we show the contour-plot of the directivityfactor with w1 and w2 on the x- and y-axes, respectively.

From Figure 5 and (46), it can be seen that the contoursare concentric circles with the center at coordinate (w1,w2) =(−1/2, 0) where the maximum directivity factor of 4 isobtained.

5. Adaptive Algorithm

5.1. Cost-Function for Directional Interferers. Next, wedevelop an adaptation scheme to adapt two weights in theGSC structure as discussed in the previous Section 4. We aimat obtaining the solution, where a unity response is obtainedat angle ϕs and two nulls are placed at angles ϕn1 and ϕn2 .

We start with

y[k] = p[k]− (w1[k] + 2α)r1[k]− w2[k]r2[k], (47)

with k being the discrete-time index, y[k] the output signal,w1[k] and w2[k] the adaptive weights, r1[k] and r2[k] the


noise reference signals, and p[k] the primary signal. Theinclusion of the term 2α in (47) is a consequence of the factthat w1[k] is an estimate of w1 (see (39) in which 2α is notincluded).

In the ideal case that we want to obtain a unity responsefor a source-signal s[k] originating from angle ϕs and havean undesired source-signal n1[k] originating from angle ϕn1

together with an undesired source-signal n2[k] originatingfrom angle ϕn2 , we have

p[k] = s[k] +∑

i=1,2

[α + (1− α) cosϕi

]ni[k],

r1[k] =∑

i=1,2

(12− 1

2cosϕi

)ni[k],

r2[k] =∑

i=1,2

sinϕini[k].

(48)

The cost-function J(w1, w2) is defined as a function of w1

and w2 and is given by

J(w1, w2) = E{y2[k]

}, (49)

with E{·} being the expectation operator.Using that E{n1[k]n2[k]} = 0 and E{ni[k]s[k]} = 0 for

i = 1, 2, we can write

J(w1, w2) = E{[p[k]− (w1[k] + 2α)r1[k]− w2[k]r2[k]

]2}

= σ2s [k] +

∑

i=1,2

σ2ni[k]

×[

14w1[k]2 + w2[k]2

+ cos2ϕi

(14w1[k]2 + w1[k]− w2[k]2 + 1

)

+ cosϕi

(−1

2w1[k]2−w1[k]

)+sinϕiw1[k]w2[k]

+ cosϕi sinϕi(−2w2[k]− w1[k]w2[k])]

= σ2s [k] +

∑

i=1,2

[w1[k]− (2 + w1[k]) cosϕi

+2w2[k] sinϕi]2 σ

2ni[k]

4,

(50)

with

σ2s [k] = E

{s2[k]

},

σ2ni[k] = E

{n2i [k]

}.

(51)

We can see that the cost-function is a quadratic-function[15] that can be written in matrix-notation (for convenience,we leave out the index k):

J(w1, w2) = σ2s +

∥∥∥Apw − vp

∥∥∥

2

= σ2s +wTAT

pApw − 2wTATp vp + vTp vp,

(52)

with

Ap =

⎡

⎢⎢⎣

σn1

2

(1− cosϕ1

)σn1 sinϕ1

σn2

2

(1− cosϕ2

)σn2 sinϕ2

⎤

⎥⎥⎦,

w =⎡

⎣w1

w2

⎤

⎦,

vp =⎡

⎣σn1 cosϕ1

σn2 cosϕ2

⎤

⎦.

(53)

The singularity of ATpAp can be analyzed by computing

the determinant of Ap and setting this determinant to zero:

σn1σn2

2

[sinϕ2

(1− cosϕ1

)− sinϕ1(1− cosϕ2

)] = 0. (54)

Equation (54) is satisfied when σn1 and/or σn2 are equal tozero, ϕ1 and/or ϕ2 are equal to zero, or when

sinϕ1

1− cosϕ1= sinϕ2

1− cosϕ2≡ cot

(ϕ1

2

)= cot

(ϕ2

2

). (55)

Equation (55) is satisfied only when ϕ1 = ϕ2. This agreeswith the result that was obtained in Section 3.1, where Δϕ =0.

In all other cases (so when ϕ1 /=ϕ2, σn1 > 0 and σn2 > 0),the matrix Ap is nonsingular and the matrix AT

pAp is positivedefinite. Hence, the cost-function is a convex function witha global minimum that can be found by solving the least-squares problem:

wopt =(

ATpAp

)−1ATp vp

= A−1p vp

= 1A

⎡

⎣2 sin

(ϕ1 − ϕ2

)

cosϕ1 − cosϕ2

⎤

⎦,

(56)

with

A = sinϕ1 − sin(ϕ1 − ϕ2

)− sinϕ2, (57)

similar to the solutions as given in (37) and (38).As an example, we show the contour-plot of the cost-

function 10 log10 J(w1, w2) in Figure 6, for the case whereϕs = π/2, ϕn1 = 0, ϕn2 = π rad., σ2

ni = 1 for i = 1, 2, andσ2s = 0.

As can be seen, the global minimum is obtained for w1 =0 and w2 = 0, resulting in a dipole beampattern. When wechange σ2

n1 /= σ2n2

, the shape of the cost-function will be moreand more stretched, but the global optimum will be obtainedfor the same values of w1 and w2. In the extreme case whenσ2n2= 0 and σ2

n1> 0, we obtain the cost-function as shown

in Figure 7. (It is interesting to note that this cost-function isexactly the same as for the case whereϕs = π/2, ϕn1 = ϕn2 = 0radians with σ2

ni = 1 for i = 1, 2 and σ2s = 0.) Although

still w1 = 0 and w2 = 0 is an optimal solution, it can be


−5−5

−5

−10

−15−10

0

00

0

05

55

5

5

5 5

5

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

w1

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

w2

Figure 6: Contour-plot of the cost-function 10 log10 J(w1, w2) forthe case where ϕs = π/2, ϕn1 = 0, and ϕn2 = π radians.

seen that there is no strict global minimum. For example,also w1 = −2 and w2 = 1 is an optimal solution (yielding acardioid beampattern).

For the situation where there is only a single interfereror the situation where there are two interferers coming from(nearly) the same angle, the resulting beampattern will havea null to this angle, while the other (second) null will beplaced randomly (i.e., the second null is not uniquely definedand the adaptation of this second null is poor). However insituations where we have additive diffuse-noise present, weobtain an extra degree of freedom, for example, optimizationof the directivity index. This is however outside the scope ofthis paper.

5.2. Cost-Function for Isotropic Noise. It is also useful toanalyze the cost-function in the presence of isotropic (i.e.,diffuse) noise. We know from [16] that spherical andcylindrical isotropic noise can be modelled by addinguncorrelated additive white-noise signals d1, d2, and d3 to thethree eigenbeams Em, E0

d, and Eπ/2d with variances σ2d , σ2

dγ, andσ2dγ, respectively, or alternatively with a covariance matrix Kd

given by

Kd = σ2d

⎡

⎢⎢⎢⎣

1 0 0

0 γ 0

0 0 γ

⎤

⎥⎥⎥⎦. (58)

(for diffuse noise situations, the individual elements arecorrelated. However, due the construction of eigenbeams,the diffuse noise will be decorrelated. Hence, it is allowedto add uncorrelated additive white-noise signals to theseeigenbeams to simulate diffuse-noise situations,) We chooseγ = 1/3 for spherically isotropic noise and γ = 1/2 forcylindrically isotropic noise.

5

−5

−5

−10

−10

−15

−15−5

−5

−10

−10

−15

−15−5

−5

−10

−10

−15

−15−5

−5

−10

−10

−15

−15

0

0

0

0

0

0

0

05

10

10

10

10

5

5

5

5

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

w1

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

w2

Figure 7: Contour-plot of the cost-function 10 log10 J(w1, w2) forthe case where ϕs = π/2 and ϕn1 = ϕn2 = 0 radians.

Assuming that there are no directional interferers,we obtain the following primary signal p[k] and noise-references r1[k] and r2[k] in the generalized sidelobe can-celler scheme:

p[k] = s[k] + αd1[k] + (1− α)d2[k]√γ,

r1[k] = 12d1[k]− 1

2d2[k]

√γ,

r2[k] = d3[k]√γ.

(59)

As di[k] with i = 1, 2, 3 and s[k] are mutually uncorre-lated, we can write the cost-function as

J(w1, w2) = σ2s [k] + σ2

d

[(12w1

)2

+ γ(

1 +12w1

)2

+ γw22

]

.

(60)

Just as for the cost-function with two directional interfer-ers, we can write the cost-function for isotropic noise also asa quadratic function in matrix notation:

Jd(w1, w2) = σ2s +

[∥∥Ad w − vd

∥∥2 +γ

1 + γ

]

, (61)

with

Ad =⎡

⎣σd2

√1 + γ 0

0 σd√γ

⎤

⎦,

vd =

⎡

⎢⎣

−σdγ√1 + γ

0

⎤

⎥⎦.

(62)


It can be easily seen that Ad is positive definite and hencewe have a convex cost-function with a global minimum.Via (56) we can easily compute this minimum of the cost-function, which is obtained by solving the least-squaresproblem:

wopt =(

ATd Ad

)−1ATd vd

= A−1p vp

=

⎡

⎢⎣− 2γ

1 + γ0

⎤

⎥⎦.

(63)

5.3. Cost-Function for Directional Interferers and IsotropicNoise. In case we have directional interferers as well asisotropic noise and assume that all these noise-componentsare mutually uncorrelated, we can construct the cost-function based on addition of the two cost-functions:

Jp,d(w1, w2) = Jp(w1, w2) + Jd(w1, w2)

= σ2s +

∥∥∥Apw − vp

∥∥∥

2+∥∥Adw − vd

∥∥2 +

σ2dγ

1 + γ

= σ2s +

∥∥∥Ap,dw − vp,d

∥∥∥

2+

σ2dγ

1 + γ,

(64)

with:

Ap,d =⎡

⎣Ap

Ad

⎤

⎦,

vp,d =⎡

⎣vp

vd

⎤

⎦.

(65)

Since Jp(w1, w2) and Jd(w1, w2) were found to be convex,the sum Jp,d(w1, w2) is also convex. The optimal weights woptcan be obtained by computing

wopt =(

ATp,dAp,d

)−1ATp,dvp,d, (66)

which can be solved numerically via standard SVD tech-niques [15].

5.4. Gradient Search Algorithm. As we know that the cost-function is a convex function with a global minimum, wecan find this optimal solution by means of a steepest descentupdate equation for wi with i = 1, 2 by stepping in thedirection opposite to the surface J(w1, w2) with respect to wi,similar to [5]

wi[k + 1] = wi[k]− μ∇wi J(w1, w2), (67)

with a gradient given by

∇wi J(w1, w2) = ∂J(w1[k], w2[k])∂wi[k]

= ∂E{y2[k]

}

∂wi[k], (68)

and where μ is the update step-size. As in practice, theensemble average E{y2[k]} is not available, we have to use aninstantaneous estimate of the gradient ∇wi J(w1, w2), which iscomputed as

∇wi J(w1, w2) = dy2[k]dwi

= −2{p[k]− (w1 + 2α)r1[k]− w2r2[k]

}ri[k]

= −2y[k]ri[k].(69)

Hence, we can write the update equation as

wi[k + 1] = wi[k] + 2μy[k]ri[k]. (70)

Just as proposed in [5], we can apply a power-normalization such that the convergence speed is indepen-dent of the power:

wi[k + 1] = wi[k] +2μy[k]ri[k]

Pri[k] + ε, (71)

with ε being a small value to prevent zero division and wherethe power-estimate Pri[k] of the i′th reference signal ri[k] canbe computed by a recursive averaging:

Pri[k + 1] = βPri[k] +(1− β)r2

i [k], (72)

with β being a smoothing parameter (lower, but close to 1).The gradient search only needs to be performed in case

one or both of the directional interferers are present. Incase the desired speech is present during the adaptation,the gradient search will not behave robustly in practice.This nonrobust behaviour is caused by leakage of speechin the noise references r1 and r2 due to either variationsof the desired speaker location, microphone mismatchesor reverberation (multipath) effects. To avoid adaptationduring desired speech, we will apply a step-size control factorin the adaptation-rule, given by

Ψ[k] = Pr1 [k] + Pr2 [k]

Pr1 [k] + Pr2 [k] + Pp[k] + ε, (73)

where Pr1 [k] + Pr2 [k] is an estimate of the noise power andPp[k] is an estimate of the primary signal p[k] that contains

mainly desired speech. The power estimate Pp[k] is, just

as for the reference-signal powers Pr1 and Pr2 , obtained viarecursive averaging:

Pp[k + 1] = βPp[k] +(1− β)p2[k]. (74)

We can see that the value of Ψ[k] will be small when thedesired speech is dominating, whileΨ[k] will be much larger(but lower than 1) when either the directional interferers orspherically isotropic noise is dominating. As it is beneficialto have a low amount of noise components in the powerestimate Pp[k], we found that α = 0.25 is a good choice.


Initialize w1[0] = 0, w2[0] = 0, Pr1 [0] = r21 [0], Pr2 [0] = r2

2 [0] and Pp[0] = p2[0]for k = 0,∞: do

Ψ[k] = Pr1 [k] + Pr2 [k]

Pr1 [k] + Pr2 [k] + Pp[k] + ε

y[k] = p[k]− (w1[k] + 2α)r1[k]− w2[k]r2[k]for i = 1, 2: do

wi[k + 1] = wi[k] +2μy[k]ri[k]

Pri[k] + εΨ[k]

X1 = (−1)i√

(w1[k]2 + 2)2(1 + w1[k] + w2[k]2)

X2 = 4 + 4w1[k] + w1[k]2 + 4w2[k]2

N = −2(w1[k]w2[k] + X1)X2

D = w1[k]3 + 4w1[k]2 + 4w1[k]− 4w2[k]X1

X2(w1[k] + 2)

ϑni = arctan(N

D

)− ϕs

if D < 0 thenϑni = ϑni − π sgn(N)

end ifPri [k + 1] = βPri [k] + (1− β)r2

i [k]Pp[k + 1] = βPp[k] + (1− β)p2[k]

end forend for

Algorithm 1: Optimal null-steering for two directional interferers.

The algorithm now looks as shown in Algorithm 1.As can be seen in the algorithm, the two weights w1[k]

and w2[k] are adapted based on a gradient-search method.Based on these two weights, a computation with arctan-function is performed to obtain the angles of the directional

interferers ϑni with i = 1, 2.

6. Validation

6.1. Directivity Pattern for Directional Interferers. First, weshow the beampatterns for a number of situations where twonulls are placed. In Table 1, we show the computed values forthe direct pattern synthesis for 4 different situations, wherenulls are placed at different angles. Furthermore, we assumethat there is no isotropic noise present.

As was explained in Section 3.1, we can obtain twodifferent sets of solutions for ϕs, α, and S. In Table 1, we showthe set of solutions where α is positive.

Similarly, in Table 2, we show the computed values for w1

and w2 in the GSC structure as explained in Section 4 for thesame situations as for the direct pattern synthesis.

The polar-plots resulting from the computed values inTables 1 and 2 are shown in Figure 8. It is noted that the twoexamples of Section 5.1 where we analyzed the cost-functionare depicted in Figures 8(b) and 8(d).

Table 1: Computed values of ϕs, α, and S for placing two nulls atϕn1 and ϕn2 and having a unity response at ϕs.

ϕn1 ϕn2 ϕs ϕs α S QS(deg) (deg) (deg) (deg)

45 180 90 292.5 0.277 1.141 0.61

0 180 90 90 0 1.0 3.0

0 225 90 112.5 0.277 1.058 3.56

0 0 90 0 0.5 2 0.75

Table 2: Computed values of w1 and w2 for placing two nulls at ϕn1

and ϕn2 and having a unity response at ϕs.

ϕn1 ϕn2 ϕs w1 w2 QS(deg) (deg) (deg)

45 180 90√

2 −12

√2 0.61

0 180 90 0 0 3.0

0 225 90−2

2 +√

2−1

2 +√

23.56

0 0 90 −2 −1 0.75


0

30

6090

120

150

180

210

240270

300

330

1

2

3

(a)

0

30

6090

120

150

180

210

240270

300

330

0.2

0.4

0.6

0.8

1

(b)

0

30

6090

120

150

180

210

240270

300

330

0.5

1

1.5

(c)

0

30

6090

120

150

180

210

240270

300

330

0.5

1

1.5

2

(d)

Figure 8: Azimuthal polar-plots for the placement of two nulls with nulls placed at (a) 45 and 180 degrees, (b) 0 and 180 degrees, (c) 0 and225 degrees and (d) 0, and 0 degrees (two identical nulls).

0 1 2 3 4 5 6 7 8 9 10×103

k

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

ϑ n1

andϑ n

2

ϑn1

ϑn2

ϕni with i = 1, 2

Figure 9: Simulation of the null-steering algorithm with twodirectional interferers only where σ2

n1= σ2

n2= 1.

From the plots in Figure 8, it can be seen that if one ofthe two null-angles is close to the desired source angle (e.g.,in Figure 8(a)), the directivity index becomes worse. Becauseof this poor directivity index, the null-steering method asis proposed in this paper will only be useful when eitherazimuthal angle of the two directional interferers is not veryclose to the azimuthal angle of the desired source. When welimit the main-beam to be steered maximally 90 degrees awayfrom the desired direction, that is, |ϕs − ϕs| < π/2, we avoida poor directivity index. For example, in Figure 8(d) such asituation is shown where the main-beam is steered 90 degreesaway from the desired direction. In case the two directionalinterferers will change quickly from 0 to 180 degrees, theadaptive algorithm will automatically adapt and removesthese two directional interferers at 180 degrees. As only twoweights are used in the adaptive algorithm, the convergenceto the optimal weights will be very fast.

6.2. Gradient Search Algorithm. Next, we validate the track-ing behaviour of the gradient update algorithm, as proposedin Section 5.4. We perform a simulation, where we have adesired source at 90 degrees and where we linearly increasethe angle of a first undesired directional interferer (ranging


0 1 2 3 4 5 6 7 8 9 10×103

k

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

ϑ n1

andϑ n

2

ϑn1

ϑn2

ϕni with i = 1, 2

(a)

0 1 2 3 4 5 6 7 8 9 10×103

k

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

ϑ n1

andϑ n

2

ϑn1

ϑn2

ϕni with i = 1, 2

(b)

Figure 10: Simulation of the null-steering algorithm with two directional interferers where σ2n1= σ2

n2= 1 and with a desired source where

σ2s = 1/16 with ϕs = 90 degrees (a) and ϕs = 60 degrees (b).

0 1 2 3 4 5 6 7 8 9 10×103

k

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

ϑ n1

andϑ n

2

ϑn1

ϑn2

ϕni with i = 1, 2

(a)

0 1 2 3 4 5 6 7 8 9 10×103

k

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

ϑ n1

andϑ n

2

ϑn1

ϑn2

ϕni with i = 1, 2

(b)

Figure 11: Simulation of the null-steering algorithm with two directional interferers where σ2n1= σ2

n2= 1 and with (spherically isotropic)

spherical isotropic noise (γ = 1/3), where σ2d = 1/16 (a) and σ2

d = 1/4 (b).

from 135 to 45 degrees) and we linearly decrease the angleof a second undesired directional interferer (ranging from 30degrees to−90 degrees) in a time-span of 10000 samples. Forthe simulation, we used α = 0.25, μ = 0.02, and β = 0.95.

First, we simulate the situation, where only two direc-

tional interferers are present. The two directional interferers

are uncorrelated white random-noise signals with variance

σ2ni = 1. The results are shown in Figure 9. It can be seen

that ϑn1 and ϑn2 do not cross (in contrast to the angles of the

directional interferers ϕn1 and ϕn2 ). The first null placed at ϑn1

adapts very well, while the second null, placed at ϑn2 , is poorlyadapted. The reason for this was explained in Section 5.1.

Similarly, we simulate the situation with the same twodirectional interferers but now together with a desired


Figure 12: Microphone array with 3 outward facing cardioidmicrophones.

N3

S

M0

0 rad

π rad

N2

N1M2 M1

1 m

φ

π/2 rad3π/2 rad

Figure 13: Practical setup of the microphone array.

source-signal s[k]. The desired source is modelled as a white-noise signal, with a variance σ2

s = 1/16. The result is shown inFigure 10(a). We see that due to the adaptation-noise (causedby s[k]), there is more variance in the estimates of the angles

ϑn1 and ϑn2 . In contrast to the situation with two directional

interferers only, we see that there is a region where ϑn1 = ϑn2 .To show how the adaptation behaviour looks in presence

of variation in the desired source location, we do a similarsimulation as above, but now with ϕs set to 60 degrees, whilethe desired source is coming from 90 degrees. This meansthat there will be leakage of the desired source signal into thenoise reference signals r1[k] and r2[k]. The results are shownin Figure 10(b). Here, it can be seen that the adaptationshows a small offset if one of the directional source anglescomes close to the desired source angle. For example, at the

0 2 4 6 8 10 12 14 16

t (s)

−0.1

−0.05

0

0.05

0.1

(a) Cardioid to 0 degrees, that is, M0

0 2 4 6 8 10 12 14 16

t (s)

−0.1

−0.05

0

0.05

0.1

(b) Proposed adaptive null-steering algorithm

Figure 14: Results of the real-life experiment (waveform).

0 2 4 6 8 10 12 14 16

t (s)

0

1

2

3

4

5

6

ϑ ni

wit

hi=

1,2

ϑni with i = 1, 2ϕni with i = 1, 2

Figure 15: Results of the real-life experiment (angle estimates).

end of the simulation where k = 10000, this can be clearly

seen for ϑn1 .Finally, we simulate the situation of the same directional

interferers, but now in a spherical isotropic noise situation.As was explained in Section 5.2, isotropic noise can bemodelled by adding uncorrelated additive white-noise to thethree eigenbeams Em, E0

d, and Eπ/2d with variances σ2d , σ2

dγ,and σ2

dγ, respectively. Here γ = 1/3 for spherically isotropicnoise and γ = 1/2 for cylindrically isotropic noise. In oursimulation, we use γ = 1/3. The results are shown in Figures11(a) and 11(b) with variances σ2

d = 1/16 and σ2d = 1/4,

respectively. When the variance of the diffuse noise gets


0

30

6090

120

150

180

210

240270

300

330

0.5

1

1.5

(a) t = 2.5 seconds

0

30

6090

120

150

180

210

240270

300

330

0.5

1

1.5

(b) t = 6 seconds

0

30

6090

120

150

180

210

240270

300

330

0.5

1

1.5

(c) t = 9.5 seconds

0

30

6090

120

150

180

210

240270

300

330

0.5

1

1.5

(d) t = 13 seconds

0

30

6090

120

150

180

210

240270

300

330

0.5

1

1.5

(e) t = 16.5 seconds

Figure 16: Polar-plot results of the real-life experiment.

larger compared to the directional interferers, the adaptationwill be influenced by the diffuse noise that is present. Thelarger the diffuse noise, the more the final beampatternwill resemble the hypercardioid. If diffuse noise would bedominant over the directional interferers, the estimates ϕn1

and ϕn2 will be equal to 90−109 degrees, and 90+109 degrees,respectively, (or −0.33 and −2.81 radians, resp.).

6.3. Real-Life Experiments. To validate the null-steeringalgorithm in real-life, we used a microphone array with3 outward facing cardioid electret microphones, as shownin Figure 12. As directional cardioid microphones haveopenings on both sides, the microphones are placed inrubber holders, enabling sound to enter both sides of thedirectional microphones.

The type of microphone elements used for this arrayis the Primo EM164 cardioid microphones [17]. Theseelements are placed uniformly on a circle with a radiusof 1 cm. This radius is sufficient for the construction ofeigenbeams up to a frequency of 4 KHz.

For the experiment, we placed the array on a table ina moderately reverberant room (conferencing-room) witha T60 of approximately 200 milliseconds. As shown in thesetup in Figure 13, all directional sources are placed at adistance of 1 meter from the array (at discrete azimuthalangles: φ = 0, π/2, π, and 3π/2 radians), while diffuse noise

was generated via four loudspeakers, placed close to the wallsand each facing diffusers hanging on the walls. The level ofthe diffuse noise is 12 dB lower compared to the directional(interfering) sources. The experiment is done in a time-spanof 17.5 seconds, where we switch the directional sources asshown in Table 3.

We use mutually uncorrelated white random-noisesequences for the directional sources N1, N2, and N3 playedby loudspeakers and use speech for the desired sound-sourceS.

For the algorithm, we use discrete-time signals with asample-rate of 8 KHz. Furthermore, we used α = 0.25, μ =0.001, and β = 0.95.

Figure 14(a) shows the waveform obtained from micro-phone #0 (M0), which is a cardioid pointed with its main-lobe to 0 radians. This waveform is compared with the result-ing waveform of the null-steering algorithm, and is shownin Figure 14(b). As the proposed null-steering algorithm isable to steer nulls toward the directional interferers, the directpart of the interferers is removed effectively (this can be seenby the lower noise-level in Figure 14(b) in the time-framefrom 0–10.5 seconds). In the segment from 10.5–14 seconds(where there is only a single directional interferer at φ = πradians), it can be seen that the null-steering algorithm isable to reject this interferer just as good as the single cardioidmicrophone.


Table 3: Switching of sound-sources during the real-life experiment.

Source angle φ (rad) 0–3.5 (s) 3.5–7 (s) 7–10.5 (s) 10.5–14 (s) 14 s–17.5 (s)

N1 π/2 active — active — —

N2 π active active — active —

N3 3π/2 — active active — —

S 0 active active active active active

In Figure 15, the resulting angle-estimates from the null-steering algorithm are shown. Here, it can be seen that theangle-estimation for the first three segments of 3.5 secondsis done accurately. For the fourth segment, there is onlya single point interferer. In this segment, only a singleangle-estimation is stable, while the other angle-estimationis highly influenced by the diffuse noise. Finally, in thefifth segment, only diffuse noise is present and the finalbeampattern will optimize the directivity-index, leading to amore hypercardioid beampattern steered with its main-lobeto 0 degrees (as explained in Section 6.2).

Finally, in Figure 16, the resulting polar-patterns fromthe null-steering algorithm are shown for some discretetime-stamps. Again, it becomes clear that the null-steeringalgorithm is able to steer the nulls toward the angles wherethe interferers are coming from.

7. Conclusions

We analyzed the construction of a first-order superdirec-tional response in order to obtain a unity response for adesired azimuthal angle and to obtain a placement of twonulls to undesired azimuthal angles to suppress two direc-tional interferers. We derived a gradient search algorithm toadapt two weights in a generalized sidelobe canceller scheme.Furthermore, we analyzed the cost-function of this gradientsearch algorithm, which was found to be convex. Hencea global minimum is obtained in all cases. From the twoweights in the algorithm and using a four-quadrant inverse-tangent operation, it is possible to obtain estimates of theazimuthal angles where the two directional interferers arecoming from. Simulations and real-life experiments show agood performance in moderate reverberant situations.

Appendix

Proofs

Maximum Directivity Factor QS. We prove that for

QS(ϕ1,ϕ2

) = 6(1− cosϕ1

)(1− cosϕ2

)

5 + 3 cos(ϕ1 − ϕ2

) , (A.1)

with ϕ1,ϕ2 ∈ [0, 2π], a maximum QS = 4 is obtained forϕ1 = arccos (−1/3) and ϕ2 = 2π − arccos (−1/3).

Proof. First, we compute the numerator of the partialderivative ∂QS/∂ϕ1 and set this derivative to zero:

6(1− cosϕ1

)sinϕ1

[5 + 3 cos

(ϕ1 − ϕ2

)]

+ 6(1− cosϕ1

)(1− cosϕ2

)3 sin

(ϕ1 − ϕ2

) = 0.(A.2)

The common factor 6(1 − cosϕ1) can be removed, resultingin

sinϕ1(5 + 3 cos

(ϕ1 − ϕ2

))+ 3

(1− cosϕ1

)sin

(ϕ1 − ϕ2

) = 0.(A.3)

Similarly, setting the partial derivative ∂QS/∂ϕ2 equal tozero, we get

sinϕ2(5 + 3 cos

(ϕ2 − ϕ1

))+ 3

(1− cosϕ2

)sin

(ϕ2 − ϕ1

) = 0.(A.4)

Combining (A.3) and (A.4) gives

sinϕ1

1− cosϕ1= −3 sin

(ϕ1 − ϕ2

)

5 + 3 cos(ϕ1 − ϕ2

)

= 3 sin(ϕ2 − ϕ1

)

5 + 3 cos(ϕ2 − ϕ1

) = − sinϕ2

1− cosϕ2,

(A.5)

or alternatively

2 sin(ϕ1/2

)cos

(ϕ1/2

)

2 sin2(ϕ1/2) = cot

(ϕ1

2

)= −cot

(ϕ2

2

), (A.6)

with ϕ1,ϕ2 ∈ [0,π].From (A.6), we can see that ϕ1/2+ϕ2/2 = π (or ϕ1 +ϕ2 =

2π) and can derive

cosϕ2 = cos(2π − ϕ1

) = cosϕ1, (A.7)

sinϕ2 = sin(2π − ϕ1

) = − sinϕ1. (A.8)

Using (A.7) and (A.8) in (A.1) gives

QS =6(1− cosϕ1

)2

5 + 3(2 cosϕ1 − 1

) = 6(1− cosϕ1

)2

2 + 6 cos2ϕ1= 6(1− x)2

2 + 6x2,

(A.9)

with x = cosϕ1 ∈ [−1, 1].We can compute the optimal value for x by differentia-

tion of (A.9) and setting the result to zero:

− 12(1− x)(2 + 6x2)− 6(1− x)212x = 0

≡ −2− 6x2 − 6x + 6x2 = 0.(A.10)

Solving (A.10) gives x = cosϕ1 = −1/3 and conse-quently, ϕ1 = arccos (−1/3) and ϕ2 = 2π−arccos (−1/3). Via(A.9), we can see that for these values, we have QS = 4.


Acknowledgment

The author likes to thank Dr. A. J. E. M. Janssen for hisvaluable suggestions.

References

[1] G. W. Elko, F. Pardo, D. Lopez, D. Bishop, and P. Gammel,“Surface-micromachined mems microphone,” in Proceedingsof the 115th AES Convention, p. 1–8, October 2003.

[2] P. L. Chu, “Superdirective microphone array for a set-topvideo conferencing system,” in Proceedings of the IEEE Inter-national Conference on Acoustics, Speech, and Signal Processing(ICASSP ’97), vol. 1, pp. 235–238, Munich, Germany, April1997.

[3] R. L. Pritchard, “Maximum directivity index of a linear pointarray,” Journal of the Acoustical Society of America, vol. 26, no.6, pp. 1034–1039, 1954.

[4] H. Cox, “Super-directivity revisited,” in Proceedings of the 21stIEEE Instrumentation and Measurement Technology Conference(IMTC ’04), vol. 2, pp. 877–880, May 2004.

[5] G. W. Elko and A. T. Nguyen Pong, “A simple first-orderdifferential microphone,” in Proceedings of the IEEE Workshopon Applications of Signal Processing to Audio and Acoustics(WASPAA ’95), pp. 169–172, New Paltz, NY, USA, October1995.

[6] G. W. Elko and A. T. Nguyen Pong, “A steerable and variablefirst-order differential microphone array,” in Proceedings ofthe IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP ’97), vol. 1, pp. 223–226, Munich,Germany, April 1997.

[7] M. A. Poletti, “Unified theory of horizontal holographic soundsystems,” Journal of the Audio Engineering Society, vol. 48, no.12, pp. 1155–1182, 2000.

[8] H. Cox, R. M. Zeskind, and M. M. Owen, “Robust adaptivebeamforming,” IEEE Transactions on Acoustics, Speech, andSignal Processing, vol. 35, no. 10, pp. 1365–1376, 1987.

[9] R. M. M. Derkx and K. Janse, “Theoretical analysis of a first-order azimuth-steerable superdirective microphone array,”IEEE Transactions on Audio, Speech and Language Processing,vol. 17, no. 1, pp. 150–162, 2009.

[10] Y. Huang and J. Benesty, Audio Signal Processing for Next Gen-eration Multimedia Communication Systems, Kluwer AcademicPublishers, Dordrecht, The Netherlands, 1st edition, 2004.

[11] H. Teutsch, Modal Array Signal Processing: Principles andApplications of Acoustic Wavefield Decomposition, Springer,Berlin, Germany, 1st edition, 2007.

[12] L. J. Griffiths and C. W. Jim, “An alternative approach to lin-early constrained adaptive beamforming,” IEEE Transactionson Antennas and Propagation, vol. 30, no. 1, pp. 27–34, 1982.

[13] R. M. M. Derkx, “Optimal azimuthal steering of a first-order superdirectional microphone response,” in Proceedingsof the 11th International Workshop on Acoustic Echo and NoiseControl (IWAENC ’08), Seattle, Wash, USA, September 2008.

[14] J.-H. Lee and Y.-H. Lee, “Two-dimensional adaptive arraybeamforming with multiple beam constraints using a general-ized sidelobe canceller,” IEEE Transactions on Signal Processing,vol. 53, no. 9, pp. 3517–3529, 2005.

[15] W. Kaplan, Maxima and Minima with Applications: PracticalOptimization and Duality, John Wiley & Sons, New York, NY,USA, 1999.

[16] B. H. Maranda, “The statistical accuracy of an arctangentbearing estimator,” in Proceedings of the Oceans Conference

(OCEANS ’03), vol. 4, pp. 2127–2132, San Diego, Calif, USA,September 2003.

[17] R. M. M. Derkx, “Spatial harmonic analysis of unidirectionalmicrophones for use in superdirective beamformers,” inProceedings of the 36th International Conference: AutomotiveAudio, Dearborn, Mich, USA, June 2009.


Research Article

Musical-Noise Analysis in Methods of Integrating MicrophoneArray and Spectral Subtraction Based on Higher-Order Statistics

Yu Takahashi,1 Hiroshi Saruwatari (EURASIP Member),1

Kiyohiro Shikano (EURASIP Member),1 and Kazunobu Kondo2

1 Graduate School of Information Science, Nara Institute of Science and Technology, Nara 630-0192, Japan2 SP Group, Center for Advanced Sound Technologies, Yamaha Corporation, Shizuoka 438-0192, Japan

Correspondence should be addressed to Yu Takahashi, [email protected]

Received 5 August 2009; Revised 3 November 2009; Accepted 16 March 2010

Academic Editor: Simon Doclo

Copyright © 2010 Yu Takahashi et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We conduct an objective analysis on musical noise generated by two methods of integrating microphone array signal processing andspectral subtraction. To obtain better noise reduction, methods of integrating microphone array signal processing and nonlinearsignal processing have been researched. However, nonlinear signal processing often generates musical noise. Since such musicalnoise causes discomfort to users, it is desirable that musical noise is mitigated. Moreover, it has been recently reported that higher-order statistics are strongly related to the amount of musical noise generated. This implies that it is possible to optimize theintegration method from the viewpoint of not only noise reduction performance but also the amount of musical noise generated.Thus, we analyze the simplest methods of integration, that is, the delay-and-sum beamformer and spectral subtraction, and fullyclarify the features of musical noise generated by each method. As a result, it is clarified that a specific structure of integrationis preferable from the viewpoint of the amount of generated musical noise. The validity of the analysis is shown via a computersimulation and a subjective evaluation.

1. Introduction

There have recently been various studies on microphonearray signal processing [1]; in particular, the delay-and-sum (DS) [2–4] array and the adaptive beamformer [5–7] are the most conventionally used microphone arrays forspeech enhancement. Moreover, many methods of integrat-ing microphone array signal processing and nonlinear signalprocessing such as spectral subtraction (SS) [8] have beenstudied with the aim of achieving better noise reduction[9–15]. It has been well demonstrated that such integrationmethods can achieve higher noise reduction performancethan that obtained using conventional adaptive microphonearrays [13] such as the Griffith-Jim array [6]. However, a seri-ous problem exists in such methods: artificial distortion (so-called musical noise [16]) due to nonlinear signal processing.Since the artificial distortion causes discomfort to users, itis desirable that musical noise is controlled through signalprocessing. However, in almost all nonlinear noise reduction

methods, the strength parameter to mitigate musical noisein nonlinear signal processing is determined heuristically.Although there have been some studies on reducing musicalnoise [16] and on nonlinear signal processing with lessmusical noise [17], evaluations have mainly depended onsubjective tests by humans, and no objective evaluations havebeen performed to the best of our knowledge.

In our recent study, it was reported that the amount ofgenerated musical noise is strongly related to the differencebetween higher-order statistics (HOS) before and afternonlinear signal processing [18]. This fact makes it possibleto analyze the amount of musical noise arising throughnonlinear signal processing. Therefore, on the basis of HOS,we can establish a mathematical metric for the amount ofmusical noise generated in an objective manner. One ofthe authors has analyzed single-channel nonlinear signalprocessing based on the objective metric and clarified thefeatures of the amount of musical noise generated [18, 19].In addition, this objective metric suggests the possibility that


Multichannelobserved signal

Beamforming toenhance target speech

(delay-and-sum)

Spectralsubtraction

Beamforming toestimate noise signal

Output...

...

+

−

Figure 1: Block diagram of architecture for spectral subtraction after beamforming (BF+SS).

Multichannelobserved signal

Beamforming toenhance target speech

(delay-and-sum)

Spectralsubtraction

Beamforming toestimate noise signal

in each channel

Spectralsubtraction

...

......

...

+

−

−

+Output

Figure 2: Block diagram of architecture for channelwise spectral subtraction before beamforming (chSS+BF).

methods of integrating microphone array signal processingand nonlinear signal processing can be optimized from theviewpoint of not only noise reduction performance butalso the sound quality according to human hearing. Asa first step toward achieving this goal, in this study weanalyze the simplest case of the integration of microphonearray signal processing and nonlinear signal processing byconsidering the integration of DS and SS. As a result of theanalysis, we clarify the musical-noise generation features oftwo types of methods on integration of microphone arraysignal processing and SS.

Figure 1 shows a typical architecture used for the inte-gration of microphone array signal processing and SS, whereSS is performed after beamforming. Thus, we call this typeof architecture BF+SS. Such a structure has been adoptedin many integration methods [11, 15]. On the other hand,the integration architecture illustrated in Figure 2 is analternative architecture used when SS is performed beforebeamforming. Such a structure is less commonly used,but some integration methods use this structure [12, 14].In this architecture, channelwise SS is performed beforebeamforming, and we call this type of architecture chSS+BF.

We have already tried to analyze such methods ofintegrating DS and SS from the viewpoint of musical-noisegeneration on the basis of HOS [20]. However, in theanalysis, we did not consider the effect of flooring in SSand the noise reduction performance. On the other hand,in this study we perform an exact analysis considering theeffect of flooring in SS and the noise reduction performance.We analyze these two architectures on the basis of HOS andobtain the following results.

(i) The amount of musical noise generated stronglydepends on not only the oversubtraction parameterof SS but also the statistical characteristics of the inputsignal.

(ii) Except for the specific condition that the input signalis Gaussian, the noise reduction performances of thetwo methods are not equivalent even if we set thesame SS parameters.

(iii) Under equivalent noise reduction performance con-ditions, chSS+BF generates less musical noise thanBF+SS for almost all practical cases.

The most important contribution of this paper is thatthese findings are mathematically proved. In particular, theamount of musical noise generated and the noise reductionperformance resulting from the integration of microphonearray signal processing and SS are analytically formulated onthe basis of HOS. Although there have been many studies onoptimization methods based on HOS [21], this is the firsttime they have been used for musical-noise assessment. Thevalidity of the analysis based on HOS is demonstrated via acomputer simulation and a subjective evaluation by humans.

The rest of the paper is organized as follows. In Section 2,the two methods of integrating microphone array signalprocessing and SS are described in detail. In Section 3, themetric based on HOS used for the amount of musical noisegenerated is described. Next, the musical-noise analysis ofSS, microphone array signal processing, and their integrationmethods are discussed in Section 4. In Section 5, the noisereduction performances of the two integration methods arediscussed, and both methods are compared under equivalent


NoiseTarget speech

θU

+

Mic. 1(d = d1)

Mic. 2(d = d2)

Mic. j(d = dj)

Mic. J(d = dJ )

· · · · · ·0 d

Figure 3: Configuration of microphone array and signals.

noise reduction performance conditions in Section 6. More-over, the result of a computer simulation and experimentalresults are given in Section 7. Following a discussion ofthe results of the experiments, we give our conclusions inSection 8.

2. Methods of Integrating Microphone ArraySignal Processing and SS

In this section, the formulations of the two methods ofintegrating microphone array signal processing and SS aredescribed. First, BF+SS, which is a typical method ofintegration, is formulated. Next, an alternative method ofintegration, chSS+BF, is introduced.

2.1. Sound-Mixing Model. In this study, a uniform linearmicrophone array is assumed, where the coordinates of theelements are denoted by dj ( j = 1, . . . , J) (see Figure 3) andJ is the number of microphones. We consider one targetspeech signal and an additive interference signal. Multiplemixed signals are observed at each microphone element, andthe short-time analysis of the observed signals is conductedby a frame-by-frame discrete Fourier transform (DFT). Theobserved signals are given by

x(f , τ) = h

(f)s(f , τ)

+ n(f , τ), (1)

where x( f , τ) = [x1( f , τ), . . . , xJ( f , τ)]T is the observedsignal vector, h( f ) = [h1( f ), . . . ,hJ( f )]T is the transferfunction vector, s( f , τ) is the target speech signal, andn( f , τ) = [n1( f , τ), . . . ,nJ( f , τ)]T is the noise signal vector.

2.2. SS after Beamforming. In BF+SS, the single-channeltarget-speech-enhanced signal is first obtained by beam-forming, for example, by DS. Next, single-channel noiseestimation is performed by a beamforming technique, forexample, null beamformer [22] or adaptive beamforming[1]. Finally, we extract the resultant target-speech-enhancedsignal via SS. The full details of signal processing are givenbelow.

To enhance the target speech, DS is applied to theobserved signal. This can be represented by

yDS(f , τ) = gDS

(f , θU)Tx(f , τ),

gDS(f , θU) =[g(DS)

1

(f , θU), . . . , g(DS)

J

(f , θU)]T

,

g(DS)j

(f , θU) = J−1 · exp

(

− i2π(f /M)fsdj sin θU

c

)

,

(2)

where gDS( f , θU) is the coefficient vector of the DS arrayand θU is the specific fixed look direction known in advance.Also, fs is the sampling frequency, M is the DFT size, and cis the sound velocity. Finally, we obtain the target-speech-enhanced spectral amplitude based on SS. This procedurecan be expressed as

∣∣ySS(f , τ)∣∣

=

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

√∣∣yDS(f , τ)∣∣2 − β · Eτ

[∣∣n(f , τ)∣∣2]

(where

∣∣yDS(f , τ)∣∣2

−β · Eτ[∣∣n(f , τ)∣∣2]≥ 0)

,

η · ∣∣yDS(f , τ)∣∣ (otherwise),

(3)

where this procedure is a type of extended SS [23]. Here,ySS( f , τ) is the target-speech-enhanced signal, β is theoversubtraction parameter, η is the flooring parameter, andn( f , τ) is the estimated noise signal, which can generallybe obtained by a beamforming techniques such as fixedor adaptive beamforming. Eτ[·] denotes the expectationoperator with respect to the time-frame index. For example,n( f , τ) can be expressed as [13]

n(f , τ) = λ

(f)

gTNBF

(f)

x(f , τ), (4)

where gNBF( f ) is the filter coefficient vector of the nullbeamformer [22] that steers the null directivity to the speechdirection θU, and λ( f ) is the gain adjustment term, whichis determined in a speech break period. Since the nullbeamformer can remove the speech signal by steering thenull directivity to the speech direction, we can estimatethe noise signal. Moreover, a method exists in whichindependent component analysis (ICA) is utilized as a noiseestimator instead of the null beamformer [15].

2.3. Channelwise SS before Beamforming. In chSS+BF, wefirst perform SS independently in each input channel andthen we derive a multichannel target-speech-enhanced signal


by channelwise SS. This can be expressed as∣∣∣y(chSS)

j

(f , τ)∣∣∣

=

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

√∣∣∣xj(f , τ)∣∣∣

2 − β · Eτ

[∣∣∣n j(f , τ)∣∣∣

2]

(where

∣∣∣xj(f , τ)∣∣∣

2

−β · Eτ

[∣∣∣n j(f , τ)∣∣∣

2]≥ 0)

,

η ·∣∣∣xj(f , τ)∣∣∣ (otherwise),

(5)

where y(chSS)j ( f , τ) is the target-speech-enhanced signal

obtained by SS at a specific channel j and n j( f , τ) is theestimated noise signal in the jth channel. For instance,the multichannel noise can be estimated by single-inputmultiple-output ICA (SIMO-ICA) [24] or a combination ofICA and the projection back method [25]. These techniquescan provide the multichannel estimated noise signal, unliketraditional ICA. SIMO-ICA can separate mixed signals notinto monaural source signals but into SIMO-model signalsat the microphone. Here SIMO denotes the specific trans-mission system in which the input signal is a single sourcesignal and the outputs are its transmitted signals observedat multiple microphones. Thus, the output signals of SIMO-ICA maintain the rich spatial qualities of the sound sources[24] Also the projection back method provides SIMO-model-separated signals using the inverse of an optimizedICA filter [25].

Finally, we extract the target-speech-enhanced signal by

applying DS to ychSS( f , τ) = [y(chSS)1 ( f , τ), . . . , y(chSS)

J ( f , τ)]T.This procedure can be expressed by

y(f , τ) = gT

DS

(f , θU)

ychSS(f , τ), (6)

where y( f , τ) is the final output of chSS+BF.Such a chSS+BF structure performs DS after (multichan-

nel) SS. Since DS is basically signal processing in which thesummation of the multichannel signal is taken, it can beconsidered that interchannel smoothing is applied to themultichannel spectral-subtracted signal. On the other hand,the resultant output signal of BF+SS remains as it is after SS.That is to say, it is expected that the output signal of chSS+BFis more natural (contains less musical noise) than that ofBF+SS. In the following sections, we reveal that chSS+BF canoutput a signal with less musical noise than BF+SS in almostall cases on the basis of HOS.

3. Kurtosis-Based Musical-NoiseGeneration Metric

3.1. Introduction. It has been reported by the authors that theamount of musical noise generated is strongly related to thedifference between the kurtosis of a signal before and aftersignal processing. Thus, in this paper, we analyze the amountof musical noise generated through BF+SS and chSS+BF onthe basis of the change in the measured kurtosis. Hereinafter,we give details of the kurtosis-based musical-noise metric.

3.2. Relation between Musical-Noise Generation and Kurtosis.In our previous works [18–20], we defined musical noise asthe audible isolated spectral components generated throughsignal processing. Figure 4(b) shows an example of a spectro-gram of musical noise in which many isolated componentscan be observed. We speculate that the amount of musicalnoise is strongly related to the number of such isolatedcomponents and their level of isolation.

Hence, we introduce kurtosis to quantify the isolatedspectral components, and we focus on the changes in kur-tosis. Since isolated spectral components are dominant, theyare heard as tonal sounds, which results in our perceptionof musical noise. Therefore, it is expected that obtainingthe number of tonal components will enable us to quantifythe amount of musical noise. However, such a measurementis extremely complicated; so instead we introduce a simplestatistical estimate, that is, kurtosis.

This strategy allows us to obtain the characteristics oftonal components. The adopted kurtosis can be used toevaluate the width of the probability density function (p.d.f.)and the weight of its tails; that is, kurtosis can be used toevaluate the percentage of tonal components among the totalcomponents. A larger value indicates a signal with a heavytail in its p.d.f., meaning that it has a large number of tonalcomponents. Also, kurtosis has the advantageous propertythat it can be easily calculated in a concise algebraic form.

3.3. Kurtosis. Kurtosis is one of the most commonly usedHOS for the assessment of non-Gaussianity. Kurtosis isdefined as

kurtx =μ4

μ22

, (7)

where x is a random variable, kurtx is the kurtosis of x, andμn is the nth-order moment of x. Here μn is defined as

μn =∫ +∞

−∞xnP(x)dx, (8)

where P(x) denotes the p.d.f. of x. Note that this μn isnot a central moment but a raw moment. Thus, (7) is notkurtosis according to the mathematically strict definition,but a modified version; however, we refer to (7) as kurtosisin this paper.

3.4. Kurtosis Ratio. Although we can measure the number oftonal components by kurtosis, it is worth mentioning thatkurtosis itself is not sufficient to measure musical noise. Thisis because that the kurtosis of some unprocessed signals suchas speech signals is also high, but we do not perceive speechas musical noise. Since we aim to count only the musical-noise components, we should not consider genuine tonalcomponents. To achieve this aim, we focus on the fact thatmusical noise is generated only in artificial signal processing.Hence, we should consider the change in kurtosis duringsignal processing. Consequently, we introduce the followingkurtosis ratio [18] to measure the kurtosis change:

kurtosis ratio = kurtproc

kurtinput, (9)


Freq

uen

cy(H

z)

Time (s)

(a)

Freq

uen

cy(H

z)

Time (s)

(b)

Figure 4: (a) Observed spectrogram and (b) processed spectrogram.

where kurtproc is the kurtosis of the processed signal andkurtinput is the kurtosis of the input signal. A larger kurtosisratio (�1) indicates a marked increase in kurtosis as a resultof processing, implying that a larger amount of musical noiseis generated. On the other hand, a smaller kurtosis ratio(�1) implies that less musical noise is generated. It has beenconfirmed that this kurtosis ratio closely matches the amountof musical noise in a subjective evaluation based on humanhearing [18].

4. Kurtosis-Based Musical-Noise Analysis forMicrophone Array Signal Processing and SS

4.1. Analysis Flow. In the following sections, we carry out ananalysis on musical-noise generation in BF+SS and chSS+BFbased on kurtosis. The analysis is composed of the followingthree parts.

(i) First, an analysis on musical-noise generation inBF+SS and chSS+BF based on kurtosis that doesnot take noise reduction performance into accountis performed in this section.

(ii) The noise reduction performance is analyzed inSection 5, and we reveal that the noise reductionperformances of BF+SS and chSS+BF are not equiv-alent. Moreover, a flooring parameter designed toalign the noise reduction performances of BF+SS andchSS+BF is also derived to ensure the fair comparisonof BF+SS and chSS+BF.

(iii) The kurtosis-based comparison between BF+SS andchSS+BF under the same noise reduction perfor-mance conditions is carried out in Section 6.

In the analysis in this section, we first clarify how kurtosisis affected by SS. Next, the same analysis is applied toDS. Finally, we analyze how kurtosis is increased by BF+SSand chSS+BF. Note that our analysis contains no limitingassumptions on the statistical characteristics of noise; thus,all noises including Gaussian and super-Gaussian noise canbe considered.

4.2. Signal Model Used for Analysis. Musical-noise compo-nents generated from the noise-only period are dominantin spectrograms (see Figure 4); hence, we mainly focus ourattention on musical-noise components originating frominput noise signals.

Moreover, to evaluate the resultant kurtosis of SS, weintroduce a gamma distribution to model the noise in thepower domain [26–28]. The p.d.f. of the gamma distributionfor random variable x is defined as

PGM(x) = 1Γ(α)θα

· xα−1 exp{−xθ

}, (10)

where x ≥ 0,α > 0, and θ > 0. Here, α denotes the shapeparameter, θ is the scale parameter, and Γ(·) is the gammafunction. The gamma distribution with α = 1 correspondsto the chi-square distribution with two degrees of freedom.Moreover, it is well known that the mean of x for a gammadistribution is E[x] = αθ, where E[·] is the expectationoperator. Furthermore, the kurtosis of a gamma distribution,kurtGM, can be expressed as [18]

kurtGM = (α + 2)(α + 3)α(α + 1)

. (11)

Moreover, let us consider the power-domain noise signal,xp, in the frequency domain, which is defined as

xp = |xre + i · xim|2

= (xre + i · xim)(xre + i · xim)∗

= x2re + x2

im,

(12)

where xre is the real part of the complex-valued signal and xim

is its imaginary part, which are independent and identicallydistributed (i.i.d.) with each other, and the superscript ∗expresses complex conjugation. Thus, the power-domainsignal is the sum of two squares of random variables withthe same distribution.

Hereinafter, let xre and xim be the signals after DFTanalysis of signal in a specific microphone j, xj , and wesuppose that the statistical properties of xj equal to xre andxim. Moreover, we assume the following; xj is i.i.d. in eachchannel, the p.d.f. of xj is symmetrical, and its mean is zero.These assumptions mean that the odd-order cumulants andmoments are zero except for the first order.


0 βαθ

Before subtraction After subtraction P.d.f. after SS

0 βαθ

As a result of subtraction,(1) p.d.f. is laterally shifted to the

zero-power direction, and(2) negative components with

nonzero probability arise.

(3) The region corresponding tothe negative components is

compressed by a small positiveflooring parameter η.

Flooring

(4) Positive componentsremain as they are.

(5) Remaining positive componentsand floored components are merged.

0 βαη2θ

P.d.f. after SSOriginal p.d.f.

Figure 5: Deformation of original p.d.f. of power-domain signal via SS.

Although kurtx = 3 if x is a Gaussian signal, notethat the kurtosis of a Gaussian signal in the power spectraldomain is 6. This is because a Gaussian signal in the timedomain obeys the chi-square distribution with two degrees offreedom in the power spectral domain; for such a chi-squaredistribution, μ4/μ

22 = 6.

4.3. Resultant Kurtosis after SS. In this section, we analyze thekurtosis after SS. In traditional SS, the long-term-averagedpower spectrum of a noise signal is utilized as the estimatednoise power spectrum. Then, the estimated noise powerspectrum multiplied by the oversubtraction parameter βis subtracted from the observed power spectrum. When agamma distribution is used to model the noise signal, itsmean is αθ. Thus, the amount of subtraction is βαθ. Thesubtraction of the estimated noise power spectrum in eachfrequency band can be considered as a shift of the p.d.f. tothe zero-power direction (see Figure 5). As a result, negative-power components with nonzero probability arise. To avoidthis, such negative components are replaced by observationsthat are multiplied by a small positive value η (the so-calledflooring technique). This means that the region correspond-ing to the probability of the negative components, whichforms a section cut from the original gamma distribution, iscompressed by the effect of the flooring. Finally, the flooredcomponents are superimposed on the laterally shifted p.d.f.(see Figure 5). Thus, the resultant p.d.f. after SS, PSS(z), canbe written as

PSS(z) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

1θαΓ(α)

(z + βαθ

)α−1 exp

{

−z + βαθ

θ

}

(z ≥ βαη2θ

),

1θαΓ(α)

(z + βαθ

)α−1 exp

{

−z + βαθ

θ

}

+1

(η2θ)αΓ(α)

zα−1 exp

{

− z

η2θ

}

(0 < z < βαη2θ

),

(13)

where z is the random variable of the p.d.f. after SS. Thederivation of PSS(z) is described in Appendix A.

From (13), the kurtosis after SS can be expressed as

kurtSS = Γ(α)F(α,β,η

)

G2(α,β,η

) , (14)

where

G(α,β,η

) = Γ(α)Γ(βα,α + 2

)− 2βαΓ(βα,α + 1

)

+ β2α2Γ(βα,α)

+ η4γ(βα,α + 2

),

F(α,β,η

) = Γ(βα,α + 4

)− 4βαΓ(βα,α + 3

)

+ 6β2α2Γ(βα,α + 2

)− 4β3α3Γ(βα,α + 1

)

+ β4α4Γ(βα,α)

+ η8γ(βα,α + 4

).

(15)

Here, Γ(b, a) is the upper incomplete gamma functiondefined as

Γ(b, a) =∫∞

bta−1 exp{−t}dt, (16)

and γ(b, a) is the lower incomplete gamma function definedas

γ(b, a) =∫ b

0ta−1 exp{−t}dt. (17)

The detailed derivation of (14) is given in Appendix B.Although Uemura et al. have given an approximated form(lower bound) of the kurtosis after SS in [18], (14) involvesno approximation throughout its derivation. Furthermore,(14) takes into account the effect of the flooring techniqueunlike [18].

Figure 6(a) depicts the theoretical kurtosis ratio afterSS, kurtSS/kurtGM, for various values of oversubtractionparameter β and flooring parameter η. In the figure, thekurtosis of the input signal is fixed to 6.0, which corresponds


0

10

20

30

40

50

60K

urt

osis

rati

o

0 0.5 1 1.5 2 2.5 3 3.5 4

Oversubtraction parameter

η = 0η = 0.1

η = 0.2η = 0.4

(a)

0

20

40

60

80

100

Ku

rtos

isra

tio

10 100

Input kurtosis

β = 1β = 2

β = 4β = 8

(b)

Figure 6: (a) Theoretical kurtosis ratio after SS for various values of oversubtraction parameter β and flooring parameter η. In this figure,kurtosis of input signal is fixed to 6.0. (b) Theoretical kurtosis ratio after SS for various values of input kurtosis. In this figure, flooringparameter η is fixed to 0.0.

to a Gaussian signal. From this figure, it is confirmed thatthekurtosis ratio is basically proportional to the oversub-traction parameter β. However, kurtosis does not mono-tonically increase when the flooring parameter is nonzero.For instance, the kurtosis ratio is smaller than the peakvalue when β = 4 and η = 0.4. This phenomenon can beexplained as follows. For a large oversubtraction parameter,almost all the spectral components become negative due tothe larger lateral shift of the p.d.f. by SS. Since flooring isapplied to avoid such negative components, almost all thecomponents are reconstructed by flooring. Therefore, thestatistical characteristics of the signal never change except forits amplitude if η /= 0. Generally, kurtosis does not depend onthe change in amplitude; consequently, it can be consideredthat kurtosis does not markedly increase when a largeroversubtraction parameter and a larger flooring parameterare set.

The relation between the theoretical kurtosis ratio andthe kurtosis of the original input signal is shown inFigure 6(b). In the figure, η is fixed to 0.0. It is revealedthat the kurtosis ratio after SS rapidly decreases as theinput kurtosis increases, even with the same oversubtractionparameter β. Therefore, the kurtosis ratio after SS, which isrelated to the amount of musical noise, strongly depends onthe statistical characteristics of the input signal. That is to say,SS generates a larger amount of musical noise for a Gaussianinput signal than for a super-Gaussian input signal. This facthas been reported in [18].

4.4. Resultant Kurtosis after DS. In this section, we analyzethe kurtosis after DS, and we reveal that DS can reduce thekurtosis of input signals. Since we assume that the statisticalproperties of xre or xim are the same as that of xj , the effectof DS on the change in kurtosis can be derived from thecumulants and moments of xj .

For cumulants, when X and Y are independent randomvariables it is well known that the following relation holds:

cumn(aX + bY) = ancumn(X) + bncumn(Y), (18)

where cumn(·) denotes the nth-order cumulant. The cumu-lants of the random variable X , cumn(X), are defined bya cumulant-generating function, which is the logarithm ofthe moment-generating function. The cumulant-generatingfunction C(ζ) is defined as

C(ζ) = log(E[exp{ζX}]) =

∞∑

n=1

cumn(X)ζn

n!, (19)

where ζ is an auxiliary variable and E[exp{ζX}] is themoment-generating function. Thus, the nth-order cumulantcumn(X) is represented by

cumn(X) = C(n)(0), (20)

where C(n)(ζ) is the nth-order derivative of C(ζ).Now we consider the DS beamformer, which is steered

to θU = 0 and whose array weights are 1/J . Using (18), theresultant nth-order cumulant after DS, Kn = cumn(yDS),can be expressed by

Kn = 1Jn−1

Kn, (21)

where Kn = cumn(xj) is the nth-order cumulant of xj .Therefore, using (21) and the well-known mathematical rela-tion between cumulants and moments, the power-spectral-domain kurtosis after DS, kurtDS can be expressed by

kurtDS = K8 + 38K24 + 32K2K6 + 288K2

2 K4 + 192K42

2K24 + 16K2

2 K4 + 32K42

.

(22)

The detailed derivation of (22) is described in Appendix C.


6

20

40

60

80

100O

utp

ut

kurt

osis

20 40 60 80 100

Input kurtosis

(a) 1-microphone case

6

20

40

60

80

100

Ou

tpu

tku

rtos

is

20 40 60 80 100

Input kurtosis

(b) 2-microphone case

6

20

40

60

80

100

Ou

tpu

tku

rtos

is

20 40 60 80 100

Input kurtosis

SimulationTheoreticalApproximated

(c) 4-microphone case

6

20

40

60

80

100

Ou

tpu

tku

rtos

is

20 40 60 80 100

Input kurtosis

SimulationTheoreticalApproximated

(d) 8-microphone case

Figure 7: Relation between input kurtosis and output kurtosis after DS. Solid lines indicate simulation results, broken lines expresstheoretical plots obtained by (22), and dotted lines show approximate results obtained by (23).

Regarding the power-spectral components obtainingfrom a gamma distribution, we illustrate the relationbetween input kurtosis and output kurtosis after DS inFigure 7. In the figure, solid lines indicate simulationresults and broken lines show theoretical relations given by(22). The simulation results are derived as follows. First,multichannel signals with various values of kurtosis aregenerated artificially from a gamma distribution. Next, DSis applied to the generated signals. Finally, kurtosis after DSis estimated from the signal resulting from DS. From thisfigure, it is confirmed that the theoretical plots closely fitthe simulation results. The relation between input/outputkurtosis behaves as follows: (i) The output kurtosis is veryclose to a linear function of the input kurtosis, and (ii)the output kurtosis is almost inversely proportional to thenumber of microphones. These behaviors result in thefollowing simplified (but useful) approximation with an

explicit function form:

kurtDS � J−0.7 · (kurtin − 6) + 6, (23)

where kurtin is the input kurtosis. The approximated plotsalso match the simulation results in Figure 7.

When input signals involve interchannel correlation, therelation between input kurtosis and output kurtosis afterDS approaches that for only one microphone. If all inputsignals are identical signals, that is, the signals are completelycorrelated, the output after DS also becomes the same as theinput signal. In such a case, the effect of DS on the changein kurtosis corresponds to that for only one microphone.However, the interchannel correlation is not equal to onewithin all frequency subbands for a diffuse noise field thatis a typically considered noise field. It is well known that the


6

9

12

15

18

Ku

rtos

is

2 4 6 8 10 12 14 16

Number of microphones

ExperimentalTheoretical

(a) 1000 Hz

6

9

12

15

18

Ku

rtos

is

2 4 6 8 10 12 14 16



(b) 8000 Hz

Figure 8: Simulation result for noise with interchannel correlation (solid line) and theoretical effect of DS assuming no interchannelcorrelation (broken line) in each frequency subband.

6

9

12

15

18

21

24

27

Ku

rtos

is

0 2000 4000 6000 8000

Frequency

Observed

TheoreticalExperimental

Figure 9: Simulation result for noise with interchannel correlation(solid line), theoretical effect of DS assuming no interchannelcorrelation (broken line), and kurtosis of the observed signalwithout any signal processing (dotted line) in eight-microphonecase.

intensity of the interchannel correlation is strong in lower-frequency subbands and weak in higher-frequency subbandsfor the diffuse noise field [1]. Therefore, in lower-frequencysubbands, it can be expected that DS does not significantlyreduce the kurtosis of the signal.

As it is well known that the interchannel correlation fora diffuse noise field between two measurement locationscan be expressed by the sinc function [1], we can statehow array signal processing is affected by the interchannelcorrelation. However, we cannot know exactly how cumu-lants are changed by the interchannel correlation because(18) only holds when signals are mutually independent.Therefore, we cannot formulate how kurtosis is changed viaDS for signals with interchannel correlation. For this reason,

we experimentally investigate the effect of interchannelcorrelation in the following.

Figures 8 and 9 show preliminary simulation results ofDS. In this simulation, SS is first applied to a multichannelGaussian signal with interchannel correlation in the diffusenoise field. Next, DS is applied to the signal after SS. In thepreliminary simulation, the interelement distance betweenmicrophones is 2.15 cm. From the results shown in Figures8(a) and 9, we can confirm that the effect of DS on kurtosisis weak in lower-frequency subbands, although it should benoted that the effect does not completely disappear. Also,the theoretical kurtosis curve is in good agreement with theactual results in higher-frequency subbands (see Figures 8(b)and 9). This is because the interchannel correlation is weak inhigher-frequency subbands. Consequently, for a diffuse noisefield, DS can reduce the kurtosis of the input signal even ifinterchannel correlation exists.

If input noise signals contain no interchannel correlation,the distance between microphones does not affect the results.That is to say, the kurtosis change via DS can be well fit to(23). Otherwise, in lower-frequency subbands, it is expectedthat the mitigation effect of kurtosis by DS degrades withdecreasing distance between microphones. This is becausethe interchannel correlation in lower-frequency subbandsincreases with decreasing distance between microphones.In higher-frequency subbands, the effect of the distancebetween microphones is thought to be small.

4.5. Resultant Kurtosis: BF+SS versus chSS+BF. In the pre-vious subsections, we discussed the resultant kurtosis afterSS and DS. In this subsection, we analyze the resultantkurtosis for two types of composite systems, that is, BF+SSand chSS+BF, and compare their effect on musical-noisegeneration. As described in Section 3, it is expected that asmaller increase in kurtosis leads to a smaller amount ofmusical noise generated.

In BF+SS, DS is first applied to a multichannel inputsignal. At this point, the resultant kurtosis in the powerspectral domain, kurtDS, can be represented by (23). Using


(11), we can derive a shape parameter for the gammadistribution corresponding to kurtDS, α, as

α =√

kurt2DS + 14 kurtDS + 1− kurtDS + 5

2 kurtDS − 2. (24)

The derivation of (24) is shown in Appendix D. Conse-quently, using (14) and (24), the resultant kurtosis afterBF+SS, kurtBF+SS, can be written as

kurtBF+SS = Γ(α)F(α,β,η

)

G2(α,β,η

) . (25)

In chSS+BF, SS is first applied to each input channel.Thus, the output kurtosis after channelwise SS, kurtchSS, isgiven by

kurtchSS = Γ(α)F(α,β,η

)

G2(α,β,η

) . (26)

Finally, DS is performed and the resultant kurtosis afterchSS+BF, kurtchSS+BF, can be written as

kurtchSS+BF = J−0.7

[

Γ(α)F(α,β,η

)

G2(α,β,η

) − 6

]

+ 6, (27)

where we use (23).We should compare kurtBF+SS and kurtchSS+BF here.

However, one problem still remains: comparison underequivalent noise reduction performance; the noise reductionperformances of BF+SS and chSS+BF are not equivalent asdescribed in the next section. Moreover, the design of aflooring parameter so that the noise reduction performancesof both methods become equivalent will be discussed inthe next section. Therefore, kurtBF+SS and kurtchSS+BF willbe compared in Section 6 under equivalent noise reductionperformance conditions.

5. Noise Reduction Performance Analysis

In the previous section, we did not discuss the noise reduc-tion performances of BF+SS and chSS+BF. In this section, amathematical analysis of the noise reduction performancesof BF+SS and chSS+BF is given. As a result of this analysis, itis revealed that the noise reduction performances of BF+SSand chSS+BF are not equivalent even if the same parametersare set in the SS part. We then derive a flooring-parameterdesign strategy for aligning the noise reduction performancesof BF+SS and chSS+BF.

5.1. Noise Reduction Performance of SS. We utilize thefollowing index to measure the noise reduction performance(NRP):

NRP = 10 log10E[nout]E[nin]

, (28)

where nin is the power-domain (noise) signal of the input andnout is the power-domain (noise) signal of the output afterprocessing.

First, we derive the average power of the input signal. Weassume that the input signal in the power domain can bemodeled by a gamma distribution. Then, the average powerof the input signal is given as

E[nin] = E[x] =∫∞

0xPGM(x)dx

=∫∞

0x · 1

θαΓ(α)xα−1 exp

{−xθ

}dx

= 1θαΓ(α)

∫∞

0xα exp

{−xθ

}dx.

(29)

Here, let t = x/θ, then θdt = dx. Thus,

E[nin] = 1θαΓ(α)

∫∞

0(θt)α exp{−t}θdt

= θα+1

θαΓ(α)

∫∞

0tα exp{−t}dt

= θΓ(α + 1)Γ(α)

= θα.

(30)

This corresponds to the mean of a random variable with agamma distribution.

Next, the average power of the signal after SS is calcu-lated. Here, let z obey the p.d.f. of the signal after SS, PSS(z),defined by (13); then the average power of the signal after SScan be expressed as

E[nout] = E[z]

=∫∞

0zPSS(z)dz

=∫∞

0

z

θαΓ(α)

(z + βαθ

)α−1 exp

{

−z + βαθ

θ

}

dz

+∫ βαη2θ

0

z(η2θ)αΓ(α)

zα−1 exp

{

− z

η2θ

}

dz.

(31)

We now consider the first term of the right-hand side in (31).We let t = z + βαθ, then dt = dz. As a result,

∫∞

0

z

θαΓ(α)

(z + βαθ

)α−1 exp

{

−z + βαθ

θ

}

dz

=∫∞

βαθ

(t − βαθ) · 1

θαΓ(α)· tα−1 exp

{− t

θ

}dt

=∫∞

βαθ

1θαΓ(α)

· tα exp{− t

θ

}dt

−∫∞

βαθ

βαθ

θαΓ(α)· tα−1 exp

{− t

θ

}dt

= θ · Γ(βα,α + 1)

Γ(α)− βαθ · Γ

(βα,α)

Γ(α).

(32)


0

5

10

15

20

25

30

35N

oise

redu

ctio

npe

rfor

man

ce(d

B)

0 1 2 3 4 5 6 7 8


η = 0η = 0.1

η = 0.2η = 0.4

(a)

0

5

10

15

20

25

30

35

Noi

sere

duct

ion

perf

orm

ance

(dB

)

6 10 100

Input kurtosis

β = 1β = 2

β = 4β = 8

(b)

Figure 10: (a) Theoretical noise reduction performance of SS with various oversubtraction parameters β and flooring parameters η. In thisfigure, kurtosis of input signal is fixed to 6.0. (b) Theoretical noise reduction performance of SS with various values of input kurtosis. In thisfigure, flooring parameter η is fixed to 0.0.

Also, we deal with the second term of the right-hand side in(31). We let t = z/(η2θ) then η2θdt = dz, resulting in

∫ βαη2θ

0

z(η2θ)αΓ(α)

zα−1 exp

{

− z

η2θ

}

dz

= 1(η2θ)αΓ(α)

∫ βα

0

(η2θt)α · exp{−t}η2θdt

= η2θ

Γ(α)γ(βα,α + 1

).

(33)

Using (30), (32), and (33), the noise reduction performanceof SS, NRPSS, can be expressed by

NRPSS = 10 log10(E[z]E[x]

)

= −10 log10

[Γ(βα,α + 1

)

Γ(α + 1)

−β · Γ(βα,α)

Γ(α)+ η2 γ

(βα,α + 1

)

Γ(α + 1)

]

.

(34)

Figure 10(a) shows the theoretical value of NRPSS forvarious values of oversubtraction parameter β and flooringparameter η, where the kurtosis of the input signal is fixedto 6.0, corresponding to a Gaussian signal. From this figure,it is confirmed that NRPSS is proportional to β. However,NRPSS hits a peak when η is nonzero even for a large value ofβ. The relation between the theoretical value of NRRSS andthe kurtosis of the input signal is illustrated in Figure 10(b).

In this figure, η is fixed to 0.0. It is revealed that NRPSS

decreases as the input kurtosis increases. This is because themean of a high-kurtosis signal tends to be small. Since theshape parameter α of a high-kurtosis signal becomes small,the mean αθ corresponding to the amount of subtractionalso becomes small. As a result, NRPSS is decreased as theinput kurtosis increases. That is to say, the NRPSS stronglydepends on the statistical characteristics of the input signalas well as the values of the oversubtraction and flooringparameters.

5.2. Noise Reduction Performance of DS. It is well knownthat the noise reduction performance of DS (NRPDS) isproportional to the number of microphones. In particular,for spatially uncorrelated multichannel signals, NRPDS isgiven as [1]

NRPDS = 10 log10J. (35)

5.3. Resultant Noise Reduction Performance: BF+SS versuschSS+BF. In the previous subsections, the noise reductionperformances of SS and DS were discussed. In this subsec-tion, we derive the resultant noise reduction performancesof the composite systems of SS and DS, that is, BF+SS andchSS+BF.

The noise reduction performance of BF+SS is analyzedas follows. In BF+SS, DS is first applied to a multichannelinput signal. If this input signal is spatially uncorrelated, itsnoise reduction performance can be represented by 10 log10J .After DS, SS is applied to the signal. Note that DS affectsthe kurtosis of the input signal. As described in Section 4.4,the resultant kurtosis after DS can be approximated asJ−0.7 · (kurtin − 6) + 6. Thus, SS is applied to the kurtosis-modified signal. Consequently, using (24), (34), and (35),


8

12

16

20

24N

oise

redu

ctio

npe

rfor

man

ce(d

B)

0 2 4 6 8


(a) Input kurtosis = 6

8

12

16

20

24

Noi

sere

duct

ion

perf

orm

ance

(dB

)

0 2 4 6 8


(b) Input kurtosis = 20

8

12

16

20

24

Noi

sere

duct

ion

perf

orm

ance

(dB

)

0 2 4 6 8


BF+SSchSS+BF

(c) Input kurtosis = 80

Figure 11: Comparison of noise reduction performances of chSS+BF with BF+SS. In this figure, flooring parameter is fixed to 0.2 andnumber of microphones is 8.

the noise reduction performance of BF+SS, NRPBF+SS, isgiven as

NRPBF+SS

= 10 log10J − 10 log10

×[Γ(βα, α + 1

)

Γ(α + 1)− β · Γ

(βα, α)

Γ(α)+ η2 γ

(βα, α + 1

)

Γ(α + 1)

]

= −10 log101

J · Γ(α)

×[Γ(βα, α + 1

)

α− β · Γ(βα, α

)+ η2 γ

(βα, α + 1

)

α

]

,

(36)

where α is defined by (24).In chSS+BF, SS is first applied to a multichannel input

signal; then DS is applied to the resulting signal. Thus, using

(34) and (35), the noise reduction performance of chSS+BF,NRPchSS+BF, can be represented by

NRPchSS+BF

= −10 log101

J · Γ(α)

×[Γ(βα,α + 1

)

α− β · Γ(βα,α

)+ η2 γ

(βα,α + 1

)

α

]

.

(37)

Figure 11 depicts the values of NRPBF+SS and NRPchSS+BF.From this result, we can see that the noise reductionperformances of both methods are equivalent when the inputsignal is Gaussian. However, if the input signal is super-Gaussian, NRPBF+SS exceeds NRPchSS+BF. This is due to thefact that DS is first applied to the input signal in BF+SS;thus, DS reduces the kurtosis of the signal. Since NRPSS for


−0.5

0

0.5

1

1.5

R

10 100

Input kurtosis

(a) Flooring parameter η = 0.0

−0.5

0

0.5

1

1.5

R

10 100

Input kurtosis

(b) Flooring parameter η = 0.1

−0.5

0

0.5

1

1.5

R

10 100

Input kurtosis

1 mic.2 mics.

4 mics.8 mics.

(c) Flooring parameter η = 0.2

−0.5

0

0.5

1

1.5

R

10 100

Input kurtosis

1 mic.2 mics.

4 mics.8 mics.

(d) Flooring parameter η = 0.4

Figure 12: Theoretical kurtosis ratio between BF+SS and chSS+BF for various values of input kurtosis. In this figure, oversubtractionparameter is β = 2.0 and flooring parameter in chSS+BF is (a) η = 0.0, (b) η = 0.1, (c) η = 0.2, and (d) η = 0.4.

a low-kurtosis signal is greater than that for a high-kurtosissignal (see Figure 10(b)), the noise reduction performance ofBF+SS is superior to that of chSS+BF.

This discussion implies that NRPBF+SS and NRPchSS+BF

are not equivalent under some conditions. Thus the kurtosis-based analysis described in Section 4 is biased and requiressome adjustment. In the following subsection, we will discusshow to align the noise reduction performances of BF+SS andchSS+BF.

5.4. Flooring-Parameter Design in BF+SS for Equivalent NoiseReduction Performance. In this section, we describe theflooring-parameter design in BF+SS so that NRPBF+SS andNRPchSS+BF become equivalent.

Using (36) and (37), the flooring parameter η that makesNRPBF+SS equal to NRPchSS+BF, is

η =√√√√ α

γ(βα, α + 1

) ·[Γ(α)Γ(α)

H(α,β,η

)− I(α,β)]

, (38)

where

H(α,β,η

)= Γ(βα,α+1

)


)+η2 γ

(βα,α+1

)

α,

(39)

I(α,β) = Γ

(βα, α + 1

)


). (40)

The detailed derivation of (38) is given in Appendix E. Byreplacing η in (3) with this new flooring parameter η, we canalign NRPBF+SS and NRPchSS+BF to ensure a fair comparison.

6. Output Kurtosis Comparison underEquivalent NRP Condition

In this section, using the new flooring parameter for BF+SS,η, we compare the output kurtosis of BF+SS and chSS+BF.

Setting η to (25), the output kurtosis of BF+SS ismodified to

kurtBF+SS = Γ(α)F(α,β, η

)

G2(α,β, η

) . (41)


−3

−2

−1

0

1

2

3

4

R

0 5 10 15 20


η = 0η = 0.1

η = 0.2η = 0.4

(a) Input kurtosis = 6.0

−3

−2

−1

0

1

2

3

4

R

0 5 10 15 20


η = 0η = 0.1

η = 0.2η = 0.4

(b) Input kurtosis = 20.0

Figure 13: Theoretical kurtosis ratio between BF+SS and chSS+BF for various oversubtraction parameters. In this figure, number ofmicrophones is fixed to 8, and input kurtosis is (a) 6.0 (Gaussian) and (b) 20.0 (super-Gaussian).

3.9

m

1m

3.9 m

Loudspeakers (for interferences)

Loudspeaker (for target source)

Microphone array(with interelement spacing of 2.15 cm)

(Reverberation time: 200 ms)

Figure 14: Reverberant room used in our simulations.

Here, we adopt the following index to compare the resultantkurtosis after BF+SS and chSS+BF:

R = lnkurtBF+SS

kurtchSS+BF, (42)

where R expresses the resultant kurtosis ratio between BF+SSand chSS+BF. Note that a positive R indicates that chSS+BFreduces the kurtosis more than BF+SS, implying that lessmusical noise is generated in chSS+BF. The behavior ofR is depicted in Figures 12 and 13. Figure 12 illustratestheoretical values of R for various values of input kurtosis.

In this figure, β is fixed to 2.0 and the flooring parameterin chSS+BF is set to η = 0.0, 0.1, 0.2, and 0.4. Theflooring parameter for BF+SS is automatically determinedby (38). From this figure, we can confirm that chSS+BFreduces the kurtosis more than BF+SS for almost all inputsignals with various values of input kurtosis. Theoreticalvalues of R for various oversubtraction parameters aredepicted in Figure 13. Figure 13(a) shows that the outputkurtosis after chSS+BF is always less than that after BF+SSfor a Gaussian signal, even if η is nonzero. On the otherhand, Figure 13(b) implies that the output kurtosis afterBF+SS becomes less than that after chSS+BF for someparameter settings. However, this only occurs for a largeoversubtraction parameter, for example, β ≥ 7, which is notoften applied in practical use. Therefore, it can be consideredthat chSS+BF reduces the kurtosis and musical noise morethan BF+SS in almost all cases.

7. Experiments and Results

7.1. Computer Simulations. First, we compare BF+SS andchSS+BF in terms of kurtosis ratio and noise reductionperformance. We use 16-kHz-sampled signals as test data,in which the target speech is the original speech convolutedwith impulse responses recorded in a room with 200millisecond reverberation (see Figure 14), and to which anartificially generated spatially uncorrelated white Gaussianor super-Gaussian signal is added. We use six speakers(six sentences) as sources of the original clean speech. Thenumber of microphone elements in the simulation is variedfrom 2 to 16, and their interelement distance is 2.15 cm each.The oversubtraction parameter β is set to 2.0 and the flooringparameter for BF+SS, η, is set to 0.0, 0.2, 0.4, or 0.8. Notethat the flooring parameter in chSS+BF is set to 0.0. In thesimulation, we assume that the long-term-averaged powerspectrum of noise is estimated perfectly in advance.


1

2

3

4

5

6

7

8

9

10K

urt

osis

rati

o

2 4 6 8 10 12 14 16Number of microphones

chSS+BF

BF+SS (η = 0.2)BF+SS (η = 0)

BF+SS (η = 0.4)BF+SS (η = 0.8)

(a)

5

10

15

20

Noi

sere

duct

ion

perf

orm

ance

(dB

)


chSS+BF

BF+SS (η = 0.2)BF+SS (η = 0)

BF+SS (η = 0.4)BF+SS (η = 0.8)

(b)

Figure 15: Results for Gaussian input signal. (a) Kurtosis ratio and (b) noise reduction performance for BF+SS with various flooringparameters.

Here, we utilize the kurtosis ratio defined in Section 3.4to measure the difference in kurtosis, which is related tothe amount of musical noise generated. The kurtosis ratiois given by

Kurtosis ratio =kurt(nproc(f , τ))

kurt(norg(f , τ)) , (43)

where nproc( f , τ) is the power spectra of the residual noisesignal after processing, and norg( f , τ) is the power spectraof the original noise signal before processing. This kurtosisratio indicates the extent to which kurtosis is increasedwith processing. Thus, a smaller kurtosis ratio is desirable.Moreover, the noise reduction performance is measuredusing (28).

Figures 15–17 show the simulation results for a Gaussianinput signal. From Figure 15(a), we can see that the kurtosisratio of chSS+BF decreases almost monotonically withincreasing number of microphones. On the other hand, thekurtosis ratio of BF+SS does not exhibit such a tendencyregardless of the flooring parameter. Also, the kurtosis ratioof chSS+BF is lower than that of BF+SS for all cases exceptfor η = 0.8. Moreover, we can confirm from Figure 15(b)that the values of noise reduction performance for BF+SSwith flooring parameter η = 0.0 and chSS+BF are almost thesame. When the flooring parameter for BF+SS is nonzero,the kurtosis ratio of BF+SS becomes smaller but the noisereduction performance degrades. On the other hand, forGaussian signals, chSS+BF can reduce the kurtosis ratio,that is, reduce the amount of musical noise generated,without degrading the noise reduction performance. IndeedBF+SS with η = 0.8 reduces the kurtosis ratio more thanchSS+BF, but the noise reduction performance of BF+SSis extremely degraded. Furthermore, we can confirm fromFigures 16 and 17 that the theoretical kurtosis ratio and noise

reduction performance closely fit the experimental results.These findings also support the validity of the analysis inSections 4, 5, and 6.

Figures 18–20 illustrate the simulation results for a super-Gaussian input signal. It is confirmed from Figure 18(a) thatthe kurtosis ratio of chSS+BF also decreases monotonicallywith increasing number of microphones. Unlike the caseof the Gaussian input signal, the kurtosis ratio of BF+SSwith η = 0.8 also decreases with increasing number ofmicrophones. However, for a lower value of the flooringparameter, the kurtosis ratio of BF+SS is not degraded.Moreover, the kurtosis ratio of chSS+BF is lower than thatof BF+SS for almost all cases. For the super-Gaussian inputsignal, in contrast to the case of the Gaussian input signal,the noise reduction performance of BF+SS with η = 0.0is greater than that of chSS+BF (see Figure 18(b)). Thatis to say, the noise reduction performance of BF+SS issuperior to that of chSS+BF for the same flooring parameter.This result is consistent with the analysis in Section 5. Thenoise reduction performance of BF+SS with η = 0.4 iscomparable to that of chSS+BF. However, the kurtosis ratioof chSS+BF is still lower than that of BF+SS with η = 0.4.This result also coincides with the analysis in Section 6.On the other hand, the kurtosis ratio of BF+SS with η =0.8 is almost the same as that of chSS+BF. However, thenoise reduction performance of BF+SS with η = 0.8 islower than that of chSS+BF. Thus, it can be confirmed thatchSS+BF reduces the kurtosis ratio more than BF+SS fora super-Gaussian signal under the same noise reductionperformance. Furthermore, the theoretical kurtosis ratio andnoise reduction performance closely fit the experimentalresults in Figures 19 and 20.

We also compare speech distortion originating fromchSS+BF and BF+SS on the basis of cepstral distortion(CD) [29] for the four-microphone case. The comparison


2

4

6

8

10

Ku

rtos

isra

tio

2 4 6 8 10 12 14 16


(a) chSS+BF

2

4

6

8

10

Ku

rtos

isra

tio

2 4 6 8 10 12 14 16


(b) BF+SS (η = 0.0)

2

4

6

8

10

Ku

rtos

isra

tio

2 4 6 8 10 12 14 16


(c) BF+SS (η = 0.2)

2

4

6

8

10

Ku

rtos

isra

tio

2 4 6 8 10 12 14 16


(d) BF+SS (η = 0.4)

2

4

6

8

10

Ku

rtos

isra

tio

2 4 6 8 10 12 14 16



(e) BF+SS (η = 0.8)

Figure 16: Comparison between experimental and theoretical kurtosis ratios for Gaussian input signal.


5

10

15

20

Noi

sere

duct

ion

per

form

ance

2 4 6 8 10 12 14 16


(a) chSS+BF

5

10

15

20

Noi

sere

duct

ion

per

form

ance

2 4 6 8 10 12 14 16


(b) BF+SS (η = 0.0)

5

10

15

20

Noi

sere

duct

ion

per

form

ance

2 4 6 8 10 12 14 16


(c) BF+SS (η = 0.2)

5

10

15

20

Noi

sere

duct

ion

per

form

ance

2 4 6 8 10 12 14 16


(d) BF+SS (η = 0.4)

5

10

15

20

Noi

sere

duct

ion

per

form

ance

2 4 6 8 10 12 14 16



(e) BF+SS (η = 0.8)

Figure 17: Comparison between experimental and theoretical noise reduction performances for Gaussian input signal.


1

2

3

4

5

6K

urt

osis

rati

o


chSS+BF

BF+SS (η = 0.2)BF+SS (η = 0)

BF+SS (η = 0.4)BF+SS (η = 0.8)

(a)

5

10

15

20

Noi

sere

duct

ion

per

form

ance


chSS+BF

BF+SS (η = 0.2)BF+SS (η = 0)

BF+SS (η = 0.4)BF+SS (η = 0.8)

(b)

Figure 18: Results for super-Gaussian input signal. (a) Kurtosis ratio and (b) noise reduction performance for BF+SS with various flooringparameters.

Table 1: Speech distortion comparison of chSS+BF and BF+SS onthe basis of CD for four-microphone case.

Input noise type chSS+BF BF+SS

Gaussian 6.15 dB 6.45 dB

Super-Gaussian 6.17 dB 5.12 dB

is made under the condition that the noise reductionperformances of both methods are almost the same. Forthe Gaussian input signal, the same parameters β = 2.0and η = 0.0 are utilized for BF+SS and chSS+BF. Onthe other hand, β = 2.0 and η = 0.4 are utilizedfor BF+SS and β = 2.0 and η = 0.0 are utilized forchSS+BF for the super-Gaussian input signal. Table 1 showsthe result of the comparison, from which we can see thatthe amount of speech distortion originating from BF+SS andchSS+BF is almost the same for the Gaussian input signal.For the super-Gaussian input signal, the speech distortionoriginating from BF+SS is less than that from chSS+BF. Thisis owing to the difference in the flooring parameter for eachmethod.

In conclusion, all of these results are strong evidence forthe validity of the analysis in Sections 4, 5, and 6. Theseresults suggest the following.

(i) Although BF+SS can reduce the amount of musicalnoise by employing a larger flooring parameter,it leads to a deterioration of the noise reductionperformance.

(ii) In contrast, chSS+BF can reduce the kurtosis ratio,which corresponds to the amount of musical noisegenerated, without degradation of the noise reduc-tion performance.

(iii) Under the same level of noise reduction performance,the amount of musical noise generated via chSS+BFis less than that generated via BF+SS.

(iv) Thus, the chSS+BF structure is preferable from theviewpoint of musical-noise generation.

(v) However, the noise reduction performance of BF+SSis superior to that of chSS+BF for a super-Gaussiansignal when the same parameters are set in the SS partfor both methods.

(vi) These results imply a trade-off between the amountof musical noise generated and the noise reductionperformance. Thus, we should use an appropriatestructure depending on the application.

These results should be applicable under different SNR con-ditions because our analysis is independent of the noise level.In the case of more reverberation, the observed signal tendsto become Gaussian because many reverberant componentsare mixed. Therefore, the behavior of both methods undermore reverberant conditions should be similar to that in thecase of a Gaussian signal.

7.2. Subjective Evaluation. Next, we conduct a subjectiveevaluation to confirm that chSS+BF can mitigate musicalnoise. In the evaluation, we presented two signals processedby BF+SS and by chSS+BF to seven male examinees inrandom order, who were asked to select which signal theyconsidered to contain less musical noise (the so-called ABmethod). Moreover, we instructed examinees to evaluateonly the musical noise and not to consider the amplitude ofthe remaining noise. Here, the flooring parameter in BF+SSwas automatically determined so that the output SNR of


1

2

3

4

5

6K

urt

osis

rati

o

2 4 6 8 10 12 14 16


(a) chSS+BF

1

2

3

4

5

6

Ku

rtos

isra

tio

2 4 6 8 10 12 14 16


(b) BF+SS (η = 0.0)

1

2

3

4

5

6

Ku

rtos

isra

tio

2 4 6 8 10 12 14 16


(c) BF+SS (η = 0.2)

1

2

3

4

5

6

Ku

rtos

isra

tio

2 4 6 8 10 12 14 16


(d) BF+SS (η = 0.4)

1

2

3

4

5

6

Ku

rtos

isra

tio

2 4 6 8 10 12 14 16



(e) BF+SS (η = 0.8)

Figure 19: Comparison between experimental and theoretical kurtosis ratios for super-Gaussian input signal.

BF+SS and chSS+BF was equivalent. We used the preferencescore as the index of the evaluation, which is the frequency ofthe selected signal.

In the experiment, three types of noise, (a) artificialspatially uncorrelated white Gaussian noise, (b) recordedrailway-station noise emitted from 36 loudspeakers, and (c)recorded human speech emitted from 36 loudspeakers, were

used. Note that noises (b) and (c) were recorded in the actualroom shown in Figure 14 and therefore include interchannelcorrelation because they were recordings of actual noisesignals.

Each test sample is a 16-kHz-sampled signal, andthe target speech is the original speech convoluted withimpulse responses recorded in a room with 200 millisecond


5

10

15

20N

oise

redu

ctio

np

erfo

rman

ce

2 4 6 8 10 12 14 16


(a) chSS+BF

5

10

15

20

Noi

sere

duct

ion

per

form

ance

2 4 6 8 10 12 14 16


(b) BF+SS (η = 0.0)

5

10

15

20

Noi

sere

duct

ion

per

form

ance

2 4 6 8 10 12 14 16


(c) BF+SS (η = 0.2)

5

10

15

20

Noi

sere

duct

ion

per

form

ance

2 4 6 8 10 12 14 16


(d) BF+SS (η = 0.4)

5

10

15

20

Noi

sere

duct

ion

per

form

ance

2 4 6 8 10 12 14 16



(e) BF+SS (η = 0.8)

Figure 20: Comparison between experimental and theoretical noise reduction performances for super-Gaussian input signal.

reverberation (see Figure 14) and to which the above-mentioned recorded noise signal is added. Ten pairs of signalsper type of noise, that is, a total of 30 pairs of processedsignals, were presented to each examinee.

Figure 21 shows the subjective evaluation results, whichconfirm that the output of chSS+BF is preferred to thatof BF+SS, even for actual acoustic noises including non-Gaussianity and interchannel correlation properties.


0

20

40

60

80

100

Pre

fere

nce

scor

e(%

)

White Gaussian Station noise from36 loudspeakers

Speech from 36loudspeakers

chSS+BF

BF+SS

95% confidence interval

Figure 21: Subjective evaluation results.

8. Conclusion

In this paper, we analyze two methods of integratingmicrophone array signal processing and SS, that is, BF+SSand chSS+BF, on the basis of HOS. As a result of the analysis,it is revealed that the amount of musical noise generatedvia SS strongly depends on the statistical characteristics ofthe input signal. Moreover, it is also clarified that the noisereduction performances of BF+SS and chSS+BF are differentexcept in the case of a Gaussian input signal. As a result ofour analysis under equivalent noise reduction performanceconditions, it is shown that chSS+BF reduces musical noisemore than BF+SS in almost all practical cases. The resultsof a computer simulation also support the validity of ouranalysis. Moreover, by carrying out a subjective evaluation,it is confirmed that the output of chSS+BF is considered tocontain less musical noise than that of BF+SS. These analyticand experimental results imply the considerable potential ofoptimization based on HOS to reduce musical noise.

As a future work, it remains necessary to carry outsignal analysis based on more general distributions. Forinstance, analysis using a generalized gamma distribution[26, 27] can lead to more general results. Moreover, an exactformulation of how kurtosis is changed through DS undera coherent condition is still an open problem. Furthermore,the robustness of BF+SS and chSS+BF against low-SNR ormore reverberant conditions is not discussed in this paper.In the future, the discussion should involve not only noisereduction performance and musical-noise generation butalso such robustness.

Appendices

A. Derivation of (13)

When we assume that the input signal of the power domaincan be modeled by a gamma distribution, the amountof subtraction is βαθ. The subtraction of the estimatednoise power spectrum in each frequency subband can beconsidered as a lateral shift of the p.d.f. to the zero-powerdirection (see Figure 5). As a result of this subtraction, the

random variable x is replaced with x + βαθ and the gammadistribution becomes

PGM(x) = 1Γ(α)θα

· (x + βαθ)α−1 exp

{

−x + βαθ

θ

}

(x ≥ −βαθ).

(A.1)

Since the domain of the original gamma distribution is x ≥0, the domain of the resultant p.d.f. is x ≥ −βαθ. Thus,negative-power components with nonzero probability arise,which can be represented by

Pnegative(x) = 1Γ(α)θα

· (x + βαθ)α−1 exp

{

−x + βαθ

θ

}

(−βαθ ≤ x ≤ 0),(A.2)

where Pnegative(x) is part of PGM(x). To remove the negative-

power components, the signals corresponding to Pnegative(x)are replaced by observations multiplied by a small positivevalue η. The observations corresponding to (A.2), Pobs(x),are given by

Pobs(x) = 1Γ(α)θα

· (x)α−1 exp{−xθ

} (0 ≤ x ≤ βαθ

).

(A.3)

Since a small positive flooring parameter η is applied to(A.3), the scale parameter θ becomes η2θ and the range ischanged from 0 ≤ x ≤ βαθ to 0 ≤ x ≤ βαη2θ. Then, (A.3) ismodified to

Pfloor(x) = 1Γ(α)(η2θ)α · (x)α−1 exp

{

− x

η2θ

}

(0 ≤ x ≤ βαη2θ

),

(A.4)

where Pfloor(x) is the probability of the floored components.This Pfloor(x) is superimposed on the p.d.f. given by (A.1)within the range 0 ≤ x ≤ βαη2θ. By considering the positiverange of (A.1) and Pfloor(x), the resultant p.d.f. of SS can beformulated as

PSS(z)

=

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

1θαΓ(α)

(z + βαθ

)α−1 exp

{

−z + βαθ

θ

}

(z ≥ βαη2θ

),

1θαΓ(α)

(z + βαθ

)α−1 exp

{

−z + βαθ

θ

}

+1

(η2θ)αΓ(α)

zα−1 exp

{

− z

η2θ

}

(0 < z < βαη2θ

),

(A.5)

where the variable x is replaced with z for convenience.


B. Derivation of (14)

To derive the kurtosis after SS, the 2nd- and 4th-ordermoments of z are required. For PSS(z), the 2nd-ordermoment is given by

μ2 =∫∞

0z2 · PSS(z)dz

=∫∞

0z2 1θαΓ(α)

(z + βαθ

)α−1 exp

{

−z + βαθ

θ

}

dz

+∫ βαη2θ

0z2 1(η2θ)αΓ(α)

zα−1 exp

{

− z

η2θ

}

dz.

(B.1)

We now expand the first term of the right-hand side of (B.1).Here, let t = (z + βαθ)/θ; then θdt = dz and z = θ(t − βα).Consequently,

∫∞

0z2 1θαΓ(α)

(z + βαθ

)α−1 exp

{

−z + βαθ

θ

}

dz

=∫∞

βαθ2(t − βα)2 1

θαΓ(α)(θt)α−1 exp{−t}θdt

= θ2

Γ(α)

∫∞

βα

(t2 − 2βαt + β2α2)tα−1 exp{−t}dt

= θ2

Γ(α)

[Γ(βα,α + 2

)− 2βαΓ(βα,α + 1

)+ β2α2Γ

(βα,α)].

(B.2)

Next we consider the second term of the right-hand side of(B.1). Here, let t = z/(η2θ); then η2θdt = dz. Thus,

∫ βαη2θ

0z2 1(η2θ)αΓ(α)

zα−1 exp

{

− z

η2θ

}

dz

=∫ βα

0

(η2θt)2 1(η2θ)αΓ(α)

(η2θt)α−1

exp{−t}η2θdt

= η4θ2

Γ(α)

∫ βα

0tα+1 exp{−t}dt = η4θ2 γ

(βα,α + 2

)

Γ(α).

(B.3)

As a result, the 2nd-order moment after SS, μ(SS)2 , is a

composite of (B.2) and (B.3) and is given as

μ(SS)2 = θ2

Γ(α)

[Γ(βα,α + 2

)− 2βαΓ(βα,α + 1

)

+β2α2Γ(βα,α)

+ η4γ(βα,α + 2

)].

(B.4)

In the same manner, the 4th-order moment after SS,μ(SS)

4 , can be represented by

μ(SS)4 = θ4

Γ(α)

[Γ(βα,α + 4

)− 4βαΓ(βα,α + 3

)

+ 6β2α2Γ(βα,α + 2

)− 4β3α3Γ(βα,α + 1

)

+β4α4Γ(βα,α)

+ η8γ(βα,α + 4

)].

(B.5)

Consequently, using (B.4) and (B.5), the kurtosis after SS isgiven as

kurtSS = Γ(α)F(α,β,η

)

G2(α,β,η

) , (B.6)

where

G(α,β,η

) = Γ(α)Γ(βα,α + 2

)− 2βαΓ(βα,α + 1

)

+ β2α2Γ(βα,α)

+ η4γ(βα,α + 2

),

F(α,β,η

) = Γ(βα,α + 4

)− 4βαΓ(βα,α + 3

)

+ 6β2α2Γ(βα,α + 2

)− 4β3α3Γ(βα,α + 1

)

+ β4α4Γ(βα,α)

+ η8γ(βα,α + 4

).

(B.7)

C. Derivation of (22)

As described in (12), the power-domain signal is the sum oftwo squares of random variables with the same distribution.

Using (18), the power-domain cumulantsK(p)n can be written

as

power-domain cumulants

⎧⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎩

K(p)1 = 2K (2)

1 ,

K(p)2 = 2K (2)

2 ,

K(p)3 = 2K (2)

3 ,

K(p)4 = 2K (2)

4 ,

(C.1)

where K (2)n is the nth square-domain moment. Here, the

p.d.f. of such a square-domain signal is not symmetrical andits mean is not zero. Thus, we utilize the following relationsbetween the moments and cumulants around the origin:

moments

⎧⎪⎪⎨

⎪⎪⎩

μ1 = κ1,

μ2 = κ2 + κ21,

μ4 = κ4 + 4κ3κ1 + 3κ22 + 6κ2κ

21 + κ4

1,

(C.2)

where μn is the nth-order raw moment and κn is the nth-

order cumulant. Moreover, the square-domain moments μ(2)n

can be expressed by

squared-domain moments

⎧⎪⎪⎨

⎪⎪⎩

μ(2)1 = μ2,

μ(2)2 = μ4,

μ(2)4 = μ8.

(C.3)

Using (C.1)–(C.3), the power-domain moments can beexpressed in terms of the 4th- and 8th-order moments in thetime domain. Therefore, to obtain the kurtosis after DS inthe power domain, the moments and cumulants after DS upto the 8th order are needed.


The 3rd-, 5th-, and 7th-order cumulants are zero becausewe assume that the p.d.f. of xj is symmetrical and that itsmean is zero. If these conditions are satisfied, the followingrelations between moments and cumulants hold:

moments

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

μ1 = 0,

μ2 = κ2,

μ4 = κ4 + 3κ22,

μ6 = κ6 + 15κ4κ2 + 15κ32,

μ8 = κ8 + 35κ24 + 28κ2κ6 + 210κ2

2κ4 + 105κ42.

(C.4)

Using (21) and (C.4), the time-domain moments afterDS are expressed as

moments after DS

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

μ(DS)2 =K2,

μ(DS)4 =K4 + 3K2

2 ,

μ(DS)6 =K6 + 15K2K4 + 15K3

2 ,

μ(DS)8 =K8 + 35K2

4 + 28K2K6

+210K22 K4 + 105K4

2 ,(C.5)

where μ(DS)n is the nth-order raw moment after DS in the time

domain.Using (C.2), (C.3), and (C.5), the square-domain cumu-

lants can be written as

square-domain cumulants

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

K (2)1 =K2,

K (2)2 =K4 + 2K2

2 ,

K (2)3 =K6 +12K4K2 +8K3

2 ,

K (2)4 =K8 +32K2

4 +24K2K6

+144K22 K4 + 48K4

2 ,(C.6)

where K (2)n is the nth-order cumulant in the square domain.

Moreover, using (C.1), (C.2), and (C.6), the 2nd- and4th-order power-domain moments can be written as

μ(p)2 = 2

(K4 + 4K2

2

),

μ(p)4 = 2

(K8 + 38K2

4 + 32K6K2 + 288K4K22 + 192K4

2

).

(C.7)

As a result, the power-domain kurtosis after DS, kurtDS, isgiven as

kurtDS = K8 + 38K24 + 32K2K6 + 288K2

2 K4 + 192K42

2K24 + 16K2

2 K4 + 32K42

.

(C.8)

D. Derivation of (24)

According to (11), the shape parameter α corresponding tothe kurtosis after DS, kurtDS, is given by the solution of thequadratic equation:

kurtDS = (α + 2)(α + 3)α(α + 1)

. (D.1)

This can be expanded as

α2(kurtDS − 1) + α(kurtDS − 5)− 6 = 0. (D.2)

Using the quadratic formula,

α =−kurtDS + 5±

√kurt2

DS + 14 kurtDS + 1

2 kurtDS − 2, (D.3)

whose denominator is larger than zero because kurtDS > 1.Here, since α > 0, we must select the appropriate numeratorof (D.3). First, suppose that

−kurtDS + 5 +√

kurt2DS + 14 kurtDS + 1 > 0. (D.4)

This inequality clearly holds when 1 < kurtDS < 5 because

−kurtDS + 5 > 0 and√

kurt2DS + 14 kurtDS + 1 > 0. Thus,

−kurtDS + 5 > −√

kurt2DS + 14 kurtDS + 1. (D.5)

When kurtDS ≥ 5, the following relation also holds:

(−kurtDS + 5)2 < kurt2DS + 14 kurtDS + 1,

⇐⇒ 24 kurtDS > 24.(D.6)

Since (D.6) is true when kurtDS ≥ 5, (D.4) holds. Insummary, (D.4) always holds for 1 < kurtDS < 5 and 5 ≤kurtDS. Thus,

−kurtDS + 5 +√

kurt2DS + 14 kurtDS + 1 > 0 for kurtDS > 1.

(D.7)

Overall,

−kurtDS + 5 +√

kurt2DS + 14 kurtDS + 1

2 kurtDS − 2> 0. (D.8)

On the other hand, let

−kurtDS + 5−√

kurt2DS + 14 kurtDS + 1 > 0. (D.9)

This inequality is not satisfied when kurtDS > 5 because

−kurtDS + 5 < 0 and√

kurt2DS + 14 kurtDS + 1 > 0. Now (D.9)

can be modified as

−kurtDS + 5 >√

kurt2DS + 14 kurtDS + 1, (D.10)


then the following relation also holds for 1 < kurtDS ≤ 5:

(−kurtDS + 5)2 > kurt2DS + 14 kurtDS + 1,

⇐⇒ 24 kurtDS < 24.(D.11)

This is not true for 1 < kurtDS ≤ 5. Thus, (D.9) is notappropriate for kurtDS > 1. Therefore, α corresponding tokurtDS is given by

α =−kurtDS + 5 +

√kurt2

DS + 14 kurtDS + 1

2 kurtDS − 2. (D.12)

E. Derivation of (38)

For 0 < α ≤ 1, which corresponds to a Gaussian or super-Gaussian input signal, it is revealed that the noise reductionperformance of BF+SS is superior to that of chSS+BF fromthe numerical simulation in Section 5.3. Thus, the followingrelation holds:

− 10 log101

J · Γ(α)

×[Γ(βα, α + 1

)


)+ η2 γ

(βα, α + 1

)

α

]

≥ −10 log101

J · Γ(α)

×[Γ(βα,α + 1

)


)+ η2 γ

(βα,α + 1

)

α

]

.

(E.1)

This inequality corresponds to

1Γ(α)

[Γ(βα, α + 1

)


)+ η2 γ

(βα, α + 1

)

α

]

≤ 1Γ(α)

[Γ(βα,α + 1

)


)+ η2 γ

(βα,α + 1

)

α

]

.

(E.2)

Then, the new flooring parameter η in BF+SS, which makesthe noise reduction performance of BF+SS equal to that ofchSS+BF, satisfies η ≥ η (≥ 0) because

γ(βα, α + 1

)

α≥ 0. (E.3)

Moreover, the following relation for η also holds:

1Γ(α)

[Γ(βα, α + 1

)


)+ η2 γ

(βα, α + 1

)

α

]

= 1Γ(α)

[Γ(βα,α + 1

)


)+ η2 γ

(βα,α + 1

)

α

]

.

(E.4)

This can be rewritten as

η2 Γ(α)Γ(α)

γ(βα, α + 1

)

α

=[Γ(βα,α + 1

)


)+ η2 γ

(βα,α + 1

)

α

]

− Γ(α)Γ(α)

[Γ(βα, α + 1

)


)]

,

(E.5)

and consequently

η2 = α

γ(βα, α + 1

)

[Γ(α)Γ(α)

H(α,β,η

)− I(α,β)]

, (E.6)

where H(α,β,η) is defined by (39) and I(α,β) is given by(40). Using (E.3) and (E.4), the right-hand side of (E.5) isclearly greater than or equal to zero. Moreover, since Γ(α) >0, Γ(α) > 0, α > 0, and γ(βα, α + 1) > 0, the right-hand sideof (E.6) is also greater than or equal to zero. Therefore,

η =√√√√ α

γ(βα, α + 1

) ·[Γ(α)Γ(α)

H(α,β,η

)− I(α,β)]

. (E.7)

Acknowledgment

This work was partly supported by MIC Strategic Informa-tion and Communications R&D Promotion Programme inJapan.

References

[1] M. Brandstein and D. Ward, Eds., Microphone Arrays: SignalProcessing Techniques and Applications, Springer, Berlin, Ger-many, 2001.

[2] J. L. Flanagan, J. D. Johnston, R. Zahn, and G. W. Elko,“Computer-steered microphone arrays for sound transduc-tion in large rooms,” Journal of the Acoustical Society ofAmerica, vol. 78, no. 5, pp. 1508–1518, 1985.

[3] M. Omologo, M. Matassoni, P. Svaizer, and D. Giuliani,“Microphone array based speech recognition with differenttalker-array positions,” in Proceedings of the InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP’97), pp. 227–230, Munich, Germany, September 1997.

[4] H. F. Silverman and W. R. Patterson, “Visualizing the perfor-mance of large-aperture microphone arrays,” in Proceedings ofthe International Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’99), pp. 962–972, 1999.

[5] O. Frost, “An algorithm for linearly constrained adaptive arrayprocessing,” Proceedings of the IEEE, vol. 60, pp. 926–935,1972.

[6] L. J. Griffiths and C. W. Jim, “An alternative approach to lin-early constrained adaptive beamforming,” IEEE Transactionson Antennas and Propagation, vol. 30, no. 1, pp. 27–34, 1982.

[7] Y. Kaneda and J. Ohga, “Adaptive microphone-array systemfor noise reduction,” IEEE Transactions on Acoustics, Speech,and Signal Processing, vol. 34, no. 6, pp. 1391–1400, 1986.

[8] S. Boll, “Suppression of acoustic noise in speech using spectralsubtraction,” IEEE Transactions on Acoustics, Speech and SignalProcessing, vol. 27, no. 2, pp. 113–120, 1979.


[9] J. Meyer and K. Simmer, “Multi-channel speech enhancementin a car environment using Wiener filtering and spectralsubtraction,” in Proceedings of the International Conferenceon Acoustics, Speech, and Signal Processing (ICASSP ’97), pp.1167–1170, 1997.

[10] S. Fischer and K. D. Kammeyer, “Broadband beamformingwith adaptive post filtering for speech acquisition in noisyenvironment,” in Proceedings of the International Conferenceon Acoustics, Speech, and Signal Processing (ICASSP ’97), pp.359–362, 1997.

[11] R. Mukai, S. Araki, H. Sawada, and S. Makino, “Removalof residual cross-talk components in blind source separationusing time-delayed spectral subtraction,” in Proceedings ofthe International Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’02), pp. 1789–1792, Orlando, Fla, USA,May 2002.

[12] J. Cho and A. Krishnamurthy, “Speech enhancement usingmicrophone array in moving vehicle environment,” in Pro-ceedings of the IEEE Intelligent Vehicles Symposium, pp. 366–371, Graz, Austria, April 2003.

[13] Y. Ohashi, T. Nishikawa, H. Saruwatari, A. Lee, and K.Shikano, “Noise robust speech recognition based on spatialsubtraction array,” in Proceedings of the International Workshopon Nonlinear Signal and Image Processing, pp. 324–327, 2005.

[14] J. Even, H. Saruwatari, and K. Shikano, “New architecturecombining blind signal extraction and modified spectral sub-traction for suppression of background noise,” in Proceedingsof the International Workshop on Acoustic Echo and NoiseControl (IWAENC ’08), Seattle, Wash, USA, 2008.

[15] Y. Takahashi, T. Takatani, K. Osako, H. Saruwatari, and K.Shikano, “Blind spatial subtraction array for speech enhance-ment in noisy environment,” IEEE Transactions on Audio,Speech and Language Processing, vol. 17, no. 4, pp. 650–664,2009.

[16] S. B. Jebara, “A perceptual approach to reduce musicalnoise phenomenon with Wiener denoising technique,” inProceedings of the International Conference on Acoustics, Speech,and Signal Processing (ICASSP ’06), vol. 3, pp. 49–52, 2006.

[17] Y. Ephraim and D. Malah, “Speech enhancement using aminimum mean-square error short-time spectral amplitudeestimator,” IEEE Transactions on Acoustics, Speech, and SignalProcessing, vol. 32, no. 6, pp. 1109–1121, 1984.

[18] Y. Uemura, Y. Takahashi, H. Saruwatari, K. Shikano, andK. Kondo, “Automatic optimization scheme of spectral sub-traction based on musical noise assessment via higher-orderstatistics,” in Proceedings of the International Workshop onAcoustic Echo and Noise Control (IWAENC ’08), Seattle, Wash,USA, 2008.

[19] Y. Uemura, Y. Takahashi, H. Saruwatari, K. Shikano, and K.Kondo, “Musical noise generation analysis for noise reductionmethods based on spectral subtraction and MMSE STSAestimatio,” in Proceedings of the International Conference onAcoustics, Speech, and Signal Processing (ICASSP ’09), pp.4433–4436, 2009.

[20] Y. Takahashi, Y. Uemura, H. Saruwatari, K. Shikano, and K.Kondo, “Musical noise analysis based on higher order statisticsfor microphone array and nonlinear signal processing,” inProceedings of the International Conference on Acoustics, Speech,and Signal Processing (ICASSP ’09), pp. 229–232, 2009.

[21] P. Comon, “Independent component analysis, a new concept?”Signal Processing, vol. 36, pp. 287–314, 1994.

[22] H. Saruwatari, S. Kurita, K. Takeda, F. Itakura, T. Nishikawa,and K. Shikano, “Blind source separation combining inde-pendent component analysis and beamforming,” EURASIP

Journal on Applied Signal Processing, vol. 2003, no. 11, pp.1135–1146, 2003.

[23] M. Mizumachi and M. Akagi, “Noise reduction by paired-microphone using spectral subtraction,” in Proceedings ofthe International Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’98), vol. 2, pp. 1001–1004, 1998.

[24] T. Takatani, T. Nishikawa, H. Saruwatari, and K. Shikano,“High-fidelity blind separation of acoustic signals usingSIMO-model-based independent component analysis,” IEICETransactions on Fundamentals of Electronics, Communicationsand Computer Sciences, vol. E87-A, no. 8, pp. 2063–2072, 2004.

[25] S. Ikeda and N. Murata, “A method of ICA in the frequencydomain,” in Proceedings of the International Workshop onIndependent Component Analysis and Blind Signal Separation,pp. 365–371, 1999.

[26] E. W. Stacy, “A generalization of the gamma distribution,” TheAnnals of Mathematical Statistics, pp. 1187–1192, 1962.

[27] K. Kokkinakis and A. K. Nandi, “Generalized gamma density-based score functions for fast and flexible ICA,” SignalProcessing, vol. 87, no. 5, pp. 1156–1162, 2007.

[28] J. W. Shin, J.-H. Chang, and N. S. Kim, “Statistical modelingof speech signals based on generalized gamma distribution,”IEEE Signal Processing Letters, vol. 12, no. 3, pp. 258–261, 2005.

[29] L. Rabiner and B. Juang, Fundamentals of Speech Recognition,Prentice-Hall PTR, 1993.


Research Article

Microphone Diversity Combining for In-Car Applications

Jurgen Freudenberger, Sebastian Stenzel (EURASIP Member),and Benjamin Venditti (EURASIP Member)

Department of Computer Science, University of Applied Sciences Konstanz, Hochschule Konstanz, Brauneggerstr. 55,78462 Konstanz, Germany

Correspondence should be addressed to Jurgen Freudenberger, [email protected]

Received 1 August 2009; Revised 23 January 2010; Accepted 17 March 2010

Academic Editor: Ivan Tashev

Copyright © 2010 Jurgen Freudenberger et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

This paper proposes a frequency domain diversity approach for two or more microphone signals, for example, for in-carapplications. The microphones should be positioned separately to insure diverse signal conditions and incoherent recording ofnoise. This enables a better compromise for the microphone position with respect to different speaker sizes and noise sources. Thiswork proposes a two-stage approach. In the first stage, the microphone signals are weighted with respect to their signal-to-noiseratio and then summed similar to maximum ratio combining. The combined signal is then used as a reference for a frequencydomain least-mean-squares (LMS) filter for each input signal. The output SNR is significantly improved compared to coherence-based noise reduction systems, even if one microphone is heavily corrupted by noise.

1. Introduction

With in-car speech applications like hands-free car kits andspeech recognition systems, speech is corrupted by enginenoise and other noise sources like airflow from electric fansor car windows. For safety and comfort reasons, hands-freetelephone systems should provide the same quality of speechas conventional fixed telephones. In practice however, thespeech quality of a hands-free car kit heavily depends onthe particular position of the microphone. Speech has to bepicked up as directly as possible to reduce reverberation andto provide a sufficient signal-to-noise ratio. The importantquestion, where to place the microphone inside the car, is,however, difficult to answer. The position is apparently acompromise for different speaker sizes, because the distancebetween microphone and speaker depends significantly onthe position of the driver and therefore on the size of thedriver. Furthermore, noise sources like airflow from electricfans or car windows have to be considered. Placing two ormore microphones in different positions enables a bettercompromise with respect to different speaker sizes and yieldsmore noise robustness.

Today, noise reduction in hands-free car kits and in-car speech recognition systems is usually based on single

channel noise reduction or beamformer arrays [1–3]. Goodnoise robustness of single microphone systems requires theuse of single channel noise suppression techniques, mostof them derived from spectral subtraction [4]. Such noisereduction algorithms improve the signal-to-noise ratio, butthey usually introduce undesired speech distortion. Micro-phone arrays can improve the performance compared tosingle microphone systems. Nevertheless, the signal qualitydoes still depend on the speaker position. Moreover, themicrophones are located in close proximity. Therefore,microphone arrays are often vulnerable to airflow that mightdisturb all microphone signals.

Alternatively, multimicrophone setups have been pro-posed that combine the processed signals of two or moreseparate microphones. The microphones are positionedseparately (e.g., 40 to 80 cm apart) in order to ensureincoherent recording of noise [5–11]. Similar multichannelsignal processing systems have been suggested to reducesignal distortion due to reverberation [12, 13]. Basically,all these approaches exploit the fact that speech compo-nents in the microphone signals are strongly correlatedwhile the noise components are only weakly correlatedif the distance between the microphones is sufficientlylarge.


The question at hand with distributed arrays is howto combine these microphone signals with possibly ratherdifferent signal conditions? In this paper, we consider a diver-sity technique that combines the processed signals of severalseparate microphones. The basic idea of our approach is toapply maximum-ratio-combining (MRC) to speech signals,where we propose a frequency domain diversity approach fortwo or more microphone signals. MRC maximizes the signal-to-noise ratio in the combined signal.

A major issue for the application of maximum-ratio-combining for multimicrophone setups is the estimationof the acoustic transfer functions. In telecommunications,the signal attenuation as well as the phase shift for eachtransmission path are usually measured to apply MRC. Withspeech applications we have no means to directly measurethe acoustic transfer functions. There exists several blindapproaches to estimate the acoustic transfer functions (seee.g., [14–16]) which were successfully applied to derever-beration. However, the proposed estimation methods arecomputationally demanding.

In this paper, we show that maximum-ratio-combiningcan be achieved without explicit knowledge of the acoustictransfer functions. Proper signal weighting can be achievedbased on an estimate of the input signal-to-noise ratio. Wepropose a two stage processing of the microphone signals.In the first stage, the microphone signals are weightedwith respect to their input signal-to-noise ratio. Theseweights guarantee maximum-ratio-combining of the signalswith respect to the signal magnitudes. To ensure cophasaladdition of the weighted signals, we use the combinedsignal as reference signal for frequency domain LMS filtersin the second stage. These filters adjust the phases of themicrophone signals to guarantee coherent signal combining.

The proposed concept is similar to the single channelnoise reduction system presented by Mukherjee and Gwee[17]. This system uses spectral subtraction to obtain a crudeestimate of the speech signal. This estimate is then used asthe reference signal of a single LMS filter. In this paper, wegeneralize this concept to multimicrophone systems, whereour aim is not only noise reduction, but also dereverberationof the microphone signals.

The paper is organized as follows: In Section 2, wepresent some measurement results obtained in a car environ-ment. This results motivate the proposed diversity approach.In Section 3, we present a signal combiner that achievesMRC weighting based on the knowledge of the inputsignal-to-noise ratios. Coherence based signal combiningis discussed in Section 4. In the subsequent section, weconsider implementation issues. In particular, we presentan estimator for the required input signal-to-noise ratios.Finally, in Section 6, we present some simulation results fordifferent real world noise situations.

2. Measurement Results

The basic idea of our spectral combining approach is toapply MRC to speech signals. To motivate this approach,we first discuss some measurement results obtained in a car

SNR

(dB

)

−20−10

01020304050

Frequency (Hz)

0 1000 2000 3000 4000 5000

mic. 1mic. 2

Figure 1: Input SNR values for a driving situation at a car speed of100 km/h.

environment. For these measurements, we used two cardioidmicrophones with positions suited for car integration. Onemicrophone (denoted by mic. 1) was installed close to theinside mirror. The second microphone (mic. 2) was mountedat the A-pillar.

Figure 1 depicts the SNR versus frequency for a drivingsituation at a car speed of 100 km/h. From this figure, weobserve that the SNR values are quite distinct for thesetwo microphone positions with differences of up to 10 dBdepending on the particular frequency. We also note thatthe better microphone position is not obvious in this case,because the SNR curves cross several times.

Theoretically, a MRC combining of the two input signalswould result in an output SNR equal to the sum of the inputSNR values. With two inputs, MRC achieves a maximumgain of 3 dB for equal input SNR values. In case of the inputSNR values being rather different, the sum is dominated bythe maximum value. Hence, for the curves in Figure 1 theoutput SNR would essentially be the envelope of the twocurves.

Next we consider the coherence for the noise and speechsignals. The corresponding results are depicted in Figure 2.The figure presents measurements for two microphonesinstalled close to the inside mirror in an end-fire beamformerconstellation with a microphone distance of 7 cm. The lowerfigure contains the results for the microphone positionsmic. 1 and mic. 2 (distance of 65 cm). From these results,we observe that the noise coherence closely follows thetheoretical coherence function (dotted line in Figure 2) in anideal diffuse sound field [18]. Separating the microphonessignificantly reduces the noise coherence for low frequencies.On the other hand, both microphone constellations havesimilar speech coherence. We note that the speech coherenceis not ideal, as it has steep dips. The corresponding frequen-cies will probably be attenuated by a signal combiner that issolely based on coherence.

3. Spectral Combining

In this section, we present the basic system concept. To sim-plify the discussion, we assume that all signals are stationaryand that the acoustic system is linear and time-invariant.


Coh

eren

ce|γx 1x 2

(f)|2

0

0.2

0.4

0.6

0.8

1

Frequency (Hz)

0 1000 2000 3000 4000 5000

(a)

Coh

eren

ce|γx 1x 2

(f)|2

0

0.2

0.4

0.6

0.8

1

Frequency (Hz)

0 1000 2000 3000 4000 5000

NoiseSpeechTheoretical

(b)

Figure 2: Coherence for noise and speech signals for tow differentmicrophone positions.

In the subsequent section we consider the modifications fornonstationary signals and time variant systems.

We consider a scenario with M microphones. Themicrophone signals yi(k) can be modeled by the convolutionof the speech signal x(k) with the impulse response hi(k) ofthe acoustic system plus additive noise ni(k). Hence the Mmicrophone signals yi(k) can be expressed as

yi(k) = hi(k)∗ x(k) + ni(k), (1)

where ∗ denotes the convolution.To apply the diversity technique, it is convenient to

consider the signals in the frequency domain. Let X( f ) bethe spectrum of the speech signal x(k) and Yi( f ) be thespectrum of the ith microphone signal yi(k). The speechsignal is linearly distorted by the acoustic transfer functionHi( f ) and corrupted by the noise term Ni( f ). Hence, thesignal observed at the ith microphone has the spectrum

Yi(f) = X

(f)Hi(f)

+Ni(f). (2)

In the following, we assume that the speech signal andthe channel coefficients are uncorrelated. We assume acomplex Gaussian distribution of the noise terms Ni( f ).Moreover, we presume that the noise power spectral densityλN ( f ) = E{|Ni( f )|2} is the same for all microphones. Thisassumption is reasonable for a diffuse sound field.

Our aim is to linearly combine theM microphone signalsYi( f ) so that the signal-to-noise ratio in the combined signal

X( f ) is maximized. In the frequency domain, the signalcombining can be expressed as

X(f) =

M∑

i=1

Gi(f)Yi(f), (3)

where Gi( f ) is the weight of the ith microphone signal. With(2) we have

X(f) = X

(f) M∑

i=1

Gi(f)Hi(f)

+M∑

i=1

Gi(f)Ni(f), (4)

where the first sum represents the speech component andthe second sum represents the noise component of thecombined signal. Hence, the overall signal-to-noise ratio ofthe combined signal is

γ(f) =

E{∣∣∣X(f)∑M

i=1 Gi(f)Hi(f)∣∣∣

2}

E{∣∣∣∑M

i=1 Gi(f)Ni(f)∣∣∣

2} . (5)

3.1. Maximum-Ratio-Combining. The optimal combiningstrategy that maximizes the signal-to-noise ratio in the com-bined signal X( f ) is usually called maximal-ratio-combining(MRC) [19]. In this section, we briefly outline the derivationof the MRC weights for completeness. Furthermore, some ofthe properties of maximal ratio combining are discussed.

Let λX( f ) = E{|X( f )|2} be the speech power spectraldensity. Assuming that the noise power λN ( f ) is the samefor all microphones and that the noise at the differentmicrophones is uncorrelated, we have

γ(f) =

λX(f)∣∣∣∑M

i=1 Gi(f)Hi(f)∣∣∣

2

λN(f)∑M

i=1

∣∣Gi(f)∣∣2 . (6)

We consider now the term |∑Mi=1 Gi( f )Hi( f )|2 in the

denominator of (6). Using the Cauchy-Schwarz inequalitywe have

∣∣∣∣∣∣

M∑

i=1

Gi(f)Hi(f)∣∣∣∣∣∣

2

≤M∑

i=1

∣∣Gi(f)∣∣2

M∑

i=1

∣∣Hi(f)∣∣2 (7)

with equality if Gi( f ) = cH∗i ( f ), where H∗

i is the complexconjugate of the channel coefficientHi. Here c is a real-valuedconstant common to all weights Gi( f ). Thus, for the signal-to-noise ratio we obtain

γ(f) ≤ λX

(f)∑M

i=1

∣∣Gi(f)∣∣2∑M

i=1

∣∣Hi(f)∣∣2

λN(f)∑M

i=1

∣∣Gi

(f)∣∣2

= λX(f)

λN(f)M∑

i=1

∣∣Hi(f)∣∣2

.

(8)

With the weights Gi( f ) = cH∗i ( f ), we obtain the maximum

signal-to-noise ratio of the combined signal as the sum of thesignal-to-noise ratios of the M received signals

γ(f) =

M∑

i=1

γi(f), (9)


where

γi(f) = λX

(f)∣∣Hi

(f)∣∣2

λN(f) (10)

is the input signal-to-noise ratio of the ith microphone. It isappropriate to chose c as

cMRC(f) = 1

∑Mj=1

∣∣∣Hj

(f)∣∣∣

2 . (11)

This leads to the MRC weights

G(i)MRC

(f) = cMRC

(f)H∗i

(f) = H∗

i

(f)

∑Mj=1

∣∣∣Hj

(f)∣∣∣

2 , (12)

and the estimated (equalized) speech spectrum

X = G(1)MRCY1 +G(2)

MRCY2 +G(3)MRCY3 · · ·

X = H∗1∑M

i=1 |Hi|2Y1 +

H∗2∑M

i=1 |Hi|2Y2 + · · ·

= H∗1 (H1X +N1)∑M

i=1 |Hi|2+H∗

2 (H2X +N2)∑M

i=1 |Hi|2+ · · ·

= X +H∗

1∑Mi=1 |Hi|2

N1 +H∗

2∑Mi=1 |Hi|2

N2 + · · ·

= X +G(1)MRCN1 +G(2)

MRCN2 + · · · ,

(13)

where we have omitted the dependency on f . The estimatedspeech spectrum X( f ) is therefore equal to the actual speechspectrum X( f ) plus some weighted noise term.

The filter defined in (12) was previously applied to speechdereverberation by Gannot and Moonen in [14], becauseit ideally equalizes the microphone signals if a sufficientlyaccurate estimate of the acoustic transfer functions is avail-able. The problem at hand with maximum-ratio-combiningis that it is rather difficult and computationally complex toexplicitly estimate the acoustic transfer characteristic Hi( f )for our microphone system.

In the next section, we show that MRC combiningcan be achieved without explicit knowledge of the acousticchannels. The weights for the different microphones canbe calculated based on an estimate of the signal-to-noiseratio for each microphone. The proposed filter achieves asignal-to-noise ratio according to (9), but does not guaranteeperfect equalization.

3.2. Diversity Combining for Speech Signals. We consider theweights

G(i)SC =

√√√√ γi

(f)

∑Mj=1 γj

(f) . (14)

Assuming the noise power is the same for all microphonesand substituting γi( f ) by (10) leads to

G(i)SC

(f) =

√√√√√

∣∣Hi

(f)∣∣2

∑Mj=1

∣∣∣Hj

(f)∣∣∣

2 =∣∣Hi

(f)∣∣

√∑M

j=1

∣∣∣Hj

(f)∣∣∣

2. (15)

Hence, we have

G(i)SC

(f) = cSC

(f)∣∣Hi

(f)∣∣ (16)

with

cSC(f) = 1

√∑M

j=1

∣∣∣Hj

(f)∣∣∣

2.

(17)

We observe that the weight G(i)SC( f ) is proportional to the

magnitude of the MRC weights Hi( f )∗, because the factorcSC is the same for all M microphone signals. Consequently,coherent addition of the sensor signals weighted with the

gain factors G(i)SC( f ) still leads to a combining, where the

signal-to-noise ratio at the combiner output is the sum ofthe input SNR values. However, coherent addition requiresan additional phase estimate. Let φi( f ) denote the phaseof Hi( f ) at frequency f . Assuming cophasal addition theestimated speech spectrum is

X = G(1)SCe

− jφ1Y1 +G(2)SCe

− jφ2Y2 +G(3)SCe

− jφ3Y3 · · ·

= 1cSC

X +G(1)SCe

− jφ1N1 +G(2)SCe

− jφ2N2 + · · · .(18)

Hence, in the case of stationary signals the term

1cSC(f) =

√√√√√

M∑

j=1

∣∣∣Hj

(f)∣∣∣

2(19)

can be interpreted as the resulting transfer characteristicof the system. An example is depicted in Figure 3. Theupper figure presents the measured transfer characteristicsfor two microphones in a car environment. Note that themicrophones have a high-pass characteristic and attenuatesignal components for frequencies below 1 kHz. The lowerfigure is the curve 1/cSC( f ). The spectral combiner equalizesmost of the deep dips in the transfer functions from themouth of the speaker to the microphones while the envelopeof the transfer functions is not equalized.

3.3. Magnitude Combining. One challenge in multimicro-phone systems with spatially separated microphones is areliable phase estimation of the different input signals. Fora coherent combining of the speech signals, we have tocompensate the phase difference between the speech signalsat each microphone. Therefore, it is sufficient to estimate thephase differences to a reference microphone, for example,to the first microphone Δi( f ) = φ1( f ) − φi( f ), for all i =2, . . . ,M. Cophasal addition is then achieved by

X = G(1)SCY1 +G(2)

SCejΔ2Y2 +G(3)

SCejΔ3Y3 · · · . (20)

But a reliable estimation of the phase differences is onlypossible in speech active periods and furthermore only forthat frequencies where speech is present. Estimating thephase differences

e jΔi( f ) = E{

Y1(f)Y∗i(f)

∣∣Y1(f)∣∣∣∣Yi

(f)∣∣

}

(21)


leads to unreliable phase values for time-frequency pointswithout speech. In particular, if Hi( f ) = 0 for somefrequency f , the estimated phase Δi( f ) is undefined. Acombining using this estimate leads to additional signaldistortions. Additionally, noise correlation would distort thephase estimation. A coarse estimate of the phase differencecan also be obtained from the time-shift τi between thespeech components in the microphone signals, for example,using the generalized correlation method [20]. The estimateis then Δi( f ) ≈ 2π f τi. Note that a combiner using thesephase values would in a certain manner be equivalentto a delay-and-sum beamformer. However, for distributedmicrophone arrays in reverberant environments this phasecompensation leads to a poor estimate of the actual phasedifferences.

Because of the drawbacks, which come along with thephase estimation methods described above, we proposeanother scheme. Therefore, we use a two stage combiningapproach. In the first stage, we use the spectral combiningapproach as described in Section 3.2 with a simple magni-tude combining of the microphone signals. For the mag-nitude combining the noisy phase of the first microphonesignal is adopted to the other microphone signals. This is alsoobvious in Figure 5, where the phase of the noisy spectrum

e jφ1( f ) is taken for the spectrum at the output of the filterG(2)

SC ( f ), before the signals were combined. This leads to thefollowing incoherent combining of the input signals

X(f) = G(1)

SC

(f)Y1(f)

+G(2)SC

(f)∣∣Y2

(f)∣∣e jφ1( f ) + · · ·

+G(M)SC

(f)∣∣YM

(f)∣∣e jφ1( f )

= G(1)SC

(f)Y1(f)

+G(2)SC

(f)∣∣Y2

(f)∣∣e jφ1( f ) + · · · .

(22)

The estimated speech spectrum X( f ) is equal to

X(f)e jφ1( f )

cSC(f) (23)

plus some weighted noise terms. It follows from the triangleinequality that

1cSC(f) ≤ 1

cSC(f) =

√√√√√

M∑

j=1

∣∣∣Hj

(f)∣∣∣

2. (24)

Magnitude combining does not therefore guarantee maxi-mum-ratio-combining. Yet the signal X( f ) is taken as a refer-ence signal in the second stage where the phase compensationis done. This coherence based signal combining scheme isdescribed in the following section.

4. Coherence-Based Combining

As an example of a coherence based diversity system wefirst consider the two microphone approach by Martinand Vary [5, 6] as depicted in Figure 4. Martin and Vary

Transfer characteristics to the microphones

H1(f

),H

2(f

)(d

B)

−40

−30

−20

−10

0

Frequency (Hz)

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

(a)

Overall transfer characteristic

1/c S

C(d

B)

−40

−30

−20

−10

0

Frequency (Hz)

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

(b)

Figure 3: Transfer characteristics to the microphones and of thecombined signal.

applied the dereverberation principle of Allen et al. [13]to noise reduction. In particular, they proposed an LMS-based time domain algorithm to combine the differentmicrophone signals. This approach provides effective noisesuppression for frequencies where the noise components ofthe microphone signals are uncorrelated.

However, as we have seen in Section 2, for practicalmicrophone distances in the range of 0.4 to 0.8 m the noisesignals are correlated for low frequencies. These correlationsreduce the noise suppression capabilities of the algorithmand lead to musical noise.

We will show in this section that a combination of thespectral combining with the coherence based approach byMartin and Vary reduces this issues.

4.1. Analysis of the LMS Approach. We present now ananalysis of the scheme by Martin and Vary as depicted inFigure 4. The filter gi(k) is adapted using the LMS algorithm.For stationary signals x(k), n1(k), and n2(k), the adaptationconverts to filter coefficients gi(k) and a corresponding filtertransfer function

G(i)LMS

(f) =

E{Y∗i(f)Yj(f)}

E{∣∣Yi(f)∣∣2} , i /= j (25)

that minimizes the expected value

E{∣∣∣Yi(f)G(i)

LMS

(f)− Yj

(f)∣∣∣

2}

, (26)

where E{Y∗i ( f )Yj( f )} is the cross-power spectrum of thetwo microphone signals and E{|Yi( f )|2} is the powerspectrum of the ith microphone signal.


n1(k)

n2(k)

x(k) h1(k)

h2(k)

y1(k) = x(k)∗ h1(k) + n1(k)

y2(k) = x(k)∗ h2(k) + n2(k)

−

−

g1(k)y2(k)

g2(k)y1(k)

0.5

x(k)

Figure 4: Basic system structure of the LMS approach.

Assuming that the speech signal and the noise signals areuncorrelated, (25) can be written as

G(i)LMS

(f) =

E{∣∣X(f)∣∣2}H∗i

(f)Hj(f)

+ E{N∗i

(f)Nj(f)}

E{∣∣X(f)∣∣2}∣∣Hi

(f)∣∣2 + E

{∣∣Ni

(f)∣∣2} .

(27)

For frequencies where the noise components are uncorre-lated, that is, E{N∗

i ( f )Nj( f )} = 0, this formula is reducedto

G(i)LMS

(f) =

E{∣∣X

(f)∣∣2}H∗i

(f)Hj(f)

E{∣∣X(f)∣∣2}∣∣Hi

(f)∣∣2 + E

{∣∣Ni

(f)∣∣2} . (28)

The filter G(i)LMS( f ) according to (28) results in fact in a

minimum mean squared error (MMSE) estimate of thesignal X( f )Hj( f ) based on the signal Yi( f ). Hence, theweighted output is a combination of the MMSE estimatesof the speech components of the two input signals. Thisexplains the good noise reduction properties of the approachby Martin and Vary.

On the other hand, the coherence of the noise dependsstrongly on the distance between the microphones. For in-car applications, practical distances are in the range of 0.4 to0.8 m. Therefore, only the noise components for frequenciesabove 1 kHz can be considered to be uncorrelated [6].

According to formula (27), the noise correlation leads toa bias

E{N∗i

(f)Nj(f)}

E{∣∣Yi(f)∣∣2} (29)

of the filter transfer function. An approach to correct thefilter bias by estimating the noise cross-power density waspresented in [21]. Another issue with speech enhancementsolely based on the LMS approach is that the speech signalsat the microphone inputs may only be weakly correlatedfor some frequencies as shown in Section 2. Consequently,these frequency components will be attenuated in the outputsignals.

In the following, we discuss a modified LMS approach,where we first combine the microphone signals to obtainan improved reference signal for the adaptation of the LMSfilters.

4.2. Combining MRC and LMS. To ensure suitable weightingand coherent signal addition we combine the diversitytechnique with the LMS approach to process the signalsof the different microphones. It is informative to examinethe combined approach under ideal conditions, that is, weassume ideal MRC weighting.

Analog to (13), weighting with the MRC gains factorsaccording to (12) results in the estimate

X(f) = X

(f)

+G(1)MRC

(f)N1(f)

+G(2)MRC

(f)N2(f)

+ · · · .(30)

We now use the estimate X( f ) as the reference signal for theLMS algorithm. That is, we adapted a filter for each inputsignal such that the expected value

E{∣∣∣Yi(f)G(i)

LMS

(f)− X( f )

∣∣∣

2}

(31)

is minimized. The adaptation results in the filter transferfunctions

G(i)LMS

(f) =

E{Y∗i(f)X(f)}

E{∣∣Yi

(f)∣∣2} . (32)

Assuming that the speech signal and the noise signals areuncorrelated and substituting X( f ) according to (30) leadsto

G(i)LMS

(f) = E

{Y∗i(f)X(f)}

E{∣∣Yi

(f)∣∣2} (33)

+G(i)MRC

(f)E{∣∣Ni

(f)∣∣2}

E{∣∣Yi

(f)∣∣2} , (34)

+G( j)MRC

(f)E{N∗i

(f)Nj(f)}

E{∣∣Yi

(f)∣∣2} + · · · . (35)

The first term

E{Y∗i(f)X(f)}

E{∣∣Yi(f)∣∣2} =

Hi(f)∗E

{∣∣X(f)∣∣2}

∣∣Hi

(f)∣∣2E

{∣∣X(f)∣∣2}

+ E{∣∣Ni

(f)∣∣2}

(36)


in this sum is the Wiener filter that results in a minimummean squared error estimate of the signal X( f ) based onthe signal Yi( f ). The Wiener filter equalizes the microphonesignal and minimizes the mean squared error between thefilter output and the actual speech signal X( f ). Note that thephase of the term in (36) is−φi, that is, the filter compensatesthe phase of the acoustic transfer function Hi( f ).

The other terms in the sum can be considered as filterbiases where the term in (34) depends on the noise powerdensity of the ith input. The remaining terms depend onthe noise cross power and vanish for uncorrelated noisesignals. However, noise correlation might distort the phaseestimation.

Similarly, when we consider the actual reference signal

X( f ) according to (22), the filter equation for G(i)LMS( f )

contains the term

Hi(f)∗E

{∣∣X(f)∣∣2}e jφ1( f )

cSC(f)(∣∣Hi

(f)∣∣2E

{∣∣X(f)∣∣2}

+ E{∣∣Ni

(f)∣∣2}) (37)

with the sought phase Δi( f ) = φ1( f ) − φi( f ). If thecorrelation of the noise terms is sufficiently small we obtainthe estimated phase

Δi(f) = arg

{G(i)

LMS

(f)}. (38)

The LMS algorithm estimates implicitly the phase differencesbetween the reference signal X( f ) and the input signals

Yi( f ). Hence, the spectra at the outputs of the filtersG(i)LMS( f )

are in phase. This enables a cophasal addition of the signalsaccording to (20).

By estimating the noise power and noise cross-powerdensities we could correct the biases of the LMS filter transferfunctions. Similarly, reducing the noisy signal componentsin (30) diminishes the filter biases. In the following, we willpursue the latter approach.

4.3. Noise Suppression. Maximum-ratio-combining providesan optimum weighting of the M sensor signals. However,it does not necessarily suppress the noisy signal compo-nents. We therefore combine the spectral combining withan additional noise suppression filter. Of the numerousproposed noise reduction techniques in literature, we con-sider only spectral subtraction [4] which supplements thespectral combining quite naturally. The basic idea of spectralsubtraction is to subtract an estimate of the noise floor froman estimate of the spectrum of the noisy signal.

Estimating the overall SNR according to (9) the spectralsubtraction filter (see i.e., [1, page 239]) for the combinedsignal X( f ) can be written as

GNS(f) =

√√√√ γ

(f)

1 + γ(f) . (39)

Multiplying this filter transfer function with (14) leads to theterm

√√√√γi

(f)

γ(f)

√√√√ γ

(f)

1 + γ(f) =

√√√√ γi

(f)

1 + γ(f) (40)

This formula shows that noise suppression can be introducedby simply adding a constant to the numerator term in (14).

Most, if not all, implementations of spectral subtractionare based on an over-subtraction approach, where anoverestimate of the noise power is subtracted from thepower spectrum of the input signal (see e.g., [22–25]). Over-subtraction can be included in (40) by using a constant ρlarger than one. This leads to the final gain factor

G(i)SC

(f) =

√√√√ γi

(f)

ρ + γ(f) . (41)

The parameter ρ does hardly affect the gain factors forhigh signal-to-noise ratios retaining optimum weighting. Forlow signal-to-noise ratios this term leads to an additionalattenuation. The over-subtraction factor is usually a functionof the SNR, sometimes it is also chosen differently fordifferent frequency bands [25].

5. Implementation Issues

Real world speech and noise signals are non-stationaryprocesses. For an implementation of the spectral weighting,we have to consider short-time spectra of the microphonesignals and estimate the short-time power spectral densities(PSD) of the speech signal and the noise components.

Therefore, the noisy signal yi(k) is transformed into thefrequency domain using a short-time Fourier transform oflength L. Each block of L consecutive samples is multipliedwith a Hamming window. Subsequent blocks are overlappingby K samples. Let Yi(κ, ν), Xi(κ, ν), and Ni(κ, ν) denote thecorresponding short-time spectra, where κ is the subsampledtime index and ν is the frequency bin index.

5.1. System Structure. The processing system for two inputsis depicted in Figure 5. The spectrum X(κ, ν) results fromincoherent magnitude combining of the input signals

X(κ, ν) = G(1)SC (κ, ν)Y1(κ, ν)

+G(2)SC (κ, ν)|Y2(κ, ν)|e jφ1(κ,ν) + · · · ,

(42)

where

G(i)SC(κ, ν) =

√√√ γi(κ, ν)ρ + γ(κ, ν)

. (43)

The power spectral density of speech signals is relativelyfast time varying. Therefore, the FLMS algorithm requiresa quick update, that is, a large step size. If the step sizeis sufficiently large the magnitudes of the FLMS filters

G(i)LMS(κ, ν) follow the filters G(i)

SC(κ, ν). Because the spectra at

the outputs of the filters G(i)LMS( f ) are in phase, we obtain the

estimated speech spectrum as

X(κ, ν) = G(1)LMS(κ, ν)Y1(κ, ν) +G(2)

LMS(κ, ν)Y2(κ, ν) + · · · .(44)


n1(k)

n2(k)

x(k)∗ h1(k)

x(k)∗ h2(k)

y1(k)

y2(k)

Windowing+ FFT

Windowing+ FFT

SNR and gaincomputing

Y1(κ, ν)

Y2(κ, ν)

G(1)SC (κ, ν)

G(2)SC (κ, ν)

Phasecomputing

| · | e jφ1(κ,ν)

X(κ, ν)

G(1)LMS(κ, ν)

G(2)LMS(κ, ν)

−

−IFFT

+ OLA

x(k)

Figure 5: Basic system structure of the diversity system with two inputs.

To perform spectral combining we have to estimate thecurrent signal-to-noise ratio based on the noisy microphoneinput signals. In the next sections, we propose a simpleand efficient method to estimate the noise power spectraldensities of the microphone inputs.

5.2. PSD Estimation. Commonly the noise PSD is estimatedin speech pauses where the pauses are detected using voiceactivity detection (VAD, see e.g., [24, 26]). VAD-basedmethods provide good estimates for stationary noise. How-ever, they may suffer from error propagation if subsequentdecisions are not independent. Other methods, like the min-imum statistics approach introduced by Martin [23, 27], usea continuous estimation that does not explicitly differentiatebetween speech pauses and speech active segments.

Our estimation method combines the VAD approachwith the minimum statistics (MS) method. Minimumstatistics is a robust technique to estimate the power spectraldensity of non-stationary noise by tracing the minimum ofthe recursively smoothed power spectral density within atime window of 1 to 2 seconds. We use these MS estimatesand a simple threshold test to determine voice activity foreach time-frequency point.

The proposed method prevents error propagation,because the MS approach is independent of the VAD. Duringspeech pauses the noise PSD estimation can be enhancedcompared with an estimate solely based on minimumstatistics. A similar time-frequency dependent VAD waspresented by Cohen to enhance the noise power spectraldensity estimation of minimum statistics [28].

For time-frequency points (κ, ν) where the speech signalis inactive, the noise PSD E{|Ni(κ, ν)|2} can be approximatedby recursive smoothing

E{|Ni(κ, ν)|2

}≈ λY ,i(κ, ν) (45)

with

λY ,i(κ, ν) = (1− α)λY ,i(κ− 1, ν) + α|Yi(κ, ν)|2, (46)

where α ∈ (0, 1) is the smoothing parameter.During speech active periods the PSD can be estimated

using the minimum statistics method introduced by Martin

[23, 27]. With this approach, the noise PSD estimate isdetermined by the minimum value

λmin,i(κ, ν) = minl∈[κ−W+1,κ]

{λY ,i(l, ν)

}(47)

within a sliding window ofW consecutive values of λY ,i(κ, ν).The noise PSD is then estimated by

E{|Ni(κ, ν)|2

}≈ omin · λmin,i(κ, ν), (48)

where omin is a parameter of the algorithm and should beapproximated as

omin = 1E{λmin}

. (49)

The MS approach provides a rough estimate of the noisepower that strongly depends on the smoothing parameter αand the window size of the sliding window (for details cf.[27]). However, this estimate can be obtained regardless ofspeech being present or not.

The idea of our approach is to approximate the PSDby the MS estimate during speech active periods while thesmoothed input power is used for time-frequency pointswhere speech is absent.

E{|Ni(κ, ν)|2

}≈ β(κ, ν)omin · λmin,i(κ, ν)

+(1− β(κ, ν)

)λY ,i(κ, ν),

(50)

where β(κ, ν) ∈ {0, 1} is an indicator function for speechactivity which will be discussed in more detail in the nextsection.

The current signal-to-noise ratio is then obtained by

γi(κ, ν) =E{|Yi(κ, ν)|2

}− E

{|Ni(κ, ν)|2

}

E{|Ni(κ, ν)|2

} , (51)

assuming that the noise and speech signals are uncorrelated.

5.3. Voice Activity Detection. Human speech contains gapsnot only in time but also in frequency domain. It istherefore reasonable to estimate the voice activity in the time-frequency domain in order to obtain a more accurate VAD.


The VAD function β(κ, ν) can then be calculated upon thecurrent input noise PSD obtained by minimum statistics.

Our aim is to determine for each time-frequency point(κ, ν) whether the speech signal is active or inactive. Wetherefore consider the two hypotheses H1(κ, ν) and H0(κ, ν)which indicate speech presence or absence at the time-frequency point (κ, ν), respectively. We assume that thecoefficients X(κ, ν) and Ni(κ, ν) of the short-time spectra ofboth the speech and the noise signal are complex Gaussianrandom variables. In this case, the current input power, thatis, squared magnitude |Yi(κ, ν)|2, is exponentially distributedwith mean (power spectral density)

λYi(κ, ν) = E{|Y(κ, ν)|2

}. (52)

Similarly we define

λXi(κ, ν) = |Hi(κ, ν)|2E{|X(κ, ν)|2

},

λNi(κ, ν) = E{|Ni(κ, ν)|2

}.

(53)

We assume that speech and noise are uncorrelated.Hence, we have

λYi(κ, ν) = λXi(κ, ν) + λNi(κ, ν) (54)

during speech active periods and

λYi(κ,ν) = λNi(κ, ν) (55)

in speech pauses.In the following, we occasionally omit the dependency on

κ and ν in order to keep the notation lucid. The conditionalprobability density functions of the random variable Yi =|Yi(κ, ν)|2 are [29]

f(Yi | H0

) =

⎧⎪⎪⎨

⎪⎪⎩

1λNi

exp

(−Yi

λNi

)

, Yi ≥ 0,

0, Yi < 0,(56)

f(Yi | H1

) =

⎧⎪⎪⎨

⎪⎪⎩

1λXi + λNi

exp

(−Yi

λXi + λNi

)

, Yi ≥ 0,

0, Yi < 0.(57)

Applying Bayes rule for the conditional speech presenceprobability

pi(κ, ν) = P(H1 | Yi

)(58)

we have [29]

pi(κ, ν) ={

1 +

(λXi + λNi

)q

λNi

(1− q) exp(−ui)

}−1

, (59)

where q(κ, ν) = P(H0(κ, ν)) is the a priori probability ofspeech absence and

ui(κ, ν) = YiλXiλNi

(λXi + λNi

) = |Yi(κ, ν)|2λXiλNi

(λXi + λNi

) . (60)

The decision rule for the ith channel is based on theconditional speech presence probability

βi(κ, ν) =

⎧⎪⎨

⎪⎩

1,P(H1 | Yi

)

P(H0 | Yi

) ≥ T ,

0, otherwise.(61)

The parameter T > 0 enables a tradeoff between the twopossible error probabilities of voice activity detection. Avalue T > 1 decreases the probability of a false alarm, thatis, β(κ, ν) = 1 when speech is absent. T < 1 reduces theprobability of a miss, that is, β(κ, ν) = 0 in the presence ofspeech. Note that the generalized likelihood-ratio test

P(H1 | Yi

)

P(H0 | Yi

) = pi(κ, ν)1− pi(κ, ν)

≥ T (62)

is according to the Neyman-Pearson-Lemma (see e.g., [30])an optimal decision rule. That is, for a fixed probability of afalse alarm it minimizes the probability of a miss and viceversa. The generalized likelihood-ratio test was previouslyused by Sohn and Sung to detect speech activity in subbands[29, 31].

The test in inequality (62) is equivalent to

pi(κ, ν)−1 ={

1 +

(λX ,i + λN ,i

)q

λN ,i(1− q) exp(−ui)

}

≤ 1 + T

T,

(63)

where we have used (59). Solving for |Yi(κ, ν)|2 using (60),we obtain a simple threshold test for the ith microphone

βi(κ, ν) ={

1, |Yi(κ, ν)|2 ≥ λN ,i(κ, ν)Θi(κ, ν),

0, otherwise.(64)

with the threshold

Θi(κ, ν) =(

1 +λN ,i

λX ,i

)

log

(Tq(1 +

(λX ,i/λN ,i

))

(1− q)

)

. (65)

This threshold test is equivalent to the decision rule in (61).With this threshold test, speech is detected if the currentinput power |Yi(κ, ν)|2 is greater or equal to the average noisepower λN ,i(κ, ν) times the threshold Θi(κ, ν). This factordepends on the input signal-to-noise ratio λX ,i/λN ,i and thea priori probability of speech absence q(κ, ν).

In order to combine the activity estimates for thedifferent input signals, we use the following rule

β(κ, ν) ={

1, if |Yi(κ, ν)|2 ≥ λN ,iΘi for any i,

0, otherwise.(66)

6. Simulation Results

In this section, we present some simulation results for dif-ferent noise conditions typical in a car. For our simulationswe consider the same microphone setup as described inSection 2, that is, we use a two-channel diversity system,


mic. 1

Freq

uen

cy(H

z)

0

1

2

3

4

5

×103

Time (s)

1 2 3 4 5 6 7

(a)

Activity

Freq

uen

cy(H

z)

0

1

2

3

4

5×103

Time (s)

1 2 3 4 5 6 7

(b)

Figure 6: Spectrogram of the microphone input (mic. 1 at car speedof 140 km/h, short speaker). The lower figure depicts the resultsof the voice activity detection (black representing estimated speechactivity) with T = 1.2 and q = 0.5.

PSD

(dB

)

−60−50−40−30−20−10

010

Frequency (Hz)

0 1000 2000 3000 4000 5000 6000

NoiseEstimate

Figure 7: Estimated and actual noise PSD for mic. 2 at car speed of140 km/h.

because this is probably the most interesting case for in-carapplications.

With respect to three different background noise situa-tions, we recorded driving noise at 100 km/h and 140 km/h.As third noise situation, we considered the noise which arisesfrom an electric fan (defroster). With an artificial head werecorded speech samples for two different seat positions.From both positions, we recorded two male and two femalespeech samples, each of a length of 8 seconds. Therefore,we took the German-speaking speech samples from the rec-ommendation P.501 of the International TelecommunicationUnion (ITU) [32]. Hence the evaluation was done usingfour different voices with two different speaker sizes, whichleads to 8 different speaker configurations. For all recordings,we used a sampling rate of 11025 Hz. Table 1 contains theaverage SNR values for the considered noise conditions. Thefirst values in each field are with respect to a short speaker

Table 1: Average input SNR values [dB] from mic. 1/mic. 2 fortypical background noise conditions in a car.

SNR IN 100 km/h 140 km/h defrost

short speaker 1.2/3.1 −0.7/−0.5 1.7/1.3

tall speaker 1.9/10.8 −0.1/7.2 2.4/9.0

Table 2: Log spectral distances with minimum statistics noise PSDestimation and with the proposed noise PSD estimator.

DLS [dB] 100 km/h 140 km/h defrost

mic. 1 3.93/3.33 2.47/2.07 3.07/1.27

mic. 2 4.6/4.5 3.03/2.33 3.4/1.5

while the second ones are according to a tall person. For allalgorithms, we used an FFT length of L = 512 and an overlapof 256 samples. For time windowing we apply a Hammingwindow.

6.1. Estimating the Noise PSD. The spectrogram of one inputsignal and the result of the voice activity detection are shownin Figure 6 for the worst case scenario (short speaker at carspeed of 140 km/h). It can be observed that time-frequencypoints with speech activity are reliably detected. Because thenoise PSD is estimated with minimum statistics also duringspeech activity, the false alarms in speech pauses do hardlyaffect the noise PSD estimation.

In Figure 7, we compare the estimated noise PSD withactual PSD for the same scenario. The PSD is well approx-imated with only minor deviations for high frequencies.To evaluate the noise PSD estimation for several drivingsituations we calculated as an objective performance measurethe log spectral distance (LSD)

DLS =√√√√ 1L

∑

ν

[

10 log10λN (ν)

λN (ν)

]2

(67)

between the actual noise power spectrum λN (ν) and the

estimate λN (ν). From the definition, it is obvious that theLSD can be interpreted as the mean distance between twoPSDs in dB. An extended analysis of different distancemeasures is presented in [33].

The log spectral distances of the proposed noise PSDestimator are shown in Table 2. The first number in each fieldis the LSD achieved with the minimum statistics approachwhile the second number is the value for the proposedscheme. Note that every noise situation was evaluated withfour different voices (two male and two female). From theseresults, we observe that the voice activity detection improvesthe PSD estimation for all considered driving situations.

6.2. Spectral Combining. Next we consider the spectralcombining as discussed in Section 3. Figure 8 presents theoutput SNR values for a driving situation with a car speed of100 km/h. For this simulation we used ρ = 0, that is, spectralcombining without noise suppression. In addition to theoutput SNR, the curve for ideal maximum-ratio-combining

EURASIP Journal on Advances in Signal Processing 11SN

R(d

B)

−10

0

10

20

30

Frequency (Hz)

0 500 1000 1500 2000 2500 3000 3500 4000

Out MRCIdeal MRC

Figure 8: Output SNR values for spectral combining withoutadditional noise suppression (car speed of 100 km/h, ρ = 0).

SNR

(dB

)

−10

0

10

20

30

Frequency (Hz)

0 500 1000 1500 2000 2500 3000 3500 4000

Out MRC-FLMSIdeal MRC

Figure 9: Output SNR values for the combined approach withoutadditional noise suppression (car speed of 100 km/h, ρ = 0).

is depicted. This curve is simply the sum of the input SNRvalues for the two microphones which we calculated basedon the actual noise and speech signals (cf. Figure 1).

We observe that the output SNR curve closely follows theideal curve but with a loss of 1–3 dB. This loss is essentiallycaused by the phase differences of the input signals. With thespectral combining approach only a magnitude combiningis possible. Furthermore, the power spectral densities areestimates based on the noisy microphone signals, this leadsto an additional loss in the SNR.

6.3. Combining SC and FLMS. The output SNR of thecombined approach without additional noise suppression isdepicted in Figure 9. It is obvious that the theoretical SNRcurve for ideal MRC is closely approximated by the outputSNR of the combined system. This is the result of the implicitphase estimation of the FLMS approach which leads to acoherent combining of the speech signals.

Now we consider the combined approach with additionalnoise suppression (ρ = 10). Figure 10 presents the corre-sponding results for a driving situation with a car speed of100 km/h. The output SNR curve still follows the ideal MRCcurve but now with a gain of up to 5 dB.

In Table 3, we compare the output SNR values of thethree considered noise conditions for different combiningtechniques. The first value is the output SNR for a shortspeaker while the second number represents the result forthe tall speaker. The values marked with FLMS correspond tothe coherence based FLMS approach with bias compensation

SNR

(dB

)

−10

0

10

20

30

Frequency (Hz)

0 500 1000 1500 2000 2500 3000 3500 4000

Out MRC-FLMSIdeal MRC

Figure 10: Output SNR values for the combined approach withadditional noise suppression (car speed of 100 km/h, ρ = 10).

Table 3: Output SNR values [dB] for different combining tech-niques—short/tall speaker.

SNR OUT 100 km/h 140 km/h defrost

FLMS 8.8/13.3 4.4/9.0 7.8/12.3

SC 16.3/20.9 13.3/18.0 14.9/19.9

SC + FLMS 13.5/17.8 10.5/15.0 12.5/16.9

ideal FLMS 12.6/15.2 10.5/13.3 14.5/17.3

Table 4: Cosh spectral distances for different combining tech-niques—short/tall speaker.

DCH 100 km/h 140 km/h defrost

FLMS 0.9/0.9 0.9/1.0 1.2/1.2

SC 1.3/1.4 1.4/1.5 1.5/1.7

SC + FLMS 1.2/1.1 1.2/1.2 1.4/1.5

ideal FLMS 0.9/0.8 1.1/1.0 1.5/1.4

as presented in [21] (see also Section 4.1). The label SCmarks results solely based on spectral combining withadditional noise suppression as discussed in Sections 3 and4.3. The results with the combined approach are labeled bySC + FLMS. Finally, the values marked with the label idealFLMS are a benchmark obtained by using the clean andunreverberant speech signal x(k) as a reference for the FLMSalgorithm.

From the results in Table 3, we observe that the spectralcombining leads to a significant improvement of the outputSNR compared to the coherence based noise reduction. Iteven outperforms the “ideal” FLMS scheme. However, thespectral combining introduces undesired speech distortionssimilar to single channel noise reduction. This is alsoindicated by the results in Table 4. This table presentsdistance values for the different combining systems. As anobjective measure of speech distortion, we calculated thecosh spectral distance (a symmetrical version of the Itakura-Saito distance) between the power spectra of the clean inputsignal (without reverberation and noise) and the outputspeech signal (filter coefficients were obtained from noisydata).

The benefit of the combined system is also indicated bythe results in Table 5 which presents Mean Opinion Score


Table 5: Evaluation of the MOS-Test.

MOS 100 km/h 140 km/h defrost average

FLMS 2.58 2.77 2.10 2.49

SC 3.19 3.15 2.96 3.10

SC + FLMS 3.75 3.73 3.88 3.78

ideal FLMS 3.81 3.67 3.94 3.81

Input 1

Freq

uen

cy(H

z)

0

2

4

×103

Time (s)

0 1 2 3 4 5 6 7

(a)

Input 2

Freq

uen

cy(H

z)

0

2

4

×103

Time (s)

0 1 2 3 4 5 6 7

(b)

Output

Freq

uen

cy(H

z)

0

2

4

×103

Time (s)

0 1 2 3 4 5 6 7

(c)

Figure 11: Spectrograms of the input and output signals with theSC + FLMS approach (car speed of 100 km/h, ρ = 10).

(MOS) values for the different algorithms. The MOS testwas performed by 24 persons. The test set was taken in arandomized order to avoid statistical dependences on thetest order. Obviously, the FLMS approach using spectralcombining as reference signal and the “ideal” FLMS filterreference approach are rated as the best noise reductionalgorithm, where the values of the combined approach aresimilar to the results with the reference implementationof the “ideal” FLMS filter solution. From this evalua-tion, it can also be seen that the FLMS approach withspectral combining outperforms the pure FLMS and thepure spectral combining algorithms in all tested acousticsituations.

The combined approach sounds more natural comparedto the pure spectral combining. The SNR and distance valuesare close to the “ideal” FLMS scheme. The speech is free ofmusical tones. The lack of musical noise can also be seen inFigure 11, which shows the spectrograms of the enhancedspeech and the input signals.

7. Conclusions

In this paper, we have presented a diversity technique thatcombines the processed signals of several separate micro-phones. The aim of our approach was noise robustnessfor in-car hands-free applications, because single channelnoise suppression methods are sensitive to the microphonelocation and in particular to the distance between speakerand microphone.

We have shown theoretically that the proposed signalweighting is equivalent to maximum-ratio-combining. Herewe have assumed that the noise power spectral densitiesare equal for all microphone inputs. This assumption mightbe unrealistic. However, the simulation results for a two-microphone system demonstrate that a performance close tothat of MRC can be achieved with real world noise situations.

Moreover, diversity combining is an effective means toreduce signal distortions due to reverberation and thereforeimproves the speech intelligibility compared to single chan-nel noise reduction. This improvement can be explained bythe fact that spectral combining equalizes frequency dips thatoccur only in one microphone input (cf. Figure 3).

The spectral combining requires an SNR estimate foreach input signal. We have presented a simple noise PSDestimator that reliably approximates the noise power forstationary as well as instationary noise.

Acknowledgments

Research for this paper was supported by the German FederalMinistry of Education and Research (Grant no. 17 N11 08).Last but not the least, the authors would like to thank thereviewers for their constructive comments and suggestionswhich greatly improve the quality of this paper.

References

[1] E. Hansler and G. Schmidt, Acoustic Echo and Noise Control:A Practical Approach, John Wiley & Sons, New York, NY, USA,2004.

[2] P. Vary and R. Martin, Digital Speech Transmission: Enhance-ment, Coding and Error Concealment, John Wiley & Sons, NewYork, NY, USA, 2006.

[3] E. Hansler and G. Schmidt, Speech and Audio Processing inAdverse Environments: Signals and Communication Technolo-gie, Springer, Berlin, Germany, 2008.

[4] S. Boll, “Suppression of acoustic noise in speech using spectralsubtraction,” IEEE Transactions on Acoustics, Speech and SignalProcessing, vol. 27, no. 2, pp. 113–120, 1979.

[5] R. Martin and P. Vary, “A symmetric two microphone speechenhancement system theoretical limits and application in a carenvironment,” in Proceedings of the Digital Signal ProcessingWorkshop, pp. 451–452, Helsingoer, Denmark, August 1992.

[6] R. Martin and P. Vary, “Combined acoustic echo cancellation,dereverberation and noise reduction: a two microphoneapproach,” Annales des Telecommunications, vol. 49, no. 7-8,pp. 429–438, 1994.

[7] A. A. Azirani, R. L. Bouquin-Jeannes, and G. Faucon,“Enhancement of speech degraded by coherent and incoher-ent noise using a cross-spectral estimator,” IEEE Transactions


on Speech and Audio Processing, vol. 5, no. 5, pp. 484–487,1997.

[8] A. Guerin, R. L. Bouquin-Jeannes, and G. Faucon, “A two-sensor noise reduction system: applications for hands-free carkit,” EURASIP Journal on Applied Signal Processing, vol. 2003,no. 11, pp. 1125–1134, 2003.

[9] J. Freudenberger and K. Linhard, “A two-microphone diver-sity system and its application for hands-free car kits,” inProceedings of European Conference on Speech Communicationand Technology (INTERSPEECH ’05), pp. 2329–2332, Lisbon,Portugal, September 2005.

[10] T. Gerkmann and R. Martin, “Soft decision combiningfor dual channel noise reduction,” in Proceedings of the9th International Conference on Spoken Language Processing(INTERSPEECH—ICSLP ’06), vol. 5, pp. 2134–2137, Pitts-burgh, Pa, USA, September 2006.

[11] J. Freudenberger, S. Stenzel, and B. Venditti, “Spectral com-bining for microphone diversity systems,” in Proceedings ofEuropean Signal Processing Conference (EUSIPCO ’09), pp.854–858, Glasgow, UK, July 2009.

[12] J. L Flanagan and R. C. Lummis, “Signal processing to reducemultipath distortion in small rooms,” Journal of the AcousticalSociety of America, vol. 47, no. 6, pp. 1475–1481, 1970.

[13] J. B. Allen, D. A. Berkley, and J. Blauert, “Multimicrophonesignal-processing technique to remove room reverberationfrom speech signals,” Journal of the Acoustical Society ofAmerica, vol. 62, no. 4, pp. 912–915, 1977.

[14] S. Gannot and M. Moonen, “Subspace methods for mul-timicrophone speech dereverberation,” EURASIP Journal onApplied Signal Processing, vol. 2003, no. 11, pp. 1074–1090,2003.

[15] M. Delcroix, T. Hikichi, and M. Miyoshi, “Dereverberationand denoising using multichannel linear prediction,” IEEETransactions on Audio, Speech and Language Processing, vol. 15,no. 6, pp. 1791–1801, 2007.

[16] I. Ram, E. Habets, Y. Avargel, and I. Cohen, “Multi-microphone speech dereverberation using LIME and leastsquares filtering,” in Proceedings of European Signal ProcessingConference (EUSIPCO ’08), Lausanne, Switzerland, August2008.

[17] K. Mukherjee and B.-H. Gwee, “A 32-point FFT based noisereduction algorithm for single channel speech signals,” inProceedings of IEEE International Symposium on Circuits andSystems (ISCAS ’07), pp. 3928–3931, New Orleans, La, USA,May 2007.

[18] W. Armbruster, R. Czarnach, and P. Vary, “Adaptive noisecancellation with reference input,” in Signal Processing III, pp.391–394, Elsevier, 1986.

[19] B. Sklar, Digital Communications: Fundamentals and Applica-tions, Prentice Hall, Upper Saddle River, NJ, USA, 2001.

[20] C. Knapp and G. Carter, “The generalized correlation methodfor estimation of time delay,” IEEE Transactions on AcousticsSpeech and Signal Processing, vol. 24, no. 4, pp. 320–327, 1976.

[21] J. Freudenberger, S. Stenzel, and B. Venditti, “An FLMSbased two-microphone speech enhancement system for in-car applications,” in Proceedings of the 15th IEEE Workshop onStatistical Signal Processing (SSP ’09), pp. 705–708, 2009.

[22] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancementof speech corrupted by acoustic noise,” in Proceedings ofIEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’79), pp. 208–211, Washington, DC, USA,April 1979.

[23] R. Martin, “Spectral subtraction based on minimum statis-tics,” in Proceedings of the European Signal Processing Confer-ence (EUSIPCO ’94), pp. 1182–1185, Edinburgh, UK, April1994.

[24] H. Puder, “Single channel noise reduction using time-frequency dependent voice activity detection,” in Proceedingsof International Workshop on Acoustic Echo and Noise Control(IWAENC ’99), pp. 68–71, Pocono Manor, Pa, USA, Septem-ber 1999.

[25] A. Juneja, O. Deshmukh, and C. Espy-Wilson, “A multi-bandspectral subtraction method for enhancing speech corruptedby colored noise,” in Proceedings of IEEE International Confer-ence on Acoustics, Speech, and Signal Processing (ICASSP ’02),vol. 4, pp. 4160–4164, Orlando, Fla, USA, May 2002.

[26] J. Ramırez, J. C. Segura, C. Benıtez, A. de La Torre, and A.Rubio, “A new voice activity detector using subband order-statistics filters for robust speech recognition,” in Proceedingsof IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP ’04), vol. 1, pp. I849–I852, 2004.

[27] R. Martin, “Noise power spectral density estimation based onoptimal smoothing and minimum statistics,” IEEE Transac-tions on Speech and Audio Processing, vol. 9, no. 5, pp. 504–512,2001.

[28] I. Cohen, “Noise spectrum estimation in adverse environ-ments: improved minima controlled recursive averaging,”IEEE Transactions on Speech and Audio Processing, vol. 11, no.5, pp. 466–475, 2003.

[29] J. Sohn and W. Sung, “A voice activity detector employing softdecision based noise spectrum adaptation,” in Proceedings ofIEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP ’98), vol. 1, pp. 365–368, 1998.

[30] G. D. Forney Jr., “Exponential error bounds for erasure,list, and decision feedback schemes,” IEEE Transactions onInformation Theory, vol. 14, no. 2, pp. 206–220, 1968.

[31] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-basedvoice activity detection,” IEEE Signal Processing Letters, vol. 6,no. 1, pp. 1–3, 1999.

[32] ITU-T, Test signals for use in telephonometry, Recommenda-tion ITU-T P.501, International Telecommunication Union,Geneva, Switzerland, 2007.

[33] A. H. Gray Jr. and J. D. Markel, “Distance measures for speechprocessing,” IEEE Transactions on Acoustics, Speech and SignalProcessing, vol. 24, no. 5, pp. 380–391, 1976.


Research Article

DOA Estimation with Local-Peak-Weighted CSP

Osamu Ichikawa, Takashi Fukuda, and Masafumi Nishimura

IBM Research-Tokyo, 1623-14, Shimotsuruma, Yamato, Kanagawa 242-8502, Japan

Correspondence should be addressed to Osamu Ichikawa, [email protected]

Received 31 July 2009; Revised 18 December 2009; Accepted 4 January 2010

Academic Editor: Sharon Gannot

Copyright © 2010 Osamu Ichikawa et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

This paper proposes a novel weighting algorithm for Cross-power Spectrum Phase (CSP) analysis to improve the accuracy ofdirection of arrival (DOA) estimation for beamforming in a noisy environment. Our sound source is a human speaker and thenoise is broadband noise in an automobile. The harmonic structures in the human speech spectrum can be used for weighting theCSP analysis, because harmonic bins must contain more speech power than the others and thus give us more reliable information.However, most conventional methods leveraging harmonic structures require pitch estimation with voiced-unvoiced classification,which is not sufficiently accurate in noisy environments. In our new approach, the observed power spectrum is directly convertedinto weights for the CSP analysis by retaining only the local peaks considered to be harmonic structures. Our experiment showedthe proposed approach significantly reduced the errors in localization, and it showed further improvements when used with otherweighting algorithms.

1. Introduction

The performance of automatic speech recognition (ASR)is severely affected in noisy environments. For example, inautomobiles the ASR error rates during high-speed cruisingwith an open window are generally high. In such situations,the noise reduction of beamforming technology can improvethe ASR accuracy. However, all beamformers except for BlindSignal Separation (BSS) require accurate localization to focuson the target sound source. If a beamformer has high perfor-mance with acute directivity, then the performance declinesgreatly if the localization is inaccurate. This means ASR mayactually lose accuracy with a beamformer, if the localizationis poor in a noisy environment. Accurate localization iscritically important for ASR with a beamformer.

For sound source localization, conventional methodsinclude MUSIC [1, 2], Minimum Variance (MV), Delayand Sum (DS), and Cross-power Spectrum Phase (CSP) [3]analysis. For two-microphone systems installed on physicalobjects such as dummy heads or external ears, approacheswith head-related transfer functions (HRTF) have beeninvestigated to model the effect of diffraction and reflection[4]. Profile Fitting [5] can also address the diffraction and

reflection with the advantage of reducing the effects of noisesources through localization.

Among these methods, CSP analysis is popular becauseit is accurate, reliable, and simple. CSP analysis measuresthe time differences in the signals from two microphonesusing normalized correlation. The differences correspond tothe direction of arrival (DOA) of the sound sources. Usingmultiple pairs of microphones, CSP analysis can be enhancedfor 2D or 3D space localization [6].

This paper seeks to improve CSP analysis in noisyenvironments with a special weighting algorithm. We assumethe target sound source is a human speaker and the noiseis broadband noise such as a fan, wind, or road noisein an automobile. Denda et al. proposed weighted CSPanalysis using average speech spectrums as weights [7]. Theassumption is that a subband with more speech powerconveys more reliable information for localization. However,it did not use the harmonic structures of human speech.Because the harmonic bins must contain more speech powerthan the other bins, they should give us more reliableinformation in noisy environments. The use of harmonicstructures for localization has been investigated in prior art[8, 9], but not for CSP analysis. This work estimated the


−7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 7

i

0

0.05

0.1

0.15

0.2

−0.05

φT

(i)

DOA

Figure 1: An example of CSP.

0 2000 4000 6000 8000

(Hz)

0.6

0.7

0.8

0.9

1

1.1

Wei

ght

Figure 2: Average speech spectrum weight.

pitches (F0) of the target sound and extracted localizationcues from the harmonic structures based on those pitches.However, the pitch estimation and the associated voiced-unvoiced classification may be insufficiently accurate in noisyenvironments. Also, it should be noted that not all harmonicbins have distinct harmonic structures. Some bins may notbe in the speech formants and be dominated by noise.Therefore, we want a special weighting algorithm that putslarger weights on the bins where the harmonic structuresare distinct, without requiring explicit pitch detection andvoiced-unvoiced classification.

2. Sound Source Localization UsingCSP Analysis

2.1. CSP Analysis. CSP analysis measures the normalizedcorrelations between two-microphone inputs with an InverseDiscrete Fourier Transform (IDFT) as

ϕT(i) = IDFT

[S1,T

(j) · S2,T

(j)∗

∣∣S1,T(j)∣∣ · ∣∣S2,T

(j)∣∣

]

, (1)

where Sm,T is a complex spectrum at the Tth frame observedwith microphone m and ∗ means complex conjugate. The binnumber j corresponds to the frequency. The CSP coefficientϕT(i) is a time-domain representation of the normalizedcorrelation for the i-sample delay. For a stable representation,

the CSP coefficients should be processed as a moving averageusing several frames around T, as long as the sound source isnot moving, using

ϕT(i) =∑H

l=−H ϕT(i + l)(2H + 1)

, (2)

where 2H + 1 is the number of averaged frames. Figure 1shows an example of ϕT . In clean conditions, there is a sharppeak for a sound source. The estimated DOA iT for the soundsource is

iT = argmaxi

(ϕT(i)

). (3)

2.2. Tracking a Moving Sound Source. If a sound source ismoving, the past location or DOA can be used as a cueto the new location. Tracking techniques may use DynamicProgramming (DP), the Viterbi search [10], Kalman Filters,or Particle Filters [11]. For example, to find the series ofDOAs that maximize the function for the input speechframes, DP can use the evaluation function Ψ as

ΨT(i) = ϕT(i) · L(k, i) + maxi−1≤k≤i+1

(ΨT−1(k)), (4)

where L(k, i) is a cost function from k to i.

2.3. Weighted CSP Analysis. Equation (1) can be viewed as asummation of each contribution at bin j. Therefore we canintroduce a weight W( j) on each bin so as to focus on themore reliable bins, as

ϕT(i) = IDFT

[

W(j) · S1,T

(j) · S2,T

(j)∗

∣∣S1,T

(j)∣∣ · ∣∣S2,T

(j)∣∣

]

. (5)

Denda et al. introduced an average speech spectrum for theweights [7] to focus on human speech. Figure 2 shows theirweights. We use the symbol WDenda for later reference tothese weights. It does not have any suffix T, since it is timeinvariant.

Another weighting approach would be to use the localSNR [12], as long as the ambient noise is stationary andmeasurable. For our evaluation in Section 4, we simply usedlarger weights where local SNR is high as

WSNRT

(j)

=max

((log(∣∣ST

(j)∣∣2)− log

(∣∣NT(j)∣∣2))

, ε)

KT,

(6)

where NT is the spectral magnitude of the average noise, ε isa very small constant, and KT is a normalizing factor

KT =∑

k

max((

log(|ST(k)|2

)− log

(|NT(k)|2

)), ε). (7)

Figure 3(c) shows an example of the local SNR weights.


0 2000 4000 6000 8000

(Hz)

2

4

6

8

10

12

Log

pow

er

(a) A sample of the average noise spectrum.

0 2000 4000 6000 8000

(Hz)

2

4

6

8

10

12

Log

pow

er

(b) A sample of the observed noisy speech spectrum.

0 2000 4000 6000 8000

(Hz)

0

0.05

0.1

Wei

ght

(c) A sample of the local SNR weights.

0 2000 4000 6000 8000

(Hz)

0

0.05

0.1

Wei

ght

(d) A sample of the local peak weights.

Figure 3: Sample spectra and the associated weights. The spectra were of the recording with air conditioner noise at an SNR of 0 dB. Thenoisy speech spectrum (b) was sampled in a vowel segment.

0 2000 4000 6000 8000

(Hz)

0

0.5

1

1.5

Wei

ght

Figure 4: A Sample of comb weight. (pitch = 300 Hz).

3. Harmonic Structure-Based Weighting

3.1. Comb Weights. If there is accurate information about thepitch and voiced-unvoiced labeling of the input speech, thenwe can design comb filters [13] for the frames in the voicedsegments. The optimal CSP weights will be equivalent to thegain of the comb filters to selectively use those harmonicbins. Figure 4 shows an example of the weights when thepitch is 300 Hz.

Unfortunately, the estimates of the pitch and the voiced-unvoiced classification become inaccurate in noisy environ-ments. Figure 5 shows our tests using the “Pitch command”

(a)

25 dB(clean)10 dB

5 dB0 dB

100

200

300

400

(Hz)

(b)

Figure 5: A sample waveform (clean) and its pitches detected bySPTK in various SNR situations. The threshold of voiced-unvoicedclassification was set to 6.0 (SPTK default). For the frames detectedas unvoiced, SPTK outputs zero. The test data was prepared byblending noise at different SNRs. The noise was recorded in a carmoving on an expressway with a fan at a medium level.

in SPTK-3.0 [14] to obtain the pitch and voiced-unvoicedinformation. There are many outliers in the low SNRconditions. Many researchers have tried to improve theaccuracy of the detection in noisy environments [15], buttheir solutions require some threshold for voiced-unvoiced


Log powerspectrum

DCT to getcepstrum

Cut off upperand lowercepstrum

I-DCT

Get exponentialand normaliseto get weights

Weighted CSP

Observedspectrum

W(ω)

Voiced frameNoise or

unvoiced frame

1 2 3 4 5 6 7 8 9 10 1112 13 14 15 1617 1819 20 21 22 2324 25 26 27 28 30 31 32 33 34 35 36 37 38 39 40

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

6500

7000

7500

Figure 6: Process to obtain Local Peak Weight.

classification [16]. When noise-corrupted speech is falselydetected as unvoiced, there is little benefit from the CSPweighting.

There is another problem with the uniform adoption ofcomb weights for all of the bins. Those bins not in the speechformants and degraded by noise may not contain reliablecues even though they are harmonic bins. Such bins shouldreceive smaller weights.

Therefore, in Section 3.2, we explore a new weightingalgorithm that does not depend on explicit pitch detectionor voiced-unvoiced classification. Our approach is like a

continuous converter from an input spectrum to a weightvector, which can be locally large for the bins whoseharmonic structures are distinct.

3.2. Proposed Local Peak Weights. We previously proposed amethod for speech enhancement called Local Peak Enhance-ment (LPE) to provide robust ASR even in very low SNRconditions due to driving noises from an open windowor loud air conditioner noises [17]. LPE does not leveragepitch information explicitly, but estimates the filters fromthe observed speech to enhance the speech spectrum. LPE


Microphone

−7

−4

7

4

6

5

±0

Figure 7: Microphone installation and the resolution of DOA in theexperimental car.

0 2000 4000 6000 8000

(Hz)

4

5

6

7

8

9

10

11

12

Window full openFan max

Log

pow

er

Figure 8: Averaged noise spectrum used in the experiment.

assumes that pitch information containing the harmonicstructure is included in the middle range of the cepstralcoefficients obtained with the discrete cosine transform(DCT) from the power spectral coefficients. The LPE filterretrieves information only from that range, so it is designedto enhance the local peaks of the harmonic structures forvoiced speech frames. Here, we propose the LPE filter be usedfor the weights in the CSP approach. This use of the LPE filteris named Local Peak Weight (LPW), and we refer to the CSPwith LPW as the Local-Peak-Weighted CSP (LPW-CSP).

Figure 6 shows all of the steps for obtaining the LPW andsample outputs of each step for both a voiced frame and anunvoiced frame. The process is the same for all of the frames,but the generated filters differ depending on whether or notthe frame is voiced speech, as shown in the figure.

Here are the details for each step.

(1) Convert the observed spectrum from one of themicrophones to a log power spectrum YT( j) for eachframe, where T and j are the frame number and

Determine DOA

Smooth over frames

Calculate weighted CSP

Get weight

DFT DFT

φT (i)

φT (i)

W( j)

S1,T ( j) S2,T ( j)

Figure 9: System for the evaluation.

Clean 10 dB 0 dB

SNR

0

5

10

15

20

25

30

35

40

DO

Ade

tect

ion

erro

r(%

)

1. CSP (Baseline)2. W-CSP (Comb)

3. W-CSP (LPW)

4. W-CSP (Local SNR)

5. W-CSP (Denda)

Figure 10: Error rate of frame-based DOA detection. (Fan Max:single-weight cases).

the bin index of the DFT. Optionally, we may takea moving average using several frames around T, tosmooth the power spectrum for YT( j).

(2) Convert the log power spectrum YT( j) into thecepstrum CT(i) by using D(i, j), a DCT matrix.

CT(i) =∑

j

D(i, j) · YT

(j), (8)

where i is the bin number of the cepstral coefficients.In our experiments, the size of the DCT matrix is 256by 256.


Clean 10 dB 0 dB

SNR

0

5

10

15

20

25

DO

Ade

tect

ion

erro

r(%

)

1. CSP (Baseline)2. W-CSP (Comb)3. W-CSP (LPW)

4. W-CSP (Local SNR)5. W-CSP (Denda)

Figure 11: Error rate of frame-based DOA detection. (Window FullOpen: single-weight cases).

Clean 10 dB 0 dB

SNR

0

5

10

15

20

25

30

35

40

DO

Ade

tect

ion

erro

r(%

)

1. CSP (Baseline)6. W-CSP (LPW and Denda)7. W-CSP (LPW and Local SNR)8. W-CSP (Local SNR and Denda)9. W-CSP(LPW and Local SNR and Denda)

Figure 12: Error rate of frame-based DOA detection. (Fan Max:combined-weight cases).

(3) The cepstra represent the curvatures of the log powerspectra. The lower and higher cepstra include longand short oscillations while the medium cepstracapture the harmonic structure information. Thusthe range of cepstra is chosen by filtering out thelower and upper cepstra in order to cover the possibleharmonic structures in the human voice.

CT(i) =⎧⎨

⎩

λ · CT(i) if (i < IL) or (i > IH),

CT(i) otherwise,(9)

where λ is a small constant. IL and IH correspondto the bin index of the possible pitch range, which

Clean 10 dB 0 dB

SNR

0

5

10

15

20

25

DO

Ade

tect

ion

erro

r(%

)

1. CSP (Baseline)6. W-CSP (LPW and Denda)7. W-CSP (LPW and Local SNR)8. W-CSP (Local SNR and Denda)9. W-CSP(LPW and Local SNR and Denda)

Figure 13: Error rate of frame-based DOA detection. (Window FullOpen: combined-weight cases).

for human speech is from 100 Hz to 400 Hz. Thisassumption gives IL = 55 and IH = 220, when thesampling frequency is 22 kHz.

(4) Convert CT(i) back to the log power spectrumdomain VT(i) by using the inverse DCT:

VT(j) =

∑

i

D−1( j, i) · CT(i). (10)

(5) Then converted back to a linear power spectrum:

wT(j) = exp

(VT(j)). (11)

(6) Finally, we obtain LPW, after normalizing, as

WLPWT

(j) = wT

(j)

∑k wT(k)

. (12)

For voiced speech frames, LPW will be designed to retainonly the local peaks of the harmonic structure as shown inthe bottom-right graph in Figure 6 (see also Figure 3(d))For unvoiced speech frames, the result will be almost flatdue to the lack of local peaks with the target harmonicstructure. Unlike the comb weights, the LPW is not uniformover the target frequencies and is more focused on thefrequencies where harmonic structures are observed in theinput spectrum.

3.3. Combination with Existing Weights. The proposed LPWand existing weights can be used in various combinations.For the combinations, the two choices are sum and product.In this paper, they are defined as the products of eachcomponent for each bin j, because the scale of eachcomponent is too different for a simple summation and we


hope to minimize some fake peaks in the weights by using theproducts of different metrics. Equations (13) to (16) showthe combinations we evaluate in Section 4.

WLPW&DendaT

(j) =WLPWT

(j) ·WDenda

(j), (13)

WLPW&SNRT

(j) =WLPWT

(j) ·WSNRT

(j), (14)

WSNR&DendaT

(j) =WSNRT

(j) ·WDenda

(j), (15)

WLPW&SNR&DendaT

(j) =WLPWT

(j) ·WSNRT

(j)

·WDenda(j).

(16)

4. Experiment

In the experimental car, two microphones were installednear the map-reading lights on the ceiling with 12.5 cmbetween them. We used omnidirectional microphones. Thesampling frequency for the recordings was 22 kHz. In thisconfiguration, CSP gives 15 steps from−7 to +7 for the DOAresolution (see Figure 7).

A higher sampling rate might yield higher directionalresolution. However, many beamformers do not supporthigher sampling frequencies because of processing costs andaliasing problems. We also know that most ASR systems workat sampling rates below 22 kHz. These considerations led usto use 22 kHz.

Again, we could have gained directional resolutionby increasing the distance between the microphones. Ingeneral, a larger baseline distance improves the performanceof a beamformer, especially for lower frequency sounds.However, this increases the aliasing problems for higherfrequency sounds. Our separation of 12.5 cm was anothertradeoff.

Our analysis used a Hamming window, 23-ms-longframes with 10-ms frame shifts. The FFT length was 512. For(2), the length of the moving average was 0.2 seconds.

The test subject speakers were 4 females and 4 males.Each speaker read 50 Japanese commands. These are shortphrases for automobiles known as Free Form Command[18]. The total number of utterances was 400. They wererecorded in a stationary car, a full-size sedan. The subjectspeakers sat in the driver’s seat. The seat was adjusted toeach speaker’s preference, so the distance to the microphonesvaried from approximately 40 cm to 60 cm. Two types ofnoise were recorded separately in a moving car, and theywere combined with the speech data at various SNRs (clean,10 dB, and 0 dB). The SNRs were measured as ratios ofspeech power and noise power, ignoring the frequencycomponents below 300 Hz. One of the recorded noises wasan air-conditioner at maximum fan speed while driving ona highway with the windows closed. This will be referredto as “Fan Max”. The other was of driving noise on ahighway with the windows fully opened. This will be referredto as “Window Full Open”. Figure 8 compares the averagespectra of the two noises. “Window Full Open” containsmore power around 1 kHz, and “Fan Max” contains relativelylarge power around 4 kHz. Although it is not shown in the

graph, “Window Full Open” contains lots of transient noisefrom the wind and other automobiles.

Figure 9 shows the system used for this evaluation.We used various types of weights for the weighted CSPanalysis. The input from one microphone was used togenerate the weights. Using both microphones could providebetter weights, but in this experiment we used only onemicrophone for simplicity. Since the baseline (normal CSP)does not use weighting, all of its weights were set to 1.0.The weighted CSP was calculated using (5), with smoothingover the frames using (2). In addition to the weightings,we introduced a lower cut-off frequency of 100 Hz and anupper cut-off frequency of 5 kHz to stabilize the CSP analysis.Finally, the DOA was estimated using (3) for each frame. Wedid not use the tracking algorithms discussed in Section 2.2,because we wanted to accurately measure the contributionsof the various types of weights in a simplified form. Actually,the subject speakers rarely moved when speaking.

The performance was measured as frame-based accuracy.The frames reporting the correct DOA were counted, andthat was divided by the total number of speech frames. Thecorrect DOA values were determined manually. The speechsegments were determined using clean speech data with arather strict threshold, so extra segments were not includedbefore or after the phrases.

4.1. Experiment Using Single Weights. We evaluated five typesof CSP analysis.

Case 1. Normal CSP (uniform weights, baseline).

Case 2. Comb-Weighted CSP.

Case 3. Local-Peak-Weighted CSP (our proposal).

Case 4. Local-SNR-Weighted CSP.

Case 5. Average-Speech-Spectrum-Weighted CSP (Denda).

Case 2 requires the pitch and voiced-unvoiced infor-mation. We used SPTK-3.0 [14] with default parametersto obtain this data. Case 4 requires estimating the noisespectrum. In this experiment, the noise spectrum wascontinuously updated within the noise segments based onoracle VAD information as

NT(j) = (1− α) ·NT−1

(j)

+ α · ST(j)

α =⎧⎨

⎩

0.0 if VAD = active,

0.1 otherwise.

(17)

The initial value of the noise spectrum for each utterance filewas given by the average of all of the noise segments in thatfile.

Figures 10 and 11 show the experimental results for“Fan Max” and “Window Full Open”, respectively. Case 2failed to show significant error reduction in both situations.This failure is probably due to bad pitch estimation or poorvoiced-unvoiced classification in the noisy environments.


This suggests that the result could be improved by intro-ducing robust pitch trackers and voiced-unvoiced classifiers.However, there is an intrinsic problem since noisier speechsegments are more likely to be classified as unvoiced and thuslose the benefit of weighting.

Case 5 failed to show significant error reduction for “FanMax”, but it showed good improvement for “Window FullOpen”. As shown in Figure 8, “Fan Max” contains morenoise power around 4 kHz than around 1 kHz. In contrast,the speech power is usually lower around 4 kHz thanaround 1 kHz. Therefore, the 4-kHz region tends to be moredegraded. However Denda’s approach does not sufficientlylower the weights in the 4-kHz region, because the weightsare time-invariant and independent on the noise. Case 3and Case 4 outperformed the baseline in both situations.For “Fan Max”, since the noise was almost stationary, thelocal-SNR approach can accurately estimate the noise. Thisis also a favorable situation for LPW, because the noise doesnot include harmonic components. However, LPW does littlefor consonants. Therefore, Case 4 had the best results for“Fan Max”. In contrast, since the noise is nonstationary for“Window Full Open”, Case 3 had slightly fewer errors thanCase 4. We believe this is because the noise estimation for thelocal SNR calculations is inaccurate for nonstationary noises.Considering that the local SNR approach in this experimentused the given and accurate VAD information, the actualperformance in the real world would probably be worse thanour results. LPW has an advantage in that it does not requireeither noise estimation or VAD information.

4.2. Experiment Using Combined Weights. We also evaluatedsome combinations of the weights in Cases 3 to 5. Thecombined weights were calculated using (13) to (16).

Case 6. CSP weighted with LPW and Denda (Cases 3 and 5).

Case 7. CSP weighted with LPW and Local SNR (Cases 3 and4).

Case 8. CSP weighted with Local SNR and Denda (Cases 4and 5).

Case 9. CSP weighted with LPW, Local SNR, and Denda(Cases 3, 4, and 5).

Figures 12 and 13 show the experimental results for“Fan Max” and “Window Full Open”, respectively, for thecombined weight cases.

For the combination of two weights, the best combina-tion was dependent on the situation. For “Fan Max”, Case 7,the combination of LPW and the local SNR approach wasbest in reducing the error by 51% for 0 dB. For “WindowFull Open”, Case 6, the combination of LPW and Denda’sapproach was best in reducing the error by 37% for 0 dB.These results correspond to the discussion in Section 4.1about how the local SNR approach is suitable for stationarynoises, while LPW is suitable for nonstationary noises, andDenda’s approach works well with noise concentrated in thelower frequency region.

Case 9, the combination of the three weights workedwell in both situations. Because each weighting method hasdifferent characteristics, we expected that their combinationwould help against variations in the noise. Actually, theresults were almost equivalent to the best combinations ofthe paired weights in each situation.

5. Conclusion

We proposed a new weighting algorithm for CSP analysis toimprove the accuracy of DOA estimation for beamforming ina noisy environment, assuming the source is human speechand the noise is broadband noise such as a fan, wind, or roadnoise in an automobile.

The proposed weights are extracted directly from theinput speech using the midrange of the cepstrum. Theyrepresent the local peaks of the harmonic structures. Asthe process does not involve voiced-unvoiced classification,it does not have to switch its behavior over the voiced-unvoiced transitions.

Experiments showed the proposed local peak weightingalgorithm significantly reduced the errors in localizationusing CSP analysis. A weighting algorithm using local SNRalso reduced the errors, but it did not produce the best resultsin the nonstationary noise situation in our evaluations. Also,it requires VAD information to estimate the noise spectrum.Our proposed algorithm does not require VAD information,voiced-unvoiced information, or pitch information. It doesnot assume the noise is stationary. Therefore, it showedadvantages in the nonstationary noise situation. Also, it canbe combined with existing weighting algorithms for furtherimprovements.

References

[1] D. Johnson and D. Dudgeon, Array Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, USA.

[2] F. Asano, H. Asoh, and T. Matsui, “Sound source localizationand separation in near field,” IEICE Transactions on Funda-mentals of Electronics, Communications and Computer Sciences,vol. E83-A, no. 11, pp. 2286–2294, 2000.

[3] M. Omologo and P. Svaizer, “Acoustic event localization usinga crosspower-spectrum phase based technique,” in Proceedingsof the International Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’94), pp. 273–276, 1994.

[4] K. D. Martin, “Estimating azimuth and elevation frominteraural differences,” in Proceedings of IEEE ASSP Workshopon Applications of Signal Processing to Audio and Acoustics(WASPAA ’95), p. 4, 1995.

[5] O. Ichikawa, T. Takiguchi, and M. Nishimura, “Sound sourcelocalization using a profile fitting method with sound reflec-tors,” IEICE Transactions on Information and Systems, vol. E87-D, no. 5, pp. 1138–1145, 2004.

[6] T. Nishiura, T. Yamada, S. Nakamura, and K. Shikano, “Local-ization of multiple sound sources based on a CSP analysiswith a microphone array,” in Proceedings of IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP’00), vol. 2, pp. 1053–1056, 2000.

[7] Y. Denda, T. Nishiura, and Y. Yamashita, “Robust talkerdirection estimation based on weighted CSP analysis and


maximum likelihood estimation,” IEICE Transactions on Infor-mation and Systems, vol. E89-D, no. 3, pp. 1050–1057, 2006.

[8] T. Yamada, S. Nakamura, and K. Shikano, “Robust speechrecognition with speaker localization by a microphone array,”in Proceedings of the International Conference on SpokenLanguage Processing (ICSLP ’96), vol. 3, pp. 1317–1320, 1996.

[9] T. Nagai, K. Kondo, M. Kaneko, and A. Kurematsu, “Esti-mation of source location based on 2-D MUSIC and itsapplication to speech recognition in cars,” in Proceedings ofIEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP ’01), vol. 5, pp. 3041–3044, 2001.

[10] T. Yamada, S. Nakamura, and K. Shikano, “Distant-talkingspeech recognition based on a 3-D Viterbi search using amicrophone array,” IEEE Transactions on Speech and AudioProcessing, vol. 10, no. 2, pp. 48–56, 2002.

[11] H. Asoh, I. Hara, F. Asano, and K. Yamamoto, “Trackinghuman speech events using a particle filter,” in Proceedings ofIEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP ’05), vol. 2, pp. 1153–1156, 2005.

[12] J.-M. Valin, F. Michaud, J. Rouat, and D. Letourneau, “Robustsound source localization using a microphone array on amobile robot,” in Proceedings of IEEE International Conferenceon Intelligent Robots and Systems (IROS ’03), vol. 2, pp. 1228–1233, 2003.

[13] H. Tolba and D. O’Shaughnessy, “Robust automaticcontinuous-speech recognition based on a voiced-unvoiceddecision,” in Proceedings of the International Conference onSpoken Language Processing (ICSLP ’98), p. 342, 1998.

[14] SPTK: http://sp-tk.sourceforge.net/.[15] M. Wu, D. L. Wang, and G. J. Brown, “A multi-pitch tracking

algorithm for noisy speech,” in Proceedings of IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP ’02), vol. 1, pp. 369–372, 2002.

[16] T. Nakatani, T. lrino, and P. Zolfaghari, “Dominance spectrumbased V/UV classification and F0 estimation,” in Proceedingsof the 8th European Conference on Speech Communication andTechnology (Eurospeech ’03), pp. 2313–2316, 2003.

[17] O. Ichikawa, T. Fukuda, and M. Nishimura, “Local peakenhancement combined with noise reduction algorithms forrobust automatic speech recognition in automobiles,” inProceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’08), pp. 4869–4872,2008.

[18] http://www-01.ibm.com/software/pervasive/embedded via-voice/.


Research Article

Shooter Localization in Wireless Microphone Networks

David Lindgren,1 Olof Wilsson,2 Fredrik Gustafsson (EURASIP Member),2

and Hans Habberstad1

1 Swedish Defence Research Agency, FOI Department of Information Systems, Division of Informatics, 581 11 Linkoping, Sweden2 Linkoping University, Department of Electrical Engineering, Division of Automatic Control, 581 83 Linkoping, Sweden

Correspondence should be addressed to David Lindgren, [email protected]

Received 31 July 2009; Accepted 14 June 2010

Academic Editor: Patrick Naylor

Copyright © 2010 David Lindgren et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

Shooter localization in a wireless network of microphones is studied. Both the acoustic muzzle blast (MB) from the gunfire andthe ballistic shock wave (SW) from the bullet can be detected by the microphones and considered as measurements. The MBmeasurements give rise to a standard sensor network problem, similar to time difference of arrivals in cellular phone networks,and the localization accuracy is good, provided that the sensors are well synchronized compared to the MB detection accuracy.The detection times of the SW depend on both shooter position and aiming angle and may provide additional information besidethe shooter location, but again this requires good synchronization. We analyze the approach to base the estimation on the timedifference of MB and SW at each sensor, which becomes insensitive to synchronization inaccuracies. Cramer-Rao lower boundanalysis indicates how a lower bound of the root mean square error depends on the synchronization error for the MB and theMB-SW difference, respectively. The estimation problem is formulated in a separable nonlinear least squares framework. Resultsfrom field trials with different types of ammunition show excellent accuracy using the MB-SW difference for both the position andthe aiming angle of the shooter.

1. Introduction

Several acoustic shooter localization systems are todaycommercially available; see, for instance [1–4]. Typically, oneor more microphone arrays are used, each synchronouslysampling acoustic phenomena associated with gunfire. Anoverview is found in [5]. Some of these systems are mobile,and in [6] it is even described how soldiers can carry themicrophone arrays on their helmets. One interesting attemptto find direction of sound from one microphone only isdescribed in [7]. It is based on direction dependent spatialfilters (mimicking the human outer ear) and prior knowledgeof the sound waveform, but this approach has not yet beenapplied to gun shots.

Indeed, less common are shooter localization systemsbased on singleton microphones geographically distributedin a wireless sensor network. An obvious issue in wirelessnetworks is the sensor synchronization. For localizationalgorithms that rely on accurate timing like the ones based on

time difference of arrival (TDOA), it is of major importancethat synchronization errors are carefully controlled. Regard-less if the synchronization is solved by using GPS or othertechniques, see, for instance [8–10], the synchronizationprocedures are associated with costs in battery life orcommunication resources that usually must be kept at aminimum.

In [11] the synchronization error impact on the sniperlocalization ability of an urban network is studied by usingMonte Carlo simulations. One of the results is that theinaccuracy increased significantly (>2 m) for synchroniza-tion errors exceeding approximately 4 ms. 56 small wirelesssensor nodes were modeled. Another closely related workthat deals with mobile asynchronous sensors is [12], wherethe estimation bounds with respect to both sensor synchro-nization and position errors are developed and validated byMonte Carlo simulations. Also [13] should be mentioned,where combinations of directional and omnidirectional


acoustic sensors for sniper localization are evaluated by per-turbation analysis. In [14], estimation bounds for multipleacoustic arrays are developed and validated by Monte Carlosimulations.

In this paper we derive fundamental estimation boundsfor shooter localization systems based on wireless sensornetworks, with the synchronization errors in focus. Anaccurate method independent of the synchronization errorswill be analyzed (the MB-SW model) as well as a usefulbullet deceleration model. The algorithms are tested on datafrom a field trial with 10 microphones spread over an areaof 100 m and with gunfire at distances up to 400 m. Partialresults of this investigation appeared in [15] and almostsimultaneously in [12].

The outline is as follows. Section 2 sketches the local-ization principle and describes the acoustical phenomenathat are used. Section 3 gives the estimation framework.Section 4 derives the signal models for the muzzle blast(MB), shock wave (SW), combined MB;SW, and differenceMB-SW, respectively. Section 5 derives expressions for theroot mean square error (RMSE) Cramer-Rao lower bound(CRLB) for the described models and provides numericalresults from a realistic scenario. Section 6 presents the resultsfrom field trials, and Section 7 gives the conclusions.

2. Localization Principle

Two acoustical phenomena associated with gunfire will beexploited to determine the shooter’s position: the muzzleblast and the shock wave. The principle is to detect and timestamp the phenomena as they reach microphones distributedover an area, and let the shooter’s position be estimated by,in a sense, the most likely point, considering the microphonelocations and detection times.

The muzzle blast (MB) is the sound that probably most ofus associate with a gun shot, the “bang.” The MB is generatedby the pressure depletion in effect of the bullet leaving thegun barrel. The sound of the MB travels at the speed of soundin all directions from the shooter. Provided that a sufficientnumber of microphones detect the MB, the shooters positioncan be more or less accurately determined.

The shock wave (SW) is formed by supersonic bullets.The SW has (approximately) the shape of an expandingcone, with the bullet trajectory as axis, and reaches onlymicrophones that happens to be located inside the cone.The SW propagates at the speed of sound in direction awayfrom the bullet trajectory, but since it is generated by asupersonic bullet, it always reaches the microphone beforethe MB, if it reaches the microphone at all. A number of SWdetections may primarily reveal the direction to the shooter.Extra observations or assumptions on the ammunition aregenerally needed to deduce the distance to the shooter. TheSW detection is also more difficult to utilize than the MBdetection, since it depends on the bullet’s speed and ballisticbehavior.

Figure 1 shows an acoustic recording of gunfire. Thefirst pulse is the SW, which for distant shooters significantlydominates the MB, not the least if the bullet passes close

0 50 100 150 200

(ms)

Shock wave

Muzzle blast

Figure 1: Signal from a microphone placed 180 m from a firing gun.Initial bullet speed is 767 m/s. The bullet passes the microphone at adistance of 30 m. The shockwave from the supersonic bullet reachesthe microphone before the muzzle blast.

to the microphone. The figure shows real data, but a ratherideal case. Usually, and particularly in urban environments,there are reflections and other acoustic effects that makeit difficult to accurately determine the MB and SW times.This issue will however not be treated in this work. We willinstead assume that the detection error is stochastic with acertain distribution. A more thorough analysis of the SWpropagation is given in [16].

Of course, the MB and SW (when present) can be usedin conjunction with each other. One of the ideas exploitedlater is to utilize the time difference between the MB andSW detections. This way, the localization is independent ofthe clock synchronization errors that are always present inwireless sensor networks.

3. Estimation Framework

It is assumed throughout this work that

(1) the coordinates of the microphones are known withnegligible error,

(2) the arrival times of the MB and SW at each micro-phone are measured with significant synchronizationerror,

(3) the shooter position and aim direction are the soughtparameters.

Thus, assume that there are M microphones with knownpositions {pk}Mk=1 in the network detecting the muzzle blast.Without loss of generality, the first S ≤ M ones also detectthe shock wave. The detected times are denoted by {yMB

k }M1and {ySW

k }S1, respectively. Each detected time is subject to adetection error {eMB

k }M1 and {eSWk }S1, different for all times,

and a clock synchronization error {bk}M1 specific for eachmicrophone. The firing time t0, shooter position x ∈ R3,and shooting direction α ∈ R2 are unknown parameters.


Also the bullet speed v and speed of sound c are unknown.Basic signal models for the detected times as a function of theparameters will be derived in the next section. The notationis summarized in Table 1.

The derived signal models will be of the form

y = h(x, θ; p

)+ e, (1)

where y is a vector with the measured detection times, his a nonlinear function with values in RM+S, and where θrepresents the unknown parameters apart from x. The errore is assumed to be stochastic; see Section 4.5. Given thesensor locations in p ∈ RM×3, nonlinear optimization canbe performed to estimate x, using the nonlinear least squares(NLS) criterion:

x = arg minx

minθV(x, θ; p

),

V(x, θ; p

) = ∥∥y − h(x, θ; p)∥∥2

R.

(2)

Here, argmin denotes the minimizing argument, min theminimum of the function, and ‖v‖2

Q denotes the Q-norm,

that is, ‖v‖2Q � vTQ−1v. Whenever Q is omitted, Q = I

is assumed. The loss function norm R is chosen by con-sideration of the expected error characteristics. Numericaloptimization, for instance, the Gauss-Newton method, canhere be applied to get the NLS estimate.

In the next section it will become clear that the assumedunknown firing time and the inverse speed of sound enterthe model equations linearly. To exploit this fact we identifya sublinear structure in the signal model and apply theweighted least squares method to the parameters appearinglinearly, the separable least squares method; see, for instance[17]. By doing so, the NLS search space is reduced which inturn significantly reduces the computational burden. For thatreason, the signal model (1) is rewritten as

y = hN(x, θN ; p

)+ hL

(x, θN ; p

)θL + e. (3)

Note that θL enters linearly here. The NLS problem can thenbe formulated as

x = arg minx

minθL,θN

V(x, θN , θL; p

),

V(x, θN , θL; p

) = ∥∥y − hN(x, θN ; p

)− hL(x, θN ; p

)θL∥∥2R.(4)

Since θL enters linearly, it can be solved for by linear leastsquares (the arguments of hL(x, θN ; p) and hN (x, θN ; p) aresuppressed for clarity):

θL = arg minθL

V(x, θN , θL; p

)

=(hTLR

−1hL)−1

hTLR−1(y − hN

),

(5a)

PL =(hTLR

−1hL)−1

. (5b)

Here, θL is the weighted least squares estimate and PL is thecovariance matrix of the estimation error. This simplifies thenonlinear minimization to

x = arg minx

minθN

V(x, θN , θL; p

)

= arg minx

minθN

∥∥∥∥y − hN + hL

(hTLR

−1hL)−1

× hTLR−1(y − hN

)∥∥∥∥

2

R′,

R′ = R + hLPLhTL .

(6)

This general separable least squares (SLSs) approach will nowbe applied to four different combinations of signal models forthe MB and SW detection times.

4. Signal Models

4.1. Muzzle Blast Model (MB). According to the clock atmicrophone k, the muzzle blast (MB) sound is assumed toreach pk at the time

yk = t0 + bk +1c

∥∥pk − x

∥∥ + ek. (7)

The shooter position x and microphone location pk are inRn, where generally n = 3. However, both computationaland numerical issues occasionally motivate a simplified planemodel with n = 2. For all M microphones, the model isrepresented in vector form as

y = b + hL(x; p

)θL + e, (8)

where

θL =[

t01c

]T, (9a)

hL,k(x; p

) =[

1∥∥pk − x

∥∥]T

, (9b)

and where y, b, and e are vectors with elements yk, bk, andek, respectively. 1M is the vector withM ones, whereM mightbe omitted if there is no ambiguity regarding the dimension.Furthermore, p is M-by-n, where each row is a microphoneposition. Note that the inverse of the speed of sound enterslinearly. The ·L notation indicates that · is part of a linearrelation, as described in the previous section. With hN = 0and hL = hL(x; p), (6) gives

x = arg minx

∥∥∥∥y − hL

(hTLR

−1hL)−1

hTLR−1y

∥∥∥∥

2

R′, (10a)

R′ = R + hL(hTLR

−1hL)−1

hTL . (10b)

Here, hL depends on x as given in (9b).This criterion has computationally efficient implemen-

tations, that in many applications make the time it takes todo an exhaustive minimization over a, say, 10-meter gridacceptable. The grid-based minimization of course reduces


Table 1: Notation. MB, SW, and MB-SW are different models, and L/N indicates if model parameters or signals enter the model linearly (L)or nonlinearly (N).

Variable MB SW MB-SW Description

M Number of microphones

S Number of microphones receiving shock wave, S ≤M

x N N N Position of shooter, Rn (n = 2, 3)

pk N N N Position of microphone k, Rn (n = 2, 3)

yk L L L Measured detection time for microphone at position pkt0 L L Rifle or gun firing time

c L N N Speed of sound

v N N Speed of bullet

α N N Shooting direction, Rn−1 (n = 2, 3)

bk L L Synchronization error for microphone k

ek L L L Detection error at microphone k

r N N Bullet speed decay rate

dk Point of origin for shock wave received by microphone k

β Mach angle, sinβ = c/v

γ Angle between line of sight to shooter and shooting angle

1000 m

Shooter

Microphones

Figure 2: Level curves of the muzzle blast localization criterionbased on data from a field trial.

the risk to settle on suboptimal local minimizers, whichotherwise could be a risk using greedy search methods.The objective function does, however, behave rather well.Figure 2 visualizes (10a) in logarithmic scale for data froma field trial (the norm is R′ = I). Apparently, there are onlytwo local minima.

4.2. Shock Wave Model (SW). In general, the bullet follows aballistic three-dimensional trajectory. In practice, a simplermodel with a two-dimensional trajectory with constantdeceleration might suffice. Thus, it will be assumed that thebullet follows a straight line with initial speed v0; see Figure 3.Due to air friction, the bullet decelerates; so when the bullethas traveled the distance ‖dk − x‖, for some point dk on thetrajectory, the speed is reduced to

v = v0 − r‖dk − x‖, (11)

where r is an assumed known ballistic parameter. This isa rather coarse bullet trajectory model, compared with, forinstance, the curvilinear trajectories proposed by [18], butwe use it here for simplicity. This model is also a special caseof the ballistic model used in [19].

The shock wave from the bullet trajectory propagates atthe speed of sound c with angle βk to the bullet heading. βkis the Mach angle defined as

sinβk = c

v= c

v0 − r‖dk − x‖. (12)

dk is now the point where the shock wave that reachesmicrophone k is generated. The time it takes the bullet toreach dk is

∫ ‖x−dk‖

0

dξ

v0 − r · ξ= 1r

logv0

v0 − r‖dk − x‖. (13)

This time and the wave propagation time from dk to pk sumup to the total time from firing to detection:

yk = t0 + bk +1r

logv0

v0 − r‖dk − x‖+

1c

∥∥dk − pk

∥∥ + ek,

(14)

according to the clock at microphone k. Note that thevariable names y and e for notational simplicity have beenreused from the MB model. Below, also h, θN , and θLwill be reused. When there is ambiguity, a superscript willindicate exactly which entity that is referred to, for instance,yMB, hSW.

It is a little bit tedious to calculate dk. The law of sinesgives

sin(90◦ − βk − γk

)

‖dk − x‖= sin

(90◦ + βk

)

∥∥pk − x

∥∥ , (15)

which together with (12) implicitly defines dk. We have notfound any simple closed form for dk; so we solve for dknumerically, and in case of multiple solutions we keep theadmissible one (which turns out to be unique). γk is triviallyinduced by the shooting direction α (and x, pk). Both theseangles thus depend on x implicitly.


pk cShock wave

βk

v

Gunx

α ||dk− x|

| dkγk

||pk− x||

90◦ + βk Bullet trajectory

Figure 3: Geometry of supersonic bullet trajectory and shock wave.Given the shooter location x, the shooting direction (aim) α, thebullet speed v, and the speed of sound c, the time it takes from firingthe gun to detecting the shock wave can be calculated.

The vector form of the model is

y = b + hN(x, θN ; p

)+ hL

(x, θN ; p

)θL + e, (16)

where

hL(x, θN ; p

) = 1,

θL = t0,

θN =[

1cαT v0

]T,

(17)

and where row k of hN (x, θN ; p) ∈ RS×1 is

hN ,k(x, θN ; pk

) = 1r

logv0

v0 − r‖dk − x‖+

1c

∥∥dk − pk

∥∥, (18)

and dk is the admissible solution to (12) and (15).

4.3. Combined Model (MB;SW). In the MB and SW models,the synchronization error has to be regarded as a noisecomponent. In a combined model, each pair of MB and SWdetections depends on the same synchronization error, andconsequently the synchronization error can be regarded as aparameter (at least for all sensor nodes inside the SW cone).The total signal model could be fused from the MB and SWmodels as the total observation vector:

yMB;SW = hMB;SWN

(x, θN ; p

)+ hMB;SW

L

(x, θN ; p

)θL + e, (19)

where

yMB;SW =⎡

⎣yMB

ySW

⎤

⎦, (20)

θL =[t0 bT

]T, (21)

hMB;SWL

(x, θN ; p

) =[

1M,1 IM1S,1

[IS 0S,M−S

]

]

, (22)

θN =[

1cαT v0

]T, (23)

hMB;SWN

(x, θN ; p

) =

⎡

⎢⎣hMBL

(x; p

)[

01c

]T

hSWN

(x, θN ; p

)

⎤

⎥⎦. (24)

4.4. Difference Model (MB-SW). Motivated by accuratelocalization despite synchronization errors, we study the MB-SW model:

yMB-SWk = yMB

k − ySWk

= hMBL

(x; p

)θMBL − hSW

N

(x, θSW

N ; p)

− hSWL

(x, θN ; p

)θSWN + eMB

k − eSWk ,

(25)

for k = 1, 2 . . . S. This rather special model has alsobeen analyzed in [12, 15]. The key idea is that y is bycancellation independent of both the firing time t0 and thesynchronization error b. The drawback, of course, is thatthere are only S equations (instead of a total of M + S) andthe detection error increases, eMB

k − eSWk . However, when the

synchronization errors are expected to be significantly largerthan the detection errors, and when also S is sufficiently large(at least as large as the number of parameters), this modelis believed to give better localization accuracy. This will beinvestigated later.

There are no parameters in (25) that appear linearlyeverywhere. Thus, the vector form for the MB-SW model canbe written as

yMB-SW = hMB-SWN

(x, θN ; p

)+ e, (26)

where

hMB-SWN ,k

(x, θN ; pk

)

= 1c

∥∥pk − x

∥∥− 1

rlog

v0

v0 − r‖dk − x‖− 1c

∥∥dk − pk

∥∥,

(27)

and y = yMB − ySW and e = eMB − eSW. As before, dk isthe admissible solution to (12) and (15). The MB-SW leastsquares criterion is

x = arg minx,θN

∥∥∥yMB−SW − hMB−SW

N

(x, θN ; p

)∥∥∥

2

R, (28)

which requires numerical optimization. Numerical experi-ments indicate that this optimization problem is more proneto local minima, compared to (10a) for the MB model;therefore good starting points for the numerical search areessential. One such starting point could, for instance, be theMB estimate xMB. Initial shooting direction could be given byassuming, in a sense, the worst possible case, that the shooteraims at some point close to the center of the microphonenetwork.

4.5. Error Model. At an arbitrary moment, the detectionerrors and synchronization errors are assumed to be inde-pendent stochastic variables with normal distribution:

eMB ∼ N(

0,RMB)

, (29a)

eSW ∼ N(

0,RSW)

, (29b)

b ∼ N(

0,Rb). (29c)


For the MB-SW model the error is consequently

eMB-SW ∼ N(

0,RMB + RSW). (29d)

Assuming that S = M in the MB;SW model, the covarianceof the summed detection and synchronization errors can beexpressed in a simple manner as

RMB;SW =[RMB + Rb Rb

Rb RSW + Rb

]

. (29e)

Note that the correlation structure of the clock synchroniza-tion error b enables estimation of these. Note also that the(assumed known) total error covariance, generally denotedby R, dictates the norm used in the weighted least squarescriterion. R also impacts the estimation bounds. This will bediscussed in the next section.

4.6. Summary of Models. Four models with different pur-poses have been described in this section.

(i) MB. Given that the acoustic environment enablesreliable detection of the muzzle blast, the MBmodel promises the most robust estimation algo-rithms. It also allows global minimization withlow-dimensional exhaustive search algorithms. Thismodel is thus suitable for initialization of algorithmsbased on the subsequent models.

(ii) SW. The SW model extends the MB model withshooting angle, bullet speed, and deceleration param-eters, which provide useful information for sniperdetection applications. The SW is easier to detectin disturbed environments, particularly when theshooter is far away and the bullet passes closely.However, a sufficient number of microphones arerequired to be located within the SW cone, and theSW measurements alone cannot be used to determinethe distance to the shooter.

(iii) MB;SW. The total MB;SW model keeps all informa-tion from the observations and should thus providethe most accurate and general estimation perfor-mance. However, the complexity of the estimationproblem is large.

(iv) MB-SW. All algorithms based on the models aboverequire that the synchronization error in each micro-phone either is negligible or can be described witha statistical distribution. The MB-SW model relaxessuch assumptions by eliminating the synchronizationerror by taking differences of the two pulses at eachmicrophone. This also eliminates the shooting time.The final model contains all interesting parametersfor the problem, but only one nuisance parameter(actual speed of sound, which further may be elim-inated if known sufficiently well).

The different parameter vectors in the relation y =hL(θN )θL + hN (θN ) + e are summarized in Table 2.

5. Cramer-Rao Lower Bound

The accuracy of any unbiased estimator η in the rathergeneral model

y = h(η)

+ e (30)

is, under not too restrictive assumptions [20], bounded bythe Cramer-Rao bound:

Cov(η) ≥ I−1(ηo

), (31)

where I(ηo) is Fisher’s information matrix evaluated atthe correct parameter values ηo. Here, the location x isfor notational purposes part of the parameter vector η.Also the sensor positions pk can be part of η, if these areknown only with a certain uncertainty. The Cramer-Raolower bound provides a fundamental estimation limit forunbiased estimators; see [20]. This bound has been analyzedthoroughly in the literature, primarily for AOA, TOA, andTDOA [21–23].

The Fisher information matrix for e ∼ N (0,R) takes theform

I(η) = ∇η

[h(η)]R−1∇T

η

[h(η)]. (32)

The bound is evaluated for a specific location, parametersetting, and microphone positioning, collectively η = ηo.

The bound for the localization error is

Cov(x) ≥[In 0

]I−1(ηo

)[In0

]

. (33)

This covariance can be converted to a more convenient scalarvalue giving a bound on the root mean square error (RMSE)using the trace operator:

RMSE ≥√√√√ 1n

tr

([In 0

]I−1

(ηo)[In0

])

. (34)

The RMSE bound can be used to compare the informationin different models in a simple and unambiguous way, whichdoes not depend on which optimization criterion is used orwhich numerical algorithm that is applied to minimize thecriterion.

5.1. MB Case. For the MB case, the entities in (32) areidentified by

η =[xT θTL

]T,

h(η) = hMB

L

(x; p

)θL,

R = RMB + Rb.

(35)

Note that b is accounted for by the error model. The Jacobian∇ηh is anM-by-n+2 matrix, n being the dimension of x. TheLS solution in (5a) however gives a shortcut to an M-by-nJacobian:

∇x

[hLθL

]= ∇x

[hL(hTLR

−1hL)−1

hTLR−1yo

](36)


Table 2: Summary of parameter vectors for the different models y = hL(θN )θL + hN (θN ) + e, where the noise models are summarized in(29a), (29b), (29c), (29d), and (29e). The values of the dimensions assume that the set of microphones giving SW observations is a subset ofthe MB observations.

Model Linear Parameters Nonlinear Parameters dim(θ) dim(y)

MB θMBL = [t0 1/c]T θMB

N = [ ] 2 + 0 M

SW θSWL = t0 θMB

N = [1/c,αT , v0]T 1 + (n + 1) S

MB;SW θMB;SWL = [t0 b]T θMB;SW

N = [1/c,αT , v0]T (M + 1) + (n + 1) M + S

MB-SW θMB-SWL = [ ] θMB-SW

N = [1/c,αT , v0]T 0 + (n + 1) S

1000 m

Shooter

Microphones

Trees

Trees

Camp

Road

x2

x1

Figure 4: Example scenario. A network with 14 sensors deployedfor camp protection. The sensors detect intruders, keep track onvehicle movements, and, of course, locate shooters.

for yo = hL(xo; po)θoL, where xo, po, and θo denote the true(unperturbed) values. For the case n = 2 and known p = po,this Jacobian can, with some effort, be expressed explicitly.The equivalent bound is

Cov(x) ≥[∇Tx

[hLθL

]R−1∇x

[hLθL

]]−1. (37)

5.2. SW, MB;SW, and MB-SW Cases. The estimation boundsfor the SW, MB;SW, and MB-SW cases are analogously to(33), but there are hardly any analytical expressions available.The Jacobian is probably best evaluated by finite differencemethods.

5.3. Numerical Example. The really interesting question ishow the information in the different models relates to eachother. We will study a scenario where 14 microphones aredeployed in a sensor network to support camp protection;see Figure 4. The microphones are positioned along a road totrack vehicles and around the camp site to detect intruders.Of course, the microphones also detect muzzle blasts andshock waves from gunfire, so shooters can be localized andthe shooter’s target identified.

A plane model (flat camp site) is assumed, x ∈ R2, α ∈R. Furthermore, it is assumed that

Rb = σ2b I

(synchronization error Cov .

),

RMB = RSW = σ2e I (detection error Cov .),

(38)

and that α = 0, c = 330 m/s, v0 = 700 m/s, and r = 0.63.The scenario setup implies that all microphones detect theshock wave, so S = M = 14. All bounds presented below arecalculated by numerical finite difference methods.

MB Model. The localization accuracy using the MB model isbounded below according to

Cov(xMB

)≥(σ2e + σ2

b

)[ 64 −17−17 9

]

· 104. (39)

The root mean square error (RMSE) is consequentlybounded according to

RMSE(xMB

)≥√

1n

tr Cov xMB ≈ 606√σ2e + σ2

b [m]. (40)

Monte Carlo simulations (not described here) indicate that

the NLS estimator attains this lower bound for√σ2e + σ2

b <0.1 s. The dash-dotted curve in Figure 5 shows the boundversus σb for fix σe = 500μs. An uncontrolled increase assoon as σb > σe can be noted.

SW Model. The SW model is disregarded here, since the SWdetections alone contain no shooter distance information.

MB-SW Model. The localization accuracy using the MB-SWmodel is bounded according to

Cov(xMB-SW

)≥ σ2

e

[28 55 12

]

· 105, (41)

RMSE(xMB-SW

)≥ 1430σe [m]. (42)

The dashed lines in Figure 5 correspond to the RMSE boundfor four different values of σe. Here, the MB-SW model givesat least twice the error of the MB model, provided thatthere are no synchronization errors. However, in a wirelessnetwork we expect the synchronization error to be 10–100times larger than the detection error, and then the MB-SWerror will be substantially smaller than the MB error.

MB;SW Model. The expression for the MB;SW bound issomewhat involved; so the dependence on σb is only pre-sented graphically, see Figure 5. The solid curves correspondto the MB;SW RMSE bound for the same four valuesof σe as for the MB-SW bound. Apparently, when thesynchronization error σb is large compared to the detectionerror σe, the MB-SW and MB;SW models contain roughlythe same amount of information, and the model havingthe simplest estimator, that is, the MB-SW model, shouldbe preferred. However, when the synchronization error is


0

0.5

1

1.5

RM

SE(m

)

0.1 1 10 100σb (ms)

σe= 1000 μs

σe= 500 μs

σe= 200 μs

σe= 50 μs

MB (σe=500 μs)MB-SW(σe=50−1000 μs)MB; SW (σe=50−1000 μs)

Figure 5: Cramer-Rao RMSE bound (34) for the MB (40), the MB-SW (42), and the MB;SW models, respectively, as a function of thesynchronization error (STD) σb, and for different levels of detectionerror σe.

smaller than 100 times the detection error, the completeMB;SW model becomes more informative.

These results are comparable with the analysis in[12, Figure 4a], where an example scenario with 6 micro-phones is considered.

5.4. Summary of the CRLB Analysis. The synchronizationerror level in a wireless sensor network is usually a matterof design tradeoff between performance and battery costsrequired by synchronization mechanisms. Based on thescenario example, the CRLB analysis is summarized with thefollowing recommendations.

(i) If σb � σe, then the MB-SW model should be used.

(ii) If σb is moderate, then the MB;SW model should beused.

(iii) Only if σb is very small (σb ≤ σe), the shootingdirection is of minor interest, and performance maybe traded for simplicity, then the MB model shouldbe used.

6. Experimental Data

A field trial to collect acoustic data on nonmilitary smallarms fire is conducted. 10 microphones are placed arounda fictitious camp; see Figure 6. The microphones are placedclose to the ground and wired to a common recorder with 16-bit sampling at 48 kHz. A total of 42 rounds are fired fromthree positions and aimed at a common cardboard target.Three rifles and one pistol are used; see Table 3. Four roundsare fired of each armament at each shooter position, withtwo exceptions. The pistol is only used at position three. At

1

2

3

Target

500 m

Shooter

Microphone

Figure 6: Scene of the shooter localization field trial. There are tenmicrophones, three shooter positions, and a common target.

position three, six instead of four rounds of 308 W are fired.All ammunition types are supersonic. However, when firingfrom position three, not all microphones are subjected to theshock wave.

Light wind, no clouds, and around 24◦C are the weatherconditions. Little or no acoustic disturbances are present.The terrain is rough. Dense woods surround the test site.There is light bush vegetation within the site. Shooterposition 1 is elevated some 20 m; otherwise spots are within±5 m of a horizontal plane. Ground truth values of thepositions are determined with less relative error than 1 m,except for shooter position 1, which is determined with 10 maccuracy.

6.1. Detection. The MB and SW are detected by visualinspection of the microphone signals in conjunction withfiltering techniques. For shooter positions 1 and 2, theshock wave detection accuracy is approximately σSW

e ≈80μs, and the muzzle blast error σMB

e is slightly worse. Forshooting position 3 the accuracies are generally much worse,since the muzzle blast and shock wave components becomeintermixed in time.

6.2. Numerical Setup. For simplicity, a plane model isassumed. All elevation measurements are ignored and x ∈R2 and α ∈ R. Localization using the MB model (7) is doneby minimizing (10a) over a 10 m grid well covering the areaof interest, followed by numerical minimization.

Localization using the MB-SW model (25) is done bynumerically minimizing (28). The objective function is sub-ject to local optima; therefore the more robust muzzle blastlocalization x is used as an initial guess. Furthermore, thedirection from x toward the mean point of the microphones(the camp) is used as initial shooting direction α. Initialbullet speed is v = 800 m/s and initial speed of sound isc = 330 m/s. r = 0.63 is used, which is a value derived fromthe 308 Winchester ammunition ballistics.

6.3. Results. Figure 7 shows, at three enlarged parts of thescene, the resulting position estimates based on the MBmodel (blue crosses) and based on the MB-SW (squares).


Table 3: Armament and ammunition used at the trial, and number of rounds fired at each shooter position. Also, the resulting localizationRMSE for the MB-SW model for each shooter position. For the Luger Pistol the MB model RMSE is given, since only one microphone islocated in the Luger Pistol SW cone.

Type Caliber Weight Velocity Sh. pos. # Rounds RMSE

308 Winchester 7.62 mm 9.55 g 847 m/s 1, 2, 3 4, 4, 6 19, 6, 6 m

Hunting Rifle 9.3 mm 15 g 767 m/s 1, 2, 3 4, 4, 4 6, 5, 6 m

Swedish Mauser 6.5 mm 8.42 g 852 m/s 1, 2, 3 4, 4, 4 40, 6, 6 m

Luger Pistol 9 mm 6.8 g 400 m/s 3 —, —, 4 —, —, 2 m

Apparently, the use of the shock wave significantly improveslocalization at positions 1 and 2, while rather the oppositeholds at position 3. Figure 8 visualizes the shooting directionestimates, α. Estimate root mean square errors (RMSEs) forthe three shooter positions, together with the theoreticalbounds (34), are given in Table 4. The practical resultsindicate that the use of the shock wave from distant shooterscut the error by at least 75%.

6.3.1. Synchronization and Detection Errors. Since all micro-phones are recorded by a common recorder, there are actuallyno timing errors due to inaccurate clocks. This is of coursethe best way to conduct a controlled experiment, where anyuncertainty renders the dataset less useful. From experimen-tal point of view, it is then simple to add synchronizationerrors of any desired magnitude off-line. On the dataset athand, this is however work under progress. At the moment,there are apparently other sources of error, worth identifying.It should however be clarified that in the final wireless sensorproduct, there will always be an unpredictable clock error.As mentioned, detection errors are present, and the expectedlevel of these (80 μs) is used for bound calculations in Table 4.It is noted that the bounds are in level with, or below, thepositioning errors.

There are at least two explanations for the bad perfor-mance using the MB-SW model at shooter position 3. One isthat the number of microphones reached by the shock waveis insufficient to make accurate estimates. There are fourunknown model parameters, but for the relatively low speedof pistol ammunition, for instance, only one microphone hasa valid shock wave detection. Another explanation is that theincreased detection uncertainty (due to SW/MB intermix)impacts the MB-SW model harder, since it relies on accuratedetection of both the MB and SW.

6.3.2. Model Errors. No doubt, there are model inaccuraciesboth in the ballistic and in the acoustic domain. To that end,there are meteorological uncertainties out of our control.For instance, looking at the MB-SW localizations aroundshooter position 1 in Figure 7 (squares), three clustersare identified that correspond to three ammunition typeswith different ballistic properties; see the RMSE for eachammunition and position in Table 3. This clustering or biasmore likely stems from model errors than from detectionerrors and could at least partially explain the large gapbetween theoretical bound and RMSE in Table 4. Workingwith three-dimensional data in the plane is of course another

Table 4: Localization RMSE and theoretical bound (34) for thethree different shooter positions using the MB and the MB-SWmodels, respectively, beside the aim RMSE for the MB-SW model.The aim RMSE is with respect to the aim at x against the target,α′, not with respect to the true direction α. This way the ability toidentify the target is assessed.

Shooter position 1 2 3

RMSE(xMB) 105 m 28 m 2.4 m

MB Bound 1 m 0.4 m 0.02 m

RMSE (xMB-SW) 26 m 5.7 m 5.2 m

MB-SW Bound 9 m 0.1 m 0.08 m

RMSE(α′) 0.041◦ 0.14◦ 17◦

model discrepancy that could have greater impact than wefirst anticipated. This will be investigated in experiments tocome.

6.3.3. Numerical Uncertainties. Finally, we face numericaluncertainties. There is no guarantee that the numericalminimization programs we have used here for the MB-SW model really deliver the global minimum. In a realisticimplementation, every possible a priori knowledge and alsoqualitative analysis of the SW and MB signals (amplitude,duration, caliber classification, etc.) together with basicconsistency checks are used to reduce the search space. Thereduced search space may then be exhaustively sampled overa grid prior to the final numerical minimization. Simpleexperiments on an ordinary desktop PC indicate that withan efficient implementation, it is feasible to, within the timeframe of one second, minimize any of the described modelobjective functions over a discrete grid with 107 points. Thus,by allowing—say—one second extra of computation time,the risk for hitting a local optima could be significantlyreduced.

7. Conclusions

We have presented a framework for estimation of shooterlocation and aiming angle from wireless networks where eachnode has a single microphone. Both the acoustic muzzleblast (MB) and the ballistic shock wave (SW) contain usefulinformation about the position, but only the SW containsinformation about the aiming angle. A separable nonlinearleast squares (SNLSs) framework was proposed to limit theparametric search space and to enable the use of global


−20

0

20

40

(m)

−50 0 50 100 150

1

(a)

−8

0

8

−10 0 10 20 30 40

2

(m)

(b)

−6

−4

−2

0 3

−6 −4 −2 0 2 4

ShooterMB modelMB-SW model

(m)

(c)

Figure 7: Estimated positions x based on the MB model and on theMB-SW model. The diagrams are enlargements of the interestingareas around the shooter positions. The dashed lines identify theshooting directions.

grid-based optimization algorithms (for the MB model),eliminating potential problems with local minima.

For a perfectly synchronized network, both MB andSW measurements should be stacked into one large signalmodel for which SNLS is applied. However, when thesynchronization error in the network becomes comparableto the detection error for MB and SW, the performancequickly deteriorates. For that reason, the time difference ofMB and SW at each microphone is used, which automaticallyeliminates any clock offset. The effective number of measure-ments decreases in this approach, but as the CRLB analysisshowed, the root mean square position error is comparableto that of the ideal stacked model, at the same time as

Target

500 m

Shooter

MicrophoneEstimated position

Figure 8: Estimated shooting directions. The relatively slow pistolammunition is excluded.

the synchronization error distribution may be completelydisregarded.

The bullet speed occurs as nuisance parameters in theproposed signal model. Further, the bullet retardation con-stant was optimized manually. Future work will investigateif the retardation constant should also be estimated, and ifthese two parameters can be used, together with the MB andSW signal forms, to identify the weapon and ammunition.

Acknowledgment

This work is funded by the VINNOVA supported Centrefor Advanced Sensors, Multisensors and Sensor Networks,FOCUS, at the Swedish Defence Research Agency, FOI.

References

[1] J. Bedard and S. Pare, “Ferret, a small arms’ fire detectionsystem: localization concepts,” in Sensors, and Command,Control, Communications, and Intelligence (C31) Technologiesfor Homeland Defense and Law Enforcement II, vol. 5071 ofProceedings of SPIE, pp. 497–509, 2003.

[2] J. A. Mazurek, J. E. Barger, M. Brinn et al., “Boomerangmobile counter shooter detection system,” in Sensors, and C3ITechnologies for Homeland Security and Homeland Defense IV,vol. 5778 of Proceedings of SPIE, pp. 264–282, Bellingham,Wash, USA, 2005.

[3] D. Crane, “Ears-MM soldier-wearable gun-shot/sniper detec-tion and location system,” Defence Review, 2008.

[4] “PILAR Sniper Countermeasures System,” November 2008,http://www.canberra.com.

[5] J. Millet and B. Balingand, “Latest achievements in gunfiredetection systems,” in Proceedings of the of the RTO-MP-SET-107 Battlefield Acoustic Sensing for ISR Applications, Neuilly-sur-Seine, France, 2006.

[6] P. Volgyesi, G. Balogh, A. Nadas, et al., “Shooter localizationand weapon classification with soldier-wearable networkedsensors,” in Proceedings of the 5th International Conference onMobile Systems, Applications, and Services (MobiSys ’07), SanJuan, Puerto Rico, 2007.

[7] A. Saxena and A. Y. Ng, “Learning Sound Location from asingle microphone,” in Proceedings of the IEEE InternationalConference on Robotics and Automation (ICRA ’09), pp. 1737–1742, Kobe, Japan, May 2009.


[8] W. S. Conner, J. Chhabra, M. Yarvis, and L. Krishnamurthy,“Experimental evaluation of synchronization and topologycontrol for in-building sensor network applications,” inProceedings of the 2nd ACM International Workshop on WirelessSensor Networks and Applications (WSNA ’03), pp. 38–49, SanDiego, Calif, USA, September 2003.

[9] O. Younis and S. Fahmy, “A scalable framework for distributedtime synchronization in multi-hop sensor networks,” inProceedings of the 2nd Annual IEEE Communications SocietyConference on Sensor and Ad Hoc Communications andNetworks (SECON ’05), pp. 13–23, Santa Clara, Calif, USA,September 2005.

[10] J. Elson and D. Estrin, “Time synchronization for wirelesssensor networks,” in Proceedings of the International Paralleland Distributed Processing Symposium, 2001.

[11] G. Simon, M. Maroti, A. Ledeczi, et al., “Sensor network-basedcountersniper system,” in Proceedings of the 2nd InternationalConference on Embedded Networked Sensor Systems (SenSys’04), pp. 1–12, Baltimore, Md, USA, November 2004.

[12] G. T. Whipps, L. M. Kaplan, and R. Damarla, “Analysisof sniper localization for mobile, asynchronous sensors,” inSignal Processing, Sensor Fusion, and Target Recognition XVIII,vol. 7336 of Proceedings of SPIE, 2009.

[13] E. Danicki, “Acoustic sniper localization,” Archives of Acoustics,vol. 30, no. 2, pp. 233–245, 2005.

[14] L. M. Kaplan, T. Damarla, and T. Pham, “Qol for passiveacoustic gunfire localization,” in Proceedings of the 5th IEEEInternational Conference on Mobile Ad-Hoc and Sensor Systems(MASS ’08), pp. 754–759, Atlanta, Ga, USA, 2008.

[15] D. Lindgren, O. Wilsson, F. Gustafsson, and H. Habberstad,“Shooter localization in wireless sensor networks,” in Proceed-ings of the 12th International Conference on Information Fusion(FUSION ’09), pp. 404–411, Seattle, Wash, USA, 2009.

[16] R. Stoughton, “Measurements of small-caliber ballistic shockwaves in air,” Journal of the Acoustical Society of America, vol.102, no. 2, pp. 781–787, 1997.

[17] F. Gustafsson, Statistical Sensor Fusion, Studentlitteratur,Lund, Sweden, 2010.

[18] E. Danicki, “The shock wave-based acoustic sniper localiza-tion,” Nonlinear Analysis: Theory, Methods & Applications, vol.65, no. 5, pp. 956–962, 2006.

[19] K. W. Lo and B. G. Ferguson, “A ballistic model-based methodfor ranging direct fire weapons using the acoustic muzzle blastand shock wave,” in Proceedings of the International Conferenceon Intelligent Sensors, Sensor Networks and Information Pro-cessing (ISSNIP ’08), pp. 453–458, December 2008.

[20] S. Kay, Fundamentals of Signal Processing: Estimation Theory,Prentice Hall, Upper Saddle River, NJ, USA, 1993.

[21] N. Patwari, A. O. Hero III, M. Perkins, N. S. Correal, andR. J. O’Dea, “Relative location estimation in wireless sensornetworks,” IEEE Transactions on Signal Processing, vol. 51, no.8, pp. 2137–2148, 2003.

[22] S. Gezici, Z. Tian, G. B. Giannakis et al., “Localization viaultra-wideband radios: a look at positioning aspects of futuresensor networks,” IEEE Signal Processing Magazine, vol. 22, no.4, pp. 70–84, 2005.

[23] F. Gustafsson and F. Gunnarsson, “Possibilities and funda-mental limitations of positioning using wireless commu-nication networks measurements,” IEEE Signal ProcessingMagazine, vol. 22, pp. 41–53, 2005.

microphone array speech processingdownloads.hindawi.com/journals/specialissues/547087.pdf ·...

Documents