acoustic positioning
TRANSCRIPT
Implementation of an acoustic localization
algorithm for video camera steering
Paolo Minero, Mustafa Ergen
{minero,ergen}@eecs.berkeley.edu
University of California at Berkeley
1 Introduction
Advances in wireless networking and sensor network design open new prospective in performing
demanding collaborative sensing and signal processing. This project was motivated by the am-
bitious plan of designing and implementing acoustic applications in low-cost distributed wireless
sensor networks. As first step, we considered the problem of acoustic localization through a
set of distributed sensors. Acoustic localization has been extensively studied for microphone
arrays since it has several practical interests, e.g. video surveillance, video conferencing, home
and military applications.
In this report, we first introduce a classic mathematical model for the propagation of wide-
band acoustic sources in a reverberant environment. Then we review two traditional approaches
for acoustic localization, based on the maximum-likelihood and the cross-correlation methods.
Finally, we describe the implementation of a Windows-based wireless networked acoustic sensor
testbed used for video camera steering.
1
2 Channel model
In the following we consider wideband signals (i.e. 30 Hz - 15 KHz) like voices, vibration
or movements. It is assumed to operate in the near-field regime and that the propagation
speed is known (typically, for an acoustic source, 345m/s). The acoustic source is located in
a reverberant space. Consider the situation in which 2R sensors are randomly distributed in
space and a single acoustic source generates the wavefront x(t) in a multipath environment.
The channel between the source and each sensor is modeled as an LTI system, and the signal
received at each sensor is sampled at the rate (fc) of one sample per unit of time. Assume that
all sensors sample the received signal synchronously. We assume that the channel response has
a finite number of taps L, and that the channel taps do not vary over N sample times. Starting
at time 1, a sequence of samples yi[1], yi[2], . . . is received at the ith sensor i = 1, .., 2R. The
resulting single-input multiple-output (SIMO) channel is:
yi[n] =L−1∑
k=0
hi[n]x[n− l] + wi[n] i = 1, 2, . . . 2R (1)
where hi[n] is the sampled impulse response of the channel between the source and the ith
sensor and wi[n] is additive background noise, modeled as a white gaussian process with zero
mean and variance σ2n. The noise is assumed uncorrelated from sensor to sensor. In vector
form, the SIMO system is equivalent to
yi = Hix + wi n = 1, 2, . . . 2R (2)
2
where
yi = [yi[1] yi[2] . . . yi[N ]]T
xi = [xi[1] xi[2] . . . xi[N ]]T
Hi =
hi[0] 0 · · · 0
hi[1] hi[0] . . . 0
. . . . . . . . . 0
hi[L] hi[L− 1] . . . 0
. . . . . . . . . 0
. . . hi[L− 1] . . . hi[0]
3 Problem formulation
We are interested in estimating the location of the source rs = [rsx rsx rsx ]T , given the received
samples yi[n] and the location of each sensor ri = [rix rix rix ]T i = 1, .., 2R. Let us partition
the sensors into pairs, and let us denote as j1 and j2 the two sensors in the j th pair. In a multi-
path channel such as (1), an observable signal characteristic is the time difference of arrival
(TDOA) of a source relative to each pair of sensors. In fact, the signal component propagating
through the direct path is usually stronger than the component due to reflections. Hence, the
channel tap due to the direct path is most of the time the largest in magnitude. As such, the
propagation delay in samples τi between the ith sensor and the source can be determinied as
the
τi = arg maxj|hi[j]| j = 1, 2, . . . L (3)
And the TDOA between the two channels in the j th pair of sensors is obtained as
τj = τj1 − τj2 j = 1, 2, . . . R (4)
3
Given the TDOA and the sampling rate fc, one can easily compute the relative distance of
the acoustic source from the pair of microphones as
Dj =τjv
fcj = 1, 2, . . . L (5)
where v is the propagation speed of the sound. The TDOA estimates of each pair of sensors,
along with the knowledge of the sensor position are used to arrive at a source location estimate.
Given Dj , the sound source rs must lie in a three dimensional hyperboloid with equations
d(rj1−rs)−d(rj2−rs) = Dj , where d(x,y) indicated the Euclidian distance between vectors x
and y. Let denote with D = [D1 . . . DR]T the set of distances from the source to every pair of
sensors. The vector D identifies a set of hyperboloids and, as a consequence, the location of the
source must be the unique intersection point of this set of hyperboloids. However, the channel
impulse response hi is usually not known at the ith sensor and it must be estimated from the
received samples yi. The problem of the channel estimation is particularly challenging since
we do not know the transmitted signal x. The channel must be blindly estimated. It the next
section we will derive the optimum ML estimator for the joint estimation of the source signal
x and the channel hi given the yi. Errors in the channel estimation may introduce errors in
D, and there might not exist a point which interpolates all the hyperboloids. So, given a set
of R pairs of sensors, we estimate the source location as the vector rs which minimizes the LS
error with the vector D. Thus, the source location is estimated as
rs = arg minr
R∑
i=1
|d(ri1 − r)− d(ri2 − r)−Di|2 (6)
In the next section two methods for the channel and TDOA estimations are introduced.
4
4 TDOA estimation
4.1 ML estimation
Let us consider the ith sensor. We are interesting in jointly estimating the source signal x
and the channel hi given the received yi. The block of samples yi can be transformed to the
frequency domain by a DFT of length Nc. In order to approximate the linear convolution in
(1) appropriate zero padding can be applied, so in general Nc > N . In the frequency domain,
the channel model is given by
Yi = HiX + Wi i = 1, 2, . . . R (7)
where Yi = [Yi[1] Yi[1] . . . Yi[Nc]]T represents the Nc-point DFT of the sampled data received
at the ith sensor. The ML estimation problem can be expressed as:
maxHi,X
f (Y | Hi,X) (8)
It can be shown that the solution to the ML problem is the following:
Hi = minHi
∥∥∥∥(
I −Hi
(HT
i Hi
)−1
HTi
)Y
∥∥∥∥2
(9)
X =(HT
i Hi
)−1
HTi Y (10)
Given the channel estimate (9), one can compute the propagation delay in samples as
τ = arg maxj
∣∣∣IDFT (Hi)∣∣∣ (11)
Complexity prohibits the use of (9) in real-time applications. More in general, all the optimal
strategies proposed in literature are computationally intense, but tend to possess robustness in
5
realistic environments. The majority of practical localization systems are based on less complex
but suboptimal estimators.
4.2 GCC method
The most commonly used method for TDOA is the GCC method, introduced in [2]. The TDOA
estimate of each pair of sensors is obtained as the time lag which maximizes the cross-correlation
function between filtered versions of the received signal. The GCC function is in fact defined as
the cross-correlation function of the filtered signals. In [2], a mathematical model of the signal
at the two sensors:
y1(t) = x(t) + n1(t)
y2(t) = αx(t−D) + n2(t)
Where n1(t), n2(t) and x(t) are assumed jointly wide-sense stationary. Voice signal is usually
suppose stationary in 20ms-30ms frames. A property of the autocorrelation function is that
R(τ) ≤ R(0). The cross-correlation of y1(t) and y1(t) is
Ry1y2(τ) = Rxx(τ) ∗ δ(τ −D) (12)
And it has a peak for t = D. So, the delay estimate is
D = argmaxxRy1y2(τ) (13)
If the input signal is ergodic, the cross-correlation function is approximated by the correlation
function in time, given by
Ry1y2(τ) =1
T − τ
∫ T
τ
y1(t)y1(t− T )dt (14)
6
The cross correlation function is related to the cross-spectral density function through an inverse
Fourier transform
Ry1y2(τ) =∫ ∞
−∞Gy1y2(f)ej2πfτdt (15)
In order to eliminate the influence introduced by the possible auto-correlation of a source signal,
it is desirable to pre-filter y1 and y2 using a filter before their cross-correlation is computed. Var-
ious filter design has been proposed (whitening filters, Wiener-Hopf, ML for Gaussian sources).
When y1 and y2 are filtered, then the cross-correlation function becomes
Ry1y2(τ) =∫ ∞
−∞Ψ(f)Gy1y2(f)ej2πfτdt (16)
GCC technique is not good in a reverberant environment (the mathematical model assumes free
space propagation). A basic approach to dealing with multi-path channel is deemphasizing the
frequency-dependent weighting. The Phase Transform method (PHAT) places equal emphasis
on each component of the cross-spectrum phase. The corresponding filter design is
Ψ(f) =1
|Gy1y2 |(17)
The resulting peak corresponds to the dominant delay in the reverberant signal. However, it
accentuates components with poor SNR. It is known that the PHAT filter performs badly under
high reverberation and high noise conditions. We can apply the GCC method to the channel
model in (1). For starters, the model can be written as
yi = Hdix + zi n = 1, 2, . . . R (18)
Where Hdi is the component of the channel responce due to the direct path, and zi is some
colored noise, given by the sum of wi and the the component of the signal due to reverberation.
As an example, let us consider the j th pair of microphones. A possible representation of the
two channel matrix is
7
Hdj1=
hj1 [0] 0 · · · · · ·
0 hi[0] 0 · · ·
· · · · · · · · · · · ·
· · · · · · 0 hi[0]
· · · · · · · · · 0
Hdj2=
· · · hj2 [m] 0 · · ·
· · · 0 hj2 [m] 0
· · · · · · · · · · · ·
· · · · · · 0 hj2 [m]
· · · · · · · · · 0
(19)
In this specific case we have that
yj1 [n]yj1 [n + τ ] ≤ yj1 [n]yj1 [n + m] = hj1hj2 |x[n]|2 + zj1 [n]zj2 [n + m] (20)
More in general, the TDOA for a pair of sensors can be directly computed as the index which
maximize the correlation of the two received data samples. Thus, the TDOA can be formally
expressed in the following way:
τj = arg maxm
N∑
k=0
yj1 [k]yj2 [k + m] (21)
The advantage of this approach compared to the ML estimation is that the GCC method
is computationally less demanding. On the other hand, the noise is now correlated and in
presence of multipath the (21) may not lead to a global maximum. In fact, reverberation causes
spurious peaks which may have greater amplitude than the peak due to the true source, so that
choosing the maximum peak may not give accurate results. The frequency of these spurious
peaks increases dramatically as the reverberation time become larger than 200ms. Also,
the method cannot accommodate multi-source scenarios. Altought suboptimal, the method
performs well in low reverberant rooms and the semplicity suggests the use of this method in
real-time applications.
8
5 Implementation
5.1 The synchronization problem
The GCC/PHAT method (17) has been implemented in an acoustic tracking application. This
method assumes that each pair of sensors in the network performs synchronous sensing. By
synchronous sensing we mean that the audio signals must be sampled synchronously and sam-
pled data must be shared in each pair to perform the signal processing. On the other hand,
different pairs may not need to be synchronized and samples data can be collected with time
tags at a central node; thus, the synchronization requirement is limited to each pair. Syn-
chronous sensing implies limited timing errors. When sampling at rate higher than 8KHz, the
timing error cannot exceed a few microseconds. Beside clock synchronization in different nodes,
perfect synchronous sampling can be achieved by controlling the latencies of the sampling sys-
tem. However, achieving clock synchronization of the order of microseconds in a distributed
sensor network is a challenging task and many audio cards have large and non-deterministic
latencies when asked to start recording. In [3], clock synchronization is achieved implementing
RBS algorithm on Linux machines and the accuracy of the sampling sub-system is controlled
using an ”audio server”. RBS is a protocol for high precision synchronization, which, on the
other hand, demands a significant communization load. In our implementation, we avoided
the synchronization problem by attaching two microphones to the same board, and sampling
the received signal through a common processor. This choice also solves the problem of the
accuracy of the sampling sub-system, since the non-deterministic latency of the audio card is
equal for the two microphones. The resulting network topology is drawn is Figure 1, where
each pair of microphones is connected to a processing board, and the boards are wireless linked
to a central node.
9
mic1 mic2
mic3
Laptop
Laptop
Laptop
Laptop
Central Laptop
camera mic1 mic2
mic3
Laptop
Laptop
Laptop
Laptop
Central Laptop
camera
Figure 1: Network structure
5.2 Hardware and software requirements
The hardware system consists of eight microphones, five laptops and one video camera. The
number of microphones is arbitrary, as long as multiple of two. The microphones are divided
into pairs, and each pair is connected to a laptop. Since the microphone input line is mono,
the microphones must be connected to the laptop through the line in. However, the line in
is un-powered, so the microphones must be powered with a battery. We tested two types of
microphones: the Radio Shack Omnidirectional Boundary Microphone and Sony ECM-F8. We
used five IBM ThinkPad running Windows XP and the camera we tested is a Sony EVI-D71.
The microphones cannot be plugged to the line in through a simple mini-to-mini Y adaptor.
Instead, the microphones will plug into two mini-to-RCA connectors, and the two female RCA
will fit into a RCA-to-mini Y adaptor. We used standard connectors and adaptors available in
stores. The remaining laptop is used as information sink and to send command to the camera.
10
In the following, we will refer to the central laptop as server and to the laptops connected to
the microphones as clients. Clients and server communicate through a TCP connection, so the
laptops must be connected via Wireless LAN or Ethernet. We connected the laptops through
an 802.11a access point. The operating system used in each laptop is Windows XP, and Matlab
(with Data Acquisition Toolbox) must be installed on each machine. The package we developed
consists of the following files:
• server.m: it is the main application run at the server. Based on the distance estimates, it
performes the LS algorithm for estimating the source location (6) and it sends commands
to the camera.
• client.m: it is the main application run at the client. It estimates the source distance
through the PHAT method (17) and sends information to the server.
• myfun.m and LMS.m: code run at the server while performing the LS algorithm (6).
• detect voice mex.dll : silence/speech detector run at the client.
• MCreateFigure.m and Updater.m: they create and update the main figure displayed at
the server.
• pnet.m, pnet getvar.m, pnet putvar.m, pnet remote.m: they manage the TCP/IP connec-
tion.
• camcmd.exe and Devserv.exe: these applications sends command to the camera through
UDP packets.
5.3 Setup
The core of the code consists in two files, server.m and client.m. Before setting up the pa-
rameters in these files, we need to define a cartesian set of coordinates for the room, then fix
11
the camera and microphone locations and measure these positions with respect to the chosen
set of coordinates. In this demo it is assumed that the server knows exactly the position of
the microphones and the camera, so the coordinates must be stored in server.m. Then, we
need to connect server and clients in the same sub-network. Each client should be able to ping
the server, and the port chosen for the TCP application should not be in conflict with other
applications. Finally, we need to configure the operating system of each client to use Line-In
for recording. Since we are using Windows, this can be done using the Volume Control Panel.
Finally, some specific parameters might be set in server.m and client.m. Refer to the code
documentation for details. The video camera must be connected to the server.
5.4 Running the demo
Once server.m is run, a window will be showed on the server screen (an example is showed
in Figure 2). The figure shows the geometry of the room and the relative position of the
microphones. A green circle represents the final estimate of the acoustic source location. The
size of the circle is proportional to the estimation error, i.e. the smaller the dot the more
accurate the estimate. After that, the server waits for connections from the clients. Every time
a packet is received from a client, the distance estimates are updated with the new information
stored in the packet. If at least one distance estimate has changed the LS algorithm (6) is run
to estimate again the source location. When the LS algorithm produces an estimate with an
error sufficiently small, the source location is updated on the figure and a command is sent to
steer the camera. The user can stop the server by clicking on the ’Stop’ button in the figure.
Once client.m is run, Matlab samples the sound signal received at the two microphones and
stores a vector of samples corresponding to one second of sound. The two data vectors are
sampled synchronously and ready to be processed. A simple silence/speech detector analyses
12
Figure 2: Network structure
the sample data to identify frames of voice. If the presence of voice is detected, then the
GCC/PHAT algorithm is run to provide an estimate of the source distance. If the information
is reliable, i.e. the estimate is not bigger than the distance between the two microphones, the
estimate is sent to the server through the TCP connection. After that, Matlab stores a new
sound sample to process until a timer set by the user expires.
6 Conclusion and future work
In this project, we implemented a Window-based wireless sensor network testbed for acoustic
localization. The testbed has been used for a camera steering application. Future work includes
the implementation of the algorithm on ZigBee boards. The major challenge to face is the
design of an efficient synchronization protocol with minimal communication load. Also, we aim
13
to use sensor networks for acoustic beamforming speech enhancement applications, e.g acoustic
beamforming and noise suppression.
References
[1] J. C. Chen, Kung Yao and R. E. Hudson, Source localization and beamforming, in IEEE
Signal Processing Magazine, vol. 19, no. 2, March 2002, pp. 30-39.
[2] C. H. Knapp and G. C. Carter. The generalized correlation method for estimation of time
delay. IEEE Transactions on ASSP, 24(4):320-327,1976.
[3] J.C. Chen, R.E. Hudson, and K. Yao, Maximum likelihood source localization and unknown
sensor location estimation for wideband signals in the near-field, IEEE Trans. on Signal
Processing, vol. 50, Aug. 2002, pp. 1843-1853.
[4] S. Hirsch Acoustic Tracker documentation, www.mathworks.com.
14