application of acoustic echo cancellation
DESCRIPTION
Adaptive algorithms for acoustic echo cancellation.TRANSCRIPT
Master Thesis IMIT/LECS/2007-63
Implementation of
Acoustic Echo Cancellation
For PC Applications
Using MATLAB
Master of Science Thesis
In System on Chip Design
by
Lu Lu
Stockholm, 05/2007
Supervisor: Temujin Gautama (NXP Software Leuven Belgium)
Examiner: Axel Jantsch (ICT/KTH Stockholm Sweden)
2
Abstract
The communication technique has changed a lot in the recent years. Today people are
more interested in hands-free communication with the use of a loudspeaker and a
microphone, in stead of a normal telephone. However, the presence of a large acoustic
coupling between the loudspeaker and microphone would produce a loud echo that would
make conversation difficult. The solution to these problems is the elimination of the echo
with an echo cancellation or echo suppression algorithm. However, traditional methods
are not sufficient.
The objective of this thesis is to find out a good echo removal algorithm, which is
capable of providing convincing results for PC application. The basic components of an
echo canceller are an adaptive filter, and a double-talk detector. The adaptive filter
estimates the echo path, based on which a replica of the echo is created and subtracted
from the combination of the actual echo and the near-end speech signal. Double talk
occurs when both ends are talking. The task of a doubletalk detector is to sense the
doubletalk, so that to stop the adaptive filter in order to avoid divergence. Since there has
been a revolution in the field of personal computers in recent years, this work attempts to
implement the acoustic echo canceller algorithm on a PC with the help of the MATLAB
software.
3
Acknowledgement
Firstly I would like to thank a lot my responsible person Temujin Gautama from NXP
Software (Leuven, Belgium), for his patient help and friendly support throughout my
work. Without him, there will be no this project. His advice and constant guidelines have
assisted me to get through a lot of difficult situations.
A special thank goes to my favorite professor Luc Bienstman in GroupT Engineering
School (Leuven, Belgium), who generously did a big effort to help me make this thesis
project possible, and also, as always, gave me continuous encouragement and taught me
how to gain confidence in me when I doubted myself.
Besides my advisers, I would also thank all my friends here in Leuven, who are always so
supportive and make me never feel lonely.
4
Table of Contents:
CONCLUSION 7
CHAPTER I: INTRODUCTION 8
1.1 Need for Echo Cancellation 8
1.2 Basics of Echo Cancellation 11
1.2.1 System overview 11
1.2.2 Adaptive Filter 13
1.2.3 Double talk Detector 13
1.3 Measures of Performance 14
1.3.1 Echo Return Loss Enhancement (ERLE) 14
1.3.2 Near-end Attenuation (NEA) 14
1.4 Thesis Organization 16
CHAPTER II: ECHO CANCELLATION ALGORITHMS 18
2.1 Acoustic Echo Canceller 18
2.2 Acoustic Echo Suppressor 19
2.2.1 Noise Suppression with Spectral Subtraction 19
2.2.2 Acoustic Echo Suppression with Spectral Subtraction 22
2.2.3 Overlapping-windowed FFT 22
2.2.4 Comparison of AEC and AES 25
2.3 Adaptive filters 25
2.3.1 Wiener Filter 26
2.3.2 Least Mean Square Algorithm (LMS) 27
2.3.3 Normalized Least Mean Square Algorithm (NLMS) 28
2.3.4 Problem with NLMS algorithm 29
2.3.5 A Simplified Echo Path Model 31
2.4 Double Talk Detector 33
2.4.1 The Generic Doubletalk Detection Scheme 34
2.4.2 Geigel DTD 34
5
2.4.3 Normalized Cross-correlation (NCR DTD) 35
2.4.4 Variable Impulse Response (VIRE DTD) 36
2.4.5 Double talk detection performance evaluation 37
CHAPTER III: OTHER ISSUES 39
3.1 Room Impulse Response 39
3.1.1 Measure the Testing Room acoustics 40
3.2 Measure of the delay between the loudspeaker and the microphone 41
3.3 Noise issues 42
3.3.1 Typing noise cancellation based on cross-correlation 43
3.3.2 High Pass Filtering 46
CHAPTER IV: EVALUATION 47
4.1 Requirements of AEC 48
4.2 Speech Stimuli 48
4.3 Acoustic Echo Canceller 49
4.4 Acoustic Echo Suppressor Based on Spectral Subtraction 52
4.4.1 NLMS-Based AES 52
4.4.2 Coloration Effect Filter-Based AES 56
4.5 DTD Performance Evaluation 58
4.5.1 Geigel DTD 60
4.5.2 NCR DTD 62
4.5.3 VIRE DTD 65
4.6 AES with different DTD Algorithms 68
4.6.1 The Influence of the NES and the FES 71
4.6.2 The Noise Performance 73
CHAPTER V: CONCLUSION AND FURTHER WORK
5.1 Summary and Conclusion 75
5.2 Further Works 76
6
LIST OF ACRONYMS 77
LIST OF FIGURES 78
REFERENCES 80
APPENDIX
7
CONCLUSION
This paper works on the implementations of acoustic echo cancellation algorithms and
analysis based on simulations in MATLAB. It focuses on Normalized Least Mean Square
(NLMS) algorithm and the recently proposed method by Christof Faller et al which uses
a simplified echo path Model based on a frequency-domain coloration effect filter.
As an important part of a successful Acoustic Echo Cancellation, several Double-Talk
detection methods are also studied and analyzed, including the Geigel algorithm, the
Normalized Cross-correlation method (NCR DTD), and the Variable Impulse Response
Double Talk Detector (VIRE DTD). Some possible further works are discussed at the end
of this paper.
Key words: AEC, AES, NLMS, Coloration-effect filter (CF), DTD, Geigel, NCR, VIRE,
SER, SNR, ERLE, NEA
8
CHAPTER I
INTRODUCTION
Echo is a delayed and distorted version of an original sound or electrical signal which
is reflected back to the source. If a reflected wave arrives after a very short time of direct
sound, it is considered as a spectral distortion or reverberation. However, when the
reflected wave arrives a few tens of milliseconds after the direct sound, it is heard as a
distinct echo. In data communication, the echo can incur a big data transmit error. In
applications like hands-free telecommunications, the echo, in extreme conditions, can
make the conversation impossible. The echo has been a big issue in communication
networks. Hence this thesis is devoted to the investigation and development of an
effective way to control the acoustic echo in hands-free communications.
This chapter gives a general review of the basic techniques in AEC such as the echo
cancellation structure, the adaptive filter, double-talk detector and performance measures.
Section 1.1 addresses the causes of echo and the echo cancellation environment. Section
1.2 details the basics of an acoustic echo removal system and the system structure of the
echo cancellation process. Section 1.3 introduces the two important measures when
evaluating the echo removal performance. Finally, the organization of the thesis is
described.
1.1 Need for Echo Cancellation
There are two types of echo existing in telecommunication networks, namely electrical
echo and acoustic echo. The electrical echo is due to the impedance mismatch at various
points along the transmission medium. This echo can be found in the public-switched
telephone network (PSTN), mobile, and IP phone systems. The electric echo is created at
the hybrid connections which are created at the two-wire / four-wire PSTN conversion
points as shown in Figure 1.1. It will not be included in the scope of this thesis.
9
Figure 1.1: Hybrid Connections and the Resulting Electric Echo
However, the development of hands-free communication systems gave rise to another
kind of echo known as an acoustic echo. The sound wave travels from loudspeaker to
microphone through vibrations of circuit or open air and generated echo. Examples of
such systems are mobiles, VOIP calls by using, for instance, Skype, the teleconferencing
for meetings or remote educations etc. and the hands-free operations have gained more
and more popularity in recent years. This situation is the one we will contribute to in this
thesis. The basic setup of a typical hands-free system can be shown as:
Fig.1.2 Basic setup of a hands-free communication system
Each side of the communication process is called an ‘End’. The remote end from the
speaker is called the far end (FE), and the near end (NE) refers to the end being measured.
The acoustic echo is due to the coupling between the loudspeaker and microphone. The
speech of the far-end speaker is sent to the loudspeaker at the near end, and it is reflected
10
from the floor, walls and other neighboring objects, and then picked up by the near-end
microphone and transmitted back to the far-end speaker, yielding an echo, which can be
illustrated in Figure 1.3.
Fig 1.3: Generation of acoustic echo through direct coupling and reverberations
Acoustic echo can severely reduce conversation quality. Adaptive cancellation of such
acoustic echoes has become very important in hands-free communication systems.
However, not all echoes reduce voice quality. In order for telephone conversations to
sound natural, callers must be able to hear themselves speaking. For this reason, a short
instantaneous echo, termed side tone, is deliberately inserted. The side tone is coupled
with the caller’s speech from the telephone mouthpiece to the earpiece so that the line
sounds connected, and it also allows the speaker to adjust his/her own speaking level.
Nevertheless the necessity of the side tone in mobile phones has been frequently brought
into discussion. The reason is that side tone poses more difficult problems in the
out-of-doors environment of the mobile phone. If you sneeze or blow into your
microphone, or when wind noise exists, you hear it loud and clear. Hence nowadays the
side tone is either eliminated or designed to be adjustable in mobile phones upon user’s
option.
11
1.2 Basics of Echo Cancellation
As stated in the previous section, there is a need of removing undesired echoes during
telecommunications. Hence from this part on, the investigation of echo removal method
is started. Echo can be either cancelled in time domain or suppressed in frequency
domain. In this section the system schematic of the acoustic echo cancellation is firstly
introduced, which is the basic structure for all echo removal methods. Later the two
major concerns in the echo cancelling process, which are the adaptive filter and the
double-talk detector, are briefly reviewed.
1.2.1 System Overview
Since we know the original signal which goes to the loudspeaker, we can use it to predict
and remove the signal picked up by the microphone. The process of doing this is called
Acoustic Echo Cancellation.
Schematically we can describe an AEC system as in Figure 1.4. The remote speaker
signal, which is always referred as far-end signal and denoted as )(tx , passes through the
room acoustic filter, h, producing an acoustic echo termed )(ty . The microphone receives
near-end speech signal )(tv together with the echo disregarding of the surrounding noise;
the received signal )(tz thus consists of both )(tv and )(ty :
)(),()()()( tvhxftvtytz +=+=
where h is the room acoustic filter.
The task of the AEC is to model the room acoustic path with an adaptive filter th as well
as possible and remove the echo signal from the measured signal, yielding a residual
signal )(te which will only consist of the near-end speech. The acoustic filter is required
to be adaptive since the echo path in the room is most likely time-varying which can be
caused by, for example, the movement of objects or the moving of the loudspeaker or
microphone from one place to another. However, to capture the complexity of an acoustic
echo path, the length of the filter needs to be infinity, but a large filter order will bring a
12
high computational load. So evidently there is a trade-off between the complexity and the
performance of the AEC.
The residual signal,
)(ˆ)()(ˆ)()( txhtztytzte T−=−= ,
should only consist of the near-end signal, which is the case when the acoustic adaptive
filter is close to the echo path, namely )()(ˆ tyty ≈ , then )()( tvte ≈ .
Fig. 1.4 General schematic of Acoustic Echo Cancellation
The adaptive filter uses the residual signal )(te to estimate the error and update new filter
coefficients, however, only if there is no near-end speech. When near-end speech exists,
the estimated error is not correct so that it will distort the filter or even result in the
divergence of the filter, so that it is important to determine whether the near-end speech is
present or not. Hence an acoustic echo cancellation normally includes parts as the
adaptive filter, as well as the double-talk detector to detect if near-end speech exists, and
possibly a nonlinear processor to eliminate the residual echoes. We will discuss each of
them in the later-on study.
13
1.2.2 Adaptive Filtering
There are two main types of digital filtering: the Finite Impulse Response (FIR) and the
Infinite Impulse Response (IIR). IIR can normally achieve similar performance as FIR,
with smaller amount of coefficients and less computation. However, as the complexity of
the filter grows, the order of the IIR filter increases a lot and the computational advantage
is less dominant. Also, IIR suffers from the instability problem. So the filters that are
being used in AEC are usually of the FIR type.
The adaptive filter is the critical part of the AEC which performs the work of estimating
the echo path of the room to get a replica of the echo signal. It needs an adaptive update
to adapt to the environmental change, for example, people moving in the room. An
important issue of the adaptive filter is the convergence speed which measures how fast
the filter converges to the best estimate of the room acoustic path.
A lot of adaptive filters have been derived and employed for the AEC. In this paper, we
will mainly study the standard Normalized Least Mean Square (NLMS) algorithm which
is old, has a low computational complexity and is proven to work well compared to a lot
of new methods. Also the recently-proposed simplified echo path model using
frequency-domain coloration filter is studied and analyzed.
1.2.3 Double Talk Detection
One of the most difficult issues in the AEC is to know when the filter should stop or slow
down the adaptation. As we have discussed in the previous part, it is important to know if
the near-end speech exists or not, when there is far-end signal present. The situation,
when both the near-end and the far-end are active, is referred to as Double Talk. If double
talk occurs, the error signal )(te will not only contain the echo estimation error, but also
the near-end signal. If this signal is used to update the filter coefficients, it might create
an artificial echo and even diverge. Thus, it is a vital yet difficult job, which is the task of
Double-talk Detector.
14
There are a variety of Double-talk detection methods. In this thesis work, we consider
three famous ones: the Geigel algorithm which is quite simple, the Normalized
Cross-correlation method (NCR) and the Variance of Impulse Response algorithm
(VIRE). All of them will be implemented and compared in the later work.
1.3 Measures of Performance
To have a standard way to examine the performance of the echo removal algorithms,
some parameters are required as a measure of the performance. The most important task
of AEC (or AES) is to suppress the echo, so it would be necessary to know how much the
echo can be reduced. During the period of double talk, the near-end signal would be
affected as well as the undesired echo in the cancellation or suppression process, so that
the amount of the attenuation would be interesting to know.
1.3.1 Echo Return Loss Enhancement (ERLE)
Echo Return Loss Enhancement (ERLE) is the most important measure of how much in
dB the echo is suppressed by the acoustic echo cancellation. It is defined as the power of
the original echo over the power of the residual echo signal after cancellation in dB unit:
)(log102
10
r
zERLEσσ
= ,
where 2
zσ is the power of the microphone signal and 2
rσ is the power of the residual
echo.
A precise measure of ERLE should be performed in the portion where there is no
near-end signal but only the echo. The higher the ERLE is, the better the AEC works.
1.3.2 Near-end Attenuation (NEA)
Near-end attenuation (NEA) is a measure of how much the near-end signal is suppressed
in dB during the cancellation process in double talk situation. It is defined as the power of
the near-end signal after suppression over the power before suppression during double
talk:
15
)(log102
2
10
bef
aftNEA
σ
σ= ,
where 2
aftσ is the power of the near-end speech during DT in the residual signal and 2
befσ is
the power of the near-end signal during DT in the microphone signal.
To make the NEA calculation practical, during recording in this thesis, the recorded
signals consist of three segments: far-end single talk, double talk and near-end single talk.
The ERLE is calculated based on the far-end only stage. To calculate NEA, we made
another synthetic signal based on the recorded microphone signal, which has a
sign-inversed (counter-phase) near-end speech during double talk by subtracting the
double of the near-end part, as shown in Figure 1.5. After passing both of them through
the AEC, we subtract the two residual signals and divide the result by two, which gives
us the near-end speech during DT after AEC:
( ) 2/)()(__ min usplus teteresidualendNear −=
( ) 2/)()(__ min usplus teteresidualendFar +=
residualendNearaft __=σ
Evidently low near-end attenuation is desired.
Fig 1.5 Composition of the signals used to calculate the NEA
16
1.4 Thesis Organization
This thesis focuses on two main issues of acoustic echo cancellation, namely the
adaptation algorithm and the control of adaptation in double talk situation.
Chapter 2 presents all the theory backgrounds. Firstly it reviews and compares the two
major ways to achieve echo cancellation, which are the acoustic echo canceller (AEC) in
time domain and acoustic echo suppressor (AES) in frequency domain. The adaptive
filter which is used to model the acoustic echo path is the central part of the AEC. Hence
much effort and researches have been devoted to it. LMS is an old, simple and proven
algorithm which has turned out to work well in comparison with newer more advanced
algorithms. In this project, we use the normalized LMS (NLMS) for the main filter in
AEC, since NLMS is so far the most popular algorithm in practice for its computational
simplicity. For the frequency-domain adaptive filtering method in AES, the recently
introduced simplified echo path method with a frequency-domain coloration effect filter
is studied. After that, the generic double talk detection scheme is outlined and then
several well-known double talk detectors are discussed. The Geigel algorithm is simple
and works well when the far-end signal is sufficiently smaller than the near-end speech,
namely it has assumption of the echo path, so in practice not widely applied to the echo
cancellation algorithms. The Normalized Cross-correlation method uses the correlation
value between the far-end signal and the near-end signal, and is also normalized, which
would bring more promising results compared to the Geigel algorithm. The Variance of
Impulse Response algorithm is based on the fact that the presence of the near-end speech
will bring dramatic variations on the filter taps, which could bring good result. However,
it is more sensitive the microphone signal than the NCR. At last, the measures and the
receiver operating curve which are used to evaluate the DTD are introduced.
In chapter 3 the other issues occurred during the echo cancellation process are discussed.
Since the adaptive filter is trying hard to mimic the room acoustics, it might be interesting
to find a strategy to measure one to have a general idea of how the room impulse
response looks like. Secondly the adaptive filtering algorithms normally require the
synchronization between the far-end and near-end speech. There has to be a delay from
17
the speaker to the microphone for the sound wave to propagate. To estimate this delay, a
method based on cross-correlation is adopted. Also noise is a big issue when the quality
of the microphone is not so good, as for the case of the internal microphone of the laptops.
The noise includes the hard-disk and fan noise from the laptop itself, the typing noise
from the near-end user, as well as the environmental noise. The typing noise mostly
probably would be the most annoying one, since the keyboard is always close to the
internal microphone in a laptop construction. A high pass filter gives a good attenuation
of most of the noise because most noise concentrate at low frequencies. The nonlinear
processor as a possible part of a AEC is introduced generally at the end of this chapter.
Chapter 4 is devoted to the evaluation of all the algorithms discussed above. Through a
bunch of recordings and simulations in MATLAB, we try to find out which adaptive
filtering and double talk detection algorithms suit better the PC application.
In chapter 5 the conclusion is drawn and also the possible future work is presented.
18
CHAPTER II
ECHO CANCELLATION ALGORITHMS
In this chapter, the theoretical background for echo cancellation is reviewed
generally. There are two common ways to remove acoustic echo. The basic method is the
traditional Acoustic Echo Canceller (AEC) which is discussed schematically in section
1.2 and will be covered again in section 2.1. Another way is the Acoustic Echo
Suppressor (AES). The AES for telephony application is usually half-duplex which shuts
off completely the speech from the direction with lower power after comparing the
strength of both ends. It is simple but not effective. Full-duplex communication is more
comfortable for real-time conversations. Another AES method, derived from noise
suppression based on spectral subtraction, makes full-duplex possible, and will be
introduced in section 2.2. In section 2.3, adaptive filtering algorithms are presented in
detail, including the LMS filter and NLMS filter which are derived from Wiener optimal
filters, and the coloration effect filter in frequency domain. Different double-talk
detection (DTD) methods are discussed individually in section 2.4 and also the DTD
performance evaluation measures.
2.1 Acoustic Echo Canceller
The traditional solution to the acoustic echo problem is the acoustic echo canceller (AEC).
An acoustic echo canceller achieves the echo removal by modeling the echo path impulse
response with an adaptive filter and subtracting echo estimation from the microphone
signal.
The acoustic echo path is assumed to be a linear filter with length L, { }TLhhhhh Λ321 ,,= ,
where L is the length of the echo path, and T)(Λ denotes the transpose of a matrix or a
vector. Then the microphone signal is expressed as:
)()()()( knkvkxhkz T ++⋅=
19
where TkxLkxLkxkx )}()...2(),1({)( +−+−= , so )(kxhT ⋅ is the echo signal, )(kv is
the near-end speech and )(kn stands for the ambient noise signal.
A modeling filter { }TLhhhhh ˆˆ,ˆ,ˆˆ321 Λ= is used to approximate the true echo path h , where
L is the length of the filter. The echo estimate will be
)(ˆ)(ˆ kxhkyT=
Adaptive algorithms are used to search the optimum h . Once the adaptive filter converges,
the residual signal will be the echo-cancelled outgoing signal.
The echo signal can be cancelled successfully when the modeling filter approaches the
true echo path. In practice, however, a modeling filter often differs from the true echo
path due to complicated reasons such as speaker nonlinearities and environment changes,
the lack of knowledge about the length of the echo path, and so on, resulting in residual
echo signals.
2.2 Acoustic Echo Suppressor
Unlike AEC, an acoustic echo suppressor achieves echo attenuation in the frequency
domain, and which is working in similar manner as the traditional noise suppression
algorithm. The AES can achieve similar results in a full duplex way as the AEC.
2.2.1 Noise Suppression with Spectral Subtraction
The introduction of noise suppression here is because that the echo works in a similar
way as the noise. So the method for noise suppression could be also interesting for echo
elimination. Various speech enhancement techniques exist for the purpose of eliminating
noise. Spectral subtraction is one of these methods to enhance speech in the presence of
noise. Spectral subtraction for noise suppression basically means that an estimate
|)(ˆ| fN of the noise magnitude spectrum is subtracted from the instantaneous input
magnitude spectrum )( fX . The noise can also be attenuated with a certain factor. The
aim of this process is to obtain an audio signal which contains less noise than the original.
20
The basic flowchart of the spectral subtraction looks like following:
Figure.2.1 Noise suppression with spectral subtraction
The noisy speech consists of 2 parts basically, the clean speech and the noise:
)()()( tntstx +=
After Fourier transform:
)()()( fNfSfX += ,
and the magnitude of the frequency spectrum can be approximately expressed as:
)()()( fNfSfX +≈
So the magnitude of the clean speech can be calculated by subtracting the estimation of
the average noise spectrum:
)(ˆ)()( fNfXfS −≈
and for the phase of the clean signal, the phase information of the noisy speech is
adopted:
)()( fXfS ∠≈∠
Combine the amplitude and phase information we get the estimate of the speech
amplitude spectrum:
)()() )(
)0),)(ˆ)(max((()()(
1
fXfGfX
fNfXfXfS ii
i
i
i ⋅=−
⋅= αα
ααβ
21
where α and β are the design parameters to control the performance, and i stands for the
ith frame since the frequency-domain calculation needs the FFT which is frame-based.
Since Short-time estimates of )( fX i fluctuate randomly in noise-only frames, resulting
in randomly fluctuating gains )( fGi. After noise suppression, statistical analysis shows
that broadband noise is transformed into signal composed of short-lived tones with
randomly distributed frequencies, called musical noise, which sounds like a warbling or
watery effect on the enhanced speech. These artifacts are due to randomly distributed
spectral peaks in the residual noise spectrum. One possible way to solve this is to
overestimate the average noise power to lower the peaks, but the original speech signal
might also be distorted.
Also a lot clicking noise occurs due to the steep changes in the gain function and it can be
removed by adding a gain smoothing function as following:
GifactorsmoothiGsfactorsmoothiGs ⋅+⋅−= _,)_1(,
and smooth factor will determine the time constant of the exponentially smoothed gain
function. To understand better how the smoothing function works, supposing a step gain
function in Figure 2.2 (solid blue line), after applying the gain smoothing we will get a
smoothed version of the steep changing corners (dotted line). The smooth factor in this
figure is 0.01. In this way, the sharp glitches in the gain function are eliminated.
Figure 2.2 Smoothing of a step function
22
2.2.2 Acoustic Echo Suppression with Spectral Subtraction
Echo suppression is basically performed in the same manner as noise suppression. Unlike
AEC, an acoustic echo suppressor achieves echo attenuation through manipulating the
magnitude spectrum of the microphone signal in the frequency domain, while leaving the
phase spectrum untouched.
The adaptive filter in the AES works as the Noise density estimation unit in the noise
suppressor, which is combined with FFT to produce an estimate of the echo magnitude
spectrum. The echo spectrum estimate is used to form the gain function together with the
spectrum of the microphone signal.
αα
ααβ 1
) )(
)0),)(ˆ)(max((()(
fZ
fYfZfGi
−=
where α and β are the design parameters to control the echo suppression performance. If
the echo is under estimated, β >1 is used and β <1 if it is over-estimated.
Then the multiplication of the gain function and the microphone signal will calculate the
magnitude spectrum of the residual signal which is supposed to be echo-free. After
performing the inverse FFT transformation, the echo-suppressed outgoing signal is
obtained as:
)]()([)( 1 fZfGiFne −=
where )(1 Λ−F denotes the inverse FFT.
2.2.3 Overlapping-windowed FFT
For the transformation into the spectral domain, the choice of data window and
overlapping are also important. When windowing a simple waveform, like )cos( tω ,
causes its Fourier transform to have non-zero values at frequencies other thanω ,
commonly called leakage. The rectangular window is the simplest window and has the
best resolution, but suffers most from the window leakage problem among all. Other
windows like Hann, Hamming, Kaiser Windows, are moderate. Hamming has more
leakage than the other two. Kaiser can have the smallest leakage in price of lower
23
resolution. On the other hand, using windows to perform FFT on a small segment of
input will bring some distortions because of the transient effects, so overlapped
windowing is employed. The windows will overlap in time, namely the window will only
shift a part of the total window size instead of the whole. FFT and latter processes are
then performed for every window. To restore the signal, the reconstructed data through
IFFT are summed up at the end.
Different windows are compared and the results are displayed in Figure 2.3, 2.4 and 2.5,
to find out which could bring better result, namely less error. Half of the window is
shifted per time and FFT and IFFT functions are processed. The error between the
original and the recovered signal is displayed in the unit of dB.
The Hann Window has relatively smaller error compared to the other two methods. The
performance of the Kaiser window depends highly on the value of beta. For the Kaiser
window, larger the beta is, wider the window will be and smaller the side-lobe becomes.
According to simulations in MATLAB, the one with beta equals to around 5.8 gives the
smallest error after performing FFT and IFFT as shown in Figure 2.5.
Figure 2.3 Error caused by Hann-Windowing FFT
24
Figure 2.4 Error caused by Hamming-Windowing FFT
beta = 2.5
beta = 4.5
beta = 5.8
beta = 6.8
Figure 2.5 Error caused by Kaiser-Windowing FFT with different beta
25
As a conclusion from the simulation results, Hann window gives the best performance
upon the error reduction. Hence, the Hann window is chosen to perform the Overlap-add
method during AES process.
2.2.4 Comparison of AEC and AES
Both AEC and AES have their advantages and disadvantages. The AEC is a well-defined
technique. When the modeling filter approaches the true echo path, an AEC can eliminate
echo signal successfully without introducing much distortion to the outgoing signal.
However, in reality the modeling filter often differs from the true echo path due to
complicated reasons, for example, the modeling filter is shorter than the true echo path,
the echo path may change or the existing nonlinearity in the echo path, and so on. As a
result, some residual echoes may still remain. In addition, an AEC is often
computationally expensive. In comparison, an AES is able to achieve higher echo
attenuation and more robust. In addition, the AES algorithm may introduce less
computational complexity as the simplified echo path method which will be discussed
later on. However, as in the noise suppressor, this technique sometimes introduces
audible distortions to the outgoing signal.
2.3 Adaptive Filters
The adaptive filter is the central part of the AEC, which is used to mimic the acoustic
echo path. There are numerous adaptive algorithms that are applicable in acoustic echo
cancellation such as least mean squares (LMS), recursive least squares (RLS) and affine
projection algorithm (APA) etc. LMS is an old, simple and proven algorithm which has
turned out to work well in comparison with newer more advanced algorithms. In this
project we use the normalized LMS (NLMS) for the main filter in AEC, since NLMS is
so far the most popular algorithm in practice for its computational simplicity. In the
following paragraphs the Normalized Least Mean Square algorithm is outlined, which is
an adaptation process based on linear FIR algorithm. It aims at approximating the room
acoustic path with the best possible model. For the frequency-domain adaptive filtering
method in AES, the recently introduced simplified echo path method with a
26
frequency-domain coloration effect filter is studied, which has the advantage of lower
computational complexity and more robustness.
2.3.1 Wiener Filter
The Wiener filter represents the optimum filter in the sense of the Mean-Squared Error
(MSE). It minimizes the cost function based on the filter coefficients which can be
expressed as:
}{)( 2eEJ =w
where w stands for the corresponding filter coefficients. }{ 2eE represents the mean
power of the error signal )(ke . With the optimal filter coefficients the minimum of the
cost function
}){min()( 2eEJ opt =w
is reached.
The error signal can be calculated as the difference between the desired signal and the
output of the adaptive filter, )()()( kykdke −= , and )()( kky T xw ⋅=
with TkxLkxLkxk )}()...2(),1({)( +−+−=x , where L is the length of the filter.
The squared error function would be:
)()()()(2)()( 22 kkkdkkdke TTT xxwwxw ⋅+−= .
The auto-correlation matrix R is defined by
)}()({ kkE T xxR = ,
and the cross-correlation vector is:
)}()({ kdkE xp = .
Assuming that the desired signal is real, wide-sense stationary, the cost function can be
written as:
wRwpwwTT
doptJ +−= 2)( 2σ
Then the minimum point of the function can be obtained by calculating the point which
has zero gradient and the general gradient of the cost function is:
27
)(222)}({ pRwRwpw −=+−=∇ Jw
The above leads to the time-discrete Wiener-Hopf-Equation:
1−= pRw opt
which gives the filter coefficients of a Wiener Filter, optimal in the sense of the MSE.
The Wiener filter is a linear optimum filter. It depends on the known statistics R and p. In
practice, we do not know R and P exactly, and in an adaptive context they may be slowly
varying with time. The adaptive filter should be able to track the changes in the statistics
hence a changing optw , so some approximations are necessary. One idea is to approximate
the R and p values, which leads to the Recursive Least Squares (RLS) algorithm. Another
way is to approximate the gradient as in the Least Mean Square algorithm presented in
the following section. The LMS algorithm is introduced much earlier than the RLS
algorithm. The RLS algorithms have the advantage of fast convergence, while the LMS
costs much fewer computations. In the PC embedded software application, the benefit of
the LMS method is more attractive.
2.3.2 Least Mean Square Algorithm (LMS)
The Least Mean Square algorithm is derived from the Steepest Descent method. Instead
of going the direct path from the starting point to the optimum, it is easier to follow the
gradient of the error function which leads to the optimum iteratively. The gradient as
shown in Figure 2.6, is a vector pointing in the steepest uphill direction on the error
surface at a given point of w(k). The filter coefficient is updated by taking a step opposite
the gradient direction. It goes locally “downhill” in the steepest direction to approach the
optimum:
)}({)()1( www Jckk w∇⋅−=+
And wxxdxw )()(2)()(2)}({ kkkkJ T
w +−=∇
])()()[(2 wxx kkdk T−−=
)()(2 kekx−=
So now it leads to the Least Mean-Square algorithm
)()()()1( kekkk xww µ+=+
28
where c2=µ is defined as the step-size. The step size parameter µ controls the
convergence speed of the filter.
Figure 2.6 Gradient of the Error function
2.3.3 Normalized Least Mean Square Algorithm (NLMS)
The NLMS algorithm is derived from the LMS algorithm. The motivation of this
algorithm is that the power of the input signal varies with time, so the step size between
two adjacent filter coefficients will vary as well, then also the convergence speed. The
convergence speed will slow down with small signals, and for the loud ones the
over-shoot error would increase. So the idea is to continuously adjust the step size
parameter with the input power. Therefore, the step size is normalized by the current
input power, resulting in the Normalized Least Mean Square algorithm, with
)(
2)(
2 kxn
αµ =
whereα is again the design parameter to adjust the convergence speed, and 10 <<α .
NLMS usually converges much more quickly than LMS at very little extra cost, so it is
very commonly used.
29
2.3.4 The Problem with NLMS:
The performance of the fast converging NLMS algorithm will be largely degraded when
doubletalk or only near-end speech exists. The reason is that it is calculated from a ratio
between the error signal and the power of the far-end signal.
)()()(
2)()1(
2kekx
kxkwkw ⋅+=+
α
During the pauses of doubletalk or when only near-end speech exists, the coefficients
become exceedingly unstable since the input is approaching zero while the error signal is
relatively large due to the near-end signal’s existence. The filter weights start to diverge.
The LMS algorithm does not suffer from this problem.
There are several possible solutions to solve this, which will be illustrated as follows:
1. Safety constant
One possibility to solve this problem is to simply add a safety constant to denominator:
)()()(
2)()1(
2kekx
kxkwkw ⋅
++=+
ρ
α
The value of the factor will influence the output quality in a way that by increasing the
factor the less the jitter of the weights will be, but the lower the ERLE will become.
2. Threshold
Another common and low-cost possibility is to introduce certain threshold to the input
power. The weight will be kept the same if the power of the input is lower than the
threshold to avoid the large jitters of the weights. It is basically a far-end signal detector
based on the input power.
thresholdkxifkwkw
thresholdkxifkekxkx
kwkw
<←=+
>←⋅+=+
2
2
2
)()()1(
)()()()(
2)()1(
α
30
3. Combination of LMS and NLMS
Both the safety factor and the input threshold will be input power dependent. Hence we
introduce a new idea which combines the advantages of NLMS and LMS. Two adaptive
filters are adapted in parallel and adjusted by a factor γ (0< γ <1). Each of the filter
banks donates γ or 1- γ percentage during the calculation of the error signal.
2)1(1 yyze ⋅−−⋅−= γγ
y1 is the echo estimation of the NLMS filter and y2 for LMS section. This method is
basically trying to find the optimal combination of LMS and NLMS at each time instance,
in order to achieve fast convergence and relatively large ERLE for echo cancellation and
also gain more stability.
To derive the appropriate value or update method forγ , we use the same way as for LMS.
The steepest descent method is applied to approach the minimum of the least
mean-squared value.
[ ]eyyyyyzyy
e2)21()2)21((2)21(
2
⋅−−=−−⋅−⋅−−=∂Ε∂
γγ
eyycii 2)21(1 ⋅−⋅+=+ γγ
c is the step size parameter as μ for the LMS algorithm.
Theoretically theγ should be more or less ‘1’ for FE (Far-end) single talk section, which
indicates the employ of NLMS algorithm. This is because the NLMS filter adapts faster
and gains higher ERLE than LMS at this moment. During DT (Double Talk) and NE
(Near-end) single talk, γ becomes ‘0’, since the LMS algorithm does not suffer from
the stability problem as the NLMS when NES (Near-end Speech) exists.
We tested this algorithm with recorded signal including three segments, which are
far-end speech only, DT and near-end speech only. Theγ value as plotted in Figure 2.7
shows analogical result as we expected.
31
Figure 2.7 γ plot ( 01.0=α 8.0=µ )
During simulation in MATLAB, it indicates that the new algorithm does not improve
enough compared to the calculation complexity it brings. As a conclusion, with the same
ERLE achieved, and according to the audible test results of the three methods, safety
constant method is chosen as an efficient way which brings acceptable results.
2.3.5 A Simplified Echo Path Model
The normal adaptive algorithm aims at approximating the real acoustic echo path,
inherently suffers from the effect of echo path changes and non-linearity. Christof Faller
and Christophe Tournery recently proposed a new AES without a need for the complex
computation of the acoustic echo path estimation. Instead of identifying the echo path
impulse response, the proposed method estimates only the magnitude spectrum of the
echo that is needed for echo suppression. A filter mimicking the coloration effect of the
echo path on the loudspeaker signal is adopted. The gain filter for the AES is computed
using this coloration effect filter. The proposed AES has low complexity and higher
robustness because it estimates signal independent on the physical echo path.
Coloration in an audio process means that some frequency ranges are attenuated or
amplified while the others are not. It is necessary to know which frequencies are
32
attenuated, not modified or amplified on the loudspeaker signal for the AES. A typical
room impulse response consists of the direct sound which comes from the loudspeaker
directly to the microphone, several early reflections and then the late reflections which is
like a long tail with high density, as shown in figure 2.8. The dense late reflections hardly
influence the amplitude of the frequency spectrum. The large direct sound and the early
reflections are what color the signal. Hence, to obtain the necessary information for the
echo suppression it is enough to just consider the direct sound and the early reflections,
which indicates the improvement of the computational complexity.
Figure 2.8 Typical room impulse response
A real-valued “coloration effect filter” ),( kiGv , mimicking the spectral modification
effect of the echo path on the loudspeaker signal, is estimated. For obtaining an
approximate echo magnitude spectrum, the estimated delay and coloration effect filter are
applied to the loudspeaker signal spectra,
),(),(),(ˆ kiXkiGvkiY d=
where d stands for the number of samples to delay. Since it takes a certain amount of time
for the loudspeaker signal to reach the microphone, the magnitude spectrum of the echo
is calculated with the delayed loudspeaker signal.
The coloration effect filter is computed as the magnitude of the least squares estimator
{ }{ }),(),(
),(),(),(
*
*
kiXkiXE
kiYkiXEkiGv
dd
d=
33
where ∗ denotes complex conjugate. Since the acoustic echo path is likely to vary in
time, ),( kiGv is estimated iteratively as
),(
),(),(
22
12
kia
kiakiGv =
where
)1,()1()},(),({),(
)1,()1()},(),({),(
22
*
22
12
*
12
−−+⋅=
−−+⋅=
kiakiXkiXEkia
kiakiZkiXEkia
dd
d
σσ
σσ
and ]1,0[∈σ determines the time constant of the exponentially decaying estimation
window.
Then the magnitude spectrum of the echo signal is used to form the gain filter as in:
αα
αα
β 1
^
) ),(
)0),k)(i,Y),(max((
().(kiZ
kiZ
kiG
−=
During double talk, the coloration filter will affect the near-end speech and even diverge
in the same way as the NLMS algorithm. To prevent this, a double talk detector (or
near-end speech detector) can be necessary to freeze the coloration effect filter when
double talk exists.
2.4 Double Talk Detector
An important feature that an AEC should have is its capability to provide full duplex
services, which means it allows the both ends to speak simultaneously, namely the case
of Double Talk (DT). If DT exists, the microphone signal which is used for adaptation
will not only contain the echo but also the near-end signal. This could lead to the
divergence of the adaptive filters since the near-end speech acts as a strong uncorrelated
noise to the adaptive algorithm. Thus it is necessary to detect when the double talk occurs,
and stop the adaptation process. This is done by a double talk detector.
34
2.4.1 The Generic Doubletalk Detection Scheme
The generic DTD is based on a detection statisticξ , which is formed by using available
signals as the speaker signal, the microphone signal and the output signal etc. Then by
comparing the ξ with a preset threshold T, the double talk situation is declared or not.
Once the double talk is detected, the filter adaptation will be disabled for a minimum
period of time Thold. The filter adaptation will be resumed if the detection statistic
indicates that there is no DT consecutively over a time Thold.
There are a variety of double talk detectors based on different algorithm to calculate the
decision statisticξ . The most popular ones are the Geigel algorithm and the Normalized
Cross-correlation (NCR) method as well as the Variance Impulse Response algorithm
(VIRE).
2.4.2 Geigel DTD
The most basic algorithm for double talk detection is the one originally developed by
Geigel. It is a quite simple approach by comparing the power of the received signal and
the far-end signal. Since normally the room acoustic filter will damp the far-end signal,
when the received microphone signal divided by the maximum of the past far-end
samples is lager than certain threshold, the DT is declared.
The decision statistic is calculated as:
))1(...)1(,)(max(
|)(|
+−−=
Ntxtxtx
tzξ
If ξ is larger than some preset threshold T, it is deemed that DT is occurring, otherwise
not, i.e.
T>ξ Double talk present
T≤ξ Double talk not present
The choice of T will strongly affect the performance of the detector. During MATLAB
analysis, it can be found by plotting the decision variable and finding out which threshold
35
would optimally distinguish the DT from the far-end signal. The Geigel detector has the
benefit of being computationally simple and needing little memory. However, the Geigel
detector has quite poor performance.
2.4.3 Normalized Cross-correlation (NCR) DTD
An alternative method is the normalized cross-correlation algorithm. The microphone
signal z(k) can be expressed as a sum of the echo signal and the near-end speech signal
(NES), where we ignore the noise influence first.
)()()( tvtytz +=
Suppose the echo path impulse response of the room is h, such that the echo signal is:
)()( txhty T=
The power of the measured microphone signal can be written as:
)()( 22 thRht vxx
T
z σσ +=
where { }xxER T
xx = .
The cross-correlation sequence of the speaker and microphone signals can be expressed
according to definition:
{ } hRxyEr xxxy ==
Yielding:
1−= xxxyRrh
And the power of the microphone signal can be rewritten as:
)()( 212 trRrt vxyxx
T
xyz σσ += −
When there is no NES present, i.e. 0)( =tv , then )()( tytz = and
xzxx
T
xzz rRrt 12 )( −=σ with { }xzErxz = .
The detection statistic is suggested as:
2
1
2
1
)(
=
−
t
rRr
z
xzxx
T
xz
σξ .
36
The nominator is the power of the measured signal if no near-end speech is present,
whereas the denominator is the actual power of the measured signal. Thus if there is no
near-end speech signal present, 1≈ξ , otherwise 1<ξ .
The DT decision is formed as
T<ξ Double Talk present
T≥ξ Double Talk not present
T is selected between 0 and 1.
The NCR method is normally computationally infeasible, as it not only requires the
estimation of the cross-correlation sequence xyr and the far-end covariance matrix xxR ,
but also the inversion of the covariance matrix. A practical approach is adopted for this
reason. The room echo path response is approximated by the response of the adaptive
filter, which results in:
wRrh xxxy ≈= −1
)(
)(
)()( 2
2
ˆ
22 t
t
t
wRw
t
wr
z
y
z
T
xx
T
z
T
xz
σ
σ
σσξ ===
The nominator is the power of the estimated echo signal and the denominator is the actual
microphone signal power. This is the form of the cheap normalized cross-correlation
algorithm. Since we are using Hann window based overlapping add method, the decision
factor will be calculated for each window frame.
2.4.4 Variance Impulse Response (VIRE DTD)
VIRE DTD is a recently introduced method which uses the maximum value of the
adaptive filter coefficients. The recent variance impulse response algorithm (VIRE) is
based on the variance of the adaptive filters. Since the near-end speech acts as a
corrupting noise, it will induce dramatic variations in the adaptive filter taps. It uses the
maximum value of the adaptive filter as a measure of the fluctuations with certain
exponential forgetting factor:
2][)1()1()( γγλξλξ −⋅−+−⋅= nn
37
where γ is the maximum value of the filter coefficients and γ is a expected value of
γ which is again formed with the exponential forgetting factor:
γλγλγ ⋅−+−⋅= )1()1()( nn
))1()1(),0(max( −= khhh Λγ
By defining certain threshold for the variation of the adaptive filter taps, the DT decision
is made as following:
T>ξ Double Talk present
T≤ξ Double Talk not present
The detection will still be frame-based and calculated once at the end of each frame.
Hence normally the length of the frame in this work is relatively small. Also it is worth to
mention that the VIRE algorithm is more sensitive the power of the near-end speech as
we will see later during simulation.
2.4.5 Double Talk Detection Performance Evaluation
Certain criteria are necessary to compare different types of DTD, since we can not
compare the performance directly because different threshold can be used. Also, a
systematic approach is required to select the value of threshold.
The criteria to evaluate DTD performance are as follows:
Probability of False alarm (Pf): the probability of declaring detection when DT does not
exist.
Probability of Detection (Pd): the probability of successful detection when DT does exist.
Probability of miss (Pm = 1 - Pd): the probability of detection failure when DT is present.
Pf is calculated when there is only far-end signal present, namely v = 0,
N
XP
active
f
∑ ⋅=
ξ
where ξ is the output decision of the DTD, activeX is the output of the activity detector
and N is the length of the entire far-end speech signal x, which is the first 15 seconds in
our case.
38
The miss probability Pm is measured as the proportion of near-end speech that remains
undetected when far-end speech also exits. The Pm is a meaningful criterion to fairly
compare different DTD methods, because the disruptive effect of undetected double talk
on an adaptive filter depends on the near-end speech that goes undetected.
∑∑
⋅
⋅⋅−=
activeactive
activeactive
mVX
VXP
ξ1 and PmPd −= 1
where ξ is the output decision of the DTD, activeX and activeV are the output of the
activity detector for near end and far end respectively. The logical AND assures that the
miss probability is only counted when both NE and FE are active.
A good detection method should maximize Pd while minimizing Pf even in a low signal
to noise ratio situation. In general, higher Pd is achieved at the cost of a higher Pf. There
is a trade-off depending on the cost of a false alarm and a miss. A Receiver Operating
Characteristic (ROC) curve is widely used to characterize detection schemes as in radar
applications. The Pm’s with respect to Pf’s at different threshold points are plotted in the
ROC curve in order to find a proper threshold to achieve certain performance. Also,
given a certain probability of fault alarm, one can plot the probability of miss in function
of the Signal-to-Echo Ratio (SER) or the Signal-to-Noise Ratio (SNR) to evaluate the
DTD algorithm under different speech power condition or environmental noise
circumstance.
39
CHAPTER III:
OTHER ISSUES
Some other issues besides the echo cancellation theory are presented in this chapter. To
decide the length of the adaptive filter, we need to know the length of the actual echo
path which the filter is trying to model. Section 3.1 discusses about the room acoustics
and the way to measure it. Another important issue is the synchronization problem
between the loudspeaker signal and the microphone signal, which is essential for the echo
cancellation process and is covered in section 3.2. Last but not least, the noise is always
an important consideration for speech and audio applications. The noise sources for
hands-free communication are various.
3.1 Room Acoustics
The testing of the AEC will be performed in real rooms. The AEC which works well in
one room, however, might not be compliant in another. This is because the acoustics of
all rooms are different. This flexibility allows designer to test the AEC in the type of
rooms they were designed for. However, this also means that the user has the
responsibility of determining whether the AEC will operate in his or her particular
environment. An AEC solution that was designed to operate in an office may not work
properly in a conference room. If an echo canceller works in one room and not another, it
would most likely be due to a tail length that was too short for the second room.
The tail length of an AEC is the length of time over which it can cancel echoes (in the
unit of ms). This is directly related to the reverberation time of the room. As the room
reverberation time increases, a longer tail length will be needed in that room. If the
reverberation time is much longer than the tail length, a significant amount of the echo
will remain audible.
40
There are two main factors that affect the reverberation time of a room. They are room
size, and the materials used to construct the walls and objects in the room. Most sound is
absorbed when it strikes walls or other surfaces. If materials are used that absorb sound
well (such as carpet, curtains, or acoustic tile), the reverberation will die out more quickly
than if the room contains mostly reflective materials (hard wood, or glass). If a room is
small, the sound waves will bounce off the walls more frequently, and will be absorbed
more quickly.
3.1.1 Measure the Testing Room acoustics
Since the job of the adaptive filter is to model the room acoustic, it is interesting to have
an idea how it looks like, also the length of the adaptive filter should cover sufficient
length of the room impulse response it is to be operated in.
Typically the impulse response is the system response to a Dirac pulse, which
theoretically has infinite amplitude at certain time point and all zero at the others.
However it is impossible to make a real Dirac pulse in practice. The room impulse
response, namely the acoustic echo path in the room, can be measured in several ways.
As a conceptual method consider a room and a balloon in it at point p. The balloon pops
and makes a "pou" sound, which is similar (due to its short duration) to a Dirac delta, and
the output h[n] is the sequence of the damped sound. Here h[n] depends on the location
(point p) of the balloon. If we know h[n] at point p of the room, then we actually know
the impulse response of the room at point p. It is then possible to predict its response to
any sound produced at this point. However it is not easy to use this method to fit exactly
our actual recording position. We can approximately use a sudden big sound to simulate
but it won’t bring good results.
The other way is to simulate with sine waves of different frequencies. To use the
sine-waves would bring the best results but it is quite time-consuming.
The third choice is to measure with a white noise signal which in theory has a flat
response in the frequency magnitude spectrum, so the de-convolution in time domain can
41
be calculated easily as a division in frequency domain. The final method of recording
white noise has the potential of giving a good result while it can be realized quite easily
using MATLAB and doesn’t require anything other than a computer equipped with a
microphone and a speaker. This is the method that we have chosen.
0 5 10 15 20 25-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Time (ms)
Amplitude
Sampling rate = 8KHz
Figure 3.1 Room Impulse Response in the Scream room in NXP Leuven
Our testing room has the size of 5m x 6m with hard walls. As we can see, approximately
16ms adaptive filter length is needed during our testing, namely 128 taps at 8 KHz
sampling rate.
3.2 Measure of the delay between the loudspeaker and the microphone
Not only for the new AES algorithm proposed by Christof Faller and Christophe Tourney,
the other algorithms of AEC or AES also require the estimation of the delay between the
microphone and loudspeaker signals, since apparently all the algorithms would need the
two signals to be synchronized.
To estimate the time delay, the cross-correlation is used in this paper. Cross-correlation
is the standard way of measuring how two signals are correlated. The correlation will be
42
high if the microphone signal is similar to the reference loudspeaker signal. So the result
of the CC indicates the point where the two signals correlated most.
The cross correlation can be related to the convolution as:
where the inverse sequence of the complex conjugate of one signal is used in the
convolution calculation. And the convolution between two signals in the time domain can
be transferred into multiplication in the frequency domain, and converted back to time
series by IFFT. The MATLAB code looks like following:
x_inv = [flipud(x (index)); zeros (fs, 1)]; % inverse the loudspeaker signal
z = [z (index); zeros (fs, 1)]; % zero padding for later calculation
Temp = FFT (x_inv).*FFT (z); % multiplication in frequency domain
cc = abs (IFFT (temp)); % IFFT
[Value, Lag] = MAX (cc); % Index of the maximum of the CC result is the
lag
To get a relatively accurate estimation, we calculate the delay for several frames and
average the results.
3.3 Noise issues
During hands-free communication for PC applications, a lot of noise may exist and
disturb the speech going to the microphone. The noise problem is especially worth
concern in the situation of using the internal microphone of a laptop. The amplification of
the internal microphone equipped in the laptop is usually high to be able to pick up the
near-end speech. Hence, a variety of noises such as the hard-disk, fan of the laptop, the
typing on the keyboard, mouse clicking as well as various ambient noise are likely to be
picked up
The possible noise sources when using internal microphone of the laptop is illustrated in
Figure 3.2. The hard drives and cooling fan are close the microphone, so as to together
43
with other mechanical sounds, be transmitted to the internal mike through vibrations. The
clicking sounds of the keyboard and the mouse are also major noise sources in this case.
Since the keyboard is usually close the position of the microphone, the typing noise can
be quite loud which makes it the most annoying noise source in this case. The situation
can be improved much more when a good-quality external microphone is adopted.
During our recording process, by using the Trust MC-1200 high sensitive external
microphone, the noise floor is lowered by 10-13dB compared to the internal microphone.
Figure 3.2 Common Laptop Noise Sources
3.3.1 Typing noise cancellation based on cross-correlation
As mentioned above, the internal microphone is likely to pick up a lot of noise from
sources such as the hard-disk and fan of the laptop and environments, and the typing
noise on the keyboard could the most annoying one among all, which bring up the
motivation of finding a way to reduce it. The most direct way is to use an external
microphone. Also there exists dedicated keyboard in the market which uses special pads
to reduce the typing noise. What we would like to make effort on below is to look for
44
certain algorithm which could be embedded into our AEC (or AES) algorithms to cancel
the typing noise.
Firstly we need to know how the typing sound looks like before dealing with it. So we
recorded the typing action using the internal microphone of the laptop. By analyzing the
recording result, it is found out that each typing sound generally consists of two separate
parts, namely a press sound and a trailing release sound, as we can see in Figure 3.3.
0 0.2 0.4 0.6 0.8 1 1.2-1
0
1One typing sound (press and release)
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08-0.5
0
0.5
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08-0.5
0
0.5
1
Time (s)
Typical press sound for the keyboard of our testing laptop
Typical release sound for the keyboard of our testing laptop
Figure 3.3 Typical look of a typing sound on the keyboard
The first idea coming to mind is to use cross-correlation algorithm to recognize the typing
operation and then cancel it out. Two masks are needed for corresponding press and
release sounds because the pause between them could vary from one typing to another.
The cross correlation is calculated with a shifting window and normalized to be
45
independent on the input power. Once the cross-correlation value exceeds certain
threshold, as shown in Figure 3.4, a pressing or releasing action is considered as
happening, and a scaled mask will be subtracted from the input signal.
The way to scale the mask is important. The projection operation brings a good estimate
of one vector on the direction of another, as defined in:
BBB
BAAprojB ⋅
><><
=,
,
< > denotes the inner product of two vectors:
∑=
>=<n
i
iibaa,b1
or aba,b T>=<
so the estimated typing noise is calculated as the projection of the masks on the actual
microphone signal when the typing is detected.
As we can observe from the residual signal, the typing noise can’t be completely
cancelled and still audible. This is because individual typing can be somewhat different
from the mask, which may indicate that there is no linear model for the typing noise.
Hence in the following we try to model the noise with a LMS adaptive filter.
The input to the LMS system is a pulse train to trigger the typing action, which can be
generated by the above mentioned cross correlation method. If a linear time-invariant
model exists for the typing noise, the cancellation should perform well after the filter
converged, otherwise not.
Through simulation we observed that the typing noise estimation generated from the
LMS filter only match perfectly with the input signal occasionally, in other cases it may
advance or lag the input. Hence, we can conclude that there is no linear time-invariant
model applicable to the typing noise cancellation, namely, it varies with time. A much
more sophisticated method is required, and we will not go further within the scope of this
thesis.
46
-1
0
1Recorded Typing Noise z(t)
-1
0
1Cross-correlation result between z(t) and press mask
0
1Press Flag
-1
0
1Cross-correlation result between z(t) and release mask
0
1Release Flag
0 1 2 3 4 5 6 7 8 9 10
x 104
-1
0
1Residual typing noise
Samples
Figure 3.4 Typing noise cancellation
3.3.2 High Pass Filtering
After analyzing the recorded noise, it is found out that most noise components dominate
the low-frequency portion in the spectrum, including the typing noise. So to reduce the
noise, a high pass filter will be an efficient choice. A second order Butterworth filter with
cutoff frequency of 200Hz is adopted.
47
CHAPTER IV
EVALUATION
So far, we have studied and discussed two echo removal methods, which are the
Acoustic Echo Canceller and the Acoustic Echo Suppressor. Most AEC products are
based on the adaptive LMS or NLMS digital filter, which is a well-defined algorithm that
has been used for years. To achieve larger echo attenuation without the help from other
devices as Nonlinear Processor, the Acoustic Echo Suppressor based on spectral
subtraction is a good option. Despite using the combination of NLMS and FFT transform,
a simplified echo path in frequency domain can be also adopted by AES, even with lower
computational complexity. To deal with the annoying situation during Double talk, three
Double talk Detection methods including Geigel, Normalized Cross-correlation and
Variance Impulse Response algorithms are presented. In the following section, the
performances of all different methods will be examined and compared using MATLAB
simulations.
Many parameters occur in the algorithms, e.g., the learning rate, the safety
constant and the suppression factor in AES, etc. They all affect the performance of AEC
or AES in some way. Hence, to achieve a certain target performance, the parameter
tuning gains an important role in the simulation process.
The evaluation of algorithms is primarily based on how much ERLE they may
achieve, since echo attenuation is the goal of an AEC. When similar ERLE are
accomplished by different algorithms, the initial convergence time and the near-end
attenuation will be more important. In real hands-free communication, the volume of the
near-end voice and far-end speech are totally variable. Hence, it is necessary to check
how the algorithm reacts to the change of the Signal-to-Echo ratio, which is defined as
the ratio between the power of the near-end speech and the power of the echo signal. The
noise has been an important issue for audio and speech systems for a long time. The
performance under different noise strength needs to be evaluated for each algorithm.
48
4.1 Requirements for AEC
The performance evaluation of an AEC (or AES) solution is based on specifications and
listening tests. As discussed in section 1.3, there are some measures existing for the
evaluation of AEC. The International Telecommunication Union (ITU) has regulated
certain criteria for a number of performance characteristics of AEC. These include such
specifications as rate of convergence, amount of cancellation and bandwidth. Although
these criteria are necessary, they are not sufficient to determine whether an AEC is good
enough, since the performance of the AEC is quite location-sensitive and noise-sensitive,
and the specification can only cover certain test environment. Hence, the evaluation
through auditory test is necessary for a given application. At the end of the day, how an
AEC sounds is the final criteria.
4.2 Speech Stimuli
In hands-free communication systems, the input signal is primarily speech and the output
signal consists of speech disturbed by noise and other speech signals. Speech has highly
time-varying characteristics. It is not stationary, but can be approximated to be stationary
in short time intervals. Speech is sometimes quasi-periodic (e.g., vowels) and sometimes
acts as noise (e.g., fricatives) or like impulses (e.g., plosives). Speech also contains
pauses. Speech signals are wide-band with a frequency content ranging from 100 Hz to
more than 8 kHz. In agreement with the sampling theorem, the audio signals (with
frequency between 300 Hz to 3400 Hz), should be sampled at a frequency equal or
greater than 6800 Hz (2 X 3400). Actually, the telephone applications usually take the
sampling rate at 8 KHz. The most popular choices for VOIP are 8 KHz and 16 KHz. A
higher sampling rate improves the speech quality but also requires wider bandwidth.
Throughout our simulation, 8 KHz sampling frequency is used. In all, speech provides a
non-persistent excitation for the adaptive filters used in AEC.
The speech stimuli signals consist of two channels: the channel to be played on the NE
speaker (male voice) and the channel to be played on the FE (female voice). The recorded
signals consist of three segments, which are FE single talk, double talk and NE single talk,
49
to examine the achieved ERLE as well as the performance during DT respectively. Each
segment has a length of 15 seconds and there are pauses of 1 second in between.
-0.6
-0.4
-0.2
0
0.2
0.4
FE
0 5 10 15 20 25 30 35 40 45 50-1
-0.5
0
0.5
1
TIme (s)
NE
FE single talk Double talk NE single talk
Figure 4.1 Speech stimuli segmentation
The recording is made with a DELL Latitude-D600 laptop. The near-end setup of a
laptop user can be different form one to another. Different configurations of internal or
external microphones and internal or external loudspeakers differ in the nominal
Signal-to-Echo Ratio (SER) and Signal-to-Noise Ratio (SNR). The SER is defined as the
ratio of the power of the NES to the power of the echo in the recording signal. The
nominal SER is obtained with the recording under common perceptional strength of NE
and FE speech. When the external microphone is used, higher nominal SER and SNR can
be achieved, compared to the situation using internal microphone with the same
loudspeaker positions.
4.3 Acoustic Echo Canceller
Firstly, the AEC based on simple NLMS-adapted FIR filter is evaluated through
parameter tuning. The learning rate is always an important parameter for NLMS to
50
control the convergence speed. Another parameter, as discussed in section 2.3.4, a safety
constant, is added to the denominator of the NLMS coefficient adaptation equation to
avoid divergence. In order to find a proper value for the safety constant, the
corresponding ERLE is plotted against different safety constants as shown in Figure 4.2,
with different learning rates. The simulation is performed with the signal which is
recorded with common perceptional strength of NE and FE speech using external
microphone and external speaker setup. The nominal SER in this case is 5dB. The length
of the filter is 128.
0 0.2 0.4 0.6 0.8 18
9
10
11
12
13
14
15
16
Safety constant
ERLE (dB)
Learning rate = 1.0
Learning rate = 0.5
Learning rate = 0.1
Figure 4.2 How ERLE changes with the safety constant
(Three lines have different corresponding learning rate)
It is observed that the ERLE has a peak value. The result is reasonable. If the step size is
large when the safety constant is negligible, the filter may be over-adapt and take longer
time to reach the final optimum point. In another case, when the step size is significantly
lowered by the safety constant, the adaptation will also take more steps to approach the
optimum. In both cases, the ERLE would become lower in the consequences of slower
adaptation. Hence, it is reasonable to obtain a peak value of ERLE where the filter
coefficients reach the optimum in the minimum number of steps.
51
Next, the learning rate is tuned in the similar manner. The result is illustrated in Figure
4.3. Similar reason as for safety constant, the higher learning rates result in
over-adaptation and the lower values result in fine adaptation, both slow down the
convergence speed. There is an optimal value for α which leads to the fastest
convergence.
0 0.5 1 1.5 20
5
10
15
Learning rate (2 x Alpha)
ERLE (dB)
Figure 4.3 The effect of Learning rate on ERLE (safety constant = 0.1)
The simulation results in Figure 4.4 show the NLMS-based AEC can only achieve certain
amount of attenuation. One reason is that the adaptive filter can never model the echo
path impulse response completely due to its limited number of filter taps. Another
important reason is that the NLMS assumes a linear echo path, yet in reality, the
loudspeaker-room-microphone impulse response is nonlinear. The nonlinearities come
from the saturation effects from the amplifiers and loudspeakers. Through listening test,
the echo residual is still audible. During double talk, though the influence on the near-end
signal is small (1 to 2 dB near-end attenuation), the echo left is still significant. Figure 4.4
also proves the improvement of adding the safety factor. When there is no safety constant,
the filter becomes unstable when DT starts so that a lot of jitters are observed.
52
-0.2
0
0.2Recorded signal z(t)
-1
0
1Without Safety constant
0 0.5 1 1.5 2 2.5 3 3.5 4
x 105
-0.2
0
0.2Safety constant = 0.1
Samples
Figure 4.4 Echo Cancellation Result of NLMS AEC under nominal SER (learning
rate = 1. Notice the y-axis of the figure without safety constant has larger scale)
4.4 Acoustic Echo Suppressor Based on Spectral Subtraction
As seen above, the time domain AEC can only achieve a very low ERLE. Hence, an echo
suppression filter is added after the NLMS based echo canceller. The echo estimate from
the NLMS algorithm is transformed and subtracted in frequency domain with a Hann
window. Over-suppression can be performed to gain higher ERLE. An alternative way
goes to the simplified echo path, which only estimates the magnitude spectrum of the
echo signal and leaves out the phase information to reduce the computational load.
4.4.1 NLMS-Based AES
There are a variety of parameters in an NLMS AES to adjust the performance, as
discussed in section 2.2, including the learning rate and the safety constant during NLMS
process, theα and the echo suppression ratio β , and the smooth factor of the gain
function. All of them face a trade-off between echo rejection and speech distortion. In the
following paragraphs, each parameter will be tuned individually to investigate its effect.
The test signal will still be the nominal recording with 5dB SER by using external
53
microphone and external loudspeakers. With this recording, it is found out that 30dB
ERLE is the least requirement to make the echo in the FE single talk section acceptable.
As discussed in last section, the learning rate of the NLMS adaptive filter has influence
on the ERLE as well as the NEA during DT. As observed in Figure 4.5, the larger the
learning rate, the higher the ERLE and NEA will be. In fact the adaptation speed of the
NLMS filter for AES is not as crucial as for the AEC any more. There are other
parameters which are able to tune the initial convergence time and the ERLE, e.g. the
suppression ratio β and the smooth factor of the gain function. Hence, unlike the NLMS
algorithm which uses learning rate of 1, small learning rate is chosen to assure low NEA
at this moment.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.925
30
35
ERLE (dB)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9-20
-15
-10
-5
Learning rate
NEA (dB)
Figure 4.5 The effect of learning rate on the ERLE and the NEA ( 1=α , 25=β ,
safety factor = 0.1, smooth = 0.1)
The smooth factor as introduced in section 2.21, manipulates the fluctuating of the
suppression gain function. The influence of smooth factor on the ERLE and NEA is
drawn in Figure 4.6. Larger smoothing (smaller smooth factor) results in more flat
suppression gain, which leads to the attenuation of both the echo and the NES during DT.
As observed from Figure 4.7, the large smoothing (smooth factor=0.01) brings in high
suppression on the NES during DT and the residual speech sounds natural yet has a very
54
low volume, while the low smoothing (smooth factor=0.99) results in enormous clicking
effect during DT which yields artificial sounds. Hence, a smooth factor in the middle
range should be chosen. We use smooth factor of 0.3.
0 0.2 0.4 0.6 0.8 124
26
28
30
ERLE (dB)
0 0.2 0.4 0.6 0.8 1-12
-10
-8
-6
Smooth factor of the Gain function
NEA (dB)
Figure 4.6 The effect of the smooth factor on the ERLE and the NEA ( 1=α , 25=β ,
Learning rate = 0.2, safety factor = 0.1)
-0.2
0
0.2Recorded signal z(t)
-0.2
0
0.2Residual signal with smooth=0.01
0 0.5 1 1.5 2 2.5 3 3.5
x 105
-0.2
0
0.2Residual signal with smooth=0.99
Samples
Figure 4.7 The effect of smooth factor on the NES during DT
55
Recalling the gain function of AES
αα
ααβ 1
) )(
)0),)(ˆ)(max((()(
fZ
fYfZfGi
−= ,
to find the optimal values of alpha and beta, we use the minimization function in
MATLAB to return the beta value corresponding to certain alpha which minimizes the
squared difference between the acquired ERLE and desired value. In such way, with a
series of given alpha values we get a series of corresponding beta values to achieve 30dB
ERLE as shown in Figure 4.8. We choose 1=α which brings least computational load so
that the β needs to be at least 27 to achieve 30dB ERLE.
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
500
1000
1500
2000
2500
alpha
beta
0.2 0.4 0.6 0.8 10
10
20
30
alpha
beta
Zoomed in
Figure 4.8 Alpha – Beta values to achieve 30dB ERLE (Learning rate = 0.2,
safety constant = 0.1, smooth factor = 0.3)
The simulation work load is quite high. To speed up the slow simulation process due to
the slow run of for-loops in MATLAB, the NLMS is implemented as a mex-function
which is more than 10 times faster.
Through simulation and listening test, the NLMS AES is able to achieve much higher
ERLE than the NLMS AEC, with more computational load introduced. During the DT
period, the echo is also inaudible yet the NES is affected largely as seen in Figure 4.7.
56
Certain portion of the NES has more attenuation than other due to the sharp changes of
the suppression gain. This results in discontinuities in the residual signal during DT.
4.4.2 Coloration-Effect-Filter-Based AES
As introduced in section 2.3.5, the AES based on a simplified echo path magnitude
spectrum costs much less computational complexity.
Recalling equation:
),(
),(),(
22
12
kia
kiakiGv = and
)1,()1()},(),({),(
)1,()1()},(),({),(
22
*
22
12
*
12
−−+⋅=
−−+⋅=
kiakiXkiXEkia
kiakiZkiXEkia
dd
d
σσ
σσ,
here the σ functions similarly as the learning rate in the NLMS algorithm, controlling
the adaptation speed of the coloration-effect filter. If it is too large, the attenuation of the
echo and the NES will be both high, and even the NES is not audible any more. If it is too
small, the initial convergence speed will be too slow, as shown in Figure 4.9. The initial
convergence time is defined as the time which it takes the echo to become totally silent.
The coloration-effect filter also suffers from the divergence problem when NES exists.
Hence, a constant is added to the denominator in the same way as the NLMS algorithm.
),(
),(),(
22
12
kiac
kiakiGv
+=
A large constant will smooth the variation of the filter taps and slow down the
convergence significantly, but bring less attenuation to the NES during DT, which is
illustrated in figure 4.10. According to the simulation and listening results, sigma of 0.05
and safety constant of 0.01 are chosen to ensure a relatively faster convergence and
smaller attenuation on the NES.
57
-0.2
0
0.2Recorded Echo
-0.05
0
0.05Echo Residual with sigma=0.01
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
-0.05
0
0.05Echo Residual with sigma = 0.05
Samples
Slow convergence
Fast convergence
Figure 4.9 Different convergence time corresponding to different sigma
( 1=α , 30=β Notice the axis of the recorded echo has larger scale)
0 0.2 0.4 0.6 0.80
10
20
30
40
50
60
Sigma
ERLE (dB)
0 0.2 0.4 0.6 0.8-60
-50
-40
-30
-20
-10
0
Sigma
NEA (dB)
c = 0
c = 0.0001
c = 0.001
c = 0.01
Figure 4.10 The influence of the sigma to the ERLE and NEA with different safety
constant values ( 1=α , 30=β )
58
Then the optimal values of α and β to gain 30dB ERLE are computed again in the
same way as for NLMS AES, shown in Figure 4.10. When 1=α , 27=β is required.
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
2000
4000
6000
8000
10000
12000
14000
16000
alpha
beta
0.2 0.4 0.6 0.8 10
20
40
60Zoomed-in
alpha
beta
Figure 4.11 Alpha – Beta values to achieve 30dB ERLE (Sigma = 0.05,
Safety constant = 0.01, smooth factor = 0.3)
To draw a conclusion at this moment, the performance of AES is superior to that of the
AEC. The AES gains higher ERLE and is able to eliminate the echo completely. During
DT, though the NES is affected as well, the background echo is quite distinct any more.
Through simulation and listening test, it is found out that the Coloration-effect filter AES
can achieve similar result as the NLMS AES, with much less computational complexity.
However, its convergence speed is slower than the NLMS.
4.5 DTD Performance Evaluation
All the algorithms discussed above suffer from the DT problem. The NES is attenuated
more or less by using different methods, especially for AES. A lot of discontinuities
occur during DT. Hence we will introduce Double Talk Detectors into AES. The
adaptation of the filter will be frozen for a hold time and lower β will be adopted to have
lower NEA when DT is detected. Instead of not being able to hear the speaker properly,
59
clear voice yet with some residual echo would be preferable. Longer hold time will
protect the NES better yet lead to long recovery time from the DT to FE single talk which
may bring in a boost of the echo. The value of hold time is normally chosen to be tens of
milliseconds. 32ms of hold time is used throughout this thesis. As stated in section 2.4.5,
the performance of DTD is evaluated by probability of false alarm (Pf) during the FE
single talk and the probability of miss (Pm) in DT duration, as shown in table 4.1. Based
on the computations of Pf and Pm, a plot which is referred to as the Receiver Operating
Characteristic (ROC) curve is adopted as a comparison criterion between different
algorithms. In the ROC curve, the probability of false alarm is plotted against the
probability of miss by tuning the threshold. This curve provides us with the knowledge of
the threshold to achieve certain DTD performance in terms of Pf and Pm, and vice versa.
For example, we can find out the threshold corresponding to 0.1 Pf. A typical ROC curve
is illustrated in Figure 4.12. It shows the tradeoff between the correct detection and the
fault ones. The smaller area enclosed by the ROC curve is, which indicates both low Pf
and low Pm can be attained at the same time, the better the DTD performance will be.
FE single talk DT
DTD = 0 Correct Miss
DTD = 1 False alarm Correct
Table 4.1 Definition of False alarm and Miss
Figure 4.12 Typical ROC curve illustration
60
4.5.1 Geigel DTD
We firstly implemented the simple Geigel scheme. The most important parameter for a
DTD is the threshold. The threshold set in Figure 4.12 separates all the FE single talk in
order to have a uniform attenuation. Because different suppression factors are used for FE
single talk and double talk in AES, the false alarm during FE single talk segment will
result in a sudden boost in the residual echo, which sounds annoying. Hence, a low
probability of false alarm is required. However, partial DT situations are left undetected.
0 0.5 1 1.5 2 2.5 3 3.5 4
x 105
-15
-10
-5
0
5
10
15
20
25
Samples
Decision variable for Geigel (dB)
A possible threshold
Figure 4.13 The detection statistic for Geigel algorithm for the nominal recording
with external microphone and loudspeakers
Higher threshold reduces the probability of false alarm at the price of an increase to the
probability of miss during DT; lower threshold gains more correct DT detection yet may
result in large fault detection so as to reduce the ERLE. The ROC curve of Geigel
algorithm is shown in Figure 4.13.
As stated in section 2.4.2, the Geigel DTD operates by comparing the power of the
received signal and the far-end signal. Recalling equation:
61
))1(...)1(,)(max(
|)(|
+−−=
Ntxtxtx
tzξ
T>ξ Double talk present
T≤ξ Double talk not present
which shows that it works well when the strength of the NES is much higher than the
FES, namely when the SER is large as illustrated in Figure 4.14. The SER is changed by
adjusting the NES while keeping the nominal FES. The probability of false alarm will be
almost constant because the strength of the echo is kept the same. As discussed before,
the probability of false alarm is required to be as low as possible so that the threshold is
chosen to attain zero probability of false. When the SER goes low, the chance of missing
DT detection becomes higher because most of the decision variable will be lower than the
threshold due to the small NES.
Overall the Geigel algorithm only works well under circumstances which assume the
echo path attenuates the FES and NES is sufficiently high. However, Geigel is not a
strong candidate in reality where unknown echo path and unknown NES are present.
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Probability of False alarm
Probability of Miss
Figure 4.14 ROC curve of Geigel Algorithm under nominal SER
62
-5 0 5 100
0.2
0.4
0.6
0.8
1
SER (dB)
Probability
Probability of miss
Probability of false alarm
Figure 4.15 Probability of miss decreases as the SER is increased by enlarging the
amplitude of near-end speech (Pf = 0)
4.5.2 NCR DTD
The cheap-NCR algorithm is normally adopted for its efficient calculation. Recalling
form section 2.4.3, the detection statistic of cheap NCR is calculated as:
)(
)(2
2
ˆ
t
t
z
y
σ
σξ =
T<ξ Double Talk present
T≥ξ Double Talk not present
which is the ratio between the power of the estimated echo and the power of the
microphone signal. Since it needs the time-domain echo estimate from the adaptive filter,
it is easier to be applied to the NLMS algorithm compared to the simplified echo path.
The convergence speed of the NLMS filter also needs to be lower to slow down the fault
adaptation during DT in order to have the right detection statistic. As shown in Figure
4.15, a smaller learning rate brings better result.
The ROC curve is drawn as in Figure 4.16 to look for the threshold to achieve a low
probability of false alarm. The performance of the NCR is also dependent on the window
size. The larger the window size, the more precise the calculation of the power, so that
63
the better the prediction will be, which is also shown in Figure 4.14. The ROC with
window size 800 has improvement of 10% probability of miss over the one with window
size of 128, with the same probability of false alarm acquired. However, larger window
size means longer computation time and larger delay to yield output, while the real-time
communications demand low processing delay.
Next the variation of the probabilities against the SER is evaluated as illustrated in Figure
4.17. The threshold is chosen to attain 0 probability of false alarm and fixed for all SER.
The probability of miss increases as the near-end signal gets weaker, because the
detection statistic will be larger during double talk so that less DT events will be detected.
Compared to the Geigel algorithm in Figure 4.12, the variation of the probability of miss
against the SER is smaller in cheap-NCR DTD, with similar probability of false alarm
achieved and same window size. In all, the NCR DTD is a more reliable method than the
Geigel DTD, yet it requires slower adaptation and longer processing delay to obtain
better detection.
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Probability of false alarm
Probability of miss
Learning rate = 1.0
Learning rate = 0.2
Learning rate = 0.1
Figure 4.16 ROC curves of cheap-NCR DTD for NLMS AES with different
Learning rate under nominal SER
64
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Probability of false alarm
Probability of miss
Window size = 128
Window size = 800
Figure 4.17 ROC curve of cheap-NCR DTD with different window size
under nominal SER (learning rate = 0.1)
-5 0 5 100
0.2
0.4
0.6
0.8
1
SER (dB)
Probability
Probability of miss
Probability of false alarm
Figure 4.18 Probability of miss increases as the SER is decreased by reducing the
amplitude of near-end speech (learning rate = 0.1, window size = 128)
65
4.5.3 VIRE DTD
The VIRE DTD uses the maximum value of the adaptive filter as a measure of the
fluctuations as presented in section 2.4.4. The variations of the adaptive filter taps will be
both high for NLMS filter and the coloration-effect filter, so that it can be applied to both
algorithms. The faster the filter adapts, the larger the variations of the filter coefficients
will be. A large learning rate makes the VIRE DTD work extremely well as observed in
Figure 4.18 and Figure 4.19. Though the high learning rate brings large NEA as shown in
Figure 4.5, the DTD is now used to protect the NES from being over-attenuated such that
a large learning rate can be adopted to affirm the good performance of the DTD. Hence,
one advantage of the VIRE DTD is the inherent fast adaptation.
The ROC curves in Figure 4.18 and 4.19 also show that the VIRE DTD for both NLMS
and coloration-effect filter AES can achieve very low probability of false alarm and
probability of miss at the same time, especially for the coloration-effect filter AES. The
detection statistics of the VIRE algorithm for the coloration-effect filter AES in Figure
4.20 makes it possible to draw a threshold to separate the single talk and double talk
completely, unlike the Geigel and NCR algorithms. The detection decisions during
double talk are much larger than the ones during single-talk period, which leads to the
excellent performance of the VIRE DTD.
0 0.1 0.2 0.3 0.4 0.5 0.60
0.2
0.4
0.6
0.8
1
Probability of false alarm
Probability of miss
Learning rate = 1.0
Learning rate = 0.2
Learning rate = 0.1
Figure 4.19 ROC curves of VIRE DTD for NLMS AES with different Learning rate
under nominal SER (window size = 128)
66
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
Probability of false alarm
Probability of miss
sigma = 0.05
sigma = 0.01
Figure 4.20 ROC curves of VIRE DTD for the Coloration-effect filter AES with
different Learning rate under nominal SER (window size = 128)
0 1 2 3 4
x 105
-160
-140
-120
-100
-80
-60
-40
-20
0
20
Samples
Detection S
tatistic in dB unit
Double Talk FE Single Talk NE Single Talk
A possible threshold
Figure 4.21 Detection decision obtained using the VIRE DTD for the
Coloration-effect filter AES under nominal SER
After tuning the SER by adjusting the power of the NES, similar results are obtained for
NLMS AES and Coloration-effect filter AES. Though the performance of the VIRE DTD
is excellent under the nominal SER situation, it turns out to be varying dramatically with
67
the SER as we can see in Figure 4.21. When the echo is much larger compared to the
near-end speech, the near-end signal is weak during DT. After the NLMS filter has
converged during FE single talk, the filter weights will not be influenced much by the
NES during DT, so that the variation of filter coefficients is not large enough to detect the
current situation as DT which leads to the high probability of miss. On the other hand, the
VIRE DTD performs in a similar manner in the Coloration-effect filter AES. The DTD
performance is quite stable during certain SER range as illustrated in Figure 4.22. Lower
threshold will cover larger SER operating range because lower threshold allows lower
NES during DT to be still separated from the FE single talk, as implied in Figure 4.20.
As a conclusion, the VIRE DTD is able to achieve best detection performance among all
algorithms. Especially the VIRE DTD embedded in the Coloration-effect filter AES
provides the most promising results. As well, compared to the cheap-NCR algorithm,
VIRE DTD has the advantage of fast adaptation and convergence. However, the VIRE
DTD is quite sensitive to the strength of the NES. It is hard to detect the DT situation
when the NES is too weak. Hence there is certain limitation in this method.
-5 0 5 100
0.2
0.4
0.6
0.8
1
SER (dB)
Probability Probability of miss
Probability of false alarm
Figure 4.22 Probability of miss of the VIRE DTD in the NLMS AES
varies dramatically with near-end speech (Pf = 0)
68
-5 0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SER (dB)
Probability
Probability of miss when T = 0.03
Probability of miss when T = 0.05
Probability of miss when T = 0.1
Probability of false alarm for all
three thresholds is zero
Figure 4.23 Performance of the VIRE DTD in the Coloration-effect filter AES
4.6 AES with Different DTD Algorithms
In this section, the improvement of each DTD algorithm to the AES is going to be
evaluated and compared. In the AES without DTD equipped, the same suppression ratio
will be used everywhere, which results in large attenuation on the NES during DT. It
makes the conversation during DT rather difficult. Hence, the DTD is adopted to predict
the DT situation. When DT is detected, on one hand the filter adaptation is stopped, and
on the other hand the suppression ratio β is lowered in order to limit the NEA. However,
certain amount of echo attenuation is still required during DT period. Hence, a different
β is used to acquire 20dB ERLE during DT period, which is 10dB lower than that
during FE single talk.
The performance of each method under nominal SER is firstly compared as shown in
Figure 4.23. With the same 30dB ERLE achieved during FE single talk segment, it can be
observed that all the DTD algorithms help to reduce the NEA during DT. The
Coloration-effect filter AES results in less NEA than the NLMS AES overall. The results
also show that the Coloration-effect filter AES with VIRE DTD displays the most
69
outstanding performance. It is expected because of the excellent detection capability of
the VIRE DTD for the Coloration-effect filter AES. Through auditory tests, the NES
during DT by the Coloration-effect Filter AES with VIRE DTD sounds most natural,
with least attenuation and without discontinuities.
In practice the near-end speech is an unknown variable, as well as the far-end speech.
The effect of NES and FES are evaluated separately in paragraph 4.6.1.As discussed in
section 3.3, various kinds of noise exist in hands-free communications. Some noises are
stationary background noise, while some are abrupt sounds. The impact of the stationary
noise is more likely to be tested than the unpredicted sudden noise, which will be
discussed in section 4.6.2.
70
-0.2
0
0.2Original microphone signal z(t)
-0.2
0
0.2NLMS AES without DTD
-0.2
0
0.2NLMS AES with Geigel DTD
-0.2
0
0.2NLMS AES with cheap-NCR DTD
0 0.5 1 1.5 2 2.5 3 3.5
x 105
-0.2
0
0.2NLMS AES with VIRE DTD
Samples
-0.2
0
0.2Coloration-effect Filter AES without DTD
-0.2
0
0.2Coloration-effect Filter AES with Geigel DTD
0 0.5 1 1.5 2 2.5 3 3.5
x 105
-0.2
0
0.2Coloration-effect Filter AES with VIRE DTD
Samples
Figure 4.24 Comparison of different AES methods under nominal SER
71
4.6.1 The Influence of the NES and the FES
The performance of each DTD algorithm against varying NES has been studied as in
section 4.5. The ERLE value is affected by the Pf in the FE single talk segment, while the
NEA is determined by the Pm during DT. Figure 4.24 shows how the power of NES
influences the ERLE and NEA. The FES is kept the same so that the detection statistics
during FE single talk section for each DTD are stable over all SER range. The threshold
of each DTD is chosen to achieve zero Pf to avoid echo boost. In such a way, all methods
attain the same suppression ratio during FE single talk so that similar ERLE is achieved
and maintained over the whole SER range. As observed before, the Pm increases as the
NES diminishes, and therefore the NEA also rises. The results in Figure 4.24 verify the
analysis in section 4.5.
Now, the nominal NES is kept the same and the FES is varied. The result is illustrated in
Figure 4.25. The louder the FES is, namely the larger the echo, the larger the ERLE value
is achieved. However, it can be observed that the echo attenuation drops when the echo is
too large for the VIRE and Geigel algorithms. Due to the amplification effect from the
volume adjustment, the magnitude of echo (z(t)) may be comparable to or even larger
than the FES (x(t)). The performance of the Geigel algorithm and the VIRE DTD for
Coloration-effect filter AES depend on the ratio between the microphone signal (z(t)) and
speaker signal (x(t)) and Pf increases as z(t) enlarges. For the VIRE DTD in the NLMS
AES, the variation of the filter taps becomes larger during FE single talk as the
microphone signal increases, which also increases the Pf. In such a way, the filter
adaptation will be slowed down and low suppression ratio is used in the Geigel and VIRE
DTD. Hence, significant amount of echo is left after cancellation. The NCR algorithm
does not suffer from this problem because it estimates the echo path based on both the
FES and the echo signal such that the ratio between the estimated echo and the actual
echo is relatively stable, namely a stable DTD performance.
72
Figure 4.25 Performance variation against SER with varying NES
-5 0 5 10-12
-10
-8
-6
-4
-2
0
SER (dB)
NEA (dB)
-5 0 5 100
5
10
15
20
25
30
35
40
45
SER (dB)
ERLE (dB)
NLMS AES
NLMS AES with NCR DTD
NLMS AES with Geigel DTD
NLMS AES with VIRE DTD
CF AES
CF AES with Geigel DTD
CF AES with VIRE DTD
Figure 4.26 Performance variation against SER with varying FES
-5 0 5 10-14
-12
-10
-8
-6
-4
-2
SER (dB)
NEA (dB)
-5 0 5 100
5
10
15
20
25
30
35
40
SER (dB)
ERLE (dB)
NLMS AES
NLMS AES with NCR DTD
NLMS AES with Geigel DTD
NLMS AES with VIRE DTD
CF AES
CF AES with Geigel DTD
CF AES with VIRE DTD
73
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SER (dB)
Probability of miss
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SER (dB)
Probability of false alarm
NCR DTD
Geigel DTD
VIRE DTD for CF
VIRE DTD for NLMS
Figure 4.27 DTD performance variation caused by varying FES
4.6.2 The Noise Performance
To examine the influence of the strength of the stationary noise, the white noise with
adjusted noise power is added to the nominal recording. As we can see in Figure 4.26, the
resulting ERLE drops as the noise power increases. As we can understand from the
analysis above, the performance of the DTD determines how the AES acts overall. The
influence of the stationary noise to the DTD is examined as shown in Figure 4.27. The
decline of the ERLE is on one hand caused by the increase of the Pf, and on the other
hand is due to the larger noise contribution to the residual signal.
The NEA is almost invariable because the variation of the Pm is small. The Pf of the
VIRE DTD for Coloration-effect Filter AES barely changes which guarantees a steady
attenuation of the echo during FE single talk in spite of the noise strength. The Pm rises
as the noise gets stronger so that some discontinuities occur as in the NCR and Geigel
methods.
74
10 15 20 25 30-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
SNR (dB)
NEA (dB)
10 15 20 25 300
5
10
15
20
25
30
SNR (dB)
ERLE (dB)
NLMS AES
NLMS AES with NCR DTD
NLMS AES with Geigel DTD
NLMS AES with VIRE DTD
CF AES
CF AES with Geigel DTD
CF AES with VIRE DTD
Figure 4.28 Noise Performance under nominal SER
10 15 20 25 300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SNR (dB)
Probability of false alarm
NCR DTD
Geigel DTD
VIRE DTD for CF
VIRE DTD for NLMS
10 15 20 25 300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SNR (dB)
Probability of miss
Figure 4.29 Comparison of DTD noise performance
75
CHAPTER V
CONCLUSION & FURTHER WORK
5.1 Summary and Conclusion
Nowadays, the conventional and hands-free telephones occupy more and more important
role in solving people’s communication needs. One of the major problems in a
telecommunication application over a telephone system is echo. This thesis is devoted to
find a solution for acoustic echo cancellation during a hands-free conversation using PC.
The basic echo canceller based on the famous NLMS algorithm is firstly studied. The
resulting ERLE of the basic AEC is low such that the residual echo is still audible.
Therefore the acoustic echo suppressor is introduced which is able to eliminate the echo
completely. The NLMS algorithm is firstly adopted to calculate the magnitude spectrum
of the echo signal in the AES but it costs much computational complexity. Then a
recently proposed algorithm, which uses a coloration-effect filter to estimate the
magnitude spectrum of the main portion of the acoustic path, is studied and modified to
ease the calculation. The disadvantage of the Coloration-effect Filter AES is slower
adaptation compared to the NLMS AES. Both of the two AES algorithms are capable of
making the echo inaudible during far-end single talk, but they all suffer from near-end
attenuation and discontinuities problems during double talk. Hence, Double Talk
Detection algorithms are investigated, including the Geigel DTD, (cheap) NCR DTD and
VIRE DTD. Each DTD algorithm is analyzed and evaluated individually based on the
two parameters: Pf and Pm. After that, the DTD algorithms are implemented into each
AES methods and compared. From the simulation and auditory tests, it is found out that
the Coloration-effect Filter AES with VIRE DTD is able to bring in the best result, with
least attenuation and without discontinuities. Yet, this consequence will only hold when
the near-end signal picked up by the microphone is strong enough compared to the
far-end speech. Also, the performance will only degrade as the noise becomes stronger
than certain point. In all, the echo cancellation algorithm presented in this thesis
successfully attempted to find a software solution for the problem of echoes in the
telecommunications environment. Furthermore, many efforts have been contributed to
76
the ways of regulating the parameters and a general frame for evaluating and comparing
different algorithms, as well as the analysis of the inside meaning of the results. The
proposed algorithm was completely a software approach without utilizing any DSP
hardware components. The algorithm was capable of running in any PC with MATLAB
software installed. In addition, the results obtained were convincing. The audio of the
output speech signals were highly satisfactory and validated the goals of this research.
5.2 Possible Further Work
The test of the algorithm was performed totally ‘off-line’. The testing speech was
recorded beforehand as input to the algorithm and the output was looked over after
simulation. Therefore, the real-time application to for testing purpose could be the most
interesting future work.
The high background noise level is annoying to the listener’s side during a conversation
and will affect the performance of the algorithm. However, the background noise is a
natural part of a conversation, which may provide the surrounding environment of the
person we talk to. Hence, there is a need of the noise suppression algorithm to reduce the
background noise to a comfortable level. Moreover, a study of the way to handle the
music noise which is trickier to solve can be also done in the future.
In practice, the echo could be still noticeable due to large variations of echo path
characteristic. Therefore, a further research and evaluation of the reaction of the
algorithm to the echo path changes should be made effort to.
77
LIST OF ACRONYMS
AEC: Acoustic Echo Canceller
AES: Acoustic Echo Suppresser
DT: Double Talk
DTD: Double Talk Detection (Detector)
FES: Far End Speech
LMS: Least Mean Square
NCR: Normalized Cross-correlation
NES: Near End Speech
NLMS: Normalized Least Mean Square
Pf: Probability of false alarm
Pm: Probability of miss
SER: Signal To Echo Ratio
SNR: Signal To Noise Ratio
VIRE: Variance Impulse Response
78
LIST OF FIGURES
Figure 1.1: Hybrid Connections and the Resulting Electric Echo
Figure 1.2: Basic setup of a hands-free communication system
Figure 1.3: Generation of acoustic echo through direct coupling and reverberations
Figure 1.4: General schematic of Acoustic Echo Cancellation
Figure 1.5: Composition of the signals used to calculate the NEA
Figure.2.1: Noise suppression with spectral subtraction
Figure 2.2: Smoothing of a step function
Figure 2.3: Error caused by Hann-Windowing FFT
Figure 2.4: Error caused by Hamming-Windowing FFT
Figure 2.5: Error caused by Kaiser-Windowing FFT with different beta
Figure 2.6: Gradient of the Error function
Figure 2.7: γ plot
Figure 2.8: Typical room impulse response
Figure 3.1: Room Impulse Response in the Scream room in NXP Leuven
Figure 3.2: Common Laptop Noise Sources
Figure 3.3: Typical look of a typing sound on the keyboard
Figure 3.4: Typing noise cancellation
Figure 4.1: Speech stimuli segmentation
Figure 4.2: How ERLE changes with the safety constant
Figure 4.3: The effect of Learning rate on ERLE (safety constant = 0.1)
Figure 4.4: Echo Cancellation Result of NLMS AEC under nominal SER
Figure 4.5: The effect of learning rate on the ERLE and the NEA
Figure 4.6: The effect of the smooth factor on the ERLE and the NEA
Figure 4.7: The effect of smooth factor on the NES during DT
Figure 4.8: Alpha – Beta values to achieve 30dB ERLE
Figure 4.9: Different convergence time corresponding to different sigma
Figure 4.10: The influence of the sigma to the ERLE and NEA with different safety
constant values
Figure 4.11: Alpha – Beta values to achieve 30dB ERLE
79
Figure 4.12: Typical ROC curve illustration
Figure 4.13: The detection statistic for Geigel algorithm for the nominal recording with
external microphone and loudspeakers
Figure 4.14: ROC curve of Geigel Algorithm under nominal SER
Figure 4.15: Probability of miss decreases as the SER is increased by enlarging the
amplitude of near-end speech (Pf = 0)
Figure 4.16: ROC curves of cheap-NCR DTD for NLMS AES with different Learning
rate under nominal SER
Figure 4.17: ROC curve of cheap-NCR DTD with different window size
under nominal SER
Figure 4.18: Probability of miss increases as the SER is decreased by reducing the
amplitude of near-end speech
Figure 4.19: ROC curves of VIRE DTD for NLMS AES with different Learning rate
under nominal SER
Figure 4.20: ROC curves of VIRE DTD for the Coloration-effect filter AES with
different Learning rate under nominal SER
Figure 4.21: Detection decision obtained using the VIRE DTD for the
Coloration-effect filter AES under nominal SER
Figure 4.22: Probability of miss of the VIRE DTD in the NLMS AES
varies dramatically with near-end speech
Figure 4.23: Performance of the VIRE DTD in the Coloration-effect filter AES
Figure 4.24: Comparison of different AES methods under nominal SER
Figure 4.25: Performance variation against SER with varying NES
Figure 4.26: Performance variation against SER with varying FES
Figure 4.27: DTD performance variation caused by varying FES
Figure 4.28: Noise Performance under nominal SER
Figure 4.29: Comparison of DTD noise performance
80
REFERENCES
Christof Faller and Christophe Tournery, “Robust Acoustic Echo Control Using A Simple
Echo Path Model” Audiovisual Communications Laboratory, EPFL, Lausanne,
Switzerland 2006
C. Faller and C. Tournery, “Estimating the delay and coloration effect of the acoustic
echo path for low complexity echo suppression” in Proc. Intl. Works. On Acoustic. Echo
and Noise Control (IWAENC), Sept. 2005.
Andreas Jakobsson, Karlstad University and Per ºAhgren, Uppsala University, “Acoustic
Echo Cancellation”
S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction” IEEE
trans. Acoust. Speech Sig. Processing, vol. 27, no. 2, pp. 113–120, Nov. 1979.
K. Ochiai, T. Araseki, and T. Ogihara, “Echo canceller with two echo path models,”
IEEE trans. on Communications, vol. 25, no. 6, pp. 589–595, June 1977.
Jun H. Cho, Dennis R. Morgan and Jacob Benesty, “An objective Technique for
Evaluating Doubletalk Detectors in Acoustic Echo Cancellers” IEEE trans. On Speech
and Audio Processing, vol. 7 no. 6, Nov. 1999
Raghavendran, Srinivasaprasath “Implementation of an Acoustic Echo Canceller Using
Matlab”2003
P. ºAhgren, “On System Identification and Acoustic Echo Cancellation” PhD thesis,
Uppsala University, 2004.
J. Benesty, D. R. Morgan, and J. H. Cho, “A new class of doubletalk detectors
based on cross-correlation,” IEEE Trans. Speech Audio Processing, vol. 8, pp. 168-172,
March 2000.
81
Form detection statistic Comparison with T
Geigel DTD
))1(...)1(,)(max(
|)(|
+−−=
Ntxtxtx
tzξ
T>ξ Double talk present
T≤ξ Double talk not present
Cheap NCR
DTD )(
)(2
2
ˆ
t
t
z
y
σ
σξ =
T<ξ Double Talk present
T≥ξ Double Talk not present
VIRE DTD 2][)1()1()( γγλξλξ −⋅−+−⋅= nn
γλγλγ ⋅−+−⋅= )1()1()( nn
))1()1(),0(max( −= khhh Λγ
T>ξ Double Talk present
T≤ξ Double Talk not present