speech enhancement using adaptive filters and independent component analysis approach

6
SPEECH ENHANCEMENT USING ADAPTIVE FILTERS AND INDEPENDENT COMPONENT ANALYSIS APPROACH Tomasz Rutkowski , Andrzej Cichocki and Allan Kardec Barros Brain Science Institute RIKEN Wako-shi, Saitama, JAPAN Depto. de Engenharia Eletrica Universidade Federal do Maranhao Sao Luis - Ma - BRAZIL email: [email protected], [email protected] and [email protected] http://www.bsp.brain.riken.go.jp/ ABSTRACT In this paper we consider the problem of enhancement and extraction of one speaker speech signal corrupted by environmental acoustic noise/interferences and other speakers using array of microphones containing at least two microphones. The preprocessing unit mimic human auditory system by roughly emulating cochlea by nonuniform bandpass filter bank. We con- struct filter bank with center frequencies of subbands based on approximation of target speaker fundamental frequency. Then we incorporate blind signal separation method for each subband signals, to extract maximum information that represent target speaker speech. After that de- sired signal is reconstructed from independent components representing every subband. Exper- iments with office room recordings are presented to confirm the validity and good performance of the proposed method in real-world environment. 1. INTRODUCTION We propose the system that approximately emulate hu- man auditory system usning a nonuniform bandpass filter bank with tracking the fundamental frequency of target speaker. The filter bank consists of bandpass fil- ters with bandwidth starting from 100 Hz to 200 Hz for first filter and increasing in length by factor two for next filters. For the experiments we use telephone quality speech signals with sampling frequency 8kHz. We assume the hearing system performs a spectro- graphic analysis of an auditory stimulus at the cochlea, which can be regarded as a bank of nonuniform self- adaptive filters whose outputs are ordered tonotopi- cally. At first the fundamental frequency of the target speaker in mixture of convoluted speech signals is es- timated in order to properly design center frequencies of the filters. The bank of adaptive band-pass filters, processes the available microphone signals around the fundamental frequency of the target speaker and around its har- monics. In the next stage of processing we perform blind source separation BSS, blind source extraction (BSE) or independet component analysis (ICA) for each frequency sub-band (bin). The efficient learning algorithms are used to perform BSS/BSE/ICA. Finally the set switches is implemented to perform temporal masking and selection/classification tasks of one in- dependent components with specific feature that en- hances the voice of target speaker. The main problem in this stage is to decide which signal from obtained subband independent components should be discarded and which one contain essential information from tar- get speaker. We apply spectral measure to solve this problem in every subband. After such processing we have subband signals that carry speech signals with en- hanced target speaker information. Last stage of our signal processing system performs signal reconstruc- tion. The inverted filter bank is applied to correctly re- construct the target speaker voice, avoiding problems with aliasing of subband filtered components. Exten- sive computer simulations results with array of two or more microphones confirm validity and performance of the proposed approach. We present the results for real room recordings with natural reverberations from the walls and objects in the room. 2. AUDITORY SYSTEM FILTER BANK It is well known that the human auditory system can roughly be described as a nonuniform bandpass filter bank, consisting of strongly overlapping bandpass fil-

Upload: independent

Post on 20-Jan-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

SPEECH ENHANCEMENT USING ADAPTIVE FILTERS AND INDEPENDENTCOMPONENT ANALYSIS APPROACH

Tomasz Rutkowski�, Andrzej Cichocki� and Allan Kardec Barros��

�Brain Science Institute RIKENWako-shi, Saitama, JAPAN

��Depto. de Engenharia Eletrica Universidade Federal do MaranhaoSao Luis - Ma - BRAZIL

email: [email protected], [email protected] and [email protected]://www.bsp.brain.riken.go.jp/

ABSTRACT

In this paper we consider the problem of enhancement and extraction of one speaker speechsignal corrupted by environmental acoustic noise/interferences and other speakers using arrayof microphones containing at least two microphones. The preprocessing unit mimic humanauditory system by roughly emulating cochlea by nonuniform bandpass filter bank. We con-struct filter bank with center frequencies of subbands based on approximation of target speakerfundamental frequency. Then we incorporate blind signal separation method for each subbandsignals, to extract maximum information that represent target speaker speech. After that de-sired signal is reconstructed from independent components representing every subband. Exper-iments with office room recordings are presented to confirm the validity and good performanceof the proposed method in real-world environment.

1. INTRODUCTION

We propose the system that approximately emulate hu-man auditory system usning a nonuniform bandpassfilter bank with tracking the fundamental frequency oftarget speaker. The filter bank consists of bandpass fil-ters with bandwidth starting from 100 Hz to 200 Hzfor first filter and increasing in length by factor twofor next filters. For the experiments we use telephonequality speech signals with sampling frequency 8kHz.

We assume the hearing system performs a spectro-graphic analysis of an auditory stimulus at the cochlea,which can be regarded as a bank of nonuniform self-adaptive filters whose outputs are ordered tonotopi-cally. At first the fundamental frequency of the targetspeaker in mixture of convoluted speech signals is es-timated in order to properly design center frequenciesof the filters.

The bank of adaptive band-pass filters, processes theavailable microphone signals around the fundamentalfrequency of the target speaker and around its har-monics. In the next stage of processing we performblind source separation BSS, blind source extraction(BSE) or independet component analysis (ICA) foreach frequency sub-band (bin). The efficient learningalgorithms are used to perform BSS/BSE/ICA. Finally

the set switches is implemented to perform temporalmasking and selection/classification tasks of one in-dependent components with specific feature that en-hances the voice of target speaker. The main problemin this stage is to decide which signal from obtainedsubband independent components should be discardedand which one contain essential information from tar-get speaker. We apply spectral measure to solve thisproblem in every subband. After such processing wehave subband signals that carry speech signals with en-hanced target speaker information. Last stage of oursignal processing system performs signal reconstruc-tion. The inverted filter bank is applied to correctly re-construct the target speaker voice, avoiding problemswith aliasing of subband filtered components. Exten-sive computer simulations results with array of two ormore microphones confirm validity and performanceof the proposed approach. We present the results forreal room recordings with natural reverberations fromthe walls and objects in the room.

2. AUDITORY SYSTEM FILTER BANK

It is well known that the human auditory system canroughly be described as a nonuniform bandpass filterbank, consisting of strongly overlapping bandpass fil-

1sy �

AuditoryFilter Bank

BSE1

BSEN

x1

x2

f0

f0

Nf0

Nf0

f0estimation

s1

s2

s3

~

~

x11

x21

x1N

x2N

AuditoryFilter Bank

InversedAuditory

Filter Bank

y1

yN

Fig. 1. Conceptual block diagram of the algorithm which roughly mimics the auditory system. First it estimatesthe fundamental frequency f0 and process the microphone signals using a bank of band-pass filters (such as theinner ear). Then it process mixed/convolved signal by an BSE or blind signal deconvolution (BSD) algorithm foreach frequency bin. Finally, it performs masking by set of switches and inversed filter bank.

ters with bandwidth in the order of 50 to 100 Hz forsignals below 500 Hz and up to 5000 Hz for signalat higher frequencies [1]. The hearing system per-forms a spectrographic analysis of any auditory stim-ulus at the cochlea, which can be regarded as a bankof nonuniform self-adaptive filters whose outputs areordered tonotopically. Recently, many sophisticatedand biologically plausible models of such auditory fil-ters banks have been proposed. In this paper we em-ploy the idea of speaker voice extraction/enhancementfrom natural mixtures recorded in real environments.Following the idea, we suggest preprocessing unit forblind signal separation signal, that transform micro-phone signals into subbands with center frequenciesaround fundamental frequency f0 and its harmonics.The fundamental frequency of target speaker can beestimated on basis of microphone signals ex1 and ex2and/or estimated signal y(t). Unvoiced speech is noisydue to the random nature of the signal generated at anarrow constriction in the vocal tract for such sounds.

For both voiced and unvoiced excitation, the vocaltract, acting as a filter, amplifies certain sound frequen-cies while attenuating others. As a periodic signal,voiced speech has spectra consisting of harmonics ofthe fundamental frequency of the vocal fold vibration;this frequency, often abbreviated f0, is the physicalaspect of speech corresponding to perceived pitch. Inour algorithm we use that feature to track the speaker,that fundamental frequency f0 is known or can be es-timated. Our filter bank adopt central frequencies ofsubband bins to enhance signal of target speaker.

An important property of speech signals is that theyare non-stationary, but can be regarded as locally sta-tionary. Roughly speaking, speech can be divided in

the time domain into voiced and unvoiced sounds, thefirst ones having more structure in the frequency do-main than the later. Indeed, voiced sounds are re-garded in general as quasi-periodic. In this matter,some experiments pointed that humans may be us-ing this voiced structure to separate sounds from thebackground. Moreover, humans can more easily un-derstand a voiced than a unvoiced sound.

One open problem is how the auditory system seg-regates those sounds in a higher level (at the audi-tory cortex). In this work, we suggest that this can becarried out by exploiting the local statistical indepen-dence of a pair of sub-band sounds for each frequencysub-band (bin).

Our final objective is to develop an efficient al-gorithm whose output signal y(t) is a modified ver-sion of a speech signal si(t), i.e. signal of interesty(t) = g(si(t)), where g(�) represents a unavoidabledistortion filter and a non-linear transformation opera-tor.

Also in our algorithm we included the temporalmasking characteristic of the auditory system. Thiscan be managed by a set of switches which se-lect from the subband components after blind separa-tion/extraction algorithm speaker related information.

Fig.1 shows the conceptual block diagram of thesystem. It is composed of four parts. The first oneis the fundamental frequency estimation algorithm inspectral domain, which estimates the fundamental fre-quency from the mixed signals as well as it showswhich part of the speech is voiced. It should be noted,that estimation of f0 becomes rather difficult when allspeakers are at the same distance from microphones.The second processing unit is a bank of FIR band-passfilters with center frequencies adopted to estimated f0

��H1

��H0

x( )n y0( )n

y1( )n

v0( )n

v1( )n

�������� ������ �

F1

F0

��

��u0( )n

u1( )n

v0( )n

v1( )n

������� � ���������

x( )n^

a)

b)

Fig. 2. Filter bank analysis (a) and sections (b) foronly 2 subbands.

for target speaker. The filter banks process the signalsaround the fundamental frequency and around its har-monics. The third section is a bank of BSE (blind sig-nal extraction) units for enhancing the desired signalfor each frequency sub-band. Finally, the last sectionperforms signal reconstruction based on subbands sig-nal. The reconstruction section is inversed auditory fil-ter in respect to the second unit of our system. In thisway we try to reconstruct the signal from subbandsavoiding aliasing and frequency distortion problems.

3. ADAPTIVE FILTER BANKS

The filter bank consists of slightly overlapping band-pass filters with bandwidth starting from 100 Hz to200 Hz for first filter and increasing in length by fac-tor two for next filters. At first the fundamental fre-quency of the target speaker in mixture of convolutedspeech signals is estimated in order to properly designcenter frequencies of the filters. The bank of adaptiveband-pass filters, processes the available microphonesignals around the fundamental frequency of the tar-get speaker and around its harmonics. Following theidea of human auditory system frequency sensitivity,we decided to construct filter banks with center fre-quencies as follows: f0; 4f0; 10f0; 22f0; : : :, mak-ing the next subbands two times wider in directionto higher frequencies, because human speech has lesssound representation there. The lowpass and highpassFIR filters are implemented in cascade configuration.In order to prevent perfect reconstruction of the signal,we design analysis and synthesis filters with followingconstrains [2] in every section (see Fig. 2 for refer-ence). To following constrain prevents distortion:

F0(z)H0(z) + F1(z)H1(z) = 2z�l (1)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−70

−60

−50

−40

−30

−20

−10

0

10

Normalized Frequency (×π rad/sample)

Mag

nitu

de (

dB)

Fig. 3. Filter bank frequency characteristic for 5 sub-bands.

and to avoid aliasing problems we apply:

F0(z)H0(�z) + F1(z)H1(�z) = 0 (2)

The exemplary five section filter bank frequency char-acteristic is presented on the Fig. 3. The center fre-quency of every subband is constructed from pairs oflowpass and highpass filters is always around f0 andits higher harmonics f0; 4f0; 10f0; 22f0; : : : for ev-ery target speaker.

4. BLIND SOURCE SEPARATION

In order to separate the subbands signals we can useany algorithm for BSS/ICA like Natural Gradient al-gorithm, SOBI, JADE, RICA etc. [3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14, 15, 16].

In this paper we use novel algorithm for blind sourceextraction (BSE). Due to limit of space we present hereonly final algorithm without the theoretical justifica-tion. Let us consider a single processing unit for ex-traction of voiced independent subcomponent (see Fig.4):

yi (k) = wTi xi (k) =

mXj=1

wijxij (k) ; (3)

"i (k) = yi (k)�LXp=1

bipyi (k � p)

= wTi xi (k)� eyi; (4)

where m is the number of microphones i =1; 2; : : : ; N , (N - number of frequency bins) w i =[wi1; : : : ; wim]T and eyi(k) =

PL

p=1 bipyi(k � p) isthe output of FIR bandpass filter with suitably chosencenter frequency and bandwidth. Coefficients b ip arefixed. It can be easily shown that the weight of the BSEprocessing unit can be iteratively updated as follows

wi = R�1

xixiRx~yi

; wi� = wi=kwik (5)

imw

�+

+

+

1ix

imx

2ix 2iw

1iw

)(kyi

0if

)(~ kyi

�Bandpass

Filterbip

Fig. 4. Single processing for blind extraction of anindependent voiced component (m means the numberof microphones typically m = 2.

where

Rxixi=

1

N

MXk=1

xi(k)xTi (k); (6)

Rxiyi=

1

N

MXk=1

xi(k)eyi(k):

It should be noted that by changing the central fre-quency of bandpass filter we can extract in generallydifferent components. Moreover, using the above con-cept we always extract desired independent compo-nents with higher energy so masking set of switchesis not necessary.

5. MULTICHANNEL BLIND DECONVOLU-TION/EQUALIZATION

Instead of instantaneous blind extraction of subbandsignals we can apply blind deconvolution/equalizationespecially if bandwidth is relatively large. The simplemodel for multichannel deconvolution/equalization is

Fig. 5. Room recording plan.

mixture from microphone #1

mixture from microphone #2 (closer to target speaker)

extracted speaker

Fig. 6. Result for extraction of one speaker from mix-ture of three speakers recorded using only two micro-phones (target speaker was close to microphone #2).

shown in Fig. 8. For each subband i we perform thefollowing processing

y1(k) = xi1(k)�LXp=0

bipxi2(k � p) (7)

with updating

4bip = �yi(k)xi2(k � p): (8)

6. SIGNAL RECONSTRUCTION FROM IN-DEPENDENT COMPONENTS

The components carrying maxima around f0 harmon-ics obtained from previous section are taken into re-construction. The signals carry speech signals withenhanced target speaker information. Last part of oursignal processing system performs signal reconstruc-tion. The inverted filter bank is applied to correctly re-construct the target speaker voice, avoiding problemswith aliasing of subband filtered components.

Extensive computer simulations results presented innext section confirm validity and performance of theproposed approach. We present results for real roomrecordings with natural reverberations from the wallsand subjects in the room.

7. EXPERIMENTS WITH SPEECH SIGNALSRECORDED IN REAL ENVIRONMENT

The real room recordings were done in the empty ex-perimental room, without carpet and any sound ab-sorbing elements, with many reverberations (easy tonotice even during usual conversation). We used twoor three cardioid condenser boundary microphones

mixture from microphone #1

mixture from microphone #2

mixture from microphone #3

extracted speaker

Fig. 7. Result for extraction of one speaker frommixture of four speakers recorded using three micro-phones (target speaker was close to microphone #1).

1ix

2ixib

FIR

�+

_

iy

Fig. 8. Processing unit for blind deconvolu-tion/equalization for m = 2 microphones.

audio-technica PRO44, that can record sounds fromhalf-cardiod space. Such configuration let as recordsounds from many directions similarly as human beingcan sense using ears. Boundary microphones make thetask more complicated, because they record more re-verbarations from surroundings than directional ones.The microphones were connected to microphone highclass line amplifier and professional 20-bit multitrackdigital recording system in PC class computer. Thesystem allows us to record up to 8 channels simultane-ously with 20-bit resolution and sampling frequency44.1kHz. The following recordings were done usingnatural voices and sounds from speakers: (i) 2 mixedman and woman voices talking different frazes in En-glish; (ii) 3 man voices talking different frazes in En-glish; (iii) mixed recordings of man and woman voicestalking different frazes in English; (iv) mixed humanand natural sound (rain, water fall) sounds or music.We conducted all experiments with target speaker po-sitioned closer to microphones than other sources. Thescheme of our recording conditions is presented on

Fig. 5 Exemplary computer simulations are shown inFig. 10 and Fig. 7. For each experiment we have ob-tained essential enhancement of target speaker. Due tolimit of space more details will be given during work-shop’s presentation.

For all performed experiments considerable speechenhancement has been achieved. More detailed audioexperiment will be presented at conference.

8. CONCLUSIONS AND DISCUSSION

In this paper we have described multistage sub-band based system for extraction and enhancement ofspeech signal corrupted by other speakers and otheracoustic interferences. The proposed approach canbe extended to other applications like extraction ofbiomedical signals with reduced number of sensors.The open problems is how to extract a speaker withlower energy than others speakers or speech signalwith specific features independently of this how faraway is from the microphones.

mixture from microphone #1

mixture from microphone #2

mixture from microphone #3

mixture from microphone #4

extracted speaker

Fig. 9. Result for extraction of one speaker from mix-ture of five speakers recorded using four microphones(target speaker was close to microphone #3).

9. REFERENCES

[1] D. O’Shaughnessy, Speech Communication -Human and Machine, IEEE Press, New York,second edition, 2000.

[2] G. Strang and T. Nguyen, Wavelets and FilterBanks, Wellesley - Cambridge Press, WellesleyMA 02181 USA, 1996.

[3] S. Amari, “ICA of temporally correlated signals- learning algorithm,” in Proceedings of ICA’99:

mixture (speech + rain) from microphone #1

mixture (speech + rain) from microphone #2

extracted speaker

Fig. 10. Result for extraction of one speaker frommixture of two speakers talking during heavy rain,recorded using only two microphones (target speakerwas close to microphone #2).

International workshop on blind signal separa-tion and independent component analysis, Aus-sois, France, Jan. 1999, pp. 13–18.

[4] S. Amari and A. Cichocki, “Adaptive blind signalprocessing - neural network approaches,” Pro-ceedings IEEE, vol. 86, no. 10, pp. 2026–2048,October 1998, (invited paper).

[5] A. K. Barros and A. Cichocki, “RICA - reliableand robust program for independent componentanalysis,” Report and matlab program ofRIKEN, Brain Science Institute RIKEN, 2-1 Hi-rosawa, Wako-shi, Saitama, 351-0198 JAPAN,http://www.riken.nagoya.jp/sensor/allan/RICAor http://go.to/RICA.

[6] A. Belouchrani, K.A. Meraim, and J.-F. Cardoso,“A blind source separation technique using sec-ond order statistics,” IEEE Transactions on Sig-nal Processing, vol. 45, pp. 434–444, February1997.

[7] A. Cichocki, R. Thawonmas, and S. Amari, “Se-quential blind signal extraction in order speci-fied by stochastic properties,” Electronics Let-ters, vol. 33, no. 1, pp. 64–65, January 1997.

[8] N. Delfosse and P. Loubaton, “Adaptive blindseparation of independent sources: a deflationapproach,” Signal Processing, vol. 45, pp. 59 –83, 1995.

[9] A. Hyvarinen and E. Oja, “A fast fixed-pointalgorithm for independent component analysis,”

Neural Computation, vol. 9, pp. 1483 – 1492,1997.

[10] S.C. Douglas and S.-Y. Kung, “Kuicnet algo-rithm for blind deconvolution,” in Proceedingsof the 1998 IEEE Workshop on Neural Networksfor Signal Processing, New York, 1998, pp. 3–12.

[11] C. Jutten and J. Herault, “Blind separation ofsources, part i: An adaptive algorithm based onneuromimetic architecture,” Signal Processing,vol. 24, pp. 1–20, 1991.

[12] L. Molgedey and H.G. Schuster, “Separationof a mixture of independent signals using time-delayed correlations,” Physical Review Letters,vol. 72, no. 23, pp. 3634–3637, 1994.

[13] B. A. Pearlmutter and L. C. Parra, “Maximumlikelihood blind source separation: A context-sensitive generalization of ica,” in Proceedingsof NIPS’96, 1997, vol. 9, pp. 613–619.

[14] L. Tong, V.C. Soon, R. Liu, and Y. Huang,“Amuse: a new blind identification algorithm,”in Proceedings of ISCAS’90, New Orleans, LA,1990.

[15] J.K. Tugnait, “Blind spatio-temporal equaliza-tion and impulse response estimation for mimochannels using a godard cost function,” IEEETransactions on Signal Processing, vol. 45, pp.268–271, January 1997.

[16] S. Choi and A. Cichocki, “Blind separation ofnonstationary sources in noisy mixtures,” Elec-tronic Letters, vol. 36, pp. 848–849, April 2000.

[17] A. K. Barros and N. Ohnishi, “Removal ofquasi-periodic sources from physiological mea-surements,” in Proceedings of ICA’99: Inter-national workshop on blind signal separationand independent component analysis, Aussois,France, Jan. 1999, pp. 185–190.

[18] J. Huang, K-C. Yen, and Y. Zhao, “Subband-based adaptive decorrelation filtering for co-channel speech separation,” IEEE Transactionson Speech and Audio Processing, vol. 8, no. 4,pp. 402–406, July 2000.

[19] A.K. Barros, H. Kawahara, A. Cichocki,S. Kojita, T. Rutkowski, M. Kawamoto, andN. Ohnishi, “Enhancement of a speech signalembedded in noisy environment using two mi-crophones,” in Proceedings of the Second Inter-national Workshop on ICA and BSS, ICA’2000,Helsinki, Finland, 19-22 June, 2000, pp. 423–428.