nonstationary speech analysis using neural prediction

Bin Zhu and Evangelia Micheli-Tzanakou floportmcnt of Riomcdicol Cilginccring,

Rutgcrs Universily

Nonstationary Speech Analysis Using Neural Prediction Extracting Dynamic Feafures of Individual Speakers from Shorf Speech Segmenfs for Robusf Recognition

Y a inclhod of lioinomorphic signal

separate the excitation and impulse rc- sponse 01' the vocal channels when applied to spezch signals. For short-timc speclrum analysis, overlapping windows are used to divided spccch inlo many francs. For each window, one ccpslruin vector is obtained. It i each kimc (about 30 insec), the speech signal is stationary. However, the speech is basically nonstationary lor long time iiilervds. 'L'hcrclbre, one iniist consider the dynamic changes bctween frames. Conveiitioniil methods often usc only lhe static features of the short-time cepstruiii [ 11. A iieural network can be seen as a nonlinear dynamic system, which inay cxpress both tlic static and dynamic features of llic signal at hand. Por this pur- pose, 21 ~ieiiriil prcdiclion network was designed to cxlract tlic iiiler- and iiitl-aframe correlations u1' cepstruin vcc- Lurs, so tis to obtain the rohusl Smtures of individual speakers from very short

A ' . processing, cepstruin analysis can

Methods Ccpstruni Analysis

Any recorded signal such as spccch is a iiiixtiire of two components: a pure sig- niil end noise. The rcliitiunship hetwecn signal and mise is considered as a lincar superposilion and can he exprcsscd as:

Speech

Hamming Window

x, = s i t n j where x, is tlie recorded signal, s i is the purc signal, n, is the noise, and iindicatcs thc lime interval. Fourier spectrum analysis as wcll as signal filtering can be applied directly on the recorded signals. However, if the relationship bctwccn signal and noise is coovolution instead of ad- dilion, as it happens in iiiany cases, then this relationship can bc cxpressed as:

(1)

(2)

where (*) indicates convolution, and since the system is not lincar and Fourier analysis and filtering can not be applicd di- rcctly, co~~vo lu t io~ i is nccded. In what Sollowa, the mcthod US cepstrum analysis is discussed bricfly.

Figure I shows the flowchart of the cepslrum analysis. AStcr the discrete Fou- ricr transform (DFT) is performed, Eq. ( I ) is re-writtcii as:

(3)

whereX, is thc DFT oix,, S,is the DFT ofs,, N,is tlie DFT of n,, iuid (e) indicates multi- plication. Aftcr taking the Log of both sides of Eq. (3), the sigiials bccnme addi- tive as in the following equation:

x = s:: 'n I I ,

Xi = S , Ni

i, = 7, + 17, (4) whcrci, istheLogofX and E j is the Log of N,.

C D

7- IDFT

Cepstruh Window I 1. Flowchart of cepstrum analysis.

IEEE ENGINEERING I N MEDICINE AND BIOLOGY 0739-5l75/00/SlQ.QQ~ZQOOlEEE Jonuory/Februory2000

Soeech Sianal lHammina Window)

l o o t

i o o t

50 100 150 200 250 Time Index

(4 FFT (Fs=DK, N=256)

I

I 0 20 40 60 DO 100 120 140

Frequency Index (b)

Index (C)

Ceptrally Smoothed Spectrum 14, I

2. Stages ill the analysis of a speech signal. Specifics of each graph are given in the figure itself.

Jnnuory/February 2000 IEEE ENGINEERING I N MEDI(INE AND BIOLOGY

In Fig. I , DPT is the discrete Fourier trliiisforin and IDFT is lhc inverse DFl. Log provides lhc logarithrnic valuc o l the absolut value of ils input. The output of tlicHainining window, a1 A, is the filtered (windowed) spcccli that is then Fourier transformed, and at U we get the DFT re- sulls. Altcr the Log and lhc IDFT arc ub- lained, thc rcsulting signal is passed through the cepslrum window, and at C, the lower par1 OS the cepstrum ( I 6 coeffi- cicnts) is passed through anotherDFT, the output of which is oblaincd at D. This is a sinoothcd shape of the spcech spectrum. The trsnsrorm lcnglh is 256 poinls with a sampling rate of 8 kHz. The ccsptrum window is reclangular, selecting the lower pert oftheccpslrum. In this work,onlythe firs1 16 coclficicnls of the cepstruni are used. Figurc 2 shows the initial speech signal, its spectrum, and its ccpstrum, as wcll as thc cepstrally sinuothcd spcclrum.

Ncural Prediction Network Iniei:frfmie Ncurul Predictor First, the interframe neural prcdictor is uscd in order tn delerminc the interframc corrclation of the speech signal. The riel- work archilecture is dcpictcd in Fig. 3.

Ci is thc cepslruiii vector ill f ranc i. The prcvious M vectors and Suture N veclors arc uscil to predict the currciil vector Ci. This ncural p d i c l n r is called the M + N predictor. There arc no connections within cach vcctor; i.e., each cocfficicnt of the cesplrum vcctor is processed separately. The bacl~propagation algorilhm (BP) min. iiig algorilhm is uscd, which is bricfly dc- scribed i n the appcndix. Figurc 4 is a representative learning curvc Cur the network of Fig. 3.

Inter- and IturaJrfinic Neurul Predictor II both lhc corrclatioii of iiilrafraine and inlerframe are considcred, the following architccturc isapplied,osshown inFig. 5.

In lhis nelwork, thc input consists of Llic 16 ccpstrum coefficients (one vector C,~,)oSIramci - I (i.e., a t t i inei- l ) .The output is [lie predictcd ccpstrum vcctorCr of frame i (i,e., a l time i ) . The crror he-

output Layer

I I

3. Architecture fnr interframe correlation.

1 0 3

One Learning Curve of Interframe Neural Predictor 6 I

Iteration Index I 4. Learning curve for the intcrframe predictor.

tween the predicted outpuland the desircd output, Cj, i s used to modify tlie weights. Figure 6 i s a learning curve for the nct- work oEPig. 5 .

Results

Text-Dependent Speaker Identification Using the Interframe

M + N Predictor The average speech length i s about 0 . 3

extract miirc inforination iiboul speaker identity, i t suppresscs llie semiintic inlor- mation o l speech while enhancing the dynamic fcaturcs o l speaker inlbrmntion. For long spcccli processing, i t may have applicntions lor text-indcpcndenl spcakcr rccognilion. Also, i t caii be sccn that the intcrl'raine predictor preserves good semantic information as well tis speaker individuality.

The iibovc two expcrimcnls deinon-

accents). Each speaker had I O articiila- t iom (three times cach). One timc was usedfor training, and two times were used for testing. The maximum corrclalion sccms lo occur within two fkaincs, which indicatcs that l l ie inaxiinuni correliition of the spcccli signal i s within about 60.70 msec. This conclusion i s in agreement with the idca o l considering tlie speech as a first-order Markov chain.

Experimental Results Using the Inter- and lntraframe Predictor For lexl-dependent speaker ident iha-

tion, the predictions found are as rollows (scc Table I): . 1+0 intcrlmiiie predictor: 90.8% . inter- and inlrafranic predictor:

This result indicates that there i s l i n - i t e d i u t ra l ra ine corre la t ion o f the ccpstrum, due to the fact that the spc;ikcr recognitinn rate does i iot improve. How- ever, for speaker-dependent spcccli recognition:

I+0 inlerlranie prcdictol-: 92.4%

80.0%

88.3%

. inter- and intraframe predictor:

This result iiidiciilcs llial although the

inlerl'ramc ciirrclalioii i s iiinrc suitable lor spccch recognitinn, the arcliitcctiirc CIS inter- aiitl intral'rame cnrrclation i s inore suitilhk tor speaker rccognilion.

Discussion and Conclusions For speaker idcntil"ication, the neural

prediction network extracts both static and dynamic Ibaturcs lrom h e ccpstrum vectors calculated from spcccli. We pro- pnscd and demiinstralcd two kinds o l

neura l p rcd i c to rs . One i s lor the intcrlmme correlation and tlic other i s for the inter- iiiid intraframe correlation. En- p c r i m c n l a l r e su l l s show t h a t the iii lcrl'raine predictor prcserved holh SIJeaker iildividualily and Semantic inlbr- mation, while the inter- and inlraRame predictor cxlracled speaker informalioii but somewhat suppressed scniaiitic information lior long spccch prediction, tlic i i itcr- and intralraine predictors may liavc applications i n text-indepciidcnt speaker recognitiiin.

Oire [if tlic problems during training of the neural prediclor i s local coiivcrgcncc, hecause of tlic backpropagalion algorithm used i n this applicalion. This cffcct may lower the elliciency o f tlic prediction. However, wc may usc the ALOPEX algu- rithm to r ~ i c l i l l ie optimum prediction 13, 41. Future work wi l l conccntrate i n iin- proving the pcrlormance by including iii- lorination tihriol pitch, which is related to speaker individuality. It 1x1s hccn found that pilcli detection i s inn1 robust.

Input Layer

Hidden Layer

Output Layer

I 5. Neural network arcliitectiirc Sor inter- and intraframe correliition.

One Learning Curve of Inter- & lntraframe Neural Predictor

I

2.2 I I 0 500 1000 1500 2000

Iteration Index -

inter- and i n t rahme predictor does not

IO4 IEEE ENGINEERING IN MtDiCiNE AND BIOLOGY Jnnunrylfebrunry 2000

6. Learning curve far the eenrul nelwork given in Pigore 5.

Bin Zhii was born in 1968. He reccivcd thc R.S. and M.S. degrees in electrical engineering from University o l Sci- cnce and Tcchnology of China ill I986 and 1994, rcspectivcly. For his sc- search wosk, he was

awardcd thc filth “Y iLiDa” Expcrimcntal Science Prize and the First “HuitngHua” Graduate Award. In 1998, he rcccivcd an M.S. degree i n biomedical engineering i‘rom Rutgers, the State University or New

y. He is currently a project engineer and team leader at Gcncral Devices, a leading company in the cmcrgcncy mcdi- cal systems industry, His arcas of interest are artificial ncural networks, parallel DFT algorithms, biomedical signal processing and instrumentation, hearing aids, and tclcmcdicinc.

Evnngel ia M i c h r l i - Tznnaknu received a P1n.D. degree in 1977 from the Physics Depaa- men1 a1 Syracuse Univer- sity, Fmn 1917 to 1980, she was a postdoctoral fellow in biophysics at Syracuse University. In

I98 I , Ds. TLanakou joined the Deparlment of Elcclsical Engineering at Rutgers Univer- sity. In 199(1, she b c w n c a profcssor of biu- medical engineering, and chairperson of the Biomedical Engineering Department at Rutgers. She has also served as a co-director ot thc giuduatc program in biomcdical engi- iiccring from 1990 to 1995 ancl is an iitljunct professes of the Univessity of Medicine and Dentistry 01 New Jersey. Dr. T;.anekou is a Founding Fellow or AIMBE, a Fellow of IEEE, and a Fcllow of thc New Jcsscy Acad- emy of Mcdicinc. Dr. Tmnakou co-authorcd

of the 1ntc.morioriril Journal m a Advfmced C(iiii/iutfitirinal Intelligeiice. She has scivcd as sccrctctaty of the IEEE Council 011 Neural Networks. She is inow a imember of the Ad- ininislrative Committee (AdCom) 01‘ the IEEE Council on Neural Networks (whctc she was also appuintcd as chair of Technical Activities), the EMBS, and the IES (whcrc

O,k = f(ner,,).

Let ;L be the rcal output of the A and the e m s is:

and the total error is: (7)

she setlies as chair of the Educational Activ- ities). She is also vice chair of the IEEE Aw;irds Bmrd Planning and Policy Com- mittee. Her awasds include an Outstanding Advisor Awasd in 1985 fromIEEE, the 1992

I N E = - C Ek .

2 x = , (8) w c lhcn dcfinc:

Achievement Award of the Society of Woman Engineers, and in 1995 she was aE, 6. =- ’’ anet,, awarded thc NJ Women of Achievement Award for Ilncapplicatioiiofiieural networks to cngineering in medicine and biology. She has bccn fc;iturcd in “Notable Scientists of

cludc neural networks, information processing in Ihe brain, image and signal processing

and

aE, dWa -- - S,O,,. the 20th Century,” Her resemh interests ill-

If neuron . j is in the output layer, then: applied to biomcdical data, and tclcmcdicinc.

References I. Rabincr 1.R and Sclinfer RW: IXgital Pro- crrsing oJ Sjrcech S i p w b . Englewoorl Cliffs: Pientice-Hall, 1978. 2. B o I< aiid Witesahc T: Speiiker-iodepedent wmtl vccognition using a ncural prediction modcl. Pmr IEEI: l i i i Con/ mi Si j inai Processinfi

1. Melissarnton I. and Micheli-Tzanakeo I3 A piirallcl implcmmlation oi‘thc AI .OPRX pmcess,

4. I’han Ii: Speaker idemification thmugh W~VCICL muliircsolulion dccomposilion and ALOI’EX. Miistci Thesis, Dept. o f Biwiedical Enginecring, Rulgcrs Uaivcrsity, May 1994. 5 . Bin Zlm, Ueiqiaa h i and Yong rang: Spcnkcr c1;tssikation bascd on coinliiiietl n c w i rietwxk and furry iinidysis. Proc lur. CO$ DSP D:Lll$w, Oct. 1994.

~ 4 1 - 4 4 4 , 1990.

Appendix

. .

6,, = -(Yk -y^,)f(net,,)

o;k = y*.

and

If the neuron j is in the hidden then:

whcsc neuron in is located in the layer just bei‘ore that of neuron .j.

Then, thealgorithm is described as fol- lows:

1) Inilialize the weights of the ANN. 2 ) Repeat the following steps until

( i ) Ron1 the input layer to the uutput

(ii) Calculale a,, and (backward)

convergence:

layer (torward), compute thenet, aiidO,, dL; aw;,

I . PIUS; has cditcd a hook to be publishcd by i,o,, is o,ltpll;and illllul is: CRC Pscss iis a compendium of her wusk;

(9)

a book with S. Deulsch titled Ne.uroclrct>-iic Sjvtenr.v, liliblished hy New York University

Backpropagation Algorithm and modify the weights: witll tile ktll i n n u t tel,,nlate. rl,r IIelIrl,n

and;ilsoliaspoblisliedover2011scientitic pa- pen both in iournals ;ind proceedings. Dr. net,, = Cw,,o,, 14\

\’I w. = w -y-

aw, (13) ‘I ‘i

Tzanakou has scrvcd as a n associate cditur for the IEEE T,r(n,v[((:tj,ifl,y Nelrr~ll Net- works, and she is on the Fditorial Board of the IE~~Tr[i i i .ru~tioti .sr,n Biomedical Infor. motion Technolog)’ and the Edilorial Board

whcrc i is the index fos input or hidden-layes inelirons and M J ( ~ ) is the weight betwecn the two Iaycss whci-c ncuroniand . j are localed. Thus:

whcre speed of convergence.

is a cunslant that controls the

Jonuary/February 2000 IEEE ENGINEERING IN MEDICINE AND BIOLOGY I05

nonstationary speech analysis using neural prediction

Documents