introduction to signal processing

Introduction to Signal Introduction to Signal Processing and Some Processing and Some

applications in audio analysisapplications in audio analysis

Md. Khademul Islam Molla

JSPS Research FellowHirose-Minematsu Laboratory

Email: [email protected]

Outlines of the presentationOutlines of the presentationBasics of discrete time signalsBasics of discrete time signalsFrequency domain signal analysisFrequency domain signal analysisBasic TransformationsBasic TransformationsFourier Transform (FT), short-time FT (STFT)Fourier Transform (FT), short-time FT (STFT)Wavelet Transform (WT) Wavelet Transform (WT) Empirical mode decomposition Empirical mode decomposition (EMD), and (EMD), and

Hilbert spectrum (HS)Hilbert spectrum (HS)Remarkable comparisons among FT, WT, HSRemarkable comparisons among FT, WT, HSSome applications in audio processingSome applications in audio processingSome open problems to work withSome open problems to work with

Discrete time signalDiscrete time signal

It is not possible to process It is not possible to process continuous signalscontinuous signalsWe need to make it discrete time We need to make it discrete time signal with suitable sampling signal with suitable sampling frequency and quantization frequency and quantization

The sampling theory The sampling theory FFss22ffcc

where, where, ffcc expected signal frequency, expected signal frequency, FFss required sampling frequency required sampling frequency Quantization is required in samplingQuantization is required in sampling


Signal samplingSignal sampling Signal quantizationSignal quantization


Effects of under samplingEffects of under sampling


Effects of required sampling frequencyEffects of required sampling frequency


Telephone speech is usually sampled at 8 kHz to capture up to 4 kHz data 16 kHz is generally regarded as sufficient for speech recognition and synthesis The audio standard is a sample rate of 44.1 kHz (CD) or 48 kHz (Digital Audio Tape) to represent frequencies up to 20 kHz

-5

-3

-1

1

3

5

-10 -5 0 5 10

-5

-4

-3

-2

-1

0

1

2

3

4

5

-10 -5 0 5 10


Amplitude

Phase

Frequency

f(x) = 5 cos (x)

f(x) = 5 cos (x + 3.14)

f(x) = 5 cos (3 x + 3.14)

-5

-3

-1

1

3

5

-10 -5 0 5 10

Time-domain signalsTime-domain signals

The Independent Variable is TimeThe Dependent Variable is the AmplitudeMost of the Information is Hidden in the Frequency Content

0 0.5 1-1

-0.5

0

0.5

1

0 0.5 1-1

-0.5

0

0.5

1

0 0.5 1-1

-0.5

0

0.5

1

0 0.5 1-4

-2

0

2

4

10 Hz2 Hz

20 Hz2 Hz +

10 Hz +20Hz

TimeTime

Time Time

Ma

gn

itu

de

Ma

gn

itu

de

Ma

gn

itu

de

Ma

gn

itu

de

SignalSignal TransformationTransformation

WhyTo obtain a further information from the signal that

is not readily available in the raw signal.

Raw SignalNormally the time-domain signal

Processed SignalA signal that has been "transformed" by any of the

available mathematical transformations

Fourier TransformationThe most popular transformation

between time and frequency domains

Frequency domain analysisFrequency domain analysis

Why Frequency Information is Needed

Be able to see any information that is not obvious in time-domain

Types of Frequency TransformationFourier Transform, Hilbert Transform,

Short-time Fourier Transform,the Radon Transform, the Wavelet Transform …

Frequency Frequency domain domain analysisanalysis

time, t frequency, fF

s(t)s(t) S(f) = S(f) = FF[s(t)][s(t)]

analysianalysiss

synthesissynthesis

s(t), S(f) : s(t), S(f) : Transform PairTransform Pair

General Transform General Transform as problem-solving as problem-solving

tooltool

•Powerful & complementary to time domain analysisPowerful & complementary to time domain analysis methodsmethods•Frequency domain representation shows the signal Frequency domain representation shows the signal energy energy and phase with respect to frequencyand phase with respect to frequency•Fast and efficient way to view signal’s informationFast and efficient way to view signal’s information

Basic block diagram of signal transformationBasic block diagram of signal transformation


Complex numbers4.2 + 3.7i9.4447 – 6.7i-5.2 (-5.2 + 0i)

General FormZ = a + biRe(Z) = aIm(Z) = b

AmplitudeA = | Z | = √(a2 + b2)

Phase = Z = tan-1(b/a)


Polar CoordinateZ = a + bi

AmplitudeA = √(a2 + b2)

Phase = tan-1(b/a)

a

b

A


Frequency SpectrumBe basically the frequency components (spectral

components) of that signalShow what frequencies exists in the signal

Fourier Transform (FT) One way to find the frequency contentTells how much of each frequency exists in a

signal

Spectrum of Spectrum of speech speech signalsignal

Fourier TransformFourier Transform•Fourier transform decomposes a function into a Fourier transform decomposes a function into a spectrum of its spectrum of its frequency componentsfrequency components, ,

•TThe inverse transform synthesizes a function from its he inverse transform synthesizes a function from its spectrum of frequency components spectrum of frequency components

•Discrete Fourier transform pair is defined as:Discrete Fourier transform pair is defined as:

Where Where XXkk represents the frequency component represents the frequency component

Where Where xxnn represents nth sample in time domain represents nth sample in time domain

FourierFourier Transform Transform

-5

-4

-3

-2

-1

0

1

2

3

4

5

0 200 400 600 800 1000 1200 1400

-5

-4

-3

-2

-1

0

1

2

3

4

5

0 200 400 600 800 1000 1200 1400

5 10 15(Hz)

5 10 15(Hz)

Amplitude OnlyAmplitude Only

Fourier Fourier Trans. of Trans. of 1D1D signal signal

-5

-4

-3

-2

-1

0

1

2

3

4

5

0 200 400 600 800 1000 1200 1400 5 10 15

(Hz)

Fourier Fourier Spectrum of 1D Spectrum of 1D

FFourier Transformourier Transform

Fourier analysis uses Sinusoids as the basis function in decompositionFourier transforms give the

frequency information, smearing timeSamples of a function give the

temporal information, smearing frequency

7

1ksin(kt)kb-(t)7sw

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10t

sq

ua

re s

ign

al,

sw

(t)

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10t

sq

ua

re s

ign

al,

sw

(t)

5

1ksin(kt)kb-(t)5sw

3

1ksin(kt)kb-(t)3sw

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10t

sq

ua

re s

ign

al,

sw

(t)

1

1ksin(kt)kb-(t)1sw

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10t

sq

ua

re s

ign

al,

sw

(t)

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10t

sq

ua

re s

ign

al,

sw

(t)

9

1ksin(kt)kb-(t)9sw

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10t

sq

ua

re s

ign

al,

sw

(t)

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10t

sq

ua

re s

ign

al,

sw

(t)

11

1ksin(kt)kb-(t)11sw

FS synthesisFS synthesisSquare wave Square wave reconstruction from reconstruction from spectral termsspectral terms

Convergence may be slow (~1/k) - ideally need infinite terms.Convergence may be slow (~1/k) - ideally need infinite terms.PracticallyPractically, series truncated when remainder below computer tolerance, series truncated when remainder below computer tolerance

(( errorerror). ). BUTBUT … Gibbs’ Phenomenon. … Gibbs’ Phenomenon.

Stationarity of the signalStationarity of the signal

Stationary SignalSignals with frequency content

unchanged over the entire timeAll frequency components exist at all

times

Non-stationary SignalFrequency changes in timeOne example: the “Chirp Signal”

Stationarity of the signalStationarity of the signal

0 0.2 0.4 0.6 0.8 1-3

-2

-1

0

1

2

3

0 5 10 15 20 250

100

200

300

400

500

600

Time

Ma

gn

itu

de

Ma

gn

itu

de

Frequency (Hz)

2 Hz + 10 Hz + 20Hz

Stationary

0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 250

50

100

150

200

250

Time

Ma

gn

itu

de

Ma

gn

itu

de

Frequency (Hz)

Non-Stationary

0.0-0.4: 2 Hz + 0.4-0.7: 10 Hz + 0.7-1.0: 20Hz

Occur at all times

Do not appear at all times

Chirp signalChirp signal

Same in Frequency Domain

0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 250

50

100

150

Time

Ma

gn

itu

de

Ma

gn

itu

de

Frequency (Hz)0 0.5 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 250

50

100

150

Time

Ma

gn

itu

de

Ma

gn

itu

de

Frequency (Hz)

Different in Time DomainFrequency: 2 Hz to 20 Hz Frequency: 20 Hz to 2 Hz

At what time the frequency components occur? FT can not tell!At what time the frequency components occur? FT can not tell!

Limitations of Fourier TransformLimitations of Fourier Transform

FT Only Gives what Frequency Components Exist in the SignalThe Time and Frequency Information can not be seen at the same timeTime-frequency representation of the signal is needed

Most of Signals are Non-stationary

ONE SOLUTION: SHORT-TIME FOURIER TRANSFORM (STFT)

Short Time Fourier TransformShort Time Fourier Transform

Dennis Gabor (1946) used STFTTo analyze only a small section of the signal at a

time -- a technique called Windowing the Signal.

The segment of signal is assumed stationary A 3D transform

dtetttxft ftj

t

2*X ,STFT

function window the:t

A function of time and frequency

Short time Fourier TransformShort time Fourier Transform

FT

FT

Speech Speech signal and its signal and its STFTSTFT

Via Narrow Window

Via Wide Window

DDrawbacks of rawbacks of STFTSTFTUnchanged WindowDilemma of Resolution

Narrow window -> poor frequency resolution Wide window -> poor time resolution

Heisenberg Uncertainty PrincipleCannot know what frequency exists at what time

intervals

Wavelet Transform

To overcome some limitations of To overcome some limitations of Fourier transformFourier transform

SS

A1 D

1

A2 D2

A3 D3

Discrete Wavelet Discrete Wavelet decompositiondecomposition

Wavelet OverviewWavelet Overview

WaveletA small wave

Wavelet TransformsProvide a way for analyzing waveforms, bounded in

both frequency and durationAllow signals to be stored more efficiently than by

Fourier transformBe able to better approximate real-world signalsWell-suited for approximating data with sharp

discontinuities

“The Forest & the Trees”Notice gross features with a large "window“Notice small features with a small "window”

Wavelet TransformAn alternative approach to the short time Fourier

transform to overcome the resolution problem Similar to STFT: signal is multiplied with a function

Multi-resolution Analysis Analyze the signal at different frequencies with

different resolutionsGood time resolution and poor frequency resolution

at high frequenciesGood frequency resolution and poor time resolution

at low frequenciesMore suitable for short duration of higher frequency;

and longer duration of lower frequency components

MMulti-resolution analysisulti-resolution analysis

Advantages of WT over STFTAdvantages of WT over STFT

Width of the Window is Changed as the Transform is Computed for Every Spectral ComponentsAltered Resolutions are Placed

Principles of WTPrinciples of WT

Split Up the Signal into a Bunch of SignalsRepresenting the Same Signal, but all Corresponding to Different Frequency BandsOnly Providing What Frequency Bands Exists at What Time Intervals

Wavelet Small waveMeans the window function is of finite length

Mother WaveletA prototype for generating the other window functionsAll the used windows are its dilated or compressed and

shifted versions


dts

ttx

sss xx

*1

, ,CWT

TranslationTranslation

(The location of (The location of the window)the window)

Scale

Mother Wavelet


Wavelet Basis Functions:

21

1

241-

0

2

20

21

1- :devivativeDOG

1!2!2

DOG :order Paul

:)frequency(Morlet

edd

mm

immi

m

ee

m

mm

mmm

j

Derivative Of a GaussianM=2 is the Marr or Mexican hat wavelet

Time domain Frequency

domain

Wavelet basesWavelet bases

Scale of waveletScale of wavelet

ScaleS>1: dilate the signalS<1: compress the signal

Low Frequency -> High Scale -> Non-detailed Global View of Signal -> Span Entire SignalHigh Frequency -> Low Scale -> Detailed View Last in Short TimeOnly Limited Interval of Scales is Necessary

Computation of WTComputation of WT

Step 1: The wavelet is placed at the beginning of the signal, and set s=1 (the most compressed wavelet);Step 2: The wavelet function at scale “1” is multiplied by the signal, and integrated over all times; then multiplied by ;Step 3: Shift the wavelet to t= , and get the transform value at t= and s=1;Step 4: Repeat the procedure until the wavelet reaches the end of the signal;Step 5: Scale s is increased by a sufficiently small value, the above procedure is repeated for all s;Step 6: Each computation for a given s fills the single row of the time-scale plane;Step 7: CWT is obtained if all s are calculated.

dts

ttx

sss xx

*1

, ,CWT

s1

Time & Frequency ResolutionTime & Frequency Resolution

Time

Frequency

Better time resolution;Poor frequency resolution

Better frequency resolution;Poor time resolution

• Each box represents a equal portion • Resolution in STFT is selected once for entire analysis

Comparison of transformationsComparison of transformations

Discretization of Discretization of WTWT

It is Necessary to Sample the Time-Frequency (scale) Plane.At High Scale s (Lower Frequency f ), the Sampling Rate N can be Decreased.The Scale Parameter s is Normally Discretized on a Logarithmic Grid.The most Common Value is 2.

1211212 NffNssN S 2 4 8 …

N 32 16 8 …

SS

A1

A2 D2

A3 D3

D1

EEffective and Fastffective and Fast DWT DWT

The Discretized WT is not a True Discrete TransformDiscrete Wavelet Transform (DWT)

Provides sufficient information both for analysis and synthesis

Reduce the computation time sufficientlyEasier to implementAnalyze the signal at different frequency

bands with different resolutions Decompose the signal into a coarse

approximation and detail information

Decomposition with DWT Decomposition with DWT

Halves the Time ResolutionOnly half number of samples resulted

Doubles the Frequency ResolutionThe spanned frequency band halved

0-1000 Hz

D2: 250-500 Hz

D3: 125-250 Hz

Filter 1

Filter 2

Filter 3

D1: 500-1000 Hz

A3: 0-125 Hz

A1

A2

X[n]512

256

128

64

64

128

256SS

A1

A2 D2

A3 D3

D1

Decomposition of non-Decomposition of non-stationary signalstationary signal

Wavelet: db4

Level: 6

Signal:0.0-0.4: 20 Hz0.4-0.7: 10 Hz0.7-1.0: 2 Hz

fH

fL

Decomposition of non-Decomposition of non-stationary signalstationary signal

Wavelet: db4

Level: 6

Signal:0.0-0.4: 2 Hz0.4-0.7: 10 Hz0.7-1.0: 20Hz

fH

fL

RReconstruction from WTeconstruction from WT

WhatHow those components can be assembled

back into the original signal without loss of information?

A Process After decomposition or analysis.Also called synthesis

HowReconstruct the signal from the wavelet

coefficients Where wavelet analysis involves filtering and

downsampling, the wavelet reconstruction process consists of upsampling and filtering

RReconstruction from WTeconstruction from WT

Lengthening a signal component by inserting zeros between samples (upsampling)MATLAB Commands: idwt and waverec.

Wavelet ApplicationsWavelet Applications

Typical Application Fields Astronomy, acoustics, nuclear engineering, sub-

band coding, signal and image processing, neurophysiology, music, magnetic resonance imaging, speech discrimination, optics, fractals, turbulence, earthquake-prediction, radar, human vision, and pure mathematics applications

Sample ApplicationsDe-noising signalsBreakdown detectingDetecting self-similarityCompressing imagesIdentifying pure tone

Signal De-noisingSignal De-noising

Highest Frequencies Highest Frequencies Appear at the Start of Appear at the Start of The Original Signal The Original Signal Approximations Approximations Appear Less and Less Appear Less and Less NoisyNoisyAlso Lose Also Lose Progressively More Progressively More High-frequency High-frequency Information. Information. In AIn A55, About the First , About the First 20% of the Signal is 20% of the Signal is TruncatedTruncated

Breakdown Detection Breakdown Detection

The Discontinuous Signal Consists of a Slow Sine Wave Abruptly Followed by a Medium Sine Wave.The 1st and 2nd Level Details (D1 and D2) Show the Discontinuity Most Clearly Things to be Detected

The site of the change

The type of change (a rupture of the signal, or an abrupt change in its first or second derivative)

The amplitude of the change

Discontinuity Points

Detecting Self-similarityDetecting Self-similarityPurpose

How analysis by wavelets can detect a self-similar, or fractal, signal.

The signal here is the Koch curve -- a synthetic signal that is built recursively

Analysis If a signal is similar to

itself at different scales, then the "resemblance index" or wavelet coefficients also will be similar at different scales.

In the coefficients plot, which shows scale on the vertical axis, this self-similarity generates a characteristic pattern.

Image CompressionImage Compression

FingerprintsFBI maintains a large

database of fingerprints — about 30 million sets of them.

The cost of storing all this data runs to hundreds of millions of dollars.

ResultsValues under the threshold

are forced to zero, achieving about 42% zeros while retaining almost all (99.96%) the energy of the original image.

By turning to wavelets, the FBI has achieved a 15:1 compression ratio

better than the more traditional JPEG compression

Identifying Pure ToneIdentifying Pure Tone

Purpose Resolving a signal into

constituent sinusoids of different frequencies

The signal is a sum of three pure sine waves

Analysis D1 contains signal components

whose period is between 1 and 2.

Zooming in on detail D1 reveals that each "belly" is composed of 10 oscillations.

D3 and D4 contain the medium sine frequencies.

There is a breakdown between approximations A3 and A4 -> The medium frequency been subtracted.

Approximations A1 to A3 be used to estimate the medium sine.

Zooming in on A1 reveals a period of around 20.

Empirical Mode Empirical Mode DecompositionDecomposition

PrinciplePrincipleObjective — From one observation of x(t), get a AM-FM

type representation :

K

x(t) = Σ ak(t) Ψk(t) k=1

with ak(.) amplitude modulating functions and Ψk(.) oscillating functions.Idea — “signal = fast oscillations superimposed to slow oscillations”.

Operating mode — (“EMD”, Huang et al., ’98) (1) identify locally in time, the fastest oscillation ; (2) subtract it from the original signal ; (3) iterate upon the residual.

0 1

-1

0

1

0 1

-1

0

1

0 1

0

A LF sawtooth

A linear FM

+

=

Empirical Mode Empirical Mode DecompositionDecomposition

PrinciplePrinciple

Empirical Mode DecompositionEmpirical Mode DecompositionAlgorithmic definitionAlgorithmic definition

SIFTING

PROCESS

First Intrinsic Mode Function


SIFTING

PROCESS


SIFTING

PROCESS

Second Intrinsic Mode Function


SIFTING

PROCESS


SIFTING

PROCESS

Third Intrinsic Mode Function


SIFTING

PROCESS


SIFTING

PROCESS

Residu


SIFTING

PROCESS

Signal

1st Intrinsic Mode Function

2nd Intrinsic Mode Function

3rd Intrinsic Mode Function

Residu


Empirical Mode DecompositionEmpirical Mode DecompositionIntrinsic Mode FunctionsIntrinsic Mode Functions

— Quasi monochromatic harmonic oscillations

#{zero crossing} = #{extrema} ± 1 symmetric envelopes around the y=0 axis

— IMF ≠ Fourier mode and, in nonlinear situations, IMF = several Fourier modes

— Output of a self-adaptive time-varying filter (≠ standard linear filter)

ex: 2 sinus FM + gaussian wave packet

Empirical Mode DecompositionEmpirical Mode DecompositionInstantaneous frequency (IF)Instantaneous frequency (IF)

— Analytic version of each IMF Ci(t) is computed using Hilbert transform as:

— hence zi(t) becomes complex with phase and amplitude. Then IF can be computed as:

-Hilbert spectrum (HS) is a triplet as is a triplet as HH((,t,t))=={{tt, , ii((tt), ), aaii((tt))}}

)()()]([)()( tjiiii

ietatCjHtCtz

dt

tdt i

i)(

)(


Signal

time

frequency

Spectrum

Time-Frequency representation

Signal



Signal

time

frequency

1st IMF

3rd IMF2nd IMF


•EMD is fully data adaptive decomposition EMD is fully data adaptive decomposition for spectral and time-frequency for spectral and time-frequency representation of non-linear and non-representation of non-linear and non-stationary time seriesstationary time series

•It does not employ any basis function for It does not employ any basis function for decompositiondecomposition

•It produces perfect localization of the It produces perfect localization of the signal components in high resolution time-signal components in high resolution time-frequency space of the time seriesfrequency space of the time series

Time-frequency representation of two pure tones (100Hz and 250Hz) using HS and STFT

Hilbert spectrum (HS) STFT

Empirical Mode DecompositionEmpirical Mode DecompositionComparison between HS and STFTComparison between HS and STFT

Empirical Mode DecompositionEmpirical Mode DecompositionComparison between Wavelet and HSComparison between Wavelet and HS

WaveletWavelet Hilbert spectrum Hilbert spectrum (HS)(HS)

Remarks on FTRemarks on FT

•Fourier Transform has a mathematical Fourier Transform has a mathematical foundationfoundation

•Can be used in robust analysis having Can be used in robust analysis having phase information phase information

•The detail signal information is limited The detail signal information is limited with the basis (sinusoid) functionwith the basis (sinusoid) function

•STFT analysis includes some addition STFT analysis includes some addition cross-spectral energy that cross-spectral energy that

degrades degrades the performance in some the performance in some applicationsapplications

Remarks on WTRemarks on WT

•WT employs data adaptive basis function WT employs data adaptive basis function base base on its time and frequency scaleson its time and frequency scales•It can produce more detail signal It can produce more detail signal information information in T-F representationin T-F representation•WT also perform well in multi-band WT also perform well in multi-band decomposition decomposition •The reconstruction error of multi-band The reconstruction error of multi-band representation is much less than the FTrepresentation is much less than the FT•It can not preserve the phase information It can not preserve the phase information for for perfect reconstruction from T-F spaceperfect reconstruction from T-F space

Remarks on EMD and HSRemarks on EMD and HS

•EMD is fully adaptive multi-band EMD is fully adaptive multi-band decomposition methoddecomposition method•It produces the perfect localization of It produces the perfect localization of signal signal components in T-F spacecomponents in T-F space•HS can represent the instantaneous HS can represent the instantaneous spectra of spectra of the signalthe signal•The signal can be reconstructed with The signal can be reconstructed with negligible negligible error termserror terms•It does not have mathematical foundation It does not have mathematical foundation yetyet•It is difficult to use EMD based It is difficult to use EMD based decomposition decomposition in robust analysisin robust analysis

Application of DSP in audio Application of DSP in audio analysisanalysis

•Audio source separation from Audio source separation from mixture mixture using independent subspace using independent subspace

analysis (ISA) analysis (ISA) •Audio source separation by spatial Audio source separation by spatial

localization in underdetermined localization in underdetermined casecase•Robust pitch estimation using EMDRobust pitch estimation using EMD


Audio source separation from Audio source separation from mixture using independent mixture using independent subspace subspace analysis (ISA) analysis (ISA)

Source separation by Independent Source separation by Independent subspace analysis (ISA)subspace analysis (ISA)

STFTMSTFTM

s(t)s(t)

Audio mixtureAudio mixture

PCAPCA

Basis vector selectionBasis vector selection

ICAICA

Basis Basis vector vector

clusteringclustering

ISTFTISTFT

Individual Individual sourcessources

STFTSTFT Source Source spectrogramsspectrograms

Short Time Fourier Transform (STFT)Short Time Fourier Transform (STFT)

Mixture Mixture AudioAudio

Magnitude Spectrogram Magnitude Spectrogram XX

Phase Information

windowwindow 30ms 30ms

OverlapOverlap20ms20ms

T-F representation of mixtureT-F representation of mixture

Proposed separation modelProposed separation model

Mixture spectrogram X= xi

xi=BiAi

Bi Invariant frequency n-component basis

Ai Corresponding amplitude envelope

Ai=BiTX, Bi=XAi

T

** To find independent Bi or Ai

Source Source SpectrogramsSpectrograms

]......,[ )()(2

)(1

in

iii bbbB

Tin

iii aaaA ]........,[ )()(

2)(

1

Dimension Reduction Dimension Reduction

Rows or columns of X number of sourcesSubject to reduce the dimension of XSingular value decomposition (SVD) is used

Xnk=UnnSnkVkkT

-U and V orthogonal matrices (column-wise)-S diagonal matrix of elements (singular values)1 2 3 .… n 0

p basis vectors (from U or V) are selected by setting =0.5 to 0.6 in inequality

p

iin

ii

1

1

1

Proposed separation modelProposed separation model

To derive the basis vectorsSingular value decomposition (SVD) is applied as PCA

Some principal components are selected as basis vectors

Independent component analysis (ICA) is applied to make the bases independent

Independent basis vectors Independent basis vectors

before ICA after ICA

**The bases Independent along time frames

Producing source subspacesProducing source subspaces

The bases of the speech signal

Time basesTime bases Frequency basesFrequency bases

Source Subspaces (cont.)Source Subspaces (cont.)

Mixture SpectrogramMixture Spectrogram

PCA+Basis Selection+ICAPCA+Basis Selection+ICA

BB AA

KLd based clusteringKLd based clustering

BB11AA11 BB22AA22

Basis vectorsBasis vectors

Source SubspacesSource Subspaces

Source Source SpectrogramsSpectrograms

Source re-synthesisSource re-synthesis

Separated subspacesSeparated subspaces

(spectrograms) (spectrograms)

Append phase Append phase informationinformation

InverseInverse STFTSTFT

Mixture of speech & bip-bip soundMixture of speech & bip-bip sound

Separated speechSeparated speech

Separated bip-bip soundSeparated bip-bip sound

)],([.),( knjii exknS

Experimental resultsExperimental results

Separated signals with proposed algorithm

mixtures separated

Speech+bip-bipSpeech+bip-bip

Male+female speechMale+female speech


Audio source separation by Audio source separation by spatial localization in spatial localization in

underdetermined caseunderdetermined case

Localization based separationLocalization based separation

To avoid the spectral dependency and signal content in separationTo increase the number of sourcesThe spatial location is considered

The use of Binaural mixtures instead of single mixture

Localization based Localization based separation (cont.)separation (cont.)

Consider a multi-source audio situationHuman can easily localize and separate the sources by HAS (human auditory system)The binaural cues ITD and ILD are mainly used in source localizationSeparation is performed by applying Beamforming and Binary mask

Source localization CuesSource localization Cues

• Interaural time difference (ITD) between two microphones’ signals (like two ears of human)

• Interaural level difference (ILD)

ITD ILD

Source localizationSource localization

Xr() and Xl() are STFT of xr(t) and xl(t)

ITD and ILD are calculated as

where r() and l() are unwrap phase of Xr() and Xl() respectively at frequency

)}()({1

)(

rlITD

|)(|

|)(|log20)(

r

lILD X

X

Source localization (cont.)Source localization (cont.)ITD becomes ambiguous at higher frequency (factor of mics` spacing)

ILD dominates to resolve the problem

ITDITD ITDITD

At low frequencyAt low frequency At high frequencyAt high frequency

Source localization (cont.)Source localization (cont.)

ITD and ILD are quantized into 50 levelsCollection of T-F points corresponding to each ITD/ILD quantized pair produces peaks

Separation by beamformingSeparation by beamforming

ITD is derived for each of the localized sourcesSpatial Beamforming is appliedLinearly constrained minimum variance Beamforming (LCMVB) is usedThe gain is selected based on the spatial locations

Separation by binary mask Separation by binary mask with HSwith HS

It is required to avoid the limitations of spatial beamforming Separation is performed by binary mask estimation based on ITD/ILDThe sources are considered as disjoint orthogonal in T-F space not more than one source is active at any T-F point

Computing ITD and ILDComputing ITD and ILDEach mixture is transformed to T-F domain using Hilbert spectrums (HL and HR)ITD and ILD are measured as:

where tf is the time frame

),(

),(,

),(

),(1),(),,(

fL

fR

fR

fLff tH

tH

tH

tHtILDtITD

2

2

),(

),(10log20),(

fL

fR

fdBtH

tHtILD

ITD-ILD Space LocalizationITD-ILD Space Localization

ITD and ILD are quantized into 50 levelsCollection of T-F points from HS corresponding to each ITD/ILD quantized pair produces peaks

Source SeparationSource Separation

Each peak region in the histogram refers to a source of the binaural mixturesConstruct a binary mask (nullifying T-F points of interfering sources) Mi(,t)The HS of ith source is separated as

Time domain ith source is given as

),(),(),( tHtMtH Lii

)],(cos[),()( ttHts ii

Source disjoint Source disjoint orthogonalityorthogonality

Disjoint orthogonality (DO) of audio sources assumes that not more than one source is active at any T-F point

where F1 and F2 are TFR of two signals

SIR (signal to interference ratio) is used as the basis to measure DO

ttFtF ,;0),(),( 21

Source disjoint orthogonality Source disjoint orthogonality (cont.)(cont.)

ss11 s s2 2 s s33 Three audio sourcesThree audio sources

MicrophoMicrophonesnesTFRTFRFrequencFrequenc

yy

TimeTime

ss11

ss22

ss33


The SIR of the jth source is defined as:

Yj sum of interfering sources

N

jii

ij

jt j

jj

tXtY

tYtY

tXSIR

1

),(),(

0),(;),(

),(


Dimensions of HS and STFT of same signal may be different

DO is defined as the percentage over the entire TFR region

Average DO (ADO) of all sources is

N number of sources

N

j

jSIRN

ADO1

1


The three mixtures are defined as m1{sp1(-40, 0), sp2(30, 0), ft(0, 0)}, m2{sp1(20, 10), sp2(0, 10), ft(-10,10)}, m3{sp1(40, 20), sp2(30, 20), ft(-20, 20)}

The separation efficiency is measured as OSSR (original to separated signal ratio) defined as:

T

tw

i

separated

w

i

original

its

its

TOSSR

1

1

2

1

2

)(

)(

10log1

Experimental results Experimental results (cont.)(cont.)

The comparative separation efficiency (OSSR) using HS and STFT :

Mixtures TFR OSSR of sp1 OSSR of sp2 OSSR of ft

m1 HS -0.0271 0.0213 0.0264

STFT 0.0621 -0.0721 -0.0531

m2 HS 0.0211 -0.0851 -0.0872

STFT 0.0824 0.1202 0.1182

m3 HS 0.0941 -0.0832 0.0225

STFT -0.1261 0.1092 -0.0821


This experiment also compares the DO using HS and STFT as TFR

STFT is affected by many factors window function and its length, overlapping, FFT points

HS is independent of such factors

It is slightly affected by the number of frequency bins used in TFR

Experimental results (cont.)Experimental results (cont.)

The ADO of HS and STFT as a function of number of frequency bins (N=3):


The ADO of only STFT is affected by the factor of window overlapping (%)


STFT includes more cross-spectral energy terms

The TFR of two pure tones using HS and STFT


Always HS has better DO for audio signalsDO depends on the resolution of TFRSTFT has to satisfy the inequality

The frequency resolution of HS is up to Nyquist frequencyIts time resolution is up to sampling rate and hence offers better resolution

2

1 t

RemarksRemarks

The separation efficiency is independent The separation efficiency is independent of the signal’s spectral characteristicsof the signal’s spectral characteristics

The performance is affected by the apart The performance is affected by the apart angles and disjointness of the sourcesangles and disjointness of the sources

HS produces better disjointness in HS produces better disjointness in T-FT-F domain and hence better separationdomain and hence better separation

The Binaural mixtures are recorder in The Binaural mixtures are recorder in anechoic room of NTTanechoic room of NTT


Robust pitch estimation Robust pitch estimation using EMDusing EMD

Why EMD in pitch estimation?Why EMD in pitch estimation?

Pitch facilitates speech coding, enhancement, recognition etc. Autocorrelation function is mostly used in pitch estimation algorithmAutocorrelation (AC) function- recalls the periodic property of the speech

EMD in pitch estimation (cont.)EMD in pitch estimation (cont.)


Pitch is the sample difference between two consecutive peaks in AC functionSometimes the pitch peak may be less prominent specially due to noise


EMD decomposes any signal into higher to lower frequency componentIt produces the local and global oscillations of the signalThe global oscillation almost represents the envelop of the signalThe IMF of global oscillation is used to estimate the pitch

Pitch estimation with EMDPitch estimation with EMD

Pitch estimation with EMD Pitch estimation with EMD (cont.)(cont.)

There exists an IMF in EMD There exists an IMF in EMD domain representing the global domain representing the global oscillation of the AC function oscillation of the AC function

That IMF represents the sinusoid That IMF represents the sinusoid of the pitch periodof the pitch period

Pitch is the frequency of that IMF Pitch is the frequency of that IMF rather than finding the pitch peakrather than finding the pitch peak

Pitch estimation with EMD (cont.)Pitch estimation with EMD (cont.)

In EMD, IMF-5 is the oscillation of In EMD, IMF-5 is the oscillation of pitch periodpitch period

It is a crucial step to determine the It is a crucial step to determine the target IMF representing the target IMF representing the sinusoid with pitch periodsinusoid with pitch period

The IMF of low frequency oscillation The IMF of low frequency oscillation (than pitch period) can be (than pitch period) can be discarded by energy thresholdingdiscarded by energy thresholding

Pitch estimation with EMD Pitch estimation with EMD (cont.)(cont.)

A reference pitch is computed by weighted AC (WAC) methodSuch pitch information is used to select the IMF with pitch periodThe periodicity of the selected each IMF is computed as pitch period

Pitch estimation with EMD (cont.)Pitch estimation with EMD (cont.)

The peak at zero-lag is selectedTwo cycles are selected from both sides Average samples are the periodicity

Proposed Pitch estimation Proposed Pitch estimation AlgorithmAlgorithm

Normalized autocorrelation (AC) of the speech frame is computedDetermine rough pitch period using WAC method Apply EMD on AC functionSelect the IMF of pitch period on the basis of WAC based method The period of the selected IMF is the estimated pitch


Keele pitch database is used here20kHz sampling rateFrame length is 25.6ms with 10ms shiftingEach frame is filtered by band-pass filter of pitch range (50-500Hz)Gross pitch error (GPE) is used to measure the performance


The %GPE of male and female speech with different SNR are presented hereTotal number of frames is 1823

SNR 30dB

20dB

10dB

0dB -5dB -15dB

Female

1.90 2.83 3.93 10.12

21.83

64.24

Male 2.15 3.78 5.22 11.89

23.56

66.76

RemarksRemarks

The use of EMD makes the pitch estimation method more robustEMD of AC function can extract the fundamental oscillation of the signalThe pitch can be easily estimated from the single sinusoid of fundamental oscillationIt is not affected by the prominent non-pitch peak

Future worksFuture works

The open problem is to identify the IMF with pitch periodIn present algorithm the error to estimate pitch roughly in ACF can propagate to the performance of final estimationThe performance is not yet tested with other existing algorithm

Open Problem-1Open Problem-1Instantaneous Pitch (IP) estimation using EMD •Frame based pitch estimation is already doneFrame based pitch estimation is already done•Paper is accepted by EUROSPEECH 2007Paper is accepted by EUROSPEECH 2007

•We have used the pitch information based WAC to We have used the pitch information based WAC to compute the exact pitch (IMF) from EMD spacecompute the exact pitch (IMF) from EMD space•Problem to compute IP only from EMD spaceProblem to compute IP only from EMD space

Three methods Three methods of pitch of pitch estimationestimation

Open Problems-2Open Problems-2Voiced/Unvoiced Detection with EMD •Useful in speech enhancement and speech/speaker Useful in speech enhancement and speech/speaker recog.recog.•Paper with preliminary results is published in Paper with preliminary results is published in ICICT2007ICICT2007

•Problem to derive better separation region for V/UV Problem to derive better separation region for V/UV and to conduct experiment with large speech dataand to conduct experiment with large speech data

V/UV V/UV differentiationdifferentiation

Open Problems-3Open Problems-3Robust Audio Source Localization •Localization is done by delay-attenuation Localization is done by delay-attenuation computed in T-F space of binaural mixtures- NOT computed in T-F space of binaural mixtures- NOT noise robustnoise robust

•The problem is to derive mathematical mode for The problem is to derive mathematical mode for robust localization in underdetermined situationrobust localization in underdetermined situation

Localization of three Localization of three sources using TD-LD sources using TD-LD computed in T-F computed in T-F space space

Open Problems-4Open Problems-4Speech denoising using image processing •Noisy speech can be represented as an image Noisy speech can be represented as an image with time-frequency (T-F) representation e.g. with time-frequency (T-F) representation e.g. SpectrogramSpectrogram

•Image processing algorithm can be used for Image processing algorithm can be used for denoisingdenoising•It seems easy for musical/white noisesIt seems easy for musical/white noises•Problem is to deal with other noise even by using Problem is to deal with other noise even by using binaural mixturesbinaural mixtures

Speech Speech with white with white noisenoise

Open Problems-5Open Problems-5Auditory segmentation with binaural mixtures •Auditory segmentation is the first stage of Auditory segmentation is the first stage of source separation using auditory scene analysis source separation using auditory scene analysis (ASA)(ASA)

•Problem is to use of binaural mixtures for Problem is to use of binaural mixtures for improved auditory segmentation as source improved auditory segmentation as source separationseparation•T-F representation other than FT can be T-F representation other than FT can be employedemployed

Source Source separation separation by ASAby ASA

Open Problems-6Open Problems-6Two stage speech enhancement •Single stage speech enhancement is not efficient Single stage speech enhancement is not efficient in all noisy situationsin all noisy situations•For example, musical noise is introduced with For example, musical noise is introduced with binary masking and some thresholding methods binary masking and some thresholding methods •Noise may not be separated perfectly by using Noise may not be separated perfectly by using ICA, ISA (independent subspace analysis) based ICA, ISA (independent subspace analysis) based techniquestechniques•Multi-stage enhancement with suitable order can Multi-stage enhancement with suitable order can improve the performance improve the performance Noisy Noisy speechspeech

First stage First stage enhancemeenhancementnt

Second Second stage stage enhancemenenhancementt

Clean Clean speecspeechh

Open Problems-7Open Problems-7Informative features extraction •To use spectral dynamics in speech/speaker To use spectral dynamics in speech/speaker recog.recog.•Special type of speech features are requiredSpecial type of speech features are required

•How to parameterized speech signal to represent How to parameterized speech signal to represent speech dynamicsspeech dynamics•WT, HS based spectral analysis can be WT, HS based spectral analysis can be studied>>>studied>>>

Mixed signal Mixed signal with its with its spectrogramspectrogram

Open Problems-8Open Problems-8Source based audio indexing

•Useful in multimedia applications and moving Useful in multimedia applications and moving audio source separationaudio source separation

•Several new method could be used for Several new method could be used for indexing indexing Ada-boost, Tree-ICA, condition Ada-boost, Tree-ICA, condition random fieldrandom field

s1

s2

s3

Audio sources at different azimuth angles

(0 to 180 degree)

1.5m

Separation of Separation of moving moving sourcessources

Open Problems-9Open Problems-9Time-series prediction with EMD•Subject to financial and environment time seriesSubject to financial and environment time series•Conventional methods use Kalman filter (for Conventional methods use Kalman filter (for smoothing) and AR model for predictionsmoothing) and AR model for prediction

•EMD can be used as smoothing filter to enhance EMD can be used as smoothing filter to enhance the prediction accuracythe prediction accuracy

Non-stationary Non-stationary time-seriestime-series

Open Problems-10Open Problems-10Heart-rate analysis with ECG data using EMD

•ECG Variability analysis at different frequency regionECG Variability analysis at different frequency region•Analysis of instantaneous ECG conditionAnalysis of instantaneous ECG condition•Abnormality analysis of heart-rate using EMD based Abnormality analysis of heart-rate using EMD based spectral modeling spectral modeling

Different parts Different parts of ECG signalof ECG signal

The End

Questions/Suggestion Questions/Suggestion PleasePlease

Source SeparationSource Separation

Each peak region in the histogram refers to a source of the stereo mixturesConstruct a binary mask (nullifying TF points of interfering sources) Mi(n,t)The HS of ith source is separated as

Time domain ith source is given as

),(),(),( tnHtnMtnH Lii

n

ii tntnHts )],(cos[),()(

The End

Questions/Suggestion Questions/Suggestion PleasePlease

introduction to signal processing

Documents