time-scale modification of speech signals bill floyd ece 5525 – digital speech processing december...

Time-Scale Modificationof Speech Signals

Bill Floyd

ECE 5525 – Digital Speech Processing

December 14, 2004

of 49

Objectives

Introduction Background Theory

Methods Examples

Matlab Code Short Time Fourier Transform Short Time Fourier Transform Magnitude Speech Samples

Conclusion Questions References

of 49

Introduction

Goal To either speed up or slow down a speech

signal while maintaining the approximate pitch Applications

Change voice mail playback Court stenographers-play proceedings quicker Sound effects Etc…

of 49

Introduction

Option 1 – Change sample rate If you modify the sample rate, you can change

the speed but the pitch is also changed Increase sample rate = higher pitch (chipmunk

sound) Decrease sample rate = lower pitch (drawn out

echo sound) Option 2 – Decimate or Interpolate Signal

If you change the number of samples, the result is the same as modifying the sample rate

of 49

Introduction

Option 3 – Use more complex methods This will change the speed of the sample while

preserving the pitch data Short Time Fourier Transform Short Time Fourier Transform Magnitude Sinusoidal Synthesis Linear Prediction Synthesis

of 49

Terminology

0 100 200 300 400 500 600 7000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Window Representation

Window Size

Frame Rate

of 49

Theory

Short Time Fourier Transform Methods Chapter 7 in our text (Discrete-Time Speech

Signal Processing) Refer to notes from in class for mathematical

theory of operation I will pick up from where Dr. Kepuska stopped

in his notes

of 49

Short Time Fourier Transform

Short Time Fourier Transform Also called the Fairbanks method Extract successive short-time segments and

then discard the following ones

STFTDecimateSamples

IFFT

OLA

Signal

Output

of 49


Frame Rate factor L In frequency domain after taking the STFT,

you get X(nL,ω)

Form a new signal by Y(nL, ω) = X(snL, ω)

where s = compression factor

Take Inverse Fourier Transform Use Overlap and Add method to form new

signal

of 49


0 100 200 300 400 500 600 700 8000

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700 8000

0.2

0.4

0.6

0.8

1

X(nL, ω)

Y(nL, ω)= X(2nL, ω)

of 49


0 100 200 300 400 500 600 7000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


0 100 200 300 400 500 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

New Sequence

OriginalWindowedSequence

of 49


Problems Pitch Synchronization

It is highly likely that the pitch periods will not line up properly

of 49

Short Time Fourier Transform Magnitude Short Time Fourier Transform Magnitude

Problems with STFT method relate directly to the linear phase component of the STFT

Time shift = phase change Alternate approach is to only use the

magnitude portion of the STFT—Short Time Fourier Transform Magnitude

of 49

Short Time Fourier Transform Magnitude Compression

With the Fairbanks method, time slices were discarded

Now we can just compress the time slices Form a new signal by

|Y(nM, ω)| = |X(nL, ω)| where M = compression factor = L / speed i.e. for speeding up by two => M = L/2

of 49

Short Time Fourier Transform Magnitude Compression

Take Inverse Fourier Transform Use Overlap and Add method to form new

signal

of 49

Short Time Fourier Transform Magnitude

0 100 200 300 400 500 600 700 8000

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700 8000

0.2

0.4

0.6

0.8

1

X(nL, ω)

Y(nM, ω)= X(nL, ω)

M=L/2

of 49

Short Time Fourier Transform Magnitude

0 100 200 300 400 500 600 7000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


New Sequence

OriginalWindowedSequence

-50 0 50 100 150 200 250 300 350 400 4500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

of 49

Other Methods

Sinusoidal Synthesis—Chapter 9 Time-warp the sinewave frequency track and

the amplitude function This technique has been successful with not

only speech but also music, biological, and mechanical signals

Problems Does not maintain the original phase relations Suffer from reverberance

of 49

Other Methods

Linear Prediction Synthesis Use Homomorphic and Linear Prediction

results to modify the time base Book briefly mentions this is possible but ran

out of time before I could investigate this process more

of 49

Other Methods

New Techniques Internet search showed several methods

trying to improve on what is out there now Software

Different software programs that will change speed for you

Adobe Audition is one of the most all encompassing right now

of 49

Matlab Code-Prepare the Workspace

%%%%%%%%%%%%%%%%% Prepare Workspace%%%%%%%%%%%%%%%%

close all;clear all;

window_size_1 = 200;frame_rate_1 = 100;

%Speed to slow down byspeed = 2;

of 49

Matlab Code-Load the Speech Signal

%%%%%%%%%%%%%%%%% Load Data File%%%%%%%%%%%%%%%%

filename = input('Please enter the file name to be used. ');

[sample_data,sample_rate,nbits] = wavread(filename);

loop_time = floor(max(size(sample_data))/frame_rate_1);

sample_data((max(size(sample_data))):(loop_time+1)* frame_rate_1)=0;

of 49

Matlab Code-Develop the Window

%%%%%%%%%%%%%%%%% Create Windows%%%%%%%%%%%%%%%%

% Want windows of 25ms% File sampled at 10,000 samples/sec% Want a window of size 10000 * 25ms(10ms)

triangle_30ms = triang(window_size_1);%triangle_30ms = hamming(window_size_1);

W0 = sum(triangle_30ms);

of 49

Matlab Code-Window the Entire Speech Signal

%%%%%%%%%%%%%%%%% Window the speech%%%%%%%%%%%%%%%%

for i =0:loop_time-1

window_data(:,i+1)=sample_data((frame_rate_1*i)+1:((i+2)* frame_rate_1)).*triangle_30ms;

end

of 49

Matlab Code-Perform the Fast Fourier Transform

%%%%%%%%%%%%%%%%% Create FFT%%%%%%%%%%%%%%%%

for i = 1:loop_time

window_data_fft(:,i) = fft(window_data(:,i),1024);

end

of 49

Matlab Code-Recreate the Modified Signal

%%%%%%%%%%%%%%%%% Recreate Original Signal%%%%%%%%%%%%%%%%

%Initialize the recreated signals

reconstructed_signal(1:(loop_time+1)*frame_rate_1)=0;real_reconstructed_signal(1:(loop_time+1)*frame_rate_1)=0;

modified_reconstructed_signal(1:(loop_time+3)*(frame_rate_1/speed))=0;

modified_reconstructed_signal_compressed(1:(loop_time+3)* (frame_rate_1/ speed))=0;

of 49


% Perform the ifft

for i = 1:loop_time recreated_data_ifft(:,i) = ifft(window_data_fft(:,i),1024); real_recreated_data_ifft(:,i) = ifft(abs(window_data_fft(:,i)),1024);

truncated_recreated_data_ifft(:,i) = recreated_data_ifft(1:window_size_1,i).*(frame_rate_1/W0);

real_truncated_recreated_data_ifft(:,i) = real_recreated_data_ifft(1:window_size_1,i).*(frame_rate_1/W0);

end

of 49


% Get back to the original signal

for i=0:loop_time-1

reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) = reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) + truncated_recreated_data_ifft(:,i+1)';

real_reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) = real_reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) + real_truncated_recreated_data_ifft(:,i+1)';

end

of 49


% Get a modified signal by deleting certain parts (STFT)

for i=0:(loop_time-1)/speed

modified_reconstructed_signal((frame_rate_1*i)+1:((i+2)* frame_rate_1)) = modified_reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) + real_truncated_recreated_data_ifft(:,i*speed+1)';

end

of 49


% Initialize the compressed sequence (STFTM)

modified_reconstructed_signal_compressed(1:frame_rate_1+frame_rate_1/speed+1)=truncated_recreated_data_ifft(frame_rate_1-frame_rate_1/speed:window_size_1,1)';

% Get a modified signal by compressing

for i=0:(loop_time-2) modified_reconstructed_signal_compressed((frame_rate_1/speed*i)

+1:(frame_rate_1/speed*i)+window_size_1) = modified_reconstructed_signal_compressed((frame_rate_1/speed*i)+1:(frame_rate_1/speed*i)+window_size_1) + real_truncated_recreated_data_ifft(:,i+2)';

end

of 49

Matlab Code-Plot Results

%%%%%%%%%%%%%%%%% Plot Results%%%%%%%%%%%%%%%%

Figure; subplot(211)plot(sample_data)title('Original Speech'); v1=axis;hold on; subplot(212)plot(real(modified_reconstructed_signal))title(['STFT Synthesis w/ Speed = ',num2str(speed),'X']); v2=axis;if speed > 1 subplot(211); axis(v1) subplot(212); axis(v1)else subplot(211); axis(v2) subplot(212); axis(v2)end

of 49

Matlab Code-Write Sound Files

%%%%%%%%%%%%%%%%% Write sound files%%%%%%%%%%%%%%%%

wavwrite(modified_reconstructed_signal,sample_rate,nbits,'C:\Classes\ECE_5525\tea party fairbanks 2x.wav')

of 49

Examples Baseline Samples

STFT Sound file

STFTM Sound file

Original File

Sample Rate 2X

Sample Rate .5X

of 49

Examples STFT—Speed 0.5X

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

x 104

-0.4

-0.2

0

0.2

0.4

0.6Original Speech

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

x 104

-0.4

-0.2

0

0.2

0.4

0.6STFT Synthesis w/ Speed = 0.5X

Sound file

of 49

Examples STFT—Speed 2X

0 0.5 1 1.5 2 2.5

x 104

-1

-0.5

0

0.5

1Original Speech

0 0.5 1 1.5 2 2.5

x 104

-1

-0.5

0

0.5

1STFT Synthesis w/ Speed = 2X

Sound file

of 49

Examples STFT—Speed 4X

0 0.5 1 1.5 2 2.5

x 104

-1

-0.5

0

0.5

1Original Speech

0 0.5 1 1.5 2 2.5

x 104

-1

-0.5

0

0.5

1STFT Synthesis w/ Speed = 4X

Sound file

of 49

Examples STFTM—Speed 0.5X

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

x 104

-0.4

-0.2

0

0.2

0.4

0.6Original Speech

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

x 104

-0.4

-0.2

0

0.2

0.4

0.6STFTM Synthesis w/ Speed = 0.5X

Sound file

of 49

Examples STFTM—Speed 2X

0 0.5 1 1.5 2 2.5

x 104

-1

-0.5

0

0.5

1Original Speech

0 0.5 1 1.5 2 2.5

x 104

-1

-0.5

0

0.5

1STFTM Synthesis w/ Speed = 2X

Sound file

of 49

Examples STFTM—Speed 4X

0 0.5 1 1.5 2 2.5

x 104

-1

-0.5

0

0.5

1Original Speech

0 0.5 1 1.5 2 2.5

x 104

-1

-0.5

0

0.5

1STFTM Synthesis w/ Speed = 4X

Sound file

of 49

More Results

Change in window size If the window size becomes too small, then a

change in pitch will occur Need window to be 2 to 3 pitch periods long I generally used 20 – 30 ms windows

of 49

More Results

Change in frame rate If the frame rate decreases too much, then there will

be too many samples overlapping to get an intelligible signal

-50 0 50 100 150 200 250 300 350 400 4500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

of 49

More Results

Change filter type Tried Hamming—not much perceptual

difference Using the window energy becomes important

here Frame Rate/W0 is not equal to one

of 49

Conclusion

Optimum area Frame rate is one half of the window size Window size needs to be 2 to 3 pitch periods

long It is possible to easily change the time scale

and still maintain the original pitch although the result is not always natural sounding

of 49

Conclusion

Further investigation What to do when you want to slow down over

half. Using the STFTM means there will be gaps

between the sequences

0 100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

of 49

Conclusion

Further investigation What to do when you want to slow down over half

Could replicate windowed segments

0 100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

of 49

Conclusion

Further investigation Use the other methods to determine quality

Implement Sinusoidal Synthesis Implement Linear Predictive Synthesis using linear

prediction and homomorphic methods Work on synchronizing pitch periods

Shift samples so that the peaks line up Scott and Gerber—Synchronized Overlap and Add (SOLA) Cross-correlation of two samples to find peak Use the peaks to line up samples

Align the window at same relative location within a pitch period

of 49

Questions

Are there any questions?

of 49

References

Quatieri, Thomas E. Discrete-Time Speech Signal Processing. Prentice Hall, Upper Saddle River, NJ, 2002.

Rabiner, L.R. and Schafer, R.W. Digital Processing of Speech Signals. Prentice Hall, Upper Saddle River, NJ, 1978.

Oppenheim, A.V and Schafer, R.W. Digital Signal Processing. Prentice Hall, Englewood Cliffs, NJ, 1975.

Scott, R. and Gerber, S. “Pitch Synchronous Time-Compression of Speech,” Proc. Conf. Speech Communications Processing, p63-85, April 1972.

of 49

References

Fairbanks, G., Everitt, W.L., and Jaeger, R.P. “Method for Time or Frequency Compression-Expansion of Speech,” IEEE Transaction Audio and Electroacoustics, vol. AU-2 pp.7-12, Jan 1954.

time-scale modification of speech signals bill floyd ece 5525 – digital speech processing december...

Documents