speech compression using wavelets

29
1. INTRODUCTION Speech is a very basic way for humans to convey information to one another. With a bandwidth of only 4 kHz, speech can convey information with the emotion of a human voice. People want to be able to hear someone’s voice from anywhere in the world as if the person was in the same room .As a result a greater emphasis is being placed on the design of new and efficient speech coders for voice communication and transmission. Today applications of speech coding and compression have become very numerous. This paper looks at a new technique for analyzing and compressing speech signals using wavelets. Any signal can be represented by a set of scaled and translated versions of a basic function called the. mother wavelet. This set of wavelet functions forms the wavelet coefficients at different scales and positions and results from taking the wavelet transform of the original signal. Speech is a non-stationary random process due to the time varying nature of the human speech production system. Non-stationary signals are characterized by numerous transitory drifts, trends and abrupt changes. The localization feature of wavelets, along with its time-frequency resolution properties makes them well suited for coding speech signals. The coefficients represent the signal in the wavelet domain and all data operations can be performed using just the corresponding wavelet coefficients. Speech is a non-stationary random process due to the time varying nature of the human speech

Upload: kushal-desale

Post on 21-Nov-2014

303 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: speech compression using wavelets

1. INTRODUCTION

Speech is a very basic way for humans to convey information to one another. With a

bandwidth of only 4 kHz, speech can convey information with the emotion of a human voice.

People want to be able to hear someone’s voice from anywhere in the world as if the person was

in the same room .As a result a greater emphasis is being placed on the design of new and

efficient speech coders for voice communication and transmission. Today applications of speech

coding and compression have become very numerous.

This paper looks at a new technique for analyzing and compressing speech signals using

wavelets. Any signal can be represented by a set of scaled and translated versions of a basic

function called the. mother wavelet. This set of wavelet functions forms the wavelet coefficients

at different scales and positions and results from taking the wavelet transform of the original

signal. Speech is a non-stationary random process due to the time varying nature of the human

speech production system. Non-stationary signals are characterized by numerous transitory drifts,

trends and abrupt changes. The localization feature of wavelets, along with its time-frequency

resolution properties makes them well suited for coding speech signals.

The coefficients represent the signal in the wavelet domain and all data operations can be

performed using just the corresponding wavelet coefficients. Speech is a non-stationary random

process due to the time varying nature of the human speech production system. Non-stationary

signals are characterized by numerous transitory drifts, trends and abrupt changes. The

localization feature of wavelets, along with its time-frequency resolution properties makes them

well suited for coding speech signals. In designing a wavelet based speech coder, the major

issues explored in this paper are:

i. Choosing optimal wavelets for speech,

ii. Decomposition level in wavelet transforms,

iii. Threshold criteria for the truncation of coefficients,

iv. Efficiently representing zero valued coefficients and

v. Quantizing and digitally encoding the coefficients.

The performance of the wavelet compression scheme in coding speech signals and the quality of

the reconstructed signals is also evaluated.

Page 2: speech compression using wavelets

1.1 Speech Signals

The human speech in its pristine form is an acoustic signal. For the purpose of communication

and storage, it is necessary to convert it into an electrical signal. This is accomplished with the

help of certain instruments called ‘transducers’.

This electrical representation of speech has certain properties.

1. It is a one-dimensional signal, with time as its independent variable.

2. It is random in nature.

3. It is non-stationary, i.e. the frequency spectrum is not constant in time.

4. Although human beings have an audible frequency range of 20Hz –20kHz, the human speech

has significant frequency components only upto 4kHz,a property that is exploited in the

compression of speech.

Digital representation of speech

With the advent of digital computing machines, it was propounded to exploit the powers of the

same for processing of speech signals. This required a digital representation of speech. To

achieve this, the analog signal is sampled at some frequency and then quantized at discrete

levels. Thus, parameters of digital speech are

1. Sampling rate

2. Bits per second

3. Number of channels.

The sound files can be stored and played in digital computers. Various formats have been

proposed by different manufacturers for example ‘.wav’ ‘.au’ to name a few.In this thesis, the

‘.wav’ format is used extensively due to the convenience in recording it with ‘Sound recorder’

software, shipped with WINDOWS OS.

1.2 Compression – An Overview

In the recent years, large scale information transfer by remote computing and the development of

massive storage and retrieval systems have witnessed a tremendous growth. To cope up with the

growth in the size of databases, additional storage devices need to be installed and the modems

and multiplexers have to be continuously upgraded in order to permit large amounts of data

transfer between computers and remote terminals. This leads to an increase in the cost as well as

equipment. One solution to these problems is-“COMPRESSION” where the database and the

transmission sequence can be encoded efficiently.

Page 3: speech compression using wavelets

1.3 Coding Techniques

There are various methods of coding the speech signal

Page 4: speech compression using wavelets

2 WEAKNESSES OF FOURIER ANALYSIS

This chapter develops the need and motivation for studying the wavelet transform. Historically,

Fourier Transform has been the most widely used tool for signal processing. As signal

processing began spreading its tentacles and encompassing newer signals, Fourier Transform

was found to be unable to satisfy the growing need for processing a bulk of signals. Hence, this

chapter begins with a review of Fourier Methods. Detailed explanation is avoided to rid the

discussion of insignificant details. A simple case is presented, where the shortcomings of Fourier

methods is expounded. The next chapter concerns wavelet transforms, and shows how the

drawback of FT are eliminated.

2. WAVELETS Vs FOURIER ANALYSIS

A major draw back of Fourier analysis is that in transforming to the frequency domain, the time

domain information is lost. The most important difference between these two kinds of transforms

is that individual wavelet functions are localized in space. In contrast Fourier sine and cosine

functions are non-local and are active for all time t.

Page 5: speech compression using wavelets

3 INTRODUCTION TO WAVELETS AND THE CONTINUOUS WAVELET

TRANSFORM (CWT)

INTRODUCTION:

This chapter provides a motivation towards the study of wavelets as a tool for signal processing.

The drawbacks inherent in the Fourier methods are overcome with wavelets. This fact is

demonstrated here. It must be reiterated that the discussion in this chapter is by no means

comprehensive and exhaustive. The concepts of time-frequency resolution have been avoided for

the sake of simplicity. Instead, the development endeavors to compare the Wavelet methods with

the Fourier methods as the reader is expected to be well conversant with the latter.

3.1 Continuous-time Wavelets

Consider a real or complex-valued continuous-time function y(t) with the following properties 1

1. The function integrates to zero

2. It is square integrable or, equivalently, has finite energy:

A function is called mother wavelet if it satisfies these two properties. There is an infinity of

functions that satisfy these properties and thus qualify to be mother wavelet. The simplest of

them is the ‘Haar wavelet’. Some other wavelets are Mexican hat, Morlet. Apart from this, there

are various families of wavelets. Some of the families are daubechies family, symlet family,

coiflet family etc. In this thesis, the main stress is given on the Daubechies family, which has db1

to db10 wavelets. They are shown in the following figure

Page 6: speech compression using wavelets

fig 3.1 : Some wavelet functions.

3.2 The Continuous Wavelet Transform (CWT)

Consider the following figure which juxtaposes a sinusoid and a wavelet

fig 3.2 : comparing sine wave and a wavelet

Page 7: speech compression using wavelets

As has already been pointed out, wavelet is a waveform of effectively limited duration that has

an average value of zero. Compare wavelets with sine waves, which are the basis of Fourier

analysis. Sinusoids do not have limited duration -- they extend from minus to plus infinity. And

where sinusoids are smooth and predictable, wavelets tend to be irregular and asymmetric.

Fourier analysis consists of breaking up a signal into sine waves of various Frequencies

(fig 2.1). Similarly, wavelet analysis is the breaking up of a signal into shifted and scaled

versions of the original (or mother) wavelet. Compare the following figure with fig :2.1

fig 3.3 :figure demonstrating the decomposition of a signal into wavelets

The above diagram suggests the existence of a synthesis equation to represent the original signal

as a linear combination of wavelets which are the basis function for wavelet analysis (recollect

that in Fourier analysis, the basis functions are sines and cosines). This is indeed the case. The

wavelets in the synthesis equation are multiplied by scalars. To obtain these scalars, we need an

analysis equation, just as in the Fourier case.

Page 8: speech compression using wavelets

4 THE DISCRETE WAVELET TRANSFORM (DWT)

INTRODUCTION

Calculating wavelet coefficients at every possible scale (for continuous WT) is a fair

amount of work, and it generates an awful lot of data. What if we choose only a subset of scales

and positions at which to make our calculations?

It turns out, rather remarkably, that if we choose scales and positions based on powers of

two -- so-called dyadic scales and positions -- then our analysis will be much more efficient and

just as accurate. We obtain such an analysis from the discrete wavelet transform (DWT).

4.1 DWT defined mathematically

The Discrete Wavelet Transform (DWT) involves choosing scales and positions based on powers

of two- the so called dyadic scales and positions. The mother wavelet is rescaled or “dilated” by

powers of two and translated by integers. Specifically, a function f(t) ÎL2 (R) (defines space of

square integrable functions) can be represented as

The function y(t) is known as the mother wavelet, while f(t) is known as the scaling function.

The set of functions { 2-Lf(2-L t - k), 2 - jy(2- j t - k ).. | j £ L; j, k ,LÎ Z} where Z is the set of

integers, is an orthonormal basis for L 2 (R). The numbers a(L, k) are known as the

approximation coefficients at scale L, while d(j,k) are known as the detail coefficients at scale j.

These approximation and detail coefficients can be expressed as

The above 2 equations give a mathematical relationship to compute the approximation and detail

coefficients. This procedure is seldom adopted. A more practical approach is to use Mallat’s Fast

Page 9: speech compression using wavelets

Wavelet Transform algorithm. The Mallat algorithm for discrete wavelet transform (DWT) is, in

fact, a classical scheme in the signal processing community, known as a two channel subband

coder using conjugate quadrature filters or quadrature mirror filters (QMF). It is developed in the

following sections.

4.2. VANISHING MOMENTS

The number of vanishing moments of a wavelet indicates the smoothness of the wavelet function

as well as the flatness of the frequency response of the wavelet filters (filters used to compute the

DWT) Typically a wavelet with p vanishing moments satisfies the following equation,

Wavelets with a high number of vanishing moments lead to a more compact signal representation

and are hence useful in coding applications. However, in general, the length of the filters

increases with the number of vanishing moments and the complexity of computing the DWT

coefficients increases with the size of the wavelet filters.

4.3. FAST WAVELET TRANSFORM

The Discrete Wavelet Transform (DWT) coefficients can be computed by using Mallet’s Fast

Wavelet Transform algorithm. This algorithm is sometimes referred to as the two-channel sub

band coder and involves filtering the input signal based on the wavelet function used.

4.4. IMPLEMENTATION USING FILTERS

To explain the implementation of the Fast Wavelet Transform algorithm consider the following

equations:

Page 10: speech compression using wavelets

The first equation is known as the twin-scale relation (or the dilation equation) and defines the

scaling function. The next equation expresses the wavelet in terms of the scaling function. The

third equation is the condition required for the wavelet to be orthogonal to the scaling function

and it translates the coefficients c(k) or {c0, .., c2N-1} in the above equations represent the

impulse response coefficients for a low pass filter of length 2N, with a sum of 1 and a norm of

1/squrerootof(2). The high pass filter is obtained from the low pass filter using the relationship

where k varies over the range (1-(2N-1)) to 1.

Starting with a discrete input signal vector s, the first stage of the FWT algorithm decomposes the

signal into two sets of coefficients. These are the approximation coefficients cA1 (low frequency

information) and the detail coefficients cD1 (high frequency information), as shown in the figure

below.

Page 11: speech compression using wavelets

The coefficient vectors are obtained by convolving s with the low-pass filter Lo_D for

approximation and with the high-pass filter Hi_D for details. This filtering operation is then

followed by dyadic decimation or down sampling by a factor of 2. Mathematically the two-

channel filtering of the discrete signal s is represented by the expressions:

These equations implement a convolution plus down sampling by a factor 2 and give the forward

fast wavelet transform. If the length of each filter is equal to 2N and the length of the original

signal s is equal to n, then the corresponding lengths of the coefficients cA1 and cD1 are given by

the formula:

This shows that the total length of the wavelet coefficients is always slightly greater than the

length of the original signal due to the filtering process used.

4.5. MULTILEVEL DECOMPOSITION

The decomposition process can be iterated, with successive approximations being decomposed in

turn, so that one signal is broken down into many lower resolution components. This is called the

wavelet decomposition tree. The wavelet Decomposition of the signal s analyzed at level j has the

Page 12: speech compression using wavelets

following structure [cAj, cDj, cD1].Looking at a signals wavelet decomposition tree can reveal

valuable information. The diagram below shows the wavelet decomposition to level 3 of a sample

signal S.

Since the analysis process is iterative, in theory it can be continued indefinitely. In reality, the

decomposition can only proceed until the vector consists of a single sample. Normally, however

there is little or no advantage gained in decomposing a signal beyond a certain level. The

selection of the optimal decomposition level in the hierarchy depends on the nature of the signal

being analyzed or some other suitable criterion, such as the low-pass filter cut-off.

4.6. SIGNAL RECONSTRUCTION

The original signal can be reconstructed or synthesized using the inverse discrete wavelet

transform (IDWT). The synthesis starts with the approximation and detail coefficients cAj and

cDj, and then reconstructs cAj-1 by up sampling and filtering with the reconstruction filters.

Page 13: speech compression using wavelets

The reconstruction filters are designed in such a way to cancel out the effects of aliasing

introduced in the wavelet decomposition phase. The reconstruction filters (Lo_R and Hi_R)

together with the low and high pass decomposition filters, forms a system known as quadrature

mirror filters (QMF). For a multilevel analysis, the reconstruction process can itself be iterated

producing successive approximations at finer resolutions and finally synthesizing the original

signal.

Page 14: speech compression using wavelets

5 WAVELETS AND SPEECH COMPRESSION

INTRODUCTION

The idea behind signal compression using wavelets is primarily linked to the relative scarceness

of the wavelet domain representation for the signal. Wavelets concentrate speech information

(energy and perception) into a few neighbouring coefficients. Therefore as a result of taking the

wavelet transform of a signal, many coefficients will either be zero or have negligible

magnitudes.

Another factor that comes into picture is taken from psychoacoustic studies. Since our ears are

more sensitive to low frequencies than high frequencies and our hearing threshold is very high in

the high frequency regions, we used a method for compression by means of which the detail

coefficients (corresponding to high frequency components) of wavelet transforms are

thresholded such that the error due to thresholding is inaudible to our ears. Since some of the

high frequency components are discarded, we should expect asmoothened output signal, as is

shown in the following figure:

In summary, the notion behind compression is based on the concept that the regular

signal component can be accurately approximated using the following elements: a small

number of approximation coefficients (at a suitably chosen level) and some of the detail

Page 15: speech compression using wavelets

coefficients

Data compression is then achieved by treating small valued coefficients as insignificant

data and thus discarding them. The process of compressing a speech signal using

wavelets involves a number of different stages, each of which is discussed below.

5.1. CHOICE OF WAVELET

The choice of the mother-wavelet function used in designing high quality speech coders is of

prime importance. Choosing a wavelet that has compact support in both time and frequency in

addition to a significant number of vanishing moments is essential for an optimum wavelet speech

compressor. This is followed very closely by the Daubechies D20, D12, D10 or D8 wavelets, all

concentrating more than 96% of the signal energy in the Level 1 approximation coefficients.

Wavelets with more vanishing moments provide better reconstruction quality, as they introduce

less distortion into the processed speech and concentrate more signal energy in a few neighboring

coefficients.

5.2. WAVELET DECOMPOSITION

Wavelets work by decomposing a signal into different resolutions or frequency bands, and this

task is carried out by choosing the wavelet function and computing the Discrete Wavelet

Transform (DWT). Signal compression is based on the concept that selecting a small number of

approximation coefficients (at a suitably chosen level) and some of the detail coefficients can

accurately represent regular signal components. Choosing a decomposition level for the DWT

usually depends on the type of signal being analyzed or some other suitable criterion such as

entropy. For the processing of speech signals decomposition up to scale 5 is adequate, with no

further advantage gained in processing beyond scale 5.

5.3. TRUNCATION OF COEFFICIENTS

After calculating the wavelet transform of the speech signal, compression involves truncating

wavelet coefficients below a threshold This means that most of the speech energy is in the high-

valued coefficients, which are few. Thus the small valued coefficients can be truncated or zeroed

and then be used to reconstruct the signal. This compression scheme provided a segmental signal-

Page 16: speech compression using wavelets

to-noise ratio (SEGSNR) of 20 dB, with only 10% of the coefficients. Two different approaches

are available for calculating thresholds. The first, known as Global Thresholding involves taking

the wavelet expansion of the signal and keeping the largest absolute value coefficients. In this

case you can manually set a global threshold, a compression performance or a relative square

norm recovery performance. Thus only a single parameter needs to be selected. The second

approach known as By Level Thresholding consists of applying visually determined level

dependent thresholds to each decomposition level in the wavelet transform.

5.4. ENCODING COEFFICIENTS

Signal compression is achieved by first truncating small-valued coefficients and then efficiently

encoding them. One way of approach to compression is to encode consecutive zero valued

coefficients, with two bytes. One byte to indicate a sequence of zeros in the wavelet transforms

vector and the second byte representing the number of consecutive zeros. For further data

compaction a suitable bit encoding format, can be used to quantize and transmit the data at low bit

rates. A low bit rate representation can be achieved by using an entropy coder like Huffman

coding or arithmetic coding.

5.5. DETECTING VOICED Vs

UNVOICED SPEECH FRAMES

In speech there are two major types of excitation, voiced and unvoiced. Voiced sounds are

produced when air flows between the vocal cords and causes them to vibrate. Voiced speech

tends to be periodic in nature. Examples of voiced sounds are English vowels, such as the /a/

in .bay and the /e/ in .see. Unvoiced sounds result from constricting the vocal tract at some point

so that turbulence is produced by air flowing past the constriction. Since unvoiced speech is due

to turbulence, the speech is aperiodic and has a noise-like structure. Some examples of unvoiced

English sounds are the /s/ in .so and the /h/ in .he.. In general at least 90% of the speech energy is

always retained in the first N/2 transform coefficients, if the speech is a voiced frame However,

for an unvoiced frame the energy is spread across several frequency bands and typically the first

N/2 coefficients holds less than 40% of the total energy. Due to this wavelets are inefficient at

coding unvoiced speech. Unvoiced speech frames are infrequent. By detecting unvoiced speech

Page 17: speech compression using wavelets

frames and directly encoding them (perhaps using entropy coding), no unvoiced data is lost and

the quality of the compressed speech will remain transparent.

5.6. PERFORMANCE MEASURES

A number of quantitative parameters can be used to evaluate the performance of the wavelet

based speech coder, in terms of both reconstructed signal quality after decoding and compression

scores. The following parameters are compared:

5.6.1. SIGNAL TO NOISE RATIO

5.6.2. PEAK SIGNAL TO NOISE RATIO (PSNR)

5.6.3. NORMALISED ROOT MEAN SQUARE ERROR (NRMSE)

Page 18: speech compression using wavelets

5.6.4. RETAINED SIGNAL ENERGY

5.6.5. COMPRESSION RATIOS

Page 19: speech compression using wavelets

6. PERFORMANCE OF RECORDED SPEECH CODING

A male and female spoken speech signals were decomposed at scale 3 and level dependent

thresholds were applied using the Birge-Massart strategy. Since the speech files were of short

duration, the entire signal was decomposed at once without framing. A summary of the

performance is given below for the different wavelets used.

Page 20: speech compression using wavelets

7.CONCLUSION

Speech coding is currently an active topic for research in the areas of Very Large Scale Integrated

(VLSI) circuit technologies and Digital Signal Processing (DSP). The Discrete Wavelet

Transform performs very well in the compression of recorded speech signals. For real time

speech processing however, its performance is not as good. Therefore for real time speech coding

it is recommended to use a wavelet with a small number of vanishing moments at level 5

decomposition or less. The wavelet based compression designed reaches a signal to noise ratio of

17.45 db at a compression ratio of 3.88 using the Daubechies 10 wavelet. The performance of the

wavelet scheme in terms of compression scores and signal quality is comparable with other good

techniques such as code excited linear predictive coding (CELP) for speech, with much less

computational burden. In addition, using wavelets the compression ratio can be easily varied,

while most other compression techniques have fixed compression ratios.

Page 21: speech compression using wavelets

8 FURTHER STUDY

This thesis was, admittedly, limited in scope. The main objective was to develop an appreciation

for wavelet transforms, discuss their application in compression of human speech signals and

study the effect of a few parameters on the quality of compression. The parameters studied are:

Sampling frequency, type of wavelet, threshold, male/female voice.

There are a few topics that were deliberately excluded due to the limitation of our undergraduate

study. Some of them are stated below. Any or all of these topics can be explored further with a

view to achieving better performance. The analysis that we undertook for wavelets includes only

the orthogonal and compactly supported wavelets. The reader may find it interesting to study the

effect of other wavelets on compression.

Secondly, the sound files that we tested were of limited duration, around 5 seconds. Albeit the

programs will run for larger files (of course, the computation time will be longer in this case), a

better approach towards such large files is to use frames of finite length. This procedure is more

used in real-time compression of sound files, and was not discussed here. Encoding is performed

using only the Run Length Encoding. The effect of other encoding schemes on the compression

factor have not been studied. In fact, higher compression ratios are expected with coding

techniques like Huffman coding.This thesis considered only wavelets analysis, wherein only

approximation coefficients are split. There exists another analysis, called wavelet packet

analysis, which splits detailcoefficients too. This was not explored in this thesis. Last but not the

least, the effect of wavelet transform on voiced and unvoiced speech is different1, and thus

compression ratios ought to be different.