report on speech recognition

SPEECH RECOGNITION SYSTEM

Submitted in partial fulfillment of the requirements for the award of

Degree of

BACHELOR OF TECHNOLOGY

IN

COMPUTER SCIENCE AND ENGINEERING

Submitted By:

ADITYA SHARMA

Roll No: 1005210005

2013-2014

Under the supervision of

Dr. Y N Singh

Associate Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

INSTITUTE OF ENGINEERING AND TECHNOLOGY

LUCKNOW

1

Certificate

This is to certify that the project entitled Speech Recognition System submitted by Aditya

Sharma, for the award of the degree of Bachelor of Technology in Computer Science and

Engineering is a record of the bonafied work carried out by him under my guidance and

supervision at the Department of Computer Science and Engineering, Institute of

Engineering and Technology, Lucknow.

This work has not been submitted anywhere else for the award of any other degree.

Dr. Y N Singh

Associate Professor

Dept. of Computer Science and Engineering

IET Lucknow

2

Acknowledgment

I would like to place on record my deep appreciation and gratitude towards my project supervisor

Dr. Y N Singh, Associate Professor, Dept. of Computer Science and Engineering, for his invaluable

support and encouragement. I would also want to express our heartfelt thanks to Mr. Radhe Shyam who provided me with invaluable guidance to find various resources required for the project.

I would like to thank Dr. Manish Gaur, Associate Professor, Dept. of Computer Science and

Engineering, for his kindness and graciousness in providing us a disciplined environment for finishing my project.

I would also like to thank Prof. Lawrence Rabiner of the Univ. of California Santa Barbara whose

paper inspired me to use Hidden Markov Models in my effort to make an efficient and robust speech recognition system.

At last but not the least I would like to thank my parents for their encouragement and support in my

studies.

Aditya Sharma

B. Tech. Final Year

Computer Science and Engineering

IET Lucknow

3

Abstract

This report takes a brief look at the basic building block of a speech recognition system. It talks about

implementation of different modules which are required for the construction of a speech recognition

system. Word detector, feature extractor and HMM training and recognizer modules have been

described in details. The objective of the project is the implementation of a connected word speech

recognition system using Hidden Markov Models. This involves the design of an efficient MATLAB

code on a PC. This phase of the project involves the development of a limited domain recognition

engine spanning limited vocabulary and in constrained environment only. The system can recognize

single words as well as connected words. The results of the executions that were conducted are also

provided. The system show high accuracy rate of nearly 90-100% in case when it is executed in the

known environment and with the know user. The report ends with a conclusion and Future plan.

4

Table of Contents

1. Introduction 1.1. An overview of Speech Recognition ...................................................................... 7

1.2. Problem Definition and Scope ............................................................................... 7

1.3. Design of the System ............................................................................................ 8

1.4. History................................................................................................................. 9

1.5. Uses of Speech Recognition System .................................................................... 10

1.6. Applications ....................................................................................................... 10

1.7. Speech Recognition Weakness and Flaws ............................................................ 11

1.8. Related Works.................................................................................................... 11

1.9. The future of Speech recognition ......................................................................... 12

1.10. Overview of the Project Report ......................................................................... 12

2. Types of Speech Recognition Systems

2.1. Based on Algorithms 2.1.1. Hidden Markov Model based................................................................... 14

2.1.2. Dynamic Time Warping Based ................................................................ 14

2.1.3. Artificial Neural Network based .............................................................. 14

2.2. Based on ability to recognize words

2.2.1. Isolated Speech Recognition.................................................................... 15

2.2.2. Connected Speech .................................................................................. 15

2.2.3. Continuous Speech ................................................................................. 15

2.2.4. Spontaneous Speech ............................................................................... 15

2.3. Based on dependency on user

2.3.1. Speaker dependent speech recognition ..................................................... 15

2.3.2. Speaker Independent speech recognition .................................................. 15

3. Words Detection and Extraction 3.1. Principle of Word Detection ................................................................................ 16

3.2. Methodology ...................................................................................................... 16

3.3. Performance ....................................................................................................... 19

4. Feature Extraction .................................................................................................. 20

5. Knowledge Models

5.1. Acoustic Models 5.1.1. Word Model .......................................................................................... 24

5.1.2. Phone Model .......................................................................................... 25

5.1.2.1. Context Independent Phone Model .................................................... 26

5.1.2.2. Context Dependent Phone Model....................................................... 26

5.2. Language Model 5.2.1. Classification ......................................................................................... 27

6. Hidden Markov Model 6.1. HMM and Speech Recognition ............................................................................ 29

6.2. Three Basic Problems of Hidden Markov Models ................................................. 29

6.3. Solution to Problem 1 Probability Evaluation..................................................... 30

6.3.1. The forward Algorithm ........................................................................... 31

6.3.2. The Backward Algorithm ........................................................................ 33

6.3.3. Scaling the Forward and Backward Variables ........................................... 35

5

6.4. Solution to Problem 2 Optimal State Sequence ............................................... 39

6.4.1. The Viterbi Algorithm ............................................................................ 40

6.4.2. The Alternative Viterbi Algorithm ........................................................... 41

6.5. Solution to Problem 3 - Parameter Estimation....................................................... 43

6.5.1. Initial Estimates of HMM Parameters ...................................................... 47

7. Implementation....................................................................................................... 49

7.1. Software Modules............................................................................................... 49

7.2. Working of the System ....................................................................................... 50

8. Results and Conclusion ........................................................................................... 54

8.1. Training ............................................................................................................. 54

8.2. Recognition........................................................................................................ 54

8.3. Conclusion ......................................................................................................... 55

8.4. Future Work ....................................................................................................... 55

Reading and References.......................................................................................... 56 Appendix A -Source Code in MATLAB .................................................................. 57

6

List of figures Page No. 1. Fig 1.1 Block Diagram of Speech Recognition System 8

2. Fig 3.1 Speech Sample 17

3. Fig 3.2 Energy plot of the sample 17

4. Fig 3.3 Zero crossing rate of the sample 18

5. Fig 3.4 Detected word from the sample 18

6. Fig 4.1 Steps in Feature Extraction 21

7. Fig 4.2 Mel Scale Filter Bank 22

8. Fig 5.1 Word based Acoustic Model 25

9. Fig 5.2 Phone based Acoustic Model 26

10. Fig 6.1 Diagrammatic Representation of HMM 28

11. Fig 6.2 Forward Variable Calculation 32

12. Fig 6.3 Backward Procedure-Induction step 34

13. Fig 6.4 Baum-Welch method 44

14. Fig 6.5 Left-Right model of HMM 47

15. Fig 7.1 Block Diagram of working of the system 51

16. Fig 7.2 The main interface 52

17. Fig 7.3 Training 52

18. Fig 7.4 Recognition 53

7

Chapter One

Introduction

1.1 An overview of Speech Recognition

Speech recognition refers to the ability to listen (input in audio format) spoken words and

identify various sounds present in it, and recognize them as words of some known language.

Speech recognition in computer system domain may then be defined as the ability of

computer systems to accept spoken words in audio format - such as wav or raw - and then

generate its content in text format.

Speech recognition in computer domain involves various steps with issues attached with

them. The steps required to make computers perform speech recognition are: Voice

recording, word boundary detection, feature extraction, and recognition with the help of

knowledge models.

Word boundary detection is the process of identifying the start and the end of a spoken

word in the given sound signal[8]. While analyzing the sound signal, at times it becomes

difficult to identify the word boundary. This can be attributed to various accents people

have, like the duration of the pause they give between words while speaking.

Feature Extraction refers to the process of conversion of sound signal to a form suitable for

the following stages to use. Feature extraction may include extracting parameters such as

amplitude of the signal, energy of frequencies, etc.

Recognition involves mapping the given input (in form of various features) to one of the

known sounds. This may involve use of various knowledge models for precise

identification and ambiguity removal.

Knowledge models refers to models such as phone acoustic model, language models etc.

which help the recognition system. To generate the knowledge model one needs to train

the system. During the training period one needs to show the system a set of inputs and

what outputs they should map to. This is often called as supervised learning.

1.2 Problem Definition and Scope

The aim of this project is to build a speech recognition system for English language. This

is connected word speech recognition system. The system will receive speech input

consisting of series of connected words and it will output the text sequence corresponding

8

to it. The system will use Continuous Hidden Markov Model for acoustic modeling of the

speech.

Scope: This project has the speech recognizing capabilities. It is designed to work in noise

as well as user constrained environments. This software also can recognize a word and

convert it into text which can be linked with an action such as execution of a command,

opening a program, sending mails etc. It can recognize single word as well as sequence of

connected words separated by a small pause.

1.3 Design of the system

An abstract overview of the system is visualized as block diagram below:

The various components are briefed below:

Word Detector and Extractor: This component is responsible for taking the

input from the microphone or from a prerecorded WAV file and detecting the

words present in them. In this way the speech signal is segmented into

individual words present in them. These individual words can then be processed

independently. The word boundary is detected by measuring the Energy and the

Zero Crossing rate of the signal.

Feature Extractor: This component generates feature vectors from the given

signal. It generates Mel Frequency Cepstrum Coefficients (MFCCs) and

9

Normalized energy as features that should be used to uniquely identify the given

sound signal[9].

Recognizer Module: This is a continuous Hidden Markov Model[7] based

component. It is the most important part of the system which performs the actual

recognition of the speech signal by finding the best match in the knowledge

base.

Acoustic and Knowledge model: These components define structure of the

sound and how it will be represented in the computer. These models are used to

map and model the acoustic characteristics of given sound signal.

1.4 History

The concept of speech recognition started somewhere in 1940s, practically the first speech

recognition program was appeared in 1952 at the bell labs, that was about recognition of a

digit in a noise free environment.

1940s and 1950s consider as the foundational period of the speech recognition

technology, in this period work was done on the foundational paradigms of the

speech recognition that is automation and information theoretic models.

In the 1960s we were able to recognize small vocabularies (order of 10-100

words) of isolated words, based on simple acoustic-phonetic properties of

speech sounds. The key technologies that were developed during this decade

were, filter banks and time normalization methods.

In 1970s the medium vocabularies (order of 100-1000 words) using simple

template-based, pattern recognition methods were recognized.

In 1980s large vocabularies (1000-unlimited) were used and speech recognition

problems based on statistical, with a large range of networks for handling

language structures were addressed. The key invention of this era were Hidden

Markov model (HMM) and the stochastic language model, which together

enabled powerful new methods for handling continuous speech recognition

problem efficiently and with high performance.

In 1990s the key technologies developed during this period were the methods

for stochastic language understanding, statistical learning of acoustic and

language models, and the methods for implementation of large vocabulary

speech understanding systems.

10

After the five decades of research, the speech recognition technology has finally

entered marketplace, benefiting the users in variety of ways. The challenge of

designing a machine that truly functions like an intelligent human is still a major

one going forward.

1.5 Uses of Speech Recognition System Basically speech recognition is used for two main purposes. First and foremost dictation

that is in the context of speech recognition is translation of spoken words into text, and

second controlling the computer, that is to develop such software that probably would be

capable enough to authorize a user to operate different application by voice.

Writing by voice let a person to write 150 words per minute or more if indeed he/she can

speak that much quickly. This perspective of speech recognition programs create an easy

way for composing text and help the people in that industry to compose millions of words

digitally in short time rather than writing them one by one, and this way they can save their

time and effort.

Speech recognition is an alternative of keyboard. If you are unable to write or just dont

want to type then programs of speech recognition helps you to do almost anything that you

used to do with keyboard.

1.6 Applications

1.6.1 From medical perspective: People with disabilities can benefit from

speech recognition programs. Speech recognition is especially useful for people

who have difficulty using their hands, in such cases speech recognition

programs are much beneficial and they can use for operating computers. Speech

recognition is used in deaf telephony, such as voicemail to text.

1.6.2 From military perspective: Speech recognition programs are important from military perspective; in Air Force speech recognition has definite potential

for reducing pilot workload. Beside the Air force such Programs can also be

trained to be used in helicopters, battle management and other applications.

1.6.3 From educational perspective: Individuals with learning disabilities who have problems with thought-to-paper communication (essentially they think of

an idea but it is processed incorrectly causing it to end up differently on paper)

can benefit from the software.

11

1.7 Speech Recognition weakness and flaws

Besides all these advantages and benefits, yet a hundred percent perfect speech recognition

system is unable to be developed. There are number of factors that can reduce the accuracy

and performance of a speech recognition program. Speech recognition process is easy for

a human but it is a difficult task for a machine, comparing with a human mind speech

recognition programs seems less intelligent, this is due to that fact that a human mind is

God gifted thing and the capability of thinking, understanding and reacting is natural, while

for a computer program it is a complicated task, first it need to understand the spoken words

with respect to their meanings, and it has to create a sufficient balance between the words,

noise and spaces. A human has a built in capability of filtering the noise from a speech

while a machine requires training, computer requires help for separating the speech sound

from the other sounds.

Few factors that are considerable in this regard are:

Homonyms: Are the words that are differently spelled and have the different meaning but acquires the same meaning, for example there their be and bee. This is a

challenge for computer machine to distinguish between such types of phrases that sound

alike.

Overlapping speeches: A second challenge in the process, is to understand the speech uttered by different users, current systems have a difficulty to separate simultaneous

speeches form multiple users.

Noise factor: The program requires hearing the words uttered by a human distinctly and clearly. Any extra sound can create interference, first you need to place system away from

noisy environments and then speak clearly else the machine will confuse and will mix up

the words.

1.8 Related Works

A lot of speech aware applications are already there in the market. Various dictation

softwares have been developed by Dragon[10], IBM and Philips. Genie is an interactive

speech recognition software developed by Microsoft. Various voice navigation

applications, one developed by AT&T, allow users to control their computer by voice, like

browsing the Internet by voice. Many more applications of this kind are appearing every

day.

12

The SPHINX speech recognizer of CMU[11] provides the acoustic as well as the language

models used for recognition. It is based on the Hidden Markov Models (HMM). The

SONIC recognizer[12] is also one of them, developed by the University of Colorado. There

are other recognizers such as XVoice[13] for Linux that take input from IBMs ViaVoice

which, now, exists just for Windows. Background noise is the worst part of a speech

recognition process. It confuses the recognizer and makes it unable to hear what it is

supposed to. One such recognizer has been devised for robots that, despite of the inevitable

motor noises, makes it communicate with the people efficiently. This is made possible by

using a noise-type-dependent acoustic model corresponding to a performing motion of

robot. Optimizations for speech recognition on a HP SmartBadge IV embedded system has

been proposed to reduce the energy consumption while still maintaining the quality of the

application.

Another such scalable system has been proposed in for DSR (Distributed Speech

recognition) by combining it with scalable compression and hence reducing the

computational load as well as the bandwidth requirement on the server. Various capabilities

of current speech recognizers in the field of telecommunications are described in like Voice

Banking and Directory Assistance.

1.9 The future of speech recognition

Accuracy will become better and better.

Dictation speech recognition will gradually become accepted.

Greater use will be made of intelligent systems which will attempt to guess

what the speaker intended to say, rather than what was actually said, as people

often misspeak and make unintentional mistakes.

Microphone and sound systems will be designed to adapt more quickly to

changing background noise levels, different environments, with better

recognition of extraneous material to be discarded.

1.10 Overview of the Project Report

The contents of the chapters are as follows:

Chapter 2 discusses the type of speech recognition systems based algorithms,

speaker dependency and their ability to recognize list of words. It also compares

the different types of the algorithm based on their application and reliability.

13

Chapter 3 explains how the words are detected and extracted from the speech

samples. It discusses the algorithms and techniques which are required to

implement a word boundary detector.

Chapter 4 explains how the features are extracted from the given speech

samples. It explains how the signal can be segmented into overlapping frames

and each frame can be transformed into set of multi-dimensional vectors.

Chapter 5 explains what is acoustic model and how is it represented using

HMM. It also gives a brief about language models.

Chapter 6 explains in detail the Hidden Markov model and how can it be used

in speech recognition. It explains in detail about how a continuous HMM can

be implemented for the recognition of the words in a speech signal.

Chapter 7 explains how the project was implemented in MATLAB. It gives brief

information about the modules used in the software.

Chapter 8 concludes the report with a summary of the work done and what are

the next proposed steps to be taken and the further works which can be done.

An appendix is provided in the last which contains detailed source code of the

project in MATLAB.

14

Chapter Two

Types of Speech Recognition System

The speech recognition systems can be classified on various basis such as algorithms, ability to

recognize words and list of words, dependency on user etc. Some of the classifications can be

explained as under:

2.1 Based on Algorithms: There are mainly three popular algorithms to preform speech recognition, namely, Hidden Markov Model (HMM) based, Dynamic Time

Warping (DTW) based and, Artificial Neural Network based.

2.1.1 Hidden Markov Model based: Modern general-purpose speech recognition systems are based on Hidden Markov Models[1][2][3][4][5][6]. These are statistical models that

output a sequence of symbols or quantities. HMMs are used in speech recognition because

a speech signal can be viewed as a piecewise stationary signal or a short-time stationary

signal. In a short time-scale (e.g., 10 milliseconds), speech can be approximated as a

stationary process. Speech can be thought of as a Markov model for many stochastic

purposes. Another reason why HMMs are popular is because they can be trained

automatically and are simple and computationally feasible to use.

2.1.2 Dynamic Time Warping based: Dynamic time warping[2] is an approach that was historically used for speech recognition but has now largely been displaced by the more

successful HMM-based approach. Dynamic time warping is an algorithm for measuring

similarity between two sequences that may vary in time or speed. For instance, similarities

in walking patterns would be detected, even if in one video the person was walking slowly

and if in another he or she were walking more quickly, or even if there were accelerations

and decelerations during the course of one observation. DTW has been applied to video,

audio, and graphics indeed, any data that can be turned into a linear representation can be

analyzed with DTW.

2.1.3 Artificial Neural Network based: Neural networks emerged as an attractive acoustic modeling approach in ASR in the late 1980s. Since then, neural networks have

been used in many aspects of speech recognition such as phoneme classification, isolated

word recognition, and speaker adaptation. In contrast to HMMs, neural networks make no

assumptions about feature statistical properties and have several qualities making them

attractive recognition models for speech recognition. When used to estimate the

probabilities of a speech feature segment, neural networks allow discriminative training in

a natural and efficient manner. Few assumptions on the statistics of input features are made

with neural networks. However, in spite of their effectiveness in classifying short-time units

such as individual phones and isolated words, neural networks are rarely successful for

continuous recognition tasks, largely because of their lack of ability to model temporal

15

dependencies. Thus, one alternative approach is to use neural networks as a pre-processing

e.g. feature transformation, dimensionality reduction, for the HMM based recognition.

2.2 Based on ability to recognize words: Speech recognition systems can be divided into the number of classes based on their ability to recognize that words and list of

words they have. A few classes are as under:

2.2.1 Isolated Speech Recognition: Isolated words usually involve a pause between two utterances; it doesnt mean that it only accepts a single word but instead it requires one

utterance at a time.

2.2.2 Connected Speech: Connected words or connected speech is similar to isolated

speech but allow separate utterances with minimal pause between them.

2.3.3 Continuous speech: Continuous speech allow the user to speak almost naturally, it is also called the computer dictation.

2.3.4 Spontaneous Speech: At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR system with spontaneous speech ability should

be able to handle a variety of natural speech features such as words being run together,

"ums" and "ahs", and even slight stutters.

2.3 Based on dependency on user: Based on the dependency on the users voice the speech recognition systems can be classified as:

2.3.1 Speaker dependent speech recognition: Speakerdependent systems works by learning the unique characteristics of a single person's voice, in a way similar to voice

recognition. New users must first "train" the software by speaking to it, so the computer

can analyze how the person talks. This often means users have to read a few pages of text

to the computer before they can use the speech recognition software.

2.3.2 Speaker Independent speech recognition: Speakerindependent software is designed to recognize anyone's voice, so no training is involved. This means it is the only

real option for applications such as interactive voice response systems where businesses

can't ask callers to read pages of text before using the system. The downside is that speaker

independent software is generally less accurate than speakerdependent software.

16

Chapter Three

Words Detection and Extraction

The component responsibility is to accept input from a microphone and forward it to the feature

extraction module. Before converting the signal into suitable or desired form it also does the

important task of identifying the segments of the sound containing words. It also has a provision

of saving the sound into WAV files which are needed by the training component. The microphone

is configured to receive the input signal at sampling rate of 8000 samples per seconds with 16 bits

per sample and mono channel.

3.1 Principle of Word Detection

In speech recognition it is important to detect when a word is spoken. The system does

detects the region of silence. Anything other than silence is considered as a spoken word

by the system. The system uses energy pattern present in the sound signal and zero crossing

rate to detect the silent region. Taking both of them is important as only energy tends to

miss some parts of sounds which are important. This process is also called Voice Activity

Detection.

3.2 Methodology

For word detection a sample are broken into frames by taking frame samples every 10

milliseconds. Each consecutive segment is separated by an overlapping distance which is

nearly 50% of the length of the frame. Energy and zero crossing for this duration is

calculated. Energy is calculated by adding the square of the value of waveform at each

instance and then dividing it by to number of instances over the period of sample. Zero

crossing rate is the number of times the value of the wave goes from the negative number

to positive of vice-versa.

Word Detector assumes that the first 100 millisecond is silence. It uses the average Energy

and average Zero Crossing Rate obtained during this time for identifying the background

noise. Upper threshold for energy and zero crossing is set to 2 times the average value of

background noise. Lower thresholds are set to 0.75 times the upper threshold.

While detecting the presence of word in the sound, if the energy or zero crossing goes above

the upper threshold and stays above for three consecutive sample word is assumed to be

17

present and the recording is started. The recording continues till the energy and zero

crossing both fall below the lower threshold and stay there for at-least 30 milliseconds[8].

Fig 3.1 Speech Sample

Fig 3.2 Energy plot of the samples

18

Fig 3.3 Zero Crossing Rate Plot of the samples

Fig 3.4 Detected Word from the sample

By applying the similar algorithm after extraction of a word, all the words present in the

speech sample can be extracted in this way provided they are separated by a small pause.

This set of words extracted from the signal are then fed into feature extractor where the

signal is converted into set of feature vectors. This word detection algorithm is only

applicable in the case when the speech signal is constituted of connected words where there

is a pause within consecutive words.

19

In case of continuous and spontaneous speech it is not possible to apply the techniques of

Energy and Zero crossing rates because the word boundaries in case of continuous speech

is not visible due to fusion.

3.3 Performance

The word detector was developed in MATLAB and tested. The word detector showed a

good performance and it was able to extract all the words from the signal provided there

was a half second gap between two consecutive words.

Some words may themselves be constituted of sub words at some distance. This problem

can be solved by measuring a threshold distance and merging up of the frames in case the

distance between then is less the threshold. In the experiment performed the threshold

distance was measured to be 100 frames.

20

Chapter Four

Feature Extraction

Humans have a capacity of identifying different types of sounds (phones). Phones put in a

particular order constitutes a word. If we want a machine to identify the spoken word, it will have

to differentiate between different kinds of sound the way the humans perceive it. The point to be

noted in case of humans is that although, one word spoken by different people produces different

sound waves humans are able to identify the sound waves as same. On the other hand two sounds

which are different are perceived as different by humans. The reason being even when same phones

or sounds are produced by different speakers they have common features. A good feature extractor

should extract these features and use them for further analysis and processing. So feature extraction

is mainly extraction of relevant information from the speech blocks. A variety of choices for this

task can be applied. Some most commonly used methods for speech recognition is linear prediction

and Mel-Cepstrum coefficient calculation[9]. These measures are widely used and here are some

reasons why:

These measures provide a good model of the speech signal. This is particularly true in

quasi steady state of voiced region of speech.

The way these measures are calculated leads to reasonable source-vocal tract separation.

This property leads to fairly good representation vocal tract characteristics.

The measure has analytical tractable model.

Experiences have found that these measures works well in speech recognition applications.

Other measures to add to the feature vectors are the energy measures and also the calculation of

delta and acceleration coefficients. The delta coefficient means that a derivative approximation of

some measures (e.g. MFCC coefficients) is added and the acceleration coefficients is the second

derivative approximation of some measures.

The steps of feature extraction can be show as in the figure[2]:

21

Windowing: Windowing is the process in which the speech signal is segmented into overlapping frames. Each segment is called a frame. The length of each frame is 0.01

seconds. The overlap between two frames is kept up to 50% i.e. 0.005 seconds. There are

various types of windows which can be used:

Rectangular Window Bartlett Window Hamming Window

The system developed uses Hamming window as it introduces least amount of distortion.

Impulse response of the Hamming

window is a raised cosine impulse as

show below in the figure:

Transfer function of hamming

window:

0.54+0.46cos (n*pi/m)

22

Mel Scale: After the framing of the signal N-point DFT is calculated to analyze its spectrum. The

frequencies are mapped into a MEL scale which is given by

Fig 4.2 Mel Scale Filter bank

The frequencies are picked according to a logarithmic scale as they match to the human

auditory system. Then the amplitude values corresponding to the Mel frequencies are

calculated and their logarithms are taken. These sequence of Log values are treated as

signals and then Inverse Discrete cosine transformation is performed on them. The resulting

coefficients are called cepstrum.

Lifter: Lifter is used to zero out (or cut away) some of the last mel cepstrum coefficients. After

this the final mel cestrum values are found. This is done to remove or discard the unwanted

mel coefficients.

23

Energy Measures: An extra measure to augment the coefficients derived from the mel-cepstrum is the log of

signal energy. This means for every frame an extra energy term is added.

Delta and Acceleration coefficients: Spectral transitions are believed to play an important role in human speech perception.

Therefore it is desired to add information about time difference, or delta coefficients and

also the acceleration coefficients.

24

Chapter Five

Knowledge Models

For speech recognition, the system needs to know how the words sound. For this we need to train

the system. During the training, using the data given by the user, the system generates acoustic

model and language model. These models are later used by the system to map a sound to a word

or a phrase.

5.1 Acoustic Model Features that are extracted by the Feature Extraction module need to be compared against

a model to identify the sound that was produced as the word that was spoken. This model

is called as Acoustic Model.

There are two kinds of Acoustic Models:

1. Word Model

2. Phone Model

5.1.1 Word Model

Word models are generally used to small vocabulary systems. In this model the

words are modelled as whole. Thus each word needs to be modelled separately. If

we need to add support to recognize a new word, we will have to train the system

for the word. In the recognition process, the sound is matched against each of the

model to find the best match. This best match is assumed to be the spoken word.

Building a model for a word requires us to collect the sound files of the word from

various users. These sound files are then used to train a HMM Model. Figure 5.1

shows a diagrammatic representation of word based acoustic model.

25

Fig 5.1 Word Based Acoustic Model

5.1.2 Phone Model

In phone model instead of modelling the whole word, we model only parts of words

generally phones. And the word itself is modelled as sequence of phone. The heard

sound is now matched against the parts and parts are recognized. The recognized

parts are put together to for a word. For example the word ek is generated by

combination of two phones A and k. This is generally useful when we need a large

vocabulary system. Adding a new word in the vocabulary is easy as the sounds of

phones are already know only the possible sequence of phone for the word with it

probability needs to be added to the system. Figure 5.2 shows a diagrammatic

representation of phone based acoustic model.

26

Fig 5.2 Phone based Acoustic Model

Phone models can be further classified into:

1. Context-Independent Phone Model

2. Context-Dependent Phone Model

5.1.2.1 Context-Independent Phone Model

In this model individual phones are modelled. The context that they occur

is not modelled. The good thing about this model is that the number of

phone that have to be modelled is small. Thus the complexity of the

system is less.

5.1.2.2 Context-Dependent Phone Model

27

While modelling phone their neighbors are also considered. That means iy

surrounded by z and r is a separate entity as compared to iy surrounded by

h and r. This results in a growth of number of modelled phones which

increases the complexity.

In both word acoustic model and phone acoustic model we need to model

silence and filler words too. Filler words are the sounds that humans produce

between two words.

Both these models can either be implemented using a Hidden Markov Model

or a Neural Network. HMM is more widely used technique in automatic

speech recognition systems.

5.2 Language Model

Although there are words that have similar sounding phone, humans generally do not find

it difficult to recognize the word. This is mainly because they know the context, and also

have a fairly good idea about what words or phrases can occur in the context. Providing

this context to a speech recognition system is the purpose of language model. The language

model specifies what are the valid words in the language and in what sequence they can

occur.

5.2.1 Classification

Language models are classified into several categories:

Uniform Models: Each word has equal probability of occurrence.

Stochastic Model: Probability of occurrence of a word depends on the words preceding it.

Finite State Language: Language uses finite state network to define allowed word sequences.

Context Free Grammar: Context free grammar can be used to encode which kind of sentences are allowed.

28

Chapter Six

Hidden Markov Model

Hidden Markov Model (HMM)[1][2][3][4][5][6] is a state machine. The states of the model are

represented as nodes and the transition are represented as edges. The difference in case of HMM

is that the symbol does not uniquely identify a state. The new state is determined by the symbol

and the transition probabilities from the current state to a candidate state.

Fig 6.1 Diagrammatic Representation of HMM

Above figure shows a diagrammatic representation of a HMM. Nodes denoted as circles are states.

O1 to O5 are observations. Observation O1 takes us to states S1. aij defines the transition

probability between Si and Sj. It can be observed that the states also have self-transitions. If we are

at state S1 and observation O2 is observed, we can either decide to go to state S2 or stay in state

S1. The decision is made depending on the probability of observation at both the states and the

transition probability.

Thus HMM Model is defined as:

= (Q, O, A, B, )

Where Q is {qi} (all possible states)

O is {vi} (all possible observation)

A is {aij} where aij = P (Xt+1 = qj |Xt = qi) (transition probabilities)

29

B is {bi} where bi(k) = P(Ot = vk|Xt = qit) (observation probabilities of observation k at state i)

is {i} where i = P(X0 = qi) (initial state probabilities)

Xt denotes the state at time t.

Ot denotes the observation at time t.

6.1 HMM and Speech Recognition

HMM can be classified upon various criteria:

1. Value of Occurrences

i) Discrete

ii) Continuous

2. Dimension

i) One Dimensional

ii) Multi-Dimensional

3. Probability Density Function

i) Continuous Density (Gaussian Distribution based)

ii) Discrete Density (Vector quantization based)

While using HMM for recognition, we provide the occurrences to the model and it returns

a number. This number is the probability with which the model could have produced the

output (occurrences). In speech recognition occurrences are feature vectors rather than just

symbols. Hence for each occurrence, feature vector has a group of real numbers. Thus, what

we need for speech recognition is a Continuous, Multi-dimensional HMM.

6.2 Three Basic Problems of Hidden Markov Model

Given the basics of an HMM from the previous section, three basic problem arise for

applying the model in a speech recognition task:

Problem 1

Given the observation sequence O=(o1,o2,o3,.oT), and the model =(A,B,), how is

the probability of the observation sequence, given the model, computed? That is, how is

P(O|) computed efficiently?

30

Problem 2

Given the observation sequence O=(o1,o2,o3,.oT), and the model =(A,B,), how is a

corresponding state sequence, q=(q1,q2,.,qT), chosen to be optimal in some sense (i.e.

best explains the observations)?

Problem 3

How are the probability measures, =(A,B,), adjusted to maximize P(O|)?

The first problem can be seen as the recognition problem. With some trained models, each

model represents a word, which model is the most likely if an observation is given? In the

second problem the hidden part of the model is attempted to be uncovered. It should be

clear that for all except the case of degenerated models, there is no correct state sequence

to be found. Thereby it is a problem to be solved best possible with some optimal criteria.

The third problem can be seen as the training problem. That is given the training sequence,

create a model for each word. The training problem is the crucial one for most applications

of HMMs, because it will optimally adapt the model parameter to observe training data-

i.e., it will create the best models for real phenomena.

6.3 Solution to Problem 1 Probability Evaluation

The aim of this problem is to find the probability of the observation sequence,

O=(o1,o2,o3,.oT), given the model , i.e. P(O|). Because the observations produced

by states are assumed to be independent of each other and the time t, the probability of

observation sequence O=(o1,o2,o3,.oT) being generated by a certain states sequence q

can be calculated by a product:

P(O|q,B)=bq1(o1).bq2(o2)..bqT(oT)

And the probability of the state sequence q can be found as

P(q|A,) = q1 aq1q2 aq2q3 ... aqT1qT

The joint probability of O and q, i.e., the probability that O and q occur simultaneously, is

simply the product of the above two terms, i.e.:

31

The aim was to find P (O|), and this probability of O (given the model ) is obtained by

summing the joint probability over all possible state sequences q, giving:

The interpretation of the computation in above is the following. Initially at time t = 1 the

process starts by jumping to state q1 with probability q1, and generate the observation

symbol o1 with probability bq1(o1). The clock changes from t to t+1 and a transition from

q1 to q2 will occur with probability aq1q2, and the symbol o2 will be generated with

probability bq2(o2). The process continues in this manner until the last transition is made (at

time T), i.e., a transition from qT1 to qT will occur with probability aqT1qT , and the symbol

oT will be generated with probability bqT(oT ).

This direct computation has one major drawback. It is infeasible due to the exponential

growth of computations as a function of sequence length T. To be precise, it needs (2T

1)NT multiplications, and NT 1 additions [1]. Even for small values of N and T; e.g., for

N = 5 (states), T = 100 (observations), there is a need for (2 100 1)5100 1.6 1072

multiplications and 5100 1 8.0 1069 additions! Clearly a more efficient procedure is

required to solve this problem. An excellent tool which cuts the computational requirements

to linear, relative to T, is the well-known forward algorithm.

6.3.1 The Forward Algorithm

32

Consider a forward variable t(i), defined as:

Where t represents time and i is the state. This gives that t(i) will be the probability

of the partial observation sequence, o1o2 ...ot, (until time t) when being in state i at

time t. The forward variable can be calculated inductively, Fig 6.2

Fig 6.2

t+1(i) is found by summing the forward variable for all N states at time t multiplied

with their corresponding state transition probability, aij, and by the emission

probability bj(ot + 1). This can be done with the following procedure:

33

1. Initialization

2. Induction

3. Update time Set t=t+1;

Return to step 2 if t

34

probability. In a similar manner (according to the forward algorithm), can the

backward be calculated inductively, see Fig. 6.3.

Fig 6.3 Backward Procedure Induction Step

The backward algorithm includes the following steps:

1. Initialization

2. Induction

35

3. Update time Set t=t-t;

Return to step 2 if t>0

Otherwise, terminate the algorithm

Note that the initialization step 1 arbitrarily defines T (i) to be 1 for all i.

6.3.3 Scaling the Forward and Backward Variables

The calculation of t(i) and t(i) involves multiplication with probabilities. All these

probabilities have a value less than 1 (generally significantly less than 1), and as t

starts to grow large, each term of t(i) or t(i) starts to head exponentially to zero.

For sufficiently large t (e.g., 100 or more) the dynamic range of t(i) and t(i)

computation will exceed the precision range of essentially any machine (even in

double precision) . The basic scaling procedure multiplies t(i) by a scaling

coefficient that is dependent only of the time t and independent of the state i. The

scaling factor for the forward variable is denoted ct (scaling is done every time t for

all states i - 1 i N). This factor will also be used for scaling the backward

variable, t(i). Scaling t(i) and t(i) with the same scale factor will show useful in

problem 3 (parameter estimation).

Consider the computation of the forward variable, t(i). In the scaled variant of the

forward algorithm some extra notations will be used. t(i) denote the unscaled

forward variable, t(i) denote the scaled and iterated variant of t(i) , t(i) denote

the local version of t(i) before scaling and ct will represent the scaling coefficient

at each time. Here follows the scaled forward algorithm:

1. Initialization

2. Induction

36



37

The ordinary induction step can be found as

Now it is possible to write

As the above equation shows t(i) is scaled by the sum over all states of t(i) when

the scaled forward algorithm is applied.

The termination (step 4) of the scaled forward algorithm, evaluation of P(O|) , must

be done in a different way. This because the sum of T (i) cannot be used, because

T (i) is scaled already. However the following properties can be used:

38

As above equation shows P(O|) can be found, but the problem is that if it is used

the result will still be very small (and probable out of the dynamic range for a

computer). If the logarithm is taken on both sides the following equation can be

used:

This is exactly what is done in the termination step of the scaled forward algorithm.

The logarithm of P(O|) is often just as useful as P(O|), because in most cases, this

measure is used as comparison with other probabilities (for other models).

The scaled backward algorithm can be found more easily, since it will use the same

scale factor as the forward algorithm. The notations used is similar to the forward

variables notations, t(i) denote the unscaled backward variable, t(i) denote the

scaled and iterated variant of t(i), t(i) denote the local version of t(i) before

scaling and ct will represent the scaling coefficient at each time. Here follows the

scaled backward algorithm:

1. Initialization

2. Induction

39

3. Update time Set t=t-1;

Return to step 2 if t>0;


6.4 Solution to Problem 2 Optimal State Sequence

The problem is to find the optimal sequence of states to a given observation sequence and

model. Unlike problem one, for which an exact solution can be found, there are several

possible ways of solving this problem. The difficulty lies with the definition of the optimal

state sequence, that is, there are several possible optimality criteria. One optimal criteria is

to choose the states, qt, that are individually most likely at each time t. To find this state

sequence the following probability variable is needed:

That is, the probability of being in state i at time t given the observation sequence, O, and

the model . Other ways to look at t(i) can be:

It is now possible to write

40

When t(i) is calculated according to above equation, the most likely state at time t, q t, will be found by:

Even if above equation maximizes the expected number of correct states, there could be

some problems with the resulting state sequence. This because the state transition

probabilities have not been taken into account. For example what happens when some state

transitions have zero probability (aij = 0)? This means that the found optimal path may not

be valid. Obviously a method generating a path that is guaranteed to be valid would be

preferably. Fortunately such a method exist, based on dynamic programming, namely the

Viterbi algorithm. Even though t(i) could not be used for this purpose, it will be useful in

problem 3 (parameter estimation).

6.4.1 The Viterbi Algorithm

This algorithm is similar to the forward algorithm. The main difference is that the

forward algorithm uses summing over previous states, whereas the Viterbi

algorithm uses maximization. The aim for the Viterbi algorithm is to find the single

best state sequence, q = (q1,q2,...,qT ), for the given observation sequence O =

(o1,o2,...,oT ) and a model . Consider the following quantity:

That is the probability of observing o1o2 ...ot using the best path that ends in state i

at time t, given the model . By using induction can t+1(i) be found as:

To actually retrieve the state sequence, it is necessary to keep track of the argument

that maximizes above equation for each t and j . This is done by saving the argument

in an array t(j). Here follows the complete Viterbi algorithm:

1. Initialization

41

2. Induction


Return to step 2 if t=1


The same problem as for the forward and backward algorithm occurs here. That is

the algorithm involves multiplication with probabilities and the precision range will

be exceeded. This is why an alternative Viterbi algorithm is needed.

6.4.2 The Alternative Viterbi Algorithm

As mentioned the original Viterbi algorithm involves multiplications with

probabilities. One way to avoid this is to take the logarithm of the model parameters,

42

giving that the multiplications become additions. Obviously will this logarithm

become a problem when some model parameters has zeros is present. This is often

the case for A and and can be avoided by adding a small number to the matrixes.

Here follows the alternative Viterbi algorithm:

1. Preprocessing

2. Initialization

3. Induction

4. Update time

Set t=t+1;


43

(c) Update time

Set t=t-1;

Return to step (b) if t>=1;

Otherwise, terminate the algorithm.

6.5 Solution to Problem 3 - Parameter Estimation

The third problem is concerned with the estimation of the model parameters, = (A,B,).

The problem can be formulated as:

Given an observation O, find the model from all possible that maximizes P (O|). This problem is the most difficult of the three problems. This because there is no known way to

analytically find the model parameters that maximizes the probability of the observation

sequence in a closed form. However can the model parameters be chosen to locally

maximize the likelihood P (O|). Some common used methods for solving this problem is

Baum-Welch method (also known as expectation-maximization method) or gradient

technics. Both of these methods uses iterations to improve the likelihood P (O|), however

there are some advantages with the Baum-Welch method compared to the gradient

techniques:

1. Baum-Welch is numerically stable with the likelihood non-decreasing with every

iteration.

2. Baum-Welch converges to a local optima.

3. Baum-Welch has linear convergence.

This is why the Baum-Welch is used in this project. This section will derive the re-

estimation equations used in the Baum-Welch method.

The model , has three terms to describe namely the state transition probability distribution

A, the initial state distribution and the observation symbol probability distribution B.

Since continuous observation densities are used, will B be represented by cjk,jk and jk.

To describe the procedure for re-estimation, the following probability will prove useful:

44

That is the probability of being in state i at time t, and state j at time t + 1, given the

model and the observation sequence O. The paths that satisfy the conditions required by

above equation are illustrated in Fig. 6.4.

Fig 6.4

By using all the previous equations we can conclude that

As mentioned in problem 2 is t(i) the probability of being in state i at time t, given the

entire observation sequence O and the model . Hence

45

If the sum over time t is applied to t(i), one will get a quantity that can be interpreted as

the expected (over time) number of times that state i is visited, or equivalently, the expected

number of transitions made from state i (if the time slot t = T is excluded) [1]. If the same

summation is done over t(i,j), one will get the expected number of transitions from state i

to state j. The term 1(i) will also prove to be useful. Conclusion:

Given the above definitions it is possible to derive the re-estimation formulas for and A:

And

46

The re-estimation of cjk,jk and jk is a bit more complicated. However if the model has

only one state j and one mixture, it would be an easy averaging task

In practice, of course, there are multiple states and multiple mixtures and there are no direct

assignment of the observation vectors to individual states because the underlying state

sequence is unknown. Since the full likelihood of each observation sequence is based on

the summation of all possible state sequences, each observation vector o t contributes to the

computation of the likelihood for each state j. In other words instead of assigning each

observation vector to a specific state, each observations i assigned to every state and is

weighted with the probability of the model being in that state accounted for that specific

mixture when the vector was observed. This probability, for state j and mixture k (there are

M mixtures), is found by

The re-estimation formula for cjk is the ratio between the expected number of times the

system is in state j using the kth mixture component, and the expected number of times the

system is in state j. That is:

To find jk and jk one can weigh the simple averages by the probability of being in state j

and using mixture k when observing ot:

47

The re-estimation formulas described in this section, are performed based on one training

sample. This will of course not be sufficient to get reliable estimates for a training sample,

especially when left-right models are used. To get reliable estimates it is convenient to use

multiple observation sequences.

6.5.1 Initial Estimates of HMM Parameters Before the re-estimation formulas can be applied for training, it is important to get

good initial parameters so that the re-estimation leads to the global maximum or as

close as possible to it. An adequate choice for and A is the uniform distribution.

But since left-right models are used, will have probability one for the first state

and zero for the other states. For example will the left-right model in figure below

have the following initial and A:

Fig 6.5 Left-Right model of HMM

48

The parameters for the emission distribution needs good initial estimations, to get a

rapid and proper convergence. This is done by using uniform segmentation into the

states of the model, for every training sample. After segmentation, all observations

from the state j is collected from all training samples. Then a clustering algorithm

is used to get the initial parameters for state j and this procedure is done for every

state. The clustering algorithm used in this thesis is the well-known k-means

algorithm. Before the clustering proceeds one has to choose the number of clusters,

K. In this task is the number of clusters equal to the number of mixtures, that is, K

= M.

The K-Means Algorithm 1. Initialization

Choose K vectors from the training vectors, here denoted x, at random. These

vectors will be the centroids k, which is to be found correctly.

2. Recursion For each vector in the training set, let every vector belong to a cluster k. This

is done by choosing the cluster closest to the vector:

Where d(x,k) is a distance measure, here is the Euclidian distance measure

used:

3. Test

Recomputed the centroids, k, by taking a mean of the vectors that belong to

this centroid. This is done for every k. If no vectors belongs to some k for

some value k - create new k by choosing a random vector from x. If there has

been no change of the centroids from the previous step go to step 4, otherwise

go back to step 2.

4. Termination From this clustering (done for one state j), the following parameters are found:

49

Chapter Seven

Implementation

The system was built and tested on a platform with below mentioned specification:

1. System Specification 1.1. Intel core i5 CPU @ 2.67GHz 1.2. 4GB of RAM

1.3. Microsoft Windows 7 x64

1.4. MATLAB 7.12.0 (R2011a)

1.5. Microphone

2. Minimum System Requirement 2.1. Pentium 200Hz processor 2.2. 512 MB of RAM

2.3. Microphone

2.4. Soundcard

The system was coded in MATLAB and was limited up to command line interface. The system is

divided into various component each dedicated to perform a unique task for e.g. speech acquisition,

feature extraction, training and, recognition. The full detailed source code can be found at the end

of the report.

The user can train the model either by directly recording from the microphone or by using samples

in pre-recorded WAV files. Same applies for recognition. The project has a constrained that either

it can recognize isolated words or it can recognize series of connected word separated by a small

pause. Each word is detected and extracted by the Word Detector and then each word is recognized

separately.

The sound is recorded at 8000 samples per second with 16 bit per sample. So the quality of the

signal it not that good to sense small differences among two similar signals. So the user has to

speak more clearly.

7.1 Software Modules

50

All the software module where coded in MATLAB and a brief description of some

important modules are given below:

1. GetSpeechSample.m: This function is responsible for acquiring the speech signal either

from a WAV file or directly from the microphone. It receives the input at 8000 samples

per second and 16 bits per sample for 6 seconds in case of training and 10 seconds in

case of recognition. It also trims first 8000 samples to remove the initial click noise

due to microphone. So the user should start speaking only after approx. 1.5 seconds.

2. ExtractFeatures.m: This function receives a speech signal and performs feature

extraction on it. It returns a 39 dimensional feature vectors set where each vector

contains MFCCs, energy and delta and acceleration coefficients.

3. TrainWord.m: This function receives feature vectors and a string which will be output

by the recognizer upon its recognition and it generates a model from it assigning a

unique id to it.

4. Recognize.m: This function performs the recognition of the speech sample feed to it by

WAV file or directly by microphone.

5. forwardBackwardAlgorithm.m: This function is used to implement the Forward-

Backward algorithm for training the samples.

6. viterbiAlgorithm.m: This function implement the Viterbi Algorithm which is used in

the calculation of the probability score and recognition of the words.

7. UpdateKnowledge.m: This function updates the dictionary wherever new word is

added to the database by training.

8. GaussianProbability.m: This function calculates the Gaussian probability of a frame of

the speech sample using the probability distribution function.

9. ResetDictionary.m: This function is used to reset the dictionary and remove all the

trained models.

10. FindNextSegment.m: This function finds the next word by detecting its word boundary

by comparing the energy and the zero crossing rates.

7.2 Working of the system

The working of the system can be understood by breaking the full functionality into sub

functions. Firstly signal is acquired by the speech acquisition module. This module trims

off first 8000 sample corresponding to 1sec of the signal and passes the signal to the words

51

detector. The word detector detects the boundary of each word present in it by analysis the

energy and the zero crossing rates. It stores the start and finish point of each word in a list.

The word segments whose distance is less than the threshold distance are merged together

and the list is updated. Then each of the word segment is fed into feature extractor module

and it is converted into set of feature vectors. Now this is passed on to the recognizer

module which finds the closest match with the models stored in the database. After the

completion of the recognition of this word the next word is fetched from the list and the

process is repeated.

Before the speech recognition can be performed the individual words are needed to be

trained first. In training part the user provides input either from a pre-recorded WAVE file

or directly from the microphone. The training module extract feature from it and generates

a HMM for that. The probabilities are adjusted by performing expectation maximization

process. The full overview of the working of the system is shown below in figure 7.1.

Fig 7.1 Block Diagram of Working of the system

52

Snapshots

1. The Main Interface

Fig: 7.2 the Main Interface

2. Training new word

Fig: 7.3 Training

53

3. Recognition

Fig: 7.4 Recognition

54

Chapter Eight

Results and Conclusion

8.1 Training

Ten words from English language which are most frequently used were trained and there

models were generated. For this I used my voice and approx. 300 samples in total were

taken with 30 training sample for each word.

Each model was trained by Expectation Maximization algorithm with 7 iterations on each

model. The total number of states in each model was 5. Each model was stored in unique

.MAT file.

The sound was recorded at 8000 frames per seconds with 16 bits per sample mono channel.

8.2 Recognition

The recognition module was given a wave file as input which contained prerecorded speech

sample. Each speech sample contained 5 to 6 words separated by a minimum distance of

100 frames. Firstly the words are detected and extracted from the signal and then each word

is recognized independently.

The recognition was tested by using samples from the user who trained it as well as an

unknown user. Also the environment was changed which was different from the

environment in which it was trained.

The result of the experiment are as show below:

Type of condition No of samples Correct Output Accuracy (%)

Known user and known environment 30 30 100%

Known user and unknown environment

20 14 80%

Unknown user and known

environment

25 12 48%

Unknown user an unknown environment

15 3 20%

Table 8.1 Recognition Results

55

8.3 Conclusion

The theory of Hidden Markov Model have been studied thoroughly. Together with the

signal processing of speech signals, a speech recognizer has been implement in MATLAB.

A word boundary detector has also been implemented. The performance of the word

detector is up to the requirement and it performed absolutely well in all conditions.

A HMM library was build which was useful in recognition and also training a word based

acoustic model. Word models for some most frequently used words of English language

was used.

The trained model was used to recognize the speech. The recognizer gave best performance

when the user and the environment both were known to the recognizer. It gave the worst

performance when the user and the environment both were unknown. This happened

because the models were trained in constrained environment with only one user. This

problem can be solved by including more training data from different persons and in

different environments. Other cases produced average results.

8.4 Future Work

Further improvements and expansions may be achieved by using one or more of the following

suggestions:

The speech recognizer is implemented in MATLAB and because of that it runs slow.

Implementing the speech recognizer in C or assembler will be desired to get a faster

execution time.

In a noisy environment, like in a car, noise reduction algorithms are preferred to enhance

signal to noise ratio. Algorithms useful can be based on adaptive noise reduction, spectral

subtraction or beam forming.

Record a larger evaluation database, for different speakers and different environments, to

get more test cases.

Try different setting in the speech recognizer, for example change the model structure, the

number of states or mixtures. More or less measures of the speech can be added to the

feature vectors, that is experiment with the feature vector dimension and its content.

56

Reading and References

[1] Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications

in Speech Recognition Proceedings of IEEE 1989.

[2] B. Plannerer, An Introduction to Speech Recognition March 28 2005.

[3] Christopher M. Bishop, Pattern Recognition and Machine Learning, Chapter 13 Page

605-646

[4] M. Narasimha Murthy, V. Susheela Devi, Pattern Recognition An Algorithmic

Approach University Press. Chapter 3, 5 and 9.

[5] B.H. Juang, L.R Rabiner. Hidden Markov Models for Speech Recognition.

Technometrics, August 1991, Vol. 33, No. 3

[6] Lawrence Rabiner, Biing-Hwang Juang. Fundamentals of Speech Recognition,

Prentice-Hall International, Chapter 6 and 7.

[7] Christophe Couvreur, Hidden Markov Models and Their Mixtures .

[8] L.R. Rabiner and M.R. Sambur. An algorithm for detecting the Endpoints for Isolated

Utterances. The Bell System Technical Journal, Vol. 54, No. 2, Feb. 1975.

[9] Mel Frequency Cepstral Coefficient (MFCC) tutorial.

http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-

cepstral-coefficients-mfccs/

[10] http://www.nuance.com/

[11] http://cmusphinx.sourceforge.net/

[12] Pellom, B., Sonic: The University of Colorado Continuous Speech Recognition

System.

[13] http://xvoice.sourceforge.net/

57

Appendix A

Source Code in MATLAB

function [] = Main( ) clear; clc; check=0; fprintf('#===================SPEECH RECOGNITION SYSTEM==================#\n'); fprintf('#Developed By: Aditya Sharma #\n');

fprintf('#Branch: CSE Final Year #\n'); fprintf('#Institute of Engineering and Technology,Lucknow #\n'); fprintf('#============================================================#\n\n); fprintf('Choose an option:\n'); fprintf('1: Train a new word\n'); fprintf('2: Perform Recognition\n'); fprintf('3: Reset Knowledge\n'); fprintf('4: Exit\n\n'); option=input('Your option is:','s');

switch option case '1' TrainWord; case '2' Recognize; case '3' ResetDictionary; case '4' cls;

check=1; otherwise fprintf('Invalid option!! Retry...\n'); end if check==0 %input(''); else end end

function [ ] = TrainWord( ) fprintf('Training Mode....\n'); x=GetSpeechSampleW(); fprintf('Enter the string corresponding to this word:'); name=input('','s'); fprintf('Training...\n'); Ini = 0.1; %Initial silence duration in seconds Ts = 0.01; %Frame width in seconds Tsh = 0.005; %Frame shift in seconds Fs = 8000; %Sampling Frequency ZTh = 40; %Zero crossing comparison rate for threshold w_sam = fix(Ts*Fs); %No of Samples/window o_sam = fix(Tsh*Fs); %No of samples/overlap lengthX = length(x); segs = fix((lengthX-w_sam)/o_sam)+1; %Number of segments in speech signal

58

sil = fix((Ini-Ts)/Tsh)+1; %Number of segments in silent period win = hamming(w_sam); Limit = o_sam*(segs-1)+1; %Start index of last segment

FrmIndex = 1:o_sam:Limit; %Vector containing starting index %for each segment ZCR_Vector = zeros(1,segs); %Vector to hold zero crossing rate %for all segments

%Below code computes and returns zero crossing rates for all segments in %speech sample for t = 1:segs ZCRCounter = 0; nextIndex = (t-1)*o_sam+1; for r = nextIndex+1:(nextIndex+w_sam-1) if (x(r) >= 0) && (x(r-1) >= 0)

elseif (x(r) >= 0) && (x(r-1) < 0) ZCRCounter = ZCRCounter + 1; elseif (x(r) < 0) && (x(r-1) < 0)

elseif (x(r) < 0) && (x(r-1) >= 0) ZCRCounter = ZCRCounter + 1; end end ZCR_Vector(t) = ZCRCounter; end

%Below code computes and returns frame energy for all segments in speech %sample Erg_Vector = zeros(1,segs); for u = 1:segs nextIndex = (u-1)*o_sam+1; Energy = x(nextIndex:nextIndex+w_sam-1).*win; Erg_Vector(u) = sum(abs(Energy)); end

IMN = mean(Erg_Vector(1:sil)); %Mean silence energy (noise energy) IMX = max(Erg_Vector); %Maximum energy for entire utterance I1 = 0.03 * (IMX-IMN) + IMN; %I1 & I2 are Initial thresholds I2 = 4 * IMN; ITL = min(I1,I2); %Lower energy threshold

ITU = 5 * ITL; %Upper energy threshold IZC = mean(ZCR_Vector(1:sil)); %mean zero crossing rate for silence region stdev = std(ZCR_Vector(1:sil)); %standard deviation of crossing rate for %silence region IZCT = min(ZTh,IZC+2*stdev); %Zero crossing rate threshold

flag=1; startpoint=1; segment_count=0;

while flag==1

[st,fi,fl]=FindNextSegment(ITU,ITL,IZCT,x,startpoint,Erg_Vector,ZCR_Vector);

59

if fl==0 break; end segment_count=segment_count+1; seg_list{segment_count,1}=st; seg_list{segment_count,2}=fi; startpoint=fi; end distance_threshold=100; valid_index=ones(1,segment_count); for l=1:segment_count-1 if valid_index(l)==1 if abs(seg_list{l+1,2}-seg_list{l,2}) ITU) counter1 = counter1 + 1;

60

indexi(counter1) = i; end end if counter1==0 flag=0; start=0; finish=0; return; end ITUs = indexi(1); first_hit=ITUs; %Search further forward for frame with energy greater than ITL for j = ITUs:-1:1 if (Erg_Vector(j) < ITL) counter2 = counter2 + 1; indexj(counter2) = j; end end start = indexj(1)+1;

%BackSearch = min(start,25); BackSearch=25; for m = start:-1:start-BackSearch+1 rate = ZCR_Vector(m); if rate > IZCT ZCRCountb = ZCRCountb + 1; realstart = m; end end if ZCRCountb > 3 start = realstart; %If IZCT is exceeded in more than 3 frames %set start to last index where IZCT is

exceeded end

l_c=0; for k=first_hit:length(Erg_Vector) if(Erg_Vector(k)

61

end

function [ pcm ] = GetSpeechSampleW( )

fprintf('Enter the name of the WAVE file:'); fname=input('','s'); buffer=wavread(fname); FrameRate=8000; [bufferlength,~]=size(buffer); buffer=buffer(FrameRate:bufferlength); pcm=buffer;

end

function [ features ] = ExtractFeatures( signal) fs=8000; frameSizeInSec = 0.025; frameShiftInSec= 0.010; hamming=1; preEmphesis=0; totalFilterBanks=26; cepstralOrder=12; lifter=22; deltaWindow=2; deltaWindowWeight = ones(1,2*deltaWindow+1); signal=double(signal); len=length(signal); preEmpSignal=zeros(len,1); preEmpSignal(1)=signal(1); preEmpSignal(2:end)=signal(2:end)-preEmphesis*signal(1:end-1); frameSize=round(fs*frameSizeInSec); frameShift=round(fs*frameShiftInSec); frameNo= floor( 1 + (len - frameSize)/frameShift ); maxMF = 2595*log10(1 + 0.5*fs/700.0); deltaMF=maxMF/(totalFilterBanks+1); f=zeros(totalFilterBanks+2,1); for m=1:totalFilterBanks+2 f(m)=(10^((m-1)*deltaMF/2595)-1)*700.0; end mfcc_tran=zeros(cepstralOrder,totalFilterBanks); for k=1:cepstralOrder for m=1:totalFilterBanks mfcc_tran(k,m)=sqrt(2/totalFilterBanks)*cos(k*pi/totalFilterBanks *

(m-0.5)); end end n=(1:cepstralOrder)'; lifter_weighting=1+(lifter/2)*sin(pi*n/lifter); k=(1:frameSize)'; h=0.54-0.46*cos(2*pi*(k-1)/(frameSize-1)); mfcc=zeros(cepstralOrder,frameNo); melspec=zeros(totalFilterBanks,frameNo);

62

for fr=1:frameNo s=preEmpSignal((fr-1)*frameShift+1:(fr-1)*frameShift+frameSize); if hamming ~= 0 s=s.*h; end fftN=2; while fftN

63

newId=GenerateId(); id=newId; modelFileName=['hmms\' int2str(newId) '.mat'];

for iter=ITERATION_BEGIN:ITERATION_END if iter==1 MIN_SELF_TRANSITION_COUNT=0; vector_sums_i_m=zeros(dim,STATE_NO); var_vec_sums_i_m=zeros(dim,STATE_NO); fr_no_i_m=zeros(STATE_NO); self_tr_fr_no_i_m=zeros(STATE_NO); [dim,fr_no]=size(features); for i=1:STATE_NO begin_fr=round( fr_no*(i-1) /STATE_NO)+1; end_fr=round( fr_no*i /STATE_NO); seg_length=end_fr-begin_fr+1; vector_sums_i_m(:,i) = vector_sums_i_m(:,i) +

sum(features(:,begin_fr:end_fr),2); var_vec_sums_i_m(:,i) = var_vec_sums_i_m(:,i) + sum(

features(:,begin_fr:end_fr).*features(:,begin_fr:end_fr) , 2); fr_no_i_m(i)=fr_no_i_m(i)+seg_length; self_tr_fr_no_i_m(i)= self_tr_fr_no_i_m(i) + seg_length-1; end for i=1:STATE_NO mean_vec_i_m(:,i) = vector_sums_i_m(:,i) / fr_no_i_m(i); var_vec_i_m(:,i) = var_vec_sums_i_m(:,i) / fr_no_i_m(i);

A_i_m(i)=(self_tr_fr_no_i_m(i)+MIN_SELF_TRANSITION_COUNT)/(fr_no_i_m(i)+2*MIN_

SELF_TRANSITION_COUNT); end else MIN_SELF_TRANSITION_COUNT=0.00; [dim,STATE_NO]=size(mean_vec_i_m); vector_sums_i_m=zeros(dim,STATE_NO); var_vec_sums_i_m=zeros(dim,STATE_NO); fr_no_i_m=zeros(STATE_NO); self_tr_fr_no_i_m=zeros(STATE_NO); total_log_prob = 0; total_fr_no = 0; [log_prob, pr_i_t, pr_self_tr_i_t

]=forwardBackwardAlgorithm(features,mean_vec_i_m(:,:),var_vec_i_m(:,:),A_i_m(:

)); total_log_prob = total_log_prob + log_prob; total_fr_no = total_fr_no + fr_no; for i=1:STATE_NO fr_no_i_m(i)=fr_no_i_m(i)+sum(pr_i_t(i,:));

self_tr_fr_no_i_m(i)=self_tr_fr_no_i_m(i)+sum(pr_self_tr_i_t(i,1:end-1)); for fr=1:fr_no vector_sums_i_m(:,i) = vector_sums_i_m(:,i) +

pr_i_t(i,fr)*features(:,fr); var_vec_sums_i_m(:,i) =var_vec_sums_i_m(:,i) +

pr_i_t(i,fr)*(features(:,fr)-mean_vec_i_m(:,i)).*(features(:,fr)-

mean_vec_i_m(:,i)); end end old_mean_vec_i_m=mean_vec_i_m;

64

old_var_vec_i_m= var_vec_i_m; old_A_i_m=A_i_m; for i=1:STATE_NO; mean_vec_i_m(:,i) = vector_sums_i_m(:,i)/ fr_no_i_m(i); var_vec_i_m(:,i)= var_vec_sums_i_m(:,i) / fr_no_i_m(i); A_i_m(i)=(self_tr_fr_no_i_m(i)+MIN_SELF_TRANSITION_COUNT)

/(fr_no_i_m(i)+2*MIN_SELF_TRANSITION_COUNT); end var_new_to_old_ratio=var_vec_i_m ./ old_var_vec_i_m; end end save(modelFileName, 'mean_vec_i_m', 'var_vec_i_m', 'A_i_m'); fprintf('The new word is added to the knowledge...\n');

end

function [log_prob, pr_i_t, pr_self_tr_i_t ]=forwardBackwardAlgorithm( V,

mean_vec_i, var_vec_i, A_i ) [dim , N]=size(mean_vec_i); [dim2 , T]=size(V); [log_prob, logfw, logObsevation_i_t ]=forwardAlgorithm(V, mean_vec_i,

var_vec_i, A_i ); pr_self_tr_i_t=zeros(N,T); logbw=ones(N,T)*(-inf); t=T; logbw(N,T)=log(1-A_i(N)); for t=T-1:-1:1 for i=1:N if i==N logbw(i,t)= log(A_i(i))+ logObsevation_i_t(i,t+1) + logbw(i,t+1); pr_self_tr_i_t(i,t)=exp(logfw(i,t)+ log(A_i(i))+

logObsevation_i_t(i,t+1) + logbw(i,t+1)-log_prob); else logbw(i,t)=CalculateSum([ (log(A_i(i))+ logObsevation_i_t(i,t+1) +

logbw(i,t+1)) , (log(1-(A_i(i))) + logObsevation_i_t(i+1,t+1)+ logbw(i+1,t+1)

)] ); pr_self_tr_i_t(i,t)=exp(logfw(i,t)+

log(A_i(i))+logObsevation_i_t(i,t+1) + logbw(i,t+1)-log_prob); end end end pr_i_t= exp( logfw+logbw - log_prob ); count_at_t(1:T)=sum(pr_i_t,1); count_at_t=squeeze(count_at_t); if (sum(count_at_t) -T) > 1E-6 diff=sum(count_at_t) -T ; end end

function [log_pr, varargout]=forwardAlgorithm(V, mean_vec_i, var_vec_i, A_i

) [dim , N]=size(mean_vec_i); [dim2 , T]=size(V); logObsevation_i_t=zeros(N,T);

65

for t=1:T for i=1:N

logObsevation_i_t(i,t)=GaussianProbability(V(:,t),mean_vec_i(:,i),var_vec_i(:,

i)); end end logfw=ones(N,T)*(-inf); t=1; logfw(1,1)=logObsevation_i_t(1,1); for t=2:T i=1; logfw(i,t)=logfw(i,t-1) + log(A_i(i))+logObsevation_i_t(i,t); for i=2:N logfw(i,t)= CalculateSum( [ (logfw(i-1,t-1) +log(1-A_i(i-1)) ) ,

(logfw(i,t-1) + log(A_i(i))) ] ) + logObsevation_i_t(i,t); end end log_pr=logfw(N,T) + log(1-A_i(N)); varargout(1)= {logfw}; varargout(2)= {logObsevation_i_t}; end

function [ ] = UpdateKnowledge( word_id,value )

dictionaryFile='dictionary.mat'; if ( exist(dictionaryFile,'file')) load(dictionaryFile,'dictionary'); dictionary{word_id,1}=word_id; dictionary{word_id,2}=value; save(dictionaryFile,'dictionary'); else dictionary{word_id,1}=word_id; dictionary{word_id,2}=value; save(dictionaryFile,'dictionary'); end end

function [ ] = Recognize( )

x=GetSpeechSampleW(); fprintf('Recognizing...\n'); Ini = 0.1; %Initial silence duration in seconds Ts = 0.01; %Frame width in seconds Tsh = 0.005; %Frame shift in seconds Fs = 8000; %Sampling Frequency ZTh = 40; %Zero crossing comparison rate for threshold w_sam = fix(Ts*Fs); %No of Samples/window o_sam = fix(Tsh*Fs); %No of samples/overlap lengthX = length(x); segs = fix((lengthX-w_sam)/o_sam)+1; %Number of segments in speech signal sil = fix((Ini-Ts)/Tsh)+1; %Number of segments in silent period

66

win = hamming(w_sam); Limit = o_sam*(segs-1)+1; %Start index of last segment

FrmIndex = 1:o_sam:Limit; %Vector containing starting index %for each segment ZCR_Vector = zeros(1,segs); %Vector to hold zero crossing rate %for all segments

%Below code computes and returns zero crossing rates for all segments in %speech sample for t = 1:segs ZCRCounter = 0; nextIndex = (t-1)*o_sam+1; for r = nextIndex+1:(nextIndex+w_sam-1) if (x(r) >= 0) && (x(r-1) >= 0)

elseif (x(r) >= 0) && (x(r-1) < 0) ZCRCounter = ZCRCounter + 1; elseif (x(r) < 0) && (x(r-1) < 0)

elseif (x(r) < 0) && (x(r-1) >= 0) ZCRCounter = ZCRCounter + 1; end end ZCR_Vector(t) = ZCRCounter; end

%Below code computes and returns frame energy for all segments in speech %sample Erg_Vector = zeros(1,segs); for u = 1:segs nextIndex = (u-1)*o_sam+1; Energy = x(nextIndex:nextIndex+w_sam-1).*win; Erg_Vector(u) = sum(abs(Energy)); end

IMN = mean(Erg_Vector(1:sil)); %Mean silence energy (noise energy) IMX = max(Erg_Vector); %Maximum energy for entire utterance I1 = 0.03 * (IMX-IMN) + IMN; %I1 & I2 are Initial thresholds I2 = 4 * IMN; ITL = min(I1,I2); %Lower energy threshold ITU = 5 * ITL; %Upper energy threshold

IZC = mean(ZCR_Vector(1:sil)); %mean zero crossing rate for silence region stdev = std(ZCR_Vector(1:sil)); %standard deviation of crossing rate for %silence region IZCT = min(ZTh,IZC+2*stdev); %Zero crossing rate threshold

flag=1; startpoint=1; segment_count=0;

while flag==1

[st,fi,fl]=FindNextSegment(ITU,ITL,IZCT,x,startpoint,Erg_Vector,

report on speech recognition

Documents

engineering iet lucknow

overview of speech recognition

final year computer

technology lucknow

project supervisor

bachelor of technology

award of degree

degree of bachelor