report on speech recognition

Upload: aditya-sharma

Post on 13-Oct-2015

76 views

Category:

Documents


2 download

DESCRIPTION

report on speech recognition project

TRANSCRIPT

  • SPEECH RECOGNITION SYSTEM

    Submitted in partial fulfillment of the requirements for the award of

    Degree of

    BACHELOR OF TECHNOLOGY

    IN

    COMPUTER SCIENCE AND ENGINEERING

    Submitted By:

    ADITYA SHARMA

    Roll No: 1005210005

    2013-2014

    Under the supervision of

    Dr. Y N Singh

    Associate Professor

    DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

    INSTITUTE OF ENGINEERING AND TECHNOLOGY

    LUCKNOW

  • 1

    Certificate

    This is to certify that the project entitled Speech Recognition System submitted by Aditya

    Sharma, for the award of the degree of Bachelor of Technology in Computer Science and

    Engineering is a record of the bonafied work carried out by him under my guidance and

    supervision at the Department of Computer Science and Engineering, Institute of

    Engineering and Technology, Lucknow.

    This work has not been submitted anywhere else for the award of any other degree.

    Dr. Y N Singh

    Associate Professor

    Dept. of Computer Science and Engineering

    IET Lucknow

  • 2

    Acknowledgment

    I would like to place on record my deep appreciation and gratitude towards my project supervisor

    Dr. Y N Singh, Associate Professor, Dept. of Computer Science and Engineering, for his invaluable

    support and encouragement. I would also want to express our heartfelt thanks to Mr. Radhe Shyam who provided me with invaluable guidance to find various resources required for the project.

    I would like to thank Dr. Manish Gaur, Associate Professor, Dept. of Computer Science and

    Engineering, for his kindness and graciousness in providing us a disciplined environment for finishing my project.

    I would also like to thank Prof. Lawrence Rabiner of the Univ. of California Santa Barbara whose

    paper inspired me to use Hidden Markov Models in my effort to make an efficient and robust speech recognition system.

    At last but not the least I would like to thank my parents for their encouragement and support in my

    studies.

    Aditya Sharma

    B. Tech. Final Year

    Computer Science and Engineering

    IET Lucknow

  • 3

    Abstract

    This report takes a brief look at the basic building block of a speech recognition system. It talks about

    implementation of different modules which are required for the construction of a speech recognition

    system. Word detector, feature extractor and HMM training and recognizer modules have been

    described in details. The objective of the project is the implementation of a connected word speech

    recognition system using Hidden Markov Models. This involves the design of an efficient MATLAB

    code on a PC. This phase of the project involves the development of a limited domain recognition

    engine spanning limited vocabulary and in constrained environment only. The system can recognize

    single words as well as connected words. The results of the executions that were conducted are also

    provided. The system show high accuracy rate of nearly 90-100% in case when it is executed in the

    known environment and with the know user. The report ends with a conclusion and Future plan.

  • 4

    Table of Contents

    1. Introduction 1.1. An overview of Speech Recognition ...................................................................... 7

    1.2. Problem Definition and Scope ............................................................................... 7

    1.3. Design of the System ............................................................................................ 8

    1.4. History................................................................................................................. 9

    1.5. Uses of Speech Recognition System .................................................................... 10

    1.6. Applications ....................................................................................................... 10

    1.7. Speech Recognition Weakness and Flaws ............................................................ 11

    1.8. Related Works.................................................................................................... 11

    1.9. The future of Speech recognition ......................................................................... 12

    1.10. Overview of the Project Report ......................................................................... 12

    2. Types of Speech Recognition Systems

    2.1. Based on Algorithms 2.1.1. Hidden Markov Model based................................................................... 14

    2.1.2. Dynamic Time Warping Based ................................................................ 14

    2.1.3. Artificial Neural Network based .............................................................. 14

    2.2. Based on ability to recognize words

    2.2.1. Isolated Speech Recognition.................................................................... 15

    2.2.2. Connected Speech .................................................................................. 15

    2.2.3. Continuous Speech ................................................................................. 15

    2.2.4. Spontaneous Speech ............................................................................... 15

    2.3. Based on dependency on user

    2.3.1. Speaker dependent speech recognition ..................................................... 15

    2.3.2. Speaker Independent speech recognition .................................................. 15

    3. Words Detection and Extraction 3.1. Principle of Word Detection ................................................................................ 16

    3.2. Methodology ...................................................................................................... 16

    3.3. Performance ....................................................................................................... 19

    4. Feature Extraction .................................................................................................. 20

    5. Knowledge Models

    5.1. Acoustic Models 5.1.1. Word Model .......................................................................................... 24

    5.1.2. Phone Model .......................................................................................... 25

    5.1.2.1. Context Independent Phone Model .................................................... 26

    5.1.2.2. Context Dependent Phone Model....................................................... 26

    5.2. Language Model 5.2.1. Classification ......................................................................................... 27

    6. Hidden Markov Model 6.1. HMM and Speech Recognition ............................................................................ 29

    6.2. Three Basic Problems of Hidden Markov Models ................................................. 29

    6.3. Solution to Problem 1 Probability Evaluation..................................................... 30

    6.3.1. The forward Algorithm ........................................................................... 31

    6.3.2. The Backward Algorithm ........................................................................ 33

    6.3.3. Scaling the Forward and Backward Variables ........................................... 35

  • 5

    6.4. Solution to Problem 2 Optimal State Sequence ............................................... 39

    6.4.1. The Viterbi Algorithm ............................................................................ 40

    6.4.2. The Alternative Viterbi Algorithm ........................................................... 41

    6.5. Solution to Problem 3 - Parameter Estimation....................................................... 43

    6.5.1. Initial Estimates of HMM Parameters ...................................................... 47

    7. Implementation....................................................................................................... 49

    7.1. Software Modules............................................................................................... 49

    7.2. Working of the System ....................................................................................... 50

    8. Results and Conclusion ........................................................................................... 54

    8.1. Training ............................................................................................................. 54

    8.2. Recognition........................................................................................................ 54

    8.3. Conclusion ......................................................................................................... 55

    8.4. Future Work ....................................................................................................... 55

    Reading and References.......................................................................................... 56 Appendix A -Source Code in MATLAB .................................................................. 57

  • 6

    List of figures Page No. 1. Fig 1.1 Block Diagram of Speech Recognition System 8

    2. Fig 3.1 Speech Sample 17

    3. Fig 3.2 Energy plot of the sample 17

    4. Fig 3.3 Zero crossing rate of the sample 18

    5. Fig 3.4 Detected word from the sample 18

    6. Fig 4.1 Steps in Feature Extraction 21

    7. Fig 4.2 Mel Scale Filter Bank 22

    8. Fig 5.1 Word based Acoustic Model 25

    9. Fig 5.2 Phone based Acoustic Model 26

    10. Fig 6.1 Diagrammatic Representation of HMM 28

    11. Fig 6.2 Forward Variable Calculation 32

    12. Fig 6.3 Backward Procedure-Induction step 34

    13. Fig 6.4 Baum-Welch method 44

    14. Fig 6.5 Left-Right model of HMM 47

    15. Fig 7.1 Block Diagram of working of the system 51

    16. Fig 7.2 The main interface 52

    17. Fig 7.3 Training 52

    18. Fig 7.4 Recognition 53

  • 7

    Chapter One

    Introduction

    1.1 An overview of Speech Recognition

    Speech recognition refers to the ability to listen (input in audio format) spoken words and

    identify various sounds present in it, and recognize them as words of some known language.

    Speech recognition in computer system domain may then be defined as the ability of

    computer systems to accept spoken words in audio format - such as wav or raw - and then

    generate its content in text format.

    Speech recognition in computer domain involves various steps with issues attached with

    them. The steps required to make computers perform speech recognition are: Voice

    recording, word boundary detection, feature extraction, and recognition with the help of

    knowledge models.

    Word boundary detection is the process of identifying the start and the end of a spoken

    word in the given sound signal[8]. While analyzing the sound signal, at times it becomes

    difficult to identify the word boundary. This can be attributed to various accents people

    have, like the duration of the pause they give between words while speaking.

    Feature Extraction refers to the process of conversion of sound signal to a form suitable for

    the following stages to use. Feature extraction may include extracting parameters such as

    amplitude of the signal, energy of frequencies, etc.

    Recognition involves mapping the given input (in form of various features) to one of the

    known sounds. This may involve use of various knowledge models for precise

    identification and ambiguity removal.

    Knowledge models refers to models such as phone acoustic model, language models etc.

    which help the recognition system. To generate the knowledge model one needs to train

    the system. During the training period one needs to show the system a set of inputs and

    what outputs they should map to. This is often called as supervised learning.

    1.2 Problem Definition and Scope

    The aim of this project is to build a speech recognition system for English language. This

    is connected word speech recognition system. The system will receive speech input

    consisting of series of connected words and it will output the text sequence corresponding

  • 8

    to it. The system will use Continuous Hidden Markov Model for acoustic modeling of the

    speech.

    Scope: This project has the speech recognizing capabilities. It is designed to work in noise

    as well as user constrained environments. This software also can recognize a word and

    convert it into text which can be linked with an action such as execution of a command,

    opening a program, sending mails etc. It can recognize single word as well as sequence of

    connected words separated by a small pause.

    1.3 Design of the system

    An abstract overview of the system is visualized as block diagram below:

    The various components are briefed below:

    Word Detector and Extractor: This component is responsible for taking the

    input from the microphone or from a prerecorded WAV file and detecting the

    words present in them. In this way the speech signal is segmented into

    individual words present in them. These individual words can then be processed

    independently. The word boundary is detected by measuring the Energy and the

    Zero Crossing rate of the signal.

    Feature Extractor: This component generates feature vectors from the given

    signal. It generates Mel Frequency Cepstrum Coefficients (MFCCs) and

  • 9

    Normalized energy as features that should be used to uniquely identify the given

    sound signal[9].

    Recognizer Module: This is a continuous Hidden Markov Model[7] based

    component. It is the most important part of the system which performs the actual

    recognition of the speech signal by finding the best match in the knowledge

    base.

    Acoustic and Knowledge model: These components define structure of the

    sound and how it will be represented in the computer. These models are used to

    map and model the acoustic characteristics of given sound signal.

    1.4 History

    The concept of speech recognition started somewhere in 1940s, practically the first speech

    recognition program was appeared in 1952 at the bell labs, that was about recognition of a

    digit in a noise free environment.

    1940s and 1950s consider as the foundational period of the speech recognition

    technology, in this period work was done on the foundational paradigms of the

    speech recognition that is automation and information theoretic models.

    In the 1960s we were able to recognize small vocabularies (order of 10-100

    words) of isolated words, based on simple acoustic-phonetic properties of

    speech sounds. The key technologies that were developed during this decade

    were, filter banks and time normalization methods.

    In 1970s the medium vocabularies (order of 100-1000 words) using simple

    template-based, pattern recognition methods were recognized.

    In 1980s large vocabularies (1000-unlimited) were used and speech recognition

    problems based on statistical, with a large range of networks for handling

    language structures were addressed. The key invention of this era were Hidden

    Markov model (HMM) and the stochastic language model, which together

    enabled powerful new methods for handling continuous speech recognition

    problem efficiently and with high performance.

    In 1990s the key technologies developed during this period were the methods

    for stochastic language understanding, statistical learning of acoustic and

    language models, and the methods for implementation of large vocabulary

    speech understanding systems.

  • 10

    After the five decades of research, the speech recognition technology has finally

    entered marketplace, benefiting the users in variety of ways. The challenge of

    designing a machine that truly functions like an intelligent human is still a major

    one going forward.

    1.5 Uses of Speech Recognition System Basically speech recognition is used for two main purposes. First and foremost dictation

    that is in the context of speech recognition is translation of spoken words into text, and

    second controlling the computer, that is to develop such software that probably would be

    capable enough to authorize a user to operate different application by voice.

    Writing by voice let a person to write 150 words per minute or more if indeed he/she can

    speak that much quickly. This perspective of speech recognition programs create an easy

    way for composing text and help the people in that industry to compose millions of words

    digitally in short time rather than writing them one by one, and this way they can save their

    time and effort.

    Speech recognition is an alternative of keyboard. If you are unable to write or just dont

    want to type then programs of speech recognition helps you to do almost anything that you

    used to do with keyboard.

    1.6 Applications

    1.6.1 From medical perspective: People with disabilities can benefit from

    speech recognition programs. Speech recognition is especially useful for people

    who have difficulty using their hands, in such cases speech recognition

    programs are much beneficial and they can use for operating computers. Speech

    recognition is used in deaf telephony, such as voicemail to text.

    1.6.2 From military perspective: Speech recognition programs are important from military perspective; in Air Force speech recognition has definite potential

    for reducing pilot workload. Beside the Air force such Programs can also be

    trained to be used in helicopters, battle management and other applications.

    1.6.3 From educational perspective: Individuals with learning disabilities who have problems with thought-to-paper communication (essentially they think of

    an idea but it is processed incorrectly causing it to end up differently on paper)

    can benefit from the software.

  • 11

    1.7 Speech Recognition weakness and flaws

    Besides all these advantages and benefits, yet a hundred percent perfect speech recognition

    system is unable to be developed. There are number of factors that can reduce the accuracy

    and performance of a speech recognition program. Speech recognition process is easy for

    a human but it is a difficult task for a machine, comparing with a human mind speech

    recognition programs seems less intelligent, this is due to that fact that a human mind is

    God gifted thing and the capability of thinking, understanding and reacting is natural, while

    for a computer program it is a complicated task, first it need to understand the spoken words

    with respect to their meanings, and it has to create a sufficient balance between the words,

    noise and spaces. A human has a built in capability of filtering the noise from a speech

    while a machine requires training, computer requires help for separating the speech sound

    from the other sounds.

    Few factors that are considerable in this regard are:

    Homonyms: Are the words that are differently spelled and have the different meaning but acquires the same meaning, for example there their be and bee. This is a

    challenge for computer machine to distinguish between such types of phrases that sound

    alike.

    Overlapping speeches: A second challenge in the process, is to understand the speech uttered by different users, current systems have a difficulty to separate simultaneous

    speeches form multiple users.

    Noise factor: The program requires hearing the words uttered by a human distinctly and clearly. Any extra sound can create interference, first you need to place system away from

    noisy environments and then speak clearly else the machine will confuse and will mix up

    the words.

    1.8 Related Works

    A lot of speech aware applications are already there in the market. Various dictation

    softwares have been developed by Dragon[10], IBM and Philips. Genie is an interactive

    speech recognition software developed by Microsoft. Various voice navigation

    applications, one developed by AT&T, allow users to control their computer by voice, like

    browsing the Internet by voice. Many more applications of this kind are appearing every

    day.

  • 12

    The SPHINX speech recognizer of CMU[11] provides the acoustic as well as the language

    models used for recognition. It is based on the Hidden Markov Models (HMM). The

    SONIC recognizer[12] is also one of them, developed by the University of Colorado. There

    are other recognizers such as XVoice[13] for Linux that take input from IBMs ViaVoice

    which, now, exists just for Windows. Background noise is the worst part of a speech

    recognition process. It confuses the recognizer and makes it unable to hear what it is

    supposed to. One such recognizer has been devised for robots that, despite of the inevitable

    motor noises, makes it communicate with the people efficiently. This is made possible by

    using a noise-type-dependent acoustic model corresponding to a performing motion of

    robot. Optimizations for speech recognition on a HP SmartBadge IV embedded system has

    been proposed to reduce the energy consumption while still maintaining the quality of the

    application.

    Another such scalable system has been proposed in for DSR (Distributed Speech

    recognition) by combining it with scalable compression and hence reducing the

    computational load as well as the bandwidth requirement on the server. Various capabilities

    of current speech recognizers in the field of telecommunications are described in like Voice

    Banking and Directory Assistance.

    1.9 The future of speech recognition

    Accuracy will become better and better.

    Dictation speech recognition will gradually become accepted.

    Greater use will be made of intelligent systems which will attempt to guess

    what the speaker intended to say, rather than what was actually said, as people

    often misspeak and make unintentional mistakes.

    Microphone and sound systems will be designed to adapt more quickly to

    changing background noise levels, different environments, with better

    recognition of extraneous material to be discarded.

    1.10 Overview of the Project Report

    The contents of the chapters are as follows:

    Chapter 2 discusses the type of speech recognition systems based algorithms,

    speaker dependency and their ability to recognize list of words. It also compares

    the different types of the algorithm based on their application and reliability.

  • 13

    Chapter 3 explains how the words are detected and extracted from the speech

    samples. It discusses the algorithms and techniques which are required to

    implement a word boundary detector.

    Chapter 4 explains how the features are extracted from the given speech

    samples. It explains how the signal can be segmented into overlapping frames

    and each frame can be transformed into set of multi-dimensional vectors.

    Chapter 5 explains what is acoustic model and how is it represented using

    HMM. It also gives a brief about language models.

    Chapter 6 explains in detail the Hidden Markov model and how can it be used

    in speech recognition. It explains in detail about how a continuous HMM can

    be implemented for the recognition of the words in a speech signal.

    Chapter 7 explains how the project was implemented in MATLAB. It gives brief

    information about the modules used in the software.

    Chapter 8 concludes the report with a summary of the work done and what are

    the next proposed steps to be taken and the further works which can be done.

    An appendix is provided in the last which contains detailed source code of the

    project in MATLAB.

  • 14

    Chapter Two

    Types of Speech Recognition System

    The speech recognition systems can be classified on various basis such as algorithms, ability to

    recognize words and list of words, dependency on user etc. Some of the classifications can be

    explained as under:

    2.1 Based on Algorithms: There are mainly three popular algorithms to preform speech recognition, namely, Hidden Markov Model (HMM) based, Dynamic Time

    Warping (DTW) based and, Artificial Neural Network based.

    2.1.1 Hidden Markov Model based: Modern general-purpose speech recognition systems are based on Hidden Markov Models[1][2][3][4][5][6]. These are statistical models that

    output a sequence of symbols or quantities. HMMs are used in speech recognition because

    a speech signal can be viewed as a piecewise stationary signal or a short-time stationary

    signal. In a short time-scale (e.g., 10 milliseconds), speech can be approximated as a

    stationary process. Speech can be thought of as a Markov model for many stochastic

    purposes. Another reason why HMMs are popular is because they can be trained

    automatically and are simple and computationally feasible to use.

    2.1.2 Dynamic Time Warping based: Dynamic time warping[2] is an approach that was historically used for speech recognition but has now largely been displaced by the more

    successful HMM-based approach. Dynamic time warping is an algorithm for measuring

    similarity between two sequences that may vary in time or speed. For instance, similarities

    in walking patterns would be detected, even if in one video the person was walking slowly

    and if in another he or she were walking more quickly, or even if there were accelerations

    and decelerations during the course of one observation. DTW has been applied to video,

    audio, and graphics indeed, any data that can be turned into a linear representation can be

    analyzed with DTW.

    2.1.3 Artificial Neural Network based: Neural networks emerged as an attractive acoustic modeling approach in ASR in the late 1980s. Since then, neural networks have

    been used in many aspects of speech recognition such as phoneme classification, isolated

    word recognition, and speaker adaptation. In contrast to HMMs, neural networks make no

    assumptions about feature statistical properties and have several qualities making them

    attractive recognition models for speech recognition. When used to estimate the

    probabilities of a speech feature segment, neural networks allow discriminative training in

    a natural and efficient manner. Few assumptions on the statistics of input features are made

    with neural networks. However, in spite of their effectiveness in classifying short-time units

    such as individual phones and isolated words, neural networks are rarely successful for

    continuous recognition tasks, largely because of their lack of ability to model temporal

  • 15

    dependencies. Thus, one alternative approach is to use neural networks as a pre-processing

    e.g. feature transformation, dimensionality reduction, for the HMM based recognition.

    2.2 Based on ability to recognize words: Speech recognition systems can be divided into the number of classes based on their ability to recognize that words and list of

    words they have. A few classes are as under:

    2.2.1 Isolated Speech Recognition: Isolated words usually involve a pause between two utterances; it doesnt mean that it only accepts a single word but instead it requires one

    utterance at a time.

    2.2.2 Connected Speech: Connected words or connected speech is similar to isolated

    speech but allow separate utterances with minimal pause between them.

    2.3.3 Continuous speech: Continuous speech allow the user to speak almost naturally, it is also called the computer dictation.

    2.3.4 Spontaneous Speech: At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR system with spontaneous speech ability should

    be able to handle a variety of natural speech features such as words being run together,

    "ums" and "ahs", and even slight stutters.

    2.3 Based on dependency on user: Based on the dependency on the users voice the speech recognition systems can be classified as:

    2.3.1 Speaker dependent speech recognition: Speakerdependent systems works by learning the unique characteristics of a single person's voice, in a way similar to voice

    recognition. New users must first "train" the software by speaking to it, so the computer

    can analyze how the person talks. This often means users have to read a few pages of text

    to the computer before they can use the speech recognition software.

    2.3.2 Speaker Independent speech recognition: Speakerindependent software is designed to recognize anyone's voice, so no training is involved. This means it is the only

    real option for applications such as interactive voice response systems where businesses

    can't ask callers to read pages of text before using the system. The downside is that speaker

    independent software is generally less accurate than speakerdependent software.

  • 16

    Chapter Three

    Words Detection and Extraction

    The component responsibility is to accept input from a microphone and forward it to the feature

    extraction module. Before converting the signal into suitable or desired form it also does the

    important task of identifying the segments of the sound containing words. It also has a provision

    of saving the sound into WAV files which are needed by the training component. The microphone

    is configured to receive the input signal at sampling rate of 8000 samples per seconds with 16 bits

    per sample and mono channel.

    3.1 Principle of Word Detection

    In speech recognition it is important to detect when a word is spoken. The system does

    detects the region of silence. Anything other than silence is considered as a spoken word

    by the system. The system uses energy pattern present in the sound signal and zero crossing

    rate to detect the silent region. Taking both of them is important as only energy tends to

    miss some parts of sounds which are important. This process is also called Voice Activity

    Detection.

    3.2 Methodology

    For word detection a sample are broken into frames by taking frame samples every 10

    milliseconds. Each consecutive segment is separated by an overlapping distance which is

    nearly 50% of the length of the frame. Energy and zero crossing for this duration is

    calculated. Energy is calculated by adding the square of the value of waveform at each

    instance and then dividing it by to number of instances over the period of sample. Zero

    crossing rate is the number of times the value of the wave goes from the negative number

    to positive of vice-versa.

    Word Detector assumes that the first 100 millisecond is silence. It uses the average Energy

    and average Zero Crossing Rate obtained during this time for identifying the background

    noise. Upper threshold for energy and zero crossing is set to 2 times the average value of

    background noise. Lower thresholds are set to 0.75 times the upper threshold.

    While detecting the presence of word in the sound, if the energy or zero crossing goes above

    the upper threshold and stays above for three consecutive sample word is assumed to be

  • 17

    present and the recording is started. The recording continues till the energy and zero

    crossing both fall below the lower threshold and stay there for at-least 30 milliseconds[8].

    Fig 3.1 Speech Sample

    Fig 3.2 Energy plot of the samples

  • 18

    Fig 3.3 Zero Crossing Rate Plot of the samples

    Fig 3.4 Detected Word from the sample

    By applying the similar algorithm after extraction of a word, all the words present in the

    speech sample can be extracted in this way provided they are separated by a small pause.

    This set of words extracted from the signal are then fed into feature extractor where the

    signal is converted into set of feature vectors. This word detection algorithm is only

    applicable in the case when the speech signal is constituted of connected words where there

    is a pause within consecutive words.

  • 19

    In case of continuous and spontaneous speech it is not possible to apply the techniques of

    Energy and Zero crossing rates because the word boundaries in case of continuous speech

    is not visible due to fusion.

    3.3 Performance

    The word detector was developed in MATLAB and tested. The word detector showed a

    good performance and it was able to extract all the words from the signal provided there

    was a half second gap between two consecutive words.

    Some words may themselves be constituted of sub words at some distance. This problem

    can be solved by measuring a threshold distance and merging up of the frames in case the

    distance between then is less the threshold. In the experiment performed the threshold

    distance was measured to be 100 frames.

  • 20

    Chapter Four

    Feature Extraction

    Humans have a capacity of identifying different types of sounds (phones). Phones put in a

    particular order constitutes a word. If we want a machine to identify the spoken word, it will have

    to differentiate between different kinds of sound the way the humans perceive it. The point to be

    noted in case of humans is that although, one word spoken by different people produces different

    sound waves humans are able to identify the sound waves as same. On the other hand two sounds

    which are different are perceived as different by humans. The reason being even when same phones

    or sounds are produced by different speakers they have common features. A good feature extractor

    should extract these features and use them for further analysis and processing. So feature extraction

    is mainly extraction of relevant information from the speech blocks. A variety of choices for this

    task can be applied. Some most commonly used methods for speech recognition is linear prediction

    and Mel-Cepstrum coefficient calculation[9]. These measures are widely used and here are some

    reasons why:

    These measures provide a good model of the speech signal. This is particularly true in

    quasi steady state of voiced region of speech.

    The way these measures are calculated leads to reasonable source-vocal tract separation.

    This property leads to fairly good representation vocal tract characteristics.

    The measure has analytical tractable model.

    Experiences have found that these measures works well in speech recognition applications.

    Other measures to add to the feature vectors are the energy measures and also the calculation of

    delta and acceleration coefficients. The delta coefficient means that a derivative approximation of

    some measures (e.g. MFCC coefficients) is added and the acceleration coefficients is the second

    derivative approximation of some measures.

    The steps of feature extraction can be show as in the figure[2]:

  • 21

    Windowing: Windowing is the process in which the speech signal is segmented into overlapping frames. Each segment is called a frame. The length of each frame is 0.01

    seconds. The overlap between two frames is kept up to 50% i.e. 0.005 seconds. There are

    various types of windows which can be used:

    Rectangular Window Bartlett Window Hamming Window

    The system developed uses Hamming window as it introduces least amount of distortion.

    Impulse response of the Hamming

    window is a raised cosine impulse as

    show below in the figure:

    Transfer function of hamming

    window:

    0.54+0.46cos (n*pi/m)

  • 22

    Mel Scale: After the framing of the signal N-point DFT is calculated to analyze its spectrum. The

    frequencies are mapped into a MEL scale which is given by

    Fig 4.2 Mel Scale Filter bank

    The frequencies are picked according to a logarithmic scale as they match to the human

    auditory system. Then the amplitude values corresponding to the Mel frequencies are

    calculated and their logarithms are taken. These sequence of Log values are treated as

    signals and then Inverse Discrete cosine transformation is performed on them. The resulting

    coefficients are called cepstrum.

    Lifter: Lifter is used to zero out (or cut away) some of the last mel cepstrum coefficients. After

    this the final mel cestrum values are found. This is done to remove or discard the unwanted

    mel coefficients.

  • 23

    Energy Measures: An extra measure to augment the coefficients derived from the mel-cepstrum is the log of

    signal energy. This means for every frame an extra energy term is added.

    Delta and Acceleration coefficients: Spectral transitions are believed to play an important role in human speech perception.

    Therefore it is desired to add information about time difference, or delta coefficients and

    also the acceleration coefficients.

  • 24

    Chapter Five

    Knowledge Models

    For speech recognition, the system needs to know how the words sound. For this we need to train

    the system. During the training, using the data given by the user, the system generates acoustic

    model and language model. These models are later used by the system to map a sound to a word

    or a phrase.

    5.1 Acoustic Model Features that are extracted by the Feature Extraction module need to be compared against

    a model to identify the sound that was produced as the word that was spoken. This model

    is called as Acoustic Model.

    There are two kinds of Acoustic Models:

    1. Word Model

    2. Phone Model

    5.1.1 Word Model

    Word models are generally used to small vocabulary systems. In this model the

    words are modelled as whole. Thus each word needs to be modelled separately. If

    we need to add support to recognize a new word, we will have to train the system

    for the word. In the recognition process, the sound is matched against each of the

    model to find the best match. This best match is assumed to be the spoken word.

    Building a model for a word requires us to collect the sound files of the word from

    various users. These sound files are then used to train a HMM Model. Figure 5.1

    shows a diagrammatic representation of word based acoustic model.

  • 25

    Fig 5.1 Word Based Acoustic Model

    5.1.2 Phone Model

    In phone model instead of modelling the whole word, we model only parts of words

    generally phones. And the word itself is modelled as sequence of phone. The heard

    sound is now matched against the parts and parts are recognized. The recognized

    parts are put together to for a word. For example the word ek is generated by

    combination of two phones A and k. This is generally useful when we need a large

    vocabulary system. Adding a new word in the vocabulary is easy as the sounds of

    phones are already know only the possible sequence of phone for the word with it

    probability needs to be added to the system. Figure 5.2 shows a diagrammatic

    representation of phone based acoustic model.

  • 26

    Fig 5.2 Phone based Acoustic Model

    Phone models can be further classified into:

    1. Context-Independent Phone Model

    2. Context-Dependent Phone Model

    5.1.2.1 Context-Independent Phone Model

    In this model individual phones are modelled. The context that they occur

    is not modelled. The good thing about this model is that the number of

    phone that have to be modelled is small. Thus the complexity of the

    system is less.

    5.1.2.2 Context-Dependent Phone Model

  • 27

    While modelling phone their neighbors are also considered. That means iy

    surrounded by z and r is a separate entity as compared to iy surrounded by

    h and r. This results in a growth of number of modelled phones which

    increases the complexity.

    In both word acoustic model and phone acoustic model we need to model

    silence and filler words too. Filler words are the sounds that humans produce

    between two words.

    Both these models can either be implemented using a Hidden Markov Model

    or a Neural Network. HMM is more widely used technique in automatic

    speech recognition systems.

    5.2 Language Model

    Although there are words that have similar sounding phone, humans generally do not find

    it difficult to recognize the word. This is mainly because they know the context, and also

    have a fairly good idea about what words or phrases can occur in the context. Providing

    this context to a speech recognition system is the purpose of language model. The language

    model specifies what are the valid words in the language and in what sequence they can

    occur.

    5.2.1 Classification

    Language models are classified into several categories:

    Uniform Models: Each word has equal probability of occurrence.

    Stochastic Model: Probability of occurrence of a word depends on the words preceding it.

    Finite State Language: Language uses finite state network to define allowed word sequences.

    Context Free Grammar: Context free grammar can be used to encode which kind of sentences are allowed.

  • 28

    Chapter Six

    Hidden Markov Model

    Hidden Markov Model (HMM)[1][2][3][4][5][6] is a state machine. The states of the model are

    represented as nodes and the transition are represented as edges. The difference in case of HMM

    is that the symbol does not uniquely identify a state. The new state is determined by the symbol

    and the transition probabilities from the current state to a candidate state.

    Fig 6.1 Diagrammatic Representation of HMM

    Above figure shows a diagrammatic representation of a HMM. Nodes denoted as circles are states.

    O1 to O5 are observations. Observation O1 takes us to states S1. aij defines the transition

    probability between Si and Sj. It can be observed that the states also have self-transitions. If we are

    at state S1 and observation O2 is observed, we can either decide to go to state S2 or stay in state

    S1. The decision is made depending on the probability of observation at both the states and the

    transition probability.

    Thus HMM Model is defined as:

    = (Q, O, A, B, )

    Where Q is {qi} (all possible states)

    O is {vi} (all possible observation)

    A is {aij} where aij = P (Xt+1 = qj |Xt = qi) (transition probabilities)

  • 29

    B is {bi} where bi(k) = P(Ot = vk|Xt = qit) (observation probabilities of observation k at state i)

    is {i} where i = P(X0 = qi) (initial state probabilities)

    Xt denotes the state at time t.

    Ot denotes the observation at time t.

    6.1 HMM and Speech Recognition

    HMM can be classified upon various criteria:

    1. Value of Occurrences

    i) Discrete

    ii) Continuous

    2. Dimension

    i) One Dimensional

    ii) Multi-Dimensional

    3. Probability Density Function

    i) Continuous Density (Gaussian Distribution based)

    ii) Discrete Density (Vector quantization based)

    While using HMM for recognition, we provide the occurrences to the model and it returns

    a number. This number is the probability with which the model could have produced the

    output (occurrences). In speech recognition occurrences are feature vectors rather than just

    symbols. Hence for each occurrence, feature vector has a group of real numbers. Thus, what

    we need for speech recognition is a Continuous, Multi-dimensional HMM.

    6.2 Three Basic Problems of Hidden Markov Model

    Given the basics of an HMM from the previous section, three basic problem arise for

    applying the model in a speech recognition task:

    Problem 1

    Given the observation sequence O=(o1,o2,o3,.oT), and the model =(A,B,), how is

    the probability of the observation sequence, given the model, computed? That is, how is

    P(O|) computed efficiently?

  • 30

    Problem 2

    Given the observation sequence O=(o1,o2,o3,.oT), and the model =(A,B,), how is a

    corresponding state sequence, q=(q1,q2,.,qT), chosen to be optimal in some sense (i.e.

    best explains the observations)?

    Problem 3

    How are the probability measures, =(A,B,), adjusted to maximize P(O|)?

    The first problem can be seen as the recognition problem. With some trained models, each

    model represents a word, which model is the most likely if an observation is given? In the

    second problem the hidden part of the model is attempted to be uncovered. It should be

    clear that for all except the case of degenerated models, there is no correct state sequence

    to be found. Thereby it is a problem to be solved best possible with some optimal criteria.

    The third problem can be seen as the training problem. That is given the training sequence,

    create a model for each word. The training problem is the crucial one for most applications

    of HMMs, because it will optimally adapt the model parameter to observe training data-

    i.e., it will create the best models for real phenomena.

    6.3 Solution to Problem 1 Probability Evaluation

    The aim of this problem is to find the probability of the observation sequence,

    O=(o1,o2,o3,.oT), given the model , i.e. P(O|). Because the observations produced

    by states are assumed to be independent of each other and the time t, the probability of

    observation sequence O=(o1,o2,o3,.oT) being generated by a certain states sequence q

    can be calculated by a product:

    P(O|q,B)=bq1(o1).bq2(o2)..bqT(oT)

    And the probability of the state sequence q can be found as

    P(q|A,) = q1 aq1q2 aq2q3 ... aqT1qT

    The joint probability of O and q, i.e., the probability that O and q occur simultaneously, is

    simply the product of the above two terms, i.e.:

  • 31

    The aim was to find P (O|), and this probability of O (given the model ) is obtained by

    summing the joint probability over all possible state sequences q, giving:

    The interpretation of the computation in above is the following. Initially at time t = 1 the

    process starts by jumping to state q1 with probability q1, and generate the observation

    symbol o1 with probability bq1(o1). The clock changes from t to t+1 and a transition from

    q1 to q2 will occur with probability aq1q2, and the symbol o2 will be generated with

    probability bq2(o2). The process continues in this manner until the last transition is made (at

    time T), i.e., a transition from qT1 to qT will occur with probability aqT1qT , and the symbol

    oT will be generated with probability bqT(oT ).

    This direct computation has one major drawback. It is infeasible due to the exponential

    growth of computations as a function of sequence length T. To be precise, it needs (2T

    1)NT multiplications, and NT 1 additions [1]. Even for small values of N and T; e.g., for

    N = 5 (states), T = 100 (observations), there is a need for (2 100 1)5100 1.6 1072

    multiplications and 5100 1 8.0 1069 additions! Clearly a more efficient procedure is

    required to solve this problem. An excellent tool which cuts the computational requirements

    to linear, relative to T, is the well-known forward algorithm.

    6.3.1 The Forward Algorithm

  • 32

    Consider a forward variable t(i), defined as:

    Where t represents time and i is the state. This gives that t(i) will be the probability

    of the partial observation sequence, o1o2 ...ot, (until time t) when being in state i at

    time t. The forward variable can be calculated inductively, Fig 6.2

    Fig 6.2

    t+1(i) is found by summing the forward variable for all N states at time t multiplied

    with their corresponding state transition probability, aij, and by the emission

    probability bj(ot + 1). This can be done with the following procedure:

  • 33

    1. Initialization

    2. Induction

    3. Update time Set t=t+1;

    Return to step 2 if t

  • 34

    probability. In a similar manner (according to the forward algorithm), can the

    backward be calculated inductively, see Fig. 6.3.

    Fig 6.3 Backward Procedure Induction Step

    The backward algorithm includes the following steps:

    1. Initialization

    2. Induction

  • 35

    3. Update time Set t=t-t;

    Return to step 2 if t>0

    Otherwise, terminate the algorithm

    Note that the initialization step 1 arbitrarily defines T (i) to be 1 for all i.

    6.3.3 Scaling the Forward and Backward Variables

    The calculation of t(i) and t(i) involves multiplication with probabilities. All these

    probabilities have a value less than 1 (generally significantly less than 1), and as t

    starts to grow large, each term of t(i) or t(i) starts to head exponentially to zero.

    For sufficiently large t (e.g., 100 or more) the dynamic range of t(i) and t(i)

    computation will exceed the precision range of essentially any machine (even in

    double precision) . The basic scaling procedure multiplies t(i) by a scaling

    coefficient that is dependent only of the time t and independent of the state i. The

    scaling factor for the forward variable is denoted ct (scaling is done every time t for

    all states i - 1 i N). This factor will also be used for scaling the backward

    variable, t(i). Scaling t(i) and t(i) with the same scale factor will show useful in

    problem 3 (parameter estimation).

    Consider the computation of the forward variable, t(i). In the scaled variant of the

    forward algorithm some extra notations will be used. t(i) denote the unscaled

    forward variable, t(i) denote the scaled and iterated variant of t(i) , t(i) denote

    the local version of t(i) before scaling and ct will represent the scaling coefficient

    at each time. Here follows the scaled forward algorithm:

    1. Initialization

    2. Induction

  • 36

    3. Update time Set t=t+1;

    Return to step 2 if t

  • 37

    The ordinary induction step can be found as

    Now it is possible to write

    As the above equation shows t(i) is scaled by the sum over all states of t(i) when

    the scaled forward algorithm is applied.

    The termination (step 4) of the scaled forward algorithm, evaluation of P(O|) , must

    be done in a different way. This because the sum of T (i) cannot be used, because

    T (i) is scaled already. However the following properties can be used:

  • 38

    As above equation shows P(O|) can be found, but the problem is that if it is used

    the result will still be very small (and probable out of the dynamic range for a

    computer). If the logarithm is taken on both sides the following equation can be

    used:

    This is exactly what is done in the termination step of the scaled forward algorithm.

    The logarithm of P(O|) is often just as useful as P(O|), because in most cases, this

    measure is used as comparison with other probabilities (for other models).

    The scaled backward algorithm can be found more easily, since it will use the same

    scale factor as the forward algorithm. The notations used is similar to the forward

    variables notations, t(i) denote the unscaled backward variable, t(i) denote the

    scaled and iterated variant of t(i), t(i) denote the local version of t(i) before

    scaling and ct will represent the scaling coefficient at each time. Here follows the

    scaled backward algorithm:

    1. Initialization

    2. Induction

  • 39

    3. Update time Set t=t-1;

    Return to step 2 if t>0;

    Otherwise, terminate the algorithm

    6.4 Solution to Problem 2 Optimal State Sequence

    The problem is to find the optimal sequence of states to a given observation sequence and

    model. Unlike problem one, for which an exact solution can be found, there are several

    possible ways of solving this problem. The difficulty lies with the definition of the optimal

    state sequence, that is, there are several possible optimality criteria. One optimal criteria is

    to choose the states, qt, that are individually most likely at each time t. To find this state

    sequence the following probability variable is needed:

    That is, the probability of being in state i at time t given the observation sequence, O, and

    the model . Other ways to look at t(i) can be:

    It is now possible to write

  • 40

    When t(i) is calculated according to above equation, the most likely state at time t, q t, will be found by:

    Even if above equation maximizes the expected number of correct states, there could be

    some problems with the resulting state sequence. This because the state transition

    probabilities have not been taken into account. For example what happens when some state

    transitions have zero probability (aij = 0)? This means that the found optimal path may not

    be valid. Obviously a method generating a path that is guaranteed to be valid would be

    preferably. Fortunately such a method exist, based on dynamic programming, namely the

    Viterbi algorithm. Even though t(i) could not be used for this purpose, it will be useful in

    problem 3 (parameter estimation).

    6.4.1 The Viterbi Algorithm

    This algorithm is similar to the forward algorithm. The main difference is that the

    forward algorithm uses summing over previous states, whereas the Viterbi

    algorithm uses maximization. The aim for the Viterbi algorithm is to find the single

    best state sequence, q = (q1,q2,...,qT ), for the given observation sequence O =

    (o1,o2,...,oT ) and a model . Consider the following quantity:

    That is the probability of observing o1o2 ...ot using the best path that ends in state i

    at time t, given the model . By using induction can t+1(i) be found as:

    To actually retrieve the state sequence, it is necessary to keep track of the argument

    that maximizes above equation for each t and j . This is done by saving the argument

    in an array t(j). Here follows the complete Viterbi algorithm:

    1. Initialization

  • 41

    2. Induction

    3. Update time Set t=t+1;

    Return to step 2 if t=1

    Otherwise, terminate the algorithm

    The same problem as for the forward and backward algorithm occurs here. That is

    the algorithm involves multiplication with probabilities and the precision range will

    be exceeded. This is why an alternative Viterbi algorithm is needed.

    6.4.2 The Alternative Viterbi Algorithm

    As mentioned the original Viterbi algorithm involves multiplications with

    probabilities. One way to avoid this is to take the logarithm of the model parameters,

  • 42

    giving that the multiplications become additions. Obviously will this logarithm

    become a problem when some model parameters has zeros is present. This is often

    the case for A and and can be avoided by adding a small number to the matrixes.

    Here follows the alternative Viterbi algorithm:

    1. Preprocessing

    2. Initialization

    3. Induction

    4. Update time

    Set t=t+1;

    Return to step 3 if t

  • 43

    (c) Update time

    Set t=t-1;

    Return to step (b) if t>=1;

    Otherwise, terminate the algorithm.

    6.5 Solution to Problem 3 - Parameter Estimation

    The third problem is concerned with the estimation of the model parameters, = (A,B,).

    The problem can be formulated as:

    Given an observation O, find the model from all possible that maximizes P (O|). This problem is the most difficult of the three problems. This because there is no known way to

    analytically find the model parameters that maximizes the probability of the observation

    sequence in a closed form. However can the model parameters be chosen to locally

    maximize the likelihood P (O|). Some common used methods for solving this problem is

    Baum-Welch method (also known as expectation-maximization method) or gradient

    technics. Both of these methods uses iterations to improve the likelihood P (O|), however

    there are some advantages with the Baum-Welch method compared to the gradient

    techniques:

    1. Baum-Welch is numerically stable with the likelihood non-decreasing with every

    iteration.

    2. Baum-Welch converges to a local optima.

    3. Baum-Welch has linear convergence.

    This is why the Baum-Welch is used in this project. This section will derive the re-

    estimation equations used in the Baum-Welch method.

    The model , has three terms to describe namely the state transition probability distribution

    A, the initial state distribution and the observation symbol probability distribution B.

    Since continuous observation densities are used, will B be represented by cjk,jk and jk.

    To describe the procedure for re-estimation, the following probability will prove useful:

  • 44

    That is the probability of being in state i at time t, and state j at time t + 1, given the

    model and the observation sequence O. The paths that satisfy the conditions required by

    above equation are illustrated in Fig. 6.4.

    Fig 6.4

    By using all the previous equations we can conclude that

    As mentioned in problem 2 is t(i) the probability of being in state i at time t, given the

    entire observation sequence O and the model . Hence

  • 45

    If the sum over time t is applied to t(i), one will get a quantity that can be interpreted as

    the expected (over time) number of times that state i is visited, or equivalently, the expected

    number of transitions made from state i (if the time slot t = T is excluded) [1]. If the same

    summation is done over t(i,j), one will get the expected number of transitions from state i

    to state j. The term 1(i) will also prove to be useful. Conclusion:

    Given the above definitions it is possible to derive the re-estimation formulas for and A:

    And

  • 46

    The re-estimation of cjk,jk and jk is a bit more complicated. However if the model has

    only one state j and one mixture, it would be an easy averaging task

    In practice, of course, there are multiple states and multiple mixtures and there are no direct

    assignment of the observation vectors to individual states because the underlying state

    sequence is unknown. Since the full likelihood of each observation sequence is based on

    the summation of all possible state sequences, each observation vector o t contributes to the

    computation of the likelihood for each state j. In other words instead of assigning each

    observation vector to a specific state, each observations i assigned to every state and is

    weighted with the probability of the model being in that state accounted for that specific

    mixture when the vector was observed. This probability, for state j and mixture k (there are

    M mixtures), is found by

    The re-estimation formula for cjk is the ratio between the expected number of times the

    system is in state j using the kth mixture component, and the expected number of times the

    system is in state j. That is:

    To find jk and jk one can weigh the simple averages by the probability of being in state j

    and using mixture k when observing ot:

  • 47

    The re-estimation formulas described in this section, are performed based on one training

    sample. This will of course not be sufficient to get reliable estimates for a training sample,

    especially when left-right models are used. To get reliable estimates it is convenient to use

    multiple observation sequences.

    6.5.1 Initial Estimates of HMM Parameters Before the re-estimation formulas can be applied for training, it is important to get

    good initial parameters so that the re-estimation leads to the global maximum or as

    close as possible to it. An adequate choice for and A is the uniform distribution.

    But since left-right models are used, will have probability one for the first state

    and zero for the other states. For example will the left-right model in figure below

    have the following initial and A:

    Fig 6.5 Left-Right model of HMM

  • 48

    The parameters for the emission distribution needs good initial estimations, to get a

    rapid and proper convergence. This is done by using uniform segmentation into the

    states of the model, for every training sample. After segmentation, all observations

    from the state j is collected from all training samples. Then a clustering algorithm

    is used to get the initial parameters for state j and this procedure is done for every

    state. The clustering algorithm used in this thesis is the well-known k-means

    algorithm. Before the clustering proceeds one has to choose the number of clusters,

    K. In this task is the number of clusters equal to the number of mixtures, that is, K

    = M.

    The K-Means Algorithm 1. Initialization

    Choose K vectors from the training vectors, here denoted x, at random. These

    vectors will be the centroids k, which is to be found correctly.

    2. Recursion For each vector in the training set, let every vector belong to a cluster k. This

    is done by choosing the cluster closest to the vector:

    Where d(x,k) is a distance measure, here is the Euclidian distance measure

    used:

    3. Test

    Recomputed the centroids, k, by taking a mean of the vectors that belong to

    this centroid. This is done for every k. If no vectors belongs to some k for

    some value k - create new k by choosing a random vector from x. If there has

    been no change of the centroids from the previous step go to step 4, otherwise

    go back to step 2.

    4. Termination From this clustering (done for one state j), the following parameters are found:

  • 49

    Chapter Seven

    Implementation

    The system was built and tested on a platform with below mentioned specification:

    1. System Specification 1.1. Intel core i5 CPU @ 2.67GHz 1.2. 4GB of RAM

    1.3. Microsoft Windows 7 x64

    1.4. MATLAB 7.12.0 (R2011a)

    1.5. Microphone

    2. Minimum System Requirement 2.1. Pentium 200Hz processor 2.2. 512 MB of RAM

    2.3. Microphone

    2.4. Soundcard

    The system was coded in MATLAB and was limited up to command line interface. The system is

    divided into various component each dedicated to perform a unique task for e.g. speech acquisition,

    feature extraction, training and, recognition. The full detailed source code can be found at the end

    of the report.

    The user can train the model either by directly recording from the microphone or by using samples

    in pre-recorded WAV files. Same applies for recognition. The project has a constrained that either

    it can recognize isolated words or it can recognize series of connected word separated by a small

    pause. Each word is detected and extracted by the Word Detector and then each word is recognized

    separately.

    The sound is recorded at 8000 samples per second with 16 bit per sample. So the quality of the

    signal it not that good to sense small differences among two similar signals. So the user has to

    speak more clearly.

    7.1 Software Modules

  • 50

    All the software module where coded in MATLAB and a brief description of some

    important modules are given below:

    1. GetSpeechSample.m: This function is responsible for acquiring the speech signal either

    from a WAV file or directly from the microphone. It receives the input at 8000 samples

    per second and 16 bits per sample for 6 seconds in case of training and 10 seconds in

    case of recognition. It also trims first 8000 samples to remove the initial click noise

    due to microphone. So the user should start speaking only after approx. 1.5 seconds.

    2. ExtractFeatures.m: This function receives a speech signal and performs feature

    extraction on it. It returns a 39 dimensional feature vectors set where each vector

    contains MFCCs, energy and delta and acceleration coefficients.

    3. TrainWord.m: This function receives feature vectors and a string which will be output

    by the recognizer upon its recognition and it generates a model from it assigning a

    unique id to it.

    4. Recognize.m: This function performs the recognition of the speech sample feed to it by

    WAV file or directly by microphone.

    5. forwardBackwardAlgorithm.m: This function is used to implement the Forward-

    Backward algorithm for training the samples.

    6. viterbiAlgorithm.m: This function implement the Viterbi Algorithm which is used in

    the calculation of the probability score and recognition of the words.

    7. UpdateKnowledge.m: This function updates the dictionary wherever new word is

    added to the database by training.

    8. GaussianProbability.m: This function calculates the Gaussian probability of a frame of

    the speech sample using the probability distribution function.

    9. ResetDictionary.m: This function is used to reset the dictionary and remove all the

    trained models.

    10. FindNextSegment.m: This function finds the next word by detecting its word boundary

    by comparing the energy and the zero crossing rates.

    7.2 Working of the system

    The working of the system can be understood by breaking the full functionality into sub

    functions. Firstly signal is acquired by the speech acquisition module. This module trims

    off first 8000 sample corresponding to 1sec of the signal and passes the signal to the words

  • 51

    detector. The word detector detects the boundary of each word present in it by analysis the

    energy and the zero crossing rates. It stores the start and finish point of each word in a list.

    The word segments whose distance is less than the threshold distance are merged together

    and the list is updated. Then each of the word segment is fed into feature extractor module

    and it is converted into set of feature vectors. Now this is passed on to the recognizer

    module which finds the closest match with the models stored in the database. After the

    completion of the recognition of this word the next word is fetched from the list and the

    process is repeated.

    Before the speech recognition can be performed the individual words are needed to be

    trained first. In training part the user provides input either from a pre-recorded WAVE file

    or directly from the microphone. The training module extract feature from it and generates

    a HMM for that. The probabilities are adjusted by performing expectation maximization

    process. The full overview of the working of the system is shown below in figure 7.1.

    Fig 7.1 Block Diagram of Working of the system

  • 52

    Snapshots

    1. The Main Interface

    Fig: 7.2 the Main Interface

    2. Training new word

    Fig: 7.3 Training

  • 53

    3. Recognition

    Fig: 7.4 Recognition

  • 54

    Chapter Eight

    Results and Conclusion

    8.1 Training

    Ten words from English language which are most frequently used were trained and there

    models were generated. For this I used my voice and approx. 300 samples in total were

    taken with 30 training sample for each word.

    Each model was trained by Expectation Maximization algorithm with 7 iterations on each

    model. The total number of states in each model was 5. Each model was stored in unique

    .MAT file.

    The sound was recorded at 8000 frames per seconds with 16 bits per sample mono channel.

    8.2 Recognition

    The recognition module was given a wave file as input which contained prerecorded speech

    sample. Each speech sample contained 5 to 6 words separated by a minimum distance of

    100 frames. Firstly the words are detected and extracted from the signal and then each word

    is recognized independently.

    The recognition was tested by using samples from the user who trained it as well as an

    unknown user. Also the environment was changed which was different from the

    environment in which it was trained.

    The result of the experiment are as show below:

    Type of condition No of samples Correct Output Accuracy (%)

    Known user and known environment 30 30 100%

    Known user and unknown environment

    20 14 80%

    Unknown user and known

    environment

    25 12 48%

    Unknown user an unknown environment

    15 3 20%

    Table 8.1 Recognition Results

  • 55

    8.3 Conclusion

    The theory of Hidden Markov Model have been studied thoroughly. Together with the

    signal processing of speech signals, a speech recognizer has been implement in MATLAB.

    A word boundary detector has also been implemented. The performance of the word

    detector is up to the requirement and it performed absolutely well in all conditions.

    A HMM library was build which was useful in recognition and also training a word based

    acoustic model. Word models for some most frequently used words of English language

    was used.

    The trained model was used to recognize the speech. The recognizer gave best performance

    when the user and the environment both were known to the recognizer. It gave the worst

    performance when the user and the environment both were unknown. This happened

    because the models were trained in constrained environment with only one user. This

    problem can be solved by including more training data from different persons and in

    different environments. Other cases produced average results.

    8.4 Future Work

    Further improvements and expansions may be achieved by using one or more of the following

    suggestions:

    The speech recognizer is implemented in MATLAB and because of that it runs slow.

    Implementing the speech recognizer in C or assembler will be desired to get a faster

    execution time.

    In a noisy environment, like in a car, noise reduction algorithms are preferred to enhance

    signal to noise ratio. Algorithms useful can be based on adaptive noise reduction, spectral

    subtraction or beam forming.

    Record a larger evaluation database, for different speakers and different environments, to

    get more test cases.

    Try different setting in the speech recognizer, for example change the model structure, the

    number of states or mixtures. More or less measures of the speech can be added to the

    feature vectors, that is experiment with the feature vector dimension and its content.

  • 56

    Reading and References

    [1] Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications

    in Speech Recognition Proceedings of IEEE 1989.

    [2] B. Plannerer, An Introduction to Speech Recognition March 28 2005.

    [3] Christopher M. Bishop, Pattern Recognition and Machine Learning, Chapter 13 Page

    605-646

    [4] M. Narasimha Murthy, V. Susheela Devi, Pattern Recognition An Algorithmic

    Approach University Press. Chapter 3, 5 and 9.

    [5] B.H. Juang, L.R Rabiner. Hidden Markov Models for Speech Recognition.

    Technometrics, August 1991, Vol. 33, No. 3

    [6] Lawrence Rabiner, Biing-Hwang Juang. Fundamentals of Speech Recognition,

    Prentice-Hall International, Chapter 6 and 7.

    [7] Christophe Couvreur, Hidden Markov Models and Their Mixtures .

    [8] L.R. Rabiner and M.R. Sambur. An algorithm for detecting the Endpoints for Isolated

    Utterances. The Bell System Technical Journal, Vol. 54, No. 2, Feb. 1975.

    [9] Mel Frequency Cepstral Coefficient (MFCC) tutorial.

    http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-

    cepstral-coefficients-mfccs/

    [10] http://www.nuance.com/

    [11] http://cmusphinx.sourceforge.net/

    [12] Pellom, B., Sonic: The University of Colorado Continuous Speech Recognition

    System.

    [13] http://xvoice.sourceforge.net/

  • 57

    Appendix A

    Source Code in MATLAB

    function [] = Main( ) clear; clc; check=0; fprintf('#===================SPEECH RECOGNITION SYSTEM==================#\n'); fprintf('#Developed By: Aditya Sharma #\n');

    fprintf('#Branch: CSE Final Year #\n'); fprintf('#Institute of Engineering and Technology,Lucknow #\n'); fprintf('#============================================================#\n\n); fprintf('Choose an option:\n'); fprintf('1: Train a new word\n'); fprintf('2: Perform Recognition\n'); fprintf('3: Reset Knowledge\n'); fprintf('4: Exit\n\n'); option=input('Your option is:','s');

    switch option case '1' TrainWord; case '2' Recognize; case '3' ResetDictionary; case '4' cls;

    check=1; otherwise fprintf('Invalid option!! Retry...\n'); end if check==0 %input(''); else end end

    function [ ] = TrainWord( ) fprintf('Training Mode....\n'); x=GetSpeechSampleW(); fprintf('Enter the string corresponding to this word:'); name=input('','s'); fprintf('Training...\n'); Ini = 0.1; %Initial silence duration in seconds Ts = 0.01; %Frame width in seconds Tsh = 0.005; %Frame shift in seconds Fs = 8000; %Sampling Frequency ZTh = 40; %Zero crossing comparison rate for threshold w_sam = fix(Ts*Fs); %No of Samples/window o_sam = fix(Tsh*Fs); %No of samples/overlap lengthX = length(x); segs = fix((lengthX-w_sam)/o_sam)+1; %Number of segments in speech signal

  • 58

    sil = fix((Ini-Ts)/Tsh)+1; %Number of segments in silent period win = hamming(w_sam); Limit = o_sam*(segs-1)+1; %Start index of last segment

    FrmIndex = 1:o_sam:Limit; %Vector containing starting index %for each segment ZCR_Vector = zeros(1,segs); %Vector to hold zero crossing rate %for all segments

    %Below code computes and returns zero crossing rates for all segments in %speech sample for t = 1:segs ZCRCounter = 0; nextIndex = (t-1)*o_sam+1; for r = nextIndex+1:(nextIndex+w_sam-1) if (x(r) >= 0) && (x(r-1) >= 0)

    elseif (x(r) >= 0) && (x(r-1) < 0) ZCRCounter = ZCRCounter + 1; elseif (x(r) < 0) && (x(r-1) < 0)

    elseif (x(r) < 0) && (x(r-1) >= 0) ZCRCounter = ZCRCounter + 1; end end ZCR_Vector(t) = ZCRCounter; end

    %Below code computes and returns frame energy for all segments in speech %sample Erg_Vector = zeros(1,segs); for u = 1:segs nextIndex = (u-1)*o_sam+1; Energy = x(nextIndex:nextIndex+w_sam-1).*win; Erg_Vector(u) = sum(abs(Energy)); end

    IMN = mean(Erg_Vector(1:sil)); %Mean silence energy (noise energy) IMX = max(Erg_Vector); %Maximum energy for entire utterance I1 = 0.03 * (IMX-IMN) + IMN; %I1 & I2 are Initial thresholds I2 = 4 * IMN; ITL = min(I1,I2); %Lower energy threshold

    ITU = 5 * ITL; %Upper energy threshold IZC = mean(ZCR_Vector(1:sil)); %mean zero crossing rate for silence region stdev = std(ZCR_Vector(1:sil)); %standard deviation of crossing rate for %silence region IZCT = min(ZTh,IZC+2*stdev); %Zero crossing rate threshold

    flag=1; startpoint=1; segment_count=0;

    while flag==1

    [st,fi,fl]=FindNextSegment(ITU,ITL,IZCT,x,startpoint,Erg_Vector,ZCR_Vector);

  • 59

    if fl==0 break; end segment_count=segment_count+1; seg_list{segment_count,1}=st; seg_list{segment_count,2}=fi; startpoint=fi; end distance_threshold=100; valid_index=ones(1,segment_count); for l=1:segment_count-1 if valid_index(l)==1 if abs(seg_list{l+1,2}-seg_list{l,2}) ITU) counter1 = counter1 + 1;

  • 60

    indexi(counter1) = i; end end if counter1==0 flag=0; start=0; finish=0; return; end ITUs = indexi(1); first_hit=ITUs; %Search further forward for frame with energy greater than ITL for j = ITUs:-1:1 if (Erg_Vector(j) < ITL) counter2 = counter2 + 1; indexj(counter2) = j; end end start = indexj(1)+1;

    %BackSearch = min(start,25); BackSearch=25; for m = start:-1:start-BackSearch+1 rate = ZCR_Vector(m); if rate > IZCT ZCRCountb = ZCRCountb + 1; realstart = m; end end if ZCRCountb > 3 start = realstart; %If IZCT is exceeded in more than 3 frames %set start to last index where IZCT is

    exceeded end

    l_c=0; for k=first_hit:length(Erg_Vector) if(Erg_Vector(k)

  • 61

    end

    function [ pcm ] = GetSpeechSampleW( )

    fprintf('Enter the name of the WAVE file:'); fname=input('','s'); buffer=wavread(fname); FrameRate=8000; [bufferlength,~]=size(buffer); buffer=buffer(FrameRate:bufferlength); pcm=buffer;

    end

    function [ features ] = ExtractFeatures( signal) fs=8000; frameSizeInSec = 0.025; frameShiftInSec= 0.010; hamming=1; preEmphesis=0; totalFilterBanks=26; cepstralOrder=12; lifter=22; deltaWindow=2; deltaWindowWeight = ones(1,2*deltaWindow+1); signal=double(signal); len=length(signal); preEmpSignal=zeros(len,1); preEmpSignal(1)=signal(1); preEmpSignal(2:end)=signal(2:end)-preEmphesis*signal(1:end-1); frameSize=round(fs*frameSizeInSec); frameShift=round(fs*frameShiftInSec); frameNo= floor( 1 + (len - frameSize)/frameShift ); maxMF = 2595*log10(1 + 0.5*fs/700.0); deltaMF=maxMF/(totalFilterBanks+1); f=zeros(totalFilterBanks+2,1); for m=1:totalFilterBanks+2 f(m)=(10^((m-1)*deltaMF/2595)-1)*700.0; end mfcc_tran=zeros(cepstralOrder,totalFilterBanks); for k=1:cepstralOrder for m=1:totalFilterBanks mfcc_tran(k,m)=sqrt(2/totalFilterBanks)*cos(k*pi/totalFilterBanks *

    (m-0.5)); end end n=(1:cepstralOrder)'; lifter_weighting=1+(lifter/2)*sin(pi*n/lifter); k=(1:frameSize)'; h=0.54-0.46*cos(2*pi*(k-1)/(frameSize-1)); mfcc=zeros(cepstralOrder,frameNo); melspec=zeros(totalFilterBanks,frameNo);

  • 62

    for fr=1:frameNo s=preEmpSignal((fr-1)*frameShift+1:(fr-1)*frameShift+frameSize); if hamming ~= 0 s=s.*h; end fftN=2; while fftN

  • 63

    newId=GenerateId(); id=newId; modelFileName=['hmms\' int2str(newId) '.mat'];

    for iter=ITERATION_BEGIN:ITERATION_END if iter==1 MIN_SELF_TRANSITION_COUNT=0; vector_sums_i_m=zeros(dim,STATE_NO); var_vec_sums_i_m=zeros(dim,STATE_NO); fr_no_i_m=zeros(STATE_NO); self_tr_fr_no_i_m=zeros(STATE_NO); [dim,fr_no]=size(features); for i=1:STATE_NO begin_fr=round( fr_no*(i-1) /STATE_NO)+1; end_fr=round( fr_no*i /STATE_NO); seg_length=end_fr-begin_fr+1; vector_sums_i_m(:,i) = vector_sums_i_m(:,i) +

    sum(features(:,begin_fr:end_fr),2); var_vec_sums_i_m(:,i) = var_vec_sums_i_m(:,i) + sum(

    features(:,begin_fr:end_fr).*features(:,begin_fr:end_fr) , 2); fr_no_i_m(i)=fr_no_i_m(i)+seg_length; self_tr_fr_no_i_m(i)= self_tr_fr_no_i_m(i) + seg_length-1; end for i=1:STATE_NO mean_vec_i_m(:,i) = vector_sums_i_m(:,i) / fr_no_i_m(i); var_vec_i_m(:,i) = var_vec_sums_i_m(:,i) / fr_no_i_m(i);

    A_i_m(i)=(self_tr_fr_no_i_m(i)+MIN_SELF_TRANSITION_COUNT)/(fr_no_i_m(i)+2*MIN_

    SELF_TRANSITION_COUNT); end else MIN_SELF_TRANSITION_COUNT=0.00; [dim,STATE_NO]=size(mean_vec_i_m); vector_sums_i_m=zeros(dim,STATE_NO); var_vec_sums_i_m=zeros(dim,STATE_NO); fr_no_i_m=zeros(STATE_NO); self_tr_fr_no_i_m=zeros(STATE_NO); total_log_prob = 0; total_fr_no = 0; [log_prob, pr_i_t, pr_self_tr_i_t

    ]=forwardBackwardAlgorithm(features,mean_vec_i_m(:,:),var_vec_i_m(:,:),A_i_m(:

    )); total_log_prob = total_log_prob + log_prob; total_fr_no = total_fr_no + fr_no; for i=1:STATE_NO fr_no_i_m(i)=fr_no_i_m(i)+sum(pr_i_t(i,:));

    self_tr_fr_no_i_m(i)=self_tr_fr_no_i_m(i)+sum(pr_self_tr_i_t(i,1:end-1)); for fr=1:fr_no vector_sums_i_m(:,i) = vector_sums_i_m(:,i) +

    pr_i_t(i,fr)*features(:,fr); var_vec_sums_i_m(:,i) =var_vec_sums_i_m(:,i) +

    pr_i_t(i,fr)*(features(:,fr)-mean_vec_i_m(:,i)).*(features(:,fr)-

    mean_vec_i_m(:,i)); end end old_mean_vec_i_m=mean_vec_i_m;

  • 64

    old_var_vec_i_m= var_vec_i_m; old_A_i_m=A_i_m; for i=1:STATE_NO; mean_vec_i_m(:,i) = vector_sums_i_m(:,i)/ fr_no_i_m(i); var_vec_i_m(:,i)= var_vec_sums_i_m(:,i) / fr_no_i_m(i); A_i_m(i)=(self_tr_fr_no_i_m(i)+MIN_SELF_TRANSITION_COUNT)

    /(fr_no_i_m(i)+2*MIN_SELF_TRANSITION_COUNT); end var_new_to_old_ratio=var_vec_i_m ./ old_var_vec_i_m; end end save(modelFileName, 'mean_vec_i_m', 'var_vec_i_m', 'A_i_m'); fprintf('The new word is added to the knowledge...\n');

    end

    function [log_prob, pr_i_t, pr_self_tr_i_t ]=forwardBackwardAlgorithm( V,

    mean_vec_i, var_vec_i, A_i ) [dim , N]=size(mean_vec_i); [dim2 , T]=size(V); [log_prob, logfw, logObsevation_i_t ]=forwardAlgorithm(V, mean_vec_i,

    var_vec_i, A_i ); pr_self_tr_i_t=zeros(N,T); logbw=ones(N,T)*(-inf); t=T; logbw(N,T)=log(1-A_i(N)); for t=T-1:-1:1 for i=1:N if i==N logbw(i,t)= log(A_i(i))+ logObsevation_i_t(i,t+1) + logbw(i,t+1); pr_self_tr_i_t(i,t)=exp(logfw(i,t)+ log(A_i(i))+

    logObsevation_i_t(i,t+1) + logbw(i,t+1)-log_prob); else logbw(i,t)=CalculateSum([ (log(A_i(i))+ logObsevation_i_t(i,t+1) +

    logbw(i,t+1)) , (log(1-(A_i(i))) + logObsevation_i_t(i+1,t+1)+ logbw(i+1,t+1)

    )] ); pr_self_tr_i_t(i,t)=exp(logfw(i,t)+

    log(A_i(i))+logObsevation_i_t(i,t+1) + logbw(i,t+1)-log_prob); end end end pr_i_t= exp( logfw+logbw - log_prob ); count_at_t(1:T)=sum(pr_i_t,1); count_at_t=squeeze(count_at_t); if (sum(count_at_t) -T) > 1E-6 diff=sum(count_at_t) -T ; end end

    function [log_pr, varargout]=forwardAlgorithm(V, mean_vec_i, var_vec_i, A_i

    ) [dim , N]=size(mean_vec_i); [dim2 , T]=size(V); logObsevation_i_t=zeros(N,T);

  • 65

    for t=1:T for i=1:N

    logObsevation_i_t(i,t)=GaussianProbability(V(:,t),mean_vec_i(:,i),var_vec_i(:,

    i)); end end logfw=ones(N,T)*(-inf); t=1; logfw(1,1)=logObsevation_i_t(1,1); for t=2:T i=1; logfw(i,t)=logfw(i,t-1) + log(A_i(i))+logObsevation_i_t(i,t); for i=2:N logfw(i,t)= CalculateSum( [ (logfw(i-1,t-1) +log(1-A_i(i-1)) ) ,

    (logfw(i,t-1) + log(A_i(i))) ] ) + logObsevation_i_t(i,t); end end log_pr=logfw(N,T) + log(1-A_i(N)); varargout(1)= {logfw}; varargout(2)= {logObsevation_i_t}; end

    function [ ] = UpdateKnowledge( word_id,value )

    dictionaryFile='dictionary.mat'; if ( exist(dictionaryFile,'file')) load(dictionaryFile,'dictionary'); dictionary{word_id,1}=word_id; dictionary{word_id,2}=value; save(dictionaryFile,'dictionary'); else dictionary{word_id,1}=word_id; dictionary{word_id,2}=value; save(dictionaryFile,'dictionary'); end end

    function [ ] = Recognize( )

    x=GetSpeechSampleW(); fprintf('Recognizing...\n'); Ini = 0.1; %Initial silence duration in seconds Ts = 0.01; %Frame width in seconds Tsh = 0.005; %Frame shift in seconds Fs = 8000; %Sampling Frequency ZTh = 40; %Zero crossing comparison rate for threshold w_sam = fix(Ts*Fs); %No of Samples/window o_sam = fix(Tsh*Fs); %No of samples/overlap lengthX = length(x); segs = fix((lengthX-w_sam)/o_sam)+1; %Number of segments in speech signal sil = fix((Ini-Ts)/Tsh)+1; %Number of segments in silent period

  • 66

    win = hamming(w_sam); Limit = o_sam*(segs-1)+1; %Start index of last segment

    FrmIndex = 1:o_sam:Limit; %Vector containing starting index %for each segment ZCR_Vector = zeros(1,segs); %Vector to hold zero crossing rate %for all segments

    %Below code computes and returns zero crossing rates for all segments in %speech sample for t = 1:segs ZCRCounter = 0; nextIndex = (t-1)*o_sam+1; for r = nextIndex+1:(nextIndex+w_sam-1) if (x(r) >= 0) && (x(r-1) >= 0)

    elseif (x(r) >= 0) && (x(r-1) < 0) ZCRCounter = ZCRCounter + 1; elseif (x(r) < 0) && (x(r-1) < 0)

    elseif (x(r) < 0) && (x(r-1) >= 0) ZCRCounter = ZCRCounter + 1; end end ZCR_Vector(t) = ZCRCounter; end

    %Below code computes and returns frame energy for all segments in speech %sample Erg_Vector = zeros(1,segs); for u = 1:segs nextIndex = (u-1)*o_sam+1; Energy = x(nextIndex:nextIndex+w_sam-1).*win; Erg_Vector(u) = sum(abs(Energy)); end

    IMN = mean(Erg_Vector(1:sil)); %Mean silence energy (noise energy) IMX = max(Erg_Vector); %Maximum energy for entire utterance I1 = 0.03 * (IMX-IMN) + IMN; %I1 & I2 are Initial thresholds I2 = 4 * IMN; ITL = min(I1,I2); %Lower energy threshold ITU = 5 * ITL; %Upper energy threshold

    IZC = mean(ZCR_Vector(1:sil)); %mean zero crossing rate for silence region stdev = std(ZCR_Vector(1:sil)); %standard deviation of crossing rate for %silence region IZCT = min(ZTh,IZC+2*stdev); %Zero crossing rate threshold

    flag=1; startpoint=1; segment_count=0;

    while flag==1

    [st,fi,fl]=FindNextSegment(ITU,ITL,IZCT,x,startpoint,Erg_Vector,