speach recognition

Upload: amrithageorge

Post on 03-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 speach recognition

    1/30

    CONTENTS

    LIST OF FIGURES PAGE

    NO

    INTRODUCTION 2

    CHAPTER.1: ABSTRACT

    1.1GENERAL1.2 PRIBCIPLES OF SPEECH RECOGNITION

    3

    3

    5

    CHAPTER.2: LITRATURE REVIEW 6

    CHAPTER.3: PROPOSED SYSTEM

    3.1 DRAWBACKS OF EXISTING SYSTEMS

    3.1.1 FACE RECOGNITION

    3.1.2 FINGER PRINT

    3.1.3 IRIS RECOGNITION

    3.1.4 PALM RECOGNITION

    3.1.5 HAND GEOMETRY

    3.1.6 SIGNATURE RECOGNTION

    3.2 COMPARITIVE STUDY WITH EXISTING

    SYSTEM

    3.3 ADVANTAGES OF VoDAR

    10

    13

    14

    CHAPTER.4: BLOCK DIAGRAM

    4.1 HUMAN VOICE GENERATION

    4.2 SPEECH RECOGNITION

    4.3 FEATURE EXTRACTION

    4.4 VECTOR QUANTIZATION

    4.5 NEED FOR MATLAB

    17

    18

    20

    22

    26

    28

    CHAPTER.5: CONCLUSION 30

  • 8/12/2019 speach recognition

    2/30

    Department of ECE,Thejus Engg., College,Thrissur Page 2

    INTRODUCTION

    VoDAR (Detection based Attendance Register), is a system that registers

    the attendance of each students at high accuracy. The personal identification is

    performed using their voice. Thus the defects in the ordinary attendance taking

    practice can be reduced by a large extent. The voice characteristics of each and

    every person are different in one or another way. All the vocal characteristics of

    two persons will never be similar. And this is the reason why the platform for

    the personal identification is voice.

    Compared to the conventional manual attendance marking system, It

    saves lot of time in registering the attendance. Further the method is based on

    speakers voice recorded in real-time, thereby making it highly robust against

    malpractices. The automated creation of attendance in the form of an Excel

    spreadsheet makes it very convenient to use, access, archive, and print the

    attendance using the classroom computer, department or college server or any

    computer connected to the college intranet. The project in practice need

    complicated circuitry and the expense for this is very high, so this project is to

    make a theoretical model rather than a practical model. VoDAR is software

    based model. The process of voice recognition in this project is done using

    MATLAB software.

  • 8/12/2019 speach recognition

    3/30

    Department of ECE,Thejus Engg., College,Thrissur Page 3

    CHAPTER 1

    ABSTRACT

    1.1 GENERAL

    The current practice of taking attendance in a lecture class is simply calling the

    roll numbers by each student and marking it by the teacher. This is a time

    consuming process and the accuracy of this process is low. It's found that the

    students those who are not present in the classroom, get attendance by doing

    some malpractices. There is a great chance for doing malpractices by the

    student. And it's complex to enter and calculate student's overall attendance, and

    sometimes may have chance not to get attendance even though he/she is present

    in the class. It seems to be very difficult to avoid these limitations, even if

    effectively the process had followed.

    In determining the internals of every students, the attendance has a10% role. The malpractices in taking attendance, thus will affect the internals of

    the students. The idea about this project had developed in our mind, while

    thinking about a scientific way to register the attendance. Thus the defects in the

    ordinary attendance taking practice can be reduced by a large extent. The

    various biometric characteristics that are generally used are the face ,iris,

    fingerprints palm prints hand geometry and the behavioral characteristicsinclude signature voice pattern.

    Biometrics is the science or technology which analyses and measures the

    biological data. In computer science it refers to science or technology that

    measure and analyze physical or behavioral characteristics of a person,for

    authentication .

  • 8/12/2019 speach recognition

    4/30

    Department of ECE,Thejus Engg., College,Thrissur Page 4

    Voice recognition or speaker recognition systems extract features from the

    speech using MATLAB and model them to use for recognition these systems

    use the aquatic features present in the speech which are unique for each

    individual. These aquatic pattern depend on the physical characteristics of

    individual (eg:the size of mouth and throat) as well as behavioral characteristics

    like speaking styles and voice pitch.

    Everyone has a distinct voice, different from all others; almost like a

    fingerprint, ones voice is unique and can act as an identifier. The human voice

    is composed of a multitude of different components, making each voice

    different; namely

    1. Pitch

    2. Tone

    1.2PRINCIPLES OF SPEECH RECOGNITION

    Speech is one of the natural forms of communication recentdevelopments have made it possible to use this in the security system. In

    speaker identification task use a speech sample to select the identity of person

    that produce the speech from a population of speakers. In speaker verification

    the task is to use a speech sample to test whether a person who claims to have

    produced the speech has in fact done so .this technique makes it possible to use

    the speaker voice to verify their identity and control access to services such asvoice dialing banking by telephone ,attendance marking ,telephone shopping

    ,data base access services ,information services ,voice mail ,security control for

    confidential information areas and remote access to computers.

    Speaker recognition methods can be divided into text independent and

    text dependent methods. In a text independent system, speaker models capture

    characteristics of somebodys speech which show up irrespective what one is

    saying. In a text dependent system, on the other hand, the recognition of the

  • 8/12/2019 speach recognition

    5/30

    Department of ECE,Thejus Engg., College,Thrissur Page 5

    speakers identity is based on his or her speaking one or more specific phases,

    like password, card numbers ,PIN codes.

    Every technology of speaker recognition ,identification and verification

    whether text independent and text dependent ,each has its own advantages and

    disadvantages and may require different treatment and techniques the choice of

    which technology to use is application specific. At the highest level all the

    speaker recognition system contain to main modules feature extraction and

    feature matching.

  • 8/12/2019 speach recognition

    6/30

    Department of ECE,Thejus Engg., College,Thrissur Page 6

    CHAPTER 2

    LITERATURE REVIEW

    This chapter contains a brief overview of voice recognition using MFCC and

    vector quantization. The details of the project were obtained from some of the

    journals listed below.

    An international journal on speaker identification (2011) describes Speaker

    recognition is the computing task of validating a user's claimed identity using

    characteristics extracted from their voices. Voice -recognition is combination of

    the two where it uses learned aspects of a speakers voice to determine what is

    being said - such a system cannot recognize speech from random speakers very

    accurately, but it can reach high accuracy for individual voices it has been

    trained with, which gives us various applications in day today life. This paper

    introduced us to various methods of speaker identification involving LPC,

    MFCC feature extraction. Linear prediction is a mathematical operation whichprovides an estimation of the current sample of a discrete signal as a linear

    combination of several previous samples. The prediction error i.e. the difference

    between the predicted and actual value is called the residual .using this idea

    feature extraction is implemented .where as in MFCC the log of signal energy is

    calculated.

    An international journal on the Design of an automatic speakerrecognition system (2011) describes the concept of MFCC. They are derived

    from a type of cepstral representation of the audio clip . The difference between

    the cepstrum and the Mel frequency cepstrum is that in the MFC, the frequency

    bands are equally spaced on the Mel scale, which approximates the human

    auditory system's response more closely than the linearly-spaced frequency

    bands used in the normal cepstrum. The cepstrum is a common transform used

    to gain information from a persons speech signal. It can be used to separate the

  • 8/12/2019 speach recognition

    7/30

    Department of ECE,Thejus Engg., College,Thrissur Page 7

    excitation signal (which contains the words and the pitch) and the transfer

    function (which contains the voice quality). It is the result of taking Fourier

    transform of decibel spectrum as if it were a signal. We use cepstral analysis in

    speaker identification because the speech signal is of the particular form above,

    and the "cepstral transform" of it makes analysis incredibly simple.

    Mathematically, cepstrum of signal = FT[log{FT(the windowed signal)}]

    MFCCs are commonly calculated by first taking the Fourier transform of a

    windowed excerpt of a signal and mapping the powers of the spectrum obtained

    above onto the Mel scale, using triangular overlapping windows. Next the logs

    of the powers at each of the Mel frequencies are taken; Direct Cosine Transform

    is applied to it (as if it were a signal). The MFCCs are the amplitudes of the

    resulting spectrum. . The speech input is typically recorded at a sampling rate

    above 10000 Hz. This sampling frequency was chosen to minimize the effects

    of aliasing in the analog-to-digital conversion. These sampled signals can

    capture all frequencies up to 5 kHz, which cover most energy of sounds that are

    generated by humans .The main purpose of the MFCC processor is to mimic thebehavior of the human ears. In addition MFCCs are shown to be less susceptible

    to mentioned variations. This paper helped us to gain knowledge about feature

    extraction using MFCC.

    International journal for Vector quantization based speaker

    identification(2010) describes that a speaker recognition system must able to

    estimate probability distributions of the computed feature vectors. Storing everysingle vector that generate from the training mode is impossible, since these

    distributions are defined over a high-dimensional space. It is often easier to start

    by quantizing each feature vector to one of a relatively small number of

    template vectors, with a process called vector quantization. VQ is a process of

    taking a large set of feature vectors and producing a smaller set of measure

    vectors that represents the centroids of the distribution. By using these training

    data features are clustered to form a codebook for each speaker. In the

  • 8/12/2019 speach recognition

    8/30

    Department of ECE,Thejus Engg., College,Thrissur Page 8

    recognition stage, the data from the tested speaker is compared to the codebook

    of each speaker and measure the difference. These differences are then use to

    make the recognition decision.

    Survey of biometric recognition systems and their applications

    (journal of theoretical and applied information technology-2010) which

    describes about various biometric systems available and their peculiarities. The

    human physical characteristics like fingerprints, face, voice and iris are known

    as biometrics .this survey helped us to understand various biometric system and

    for choosing voice recognition system as best suited for our application

    ie,attendance system. Facial recognition have disadvantages like complex

    system ,time consuming compared to voice. As well 2D recognition is affected

    by changes in lighting ,the age and if the person wear glasses .it requires camera

    equipment for user identification which is costly .iris recognition has

    disadvantages such as large storage requirement and expensive. Finger tip

    recognition can make mistakes due to the dryness or dirt in the fingers skin, as

    well as with the age .it demands a large memory and Compression is required

    (a factor of 10 approximately).hence out of all most suitable for our application

    is voice which is cheap and verification time only 5 seconds.

    MFCC and its applications in speaker recognition an International Journal

    on Emerging Technologies(2010) describes that Speech processing is emerged

    as one of the important application area of digital signal processing .Various

    fields for research in speech processing are speech recognition, speaker

    recognition, speech synthesis ,speech coding etc. The objective of automatic

    speaker recognition is to extract, characterize and recognize the information

    about speaker identity. Feature extraction is the first step for speaker

    recognition. Many algorithms are suggested/developed by the researchers for

    feature extraction. In this work, the Mel Frequency Cepstrum Coefficient

    (MFCC) feature has been used for designing a text dependent speaker

  • 8/12/2019 speach recognition

    9/30

    Department of ECE,Thejus Engg., College,Thrissur Page 9

    identification system. Some modifications to the existing technique of MFCC

    for feature extraction are also suggested to improve the speaker recognition

    efficiency. From this paper we found that as no. of filter in filter bank increases

    the efficiency also increases. Also compared to rectangular window hanning

    window has more efficiency.

    Vector quantization based speaker identification (international

    journal of computer applications -2010) describes about the methodology

    followed in this paper for speaker identification, which consists of comparing a

    speech signal from an unknown speaker to a database of known speakers. The

    methodology followed in this paper for Speaker identification is using Feature

    Extraction process and then Vector Quantization of extracted features is done

    using k-means algorithm. The K-means algorithm is widely used in speech

    processing as a dynamic clustering approach. K is pre-selected and simply

    refers to the number of desired clusters. In the recognition phase an unknown

    speaker, represented by a sequence of feature vectors {x1,, xT}, is

    compared with the codebooks in the database. For each codebook a distortion

    measure is computed, and the speaker with the lowest distortion is chosen. VQ

    based clustering approach is best as it provides us with the faster speaker

    identification process

  • 8/12/2019 speach recognition

    10/30

    Department of ECE,Thejus Engg., College,Thrissur Page 10

    CHAPTER 3

    PROSPOSED SYSTEM

    3.1 DRAWBACKS OF EXISTING SYSTEMS

    3.1.1 Face recognition attendance system:

    Humans have a remarkable ability to recognize fellow beings based on facial

    appearance. So, face is a natural human trait for automated biometric

    recognition. Face recognition systems typically utilize the spatial relationship

    among the locations of facial features such as eyes, nose, lips, chin, and the

    global appearance of a face. The forensic and civilian applications of face

    recognition technologies pose a number of technical challenges both for static

    mug-shot photograph matching (e.g., for ensuring that the same person is not

    requesting multiple passports) . The problems associated with illumination,

    gesture, facial makeup, occlusion, and pose variations adversely affect the face

    recognition performance. While face recognition is non-intrusive, robust face

    recognition in non-ideal situations continues to pose challenges.

    3.1.2 Fingerprint recognition attendance system:

    Fingerprint-based recognition has been the longest serving, and popular method

    for person identification. Fingerprints consist of a regular texture pattern

    composed of ridges and valleys. These ridges are characterized by several

    landmark points, known as minutiae, which are mostly in the form of ridge

    endings and ridge bifurcations. The spatial distribution of these minutiae points

    is claimed to be unique to each finger; it is the collection of minutiae points in a

    fingerprint that is primarily employed for matching two fingerprints. In addition

    to minutiae points, there are sweat pores and other details (referred to as

    extended or level 3 features) which can be acquired in high resolution (1000 pi)

  • 8/12/2019 speach recognition

    11/30

    Department of ECE,Thejus Engg., College,Thrissur Page 11

    fingerprint images. However, there are some disadvantages in this system. If the

    surface of the finger gets damaged and/or has one or more marks on it,

    identification becomes increasingly hard. Furthermore, the system requires the

    users' finger surface to have a point of minutiae or pattern in order to have

    matching images. This will be a limitation factor for the security of the

    algorithm

    3.1.3 Iris recognition attendance system:

    The iris is the colored annular ring that surrounds the pupil. Iris images acquired

    under infrared illumination consist of complex texture pattern with numerous

    individual attributes, e.g. stripes, pits, and furrows, which allow for highly

    reliable personal identification. The iris is a protected internal organ whose

    texture is stable and distinctive, even among identical twins (similar to

    fingerprints), and extremely difficult to surgically spoof. However, relatively

    high sensor cost, along with relatively large failure to enroll (FTE) rate reported

    in some studies, and lack of legacy iris databases may limit its usage in some

    large-scale government application.

    3.1.4 Palm print recognition attendance system:

    The image of a human palm consists of palm are friction ridges and flexion

    creases. Similar to fingerprints, latent palm print systems utilize minutiae and

    creases for matching. Based on the success of fingerprints in civilian

    applications, some attempts have been made to utilize low resolution palm printimages for access control applications .These systems utilize texture features

    which are quite similar to those employed for iris recognition. Palm print

    recognition systems have not yet been deployed for civilian applications (e.g.,

    access control), mainly due to their large physical size and the fact that

    fingerprint identification based on compact and embedded sensors works quite

    well for such applications.

  • 8/12/2019 speach recognition

    12/30

    Department of ECE,Thejus Engg., College,Thrissur Page 12

    3.1.5 Hand Geometry recognition attendance system:

    It is claimed that individuals can be discriminated based on the shape of their

    hands. Person identification using hand geometry utilizes low resolution hand

    images to extract a number of geometrical features such as finger length, width,

    thickness, perimeter, and finger area. The discriminatory power of these features

    is quite limited, and therefore hand geometry systems are employed only for

    verification applications in low security access control and time-and-attendance

    application geometry systems have large physical size, so they cannot be easily

    embedded in existing security systems.

    3.1.6 Signature recognition attendance system:

    Signature is a behavioral biometric modality that is used in daily business

    transactions (e.g., credit card purchase). However, attempts to develop highly

    accurate signature recognition systems have not been successful. This is

    primarily due to the large intra-class variations in a persons signature overtime. Attempts have been made to improve the signature recognition

    performance by capturing dynamic or online signatures that require pressure-

    sensitive pen-pad. Dynamic signatures help in acquiring the shape, speed,

    acceleration, pen pressure, order and speed of strokes, during the actual act of

    signing. This additional information seems to improve the verification

    performance (over static signatures) as well as circumvent signature forgeries.Still, very few automatic signature verification systems have been deployed.

  • 8/12/2019 speach recognition

    13/30

    Department of ECE,Thejus Engg., College,Thrissur Page 13

    3.2 COMPARITVE STUDY WITH EXISTING SYSTEMS

    Features Eye-Iris Eye-Retina Finger

    print

    Signature Voice

    Reliability Very

    High

    Very High High High High

    Easiness of

    the use

    Average Low High High High

    Social

    Acceptance

    Low Low Medium High High

    Interference Glasses Irritation Dirtiness

    Injury

    Roughness

    Changeble

    Easy

    signature

    Noise

    Cost High High Medium Low Low

    Device

    Required

    Camera Camera scanner Optic pen

    Touch

    panel

    Microphone

    We can conclude from the above table that voice recognition is comparatively

    less costly and sufficiently accurate. The device required for voice recognition

    is easily available and low cost compared to other systems. Hence its also

    socially acceptable

  • 8/12/2019 speach recognition

    14/30

    Department of ECE,Thejus Engg., College,Thrissur Page 14

    3.3 ADVANTAGES OF VoDAR

    3.3.1 Ability to Use Technology remotely

    One of the main advantages of voice verification technology is the ability

    to use it remotely. Many other types of biometrics cannot be used remotely,

    such as fingerprints, retina biometrics or iris biometrics.

    One of the advantages of speech recognition technology is that, its easy to

    use over the phone or other speaking devices, increasing its usefulness to manycompanies. The ability to use it remotely makes it stand out among many other

    types of biometric technology available today.

    3.3.2 Low Cost of Using It

    The low cost of this technology is another of the advantages of voicerecognition. Theprice of acquiring a voice recognition system is usually quite

    reasonable, especially when compared to the price of other biometric

    systems. These systems are relatively low cost to implement and maintain and

    the equipment needed is low priced as well. Very little equipment is needed for

    these systems, making it a cost effective option for businesses that are carefully

    watching their bottom line.

    In many cases, all that is required for these systems to function is the

    right biometric software in place if the technology is being used remotely over

    the phone. The phone acts as the speaking device, so there is no investment in

    these device .For systems being used for authentication and verification on site,

    businesses only have to worry about purchasing a device that users can speak

    into along with the speech recognition software.

    http://www.biometric-security-devices.com/purchase-biometric-devices.htmlhttp://www.biometric-security-devices.com/purchase-biometric-devices.html
  • 8/12/2019 speach recognition

    15/30

    Department of ECE,Thejus Engg., College,Thrissur Page 15

    3.3.3 High Reliability Rate

    Another of the heralded advantages of voice recognition is this technologyshigh reliability rate. 10-20 years ago, the reliability rate of speech recognition

    technology was actually quite low.

    There were many problems that produced reliability problems, such as

    the inability to deal with background noise or the inability to recognize voices

    when an individual had a slight cold. However, these problems have been dealt

    with successfully today, giving thisbiometric technology a very high reliability

    rate. Vocal prints now can easily be used to identify an individual, even if their

    speech sounds a bit different due to a cold.

    One of the advantages of these types of systems is that they are designed

    to ignore background noise and focus on the voice, which also has given the

    reliability rate a huge boost.

    3.3.4 Ease of Use and Implementation

    Many companies really appreciate the ease of use and implementation that

    comes withvoice recognition biometrics.Some biometric technologies can be

    difficult to implement into a company and difficult to begin using. Since these

    systems require such little equipment, they can usually be implemented without

    the addition of new equipment and systems. Since they are so easy to use,

    companies can often reduce their personnel and make use of them elsewhere in

    the company to improve performance and customer satisfaction.

    http://www.biometric-security-devices.com/biometric-technologies.htmlhttp://www.biometric-security-devices.com/voice-recognition-biometrics.htmlhttp://www.biometric-security-devices.com/voice-recognition-biometrics.htmlhttp://www.biometric-security-devices.com/biometric-technologies.html
  • 8/12/2019 speach recognition

    16/30

    Department of ECE,Thejus Engg., College,Thrissur Page 16

    3.3.5 Minimally Invasive

    One of the major advantages of this system is that it is minimally invasive,

    which is one of the big advantages of voice recognition. This is very importantto individuals that use these security devices. Many consumers today do not like

    many forms of biometric technology, since other forms seem so invasive.

    The advantages of speech technology are that it only requires individuals to

    speak and offer a vocal sample, which is minimally invasive. Since this

    technology has a high approval rate among consumers, it can help businesses

    keep their customers happy with the service they are providing.

  • 8/12/2019 speach recognition

    17/30

    Department of ECE,Thejus Engg., College,Thrissur Page 17

    CHAPTER 5

    BLOCK DIAGRAM

    The main aim of this project is speaker identification, which consists of

    comparing a speech signal from an unknown speaker to a database of known

    speaker database. The system can recognize the speaker, which has been trained

    with a number of speakers. Above figure shows the fundamental formation of

    speaker identification and verification systems. Where the speaker identification

    is the process of determining which registered speaker provides a given speech.

    On the other hand, speaker verification is the process of rejecting or

    accepting the identity claim of a speaker. In most of the applications, voice is

    used as the key to confirm the identities of a speaker which is known as speaker

    verification .The system consists of a microphone connected to a computer

  • 8/12/2019 speach recognition

    18/30

    Department of ECE,Thejus Engg., College,Thrissur Page 18

    system. The voice inputs of each student are recorded via mice, and each input

    is analyzed by the system by the MATLAB software. MATLAB is the software

    tool of our project. First we have to store some reference voice signal wave

    form, with the help of a microphone and the computer. These stored speech

    signals are called corpus sentences .By the help of MATLAB software, these

    waveforms get analyzed and we convert each speech signal into vector form.

    Now the input voice signals from the students are also converted into the

    vector form. After comparing this vector sentence with the corpus sentences, the

    most similar corpus sentence will be determined .Thus speaker identification is

    carried out and corresponding attendance will be marked.

    4.1 HUMAN VOICE GENERATION

  • 8/12/2019 speach recognition

    19/30

    Department of ECE,Thejus Engg., College,Thrissur Page 19

    Consider the anatomy and physiology of the voice by following the voice from

    the lungs to the lips. The breath stream, referred to as the "generator" of the

    voice, originates in the lungs. This generatorprovides a controlled flow of air

    which powers the vocal folds by setting them into motion.

    The human larynx has three vital functions. They are...

    1. Airway protection (prevention of aspiration)2. Respiration (breathing)3. Phonation (talking)

    When we speak, the vocal folds approximate and vibrate to produce voice.

    When we breathe the vocal folds open or abduct and allow air to flow from the

    lungs through the mouth and nose and vice versa. When we eat, we reflexively

    stop breathing and the vocal folds approximate to protect the airway and keep

    food and drink out of the lungs. The speech signal is given by

    amplitude

  • 8/12/2019 speach recognition

    20/30

    Department of ECE,Thejus Engg., College,Thrissur Page 20

    The vocal folds do not operate like strings on a violin but actually are more

    comparable to vibrating lips "buzzing". The three-dimensional cavity, or

    "resonator", that provides sound modification. The articulators (the parts of

    thevocal tract above the larynx consisting oftongue,palate,cheek,lips,

    etc.)articulate andfilter the sound emanating from the larynx and to some

    degree can interact with the laryngeal airflow to strengthen it or weaken it as a

    sound source. Adult men and women have different sizes of vocal fold;

    reflecting the male-female differences in larynx size. Adult male voices are

    usually lower-pitched and have larger folds. The male vocal folds (which would

    be measured vertically in the opposite diagram), are between 17 mm and 25 mm

    in length.]The female vocal folds are between 12.5 mm and 17.5 mm in length.

    The difference in vocal folds size between men and women means that they

    have differently pitched voices. Additionally, genetics also causes variances

    amongst the same sex, with men's and women'ssinging voices being

    categorized into types.

    4.2 SPEECH RECOGNITION

    The structure of a typical speech recognition system mainly consists of

    feature extraction, training and recognition. Because of the instability of speech

    signal, feature extraction of speech signal becomes very difficult. There exist

    different features between each word. For each word there are differencesamong different person, such as the differences between adults and children,

    male and female. Even for the same person and the same word there also exists

    changes for different time. Nowadays, there are several feature extraction

    methods used in speech recognition systems. All of them have good

    performance when used in clean condition. In the adverse condition, we still

    cant find a good way in speech recognition system.

    http://en.wikipedia.org/wiki/Vocal_tracthttp://en.wikipedia.org/wiki/Tonguehttp://en.wikipedia.org/wiki/Soft_palatehttp://en.wikipedia.org/wiki/Cheekhttp://en.wikipedia.org/wiki/Liphttp://en.wikipedia.org/wiki/Manner_of_articulationhttp://en.wikipedia.org/wiki/Band-pass_filterhttp://en.wikipedia.org/wiki/Geneticshttp://en.wikipedia.org/wiki/Singinghttp://en.wikipedia.org/wiki/Singinghttp://en.wikipedia.org/wiki/Geneticshttp://en.wikipedia.org/wiki/Band-pass_filterhttp://en.wikipedia.org/wiki/Manner_of_articulationhttp://en.wikipedia.org/wiki/Liphttp://en.wikipedia.org/wiki/Cheekhttp://en.wikipedia.org/wiki/Soft_palatehttp://en.wikipedia.org/wiki/Tonguehttp://en.wikipedia.org/wiki/Vocal_tract
  • 8/12/2019 speach recognition

    21/30

    Department of ECE,Thejus Engg., College,Thrissur Page 21

    Compared with them, human auditory system always has good

    performance under clean and noisy condition. So a way solve this is to research

    our auditory system and use the result in speech recognition system developed

    .There are two major approaches available for feature extraction: modelling

    human voice production and perception system. For the first approach, one of

    the most popular features is the LPC (Linear Prediction Coefficient) feature. For

    the second approach, the most popular feature is the MFCC (Mel-Frequency

    Cepstrum Coefficient) feature. In MFCC, the main advantage is that it uses Mel

    frequency scaling which is very approximate to the human auditory system.

    Hence more effective than LPC.

    A. Definition of speech recognition:

    Speech Recognition (is also known as Automatic Speech Recognition (ASR), or

    computer speech recognition) is the process of converting a speech signal to a

    sequence of words, by means of an algorithm implemented as a computer

    program.

    B. Types of Speech Recognition:

  • 8/12/2019 speach recognition

    22/30

    Department of ECE,Thejus Engg., College,Thrissur Page 22

    Speech recognition systems can be separated in several different classes by

    describing what types of utterances they have the ability to recognize. These

    classes are classified as the following:

    Isolated Words:Isolated word recognizers usually require each utterance to have quiet (lack of

    an audio signal) on both sides of the sample window. It accepts single words or

    single utterance at

    a time. These systems have "Listen/Not-Listen" states, where they require the

    speaker to wait between utterances (usually doing processing during the

    pauses). Isolated Utterance might

    be a better name for this class.

    Connected Words:Connected word systems (or more correctly 'connected utterances') are similar

    to isolated words, but allows separate utterances to be 'run-together' with a

    minimal pause between them.

    Continuous Speech:Continuous speech recognizers allow users to speak almost naturally, while the

    computer determines the content. (Basically, it's computer dictation).

    Recognizers with continuous speech capabilities are some of the most difficult

    to create because they utilize special methods to determine utterance

    boundaries.

    Spontaneous Speech:At a basic level, it can be thought of as speech that is natural sounding and not

    rehearsed. An ASR system with spontaneous speech ability should be able to

    handle a variety of natural speech features such as words being run together,

    "ums" and "ahs", and even slight stutters.

    4.3 FEATURE EXTRACTION

  • 8/12/2019 speach recognition

    23/30

    Department of ECE,Thejus Engg., College,Thrissur Page 23

    Step 1:Preemphasis

    This step processes the passing of signal through a filter which emphasizes

    higher frequencies. This process will increase the energy of signal at higher

    frequency.

    Y[n] = X[n]- a X[n-1].

    a = 0.95, which make 95% of any one sample is presumed to originate from

    previous sample.

    Step 2:Framing

    The process of segmenting the speech samples obtained from analog to digital

    conversion (ADC) into a small frame with the length within the range of 20 to

    40 msec. The voice signal is divided into frames of N samples. Adjacent frames

    are being separated by M (M

  • 8/12/2019 speach recognition

    24/30

    Department of ECE,Thejus Engg., College,Thrissur Page 24

    A traditional method of spectral evaluation is reliable in case of stationary

    signal. Nature of signal changes continuously with time. For voice reliability

    can be ensured for a short time. Audio signal is continuous. Processing cannot

    wait for last sample. Processing complexity increases exponentially. It is

    important to retain short term features. Short time analysis is performed by

    windowing the signal. Normally Hamming Window is used. The Hamming

    function is given by

    Hamming window:

    W(n) = 0.540.46 cos(2n/L-1) 0 nL-1

    0 otherwise.

    Step 4: Fast Fourier Transform

    To convert each frame of N samples from time domain into frequency domain.

    The Fourier Transform is to convert the convolution of the glottal pulse U[n]

    and the vocal tract impulse response H[n] in the time domain. This statement

    supports the equation below:

    Y(W) = FFT [h(t)*X(t)] = H(W) *X(W)

    If X (w), H (w) and Y (w) are the Fourier Transform of X (t), H (t) and Y (t)

    respectively.

    Step 5: Mel Filter Bank Processing

    The frequencies range in FFT spectrum is very wide and voice signal does not

    follow the linear scale. The bank of filters according to Mel scale as shown in

  • 8/12/2019 speach recognition

    25/30

    Department of ECE,Thejus Engg., College,Thrissur Page 25

    figure 4 is then performed. This figure below shows a set of triangular filters

    that are used to compute a weighted sum of filter spectral components so that

    the output of process approximates to a Mel scale. Each filters magnitude

    frequency response is triangular in shape and equal to unity at the centre

    frequency and decrease linearly to zero at centre frequency of two adjacent

    filters.

    Step 6: Discrete Cosine Transform

    is the process to convert the log Mel spectrum into time domain using Discrete

    Cosine Transform (DCT). The result of the conversion is called Mel Frequency

    Cepstrum Coefficient. The set of coefficient is called acoustic vectors.

    Therefore, each input utterance is transformed into a sequence of acoustic

    vector.

    Step 7: Delta Energy and Delta Spectrum

    The voice signal and the frames changes, such as the slope of a formant at its

    transitions. Therefore, there is a need to add features related to the change in

    cepstral features over time . 13 delta or velocity features (12 cepstral features

    plus energy), and 39 features a double delta or acceleration feature are added.

  • 8/12/2019 speach recognition

    26/30

    Department of ECE,Thejus Engg., College,Thrissur Page 26

    The energy in a frame for a signal x in a window from time sample t1 to time

    sample t2, is represented at the equation below:

    Energy = X2 [t]

    Procedure for forming MFCC

    4.4 VECTOR QUANTISATION

    Vector quantization (VQ) is a lossy data compression method based on

    the principle of block coding. It is a fixed-to-fixed length algorithm. In the

    earlier days, the design of a vector quantizer (VQ) is considered to be a

    challenging problem due to the need for multi-dimensional integration. In 1980,

    Linde, Buzo, and Gray (LBG) proposed a VQ design algorithm based on a

    training sequence.

    The main advantage of VQ inpattern recognition is its low computational

    burden when compared with other techniques such asdynamic time

    warping (DTW) and hidden Markov model (HMM). A VQ is nothing more

    http://data-compression.com/theory.shtml#theoryhttp://en.wikipedia.org/wiki/Pattern_recognitionhttp://en.wikipedia.org/wiki/Dynamic_time_warpinghttp://en.wikipedia.org/wiki/Dynamic_time_warpinghttp://en.wikipedia.org/wiki/Hidden_Markov_modelhttp://en.wikipedia.org/wiki/Hidden_Markov_modelhttp://en.wikipedia.org/wiki/Dynamic_time_warpinghttp://en.wikipedia.org/wiki/Dynamic_time_warpinghttp://en.wikipedia.org/wiki/Pattern_recognitionhttp://data-compression.com/theory.shtml#theory
  • 8/12/2019 speach recognition

    27/30

    Department of ECE,Thejus Engg., College,Thrissur Page 27

    than an approximator. The idea is similar to that of ``rounding-off'' (say to the

    nearest integer). An example of a 1-dimensional VQ is shown below:

    Here, every number less than -2 are approximated by -3. Every number between

    -2 and 0 are approximated by -1. Every number between 0 and 2 are

    approximated by +1. Every number greater than 2 are approximated by +3. Note

    that the approximate values are uniquely represented by 2 bits. This is a 1-

    dimensional, 2-bit VQ. It has a rate of 2 bits/dimension.

    An example of a 2-dimensional VQ is shown below:

    Here, every pair of numbers falling in a particular region are approximated by a

    red star associated with that region. Note that there are 16 regions and 16 red

    stars -- each of which can be uniquely represented by 4 bits. Thus, this is a 2-

    dimensional, 4-bit VQ. Its rate is also 2 bits/dimension. In the above twoexamples, the red stars are called code vectors and the regions defined by the

  • 8/12/2019 speach recognition

    28/30

    Department of ECE,Thejus Engg., College,Thrissur Page 28

    blue borders are called encoding regions. The set of all code vectors is called

    the codebook and the set of all encoding regions is called the partition of the

    space. The performances of VQ are typically given in terms of the signal-to-

    distortion ratio (SDR):

    SDR=10log102/Dave (in dB),

    Where is the variance of the source and Dave is the average squared-error

    distortion. The higher the SDRthe better the performance

    In verification systems two key performance measures are popular, the false

    rejectionrate (FRR), the number of times the true speaker is incorrectly rejected,

    and false acceptance rate (FAR), the number of times an imposter speaker is

    incorrectly accepted. By varying the decision threshold the FAR and FRR will

    change in opposing directions. For example raising the threshold will lower

    FAR but increase the FRR as true claims will start to be rejected since the bar is

    raised, conversely if the threshold is lowered the FRR is reduced but FAR will

    increase since not only are all true claims now accepted but more false ones will

    as well. The typical operating point for the selection of the threshold is when

    FAR = FRR, termed the equalerror rate (EER) condition

    4.5 NEED FOR MATLAB

    MATLAB is a high-level language and interactive environment for

    numerical computation, visualization, and programming. Using MATLAB, we

    can analyze data, develop algorithms, and create models and applications. The

    language, tools, and built-in math functions enable you to explore multipleapproaches and reach a solution faster than with spreadsheets or traditional

    programming languages, such as C/C++ or Java.MATLAB has a range of

    applications, including signal processing and communications, image and video

    processing, control systems, test and measurement, computational finance, and

    computational biology. Hence we are preferring MATLAB as our software tool.

    More than a million engineers and scientists in industry and academia use

    MATLAB, the language of technical computing. MATLAB has built-in

  • 8/12/2019 speach recognition

    29/30

    Department of ECE,Thejus Engg., College,Thrissur Page 29

    mathematical functions in MATLAB to solve science and engineering

    problems. MATLAB (matrix laboratory) is anumerical computing environment

    and fourth-generation programming language. Developed byMath Works,

    MATLAB allowsmatrix manipulations, plotting offunctions and data,

    implementation ofalgorithms, creation ofuser interfaces, and interfacing with

    programs written in other languages, includingC,C++,Java, and Fortran.(

    Cleve Moler the chairman of computer science department started developing

    MATLAB in late 1970s. Jack Little recognized its commercial potential and

    joined with Moler and Steve Banjert. They rewrote MATLAB in Cand founded

    mathworks in 1984)

    .

    http://en.wikipedia.org/wiki/Numerical_analysishttp://en.wikipedia.org/wiki/Fourth-generation_programming_languagehttp://en.wikipedia.org/wiki/MathWorkshttp://en.wikipedia.org/wiki/Matrix_(mathematics)http://en.wikipedia.org/wiki/Function_(mathematics)http://en.wikipedia.org/wiki/Algorithmhttp://en.wikipedia.org/wiki/User_interfacehttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/C%2B%2Bhttp://en.wikipedia.org/wiki/Java_(programming_language)http://en.wikipedia.org/wiki/Fortranhttp://en.wikipedia.org/wiki/Fortranhttp://en.wikipedia.org/wiki/Java_(programming_language)http://en.wikipedia.org/wiki/C%2B%2Bhttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/User_interfacehttp://en.wikipedia.org/wiki/Algorithmhttp://en.wikipedia.org/wiki/Function_(mathematics)http://en.wikipedia.org/wiki/Matrix_(mathematics)http://en.wikipedia.org/wiki/MathWorkshttp://en.wikipedia.org/wiki/Fourth-generation_programming_languagehttp://en.wikipedia.org/wiki/Numerical_analysis
  • 8/12/2019 speach recognition

    30/30

    CHAPTER 5

    CONCLUSION

    From the comparison study of various biometric systems we came into a

    conclusion that voice recognition is best suitable for our application. The Mel

    filter is best suitable for feature extraction. Vector quantization technique is

    more desired for feature matching to implement the voice detected attendance

    system.