speach recognition

8/12/2019 speach recognition

1/30

CONTENTS

LIST OF FIGURES PAGE

NO

INTRODUCTION 2

CHAPTER.1: ABSTRACT

1.1GENERAL1.2 PRIBCIPLES OF SPEECH RECOGNITION

3

3

5

CHAPTER.2: LITRATURE REVIEW 6

CHAPTER.3: PROPOSED SYSTEM

3.1 DRAWBACKS OF EXISTING SYSTEMS

3.1.1 FACE RECOGNITION

3.1.2 FINGER PRINT

3.1.3 IRIS RECOGNITION

3.1.4 PALM RECOGNITION

3.1.5 HAND GEOMETRY

3.1.6 SIGNATURE RECOGNTION

3.2 COMPARITIVE STUDY WITH EXISTING

SYSTEM

3.3 ADVANTAGES OF VoDAR

10

13

14

CHAPTER.4: BLOCK DIAGRAM

4.1 HUMAN VOICE GENERATION

4.2 SPEECH RECOGNITION

4.3 FEATURE EXTRACTION

4.4 VECTOR QUANTIZATION

4.5 NEED FOR MATLAB

17

18

20

22

26

28

CHAPTER.5: CONCLUSION 30


2/30

Department of ECE,Thejus Engg., College,Thrissur Page 2

INTRODUCTION

VoDAR (Detection based Attendance Register), is a system that registers

the attendance of each students at high accuracy. The personal identification is

performed using their voice. Thus the defects in the ordinary attendance taking

practice can be reduced by a large extent. The voice characteristics of each and

every person are different in one or another way. All the vocal characteristics of

two persons will never be similar. And this is the reason why the platform for

the personal identification is voice.

Compared to the conventional manual attendance marking system, It

saves lot of time in registering the attendance. Further the method is based on

speakers voice recorded in real-time, thereby making it highly robust against

malpractices. The automated creation of attendance in the form of an Excel

spreadsheet makes it very convenient to use, access, archive, and print the

attendance using the classroom computer, department or college server or any

computer connected to the college intranet. The project in practice need

complicated circuitry and the expense for this is very high, so this project is to

make a theoretical model rather than a practical model. VoDAR is software

based model. The process of voice recognition in this project is done using

MATLAB software.


3/30


CHAPTER 1

ABSTRACT

1.1 GENERAL

The current practice of taking attendance in a lecture class is simply calling the

roll numbers by each student and marking it by the teacher. This is a time

consuming process and the accuracy of this process is low. It's found that the

students those who are not present in the classroom, get attendance by doing

some malpractices. There is a great chance for doing malpractices by the

student. And it's complex to enter and calculate student's overall attendance, and

sometimes may have chance not to get attendance even though he/she is present

in the class. It seems to be very difficult to avoid these limitations, even if

effectively the process had followed.

In determining the internals of every students, the attendance has a10% role. The malpractices in taking attendance, thus will affect the internals of

the students. The idea about this project had developed in our mind, while

thinking about a scientific way to register the attendance. Thus the defects in the

ordinary attendance taking practice can be reduced by a large extent. The

various biometric characteristics that are generally used are the face ,iris,

fingerprints palm prints hand geometry and the behavioral characteristicsinclude signature voice pattern.

Biometrics is the science or technology which analyses and measures the

biological data. In computer science it refers to science or technology that

measure and analyze physical or behavioral characteristics of a person,for

authentication .


4/30


Voice recognition or speaker recognition systems extract features from the

speech using MATLAB and model them to use for recognition these systems

use the aquatic features present in the speech which are unique for each

individual. These aquatic pattern depend on the physical characteristics of

individual (eg:the size of mouth and throat) as well as behavioral characteristics

like speaking styles and voice pitch.

Everyone has a distinct voice, different from all others; almost like a

fingerprint, ones voice is unique and can act as an identifier. The human voice

is composed of a multitude of different components, making each voice

different; namely

1. Pitch

2. Tone

1.2PRINCIPLES OF SPEECH RECOGNITION

Speech is one of the natural forms of communication recentdevelopments have made it possible to use this in the security system. In

speaker identification task use a speech sample to select the identity of person

that produce the speech from a population of speakers. In speaker verification

the task is to use a speech sample to test whether a person who claims to have

produced the speech has in fact done so .this technique makes it possible to use

the speaker voice to verify their identity and control access to services such asvoice dialing banking by telephone ,attendance marking ,telephone shopping

,data base access services ,information services ,voice mail ,security control for

confidential information areas and remote access to computers.

Speaker recognition methods can be divided into text independent and

text dependent methods. In a text independent system, speaker models capture

characteristics of somebodys speech which show up irrespective what one is

saying. In a text dependent system, on the other hand, the recognition of the


5/30


speakers identity is based on his or her speaking one or more specific phases,

like password, card numbers ,PIN codes.

Every technology of speaker recognition ,identification and verification

whether text independent and text dependent ,each has its own advantages and

disadvantages and may require different treatment and techniques the choice of

which technology to use is application specific. At the highest level all the

speaker recognition system contain to main modules feature extraction and

feature matching.


6/30


CHAPTER 2

LITERATURE REVIEW

This chapter contains a brief overview of voice recognition using MFCC and

vector quantization. The details of the project were obtained from some of the

journals listed below.

An international journal on speaker identification (2011) describes Speaker

recognition is the computing task of validating a user's claimed identity using

characteristics extracted from their voices. Voice -recognition is combination of

the two where it uses learned aspects of a speakers voice to determine what is

being said - such a system cannot recognize speech from random speakers very

accurately, but it can reach high accuracy for individual voices it has been

trained with, which gives us various applications in day today life. This paper

introduced us to various methods of speaker identification involving LPC,

MFCC feature extraction. Linear prediction is a mathematical operation whichprovides an estimation of the current sample of a discrete signal as a linear

combination of several previous samples. The prediction error i.e. the difference

between the predicted and actual value is called the residual .using this idea

feature extraction is implemented .where as in MFCC the log of signal energy is

calculated.

An international journal on the Design of an automatic speakerrecognition system (2011) describes the concept of MFCC. They are derived

from a type of cepstral representation of the audio clip . The difference between

the cepstrum and the Mel frequency cepstrum is that in the MFC, the frequency

bands are equally spaced on the Mel scale, which approximates the human

auditory system's response more closely than the linearly-spaced frequency

bands used in the normal cepstrum. The cepstrum is a common transform used

to gain information from a persons speech signal. It can be used to separate the


7/30


excitation signal (which contains the words and the pitch) and the transfer

function (which contains the voice quality). It is the result of taking Fourier

transform of decibel spectrum as if it were a signal. We use cepstral analysis in

speaker identification because the speech signal is of the particular form above,

and the "cepstral transform" of it makes analysis incredibly simple.

Mathematically, cepstrum of signal = FT[log{FT(the windowed signal)}]

MFCCs are commonly calculated by first taking the Fourier transform of a

windowed excerpt of a signal and mapping the powers of the spectrum obtained

above onto the Mel scale, using triangular overlapping windows. Next the logs

of the powers at each of the Mel frequencies are taken; Direct Cosine Transform

is applied to it (as if it were a signal). The MFCCs are the amplitudes of the

resulting spectrum. . The speech input is typically recorded at a sampling rate

above 10000 Hz. This sampling frequency was chosen to minimize the effects

of aliasing in the analog-to-digital conversion. These sampled signals can

capture all frequencies up to 5 kHz, which cover most energy of sounds that are

generated by humans .The main purpose of the MFCC processor is to mimic thebehavior of the human ears. In addition MFCCs are shown to be less susceptible

to mentioned variations. This paper helped us to gain knowledge about feature

extraction using MFCC.

International journal for Vector quantization based speaker

identification(2010) describes that a speaker recognition system must able to

estimate probability distributions of the computed feature vectors. Storing everysingle vector that generate from the training mode is impossible, since these

distributions are defined over a high-dimensional space. It is often easier to start

by quantizing each feature vector to one of a relatively small number of

template vectors, with a process called vector quantization. VQ is a process of

taking a large set of feature vectors and producing a smaller set of measure

vectors that represents the centroids of the distribution. By using these training

data features are clustered to form a codebook for each speaker. In the


8/30


recognition stage, the data from the tested speaker is compared to the codebook

of each speaker and measure the difference. These differences are then use to

make the recognition decision.

Survey of biometric recognition systems and their applications

(journal of theoretical and applied information technology-2010) which

describes about various biometric systems available and their peculiarities. The

human physical characteristics like fingerprints, face, voice and iris are known

as biometrics .this survey helped us to understand various biometric system and

for choosing voice recognition system as best suited for our application

ie,attendance system. Facial recognition have disadvantages like complex

system ,time consuming compared to voice. As well 2D recognition is affected

by changes in lighting ,the age and if the person wear glasses .it requires camera

equipment for user identification which is costly .iris recognition has

disadvantages such as large storage requirement and expensive. Finger tip

recognition can make mistakes due to the dryness or dirt in the fingers skin, as

well as with the age .it demands a large memory and Compression is required

(a factor of 10 approximately).hence out of all most suitable for our application

is voice which is cheap and verification time only 5 seconds.

MFCC and its applications in speaker recognition an International Journal

on Emerging Technologies(2010) describes that Speech processing is emerged

as one of the important application area of digital signal processing .Various

fields for research in speech processing are speech recognition, speaker

recognition, speech synthesis ,speech coding etc. The objective of automatic

speaker recognition is to extract, characterize and recognize the information

about speaker identity. Feature extraction is the first step for speaker

recognition. Many algorithms are suggested/developed by the researchers for

feature extraction. In this work, the Mel Frequency Cepstrum Coefficient

(MFCC) feature has been used for designing a text dependent speaker


9/30


identification system. Some modifications to the existing technique of MFCC

for feature extraction are also suggested to improve the speaker recognition

efficiency. From this paper we found that as no. of filter in filter bank increases

the efficiency also increases. Also compared to rectangular window hanning

window has more efficiency.

Vector quantization based speaker identification (international

journal of computer applications -2010) describes about the methodology

followed in this paper for speaker identification, which consists of comparing a

speech signal from an unknown speaker to a database of known speakers. The

methodology followed in this paper for Speaker identification is using Feature

Extraction process and then Vector Quantization of extracted features is done

using k-means algorithm. The K-means algorithm is widely used in speech

processing as a dynamic clustering approach. K is pre-selected and simply

refers to the number of desired clusters. In the recognition phase an unknown

speaker, represented by a sequence of feature vectors {x1,, xT}, is

compared with the codebooks in the database. For each codebook a distortion

measure is computed, and the speaker with the lowest distortion is chosen. VQ

based clustering approach is best as it provides us with the faster speaker

identification process


10/30


CHAPTER 3

PROSPOSED SYSTEM

3.1 DRAWBACKS OF EXISTING SYSTEMS

3.1.1 Face recognition attendance system:

Humans have a remarkable ability to recognize fellow beings based on facial

appearance. So, face is a natural human trait for automated biometric

recognition. Face recognition systems typically utilize the spatial relationship

among the locations of facial features such as eyes, nose, lips, chin, and the

global appearance of a face. The forensic and civilian applications of face

recognition technologies pose a number of technical challenges both for static

mug-shot photograph matching (e.g., for ensuring that the same person is not

requesting multiple passports) . The problems associated with illumination,

gesture, facial makeup, occlusion, and pose variations adversely affect the face

recognition performance. While face recognition is non-intrusive, robust face

recognition in non-ideal situations continues to pose challenges.

3.1.2 Fingerprint recognition attendance system:

Fingerprint-based recognition has been the longest serving, and popular method

for person identification. Fingerprints consist of a regular texture pattern

composed of ridges and valleys. These ridges are characterized by several

landmark points, known as minutiae, which are mostly in the form of ridge

endings and ridge bifurcations. The spatial distribution of these minutiae points

is claimed to be unique to each finger; it is the collection of minutiae points in a

fingerprint that is primarily employed for matching two fingerprints. In addition

to minutiae points, there are sweat pores and other details (referred to as

extended or level 3 features) which can be acquired in high resolution (1000 pi)


11/30


fingerprint images. However, there are some disadvantages in this system. If the

surface of the finger gets damaged and/or has one or more marks on it,

identification becomes increasingly hard. Furthermore, the system requires the

users' finger surface to have a point of minutiae or pattern in order to have

matching images. This will be a limitation factor for the security of the

algorithm

3.1.3 Iris recognition attendance system:

The iris is the colored annular ring that surrounds the pupil. Iris images acquired

under infrared illumination consist of complex texture pattern with numerous

individual attributes, e.g. stripes, pits, and furrows, which allow for highly

reliable personal identification. The iris is a protected internal organ whose

texture is stable and distinctive, even among identical twins (similar to

fingerprints), and extremely difficult to surgically spoof. However, relatively

high sensor cost, along with relatively large failure to enroll (FTE) rate reported

in some studies, and lack of legacy iris databases may limit its usage in some

large-scale government application.

3.1.4 Palm print recognition attendance system:

The image of a human palm consists of palm are friction ridges and flexion

creases. Similar to fingerprints, latent palm print systems utilize minutiae and

creases for matching. Based on the success of fingerprints in civilian

applications, some attempts have been made to utilize low resolution palm printimages for access control applications .These systems utilize texture features

which are quite similar to those employed for iris recognition. Palm print

recognition systems have not yet been deployed for civilian applications (e.g.,

access control), mainly due to their large physical size and the fact that

fingerprint identification based on compact and embedded sensors works quite

well for such applications.


12/30


3.1.5 Hand Geometry recognition attendance system:

It is claimed that individuals can be discriminated based on the shape of their

hands. Person identification using hand geometry utilizes low resolution hand

images to extract a number of geometrical features such as finger length, width,

thickness, perimeter, and finger area. The discriminatory power of these features

is quite limited, and therefore hand geometry systems are employed only for

verification applications in low security access control and time-and-attendance

application geometry systems have large physical size, so they cannot be easily

embedded in existing security systems.

3.1.6 Signature recognition attendance system:

Signature is a behavioral biometric modality that is used in daily business

transactions (e.g., credit card purchase). However, attempts to develop highly

accurate signature recognition systems have not been successful. This is

primarily due to the large intra-class variations in a persons signature overtime. Attempts have been made to improve the signature recognition

performance by capturing dynamic or online signatures that require pressure-

sensitive pen-pad. Dynamic signatures help in acquiring the shape, speed,

acceleration, pen pressure, order and speed of strokes, during the actual act of

signing. This additional information seems to improve the verification

performance (over static signatures) as well as circumvent signature forgeries.Still, very few automatic signature verification systems have been deployed.


13/30


3.2 COMPARITVE STUDY WITH EXISTING SYSTEMS

Features Eye-Iris Eye-Retina Finger

print

Signature Voice

Reliability Very

High

Very High High High High

Easiness of

the use

Average Low High High High

Social

Acceptance

Low Low Medium High High

Interference Glasses Irritation Dirtiness

Injury

Roughness

Changeble

Easy

signature

Noise

Cost High High Medium Low Low

Device

Required

Camera Camera scanner Optic pen

Touch

panel

Microphone

We can conclude from the above table that voice recognition is comparatively

less costly and sufficiently accurate. The device required for voice recognition

is easily available and low cost compared to other systems. Hence its also

socially acceptable


14/30


3.3 ADVANTAGES OF VoDAR

3.3.1 Ability to Use Technology remotely

One of the main advantages of voice verification technology is the ability

to use it remotely. Many other types of biometrics cannot be used remotely,

such as fingerprints, retina biometrics or iris biometrics.

One of the advantages of speech recognition technology is that, its easy to

use over the phone or other speaking devices, increasing its usefulness to manycompanies. The ability to use it remotely makes it stand out among many other

types of biometric technology available today.

3.3.2 Low Cost of Using It

The low cost of this technology is another of the advantages of voicerecognition. Theprice of acquiring a voice recognition system is usually quite

reasonable, especially when compared to the price of other biometric

systems. These systems are relatively low cost to implement and maintain and

the equipment needed is low priced as well. Very little equipment is needed for

these systems, making it a cost effective option for businesses that are carefully

watching their bottom line.

In many cases, all that is required for these systems to function is the

right biometric software in place if the technology is being used remotely over

the phone. The phone acts as the speaking device, so there is no investment in

these device .For systems being used for authentication and verification on site,

businesses only have to worry about purchasing a device that users can speak

into along with the speech recognition software.
http://www.biometric-security-devices.com/purchase-biometric-devices.htmlhttp://www.biometric-security-devices.com/purchase-biometric-devices.html


15/30


3.3.3 High Reliability Rate

Another of the heralded advantages of voice recognition is this technologyshigh reliability rate. 10-20 years ago, the reliability rate of speech recognition

technology was actually quite low.

There were many problems that produced reliability problems, such as

the inability to deal with background noise or the inability to recognize voices

when an individual had a slight cold. However, these problems have been dealt

with successfully today, giving thisbiometric technology a very high reliability

rate. Vocal prints now can easily be used to identify an individual, even if their

speech sounds a bit different due to a cold.

One of the advantages of these types of systems is that they are designed

to ignore background noise and focus on the voice, which also has given the

reliability rate a huge boost.

3.3.4 Ease of Use and Implementation

Many companies really appreciate the ease of use and implementation that

comes withvoice recognition biometrics.Some biometric technologies can be

difficult to implement into a company and difficult to begin using. Since these

systems require such little equipment, they can usually be implemented without

the addition of new equipment and systems. Since they are so easy to use,

companies can often reduce their personnel and make use of them elsewhere in

the company to improve performance and customer satisfaction.
http://www.biometric-security-devices.com/biometric-technologies.htmlhttp://www.biometric-security-devices.com/voice-recognition-biometrics.htmlhttp://www.biometric-security-devices.com/voice-recognition-biometrics.htmlhttp://www.biometric-security-devices.com/biometric-technologies.html


16/30


3.3.5 Minimally Invasive

One of the major advantages of this system is that it is minimally invasive,

which is one of the big advantages of voice recognition. This is very importantto individuals that use these security devices. Many consumers today do not like

many forms of biometric technology, since other forms seem so invasive.

The advantages of speech technology are that it only requires individuals to

speak and offer a vocal sample, which is minimally invasive. Since this

technology has a high approval rate among consumers, it can help businesses

keep their customers happy with the service they are providing.


17/30


CHAPTER 5

BLOCK DIAGRAM

The main aim of this project is speaker identification, which consists of

comparing a speech signal from an unknown speaker to a database of known

speaker database. The system can recognize the speaker, which has been trained

with a number of speakers. Above figure shows the fundamental formation of

speaker identification and verification systems. Where the speaker identification

is the process of determining which registered speaker provides a given speech.

On the other hand, speaker verification is the process of rejecting or

accepting the identity claim of a speaker. In most of the applications, voice is

used as the key to confirm the identities of a speaker which is known as speaker

verification .The system consists of a microphone connected to a computer


18/30


system. The voice inputs of each student are recorded via mice, and each input

is analyzed by the system by the MATLAB software. MATLAB is the software

tool of our project. First we have to store some reference voice signal wave

form, with the help of a microphone and the computer. These stored speech

signals are called corpus sentences .By the help of MATLAB software, these

waveforms get analyzed and we convert each speech signal into vector form.

Now the input voice signals from the students are also converted into the

vector form. After comparing this vector sentence with the corpus sentences, the

most similar corpus sentence will be determined .Thus speaker identification is

carried out and corresponding attendance will be marked.

4.1 HUMAN VOICE GENERATION


19/30


Consider the anatomy and physiology of the voice by following the voice from

the lungs to the lips. The breath stream, referred to as the "generator" of the

voice, originates in the lungs. This generatorprovides a controlled flow of air

which powers the vocal folds by setting them into motion.

The human larynx has three vital functions. They are...

1. Airway protection (prevention of aspiration)2. Respiration (breathing)3. Phonation (talking)

When we speak, the vocal folds approximate and vibrate to produce voice.

When we breathe the vocal folds open or abduct and allow air to flow from the

lungs through the mouth and nose and vice versa. When we eat, we reflexively

stop breathing and the vocal folds approximate to protect the airway and keep

food and drink out of the lungs. The speech signal is given by

amplitude


20/30


The vocal folds do not operate like strings on a violin but actually are more

comparable to vibrating lips "buzzing". The three-dimensional cavity, or

"resonator", that provides sound modification. The articulators (the parts of

thevocal tract above the larynx consisting oftongue,palate,cheek,lips,

etc.)articulate andfilter the sound emanating from the larynx and to some

degree can interact with the laryngeal airflow to strengthen it or weaken it as a

sound source. Adult men and women have different sizes of vocal fold;

reflecting the male-female differences in larynx size. Adult male voices are

usually lower-pitched and have larger folds. The male vocal folds (which would

be measured vertically in the opposite diagram), are between 17 mm and 25 mm

in length.]The female vocal folds are between 12.5 mm and 17.5 mm in length.

The difference in vocal folds size between men and women means that they

have differently pitched voices. Additionally, genetics also causes variances

amongst the same sex, with men's and women'ssinging voices being

categorized into types.

4.2 SPEECH RECOGNITION

The structure of a typical speech recognition system mainly consists of

feature extraction, training and recognition. Because of the instability of speech

signal, feature extraction of speech signal becomes very difficult. There exist

different features between each word. For each word there are differencesamong different person, such as the differences between adults and children,

male and female. Even for the same person and the same word there also exists

changes for different time. Nowadays, there are several feature extraction

methods used in speech recognition systems. All of them have good

performance when used in clean condition. In the adverse condition, we still

cant find a good way in speech recognition system.
http://en.wikipedia.org/wiki/Vocal_tracthttp://en.wikipedia.org/wiki/Tonguehttp://en.wikipedia.org/wiki/Soft_palatehttp://en.wikipedia.org/wiki/Cheekhttp://en.wikipedia.org/wiki/Liphttp://en.wikipedia.org/wiki/Manner_of_articulationhttp://en.wikipedia.org/wiki/Band-pass_filterhttp://en.wikipedia.org/wiki/Geneticshttp://en.wikipedia.org/wiki/Singinghttp://en.wikipedia.org/wiki/Singinghttp://en.wikipedia.org/wiki/Geneticshttp://en.wikipedia.org/wiki/Band-pass_filterhttp://en.wikipedia.org/wiki/Manner_of_articulationhttp://en.wikipedia.org/wiki/Liphttp://en.wikipedia.org/wiki/Cheekhttp://en.wikipedia.org/wiki/Soft_palatehttp://en.wikipedia.org/wiki/Tonguehttp://en.wikipedia.org/wiki/Vocal_tract


21/30


Compared with them, human auditory system always has good

performance under clean and noisy condition. So a way solve this is to research

our auditory system and use the result in speech recognition system developed

.There are two major approaches available for feature extraction: modelling

human voice production and perception system. For the first approach, one of

the most popular features is the LPC (Linear Prediction Coefficient) feature. For

the second approach, the most popular feature is the MFCC (Mel-Frequency

Cepstrum Coefficient) feature. In MFCC, the main advantage is that it uses Mel

frequency scaling which is very approximate to the human auditory system.

Hence more effective than LPC.

A. Definition of speech recognition:

Speech Recognition (is also known as Automatic Speech Recognition (ASR), or

computer speech recognition) is the process of converting a speech signal to a

sequence of words, by means of an algorithm implemented as a computer

program.

B. Types of Speech Recognition:


22/30


Speech recognition systems can be separated in several different classes by

describing what types of utterances they have the ability to recognize. These

classes are classified as the following:

Isolated Words:Isolated word recognizers usually require each utterance to have quiet (lack of

an audio signal) on both sides of the sample window. It accepts single words or

single utterance at

a time. These systems have "Listen/Not-Listen" states, where they require the

speaker to wait between utterances (usually doing processing during the

pauses). Isolated Utterance might

be a better name for this class.

Connected Words:Connected word systems (or more correctly 'connected utterances') are similar

to isolated words, but allows separate utterances to be 'run-together' with a

minimal pause between them.

Continuous Speech:Continuous speech recognizers allow users to speak almost naturally, while the

computer determines the content. (Basically, it's computer dictation).

Recognizers with continuous speech capabilities are some of the most difficult

to create because they utilize special methods to determine utterance

boundaries.

Spontaneous Speech:At a basic level, it can be thought of as speech that is natural sounding and not

rehearsed. An ASR system with spontaneous speech ability should be able to

handle a variety of natural speech features such as words being run together,

"ums" and "ahs", and even slight stutters.

4.3 FEATURE EXTRACTION


23/30


Step 1:Preemphasis

This step processes the passing of signal through a filter which emphasizes

higher frequencies. This process will increase the energy of signal at higher

frequency.

Y[n] = X[n]- a X[n-1].

a = 0.95, which make 95% of any one sample is presumed to originate from

previous sample.

Step 2:Framing

The process of segmenting the speech samples obtained from analog to digital

conversion (ADC) into a small frame with the length within the range of 20 to

40 msec. The voice signal is divided into frames of N samples. Adjacent frames

are being separated by M (M


24/30


A traditional method of spectral evaluation is reliable in case of stationary

signal. Nature of signal changes continuously with time. For voice reliability

can be ensured for a short time. Audio signal is continuous. Processing cannot

wait for last sample. Processing complexity increases exponentially. It is

important to retain short term features. Short time analysis is performed by

windowing the signal. Normally Hamming Window is used. The Hamming

function is given by

Hamming window:

W(n) = 0.540.46 cos(2n/L-1) 0 nL-1

0 otherwise.

Step 4: Fast Fourier Transform

To convert each frame of N samples from time domain into frequency domain.

The Fourier Transform is to convert the convolution of the glottal pulse U[n]

and the vocal tract impulse response H[n] in the time domain. This statement

supports the equation below:

Y(W) = FFT [h(t)*X(t)] = H(W) *X(W)

If X (w), H (w) and Y (w) are the Fourier Transform of X (t), H (t) and Y (t)

respectively.

Step 5: Mel Filter Bank Processing

The frequencies range in FFT spectrum is very wide and voice signal does not

follow the linear scale. The bank of filters according to Mel scale as shown in


25/30


figure 4 is then performed. This figure below shows a set of triangular filters

that are used to compute a weighted sum of filter spectral components so that

the output of process approximates to a Mel scale. Each filters magnitude

frequency response is triangular in shape and equal to unity at the centre

frequency and decrease linearly to zero at centre frequency of two adjacent

filters.

Step 6: Discrete Cosine Transform

is the process to convert the log Mel spectrum into time domain using Discrete

Cosine Transform (DCT). The result of the conversion is called Mel Frequency

Cepstrum Coefficient. The set of coefficient is called acoustic vectors.

Therefore, each input utterance is transformed into a sequence of acoustic

vector.

Step 7: Delta Energy and Delta Spectrum

The voice signal and the frames changes, such as the slope of a formant at its

transitions. Therefore, there is a need to add features related to the change in

cepstral features over time . 13 delta or velocity features (12 cepstral features

plus energy), and 39 features a double delta or acceleration feature are added.


26/30


The energy in a frame for a signal x in a window from time sample t1 to time

sample t2, is represented at the equation below:

Energy = X2 [t]

Procedure for forming MFCC

4.4 VECTOR QUANTISATION

Vector quantization (VQ) is a lossy data compression method based on

the principle of block coding. It is a fixed-to-fixed length algorithm. In the

earlier days, the design of a vector quantizer (VQ) is considered to be a

challenging problem due to the need for multi-dimensional integration. In 1980,

Linde, Buzo, and Gray (LBG) proposed a VQ design algorithm based on a

training sequence.

The main advantage of VQ inpattern recognition is its low computational

burden when compared with other techniques such asdynamic time

warping (DTW) and hidden Markov model (HMM). A VQ is nothing more
http://data-compression.com/theory.shtml#theoryhttp://en.wikipedia.org/wiki/Pattern_recognitionhttp://en.wikipedia.org/wiki/Dynamic_time_warpinghttp://en.wikipedia.org/wiki/Dynamic_time_warpinghttp://en.wikipedia.org/wiki/Hidden_Markov_modelhttp://en.wikipedia.org/wiki/Hidden_Markov_modelhttp://en.wikipedia.org/wiki/Dynamic_time_warpinghttp://en.wikipedia.org/wiki/Dynamic_time_warpinghttp://en.wikipedia.org/wiki/Pattern_recognitionhttp://data-compression.com/theory.shtml#theory


27/30


than an approximator. The idea is similar to that of ``rounding-off'' (say to the

nearest integer). An example of a 1-dimensional VQ is shown below:

Here, every number less than -2 are approximated by -3. Every number between

-2 and 0 are approximated by -1. Every number between 0 and 2 are

approximated by +1. Every number greater than 2 are approximated by +3. Note

that the approximate values are uniquely represented by 2 bits. This is a 1-

dimensional, 2-bit VQ. It has a rate of 2 bits/dimension.

An example of a 2-dimensional VQ is shown below:

Here, every pair of numbers falling in a particular region are approximated by a

red star associated with that region. Note that there are 16 regions and 16 red

stars -- each of which can be uniquely represented by 4 bits. Thus, this is a 2-

dimensional, 4-bit VQ. Its rate is also 2 bits/dimension. In the above twoexamples, the red stars are called code vectors and the regions defined by the


28/30


blue borders are called encoding regions. The set of all code vectors is called

the codebook and the set of all encoding regions is called the partition of the

space. The performances of VQ are typically given in terms of the signal-to-

distortion ratio (SDR):

SDR=10log102/Dave (in dB),

Where is the variance of the source and Dave is the average squared-error

distortion. The higher the SDRthe better the performance

In verification systems two key performance measures are popular, the false

rejectionrate (FRR), the number of times the true speaker is incorrectly rejected,

and false acceptance rate (FAR), the number of times an imposter speaker is

incorrectly accepted. By varying the decision threshold the FAR and FRR will

change in opposing directions. For example raising the threshold will lower

FAR but increase the FRR as true claims will start to be rejected since the bar is

raised, conversely if the threshold is lowered the FRR is reduced but FAR will

increase since not only are all true claims now accepted but more false ones will

as well. The typical operating point for the selection of the threshold is when

FAR = FRR, termed the equalerror rate (EER) condition

4.5 NEED FOR MATLAB

MATLAB is a high-level language and interactive environment for

numerical computation, visualization, and programming. Using MATLAB, we

can analyze data, develop algorithms, and create models and applications. The

language, tools, and built-in math functions enable you to explore multipleapproaches and reach a solution faster than with spreadsheets or traditional

programming languages, such as C/C++ or Java.MATLAB has a range of

applications, including signal processing and communications, image and video

processing, control systems, test and measurement, computational finance, and

computational biology. Hence we are preferring MATLAB as our software tool.

More than a million engineers and scientists in industry and academia use

MATLAB, the language of technical computing. MATLAB has built-in


29/30


mathematical functions in MATLAB to solve science and engineering

problems. MATLAB (matrix laboratory) is anumerical computing environment

and fourth-generation programming language. Developed byMath Works,

MATLAB allowsmatrix manipulations, plotting offunctions and data,

implementation ofalgorithms, creation ofuser interfaces, and interfacing with

programs written in other languages, includingC,C++,Java, and Fortran.(

Cleve Moler the chairman of computer science department started developing

MATLAB in late 1970s. Jack Little recognized its commercial potential and

joined with Moler and Steve Banjert. They rewrote MATLAB in Cand founded

mathworks in 1984)

.
http://en.wikipedia.org/wiki/Numerical_analysishttp://en.wikipedia.org/wiki/Fourth-generation_programming_languagehttp://en.wikipedia.org/wiki/MathWorkshttp://en.wikipedia.org/wiki/Matrix_(mathematics)http://en.wikipedia.org/wiki/Function_(mathematics)http://en.wikipedia.org/wiki/Algorithmhttp://en.wikipedia.org/wiki/User_interfacehttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/C%2B%2Bhttp://en.wikipedia.org/wiki/Java_(programming_language)http://en.wikipedia.org/wiki/Fortranhttp://en.wikipedia.org/wiki/Fortranhttp://en.wikipedia.org/wiki/Java_(programming_language)http://en.wikipedia.org/wiki/C%2B%2Bhttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/User_interfacehttp://en.wikipedia.org/wiki/Algorithmhttp://en.wikipedia.org/wiki/Function_(mathematics)http://en.wikipedia.org/wiki/Matrix_(mathematics)http://en.wikipedia.org/wiki/MathWorkshttp://en.wikipedia.org/wiki/Fourth-generation_programming_languagehttp://en.wikipedia.org/wiki/Numerical_analysis


30/30

CHAPTER 5

CONCLUSION

From the comparison study of various biometric systems we came into a

conclusion that voice recognition is best suitable for our application. The Mel

filter is best suitable for feature extraction. Vector quantization technique is

more desired for feature matching to implement the voice detected attendance

system.

speach recognition

Documents