speach recognition
TRANSCRIPT
-
8/12/2019 speach recognition
1/30
CONTENTS
LIST OF FIGURES PAGE
NO
INTRODUCTION 2
CHAPTER.1: ABSTRACT
1.1GENERAL1.2 PRIBCIPLES OF SPEECH RECOGNITION
3
3
5
CHAPTER.2: LITRATURE REVIEW 6
CHAPTER.3: PROPOSED SYSTEM
3.1 DRAWBACKS OF EXISTING SYSTEMS
3.1.1 FACE RECOGNITION
3.1.2 FINGER PRINT
3.1.3 IRIS RECOGNITION
3.1.4 PALM RECOGNITION
3.1.5 HAND GEOMETRY
3.1.6 SIGNATURE RECOGNTION
3.2 COMPARITIVE STUDY WITH EXISTING
SYSTEM
3.3 ADVANTAGES OF VoDAR
10
13
14
CHAPTER.4: BLOCK DIAGRAM
4.1 HUMAN VOICE GENERATION
4.2 SPEECH RECOGNITION
4.3 FEATURE EXTRACTION
4.4 VECTOR QUANTIZATION
4.5 NEED FOR MATLAB
17
18
20
22
26
28
CHAPTER.5: CONCLUSION 30
-
8/12/2019 speach recognition
2/30
Department of ECE,Thejus Engg., College,Thrissur Page 2
INTRODUCTION
VoDAR (Detection based Attendance Register), is a system that registers
the attendance of each students at high accuracy. The personal identification is
performed using their voice. Thus the defects in the ordinary attendance taking
practice can be reduced by a large extent. The voice characteristics of each and
every person are different in one or another way. All the vocal characteristics of
two persons will never be similar. And this is the reason why the platform for
the personal identification is voice.
Compared to the conventional manual attendance marking system, It
saves lot of time in registering the attendance. Further the method is based on
speakers voice recorded in real-time, thereby making it highly robust against
malpractices. The automated creation of attendance in the form of an Excel
spreadsheet makes it very convenient to use, access, archive, and print the
attendance using the classroom computer, department or college server or any
computer connected to the college intranet. The project in practice need
complicated circuitry and the expense for this is very high, so this project is to
make a theoretical model rather than a practical model. VoDAR is software
based model. The process of voice recognition in this project is done using
MATLAB software.
-
8/12/2019 speach recognition
3/30
Department of ECE,Thejus Engg., College,Thrissur Page 3
CHAPTER 1
ABSTRACT
1.1 GENERAL
The current practice of taking attendance in a lecture class is simply calling the
roll numbers by each student and marking it by the teacher. This is a time
consuming process and the accuracy of this process is low. It's found that the
students those who are not present in the classroom, get attendance by doing
some malpractices. There is a great chance for doing malpractices by the
student. And it's complex to enter and calculate student's overall attendance, and
sometimes may have chance not to get attendance even though he/she is present
in the class. It seems to be very difficult to avoid these limitations, even if
effectively the process had followed.
In determining the internals of every students, the attendance has a10% role. The malpractices in taking attendance, thus will affect the internals of
the students. The idea about this project had developed in our mind, while
thinking about a scientific way to register the attendance. Thus the defects in the
ordinary attendance taking practice can be reduced by a large extent. The
various biometric characteristics that are generally used are the face ,iris,
fingerprints palm prints hand geometry and the behavioral characteristicsinclude signature voice pattern.
Biometrics is the science or technology which analyses and measures the
biological data. In computer science it refers to science or technology that
measure and analyze physical or behavioral characteristics of a person,for
authentication .
-
8/12/2019 speach recognition
4/30
Department of ECE,Thejus Engg., College,Thrissur Page 4
Voice recognition or speaker recognition systems extract features from the
speech using MATLAB and model them to use for recognition these systems
use the aquatic features present in the speech which are unique for each
individual. These aquatic pattern depend on the physical characteristics of
individual (eg:the size of mouth and throat) as well as behavioral characteristics
like speaking styles and voice pitch.
Everyone has a distinct voice, different from all others; almost like a
fingerprint, ones voice is unique and can act as an identifier. The human voice
is composed of a multitude of different components, making each voice
different; namely
1. Pitch
2. Tone
1.2PRINCIPLES OF SPEECH RECOGNITION
Speech is one of the natural forms of communication recentdevelopments have made it possible to use this in the security system. In
speaker identification task use a speech sample to select the identity of person
that produce the speech from a population of speakers. In speaker verification
the task is to use a speech sample to test whether a person who claims to have
produced the speech has in fact done so .this technique makes it possible to use
the speaker voice to verify their identity and control access to services such asvoice dialing banking by telephone ,attendance marking ,telephone shopping
,data base access services ,information services ,voice mail ,security control for
confidential information areas and remote access to computers.
Speaker recognition methods can be divided into text independent and
text dependent methods. In a text independent system, speaker models capture
characteristics of somebodys speech which show up irrespective what one is
saying. In a text dependent system, on the other hand, the recognition of the
-
8/12/2019 speach recognition
5/30
Department of ECE,Thejus Engg., College,Thrissur Page 5
speakers identity is based on his or her speaking one or more specific phases,
like password, card numbers ,PIN codes.
Every technology of speaker recognition ,identification and verification
whether text independent and text dependent ,each has its own advantages and
disadvantages and may require different treatment and techniques the choice of
which technology to use is application specific. At the highest level all the
speaker recognition system contain to main modules feature extraction and
feature matching.
-
8/12/2019 speach recognition
6/30
Department of ECE,Thejus Engg., College,Thrissur Page 6
CHAPTER 2
LITERATURE REVIEW
This chapter contains a brief overview of voice recognition using MFCC and
vector quantization. The details of the project were obtained from some of the
journals listed below.
An international journal on speaker identification (2011) describes Speaker
recognition is the computing task of validating a user's claimed identity using
characteristics extracted from their voices. Voice -recognition is combination of
the two where it uses learned aspects of a speakers voice to determine what is
being said - such a system cannot recognize speech from random speakers very
accurately, but it can reach high accuracy for individual voices it has been
trained with, which gives us various applications in day today life. This paper
introduced us to various methods of speaker identification involving LPC,
MFCC feature extraction. Linear prediction is a mathematical operation whichprovides an estimation of the current sample of a discrete signal as a linear
combination of several previous samples. The prediction error i.e. the difference
between the predicted and actual value is called the residual .using this idea
feature extraction is implemented .where as in MFCC the log of signal energy is
calculated.
An international journal on the Design of an automatic speakerrecognition system (2011) describes the concept of MFCC. They are derived
from a type of cepstral representation of the audio clip . The difference between
the cepstrum and the Mel frequency cepstrum is that in the MFC, the frequency
bands are equally spaced on the Mel scale, which approximates the human
auditory system's response more closely than the linearly-spaced frequency
bands used in the normal cepstrum. The cepstrum is a common transform used
to gain information from a persons speech signal. It can be used to separate the
-
8/12/2019 speach recognition
7/30
Department of ECE,Thejus Engg., College,Thrissur Page 7
excitation signal (which contains the words and the pitch) and the transfer
function (which contains the voice quality). It is the result of taking Fourier
transform of decibel spectrum as if it were a signal. We use cepstral analysis in
speaker identification because the speech signal is of the particular form above,
and the "cepstral transform" of it makes analysis incredibly simple.
Mathematically, cepstrum of signal = FT[log{FT(the windowed signal)}]
MFCCs are commonly calculated by first taking the Fourier transform of a
windowed excerpt of a signal and mapping the powers of the spectrum obtained
above onto the Mel scale, using triangular overlapping windows. Next the logs
of the powers at each of the Mel frequencies are taken; Direct Cosine Transform
is applied to it (as if it were a signal). The MFCCs are the amplitudes of the
resulting spectrum. . The speech input is typically recorded at a sampling rate
above 10000 Hz. This sampling frequency was chosen to minimize the effects
of aliasing in the analog-to-digital conversion. These sampled signals can
capture all frequencies up to 5 kHz, which cover most energy of sounds that are
generated by humans .The main purpose of the MFCC processor is to mimic thebehavior of the human ears. In addition MFCCs are shown to be less susceptible
to mentioned variations. This paper helped us to gain knowledge about feature
extraction using MFCC.
International journal for Vector quantization based speaker
identification(2010) describes that a speaker recognition system must able to
estimate probability distributions of the computed feature vectors. Storing everysingle vector that generate from the training mode is impossible, since these
distributions are defined over a high-dimensional space. It is often easier to start
by quantizing each feature vector to one of a relatively small number of
template vectors, with a process called vector quantization. VQ is a process of
taking a large set of feature vectors and producing a smaller set of measure
vectors that represents the centroids of the distribution. By using these training
data features are clustered to form a codebook for each speaker. In the
-
8/12/2019 speach recognition
8/30
Department of ECE,Thejus Engg., College,Thrissur Page 8
recognition stage, the data from the tested speaker is compared to the codebook
of each speaker and measure the difference. These differences are then use to
make the recognition decision.
Survey of biometric recognition systems and their applications
(journal of theoretical and applied information technology-2010) which
describes about various biometric systems available and their peculiarities. The
human physical characteristics like fingerprints, face, voice and iris are known
as biometrics .this survey helped us to understand various biometric system and
for choosing voice recognition system as best suited for our application
ie,attendance system. Facial recognition have disadvantages like complex
system ,time consuming compared to voice. As well 2D recognition is affected
by changes in lighting ,the age and if the person wear glasses .it requires camera
equipment for user identification which is costly .iris recognition has
disadvantages such as large storage requirement and expensive. Finger tip
recognition can make mistakes due to the dryness or dirt in the fingers skin, as
well as with the age .it demands a large memory and Compression is required
(a factor of 10 approximately).hence out of all most suitable for our application
is voice which is cheap and verification time only 5 seconds.
MFCC and its applications in speaker recognition an International Journal
on Emerging Technologies(2010) describes that Speech processing is emerged
as one of the important application area of digital signal processing .Various
fields for research in speech processing are speech recognition, speaker
recognition, speech synthesis ,speech coding etc. The objective of automatic
speaker recognition is to extract, characterize and recognize the information
about speaker identity. Feature extraction is the first step for speaker
recognition. Many algorithms are suggested/developed by the researchers for
feature extraction. In this work, the Mel Frequency Cepstrum Coefficient
(MFCC) feature has been used for designing a text dependent speaker
-
8/12/2019 speach recognition
9/30
Department of ECE,Thejus Engg., College,Thrissur Page 9
identification system. Some modifications to the existing technique of MFCC
for feature extraction are also suggested to improve the speaker recognition
efficiency. From this paper we found that as no. of filter in filter bank increases
the efficiency also increases. Also compared to rectangular window hanning
window has more efficiency.
Vector quantization based speaker identification (international
journal of computer applications -2010) describes about the methodology
followed in this paper for speaker identification, which consists of comparing a
speech signal from an unknown speaker to a database of known speakers. The
methodology followed in this paper for Speaker identification is using Feature
Extraction process and then Vector Quantization of extracted features is done
using k-means algorithm. The K-means algorithm is widely used in speech
processing as a dynamic clustering approach. K is pre-selected and simply
refers to the number of desired clusters. In the recognition phase an unknown
speaker, represented by a sequence of feature vectors {x1,, xT}, is
compared with the codebooks in the database. For each codebook a distortion
measure is computed, and the speaker with the lowest distortion is chosen. VQ
based clustering approach is best as it provides us with the faster speaker
identification process
-
8/12/2019 speach recognition
10/30
Department of ECE,Thejus Engg., College,Thrissur Page 10
CHAPTER 3
PROSPOSED SYSTEM
3.1 DRAWBACKS OF EXISTING SYSTEMS
3.1.1 Face recognition attendance system:
Humans have a remarkable ability to recognize fellow beings based on facial
appearance. So, face is a natural human trait for automated biometric
recognition. Face recognition systems typically utilize the spatial relationship
among the locations of facial features such as eyes, nose, lips, chin, and the
global appearance of a face. The forensic and civilian applications of face
recognition technologies pose a number of technical challenges both for static
mug-shot photograph matching (e.g., for ensuring that the same person is not
requesting multiple passports) . The problems associated with illumination,
gesture, facial makeup, occlusion, and pose variations adversely affect the face
recognition performance. While face recognition is non-intrusive, robust face
recognition in non-ideal situations continues to pose challenges.
3.1.2 Fingerprint recognition attendance system:
Fingerprint-based recognition has been the longest serving, and popular method
for person identification. Fingerprints consist of a regular texture pattern
composed of ridges and valleys. These ridges are characterized by several
landmark points, known as minutiae, which are mostly in the form of ridge
endings and ridge bifurcations. The spatial distribution of these minutiae points
is claimed to be unique to each finger; it is the collection of minutiae points in a
fingerprint that is primarily employed for matching two fingerprints. In addition
to minutiae points, there are sweat pores and other details (referred to as
extended or level 3 features) which can be acquired in high resolution (1000 pi)
-
8/12/2019 speach recognition
11/30
Department of ECE,Thejus Engg., College,Thrissur Page 11
fingerprint images. However, there are some disadvantages in this system. If the
surface of the finger gets damaged and/or has one or more marks on it,
identification becomes increasingly hard. Furthermore, the system requires the
users' finger surface to have a point of minutiae or pattern in order to have
matching images. This will be a limitation factor for the security of the
algorithm
3.1.3 Iris recognition attendance system:
The iris is the colored annular ring that surrounds the pupil. Iris images acquired
under infrared illumination consist of complex texture pattern with numerous
individual attributes, e.g. stripes, pits, and furrows, which allow for highly
reliable personal identification. The iris is a protected internal organ whose
texture is stable and distinctive, even among identical twins (similar to
fingerprints), and extremely difficult to surgically spoof. However, relatively
high sensor cost, along with relatively large failure to enroll (FTE) rate reported
in some studies, and lack of legacy iris databases may limit its usage in some
large-scale government application.
3.1.4 Palm print recognition attendance system:
The image of a human palm consists of palm are friction ridges and flexion
creases. Similar to fingerprints, latent palm print systems utilize minutiae and
creases for matching. Based on the success of fingerprints in civilian
applications, some attempts have been made to utilize low resolution palm printimages for access control applications .These systems utilize texture features
which are quite similar to those employed for iris recognition. Palm print
recognition systems have not yet been deployed for civilian applications (e.g.,
access control), mainly due to their large physical size and the fact that
fingerprint identification based on compact and embedded sensors works quite
well for such applications.
-
8/12/2019 speach recognition
12/30
Department of ECE,Thejus Engg., College,Thrissur Page 12
3.1.5 Hand Geometry recognition attendance system:
It is claimed that individuals can be discriminated based on the shape of their
hands. Person identification using hand geometry utilizes low resolution hand
images to extract a number of geometrical features such as finger length, width,
thickness, perimeter, and finger area. The discriminatory power of these features
is quite limited, and therefore hand geometry systems are employed only for
verification applications in low security access control and time-and-attendance
application geometry systems have large physical size, so they cannot be easily
embedded in existing security systems.
3.1.6 Signature recognition attendance system:
Signature is a behavioral biometric modality that is used in daily business
transactions (e.g., credit card purchase). However, attempts to develop highly
accurate signature recognition systems have not been successful. This is
primarily due to the large intra-class variations in a persons signature overtime. Attempts have been made to improve the signature recognition
performance by capturing dynamic or online signatures that require pressure-
sensitive pen-pad. Dynamic signatures help in acquiring the shape, speed,
acceleration, pen pressure, order and speed of strokes, during the actual act of
signing. This additional information seems to improve the verification
performance (over static signatures) as well as circumvent signature forgeries.Still, very few automatic signature verification systems have been deployed.
-
8/12/2019 speach recognition
13/30
Department of ECE,Thejus Engg., College,Thrissur Page 13
3.2 COMPARITVE STUDY WITH EXISTING SYSTEMS
Features Eye-Iris Eye-Retina Finger
print
Signature Voice
Reliability Very
High
Very High High High High
Easiness of
the use
Average Low High High High
Social
Acceptance
Low Low Medium High High
Interference Glasses Irritation Dirtiness
Injury
Roughness
Changeble
Easy
signature
Noise
Cost High High Medium Low Low
Device
Required
Camera Camera scanner Optic pen
Touch
panel
Microphone
We can conclude from the above table that voice recognition is comparatively
less costly and sufficiently accurate. The device required for voice recognition
is easily available and low cost compared to other systems. Hence its also
socially acceptable
-
8/12/2019 speach recognition
14/30
Department of ECE,Thejus Engg., College,Thrissur Page 14
3.3 ADVANTAGES OF VoDAR
3.3.1 Ability to Use Technology remotely
One of the main advantages of voice verification technology is the ability
to use it remotely. Many other types of biometrics cannot be used remotely,
such as fingerprints, retina biometrics or iris biometrics.
One of the advantages of speech recognition technology is that, its easy to
use over the phone or other speaking devices, increasing its usefulness to manycompanies. The ability to use it remotely makes it stand out among many other
types of biometric technology available today.
3.3.2 Low Cost of Using It
The low cost of this technology is another of the advantages of voicerecognition. Theprice of acquiring a voice recognition system is usually quite
reasonable, especially when compared to the price of other biometric
systems. These systems are relatively low cost to implement and maintain and
the equipment needed is low priced as well. Very little equipment is needed for
these systems, making it a cost effective option for businesses that are carefully
watching their bottom line.
In many cases, all that is required for these systems to function is the
right biometric software in place if the technology is being used remotely over
the phone. The phone acts as the speaking device, so there is no investment in
these device .For systems being used for authentication and verification on site,
businesses only have to worry about purchasing a device that users can speak
into along with the speech recognition software.
http://www.biometric-security-devices.com/purchase-biometric-devices.htmlhttp://www.biometric-security-devices.com/purchase-biometric-devices.html -
8/12/2019 speach recognition
15/30
Department of ECE,Thejus Engg., College,Thrissur Page 15
3.3.3 High Reliability Rate
Another of the heralded advantages of voice recognition is this technologyshigh reliability rate. 10-20 years ago, the reliability rate of speech recognition
technology was actually quite low.
There were many problems that produced reliability problems, such as
the inability to deal with background noise or the inability to recognize voices
when an individual had a slight cold. However, these problems have been dealt
with successfully today, giving thisbiometric technology a very high reliability
rate. Vocal prints now can easily be used to identify an individual, even if their
speech sounds a bit different due to a cold.
One of the advantages of these types of systems is that they are designed
to ignore background noise and focus on the voice, which also has given the
reliability rate a huge boost.
3.3.4 Ease of Use and Implementation
Many companies really appreciate the ease of use and implementation that
comes withvoice recognition biometrics.Some biometric technologies can be
difficult to implement into a company and difficult to begin using. Since these
systems require such little equipment, they can usually be implemented without
the addition of new equipment and systems. Since they are so easy to use,
companies can often reduce their personnel and make use of them elsewhere in
the company to improve performance and customer satisfaction.
http://www.biometric-security-devices.com/biometric-technologies.htmlhttp://www.biometric-security-devices.com/voice-recognition-biometrics.htmlhttp://www.biometric-security-devices.com/voice-recognition-biometrics.htmlhttp://www.biometric-security-devices.com/biometric-technologies.html -
8/12/2019 speach recognition
16/30
Department of ECE,Thejus Engg., College,Thrissur Page 16
3.3.5 Minimally Invasive
One of the major advantages of this system is that it is minimally invasive,
which is one of the big advantages of voice recognition. This is very importantto individuals that use these security devices. Many consumers today do not like
many forms of biometric technology, since other forms seem so invasive.
The advantages of speech technology are that it only requires individuals to
speak and offer a vocal sample, which is minimally invasive. Since this
technology has a high approval rate among consumers, it can help businesses
keep their customers happy with the service they are providing.
-
8/12/2019 speach recognition
17/30
Department of ECE,Thejus Engg., College,Thrissur Page 17
CHAPTER 5
BLOCK DIAGRAM
The main aim of this project is speaker identification, which consists of
comparing a speech signal from an unknown speaker to a database of known
speaker database. The system can recognize the speaker, which has been trained
with a number of speakers. Above figure shows the fundamental formation of
speaker identification and verification systems. Where the speaker identification
is the process of determining which registered speaker provides a given speech.
On the other hand, speaker verification is the process of rejecting or
accepting the identity claim of a speaker. In most of the applications, voice is
used as the key to confirm the identities of a speaker which is known as speaker
verification .The system consists of a microphone connected to a computer
-
8/12/2019 speach recognition
18/30
Department of ECE,Thejus Engg., College,Thrissur Page 18
system. The voice inputs of each student are recorded via mice, and each input
is analyzed by the system by the MATLAB software. MATLAB is the software
tool of our project. First we have to store some reference voice signal wave
form, with the help of a microphone and the computer. These stored speech
signals are called corpus sentences .By the help of MATLAB software, these
waveforms get analyzed and we convert each speech signal into vector form.
Now the input voice signals from the students are also converted into the
vector form. After comparing this vector sentence with the corpus sentences, the
most similar corpus sentence will be determined .Thus speaker identification is
carried out and corresponding attendance will be marked.
4.1 HUMAN VOICE GENERATION
-
8/12/2019 speach recognition
19/30
Department of ECE,Thejus Engg., College,Thrissur Page 19
Consider the anatomy and physiology of the voice by following the voice from
the lungs to the lips. The breath stream, referred to as the "generator" of the
voice, originates in the lungs. This generatorprovides a controlled flow of air
which powers the vocal folds by setting them into motion.
The human larynx has three vital functions. They are...
1. Airway protection (prevention of aspiration)2. Respiration (breathing)3. Phonation (talking)
When we speak, the vocal folds approximate and vibrate to produce voice.
When we breathe the vocal folds open or abduct and allow air to flow from the
lungs through the mouth and nose and vice versa. When we eat, we reflexively
stop breathing and the vocal folds approximate to protect the airway and keep
food and drink out of the lungs. The speech signal is given by
amplitude
-
8/12/2019 speach recognition
20/30
Department of ECE,Thejus Engg., College,Thrissur Page 20
The vocal folds do not operate like strings on a violin but actually are more
comparable to vibrating lips "buzzing". The three-dimensional cavity, or
"resonator", that provides sound modification. The articulators (the parts of
thevocal tract above the larynx consisting oftongue,palate,cheek,lips,
etc.)articulate andfilter the sound emanating from the larynx and to some
degree can interact with the laryngeal airflow to strengthen it or weaken it as a
sound source. Adult men and women have different sizes of vocal fold;
reflecting the male-female differences in larynx size. Adult male voices are
usually lower-pitched and have larger folds. The male vocal folds (which would
be measured vertically in the opposite diagram), are between 17 mm and 25 mm
in length.]The female vocal folds are between 12.5 mm and 17.5 mm in length.
The difference in vocal folds size between men and women means that they
have differently pitched voices. Additionally, genetics also causes variances
amongst the same sex, with men's and women'ssinging voices being
categorized into types.
4.2 SPEECH RECOGNITION
The structure of a typical speech recognition system mainly consists of
feature extraction, training and recognition. Because of the instability of speech
signal, feature extraction of speech signal becomes very difficult. There exist
different features between each word. For each word there are differencesamong different person, such as the differences between adults and children,
male and female. Even for the same person and the same word there also exists
changes for different time. Nowadays, there are several feature extraction
methods used in speech recognition systems. All of them have good
performance when used in clean condition. In the adverse condition, we still
cant find a good way in speech recognition system.
http://en.wikipedia.org/wiki/Vocal_tracthttp://en.wikipedia.org/wiki/Tonguehttp://en.wikipedia.org/wiki/Soft_palatehttp://en.wikipedia.org/wiki/Cheekhttp://en.wikipedia.org/wiki/Liphttp://en.wikipedia.org/wiki/Manner_of_articulationhttp://en.wikipedia.org/wiki/Band-pass_filterhttp://en.wikipedia.org/wiki/Geneticshttp://en.wikipedia.org/wiki/Singinghttp://en.wikipedia.org/wiki/Singinghttp://en.wikipedia.org/wiki/Geneticshttp://en.wikipedia.org/wiki/Band-pass_filterhttp://en.wikipedia.org/wiki/Manner_of_articulationhttp://en.wikipedia.org/wiki/Liphttp://en.wikipedia.org/wiki/Cheekhttp://en.wikipedia.org/wiki/Soft_palatehttp://en.wikipedia.org/wiki/Tonguehttp://en.wikipedia.org/wiki/Vocal_tract -
8/12/2019 speach recognition
21/30
Department of ECE,Thejus Engg., College,Thrissur Page 21
Compared with them, human auditory system always has good
performance under clean and noisy condition. So a way solve this is to research
our auditory system and use the result in speech recognition system developed
.There are two major approaches available for feature extraction: modelling
human voice production and perception system. For the first approach, one of
the most popular features is the LPC (Linear Prediction Coefficient) feature. For
the second approach, the most popular feature is the MFCC (Mel-Frequency
Cepstrum Coefficient) feature. In MFCC, the main advantage is that it uses Mel
frequency scaling which is very approximate to the human auditory system.
Hence more effective than LPC.
A. Definition of speech recognition:
Speech Recognition (is also known as Automatic Speech Recognition (ASR), or
computer speech recognition) is the process of converting a speech signal to a
sequence of words, by means of an algorithm implemented as a computer
program.
B. Types of Speech Recognition:
-
8/12/2019 speach recognition
22/30
Department of ECE,Thejus Engg., College,Thrissur Page 22
Speech recognition systems can be separated in several different classes by
describing what types of utterances they have the ability to recognize. These
classes are classified as the following:
Isolated Words:Isolated word recognizers usually require each utterance to have quiet (lack of
an audio signal) on both sides of the sample window. It accepts single words or
single utterance at
a time. These systems have "Listen/Not-Listen" states, where they require the
speaker to wait between utterances (usually doing processing during the
pauses). Isolated Utterance might
be a better name for this class.
Connected Words:Connected word systems (or more correctly 'connected utterances') are similar
to isolated words, but allows separate utterances to be 'run-together' with a
minimal pause between them.
Continuous Speech:Continuous speech recognizers allow users to speak almost naturally, while the
computer determines the content. (Basically, it's computer dictation).
Recognizers with continuous speech capabilities are some of the most difficult
to create because they utilize special methods to determine utterance
boundaries.
Spontaneous Speech:At a basic level, it can be thought of as speech that is natural sounding and not
rehearsed. An ASR system with spontaneous speech ability should be able to
handle a variety of natural speech features such as words being run together,
"ums" and "ahs", and even slight stutters.
4.3 FEATURE EXTRACTION
-
8/12/2019 speach recognition
23/30
Department of ECE,Thejus Engg., College,Thrissur Page 23
Step 1:Preemphasis
This step processes the passing of signal through a filter which emphasizes
higher frequencies. This process will increase the energy of signal at higher
frequency.
Y[n] = X[n]- a X[n-1].
a = 0.95, which make 95% of any one sample is presumed to originate from
previous sample.
Step 2:Framing
The process of segmenting the speech samples obtained from analog to digital
conversion (ADC) into a small frame with the length within the range of 20 to
40 msec. The voice signal is divided into frames of N samples. Adjacent frames
are being separated by M (M
-
8/12/2019 speach recognition
24/30
Department of ECE,Thejus Engg., College,Thrissur Page 24
A traditional method of spectral evaluation is reliable in case of stationary
signal. Nature of signal changes continuously with time. For voice reliability
can be ensured for a short time. Audio signal is continuous. Processing cannot
wait for last sample. Processing complexity increases exponentially. It is
important to retain short term features. Short time analysis is performed by
windowing the signal. Normally Hamming Window is used. The Hamming
function is given by
Hamming window:
W(n) = 0.540.46 cos(2n/L-1) 0 nL-1
0 otherwise.
Step 4: Fast Fourier Transform
To convert each frame of N samples from time domain into frequency domain.
The Fourier Transform is to convert the convolution of the glottal pulse U[n]
and the vocal tract impulse response H[n] in the time domain. This statement
supports the equation below:
Y(W) = FFT [h(t)*X(t)] = H(W) *X(W)
If X (w), H (w) and Y (w) are the Fourier Transform of X (t), H (t) and Y (t)
respectively.
Step 5: Mel Filter Bank Processing
The frequencies range in FFT spectrum is very wide and voice signal does not
follow the linear scale. The bank of filters according to Mel scale as shown in
-
8/12/2019 speach recognition
25/30
Department of ECE,Thejus Engg., College,Thrissur Page 25
figure 4 is then performed. This figure below shows a set of triangular filters
that are used to compute a weighted sum of filter spectral components so that
the output of process approximates to a Mel scale. Each filters magnitude
frequency response is triangular in shape and equal to unity at the centre
frequency and decrease linearly to zero at centre frequency of two adjacent
filters.
Step 6: Discrete Cosine Transform
is the process to convert the log Mel spectrum into time domain using Discrete
Cosine Transform (DCT). The result of the conversion is called Mel Frequency
Cepstrum Coefficient. The set of coefficient is called acoustic vectors.
Therefore, each input utterance is transformed into a sequence of acoustic
vector.
Step 7: Delta Energy and Delta Spectrum
The voice signal and the frames changes, such as the slope of a formant at its
transitions. Therefore, there is a need to add features related to the change in
cepstral features over time . 13 delta or velocity features (12 cepstral features
plus energy), and 39 features a double delta or acceleration feature are added.
-
8/12/2019 speach recognition
26/30
Department of ECE,Thejus Engg., College,Thrissur Page 26
The energy in a frame for a signal x in a window from time sample t1 to time
sample t2, is represented at the equation below:
Energy = X2 [t]
Procedure for forming MFCC
4.4 VECTOR QUANTISATION
Vector quantization (VQ) is a lossy data compression method based on
the principle of block coding. It is a fixed-to-fixed length algorithm. In the
earlier days, the design of a vector quantizer (VQ) is considered to be a
challenging problem due to the need for multi-dimensional integration. In 1980,
Linde, Buzo, and Gray (LBG) proposed a VQ design algorithm based on a
training sequence.
The main advantage of VQ inpattern recognition is its low computational
burden when compared with other techniques such asdynamic time
warping (DTW) and hidden Markov model (HMM). A VQ is nothing more
http://data-compression.com/theory.shtml#theoryhttp://en.wikipedia.org/wiki/Pattern_recognitionhttp://en.wikipedia.org/wiki/Dynamic_time_warpinghttp://en.wikipedia.org/wiki/Dynamic_time_warpinghttp://en.wikipedia.org/wiki/Hidden_Markov_modelhttp://en.wikipedia.org/wiki/Hidden_Markov_modelhttp://en.wikipedia.org/wiki/Dynamic_time_warpinghttp://en.wikipedia.org/wiki/Dynamic_time_warpinghttp://en.wikipedia.org/wiki/Pattern_recognitionhttp://data-compression.com/theory.shtml#theory -
8/12/2019 speach recognition
27/30
Department of ECE,Thejus Engg., College,Thrissur Page 27
than an approximator. The idea is similar to that of ``rounding-off'' (say to the
nearest integer). An example of a 1-dimensional VQ is shown below:
Here, every number less than -2 are approximated by -3. Every number between
-2 and 0 are approximated by -1. Every number between 0 and 2 are
approximated by +1. Every number greater than 2 are approximated by +3. Note
that the approximate values are uniquely represented by 2 bits. This is a 1-
dimensional, 2-bit VQ. It has a rate of 2 bits/dimension.
An example of a 2-dimensional VQ is shown below:
Here, every pair of numbers falling in a particular region are approximated by a
red star associated with that region. Note that there are 16 regions and 16 red
stars -- each of which can be uniquely represented by 4 bits. Thus, this is a 2-
dimensional, 4-bit VQ. Its rate is also 2 bits/dimension. In the above twoexamples, the red stars are called code vectors and the regions defined by the
-
8/12/2019 speach recognition
28/30
Department of ECE,Thejus Engg., College,Thrissur Page 28
blue borders are called encoding regions. The set of all code vectors is called
the codebook and the set of all encoding regions is called the partition of the
space. The performances of VQ are typically given in terms of the signal-to-
distortion ratio (SDR):
SDR=10log102/Dave (in dB),
Where is the variance of the source and Dave is the average squared-error
distortion. The higher the SDRthe better the performance
In verification systems two key performance measures are popular, the false
rejectionrate (FRR), the number of times the true speaker is incorrectly rejected,
and false acceptance rate (FAR), the number of times an imposter speaker is
incorrectly accepted. By varying the decision threshold the FAR and FRR will
change in opposing directions. For example raising the threshold will lower
FAR but increase the FRR as true claims will start to be rejected since the bar is
raised, conversely if the threshold is lowered the FRR is reduced but FAR will
increase since not only are all true claims now accepted but more false ones will
as well. The typical operating point for the selection of the threshold is when
FAR = FRR, termed the equalerror rate (EER) condition
4.5 NEED FOR MATLAB
MATLAB is a high-level language and interactive environment for
numerical computation, visualization, and programming. Using MATLAB, we
can analyze data, develop algorithms, and create models and applications. The
language, tools, and built-in math functions enable you to explore multipleapproaches and reach a solution faster than with spreadsheets or traditional
programming languages, such as C/C++ or Java.MATLAB has a range of
applications, including signal processing and communications, image and video
processing, control systems, test and measurement, computational finance, and
computational biology. Hence we are preferring MATLAB as our software tool.
More than a million engineers and scientists in industry and academia use
MATLAB, the language of technical computing. MATLAB has built-in
-
8/12/2019 speach recognition
29/30
Department of ECE,Thejus Engg., College,Thrissur Page 29
mathematical functions in MATLAB to solve science and engineering
problems. MATLAB (matrix laboratory) is anumerical computing environment
and fourth-generation programming language. Developed byMath Works,
MATLAB allowsmatrix manipulations, plotting offunctions and data,
implementation ofalgorithms, creation ofuser interfaces, and interfacing with
programs written in other languages, includingC,C++,Java, and Fortran.(
Cleve Moler the chairman of computer science department started developing
MATLAB in late 1970s. Jack Little recognized its commercial potential and
joined with Moler and Steve Banjert. They rewrote MATLAB in Cand founded
mathworks in 1984)
.
http://en.wikipedia.org/wiki/Numerical_analysishttp://en.wikipedia.org/wiki/Fourth-generation_programming_languagehttp://en.wikipedia.org/wiki/MathWorkshttp://en.wikipedia.org/wiki/Matrix_(mathematics)http://en.wikipedia.org/wiki/Function_(mathematics)http://en.wikipedia.org/wiki/Algorithmhttp://en.wikipedia.org/wiki/User_interfacehttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/C%2B%2Bhttp://en.wikipedia.org/wiki/Java_(programming_language)http://en.wikipedia.org/wiki/Fortranhttp://en.wikipedia.org/wiki/Fortranhttp://en.wikipedia.org/wiki/Java_(programming_language)http://en.wikipedia.org/wiki/C%2B%2Bhttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/User_interfacehttp://en.wikipedia.org/wiki/Algorithmhttp://en.wikipedia.org/wiki/Function_(mathematics)http://en.wikipedia.org/wiki/Matrix_(mathematics)http://en.wikipedia.org/wiki/MathWorkshttp://en.wikipedia.org/wiki/Fourth-generation_programming_languagehttp://en.wikipedia.org/wiki/Numerical_analysis -
8/12/2019 speach recognition
30/30
CHAPTER 5
CONCLUSION
From the comparison study of various biometric systems we came into a
conclusion that voice recognition is best suitable for our application. The Mel
filter is best suitable for feature extraction. Vector quantization technique is
more desired for feature matching to implement the voice detected attendance
system.