speech recognition application: - earlham...
TRANSCRIPT
Yousef Rabah Page 1 5/18/2023
Speech Recognition Application:
Voice Enabled Phone Directory
Yousef RabahSenior Seminar
May 4, 2004
Yousef Rabah Page 2 5/18/2023
Table of Contents:
I. Introduction ……………………………………………………………………………3
II. Statement of the Problem ……………………………………………………………..6
III. Proposed Solution ……………………………………………………………………6 Sphinx ……………………………………………………………………………7Database (ADB)…………………………………………………………………..8Application ……………………………………………………………………….9
IV. Bibliography ...………………………………………………………………………14
V. Appendix (Project Code & Documentation) ...……………………………………….16- db.pm- adb_tables.sql- people.pl- people.pm- record_wav.pl- wav_to_raw.pl- get_speech.pl- DisplaySpeech.pm- DisplaySpeech.pl- VEPD.pm- VEPD.pl- an4.log (generated log file from sphinx)
Yousef Rabah Page 3 5/18/2023
I. Introduction
Speech Recognition has been a topic that interested me and I have decided to
focus on it this semester. During the semester I was able to read different articles from
journals and I decided to build an application over Automatic Speech Recognition (ASR).
As I read articles and books, I developed the idea of building a voice enabled phone
directory. The ultimate vision is to have a system that would use speech as its primary
communication. The user would call one number that would connect and communicate
with the phone database using speech and it would dial numbers for the user. Therefore,
the user would only need to know one number that would enable him/her to call any
name in the database via speech.
This system’s primary use is aimed at serving not only people who wish to have
fast and mobile access to calling people but also its main purpose is towards people with
certain disabilities that would benefit form having such system; a system that uses a
‘hands free’ structure.
Different companies such as Advanced Recognition Technologies, Inc (ART),
Microsoft, as well as other companies have been integrating/ implementing speech
recognition systems in their software. These voice command based applications will be
expected to cover many of the communicational aspects of our daily lives ranging from
telephones to the Internet.
There are two types of speech recognition systems. The first is a ‘speaker
dependent system’ that is designed for a single speaker; it is easy to develop whereas it is
Yousef Rabah Page 4 5/18/2023
not flexible to use. The second is a ‘speaker independent system’ designed for any
speaker. It is harder to develop, less accurate and more expensive than the ‘speaker
dependent’ system, but it is more flexible.
The vocabulary sizes of the Automatic Speech Recognition (ASR) system range
from a small vocabulary that would consist of two words to a very large vocabulary that
consists of tens of thousands of words. The size of the vocabulary affects the complexity,
processing requirements, and the accuracy of the ASR system.
There are a number of different factors that could affect the accuracy and
performance of an ASR system, such as pronunciation and frequency; the speaker's
current mood, age, sex, dialect, inflexions and background noise. It is thus necessary for
the system to overcome these obstacles. As an example, the system could use filters to
solve some of these problems like background noises, coughs, heavy breath, etc.
The process of speech requires analog-to-digital conversion in which the voice's
pressure waves are converted to their numerical values, through regular intervals, in order
to be digitally processed. When you replay with an appropriate rate, the revised sound is
then reproduced. There are multiple types of ways you can save speech. As an example,
you could use wav files, which were initially defined by Microsoft for their multimedia
extensions, where you can store wav files as mono or stereo sound at sampling rate of up
to 44 Khz. Another type is of RAW type, which is the basic digital sound format. The
data file is a stream of bytes that represents the amplitude of a single sample. It does not
contain a header file and in order to replay it correctly the sampling rate must be know.
The model that I am going to build will use these sound formats for the Voice Enabled
Phone Directory (VEPD).
Yousef Rabah Page 5 5/18/2023
The Hidden Markov Model is a Markov Chain in which the output symbols or
probabilistic functions that describe them. To be specific, it uses the graph structure,
which is the number of states and their connections, and the number of mixtures per state.
The algorithm consists of a set of nodes that are chosen
to represent a particular vocabulary. These nodes are
ordered and connected from left to right, and recursive
loops are allowed. Recognition is based on a transition
matrix of changing from one node to another. A good
way to understand HMM is by giving an example. If we
build a model that recognizes only the word “yes”, then the word is composed of the two
phonemes ‘\ye’ and ‘\s’. This corresponds to the six states of the two phoneme models.
To be more accurate “yes” is composed of ‘\y’ ‘\eh’ ‘\s’. The ASR would not know the
acoustic state in mind of the speaker, therefore the ASR system would try to find W by
reconstructing the more likely sequences of states and words W that have generated X.
Here W represents the sequence of ‘words’ and X is the sequence of acoustic sounds.
The HMM is referred to often as a parametric model because the state of the
system at each time t is completely described by a finite set of parameters. The training
algorithm estimates the HMM parameters by taking a first good guess using the
preprocessed speech data (features) with their associated phoneme labels. The HMM
parameters are kept or stored as files and then retrieved by the training procedure.
Model training is performed by estimating the HMM parameters, since estimation
accuracy is roughly proportional to the number of training data. The HMM is well suited
for a speaker-independent system because the speech used during training uses
Yousef Rabah Page 6 5/18/2023
probabilities or generalizations and that makes it a good system to use for multiple
speakers.
It is good to notice the difference between an isolated system and a continuous
system. An Isolated system uses a single word - either a full word or a letter - at a time. It
is the simplest type because it is easy to find the ending points of a word due to the
pauses between saying the word or letter. In the second type (the continuous system)
uses full sentences and therefore it would be much harder to find starting and ending
points.
II. Statement of the Problem
The focus of my project is based on having automatic speech interacting phone
directory assistance. It is hard to develop a whole system that uses a ‘hands free’
environment for the fact that there are a lot of areas to cover. As I said, the ultimate
vision is to have a ‘hands free’ system. I want to build a structure or module that one
could later enhance in the future to support a ‘hands free’ voice enabled system.
III. Proposed Solution
My solution consists of three parts and I will go through them and explain my
approaches and what I would like to obtain out of each.
I will then demonstrate how they all play a part in the
final configuration. Here is a diagram that will show the
overview of the models, and the next paragraphs are its
explanations:
Yousef Rabah Page 7 5/18/2023
Sphinx:
The first part needed is an ASR system that I would be able to work with in order
to build my speech enabled phone directory. I need a speaker independent system based
on HMM that has a large vocabulary. After researching the matter, I have decided to use
sphinx, based from Carnegie Mellon University, for my ASR system. In sphinx, basic
sounds in the language are classified into phonemes or phones. The phones are
distinguished according to their position within the word (Beginning, end, internal, or
single) and they are further refined into context-dependent triphones. The building
processes of acoustic models are through the triphones. Triphones are modeled by HMM
and usually contain three to five states. The HMM states are clustered into a much
smaller number of groups called senone.
The input audio is of 16 bit samples, ranging from 8 to 16 Mhz, which is of a .raw
type. Training consists of having good data that consists of spoken text or utterances.
Each utterance:
- Is converted into leaner sequences of triphones HMM’s using pronunciation
lexicon.
- Finds best state sequence or state alignment through HMM
For each senone, all frames are gathered in the training and are mapped in order to
build suitable statistical models. The language model consists of:
- Unigrams where the entire set of words and their individual probabilities of
occurrences in language, are considered
- Bigrams: the conditional probability that word2 immediately follows word 1
in the language.
- Contains information for some subset of possible word pairs.
Yousef Rabah Page 8 5/18/2023
It also contains the Lexicon Structure, which is the pronunciation dictionary. It is a file,
which specifies word pronunciation. Pronunciations are specified as linear sequences of
phones. Also, it is essential to know that there are multiple pronunciations for the same
word or letter. It also includes a silence symbol <sil> to represent the user’s silence. As
an example, ‘ZERO’ is pronounced ‘Z IH R OW’.
Database (ADB)
The second step in the process, was building a database that included contact
information of people on the directory. I decided to use PostgreSQL for this part because
I had the book, installation CD and I was familiar with its contents.
The database, named ADB, will contain a “People” entity, which contains these attributes:
1. pid: which is an attribute that contains the unique identification for each, and is of
type integer.
2. first_name: attribute that contains the first name of a person and is of type
varchar(20)
3. last_name: attribute that contains the last name of a person and it is also of type
varchar(20).
4. phone_number: attribute that contains phone number and it is also of type
varchar(12) UNIQE (which means that system would not accept the same number
more than once).
5. city: attribute that contains city name and its type is varchar(15).
The primary key is (pid, first_name, last_name)
Here is an example of what the Database contains:
Pid | first_name | last_name | phone_num | city------+----------------+---------------+--------------------+----------- 1 | Sam | Smith | 765-973-2743 | Ramallah 2 | George | Adams | 765-973-2741 | Richmond
Yousef Rabah Page 9 5/18/2023
The database has the person’s information and it will provide the data, which is needed
for the phone directory. In other words it will act as an address book but at the same time
it can select information that will be needed by the application. For example, you can
either select all names in the directory, or you can select a specific person by first name
or last name. The Application part will talk about db.pm, people.pm, and people.pl which
are scripts that connect, send and retrieve info via ADB.
Application
The application is the third item in the deliverables and it will serve as the main
connector between the ASR system (sphinx) and the Database (ADB). The application
will serve as easy communication through sphinx and the database to send and receive
information. The programming language used is Perl, along with shell scripting
embedded in the Perl code. The figure below will illustrate the overall structure of my
application:
Yousef Rabah Page 10 5/18/2023
As seen in the figure above, the overall structure consists of four main stages, which are
recording speech, decoding speech, connecting to database and then finally displaying the
results back to the user.
The main script is the Voice Enabled Phone Directory (VEPD.pm and VEPD.pl)
in which it calls one script at a time, each with its own duty, and then it moves on to the
next script. The best way to go through the architecture is to follow it step by step while,
at the same time, explain what each script does.
The first script called is, the record script (record_wav.pl) and this script has a
simple objective, which is when it is called information is displayed for user to know
what to do for recording. In order to record the user would press space bar, and to stop
the user would hit space bar again. A system call will be called to record time. It would
also take a rate of 16000 and with an output option -o and it would be recorded as
record000, then record001 ... recordNNN in a directory called wav_files. Although
recording can go up to recordNNN, we only want to deal with only two files at a time,
because that way we can contain the structure. The file is of type wav that I mentioned
about in the introduction section.
The next script called is the wav_to_raw.pl. This script runs through all the wav
files and changes the format type into raw file and it places them in another directory.
First it opens the directory that contains the wav files that were recorded by
record_wav.pl. For each of the wav file names that matches .wav file will be changed
to .raw rather than .wav. Then the system call uses sox, which is a sound file exchange
with rate of 16000 will change the wav file type and then copy it into another directory
called raw_files. There is an option that could be used later on as an enhancement, to
Yousef Rabah Page 11 5/18/2023
replay what the user recorded in the record_wav.pl script. But with this, the new raw file
output will be changed into a wav file so that it would be replayed to user if option is
needed.
After we get the raw sound file, we can call the get_speech.pl script. First of all,
the raw file used will always be record000. Even if person wants to add string, the second
recording, which would be record001 would be cated into record000, so in tern record000
would include the first spoken string plus the second one. Now once record001 is cated
into record000 then record001 is removed, so that if user wants to search more or add
another string it would be saved as record001 again and then the process is repeated. Next
we are going to open the current directory that the ASR (SPHINX) system is located at
and we need to put the correct raw sound file in this location so that Sphinx would try to
decode what was said. So here we just put the location of the raw file in the sphinx
location. Then the next step is to define locations of the ASR sphinx locations S3BATCH
(a variable of location) is the sphinx application we are running and define location of
other sources that need to be present as arguments for the S3BATCH. Then it will
execute system call that will run the program and decode speech. A log file will be
generated that will display all commands, what occurred, as well as how the system
(sphinx) got to the decoding of speech; it shows the process.
Finally, a system call will ‘grep’ (or get) the line of decoded text and put in variable that
will in turn put it in another file, DecodedSpeech.txt. This way other scripts would be
able to use the text generation.
Once we have the decoded text, we would call DisplaySpeech.pl. It contains a
function that the DisplaySpeech.pl simply calls. First, it gets info whether user wants to
Yousef Rabah Page 12 5/18/2023
search by First or Last Name. Here we are getting info of which option the user wants to
search by, either by searching by first name, getFirstChar or by searching by last name,
getLastChar. At the beginning, before this script is run, the main menu function in
VEPD.pm will call the function of whether or not user wants to search by first or last
name, and here were opening the file and storing in a variable the result. Then, we would
want to run the record program or script through a function in order to record wav files.
Then we would run the wav to raw script that will change wav files into raw files and
place the raw files into raw files directory
Then we would get speech by running the file get_speech.pl, and then we would open the
decoded speech file from ASR sphinx log file and matching and split commands are used
to strip unwanted naming, and get back string of decoded speech into text. There is an
option called Play_wav() that would enable user to hear what the user said if the user
chooses to do. For the decoded text, it connects to ADB and it to get back names and
numbers. Here we are connecting to database through db.pm, and getting back info
through people.pm and people.pl.
The script db.pm’s main function is to connect to the postgreSQL database called
ADB. It contains functions that would prepare SQL statement and run or execute them. It
will then fetch rows and put them in array of rows or it would do the same, yet it would
insert array of rows in the database ADB.
The people.pm/pl scripts use db.pm to connect to db and run the SQL statements
in order to get back the results needed. As an example, if you want to search by first
name, then were selecting first_name, last_name, phone_num from people where
first_name matches any of the string needed to run through the ADB.
Yousef Rabah Page 13 5/18/2023
We are going to use SQLselect from db.pm to connect to ADB and run through SQL
statement. The results getting back will be stored in an array, rows and getting the status
of counter back. If status is 0, then there is no result back. If status is 1, then there is only
one match and in that case because there is only one result, it will ask whether or not
want to call that name. If status is more than one, that means that the user might either
add another string or can pick from the list and all that will be done with a couple of
options that the user will see. For the people.pl file, depending on the entry point, if it
matches any functions by getting by first name, or last name …etc.
The VEPD.pm/pl scripts contain functions to execute other scripts. As an
example, the main menu is called in VEPD.pl and depending what is needed by the user,
a function will be called to either retrieve, add, view or quit program.
(* Note: reference the appendix for the scripts, its functionality and documentation)
Yousef Rabah Page 14 5/18/2023
Bibliography
White, George M. "Natural Language understanding and Speech Recognition."
Communications of the ACM 33 (1990): 74 - 82.
Osada, Hiroyasu. "Evaluation Method for a Voice Recognition System Modeled with
Discrete Markov Chain." IEEE 1997: 1 - 3.
Bradford, James H. "The Human Factors of Speech-Based Interfaces: A Research
Agenda." SIGCHI Bulletin 27 (1995): 61 - 67.
Shneiderman, Ben. "The Limits of Speech Recognition." Communication of the
ACM 43 (2000): 63 - 65.
Danis, Catalina, and John Karat. "Technology-Driven Design of Speech Recognition
Systems" ACM 1995: 17 - 24.
Suhm, Bernhard, et al. "Multimodal Error Correction for Speech User Interfaces"
ACM Transactions on Computer-Human Interaction 8 (2001) 60 - 98.
Brown, M.G., et al. "Open-Vocabulary Speech Indexing for Voice and Video Mail
Retrieval" ACM Multimedia 96 1996: 307 - 316.
Christian, Kevin., et al. "A Comparison of Voice Controlled and Mouse Controlled
Web Browsing" ACM 2000: 72 - 79.
Falavigna, D., et al. "Analysis of Different Acoustic front-ends for Automatic voice
over IP Recognition" Italy 2001.
Simons, Sheryl P. "Voice Recognition Market Trends" Faulkner Information Services
2002.
Yousef Rabah Page 15 5/18/2023
(11) Becchetti, Claudio, and Lucio Prina Ricotti. Speech Recognition: Theory and C+
+ Implementation. New York : 1999
Abbott, Kenneth R. Voice Enabling Web Applications: VoiceXML and Beyond. New
York: 2002
Miller, Mark. VOiceXML: 10 Projects to Voice Enable Your Web Site. New York:
2002
Syrdal, A., et. al. Applied Speech Technology Ann Arbor: CRC 1995
Larson, James A. VoiceXML:Introduction to Developing Speech Applications New
Jersey : 2003