speech recognition application: - earlham...

Yousef Rabah 5/18/2023

Speech Recognition Application:

Voice Enabled Phone Directory

Yousef RabahSenior Seminar

May 4, 2004


Table of Contents:

I. Introduction ……………………………………………………………………………3

II. Statement of the Problem ……………………………………………………………..6

III. Proposed Solution ……………………………………………………………………6 Sphinx ……………………………………………………………………………7Database (ADB)…………………………………………………………………..8Application ……………………………………………………………………….9

IV. Bibliography ...………………………………………………………………………14

V. Appendix (Project Code & Documentation) ...……………………………………….16- db.pm- adb_tables.sql- people.pl- people.pm- record_wav.pl- wav_to_raw.pl- get_speech.pl- DisplaySpeech.pm- DisplaySpeech.pl- VEPD.pm- VEPD.pl- an4.log (generated log file from sphinx)


I. Introduction

Speech Recognition has been a topic that interested me and I have decided to

focus on it this semester. During the semester I was able to read different articles from

journals and I decided to build an application over Automatic Speech Recognition (ASR).

As I read articles and books, I developed the idea of building a voice enabled phone

directory. The ultimate vision is to have a system that would use speech as its primary

communication. The user would call one number that would connect and communicate

with the phone database using speech and it would dial numbers for the user. Therefore,

the user would only need to know one number that would enable him/her to call any

name in the database via speech.

This system’s primary use is aimed at serving not only people who wish to have

fast and mobile access to calling people but also its main purpose is towards people with

certain disabilities that would benefit form having such system; a system that uses a

‘hands free’ structure.

Different companies such as Advanced Recognition Technologies, Inc (ART),

Microsoft, as well as other companies have been integrating/ implementing speech

recognition systems in their software. These voice command based applications will be

expected to cover many of the communicational aspects of our daily lives ranging from

telephones to the Internet.

There are two types of speech recognition systems. The first is a ‘speaker

dependent system’ that is designed for a single speaker; it is easy to develop whereas it is


not flexible to use. The second is a ‘speaker independent system’ designed for any

speaker. It is harder to develop, less accurate and more expensive than the ‘speaker

dependent’ system, but it is more flexible.

The vocabulary sizes of the Automatic Speech Recognition (ASR) system range

from a small vocabulary that would consist of two words to a very large vocabulary that

consists of tens of thousands of words. The size of the vocabulary affects the complexity,

processing requirements, and the accuracy of the ASR system.

There are a number of different factors that could affect the accuracy and

performance of an ASR system, such as pronunciation and frequency; the speaker's

current mood, age, sex, dialect, inflexions and background noise. It is thus necessary for

the system to overcome these obstacles. As an example, the system could use filters to

solve some of these problems like background noises, coughs, heavy breath, etc.

The process of speech requires analog-to-digital conversion in which the voice's

pressure waves are converted to their numerical values, through regular intervals, in order

to be digitally processed. When you replay with an appropriate rate, the revised sound is

then reproduced. There are multiple types of ways you can save speech. As an example,

you could use wav files, which were initially defined by Microsoft for their multimedia

extensions, where you can store wav files as mono or stereo sound at sampling rate of up

to 44 Khz. Another type is of RAW type, which is the basic digital sound format. The

data file is a stream of bytes that represents the amplitude of a single sample. It does not

contain a header file and in order to replay it correctly the sampling rate must be know.

The model that I am going to build will use these sound formats for the Voice Enabled

Phone Directory (VEPD).


The Hidden Markov Model is a Markov Chain in which the output symbols or

probabilistic functions that describe them. To be specific, it uses the graph structure,

which is the number of states and their connections, and the number of mixtures per state.

The algorithm consists of a set of nodes that are chosen

to represent a particular vocabulary. These nodes are

ordered and connected from left to right, and recursive

loops are allowed. Recognition is based on a transition

matrix of changing from one node to another. A good

way to understand HMM is by giving an example. If we

build a model that recognizes only the word “yes”, then the word is composed of the two

phonemes ‘\ye’ and ‘\s’. This corresponds to the six states of the two phoneme models.

To be more accurate “yes” is composed of ‘\y’ ‘\eh’ ‘\s’. The ASR would not know the

acoustic state in mind of the speaker, therefore the ASR system would try to find W by

reconstructing the more likely sequences of states and words W that have generated X.

Here W represents the sequence of ‘words’ and X is the sequence of acoustic sounds.

The HMM is referred to often as a parametric model because the state of the

system at each time t is completely described by a finite set of parameters. The training

algorithm estimates the HMM parameters by taking a first good guess using the

preprocessed speech data (features) with their associated phoneme labels. The HMM

parameters are kept or stored as files and then retrieved by the training procedure.

Model training is performed by estimating the HMM parameters, since estimation

accuracy is roughly proportional to the number of training data. The HMM is well suited

for a speaker-independent system because the speech used during training uses


probabilities or generalizations and that makes it a good system to use for multiple

speakers.

It is good to notice the difference between an isolated system and a continuous

system. An Isolated system uses a single word - either a full word or a letter - at a time. It

is the simplest type because it is easy to find the ending points of a word due to the

pauses between saying the word or letter. In the second type (the continuous system)

uses full sentences and therefore it would be much harder to find starting and ending

points.

II. Statement of the Problem

The focus of my project is based on having automatic speech interacting phone

directory assistance. It is hard to develop a whole system that uses a ‘hands free’

environment for the fact that there are a lot of areas to cover. As I said, the ultimate

vision is to have a ‘hands free’ system. I want to build a structure or module that one

could later enhance in the future to support a ‘hands free’ voice enabled system.

III. Proposed Solution

My solution consists of three parts and I will go through them and explain my

approaches and what I would like to obtain out of each.

I will then demonstrate how they all play a part in the

final configuration. Here is a diagram that will show the

overview of the models, and the next paragraphs are its

explanations:


Sphinx:

The first part needed is an ASR system that I would be able to work with in order

to build my speech enabled phone directory. I need a speaker independent system based

on HMM that has a large vocabulary. After researching the matter, I have decided to use

sphinx, based from Carnegie Mellon University, for my ASR system. In sphinx, basic

sounds in the language are classified into phonemes or phones. The phones are

distinguished according to their position within the word (Beginning, end, internal, or

single) and they are further refined into context-dependent triphones. The building

processes of acoustic models are through the triphones. Triphones are modeled by HMM

and usually contain three to five states. The HMM states are clustered into a much

smaller number of groups called senone.

The input audio is of 16 bit samples, ranging from 8 to 16 Mhz, which is of a .raw

type. Training consists of having good data that consists of spoken text or utterances.

Each utterance:

- Is converted into leaner sequences of triphones HMM’s using pronunciation

lexicon.

- Finds best state sequence or state alignment through HMM

For each senone, all frames are gathered in the training and are mapped in order to

build suitable statistical models. The language model consists of:

- Unigrams where the entire set of words and their individual probabilities of

occurrences in language, are considered

- Bigrams: the conditional probability that word2 immediately follows word 1

in the language.

- Contains information for some subset of possible word pairs.


It also contains the Lexicon Structure, which is the pronunciation dictionary. It is a file,

which specifies word pronunciation. Pronunciations are specified as linear sequences of

phones. Also, it is essential to know that there are multiple pronunciations for the same

word or letter. It also includes a silence symbol <sil> to represent the user’s silence. As

an example, ‘ZERO’ is pronounced ‘Z IH R OW’.

Database (ADB)

The second step in the process, was building a database that included contact

information of people on the directory. I decided to use PostgreSQL for this part because

I had the book, installation CD and I was familiar with its contents.

The database, named ADB, will contain a “People” entity, which contains these attributes:

1. pid: which is an attribute that contains the unique identification for each, and is of

type integer.

2. first_name: attribute that contains the first name of a person and is of type

varchar(20)

3. last_name: attribute that contains the last name of a person and it is also of type

varchar(20).

4. phone_number: attribute that contains phone number and it is also of type

varchar(12) UNIQE (which means that system would not accept the same number

more than once).

5. city: attribute that contains city name and its type is varchar(15).

The primary key is (pid, first_name, last_name)

Here is an example of what the Database contains:

Pid | first_name | last_name | phone_num | city------+----------------+---------------+--------------------+----------- 1 | Sam | Smith | 765-973-2743 | Ramallah 2 | George | Adams | 765-973-2741 | Richmond


The database has the person’s information and it will provide the data, which is needed

for the phone directory. In other words it will act as an address book but at the same time

it can select information that will be needed by the application. For example, you can

either select all names in the directory, or you can select a specific person by first name

or last name. The Application part will talk about db.pm, people.pm, and people.pl which

are scripts that connect, send and retrieve info via ADB.

Application

The application is the third item in the deliverables and it will serve as the main

connector between the ASR system (sphinx) and the Database (ADB). The application

will serve as easy communication through sphinx and the database to send and receive

information. The programming language used is Perl, along with shell scripting

embedded in the Perl code. The figure below will illustrate the overall structure of my

application:


As seen in the figure above, the overall structure consists of four main stages, which are

recording speech, decoding speech, connecting to database and then finally displaying the

results back to the user.

The main script is the Voice Enabled Phone Directory (VEPD.pm and VEPD.pl)

in which it calls one script at a time, each with its own duty, and then it moves on to the

next script. The best way to go through the architecture is to follow it step by step while,

at the same time, explain what each script does.

The first script called is, the record script (record_wav.pl) and this script has a

simple objective, which is when it is called information is displayed for user to know

what to do for recording. In order to record the user would press space bar, and to stop

the user would hit space bar again. A system call will be called to record time. It would

also take a rate of 16000 and with an output option -o and it would be recorded as

record000, then record001 ... recordNNN in a directory called wav_files. Although

recording can go up to recordNNN, we only want to deal with only two files at a time,

because that way we can contain the structure. The file is of type wav that I mentioned

about in the introduction section.

The next script called is the wav_to_raw.pl. This script runs through all the wav

files and changes the format type into raw file and it places them in another directory.

First it opens the directory that contains the wav files that were recorded by

record_wav.pl. For each of the wav file names that matches .wav file will be changed

to .raw rather than .wav. Then the system call uses sox, which is a sound file exchange

with rate of 16000 will change the wav file type and then copy it into another directory

called raw_files. There is an option that could be used later on as an enhancement, to


replay what the user recorded in the record_wav.pl script. But with this, the new raw file

output will be changed into a wav file so that it would be replayed to user if option is

needed.

After we get the raw sound file, we can call the get_speech.pl script. First of all,

the raw file used will always be record000. Even if person wants to add string, the second

recording, which would be record001 would be cated into record000, so in tern record000

would include the first spoken string plus the second one. Now once record001 is cated

into record000 then record001 is removed, so that if user wants to search more or add

another string it would be saved as record001 again and then the process is repeated. Next

we are going to open the current directory that the ASR (SPHINX) system is located at

and we need to put the correct raw sound file in this location so that Sphinx would try to

decode what was said. So here we just put the location of the raw file in the sphinx

location. Then the next step is to define locations of the ASR sphinx locations S3BATCH

(a variable of location) is the sphinx application we are running and define location of

other sources that need to be present as arguments for the S3BATCH. Then it will

execute system call that will run the program and decode speech. A log file will be

generated that will display all commands, what occurred, as well as how the system

(sphinx) got to the decoding of speech; it shows the process.

Finally, a system call will ‘grep’ (or get) the line of decoded text and put in variable that

will in turn put it in another file, DecodedSpeech.txt. This way other scripts would be

able to use the text generation.

Once we have the decoded text, we would call DisplaySpeech.pl. It contains a

function that the DisplaySpeech.pl simply calls. First, it gets info whether user wants to


search by First or Last Name. Here we are getting info of which option the user wants to

search by, either by searching by first name, getFirstChar or by searching by last name,

getLastChar. At the beginning, before this script is run, the main menu function in

VEPD.pm will call the function of whether or not user wants to search by first or last

name, and here were opening the file and storing in a variable the result. Then, we would

want to run the record program or script through a function in order to record wav files.

Then we would run the wav to raw script that will change wav files into raw files and

place the raw files into raw files directory

Then we would get speech by running the file get_speech.pl, and then we would open the

decoded speech file from ASR sphinx log file and matching and split commands are used

to strip unwanted naming, and get back string of decoded speech into text. There is an

option called Play_wav() that would enable user to hear what the user said if the user

chooses to do. For the decoded text, it connects to ADB and it to get back names and

numbers. Here we are connecting to database through db.pm, and getting back info

through people.pm and people.pl.

The script db.pm’s main function is to connect to the postgreSQL database called

ADB. It contains functions that would prepare SQL statement and run or execute them. It

will then fetch rows and put them in array of rows or it would do the same, yet it would

insert array of rows in the database ADB.

The people.pm/pl scripts use db.pm to connect to db and run the SQL statements

in order to get back the results needed. As an example, if you want to search by first

name, then were selecting first_name, last_name, phone_num from people where

first_name matches any of the string needed to run through the ADB.


We are going to use SQLselect from db.pm to connect to ADB and run through SQL

statement. The results getting back will be stored in an array, rows and getting the status

of counter back. If status is 0, then there is no result back. If status is 1, then there is only

one match and in that case because there is only one result, it will ask whether or not

want to call that name. If status is more than one, that means that the user might either

add another string or can pick from the list and all that will be done with a couple of

options that the user will see. For the people.pl file, depending on the entry point, if it

matches any functions by getting by first name, or last name …etc.

The VEPD.pm/pl scripts contain functions to execute other scripts. As an

example, the main menu is called in VEPD.pl and depending what is needed by the user,

a function will be called to either retrieve, add, view or quit program.

(* Note: reference the appendix for the scripts, its functionality and documentation)


Bibliography

White, George M. "Natural Language understanding and Speech Recognition."

Communications of the ACM 33 (1990): 74 - 82.

Osada, Hiroyasu. "Evaluation Method for a Voice Recognition System Modeled with

Discrete Markov Chain." IEEE 1997: 1 - 3.

Bradford, James H. "The Human Factors of Speech-Based Interfaces: A Research

Agenda." SIGCHI Bulletin 27 (1995): 61 - 67.

Shneiderman, Ben. "The Limits of Speech Recognition." Communication of the

ACM 43 (2000): 63 - 65.

Danis, Catalina, and John Karat. "Technology-Driven Design of Speech Recognition

Systems" ACM 1995: 17 - 24.

Suhm, Bernhard, et al. "Multimodal Error Correction for Speech User Interfaces"

ACM Transactions on Computer-Human Interaction 8 (2001) 60 - 98.

Brown, M.G., et al. "Open-Vocabulary Speech Indexing for Voice and Video Mail

Retrieval" ACM Multimedia 96 1996: 307 - 316.

Christian, Kevin., et al. "A Comparison of Voice Controlled and Mouse Controlled

Web Browsing" ACM 2000: 72 - 79.

Falavigna, D., et al. "Analysis of Different Acoustic front-ends for Automatic voice

over IP Recognition" Italy 2001.

Simons, Sheryl P. "Voice Recognition Market Trends" Faulkner Information Services

2002.


(11) Becchetti, Claudio, and Lucio Prina Ricotti. Speech Recognition: Theory and C+

+ Implementation. New York : 1999

Abbott, Kenneth R. Voice Enabling Web Applications: VoiceXML and Beyond. New

York: 2002

Miller, Mark. VOiceXML: 10 Projects to Voice Enable Your Web Site. New York:

2002

Syrdal, A., et. al. Applied Speech Technology Ann Arbor: CRC 1995

Larson, James A. VoiceXML:Introduction to Developing Speech Applications New

Jersey : 2003

speech recognition application: - earlham...

Documents