integrating the speech recognition system sphinx with the ... · integrating the speech recognition...

Integrating the speechrecognition system SPHINX

with the STEP system

Lukasz Ratajczak

December 25, 2005Master’s Thesis in Computing Science, 20 credits

Supervisor at CS-UmU: Michael MinockExaminer: Per Lindstrom

Umea UniversityDepartment of Computing Science

SE-901 87 UMEASWEDEN

Abstract

This thesis performs a set of experiments to assess the prospects of speech interfacesto databases. The particular systems integrated in this thesis are the Sphinx-4 speechrecognition system, the STEP natural language interface to databases and text to speechsystem Festival. Several useful results are obtained: One is a set of guidelines on theconfiguration of SPHINX with STEP. Another is the implementation of barge in, theability to interrupt the text to speech system when the user is uninterested in hearingthe full result. Finally the thesis develops a syntax for specification of numerical valuesa common need in voice based DB access.

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Speech over Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Practical applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Organization of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Introduction to speech recognition . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Automatic SR systems . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 History of speech recognition . . . . . . . . . . . . . . . . . . . . 42.1.3 Practical side of speech recognition . . . . . . . . . . . . . . . . . 4

2.2 Introduction to SPHINX . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Introduction to STEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Spoken Systems expectations . . . . . . . . . . . . . . . . . . . . 62.4 Introduction to text to speech generation . . . . . . . . . . . . . . . . . 7

2.4.1 Speech generation . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4.2 FESTIVAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4.3 Weakness of FESTIVAL . . . . . . . . . . . . . . . . . . . . . . . 9

3 Integration and Operation of SPHINX 113.1 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Implementation of ‘barge in’ . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Architecture of Sphinx-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.1 The Frontend module . . . . . . . . . . . . . . . . . . . . . . . . 133.3.2 The decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3.3 Providing STEP with Sphinx-4 . . . . . . . . . . . . . . . . . . . 14

3.4 Configuration of Sphinx-4 components . . . . . . . . . . . . . . . . . . . 163.5 Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5.1 Structure of grammar . . . . . . . . . . . . . . . . . . . . . . . . 183.5.2 Dates and numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 19

iii

iv CONTENTS

3.5.3 New words in dictionary . . . . . . . . . . . . . . . . . . . . . . . 193.5.4 Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Evaluation of the STEP and SPHINX Integration 23

5 Conclusions 255.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Acknowledgements 27

References 29

A Glossary 31

B tests 33

List of Figures

1.1 Evolution of speech recognition technology by top 1000 U.S. corporations(taken from [3]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 STEP’s interface(taken from [13]) . . . . . . . . . . . . . . . . . . . . . 6

3.1 graph of interaction between STEP and SPHINX . . . . . . . . . . . . . 15

v

vi LIST OF FIGURES

List of Tables

4.1 Accuracy of the system according to amount of rules expansion . . . . . 234.2 SpeedTracker1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 SpeedTracker2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

vii

viii LIST OF TABLES

Chapter 1

Introduction

1.1 Motivation

Speech Recognition (SR) is a technology which makes life easier and has much to offer forthe future. The main advantage of SR is that it can save time. SR makes communicatingwith the computer faster than manual interfaces do, like for example the keyboard.Instead of typing a long report it is enough to say it. Another advantage of SR is thatit can be used to learn a language. It can help improve ones pronunciation. SR is alsoone step closer to speech translators with the computer as both the interpreter andtranslator. Handicapped people can use SR to interact with the society around them.SR related technology may also in the future help mute people. The movement of thelips can be studied, interpreted and then translated by computer into ordinary speech.

1.2 Speech over Databases

Databases often contain useful information. Speech interfaces to databases save a timeof users and allow them to access this useful information without using their hands. Itis reason why The CMU Sphinx Group or Linguistic Data Consortium conduct researchto improve accuracy of speech recognition systems. In the speech over databases, thereare several challenges like lots of require italicize words, new phoneme inventories andword formation rules in each human language, which each language we have to create anew database with speech data.

1.3 Practical applications

Today the most popular speech recognition software for Windows are ViaVoice (IBMproduct) and Dragon Naturally Speaking (ScanSoft). Both products have a high accu-racy, i.e. Dragon NaturallySpeaking Preferred 8 in 2005 has reached 99% accuracy inspeech to text conversion. The ScanSoft software permits the user to verbally controlWindows environment and to dictate text to certain Window applications, such as Mi-crSoft Word, Excel, Corel etc. Also, e-mails and documents can be read aloud, calledtext-to-speech. SR has also found applications in language courses (Collins, TellMeMoreor SuperMemo) for improving user’s pronunciation. After only few years, influence ofspeech technology on mobile phone market can be observed [11]. In newer cell phones

1

2 Chapter 1. Introduction

Figure 1.1: Evolution of speech recognition technology by top 1000 U.S. corporations(taken from [3])

the user is able to select a number by voice and send email. This progressive evolu-tion (see Fig. 1.1) shows us the new trend in software which might be the next bigphenomenon on the computers market.

1.4 Goals

In this thesis I will show how I have integrated STEP, SPHINX and FESTIVAL. Alsoimprovement of robustness will be described. Another goal is to improve the accuracyof Sphinx-4, adaptation environment of system to speech commands (allow for ‘bargein’). New words and pronunciations have been implemented into the dictionary.

1.5 Organization of this thesis

Chapter 2 is an introduction to speech recognition, history and a technical overview.The third chapter describes the architecture of the Sphinx-4, where a Decoder and aFront End are described, as well as basic information about STEP is shown. This sectionalso describes practical applications of SR and configuration of SPHINX for Robustness.This chapter is also a description of the implementation of barge in. The fourth chapteris devoted to the evaluation of SPHINX (accuracy, adding words, numbers grammar).Chapter 5 describes tests and conclusions.

Chapter 2

Background

2.1 Introduction to speech recognition

Natural language interfaces with voice recognition are going to play an important rolein our world. The speed of typing and handwriting is usually one word per second, sospeaking may be the fastest communication form with a computer [15]. Applicationswith voice recognition can also be a very helpful tool for handicapped people who havedifficulties with typing. The number of organizations which widen knowledge of voicecommunication with computer is still growing. This is very important because eachlanguage needs an individual approach for setting the style grammar or continuousspeech [20]. Just by surfing on the internet we can see how awkward the keyboard andmouse really is. Much time would be saved if the communication with a computer wascontrolled by the user’s voice.

Speech recognition (SR) has been defined as the ability for a computer to understandspoken commands or responses is an important factor in the human-computer interac-tion [19]. SR has been available for many years, but it has not been practical due to thehigh cost of applications and computing resources. The SR has had significant growth intelephony (voice mail, call center management) and voice-to-text (VTT) applications.Benefits of SR are not small. Increasing efficiency of workers that perform extensivetyping (medical or insurance environments), assisting with disabilities and managingcall centers by reducing staffing costs, shows advantages of SR.

2.1.1 Automatic SR systems

ASR systems we can classified according to [3]:

– speakers

• single speaker

• speaker independent

– speech style

• Isolated Word Recognition (IWR)

• Connected Word Recognition (CWR)

• Continuous Speech Recognition (CSR)

3

4 Chapter 2. Background

– vocabulary size

If an application is speaker-dependent (single speaker) it means that user has to train theprogram to recognize his speech, this type of application has got the highest recognitionrates. An independent software uses a default set of discrete sounds what cause lowerrecognition rates, this is usually used in telephony applications. In IWR systems speakerhas to make long breaks between words, almost the same as in CWR, but there thosepauses can be much shorter. In CSR the speaker can speak fluently, without any breaks.This type of systems has problems detecting individual words for e.g. in the phrase “Iscream for ice cream”.

2.1.2 History of speech recognition

The large potential of speech recognition and advantages was noticed by U.S. Depart-ment of Defense as early as the late 1940’s. They wanted to create an automatic languagetranslator to intercept and decode Russian messages. The project was a large failure be-cause creating a program that could recognize speech was to big of tasks because of tooslowly computers. In 1952 Bell Laboratories created a speech recognition system whichcould identify the digits 0-9. Seven years later MIT developed a system which identifiedvowel sounds with 93% accuracy [4]. In 1966 the first system with 50 vocabulary wordswas tested.“In the early 1970’s the SUR 1 program began to produce results in the form of theHARPY system. This system could recognize complete sentences that consisted of alimited range of grammar structures. This program required massive amounts of com-puting power to work, 50 state of the art computers” [4]. In 1980 a new standard methodfor computation, Hidden Markov Models, was developed. Currently it is the most suc-cessful method in matching for large vocabularies [5]). Two years later time-delay neuralnetworks (TDNN) were applied to voice recognition. As it appeared later it does notprovide better results than the stochastic approach. In 1987, a system to build small,medium or large vocabulary applications called SPHINX appeared [1]. In 1997 the firstcontinuous speech dictation software was developed by Dragon Systems [9]. “In 2002,TellMe supplied the first global voice portal, and later that year, NetByTel launchedthe first voice enabler. This enabled users to fill out a web-based data form over thephone” [14].

In last year IBM produced ViaVoice whose accuracy reached 99,9%, even in veryloud environments.

2.1.3 Practical side of speech recognition

While speaking our words are converted by the microphone into digital pieces of data.Using this data, the computer has to find out which word was spoken by findingphonemes (linguistic units). The audio data is a stream of amplitudes, sampled atabout 16,000 times per second. It is a wavy line that periodically repeats while the useris speaking. The data in this form is not useful to speech recognition because it is toodifficulty to identify and compare with any patterns that correlate to what was said bythe speaker [18].To make pattern recognition easier, the digital audio is transformed into the frequencydomain. Transformations are done using a Fast Fourier Transform (FFT).

1Program at Carnegie Mellon University, Speech Understanding Research

2.2. Introduction to SPHINX 5

In the frequency domain, we can identify the frequency components of a sound. TheFFT analyzes every 1/100th of a second and converts the audio data into the frequencydomain. Each 1/100th of a second results in a graph of the amplitudes of frequencycomponents. The sound is recognized by matching it to patterns of sounds.

There are used statistical models to figure out which phoneme is spoken. Phonemescan be extracted by running the waveform through the Fourier Transform (FT), whichallows an analysis of waveform in the frequency domain. The analysis can be carriedout by using a spectrograph to get phoneme.

Each phoneme can have a lot of different forms, because of background noise whichdisrupt speech signals, thus you can have different variations of human voice etc. Hence,training tools are passed through hundreds of recordings for each phoneme. It analyzeseach 1/100th of a second of these hundred examples and makes a feature number, whichis used to figure out the number of appears it for a given phoneme.

For instance for the phoneme ‘a’, there can be a 60% chance of feature marked by‘&50’ appearing in any 1/100th of a second, 45% chance of feature ‘&150’ and 14%chance of feature ‘&15’, and now if we take a phoneme ‘f’, every 1/100th of a secondcan have a 1% chance of feature ‘&50’, ‘&150’ and ‘&15’. During the recognition of fivefeature numbers it might record: 15,50,50,50,150; The recognizer computes the proba-bility of the sound an ‘a’ and in second phoneme ‘h’ (e.g. 14% * 60% * 60% * 60% *45% = 1,36% chance of ‘a’ and 0,0000001% chance of ‘f’).

After extracting the phonemes from the data, the computer can convert it intowords and after words into sentences. The most popular method is a Hidden MarkovModel (HMM)2, where a phoneme is represented by a statistical model (In the Sphinx-4Viterbi3 a full forward decoding have been used in special cases). During the formationof this acoustic model, the speech signals and input data are changed into a sequenceof vectors (features) which are representations of the signal. Those features are scoredagainst the acoustic model. The scores indicate how much a particular set of features issimilar to the acoustic model of the phoneme. The purpose of HMM in this example isto find the best possible sequence of units that will fit to the given input speech [18].

2.2 Introduction to SPHINX

Sphinx-4 is an open source speech recognition system written in the java programminglanguage. The Sphinx-4 has been designed by Carnegie Mellon University, SUN Mi-crosystems and Mitsubishi Electric Research Laboratories. Over the last few years theability of Sphinx-4 to perform multistream decoding has been improving. Such thingslike unexpected environmental noises with different levels and types, portability across agrowing number of platforms, conformation to different resource requirements, restruc-turing of architecture of the systems, causes continual improving of the system [18].

2A hidden Markov model (HMM) is a statistical model where the system being modeled is assumed tobe a Markov process with unknown parameters, and the challenge is to determine the hidden parameters,from the observable parameters, based on this assumption. The extracted model parameters can thenbe used to perform further analysis, for example for pattern recognition applications.

3The Viterbi algorithm, named after its developer Andrew Viterbi, is a dynamic programmingalgorithm for finding the most likely sequence of hidden states known as the Viterbi path that resultin a sequence of observed events, especially in the context of hidden Markov models. The forwardalgorithm is a closely related algorithm for computing the probability of a sequence of observed events.taken form Wikipedia


Figure 2.1: STEP’s interface(taken from [13])

The design of the Sphinx-4 causes the system to be portable and easily extensile(each component can be changed at any time) and some modules have been created tocope with different types of noises.

2.3 Introduction to STEP

The system STEP provides natural language interfaces to databases. It takes queriesand gives answers in a natural language. STEP is a system dedicated to providing con-versational access to standard and relational databases. A full demonstration of STEPover a geography database, has been continuously available for anonymous questing atwww.cs.umu.se/˜mjm.STEP’s interface is very simple (see Fig. 2.1). A user types aquery in the input field. A reply is given in an area below.

Currently STEP is about 10,000 lines of LISP code running as a server. [10] Theadvantage of STEP as a phrasal approach is that it avoids many of the difficulties asso-ciated with ambiguity in large scale domain independent grammars and maps directlyto the underlying database relations. Additionally, via fudging operations, STEP findsacceptable parses for many non-grammatical inputs, a common occurrence in practice.Finally a phrasal approach allows for “easier specification of idiomatic and idiosyncraticdomain language” [10]. A disadvantage of STEP approach is that a specific phrasallexicon must be authored for each new database.

2.3.1 Spoken Systems expectations

1. Most questions likely to appear must be understood by the system.

2.4. Introduction to text to speech generation 7

2. The answers must be correct and optimal.

3. When a question is not successfully parsed, the system must indicate the natureof the error - lexical, syntactic, semantic or factual.

4. The response time must be acceptable.

5. Command language systems using a natural language for making commands whichsome receiving system will execute.

6. Database query systems that translate natural language queries into formal databasequeries.

2.4 Introduction to text to speech generation

2.4.1 Speech generation

Speech generation is the artificial production of human speech. A system used forthis purpose is called a speech synthesizer. Those systems usually called text-to-speech(TTS) systems cause of their ability to convert text into speech. A text-to-speech systemis composed of two parts: a front end and a back end.

– The front end takes input in the form of text and outputs a symbolic linguisticrepresentation.

– The back end takes the symbolic linguistic representation as input and outputsthe synthesized speech waveform.

There are 3 main technologies used for the generating synthetic speech waveforms: con-catenative synthesis and formant synthesis.

1. Concatenative synthesis is based on the concatenation (or stringing together) ofsegments of recorded speech. Generally, concatenative synthesis gives the mostnatural sounding synthesized speech. The synthesis has a database of speech sam-ples, which are used during the synthesis. This type of synthesis usually demandlarger amount of memory and processor power.

2. Formant synthesis does not use any human speech samples at runtime. Instead,the output synthesized speech is created using an acoustic model. Parameterssuch as fundamental frequency, voicing, and noise levels are varied over time tocreate a waveform of artificial speech. This method is sometimes called Rule-based synthesis. Many systems which formant synthesis occurs generate robotic-sounding speech. Formant synthesized speech usually is very reliably intelligible,even at very high speeds. High speed synthesized speech is often used by thevisually impaired for quickly navigating computers using a screen reader. Thisformant is often a smaller program than concatenative because it does not have adatabase with speech samples. So we can use it in systems where is not a lot ofmemory space and processor power.

3. Articulatory synthesis method has been using mostly of academic environment. Itis based on computational models of the human vocal tract and the articulationprocesses occurring there. Few of these models are currently sufficiently advanced


to be used in commercial speech synthesis systems. The NeXT-based system.The system, first marketed in 1994, provides full articulatory-based text-to-speechconversion using a wave guide or transmission-line analog of the human oral andnasal tracts controlled by Carre’s Distinctive Region Model that is, in turn, basedon work by Fant and others at the Stockholm Speech Technology Lab of the RITon formant sensitivity analysis. This work showed that the formants in a resonanttube can be controlled by just eight parameters that correspond closely with thenaturally available articulators in the human vocal tract.

2.4.2 FESTIVAL

Festival is a general multi-lingual speech synthesis system developed at Center for SpeechTechnology Research (CSTR) at the University of Edinburgh. Festival is multi-lingual(currently English, Welsh and Spanish). Currently there are a number of voices availablein Festival. Each is elected via a function of the name ‘voice *’ which sets up the wave-form synthesizer, phone set, lexicon, duration and intonation models (and anything elsenecessary) for that speaker. These voice setup functions are defined in ‘lib/voices.scm’.

Festival supports the notion of text modes where the text file type may be identified,allowing Festival to process the file in an appropriate way. Currently only two types areconsidered stable: STML and raw, but other types like HTML, Latex, etc. are beingdeveloped. The current voice functions are:

1. voice rab diphoneA British English male RP speaker, Roger. The lexicon is the computer usersversion of Oxford Advanced Learners’ Dictionary, with letter to sound rules trainedfrom that lexicon.

2. voice ked diphoneAn American English male speaker, Kurt. This uses the CMU lexicon, and letterto sound rules trained from it. Intonation as with Roger is trained from the BostonUniversity FM Radio corpus.

3. voice kal diphoneAn American English male speaker. And like ked, uses the CMU lexicon, andletter to sound rules trained from it. Intonation as with Roger is trained from theBoston University FM Radio corpus.

4. voice don diphoneSteve Isard’s LPC based diphone synthesizer, Donovan diphones. The other partsof this voice, lexicon, intonation, and duration are the same as voice rab diphonedescribed above. Although the quality is not as good it is much faster and thedatabase is much smaller than the others.

5. voice el diphoneA male Castilian Spanish speaker, using the Eduardo Lopez diphones.

6. voice gsw diphoneThis offers a male RP speaker, Gordon, famed for many previous CSTR synthe-sizers, using the standard diphone module. Its higher levels are very similar to theRoger voice above.

2.4. Introduction to text to speech generation 9

To run Festival you need: A Unix machine, Suns (SunOS and Solaris), FreeBSD, Linux,SGIs, HPs and DEC Alphas. Audio hardware, /dev/audio (8bit and 16bit for Suns,Linux and FreeBSD) and NCD’s NAS network transparent audio system are supporteddirectly but Festival supports the execution of any Unix command that can play audiofiles.

2.4.3 Weakness of FESTIVAL

The system is quite slow. Although machines are getting faster, it still takes too long tostart the system and get it to speak some given text. Even so, on reasonable machines,Festival can generate the speech several times faster than it takes to say it. But even ifit is five time faster, it will take 2 seconds to generate a 10 second utterance.The signal quality of the voices isn’t very good by today’s standard of synthesizers, evengiven the improvement quality since the last release.

Chapter 3

Integration and Operation ofSPHINX

3.1 Robustness

This thesis began with a very preliminary integration of STEP and SPHINX. The firstproblem consisted of the version of JDK and also with the setting of the microphone.We solved this problem by turning on every channel and placing at 100 percent settingit on microphone 1.The fist part of my work, was to improve the accuracy of speech recognition system.This has been achieved by making changes in a configuration file.I have added the following components and properties which decreased an out-of-grammaroccurrences, optimized grammar, made the end pointer less sensitive, that means, markless audio than speech, prevented over threading of the scoring (that could happen if thenumber of threads is high compared to the size of the active list), and set the amountof time for silence to be considered as utterance end:

property name=silenceInsertionProbability value=40Silence insertion probability property.

property name=minScoreablesPerThread value=10A Sphinx Property name that controls the minimum number of scoreables sent to athread. This is used to prevent over threading of the scoring that could happen if thenumber of threads is high compared to the size of the active list. The default is 50.

property name=outOfGrammarProbability value=1E-21Sphinx property for the probability of entering the out-of-grammar.

property name=addOutOfGrammarBranch value=trueproperty name=phoneInsertionProbability value=1E-21Properties decrease of end-of-grammar

property name=optimizeGrammar value=true

11

12 Chapter 3. Integration and Operation of SPHINX

Property to control whether grammars are optimized or not.

property name=threshold value=10The Sphinx Property specifying the threshold. If the current signal level is greater thanthe background level by this threshold, then the current signal is marked as speech.Therefore, a lower threshold will make the end pointer more sensitive, that is, markmore audio as speech. A higher threshold will make the end pointer less sensitive, thatis, mark less audio as speech.

property name=endSilence value=300The SphinxProperty for the amount of time in silence (in milliseconds) to be consideredas utterance end.

property name=startSpeech value=600The SphinxP roperty for the minimum amount of time in speech (in milliseconds) to beconsidered as utterance start.

property name=minScoreablesPerThread value=50A Sphinx Property name that controls the minimum number of scoreables sent to athread. This is used to prevent over threading of the scoring that could happen if thenumber of threads is high compared to the size of the active list.

property name=mergeSpeechSegments value=trueThe Sphinx Property that controls whether to merge discontinuous speech segments(and the non-speech segments between them) in an utterance into one big segment(true), or to treat the individual speech segments as individual utterances (false).

3.2 Implementation of ‘barge in’

The second step was interrupted the Festival process. During the reading an answer byFestival user could not break the reading by voice. I have done it by intercept speech,compare it with a pattern of words which have been used to kill the process.

if (resultText.equals("ok"|"enough"|"abort")){

String ls_str;Process ls_proc = Runtime.getRuntime().exec("pkill audsp");DataInputStream ls_in = new DataInputStream(

ls_proc.getInputStream());while ((ls_str = ls_in.readLine()) != null)

System.out.println(ls_str);}

3.3 Architecture of Sphinx-4

The Sphinx-4 architecture has been designed in such a way that we can change eachmodule of it. The code is very modular with easily replaceable functions. Blocks like a

3.3. Architecture of Sphinx-4 13

Frontend, Decoder are independently replaceable. The main blocks in Sphinx-4 archi-tecture are frontend, decoder and knowledge base (KB) [6].

3.3.1 The Frontend module

The Frontend module is made up of several communicating blocks. Each communicat-ing block consists of an input and an output. Each input is connected to the output ofits predecessor and interprets it to find out if the incoming data is a speech data or acontrol signal, which can show the beginning or end of speech. This design allows thesystem to be used in live mode, and allows each output to be read. Actual input to thesystem does not have to be at the first block, but can be at any of the blocks.Sphinx-4 allows users to run the system using speech signals, different bispectra, spec-tra, cepstra, etc. Thus, other kinds of features such as auditory representations can beplugged in. Additional blocks can also be put between any blocks [7].The system has four modes of operation. First, when a system is running continuouslyfrom a stream of input speech. Second, when a system performs end pointing (determin-ing both beginning and ending endpoints of a speech segment automatically). Third,the user gives the beginning of a speech segment but the system determines when thespeech ends automatically. Fourth, the last mode is when the user gives both the begin-ning and the end of a speech segment. Endpoint detection is performed by an algorithmthat compares the three energy threshold levels. Two of these are used to describe whenspeech is starting and one for determining end of speech. The end pointer detects thestart and end of speech, non-speech segments, which are not sent to the decoder [7].

3.3.2 The decoder

The decoder block is made up of three modules: the search manager, the acoustic scorerand the linguist.

1. Search ManagerThe main task of the search manager is to construct and search a tree of possibil-ities for the best result. To do this the module communicates with the acousticscorer to obtain acoustic scores for incoming data and with the linguist to obtainneeded information.

The search manager makes a complete history of all active paths in search. Itcreates tokens that contain the overall acoustic and language scores of the path atgiven point. Each token has an input feature frame identification, also a referenceto the previous token, and a SentenceHMM [6] which is a directed state graphwhere each state in the graph represents a unit of speech. The SentenceHMM ref-erence allows the search manager to classify a token to its senome, word, grammarstate and pronunciation [8].

Searching with the help of tokens and the SentenceHMM can be execute in twoways. The first called depth-first expands the most promising tokens in time se-quentially, thus the paths from the first token to the currently active token can beof different lengths. The second way, called breadth-first Sphinx-4 uses a Viterbialgorithm [16]. All active tokens are expanded synchronously, making the pathsfrom the first token to the currently active tokens equally long [18].


2. LinguistThe Linguist provides a set of interfaces and classes that are used to define thesearch graph which is created by the decoder. Implementations of the Linguistinterface are used by the decoder to create a search graph. The search graph is adirected graph formed by SearchState and SearchStateArc objects. Some imple-mentations of the Linguist may build the search graph based upon the Grammar,which represents a graph of words and probabilities. This package provides a num-ber of different implementations of Grammar. The grammar is a directed graphwhere each node represents a set of words that may be spoken at a particulartime. The nodes are connected by arcs, which are used to predict the probabilityof transiting from one node to another node. Sphinx-4 has three grammars for-mats: a word list grammar loader (it is generating a flat grammar from a list ofwords; this loader has been used in my M.A. thesis), an N-Gram models and thefinite state transducer (FST) [8].

3. Acoustic scorerProvides a mechanism for scoring a set of HMM states. The task of it is to computestate output probability for the various states for any given input [18].

3.3.3 Providing STEP with Sphinx-4

Sphinx-4 has been chosen to connect with STEP system since it is faster and moreaccurate than previous versions. Virtues like an open source, platform independency orthe use of the Java Speech Grammar Format(used by speech recognizers to determinewhat the recognizer should listen to) which makes that Sphinx4 meets expectations.The figure (see Fig. 3.1), shows how STEP and Speech Recognition are connected. Itworks as follows:

XML-base configuration file (name.config.xml) is used to define names and typesof all the components of the system and to the connectivity of these components andconfiguration of each of them.The system turns on the microphone (microphone.startRecording()). After the micro-phone is turned on successfully, the program enters a loop that repeats the following.The system tries to recognize what the user is saying using the Recognizer.recognize()method. Recognition is stopped when the user stops speaking, which is detected by theend pointer. The speech signals are transformed into a sequence of vectors (features) by“front end” component. The vectors are decoded after their last feature. The processof SR is to find the best sequence of units or words. In Sphinx this problem is solved bythe search manager, which uses language models, dictionaries and grammars to createa HMM which returned the best result from the rest of all results. In the Sphinx whenutterance is recognized the result is returned in text by Result.getBestresultNoFiller().The result is sent to the Socket server like a text message. Then the message is sentto STEP system. Result of STEP action is sent to the Festival and the answer is readaloud. Each utterance is checked if there are key words that control the command linesof STEP, for instance:if (resultText.equals(“shutdown”))Process proc= Runtime.getRuntime().exec(“pkill audsp”);

3.3. Architecture of Sphinx-4 15

Figure 3.1: graph of interaction between STEP and SPHINX


3.4 Configuration of Sphinx-4 components

The configuration manager has two purposes: the first one is to determine which com-ponent will be used in the system, the second purpose determinate details of configura-tion of those components. The configuration manager makes it possible to reconfigureSphinx-4 to use a different types of components. For instance, normally Sphinx-4 isconfigured with a FrontEnd that produces Mel frequency cepstral coefficients (MFCCs),however it is possible to change the component and use a different FrontEnd. The con-figuration of a Sphinx-4 system is determined by a configuration file. The configurationfile defines the names and types of all of the components of the system, the connectivityof these components and the detailed configuration for each of these components.The configuration file of this project is called name.config.xml. This file defines: namesand types of all of the components of the system and their connectivity (which compo-nents talk to each other) and details of configuration for each of these components.Fore instance we can define two components one of the type Component1 and the secondtype Component2 as following:

<config><component name="Comp" type="edu.cmu.sphinx.sample.Component1"/><component name="anotherComp" type="edu.cmu.sphinx.sample.Component2"/></config>

Each component can have properties. We can define them inside the component blockas following:

<component name="searchManager"type="edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager"><property name="logMath" value="logMath"/><property name="linguist" value="flatLinguist"/>

</component>

The elements and attributes of the configuration file are described in following table1:Element Attributes Sub-

elementsDescription

<config> none <component><property><propertylist>

The top level element. It has no at-tributes. It can have any number of thecomponent, property and propertylistsub-elements

<component> name - the com-ponent nametype - the com-ponent type

<property><propertylist>

Defines an instance of a component.This element must always have thename and type attributes.

<property> name- the prop-erty namevalue - the typeof the property

None Used to define a single property of acomponent or a global system prop-erty. This element must always havethe name and value attributes.

<propertylist> name- the nameof the propertylist

<item> Used to define a list of strings or com-ponents. This element must alwayshave the name element. I can have anynumber of item sub-elements.

<item> None None Contents of this element define a stringor a component name.

1taken from [17]

3.5. Grammar 17

When a configuration file is loaded, the configuration manager will look for errors and ifat least one error is detected process will be interrupted. Some of the errors that couldbe detect are [17]:

1. Invalid XML - the file is not a valid XML file

2. Unknown XML elements - there are unknown elements in the file.

3. Missing, extra or Unknown XML attributes - an element has been given the wrongnumber of attribute

4. Multiply defined properties - a property for a component has been defined morethan once.

5. Bad data type for a property - a given value cannot be converted to the declaredtype for a property

6. Multiply defined components - a component can be defined only once

7. Out-of-range-data for a component - the given value for a component property isout of the range.

3.5 Grammar

Speech recognition systems provide computers with the ability to listen to user speechand determine what is said. Current technology does not yet support unconstrainedspeech recognition, the ability to listen to any speech in any context and transcribeit accurately. To achieve reasonable recognition accuracy and response time, currentspeech recognizers constrain what they listen to by using grammars.

The Java Speech Grammar Format (JSGF) defines a platform-independent way ofdescribing one type of grammar, a rule grammar. It uses a textual representation thatis readable and editable by both developers and computers, and can be included in Javasource code.

A rule grammar specifies the types of utterances a user might say. For example, asimple grammar might listen for “Cities in Sweden” or “Provinces of Sweden”.

What the user can say depends upon the context: is the user controlling an emailapplication, reading a credit card number, or selecting a font? Applications know thecontext, so applications are responsible for providing a speech recognizer with appropri-ate grammars. In our example we have only one type of grammar.

To specify JSGF grammars, I had to add fallow lines of code:

<component name="flatLinguist" type="edu.cmu.sphinx.linguist.flat.FlatLinguist"><property name="grammar" value="jsgfGrammar"/>

</component>

<component name="sgfGrammar" type="edu.cmu.sphinx.jsapi.JSGFGrammar"><property name="location of grammar" value="URL of grammar directory"/>

</component>


3.5.1 Structure of grammar

The grammar body defines rules. Each rule is defined in a rule definition. A rule isdefined once in a grammar.The structure of definitions is built as fallow:

public <ruleName> = ruleExpansion;

Word ‘public’ is an optional public declaration.White space in grammar file before the definition, between the public keyword and

the rule name is ignored. White space is significant within the rule expansion.

The rule expansion defines how the rule may be spoken. It is a logical combination oftokens (text that may be spoken) and references to other rules. The term ”expansion” isused because an expansion defines how a rule is expanded when it is spoken - a single rulemay expand into many spoken words plus other rules which are themselves expanded.Any rule in a grammar may be declared as public by the use of the public keyword.A public rule has three possible uses:

1. It can be referenced within the rule definitions of another grammar by its fully-qualified name or with an import declaration.

2. It can be used as an active rule for recognition. That is, the rule can be used bya recognizer to determine what may be spoken.

3. It can be referenced locally: that is, by any public or non-public rule defined inthe same grammar.

A rule may be defined as a set of alternative expansions separated by vertical barcharacters ‘—’ and optionally by white space. For example:

<COUNTRYname> = POLAND | SWEDEN | GERMANY | <otherCountriesNames>;<TEMPERATURE> = HOT | COLD;

A rule may be defined by a sequence of expansions.

<statement> = in <COUNTRYname> is <TEMPERATURE>;

Any legal expansion may be explicitly grouped using matching parentheses ‘()’. Group-ing has high precedence and so can be used to ensure the correct interpretation of rules.It is also useful for improving clarity. For example, because sequences have higher prece-dence than alternatives, parentheses are required in the following rule definition so that”Cities in POLAND” and ”City in SWEDEN” are legal.

<action> = Cities in (SWEDEN | POLAND);

* (Kleene Star) this rule expansion followed by the asterisk symbol indicates thatthe expansion may be spoken zero or more times. For example,

<numbers> = (one|two|three|four)*;

So, it allows a user to say ”one one four four”.

3.5. Grammar 19

3.5.2 Dates and numbers

A problem we were faced with was the reading and interpreting of dates. In English, itis possible to say a given date in more than one way. For example 1908 can be read asnineteen eight, nineteen hundred and eight or even nineteen oh eight. The problem iswith interpreting it. If nineteen eight is said, we are not entirely sure if the date intendedis 1908 or maybe 198. This problem was solved by implementing special grammar ruleswhich we created for this purpose. Here is our solution:

Digit -> "one" | 1 | "two" | 2 | ... | "nine" | 91Digit -> Digit | "zero" | 02Digit -> 1Digit * 1Digit | "eleven" | "twelve" | ... | "twenty" |"twenty" * Digit | ... | "ninety" * DigitLT3Digit -> 1Digit | 2Digit3Digit -> 1Digit * 1Digit * 1Digit |1Digit * "hundred" * ["and"] * LT3Digit |1Digit * "hundred" |["a"] * "hundred"LT4Digit -> 1Digit | 2Digit | 3Digit4Digit -> 1Digit * [","] * 1Digit * 1Digit * 1Digit |1Digit * "thousand" * ["and"] LT4Digit |1Digit * "thousand" |["a"] * "thousand" |2Digit * 2Digit |2Digit * "hundred" * ["and"] * 2Digit

By this grammar which is described below, we can describe all numbers by 9999, but ifwe want to operate with bigger numbers we have to expand the grammar as following:

5Digit -> 1Digit * 1Digit * [","] * 1Digit * 1Digit * 1Digit5Digit -> 2Digit * "thousand" * ["and"] * LT4Digit5Digit -> 2Digit * "thousand"6Digit -> 1Digit * 1Digit * 1Digit * [","] * 1Digit * 1Digit * 1Digit |6Digit -> 3Digit * "thousand" * ["and"] * LT4Digit6Digit -> 3Digit * "thousand"LT7Digit -> 6Digit | 5Digit | 4Digit | 3Digit | 2Digit | 1 Digit7Digit -> 1Digit * [,] * 1Digit * 1Digit * 1Digit * [","] * 1Digit *1Digit * 1Digit |1Digit * "million" * ["and"] LT7Digit |1Digit * "million" |["a"] "million"

The pattern is fairly clear at this point and we may continue to a billion, trillion or aquadrillion, etc. [12]

3.5.3 New words in dictionary

Dictionary is available in sphinx folder:edu\cmu\sphinx\model\acoustic\WSJ 8gau 13dCep 16k 40mel 130Hz 6800Hz\dictThere are 129255 words and their phonemes included in the dictionary. Before thespeech recognition, Sphinx-4 checks words and their order according to the grammar


and it is building each possible way of utterance. After utterance, Sphinx-4 use theHMM to establish which word has been spoken. It takes phonemes from the dictionaryfile and checks the probability of different words.Structure of it is as following:

CITIES S IH T IY ZCITIES’ S IH T IY ZCITING S AY T IH NG

CITISTEEL S IH T IY S T IY LCITIZEN S IH T AH Z AH N

CITIZEN(2) S IH T IH Z AH NCITIZEN’S S IH T AH Z AH N Z

CITIZENRY S IH T IH Z AH N R IYCITIZENS S IH T AH Z AH N Z

CITIZENS(2) S IH T IH Z AH N ZCITIZENS’ S IH T IH Z AH N Z

CITIZENSHIP S IH T IH Z AH N SH IH P

3.5.4 Run

To run a system you have to execute following:

– To run the voice server type:java -mx800m -jar /Home/staff/mjm/sphinx/sphinx4-1.0beta/bin/SRmodule.jar

– Now in a separate terminal type, write:step <step-root>/apps/world2 and then type: (voice-ui ”Name”)After that you will see a view(Fig. 3.5.4)as following:

In the following example (see Fig. 3.5.4)we asked the system about ”Cities in Ger-many”:

3.5. Grammar 21

Chapter 4

Evaluation of the STEP andSPHINX Integration

System has been tested by list of questions on Intel Pentium 1,8GHz were 800MBof RAM-memory was dedicated. Average of time spent on giving questions is 3,73s.Accuracy of utterance is 98%. The questions in Appendix A have been used for tests.

The following table is a result of component called speedTracker, it shows times oftimes need to run the Sphinx-4 (ruleExpansion at ‘COUNTRYNAME’ = 3):

Accuracy DictioanryLoadTime ruleExpansion rule at‘COUNTRYNAME’1

98% 83,2 33 1099% 80 32 999% 78,8 31 899% 69,5 30 7100% 65,3 29 6100% 61,8 28 5100% 57,5 27 4100% 53 26 3100% 48,2 26 2

Table 4.1: Accuracy of the system according to amount of rules expansion

23

24 Chapter 4. Evaluation of the STEP and SPHINX Integration

Name Count CurTime MinTime MaxTime AvgTime TotTimeAM Load 1 7,8720s 7,8720s 7,8720s 7,8720s 7,8720s

DictionaryLoad 1 1,1610s 1,1610s 1,1610s 1,1610s 1,1610sgrammarLoad 1 0,8240s 0,8240s 0,8240s 0,8240s 0,8240s

compile 1 27,6840s 27,6840s 27,6840s 27,6840s 27,6840screateGStates 1 0,1410s 0,1410s 0,1410s 0,1410s 0,1410scollectContex 1 1,1110s 1,1110s 1,1110s 1,1110s 1,1110sexpandStates 1 25,2630s 25,2630s 25,2630s 25,2630s 25,2630sconnectNodes 1 1,1670s 1,1670s 1,1670s 1,1670s 1,1670s

Table 4.2: ruleExpansion = 33

Name Count CurTime MinTime MaxTime AvgTime TotTimeAM Load 1 9,4230s 9,4230s 9,4230s 9,4230s 9,4230s

DictionaryLoad 1 1,4450s 1,4450s 1,4450s 1,4450s 1,4450sgrammarLoad 1 0,9530s 0,9530s 0,9530s 0,9530s 0,9530s

compile 1 71,3990s 71,3990s 71,3990s 71,3990s 71,3990screateGStates 1 0,1500s 0,1500s 0,1500s 0,1500s 0,1500scollectContex 1 1,4920s 1,4920s 1,4920s 1,4920s 1,4920sexpandStates 1 51,7050s 51,7050s 51,7050s 51,7050s 51,7050sconnectNodes 1 8,0400s 8,0400s 8,0400s 8,0400s 8,0400

Table 4.3: ruleExpansion = 19

Chapter 5

Conclusions

Speech recognition has a big potential in becoming an important factor of interactionbetween human and computer in the near future. But many users prefer more trust-worthy ways to communicates with machines. Differences between timbre of voice, annoisy environment or difficulties in the interpretation of sentences which can not beunderstood without some existing context, causes many mistakes in the systems. Ofcourse Automatic SR dependent systems offer a great accuracy but those applicationsare limited by the necessity to carry out of training.

So far we can use lip reading to improve the SR. The visual part of the speech hasnot got as much information as the voice signal so it is easier to process. This visual partcan improve accuracy of the audio part of the signal [2]. We can also use different typesof acoustic models, dictionaries and language models depending on a person accent andto ‘understand’ ungrammatical sentences. After a short test of the user’s accent, whichconsists of on reading some test sentence, the system could choose the most optimaloptions for his type of accent.

5.1 Limitations

The system should be run with at least 800MB of RAM memory. Questions should begiven exactly according to grammar patterns. Grammar file can not be too big becauseof an insufficient amount of memory.

The system has a North American Dictionary, so the best results are achieved withthat pronunciation. User, also should have a microphone and speakers. As well, eachproblems with speaking can be big disadvantage with using the system. Next limitationconcerns the loudness of the environment. The larger the amount of interferences, thelower the accuracy of recognizing what was been said.

25

26 Chapter 5. Conclusions

Chapter 6

Acknowledgements

I would like to thank my supervisor, Michael Minock for his help at each stage of theproject and his patience. Jednak najbardziej chcialbym podziekowac moim rodzicom,ktorzy przyczynili sie w znacznym stopniu do skonczenia tej magisterki. Mamo, Tatodzieki! Nie wybaczylbym sobie gdybym nie podziekowal rowniez Michalowi Jaskiewic-zowi za sentensje ”Dobry pilot i na drzwiach poleci”.This temporary residence in Umea gave me the most exciting moments in my life. Thankyou Sweden and international students!

27

28 Chapter 6. Acknowledgements

References

[1] X. Beeferman and A. Acerp et al. From cmu sphinx-ii to microsoft whisper -making speech recognition usable. Technical report, Microsoft Research, RedmondWA, 98052, USA, 1995.

[2] Josef CHALOUPKA. Automatic lips reading for audio-visual speech processingand recognition. In Proc. of ICSLP Jeju Island, Korea, pages 2505–2508, October2004.

[3] Prof. Ramon Lopez-Cozar Delgado. An introduction to spoken dialogue systems.fundamentals, standards and tools, 2004.

[4] Jon, Gloria, and Pete. Voice recognition technology.

[5] Konstantinos Koumpis and Keith Pavitt. Corporate activities in speech recognitionand natural language: another ¡¡new - science¿¿ - based technology.

[6] L1 LANGUAGES. Historia systemOw rozpoznawania mowy w nauczaniu jEzykOwobcych. http://www.l1.pl/pl/historia.html, accessed April 2005, 2005.

[7] Paul Lamere, Philip Kwok, Evandro Gouvea, Bhiksha Raj, and Peter Wolf. Thecmu sphinx-4 speech recognition system, 2004.

[8] Paul Lamere, Philip Kwok, Evandro Gouvea, Bhiksha Raj, and Peter Wolf. Designof the cmu sphinx-4 decoder, 2004.

[9] Carl M., Rebman Jr., and Casey G. Cegielski. Speech recognition in the human-computer interface. Information and Managment, 40:509–519, 2003.

[10] Minock Michael. Step. http://www.cs.umu.se/˜mjm/step/, accessed April 2005.

[11] Sun Microsystems. Ods luchaj maila. http://pl.sun.com/, accessed April 2005, 2005.

[12] Michael Minock. Some folklore on parsing english integer descriptions. 2005.

[13] Michael Minock, Anton Flank, and Hans Olofsson. The step system. 2004.

[14] Norbert College. History of speech recognition. http://www.snc.edu/, accessedApril 2005.

[15] D. Reddy. Speech recognition by machine: a review. Proceedings of the IEEE,4:7–11, 1976.

29

30 REFERENCES

[16] Godfried T. Toussaint. Viterbi algorithm in text recognition.http://www.cim.mcgill.ca/˜latorres/Viterbi/va main.html, accessed December2000.

[17] Carnegie Mellon University. http://cmusphinx.sourceforge.net/sphinx4/, accessedApril 2005.

[18] Carnegie Mellon University. Sphinx-4. http://cmusphinx.sourceforge.net/sphinx4/,accessed April 2005, 2004.

[19] Wikipedia Company. Encyklopedia. http://wikipedia.pl, accessed April 2005, 2004.

[20] Nicole Yankelovich, Gina-Anne Levow, and Matt Marx. Issues in speech user in-terfaces. 1995.

Appendix A

Glossary

1. Bigram- a group two written letters, syllables or words used in statistical analyzesof text

2. Trigram- a group three written letters, syllables or words used in statistical ana-lyzes of text

3. Cepstra- a representation of autiosingal, it is a result of taking FT or FFT

4. Mel Frequfancy bands- a transform used to voice for transform it into cepstrum

5. ActiveList- representing all states in speech graph

6. Pruner- reduces the number of possibilities of path during the search

7. Search graph- graph structure produced by the linguist according to the certaincriteria. It is using knowledge from the dictionary, acustic and language model.

8. Lexicon- consist of words maps with phonemes

9. Language model- a language structure (Java Speech Grammar Format has beenused in our system)

10. Acoustic Model- contains a representation of sound created by training using lotsof acoustic data (WallStreetJurnal has been used in our system)

11. Search manager- scores the feature frame against all the active states (in Ac-tiveList)

12. Acoustic Scorer- schres the current frame against all the active states in the Ac-tiveList

13. Lingusit- produces the search graph structure

14. FrontEnd- change input signal into a sequence of output Features (mel frequencycepstral coefficients MFCC has been used in our system)

15. Phoneme- the smallest contrastive unit in the sound system of a language

31

32 Chapter A. Glossary

Appendix B

tests

1. Capitals of Germany

2. Capitals of Sweden

3. Capitals of France

4. Cities in Germany

5. Cities in Sweden

6. Cities in France

7. Provinces of Germany

8. Provinces of Sweden

9. Provinces of France

10. Capitals in Germany

11. Capitals in Sweden

12. Capitals in France

13. Villages in Germany

14. Villages in Sweden

15. Villages in France

16. How many people in Germany

17. How many people in Sweden

18. How many people in France

19. How many Islamic in Germany

20. How many Islamic in Sweden

21. How many Islamic in France

33

34 Chapter B. tests

22. Print the countries bordering Germany

23. Print the countries bordering Sweden

24. Print the countries bordering France

25. Print the countries with population over million people

26. Print the countries with population over two million people

27. Print the countries with population less million people

28. Print the countries with population less two million people

29. Print the countries with population under million people

integrating the speech recognition system sphinx with the ... · integrating the speech recognition...

Documents