speech recognition. definition speech recognition is the process of converting an acoustic signal,...

70
Speech Recognition

Upload: noah-todd

Post on 17-Dec-2015

236 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Speech Recognition

Page 2: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Definition

• Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words.

• The recognised words can be an end in themselves, as for applications such as commands & control, data entry, and document preparation.

• They can also serve as the input to further linguistic processing in order to achieve speech understanding

Page 3: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Speech Processing

• Signal processing:– Convert the audio wave into a sequence of feature vectors

• Speech recognition:– Decode the sequence of feature vectors into a sequence

of words• Semantic interpretation:

– Determine the meaning of the recognized words• Dialog Management:

– Correct errors and help get the task done• Response Generation

– What words to use to maximize user understanding• Speech synthesis (Text to Speech):

– Generate synthetic speech from a ‘marked-up’ word string

Page 4: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Dialog Management

• Goal: determine what to accomplish in response to user utterances, e.g.:– Answer user question– Solicit further information– Confirm/Clarify user utterance– Notify invalid query– Notify invalid query and suggest alternative

• Interface between user/language processing components and system knowledge base

Page 5: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

What you can do with Speech Recognition

• Transcription– dictation, information retrieval

• Command and control– data entry, device control, navigation, call routing

• Information access– airline schedules, stock quotes, directory

assistance

• Problem solving– travel planning, logistics

Page 6: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Transcription and Dictation

• Transcription is transforming a stream of human speech into computer-readable form– Medical reports, court proceedings, notes– Indexing (e.g., broadcasts)

• Dictation is the interactive composition of text– Report, correspondence, etc.

Page 7: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Speech recognition and understanding

• Sphinx system– speaker-independent– continuous speech– large vocabulary

• ATIS system– air travel information retrieval– context management

Page 8: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Speech Recognition and Call Centres

• Automate services, lower payroll• Shorten time on hold• Shorten agent and client call time• Reduce fraud• Improve customer service

Page 9: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Applications related to Speech Recognition

• Speech RecognitionSpeech Recognition • Figure out what a person is saying.• Speaker VerificationSpeaker Verification • Authenticate that a person is who she/he

claims to be.• Limited speech patterns• Speaker IdentificationSpeaker Identification• Assigns an identity to the voice of an

unknown person.• Arbitrary speech patterns

Page 10: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Many kinds of Speech Recognition Systems

• Speech recognition systems can be characterised by many parameters.

• An isolated-word (Discrete) speech recognition system requires that the speaker pauses briefly between words, whereas a continuous speech recognition system does not.

Page 11: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Spontaneous V Scripted

• Spontaneous, speech contains disfluencies, periods of pause and restart, and is much more difficult to recognise than speech read from script.

Page 12: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Enrolment

• Some systems require speaker enrolment, a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrolment is necessary.

Page 13: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Large V small vocabularies

• Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large with many similar-sounding words.

• When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.

• The simplest language model can be specified as a finite-state network, where the permissible words following each word are given explicitly.

Page 14: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Perplexity

• One popular measure of the difficulty of the task, combining the vocabulary size and the language model, is perplexity.

• Loosely defined as the geometric mean of the number of words that can follow a word after the language model has been applied., (Zue, Cole, and Ward, 1995).

Page 15: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

• Finally, some external parameters can affect speech recognition system performance. These include the characteristics of the environmental noise and the type and the placement of the microphone.

Page 16: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Properties of RecognizersSummary

• Speaker Independent vs. Speaker Dependent• Large Vocabulary (2K-200K words) vs. Limited

Vocabulary (2-200)• Continuous vs. Discrete• Speech Recognition vs. Speech Verification• Real Time vs. multiples of real time

Page 17: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Continued

• Spontaneous Speech vs. Read Speech• Noisy Environment vs. Quiet Environment• High Resolution Microphone vs. Telephone vs.

Cellphone• Push-and-hold vs. push-to-talk vs. always-

listening• Adapt to speaker vs. non-adaptive• Low vs. High Latency• With online incremental results vs. final results• Dialog Management

Page 18: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Features That Distinguish Products & Applications

• Words, phrases, and grammar• Models of the speakers • Speech flow

• Vocabulary: How many words

• How you add new words

• Grammars Branching Factor (Perplexity)

• Available languages

Page 19: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Systems are also defined by Users

• Different Kinds of Users

• One time vs. Frequent users

• Homogeneity

• Technically sophisticated

• Based on Users have different speaker models

Page 20: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Speaker Models

• Speaker Dependent

• Speaker Independent

• Speaker Adaptive

Page 21: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Automate services, lower payroll Shorten time on hold Shorten agent and client call time Reduce fraud Improve customer service

Sample Market: Call Centers

Page 22: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

 A TIMELINE OF SPEECH RECOGNITION

• 1890s Alexander Graham Bell discovers Phone while trying to develop speech recognition system for deaf people.

• 1936AT&T's Bell Labs produced the first electronic speech synthesizer called the Voder (Dudley, Riesz and Watkins).

• This machine was demonstrated in the 1939 World Fairs by experts that used a keyboard and foot pedals to play the machine and emit speech.

• 1969John Pierce of Bell Labs said automatic speech recognition will not be a reality for several decades because it requires artificial intelligence.

Page 23: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Early 70s

• Early 1970'sThe Hidden Markov Modeling (HMM) approach to speech recognition was invented by Lenny Baum of Princeton University and shared with several ARPA (Advanced Research Projects Agency) contractors including IBM.

• HMM is a complex mathematical pattern-matching strategy that eventually was adopted by all the leading speech recognition companies including Dragon Systems, IBM, Philips, AT&T and others.

Page 24: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

70+• 1971DARPA (Defense Advanced Research Projects Agency)

established the Speech Understanding Research (SUR) program to develop a computer system that could understand continuous speech.

• Lawrence Roberts, who initiated the program, spent $3 million per year of government funds for 5 years. Major SUR project groups were established at CMU, SRI, MIT's Lincoln Laboratory, Systems Development Corporation (SDC), and Bolt, Beranek, and Newman (BBN). It was the largest speech recognition project ever.

• 1978The popular toy "Speak and Spell" by Texas Instruments was introduced. Speak and Spell used a speech chip which led to huge strides in development of more human-like digital synthesis sound.

Page 25: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

80+

• 1982Covox founded. Company brought digital sound (via The Voice Master, Sound Master and The Speech Thing) to the Commodore 64, Atari 400/800, and finally to the IBM PC in the mid ‘80s.

• 1982Dragon Systems was founded in 1982 by speech industry pioneers Drs. Jim and Janet Baker. Dragon Systems is well known for its long history of speech and language technology innovations and its large patent portfolio.

• 1984SpeechWorks, the leading provider of over-the-telephone automated speech recognition (ASR) solutions, was founded.

Page 26: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

90s• 1993 Covox sells its products out to Creative Labs, Inc.• 1995 Dragon released discrete word dictation-level speech

recognition software. It was the first time dictation speech recognition technology was available to consumers. IBM and Kurzweil followed a few months later.

• 1996 Charles Schwab is the first company to devote resources towards developing up a speech recognition IVR system with Nuance. The program, Voice Broker, allows for up to 360 simultaneous customers to call in and get quotes on stock and options... it handles up to 50,000 requests each day. The system was found to be 95% accurate and set the stage for other companies such as Sears, Roebuck and Co., and United Parcel Service of America Inc., and E*Trade Securities to follow in their footsteps.

• 1996 BellSouth launches the world's first voice portal, called Val and later Info By Voice.

Page 27: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

95+

• 1997 Dragon introduced "Naturally Speaking", the first "continuous speech" dictation software available (meaning you no longer need to pause between words for the computer to understand what you're saying).

• 1998 Lernout & Hauspie bought Kurzweil. Microsoft invested $45 million in Lernout & Hauspie to form a partnership that will eventually allow Microsoft to use their speech recognition technology in their systems.

• 1999 Microsoft acquired Entropic, giving Microsoft access to what was known as the "most accurate speech recognition system" in the Old VCR!

Page 28: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

2000

2000 Lernout & Hauspie acquired Dragon Systems for approximately $460 million.

2000 TellMe introduces first world-wide voice portal.

2000 NetBytel launched the world's first voice enabler, which includes an on-line ordering application with real-time Internet integration for Office Depot.

Page 29: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

2000s

2001ScanSoft Closes Acquisition of Lernout & Hauspie Speech and Language Assets.

2003ScanSoft Ships Dragon NaturallySpeaking 7 Medical, Lowers Healthcare Costs through Highly Accurate Speech Recognition.

2003ScanSoft closes deal to distribute and support IBM ViaVoice Desktop Products.

Page 30: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Signal Variability

• Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal.

• The acoustic realisations of phonemes, the recognition systems smallest sound units of which words are composed, are highly dependent on the context in which they appear.

• These phonetic variables are exemplified by the acoustic differences of the phoneme 't/'in two, true, and butter in English.

• At word boundaries, contextual variations can be quite dramatic, and devo andare sound like devandare in Italian.

Page 31: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

More

• Acoustic variability can result from changes in the environment as well as in the position and characteristics of the transducer.

• Within-speaker variability can result from changes in the speaker's physical and emotional state, speaking rate, or voice quality.

• Differences in socio-linguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variability.

Page 32: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

What is a speech recognition system?

• Speech recognition is generally used as a human computer interface for other software. When it functions in this role, three primary tasks need be performed.

• Pre-processing, the conversion of spoken input into a form the recogniser can process.

• Recognition, the identification of what has been said.

• Communication, to send the recognised input to the application that requested it.

Page 33: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

How is pre-processing performed

• To understand how the first of these functions is performed, we must examine,

• Articulation, the production of the sound.

• Acoustics, the stream of the speech itself.

• What characterises the ability to understand spoke input, Auditory perception.

Page 34: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Articulation

• The science of articulation is concerned with how phonemes are produced. The focus of articulation is on the vocal apparatus of the throat, mouth and nose where the sounds are produced.

• The phonemes themselves need to be classified, the system most often used by speech recognition is the ARPABET, (Rabiner and Juang, 1993) The ARPABET was created in the 1970’s by and for contractors working on speech processing for the Advanced Research Projects Agency of the U.S. department of defence.

Page 35: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

ARPABET

• • Like most phoneme classifications, the

ARPABET separates consonants from vowels. • Consonants are characterised by a total or

partial blockage of the vocal tract. • Vowels are characterised by strong harmonic

patterns and relatively free passage of air through the vocal tract.

• Semi-Vowels, such as the ‘y’ in you, fall between consonants and vowels.

Page 36: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Consonant Classifcation

• Consonant classification uses the,

• Point of articulation.

• Manner of articulation.

• Presence or absence of voicing.

Page 37: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Acoustics

• Articulation provides valuable information about how speech sounds are produced, but a speech recognition system cannot analyse movements of the mouth.

• Instead, the data source for speech recognition is the stream of speech itself.

• This is an analogue signal, a sound stream, and a continuous flow of sound waves and silence.

Page 38: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Important Features (Acoustics)

• Four important features of the acoustic analysis of speech are, (Carter, 1984)

• Frequency, the number of vibrations per second a sound produces

• Amplitude, the loudness of the sound.• Harmonic structure added to the fundamental

frequency of a sound are other frequencies that contribute to its quality or timbre.

• Resonance.

Page 39: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Auditory perception, hearing speech.

• "Phonemes tend to be abstractions that are implicitly defined by the pronunciation of the words in the language. In particular, the acoustic realisation of a phoneme may heavily depend on the acoustic context in which it occurs. This effect is usually called co-articulation", (Ney, 1994).

• The way a phoneme is pronounced can be affected by its position in a word, neighbouring phonemes and even the word's position in a sentence. This affect is called the co-articulation effect.

• The variability in the speech signal caused by co-articulation and other sources make speech analysis very difficult.

Page 40: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Human Hearing

• The human ear can detect frequencies from 20Hz to 20,000Hz but it is most sensitive in the critical frequency range, 1000Hz to 6000Hz, (Ghitza, 1994).

• Recent Research has uncovered the fact that humans do not process individual frequencies.

• Instead, we hear groups of frequencies, such as format patterns, as cohesive units and we are capable of distinguishing them from surrounding sound patterns, (Carrell and Opie, 1992) .

• This capability, called auditory object formation, or auditory image formation, helps explain how humans can discern the speech of individual people at cocktail parties and separate a voice from noise over a poor telephone channel, (Markowitz, 1995).

Page 41: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Pre-processing Speech

• Like all sounds, speech is an analogue waveform. In order for a Recognition System to perform action on speech, it must be represented in a digital manner.

• All noise patterns silences and co-articulation effects must be captured.

• This is accomplished by digital signal processing. The way the analogue speech is processed is one of the most complex elements of a Speech Recognition system.

Page 42: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Recognition Accuracy

• To achieve high recognition accuracy the speech representation process should, (Markowitz, 1995),

• Include all critical data.

• Remove Redundancies.

• Remove Noise and Distortion.

• Avoid introducing new distortions.

Page 43: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Signal Representation

• In statistically based automatic speech recognition, the speech waveform is sampled at a rate between 6.6 kHz and 20 kHz and processed to produce a new representation as a sequence of vectors containing values of what are generally called parameters.

• The vectors typically comprise between 10 and 20 parameters, and are usually computed every 10 or 20 milliseconds.

Page 44: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Parameter Values

• These parameter values are then used in succeeding stages in the estimation of the probability that the portion of waveform just analysed corresponds to a particular phonetic event that occurs in the phone-sized or whole-word reference unit being hypothesised.

• In practice, the representation and the probability estimation interact strongly: what one person sees as part of the representation another may see as part of the probability estimation process.

Page 45: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Emotional State

• Representations aim to preserve the information needed to determine the phonetic identity of a portion of speech while being as impervious as possible to factors such as speaker differences, effects introduced by communications channels, and paralinguistic factors such as the emotional state of the speaker.

• They also aim to be as compact as possible.

Page 46: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

• Representations used in current speech recognisers, concentrate primarily on properties of the speech signal attributable to the shape of the vocal tract rather than to the excitation, whether generated by a vocal-tract constriction or by the larynx.

• Representations are sensitive to whether the vocal folds are vibrating or not (the voiced/unvoiced distinction), but try to ignore effects due to variations in their frequency of vibration.

Page 47: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Future Improvements in Speech Representation.

• The vast majority of major commercial and experimental systems use representations akin to those described here.

• However, in striving to develop better representations, wave-let transforms (Daubechies, 1990) are being explored, and neural network methods are being used to provide non-linear operations on log spectral representations.

Page 48: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

• Work continues on representations more closely reflecting auditory properties (Greenberg, 1988) and on representations reconstructing articulatory gestures from the speech signal (Schroeter & Sondhi, 1994).

• It is attractive because it holds out the promise of a small set of smoothly varying parameters that could deal in a simple and principled way with the interactions that occur between neighbouring phonemes and with the effects of differences in speaking rate and of carefulness of enunciation.

Page 49: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

• The ultimate challenge is to match the superior performance of human listeners over automatic recognisers.

• This superiority is especially marked when there is little material to allow adaptation to the voice of the current speaker, and when the acoustic conditions are difficult.

• The fact that it persists even when nonsense words are used shows that it exists at least partly at the acoustic/phonetic level and cannot be explained purely by superior language modelling in the brain.

• It confirms that there is still much to be done in developing better representations of the speech signal, (Rabiner and Schafer, 1978; Hunt, 1993).

Page 50: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Signal Recognition Technologies

• Signal Recognition methodologies fall into to four categories, most system will apply one or more in the conversion process.

Page 51: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Template Matching,

• Template match is the oldest and least effective method. It is a form of pattern recognition.

• It was the dominant technology in the 1950's and 1960's. • Each word or phrase in an application is stored as a

template. • The user input is also arranged into templates at the

word level and the best match with a system template is found.

• Although Template matching is currently in decline as the basic approach to recognition, it has been adapted for use in word spotting applications. It also remains the primary technology applied to speaker verification, (Moore, 1982).

Page 52: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Acoustic-Phonetic Recognition

• Acoustic-phonetic recognition functions at the phoneme level. It is an attractive approach to speech as it limits the number of representations that must be stored. In English there are about forty discernible phonemes no matter how large the vocabulary, (Markowitz, 1995).

• Acoustic phonetic recognition involves three steps,

Feature Extraction.Segmentation and Labelling.Word-Level recognition.

Page 53: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

• Acoustic phonetic recognition supplanted template matching in the early 1970's.

• The successful ARPA SUR systems highlighted potential benefits of this approach. Unfortunately acoustic phonetic was at the time a poorly researched area and many of the expected advances failed to materialise.

Page 54: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

• The high degree of acoustic similarity among phonemes combined with phoneme variability resulting from the co-articulation effect and other sources create uncertainty with regard to potential phoneme labels, (Cole 1986).

• If these problems can be overcome, there is certainly an opportunity for this technology to play a part in future Speech Recognition system.

Page 55: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Stochastic Processing,

• The term stochastic refers to the process of making a sequence of non-deterministic selections from among a set of alternatives.

• They are non-deterministic because the choices during the recognition process are governed by the characteristics of the input and not specified in advance, (Markowitz, 1995).

• Like template matching, stochastic processing requires the creation and storage of models of each of the items that will be recognised.

• It is based on a series of complex statistical or probabilistic analyses. These statistics are stored in a network-like structure called a Hidden Markov Model (HMM), (Paul, 1990).

Page 56: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

HMM

• A Hidden Markov Model is made up of states and transitions, which are shown, in the diagram. Each state represents of a HMM holds statistics for a segment of a word, which describe the value and variations that are found in the model of that word segment. The transitions allow for speech variations such as

• The prolonging of a word segment, this would cause several recursive transitions in the recogniser.

• The omission of a word segment, This would cause a transition that skips a state.

• Stochastic processing using Hidden Markov Models is accurate, flexible, and capable of being fully automated, (Rabiner and Juang, 1986).

Page 57: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Neural networks

• "if speech recognition systems could learn speech knowledge automatically and represent this knowledge in a parallel distributed fashion for rapid evaluation … such a system would mimic the function of the human brain, which consists of several billion simple, inaccurate and slow processors that perform reliable speech processing", (Waibel and Hampshire, 1989).

• An artificial neural network is a computer program, which attempt to emulate the biological functions of the Human brain. They are an excellent classification systems, and have been effective with noisy, patterned, variable data streams containing multiple, overlapping, interacting and incomplete cues, (Markowitz, 1995).

Page 58: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

• Neural networks do not require the complete specification of a problem, learning instead through exposure to large amount of example data. Neural networks comprise of an input layer, one or more hidden layers, and one output layer. The way in which the nodes and layers of a network are organised is called the networks architecture.

• The allure of neural networks for speech recognition lies in their superior classification abilities.

• Considerable effort has been directed towards development of networks to do word, syllable and phoneme classification.

Page 59: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Auditory Models,

• The aim of auditory models to allow a Speech Recognition system to screen all noise from the signal and concentrate on the central speech pattern in a similar way to the Human Brain.

• Auditory modelling offers the promise of being able to develop robust Speech Recognition systems that are capable of working in difficult environments.

• Currently, it is purely an experimental technology.

Page 60: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Performance of Speech Recognitions systems

• Performance of speech recognition systems is typically described in terms of word error rate, defined as:

• Deletion, The loss of a word within the original speech. The system outputs "A E I U" while the input was "A E I O U".

• Substitution, The replacement of an element of the input, such as a word, with another. The system outputs "song" while the input was "long".

• Insertion, The system adds an element to the input, such as a word, when no word was input. The system outputs "A E I O U" while the input was "A E I U".

Page 61: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Speech Recognition as Assistive Technology

• Main use is as alternative Hands Free Data entry mechanism

• Very effective

• Much faster than switch access

• Mainstream technology

• Used in many applications where hands are needed for other things e.g. mobile phone while driving, in surgical theatres

Page 62: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

• Dictation is a big part of office administration and commercial speech recognition systems are targeted at this market.

Page 63: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Some interesting facts

• Switch access users who were at around 5 words per minute achieved 80 words with SR

• This allowed them to do state exams

• SR can be used for environmental control systems around the home e.g.

• “Open Curtains”

Page 64: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

• People with speech impairment (Dysarthic Speech) have shown improved articulation after using SR systems especially Discrete systems

Page 65: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Reasons why SR may fail some people

• Crowded room - Cannot have everyone talking at once

• Too many errors because all noises, coughs, throat clearances etc are picked up

• Speech not good enough to use it• Not enough training• Cognitive overhead too much for some

people

Page 66: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

• Too demanding physically – Hard work to talk for a long time

• Cannot be bothered with Initial Enrolment

• Drinking- Adversely affects vocal cords

• Smoking, Shouting, Dry Mouth and illness all affect the vocal tract

• Need to drink water

• Room must not be too stuffy

Page 67: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Some links

• The following are links to major speech recognition links

Page 68: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Carnegie Mellon Speech Demos

• CMU Communicator– Call: 1-877-CMU-PLAN (268-7526), also 268-5144,

or x8-1084– the information is accurate; you can use it for your

own travel planning…

CMU Universal Speech Interface (USI)

• CMU Movie LineSeems to be about apartments now…– Call: (412) 268-1185

Page 69: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

Telephone Demos• Nuance http://www.nuance.com

– Banking: 1-650-847-7438– Travel Planning: 1-650-847-7427– Stock Quotes: 1-650-847-7423

• SpeechWorks http://www.speechworks.com/demos/demos.htm – Banking: 1-888-729-3366– Stock Trading: 1-800-786-2571

Page 70: Speech Recognition. Definition Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of

• MIT Spoken Language Systems Laboratory http://www.sls.lcs.mit.edu/sls/whatwedo/applications.html – Travel Plans (Pegasus): 1-877-648-8255– Weather (Jupiter): 1-888-573-8255

• IBM http://www-3.ibm.com/software/speech/

– Mutual Funds, Name Dialing: 1-877-VIA-VOICE