topic 4: speech processing...
TRANSCRIPT
TOPIC 4: SPEECH PROCESSING SYSTEMS
NATURAL LANGUAGE PROCESSING (NLP)
CS-724
WondwossenMulugeta (PhD) email: [email protected]
1
Topics2
Topics Subtopics5: Speech Processing Systems
• Introduction, Challenges, • Automatic Speech Recognition
(Approaches, Acoustic Modeling, Lexical Modeling)
• Text to Speech System (Text Analysis, Wave form synthesis)
• Evaluations
INTRODUCTION
Language is the ability to express one’s thoughts by
means of a set of signs, whether graphical, gestual,
acoustic or even musical.
It is a distinctive feature of human beings who use
such structured system
Speech
Speech is major component of a language
Speech is the oldest means of communication
Levels of speech:
1. Acoustic
2. Phonetic
3. Phonological
4. Morphological
5. Syntactic
6. Semantic
7. Pragmatic
Why Speech Processing?
No visual contact required… helpful for many
instances
No special equipment required… voice from human
beings directly as input to the computer
Can be done while doing other things…. People can
talk while doing other things… productivity
Available Services
Windows Operating System
Mobile Phones (Smart Phones)
GPS
Wave Forms
“THE SPACE NEARBY”
“THE AREA AROUND”
Identifying word boundaries is one ofthe major challenges in SpeechProcessing Systems. It also depend onhow the speaker speaks
What is a Dialog System?
Dialog systems seek to provide a natural
conversational interaction between the user and the
computer system, e.g.,
User:
“Is there a way I can get to Bole International Airport from here?”
Domains for Dialog Systems
Possible Applications
Travel reservation
Weather forecasting
In-vehicle driver assistance
Call routing
On-line learning environments
Dialog Systems: Information Flow
Must model two-way flow of information
User-to-system
Put inquiries
Provide Clarification
Gives confirmation
System-to-user
Asks for Clarification
Gives Response
Tasks in Speech Processing
1) Speech Coding
Compress a Speech File
Making storage of audio files more compact
2) Speech Synthesis
Construct Speech waveform from words with good speaker
Quality and Accent, Prosody?
Tasks in Speech Processing
3) Speech Recognition
Convert a sound waveform to words
The most relevant and important task in the industry
Tools: Sphinx, ViaVoce & SDK
4) Speaker Recognition/Verification
Concerned with Biometrics
Concerned with:
Speaker Quality
Prosody
Pitch, Accent etc.
Challenges of ASR
Co-articulation:
Cases where two speakers speaking at the same time
Speaker Variation
Many speakers for the system at various times (call centers)
Spontaneity
Naturalness of the speech
Language Modeling
Representation of the language
Noise Robustness
Tolerating and dealing with noise (natural environment)
Research Issues
Many fundamental problems must be solved for
these systems to mature.
Three general areas include:
Automatic Speech Recognition (ASR)
Natural Language Processing (NLP)
Human-computer Interaction (HCI)
The Noisy Channel Model
Search through space of all possible sentences.
Pick the one that is most probable given the
waveform.
Dealing with Noise
What is the most likely sentence out of all sentences
in the language L given some acoustic input O?
Treat acoustic input O as sequence of individual
observations
O = o1,o2,o3,…,ot
Define a sentence as a sequence of words:
W = w1,w2,w3,…,wn
Dealing with Noise
Probabilistic implication: Pick the highest prob S:
We can use Bayes rule to rewrite this:
Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:
ˆ W argmaxWL
P(W |O)
ˆ W argmaxWL
P(O |W )P(W )
ˆ W argmaxWL
P(O |W )P(W )
P(O)
The noisy channel model
Ignoring the denominator leaves us with two factors:
P(Source) and P(Signal|Source)
NLP Issue: Semantic Representation
Two Approaches:
Hand-craft the grammar for the application,
using robust parsing to understand meaning
Problem: time, expense
Use statistical approach, generating initial
rules and using annotated tree-banked data
to discover the full rule set
Problem: annotated training data
NLP Issue:
Resolving Meaning Using Context
Must maintain knowledge of the conversational
context.
“I’m at Dembel. How do I get to a gas station close to it and a café close to it?”
After request for nearest gas station, user says,
“What is it close to?”
Resolving “it” -……… more on this later
Another follow-up by the user,
“How about …restaurant?”
Resolving “…” with “nearest”- ellipsis
Resolving Meaning: Discourse
Analysis
To resolve such requests, system must track
context of the conversation.
The system need to keep long distance
relationship between words.
This is typically handled by a discourse
analysis component in the Dialog Manager.
How is this resolved in speech system:
“Kill him, not leave him.” Vs “Kill him not, leave him.”
Dialog Manager: Discourse Analysis
Anaphora resolution approach:
Use focus mechanism, assuming conversation has focus.
For instance, “gas station” is current focus.
But how about:
“I’m at Dembel. How do I get to a gas station close
to it and then a café close to it?”
Problem: Resolving the two “it”.
Dialog Manager: Clarification
Often cannot satisfy request in one iteration.
The previous example may require clarification
from the user,
“Do you want to go to the gas station first?”
Reading
For Human beings the reading process involves:
Seeing, Thinking, Saying, Hearing
These are most complex processes and cannot be
imitated
Text AnalysisDocument Structure DetectionText NormalizationLinguistic Analysis
Prosodic AnalysisPitch & Duration attachment
Speech SynthesisVoice Rendering
Raw text or tagged text
Tagged text
Tagged phone
Controls
Phonetic AnalysisGrapheme-to-Phoneme Conversion
Architecture
of TTS
system
TTS Synthesizer System
A text to speech synthesizer is a computer based system that should be able to read any text. And speech should be intelligible and natural.
“Text-to-Speech software is used to convert words from a computer document (e.g. word processor document, web page) into Audible Speech spoken through the computer speaker”
Applications
1. Talking Calculator
2. Smart Phone Features
SMS Reader
Caller Reader
3. Computer generated wiring instruction
4. Aids for the blind
5. Telephone inquiry service (Ethio Telecom 994)
6. Teaching machines
Typical TTS Components
Text
NATURAL LANGUAGE PROCESSING
Linguistic FormalismInference EnginesLogical Inferences
DIGITAL SIGNAL PROCESSING
Mathematical ModelsAlgorithms
Computations
Phonemes
Prosody
Speech
TEXT-TO-SPEECH SYNTHESIZER
Typical TTS Components
TTS has two components
1. Natural Language Processing Module (NLP) Linguistics Formalism
Inference Engine
Logical Inferences
2. Digital Signal Processing Module (DSP) Mathematical Models
Algorithms
Computations
Phonetic Transcription
Phones
Prosody
NLP and DSP Modules
The NLP module is capable of producing aphonetic transcription of the text to be read,together with the desired intonation and rhythm.
It takes in the text as input and give narrowphonetic transcription as output which is furtherforwarded to the DSP module.
The DSP module which transforms the symbolicinformation it receives into natural soundingspeech. “Narrow phonetic transcription” whichis taken as intermediate varies from synthesizersystem to another.
NLP Module of typical TTS system
Text Analyzer (Morpho Syntactic Analysis)
Pre-processor
Morphological Analyzer
Contextual Analyzer
Syntactic-Prosodic parser
Letter to Sound Module
Preprocessor
Takes in texts as strings of ASCII characters Transforms text into Broad Segmentation Units (BSU’s)
following the set: A sequence of characters A sequence of digits A single punctuation mark or another special character A sequence of white space characters
Eg: Sentences: I Know 1,000 words, Dr. Jones.
BSU: (I)()(know)()(1)(,)(000)()(words)(,)()(Dr)(.)() (Jones)(.)
Rewrites the BSU’s into a list of word-like units and of syntax bearing punctuation marks called Final Segmentation Units are produced (FSU’s).
Preprocessor
Sentence end detection (semicolon, period – ratio, time and decimal point, sentence ending respectively)
Abbreviations (e.g. – for instance) Changed to their full form with the help of lexicons
Acronyms (I.B.M – these can be read as a sequence of characters, or NASA which can be read following the default way)
Numbers (Once detected, first interpreted as rational, time of the day, dates and ordinal depending on their context)
Idioms (eg. “In spite of”, “as a matter of fact”– these are combined into single FSU using a special lexicon)
Morphological Analysis
Task is to propose all possible parts of speech categories to each word taken individually on the basis of their spelling.
The part of speech might affect the way it is pronounced.
Words – Function and Content words
Contextual Analysis
Considers words in their context
Reduces the list of their parts of speech categoriesto a very restricted number of highly probablehypotheses, given the corresponding possibleparts of speech of neighboring words.
Achieved by N-grams, multi-layer perceptron(Neural networks), local stochastic grammars(provided by expert linguistics) etc
Letter to Sound Module
LTS module is responsible for the automatic determination of the phonetic transcription of the incoming text
Cannot just look up in a pronunciation dictionary Do not follow the rule “one character = one phoneme” Examples
Single character correspond to two phonemes x as /ks/
Several characters producing one phoneme gh in thought
Single character pronounced in different ways c in ancestor, ancient, epic
Single phoneme resulting in several spellings sh in dish, t in action, c in ancient
Two Basic Strategies
There are two commonly used strategies to produce
audio from text:
1. Dictionary based and
2. Rule-based
Dictionary Based approach
The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciation is stored by the program.
Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary.
Rule based approach
The other approach used for text-to-phoneme
conversion is the rule-based approach, where rules
for the pronunciations of words are applied to
words to work out their pronunciations based on
their spellings.
This is similar to the "sounding out" approach to
learning reading.
Synthesizer technologies
There are two main technologies used for the
generating synthetic speech waveforms:
1. concatenative synthesis and
2. formant synthesis (a.k.a: parametric speech
synthesis)
Formant Synthesis
Formant synthesis does not use any human speech samples at runtime. Instead, the output synthesized speech is created using an acoustic model.
Parameters such as frequency amplitude etcare varied over time to create a waveform of artificial speech.
Concatenative synthesis
Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech.
Generally, concatenative synthesis gives the most natural sounding synthesized speech.
However, natural variation in speech and automated techniques for segmenting the waveforms sometimes result in audible glitches in the output, detracting from the naturalness.
Concatenative Synthesis
Record basic inventory of sounds
Retrieve appropriate sequence of units at run time
Concatenate and adjust durations and pitch
Synthesize waveform
Phonetic Post Processing
In order to increase the intelligibility and thenaturalness of synthetic speech, some kind ofphonetic post processing is required.
After first phonemic transcription of each wordhas been obtained, this is applied so as toaccount for co-articulatory smoothing. Thissmoothing results in high quality speech.
Prosody refers to certain properties of thespeech signal which are related to audiblechanges in pitch, loudness, syllable length. Thisis also referred as intonation.
DSP Module
Digital signal processing (DSP) is the numerical
manipulation of signals, usually with the intention to
measure, filter, produce or compress continuous
analog signals.
DSP takes in the narrow phonetic transcription and
gives out speech as output
More of a mathematical computation and system
development issue.
Evaluating Speech Systems
System Based Evaluation:
Total system initiative provides low usability.
User Based Evaluation:
Total user initiative introduces higher error
rate.
Thus, mixed initiative approach, balancing
usability and error rate, is taken most often.
Evaluating Speech Systems
Task Success
Was the necessary information exchanged?
Efficiency/Cost
Number dialog turns, task completion time
Qualitative
ASR rejections, timeouts, helps
Usability
User satisfaction with ASR, task ease, interaction pace, system response