machine translation & automated speech recognition jaime carbonell with richard stern and alex...

Machine Translation & Automated Speech Recognition

Jaime CarbonellWith Richard Stern and Alex Rudnicky

Language Technologies InstituteCarnegie Mellon University

Telephone: +1 412 268 7279Fax: +1 412 268 6298

[email protected]

For Samsung, February 20, 2008

CarnegieMellon Slide 2 Languge Technologies Insitute

Outline of Presentation

Machine Translation (MT):

– Three major paradigms for MT (transfer, example-based, statistical)

– New method: Context-Based MT

Automated Speech Recognition (ASR):

– The Sphinx family of recognizers

Environmental Robustness for ASR

Intelligent Tutoring for English

– Native Accent: Pronunciation product

– Grammar, Vocabulary & other tutotring

Components for Virtual English Village


Language Technologies BILL OF RIGHTS TECHNOLGY (e.g.)

“…right information”

“…right people”

“…right time”

“…right medium”

“…right language”

“…right level of detail”

IR (search engines)

Routing, personalization

Anticipatory analysis

Speech: ASR, TTS

Machine translation

Summarization, Drill down


Machine Translation:Context is Needed to Resolve Ambiguity

Example: English Japanese Power line – densen ( 電線 ) Subway line – chikatetsu ( 地下鉄 )

(Be) on line – onrain ( オンライン ) (Be) on the line – denwachuu ( 電話中 ) Line up – narabu ( 並ぶ ) Line one’s pockets – kanemochi ni naru ( 金持ちになる ) Line one’s jacket – uwagi o nijuu ni suru ( 上着を二重にする ) Actor’s line – serifu ( セリフ ) Get a line on – joho o eru ( 情報を得る )

Sometimes local context suffices (as above) . . . but sometimes not


CONTEXT: More is Better

Examples requiring longer-range context:– “The line for the new play extended for 3 blocks.”

– “The line for the new play was changed by the scriptwriter.”

– “The line for the new play got tangled with the other props.”

– “The line for the new play better protected the quarterback.”

Better MT requires better context sensitivity

– Syntax-rule based MT little or no context

– EBMT potentially long context on exact substring matches

– SMT short context with proper probabilistic optimization

– CBMT long context with optimization (but still experimental)


Types of Machine Translation

Text Generation

Syntactic Parsing

Semantic Analysis

Sentence Planning

Source (Arabic)

Target(English)

Transfer Rules

Direct: SMT, EBMT

Interlingua


Example-Based Machine Translation Example

English: I would like to meet her.Mapudungun: Ayükefun trawüael fey engu.

English: The tallest man is my father.Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw.

English: I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.


Statistical Machine Translation: The Three Ingredients

English Language Model

English-KoreanTranslation

Model

Decoder

E K

p(E) p(K|E)

E* K

E*=arg max p(E|K)= argmax p(E) p(K|E)

AlignmentsThe

proposal

Will

Not

Now

Be

Implemented

Les

Propositions

ne

seront

Pas

mises

En

Application

maintenant

Translation models built in terms of hidden alignmentsp(F|E)= )|,( EAFp

A

Each English word “generates” zero or more French words.


Conditioning Joint Probabilities

),,,|(

),,,|(

|)(()|,(

111

11

11

1

Emfafp

Emfaap

EFlengthmpEAFp

jjj

jjm

jj

The modeling problem is how to approximate these terms to achieve:

Statistically robust estimates

Computationally efficient training

An effective and accurate model


The Gang of Five IBM++ Models

A series of five models is sequentially trained to “bootstrap a detailed translation model from parallel corpora:

1. Word-byword translation. Parameters p(f|e) between pairs of words.

2. Local alignments. Adds alignment probabilities

3. Fertilities and distortions. Allow a source word to generate zero or more target words, then place words in position with p(j|i,sentence lengths).

4. Tablets. Groups words into phrases

• Primary idea of Example-Based MT incorporated into SMT

• Currently seeking meaningful (vs arbitrary phrases), e.g. NPs. PPs.

5. Non-deficient alignments. Don’t ask. . .

lengths) sentence,|( jap j

)|)(( eeφp


Multi-Engine Machine Transaltion

MT Systems have different strengths– Rapidly adaptable: Statistical, example-based

– Good grammar: Rule-Based (linguisitic) MT

– High precision in narrow domains: KBMT

– Minority Language MT: Learnable from informant

Combine results of parallel-invoked MT– Select best of multiple translations

– Selection based on optimizing combination of:

» Target language joint-exponential model

» Confidence scores of individual MT engines


Illustration of Multi-Engine MT

El punto de descarge

The drop-off point

se cumplirá en

will comply with

el puente Agua Fria

The cold Bridgewater


The discharge point

se cumplirá en

will self comply in

el puente Agua Fria

the “Agua Fria” bridge


Unload of the point

se cumplirá en

will take place at

el puente Agua Fria

the cold water of bridge


Key Challenges for Corpus-Based MT

Long-Range Context

– More is better

– How to incorporate it?

– How to do so tractably?

– “I conjecture that MT is AI complete” – H. Simon

State-of-the-Art

– Trigrams 4 or 5-grams in LM only (e.g., Google)

– SMT SMT+EBMT (phrase tables, etc.)

Parallel Corpora

– More is better

– Where to find enough of it?

– What about rare languages?

– “There’s no data like more data” – IBM SMT circa 1992

State-of-the-Art

– Avoids rare languages (GALE only does Arabic & Chinese)

– Symbolic rule-learning from minimalist parallel corpus (AVENUE project at CMU)


Parallel Text: Requiring Less is Better (Requiring None is Best )

Challenge– There is just not enough to approach human-quality MT for most major

language pairs (we need ~100X more)

– Much parallel text is not on-point (not on domain)

– Rare languages and some distant pairs have virtually no parallel text

Context-Based (CBMT) Approach

– Invented and patented by Meaningful Machines (in New York)

– Requires no parallel text, no transfer rules . . .

– Instead, CBMT needs

» A fully-inflected bilingual dictionary

» A (very large) target-language-only corpus

» A (modest) source-language-only corpus [optional, but preferred]

CMBT System

Parser

Target Language

Source Language

ParserN-gram Segmenter

Overlap-based Decoder

Bilingual Dictionary

INDEXED RESOURCES

Target Corpora

[Source Corpora]

N-GRAM BUILDERS(Translation Model)

Flooder(non-parallel text method)

Cross-LanguageN-gram Database

CACHE DATABASE

N-GRAM CONNECTOR

N-gram Candidates

Approved N-gram

Pairs

Stored N-gram

PairsGazetteers

Edge Locker

TTR

Substitution Request

Step 1: Source Sentence Chunking Segment source sentence into overlapping n-grams via sliding window Typical n-gram length 4 to 9 terms Each term is a word or a known phrase Any sentence length (for BLEU test: ave-27; shortest-8; longest-66 words)

S1 S2 S3 S4 S5 S6 S7 S8 S9

S1 S2 S3 S4 S5

S2 S3 S4 S5 S6

S3 S4 S5 S6 S7

S4 S5 S6 S7 S8

S5 S6 S7 S8 S9

Flooding Set

Step 2: Dictionary Lookup

T3-aT3-bT3-c

T4-aT4-bT4-cT4-dT4-e

T5-a T6-aT6-bT6-c

Using bilingual dictionary, list all possible target translations for each source word or phrase

Source Word-String

T2-aT2-bT2-cT2-d

Target Word Lists

S2 S3 S4 S5 S6

Inflected Bilingual Dictionary

Step 3: Search Target Text

Using the Flooding Set, search target text for word-strings containing one

word from each group

Find maximum number of words from Flooding Set in minimum length word-string

– Words or phrases can be in any order

– Ignore function words in initial step (T5 is a function word in this example)

T2-aT2-bT2-cT2-d

T3-aT3-bT3-c


T5-a T6-aT6-bT6-cFlooding Set

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-b T(x) T2-d T(x) T(x) T6-cT(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)

Step 3: Search Target Text (Example)

T2-aT2-bT2-cT2-d

T3-aT3-bT3-c



Target Corpus

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-b T(x) T2-d T(x) T(x) T6-cT(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)

T3-b T(x) T2-d T(x) T(x) T6-c

Target Candidate 1

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)


T2-aT2-bT2-cT2-d

T3-aT3-bT3-c



Target Corpus

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T4-a T6-b T(x) T2-c T3-a T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)

T4-a T6-b T(x) T2-c T3-a

Target Candidate 2

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)


T2-aT2-bT2-cT2-d

T3-aT3-bT3-c



Target Corpus

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-c T2-b T4-e T5-a T6-a T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T3-c T2-b T4-e T5-a T6-a

Target Candidate 3

Reintroduce function words after initial match (e.g. T5)

Scoring

---

---

---

Step 4: Score Word-String Candidates

Scoring of candidates based on:

– Proximity (minimize extraneous words in target n-gram precision)

– Number of word matches (maximize coverage recall))

– Regular words given more weight than function words

– Combine results (e.g., optimize F1 or p-norm or …)

T3-b T(x) T2-d T(x) T(x) T6-c

T4-a T6-b T(x) T2-c T3-a

T3-c T2-b T4-e T5-a T6-a

Proximity

3rd

1st

1st

Word Matches

3rd

2st

1st

“Regular” Words

3rd

1st

1st

Total Scoring

3rd

2nd

1st

Target Word-String Candidates

T3-b T(x3) T2-d T(x5) T(x6) T6-c

T4-a T6-b T(x3) T2-c T3-a


T(x2) T4-a T6-b T(x3) T2-c

T(x1) T2-d T3-c T(x2) T4-b

T(x1) T3-c T2-b T4-e

T6-b T(x11) T2-c T3-a T(x9)

T2-b T4-e T5-a T6-a T(x8)

T6-b T(x3) T2-c T3-a T(x8)

Step 5: Select Candidates Using Overlap(Propagate context over entire sentence)

Word-String 1Candidates



T(x2) T4-a T6-b T(x3) T2-c


T6-b T(x3) T2-c T3-a T(x8)


T3-b T(x3) T2-d T(x5) T(x6) T6-cT3-b T(x3) T2-d T(x5) T(x6) T6-c






Step 5: Select Candidates Using Overlap




T(x2) T4-a T6-b T(x3) T2-c


T6-b T(x3) T2-c T3-a T(x8)

T(x2) T4-a T6-b T(x3) T2-c T3-a T(x8)

T(x1) T3-c T2-b T4-e T5-a T6-a T(x8)

Best translations selected via maximal overlap

Alternative 1

Alternative 2

A (Simple) Real Example of Overlap

a United States soldier died and two others were injured Monday

a United States soldier

United States soldier died

soldier died and two others

died and two others were injured

two others were injured Monday

A soldier of the wounded United States died and other two were east Monday

N-grams generated

from Flooding

Systran

Flooding N-gram fidelity

Overlap Long range fidelity

N-grams connected via Overlap

Comparative System Scores

Based on same Spanish test set

0.4

0.5

0.6

0.7

0.8

BL

EU

SC

OR

ES

SystranSpanish

GoogleChinese

(‘06 NIST)

MMSpanish

0.3GoogleArabic

(‘05 NIST)

0.3859

0.5137

0.5551 0.5610

SDLSpanish

0.7533

0.85

0.7447

MMSpanish

(Non-blind)

GoogleSpanish

’08 top lang

0.7209

Human Scoring Range

0.4

0.5

0.6

0.7

0.8

BL

EU

SC

OR

ES

0.3

Human Scoring Range

Apr 04 Aug 04 Dec 04 Apr 05 Aug 05 Dec 05 Apr 06 Aug 06 Dec 06 Apr 07 Aug 07 Dec 07

0.85

.3743

.5670

.6144.6165

.6456

.6950

.5953.6129

Non-Blind

.6374.6393

.6267

.7059

.6694 .6929

Blind

Historical CBMT Scoring

.7354

.6645

.7365

.7276

.7447

.7533

Illustrative Translation Examples

Un coche bomba estalla junto a una comisaría de policía en Bagdad

MM: A car bomb explodes next to a police station in Baghdad

Systran: A car pump explodes next to a police station of police in bagdad

Hamas anunció este jueves el fin de su cese del fuego con Israel

MM: Hamas announced Thursday the end of the cease fire with israel

Systran: Hamas announced east Thursday the aim of its cease-fire with Israel

Testigos dijeron haber visto seis jeeps y dos vehículos blindados

MM: Witnesses said they have seen six jeeps and two armored vehicles

Systran: Six witnesses said to have seen jeeps and two armored vehicles

A More Complex Example

Un soldado de Estados Unidos murió y otros dos resultaron heridos este lunes

por el estallido de un artefacto explosivo improvisado en el centro de Bagdad,

dijeron funcionarios militares estadounidenses.

MM: A United States soldier died and two others were injured monday by the

explosion of an improvised explosive device in the heart of Baghdad, American

military officials said.

Systran: A soldier of the wounded United States died and other two were east

Monday by the outbreak from an improvised explosive device in the center of

Bagdad, said American military civil employees.

Summing up CBMT

What if a source word or phrase is not in the bilingual dictionary?

– Automatically Find near synonyms in source,

– Replace and retranslate

Current Status

– Developing a Chinese-to-English CBMT system (no results yet)

– Pre-commercial (field-testing)

– CMU participating in improving CBMT technology

Advantages

– Most accurate MT method in existance

– Does not require parallel text

Disadvantages

– Run time speed: 20s/w (7/07) 1.5s/w (2/08) 0.1s/w(12/08)

– Requires heavy-duty servers (not yet miniturizable)


Compositionality

Adjust rule toreflect compositionality

NP rule can be used to translate part of the

sentence; keep / add context constraints,

eliminate unnecessary ones

Flat Seed Generation

The highly qualified applicant visits the company.Der äußerst qualifizierte Bewerber besucht die Firma.

((1,1),(2,2),(3,3),(4,4),(5,5),(6,6))

S::S [det adv adj n v det n] [det adv adj n v det n]

((x1::y1) (x2::y2)….((x4 agr) = *3-sing) …((y3 case) = *nom)…)

Goal: Syntactic Transfer Rules1) Flat Seed Generation: produce rules from word-aligned sentence pairs, abstracted only to POS level; no syntactic structure2) Add compositional structure to Seed Rule by exploiting previously learned rules3) Seeded Version Space Learning group seed rules by constituent sequences and alignments, seed rules form s-boundary of VS; generalize with validation

Seeded Version Space LearningGroup seed rules into version spaces:

… …

NP v det n …

Notes:1) Partial order ofrules in VS2) Generalization

via merging

S::S [NP v det n] [NP n v det n]

((x1::y1) (x2::y2)….…

((y1 case) = *nom)…)

NP::NP [det adv adj n][det adv adj n]

((x1::y1)…((y4 agr) = (x4 agr)

….)Merge two rules:

1) Deletion of constraint2) Raising of two value to one

Agreement constraint, e.g.((x1 num) = *pl), ((x3 num) = *pl)

((x1 num) = (x3 num)3) Use merged rule to translate


Version Spaces (Mitchell, 1980)

G boundary

S boundary

“Target” concept N

Specific Instances

Anything

b


Original & Seeded Version Spaces

Version-spaces (Mitchell, 1980)

– Symbolic multivariate learning

– S & G sets define lattice boundaries

– Exponential worst-case: O(bN)

Seeded Version Spaces (Carbonell, 2002)

– Generality level hypothesis seed

S & G subsets effective lattice

– Polynomial worst case: O(bk/2), k=3,4


Seeded Version Spaces (Carbonell, 2002)

G boundary

S boundary

“Target” concept N

Xn Ym

“The big book” “ el libro grande”

Det Adj N Det N Adj(Y2 num) = (Y3 num)(Y2 gen) = (Y3 gen)(X3 num) = (Y2 num)


Seeded Version Spaces

S boundary

“Target” concept Seed(best guess)

kN

G boundary

Xn Ym

“The big book” “ el libro grande”


Transfer Rule Learning

One SVS per transfer rule

– Alignment + generalization level = seed

– Apply SVS(seed,k) learning method

» High confidence seed small k [k too small increases prob(failure)]

» Low confidence seed larger k [k too large increases computation & elicitation]

For translation use SVS f-centroid

– If success % > threshold, halt SVS

– If no rule found, k := K+1 & redo (to max K)


Sample learned Rule

Compositionality

;;L2: h &niin h rxb b h bxirwt

;;L1: the widespread interest in the election

NP::NP

[Det N Det Adj PP] -> [Det Adj N PP]

((X1::Y1)(X2::Y3)

(X3::Y1)(X4::Y2)

(X5::Y4)

((Y3 num) = (X2 num))

((X2 num) = sg)

((X2 gen) = m)

((Y1 def) = +) )

Training example

Rule type

Component sequences

Component alignments

Agreement constraints

Value constraints


Automated Speech Recognition at CMU:Sphinx and related technologies

Sphinx is the name for a family of speech recognition engines – developed by Reddy, Lee & Rudnicky

– First continuous-speech, speaker-independent large vocabulary system

– Standard continuous HMM technology

– Multiple languages

– Open source code base

– Miniaturized and integrated with MT, dialog, and TTS

– Basis for Native Accent & Carnegies Speech prouducts

Janus system developed–by Waibel and Schultz – Started as Neural-Net Approach

– Now standard continuous HMM system

– Multiple languages

– NOT open source code base

– Held WER advantage over Sphinx till recently (now both are equal)


Sphinx ASR at Carnegie Mellon

Sphinx systems have evolved along two parallel batch

Technology innovation occurs in the Research branch, then transitions to the Live branch

The Live branch focuses on the additional requirements of interactive real-time recognition and on the integration with higher-level components such as spoken language understanding and dialog management

Sphinx

Sphinx-2

Sphinx-3.x

Sphinx-3.7

PocketSphinx

Live ASR Research ASR

Sphinx-2L

SphinxL

1986

1992

1999

2006


Some features of Sphinx

Level Features

Acoustic Modeling CHMM-based representationsFeature-space reduction (LDA/MLLT)Adaptive training (based on VTLN)Speaker adaptation (MLLR)Discriminative Training

Real-Time Decoding GMM selection (4 level)Lexical trees; fixed-sized tree copyingParameter quantization

Live interaction support End-point detectionDynamic lass-based language modelsDynamic language modelsPartial hypothesesConfidence annotationDynamic noise processing

CarnegieMellon Slide 43 Languge Technologies InsituteApril 19, 2023 Speech Group 43

Spoken language processing

speech

text

meaning

action

recognition dialogue discourse understanding

generation synthesis

task & domain knowledge


A generic speech system

human speech

SignalProcessing

DialogManager

Decoder

Parser LanguageGenerator

SpeechSynthesizer

postParser Domain

agent

Domainagent

DomainAgent

display

machine speech

effector


Decoding speech

Signalprocessing

Decoder

Reduce dimensionality of signal

noise conditioning

Transcribe speech to words

Acousticmodels

Languagemodels

Corpus-base statistical models

Lexicalmodels


A model for speech recognition

Speech recognition as an estimation problem– A = sequence of acoustic observations– W = string of words

)|(maxargˆ AWPWW


Recognition

A Bayesian formulation of the recognition problem

– the “fundamental equation” of speech recognition

)(/)|()()|( ApWApWpAWp

posterior prior

The “model”

conditional

event evidence


Understanding speech

Parser

Postparser

Extract semantic content from utterance

Introduce context and world knowledge into interpretation

Grammar

Context DomainAgents

Grounding, knowledge engineering

Ontology design, language acquisition


Interacting with the user

Dialogmanager

Domainagent

Domainagent

Domainagent

Guide interaction through task

Map user inputs and system state into actions

Interact with back-end(s)

Interpret information using domain knowledge

Taskschemas

Database Live data (e.g. Web)

Domainexpert

Context

Task analysis

Knowledge engineering


Communicating with the user

LanguageGenerator

Speechsynthesizer

DisplayGenerator

ActionGenerator

Decide what to say to user (and how to phrase it)

RulesDatabase

RulesDatabase

Construct sounds and intonation


Environmental robustness for the STS

Expected types of disturbance:

– White/broadband noise

– Speech babble

– Single competing speakers

– Other background music

Comments:

– State-of-the art technologies have been applied mostly to applications where time and computation is unlimited

– Samsung expects reverberation will not be a factor; if reverb actually is in environment, compensation becomes more difficult


Speech recognition as pattern classification

Major functional components:

– Signal processing to extract features from speech waveforms

– Comparison of features to pre-stored representations

Important design choices:

– Choice of features

– Specific method of comparing features to stored representations

Feature extractionDecision making

procedure

Speech features

Wordhypotheses


Multistage environmental compensationfor Samsung STC

Compensation is a combination of:

– ZCAE separates speech from noise based on direction of arrival

– ANC cancels noise

– CDCN, VTS, DCN provide static and dynamic noise removal, channel compensation, feature normalization

– CBMF restores missing features that are lost due to transient noise


Short-time Fourier analysis separates speech in time and frequency

Separate time-frequency components according to direction of arrival based on time delay comparisons (ITD)

Reconstruct signal from only the components that are believed to arise from desired source

Missing feature restoration can smooth reconstruction

Binaural processing for selective reconstruction (the ZCAE algorithm)


Sample results with 45-degree arrival angle difference:

Original Separated

– With no reverberation:

– With 300-ms reverb time: ZCAE provides excellent separation in “dry” rooms but fails in

reverberation

Performance of zero-crossing-based amplitude estimation (ZCAE)


Adaptive noise cancellation (ANC)

Speech + noise enters primary channel, correlated noise enters reference channel

Adaptive filter attempts to convert noise in secondary channel to best resemble noise in primary channel and subtracts

Performance degrades when speech leaks into reference channel and in reverberation


Possible placement of mics for ZCAE and ANC

One possible arrangement:

– Two omni mics for ZCAE processing

– One uni mic (notch toward speaker) for ANC reference

Another option is to have reference mic on back, away from speaker


“Classic” compensation for noise and filtering

Model of Signal Degradation:

Compensation procedures attempt to estimate characteristics of filter and additive noise, and then apply inverse operations

Algorithms include CDCN, VTS, DCN

Compensation applied at cepstral level (not to waveforms)

Supersedes Wiener filtering


Compensation results: WSJ task in white noise

VTS provides ~7-dB improvement over baseline


What does compensated speech “sound” like?

Speech reconstructed after CDCN processing:

Signals that sound “better” do not produce lower error rates

0 dB 20 dB Clean

Original“Recovered”


Missing-feature recognition

General approach:

– Determine which cells of a spectrogram-like display are unreliable (or “missing”)

– Ignore missing features or make best guess about their values based on data that are present

Comment:

– Used in the STS for transient compensation and as a supporting technology for ZCAE


Original speech spectrogram


Spectrogram corrupted by noise at SNR 15 dB

Some regions are affected far more than others


Ignoring regions in the spectrogram that are corrupted by noise

All regions with SNR less than 0 dB deemed missing (dark blue)

Recognition performed based on colored regions alone


Robustness Summary

The CMU Robust Group will implement an ensemble of complementary technologies for the Samsung STS:

– ZCAE for speech separation by arrival angle

– Adaptive noise compensation to reduce background interference

– CDCN, VTS, DCN to reduce stationary residual noise and provide channel compensation

– Cluster-based missing-feature compensation to protect against transient interference and to improve ZCAE performance

Most compensation procedures are based on a tight integration of environmental compensation with the core speech recognition engine.

Virtual English Village (VEV) [SAMSUNG SLIDE]

Service Overview– An online 3D virtual world providing language students with an opportunity to practice and

develop conversational skills for pre-defined domains/scenarios.– Provide students with a fun, safe environment to develop conversational skills.– Improve students:-

» Grammar by correcting student input or suggesting other appropriate variations» Vocabulary by suggesting other words» Pronunciation through clear, synthesized speech and correcting students input

– Service requires student to have access to a microphone.– A complement to existing English language services in Korea.

Target Market– Initially targeting Korean Elementary to High-School Students (~8 to 16 year olds) with some

familiarity of the English language.


Potential Carnegie Contributions to VEG

ASR: Sphinx speech recognition (CMU)

TTS: Festival/Festbox text-to-speech (CMU)

Let’s Go & Olympus dialog systems (CMU)

Native Accent pronunciation tutor (Carnegie Speech)

Grammar & lexicon tutors (Carnegie Speech)

– Story Friends, Story Librarian, …

– Intelligent tutoring architecture

Language parsing/generation (CMU & CS)


Let’s Go Performance

Automated bus schedule/routes for Port Authority of Pittsburgh

Answers phone every evening and on weekends since March 5, 2005

Presently over 52,000 calls from real users

– Variety of noise and transmission conditions

– Variety of dialects and other variants

Estimated Task Success = about 75%

Average number of turns = 12-14

Dialogue characteristics:

– Mixed initiative after first turn

– Open-ended first turn (“What can I do for you?”)

– Change of topic upon comprehension problem (e.g. “what neighborhood are you in?”)


Let’s Go Architecture

ASRSphinx II

ParsingPhoenix

Dialogue ManagementOlympus II

Speech SynthesisFestival

HUBGalaxy

NLGRosetta


Olympus II dialogue architecture

Improves timing awareness – separates DM decision making and awareness into separate modules– Improves turntaking

– Higher success after barge-in

– Reduces system cut-ins

International spoken dialogue community uses OlympusII for different systems– About 10 different applications

» Bus, air reservations, room reservations, movie line, personal minder, etc.

Open Source with Documentation


Olympus II architecture


Speech synthesis

Recognizer = Sphinx-II (enhanced, customized)

Synthesizer = Festival

– Well-known, open source

– Provides very high quality limited domain synthesis

» Some people calling Let’s Go think they are talking to a person

Rosetta Spoken Language Generation

– Generate natural language sentences using templates


Spoken Dialogue and Language Learning

Correction of any of several levels of language within a dialogue while keeping the dialogue on track

– “I want to go the airport”

– “Going to the airport, at what time?”

Adaptation to non-native speech

Uses F0 unit selection (Raux, Black ASRU 2003) for emphasis in synthesis

Lexical entrainment for rapid correction

– Implicit strategy, no break in dialogue


Spoken Dialogue and Language Learning

Lexical entrainment for rapid correction

Collect speech in WOZ setup

Model expected error

Find closest user utterance

TargetPrompts

ASR Hypothesis

DP-basedAlignment

PromptSelection

EmphasisConfirmation

Prompt

machine translation & automated speech recognition jaime carbonell with richard stern and alex...

Documents

carnegie mellon slide

line denwachuu line

experimental slide

line onrain

ebmt interlingua slide

context ebmt

local context

longerrange context