machine translation & automated speech recognition jaime carbonell with richard stern and alex...

74
Machine Translation & Automated Speech Recognition Jaime Carbonell With Richard Stern and Alex Rudnicky Language Technologies Institute Carnegie Mellon University Telephone: +1 412 268 7279 Fax: +1 412 268 6298 [email protected] For Samsung, February 20, 2008

Upload: wendy-davis

Post on 25-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Machine Translation & Automated Speech Recognition

Jaime CarbonellWith Richard Stern and Alex Rudnicky

Language Technologies InstituteCarnegie Mellon University

Telephone: +1 412 268 7279Fax: +1 412 268 6298

[email protected]

For Samsung, February 20, 2008

CarnegieMellon Slide 2 Languge Technologies Insitute

Outline of Presentation

Machine Translation (MT):

– Three major paradigms for MT (transfer, example-based, statistical)

– New method: Context-Based MT

Automated Speech Recognition (ASR):

– The Sphinx family of recognizers

Environmental Robustness for ASR

Intelligent Tutoring for English

– Native Accent: Pronunciation product

– Grammar, Vocabulary & other tutotring

Components for Virtual English Village

CarnegieMellon Slide 3 Languge Technologies Insitute

Language Technologies BILL OF RIGHTS TECHNOLGY (e.g.)

“…right information”

“…right people”

“…right time”

“…right medium”

“…right language”

“…right level of detail”

IR (search engines)

Routing, personalization

Anticipatory analysis

Speech: ASR, TTS

Machine translation

Summarization, Drill down

CarnegieMellon Slide 4 Languge Technologies Insitute

Machine Translation:Context is Needed to Resolve Ambiguity

Example: English Japanese Power line – densen ( 電線 ) Subway line – chikatetsu ( 地下鉄 )

(Be) on line – onrain ( オンライン ) (Be) on the line – denwachuu ( 電話中 ) Line up – narabu ( 並ぶ ) Line one’s pockets – kanemochi ni naru ( 金持ちになる ) Line one’s jacket – uwagi o nijuu ni suru ( 上着を二重にする ) Actor’s line – serifu ( セリフ ) Get a line on – joho o eru ( 情報を得る )

Sometimes local context suffices (as above) . . . but sometimes not

CarnegieMellon Slide 5 Languge Technologies Insitute

CONTEXT: More is Better

Examples requiring longer-range context:– “The line for the new play extended for 3 blocks.”

– “The line for the new play was changed by the scriptwriter.”

– “The line for the new play got tangled with the other props.”

– “The line for the new play better protected the quarterback.”

Better MT requires better context sensitivity

– Syntax-rule based MT little or no context

– EBMT potentially long context on exact substring matches

– SMT short context with proper probabilistic optimization

– CBMT long context with optimization (but still experimental)

CarnegieMellon Slide 6 Languge Technologies Insitute

Types of Machine Translation

Text Generation

Syntactic Parsing

Semantic Analysis

Sentence Planning

Source (Arabic)

Target(English)

Transfer Rules

Direct: SMT, EBMT

Interlingua

CarnegieMellon Slide 7 Languge Technologies Insitute

Example-Based Machine Translation Example

English: I would like to meet her.Mapudungun: Ayükefun trawüael fey engu.

English: The tallest man is my father.Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw.

English: I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.

CarnegieMellon Slide 8 Languge Technologies Insitute

Statistical Machine Translation: The Three Ingredients

English Language Model

English-KoreanTranslation

Model

Decoder

E K

p(E) p(K|E)

E* K

E*=arg max p(E|K)= argmax p(E) p(K|E)

AlignmentsThe

proposal

Will

Not

Now

Be

Implemented

Les

Propositions

ne

seront

Pas

mises

En

Application

maintenant

Translation models built in terms of hidden alignmentsp(F|E)= )|,( EAFp

A

Each English word “generates” zero or more French words.

CarnegieMellon Slide 10 Languge Technologies Insitute

Conditioning Joint Probabilities

),,,|(

),,,|(

|)(()|,(

111

11

11

1

Emfafp

Emfaap

EFlengthmpEAFp

jjj

jjm

jj

The modeling problem is how to approximate these terms to achieve:

Statistically robust estimates

Computationally efficient training

An effective and accurate model

CarnegieMellon Slide 11 Languge Technologies Insitute

The Gang of Five IBM++ Models

A series of five models is sequentially trained to “bootstrap a detailed translation model from parallel corpora:

1. Word-byword translation. Parameters p(f|e) between pairs of words.

2. Local alignments. Adds alignment probabilities

3. Fertilities and distortions. Allow a source word to generate zero or more target words, then place words in position with p(j|i,sentence lengths).

4. Tablets. Groups words into phrases

• Primary idea of Example-Based MT incorporated into SMT

• Currently seeking meaningful (vs arbitrary phrases), e.g. NPs. PPs.

5. Non-deficient alignments. Don’t ask. . .

lengths) sentence,|( jap j

)|)(( eeφp

CarnegieMellon Slide 12 Languge Technologies Insitute

Multi-Engine Machine Transaltion

MT Systems have different strengths– Rapidly adaptable: Statistical, example-based

– Good grammar: Rule-Based (linguisitic) MT

– High precision in narrow domains: KBMT

– Minority Language MT: Learnable from informant

Combine results of parallel-invoked MT– Select best of multiple translations

– Selection based on optimizing combination of:

» Target language joint-exponential model

» Confidence scores of individual MT engines

CarnegieMellon Slide 13 Languge Technologies Insitute

Illustration of Multi-Engine MT

El punto de descarge

The drop-off point

se cumplirá en

will comply with

el puente Agua Fria

The cold Bridgewater

El punto de descarge

The discharge point

se cumplirá en

will self comply in

el puente Agua Fria

the “Agua Fria” bridge

El punto de descarge

Unload of the point

se cumplirá en

will take place at

el puente Agua Fria

the cold water of bridge

CarnegieMellon Slide 14 Languge Technologies Insitute

Key Challenges for Corpus-Based MT

Long-Range Context

– More is better

– How to incorporate it?

– How to do so tractably?

– “I conjecture that MT is AI complete” – H. Simon

State-of-the-Art

– Trigrams 4 or 5-grams in LM only (e.g., Google)

– SMT SMT+EBMT (phrase tables, etc.)

Parallel Corpora

– More is better

– Where to find enough of it?

– What about rare languages?

– “There’s no data like more data” – IBM SMT circa 1992

State-of-the-Art

– Avoids rare languages (GALE only does Arabic & Chinese)

– Symbolic rule-learning from minimalist parallel corpus (AVENUE project at CMU)

CarnegieMellon Slide 15 Languge Technologies Insitute

Parallel Text: Requiring Less is Better (Requiring None is Best )

Challenge– There is just not enough to approach human-quality MT for most major

language pairs (we need ~100X more)

– Much parallel text is not on-point (not on domain)

– Rare languages and some distant pairs have virtually no parallel text

Context-Based (CBMT) Approach

– Invented and patented by Meaningful Machines (in New York)

– Requires no parallel text, no transfer rules . . .

– Instead, CBMT needs

» A fully-inflected bilingual dictionary

» A (very large) target-language-only corpus

» A (modest) source-language-only corpus [optional, but preferred]

CMBT System

Parser

Target Language

Source Language

ParserN-gram Segmenter

Overlap-based Decoder

Bilingual Dictionary

INDEXED RESOURCES

Target Corpora

[Source Corpora]

N-GRAM BUILDERS(Translation Model)

Flooder(non-parallel text method)

Cross-LanguageN-gram Database

CACHE DATABASE

N-GRAM CONNECTOR

N-gram Candidates

Approved N-gram

Pairs

Stored N-gram

PairsGazetteers

Edge Locker

TTR

Substitution Request

Step 1: Source Sentence Chunking Segment source sentence into overlapping n-grams via sliding window Typical n-gram length 4 to 9 terms Each term is a word or a known phrase Any sentence length (for BLEU test: ave-27; shortest-8; longest-66 words)

S1 S2 S3 S4 S5 S6 S7 S8 S9

S1 S2 S3 S4 S5

S2 S3 S4 S5 S6

S3 S4 S5 S6 S7

S4 S5 S6 S7 S8

S5 S6 S7 S8 S9

Flooding Set

Step 2: Dictionary Lookup

T3-aT3-bT3-c

T4-aT4-bT4-cT4-dT4-e

T5-a T6-aT6-bT6-c

Using bilingual dictionary, list all possible target translations for each source word or phrase

Source Word-String

T2-aT2-bT2-cT2-d

Target Word Lists

S2 S3 S4 S5 S6

Inflected Bilingual Dictionary

Step 3: Search Target Text

Using the Flooding Set, search target text for word-strings containing one

word from each group

Find maximum number of words from Flooding Set in minimum length word-string

– Words or phrases can be in any order

– Ignore function words in initial step (T5 is a function word in this example)

T2-aT2-bT2-cT2-d

T3-aT3-bT3-c

T4-aT4-bT4-cT4-dT4-e

T5-a T6-aT6-bT6-cFlooding Set

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-b T(x) T2-d T(x) T(x) T6-cT(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)

Step 3: Search Target Text (Example)

T2-aT2-bT2-cT2-d

T3-aT3-bT3-c

T4-aT4-bT4-cT4-dT4-e

T5-a T6-aT6-bT6-cFlooding Set

Target Corpus

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-b T(x) T2-d T(x) T(x) T6-cT(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)

T3-b T(x) T2-d T(x) T(x) T6-c

Target Candidate 1

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)

Step 3: Search Target Text (Example)

T2-aT2-bT2-cT2-d

T3-aT3-bT3-c

T4-aT4-bT4-cT4-dT4-e

T5-a T6-aT6-bT6-cFlooding Set

Target Corpus

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T4-a T6-b T(x) T2-c T3-a T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)

T4-a T6-b T(x) T2-c T3-a

Target Candidate 2

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)

Step 3: Search Target Text (Example)

T2-aT2-bT2-cT2-d

T3-aT3-bT3-c

T4-aT4-bT4-cT4-dT4-e

T5-a T6-aT6-bT6-cFlooding Set

Target Corpus

T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x) T3-c T2-b T4-e T5-a T6-a T(x) T(x)T(x) T(x) T(x) T(x) T(x) T(x) T(x)T3-c T2-b T4-e T5-a T6-a

Target Candidate 3

Reintroduce function words after initial match (e.g. T5)

Scoring

---

---

---

Step 4: Score Word-String Candidates

Scoring of candidates based on:

– Proximity (minimize extraneous words in target n-gram precision)

– Number of word matches (maximize coverage recall))

– Regular words given more weight than function words

– Combine results (e.g., optimize F1 or p-norm or …)

T3-b T(x) T2-d T(x) T(x) T6-c

T4-a T6-b T(x) T2-c T3-a

T3-c T2-b T4-e T5-a T6-a

Proximity

3rd

1st

1st

Word Matches

3rd

2st

1st

“Regular” Words

3rd

1st

1st

Total Scoring

3rd

2nd

1st

Target Word-String Candidates

T3-b T(x3) T2-d T(x5) T(x6) T6-c

T4-a T6-b T(x3) T2-c T3-a

T3-c T2-b T4-e T5-a T6-a

T(x2) T4-a T6-b T(x3) T2-c

T(x1) T2-d T3-c T(x2) T4-b

T(x1) T3-c T2-b T4-e

T6-b T(x11) T2-c T3-a T(x9)

T2-b T4-e T5-a T6-a T(x8)

T6-b T(x3) T2-c T3-a T(x8)

Step 5: Select Candidates Using Overlap(Propagate context over entire sentence)

Word-String 1Candidates

Word-String 2Candidates

Word-String 3Candidates

T(x2) T4-a T6-b T(x3) T2-c

T4-a T6-b T(x3) T2-c T3-a

T6-b T(x3) T2-c T3-a T(x8)

T3-c T2-b T4-e T5-a T6-a

T3-b T(x3) T2-d T(x5) T(x6) T6-cT3-b T(x3) T2-d T(x5) T(x6) T6-c

T4-a T6-b T(x3) T2-c T3-a

T(x1) T3-c T2-b T4-e

T3-c T2-b T4-e T5-a T6-a

T2-b T4-e T5-a T6-a T(x8)

CarnegieMellon Slide 25 Languge Technologies Insitute

Step 5: Select Candidates Using Overlap

T(x1) T3-c T2-b T4-e

T3-c T2-b T4-e T5-a T6-a

T2-b T4-e T5-a T6-a T(x8)

T(x2) T4-a T6-b T(x3) T2-c

T4-a T6-b T(x3) T2-c T3-a

T6-b T(x3) T2-c T3-a T(x8)

T(x2) T4-a T6-b T(x3) T2-c T3-a T(x8)

T(x1) T3-c T2-b T4-e T5-a T6-a T(x8)

Best translations selected via maximal overlap

Alternative 1

Alternative 2

A (Simple) Real Example of Overlap

a United States soldier died and two others were injured Monday

a United States soldier

United States soldier died

soldier died and two others

died and two others were injured

two others were injured Monday

A soldier of the wounded United States died and other two were east Monday

N-grams generated

from Flooding

Systran

Flooding N-gram fidelity

Overlap Long range fidelity

N-grams connected via Overlap

Comparative System Scores

Based on same Spanish test set

0.4

0.5

0.6

0.7

0.8

BL

EU

SC

OR

ES

SystranSpanish

GoogleChinese

(‘06 NIST)

MMSpanish

0.3GoogleArabic

(‘05 NIST)

0.3859

0.5137

0.5551 0.5610

SDLSpanish

0.7533

0.85

0.7447

MMSpanish

(Non-blind)

GoogleSpanish

’08 top lang

0.7209

Human Scoring Range

0.4

0.5

0.6

0.7

0.8

BL

EU

SC

OR

ES

0.3

Human Scoring Range

Apr 04 Aug 04 Dec 04 Apr 05 Aug 05 Dec 05 Apr 06 Aug 06 Dec 06 Apr 07 Aug 07 Dec 07

0.85

.3743

.5670

.6144.6165

.6456

.6950

.5953.6129

Non-Blind

.6374.6393

.6267

.7059

.6694 .6929

Blind

Historical CBMT Scoring

.7354

.6645

.7365

.7276

.7447

.7533

Illustrative Translation Examples

Un coche bomba estalla junto a una comisaría de policía en Bagdad

MM: A car bomb explodes next to a police station in Baghdad

Systran: A car pump explodes next to a police station of police in bagdad

Hamas anunció este jueves el fin de su cese del fuego con Israel

MM: Hamas announced Thursday the end of the cease fire with israel

Systran: Hamas announced east Thursday the aim of its cease-fire with Israel

Testigos dijeron haber visto seis jeeps y dos vehículos blindados

MM: Witnesses said they have seen six jeeps and two armored vehicles

Systran: Six witnesses said to have seen jeeps and two armored vehicles

A More Complex Example

Un soldado de Estados Unidos murió y otros dos resultaron heridos este lunes

por el estallido de un artefacto explosivo improvisado en el centro de Bagdad,

dijeron funcionarios militares estadounidenses.

MM: A United States soldier died and two others were injured monday by the

explosion of an improvised explosive device in the heart of Baghdad, American

military officials said.

Systran: A soldier of the wounded United States died and other two were east

Monday by the outbreak from an improvised explosive device in the center of

Bagdad, said American military civil employees.

Summing up CBMT

What if a source word or phrase is not in the bilingual dictionary?

– Automatically Find near synonyms in source,

– Replace and retranslate

Current Status

– Developing a Chinese-to-English CBMT system (no results yet)

– Pre-commercial (field-testing)

– CMU participating in improving CBMT technology

Advantages

– Most accurate MT method in existance

– Does not require parallel text

Disadvantages

– Run time speed: 20s/w (7/07) 1.5s/w (2/08) 0.1s/w(12/08)

– Requires heavy-duty servers (not yet miniturizable)

CarnegieMellon Slide 32 Languge Technologies Insitute

Compositionality

Adjust rule toreflect compositionality

NP rule can be used to translate part of the

sentence; keep / add context constraints,

eliminate unnecessary ones

Flat Seed Generation

The highly qualified applicant visits the company.Der äußerst qualifizierte Bewerber besucht die Firma.

((1,1),(2,2),(3,3),(4,4),(5,5),(6,6))

S::S [det adv adj n v det n] [det adv adj n v det n]

((x1::y1) (x2::y2)….((x4 agr) = *3-sing) …((y3 case) = *nom)…)

Goal: Syntactic Transfer Rules1) Flat Seed Generation: produce rules from word-aligned sentence pairs, abstracted only to POS level; no syntactic structure2) Add compositional structure to Seed Rule by exploiting previously learned rules3) Seeded Version Space Learning group seed rules by constituent sequences and alignments, seed rules form s-boundary of VS; generalize with validation

Seeded Version Space LearningGroup seed rules into version spaces:

… …

NP v det n …

Notes:1) Partial order ofrules in VS2) Generalization

via merging

S::S [NP v det n] [NP n v det n]

((x1::y1) (x2::y2)….…

((y1 case) = *nom)…)

NP::NP [det adv adj n][det adv adj n]

((x1::y1)…((y4 agr) = (x4 agr)

….)Merge two rules:

1) Deletion of constraint2) Raising of two value to one

Agreement constraint, e.g.((x1 num) = *pl), ((x3 num) = *pl)

((x1 num) = (x3 num)3) Use merged rule to translate

CarnegieMellon Slide 33 Languge Technologies Insitute

Version Spaces (Mitchell, 1980)

G boundary

S boundary

“Target” concept N

Specific Instances

Anything

b

CarnegieMellon Slide 34 Languge Technologies Insitute

Original & Seeded Version Spaces

Version-spaces (Mitchell, 1980)

– Symbolic multivariate learning

– S & G sets define lattice boundaries

– Exponential worst-case: O(bN)

Seeded Version Spaces (Carbonell, 2002)

– Generality level hypothesis seed

S & G subsets effective lattice

– Polynomial worst case: O(bk/2), k=3,4

CarnegieMellon Slide 35 Languge Technologies Insitute

Seeded Version Spaces (Carbonell, 2002)

G boundary

S boundary

“Target” concept N

Xn Ym

“The big book” “ el libro grande”

Det Adj N Det N Adj(Y2 num) = (Y3 num)(Y2 gen) = (Y3 gen)(X3 num) = (Y2 num)

CarnegieMellon Slide 36 Languge Technologies Insitute

Seeded Version Spaces

S boundary

“Target” concept Seed(best guess)

kN

G boundary

Xn Ym

“The big book” “ el libro grande”

CarnegieMellon Slide 37 Languge Technologies Insitute

Transfer Rule Learning

One SVS per transfer rule

– Alignment + generalization level = seed

– Apply SVS(seed,k) learning method

» High confidence seed small k [k too small increases prob(failure)]

» Low confidence seed larger k [k too large increases computation & elicitation]

For translation use SVS f-centroid

– If success % > threshold, halt SVS

– If no rule found, k := K+1 & redo (to max K)

CarnegieMellon Slide 38 Languge Technologies Insitute

Sample learned Rule

Compositionality

;;L2: h &niin h rxb b h bxirwt

;;L1: the widespread interest in the election

NP::NP

[Det N Det Adj PP] -> [Det Adj N PP]

((X1::Y1)(X2::Y3)

(X3::Y1)(X4::Y2)

(X5::Y4)

((Y3 num) = (X2 num))

((X2 num) = sg)

((X2 gen) = m)

((Y1 def) = +) )

Training example

Rule type

Component sequences

Component alignments

Agreement constraints

Value constraints

CarnegieMellon Slide 40 Languge Technologies Insitute

Automated Speech Recognition at CMU:Sphinx and related technologies

Sphinx is the name for a family of speech recognition engines – developed by Reddy, Lee & Rudnicky

– First continuous-speech, speaker-independent large vocabulary system

– Standard continuous HMM technology

– Multiple languages

– Open source code base

– Miniaturized and integrated with MT, dialog, and TTS

– Basis for Native Accent & Carnegies Speech prouducts

Janus system developed–by Waibel and Schultz – Started as Neural-Net Approach

– Now standard continuous HMM system

– Multiple languages

– NOT open source code base

– Held WER advantage over Sphinx till recently (now both are equal)

CarnegieMellon Slide 41 Languge Technologies Insitute

Sphinx ASR at Carnegie Mellon

Sphinx systems have evolved along two parallel batch

Technology innovation occurs in the Research branch, then transitions to the Live branch

The Live branch focuses on the additional requirements of interactive real-time recognition and on the integration with higher-level components such as spoken language understanding and dialog management

Sphinx

Sphinx-2

Sphinx-3.x

Sphinx-3.7

PocketSphinx

Live ASR Research ASR

Sphinx-2L

SphinxL

1986

1992

1999

2006

CarnegieMellon Slide 42 Languge Technologies Insitute

Some features of Sphinx

Level Features

Acoustic Modeling CHMM-based representationsFeature-space reduction (LDA/MLLT)Adaptive training (based on VTLN)Speaker adaptation (MLLR)Discriminative Training

Real-Time Decoding GMM selection (4 level)Lexical trees; fixed-sized tree copyingParameter quantization

Live interaction support End-point detectionDynamic lass-based language modelsDynamic language modelsPartial hypothesesConfidence annotationDynamic noise processing

CarnegieMellon Slide 43 Languge Technologies InsituteApril 19, 2023 Speech Group 43

Spoken language processing

speech

text

meaning

action

recognition dialogue discourse understanding

generation synthesis

task & domain knowledge

CarnegieMellon Slide 44 Languge Technologies InsituteApril 19, 2023 Speech Group 44

A generic speech system

human speech

SignalProcessing

DialogManager

Decoder

Parser LanguageGenerator

SpeechSynthesizer

postParser Domain

agent

Domainagent

DomainAgent

display

machine speech

effector

CarnegieMellon Slide 45 Languge Technologies InsituteApril 19, 2023 Speech Group 45

Decoding speech

Signalprocessing

Decoder

Reduce dimensionality of signal

noise conditioning

Transcribe speech to words

Acousticmodels

Languagemodels

Corpus-base statistical models

Lexicalmodels

CarnegieMellon Slide 46 Languge Technologies InsituteApril 19, 2023 Speech Group 46

A model for speech recognition

Speech recognition as an estimation problem– A = sequence of acoustic observations– W = string of words

)|(maxargˆ AWPWW

CarnegieMellon Slide 47 Languge Technologies InsituteApril 19, 2023 Speech Group 47

Recognition

A Bayesian formulation of the recognition problem

– the “fundamental equation” of speech recognition

)(/)|()()|( ApWApWpAWp

posterior prior

The “model”

conditional

event evidence

CarnegieMellon Slide 48 Languge Technologies InsituteApril 19, 2023 Speech Group 48

Understanding speech

Parser

Postparser

Extract semantic content from utterance

Introduce context and world knowledge into interpretation

Grammar

Context DomainAgents

Grounding, knowledge engineering

Ontology design, language acquisition

CarnegieMellon Slide 49 Languge Technologies InsituteApril 19, 2023 Speech Group 49

Interacting with the user

Dialogmanager

Domainagent

Domainagent

Domainagent

Guide interaction through task

Map user inputs and system state into actions

Interact with back-end(s)

Interpret information using domain knowledge

Taskschemas

Database Live data (e.g. Web)

Domainexpert

Context

Task analysis

Knowledge engineering

CarnegieMellon Slide 50 Languge Technologies InsituteApril 19, 2023 Speech Group 50

Communicating with the user

LanguageGenerator

Speechsynthesizer

DisplayGenerator

ActionGenerator

Decide what to say to user (and how to phrase it)

RulesDatabase

RulesDatabase

Construct sounds and intonation

CarnegieMellon Slide 51 Languge Technologies Insitute

Environmental robustness for the STS

Expected types of disturbance:

– White/broadband noise

– Speech babble

– Single competing speakers

– Other background music

Comments:

– State-of-the art technologies have been applied mostly to applications where time and computation is unlimited

– Samsung expects reverberation will not be a factor; if reverb actually is in environment, compensation becomes more difficult

CarnegieMellon Slide 52 Languge Technologies Insitute

Speech recognition as pattern classification

Major functional components:

– Signal processing to extract features from speech waveforms

– Comparison of features to pre-stored representations

Important design choices:

– Choice of features

– Specific method of comparing features to stored representations

Feature extractionDecision making

procedure

Speech features

Wordhypotheses

CarnegieMellon Slide 53 Languge Technologies Insitute

Multistage environmental compensationfor Samsung STC

Compensation is a combination of:

– ZCAE separates speech from noise based on direction of arrival

– ANC cancels noise

– CDCN, VTS, DCN provide static and dynamic noise removal, channel compensation, feature normalization

– CBMF restores missing features that are lost due to transient noise

CarnegieMellon Slide 54 Languge Technologies Insitute

Short-time Fourier analysis separates speech in time and frequency

Separate time-frequency components according to direction of arrival based on time delay comparisons (ITD)

Reconstruct signal from only the components that are believed to arise from desired source

Missing feature restoration can smooth reconstruction

Binaural processing for selective reconstruction (the ZCAE algorithm)

CarnegieMellon Slide 55 Languge Technologies Insitute

Sample results with 45-degree arrival angle difference:

Original Separated

– With no reverberation:

– With 300-ms reverb time: ZCAE provides excellent separation in “dry” rooms but fails in

reverberation

Performance of zero-crossing-based amplitude estimation (ZCAE)

CarnegieMellon Slide 56 Languge Technologies Insitute

Adaptive noise cancellation (ANC)

Speech + noise enters primary channel, correlated noise enters reference channel

Adaptive filter attempts to convert noise in secondary channel to best resemble noise in primary channel and subtracts

Performance degrades when speech leaks into reference channel and in reverberation

CarnegieMellon Slide 57 Languge Technologies Insitute

Possible placement of mics for ZCAE and ANC

One possible arrangement:

– Two omni mics for ZCAE processing

– One uni mic (notch toward speaker) for ANC reference

Another option is to have reference mic on back, away from speaker

CarnegieMellon Slide 58 Languge Technologies Insitute

“Classic” compensation for noise and filtering

Model of Signal Degradation:

Compensation procedures attempt to estimate characteristics of filter and additive noise, and then apply inverse operations

Algorithms include CDCN, VTS, DCN

Compensation applied at cepstral level (not to waveforms)

Supersedes Wiener filtering

CarnegieMellon Slide 59 Languge Technologies Insitute

Compensation results: WSJ task in white noise

VTS provides ~7-dB improvement over baseline

CarnegieMellon Slide 60 Languge Technologies Insitute

What does compensated speech “sound” like?

Speech reconstructed after CDCN processing:

Signals that sound “better” do not produce lower error rates

0 dB 20 dB Clean

Original“Recovered”

CarnegieMellon Slide 61 Languge Technologies Insitute

Missing-feature recognition

General approach:

– Determine which cells of a spectrogram-like display are unreliable (or “missing”)

– Ignore missing features or make best guess about their values based on data that are present

Comment:

– Used in the STS for transient compensation and as a supporting technology for ZCAE

CarnegieMellon Slide 62 Languge Technologies Insitute

Original speech spectrogram

CarnegieMellon Slide 63 Languge Technologies Insitute

Spectrogram corrupted by noise at SNR 15 dB

Some regions are affected far more than others

CarnegieMellon Slide 64 Languge Technologies Insitute

Ignoring regions in the spectrogram that are corrupted by noise

All regions with SNR less than 0 dB deemed missing (dark blue)

Recognition performed based on colored regions alone

CarnegieMellon Slide 65 Languge Technologies Insitute

Robustness Summary

The CMU Robust Group will implement an ensemble of complementary technologies for the Samsung STS:

– ZCAE for speech separation by arrival angle

– Adaptive noise compensation to reduce background interference

– CDCN, VTS, DCN to reduce stationary residual noise and provide channel compensation

– Cluster-based missing-feature compensation to protect against transient interference and to improve ZCAE performance

Most compensation procedures are based on a tight integration of environmental compensation with the core speech recognition engine.

Virtual English Village (VEV) [SAMSUNG SLIDE]

Service Overview– An online 3D virtual world providing language students with an opportunity to practice and

develop conversational skills for pre-defined domains/scenarios.– Provide students with a fun, safe environment to develop conversational skills.– Improve students:-

» Grammar by correcting student input or suggesting other appropriate variations» Vocabulary by suggesting other words» Pronunciation through clear, synthesized speech and correcting students input

– Service requires student to have access to a microphone.– A complement to existing English language services in Korea.

Target Market– Initially targeting Korean Elementary to High-School Students (~8 to 16 year olds) with some

familiarity of the English language.

CarnegieMellon Slide 67 Languge Technologies Insitute

Potential Carnegie Contributions to VEG

ASR: Sphinx speech recognition (CMU)

TTS: Festival/Festbox text-to-speech (CMU)

Let’s Go & Olympus dialog systems (CMU)

Native Accent pronunciation tutor (Carnegie Speech)

Grammar & lexicon tutors (Carnegie Speech)

– Story Friends, Story Librarian, …

– Intelligent tutoring architecture

Language parsing/generation (CMU & CS)

CarnegieMellon Slide 68 Languge Technologies Insitute

Let’s Go Performance

Automated bus schedule/routes for Port Authority of Pittsburgh

Answers phone every evening and on weekends since March 5, 2005

Presently over 52,000 calls from real users

– Variety of noise and transmission conditions

– Variety of dialects and other variants

Estimated Task Success = about 75%

Average number of turns = 12-14

Dialogue characteristics:

– Mixed initiative after first turn

– Open-ended first turn (“What can I do for you?”)

– Change of topic upon comprehension problem (e.g. “what neighborhood are you in?”)

CarnegieMellon Slide 69 Languge Technologies Insitute

Let’s Go Architecture

ASRSphinx II

ParsingPhoenix

Dialogue ManagementOlympus II

Speech SynthesisFestival

HUBGalaxy

NLGRosetta

CarnegieMellon Slide 70 Languge Technologies Insitute

Olympus II dialogue architecture

Improves timing awareness – separates DM decision making and awareness into separate modules– Improves turntaking

– Higher success after barge-in

– Reduces system cut-ins

International spoken dialogue community uses OlympusII for different systems– About 10 different applications

» Bus, air reservations, room reservations, movie line, personal minder, etc.

Open Source with Documentation

CarnegieMellon Slide 71 Languge Technologies Insitute

Olympus II architecture

CarnegieMellon Slide 72 Languge Technologies Insitute

Speech synthesis

Recognizer = Sphinx-II (enhanced, customized)

Synthesizer = Festival

– Well-known, open source

– Provides very high quality limited domain synthesis

» Some people calling Let’s Go think they are talking to a person

Rosetta Spoken Language Generation

– Generate natural language sentences using templates

CarnegieMellon Slide 73 Languge Technologies Insitute

Spoken Dialogue and Language Learning

Correction of any of several levels of language within a dialogue while keeping the dialogue on track

– “I want to go the airport”

– “Going to the airport, at what time?”

Adaptation to non-native speech

Uses F0 unit selection (Raux, Black ASRU 2003) for emphasis in synthesis

Lexical entrainment for rapid correction

– Implicit strategy, no break in dialogue

CarnegieMellon Slide 74 Languge Technologies Insitute

Spoken Dialogue and Language Learning

Lexical entrainment for rapid correction

Collect speech in WOZ setup

Model expected error

Find closest user utterance

TargetPrompts

ASR Hypothesis

DP-basedAlignment

PromptSelection

EmphasisConfirmation

Prompt