speech-to-speech mt in nespole! design and engineering alon lavie, lori levin work with: chad...

Speech-to-Speech MT in NESPOLE!

Design and EngineeringAlon Lavie, Lori Levin

Work with: Chad Langley, Tanja Schultz, Dorcas Wallace, Donna Gates, Kay

Peterson, Kornel Laskowski

MT Class, April 2, 2003

• Speech-to-speech translation for E-Commerce applications

• Partners: CMU, Univ of Karlsruhe, ITC-irst, UJF-CLIPS, AETHRA, APT-Trentino

• Builds on successful collaboration within C-STAR• Improved limited-domain speech translation• Experiment with multimodality and with MEMT• Showcase-1: Travel and Tourism in Trentino, completed

in Nov-2001, demonstrated IST,HLT• Showcase-2: expanded travel + medical service

April 2, 2003 MT Class 3

NESPOLE! System Overview

• Human-to-human spoken language translation for e-commerce application (e.g. travel & tourism) (Lavie et al., 2002)

• English, German, Italian, and French• Translation via interlingua• Translation servers for each language

exchange interlingua to perform translation– Speech recognition (Speech Text)– Analysis (Text Interlingua)– Generation (Interlingua Text)– Synthesis (Text Speech)


Speech-to-speech in E-commerce

• Augment current passive web E-commerce with live interaction capabilities

• Client starts via web, can easily connect to agent for specific detailed information

• “Thin client” - very little special hardware and software on client PC: browser, MS Netmeeting, Shared Whiteboard


NESPOLE! User Interfaces


NESPOLE! Translation Monitor


NESPOLE! Architecture


Distributed S2S Translation over the Internet


Language-specific HLT Servers


Our Parsing and Analysis Approach

• Goal: A portable and robust analyzer for task-oriented human-to-human speech, parsing utterances into interlingua representations

• Our earlier systems used full semantic grammars to parse complete DAs– Useful for parsing spoken language in restricted domains– Difficult to port to new domains

• Current focus is on improving portability to new domains (and new languages)

• Approach: Continue to use semantic grammars to parse domain-independent phrase-level arguments and train classifiers to identify DAs


Interchange Format

• Interchange Format (IF) is a shallow semantic interlingua for task-oriented domains

• Utterances represented as sequences of semantic dialog units (SDUs)

• IF representation consists of four parts– Speaker– Speech Act– Concepts– Arguments

speaker : speech act +concept* +arguments*

}Domain Action


Hybrid Analysis Approach

Text

Argument

Parser

TextArguments

SDU

Segmenter

TextArguments

SDUs

DA

Classifier

IF

Use a combination of grammar-based phrase-level parsing and machine learning to produce interlingua (IF) representations


Hybrid Analysis ApproachHello. I would like to take a vacation in Val di Fiemme.c:greeting (greeting=hello)c:give-information+disposition+trip (disposition=(who=i, desire), visit-spec=(identifiability=no, vacation), location=(place-name=val_di_fiemme))

hello i would like to take a vacation in val di fiemme

SDU1 SDU2

greeting= disposition= visit-spec= location=


greeting give-information+disposition+trip

greeting= disposition= visit-spec= location=



Argument Parsing

• Parse utterances using phrase-level grammars• SOUP Parser (Gavaldà, 2000): Stochastic,

chart-based, top-down robust parser designed for real-time analysis of spoken language

• Separate grammars based on the type of phrases that the grammar is intended to cover


Domain Action Classification

• Identify the DA for each SDU using trainable classifiers

• Two TiMBL (k-NN) classifiers– Speech act– Concept sequence

• Binary features indicate presence or absence of arguments and pseudo-arguments


Using the IF Specification

• Use knowledge of the IF specification during DA classification– Ensure that only legal DAs are produced– Guarantee that the DA and arguments

combine to form a valid IF representation

• Strategy: Find the best DA that licenses the most arguments– Trust parser to reliably label arguments– Retaining detailed argument information is

important for translation


Evaluation: Classification Accuracy

• 20-fold cross-validation using the NESPOLE! travel domain database

English German

SDUs 8289 8719

Domain Actions

972 1001

Speech Acts

70 70

Concept Sequences

615 638

Vocabulary 1946 2815

The database: Most Frequent Class:

English German

Speech Act

41.4% 40.7%

Concept Sequence

38.9% 40.3%


Evaluation: Classification Accuracy

English German

Speech Acts

81.25% 78.93%

Concept Sequences

69.59% 67.08%

Classification Performance Accuracy


Evaluation:End-to-End Translation

• English-to-English and English-to-Italian• Training set: ~8000 SDUs from NESPOLE!• Test set: 2 dialogs, only client utterances• Uses IF specification fallback strategy• Three graders, bilingual English/Italian

speakers• Each SDU graded as perfect, ok, bad, very bad• Acceptable translation = perfect+ok• Majority scores


Evaluation:End-to-End Translation

Speech recognizer hypotheses 66.7% WAR: 56.4%

English Source InputTarget

LanguageAcceptable

(OK + Perfect)

Translation from English 68.1%

Human Transcription Italian 69.7%

Translation from English 50.4%

SR Hypothesis Italian 50.2%


Evaluation:Data Ablation Experiment

Classification Accuracy (16-fold Cross Validation)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

500 1000 2000 3000 4000 5000 6009

Training Set Size

Mea

n A

ccu

racy

Speech Act

Concept Sequence

Domain Action


Domain Portability

• Experimented with porting to a medical assistance domain in NESPOLE!

• Initial medical domain system up and running, with reasonable coverage of flu-like symptoms and chest pain

• Porting the interlingua, grammars and modules for English, German and Italian required about 6 person months in total– Interlingua development: ~180 hours– Interlingua annotation: ~200 hours– Analysis grammars, training: ~250 hours– Generation development: ~250 hours


New Development Tools


Questions?


Grammars

• Argument grammar– Identifies arguments defined in the IFs[arg:activity-spec=]

(*[object-ref=any] *[modifier=good] [biking])

– Covers "any good biking", "any biking", "good biking", "biking", plus synonyms for all 3 words

• Pseudo-argument grammar– Groups common phrases with similar meanings into

classess[=arrival=] (*is *usually arriving)

– Covers "arriving", "is arriving", "usually arriving", "is usually arriving", plus synonyms


Grammars

• Cross-domain grammar– Identifies simple domain-independent DAss[greeting]

([greeting=first_meeting] *[greet:to-whom=])

– Covers "nice to meet you", "nice to meet you donna", "nice to meet you sir", plus synonyms

• Shared grammar– Contains low-level rules accessible by all

other grammars


Segmentation

• Identify SDU boundaries between argument parse trees

• Insert a boundary if either parse tree is from cross-domain grammar

• Otherwise, use a simple statistical model

])C([A ])C([A

])AC([ ])C([A])AF([A

21

2121


Using the IF Specification

• Check if the best speech act and concept sequence form a legal IF

• If not, test alternative combinations of speech acts and concept sequences from ranked set of possibilities

• Select the best combination that licenses the most arguments

• Drop any arguments not licensed by the best DA


Grammar Development and Classifier Training

• Four steps1. Write argument grammars2. Parse training data3. Obtain segmentation counts4. Train DA classifiers

• Steps 2-4 are automated to simplify testing new grammars

• Translation servers include a development mode for testing new grammars


Evaluation:IF Specification Fallback

• 182 SDUs required classification• 4% had illegal DAs• 29% had illegal IFs• Mean arguments per SDU: 1.47

Changed

Speech Act 5%

Concept Sequence 26%

Domain Action 29%

Arguments dropped per SDU

Without fallback 0.38

With fallback 0.07


Evaluation:Data Ablation Experiment

• 16-fold cross validation setup• Test set size (# SDUs): 400• Training set sizes (# SDUs): 500, 1000, 2000,

3000, 4000, 5000, 6009 (all data)• Data from previous C-STAR system• No use of IF specification


Future Work

• Alternative segmentation models, feature sets, and classification methods

• Multiple argument parses• Evaluate portability and robustness

– Collect dialogues in a new domain– Create argument and full DA grammars for a small

development set of dialogues– Assess portability by comparing grammar

development times and examining grammar reusability

– Assess robustness by comparing performance on unseen data


References• Cattoni, R., M. Federico, and A. Lavie. 2001. Robust Analysis of Spoken

Input Combining Statistical and Knowledge-Based Information Sources. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, Trento, Italy.

• Daelemans, W., J. Zavrel, K. van der Sloot, and A. van den Bosch. 2000. TiMBL: Tilburg Memory Based Learner, version 3.0, Reference Guide. ILK Technical Report 00-01. http://ilk.kub.nl/~ilk/papers/ilk0001.ps.gz

• Gavaldà, M. 2000. SOUP: A Parser for Real-World Spontaneous Speech. In Proceedings of the IWPT-2000, Trento, Italy.

• Gotoh, Y. and S. Renals. Sentence Boundary Detection in Broadcast Speech Transcripts. 2000. In Proceedings on the International Speech Communication Association Workshop: Automatic Speech Recognition: Challenges for the New Millennium, Paris.

• Lavie, A., F. Metze, F. Pianesi, et al. 2002. Enhancing the Usability and Performance of NESPOLE! – a Real-World Speech-to-Speech Translation System. In Proceedings of HLT-2002, San Diego, CA.


References• Lavie, A., C. Langley, A. Waibel, et al. 2001. Architecture and Design

Considerations in NESPOLE!: a Speech Translation System for E-commerce Applications. In Proceedings of HLT-2001, San Diego, CA.

• Lavie, A., D. Gates, N. Coccaro, and L. Levin. 1997. Input Segmentation of Spontaneous Speech in JANUS: a Speech-to-speech Translation System. In Dialogue Processing in Spoken Language Systems: Revised Papers from ECAI-96 Workshop, E. Maier, M. Mast, and S. Luperfoy (eds.), LNCS series, Springer Verlag.

• Lavie, A. 1996. GLR*: A Robust Grammar-Focused Parser for Spontaneously Spoken Language. PhD dissertation, Technical Report CMU-CS-96-126, Carnegie Mellon University, Pittsburgh, PA.

• Munk, M. 1999. Shallow Statistical Parsing for Machine Translation. Diploma Thesis, Karlsruhe University.

• Stevenson, M. and R. Gaizauskas. Experiments on Sentence Boundary Detection. 2000. In Proceedings of ANLP and NAACL-2000, Seattle.

• Woszczyna, M., M. Broadhead, D. Gates, et al. 1998. A Modular Approach to Spoken Language Translation for Large Domains. In Proceedings of AMTA-98, Langhorne, PA.

speech-to-speech mt in nespole! design and engineering alon lavie, lori levin work with: chad...

Documents

human speech

speech mt

vacationin val

shallow semantic interlingua

speech act concept

language exchange interlingua

interlingua representations

semantic grammars