wolfgang wahlster german research center for artificial intelligence, dfki gmbh stuhlsatzenhausweg 3...

Wolfgang Wahlster

German Research Center for Artificial Intelligence, DFKI GmbH

Stuhlsatzenhausweg 366123 Saarbruecken, Germany

phone: (+49 681) 302-5252/4162fax: (+49 681) 302-5341e-mail: [email protected]

WWW:http://www.dfki.de/~wahlster

Dagstuhl 2000

Pervasive Speech andLanguage Technology

Dagstuhl 2000© Wolfgang Wahlster, DFKI

Pervasive Speech and Language Technology

A capuccino in 10 minutes, please!

Send the following email to Mark Maybury: Hi Mark,

please forward the following agenda to your project

partners!

Let‘s go to Baker Street in Berkeley!

I would like to hear Mozart‘s piano concert No. 3!

Speech-controlled coffee machine

Speech-basedcar navigation

Speech-enabledmusic selection

Dictation


Show me all CNN news of the last 3 months that

feature Bill Clinton discussing health care!

I would like to make an appointment with

Dr. Kuremastu in Kyoto next week!

Pervasive Speech and Language Technology

What has Jim Hendler said about DAML during our

recent Dagstuhl seminar?

Information on demand

Audio Mining

Speech-to-SpeechTranslation


What has the speakersaid?100

Alternatives

What has the speaker meant?

10Alternatives

What does the speakerwant?

Unambiguous Understanding in the

Dialog Context

Red

uct

ion

of

Un

cert

ain

tySprachanalyse

Speech Recognition

Speech Input

Discourse Context

Knowledgeabout Domainof Discourse

Grammar

LexicalMeaning

AcousticLanguage Models

Word Lists

Speech Analysis

SpeechUnder-

standing

Three Levels of Language Processing


Input Conditions Naturalness Adaptability Dialog Capabilities

Incr

easi

ng

Co

mp

lexi

ty

Close-SpeakingMicrophone/Headset

Push-to-talk

Telephone,Pause-basedSegmentation

Isolated Words

Read ContinuousSpeech

SpeakerIndependent

SpeakerDependent

MonologDictation

Information-seeking Dialog

Open Microphone,GSM Quality

SpontaneousSpeech

Speakeradaptive

MultipartyNegotiation

Verbmobil

Challenges for Language Engineering


Wann fährt der nächsteZug nach Hamburg ab?

When does the next train to Hamburg depart?

Wo befindet sichdas nächste

Hotel?

Where is the nearest hotel?

Context-Sensitive Speech-to-Speech Translation

VerbmobilServer


Mobile Speech-to-Speech Translation of Spontaneous Dialogs

As the name Verbmobil suggests,the system supports verbal

communication with foreign dialog partners in mobile situations.

1

2

face-to-face conversations

telecommunication


Mobile Speech-to-Speech Translation of Spontaneous Dialogs

Verbmobil Speech Translation Server

Solution: Conference Call: The Verbmobil Speech Translation Server

is accessed by GSM mobile phones.


Speech-to-Speech Translation


The Control Panel of Verbmobil


General Speech Recognition Task

GermanGerman

EnglishEnglish

JapaneseJapanese

Audio Signal Recognizers Word Hypotheses Graph


Word Hypotheses Graphs (WHGs)

WHGs realize the interface between acoustic and linguistic processing

Edge = Word

Best Hypothesis

Acoustic Score


Massive Data Collection Efforts

Transliteration Variant 1Transliteration Variant 2 Lexical OrthographyCanonical PronounciationManual Phonological Segmentation

Automatic Phonological SegmentationWord SegmentationProsodic SegmentationDialog ActsNoises

Superimposed SpeechSyntactic CategoryWord CategorySyntactic FunctionProsodic Boundaries

The so-called Partitur (German word for musical score)orchestrates fifteen strata of annotations

3,200 dialogs (182 hours)with 1,658 speakers79,562 turnsdistributed on56 CDs, 21.5 GB


Machine Learningfor the Integration of Statistical Properties into

Symbolic Models for Speech Recognition, Parsing,Dialog Processing, Translation

TranscribedSpeech Data

SegmentedSpeech

with ProsodicLabels

AnnotatedDialogs withDialog Acts

Treebanks &Predicate-ArgumentStructures

AlignedBilingualCorpora

HiddenMarkovModels

Neural Nets,MultilayeredPerceptrons

ProbabilisticAutomata

ProbabilisticGrammars

ProbabilisticTransfer

Rules

Extracting Statistical Properties from Large Corpora


M1 M2 M3

M5 M6M4

BB 2BB 1 BB 3

M1

Multi-Agent Architecture Multi-Blackboard Architecture

Each module must know, which moduleproduces what data

Direct communication between modulesEach module has only one instance Heavy data traffic for moving copies

around Multiparty and telecooperation applications

are impossible Software: ICE and ICE Master Basic Platform: PVM

All modules can register for each blackboard dynamically

No direct communication between modules Each module can have several instances No copies of representation structures

(word lattice, VIT chart) Multiparty and Telecooperation applications are

possible Software: PCA and Module Manager Basic Platform: PVM

From Multi-Agent Architectures to a Multi-Blackboard Architectures

BlackboardsM2

M3

M6

M4 M5


Audio Data

Word HypothesesGraph with

Prosodic Labels

VITsUnderspecified

DiscourseRepresentations

CommandRecognizer

SpontaneousSpeech Recognizer

Channel/SpeakerAdaptation

ProsodicAnalysis

StatisticalParser

Dialog ActRecognition

Chunk Parser

HPSGParser

SemanticConstruction

Robust DialogSemantics

SemanticTransfer

Generation

A Multi-Blackboard Architecture for the Combinationof Results from Deep and Shallow Processing Modules


The Use of Prosodic Information at All Processing Stages

Speech Signal Word Hypotheses Graph

Multilingual Prosody ModuleProsodic features:durationpitchenergypause

Search SpaceRestriction

Parsing

Dialog ActSegmentation and

Recognition

Dialog Understanding

Constraints forTransfer

Translation

LexicalChoice

GenerationSpeech

Synthesis

SpeakerAdaptation

BoundaryInformationBoundary

InformationBoundary

InformationBoundary

InformationSentence

MoodSentence

MoodAccented

WordsAccented

WordsProsodic Feature

Vector


Competing Strategies for Robust Speech Translation

Concurrent processing modules combine deep semantic translationwith shallow surface-oriented translation methods.

Word LatticeWord Lattice

timeout?

timeout?

Acceptable Translation RateAcceptable Translation Rate

Selection ofbest result

Selection ofbest result

Expensive, but precise Translation Cheap, but approximate Translation

Principled and compositional syntactic and semantic analysis

Semantic-based transfer of Verbmobil Interface Terms (VITs) as set of underspecified DRS

Case-based Translation

Dialog-act based translation

Statistical translation

Results withConfidence Values

Results withConfidence Values


Robust Dialog SemanticsCombination and knowledge-

based reconstruction of complete VITs

Robust Dialog SemanticsCombination and knowledge-

based reconstruction of complete VITs

Complete and SpanningVITs

Complete and SpanningVITs

Integrating Shallow and Deep Analysis Components in a Multi-Blackboard Architecture

Chunk ParserChunk ParserStatistical ParserStatistical Parser HPSG ParserHPSG Parser

partial VITs Chart with a combination of

partial VITs

Chart with a combination of

partial VITs

partial VITs

partial VITs

AugmentedWord Hypotheses

Graph

AugmentedWord Hypotheses

Graph


Incremental chart construction and anytime processing Rule-based combination and transformation of partial UDRS coded as VITs Selection of a spanning analysis using a bigram model for VITs

(trained on a tree bank of 24 k VITs)

Chart Parser using cascaded finite-state transducers

Statistical LR parser trained on treebank

Very fast HPSG parser

SemanticConstruction

VHG: A Packed Chart Representation of Partial Semantic Representations


I need a car next Tuesday oops MondayI need a car next Tuesday oops Monday

Original Utterance Editing Phase Repair Phase

Reparandum Hesitation Reparans

Recognition ofSubstitutions

Transformation of theWord Hypothesis Graph

I need a car next MondayI need a car next Monday

Verbmobil Technology: Understands Speech Repairs and extracts the intended meaning

Dictation Systems like: ViaVoice, VoiceXpress, FreeSpeech, Naturally Speaking cannot deal with spontaneous speech and transcribe the corrupted utterances.

The Understanding of Spontaneous Speech Repairs


Wir treffen uns inMannheim, äh, in Saarbrücken.

(We are meeting in Mannheim, oops, in Saarbruecken.)

We are meetingin Saarbruecken.

English

German

Automatic Understanding and Correction of Speech Repairs in Spontaneous Telephone Dialogs


The preposition ‚in‘ is missing in all paths through the word hypotheses graph.A temporal NP is transformed into a temporal modifier using a underspecifiedtemporal relation:

[temporal_np(V1)] [typeraise_to_mod (V1, V2)] & V2

The modifier is applied to a proposition:

[type (V1, prop), type (V2, mod)] [apply (V2, V1, V3)] & V3

Let us meet the late afternoon to catch the train to Frankfurt

Let us meet (in) the late afternoon to catch the train to Frankfurt

Robust Dialog Semantics: Combining and Completing Partial Representations


Integrating Deep and Shallow Processing: Combining Results from Concurrent Translation Threads

Segment 1Translated by Semantic Transfer

Segment 1Translated by Semantic Transfer

Segment 2Translated by Case-Based Translation

Segment 2Translated by Case-Based Translation

Alternative Translations with Confidence Values

StatisticalTranslationStatistical

TranslationDialog-Act Based

TranslationDialog-Act Based

TranslationSemanticTransferSemanticTransfer

Case-BasedTranslation

Case-BasedTranslation

Segment 1If you prefer another hotel,

Segment 1If you prefer another hotel,

Segment 2please let me know.

Segment 2please let me know.

Selection ModuleSelection Module


I have time monday.onSentence to synthesize

I have time monday

I have time monday

I have monday

I

on

on

on

onTok

ens

S E

Edge direction

S E

have time

I mondayon

Unit Selection Algorithm


MicrophonePush-to-talk

Switch

Please call Doris Wahlster.

Open the left window in the back.

I want to hear the weather channel.

When will I reach the next gas station?

Where is the next parking lot?

Speech control of: cellular phone, radio, windows / AC, route guidance system Option for S-, C-, and E-Class of Mercedes and BMW Speaker-independent, Garbage models for non-speech (blinker, AC, wheels)

Linguatronic : Spoken Dialogs with Mercedes-Benz


Multilingualand Mobile

CommunicationAssistants

Multimodal Interfaces

SmartKom

Speech-based Web Access to Multilingual

Web pages

WAP Phones WebTV

Multilingual Audio Retrieval

and Audio Mining

Discussions Lecture Notes Organizers

MultilingualIndexing andAnnotation of

Videos

Video Archives News Archives

Call CentersECommerce Mobile Travel Assistance Telephone Translations

Verbmobil

Dialog Translation

International Research Trends in Multilingual Systems

Multilingual Language Technology Speech Recognition, Language Understanding, Language Generation,

and Speech Synthesis

Multilingual Language Technology Speech Recognition, Language Understanding, Language Generation,

and Speech Synthesis

Spontaneous Speech, Robust Processing and Translation, Semantic and Pragmatic Understanding


Real-world problems in language technology like the understanding of spoken dialogs, speech-to-speech translation and multimodal dialog systems can onlybe cracked by the combined muscle of deep and shallow processing approaches.

In a multi-blackboard architecture based on packed representations on all processing levels (speech recognition, parsing, semantic processing, translation, generation) using charts with underspecified representations (eg. UDRS) the results of concurrent processing threads can be combined in an incremental fashion.

Conclusion I


All results of concurrent processing modules should come with a confidence value, so that a selection module can choose the most promising result at a each processing stage.

Packed representations together with formalisms for underspecification capture the uncertainties in a each processing phase, so that the uncertainties can be reduced by linguistic, discourse and domain constraints as soon as they become applicable.

Conclusion II


Deep Processing can be used for merging, completing and repairing the results of shallow processing strategies.

Shallow methods can be used to guide the search in deepprocessing.

Statistical methods must be augmented by symbolic models (eg. Class-based language modelling, word order normalization as part of statistical translation).

Statistical methods can be used to learn operators orselection strategies for symbolic processes.

It is much more than a balancing act... (see Klavans and Resnik 1996)

Conclusion III


Open Problems for the Next Decade

Problems with current machine learning approaches

Expensive data collection

Cognitively unrealistic training data

Data sparseness

Problems with current hand-crafted knowledge sources

Brittleness

Domain dependence

Limited scalability


A Speculative Conclusion (+50 years)

-500 years TODAY +50 years

Oral Society Textual Society Oral Society

News and knowledge ispassed orally

No mass storageNo automatic processingNo automatic retrieval

Mass storage of textsText ProcessingText Retrieval

Mass storage of speechSpeech ProcessingAudio Retrieval

News and knowledge ispassed textually

News and knowledge ispassed orally

wolfgang wahlster german research center for artificial intelligence, dfki gmbh stuhlsatzenhausweg 3...

Documents