a comprehensive framework for multimodal meaning representation ashwani kumar laurent romary...
TRANSCRIPT
A comprehensive framework for multimodal meaning representation
Ashwani KumarLaurent Romary
Laboratoire Loria, Vandoeuvre Lès Nancy
Overview - 1
Context: Conception phase of the EU IST/MIAMM project (Multidimensional Information Access using Multiple Modalities - with DFKI, TNO, Sony, Canon)
Study of the design factors for a future haptic PDA like deviceUnderlying application: multidimensional access to a musical database
Overview - 2
Objectives:Design and implementation of a unified representation language within the MIAMM demonstrator• MMIL: Multimodal interface language
“Blind” application of (Bunt & Romary 2002)
Methodology
Basic componentsRepresent the general organization of any semantic structureParameterized by
• data categories taken from a common registry• application specific data categories
General mechanismsTo make the thing work
General categoriesDescriptive categories available to all formats
+ strict conformance to existing standards
MIAMM - wheel mode
MIAMM architecture
dependancies Dialogue Manager
MultiModalFusion (MMF)
MiaDomo
Database
Dialogue Historyy
Visual configuration
Action Planner (AP)
Sentences
Haptic Device
Display
Haptic Processor
Visualization Haptic-Visual
Generation
Visual-Haptic Processing (VisHapTac)
Speaker Speech Synthesis
Language Generation
Scheduling Information
Speech Generation
Haptic-Visual Interpretation
Microphone (Headset)
Continuous Speech Recognizer
Structural Analysis (SPIN)
Word/Phoneme Lattice
Speech Analysis
Word/Phoneme sequence
Various processing steps - 1
Reco:Provides word latticesOut of our scope (MPEG7 word and phone lattice module)
SPIN:Template based (en-de) or TAG-based (fr) dependency structuresLow level semantic constructs
Various processing steps - 2
MMF (Multimodal Fusion)Fully interpreted structuresReferential (MMILId) and temporal anchoringDialogue history update
AP (Action Planner)Generates MIAMM internal actions
• Request to MiaDoMo• Actions to be generated (Language+VisHapTac)
Various processing steps - 3
VisHaptacInforms MMF of current graphical and haptic configuration (hierarchies of objects, focus, selection)
MMIL: must answer those needsBut not at the same time
Main characteristics of MMIL
Basic ontologyEvents and participants (organized as hierarchies)Restrictions on events and participantRelations among these
Additional mechanismsTemporal anchoring of eventsRanges and alternatives
RepresentationFlat meta-model
MMIL meta-model (UML)
LevelName:Event
LevelName:Time
LevelName:MMIL
LevelName:Participant
LevelName :NMTOKEN
Struct Node
Associationdependancy
Association
dependancy
Association
dependancy
Association
dependancy
Associationdependancy
Associationdependancy
Associationdependancy
MMIL Level Event Level Time Level Participant Level
0..*
0..*
0..*
0..* 0..* 0..*
0..*1..1 1..1 1..1 0..11..1 1..1 1..1
Meta-model DatCat Registry
DatCat Specification- DCR subset- Application dependantDatCats
Interoperability conditionsGMT
Dialecti- Expansion trees- DatCat styles +
vocabularies
Semantic Markup Language (e.g. MM IL)
An overview of data categories
Underlying ontology for a variety of formatsDistinction between abstract definition and implementation (e.g. in XML)Standardization objective: implementing a reference registry for NLP applications
Wider set of DatCats than just semanticsISO 11179 (meta-data registries) as a reference standard for implementing such a registry
DatCat example: Addressee
/Addressee/Definition: The entity that is the intended hearer of a speech event. The scope of this data category is extended to deal with any multimodal communication event (e.g. haptics and tactile)Source: (implicit) an event, whose evtType should be /Speak/Target: a participant (user or system)
Styles and vocabularies
Style: design choice to impement a data actegory as an XML element, a database field, etc.Vocabulary: the names to be provided for a given styleE.g. (for /Addressee/)
Style: ElementVocabulary: {“addressee”}
Note:Multilingualism
Time stamping
/Starting point/• Def: indicates the beginning of the event• Values: dateTime• Anchor: time level
Style: attributeVocabulary: {“startPoint”}
Example<event id="e4">
<evtType>yearPeriod</evtType><lex>1991</lex><tempSpan
startPoint=“1991-01-01T00:00:00”endPoint=“1991-12-31T24:59:59”/>
</event>
Application: a family of formats
Openness: a requirement for MIAMM
Specific formats for input and output of each moduleEach format is defined within the same generic MMIL framework:• Same meta-model for all• Specific DatCat specification for each
The MIAMM family of formats
SPIN-OMMF-O AP-O
VisHapTac-OMMF-I
MMIL+
The specifications provide typing information for all these formats
SPIN-O exampleSpiel mir den lied bitte vor(Please play the song)
e0
e1
p1
destination
evtType=speakdialogueAct=request
evtType=playlex=vorspielen
p2
objectType=user
objType=tunerefType=definiterefStatus=pending
object
propContent
speaker
<mmilComponent> <event id="e0"> <evtType>speak</evtType>
<dialogueAct>request</dialogueAct><speaker target=“p1“/>
</event> <event id="e1"> <evtType>play</evtType> <lex>vorspielen</lex> </event> <participant id= "p1"> <objType>user</objType> </participant> <participant id= "p2"> <objType>tune</objType> <refType>definite</refType> <refStatus>pending</refStatus> </participant> <relation source="e1" target="e0" type="propContent"/> <relation source=" p1" target="e1" type="destination"/> <relation source= "p2" target="e1" type="object"/></mmilComponent>
• The use of perceptual grouping
Reference domains and visual contexts
« these three objects »
{ , , }
« the triangle »
{ }
« the two circles » { , }
• The use of salience
VisHapTac-Oe0
set1
s1
s2
set2
s25
…
description
s2-1
s2-2
s2-3
inFocus
inSelection
Visual haptic state
Participant setting
Sub-divisions
VisHapTac output - 1<mmilcomponent>
<event id=“e0”>
<evtType>HGState</evtType>
<visMode>galaxy</visMode>
<tempSpan
startPoint=“2000-01-20T14:12:06”
endPoint=“2002-01-20T14:12:13”/>
</event>
<participant id=“set1”>
…
</participant>
<relation type=“description” source=“set1” target=“e0”/>
</mmilcomponent>
VisHapTac output - 2<participant id=“set1”>
…<participant id=”s1”>
<Name>Let it be</Name></participant><participant id=“set2”>
<individuation>set</individuation><attentionstatus>inFocus</attentionstatus><participant id=“s2-1”>
<Name>Lady Madonna</Name></participant>…<participant id=“s2-3”>
<attentionStatus>inSelection</attentionStatus><Name>Revolution 9</Name>
</participant></participant>…
</participant>
Conclusion
Most of the properties we wanted are fulfilled:
Uniformity, incrementality, partiality, openness and extensibility
Discussion point:Semantic adequacy:• Not a direct input to an inference system
(except for underlying ontology)• Semantics provided through specification