isaac castillo ivonne lópez the dime corpusturing.iimas.unam.mx/~luis/dime/presentaciones/... ·...
TRANSCRIPT
1
THE DIME CORPUS
Presented by Sergio Coria
IIMAS-UNAM. March, 2006
LUIS PINEDA ET AL.
ANNOTATION TEAM
Haydé Castellanos
Isaac Castillo
Sergio Coria
Javier Cuétara
Varinia Estrada
Laura I. González
Laura P. Hernández
Fernanda López
Isabel López
Ivonne López
Iván Moreno
Valentina Muñoz
Patricia Pérez
Laura L. Rosales
Arturo Wong
DIME Corpus
What is it?
Multimodal: speech & video corpus. [Villaseñor et
al., 2001]26 dialogues developed under the Wizard of OzprotocolTask-oriented domain
Domain: computer-aided kitchen design
Original resources: audio (FEA and WAV) and video (AVI) files, orthographic transcriptions (TXT)
Volunteers� 14 people (users), aged 30 (avg), C.S. related
Tasks developed by every user1) Simple: Rearrange and complete the furnishing
of a small kitchen according to a furniture distribution they had at sight in a piece of paper.
2) Complex: Completely furnish a medium sized kitchen room.
DIME Corpus
How was it created?
DIME Corpus
What was it created for?
To offer empiric information for studying the use and interaction between spoken language, deictic* gestures and graphical context during human-computer interaction
Long-term goal: to build a conversational system able to talk in Mexican Spanish about task-oriented domains
* pointing, showing
Prosodic information: why
do we need it?
To analyze relationship intonation-D.A. in speech in order to find patterns which contribute to improve ASR and dialog management efficiency
For speech repair analysis
For Sp-ToBI evaluation (maybe a specific proposal)
To share the corpus with the research community
2
Annotation layers
Orthographic (based on utt segmentation)Layers� Allophones� Phonetic syllables� Words� Intonation (INTSINT)� ToBI (Break Indices only)� Sentence mood (declarative, interrogative, imperative)� Dialogue acts (DIME-DAMSL)� P.O.S. (Parts of Speech)� Discourse markers� Speech repairs
Orthographic transcriptionDialogue: s6-t1-g1 (d12)
Number of utterances files: 117Length of dialogue: 7 min, 19 segEstimated number of turns: 69
utt1 : s: do you want me to move or insert some objet into the kitchen? utt2 : u: <noise> no utt3 : can you move the stove to the left? utt4 : s: <noise> where to? utt5 : u: <noise> to <sil> to the rightutt6 : s: <no-vocal> to the rightutt7 : <no-vocal> okay utt8 : how far do you want me to move it? utt9 : u: in the middle of the space between the window and the wallutt10 : s: okay utt11 : do you want me to move this object <sil> to here? utt12 : u: no utt13 : to the other side
S: System (Wizard)U: User (volunteer)
Allophones
Based on Mexbet phonetic alphabet, including context rules, by (Cuétara, 2004)
PERL scripts read orthographic transcriptions and produce preliminary non-time-aligned annotations (default “average” times) for allophones � TranscribeMex (Villaseñor & Cuétara, 2004)
Allophones
Annotators align manually allophone default annotations and use them for the following layers
Annotators use: audio file, oscilogram, 3D color spectrum (to see formants), and pitch (to see sound/sil) on CSLU SpeechView software
Allophones
Annotation includes stressed vowels
d15, utt1: Do you want me to move or insert some object into the kitchen? (¿Quieres que desplace o traiga algún objeto a la cocina?)
3
Pitch
3D Color spectrum
Allophones
Allophones (zoom in)
Phonetic syllables
Utterances are segmented into syllables as the speaker uttered them (we annotate what we listen)Annotated with allophone tagsResyllabilization: original syllables of the words can be realigned on a phonetic basis (synaloepha and other cases)
Phonetic syllables
New syllables might emerge / Others disappear
This layer is similar to allophone layer
Phonetic syllables
4
Phonetic syllables (zoom in)
Conventions for words annotation
Conventions�Accented vowel � tagged with _7
(from MexBet)�Every tag must be annotated using
lower-case�Letter ñ � tagged with n~ (from
MexBet)
Words (orthographic)
Annotated with English alphabetWords (zoom in): They do not necesariously coincide with phonetic syllables
Orthographic accent
ToBI (Break Indices, only)
ToBI: Tone and Break Indices, but we annotate break indices, onlyAligned with words layerBased on Sp-ToBI (Beckman, 2002) and (Gurlekian et al.), but modified for our projectThe tags used: � 0 (zero)=syllabic reduction via vowel contactbetween words
� 1=word boundary� 4=intonational phrase boundary
5
ToBI (break indices, only): aligned with words (or silences)ToBI (break indices, only)(zoom in)
INTSINT
INTSINT: International Transcription System for Intonation by (Hirst et al.1991)
Semi-automatic
Based on MOMEL (Modelisation de Melodie) algorithm (Hirst & Espesser, (1993), implemented in M.E.S. software(Motif Environment for Speech)
INTSINT
MOMEL obtains a stilyzed contour of f0by determining target points (T.P.), which are united using spline functions
M.E.S. requires annotator to verify perceptually the automatically generated stylized contour (listening to original versus stylized contour)
Original f0 contourStylized contour checked
by annotator (blue)Target point
d15, utt1: ¿Quieres que desplace o traiga
algún objeto a la cocina? (Do you want me
to move or insert some object into the
kitchen?)
M.E.S. GRAPHIC INTERFACE
Original f0 contour(violet)
INTSINT perceptual verification
Manual modifications of inflexion points on stylized contour are eventually necessary in some cases to make them similar to original contour
6
INTSINT perceptual verification
Most common modifications are:�Eliminate points at the end, which
appear on silent (or back noise) interval� Insert (not too many) new points where
stylized contour needs them�Move points: left, right, up, down
d12, utt15: ¿Este objeto... hacia
acá? (This object... to here?)
Original f0 contour(violet) MOMEL (red)
MOMEL T.P.s modified by annotator (blue
circles
MOMEL T.P.(red square)
MOMEL & annotator T.P.
d12, utt23: ¿Ahí está bien?
(Is it fine there?)
This T.P. was moved from beyond the right border to the left
d12, utt24: Ahí está bien.
(It is fine there).
These two T.P.s were deleted by annotator because they are on a
silent (back noise) interval
INTSINT (cont.)Once stylized contour is verified by annotator, INTSINT tags can be automatically generated using M.E.S. ( based on target points )INTSINT tags are assigned on a relative basis
M: mediumB: bottomT: topU: upstepD: downstepS: sameH: higherL: lower
INTSINT (cont.)
M.E.S. looks for Top and Bottom points and annotates them as T and B
The first point of the sequence is tagged as M (medium) unless it has already been tagged as T or B
7
INTSINT (cont.)
From the second point forward, tags are assigned by comparing the previous and the next points to the current one
INTSINT transcription results as an alphabetic sequence made of tags, being as long as there are target points in the stylized contour
INTSINT tags
automatically
generated and aligned
with target points
d15, utt1: ¿Quieres que desplace o traiga algún
objeto a la cocina? (Do you want me to move or
insert some object into the kitchen?)
An INTSINT annotation file in ETI format, automatically created by M.E.S.
INTSINT: tag refers to the right-hand boundary
Prefer ToBI or INTSINT ?
ToBI � It’s phonological�No tools for automatic annotation�Models for Spanish are still tentative:
Sp-ToBI (Beckman et al.), and (Gurlekian).
Prefer ToBI or INTSINT ?
INTSINT � It’s phonetic�Tool available for automatic
annotation (MOMEL-MES)
�Complete-enough model
8
Prefer ToBI or INTSINT ?
Our first selection is INTSINT, although complete ToBI scheme could be used in future stages of annotations
Dialogue acts: DIME-DAMSL
Based on DIME-DAMSL (Pineda et al., 2006)
DIME-DAMSL � Adapted from DAMSL, Dialogue Act Markup in Several Layers (ALLEN & CORE, 1997) for specific requirements in the DIME Corpus� Deictic (graphical) acts annotation added
Annotator uses orthographic transcription, video and audio to annotate (seeing forward is allowed)
DIME-DAMSL annotation format is atribute-value listSoftware tools used: text editor, audio/video playerThis version adds one level to annotate utterances with a compound in the graphical modality
DIME-DAMSL
Each utterance in a dialogue is annotated as a tag list
A label is an attribute-value graphic as follows:� attribute = value� value = basic tag or...� value = [...] i.e. an attribute-value list� {tag1 | tag2 |...} means that the value is one
of the listed options
DIME-DAMSL
Only attributes used in the utterance are annotated
This version adds one level to annotate utterances which have a compound in the graphical modality
DIME-DAMSL
comm-status = {uninterp | mono | abandoned}
DIME-DAMSL Notation
9
info-level = {task | task-mngmt | comm-mngmt | levels-list | other}
DIME-DAMSL Notation
forward-looking-function=[dec = {assert | reassert | other},info-request = {y/n-question | pronom-question | imp-
question},influ-addressee-fut-act = {action-dir | open-option},commit-speaker-fut-act = {offer | commitment},conventional = {open, close},performative,exclamationother
]
DIME-DAMSL Notation
backward-looking-function=[agreement = [reference,
{accept | accept-part | maybe | hold | reject | reject-part }
],und = [reference, {NUS | ack | back-channel |
rep-rephr | compl | correction}],
answer = {reference}]
DIME-DAMSL Notation
mod = [graph = {point-obj, point-zone |point-traject | display | point-coord-obj | move-obj | add-obj | delete-obj | graph-plan | visual}
]
DIME-DAMSL Notation
Excel forms for D.A. tagging
Transactions boundaries
Charges and credits on the obligations and common ground structures
DIME-DAMSL tagging
Transactions boundaries
10
Obligations and common ground with chrgs & crdts DIME-DAMSL annotation
Kappas are computed for each of the three stages in tagging:
�Transactions boundaries
�Charges & Credits on the obligations and common ground planes
�DIME-DAMSL taggings
Want to see the complete forms for tagging on Excel?
Dialogue act annotation might be very subjective
An agreement parameter is necessary to evaluate the consistency in tagging
Parameter: kappa statistic, proposed by (Carletta, 1996), based on Siegel & Castellan (1988)
Dialogue acts: DIME-DAMSL
Tagging-Agreement measure
Kappa statistic
�Kappa-statistic measures the agreement achieved among taggers beyond the chance
�Kappa=1.0 Absolute agreementKappa=0.0 No agreement
P(A): Proportion of times annotators agree [0.0 to 1.0]
P(E): Proportion of times of expected agreement caused by chance [0.0 to 1.0]
11
Kappa statistic
How good are our annotations?� If K > 0.8, then annotations have good consistency
� If 0.67 < K < 0.8, annotations are not very consistent, but can be useful to produce tentative conclusions.
� If K < 0.67, annotations are not consistent
We use Excel datasheet implementations to compute a series of Kappa statistics
How to compute Kappa?
How to compute Kappa? k taggers, N utterances, m labels
nij : Number of taggers who assigned the j-th label to the i-th utterance
Cj : Number of times label j is used
Labels
Utts
How to compute Kappa?
Calculating expected agreement P(E)
� Ratio of utterances tagged with label j is
pj = Cj / Nk
� Expected ratio of agreement: pj2
� Total expected agreement for all labels:
P(E) = Σ pj2
m
j=1
How to compute Kappa?
Calculating non-adjusted agreement P(A):� Agreement among taggers on the i-th utterance:
� Total ratio of agreement is the average of the ratios for every tagged utterance:
Kappa for transact. boundaries
12
Kappa for transact. boundaries Kappas for chrgs/creds
Kappas for chrgs/creds Kappas for chrgs/creds
OBLIGATIONS CREDITS
Kappas for chrgs/creds Kappas for chrgs/creds
13
Kappas for chrgs/creds
COMMON GROUND
AGREEMENT CREDITS
Kappas for chrgs/creds
Kappas for chrgs/creds Kappas for DIME-DAMSL
Kappas for DIME-DAMSL Kappas for DIME-DAMSL
14
Kappas for DIME-DAMSL Kappas for DIME-DAMSL
Sentence moodDeclarative, interrogative and imperative modalities are considered
Declarative: No interrogative intonation, no verbs in imperative mood
Interrogative: interrogative intonation
Imperative: verb in imperative mood
Parts of Spech (POS)
Ivan’s talk
Discourse markers
Ivan’s talk
Speech repairs
Ivan’s talk
15
Tagging statistics of the
DIME Corpus 5,369 utts X 12 levels =
64,428
Total utts. to be annotated
5,369Total number of utts. in the corpus
Tagging statistics: March 6th
Tagging statistics: March 6th
23.0%14,791TOTAL
12.2%656Speech Repairs
17.3%930Discourse Markers
20.9%1,122Parts of Speech (P.O.S.)
13.8%743Sentence Mood
21.7%1,165Break Indices (from ToBI)
1.9%100INTSINT
13.5%723Verified MOMEL
100.0%5,369Default MOMEL
21.7%1,165Words
23.9%1,282Phonetic syllables
23.9%1,282Allophones (T54)
4.7%254DIME-DAMSL
%TAGGED
UTTS.LEVEL
14,791 / 64,428 = 23.0%
Global tagging advance
5,369 utts X 12 levels =
64,428
Total utts. to be annotated
5,369Total number of utts. in the corpus
Tagging statistics: March 6th
Most of the tagging process has been manual, time-consuming and labor-intensive
DIME-DAMSL tagging data involve a preliminary approach to the task
Produced data can be used to create automatic taggers
Tagging statistics: March 6th Final comments
Other prosodic data could be generated to increase the DIME corpus resources:� Intensity parameters� Pause durations� Vowel durations� Syllable durations� Stressed syllables� Some lexical features (cue words,cue
phrases)
16
Thank you
REFERENCES
BECKMAN, M., DÍAZ-CAMPOS, M., TEVIS MCGORY, MORGAN, T. (2002) Intonation across Spanish, in the Tones and Break Indices framework
CARLETTA, Jean (1996). Assessing agreement on classification tasks: the kappa statistic. Human Communication Research Centre, University of Edinburgh, Scotland.
CUETARA. Mexbet alphabet.
GURLEKIAN J., RODRÍGUEZ H., COLANTONI L., TORRES H. Development of a Prosodic Database for an Argentine Spanish Text to Speech System. LIS-CONICET, Argentina.
HIRST & ESPESSER (1993). MOMEL algorithm.
HIRST, NICOLAS, ESPESSER (1991). Coding the F0 of a continuous text in French: an experimental approach.
SIEGEL & CASTELLAN (1988). Nonparametric statistics for the behavioral sciences. McGraw-Hill.
REFERENCES
VILLASEÑOR, L., MASSE, A. & PINEDA, L.A. (2001). The DIME Corpus. ENC'01 3er Encuentro Internacional de Ciencias de la Computación, SMCC-INEGI, Aguascalientes, México, September 2001.
REFERENCES
INTSINT perceptual verification
(LLISTERRI et al., 1996) evaluated MOMEL efficiency and failures, finding for Spanish:� 72.85% of the errors � Missing target points
in final rising contours (not detected by MOMEL) and had to be manually inserted.
� 15.43% of the errors � Missing target points in initial position.
� These errors are linked to beginning and end of sentences in cases where a pause occurs
Intonation tagging
INTSINT scheme � Using MOMEL algorithm (Hirst, 2000) in M.E.S.
Software (Espesser, 1999)
17
Intonation tagging
¿Me puedes recorrer el el fregadero un poco hacia
<sil> hacia el frigobar?
(Can you move the the sink a little bit through
<sil> through the minibar?)
Intonation tagging
M: mediumT: topB: bottom
INTSINT tag set
H: higherL: lowerU: upstepD: downstepS: same
Tone tags are concatenated
INTSINT word (iw)
Tone tags
Time
Intonation tagging Intonation tagging
MSTLHDLUHLHDSDLUHBUS
MTLHDLUHLHDDLUHBU