isaac castillo ivonne lópez the dime corpusturing.iimas.unam.mx/~luis/dime/presentaciones/... ·...

1

THE DIME CORPUS

Presented by Sergio Coria

IIMAS-UNAM. March, 2006

LUIS PINEDA ET AL.

ANNOTATION TEAM

Haydé Castellanos

Isaac Castillo

Sergio Coria

Javier Cuétara

Varinia Estrada

Laura I. González

Laura P. Hernández

Fernanda López

Isabel López

Ivonne López

Iván Moreno

Valentina Muñoz

Patricia Pérez

Laura L. Rosales

Arturo Wong

DIME Corpus

What is it?

Multimodal: speech & video corpus. [Villaseñor et

al., 2001]26 dialogues developed under the Wizard of OzprotocolTask-oriented domain

Domain: computer-aided kitchen design

Original resources: audio (FEA and WAV) and video (AVI) files, orthographic transcriptions (TXT)

Volunteers� 14 people (users), aged 30 (avg), C.S. related

Tasks developed by every user1) Simple: Rearrange and complete the furnishing

of a small kitchen according to a furniture distribution they had at sight in a piece of paper.

2) Complex: Completely furnish a medium sized kitchen room.

DIME Corpus

How was it created?

DIME Corpus

What was it created for?

To offer empiric information for studying the use and interaction between spoken language, deictic* gestures and graphical context during human-computer interaction

Long-term goal: to build a conversational system able to talk in Mexican Spanish about task-oriented domains

* pointing, showing

Prosodic information: why

do we need it?

To analyze relationship intonation-D.A. in speech in order to find patterns which contribute to improve ASR and dialog management efficiency

For speech repair analysis

For Sp-ToBI evaluation (maybe a specific proposal)

To share the corpus with the research community

2

Annotation layers

Orthographic (based on utt segmentation)Layers� Allophones� Phonetic syllables� Words� Intonation (INTSINT)� ToBI (Break Indices only)� Sentence mood (declarative, interrogative, imperative)� Dialogue acts (DIME-DAMSL)� P.O.S. (Parts of Speech)� Discourse markers� Speech repairs

Orthographic transcriptionDialogue: s6-t1-g1 (d12)

Number of utterances files: 117Length of dialogue: 7 min, 19 segEstimated number of turns: 69

utt1 : s: do you want me to move or insert some objet into the kitchen? utt2 : u: <noise> no utt3 : can you move the stove to the left? utt4 : s: <noise> where to? utt5 : u: <noise> to <sil> to the rightutt6 : s: <no-vocal> to the rightutt7 : <no-vocal> okay utt8 : how far do you want me to move it? utt9 : u: in the middle of the space between the window and the wallutt10 : s: okay utt11 : do you want me to move this object <sil> to here? utt12 : u: no utt13 : to the other side

S: System (Wizard)U: User (volunteer)

Allophones

Based on Mexbet phonetic alphabet, including context rules, by (Cuétara, 2004)

PERL scripts read orthographic transcriptions and produce preliminary non-time-aligned annotations (default “average” times) for allophones � TranscribeMex (Villaseñor & Cuétara, 2004)

Allophones

Annotators align manually allophone default annotations and use them for the following layers

Annotators use: audio file, oscilogram, 3D color spectrum (to see formants), and pitch (to see sound/sil) on CSLU SpeechView software

Allophones

Annotation includes stressed vowels

d15, utt1: Do you want me to move or insert some object into the kitchen? (¿Quieres que desplace o traiga algún objeto a la cocina?)

3

Pitch

3D Color spectrum

Allophones

Allophones (zoom in)

Phonetic syllables

Utterances are segmented into syllables as the speaker uttered them (we annotate what we listen)Annotated with allophone tagsResyllabilization: original syllables of the words can be realigned on a phonetic basis (synaloepha and other cases)

Phonetic syllables

New syllables might emerge / Others disappear

This layer is similar to allophone layer

Phonetic syllables

4

Phonetic syllables (zoom in)

Conventions for words annotation

Conventions�Accented vowel � tagged with _7

(from MexBet)�Every tag must be annotated using

lower-case�Letter ñ � tagged with n~ (from

MexBet)

Words (orthographic)

Annotated with English alphabetWords (zoom in): They do not necesariously coincide with phonetic syllables

Orthographic accent

ToBI (Break Indices, only)

ToBI: Tone and Break Indices, but we annotate break indices, onlyAligned with words layerBased on Sp-ToBI (Beckman, 2002) and (Gurlekian et al.), but modified for our projectThe tags used: � 0 (zero)=syllabic reduction via vowel contactbetween words

� 1=word boundary� 4=intonational phrase boundary

5

ToBI (break indices, only): aligned with words (or silences)ToBI (break indices, only)(zoom in)

INTSINT

INTSINT: International Transcription System for Intonation by (Hirst et al.1991)

Semi-automatic

Based on MOMEL (Modelisation de Melodie) algorithm (Hirst & Espesser, (1993), implemented in M.E.S. software(Motif Environment for Speech)

INTSINT

MOMEL obtains a stilyzed contour of f0by determining target points (T.P.), which are united using spline functions

M.E.S. requires annotator to verify perceptually the automatically generated stylized contour (listening to original versus stylized contour)

Original f0 contourStylized contour checked

by annotator (blue)Target point

d15, utt1: ¿Quieres que desplace o traiga

algún objeto a la cocina? (Do you want me

to move or insert some object into the

kitchen?)

M.E.S. GRAPHIC INTERFACE

Original f0 contour(violet)

INTSINT perceptual verification

Manual modifications of inflexion points on stylized contour are eventually necessary in some cases to make them similar to original contour

6


Most common modifications are:�Eliminate points at the end, which

appear on silent (or back noise) interval� Insert (not too many) new points where

stylized contour needs them�Move points: left, right, up, down

d12, utt15: ¿Este objeto... hacia

acá? (This object... to here?)

Original f0 contour(violet) MOMEL (red)

MOMEL T.P.s modified by annotator (blue

circles

MOMEL T.P.(red square)

MOMEL & annotator T.P.

d12, utt23: ¿Ahí está bien?

(Is it fine there?)

This T.P. was moved from beyond the right border to the left

d12, utt24: Ahí está bien.

(It is fine there).

These two T.P.s were deleted by annotator because they are on a

silent (back noise) interval

INTSINT (cont.)Once stylized contour is verified by annotator, INTSINT tags can be automatically generated using M.E.S. ( based on target points )INTSINT tags are assigned on a relative basis

M: mediumB: bottomT: topU: upstepD: downstepS: sameH: higherL: lower

INTSINT (cont.)

M.E.S. looks for Top and Bottom points and annotates them as T and B

The first point of the sequence is tagged as M (medium) unless it has already been tagged as T or B

7

INTSINT (cont.)

From the second point forward, tags are assigned by comparing the previous and the next points to the current one

INTSINT transcription results as an alphabetic sequence made of tags, being as long as there are target points in the stylized contour

INTSINT tags

automatically

generated and aligned

with target points

d15, utt1: ¿Quieres que desplace o traiga algún

objeto a la cocina? (Do you want me to move or

insert some object into the kitchen?)

An INTSINT annotation file in ETI format, automatically created by M.E.S.

INTSINT: tag refers to the right-hand boundary

Prefer ToBI or INTSINT ?

ToBI � It’s phonological�No tools for automatic annotation�Models for Spanish are still tentative:

Sp-ToBI (Beckman et al.), and (Gurlekian).


INTSINT � It’s phonetic�Tool available for automatic

annotation (MOMEL-MES)

�Complete-enough model

8


Our first selection is INTSINT, although complete ToBI scheme could be used in future stages of annotations

Dialogue acts: DIME-DAMSL

Based on DIME-DAMSL (Pineda et al., 2006)

DIME-DAMSL � Adapted from DAMSL, Dialogue Act Markup in Several Layers (ALLEN & CORE, 1997) for specific requirements in the DIME Corpus� Deictic (graphical) acts annotation added

Annotator uses orthographic transcription, video and audio to annotate (seeing forward is allowed)

DIME-DAMSL annotation format is atribute-value listSoftware tools used: text editor, audio/video playerThis version adds one level to annotate utterances with a compound in the graphical modality

DIME-DAMSL

Each utterance in a dialogue is annotated as a tag list

A label is an attribute-value graphic as follows:� attribute = value� value = basic tag or...� value = [...] i.e. an attribute-value list� {tag1 | tag2 |...} means that the value is one

of the listed options

DIME-DAMSL

Only attributes used in the utterance are annotated

This version adds one level to annotate utterances which have a compound in the graphical modality

DIME-DAMSL

comm-status = {uninterp | mono | abandoned}

DIME-DAMSL Notation

9

info-level = {task | task-mngmt | comm-mngmt | levels-list | other}

DIME-DAMSL Notation

forward-looking-function=[dec = {assert | reassert | other},info-request = {y/n-question | pronom-question | imp-

question},influ-addressee-fut-act = {action-dir | open-option},commit-speaker-fut-act = {offer | commitment},conventional = {open, close},performative,exclamationother

]

DIME-DAMSL Notation

backward-looking-function=[agreement = [reference,

{accept | accept-part | maybe | hold | reject | reject-part }

],und = [reference, {NUS | ack | back-channel |

rep-rephr | compl | correction}],

answer = {reference}]

DIME-DAMSL Notation

mod = [graph = {point-obj, point-zone |point-traject | display | point-coord-obj | move-obj | add-obj | delete-obj | graph-plan | visual}

]

DIME-DAMSL Notation

Excel forms for D.A. tagging

Transactions boundaries

Charges and credits on the obligations and common ground structures

DIME-DAMSL tagging

Transactions boundaries

10

Obligations and common ground with chrgs & crdts DIME-DAMSL annotation

Kappas are computed for each of the three stages in tagging:

�Transactions boundaries

�Charges & Credits on the obligations and common ground planes

�DIME-DAMSL taggings

Want to see the complete forms for tagging on Excel?

Dialogue act annotation might be very subjective

An agreement parameter is necessary to evaluate the consistency in tagging

Parameter: kappa statistic, proposed by (Carletta, 1996), based on Siegel & Castellan (1988)

Dialogue acts: DIME-DAMSL

Tagging-Agreement measure

Kappa statistic

�Kappa-statistic measures the agreement achieved among taggers beyond the chance

�Kappa=1.0 Absolute agreementKappa=0.0 No agreement

P(A): Proportion of times annotators agree [0.0 to 1.0]

P(E): Proportion of times of expected agreement caused by chance [0.0 to 1.0]

11

Kappa statistic

How good are our annotations?� If K > 0.8, then annotations have good consistency

� If 0.67 < K < 0.8, annotations are not very consistent, but can be useful to produce tentative conclusions.

� If K < 0.67, annotations are not consistent

We use Excel datasheet implementations to compute a series of Kappa statistics

How to compute Kappa?

How to compute Kappa? k taggers, N utterances, m labels

nij : Number of taggers who assigned the j-th label to the i-th utterance

Cj : Number of times label j is used

Labels

Utts


Calculating expected agreement P(E)

� Ratio of utterances tagged with label j is

pj = Cj / Nk

� Expected ratio of agreement: pj2

� Total expected agreement for all labels:

P(E) = Σ pj2

m

j=1


Calculating non-adjusted agreement P(A):� Agreement among taggers on the i-th utterance:

� Total ratio of agreement is the average of the ratios for every tagged utterance:

Kappa for transact. boundaries

12

Kappa for transact. boundaries Kappas for chrgs/creds

Kappas for chrgs/creds Kappas for chrgs/creds

OBLIGATIONS CREDITS

Kappas for chrgs/creds Kappas for chrgs/creds

13

Kappas for chrgs/creds

COMMON GROUND

AGREEMENT CREDITS

Kappas for chrgs/creds

Kappas for chrgs/creds Kappas for DIME-DAMSL

Kappas for DIME-DAMSL Kappas for DIME-DAMSL

14

Kappas for DIME-DAMSL Kappas for DIME-DAMSL

Sentence moodDeclarative, interrogative and imperative modalities are considered

Declarative: No interrogative intonation, no verbs in imperative mood

Interrogative: interrogative intonation

Imperative: verb in imperative mood

Parts of Spech (POS)

Ivan’s talk

Discourse markers

Ivan’s talk

Speech repairs

Ivan’s talk

15

Tagging statistics of the

DIME Corpus 5,369 utts X 12 levels =

64,428

Total utts. to be annotated

5,369Total number of utts. in the corpus

Tagging statistics: March 6th


23.0%14,791TOTAL

12.2%656Speech Repairs

17.3%930Discourse Markers

20.9%1,122Parts of Speech (P.O.S.)

13.8%743Sentence Mood

21.7%1,165Break Indices (from ToBI)

1.9%100INTSINT

13.5%723Verified MOMEL

100.0%5,369Default MOMEL

21.7%1,165Words

23.9%1,282Phonetic syllables

23.9%1,282Allophones (T54)

4.7%254DIME-DAMSL

%TAGGED

UTTS.LEVEL

14,791 / 64,428 = 23.0%

Global tagging advance

5,369 utts X 12 levels =

64,428

Total utts. to be annotated

5,369Total number of utts. in the corpus


Most of the tagging process has been manual, time-consuming and labor-intensive

DIME-DAMSL tagging data involve a preliminary approach to the task

Produced data can be used to create automatic taggers

Tagging statistics: March 6th Final comments

Other prosodic data could be generated to increase the DIME corpus resources:� Intensity parameters� Pause durations� Vowel durations� Syllable durations� Stressed syllables� Some lexical features (cue words,cue

phrases)

16

Thank you

REFERENCES

BECKMAN, M., DÍAZ-CAMPOS, M., TEVIS MCGORY, MORGAN, T. (2002) Intonation across Spanish, in the Tones and Break Indices framework

CARLETTA, Jean (1996). Assessing agreement on classification tasks: the kappa statistic. Human Communication Research Centre, University of Edinburgh, Scotland.

CUETARA. Mexbet alphabet.

GURLEKIAN J., RODRÍGUEZ H., COLANTONI L., TORRES H. Development of a Prosodic Database for an Argentine Spanish Text to Speech System. LIS-CONICET, Argentina.

HIRST & ESPESSER (1993). MOMEL algorithm.

HIRST, NICOLAS, ESPESSER (1991). Coding the F0 of a continuous text in French: an experimental approach.

SIEGEL & CASTELLAN (1988). Nonparametric statistics for the behavioral sciences. McGraw-Hill.

REFERENCES

VILLASEÑOR, L., MASSE, A. & PINEDA, L.A. (2001). The DIME Corpus. ENC'01 3er Encuentro Internacional de Ciencias de la Computación, SMCC-INEGI, Aguascalientes, México, September 2001.

REFERENCES


(LLISTERRI et al., 1996) evaluated MOMEL efficiency and failures, finding for Spanish:� 72.85% of the errors � Missing target points

in final rising contours (not detected by MOMEL) and had to be manually inserted.

� 15.43% of the errors � Missing target points in initial position.

� These errors are linked to beginning and end of sentences in cases where a pause occurs

Intonation tagging

INTSINT scheme � Using MOMEL algorithm (Hirst, 2000) in M.E.S.

Software (Espesser, 1999)

17

Intonation tagging

¿Me puedes recorrer el el fregadero un poco hacia

<sil> hacia el frigobar?

(Can you move the the sink a little bit through

<sil> through the minibar?)

Intonation tagging

M: mediumT: topB: bottom

INTSINT tag set

H: higherL: lowerU: upstepD: downstepS: same

Tone tags are concatenated

INTSINT word (iw)

Tone tags

Time

Intonation tagging Intonation tagging

MSTLHDLUHLHDSDLUHBUS

MTLHDLUHLHDDLUHBU

isaac castillo ivonne lópez the dime corpusturing.iimas.unam.mx/~luis/dime/presentaciones/... ·...

Documents