problems in the annotation of spoken language...

22
Joaquim Llisterri Grup de Fonètica, Departament de Filologia Espanyola Sonderforschungsbereich “Mehrsprachigkeit” Universität Hamburg 25 July 2007 Problems in the annotation of spoken language corpora Joaquim Llisterri Grup de Fonètica, Departament de Filologia Espanyola, Universitat Autònoma de Barcelona [email protected] http://liceu.uab.cat/~joaquim Joaquim Llisterri Grup de Fonètica, Departament de Filologia Espanyola Problems in the annotation of spoken language corpora Sonderforschungsbereich “Mehrsprachigkeit”, Universität Hamburg 25 July 2007 http://liceu.uab.cat/ ~joaquim/ language_resources/ Hamburg_07/ Hamburg_07.html Joaquim Llisterri Grup de Fonètica, Departament de Filologia Espanyola Problems in the annotation of spoken language corpora Levels of annotation Orthographic representation Segmental annotation Suprasegmental annotation Final remarks Joaquim Llisterri Grup de Fonètica, Departament de Filologia Espanyola Problems in the annotation of spoken language corpora Levels of annotation Orthographic representation Segmental annotation Suprasegmental annotation Final remarks

Upload: doanquynh

Post on 01-Apr-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Sonderforschungsbereich “Mehrsprachigkeit”Universität Hamburg

25 July 2007

Problems in the annotation of spokenlanguage corporaJoaquim Llisterri

Grup de Fonètica, Departament de FilologiaEspanyola, Universitat Autònoma de Barcelona

[email protected]

http://liceu.uab.cat/~joaquim

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Problems in the annotation of spoken language corporaSonderforschungsbereich “Mehrsprachigkeit”, Universität Hamburg

25 July 2007

http://liceu.uab.cat/~joaquim/

language_resources/Hamburg_07/

Hamburg_07.html

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Problems in the annotation of spokenlanguage corpora

Levels of annotation

Orthographic representation

Segmental annotation

Suprasegmental annotation

Final remarks

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Problems in the annotation of spokenlanguage corpora

Levels of annotation

Orthographic representation

Segmental annotation

Suprasegmental annotation

Final remarks

Page 2: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Levels of annotation

“Corpus annotation is the practice ofadding interpretative linguisticinformation to a corpus”

LEECH, G. (2005) “Adding linguistic annotation”, in WYNNE, M.(Ed.) Developing Linguistic Corpora: a Guide to Good Practice.

Oxford: Oxbow Books. pp. 17-29.http://www.ahds.ac.uk/creating/guides/linguistic-

corpora/chapter2.htm

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Levels of annotation

• The annotation of a corpus can beconceived as a set of hierarchicallyorganized layers

• Layers usually represent linguistic levelsof analysis

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Levels of annotationMusical score

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Levels of annotationEXMARaLDA Partitur Editor

Page 3: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Levels of annotation

• The levels of annotation are establishedaccording to the aims of the research tobe carried out with the corpus

• Pragmatics and discourse analysis

• Grammar of spoken language

• Phonetics and phonology

• …

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Levels of annotation

• Different kinds of labels are used fordifferent annotation levels

• Phonetic symbols: phonetic labelling

• Morphological tags: POS (Part ofSpeech Tagging)

• Syntactic labels: Parsing

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Levels of annotation

• Phonetic labelling

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Levels of annotation

• POS tagging

CLiC, Centre de Llenguatge i Computacióhttp://clic.fil.ub.es

Page 4: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Levels of annotation

• Parsing

CLiC, Centre de Llenguatge i Computacióhttp://clic.fil.ub.es

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Problems in the annotation of spokenlanguage corpora

Levels of annotation

Orthographic representation

Segmental annotation

Suprasegmental annotation

Final remarks

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

• The orthographic representationcorresponds to a representation of thespeakers utterances using the standardspelling of a given language

• Also known as transliteration

• The orthographic level is common to allspeech and spoken corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

“[…] the words that appear in an orthographictranscription of a speech event constitute only a partialrepresentation of the original speech event. Tosupplement this record of the event, the analyst cancapture other features, by making either a prosodic orphonetic transcription, and can also record contextualfeatures. However, […] the record remains inevitablypartial.”

THOMPSON, P. (2005) "Spoken language corpora", in WYNNE, M. (Ed.) DevelopingLinguistic Corpora: a Guide to Good Practice. Oxford: Oxbow Books. pp. 59-70.

http://www.ahds.ac.uk/creating/guides/linguistic-corpora/chapter5.htm

Page 5: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

Problems in spontaneous speech

• Punctuation

• Adding punctuation implies a segmentationdecided by the transcriber

• Lack of punctuation decreases legibility ofthe text

• Avoidance of more difficult punctuationmarks like “;”

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

Problems in spontaneous speech

• Non-standard forms (situational, social orgeographic variation)

• Vocal semi-lexical forms

• Disfluencies: self-repairs, word fragments

• Unintelligible fragments

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

Recommendations

Preliminary Recommendations on Spoken

Texts. EAGLES Document EAG-TCWG-STP/P, May 1996.http://www.ilc.cnr.it/EAGLES96/spokentx/spokentx.html

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

• Use conventional spelling forms as theyappear in a standard dictionary. Thisalso applies to contractions, reducedword forms, apostrophes, dialect forms,interjections and vocalised semi-lexicalevents

Page 6: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

• If more than one orthographic form ispossible or if non-standard spellings orspelling variations are necessary,maintain a lexicon of the spelling formsused in the transcription

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

• Represent numbers, abbreviations,acronyms and spelled words in fullorthographic form as pronounced by thespeaker

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

Recommendations

SENIA, F.- van VELDEN, J.G. (1997) Specifications of

orthographic transcription and lexicon conventions.LRE-4001 SpeechDat Technical Report SD1.3.2, Finalversion, 10 January 1997.http://www.speechdat.org/speechdat/deliverables/public/SD132V24.PDF

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

• Normal lexical items will be represented bytheir spellings in the normal way

• It is possible to include a very restricted number ofmarkings for regular variations in pronunciation,provided that they are documented and no more thantwo or three regular variations are indicated

Page 7: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

• Abbreviations should be represented bytheir full orthographic forms, unless theyare spoken in their abbreviated form

• Number sequences will be spelled out toreflect what was said

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

• If a speaker pronounces letters, acronyms orabbreviations as a word, then these should bespelled out as words

• No punctuation will be provided in thetranscription other than those symbols used forspecial transcription purposes

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Ortographic representation

• Enriched orthographical representation

• Incorporates information which is notpossible to represent with conventionalspelling

• Used in pragmatics, discourse andconversation analysis, among otherfields

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

• Phenomena included in an enrichedorthographic representation need to beencodedSPERBERG-McQUEEN, C.M. - BURNARD, L. (Eds.)(2007) "7 Transcriptions of Speech", in TEI P5:Guidelines for Electronic Text Encoding andInterchange. The TEI Consortium: Oxford, Providence,Charlottesville, Nancy. http://www.tei-c.org/release/doc/tei-p5-doc/html/TS.html

Page 8: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

• <u> (utterance) a stretch of speech usuallypreceded and followed by silence or by achange of speaker.

• <pause/> a pause either between or withinutterances.

• <vocal> (Vocalized semi-lexical) any vocalizedbut not necessarily lexical phenomenon, forexample voiced pauses, non-lexicalbackchannels, etc.

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

• <kinesic> (Non-vocalized quasi-lexical) anycommunicative phenomenon, not necessarilyvocalized, for example a gesture, frown, etc.

• <event> any phenomenon or occurrence, notnecessarily vocalized or communicative, forexample incidental noises or other eventsaffecting communication.

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

• <writing> (Writing) a passage of written textrevealed to participants in the course of aspoken text.

• <shift/> marks the point at which someparalinguistic feature of a series of utterancesby any one speaker changes.

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

<shift/> in tempo• a - allegro (fast)• aa - very fast• acc - accelerando (getting faster)• l - lento (slow)• ll - very slow• rall - rallentando (getting slower)

Page 9: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

<shift/> in loud (loudness)

• f - forte (loud)

• ff - very loud

• cresc - crescendo (getting louder)

• p - piano (soft)

• pp - very soft

• dimin - diminuendo (getting softer)

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

<shift/> in pitch (pitch range)• high - high pitch-range• low - low pitch-range• wide - wide pitch-range• narrow - narrow pitch-range• asc - ascending• desc - descending• monot - monotonous• scand - scandent, each succeeding syllable higher than

the last, generally ending in a falling tone

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

<shift/> in tension• sl - slurred• lax - lax, a little slurred• ten - tense• pr - very precise• st - staccato, every stressed syllable being

doubly stressed• leg - legato, every syllable receiving more or

less equal stressJoaquim Llisterri

Grup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

<shift/> in rhythm• rh - beatable rhythm

• arrh - arrhythmic, particularly halting

• spr - spiky rising, with markedly higher unstressed syllables

• spf - spiky falling, with markedly lower unstressed syllables

• glr - glissando rising, like spiky rising but the unstressed syllables,usually several, also rise in pitch relative to each other

• glf - glissando falling, like spiky falling but with the unstressedsyllables also falling in pitch relative to each other

Page 10: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

<shift/> in voice (voice quality)• whisp - whisper

• breath - breathy

• husk - husky

• creak - creaky

• fals - falsetto

• reson - resonant

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

<shift/> in voice (voice quality)• giggle - unvoiced laugh or giggle• laugh - voiced laugh• trem - tremulous• sob - sobbing• yawn - yawning• sigh - sighing

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

• “A full definition of the sense of thevalues provided for each feature shouldbe provided in the encoding descriptionsection of the text header”

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Orthographic representation

• "Keep it simple”

• "Document everything adequately"

SENIA, F.- van VELDEN, J.G. (1997) Specifications

of orthographic transcription and lexicon conventions.LRE-4001 SpeechDat Technical Report SD1.3.2,Final version, 10 January 1997.http://www.speechdat.org/speechdat/deliverables/public/SD132V24.PDF

Page 11: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Problems in the annotation of spokenlanguage corpora

Levels of annotation

Orthographic representation

Segmental annotation

Suprasegmental annotation

Final remarks

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

• Segmental annotation concerns the phoneticrepresentation of the utterances pronounced bythe speakers

• Levels of segmental annotation GIBBON, D. - MOORE, R.- WINSKI, R. (Eds.) (1998)

Spoken Language System and Corpus Design. Berlin:Mouton De Gruyter. (Handbook of Standards andResources for Spoken Language Systems, I)

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

• Citation or canonical form

• Words are transcribed in theircanonical form, as pronounced inisolation in careful speech

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

• Broad transcription or phonotypicaltranscription

• Phonological transcription plus regularor predictable contextual phoneticphenomenaSAMPA

Page 12: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

• Narrow transcription

• Phonetic transcription with allophonesclosely representing the phoneticrealizationX-SAMPA

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

• Acoustic-phonetic transcription

• Representation of acoustic-phoneticevents which can be observed in thewaveform

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

SAMPA

• SAM (Speech Assessment Methods)Phonetic Alphabet (1987-1989)

http://www.phon.ucl.ac.uk/home/sampa/home.htm

John WellsUniversity College London

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

SAMPA

• Only 7-bits ASCII characters

• Phonological transcription: onlycontrastive symbols are used

• Some symbols for allophones have beenintroduced for certain languages

Page 13: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

Catalan SAMPA

http://liceu.uab.es/~joaquim/language_resources/SAMPA_Catalan.html

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

Catalan SAMPA

http://liceu.uab.es/~joaquim/language_resources/SAMPA_Catalan.html

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

Catalan SAMPA

http://liceu.uab.es/~joaquim/language_resources/SAMPA_Catalan.html

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

Catalan SAMPA

http://liceu.uab.es/~joaquim/language_resources/SAMPA_Catalan.html

Page 14: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

Catalan SAMPA

http://liceu.uab.es/~joaquim/language_resources/SAMPA_Catalan.html

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

Catalan SAMPA

http://liceu.uab.es/~joaquim/language_resources/SAMPA_Catalan.html

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

X-SAMPA

• Extended SAM (Speech AssessmentMethods) Phonetic Alphabet

http://www.phon.ucl.ac.uk/home/sampa/x-sampa.htm

John WellsUniversity College London

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Segmental annotation

X-SAMPA

• Equivalence in ASCII codes of all IPAsymbols, including diacritics and tonalmarks

Page 15: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Problems in the annotation of spokenlanguage corpora

Levels of annotation

Orthographic representation

Segmental annotation

Suprasegmental annotation

Final remarks

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

• Prosodic or suprasegmental phenomena• Stress / Accent• Melody / Intonation• Rate• Rhythm• Pauses• Voice quality

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

• Some of the suprasegmental elements areannotated in enriched orthographicrepresentations

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

THOMPSON, P. (2005) "Spoken Language Corpora", in WYNNE, M. (Ed.) Developing LinguisticCorpora: a Guide to Good Practice. Oxford: Oxbow Books: 59-70.

http://ahds.ac.uk/guides/linguistic-corpora/chapter5.htm

Page 16: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

SAMPROSA

• SAM (Speech Assessment Methods)Prosodic Alphabet

http://www.phon.ucl.ac.uk/home/sampa/samprosa.htm

John Wells

University College London

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

• SAMPROSA - Local tone

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

• Most of the problems are found in thetranscription of intonation (melody +stress)

• Continuous variations of three physicalparameters which have to betransformed into a symbolic (discrete)representation linguistically meaningful

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

“The standard system for annotating prosody (stress,intonation, etc.) is ToBI (= Tones and Break Indices),which comes with its own speech-processing platform. Itsphonological model originated with Pierrehumbert (1980).The system is partially automated, but needs to besubstantially adapted for fresh languages and dialects.”

LEECH, G. (2005) “Adding linguistic annotation”, in WYNNE, M. (Ed.)Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow

Books. pp. 17-29.http://www.ahds.ac.uk/creating/guides/linguistic-

corpora/chapter2.htm

Page 17: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

“ToBI is well supported by dedicated softwareand a committed research community. On theother hand, it has met with criticism, and twoalternative annotation systems worth examiningare INTSINT (see Hirst 1991) and TSM — toneticstress marks (see Knowles et al. 1996).”

LEECH, G. (2005) “Adding linguistic annotation”, in WYNNE, M. (Ed.)Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow

Books. pp. 17-29.http://www.ahds.ac.uk/creating/guides/linguistic-

corpora/chapter2.htm

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

ToBI (Tone and Break Indices)• Phonological representation based in the

metrical autosegmental model

BECKMAN, M. E. - HIRSCHBERG, J. - SHATTUCK-HUFNAGEL, S.(2005) "The original ToBI system and the evolution of the ToBIframework”, in JUN, S.-A. (Ed.), Prosodic Typology. The Phonology ofIntonation and Phrasing (pp. 9-54). Oxford: Oxford University Press. pp.9-54. http://www.ling.ohio-state.edu/~tobi/JunBook/BeckHirschShattuckToBI.pdf

http://www.ling.ohio-state.edu/~tobi/

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

ToBI (Tone and Break Indices)• Orthographic tier• Break index tier• Tone tier

• Phrasal tones• Pitch accents• Boundary tones

• Miscellaneous tier

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

ToBI (Tone and Break Indices)

Page 18: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

• ToBI (Tone and Break Indices)• Heavily dependent on a phonological

model• Needs adaptation for particular

languages• Somehow it implies a previous

knowledge of expected intonationalphenomena

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

INTSINT (International Transcription Systemfor Intonation)

Daniel Hirst, Laboratoire Parole et Langage, Universitéde Provence, Aix-en-Provence

CAMPIONE, E.- HIRST, D.- VÉRONIS, J. (2000) "Automaticstylisation and symbolic coding of F0: Implementations of theINTSINT model", in BOTINIS, A. (Ed.) Intonation: Analysis,

Modelling and Technology. Dordrecht: Kluwer AcademicPublishers. pp. 185-208. http://www.up.univ-mrs.fr/~veronis/pdf/2000Campione.pdf

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

INTSINT

• F0 detection

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

INTSINT

• Stylization with target points

Page 19: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

INTSINT

• Coding with INTSINT labels

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

INTSINT

• Absolute tones

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

INTSINT

• Relative iterative

tones

Page 20: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

INTSINT

• Relative non

iterative tones

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

INTSINT• Symbolic representation of the F0

contour in discrete categories• Based on target points with values for

time and F0 which are coded withINTSINT symbols

• Perceptual equivalence between thestylized and the actual melodic contour

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Suprasegmental annotation

INTSINT• Praat implementationC. Auran, Laboratoire Parole et Langage, Université de

Provencehttp://www.lpl.univ-aix.fr/~auran/english/ressources.html

G. Rolland, Institut de la Communication Parlée,Grenoblehttp://www.icp.inpg.fr/~loeven/Praat/momel_english.html

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Problems in the annotation of spokenlanguage corpora

Levels of annotation

Orthographic representation

Segmental annotation

Suprasegmental annotation

Final remarks

Page 21: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Final remarks

• Orthographic representation

• Standards for encoding: TEI (TextEncoding Initiative)

• Different transcription/transliterationpractices

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Final remarks

• Segmental annotation

• Choice of annotation levels

• Standard for transcription symbols:IPA and computer-readableequivalents (SAMPA, X-SAMPA)

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Final remarks

• Suprasegmental annotation

• Different systems for differentapproaches to annotation:phonological (ToBI) or phonetic(INTSINT)

• Not in conflict, but complementary

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Final remarks

• Choices in annotation depend on theresearch objectives

• Be eclectic if needed, but ensurereusability• Automatic conversion between systems• Document everything

Page 22: Problems in the annotation of spoken language corporaliceu.uab.cat/~joaquim/language_resources/Hamburg_07… ·  · 2012-09-10Problems in the annotation of spoken language corpora

Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola

Problems in the annotation of spoken language corporaSonderforschungsbereich “Mehrsprachigkeit”, Universität Hamburg

25 July 2007

http://liceu.uab.cat/~joaquim/

language_resources/Hamburg_07/

Hamburg_07.html