problems in the annotation of spoken language...
TRANSCRIPT
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Sonderforschungsbereich “Mehrsprachigkeit”Universität Hamburg
25 July 2007
Problems in the annotation of spokenlanguage corporaJoaquim Llisterri
Grup de Fonètica, Departament de FilologiaEspanyola, Universitat Autònoma de Barcelona
http://liceu.uab.cat/~joaquim
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Problems in the annotation of spoken language corporaSonderforschungsbereich “Mehrsprachigkeit”, Universität Hamburg
25 July 2007
http://liceu.uab.cat/~joaquim/
language_resources/Hamburg_07/
Hamburg_07.html
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Problems in the annotation of spokenlanguage corpora
Levels of annotation
Orthographic representation
Segmental annotation
Suprasegmental annotation
Final remarks
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Problems in the annotation of spokenlanguage corpora
Levels of annotation
Orthographic representation
Segmental annotation
Suprasegmental annotation
Final remarks
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Levels of annotation
“Corpus annotation is the practice ofadding interpretative linguisticinformation to a corpus”
LEECH, G. (2005) “Adding linguistic annotation”, in WYNNE, M.(Ed.) Developing Linguistic Corpora: a Guide to Good Practice.
Oxford: Oxbow Books. pp. 17-29.http://www.ahds.ac.uk/creating/guides/linguistic-
corpora/chapter2.htm
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Levels of annotation
• The annotation of a corpus can beconceived as a set of hierarchicallyorganized layers
• Layers usually represent linguistic levelsof analysis
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Levels of annotationMusical score
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Levels of annotationEXMARaLDA Partitur Editor
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Levels of annotation
• The levels of annotation are establishedaccording to the aims of the research tobe carried out with the corpus
• Pragmatics and discourse analysis
• Grammar of spoken language
• Phonetics and phonology
• …
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Levels of annotation
• Different kinds of labels are used fordifferent annotation levels
• Phonetic symbols: phonetic labelling
• Morphological tags: POS (Part ofSpeech Tagging)
• Syntactic labels: Parsing
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Levels of annotation
• Phonetic labelling
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Levels of annotation
• POS tagging
CLiC, Centre de Llenguatge i Computacióhttp://clic.fil.ub.es
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Levels of annotation
• Parsing
CLiC, Centre de Llenguatge i Computacióhttp://clic.fil.ub.es
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Problems in the annotation of spokenlanguage corpora
Levels of annotation
Orthographic representation
Segmental annotation
Suprasegmental annotation
Final remarks
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
• The orthographic representationcorresponds to a representation of thespeakers utterances using the standardspelling of a given language
• Also known as transliteration
• The orthographic level is common to allspeech and spoken corpora
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
“[…] the words that appear in an orthographictranscription of a speech event constitute only a partialrepresentation of the original speech event. Tosupplement this record of the event, the analyst cancapture other features, by making either a prosodic orphonetic transcription, and can also record contextualfeatures. However, […] the record remains inevitablypartial.”
THOMPSON, P. (2005) "Spoken language corpora", in WYNNE, M. (Ed.) DevelopingLinguistic Corpora: a Guide to Good Practice. Oxford: Oxbow Books. pp. 59-70.
http://www.ahds.ac.uk/creating/guides/linguistic-corpora/chapter5.htm
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
Problems in spontaneous speech
• Punctuation
• Adding punctuation implies a segmentationdecided by the transcriber
• Lack of punctuation decreases legibility ofthe text
• Avoidance of more difficult punctuationmarks like “;”
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
Problems in spontaneous speech
• Non-standard forms (situational, social orgeographic variation)
• Vocal semi-lexical forms
• Disfluencies: self-repairs, word fragments
• Unintelligible fragments
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
Recommendations
Preliminary Recommendations on Spoken
Texts. EAGLES Document EAG-TCWG-STP/P, May 1996.http://www.ilc.cnr.it/EAGLES96/spokentx/spokentx.html
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
• Use conventional spelling forms as theyappear in a standard dictionary. Thisalso applies to contractions, reducedword forms, apostrophes, dialect forms,interjections and vocalised semi-lexicalevents
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
• If more than one orthographic form ispossible or if non-standard spellings orspelling variations are necessary,maintain a lexicon of the spelling formsused in the transcription
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
• Represent numbers, abbreviations,acronyms and spelled words in fullorthographic form as pronounced by thespeaker
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
Recommendations
SENIA, F.- van VELDEN, J.G. (1997) Specifications of
orthographic transcription and lexicon conventions.LRE-4001 SpeechDat Technical Report SD1.3.2, Finalversion, 10 January 1997.http://www.speechdat.org/speechdat/deliverables/public/SD132V24.PDF
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
• Normal lexical items will be represented bytheir spellings in the normal way
• It is possible to include a very restricted number ofmarkings for regular variations in pronunciation,provided that they are documented and no more thantwo or three regular variations are indicated
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
• Abbreviations should be represented bytheir full orthographic forms, unless theyare spoken in their abbreviated form
• Number sequences will be spelled out toreflect what was said
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
• If a speaker pronounces letters, acronyms orabbreviations as a word, then these should bespelled out as words
• No punctuation will be provided in thetranscription other than those symbols used forspecial transcription purposes
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Ortographic representation
• Enriched orthographical representation
• Incorporates information which is notpossible to represent with conventionalspelling
• Used in pragmatics, discourse andconversation analysis, among otherfields
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
• Phenomena included in an enrichedorthographic representation need to beencodedSPERBERG-McQUEEN, C.M. - BURNARD, L. (Eds.)(2007) "7 Transcriptions of Speech", in TEI P5:Guidelines for Electronic Text Encoding andInterchange. The TEI Consortium: Oxford, Providence,Charlottesville, Nancy. http://www.tei-c.org/release/doc/tei-p5-doc/html/TS.html
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
• <u> (utterance) a stretch of speech usuallypreceded and followed by silence or by achange of speaker.
• <pause/> a pause either between or withinutterances.
• <vocal> (Vocalized semi-lexical) any vocalizedbut not necessarily lexical phenomenon, forexample voiced pauses, non-lexicalbackchannels, etc.
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
• <kinesic> (Non-vocalized quasi-lexical) anycommunicative phenomenon, not necessarilyvocalized, for example a gesture, frown, etc.
• <event> any phenomenon or occurrence, notnecessarily vocalized or communicative, forexample incidental noises or other eventsaffecting communication.
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
• <writing> (Writing) a passage of written textrevealed to participants in the course of aspoken text.
• <shift/> marks the point at which someparalinguistic feature of a series of utterancesby any one speaker changes.
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
<shift/> in tempo• a - allegro (fast)• aa - very fast• acc - accelerando (getting faster)• l - lento (slow)• ll - very slow• rall - rallentando (getting slower)
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
<shift/> in loud (loudness)
• f - forte (loud)
• ff - very loud
• cresc - crescendo (getting louder)
• p - piano (soft)
• pp - very soft
• dimin - diminuendo (getting softer)
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
<shift/> in pitch (pitch range)• high - high pitch-range• low - low pitch-range• wide - wide pitch-range• narrow - narrow pitch-range• asc - ascending• desc - descending• monot - monotonous• scand - scandent, each succeeding syllable higher than
the last, generally ending in a falling tone
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
<shift/> in tension• sl - slurred• lax - lax, a little slurred• ten - tense• pr - very precise• st - staccato, every stressed syllable being
doubly stressed• leg - legato, every syllable receiving more or
less equal stressJoaquim Llisterri
Grup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
<shift/> in rhythm• rh - beatable rhythm
• arrh - arrhythmic, particularly halting
• spr - spiky rising, with markedly higher unstressed syllables
• spf - spiky falling, with markedly lower unstressed syllables
• glr - glissando rising, like spiky rising but the unstressed syllables,usually several, also rise in pitch relative to each other
• glf - glissando falling, like spiky falling but with the unstressedsyllables also falling in pitch relative to each other
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
<shift/> in voice (voice quality)• whisp - whisper
• breath - breathy
• husk - husky
• creak - creaky
• fals - falsetto
• reson - resonant
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
<shift/> in voice (voice quality)• giggle - unvoiced laugh or giggle• laugh - voiced laugh• trem - tremulous• sob - sobbing• yawn - yawning• sigh - sighing
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
• “A full definition of the sense of thevalues provided for each feature shouldbe provided in the encoding descriptionsection of the text header”
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Orthographic representation
• "Keep it simple”
• "Document everything adequately"
SENIA, F.- van VELDEN, J.G. (1997) Specifications
of orthographic transcription and lexicon conventions.LRE-4001 SpeechDat Technical Report SD1.3.2,Final version, 10 January 1997.http://www.speechdat.org/speechdat/deliverables/public/SD132V24.PDF
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Problems in the annotation of spokenlanguage corpora
Levels of annotation
Orthographic representation
Segmental annotation
Suprasegmental annotation
Final remarks
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
• Segmental annotation concerns the phoneticrepresentation of the utterances pronounced bythe speakers
• Levels of segmental annotation GIBBON, D. - MOORE, R.- WINSKI, R. (Eds.) (1998)
Spoken Language System and Corpus Design. Berlin:Mouton De Gruyter. (Handbook of Standards andResources for Spoken Language Systems, I)
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
• Citation or canonical form
• Words are transcribed in theircanonical form, as pronounced inisolation in careful speech
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
• Broad transcription or phonotypicaltranscription
• Phonological transcription plus regularor predictable contextual phoneticphenomenaSAMPA
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
• Narrow transcription
• Phonetic transcription with allophonesclosely representing the phoneticrealizationX-SAMPA
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
• Acoustic-phonetic transcription
• Representation of acoustic-phoneticevents which can be observed in thewaveform
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
SAMPA
• SAM (Speech Assessment Methods)Phonetic Alphabet (1987-1989)
http://www.phon.ucl.ac.uk/home/sampa/home.htm
John WellsUniversity College London
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
SAMPA
• Only 7-bits ASCII characters
• Phonological transcription: onlycontrastive symbols are used
• Some symbols for allophones have beenintroduced for certain languages
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
Catalan SAMPA
http://liceu.uab.es/~joaquim/language_resources/SAMPA_Catalan.html
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
Catalan SAMPA
http://liceu.uab.es/~joaquim/language_resources/SAMPA_Catalan.html
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
Catalan SAMPA
http://liceu.uab.es/~joaquim/language_resources/SAMPA_Catalan.html
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
Catalan SAMPA
http://liceu.uab.es/~joaquim/language_resources/SAMPA_Catalan.html
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
Catalan SAMPA
http://liceu.uab.es/~joaquim/language_resources/SAMPA_Catalan.html
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
Catalan SAMPA
http://liceu.uab.es/~joaquim/language_resources/SAMPA_Catalan.html
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
X-SAMPA
• Extended SAM (Speech AssessmentMethods) Phonetic Alphabet
http://www.phon.ucl.ac.uk/home/sampa/x-sampa.htm
John WellsUniversity College London
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Segmental annotation
X-SAMPA
• Equivalence in ASCII codes of all IPAsymbols, including diacritics and tonalmarks
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Problems in the annotation of spokenlanguage corpora
Levels of annotation
Orthographic representation
Segmental annotation
Suprasegmental annotation
Final remarks
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
• Prosodic or suprasegmental phenomena• Stress / Accent• Melody / Intonation• Rate• Rhythm• Pauses• Voice quality
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
• Some of the suprasegmental elements areannotated in enriched orthographicrepresentations
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
THOMPSON, P. (2005) "Spoken Language Corpora", in WYNNE, M. (Ed.) Developing LinguisticCorpora: a Guide to Good Practice. Oxford: Oxbow Books: 59-70.
http://ahds.ac.uk/guides/linguistic-corpora/chapter5.htm
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
SAMPROSA
• SAM (Speech Assessment Methods)Prosodic Alphabet
http://www.phon.ucl.ac.uk/home/sampa/samprosa.htm
John Wells
University College London
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
• SAMPROSA - Local tone
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
• Most of the problems are found in thetranscription of intonation (melody +stress)
• Continuous variations of three physicalparameters which have to betransformed into a symbolic (discrete)representation linguistically meaningful
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
“The standard system for annotating prosody (stress,intonation, etc.) is ToBI (= Tones and Break Indices),which comes with its own speech-processing platform. Itsphonological model originated with Pierrehumbert (1980).The system is partially automated, but needs to besubstantially adapted for fresh languages and dialects.”
LEECH, G. (2005) “Adding linguistic annotation”, in WYNNE, M. (Ed.)Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow
Books. pp. 17-29.http://www.ahds.ac.uk/creating/guides/linguistic-
corpora/chapter2.htm
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
“ToBI is well supported by dedicated softwareand a committed research community. On theother hand, it has met with criticism, and twoalternative annotation systems worth examiningare INTSINT (see Hirst 1991) and TSM — toneticstress marks (see Knowles et al. 1996).”
LEECH, G. (2005) “Adding linguistic annotation”, in WYNNE, M. (Ed.)Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow
Books. pp. 17-29.http://www.ahds.ac.uk/creating/guides/linguistic-
corpora/chapter2.htm
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
ToBI (Tone and Break Indices)• Phonological representation based in the
metrical autosegmental model
BECKMAN, M. E. - HIRSCHBERG, J. - SHATTUCK-HUFNAGEL, S.(2005) "The original ToBI system and the evolution of the ToBIframework”, in JUN, S.-A. (Ed.), Prosodic Typology. The Phonology ofIntonation and Phrasing (pp. 9-54). Oxford: Oxford University Press. pp.9-54. http://www.ling.ohio-state.edu/~tobi/JunBook/BeckHirschShattuckToBI.pdf
http://www.ling.ohio-state.edu/~tobi/
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
ToBI (Tone and Break Indices)• Orthographic tier• Break index tier• Tone tier
• Phrasal tones• Pitch accents• Boundary tones
• Miscellaneous tier
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
ToBI (Tone and Break Indices)
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
• ToBI (Tone and Break Indices)• Heavily dependent on a phonological
model• Needs adaptation for particular
languages• Somehow it implies a previous
knowledge of expected intonationalphenomena
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
INTSINT (International Transcription Systemfor Intonation)
Daniel Hirst, Laboratoire Parole et Langage, Universitéde Provence, Aix-en-Provence
CAMPIONE, E.- HIRST, D.- VÉRONIS, J. (2000) "Automaticstylisation and symbolic coding of F0: Implementations of theINTSINT model", in BOTINIS, A. (Ed.) Intonation: Analysis,
Modelling and Technology. Dordrecht: Kluwer AcademicPublishers. pp. 185-208. http://www.up.univ-mrs.fr/~veronis/pdf/2000Campione.pdf
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
INTSINT
• F0 detection
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
INTSINT
• Stylization with target points
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
INTSINT
• Coding with INTSINT labels
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
INTSINT
• Absolute tones
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
INTSINT
• Relative iterative
tones
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
INTSINT
• Relative non
iterative tones
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
INTSINT• Symbolic representation of the F0
contour in discrete categories• Based on target points with values for
time and F0 which are coded withINTSINT symbols
• Perceptual equivalence between thestylized and the actual melodic contour
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Suprasegmental annotation
INTSINT• Praat implementationC. Auran, Laboratoire Parole et Langage, Université de
Provencehttp://www.lpl.univ-aix.fr/~auran/english/ressources.html
G. Rolland, Institut de la Communication Parlée,Grenoblehttp://www.icp.inpg.fr/~loeven/Praat/momel_english.html
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Problems in the annotation of spokenlanguage corpora
Levels of annotation
Orthographic representation
Segmental annotation
Suprasegmental annotation
Final remarks
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Final remarks
• Orthographic representation
• Standards for encoding: TEI (TextEncoding Initiative)
• Different transcription/transliterationpractices
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Final remarks
• Segmental annotation
• Choice of annotation levels
• Standard for transcription symbols:IPA and computer-readableequivalents (SAMPA, X-SAMPA)
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Final remarks
• Suprasegmental annotation
• Different systems for differentapproaches to annotation:phonological (ToBI) or phonetic(INTSINT)
• Not in conflict, but complementary
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Final remarks
• Choices in annotation depend on theresearch objectives
• Be eclectic if needed, but ensurereusability• Automatic conversion between systems• Document everything
Joaquim LlisterriGrup de Fonètica, Departament de Filologia Espanyola
Problems in the annotation of spoken language corporaSonderforschungsbereich “Mehrsprachigkeit”, Universität Hamburg
25 July 2007
http://liceu.uab.cat/~joaquim/
language_resources/Hamburg_07/
Hamburg_07.html