issues in designing a corpus of spoken irish
DESCRIPTION
© Elaine Uí Dhonnchadha, Alessio Frenda, Brian VaughanTRANSCRIPT
Issues in Designing a Corpus of Spoken Irish
Elaine Uí Dhonnchadha, Alessio Frenda, Brian Vaughan
Centre for Language and Communication StudiesTrinity College DublinIreland.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
2
Overview
Linguistic Background Corpus Design Pilot Corpus
Data Collection and Recording Transcription Corpus Processing
Future work
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
3
Irish
Indo-European - Celtic language Verb initial language (VSO) Irish is the first official language of Ireland -
English is the second official language. Irish is spoken as a first language (L1) in only a
small number of areas known as Gaeltachtaí. Irish is learned at school as a second language
by the majority of the population.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
4
Irish Speaking Regions
Na Gaeltachtaí1. Donegal2. Mayo3. Galway4. Kerry5. Cork6. Waterford7. Meath
1
2
3
4
5
6
7
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
5
Irish
1.6 million of the 3.9 million population report proficiency in the spoken language.
The number of native speakers is 64 thousand
These sociolinguistic conditions mean that a comprehensive spoken corpus can play a vital role in promoting and preserving the spoken language.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
6
Motivation
Linguistic research language change, language contact … phonology, syntax, semantics, pragmatics, discourse etc.
Lexicography (new Irish-English dictionary project due to start in 2013)
Teaching materials Speech Recognition
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
7
Existing Resources
Spoken Language Collections Caint Chonamara (1964) 1.2 mill. wds Iorras Aithneach Irish (pub. 2007) Doegen Records Web Project (1928-
1931) (various dialects) Other dialectal studies (without audio
files)
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
8
Motivation
Various difficulties … one dialect, or one year Different dialects but mainly songs,
stories, monologues Very little dialogue Book and CD format (pdf) Some phonetic transcriptions but not
other linguistic annotation Limited searchability
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
9
Motivation
Need a spoken corpus which is: Dialectally balanced Diachronically balanced Gender/age balanced L1 and L2 speakers Text aligned with audio/video file Linguistically annotated
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
10
Corpus Design
We examined the design of a number of corpora: London-Lund Corpus of Spoken English Lancaster/IBM Spoken English Corpus (SEC) Corpus of Spoken New Zealand English British National Corpus (BNC) COREC (Corpus oral de referencia del Español
Contemporáneo) CLIPS (Corpora e Lessici dell’Italiano Parlato e Scritto) ICE (The International Corpus of English) CGN (Corpus Gesproken Nederlands)
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
11
Corpus Design
One common feature shared by the more recent corpora surveyed here is the extent of naturalistic conversational material they include.
Our design is heavily influenced by ICE and CGN
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
12
Corpus Design Dialogues (420, 70%)
Private (250, 42%) [r] Face-to-face conversations (120, 20%) [r] Phone calls (50, 8.5%)[r] Video calls (50, 8.5%) [r] Interviews with teachers of Irish (30, 5%)
Public (170, 28%) [r] Classroom Lessons (40, 7%) Broadcast Discussions (40, 7%) Broadcast Interviews (40, 7%) Parliamentary Debates (20, 3%) [r] Legal cross-examinations (10, 1.5%) [r] Business Transactions (20, 3%)
Monologues (180, 30%)
Unscripted (90, 15%) Spontaneous Commentaries (40, 7%)Unscripted Speeches (20, 3%) Demonstrations (20, 3%) [r] Legal Presentations (10, 1.5%)
Scripted (90, 15%) Broadcast News (40, 7%) Broadcast Talks (40, 7%) Non-broadcast Talks (10, 1%)
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
13
Corpus Design
Our design considers the following variables: Time frame Dialectal variation Sociolinguistic variation Gender and age Context and subject matter
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
14
Time Frame
We have decided upon the three time periods P1: 1930-1971 P2: 1972-1995 P3: 1996-present
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
15
Dialectal Variation
We aim to cover the main dialects of Irish in equal measure i.e. not proportionally to the number of
speakers of each dialect (which may have varied over the years)
Ulster (north) Connaught (west) Munster (south)
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
16
Sociolinguistic Variation
We aim to include Irish speakers from all linguistic backgrounds
‘Traditional’ native speakers (L1) Non-native speakers (L2) ‘Non-traditional’ native speakers (L1),
i.e. those who were raised through Irish by L1 or L2 parents, typically in a non-Gaeltacht setting
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
17
Gender and Age Variation
We aim to represent both males and females proportionally
We aim to represent different generations i.e. young adults, middle aged and elderly speakers
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
18
Content Variation
We aim to record conversations in a variety of contexts (informal, work, leisure, education etc.) and cover a variety of topics.
Overall we aim for a spoken corpus of 2 million words approx.
Pilot Corpus - GaLa
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
20
Pilot Corpus
Funded by Foras na Gaeilge P3: 1996-present (contemporary) Dialogues Mainly public broadcast dialogues (mp3
podcasts of radio interviews and discussions).
We also carried out a small amount of video recording of private dialogue conversations.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
21
Data Collection
Four pairs of volunteers agreed to be video recorded in informal conversation in the Speech Communications Laboratory, TCD
Video recorded using a Sony HDR-XR500v High Definition Handycam.
The audio was recorded in two ways: using the onboard camera microphone and using two Sennheiser MKH-60 shotgun
microphones and an Edirol 4-channel HD Audio recorder.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
23
Podcast Extracts
70 x 8 min. audio extracts were transcribed giving 102,000 words of transcribed speech (8.5 hours approx.).
We also aligned and formatted some existing transcripts, Frenda (2011) material transcribed for PhD research TCD (20K); Wigger (2000) Caint Chonamara (10K); Dillon, G. material transcribed for PhD research TCD
(5K).
overall total 140,000 words (approx.)106 transcripts, 151 speakers
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
24
Transcription
Spoken and written language differ in a number of important respects.
The syntactic structure of spontaneous spoken utterances is usually simpler
Spontaneous speech: repetitions, false starts, hesitations or non-verbal communication such as a gesture or the tone of voice.
Dialectal pronunciations deviate substantially from standard orthographical representations
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
25
Transcription Guidelines
Phonetic or Orthographic transcription We examined a number of transcription
conventions already in use including CHAT: The CHAT (Codes for the Human Analysis of
Transcripts) System is a comprehensive standard for transcribing and encoding the characteristics of spoken language (MacWhinney, 2000).
LINDSEI: Louvain International Database of Spoken English Interlanguage Transcription guidelines http://www.uclouvain.be/en-307849.html
LDC: Linguistic Data Consortium http://www.ldc.upenn.edu /Creating/creating_annotated.shtml#Transcription
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
26
CHAT Guidelines
The CHAT (Codes for the Human Analysis of Transcripts) (MacWhinney, 2000).
These guidelines were developed for the transcription of spoken interactions between children and their carers in order to study child language acquisition.
Inaudible segments, phonetic fragments, repetitions, overlaps, interruptions, trailing off, foreign words, proper nouns and numbers etc.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
27
CHAT Guidelines
the guidelines are very comprehensive but there are a few drawbacks to implementing the guidelines in full
it can slow down the transcription process considerably
some are quite subjective (short, medium and long pauses)
while others are difficult to implement (retracings and reformulations)
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
28
LDC Transcription Guidelines
LDC guidelines advocate simplicity Keep the rules to a minimum in order to
make transcription as easy as possible for the transcriber, which increases transcription speed, accuracy and consistency
In addition automatic procedures are used when possible
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
29
Transcription
On average 30 minutes to orthographically transcribe 1 minute of audio material.
Transcription process must be as straightforward and intuitive as possible.
Minimum number of codes and keystrokes [repeated material], xxx, < … > [?], [% comment], @laugh etc., @eng, filled pauses {yeah, ehm, uh..}
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
30
Transcription
Dialectal variation maith ‘good’ /mah/ or /maɪ/ an-mhaith ‘very good’ /ənə'wa/
or /ənə'waɪ/ or /ənə'va/ Initial mutations
ag déanamh ‘doing’ /ə d´ianəv/ or /ə ʤanu/ (not a’ déanamh)
standard orthography
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
31
Transcription Guidelines
Advantages to using standard orthography: It makes the job of transcription easier and
quicker for transcribers It helps mimimise spelling inconsistencies among
transcribers as only standard spelling is used, apart from predefined lists permitted exceptions
Attempting to represent actual pronunciation in orthography is difficult and prone to inconsistency. It can be more accurately captured in a separate phonetic transcription layer (which may be partially generated from the orthography).
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
32
Transcription Guidelines
Standard orthography facilitates corpus querying and lexical searches
Standard orthography facilitates automatic text processing, such as part-of-speech tagging and parsing
Transcription codes for some linguistic features (e.g. co-articulation effects, elision etc.) would require specialist training for transcribers, in order to ensure accuracy and consistency, and are better undertaken as a separate task.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
33
Transcription Software
We tested several pieces of freely-available transcription and annotation software (e.g. Praat, ELAN, Anvil, CLAN, Xtrans, Transcriber)
We chose Transcriber http://trans.sourceforge.net It has a straightforward user interface It facilitates alignment of the audio and text
transcription in XML format Audio duration and word count information at a
glance Transcripts can be conveniently exported as
text
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
34
Transcription Software
It handles a variety of audio file types, including .wav, .mp3 (podcasts) and .ogg
The later version of the software, TranscriberAG, can handle video as well as audio
It facilitates the annotation of various features of spontaneous speech (overlap, interruptions, coughs, laughs, etc.) as well as linguistics categories (e.g. proper nouns, human/animate etc. etc.) if desired
It can be used with foot pedals for increased speed if necessary
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
35
Transcribers
Audio segments of 8 min. in duration broadcast discussions and interviews Raidio na Gaeltachta podcasts.
Panel of 22 transcribers recruited Workpackages were sent via e-mail to
members of the panel who worked from home. (filenames, speaker ids)
They returned a time-aligned transcription and timesheet for each workpackage completed.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
36
Transcription Checking
Each transcript was checked for accuracy against the audio file by a member of the project team.
In the case of new video-recordings, the transcripts were also anonymised, i.e. names and places which could identify the participants were replaced by fictitious names to ensure anonynity.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
37
Corpus Processing
Corpus Metadata XCES Corpus Encoding Standard Part-of-Speech Tagging SketchEngine Corpus Query Tool
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
38
Corpus Metadata
All relevant details related to speakers, transcripts and transcribers are recorded in a database.
Each speaker is given a speaker code which is used in the transcript in place of the speaker’s name, in order to make speakers less recognisable.
Speaker attributes such as dialect, language acquisition type, (L1-G L1-NG L2) gender and age, etc
are recorded where known.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
39
Corpus Metadata
Corpus database is used to generate XML corpus headers, and to facilitate onging monitoring of word counts of the various corpus design categories.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
40
XCES – XML Corpus Encoding Std.
For each transcript, the output of the Transcriber software was transformed into TEI compliant XCES (XML Corpus Encoding Standard) format using a Perl script and data from the corpus database.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
41
Speech Turns
All of the transcripts to date involve conversations between at least two participants (dialogues).
It is quite common, particularly in radio interviews, for spoken interactions to take place between speakers with different dialects or between native and non-native speakers.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
42
Speech Turns
In order to create sub-corpora on the basis of dialect, native/non-native status, speaker, age, gender etc. then these features must be recorded at the level of speaker-turn rather than for the transcript as a whole.
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
43
XML - XCES
<doc id = "irbs0012" title = "Barrscéalta 08 October 2010" period = "1996-pres" medium = "broadcast-radio"spokentype = "interview" text_source = "GALA-TCD" av_source = "RnaG podcast">
<speaker_turn id = "200" code = "RNG_ANC" dialect = "Ulaidh" gender = "Bain" actype = "L1 Gaeltacht" year = "2010">
caidé méid airgid a chosnódh sé na bádaí seo a thabhairt suas chun dáta agus cloígh lena rialacha úra atá tagtha isteach?
</speaker_turn> <speaker_turn id = "559" code = "RNG_LCI" dialect =
"Mumhan" gender = "Fir" actype = "L1 Gaeltacht?" year = "2010" >
Bhuel ehm braitheann sé sin ar chaighdeán an bháid, abair, agus níl aon dabht faoi ach go bhfuil sé costasach, abair, [tá tá] tá tuairiscí …
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
44
Part-of-Speech Tagging
All transcripts are lemmatised and POS tagged
Using finite-state tools (xfst/foma) and Constraint Grammar (VISL cg3)
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
46
Future Work
Extensive Data Collection is required Archives need to be examined for suitable
material (diachronic corpus) Quality control procedures for
transcription standards need to be formalised
Testing and enhancement of POS tagging tools for spoken language
LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish
47
Websites
GaLa TCD Website https://www.scss.tcd.ie/SLP/gala/index.utf8.html
GaLa in the SketchEnginehttp://the.sketchengine.co.uk/
Go raibh maith agat!
Thank you!