issues in designing a corpus of spoken irish

Issues in Designing a Corpus of Spoken Irish

Elaine Uí Dhonnchadha, Alessio Frenda, Brian Vaughan

Centre for Language and Communication StudiesTrinity College DublinIreland.

LREC-2012: SALTMIL-AfLaT Workshop on “Language technology for normalisation of less-resourced languages”: Issues in Designing a Corpus of Spoken Irish

2

Overview

Linguistic Background Corpus Design Pilot Corpus

Data Collection and Recording Transcription Corpus Processing

Future work


3

Irish

Indo-European - Celtic language Verb initial language (VSO) Irish is the first official language of Ireland -

English is the second official language. Irish is spoken as a first language (L1) in only a

small number of areas known as Gaeltachtaí. Irish is learned at school as a second language

by the majority of the population.


4

Irish Speaking Regions

Na Gaeltachtaí1. Donegal2. Mayo3. Galway4. Kerry5. Cork6. Waterford7. Meath

1

2

3

4

5

6

7


5

Irish

1.6 million of the 3.9 million population report proficiency in the spoken language.

The number of native speakers is 64 thousand

These sociolinguistic conditions mean that a comprehensive spoken corpus can play a vital role in promoting and preserving the spoken language.


6

Motivation

Linguistic research language change, language contact … phonology, syntax, semantics, pragmatics, discourse etc.

Lexicography (new Irish-English dictionary project due to start in 2013)

Teaching materials Speech Recognition


7

Existing Resources

Spoken Language Collections Caint Chonamara (1964) 1.2 mill. wds Iorras Aithneach Irish (pub. 2007) Doegen Records Web Project (1928-

1931) (various dialects) Other dialectal studies (without audio

files)


8

Motivation

Various difficulties … one dialect, or one year Different dialects but mainly songs,

stories, monologues Very little dialogue Book and CD format (pdf) Some phonetic transcriptions but not

other linguistic annotation Limited searchability


9

Motivation

Need a spoken corpus which is: Dialectally balanced Diachronically balanced Gender/age balanced L1 and L2 speakers Text aligned with audio/video file Linguistically annotated


10

Corpus Design

We examined the design of a number of corpora: London-Lund Corpus of Spoken English Lancaster/IBM Spoken English Corpus (SEC) Corpus of Spoken New Zealand English British National Corpus (BNC) COREC (Corpus oral de referencia del Español

Contemporáneo) CLIPS (Corpora e Lessici dell’Italiano Parlato e Scritto) ICE (The International Corpus of English) CGN (Corpus Gesproken Nederlands)


11

Corpus Design

One common feature shared by the more recent corpora surveyed here is the extent of naturalistic conversational material they include.

Our design is heavily influenced by ICE and CGN


12

Corpus Design Dialogues (420, 70%)

Private (250, 42%) [r] Face-to-face conversations (120, 20%) [r] Phone calls (50, 8.5%)[r] Video calls (50, 8.5%) [r] Interviews with teachers of Irish (30, 5%)

Public (170, 28%) [r] Classroom Lessons (40, 7%) Broadcast Discussions (40, 7%) Broadcast Interviews (40, 7%) Parliamentary Debates (20, 3%) [r] Legal cross-examinations (10, 1.5%) [r] Business Transactions (20, 3%)

Monologues (180, 30%)

Unscripted (90, 15%) Spontaneous Commentaries (40, 7%)Unscripted Speeches (20, 3%) Demonstrations (20, 3%) [r] Legal Presentations (10, 1.5%)

Scripted (90, 15%) Broadcast News (40, 7%) Broadcast Talks (40, 7%) Non-broadcast Talks (10, 1%)


13

Corpus Design

Our design considers the following variables: Time frame Dialectal variation Sociolinguistic variation Gender and age Context and subject matter


14

Time Frame

We have decided upon the three time periods P1: 1930-1971 P2: 1972-1995 P3: 1996-present


15

Dialectal Variation

We aim to cover the main dialects of Irish in equal measure i.e. not proportionally to the number of

speakers of each dialect (which may have varied over the years)

Ulster (north) Connaught (west) Munster (south)


16

Sociolinguistic Variation

We aim to include Irish speakers from all linguistic backgrounds

‘Traditional’ native speakers (L1) Non-native speakers (L2) ‘Non-traditional’ native speakers (L1),

i.e. those who were raised through Irish by L1 or L2 parents, typically in a non-Gaeltacht setting


17

Gender and Age Variation

We aim to represent both males and females proportionally

We aim to represent different generations i.e. young adults, middle aged and elderly speakers


18

Content Variation

We aim to record conversations in a variety of contexts (informal, work, leisure, education etc.) and cover a variety of topics.

Overall we aim for a spoken corpus of 2 million words approx.

Pilot Corpus - GaLa


20

Pilot Corpus

Funded by Foras na Gaeilge P3: 1996-present (contemporary) Dialogues Mainly public broadcast dialogues (mp3

podcasts of radio interviews and discussions).

We also carried out a small amount of video recording of private dialogue conversations.


21

Data Collection

Four pairs of volunteers agreed to be video recorded in informal conversation in the Speech Communications Laboratory, TCD

Video recorded using a Sony HDR-XR500v High Definition Handycam.

The audio was recorded in two ways: using the onboard camera microphone and using two Sennheiser MKH-60 shotgun

microphones and an Edirol 4-channel HD Audio recorder.


23

Podcast Extracts

70 x 8 min. audio extracts were transcribed giving 102,000 words of transcribed speech (8.5 hours approx.).

We also aligned and formatted some existing transcripts, Frenda (2011) material transcribed for PhD research TCD (20K); Wigger (2000) Caint Chonamara (10K); Dillon, G. material transcribed for PhD research TCD

(5K).

overall total 140,000 words (approx.)106 transcripts, 151 speakers


24

Transcription

Spoken and written language differ in a number of important respects.

The syntactic structure of spontaneous spoken utterances is usually simpler

Spontaneous speech: repetitions, false starts, hesitations or non-verbal communication such as a gesture or the tone of voice.

Dialectal pronunciations deviate substantially from standard orthographical representations


25

Transcription Guidelines

Phonetic or Orthographic transcription We examined a number of transcription

conventions already in use including CHAT: The CHAT (Codes for the Human Analysis of

Transcripts) System is a comprehensive standard for transcribing and encoding the characteristics of spoken language (MacWhinney, 2000).

LINDSEI: Louvain International Database of Spoken English Interlanguage Transcription guidelines http://www.uclouvain.be/en-307849.html

LDC: Linguistic Data Consortium http://www.ldc.upenn.edu /Creating/creating_annotated.shtml#Transcription


26

CHAT Guidelines

The CHAT (Codes for the Human Analysis of Transcripts) (MacWhinney, 2000).

These guidelines were developed for the transcription of spoken interactions between children and their carers in order to study child language acquisition.

Inaudible segments, phonetic fragments, repetitions, overlaps, interruptions, trailing off, foreign words, proper nouns and numbers etc.


27

CHAT Guidelines

the guidelines are very comprehensive but there are a few drawbacks to implementing the guidelines in full

it can slow down the transcription process considerably

some are quite subjective (short, medium and long pauses)

while others are difficult to implement (retracings and reformulations)


28

LDC Transcription Guidelines

LDC guidelines advocate simplicity Keep the rules to a minimum in order to

make transcription as easy as possible for the transcriber, which increases transcription speed, accuracy and consistency

In addition automatic procedures are used when possible


29

Transcription

On average 30 minutes to orthographically transcribe 1 minute of audio material.

Transcription process must be as straightforward and intuitive as possible.

Minimum number of codes and keystrokes [repeated material], xxx, < … > [?], [% comment], @laugh etc., @eng, filled pauses {yeah, ehm, uh..}


30

Transcription

Dialectal variation maith ‘good’ /mah/ or /maɪ/ an-mhaith ‘very good’ /ənə'wa/

or /ənə'waɪ/ or /ənə'va/ Initial mutations

ag déanamh ‘doing’ /ə d´ianəv/ or /ə ʤanu/ (not a’ déanamh)

standard orthography


31


Advantages to using standard orthography: It makes the job of transcription easier and

quicker for transcribers It helps mimimise spelling inconsistencies among

transcribers as only standard spelling is used, apart from predefined lists permitted exceptions

Attempting to represent actual pronunciation in orthography is difficult and prone to inconsistency. It can be more accurately captured in a separate phonetic transcription layer (which may be partially generated from the orthography).


32


Standard orthography facilitates corpus querying and lexical searches

Standard orthography facilitates automatic text processing, such as part-of-speech tagging and parsing

Transcription codes for some linguistic features (e.g. co-articulation effects, elision etc.) would require specialist training for transcribers, in order to ensure accuracy and consistency, and are better undertaken as a separate task.


33

Transcription Software

We tested several pieces of freely-available transcription and annotation software (e.g. Praat, ELAN, Anvil, CLAN, Xtrans, Transcriber)

We chose Transcriber http://trans.sourceforge.net It has a straightforward user interface It facilitates alignment of the audio and text

transcription in XML format Audio duration and word count information at a

glance Transcripts can be conveniently exported as

text


34

Transcription Software

It handles a variety of audio file types, including .wav, .mp3 (podcasts) and .ogg

The later version of the software, TranscriberAG, can handle video as well as audio

It facilitates the annotation of various features of spontaneous speech (overlap, interruptions, coughs, laughs, etc.) as well as linguistics categories (e.g. proper nouns, human/animate etc. etc.) if desired

It can be used with foot pedals for increased speed if necessary


35

Transcribers

Audio segments of 8 min. in duration broadcast discussions and interviews Raidio na Gaeltachta podcasts.

Panel of 22 transcribers recruited Workpackages were sent via e-mail to

members of the panel who worked from home. (filenames, speaker ids)

They returned a time-aligned transcription and timesheet for each workpackage completed.


36

Transcription Checking

Each transcript was checked for accuracy against the audio file by a member of the project team.

In the case of new video-recordings, the transcripts were also anonymised, i.e. names and places which could identify the participants were replaced by fictitious names to ensure anonynity.


37

Corpus Processing

Corpus Metadata XCES Corpus Encoding Standard Part-of-Speech Tagging SketchEngine Corpus Query Tool


38

Corpus Metadata

All relevant details related to speakers, transcripts and transcribers are recorded in a database.

Each speaker is given a speaker code which is used in the transcript in place of the speaker’s name, in order to make speakers less recognisable.

Speaker attributes such as dialect, language acquisition type, (L1-G L1-NG L2) gender and age, etc

are recorded where known.


39

Corpus Metadata

Corpus database is used to generate XML corpus headers, and to facilitate onging monitoring of word counts of the various corpus design categories.


40

XCES – XML Corpus Encoding Std.

For each transcript, the output of the Transcriber software was transformed into TEI compliant XCES (XML Corpus Encoding Standard) format using a Perl script and data from the corpus database.


41

Speech Turns

All of the transcripts to date involve conversations between at least two participants (dialogues).

It is quite common, particularly in radio interviews, for spoken interactions to take place between speakers with different dialects or between native and non-native speakers.


42

Speech Turns

In order to create sub-corpora on the basis of dialect, native/non-native status, speaker, age, gender etc. then these features must be recorded at the level of speaker-turn rather than for the transcript as a whole.


43

XML - XCES

<doc id = "irbs0012" title = "Barrscéalta 08 October 2010" period = "1996-pres" medium = "broadcast-radio"spokentype = "interview" text_source = "GALA-TCD" av_source = "RnaG podcast">

<speaker_turn id = "200" code = "RNG_ANC" dialect = "Ulaidh" gender = "Bain" actype = "L1 Gaeltacht" year = "2010">

caidé méid airgid a chosnódh sé na bádaí seo a thabhairt suas chun dáta agus cloígh lena rialacha úra atá tagtha isteach?

</speaker_turn> <speaker_turn id = "559" code = "RNG_LCI" dialect =

"Mumhan" gender = "Fir" actype = "L1 Gaeltacht?" year = "2010" >

Bhuel ehm braitheann sé sin ar chaighdeán an bháid, abair, agus níl aon dabht faoi ach go bhfuil sé costasach, abair, [tá tá] tá tuairiscí …


44

Part-of-Speech Tagging

All transcripts are lemmatised and POS tagged

Using finite-state tools (xfst/foma) and Constraint Grammar (VISL cg3)


46

Future Work

Extensive Data Collection is required Archives need to be examined for suitable

material (diachronic corpus) Quality control procedures for

transcription standards need to be formalised

Testing and enhancement of POS tagging tools for spoken language


47

Websites

GaLa TCD Website https://www.scss.tcd.ie/SLP/gala/index.utf8.html

GaLa in the SketchEnginehttp://the.sketchengine.co.uk/

Go raibh maith agat!

Thank you!

issues in designing a corpus of spoken irish

Technology

language technology

saltmilaflat workshop

language l1

thespoken language

official language of

english corpus sec corpus

language contact phonology

corpus designdialoguesprivate