laurent besacier, floriane brennus, sylvie voisin · respeaking an existing recording or an...
TRANSCRIPT
Collecte de donnees orales pour la langue sereer
Laurent Besacier, Floriane Brennus, Sylvie Voisin
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 1 / 29
1 Introduction
2 Speech data collection with Lig-Aikuma
3 Starting a documentation project for Sereer
4 For people interested in Lig-Aikuma
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 2 / 29
Introduction
Linguistic Fieldwork, Language Documentation andDescription
In linguistic fieldwork, notions of documentation and description havebeen used often interchangeably
Documentation gathers primary documentation (recording of audioand visual data, their transcription and translation)
Description obtains secondary data of increasingly interpretativenature (description of underlying structures of the language such aslexicons and grammars)
Language documentation is not only for field linguistics but also forthe language community itself (future generations may want to learnabout their ancestors’ language)
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 3 / 29
Introduction
Linguistic fieldwork needs means and tools
Bird (2009): need renewal in computational linguistics technics, toaccelerate the documentation and description of the world’sendangered linguistic heritage
Liberman (2011): towards very large scale phonetics
Adda (2012): speech technology as instrument for phonetics andlinguistics
Thieberger (2013, 2016): need for tools appropriate to a discipline;need for renewal in data access, analysis and publication; reflectionabout methods for language documentation
Holton et al. (2017): tools and standards for processing linguisticdata archiving are still lacking
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 4 / 29
Introduction
Linguistic fieldwork needs means and tools
Conferences and workshops
VLSP 2011 - Very Large Scale Phonetics workshop1
Digital Tools Summit in the Humanities2
Spoken Language Technologies for Under-resourced Languages(SLTU)3
Language Documentation Tools Summit4
Computational Methods for Endangered Language Documentationand Description (CMLD)5
1http://www.phon.ox.ac.uk/jcoleman/MiningVLSP.pdf2http://www.iath.virginia.edu/dtsummit3http://www.mica.edu.vn/sltu/4https://sites.google.com/site/ldtoolssummit/home5http://lattice.cnrs.fr/cmld/
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 5 / 29
Introduction
Linguistic fieldwork needs means and tools
Workgroups
Special Interest Group: Under-resourced Languages (SIGUL)6
Summer schools
CIDLeS 2014: Coding for Language Communities7
BigDataSpeech 20188, Summer school for using computational toolsin linguistic, phonetic research
SENELANGUES 20159, Spring school for West African languagedescription
6www.elra.info/en/sig/sigul7www.cidles.eu/summer-school-coding-for-language-communities-20148http://bigdataspeech.alwaysdata.net9senelangues2015.ucad.sn
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 6 / 29
Introduction
Some Language Documentation Challenges
Endangered languages corpora are often small, scattered and mostlynot freely available which - in addition to prevent interesting crosslingual or cross dialect studies - limits their wide use by other linguistsor computer scientists,
Classical computational linguistics (CL) and speech processing mostlyrely on supervised algorithms and will not scale to the endangeredlanguages targeted which are poorly described and un-annotated,
Most endangered languages are unwritten so the main raw material isthe speech signal, often recorded in poor conditions (remote ruralareas, elderly speakers, etc.)
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 7 / 29
Introduction
Our Vision
field linguist
cyber field linguist
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 8 / 29
Introduction
Our Vision
field linguist cyber field linguist
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 8 / 29
Introduction
Computational Language Documentation and Description
Recast language documentation and description as highlyinter-disciplinary research
Where field linguistics leverage computational models and machinelearning
Focus on endangered and/or unwritten languages (speech!)
Relies on
Large and (as much as possible) naturalistic speech corpora ...
... automatically processed ....
... to provide a (radically) new machine-assistance to the field linguist/ dialectologist
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 9 / 29
Introduction
Computational Language Documentation and Description
Recast language documentation and description as highlyinter-disciplinary research
Where field linguistics leverage computational models and machinelearning
Focus on endangered and/or unwritten languages (speech!)
Relies on
Large and (as much as possible) naturalistic speech corpora ...... automatically processed ....
... to provide a (radically) new machine-assistance to the field linguist/ dialectologist
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 9 / 29
Introduction
Computational Language Documentation and Description
Recast language documentation and description as highlyinter-disciplinary research
Where field linguistics leverage computational models and machinelearning
Focus on endangered and/or unwritten languages (speech!)
Relies on
Large and (as much as possible) naturalistic speech corpora ...... automatically processed ....... to provide a (radically) new machine-assistance to the field linguist/ dialectologist
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 9 / 29
Introduction
Computational Language Documentation and Description
Recast language documentation and description as highlyinter-disciplinary research
Where field linguistics leverage computational models and machinelearning
Focus on endangered and/or unwritten languages (speech!)
Relies on
Large and (as much as possible) naturalistic speech corpora ...... automatically processed ....... to provide a (radically) new machine-assistance to the field linguist/ dialectologist
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 9 / 29
Speech data collection with Lig-Aikuma
Aikuma: The origin
Initial Android app dedicated to speakers of endangered languagesDeveloped by Steven Bird and Florian Hanke (Hanke and Bird, 2013)
Goal : Collecting speech at a large scale with self-recording
featuresRecordingRespeakingm Concept introduced by Woodbury (2003)m Firstly experienced by Bird (2010)Translation
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 10 / 29
Speech data collection with Lig-Aikuma
Lig-Aikuma
From Aikuma to Lig-Aikuma
New branch specified with BULB field linguistsFrom self-recording to more controlled recording
Lig-Aikuma V3
https://lig-aikuma.imag.fr and playstoreContact: [email protected] and [email protected] open source on LIG ForgeOnline video tutorial and manuals
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 11 / 29
Speech data collection with Lig-Aikuma
Lig-Aikuma
Features Aikuma Lig-Aikuma
Recording and documentation Ë Ë
Respeaking and oral translation Ë Ë
Extras : Sync. and Sharing, Geolocalisation,Textless interface
Ë Ë
Elicitation (text-image-video) mode é Ë
Check mode é Ë
User profiles, Consent form, Metadata é Ë
Automatic backup of interrupted sessions é Ë
Multilingual interface and User feedback é Ë
Documentation (samples, tutorial, . . . ) é Ë
Export to Elan é Ë
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 12 / 29
Speech data collection with Lig-Aikuma
Lig-Aikuma
Recording speech in a very simpleway
Respeaking an existing recordingor an external audio file
Translation is the same asrespeaking, except the languagechanges
Elicitation records speech based onexternal resource (text, images)
Check mode (in progress)
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 13 / 29
Speech data collection with Lig-Aikuma
Metadata
Spoken languages– language of the recording– mother tongue– extra languages– quick input using language
codes (en, fra)
Extra information– geographical region of origin
of the speaker
Personal information– name– age– gender
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 14 / 29
Speech data collection with Lig-Aikuma
Metadata
Figure: (1)
1 Click ”Select from list”
2 A list appears, with afilter search box at thetop
3 Type in the languagename or code for filteringout the list
Figure: (2) Figure: (3)L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 15 / 29
Speech data collection with Lig-Aikuma
Recording mode
Figure: (1) Figure: (2) Figure: (3) Figure: (4)
1 Start a recording2 Currently recording3 Recording is done, now save it !4 Recording saved (see popup), back to home view
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 16 / 29
Speech data collection with Lig-Aikuma
Respeaking mode
Alternatively play and record audio segments
Listen to the latest recorded segment
Validate once it is done
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 17 / 29
Speech data collection with Lig-Aikuma
Respeaking mode
(Optionally) Play and Record anySegment
Every pair of original and respokensegments are aligned on rows
Both segments can be played
Respoken segments can be recordedagain
Then Validate! Popup view informsabout the success or failure of therespeaking
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 18 / 29
Speech data collection with Lig-Aikuma
Translation mode
Translation works the same way as the respeaking modeThe only difference relies in the produced name file
160118-152218 fra 58a rspk
160118-152218 eng 58a trsl
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 19 / 29
Speech data collection with Lig-Aikuma
Elicitation mode
Dedicated to elicitation of resources
X Text: speak written text
– words or sentences
X Image, Video: implemented recently
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 20 / 29
Speech data collection with Lig-Aikuma
Elicitation from text
Key Features
Record your speech
Listen to the record
Go to next item or validate and/orquit
Text items
Items: words or sentences
Loaded from the imported text file
One line at a time
– ending by ”##”
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 21 / 29
Speech data collection with Lig-Aikuma
Elicitation from image
Key Features
Visualize an image
Record your speech
Go to next item or validate and/orquit
Image items
Loaded from a directorycontaining only images
– JPG or PNG formats
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 22 / 29
Speech data collection with Lig-Aikuma
Elicitation from video
Key Features
Watch a video
Record your speech
Go to next item or validate and/orquit
Video items
Loaded from a directorycontaining only video
– AVI or MP4 formats
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 23 / 29
Speech data collection with Lig-Aikuma
Download and Install
Download URL https://lig-aikuma.imag.fr/download/ from theAndroid device
In a web browser, type in the url address
A popup window may appear to require you the install confirmation
You must have authorized the install from other sources than PlayStore. To do so:
→ Settings → Security tab → Check the ”Unknown sources” button
From a computer
Download the application on your computer
Email the application file (MainActivity.apk) to an account synced onthe Android device
On your Android device, open the joined file, an installation popupwindow should appear
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 24 / 29
Starting a documentation project for Sereer
Documentation of Sereer
3d language of Senegal
Several variants (5)
Speech elicitation from videos (Trajectoire corpus)
Phonology described in the paper
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 25 / 29
Starting a documentation project for Sereer
Field recordings in Sereer
Recordings done mainly in Lyon
Association of students from Senegal was contacted to find speakers
5 speakers were correctly recorded with Lig-Aikuma
Sessions failed for 2 speakers (1 not fluent in Sereer language, 1 notseriously engaged in the recording process)
Better to favour a formal framework (recordings in a lab rather at thespeaker’s home)
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 26 / 29
Starting a documentation project for Sereer
Data collected so far
376 speech files from 5 speakers in Sereer
333 speech translations in French
1h of total recordings
Sereer transcription has started
Automatic annotation with Persephone tool10 planned
10https://pypi.org/project/persephoneL. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 27 / 29
For people interested in Lig-Aikuma
For people interested in Lig-Aikuma
Lig-Aikuma in 90mn:https://lig-aikuma.imag.fr/lig-aikuma-in-90mn/
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 28 / 29
For people interested in Lig-Aikuma
Questions?
Thank you
L. Besacier (LIG) Sereer (TALAF 2018) 26th of September 2018 29 / 29