bulb - breaking the unwritten language barrier
TRANSCRIPT
BULB - Breaking the Unwritten Language Barrier
Gilles Adda, Martine Adda-Decker, Odette Ambouroue, LaurentBesacier, David Blachon, Helene Bonneau-Maynard, Pierre Godard,Fatima Hamlaoui, Dmitry Idiatov, Guy-Noel Kouarata, Lori Lamel,
Emmanuel-Moselly Makasso, Joseph Mariani, Annie Rialland,Sebastian Stuker, Mark Van de Velde, Francois Yvon, Elodie Gauthier,
Sabine Zerbian
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 1 / 25
1 Context
2 Methodology
3 Data Collections
4 Machine Assisted Analyses
5 Conclusion
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 2 / 25
Context
Computational Language Documentation
Recast language documentation and description as highlyinter-disciplinary research
Where field linguistics leverage computational models and machinelearning
Focus on endangered and/or unwritten languages (speech!)
Relies on
Large and (as much as possible) naturalistic speech corpora ...... automatically (or semi-automatically) processed ....... to provide a (radically) new machine-assistance to the field linguist
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 3 / 25
Context
Endangered Languages
7000 languages, half of them unwritten, half of them endangered.
Massive extinction in the coming decades (the clock is ticking!)
Too few field linguists to document them
What can we do?: propose machine assistance to field linguists, butvery few languages (1%) with language technology access.
Open questionsHow to use automatic methods as instruments for field linguists ?How to propose innovative speech data collection methodologies ?
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 4 / 25
Context
The BULB Project
French-German 3 years project
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 5 / 25
Context
The BULB Project
French-German 3 years project
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 5 / 25
Context
The BULB Project
French-German 3 years project
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 5 / 25
Context
The BULB Project
French-German 3 years project
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 5 / 25
Context
The BULB Project
French-German 3 years project
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 5 / 25
Context
The BULB Project
French-German 3 years project
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 5 / 25
Methodology
Recording Methodology
New methodology, inspired by pioneering work by Steven Bird and MarkLiberman.
Collect a large speech corpus of the source language (U-language):
Re-speaking in the source language by a reference speaker.
Spoken translation into a target language and processing withhigh-quality spoken language technologies (language in contact andlingua franca).
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 6 / 25
Methodology
Recording Methodology
(Fig. are courtesy of Gilles and Martine Adda)
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 7 / 25
Methodology
Recording Methodology
(Fig. are courtesy of Gilles and Martine Adda)
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 7 / 25
Methodology
Recording Methodology
(Fig. are courtesy of Gilles and Martine Adda)
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 7 / 25
Methodology
Data Processing Methodology
”Cyberlinguists of the future will have to devise algorithms todecipher the recordings that were made before this mass extinctionevent.” S. Bird1
The Rosetta stone metaphorUse automatic methods / probabilistic learning / deep learning
Automatic alignment between re-speaking signal and translation signalAutomatic alignment between re-speaking signal and translation textAutomatic alignment between elicited speech and images
parallel speeches parallel speech / text par. speech / images
1https://theconversation.com/
computing-gives-us-tools-to-preserve-disappearing-languages-60235L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 8 / 25
Methodology
Data Processing Methodology
”Cyberlinguists of the future will have to devise algorithms todecipher the recordings that were made before this mass extinctionevent.” S. Bird1
The Rosetta stone metaphorUse automatic methods / probabilistic learning / deep learningAutomatic alignment between re-speaking signal and translation signal
Automatic alignment between re-speaking signal and translation textAutomatic alignment between elicited speech and images
parallel speeches
parallel speech / text par. speech / images
1https://theconversation.com/
computing-gives-us-tools-to-preserve-disappearing-languages-60235L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 8 / 25
Methodology
Data Processing Methodology
”Cyberlinguists of the future will have to devise algorithms todecipher the recordings that were made before this mass extinctionevent.” S. Bird1
The Rosetta stone metaphorUse automatic methods / probabilistic learning / deep learningAutomatic alignment between re-speaking signal and translation signalAutomatic alignment between re-speaking signal and translation text
Automatic alignment between elicited speech and images
parallel speeches parallel speech / text
par. speech / images
1https://theconversation.com/
computing-gives-us-tools-to-preserve-disappearing-languages-60235L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 8 / 25
Methodology
Data Processing Methodology
”Cyberlinguists of the future will have to devise algorithms todecipher the recordings that were made before this mass extinctionevent.” S. Bird1
The Rosetta stone metaphorUse automatic methods / probabilistic learning / deep learningAutomatic alignment between re-speaking signal and translation signalAutomatic alignment between re-speaking signal and translation textAutomatic alignment between elicited speech and images
parallel speeches parallel speech / text par. speech / images1https://theconversation.com/
computing-gives-us-tools-to-preserve-disappearing-languages-60235L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 8 / 25
Methodology
Data Processing Methodology
We want to discover lexical units in an unknown and unwrittenlanguage without (or with few) supervision
Segmentation2: problem: transform a continuous flow of phonemesinto a sequence of word forms
m/u/r/a/m/ɓ/ɔ/ŋ/l/a/m/a/k/a/l/a mur amɓɔŋ la makala/ / /
Alignment problem: map unknown units with known units
mur amɓɔŋ la makala
how does the man make donuts
2[example from Fatima Hamlaoui (Zentrum fur Allgemeine Sprachwissenschaft)]L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 9 / 25
Methodology
Data Processing Methodology
mur amɓɔŋ la makala
how does the man make donuts
m/u/r/a/m/ɓ/ɔ/ŋ/l/a/m/a/k/a/l/a
SEGMENTATION
ALIGNMENT
Segmentation3 seen as a pre-processing task before alignment(monolingual problem)
Alignment between source phonemes and target words to infer asegmentation on the source side
Joint modeling of segmentation and alignment
3[example from Fatima Hamlaoui (Zentrum fur Allgemeine Sprachwissenschaft)]L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 10 / 25
Methodology
Data Processing Methodology
mur amɓɔŋ la makala
how does the man make donuts
m/u/r/a/m/ɓ/ɔ/ŋ/l/a/m/a/k/a/l/a
SEGMENTATION
ALIGNMENT
Segmentation3 seen as a pre-processing task before alignment(monolingual problem)
Alignment between source phonemes and target words to infer asegmentation on the source side
Joint modeling of segmentation and alignment
3[example from Fatima Hamlaoui (Zentrum fur Allgemeine Sprachwissenschaft)]L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 10 / 25
Methodology
Data Processing Methodology
mur amɓɔŋ la makala
how does the man make donuts
m/u/r/a/m/ɓ/ɔ/ŋ/l/a/m/a/k/a/l/a
SEGMENTATION
ALIGNMENT
Segmentation3 seen as a pre-processing task before alignment(monolingual problem)
Alignment between source phonemes and target words to infer asegmentation on the source side
Joint modeling of segmentation and alignment
3[example from Fatima Hamlaoui (Zentrum fur Allgemeine Sprachwissenschaft)]L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 10 / 25
Methodology
Data Processing Methodology
Challenges
scarcity of the data
evaluation: we need manual phonetic (or pseudo-graphemic)transcriptions
introduction of noise during phonetic transcription
choices in data representationeg. tones? prosodic markers? additional resources?→ models able to take advantage of that information
identify prior knowledge to extract meaningful linguistic units(including cross-lingual priors)
vast literature on unsupervised segmentation and unsupervisedalignment but much less (and more recent) on joint learning
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 11 / 25
Data Collections
Documentation of Bantu Languages
Three northwestern Bantu:1 Basaa (A43, Cameroon) 300,000 speakers,2 Myene (B10, Gabon) 46,000 speakers, probably endangered3 Embosi (C25,Congo-Brazzaville) 150,000 speakers,
(Fig. is courtesy of Gilles and Martine Adda)
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 12 / 25
Data Collections
Documentation of Bantu Languages
Three northwestern Bantu1 Basaa (A43, Cameroon) 300,000 speakers,2 Myene (B10, Gabon) 46,000 speakers,3 Embosi (C25,Congo-Brazzaville) 150,000 speakers,
Well-described, competent native-speaker linguists, basic electronicresources
A complex morphology (both nominal and verbal),
Challenging lexical an postlexical phonologies,
Contrastive tones both at lexical and grammatical levels,
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 13 / 25
Data Collections
Lig-Aikuma
The originsSmartphone application developped by S. Bird and al.4
From Aikuma to Lig-AikumaFull pipeline: recording - respeaking - translatingElicitation of speech from text, images or videosMore detailed metadata information (speaker profiles)Automatic generation of consent formBetter feedback to the user (show waveform duringre-speaking/translation)Geolocalisation of the recordingsBroad grain annotation (speech/non speech)Adapted for larger screens of tablets (10 inches)Localized in English, French and German
Lig-Aikuma V3
https://lig-aikuma.imag.fr and playstorecontact [email protected] and [email protected]
4Bird, Steven and Hanke, Florian R and Adams, Oliver and Lee, Haejoong Aikuma:A mobile app for collaborative language documentation, ACL, 2014
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 14 / 25
Data Collections
Lig-Aikuma
Recording speech in a very simpleway
Respeaking an existing recordingor an external audio file
Translation is the same asrespeaking, except the languagechanges
Elicitation records speech based onexternal resource (text, images)
Check mode (in progress)
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 15 / 25
Data Collections
Field recordings
Basaa Embosi
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 16 / 25
Data Collections
Sound samples
Basaa Embosi
re-speaking translating
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 17 / 25
Data Collections
Data collected with Lig-Aikuma
Table: Re-spoken and translated speech collected so far
Language Re-spoken Translated
Embosi 55h 30hMyene 45h 44hBasaa 55h 25h
Released a Embosi5k corpus for computational languagedocumentation experiments5
used during JSALT Summer Workshop 2017 at CMU
LREC 2018 publications
P. Godard et.al. A Very Low Resource Language Speech Corpus forComputational Language Documentation ExperimentsF. Hamalaoui et.al. BULBasaa: A Dataset for Comp. Linguistics
5http://github.com/besacier/mboshi-french-parallel-corpusL. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 18 / 25
Machine Assisted Analyses
Technologies Needed
French LVCSR (adapted to domain) - LIMSI
Language Independent phoneme recognition - KITMake use of multilingual modelsMake use of modern DNNsUnsupervised phoneme boundary detectionMake use of alternative models, e.g., articulatory features
Alignment of French word level sentences and phoneme sequences inBasaa/Myene/Embosi LIG+LIMSI
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 19 / 25
Machine Assisted Analyses
Language Independent Phoneme Discovery andTranscription (KIT)
Multilingual phoneme recognition suboptimal, since not all phonemesmight be covered
Three step approach pursued
Find phoneme boundariesClassify articulatory features (AF) within the segmentsCluster segments into phoneme like units, propose phoneme identitybased on AF
Mueller et.al DBLSTM based multilingual articulatory featureextraction for language documentation. IEEE ASRU 2017
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 20 / 25
Machine Assisted Analyses
Unsupervised Word Unit Discovery (LIG-LIMSI)
Monolingual
Experiments with various non-parametric Bayesian word segmentationmodelsStudy of the impact of tones on word segmentationPreliminary experiments with Adaptor GrammarsCharacter or phone sequences vs. Lattice inputsPipeline scoring using (language independent) phoneme transcriptionOndel et.al. Bayesian Models for Unit Discovery on a Very LowResource Language. In IEEE ICASSP 2018
Cross-Lingual (using neural End-to-End approaches)
Zanon Boito et.al. Unwritten languages demand attention too! worddiscovery with encoder-decoder models. In IEEE ASRU 2017.Godard et.al Unsupervised Word Segmentation from Speech withAttention. Submitted to NAACL 2018.
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 21 / 25
Machine Assisted Analyses
Example of Soft-Alignment Matrices
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 22 / 25
Conclusion
Conclusion
Ergonomic, specialized data collection tool for documentary linguistsin the field
Lig-Aikuma
Collected valuable data in three Bantu languages
Partly processed yetPartly published yet (watch out for LREC publications)More to become available in the future
Progress on language technologies for documenting unwrittenlanguages
Automatic phone transcription and phone set discoveryAutomatic word discovery, mono- and crosslingualCurrent process of making (first) technologies available to the public,e.g., via web services
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 23 / 25
Conclusion
Tools for field linguists
Close collaboration with computer scientists to analyze their needs
First steps taken in the projectOrganization of training activities
July 3rd 2015: Workshop on Natural Language Processing Technologyfor Linguists in Paris at LPPJanuary 25th and 26th 2016: Linguistic training workshop for languagetechnology experts in Paris
Built prototypes of tools
Developed technologies now on the brink of usabilityPlans of publishing some tools as web services on existing / establishedplatformsLigAikuma as standalone project already available at production grade
L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 24 / 25