bulb - breaking the unwritten language barrier

37
BULB - Breaking the Unwritten Language Barrier Gilles Adda, Martine Adda-Decker, Odette Ambouroue, Laurent Besacier, David Blachon, Helene Bonneau-Maynard, Pierre Godard, Fatima Hamlaoui, Dmitry Idiatov, Guy-Noel Kouarata, Lori Lamel, Emmanuel-Moselly Makasso, Joseph Mariani, Annie Rialland, Sebastian Stuker, Mark Van de Velde, Francois Yvon, Elodie Gauthier, Sabine Zerbian L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 1 / 25

Upload: khangminh22

Post on 26-Feb-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

BULB - Breaking the Unwritten Language Barrier

Gilles Adda, Martine Adda-Decker, Odette Ambouroue, LaurentBesacier, David Blachon, Helene Bonneau-Maynard, Pierre Godard,Fatima Hamlaoui, Dmitry Idiatov, Guy-Noel Kouarata, Lori Lamel,

Emmanuel-Moselly Makasso, Joseph Mariani, Annie Rialland,Sebastian Stuker, Mark Van de Velde, Francois Yvon, Elodie Gauthier,

Sabine Zerbian

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 1 / 25

1 Context

2 Methodology

3 Data Collections

4 Machine Assisted Analyses

5 Conclusion

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 2 / 25

Context

Computational Language Documentation

Recast language documentation and description as highlyinter-disciplinary research

Where field linguistics leverage computational models and machinelearning

Focus on endangered and/or unwritten languages (speech!)

Relies on

Large and (as much as possible) naturalistic speech corpora ...... automatically (or semi-automatically) processed ....... to provide a (radically) new machine-assistance to the field linguist

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 3 / 25

Context

Endangered Languages

7000 languages, half of them unwritten, half of them endangered.

Massive extinction in the coming decades (the clock is ticking!)

Too few field linguists to document them

What can we do?: propose machine assistance to field linguists, butvery few languages (1%) with language technology access.

Open questionsHow to use automatic methods as instruments for field linguists ?How to propose innovative speech data collection methodologies ?

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 4 / 25

Context

The BULB Project

French-German 3 years project

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 5 / 25

Context

The BULB Project

French-German 3 years project

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 5 / 25

Context

The BULB Project

French-German 3 years project

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 5 / 25

Context

The BULB Project

French-German 3 years project

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 5 / 25

Context

The BULB Project

French-German 3 years project

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 5 / 25

Context

The BULB Project

French-German 3 years project

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 5 / 25

Methodology

Recording Methodology

New methodology, inspired by pioneering work by Steven Bird and MarkLiberman.

Collect a large speech corpus of the source language (U-language):

Re-speaking in the source language by a reference speaker.

Spoken translation into a target language and processing withhigh-quality spoken language technologies (language in contact andlingua franca).

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 6 / 25

Methodology

Recording Methodology

(Fig. are courtesy of Gilles and Martine Adda)

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 7 / 25

Methodology

Recording Methodology

(Fig. are courtesy of Gilles and Martine Adda)

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 7 / 25

Methodology

Recording Methodology

(Fig. are courtesy of Gilles and Martine Adda)

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 7 / 25

Methodology

Data Processing Methodology

”Cyberlinguists of the future will have to devise algorithms todecipher the recordings that were made before this mass extinctionevent.” S. Bird1

The Rosetta stone metaphorUse automatic methods / probabilistic learning / deep learning

Automatic alignment between re-speaking signal and translation signalAutomatic alignment between re-speaking signal and translation textAutomatic alignment between elicited speech and images

parallel speeches parallel speech / text par. speech / images

1https://theconversation.com/

computing-gives-us-tools-to-preserve-disappearing-languages-60235L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 8 / 25

Methodology

Data Processing Methodology

”Cyberlinguists of the future will have to devise algorithms todecipher the recordings that were made before this mass extinctionevent.” S. Bird1

The Rosetta stone metaphorUse automatic methods / probabilistic learning / deep learningAutomatic alignment between re-speaking signal and translation signal

Automatic alignment between re-speaking signal and translation textAutomatic alignment between elicited speech and images

parallel speeches

parallel speech / text par. speech / images

1https://theconversation.com/

computing-gives-us-tools-to-preserve-disappearing-languages-60235L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 8 / 25

Methodology

Data Processing Methodology

”Cyberlinguists of the future will have to devise algorithms todecipher the recordings that were made before this mass extinctionevent.” S. Bird1

The Rosetta stone metaphorUse automatic methods / probabilistic learning / deep learningAutomatic alignment between re-speaking signal and translation signalAutomatic alignment between re-speaking signal and translation text

Automatic alignment between elicited speech and images

parallel speeches parallel speech / text

par. speech / images

1https://theconversation.com/

computing-gives-us-tools-to-preserve-disappearing-languages-60235L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 8 / 25

Methodology

Data Processing Methodology

”Cyberlinguists of the future will have to devise algorithms todecipher the recordings that were made before this mass extinctionevent.” S. Bird1

The Rosetta stone metaphorUse automatic methods / probabilistic learning / deep learningAutomatic alignment between re-speaking signal and translation signalAutomatic alignment between re-speaking signal and translation textAutomatic alignment between elicited speech and images

parallel speeches parallel speech / text par. speech / images1https://theconversation.com/

computing-gives-us-tools-to-preserve-disappearing-languages-60235L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 8 / 25

Methodology

Data Processing Methodology

We want to discover lexical units in an unknown and unwrittenlanguage without (or with few) supervision

Segmentation2: problem: transform a continuous flow of phonemesinto a sequence of word forms

m/u/r/a/m/ɓ/ɔ/ŋ/l/a/m/a/k/a/l/a mur amɓɔŋ la makala/ / /

Alignment problem: map unknown units with known units

mur amɓɔŋ la makala

how does the man make donuts

2[example from Fatima Hamlaoui (Zentrum fur Allgemeine Sprachwissenschaft)]L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 9 / 25

Methodology

Data Processing Methodology

mur amɓɔŋ la makala

how does the man make donuts

m/u/r/a/m/ɓ/ɔ/ŋ/l/a/m/a/k/a/l/a

SEGMENTATION

ALIGNMENT

Segmentation3 seen as a pre-processing task before alignment(monolingual problem)

Alignment between source phonemes and target words to infer asegmentation on the source side

Joint modeling of segmentation and alignment

3[example from Fatima Hamlaoui (Zentrum fur Allgemeine Sprachwissenschaft)]L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 10 / 25

Methodology

Data Processing Methodology

mur amɓɔŋ la makala

how does the man make donuts

m/u/r/a/m/ɓ/ɔ/ŋ/l/a/m/a/k/a/l/a

SEGMENTATION

ALIGNMENT

Segmentation3 seen as a pre-processing task before alignment(monolingual problem)

Alignment between source phonemes and target words to infer asegmentation on the source side

Joint modeling of segmentation and alignment

3[example from Fatima Hamlaoui (Zentrum fur Allgemeine Sprachwissenschaft)]L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 10 / 25

Methodology

Data Processing Methodology

mur amɓɔŋ la makala

how does the man make donuts

m/u/r/a/m/ɓ/ɔ/ŋ/l/a/m/a/k/a/l/a

SEGMENTATION

ALIGNMENT

Segmentation3 seen as a pre-processing task before alignment(monolingual problem)

Alignment between source phonemes and target words to infer asegmentation on the source side

Joint modeling of segmentation and alignment

3[example from Fatima Hamlaoui (Zentrum fur Allgemeine Sprachwissenschaft)]L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 10 / 25

Methodology

Data Processing Methodology

Challenges

scarcity of the data

evaluation: we need manual phonetic (or pseudo-graphemic)transcriptions

introduction of noise during phonetic transcription

choices in data representationeg. tones? prosodic markers? additional resources?→ models able to take advantage of that information

identify prior knowledge to extract meaningful linguistic units(including cross-lingual priors)

vast literature on unsupervised segmentation and unsupervisedalignment but much less (and more recent) on joint learning

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 11 / 25

Data Collections

Documentation of Bantu Languages

Three northwestern Bantu:1 Basaa (A43, Cameroon) 300,000 speakers,2 Myene (B10, Gabon) 46,000 speakers, probably endangered3 Embosi (C25,Congo-Brazzaville) 150,000 speakers,

(Fig. is courtesy of Gilles and Martine Adda)

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 12 / 25

Data Collections

Documentation of Bantu Languages

Three northwestern Bantu1 Basaa (A43, Cameroon) 300,000 speakers,2 Myene (B10, Gabon) 46,000 speakers,3 Embosi (C25,Congo-Brazzaville) 150,000 speakers,

Well-described, competent native-speaker linguists, basic electronicresources

A complex morphology (both nominal and verbal),

Challenging lexical an postlexical phonologies,

Contrastive tones both at lexical and grammatical levels,

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 13 / 25

Data Collections

Lig-Aikuma

The originsSmartphone application developped by S. Bird and al.4

From Aikuma to Lig-AikumaFull pipeline: recording - respeaking - translatingElicitation of speech from text, images or videosMore detailed metadata information (speaker profiles)Automatic generation of consent formBetter feedback to the user (show waveform duringre-speaking/translation)Geolocalisation of the recordingsBroad grain annotation (speech/non speech)Adapted for larger screens of tablets (10 inches)Localized in English, French and German

Lig-Aikuma V3

https://lig-aikuma.imag.fr and playstorecontact [email protected] and [email protected]

4Bird, Steven and Hanke, Florian R and Adams, Oliver and Lee, Haejoong Aikuma:A mobile app for collaborative language documentation, ACL, 2014

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 14 / 25

Data Collections

Lig-Aikuma

Recording speech in a very simpleway

Respeaking an existing recordingor an external audio file

Translation is the same asrespeaking, except the languagechanges

Elicitation records speech based onexternal resource (text, images)

Check mode (in progress)

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 15 / 25

Data Collections

Field recordings

Basaa Embosi

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 16 / 25

Data Collections

Sound samples

Basaa Embosi

re-speaking translating

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 17 / 25

Data Collections

Data collected with Lig-Aikuma

Table: Re-spoken and translated speech collected so far

Language Re-spoken Translated

Embosi 55h 30hMyene 45h 44hBasaa 55h 25h

Released a Embosi5k corpus for computational languagedocumentation experiments5

used during JSALT Summer Workshop 2017 at CMU

LREC 2018 publications

P. Godard et.al. A Very Low Resource Language Speech Corpus forComputational Language Documentation ExperimentsF. Hamalaoui et.al. BULBasaa: A Dataset for Comp. Linguistics

5http://github.com/besacier/mboshi-french-parallel-corpusL. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 18 / 25

Machine Assisted Analyses

Technologies Needed

French LVCSR (adapted to domain) - LIMSI

Language Independent phoneme recognition - KITMake use of multilingual modelsMake use of modern DNNsUnsupervised phoneme boundary detectionMake use of alternative models, e.g., articulatory features

Alignment of French word level sentences and phoneme sequences inBasaa/Myene/Embosi LIG+LIMSI

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 19 / 25

Machine Assisted Analyses

Language Independent Phoneme Discovery andTranscription (KIT)

Multilingual phoneme recognition suboptimal, since not all phonemesmight be covered

Three step approach pursued

Find phoneme boundariesClassify articulatory features (AF) within the segmentsCluster segments into phoneme like units, propose phoneme identitybased on AF

Mueller et.al DBLSTM based multilingual articulatory featureextraction for language documentation. IEEE ASRU 2017

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 20 / 25

Machine Assisted Analyses

Unsupervised Word Unit Discovery (LIG-LIMSI)

Monolingual

Experiments with various non-parametric Bayesian word segmentationmodelsStudy of the impact of tones on word segmentationPreliminary experiments with Adaptor GrammarsCharacter or phone sequences vs. Lattice inputsPipeline scoring using (language independent) phoneme transcriptionOndel et.al. Bayesian Models for Unit Discovery on a Very LowResource Language. In IEEE ICASSP 2018

Cross-Lingual (using neural End-to-End approaches)

Zanon Boito et.al. Unwritten languages demand attention too! worddiscovery with encoder-decoder models. In IEEE ASRU 2017.Godard et.al Unsupervised Word Segmentation from Speech withAttention. Submitted to NAACL 2018.

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 21 / 25

Machine Assisted Analyses

Example of Soft-Alignment Matrices

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 22 / 25

Conclusion

Conclusion

Ergonomic, specialized data collection tool for documentary linguistsin the field

Lig-Aikuma

Collected valuable data in three Bantu languages

Partly processed yetPartly published yet (watch out for LREC publications)More to become available in the future

Progress on language technologies for documenting unwrittenlanguages

Automatic phone transcription and phone set discoveryAutomatic word discovery, mono- and crosslingualCurrent process of making (first) technologies available to the public,e.g., via web services

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 23 / 25

Conclusion

Tools for field linguists

Close collaboration with computer scientists to analyze their needs

First steps taken in the projectOrganization of training activities

July 3rd 2015: Workshop on Natural Language Processing Technologyfor Linguists in Paris at LPPJanuary 25th and 26th 2016: Linguistic training workshop for languagetechnology experts in Paris

Built prototypes of tools

Developed technologies now on the brink of usabilityPlans of publishing some tools as web services on existing / establishedplatformsLigAikuma as standalone project already available at production grade

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 24 / 25

Conclusion

Questions?

Thank you

L. Besacier (LIG) (LIG) BULB Talk at CMLD - Feb 2018 2d of Feb 2018 25 / 25