fieldwork as a computational problem uniting computational

Post on 07-Dec-2021

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Human Language Project:Uniting Computational Linguisticswith Documentary Linguistics

Steven Bird

University of Melbourne &University of Pennsylvania

Fieldwork as a Computational Problem

• three data types

• three kinds of metadata

• relations

• computational challenge

• http://www.ldc.upenn.edu/sb/fieldwork/

• this isn't computational linguistics

Convergences

• concern with data

• use of speech data

• bilingual text

Convergences:Bitext + morph = IGT

• bilingual text

• morphologically analyzed text

• comparative wordlists

• bilingual lexicons

Documentary and Descriptive Linguistics

Nikolaus Himmelmann (1998) "Documentary and Descriptive Linguistics" Linguistics 36:161-195

Documentation types:Interlinear text

Guwamu, Peter Austin (2010)

Documentation types:Lexicons

Kröger, F. Buli-English dictionary: With an Introductory Grammar and an Index. Münster: Lit, 1992.

Documentary and Descriptive LinguisticsUse of Computation

Nikolaus Himmelmann (1998) "Documentary and Descriptive Linguistics" Linguistics 36:161-195

• documentarists

• innovation, tool development

• descriptivists

• Evans, Hyman

Karaim CD-ROMEva Csato and David Nathan

Nathan, D. (1998) The spoken Karaim CD: Sound, text, lexicon and "active morphology" for language learning multimedia, Proceedings of the Ninth Annual Conference on Turkish Linguistics.

Where's the science?

After years of neglect in which linguistics lost sight of the value of empirical field research, new life has finally been breathed into this fundamentally important component of our discipline. But in the process, linguistic fieldwork has ironically lost sight of linguistics! That is, if by linguistics one means the scientific study of language, fieldwork ideology and practice have gone askew. The major movements and individuals that we can thank for the resurgence of interest in linguistic fieldwork all promote (in words or deeds) approaches to field research that fall far short of the tenets of science. Examples of such misguided directions include (a) the endangered languages movement, (b) language documentation, and (c) the "Dixon school".! In my talk, I expose the failings of these non-scientific approaches to linguistic field research and set out what would be required for linguistic fieldwork to qualify as truly scientific and thus be entitled to recognition as an essential subfield within linguistics per se.

Paul Newman -- Linguistic Fieldwork as a Scientific Enterprise, International Conference on Language

Key Questions

• What does computational linguistics offer to the problem of documenting and describing the world's languages?

• How can CL help improve the descriptive value of language documentation?

• three places where this might happen Basic Oral Language Documentation

Pilot projectSynopsis of 1 weekin Moife

1. Discussions re orthography, literacy

2. training, practice, listening, tone orthography experiment

3. training in oral transcription and translation; gave out recorders

4. re-assigned recorders

5. (Saturday)

6. oral transcription, vitality survey, orthography recommendations

7. more oral transcription

Pilot project

Main Phase

Preparation

• Batteries

• Date

• Identifiers

Training Training

Basic Oral Language DocumentationOverview of one week's activity... Oral Annotation Protocol

Transcription

Cross Checking Evaluation

• What is the quality of the collected materials?

• Can we correctly establish the phonemic inventory of the language from the recorded materials?

• What semantic domains are covered?

• What can trained linguists get from the raw transcripts?

Back to the computational questions...

Axioms

• Limited funding, but costs for local participation are negligible

• Cannot assume continuous presence of a linguist: primary collection work is "unsupervised"

• Cannot assume an orthography

• Can give training in documentation, but not description

• Contact language has every conceivable resource

• No time limit

Transcription

• contact-language orthography: issues with normalisation

• lexical inventory, diphone inventory

• sense tagging

• multiple instances of one story

• ASR?

• resegmentation

• active learning in interlinear text glossing

MT to help with eliciting morphology?

• problems with recording and translating isolated words

• short complete sentences with translations

• fix nouns and vary the form of the verb?

• bilingual texts as the key means a user would train the system

MT as the measure of adequacy?

• inspect MT output to see what is lost

• supply a corrected version when it gets something wrong

• supply other examples, much as you would do with a child

Data mining

Bird (1999) Multidimensional exploration of online linguistic field data. NELS 29: 33-50.

top related