data elicitation for avenue lori levin alison alvarez jeff good (mpi leipzig) bob frederking erik...

38
Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon University

Upload: cathleen-wilkins

Post on 12-Jan-2016

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Data Elicitation for AVENUE

Lori LevinAlison AlvarezJeff Good (MPI Leipzig)Bob Frederking

Erik Peterson

Language Technologies InstituteCarnegie Mellon University

Page 2: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Outline

Elicitation The Functional-Typological Corpus Corpus Creation Feature Detection Corpus Navigation

Page 3: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

The Elicitation Tool

Page 4: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Input to the Elicitation Tool: corpus of minimal pairs

Eliciting from Spanish# 1,2,3 {Sg,pl} person pronounsnewpairsrcsent: Cantocontext: comment:

newpairsrcsent: Cantécontext: comment:

newpairsrcsent: Estoy cantandocontext: comment:

newpairsrcsent: Cantastecontext: comment:

Eliciting from English# 1,2,3 {Sg,pl} person pronounsnewpairsrcsent: I singcontext: comment:

newpairsrcsent: I sangcontext: comment:

newpairsrcsent: I am singingcontext: comment:

newpairsrcsent: You sangcontext: comment:

Page 5: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Output of the elicitation processnewpairsrcsent: Tú caístetgtsent: eymi ütrünagimialigned: ((1,1),(2,2))context: tú = Juan [masculino, 2a persona del singular]comment: You (John) fell

newpairsrcsent: Tú estás cayendotgtsent: eymi petu ütünagimialigned: ((1,1),(2 3,2 3))context: tú = Juan [masculino, 2a persona del singular]comment: You (John) are falling

newpairsrcsent: Tú caíste tgtsent: eymi, ütrunagimialigned: ((1,1),(2,2))context: tú = María [femenino, 2a persona del singular]comment: You (Mary) fell

Page 6: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Elicitation Corpus

Elicitation Corpus refers to the list of sentences in the major language.Not yet translated or aligned

Field workers call it a questionnaire.

Page 7: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

The elicitation corpus is useful as

Input to automatic rule learning Test suite for machine translation (at ARL) Fieldwork questionnaire

The consultant can do some of the tedious parts by himself/herself.

Page 8: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

AVENUE Elicitation Corpora

The Functional-Typological CorpusDesigned to elicit elements of meaning that

may have morpho-syntactic realization The Structural Elicitation Corpus

Based on sentence structures from the Penn TreeBank

Page 9: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

The Functional Typological Corpus

</feature>

<feature><feature-name>c-my-polarity</feature-

name>

<value><value-name>polarity-positive</value-

name></value>

<value><value-name>polarity-negative</value-

name></value>

<note>Stick to the two obvious values of polarity for now.</note>

</feature>

Feature Name: c-my-polarityValues: positive, negativeNote: Stick to the two obvious values

of polarity for now.

Page 10: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Functional Typological Corpus

In XML XSLT scripts can format it into human-readable

text or into data structures. Currently contains around 50 features and a few

hundred values. Based on the Lingua checklist (Comrie and Smith,

1977), other fieldwork checklists, other typological taxonomies.

Still under development

Page 11: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Functional Typological Corpus: Representation of “Who is at the meeting” ((subj ((np-my-general-type pronoun-type)(np-my-person person-unk) (np-my-number num-sg)(np-my-animacy anim-human)(np-my-function fn-predicatee)(np-d-my-distance-from-speaker distance-neutral)(np-my-emphasis emph-no-emph)(np-my-info-function info-neutral)(np-pronoun-exclusivity exclusivity-n/a)(np-pronoun-antecedent-function antecedent-n/a)(np-pronoun-reflexivity reflexivity-n/a)))(predicate ((loc-roles loc-general-at)))

Continued on next slide

Page 12: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Continued: “Who is at the meeting”

(c-my-copula-type locative)(c-my-secondary-type secondary-copula) (c-my-polarity polarity-positive) (c-my-function fn-main-clause)(c-my-general-type open-question)(gap-function gap-copula-subject)(c-my-sp-act sp-act-request-information)(c-v-my-grammatical-aspect gram-aspect-neutral)(c-v-my-absolute-tense present) (c-v-my-phase-aspect durative)(c-my-headedness-rc rc-head-n/a)(c-my-minor-type minor-n/a)(c-my-restrictivess-rc rc-restrictive-n/a)(c-my-answer-type ans-n/a)(c-my-imperative-degree imp-degree-n/a)(c-my-actor's-status actor-neutral)(c-my-focus-rc focus-n/a)(c-my-gaps-function gap-n/a)(c-my-relative-tense relative-n/a)(c-my-ynq-type ynq-n/a)(c-my-actor's-sem-role actor-sem-role-neutral)(c-v-my-lexical-aspect state))

Page 13: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Why is the corpus represented as a set of feature structures?

Multiple elicitation languagesGenerate the English and Spanish elicitation

corpora from the same internal representationEasy to add a new elicitation language

Write a GenKit grammar to generate sentences from the same internal representation

Page 14: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Why is the corpus represented as a set of feature structures?

Feature structure represents things that are not expressed in the major language These things show up as comments in the

elicitation corpus “I am singing” (comment: female)

May eventually use pictures and discourse context

We actually want to elicit the meaning associated with the feature structure. English and Spanish are just vehicles for getting at the meaning.

Page 15: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Corpus Creation Tools

The elicitation corpus can be changed and new corpora can be created.

Page 16: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Motivation for Corpus Creation Tools

Make new corpora easilyAdd a new tense (e.g., remote past) and

automatically get all the combinations with other features

Make a specialized corpus for a limited semantic domain or a specific language family

Page 17: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Motivation for Corpus Creation Tools

Combinatorics For example, all combinations of person,

number, gender, tense, etc.Too much bookkeeping for a human corpus

creator, and too time consuming

Page 18: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Where do the feature structures come from? A linguist formulates a Multiply The multiply specifies a set of feature

structures

Page 19: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

A Multiply((subj ((np-my-general-type pronoun-type common-noun-type)

(np-my-person person-first person-second person-third) (np-my-number num-sg num-pl)

(np-my-biological-gender bio-gender-male bio-gender-female) (np-my-function fn-predicatee)))

{[(predicate ((np-my-general-type common-noun-type) (np-my-definiteness definiteness-minus) (np-my-person person-third) (np-my-function predicate))) (c-my-copula-type role)]

[(predicate ((adj-my-general-type quality-type))) (c-my-copula-type attributive)] [(predicate ((np-my-general-type common-noun-type)

(np-my-person person-third) (np-my-definiteness definiteness-plus) (np-my-function predicate))) (c-my-copula-type identity)]} (c-my-secondary-type secondary-copula) (c-my-polarity #all) (c-my-function fn-main-clause)(c-my-general-type declarative)(c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral) (c-v-my-lexical-aspect state) (c-v-my-absolute-tense past present future) (c-v-my-phase-aspect durative))

This multiply expands to 288 feature structures.

Page 20: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

There is a GUI for making Multiplies Demo may be available

Page 21: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

GenKit Grammar Use GenKit for generation

;;declarative(<s> ==> (<np> <vp> <np> <sc>) (((x0 c-my-general-type) =c declarative) ((x2 verb-form) = fin) ((x3 c-my-copula-type) = (x0 c-my-copula-type)) ((x4 d-speaker-gender) = (x0 d-speaker-gender)) ((x4 d-hearer-gender) = (x0 d-hearer-gender)) ((x4 d-my-formality) = (x0 d-my-formality)) ((x3 np-my-number) = (x0 np-my-number)) ((x3 np-my-animacy) = (x0 np-my-animacy)) ((x3 np-my-biological-gender) = (x0 np-my-biological-gender)) (x3 = (x0 predicate)) (x1 = (x0 subj)) (x2 = x0)))

Page 22: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

GenKit Lexicon;;Pronouns

(word ((cat n) (root you) (pred pro) (np-my-person person-second) (np-my-animacy anim-human) (np-my-general-type pronoun-type))) (word ((cat n) (root I) (pred pro) (np-my-person person-first) (np-my-number num-sg) (np-my-animacy anim-human) (np-my-general-type pronoun-type))) (word ((cat n) (root we) (pred pro) (np-my-person person-first) (np-my-number num-pl) (np-my-animacy anim-human) (np-my-general-type pronoun-type))) (word ((cat n) (root we) (pred pro) (np-my-person person-first) (np-my-number num-dual) (np-my-animacy anim-human) (np-my-general-type pronoun-type))) (word ((cat n) (root she) (pred pro) (np-my-person person-third) (np-my-number num-sg) (np-my-biological-gender bio-gender-female) (np-my-animacy anim-human) (np-my-general-type pronoun-type)))

Page 23: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Comments are also generated

I & one female & sang Use comments for things that are not

expressed in English.

Page 24: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Convert to Elicitation Format(input to Elicitation Tool)

original: WHO & IS AT THE BOX &full comment:Sentence: WHO IS AT THE BOX

original: I &ONE-WOMAN & AM PN_FEMALE &ONE-WOMAN & &

full comment: NP1: ONE-WOMANSentence: I AM PN_FEMALE

original: WILL I &ONE-WOMAN & BE THE TEACHER &

full comment: NP1: ONE-WOMANSentence: WILL I BE THE TEACHER

Page 25: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Eight Basic Steps for Corpus Creation1. Write FVD and format into data structure2. Gather Exclusions (restrictions on co-

occurrence of features3. Design the Multiply4. Get a full set of Feature Structures5. Design Grammar and Comments6. Design Lexicon7. Generate Sentences from Feature Structures8. Convert to Elicitation Format

Page 26: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Can make other types of corpora

The Elicitation Corpus does not have to be functional-typological

Page 27: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Alternative Corpora: The Medical CorpusFeature: Body-Parts

Values     part-hand   Restrictions: part-finger  Restrictions: part-tooth   Restrictions: symptom_redness

symptom_scratch

symptom_numbness symptom_cut

symptom_lumpsymptom_rashsymptom_puncturesymptom_bruisesymptom_frozen

part-eye    Restrictions: symptom_rash part-arm    Restrictions:

((subj ((body-parts #all) (Poss ((np-my-general-type pronoun-type) (np-my-person #all) (np-my-number num-sg num-pl) (np-my-animacy anim-human) (np-my-use possessive))) (Pred ((symptoms #all)) (c-my-general-type declarative)(c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral) (c-v-my-lexical-aspect state) (c-v-my-absolute-tense present));

The Result:YOUR ARM IS RED

YOUR ARM IS SCRATCHED

YOUR ARM IS NUMB

YOUR ARM IS NIL

YOUR ARM HAS A/N INFECTION…

Page 28: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Feature Detection

Identify meaning components that have morpho-syntactic consequences in the language that is being elicited.The gender of the subject is marked on the

verb in Hebrew.The gender of the subject has no morpho-

syntactic realization in Mapudungun.

Page 29: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Feature Detection: Spanish

The girl saw a red book.((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))La niña vió un libro rojo

A girl saw a red book((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))Una niña vió un libro rojo

I saw the red book((1,1)(2,2)(3,3)(4,5)(5,4))Yo vi el libro rojo

I saw a red book.

((1,1)(2,2)(3,3)(4,5)(5,4)) Yo vi un libro rojo

Feature: definitenessValues: definite, indefiniteFunction-of-*: subj, objMarked-on-head-of-*: noMarked-on-dependent: yesMarked-on-governor: noMarked-on-other: noAdd/delete-word: noChange-in-alignment: no

Page 30: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Feature Detection: Chinese

A girl saw a red book.

((1,2)(2,2)(3,3)(3,4)(4,5)(5,6)(5,7)(6,8))

有 一个 女人 看见 了 一本 红色 的 书 。

The girl saw a red book.

((1,1)(2,1)(3,3)(3,4)(4,5)(5,6)(6,7))

女人 看见 了 一本 红色的 书

Feature: definiteness

Values: definite, indefinite

Function-of-*: subject

Marked-on-head-of-*: no

Marked-on-dependent: no

Marked-on-governor: no

Add/delete-word: yes

Change-in-alignment: no

Page 31: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Feature Detection: Chinese

I saw the red book((1, 3)(2, 4)(2, 5)(4, 1)(5, 2))

红色的 书, 我 看见 了

I saw a red book.((1,1)(2,2)(2,3)(2, 4)(4,5)(5,6))我 看见 了 一本 红色的 书 。

Feature: definitenesValues: definite, indefiniteFunction-of-*: objectMarked-on-head-of-*: noMarked-on-dependent: noMarked-on-governor: noAdd/delete-word: yesChange-in-alignment: yes

Page 32: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Feature Detection: Hebrew

A girl saw a red book.((2,1) (3,2)(5,4)(6,3))

אדום ילדה ספר ראתה

The girl saw a red book((1,1)(2,1)(3,2)(5,4)(6,3))

אדום הילדה ספר ראתה

I saw a red book.((2,1)(4,3)(5,2))

אדוםספרראיתי

I saw the red book.((2,1)(3,3)(3,4)(4,4)(5,3))

האדוםהספרראיתי את

Feature: definitenessValues: definite, indefiniteFunction-of-*: subj, objMarked-on-head-of-*: yesMarked-on-dependent: yesMarked-on-governor: noAdd-word: noChange-in-alignment: no

Page 33: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Feature detection feeds into Corpus Navigation: which minimal pairs to pursue

next. Don’t pursue gender in Mapudungun Do pursue definiteness in Hebrew

Morphology Learning: Morphological learner identifies the forms of the

morphemes Feature detection identifies the functions

Rule learning: Rule learner will have to learn a constraint for each

morpho-syntactic marker that is discovered E.g., Adjectives and nouns agree in gender, number, and

definiteness in Hebrew.

Page 34: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Other uses of Feature Detection A human-readable reference grammar can

be generated from fact records. A human analyst knows Northern Ostyak, and then has to

translate a document in Eastern Ostyak. The only reference grammar of Eastern Ostyak is written in Hungarian, which the

analyst does not speak. An Eastern Ostyak consultant who speaks Russian translates the Elicitation Corpus from Russian to Eastern Ostyak. The analyst learns about Eastern Ostyak from automatically generated fact records.

Page 35: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Other uses of Feature Detection A human-readable reference grammar can be

generated from fact records. A human analyst knows Northern Ostyak, and then has to translate a

document in Eastern Ostyak. The only reference grammar of Eastern

Ostyak is written in Hungarian, which the analyst does not speak. An Eastern Ostyak consultant who speaks Russian translates the Elicitation Corpus from Russian to Eastern Ostyak. The analyst learns about Eastern Ostyak from the automatically generated fact records.

I’m not really sure whether the only grammar of Eastern Ostyak is written in Hungarian. There is one reference grammar of Northern Ostyak written in English (by Irina Nikolaeva). All other Ostyak materials are in Hungarian, Russian, and German.

The Ostyaks are subsistence hunters, and Eastern Ostyak is nearly extinct, so there is no real need for government translators.

Other Siberian and Central Asian languages with similar scarcity of resources may be important.

Page 36: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Other uses of Feature Detection Help a field worker

Instead of “Elicit by day; analyze by night” (in order to know what to elicit the next day), go to sleep and look at the automatically generated analysis in the morning.

We have been working with people at EMELD and MPI Leipzig.

Page 37: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Corpus Navigation

While the Elicitation Corpus for any one target language (TL) can be kept to a reasonable size, the universal Elicitation Corpus must check for all phenomena that might occur in any langauge.

Since the universal corpus cannot be kept to a reasonable size, Corpus Navigation is necessary.

Facts discovered about a particular TL early in the process constrain what needs to be looked for later in the process for that TL. Thus this is a dynamic process, different for each TL.

Page 38: Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon

Corpus Navigation: search Search process, with the informant in the inner loop, expanding search states he/she is

given as SL sentences by supplying the corresponding TL sentence and alignments.

Analogously to game search, there is an "opening book" of moves (SL sentences to check for all languages), until enough inforrmation has been gathered to make intelligent search choices.

The hueristic function driving the search process is Relative Info Gain: RIG(Y|X) = [H(Y) - H(Y|X)]/H(Y)

The system reduces the remaining entropy in its knowledge of the language as much as possible.

There should also be a cost factor, estimating the human effort required to expand the node.

To make the process efficient enough, we will create "decision graphs", similar to RETE networks, that cache information so only the information that changes needs to be recomputed.