the linguistic-core approach to structured translation and analysis of low-resource languages...

The Linguistic-Core Approach to Structured Translation and Analysis

of Low-Resource Languages

November 2014 Project Review at ARL Jaime Carbonell (CMU) & TeamMURI via ARO (PM: Joseph Myers)

The FacultyCMU: Jaime Carbonell

USC-ISI:Kevin Knight

David Chiang

(Notre Dame)

MIT: UT Austin:Regina Barzilay Jason Baldridge

Supporting roles: 2 other PhDs, 8 Grad Students, 3 Postdocs, N UGs,

Noah Smith

Lori Levin

Chris Dyer

http://www.google.com/imgres?imgurl=http://www-tsujii.is.s.u-tokyo.ac.jp/T-FaNT/T-FaNT.files/Photos/jaime_carbonell.jpg&imgrefurl=http://www-tsujii.is.s.u-tokyo.ac.jp/T-FaNT/abstracts.html&h=1190&w=788&sz=196&tbnid=Two7BK9lGwzBZM:&tbnh=150&tbnw=99&prev=/images?q=jaime+carbonell&zoom=1&q=jaime+carbonell&usg=__cEZMouIE0MqkKaRb7Bh98jf5W28=&sa=X&ei=P5HHTN22BIPGlQfAyI25AQ&ved=0CCgQ9QEwBQ

http://www.google.com/imgres?imgurl=http://www.cs.cmu.edu/cs4hs/summer06/people/lorilevin.jpg&imgrefurl=http://www.cs.cmu.edu/cs4hs/summer10/people.html&usg=__YdntlLmYC4zX9A99WmcWQfVcpAc=&h=100&w=100&sz=4&hl=en&start=22&zoom=1&itbs=1&tbnid=U_CTNazsBH1pVM:&tbnh=82&tbnw=82&prev=/images?q=Lori+Levin&start=20&hl=en&sa=N&gbv=2&ndsp=20&tbs=isch:1

http://www.google.com/imgres?imgurl=http://ijcai.org/~ijcai05/images/kknight.jpg&imgrefurl=http://ijcai.org/~ijcai05/speakers.php&usg=__rWepMSgYktdd1Gke16HqQd479x4=&h=130&w=100&sz=14&hl=en&start=30&zoom=1&um=1&itbs=1&tbnid=b9pisRJSoCW-HM:&tbnh=91&tbnw=70&prev=/images?q=kevin+knight&start=20&um=1&hl=en&sa=N&rlz=1T4GGLL_enUS372US373&ndsp=20&tbs=isch:1

http://www.google.com/imgres?imgurl=http://bp3.blogger.com/_cnAL0SeDxs0/R_JJMsD9KcI/AAAAAAAAAhk/VE5Pkb9mlkY/s400/barzilay.jpg&imgrefurl=http://sizzlersspot.blogspot.com/2008_04_01_archive.html&usg=__EbtcMNgrzBjFB2vSTjoTB4WKBg0=&h=400&w=400&sz=27&hl=en&start=5&zoom=1&um=1&itbs=1&tbnid=JySg40XFzv0N_M:&tbnh=124&tbnw=124&prev=/images?q=regina+barzilay&um=1&hl=en&sa=N&rlz=1T4GGLL_enUS372US373&tbs=isch:1

http://www.google.com/imgres?imgurl=http://comp.ling.utexas.edu/_media/people/jason_baldridge/jason.jpg&imgrefurl=https://sites.google.com/site/2009facultysummit/attendees&usg=__IneAZ4r0rE3XQaOUCKn6o0sOV5c=&h=573&w=451&sz=73&hl=en&start=1&zoom=1&um=1&itbs=1&tbnid=pIYD-BmKe3DEhM:&tbnh=134&tbnw=105&prev=/images?q=Jason+Baldridge+UT&um=1&hl=en&sa=X&rlz=1T4GGLL_enUS372US373&tbs=isch:1

http://www.google.com/imgres?imgurl=http://www.cmuportugal.org/uploadedImages/people/faculty-researchers/Noah%20Smith.jpg&imgrefurl=http://www.cmuportugal.org/tiercontent.aspx?id=874&usg=__FgBhVVA-V2aizdXJmlS50ykFrh4=&h=800&w=600&sz=136&hl=en&start=15&zoom=1&um=1&itbs=1&tbnid=NS08CdbS_8ghPM:&tbnh=143&tbnw=107&prev=/images?q=Noah+Smith&um=1&hl=en&sa=N&rlz=1T4GGLL_enUS372US373&tbs=isch:1

LCMT: The Elevator Pitch• The fundamental challenge

– “Modern” MT requires massive parallel data– There are 7000+ L’s with scant ||-data– Rule-based MT requires extensive trained-linguist efforts

• The linguistic-core approach– Goal: 90% linguistic benefit with 10% linguist effort– Annotation deep and light, linguistics “lay” bilinguals– Augmented with machine learning from bi & mono-L text

• Accomplishments to date– Theory: GFL, graph-semantics, AMR & other parsers, sparse ML training,

linguistically-anchored models, … 40+ papers– Tool suites: GFL, TurboParser, MT-in-works, Morph, SuperTag,…– Languages: Kiriwanda, Malagasy, Swhhili, Yoruba

4

The Setting• MURI Languages

– Kinyarwanda • Bantu (7.5M speakers)

– Malagasy• Malayo-Polynesian (14.5M)

– Swahili• Bantu (5M native, 150M 2nd/3rd)

– Yoruba• Niger-Congo (22+M)

Swahili Anamwona“he is seeing him/her”

Morpho-syntactics

5

Which MT Paradigms are Best? Towards Filling the Table

Large T Med T Small T

Large S SMT LCMT LCMT

Med S LCMT ??? ???

Small S LCMT ??? ???

• “Old” DARPA MT: Large S Large T– Arabic English; Chinese English

Sou

rce

Target

6

Evolutionary Tree of MT ParadigmsLeading up to LCMT

1950 20141980

Transfer MT

DecodingMT

Analogy MT

Large-scale TMT

Interlingua MT

Example-based MT

Large-scale TMT

Context-Based MT

Statistical MT

Phrasal SMT

Transfer MT w stat phrases

SMT with syntax

LCMT

7

Linguistically omnivorous parsing

Linguisticuniversals

GFL annotated corpus

Unannotatedcorpus

Small CCGLexicon

Parsers

CMU, Texas

Texas

Texas, MIT

He has been writing a letter.

Dependencies

(j / join-01 :ARG0 (p / person :name (p2 / name :op1 "Pierre" :op2 "Vinken") :age (t / temporal-quantity :quant 61 :unit (y / year))) :ARG1 (b / board) :prep-as (d2 / director :mod (e / executive :polarity -)) :time (d / date-entity :month 11 :day 29))

Abstract Meaning Reps ISICMU

CMUMIT

Linguistic Core Team(LL, JB, SV, JC)

Linguistic Analyzers Team (NS, RB, JB)

MT Systems Team (KK, DC, SV, JC)

Parser, Taggers, Morph. Analyzers

Hand-built Linguistic Core

Triple Gold Data

Triple Ungold Data

MT Visualizations and logs

MT Features

MT Error Analysis

MT Systems

Inference Algorithms

Data:Parallel

MonolingualElicited

Related languageMulti-parallelComparable

Elicitation corpus

Data selection for annotationOriginal Vision

Linguistic Core Team(LL, JB, CD, JC)


MT Systems Team (KK, DC, CD, JC)

Parser, Taggers, Semantic analyzers


Triple Gold and GFL annotated

String/tree/graph transducers

Complex Morph Analyzers

Dependency parses

MT/TA Error Analysis

MT Systems and TA modules

Semantic Parsing Algorithms

Data:Parallel

MonolingualElicited


Elicitation corpus

Data selection for annotationCurrent Vision

Definiteness/Discourse

PFA Node Alignment Algorithm Example

• Tree-tree aligner enforces equivalence constraints and optimizes over terminal alignment scores (words/phrases)

• Resulting aligned nodes are highlighted in figure

• Transfer rules are partially lexicalized and read off tree.

LCMT: NLP Workflow and Tools

Annotated Data

Unannotated Data

Texas

Supervised POS Taggers

CMU

Semisupervised POS Taggers

Texas

Unsupervised Dependency

Parsers

MIT

GFL annotator Framework

CMU (current)

= Toolsuite software (more to come)

Supervised Dependency & AMR Parsers

CMUCMU + Texas

Semisupervised Dependency

Parsers

MIT

CMU

Tree-Graph Syn/Sem trx

ISI

Machine Translation Paradigms

Phrase-based MT (LCMT 20+% of effort)

Morph-Syntax-based MT (LCMT 30+%)

Meaning-based MT (LCMT 40+%)

sourcestring

meaningrepresentation

targetstring

sourcestring

targetstring

sourcestring

sourcetree

targettree

targetstring

sourcetree

targettree

05

101520253035 NIST 2009 c2e

Some Key Results to Date• Theory of transducers (string, tree, graph)• Massive Lexical borrowing across diverse languages• Linguistic universals

– Dependencies, semantic roles, conservation, AMR, discourse, …• Statistical learning over strings, trees, graphs

– Bayesian, HMM/CRF, active sampling model parameters– Parsing into deep semantics (AMR)

• MT demonstrations: Focus on M, K, S Y, but also across ~20 languages (WMT honors, synthetic phrases)

• A suite of 11 serious software modules and tools (morphology, variable-depth linguistic annotation, dependency parsing, MT, …)

• Current scientific challenges– Is general graph topology induction possible?– Bridging structural divergences via semi-universals?– Semantic invariance: lexical, structural, non-propositional?

List of “Firsts” for the Linguistic CoreFirst use of models incorporating linguistic knowledge in the form of hand-written morpho-

grammatical rules combined with limited-volume corpus statistics

First use of models of lexical "borrowing" from other (major) languages to improve translation and analysis of low resource languages (publication in prep).

First efficient and exact probabilistic model for structured prediction with arbitrary syntactic and semantic dependencies derived from the input language.

First exploitation of large monolingual foreign text collections (vs bilingually translated collections) to improve low-density MT, via treating foreign text as a mapped/encoded version of English.

First application of formal graph transduction theory to natural language analysis; earlier efforts applied to string transduction and tree transduction theory only.

First substantial corpora annotated cheaply by novices used to build effective NLP tools

First statistical parser to map language into abstract meaning representation of semantics

First to show that for resource-impoverished languages, a multilingual parser based on language universals outperforms a target language parser target language

First analyses to prove formally and empirically demonstrate that inference in dependency parsing is computationally easy on average case (despite NP-hard for the worst case).

External Honors for the LC ProjectBest human judgments of English-Russian translations at WMT2013Best BLEU on Hindi-English translation at WMT2014 Best student paper, ACL 2014 Low-Rank Tensors for Scoring Dependency

Structures. Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay and Tommi Jaakkola. http://people.csail.mit.edu/taolei/papers/acl2014.pdf

Best paper, honorable mention, ACL 2014 A Discriminative Graph-Based Parser for the Abstract Meaning Representation. Jeffrey Flanigan, Sam Thomson, Jaime Carbonell, Chris Dyer and Noah A. Smith. http://www.cs.cmu.edu/~jmflanig/flanigan+etal.acl2014.pdf

Best paper, runner up, EMNLP 2014 Language Modeling with Power Low Rank Ensembles. http://www.aclweb.org/anthology/D/D14/D14-1158.pdf

Best paper (one of four), NIPS 2014 Conditional Random Field Autoencoders for Unsupervised Structured Prediction. Waleed Ammar, Chris Dyer, and Noah A. Smith

http://people.csail.mit.edu/taolei/papers/acl2014.pdf

http://www.cs.cmu.edu/~jmflanig/flanigan+etal.acl2014.pdf

http://www.aclweb.org/anthology/D/D14/D14-1158.pdf

Lexical Borrowing of Common Words

Swahili morphology using a crowdsourced lexiconPatrick Littell, Lori Levin, Chris Dyer

No provenance: the root of the word was collected by hand.[GUESS1]: the root is inferred from the Kamusi lexicon part of speech tag including noun class. [GUESS2]: the root is from Kamusi, but no noun class is given.[GUESS3]: possible English loan word[GUESS4]: complete guess

FST written by Patrick LittellLexicon extracted from dictionaries and textbooks

Parsing Progress (F1)

On CoNNL Dataset: 88.72 (CMU), 89.44 (MIT)

CCG Supertagging

the lazy dogs

np/n n

np

n/n

np

wander

(s\np)/np

n

n/n

np/n

s\np…

HMM

Linguistically- Motivated Priors

Parsing into AMR (ACL 2014 honorable mention for best paper)

Approximately 11000 guards patrol the 1200 - kilometre border between Russia and Afghanistan.

(p / patrol-01 :ARG0 (g / guard :quant (a2 / approximately :op1 11000)) :ARG1 (b / border :quant (d4 / distance-quantity :unit (k2 / kilometer) :quant 1200) :location (b2 / between :op1 (c / country :name (n / name :op1 "Russia")) :op2 (c2 / country :name (n2 / name :op1 "Afghanistan")))))

a2 / approximately11000p / patrol-01g / guard(d4 / distance-quantity :unit (k2 / kilometer)1200b / borderb2 / between(c / country :name (n / name :op1 "Russia"))(c2 / country :name (n2 / name :op1 "Afghanistan"))

Add relations

New Results:61% F1

CMU and ISI

Unsupervised Part-of-Speech Tagging

V-measure (higher is better)

Arabic Basque Danish Greek Hungarian Italian Kin. Mal. Turkish Ave.

conditional random field autoencoder

classic hidden Markov model

featurized hidden Markov model

Automatic Classification of the Communicative Functions of Definiteness

Annotated CorpusSemantics of Definiteness

Syntactic features extracted from

dependency parser

Logistic regression classifier

Predicted semantic functions of

definiteness: 78.2% accuracy

Why Definiteness:• One instance of non-propositional

semantics• Major determinant of word order• Wildly divergent in morpho-syntactic

expression• Problems in word alignment and

language models

Integrating Alignment and Decipherment for Better Low-Density MT

Small bilingual Malagasy/English text(need to align words [Brown et al 93])

Large Malagasy monolingual text(need to decipher [Dou & Knight 13])

Decipherment helpsWord Alignment

Decipherment helpsMachine Translation

joint

Ble

u

ISI jointly with CMU/Texas/MIT

Graph Formalisms for Language Understanding and Generation

String Automata Algorithms

Tree Automata Algorithms

Graph Automata Algorithms

N-best answer extraction

… paths through an WFSA (Viterbi, 1967; Eppstein, 1998)

… trees in a weighted forest (Jiménez & Marzal, 2000; Huang & Chiang, 2005)

Investigating:

• Linguistically adequate representations• Efficient algorithms

Using them in:

• Text Meaning (NLU)• Meaning Text (NLG)• Meaning-based MT

Unsupervised EM training

Forward-backward EM (Baum/Welch, 1971; Eisner 2003)

Tree transducer EM training (Graehl & Knight, 2004)

Determinization, minimization

… of weighted string acceptors (Mohri, 1997)

… of weighted tree acceptors (Borchardt & Vogler, 2003; May & Knight, 2005)

Intersection WFSA intersection Tree acceptor intersection

Application of transducers

string WFST WFSA tree TT weighted tree acceptor

Composition of transducers

WFST composition (Pereira & Riley, 1996)

Many tree transducers not closed under composition (Maletti et al 09)

Software tools Carmel, OpenFST Tiburon (May & Knight 10)

ISI jointly with CMU

NE Gold Standard (native speaker)

(Azim)

NE Pyrite Standard (linguist)

(Alexa, Lori)

Morphology Lists for NE

(Alexa, David)

Morphological Analyzers

(Swabha, Chris)

Gazetteers (Pat, Chris)

Brown Clusters(Kartik)

Tajik Corpus from Leipzig Archive

Supervision(Azim)

Supervision(David)

Tajik and Persian

Wikipedias

Tajik Reference Grammar

(Perry, 2005)

PerLex Persian Lexicon

(Sagot and Walther, 2010)

IPA Converter(Kartik, Pat, Chris)

Named Entity Recognizer(Kartik, Chris)

Persian Treebank (Rasooli et al., 2013)

Tajik POS Tagger(Chris)

Persian-Tajik Converter(Chris)

Strings Graphs

FSA CFG DAG acceptor HRG

probabilistic yes yes yes yes

intersects with finite-

stateyes yes yes yes

EM training yes yes yes yes

transduction O(n) O(n3) O(|Q|T+1n) O((3dn)T+1)

implemented yes yes yes yes

New Results for Graph Automata for Mapping Between Text and Meaning

d = graph degree for AMR, high in practiceT = treewidth complexity for AMR, low in practice (2-3)

Next Steps (high level overview)• Finalize MT systems: K, M, S, Y

– Package and make available externally– Possibly integrate with government translator workbench– Compare with Govt systems when available and appropriate (e.g.

Malagasy with Carl Rubino)• Complete scientific investigations (Graph transduction, MT with

AMR, supertagging parsing, borrowing++, …)• Document and distribute tool suites (rapid annotation,

morphology, CCG supertagging, dependency parsing, AMR parsing, generation, end-to-end MT, lexicon borrowing, ML modules, …) 15 +/-

• Publish, publish, publish (40+ papers and counting)• Detailed next steps at the end of each major presentation

Jaime Carbonell, CMU 28

THANK YOU!

Supplementary Slides

Select/show as needed for discussion period

Tag Dictionary Generalization

TOK_the_1 TOK_dog_2TOK_the_4 TOK_thug_5

NEXT_walksPREV_<b> PREV_the

PRE1_tPRE2_th SUF1_g

TYPE_the TYPE_thug TYPE_dog

Token Annotations________________________

Type Annotations________________________

thedog

the dog walksDT NN VBZ

DTNN

Raw Corpus________________________

Any arbitrary features could be added

RULE 1:DT(these) 这

RULE 2:VBP(include) 中包括

RULE 6:NNP(Russia) 俄罗斯

RULE 4:NNP(France) 法国

RULE 8:NP(NNS(astronauts)) 宇航 , 员

RULE 5:CC(and) 和

RULE 9:PUNC(.) .

这 7 人中包括来自法国和俄罗斯的宇航员 .

RULE 10:NP(x0:DT, CD(7), NNS(people) x0 , 7 人

RULE 13:NP(x0:NNP, x1:CC, x2:NNP) x0 , x1 , x2

RULE 15:S(x0:NP, x1:VP, x2:PUNC) x0 , x1 , x2

RULE 16:NP(x0:NP, x1:VP) x1 , 的 , x0

RULE 11:VP(VBG(coming), PP(IN(from), x0:NP)) 来自 , x0

RULE 14:VP(x0:VBP, x1:NP) x0 , x1

“These 7 people include astronauts coming from France and Russia”

“France and Russia”

“coming from France and Russia”

“astronauts coming fromFrance and Russia”

“these 7 people”

“include astronauts coming fromFrance and Russia”

“these” “Russia” “astronauts” “.”“include” “France” “&”

Model Minimization

<b> The man saw the saw <b>

<b>

DT

NN

VBD

1.0

1.0

1.0

0.8

0.2

0.4

0.7

0.3

1.0

0 1 2 3 4 5 6

0.6

33

Linguistically opportunistic parsing

Linguisticuniversals

GFL annotated corpus

Unannotatedcorpus

Small CCGLexicon

Parsers

CMU, Texas

Texas

Texas, MIT

He has been writing a letter.

Dependencies


Abstract Meaning Reps ISICMU

CMUMIT

Fragmentary Unlabeled Dependency Grammar(Schneider, O’Connor, Saphra, Bamman, Faruqui,

Smith, Dyer, and Baldridge, 2013)

• Represents unlabeled dependencies• Special handling for:• multiword expressions• coordination• anaphora• Allows underspecification• Graph fragment language for easy

annotation

Graph Fragment Language (GFL)

{Our three} > weapons > are < $a$a :: {fear surprise efficiency} :: {and~1 and~2}ruthless > efficiency

Provide a detailed analysis of coordination…

(Our three weapons*) > are < (fear surprise and ruthless efficiency)

Or focus just on the high level…

(((Ataon’ < (ny > mpanao < fihetsiketsehana)) < hoe < mpikiky < manko) < (i > Gaddafi))Atoan’ < noho < (ny_1 > kabariny < lavareny)

Provide detailed syntactic dependency structure

Ataon’ < (ny_1 mpanao* fihetsiketsehana)Atoan’ < (hoe* mpikiky manko)Atoan’ < (i Gaddafi*)Atoan’ < noho < (ny_2 kabariny lavareny)

Or focus on predictate/arguments

“Gaddafi has referred to protesters as rodents in his rambling speeches.”

36

GFL (CMU/Texas) & AMR (ISI)The classic: “Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .”


join < [ Pierre Vinken ]join < boardjoin < as < directorjoin < [Nov. 29]nonexecutive > director61 > years > old > [ Pierre Vinken ]

( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))

instanceARG0

WANT

ARG1

instance

ARG0

BELIEVE

ARG1

instance

WANT

ARG1

WE CAN DERIVE AND TRANSFORM SEMANTIC GRAPHS:

Probabilistic Graph Grammars

“the boy wantssomethinginvolving himself”

instance

ARG0

WANT

B

X

ARG1

WITH BASIC RULES LIKE THIS:

instanceGIRL

instance

BOY

“the boy wants the girl to believe he is wanted”

Example Parsing into AMR

Approximately 11000 guards patrol the 1200 - kilometre border between Russia and Afghanistan.

(p / patrol-01 :ARG0 (g / guard :quant (a2 / approximately :op1 11000)) :ARG1 (b / border :quant (d4 / distance-quantity :unit (k2 / kilometer) :quant 1200) :location (b2 / between :op1 (c / country :name (n / name :op1 "Russia")) :op2 (c2 / country :name (n2 / name :op1 "Afghanistan")))))

a2 / approximately11000p / patrol-01g / guard(d4 / distance-quantity :unit (k2 / kilometer)1200b / borderb2 / between(c / country :name (n / name :op1 "Russia"))(c2 / country :name (n2 / name :op1 "Afghanistan"))

Add relations

How much foreign text (running words)

Accuracy of learned bilingual dictionary

Deciphering Foreign Language

(Dou & Knight 2013)Dependency-based:Linguistic analysis helps substantially!

(Dou & Knight 2012)Ngram-based

test on Spanish

Englishtext Foreign

text

not translationsof each other

DecipheringEngine

bilingualword-for-worddictionary

Constituent Structure Trees

Strength: You can use tests for constituency (movement, deletion, substitution, coordination) to get reproducible results for corpus annotation.

Weakness: 1. Tests for constituency sometimes fail to provide reproducible results.

The five trees (based on an exercise in Radford , 1988) have each been proposed in a published paper and can each be defended by tests for constituency.

2. People do not have uniform intuitions about which tree is “correct”.

42

Morpho-syntacticsIñupiaq (North Slope Alaska)

Tauqsiġñiaġviŋmuŋniaŋitchugut. ‘We won’t go to the store.’

Mathematical Foundations for Semantics-Based Machine Translation

• Previous MT systems have been based on clean string automata and tree automata

• General purpose algorithms have been worked out (in part by MT scientists), with wide applicability– software toolkits even implement those algorithms

• But new models of meaning-based MT deal in semantic graph structures

• Foreign string Meaning graph English string

• QUESTION: Do efficient, general-purpose algorithms for graph automata exist to support these linguistic models?

General-Purpose Algorithms for Manipulating Linguistic Structures: Acceptors

String Acceptors successfully applied to

speech recognition

Tree Acceptors successfully applied to

syntax-based MT

Graph Acceptors now being applied to semantics-based MT

Membership checking ...

... of string (length n) in WFSA. O(n) if WFSA is determinized.

... of tree in forest. O(n) if determinized.

... of graph in hyperedge-replacement grammar (HERG) (Drewes 97)New algorithm: Chiang (forthcoming), O(2dn)k+1 : d & n properties of individual grammar

k-best … … best k paths through an WFSA with n states and e edges (Viterbi 67; Eppstein 98)O(e + n log n + k log k)

… trees in a weighted forest (Jiménez & Marzal 00; Huang & Chiang 05)O(e + n k log k)

... graphs in weighted HERG. Efficient Huang & Chiang results carry over.

EM training of probabilistic weights

Forward-backward EM (Baum/Welch 71; Eisner 03)O(n)

Tree acceptor training (Graehl & Knight 04) O(n)

Efficient Graehl & Knight results carry over.

Intersection WFSA intersectionO(n2) classical

Tree acceptor intersection O(n2) classical

Graph acceptor intersectionNOT CLOSED (in general)

co-PI supported under MURI project

General-Purpose Algorithms for Feature Structures (Graphs)

String World Tree World Graph World

Acceptor Finite-state acceptors Tree automata HRG

Transducer Finite-state transducers Tree transducers Synchronous HRG

Membership checking

O(n) O(n) for treesO(n3) for strings

O(nk+1) for graphs

N-best … … paths through an WFSA (Viterbi, 1967; Eppstein, 1998)

… trees in a weighted forest (Jiménez & Marzal, 2000; Huang & Chiang, 2005)

… graphs in a weighted forest

EM training Forward-backward EM (Baum/Welch, 1971; Eisner 2003)

Tree transducer EM training (Graehl & Knight, 2004)

EM on forests of graphs

Intersection WFSA intersection Tree acceptor intersection Not closed

Transducer composition

WFST composition (Pereira & Riley, 1996)

Many tree transducers not closed under composition (Maletti et al 09)

Not closed

General tools Carmel, OpenFST Tiburon (May & Knight 10) Bolinas

Linguistic Core Team(LL, JB, SV, JC)


MT Systems Team (KK, DC, SV, JC)

Parser, Taggers, Morph. Analyzers


Triple Gold Data

Triple Ungold Data

MT Visualizations and logs

MT Features

MT Error Analysis

MT Systems

Inference Algorithms

Data:Parallel

MonolingualElicited


Elicitation corpus

Data selection for annotation

Functional Collaboration

Malagasy ResourcesTokens Types Hapax

Bible (Year 1) 579,578 19,460 8,401

Leipzig corpus (Year 2) 618,282 41,462 23,659

CMU Global Voices (Year 2) 2,148,976 84,744 46,627

Total 3,346,836 115,172 62,517

Malagasy - English Resourceseng-Tokens eng-Types mlg-Tokens mlg-Types

Bible (Year 1) 584,872 13,084 579,578 19,460

CMU Global Voices (Year 2) 1,785,472 63,357 2,148,976 84,744

Total 2,370,344 67,790 3,346,836 115,172

48

Evolutionary Tree of MT ParadigmsPrior to LCMT

1950 20121980

Transfer MT

DecodingMT

Analogy MT

Large-scale TMT

Interlingua MT

Example-based MT

Large-scale TMT

Context-Based MT

Statistical MT

Phrasal SMT

Transfer MT w stat phrases

SMT on syntax struct.

LCMT

49

Model Parameters

• Distribution over number of arguments given the parent tag

• Weights for selection features, shared across all set sizes

• Weights for ordering features

All parameters are shared across languages

Malagasy Language ModelingModel Data Seq. X-ent Word X-ent Total X-ent. Perplexity OOVs

3-gram+char Bible 10.35 7.66 18.01 264,323 23.94%

3-gram+char GV 7.02 1.14 8.16 286.0 3.30%

3-gram+morph GV 7.02 0.90 7.92 241.4 3.30%

• Successes• Malagasy analyzer has << 100% coverage, but we still

get substantial gains• Year 3 Goals

• Improve word sequence model with morphosyntactic information

• Improve coverage of Malagasy morphological phenomena• Incorporation in MT system• Kinyarwanda analyzer/generator under development

How CMU ISI UT and MIT collaborate

• Monthly teleconference calls– Focused on management and project coordination– Technical topics follow when appropriate

• Semi-annual face-to-face meetings – Last ones in Nov 2012 and March 2013– Include students/postdocs, etc. Focused on science

• Much more frequent focused calls/chats/etc.– Data collection, annotations, SW APIs, brainstorming new algorithms, …– Sharing/reviewing results and papers

• Website/repository + shared SW/data sets + papers + more goodies– www.linguisticcore.info

• Student exchanges (e.g. week, month, summer)• Occasional individual faculty trips • Combined research (GFL, AMR parsing, CCG parsing, decipherment,…)

the linguistic-core approach to structured translation and analysis of low-resource languages...

Documents

old darpa mt

jb mt systems team kk

scant data rulebased

linguistic core team

yoruba slide

linguistic benefit

linguisticcore approach

massive parallel data