the linguistic-core approach to structured translation and analysis of low-resource languages...
TRANSCRIPT
The Linguistic-Core Approach to Structured Translation and Analysis
of Low-Resource Languages
November 2014 Project Review at ARL Jaime Carbonell (CMU) & TeamMURI via ARO (PM: Joseph Myers)
The FacultyCMU: Jaime Carbonell
USC-ISI:Kevin Knight
David Chiang
(Notre Dame)
MIT: UT Austin:Regina Barzilay Jason Baldridge
Supporting roles: 2 other PhDs, 8 Grad Students, 3 Postdocs, N UGs,
Noah Smith
Lori Levin
Chris Dyer
LCMT: The Elevator Pitch• The fundamental challenge
– “Modern” MT requires massive parallel data– There are 7000+ L’s with scant ||-data– Rule-based MT requires extensive trained-linguist efforts
• The linguistic-core approach– Goal: 90% linguistic benefit with 10% linguist effort– Annotation deep and light, linguistics “lay” bilinguals– Augmented with machine learning from bi & mono-L text
• Accomplishments to date– Theory: GFL, graph-semantics, AMR & other parsers, sparse ML training,
linguistically-anchored models, … 40+ papers– Tool suites: GFL, TurboParser, MT-in-works, Morph, SuperTag,…– Languages: Kiriwanda, Malagasy, Swhhili, Yoruba
4
The Setting• MURI Languages
– Kinyarwanda • Bantu (7.5M speakers)
– Malagasy• Malayo-Polynesian (14.5M)
– Swahili• Bantu (5M native, 150M 2nd/3rd)
– Yoruba• Niger-Congo (22+M)
Swahili Anamwona“he is seeing him/her”
Morpho-syntactics
5
Which MT Paradigms are Best? Towards Filling the Table
Large T Med T Small T
Large S SMT LCMT LCMT
Med S LCMT ??? ???
Small S LCMT ??? ???
• “Old” DARPA MT: Large S Large T– Arabic English; Chinese English
Sou
rce
Target
6
Evolutionary Tree of MT ParadigmsLeading up to LCMT
1950 20141980
Transfer MT
DecodingMT
Analogy MT
Large-scale TMT
Interlingua MT
Example-based MT
Large-scale TMT
Context-Based MT
Statistical MT
Phrasal SMT
Transfer MT w stat phrases
SMT with syntax
LCMT
7
Linguistically omnivorous parsing
Linguisticuniversals
GFL annotated corpus
Unannotatedcorpus
Small CCGLexicon
Parsers
CMU, Texas
Texas
Texas, MIT
He has been writing a letter.
Dependencies
(j / join-01 :ARG0 (p / person :name (p2 / name :op1 "Pierre" :op2 "Vinken") :age (t / temporal-quantity :quant 61 :unit (y / year))) :ARG1 (b / board) :prep-as (d2 / director :mod (e / executive :polarity -)) :time (d / date-entity :month 11 :day 29))
Abstract Meaning Reps ISICMU
CMUMIT
Linguistic Core Team(LL, JB, SV, JC)
Linguistic Analyzers Team (NS, RB, JB)
MT Systems Team (KK, DC, SV, JC)
Parser, Taggers, Morph. Analyzers
Hand-built Linguistic Core
Triple Gold Data
Triple Ungold Data
MT Visualizations and logs
MT Features
MT Error Analysis
MT Systems
Inference Algorithms
Data:Parallel
MonolingualElicited
Related languageMulti-parallelComparable
Elicitation corpus
Data selection for annotationOriginal Vision
Linguistic Core Team(LL, JB, CD, JC)
Linguistic Analyzers Team (NS, RB, JB)
MT Systems Team (KK, DC, CD, JC)
Parser, Taggers, Semantic analyzers
Hand-built Linguistic Core
Triple Gold and GFL annotated
String/tree/graph transducers
Complex Morph Analyzers
Dependency parses
MT/TA Error Analysis
MT Systems and TA modules
Semantic Parsing Algorithms
Data:Parallel
MonolingualElicited
Related languageMulti-parallelComparable
Elicitation corpus
Data selection for annotationCurrent Vision
Definiteness/Discourse
PFA Node Alignment Algorithm Example
• Tree-tree aligner enforces equivalence constraints and optimizes over terminal alignment scores (words/phrases)
• Resulting aligned nodes are highlighted in figure
• Transfer rules are partially lexicalized and read off tree.
LCMT: NLP Workflow and Tools
Annotated Data
Unannotated Data
Texas
Supervised POS Taggers
CMU
Semisupervised POS Taggers
Texas
Unsupervised Dependency
Parsers
MIT
GFL annotator Framework
CMU (current)
= Toolsuite software (more to come)
Supervised Dependency & AMR Parsers
CMUCMU + Texas
Semisupervised Dependency
Parsers
MIT
CMU
Tree-Graph Syn/Sem trx
ISI
Machine Translation Paradigms
Phrase-based MT (LCMT 20+% of effort)
Morph-Syntax-based MT (LCMT 30+%)
Meaning-based MT (LCMT 40+%)
sourcestring
meaningrepresentation
targetstring
sourcestring
targetstring
sourcestring
sourcetree
targettree
targetstring
sourcetree
targettree
05
101520253035 NIST 2009 c2e
Some Key Results to Date• Theory of transducers (string, tree, graph)• Massive Lexical borrowing across diverse languages• Linguistic universals
– Dependencies, semantic roles, conservation, AMR, discourse, …• Statistical learning over strings, trees, graphs
– Bayesian, HMM/CRF, active sampling model parameters– Parsing into deep semantics (AMR)
• MT demonstrations: Focus on M, K, S Y, but also across ~20 languages (WMT honors, synthetic phrases)
• A suite of 11 serious software modules and tools (morphology, variable-depth linguistic annotation, dependency parsing, MT, …)
• Current scientific challenges– Is general graph topology induction possible?– Bridging structural divergences via semi-universals?– Semantic invariance: lexical, structural, non-propositional?
List of “Firsts” for the Linguistic CoreFirst use of models incorporating linguistic knowledge in the form of hand-written morpho-
grammatical rules combined with limited-volume corpus statistics
First use of models of lexical "borrowing" from other (major) languages to improve translation and analysis of low resource languages (publication in prep).
First efficient and exact probabilistic model for structured prediction with arbitrary syntactic and semantic dependencies derived from the input language.
First exploitation of large monolingual foreign text collections (vs bilingually translated collections) to improve low-density MT, via treating foreign text as a mapped/encoded version of English.
First application of formal graph transduction theory to natural language analysis; earlier efforts applied to string transduction and tree transduction theory only.
First substantial corpora annotated cheaply by novices used to build effective NLP tools
First statistical parser to map language into abstract meaning representation of semantics
First to show that for resource-impoverished languages, a multilingual parser based on language universals outperforms a target language parser target language
First analyses to prove formally and empirically demonstrate that inference in dependency parsing is computationally easy on average case (despite NP-hard for the worst case).
External Honors for the LC ProjectBest human judgments of English-Russian translations at WMT2013Best BLEU on Hindi-English translation at WMT2014 Best student paper, ACL 2014 Low-Rank Tensors for Scoring Dependency
Structures. Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay and Tommi Jaakkola. http://people.csail.mit.edu/taolei/papers/acl2014.pdf
Best paper, honorable mention, ACL 2014 A Discriminative Graph-Based Parser for the Abstract Meaning Representation. Jeffrey Flanigan, Sam Thomson, Jaime Carbonell, Chris Dyer and Noah A. Smith. http://www.cs.cmu.edu/~jmflanig/flanigan+etal.acl2014.pdf
Best paper, runner up, EMNLP 2014 Language Modeling with Power Low Rank Ensembles. http://www.aclweb.org/anthology/D/D14/D14-1158.pdf
Best paper (one of four), NIPS 2014 Conditional Random Field Autoencoders for Unsupervised Structured Prediction. Waleed Ammar, Chris Dyer, and Noah A. Smith
Lexical Borrowing of Common Words
Swahili morphology using a crowdsourced lexiconPatrick Littell, Lori Levin, Chris Dyer
No provenance: the root of the word was collected by hand.[GUESS1]: the root is inferred from the Kamusi lexicon part of speech tag including noun class. [GUESS2]: the root is from Kamusi, but no noun class is given.[GUESS3]: possible English loan word[GUESS4]: complete guess
FST written by Patrick LittellLexicon extracted from dictionaries and textbooks
Parsing Progress (F1)
On CoNNL Dataset: 88.72 (CMU), 89.44 (MIT)
CCG Supertagging
the lazy dogs
np/n n
np
n/n
np
wander
(s\np)/np
n
n/n
np/n
s\np…
HMM
Linguistically- Motivated Priors
Parsing into AMR (ACL 2014 honorable mention for best paper)
Approximately 11000 guards patrol the 1200 - kilometre border between Russia and Afghanistan.
(p / patrol-01 :ARG0 (g / guard :quant (a2 / approximately :op1 11000)) :ARG1 (b / border :quant (d4 / distance-quantity :unit (k2 / kilometer) :quant 1200) :location (b2 / between :op1 (c / country :name (n / name :op1 "Russia")) :op2 (c2 / country :name (n2 / name :op1 "Afghanistan")))))
a2 / approximately11000p / patrol-01g / guard(d4 / distance-quantity :unit (k2 / kilometer)1200b / borderb2 / between(c / country :name (n / name :op1 "Russia"))(c2 / country :name (n2 / name :op1 "Afghanistan"))
Add relations
New Results:61% F1
CMU and ISI
Unsupervised Part-of-Speech Tagging
V-measure (higher is better)
Arabic Basque Danish Greek Hungarian Italian Kin. Mal. Turkish Ave.
conditional random field autoencoder
classic hidden Markov model
featurized hidden Markov model
Automatic Classification of the Communicative Functions of Definiteness
Annotated CorpusSemantics of Definiteness
Syntactic features extracted from
dependency parser
Logistic regression classifier
Predicted semantic functions of
definiteness: 78.2% accuracy
Why Definiteness:• One instance of non-propositional
semantics• Major determinant of word order• Wildly divergent in morpho-syntactic
expression• Problems in word alignment and
language models
Integrating Alignment and Decipherment for Better Low-Density MT
Small bilingual Malagasy/English text(need to align words [Brown et al 93])
Large Malagasy monolingual text(need to decipher [Dou & Knight 13])
Decipherment helpsWord Alignment
Decipherment helpsMachine Translation
joint
Ble
u
ISI jointly with CMU/Texas/MIT
Graph Formalisms for Language Understanding and Generation
String Automata Algorithms
Tree Automata Algorithms
Graph Automata Algorithms
N-best answer extraction
… paths through an WFSA (Viterbi, 1967; Eppstein, 1998)
… trees in a weighted forest (Jiménez & Marzal, 2000; Huang & Chiang, 2005)
Investigating:
• Linguistically adequate representations• Efficient algorithms
Using them in:
• Text Meaning (NLU)• Meaning Text (NLG)• Meaning-based MT
Unsupervised EM training
Forward-backward EM (Baum/Welch, 1971; Eisner 2003)
Tree transducer EM training (Graehl & Knight, 2004)
Determinization, minimization
… of weighted string acceptors (Mohri, 1997)
… of weighted tree acceptors (Borchardt & Vogler, 2003; May & Knight, 2005)
Intersection WFSA intersection Tree acceptor intersection
Application of transducers
string WFST WFSA tree TT weighted tree acceptor
Composition of transducers
WFST composition (Pereira & Riley, 1996)
Many tree transducers not closed under composition (Maletti et al 09)
Software tools Carmel, OpenFST Tiburon (May & Knight 10)
ISI jointly with CMU
NE Gold Standard (native speaker)
(Azim)
NE Pyrite Standard (linguist)
(Alexa, Lori)
Morphology Lists for NE
(Alexa, David)
Morphological Analyzers
(Swabha, Chris)
Gazetteers (Pat, Chris)
Brown Clusters(Kartik)
Tajik Corpus from Leipzig Archive
Supervision(Azim)
Supervision(David)
Tajik and Persian
Wikipedias
Tajik Reference Grammar
(Perry, 2005)
PerLex Persian Lexicon
(Sagot and Walther, 2010)
IPA Converter(Kartik, Pat, Chris)
Named Entity Recognizer(Kartik, Chris)
Persian Treebank (Rasooli et al., 2013)
Tajik POS Tagger(Chris)
Persian-Tajik Converter(Chris)
Strings Graphs
FSA CFG DAG acceptor HRG
probabilistic yes yes yes yes
intersects with finite-
stateyes yes yes yes
EM training yes yes yes yes
transduction O(n) O(n3) O(|Q|T+1n) O((3dn)T+1)
implemented yes yes yes yes
New Results for Graph Automata for Mapping Between Text and Meaning
d = graph degree for AMR, high in practiceT = treewidth complexity for AMR, low in practice (2-3)
Next Steps (high level overview)• Finalize MT systems: K, M, S, Y
– Package and make available externally– Possibly integrate with government translator workbench– Compare with Govt systems when available and appropriate (e.g.
Malagasy with Carl Rubino)• Complete scientific investigations (Graph transduction, MT with
AMR, supertagging parsing, borrowing++, …)• Document and distribute tool suites (rapid annotation,
morphology, CCG supertagging, dependency parsing, AMR parsing, generation, end-to-end MT, lexicon borrowing, ML modules, …) 15 +/-
• Publish, publish, publish (40+ papers and counting)• Detailed next steps at the end of each major presentation
Jaime Carbonell, CMU 28
THANK YOU!
Supplementary Slides
Select/show as needed for discussion period
Tag Dictionary Generalization
TOK_the_1 TOK_dog_2TOK_the_4 TOK_thug_5
NEXT_walksPREV_<b> PREV_the
PRE1_tPRE2_th SUF1_g
TYPE_the TYPE_thug TYPE_dog
Token Annotations________________________
Type Annotations________________________
thedog
the dog walksDT NN VBZ
DTNN
Raw Corpus________________________
Any arbitrary features could be added
RULE 1:DT(these) 这
RULE 2:VBP(include) 中包括
RULE 6:NNP(Russia) 俄罗斯
RULE 4:NNP(France) 法国
RULE 8:NP(NNS(astronauts)) 宇航 , 员
RULE 5:CC(and) 和
RULE 9:PUNC(.) .
这 7 人 中包括 来自 法国 和 俄罗斯 的 宇航 员 .
RULE 10:NP(x0:DT, CD(7), NNS(people) x0 , 7 人
RULE 13:NP(x0:NNP, x1:CC, x2:NNP) x0 , x1 , x2
RULE 15:S(x0:NP, x1:VP, x2:PUNC) x0 , x1 , x2
RULE 16:NP(x0:NP, x1:VP) x1 , 的 , x0
RULE 11:VP(VBG(coming), PP(IN(from), x0:NP)) 来自 , x0
RULE 14:VP(x0:VBP, x1:NP) x0 , x1
“These 7 people include astronauts coming from France and Russia”
“France and Russia”
“coming from France and Russia”
“astronauts coming fromFrance and Russia”
“these 7 people”
“include astronauts coming fromFrance and Russia”
“these” “Russia” “astronauts” “.”“include” “France” “&”
Model Minimization
<b> The man saw the saw <b>
<b>
DT
NN
VBD
1.0
1.0
1.0
0.8
0.2
0.4
0.7
0.3
1.0
0 1 2 3 4 5 6
0.6
33
Linguistically opportunistic parsing
Linguisticuniversals
GFL annotated corpus
Unannotatedcorpus
Small CCGLexicon
Parsers
CMU, Texas
Texas
Texas, MIT
He has been writing a letter.
Dependencies
(j / join-01 :ARG0 (p / person :name (p2 / name :op1 "Pierre" :op2 "Vinken") :age (t / temporal-quantity :quant 61 :unit (y / year))) :ARG1 (b / board) :prep-as (d2 / director :mod (e / executive :polarity -)) :time (d / date-entity :month 11 :day 29))
Abstract Meaning Reps ISICMU
CMUMIT
Fragmentary Unlabeled Dependency Grammar(Schneider, O’Connor, Saphra, Bamman, Faruqui,
Smith, Dyer, and Baldridge, 2013)
• Represents unlabeled dependencies• Special handling for:• multiword expressions• coordination• anaphora• Allows underspecification• Graph fragment language for easy
annotation
Graph Fragment Language (GFL)
{Our three} > weapons > are < $a$a :: {fear surprise efficiency} :: {and~1 and~2}ruthless > efficiency
Provide a detailed analysis of coordination…
(Our three weapons*) > are < (fear surprise and ruthless efficiency)
Or focus just on the high level…
(((Ataon’ < (ny > mpanao < fihetsiketsehana)) < hoe < mpikiky < manko) < (i > Gaddafi))Atoan’ < noho < (ny_1 > kabariny < lavareny)
Provide detailed syntactic dependency structure
Ataon’ < (ny_1 mpanao* fihetsiketsehana)Atoan’ < (hoe* mpikiky manko)Atoan’ < (i Gaddafi*)Atoan’ < noho < (ny_2 kabariny lavareny)
Or focus on predictate/arguments
“Gaddafi has referred to protesters as rodents in his rambling speeches.”
36
GFL (CMU/Texas) & AMR (ISI)The classic: “Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .”
(j / join-01 :ARG0 (p / person :name (p2 / name :op1 "Pierre" :op2 "Vinken") :age (t / temporal-quantity :quant 61 :unit (y / year))) :ARG1 (b / board) :prep-as (d2 / director :mod (e / executive :polarity -)) :time (d / date-entity :month 11 :day 29))
join < [ Pierre Vinken ]join < boardjoin < as < directorjoin < [Nov. 29]nonexecutive > director61 > years > old > [ Pierre Vinken ]
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))
instanceARG0
WANT
ARG1
instance
ARG0
BELIEVE
ARG1
instance
WANT
ARG1
WE CAN DERIVE AND TRANSFORM SEMANTIC GRAPHS:
Probabilistic Graph Grammars
“the boy wantssomethinginvolving himself”
instance
ARG0
WANT
B
X
ARG1
WITH BASIC RULES LIKE THIS:
instanceGIRL
instance
BOY
“the boy wants the girl to believe he is wanted”
Example Parsing into AMR
Approximately 11000 guards patrol the 1200 - kilometre border between Russia and Afghanistan.
(p / patrol-01 :ARG0 (g / guard :quant (a2 / approximately :op1 11000)) :ARG1 (b / border :quant (d4 / distance-quantity :unit (k2 / kilometer) :quant 1200) :location (b2 / between :op1 (c / country :name (n / name :op1 "Russia")) :op2 (c2 / country :name (n2 / name :op1 "Afghanistan")))))
a2 / approximately11000p / patrol-01g / guard(d4 / distance-quantity :unit (k2 / kilometer)1200b / borderb2 / between(c / country :name (n / name :op1 "Russia"))(c2 / country :name (n2 / name :op1 "Afghanistan"))
Add relations
How much foreign text (running words)
Accuracy of learned bilingual dictionary
Deciphering Foreign Language
(Dou & Knight 2013)Dependency-based:Linguistic analysis helps substantially!
(Dou & Knight 2012)Ngram-based
test on Spanish
Englishtext Foreign
text
not translationsof each other
DecipheringEngine
bilingualword-for-worddictionary
Constituent Structure Trees
Strength: You can use tests for constituency (movement, deletion, substitution, coordination) to get reproducible results for corpus annotation.
Weakness: 1. Tests for constituency sometimes fail to provide reproducible results.
The five trees (based on an exercise in Radford , 1988) have each been proposed in a published paper and can each be defended by tests for constituency.
2. People do not have uniform intuitions about which tree is “correct”.
42
Morpho-syntacticsIñupiaq (North Slope Alaska)
Tauqsiġñiaġviŋmuŋniaŋitchugut. ‘We won’t go to the store.’
Mathematical Foundations for Semantics-Based Machine Translation
• Previous MT systems have been based on clean string automata and tree automata
• General purpose algorithms have been worked out (in part by MT scientists), with wide applicability– software toolkits even implement those algorithms
• But new models of meaning-based MT deal in semantic graph structures
• Foreign string Meaning graph English string
• QUESTION: Do efficient, general-purpose algorithms for graph automata exist to support these linguistic models?
General-Purpose Algorithms for Manipulating Linguistic Structures: Acceptors
String Acceptors successfully applied to
speech recognition
Tree Acceptors successfully applied to
syntax-based MT
Graph Acceptors now being applied to semantics-based MT
Membership checking ...
... of string (length n) in WFSA. O(n) if WFSA is determinized.
... of tree in forest. O(n) if determinized.
... of graph in hyperedge-replacement grammar (HERG) (Drewes 97)New algorithm: Chiang (forthcoming), O(2dn)k+1 : d & n properties of individual grammar
k-best … … best k paths through an WFSA with n states and e edges (Viterbi 67; Eppstein 98)O(e + n log n + k log k)
… trees in a weighted forest (Jiménez & Marzal 00; Huang & Chiang 05)O(e + n k log k)
... graphs in weighted HERG. Efficient Huang & Chiang results carry over.
EM training of probabilistic weights
Forward-backward EM (Baum/Welch 71; Eisner 03)O(n)
Tree acceptor training (Graehl & Knight 04) O(n)
Efficient Graehl & Knight results carry over.
Intersection WFSA intersectionO(n2) classical
Tree acceptor intersection O(n2) classical
Graph acceptor intersectionNOT CLOSED (in general)
co-PI supported under MURI project
General-Purpose Algorithms for Feature Structures (Graphs)
String World Tree World Graph World
Acceptor Finite-state acceptors Tree automata HRG
Transducer Finite-state transducers Tree transducers Synchronous HRG
Membership checking
O(n) O(n) for treesO(n3) for strings
O(nk+1) for graphs
N-best … … paths through an WFSA (Viterbi, 1967; Eppstein, 1998)
… trees in a weighted forest (Jiménez & Marzal, 2000; Huang & Chiang, 2005)
… graphs in a weighted forest
EM training Forward-backward EM (Baum/Welch, 1971; Eisner 2003)
Tree transducer EM training (Graehl & Knight, 2004)
EM on forests of graphs
Intersection WFSA intersection Tree acceptor intersection Not closed
Transducer composition
WFST composition (Pereira & Riley, 1996)
Many tree transducers not closed under composition (Maletti et al 09)
Not closed
General tools Carmel, OpenFST Tiburon (May & Knight 10) Bolinas
Linguistic Core Team(LL, JB, SV, JC)
Linguistic Analyzers Team (NS, RB, JB)
MT Systems Team (KK, DC, SV, JC)
Parser, Taggers, Morph. Analyzers
Hand-built Linguistic Core
Triple Gold Data
Triple Ungold Data
MT Visualizations and logs
MT Features
MT Error Analysis
MT Systems
Inference Algorithms
Data:Parallel
MonolingualElicited
Related languageMulti-parallelComparable
Elicitation corpus
Data selection for annotation
Functional Collaboration
Malagasy ResourcesTokens Types Hapax
Bible (Year 1) 579,578 19,460 8,401
Leipzig corpus (Year 2) 618,282 41,462 23,659
CMU Global Voices (Year 2) 2,148,976 84,744 46,627
Total 3,346,836 115,172 62,517
Malagasy - English Resourceseng-Tokens eng-Types mlg-Tokens mlg-Types
Bible (Year 1) 584,872 13,084 579,578 19,460
CMU Global Voices (Year 2) 1,785,472 63,357 2,148,976 84,744
Total 2,370,344 67,790 3,346,836 115,172
48
Evolutionary Tree of MT ParadigmsPrior to LCMT
1950 20121980
Transfer MT
DecodingMT
Analogy MT
Large-scale TMT
Interlingua MT
Example-based MT
Large-scale TMT
Context-Based MT
Statistical MT
Phrasal SMT
Transfer MT w stat phrases
SMT on syntax struct.
LCMT
49
Model Parameters
• Distribution over number of arguments given the parent tag
• Weights for selection features, shared across all set sizes
• Weights for ordering features
All parameters are shared across languages
Malagasy Language ModelingModel Data Seq. X-ent Word X-ent Total X-ent. Perplexity OOVs
3-gram+char Bible 10.35 7.66 18.01 264,323 23.94%
3-gram+char GV 7.02 1.14 8.16 286.0 3.30%
3-gram+morph GV 7.02 0.90 7.92 241.4 3.30%
• Successes• Malagasy analyzer has << 100% coverage, but we still
get substantial gains• Year 3 Goals
• Improve word sequence model with morphosyntactic information
• Improve coverage of Malagasy morphological phenomena• Incorporation in MT system• Kinyarwanda analyzer/generator under development
How CMU ISI UT and MIT collaborate
• Monthly teleconference calls– Focused on management and project coordination– Technical topics follow when appropriate
• Semi-annual face-to-face meetings – Last ones in Nov 2012 and March 2013– Include students/postdocs, etc. Focused on science
• Much more frequent focused calls/chats/etc.– Data collection, annotations, SW APIs, brainstorming new algorithms, …– Sharing/reviewing results and papers
• Website/repository + shared SW/data sets + papers + more goodies– www.linguisticcore.info
• Student exchanges (e.g. week, month, summer)• Occasional individual faculty trips • Combined research (GFL, AMR parsing, CCG parsing, decipherment,…)