multilingual language processing - umiacs
TRANSCRIPT
Multilingual Language Processing1 Hal Daumé III ([email protected])
MultilingualLanguage Processing
Hal Daumé IIIComputer ScienceUniversity of Maryland
Blair Linguistics Club
19 Nov 2014
Piyush Rai(Duke)
Lyle Campbell(U Hawaii)
Sujith Ravi(Google)
Adam Teichert(JHU)
Statistics, Typology and NLP2 Hal Daumé III ([email protected])
Why study O(100) languages➢ What makes a language a
human language?
➢ What properties of “Language” can be learned from/exploited on from text
➢ Computational challenge of dealing with large, uncertain data sets
➢ You never know what language will be important tomorrow
➢ Pairwise models of language don't scale
➢ Hard to find linguists or translators in minority languages
Statistics, Typology and NLP3 Hal Daumé III ([email protected])
Typical NLP pipeline
Source Words Target Words
SourceMorphology
SourceSyntax
SourceShallowmantics
Interlingua
TargetMorphology
TargetSyntax
TargetShallowmantics
Analysis Generation
Source Semantics Target Semantics
The man ate a sandwich
Statistics, Typology and NLP4 Hal Daumé III ([email protected])
Typical NLP pipeline
Source Words Target Words
SourceMorphology
SourceSyntax
SourceShallowmantics
Interlingua
TargetMorphology
TargetSyntax
TargetShallowmantics
Analysis Generation
Source Semantics Target Semantics
The man ate a sandwichThe man eat+ a sandwich
past
Statistics, Typology and NLP5 Hal Daumé III ([email protected])
Typical NLP pipeline
Source Words Target Words
SourceMorphology
SourceSyntax
SourceShallowmantics
Interlingua
TargetMorphology
TargetSyntax
TargetShallowmantics
Analysis Generation
Source Semantics Target Semantics
The man ate a sandwich
DT NN VB DT NN
The man eat+ a sandwich past
Statistics, Typology and NLP6 Hal Daumé III ([email protected])
Typical NLP pipeline
Source Words Target Words
SourceMorphology
SourceSyntax
SourceShallowmantics
Interlingua
TargetMorphology
TargetSyntax
TargetShallowmantics
Analysis Generation
Source Semantics Target Semantics
The man ate a sandwich
DT NN VB DT NN
NP NPVP
S
The man eat+ a sandwich past
Statistics, Typology and NLP7 Hal Daumé III ([email protected])
Typical NLP pipeline
Source Words Target Words
SourceMorphology
SourceSyntax
SourceShallowmantics
Interlingua
TargetMorphology
TargetSyntax
TargetShallowmantics
Analysis Generation
Source Semantics Target Semantics
The man ate a sandwich
DT NN VB DT NN
NP NPVP
SAgent Theme
The man eat+ a sandwich past
Statistics, Typology and NLP8 Hal Daumé III ([email protected])
Typical NLP pipeline
Source Words Target Words
SourceMorphology
SourceSyntax
SourceShallowmantics
Interlingua
TargetMorphology
TargetSyntax
TargetShallowmantics
Analysis Generation
Source Semantics Target Semantics
The man ate a sandwich
DT NN VB DT NN
NP NPVP
SAgent Theme
∃ a ∃ t ∃ e man(a) & sandwich(t) & eat(e,a,t) & past(e)
The man eat+ a sandwich past
Statistics, Typology and NLP9 Hal Daumé III ([email protected])
Typical NLP pipeline
Source Words Target Words
SourceMorphology
SourceSyntax
SourceShallowmantics
Interlingua
TargetMorphology
TargetSyntax
TargetShallowmantics
Analysis Generation
Source Semantics Target Semantics
The man ate a sandwich
DT NN VB DT NN
NP NPVP
SAgent Theme
∃ a ∃ t ∃ e man(a) & sandwich(t) & eat(e,a,t) & past(e)
The man eat+ a sandwich past
MorphologyTaggingParsingRole labelingInterpretation
Statistics, Typology and NLP10 Hal Daumé III ([email protected])
A unified approach
Raw Text
Linguistic Features
AnnotatedTreebanks
VO ⊃ PrePPostP ⊃ OV
Typological Features
Parallel Data
Statistics, Typology and NLP11 Hal Daumé III ([email protected])
A unified approach
Raw Text
Linguistic Features
AnnotatedTreebanks
VO ⊃ PrePPostP ⊃ OV
Typological Features
Parallel Data
AfrikaansAlbanianAmuzgoArabicArabic (Syrian)ArmenianArmenianAzerbaijaniBasqueBulgarianBurmeseByzantineCakchiquelChamorroCherokeeChinantec
CzechDanishDutchEnglishEsperantoEstonianFinnishFrenchGaelicGermanGreekGujaratiHaitian CreoleHebrewHiligaynonHindiHungarianIcelandic
IndonesianIrishItalianJacaltecoKannadaK'ekchíKlingonKoreanLatinLatvianLithuanianLow GermanMacedonianMalagasyMalayalamMamMam of TodosMandan
MandarinMaoriNahuatlNdebeleNorwegianOryaPersianPolishPortuguesePotawatomiQuichéRomanianRomaniRussianSerbianShonaSlovakSomali
SpanishSwahiliSwedishTagalogTamilThaiTurkishUkrainianUmaUrduUspantecoVietnamese
Statistics, Typology and NLP12 Hal Daumé III ([email protected])
A unified approach
Raw Text
Linguistic Features
AnnotatedTreebanks
VO ⊃ PrePPostP ⊃ OV
Typological Features
Parallel Data
Statistics, Typology and NLP13 Hal Daumé III ([email protected])
How does (eg) syntax work?➢ Get some linguists to annotate text with
syntactic structures
➢ Estimate a probabilistic context freegrammar from those structures
➢ Use that pCFG to parse new “test”sentences
➢ Works for any language for whichwe have annotated text!
AnnotatedTreebanks
Statistics, Typology and NLP14 Hal Daumé III ([email protected])
How does (eg) syntax work?➢ Get some linguists to annotate text with
syntactic structures
➢ Estimate a probabilistic context freegrammar from those structures
➢ Use that pCFG to parse new “test”sentences
➢ Works for any language for whichwe have annotated text!
AnnotatedTreebanks
Multilingual Language Processing15 Hal Daumé III ([email protected])
Multilinguality as a source of x-ferThe man ate a tasty sandwich D N V D J N
NP NPVP
S
English PCFG
Multilingual Language Processing16 Hal Daumé III ([email protected])
Multilinguality as a source of x-ferThe man ate a tasty sandwich D N V D J N
NP NPVP
SLe homme a mange un sandwich savoureaux D N A V D N J
NP NPVP
SEl hombre se comio un bocadillo sabrosa D N A V D N J
NP NPVP
S
English PCFG
French PCFG
Spanish PCFG
ϴ
[Berg-Kirkpatrick & Klein; ACL10][Iwata, Mochihashi & Sawada; ACL10]
Multilingual Language Processing17 Hal Daumé III ([email protected])
Multilinguality as a source of x-ferThe man ate a tasty sandwich D N V D J N
NP NPVP
SLe homme a mange un sandwich savoureaux D N A V D N J
NP NPVP
SEl hombre se comio un bocadillo sabrosa D N A V D N J
NP NPVP
S
English PCFG
French PCFG
Spanish PCFG
ϴ
[Berg-Kirkpatrick & Klein; ACL10][Iwata, Mochihashi & Sawada; ACL10]
+ 21% on averageover 8 languages
English, DutchDanish, Swedish
Spanish, PortugueseSloveneChinese
See also:Snyder, Barzilay et al....
Multilingual Language Processing18 Hal Daumé III ([email protected])
Typology can helplanguage processing
Language processing can help typology
Statistics is the mediator
Multilingual Language Processing19 Hal Daumé III ([email protected])
Implicational Universals
English:I eat dinner in restaurants.
French:je mange le diner dans les restaurantsI eat the dinner in the restaurants
Japanese:boku-wa bangohan-o resutoran -de taberuI -topic dinner -obj restaurants -in eat
Hindi:main raat ka khaana restra mein khaata hoonI night-of-meal restaurants in eat am
Verb-Object (VO)
Object-Verb (OV)
Prepositional (PreP)
Postpositional (PostP)
Multilingual Language Processing20 Hal Daumé III ([email protected])
Implicational Universals
English:I eat dinner in restaurants.
French:je mange le diner dans les restaurantsI eat the dinner in the restaurants
Japanese:boku-wa bangohan-o resutoran -de taberuI -topic dinner -obj restaurants -in eat
Hindi:main raat ka khaana restra mein khaata hoonI night-of-meal restaurants in eat am
Verb-Object (VO)
Object-Verb (OV)
Prepositional (PreP)
Postpositional (PostP)
VO ⊃ PrePPostP ⊃ OV
Multilingual Language Processing21 Hal Daumé III ([email protected])
The Typologist's Life
PreP PostPVOOV
Now, repeat for lots of feature pairs
(Greenberg, 1963) – Based on 30 diversely
sampled languages
16 0 3 11
Multilingual Language Processing22 Hal Daumé III ([email protected])
Difficulties with Typical Approach
A ⊃ B (99%) is uninterestingwhen ∅ ⊃ B (99%)
Search process is tedious
Sampling problem whenmany languages considered
Process is inherently noisy
Multilingual Language Processing23 Hal Daumé III ([email protected])
A Typological Database➢ 2150 Languages
➢ 35 language families➢ 275 language geni
➢ 139 Features➢ 11 feature categories
➢ Sparsely sampled➢ 85% missing data
Multilingual Language Processing24 Hal Daumé III ([email protected])
Typological Map: VO
Multilingual Language Processing25 Hal Daumé III ([email protected])
Typological Map: PreP
Multilingual Language Processing26 Hal Daumé III ([email protected])
➢ Consider two features --> 2xN matrix
➢ First, generate first column withprior probability π1
➢ Next, decide if the implication holds
➢ Finally, generate the second column:➢ With probability π2 if feature 1 is not “+”
or if the implication doesn't hold➢ Forced to be “+” otherwise
An Initial Model VO PreP
++-++?+??+-+?+-
+?+-+++--?-+-++
Multilingual Language Processing27 Hal Daumé III ([email protected])
➢ Consider two features --> 2xN matrix
➢ First, generate first column withprior probability π1
➢ Next, decide if the implication holds
➢ Finally, generate the second column:➢ With probability π2 if feature 1 is not “+”
or if the implication doesn't hold➢ Forced to be “+” otherwise
An Initial Model VO PreP
++-++?+??+-+?+-
+?+-+++--?-+-++
Problems: Cannot handle noisy data Doesn't address sampling problem
Multilingual Language Processing28 Hal Daumé III ([email protected])
➢ Consider two features --> 2xN matrix
➢ First, generate first column withprior probability π1
➢ Next, decide if the implication holds
➢ Finally, generate the second column:➢ With probability π2 if feature 1 is not “+”
or if the implication doesn't hold➢ Forced to be “+” otherwise
An Initial Model VO PreP
++-++?+??+-+?+-
+?+-+++--?-+-++
Problems: Cannot handle noisy data Doesn't address sampling problem
m
π2π1 f2f1
Multilingual Language Processing29 Hal Daumé III ([email protected])
Fixing the Noise Problem➢ Assume language-specific noise
➢ Model remains unchanged, excepta new variable causes “f” to be flipped
m π2π1
f2f1
e1 ε e2
Multilingual Language Processing30 Hal Daumé III ([email protected])
Fixing the Sampling Problem➢ Hierarchical Bayes prior...
m π2π1
f2f1
e1 ε e2
Multilingual Language Processing31 Hal Daumé III ([email protected])
Fixing the Sampling Problem➢ Hierarchical Bayes prior...
f2f1
e1 ε e2
f2f1
e1 ε e2
f2f1
e1 ε e2
. . .
Multilingual Language Processing32 Hal Daumé III ([email protected])
Fixing the Sampling Problem➢ Hierarchical Bayes prior...
f2f1
e1 ε e2
f2f1
e1 ε e2
f2f1
e1 ε e2
. . .
m0
mIE
mGer mRom
mAus
mOce
Multilingual Language Processing33 Hal Daumé III ([email protected])
Inference➢ Binomials get Beta priors
➢ m ~ Uniform➢ ~ Beta with 5% mean, 0-10% with 50% probability
➢ Everything else gets uniform priors
➢ Inference by Gibbs sampling➢ Plus a rejection sampler subroutine
Multilingual Language Processing34 Hal Daumé III ([email protected])
Three Models
Flat – All languages independent
LingHier – Typological Hierarchy
DistHier – Obtained by clustering positionally
Multilingual Language Processing35 Hal Daumé III ([email protected])
Automatically Extracting Implications➢ Search only over pairs with:
➢ 250 langs for which both features are known➢ 15 languages for which both hold simultaneously➢ When f1 is true, f2 is true with >50% probability
➢ Reduces space from 19,000 to 3442
➢ Sort by probability that m is true
➢ Evaluate:➢ Compare restorative accuracy versus each other➢ Compare against well-known implications
Multilingual Language Processing36 Hal Daumé III ([email protected])
Restoration Accuracy by Model
Multilingual Language Processing37 Hal Daumé III ([email protected])
Top Implications – LingHierPostpositions Gen-N Greenberg #2a OV Greenberg #4 OV Gen-N Greenberg #4 + Greenberg \#2a Gen-Noun Greenberg #2a (converse)
OV Greenberg #2b (converse) SV Gen-N ??? Adj-N Greenberg #18 Suffixing Clear explanation VO
Appeal to economy Dem-NVO Greenberg #3 (converse)
Adj-N Dem-N Greenberg #18 Noun-AdjSV ??? VO Greenberg #3
Prefixing Greenberg #27b N-Adj ???
Labial-velars No uvulars See paperNegative word See paperStrong prefixing VO
Suffixing ??? Final Sub. Word
Many vowels See paperPlural prefix N-Gen ??? No fricatives No tones ???
See paperDem-N
PostP
PostPPostPositions
Num-NTense Suf.Noun-RelC Lehmann
Intr. verb No question prt.Num-N Hawkins XVI (for postpositional languages) PreP
PostP Lehmann PostPPreP
Init. Subord. PreP Operator-operand principle (Lehmann) PreP
Little affixation
No pron poss afxLehmann
Subord. SuffixPostP Operator-operand principle (Lehmann)
High+Mid F.V.s
Oblig. subj. pron No pron poss afxTense Suf. Operator-operand principle (Lehmann)
Multilingual Language Processing38 Hal Daumé III ([email protected])
Notes➢ If you think this stuff is interesting, you should read the
Dunn et al Nature paper
➢ Main claim:➢ All of this typology stuff is bogus➢ Once you account for “genetic” influences
➢ Directly contradicts what I've just told you
➢ Who is right?
Statistics, Typology and NLP39 Hal Daumé III ([email protected])
Automatic Induction of Syntax➢ INPUT: A pile of text➢ OUTPUT: Syntactic structures of this text
➢ Current approaches are mostly based on dependency formalisms
The man ate a big sandwich D N V D J N
MODSUBJ
OBJMOD
MOD
Statistics, Typology and NLP40 Hal Daumé III ([email protected])
Probabilistic Models of Syntax
D N V D J N
p(V|0,r)
p(N|V,l)
p(D|N,l)
p(N|V,r)
p(D|N,l)
p(J|N,l)
p(Data) = p(V|0,r) p(N|V,l) p(D|N,l) p(N|V,r) p(J|N,l) p(D|N,l)
Statistics, Typology and NLP41 Hal Daumé III ([email protected])
Inferring Tags from the Structure➢ INPUT:
➢ OUTPUT:
➢ Baseline:➢ Random guessing: 4% accuracy
The man ate a big sandwich
D N V D J N
Statistics, Typology and NLP42 Hal Daumé III ([email protected])
Sources of Knowledge➢ Seeds (frequent words for each tag)
➢ N: membro, milhoes, obras➢ D: as [the,2f] o [the,1m] os [the,2m]➢ V: afector, gasta, juntar➢ P: com, como, de, em
➢ Typological rules:➢ Art Noun←➢ Prp Noun→
➢ Tag knowledge:➢ Open class➢ Closed class
Statistics, Typology and NLP43 Hal Daumé III ([email protected])
Preliminary Results
No Seeds Seeds0
10
20
30
40
50
60
No O/COpen/Closed
Statistics, Typology and NLP44 Hal Daumé III ([email protected])
Preliminary Results: Open/Closed
No RulesArt<-N
Prp->NBoth
20
25
30
35
40
45
50
55
60
No RulesArt<-N
Prp->NBoth
20
25
30
35
40
45
50
55
60NO SEEDS SEEDS
Multilingual Language Processing45 Hal Daumé III ([email protected])
Where does the tree come from?
Multilingual Language Processing46 Hal Daumé III ([email protected])
A standard model for the genealogy of a populationEach organism has exactly one parent (haploid)Thus, the genealogy is a tree
Kingman's Coalescent
Multilingual Language Processing47 Hal Daumé III ([email protected])
An infinite tree...
Multilingual Language Processing48 Hal Daumé III ([email protected])
Graphical model on a coalescent
Multilingual Language Processing49 Hal Daumé III ([email protected])
Graphical model on a coalescent
Multilingual Language Processing50 Hal Daumé III ([email protected])
Graphical model on a coalescent
Multilingual Language Processing51 Hal Daumé III ([email protected])
Graphical model on a coalescent
Multilingual Language Processing52 Hal Daumé III ([email protected])
Graphical model on a coalescent
Multilingual Language Processing53 Hal Daumé III ([email protected])
Graphical model on a coalescent
Multilingual Language Processing54 Hal Daumé III ([email protected])
Graphical model on a coalescent
Multilingual Language Processing55 Hal Daumé III ([email protected])
Understanding language relationships
Multilingual Language Processing56 Hal Daumé III ([email protected])
Modeling errors
Multilingual Language Processing57 Hal Daumé III ([email protected])
The Balkans
Multilingual Language Processing58 Hal Daumé III ([email protected])
Linguistic areas
Multilingual Language Processing59 Hal Daumé III ([email protected])
Classic linguistic areas
Multilingual Language Processing60 Hal Daumé III ([email protected])
Model desiridata
Multilingual Language Processing61 Hal Daumé III ([email protected])
Pitman-Yor Processes
Multilingual Language Processing62 Hal Daumé III ([email protected])
Generative Story
Multilingual Language Processing63 Hal Daumé III ([email protected])
Generative Story
Multilingual Language Processing64 Hal Daumé III ([email protected])
Generative Story
Multilingual Language Processing65 Hal Daumé III ([email protected])
Generative Story
Multilingual Language Processing66 Hal Daumé III ([email protected])
Generative Story
Multilingual Language Processing67 Hal Daumé III ([email protected])
Generative Story
Multilingual Language Processing68 Hal Daumé III ([email protected])
Generative Story
Multilingual Language Processing69 Hal Daumé III ([email protected])
Generative Story
Multilingual Language Processing70 Hal Daumé III ([email protected])
Discovered results
Multilingual Language Processing71 Hal Daumé III ([email protected])
Shared features
Multilingual Language Processing72 Hal Daumé III ([email protected])
Reconstruction accuracies
Statistics, Typology and NLP73 Hal Daumé III ([email protected])
Conclusions + Future Steps➢ Can infer IUs from data (WALS)
➢ Old ones and new ones➢ Can handle the sampling problem
➢ Can use typology to help tagging➢ Open/closed➢ Simple features
➢ Infer tree structure, too➢ Don't assume features: just IUs!➢ Infer multiple languages simultaneously➢ Feedback from text to IUs