evaluating an agglutinative segmentation model for paramor
DESCRIPTION
Evaluating an Agglutinative Segmentation Model for ParaMor. Christian Monson Jaime Carbonell Alon Lavie Lori Levin Carnegie Mellon University. Turkish Morphology – Beads on a String. One Turkish Word. götür. ül. m. ü yor. s u n. present progressive. 2 nd person singular. take. - PowerPoint PPT PresentationTRANSCRIPT
Carnegie Mellon
Christian Monson
Evaluating an Agglutinative
Segmentation Model for ParaMor
Christian Monson
Jaime Carbonell
Alon Lavie
Lori Levin
Carnegie Mellon University
2Carnegie Mellon
Christian Monson
I am not being taken
Turkish Morphology – Beads on a String
götür ül m sunüyor
take passive negativepresent
progressive2nd person singular
One Turkish Word
3Carnegie Mellon
Christian Monson
Computational Morphology Improves:
Machine TranslationTurkish-English (Oflazer, 2007)
Czech-English (Goldwater and McClosky, 2005)
Speech RecognitionFinnish (Creutz, 2006)
Grapheme-to-Phoneme ConversionGerman (Demberg, 2007)
Information RetrievalEnglish, German, Finnish (Kurimo et al., 2008)
4Carnegie Mellon
Christian Monson
Morphology is Complex
Operations
Suffix, Prefix, Reduplication, …
Purpose
Inflection vs. Derivation
Morphophonology
Ambiguity
5Carnegie Mellon
Christian Monson
Complexity Demands Time and Expertise
Kemal OflazerExpert on
Turkish
Computational morphology
Time3 - 4 Months to manually build a basic Turkish analyzer
Plus lexicon development and maintenance
6Carnegie Mellon
Christian Monson
The SolutionRaw Text
Unsupervised Morphology
Induction
7Carnegie Mellon
Christian Monson
The SolutionRaw Text
?
8Carnegie Mellon
Christian Monson
Techniques for Unsupervised Morphology Induction
Transition Likelihood
Harris (1955) – Finite State Automata
Bernhard (2007)
9Carnegie Mellon
Christian Monson
Transition Likelihood
Harris (1955) – Finite State Automata
Bernhard (2007)
Minimum Description LengthGoldsmith (2001, 2006)
Creutz’s Morfessor (2006)
Techniques for Unsupervised Morphology Induction
10Carnegie Mellon
Christian Monson
Transition Likelihood
Harris (1955) – Finite State Automata
Bernhard (2007)
Statistical or Minimum Description LengthGoldsmith (2001, 2006)
Creutz’s Morfessor (2006)
The ParadigmSnover (2002)
ParaMor (2004, 2007)
Techniques for Unsupervised Morphology Induction
11Carnegie Mellon
Christian Monson
What is a Paradigm?
ül m sunüyor
take passive negativepresent
progressive2nd person singular
götür
12Carnegie Mellon
Christian Monson
ül m sunüyor
take passive negativepresent
progressive2nd person singular
götür
Person & Number
Paradigms Structure Inflectional Morphology
13Carnegie Mellon
Christian Monson
um
Person & Number
1st person singular
umül m üyor
take passive negativepresent
progressive
götür
Paradigms Structure Inflectional Morphology
14Carnegie Mellon
Christian Monson
um
Person & Number
3rd person singular
umØ
ül m üyor
take passive negativepresent
progressive
götür
Paradigms Structure Inflectional Morphology
15Carnegie Mellon
Christian Monson
umumØuz
ül m üyor
take passive negativepresent
progressive
götür
Person & Number
Paradigms Structure Inflectional Morphology
16Carnegie Mellon
Christian Monson
umumØuz
ül m üyor
take passive negativepresent
progressive
götür
ParadigmMutually substitutable morphological operations
Paradigm
Paradigms Structure Inflectional Morphology
17Carnegie Mellon
Christian Monson
ül m um
Voice PolarityTense & Aspect
Person & Number
umØuz
üyoryecek
Paradigms Structure Inflectional Morphology
18Carnegie Mellon
Christian Monson
Paradigms
ParadigmMutually substitutable morphological operations
ül m umumØuz
üyoryecek
Paradigms Structure Inflectional Morphology
19Carnegie Mellon
Christian Monson
Paradigm
ül m umumØuz
üyoryecek
ParadigmMutually substitutable strings
The ParaMor Algorithm
20Carnegie Mellon
Christian Monson
Paradigm
ül m umumØuz
üyoryecek
Candidate Stems
1 Morpheme Boundary
The ParaMor Algorithm
21Carnegie Mellon
Christian Monson
The ParaMor Algorithm
Simplifying Assumptions
Suffixes only70% of the World’s Languages are Suffixing (Dryer, 2005)
No morphophonology
Only a High-Level Overview
22Carnegie Mellon
Christian Monson
The ParaMor Algorithm
Identify Paradigms in 3 Steps
ParaMorIdentify
23Carnegie Mellon
Christian Monson
The ParaMor Algorithm
Identify Paradigms in 3 Steps1. Search for candidate paradigms
ParaMorIdentify
Search
24Carnegie Mellon
Christian Monson
The ParaMor Algorithm
Identify Paradigms in 3 Steps1. Search for candidate paradigms
2. Cluster candidates modeling the same paradigm
ParaMorIdentify
SearchCluster
25Carnegie Mellon
Christian Monson
The ParaMor Algorithm
Identify Paradigms in 3 Steps1. Search for candidate paradigms
2. Cluster candidates modeling the same paradigm
3. Filter
ParaMorIdentify
SearchClusterFilter
26Carnegie Mellon
Christian Monson
The ParaMor Algorithm
Identify Paradigms in 3 Steps1. Search for candidate paradigms
2. Cluster candidates modeling the same paradigm
3. Filter
Segment Words Using the discovered paradigms
ParaMorIdentify
SearchClusterFilter
Segment
27Carnegie Mellon
Christian Monson
This Presentation
Identify Paradigms in 3 Steps1. Search for candidate paradigms
2. Cluster candidates modeling the same paradigm
3. Filter
Segment Words Using the discovered paradigms
Example Search
Full Description in Monson et al. (SIGMORPHON 2007)
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
28Carnegie Mellon
Christian Monson
This Presentation
Identify Paradigms in 3 Steps1. Search for candidate paradigms
2. Cluster candidates modeling the same paradigm
3. Filter
Segment Words Using the discovered paradigms
Agglutinative Segmentation Model
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
29Carnegie Mellon
Christian Monson
This Paper
Identify Paradigms in 3 Steps1. Search for candidate paradigms
2. Cluster candidates modeling the same paradigm
3. Filter
Segment Words Using the discovered paradigms
2 Filters Adapted from
Harris (1955) and Goldsmith (2006)
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
30Carnegie Mellon
Christian Monson
The ParaMor Algorithm
Identify Paradigms in 3 Steps1. Search for candidate paradigms
2. Cluster candidates modeling the same paradigm
3. Filter
Segment Words Using the discovered paradigms
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
31Carnegie Mellon
Christian Monson
s10697
autorizacionesbuscabamos
costasimportadoras
vallas…
Search for Candidate Paradigms
Spanish Example
Propose a morpheme boundary at every character boundary in every word
Consolidate identical candidate suffixes into paradigm seeds
Word List50,000 Types
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
32Carnegie Mellon
Christian Monson
s10697
autorizacionesbuscabamos
costaØ costasimportadoraØ importadoras
vallaØ vallas…
Ø s5513
Identify the most frequent mutually replaceable candidate suffix
Stems that occur with one suffix in a paradigm will likely occur with other suffixes in that paradigm
Search for Candidate ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
33Carnegie Mellon
Christian Monson
s10697
A Parameter halts the introduction of suffixes When the most frequent
mutually replaceable candidate suffix severely decreases the stem count
Ø s5513
Ø r s
281autorizaciones
buscabamos costar costaØ
costasimportadoraØ importadoras
vallaØ vallas…
Search for Candidate ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
34Carnegie Mellon
Christian Monson
Move on to the next most frequent paradigm seed
a9020
s10697
Ø s5513
Ø r s
281
Search for Candidate ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
35Carnegie Mellon
Christian Monson
a9020
a o2325
a o os
1418
a as o os899
s10697
Ø s5513
Ø r s
281
Search for Candidate ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
36Carnegie Mellon
Christian Monson
n6039
Ø n1863
Ø n r
512
Ø do n r357
Ø da das do dos n ndo r ron
115
a9020
a o2325
a o os
1418
a as o os899
s10697
Ø s5513
Ø r s
281
Search for Candidate ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
37Carnegie Mellon
Christian Monson
es2750
Ø es845
n6039
Ø n1863
Ø n r
512
Ø do n r357
Ø da das do dos n ndo r ron
115
a9020
a o2325
a o os
1418
a as o os899
s10697
Ø s5513
Ø r s
281
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Search for Candidate Paradigms
38Carnegie Mellon
Christian Monson
an1784
a an1045
a an ar
417
a an ar ó355
a ada adas ado ados an
ar aron ó148
es2750
Ø es845
n6039
Ø n1863
Ø n r
512
Ø do n r357
Ø da das do dos n ndo r ron
115
a9020
a o2325
a o os
1418
a as o os899
s10697
Ø s5513
Ø r s
281
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Search for Candidate Paradigms
39Carnegie Mellon
Christian Monson
strado15
rado167
rada radas rado rados
53
rada radorados
67
rada rado89
ra rada radasrado rados ran
rar raron ró23
strada strado12
strada strado stró
9
strada strado strar stró
8
strada stradas strado strar stró
7
...an
1784
a an1045
a an ar
417
a an ar ó355
a ada adas ado ados an
ar aron ó148
es2750
Ø es845
n6039
Ø n1863
Ø n r
512
Ø do n r357
Ø da das do dos n ndo r ron
115
a9020
a o2325
a o os
1418
a as o os899
s10697
Ø s5513
Ø r s
281
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Search for Candidate Paradigms
40Carnegie Mellon
Christian Monson
The ParaMor Algorithm
Identify Paradigms in 3 Steps1. Search for candidate paradigms
2. Cluster candidates modeling the same paradigm
3. Filter
Segment Words Using the discovered paradigms
ParaMorIdentify
SearchClusterFilter
Segment
41Carnegie Mellon
Christian Monson
A Few of the 42 Final Paradigms4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
42Carnegie Mellon
Christian Monson
4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
A Few of the 42 Final Paradigms
Number on Nouns
43Carnegie Mellon
Christian Monson
A Few of the 42 Final Paradigms4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
Number & Gender on Adjectives
44Carnegie Mellon
Christian Monson
A Few of the 42 Final Paradigms4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
Verbal Suffixes
45Carnegie Mellon
Christian Monson
The ParaMor Algorithm
Identify Paradigms in 3 Steps1. Search for candidate paradigms
2. Cluster candidates modeling the same paradigm
3. Filter
Segment Words Using the discovered paradigms
ParaMorIdentify
SearchClusterFilter
Segment
Agglutinative Segmentation Model
46Carnegie Mellon
Christian Monson
Segment Words Using the Paradigms4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
administradas‘Feminine gender nouns under administration’
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
47Carnegie Mellon
Christian Monson
Segment Words Using the Paradigms4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
administr + ad + a + s
Past Participle
FemininePlural
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
48Carnegie Mellon
Christian Monson
4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
administradas
Segment Words Using the ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
49Carnegie Mellon
Christian Monson
4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
administradas administrada
Also in corpus
Segment Words Using the ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
50Carnegie Mellon
Christian Monson
4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
administradas administrada
Segment Words Using the ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
51Carnegie Mellon
Christian Monson
4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
administradas administradaØ
Segment Words Using the ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
52Carnegie Mellon
Christian Monson
Segment Words Using the Paradigms4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
administr + ad + a + s
Recovers multiple morpheme boundaries
from candidate paradigms which each propose single morpheme boundaries
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
53Carnegie Mellon
Christian Monson
Segment Words Using the Paradigms4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
administr + ad + a + s
Baseline ParaMor
single morpheme boundary in each analysis of each word
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
54Carnegie Mellon
Christian Monson
Morpho Challenge 2007Morpho Challenge 2007ParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
Peer operated competition For unsupervised morphology
induction algorithms
55Carnegie Mellon
Christian Monson
Morpho Challenge 2007Morpho Challenge 2007ParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
Peer operated competition For unsupervised morphology
induction algorithms
4 languagesEnglish ( 384,904 Types)
German (1,266,160 Types)
Finnish (2,206,720 Types)
Turkish ( 617,299 Types)
56Carnegie Mellon
Christian Monson
Morpho Challenge 2007Morpho Challenge 2007ParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
Peer operated competition For unsupervised morphology
induction algorithms
4 languagesEnglish ( 384,904 Types)
German (1,266,160 Types)
Finnish (2,206,720 Types)
Turkish ( 617,299 Types)
2 methods of evaluationLinguistic – Morpheme IdentificationInformation Retrieval
57Carnegie Mellon
Christian Monson
Morpho Challenge 2007Morpho Challenge 2007
Today
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Peer operated competition For unsupervised morphology
induction algorithms
4 languagesEnglish ( 384,904 Types)
German (1,266,160 Types)
Finnish (2,206,720 Types)
Turkish ( 617,299 Types)
2 methods of evaluationLinguistic – Morpheme IdentificationInformation Retrieval
58Carnegie Mellon
Christian Monson
Morpho Challenge 2007Morpho Challenge 2007
Developed on SpanishParameters Frozen
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Peer operated competition For unsupervised morphology
induction algorithms
4 languagesEnglish ( 384,904 Types)
German (1,266,160 Types)
Finnish (2,206,720 Types)
Turkish ( 617,299 Types)
2 methods of evaluationLinguistic – Morpheme IdentificationInformation Retrieval
59Carnegie Mellon
Christian Monson
Combine ParaMor and Morfessor
MorfessorFreely available unsupervised morphology
induction system (Creutz, 2006)
Combine ParaMor and Morfessor Performs better than either system alone
(Monson et al., 2007)
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
60Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
Par
aMor
& M
orfe
ssor
50.7
47.2
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
F1
Mor
fess
or
Baseline
61Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
Mor
fess
or
Agg
lutin
ativ
e P
& M
56.3
50.7
47.2
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
F1
Par
aMor
& M
orfe
ssor
62Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
60.8
56.3
50.7
47.2
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
F1
Par
aMor
& M
orfe
ssor
63Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
Ber
nhar
d
Mor
fess
or
Par
aMor
& M
orfe
ssor
60.8
56.3
52.950.7
47.2 47.8
53.4
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
F1
Par
aMor
& M
orfe
ssor
64Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
Ber
nhar
d
Mor
fess
or
Par
aMor
& M
orfe
ssor
Agg
lutin
ativ
e P
& M
60.8
56.3
52.954.1
50.7
47.2 47.8
53.4
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
F1
Par
aMor
& M
orfe
ssor
65Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
Ber
nhar
d
Mor
fess
or
Par
aMor
& M
orf.
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
60.8
56.3
52.954.1
48.250.7
47.2 47.8
53.4
40.643.2
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
F1
Par
aMor
& M
orfe
ssor
Par
aMor
& M
orfe
ssor
66Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
60.8
56.3
52.954.1
48.2 48.550.7
47.2 47.8
53.4
40.643.2
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
F1
Par
aMor
& M
orfe
ssor
Par
aMor
& M
orfe
ssor
Par
aMor
& M
orf.
67Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
60.8
56.3
52.954.1
48.2 48.5
24.7
50.7
47.2 47.8
53.4
40.643.2
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
F1
Par
aMor
& M
orfe
ssor
Par
aMor
& M
orfe
ssor
Par
aMor
& M
orf.
68Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
Mor
fess
or
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
60.8
56.3
52.954.1
48.2 48.5
24.7
50.7
47.2 47.8
53.4
40.638.5
43.2
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
F1
Par
aMor
& M
orfe
ssor
Par
aMor
& M
orfe
ssor
Par
aMor
& M
orf.
69Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
Mor
fess
or
Par
aMor
& M
orf.
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
60.8
56.3
52.954.1
48.2 48.5
24.7
50.7
47.2 47.8
53.4
40.638.5
43.2
46.7
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
F1
Par
aMor
& M
orfe
ssor
Par
aMor
& M
orfe
ssor
Par
aMor
& M
orf.
70Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
Mor
fess
or
Par
aMor
& M
orf.
Agg
lutin
ativ
e P
& M
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
Ber
nhar
d
Mor
fess
or
Agg
lutin
ativ
e P
& M
60.8
56.3
52.954.1
48.2 48.5
24.7
52.050.7
47.2 47.8
53.4
40.638.5
43.2
46.7
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
F1
Par
aMor
& M
orfe
ssor
Par
aMor
& M
orfe
ssor
Par
aMor
& M
orf.
71Carnegie Mellon
Christian Monson
ParaMor: State-of-the-Art Unsupervised Morphology Induction System
ParaMorIdentifies paradigms
The organizing structure of inflectional morphology
Segments words As discovered paradigms suggest
Our Agglutinative Segmentation ModelSignificantly improves morpheme identification
Particularly for agglutinative languages
72Carnegie Mellon
Christian Monson
The Next Steps for ParaMor
Beyond Suffixes
English, German, Finnish, Turkish, and Spanishare all primarily suffixing
Straightforward extension to ParaMor forPrefixes
More ChallengingReduplicationInfixation etc.
73Carnegie Mellon
Christian Monson
Beyond ParaMor
Improve Performance
Segmentation F1 of 50-60% is state of the art!
Morphophonology is the primary culprit
Simply splitting words cannot identify alternate forms of the same morpheme
74Carnegie Mellon
Christian Monson
75Carnegie Mellon
Christian Monson
76Carnegie Mellon
Christian Monson
The ParaMor Algorithm
Identify Paradigms in 3 Steps1. Search for candidate paradigms
2. Cluster candidates modeling the same paradigm
3. Filter
Segment Words Using the discovered paradigms
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
77Carnegie Mellon
Christian Monson
Cluster Paradigm Fragments
15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …
15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
78Carnegie Mellon
Christian Monson
Cluster Paradigm FragmentsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …
15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …
anunci+aaplic+aapoy+a…anunci+abaaplic+abaapoy+aba…anunci+aría…
79Carnegie Mellon
Christian Monson
Cluster Paradigm Fragments
anunci+aapoy+aconfirm+a…anunci+abaapoy+abaconfirm+aba…anunci+ara…
anunci+aaplic+aapoy+a…anunci+abaaplic+abaapoy+aba…anunci+aría…
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …
15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …
80Carnegie Mellon
Christian Monson
Cluster Paradigm Fragments
anunci+aapoy+aconfirm+a…anunci+abaapoy+abaconfirm+aba…anunci+ara…
anunci+aaplic+aapoy+a…anunci+abaaplic+abaapoy+aba…anunci+aría…
Cosine Similarity
0.664
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …
15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …
81Carnegie Mellon
Christian Monson
Cluster Paradigm Fragments
16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …
15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …
82Carnegie Mellon
Christian Monson
Cluster Paradigm Fragments
15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó25 Stems: anunci- aplic- apoy- celebr- consider- …
15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …
15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …
16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
83Carnegie Mellon
Christian Monson
Cluster Paradigm Fragments
17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.715
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó25 Stems: anunci- aplic- apoy- celebr- consider- …
15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …
15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …
16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664
84Carnegie Mellon
Christian Monson
17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.715
15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó25 Stems: anunci- aplic- apoy- celebr- consider- …
15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …
15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …
16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664
Cluster Paradigm Fragments
Continue Clustering UntilAny merger would place in the same cluster 2
suffixes which share no stem in the corpus
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
85Carnegie Mellon
Christian Monson
Cluster Paradigm FragmentsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.715
15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó25 Stems: anunci- aplic- apoy- celebr- consider- …
15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …
15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …
16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664
Continue Clustering UntilAny merger would place in the same cluster 2
suffixes which share no stem in the corpus
11: a e en ida idas ido idos iendo ieron ió ía15 Stems: culpl- discut- emit- part- recib- reun- transmit- un- vend- viv- …
86Carnegie Mellon
Christian Monson
Cluster Paradigm FragmentsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.715
15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó25 Stems: anunci- aplic- apoy- celebr- consider- …
15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …
15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …
16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664
In all, 23 initial candidate paradigms joined this cluster
87Carnegie Mellon
Christian Monson
88Carnegie Mellon
Christian Monson
Cluster Overlapping Candidates
Greedy bottom-up agglomerative clustering
Merge most similar candidate paradigms Cosine similarity:
Sets of boundary annotated supporting types
Halting conditionThe corpus must contain paradigmatic evidence for
each pair of suffixes in a cluster:
Two suffixes may not be in the same cluster if they share no common candidate stem in the corpus
YXYX
89Carnegie Mellon
Christian Monson
90Carnegie Mellon
Christian Monson
A Few of the 42 Final Paradigm Clusters4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
91Carnegie Mellon
Christian Monson
Spanish Derivation and Clitics4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
92Carnegie Mellon
Christian Monson
Morphophonology in ParaMor4 SuffixesØ menente mente s
11 Suffixes a amente as illa illas o or ora oras ores os
41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó
29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían
20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían
29 Suffixes ce cedores cemos cen cer cerlo cerlos
cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco
6 SuffixesØ es idad idades mente ísima
93Carnegie Mellon
Christian Monson
94Carnegie Mellon
Christian Monson
Morpheme to Feature Mapping
((TENSE past) (LEXICAL-ASPECT activity) ...
(SUBJ ((NUM sg) (PERSON 3sg) ...)))
Subject Number marked in 3 places:
1. on N head with Ø = sg, es = pl 2. on dependent Det with El = sg, Los = pl 3. on governing V with ó = sg, eron = pl
((TENSE past) (LEXICAL-ASPECT activity) ...
(SUBJ ((NUM pl) (PERSON 3sg) ...)))
Los cayeronárbolesEl cayóárbolØ
S
NP VP
VDet N
The tree fell
S
NP VP
VDet N
The trees fell
95Carnegie Mellon
Christian Monson
96Carnegie Mellon
Christian Monson
IR EvaluationParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
Task Based EvaluationInformation retrieval
Data from CLEF (Cross Language Evaluation Forum)
Short two-sentence queries
About international news topics
Binary relevance assessments
About 50 queries and 20K relevance judgments for each language
Okapi term weighting
97Carnegie Mellon
Christian Monson
25
35
45
English German Finnish Turkish
IR EvaluationParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
Ber
nhar
d
Mor
fess
or
Par
aMor
Par
aMor
& M
orf.
39.4 39.639.3
37.2
Average Precision
31.2 – No Morphological Analysis
98Carnegie Mellon
Christian Monson
25
35
45
English German Finnish Turkish
IR EvaluationParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
Ber
nhar
d
Mor
fess
or
Par
aMor
Par
aMor
& M
orf.
Ber
nhar
d
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor39.4 39.6
47.348.1
39.3
37.2
46.0
39.6
Average Precision
32.3 – No Morphological Analysis
99Carnegie Mellon
Christian Monson
25
35
45
English German Finnish Turkish
IR EvaluationParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
Ber
nhar
d
Mor
fess
or
Par
aMor
Par
aMor
& M
orf.
Ber
nhar
d
Mor
fess
or
Par
aMor
Par
aMor
& M
orf.
Ber
nhar
d
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor39.4 39.6
47.348.1
49.2
38.839.3
37.2
46.0
39.6
37.9
36.9
Average Precision
32.3 – No Morphological Analysis
100Carnegie Mellon
Christian Monson
Turkish Was Not Evaluated for the IR Task
25
35
45
English German Finnish Turkish
IR EvaluationParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
Ber
nhar
d
Mor
fess
or
Par
aMor
Par
aMor
& M
orf.
Ber
nhar
d
Mor
fess
or
Par
aMor
Par
aMor
& M
orf.
Ber
nhar
d
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor39.4 39.6
47.348.1
49.2
38.839.3
37.2
46.0
39.6
37.9
36.9
101Carnegie Mellon
Christian Monson
102Carnegie Mellon
Christian Monson
Is Beads-on-a-String Model Adequate?
103Carnegie Mellon
Christian Monson
A Sample of 894 Languages (Dryer, 2005)
104Carnegie Mellon
Christian Monson
86% Have Affixational Morphology
Affixing Languages
Little Affixation
105Carnegie Mellon
Christian Monson
70% are Suffixing
Primarily Prefixation
Little Affixation
Significant SuffixationSuffixationSuffixationSuffixationSuffixation
106Carnegie Mellon
Christian Monson
107Carnegie Mellon
Christian Monson
Paradigms Do Not Describe Derivation
inform ationmiser
ment
manage mentmis
108Carnegie Mellon
Christian Monson
Paradigms Do Not Describe Derivation
inform ationmiser
ment
manage mentmisation
109Carnegie Mellon
Christian Monson
Paradigms Do Not Describe Derivation
inform ationmiser
ment
manage mentmisation
110Carnegie Mellon
Christian Monson
111Carnegie Mellon
Christian Monson
sinyecek
present2nd person singular
Morphology is Complex – Fusion
me
take passive negative
You are not taken
götür ül
112Carnegie Mellon
Christian Monson
sin
negative-present2nd person singular
take passive
You are not taken
götür ül mez
Morphology is Complex – Fusion
113Carnegie Mellon
Christian Monson
sin
negative-present2nd person singular
take passive
You are not taken
götür ül mez
Morphology is Complex – Fusion
114Carnegie Mellon
Christian Monson
115Carnegie Mellon
Christian Monson
Size of Search Space
Huge: 2|candidate suffixes|
Most candidate suffixes have no common stems
Still Exponential
Greedily searched space: O(|candidate suffixes|)
This example: 0.1% of the searched space
s10697
autorizacionesbuscabamos
costaØ costasimportadoraØ importadoras
vallaØ vallas…
Ø s5513
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Search for Candidate Paradigms
116Carnegie Mellon
Christian Monson
117Carnegie Mellon
Christian Monson
Some Candidates are Errors
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Stem-internal boundary hypothesis
Correct
Incorrect
118Carnegie Mellon
Christian Monson
119Carnegie Mellon
Christian Monson
Enable Cross-Lingual Communication
7000 languages in the world
6.66 billion peopleHalf speak one of the 10 largest languages
Half don’t!
120Carnegie Mellon
Christian Monson
121Carnegie Mellon
Christian Monson
Preliminary Linguistic Evaluation
P R P R P R P R
English German English German
Inflectional Only Inflectional & Derivational
Built 2 styles of answer keys
For 2 languages
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
122Carnegie Mellon
Christian Monson
Preliminary Linguistic Evaluation
ParaMor 33.0 81.4 42.8 68.6 48.9 53.6 60.0 33.5
P R P R P R P R
English German English German
Inflectional Only Inflectional & Derivational
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Built 2 styles of answer keys
For 2 languages
123Carnegie Mellon
Christian Monson
Preliminary Linguistic Evaluation
ParaMor 33.0 81.4 42.8 68.6 48.9 53.6 60.0 33.5
Morfessor
P R P R P R P R
English German English German
Inflectional Only Inflectional & Derivational
MorfessorFreely available unsupervised morphology
induction system
Statistical model of morphology
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
124Carnegie Mellon
Christian Monson
Preliminary Linguistic Evaluation
ParaMor 33.0 81.4 42.8 68.6 48.9 53.6 60.0 33.5
Morfessor 53.3 47.0 38.7 44.2 73.6 34.0 66.9 37.1
P R P R P R P R
English German English German
Inflectional Only Inflectional & Derivational
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
MorfessorFreely available unsupervised morphology
induction system
Statistical model of morphology
125Carnegie Mellon
Christian Monson
126Carnegie Mellon
Christian Monson
strado15
rado167
rada radas rado rados
53
rada radorados
67
rada rado89
ra rada radasrado rados ran
rar raron ró23
strada strado12
strada strado stró
9
strada strado strar stró
8
strada stradas strado strar stró
7
...an
1784
a an1045
a an ar
417
a an ar ó355
a ada adas ado ados an
ar aron ó148
es2750
Ø es845
n6039
Ø n1863
Ø n r
512
Ø do n r357
Ø da das do dos n ndo r ron
115
a9020
a o2325
a o os
1418
a as o os899
s10697
Ø s5513
Ø r s
281
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Where is the Statistics?
ParaMor has no explicit statistical model
Each candidate paradigm is a minimal description
MDL has close ties to Bayesian statistics
127Carnegie Mellon
Christian Monson
128Carnegie Mellon
Christian Monson
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
Some Model Fragments of Paradigms
15 Suffixes from the
ar Verbal Paradigm(Which has more than 30)
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
129Carnegie Mellon
Christian Monson
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
Some Model Fragments of Paradigms
Here’s 15 More Suffixes from the ar Verbal Paradigm
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
130Carnegie Mellon
Christian Monson
131Carnegie Mellon
Christian Monson
Raw Text
The ParaMor AlgorithmParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
Spanish Example
132Carnegie Mellon
Christian Monson
Raw Text
ParaMorWord List50,000 Types
autorizacionesbuscabamoscostarimportadoravallas…
The ParaMor AlgorithmParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
133Carnegie Mellon
Christian Monson
Raw Text
ParaMor autorizacionesbuscabamoscostarimportadoravallas…
v + allasva + llasval + lasvall + asvalla + svallas + Ø
A priori, each character boundary is a candidate morpheme boundary
Propose multiple analyses of each word
Each analysis contains exactly 1 morpheme boundary
The ParaMor AlgorithmParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
134Carnegie Mellon
Christian Monson
Paradigms
ParadigmMutually substitutable morphological operations
ül m umumØuz
üyoryecek
The ParaMor Algorithm
135Carnegie Mellon
Christian Monson
Paradigms
ParadigmMutually substitutable strings
ül m umumØuz
üyoryecek
The ParaMor Algorithm
136Carnegie Mellon
Christian Monson
s10697
ParaMor
Consolidate Identical candidate suffixes into paradigm seeds
Raw Text
autorizacionesbuscabamoscostarimportadoravallas…
v + allasva + llasval + lasvall + asvalla + svallas + Ø
The ParaMor Algorithm
137Carnegie Mellon
Christian Monson
s10697
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
autorizacionesbuscabamos
costasimportadoras
vallas…
Begin search with the most frequent candidate suffix
Search for Candidate Paradigms
Bottom-Up
Greedy
138Carnegie Mellon
Christian Monson
139Carnegie Mellon
Christian Monson
8240 Selected Candidates Paradigms
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
140Carnegie Mellon
Christian Monson
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
Some Candidates Model ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
141Carnegie Mellon
Christian Monson
Some Candidates are Errors
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
142Carnegie Mellon
Christian Monson
The ParaMor Algorithm
Identify Paradigms in 3 Steps1. Search for candidate paradigms
2. Cluster candidates modeling the same paradigm
3. Filter
Segment Words Using the discovered paradigms
2 Filters New to ParaMor
ParaMorIdentify
SearchClusterFilter
Segment
143Carnegie Mellon
Christian Monson
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
1. Spurious String SimilaritiesParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
144Carnegie Mellon
Christian Monson
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
1. Spurious String SimilaritiesParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
From: allØ amØ gØ sØ alla ama ga saallanar amanar ganar sanar
145Carnegie Mellon
Christian Monson
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
From: allØ amØ gØ sØ alla ama ga saallanar amanar ganar sanar
Supported by 8 Short Types
1. Spurious String Similarities
146Carnegie Mellon
Christian Monson
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
From: allØ amØ gØ sØ alla ama ga saallanar amanar ganar sanar
Exclude Short Types from the Induction Vocabulary
1. Spurious String Similarities
147Carnegie Mellon
Christian Monson
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
2. Suffix-Internal Boundary Hypotheses
148Carnegie Mellon
Christian Monson
2. Suffix-Internal Boundary Hypotheses
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th aØ aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
Incorrect
Correct
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
149Carnegie Mellon
Christian Monson
Incorrect
Correct
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th aØ aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
2. Suffix-Internal Boundary Hypotheses
150Carnegie Mellon
Christian Monson
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th aØ aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
2. Suffix-Internal Boundary Hypotheses
Adapt the Transition Likelihood ApproachSimilar to Goldsmith (2006)
151Carnegie Mellon
Christian Monson
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
acompaña-anuncia-
aplica-apoya-
celebra-considera-
controla-desarrolla-desplaza-
disputa-eleva-
enfrenta-forma-
halla-integra-
lanza-llama-llega-lleva-
ocupa-pasa-
presenta-realiza-
registra-toma-
From The Candidate Stems
2. Suffix-Internal Boundary Hypotheses
152Carnegie Mellon
Christian Monson
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
acompañ-anunci-
aplic-apoy-
celebr-consider-desarroll-desplaz-
disput-elev-
enfrent-form-
hall-integr-
lanz-llam-lleg-llev-
ocup-pas-
present-realiz-
registr-tom-
From The Candidate Stems
acompaña-anuncia-
aplica-apoya-
celebra-considera-
controla-desarrolla-desplaza-
disputa-eleva-
enfrenta-forma-
halla-integra-
lanza-llama-llega-lleva-
ocupa-pasa-
presenta-realiza-
registra-toma-
From The Candidate Stems
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
2. Suffix-Internal Boundary Hypotheses
153Carnegie Mellon
Christian Monson
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
Entropy
3.490.00
Entropy
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
acompañ-anunci-
aplic-apoy-
celebr-consider-desarroll-desplaz-
disput-elev-
enfrent-form-
hall-integr-
lanz-llam-lleg-llev-
ocup-pas-
present-realiz-
registr-tom-
From The Candidate Stems
acompaña-anuncia-
aplica-apoya-
celebra-considera-
controla-desarrolla-desplaza-
disputa-eleva-
enfrenta-forma-
halla-integra-
lanza-llama-llega-lleva-
ocupa-pasa-
presenta-realiza-
registra-toma-
From The Candidate Stems
2. Suffix-Internal Boundary Hypotheses
154Carnegie Mellon
Christian Monson
1st Ø s
2nd a as o os
3rd Ø ba ban da das do dos n ndo r ron rse rá rán
5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó
11th ta tamente tas to tos
12th Ø ba ción da das do dos n ndo r ron rá rán ría
13th a aba ada adas ado ados an ando ar aron ará arán e en ó
30th a e en ida idas ido idos iendo ieron ió ía
1000th Ø g gs
1566th ido idos ir iré
2000th lia liana
3000th Ø a anar
4000th Ø e ince
8000th trada trarnos
acompañ-anunci-
aplic-apoy-
celebr-consider-desarroll-desplaz-
disput-elev-
enfrent-form-
hall-integr-
lanz-llam-lleg-llev-
ocup-pas-
present-realiz-
registr-tom-
From The Candidate Stems
Entropy
3.490.00
Entropy
Removed
ParaMor discards candidates whose entropy falls below a threshold parameter
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
acompaña-anuncia-
aplica-apoya-
celebra-considera-
controla-desarrolla-desplaza-
disputa-eleva-
enfrenta-forma-
halla-integra-
lanza-llama-llega-lleva-
ocupa-pasa-
presenta-realiza-
registra-toma-
From The Candidate Stems
2. Suffix-Internal Boundary Hypotheses
155Carnegie Mellon
Christian Monson
156Carnegie Mellon
Christian Monson
Morphology is Complex – Operations
Prefixation
Suffixation
157Carnegie Mellon
Christian Monson
Prefixation
Reduplication
Suffixation
Morphology is Complex – Operations
158Carnegie Mellon
Christian Monson
Prefixation
Reduplication
Infixation
Suffixation
Morphology is Complex – Operations
159Carnegie Mellon
Christian Monson
Prefixation
Reduplication
Infixation
Suffixation
Morphology is Complex – Operations
160Carnegie Mellon
Christian Monson
Prefixation
Reduplication
Infixation
Suffixation
Morphology is Complex – Operations
161Carnegie Mellon
Christian Monson
Inflection
Morphology is Complex – Purpose
götür ül m sunüyor
take passive negativepresent
progressive2nd person singular
162Carnegie Mellon
Christian Monson
Inflection
götür ül m sunüyor
take passive negativepresent
progressive2nd person singular
Derivation
inform
Morphology is Complex – Purpose
163Carnegie Mellon
Christian Monson
Inflection
Morphology is Complex – Purpose
götür ül m sunüyor
take passive negativepresent
progressive2nd person singular
Derivation
inform ation
abstract noun
164Carnegie Mellon
Christian Monson
Inflection
götür ül m sunüyor
take passive negativepresent
progressive2nd person singular
Derivation
inform ationmis
abstract noun
negative
Morphology is Complex – Purpose
165Carnegie Mellon
Christian Monson
götür ül m sunüyor
take passive negativepresent
progressive
You are not being taken
2nd person singular
Morphology is Complex – Morphophonology
166Carnegie Mellon
Christian Monson
sunyecek
future2nd person singular
götür ül m
take passive negative
You will not be taken
Morphology is Complex – Morphophonology
167Carnegie Mellon
Christian Monson
sunyecek
future2nd person singular
götür ül m
take passive negative
You will not be taken
Morphology is Complex – Morphophonology
168Carnegie Mellon
Christian Monson
sunyecek
future2nd person singular
götür ül me
take passive negative
You will not be taken
Morphology is Complex – Morphophonology
169Carnegie Mellon
Christian Monson
sinyecek
future2nd person singular
götür ül me
take passive negative
You will not be taken
Morphology is Complex – Morphophonology
170Carnegie Mellon
Christian Monson
sinyecek
future2nd person singular
götür ül me
take passive negative
You will not be taken
Morphology is Complex – Morphophonology
171Carnegie Mellon
Christian Monson
Morphology is Complex – Ambiguity
Hungarianmentek
men +tekgo +Present.2nd.Plural‘yinz go’
172Carnegie Mellon
Christian Monson
Hungarianmentek
men +tekgo +Present.2nd.Plural‘yinz go’
men +t +ekgo +PastParticiple
+Plural‘those who have gone’
Morphology is Complex – Ambiguity
173Carnegie Mellon
Christian Monson
174Carnegie Mellon
Christian Monson
Paradigms Do Not Describe Derivation
inform ationmis
175Carnegie Mellon
Christian Monson
inform ationmiser
Paradigms Do Not Describe Derivation
176Carnegie Mellon
Christian Monson
inform ationmiser
ement
Paradigms Do Not Describe Derivation
177Carnegie Mellon
Christian Monson
inform ationmiser
ement
Paradigm Based ImpliesStrong at inflectional morphology
Paradigms Do Not Describe Derivation
178Carnegie Mellon
Christian Monson
179Carnegie Mellon
Christian Monson
The Next Steps for ParaMor
Scaling Paradigm InductionCurrently 50,000 typesUp to larger vocabulariesDown for languages with few resourcesParameter settings need tuning
Scaling Down SegmentationCurrently 300,000 to 2.2 million typesThe larger the vocabulary, the more likely a
particular stem will occur in more than one surface form
180Carnegie Mellon
Christian Monson