evaluating an agglutinative segmentation model for paramor

180
Carnegie Mellon Christian Monson Evaluating an Agglutinative Segmentation Model for ParaMor Christian Monson Jaime Carbonell Alon Lavie Lori Levin Carnegie Mellon University

Upload: halima

Post on 11-Jan-2016

29 views

Category:

Documents


2 download

DESCRIPTION

Evaluating an Agglutinative Segmentation Model for ParaMor. Christian Monson Jaime Carbonell Alon Lavie Lori Levin Carnegie Mellon University. Turkish Morphology – Beads on a String. One Turkish Word. götür. ül. m. ü yor. s u n. present progressive. 2 nd person singular. take. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Evaluating an Agglutinative Segmentation Model for ParaMor

Carnegie Mellon

Christian Monson

Evaluating an Agglutinative

Segmentation Model for ParaMor

Christian Monson

Jaime Carbonell

Alon Lavie

Lori Levin

Carnegie Mellon University

Page 2: Evaluating an Agglutinative Segmentation Model for ParaMor

2Carnegie Mellon

Christian Monson

I am not being taken

Turkish Morphology – Beads on a String

götür ül m sunüyor

take passive negativepresent

progressive2nd person singular

One Turkish Word

Page 3: Evaluating an Agglutinative Segmentation Model for ParaMor

3Carnegie Mellon

Christian Monson

Computational Morphology Improves:

Machine TranslationTurkish-English (Oflazer, 2007)

Czech-English (Goldwater and McClosky, 2005)

Speech RecognitionFinnish (Creutz, 2006)

Grapheme-to-Phoneme ConversionGerman (Demberg, 2007)

Information RetrievalEnglish, German, Finnish (Kurimo et al., 2008)

Page 4: Evaluating an Agglutinative Segmentation Model for ParaMor

4Carnegie Mellon

Christian Monson

Morphology is Complex

Operations

Suffix, Prefix, Reduplication, …

Purpose

Inflection vs. Derivation

Morphophonology

Ambiguity

Page 5: Evaluating an Agglutinative Segmentation Model for ParaMor

5Carnegie Mellon

Christian Monson

Complexity Demands Time and Expertise

Kemal OflazerExpert on

Turkish

Computational morphology

Time3 - 4 Months to manually build a basic Turkish analyzer

Plus lexicon development and maintenance

Page 6: Evaluating an Agglutinative Segmentation Model for ParaMor

6Carnegie Mellon

Christian Monson

The SolutionRaw Text

Unsupervised Morphology

Induction

Page 7: Evaluating an Agglutinative Segmentation Model for ParaMor

7Carnegie Mellon

Christian Monson

The SolutionRaw Text

?

Page 8: Evaluating an Agglutinative Segmentation Model for ParaMor

8Carnegie Mellon

Christian Monson

Techniques for Unsupervised Morphology Induction

Transition Likelihood

Harris (1955) – Finite State Automata

Bernhard (2007)

Page 9: Evaluating an Agglutinative Segmentation Model for ParaMor

9Carnegie Mellon

Christian Monson

Transition Likelihood

Harris (1955) – Finite State Automata

Bernhard (2007)

Minimum Description LengthGoldsmith (2001, 2006)

Creutz’s Morfessor (2006)

Techniques for Unsupervised Morphology Induction

Page 10: Evaluating an Agglutinative Segmentation Model for ParaMor

10Carnegie Mellon

Christian Monson

Transition Likelihood

Harris (1955) – Finite State Automata

Bernhard (2007)

Statistical or Minimum Description LengthGoldsmith (2001, 2006)

Creutz’s Morfessor (2006)

The ParadigmSnover (2002)

ParaMor (2004, 2007)

Techniques for Unsupervised Morphology Induction

Page 11: Evaluating an Agglutinative Segmentation Model for ParaMor

11Carnegie Mellon

Christian Monson

What is a Paradigm?

ül m sunüyor

take passive negativepresent

progressive2nd person singular

götür

Page 12: Evaluating an Agglutinative Segmentation Model for ParaMor

12Carnegie Mellon

Christian Monson

ül m sunüyor

take passive negativepresent

progressive2nd person singular

götür

Person & Number

Paradigms Structure Inflectional Morphology

Page 13: Evaluating an Agglutinative Segmentation Model for ParaMor

13Carnegie Mellon

Christian Monson

um

Person & Number

1st person singular

umül m üyor

take passive negativepresent

progressive

götür

Paradigms Structure Inflectional Morphology

Page 14: Evaluating an Agglutinative Segmentation Model for ParaMor

14Carnegie Mellon

Christian Monson

um

Person & Number

3rd person singular

umØ

ül m üyor

take passive negativepresent

progressive

götür

Paradigms Structure Inflectional Morphology

Page 15: Evaluating an Agglutinative Segmentation Model for ParaMor

15Carnegie Mellon

Christian Monson

umumØuz

ül m üyor

take passive negativepresent

progressive

götür

Person & Number

Paradigms Structure Inflectional Morphology

Page 16: Evaluating an Agglutinative Segmentation Model for ParaMor

16Carnegie Mellon

Christian Monson

umumØuz

ül m üyor

take passive negativepresent

progressive

götür

ParadigmMutually substitutable morphological operations

Paradigm

Paradigms Structure Inflectional Morphology

Page 17: Evaluating an Agglutinative Segmentation Model for ParaMor

17Carnegie Mellon

Christian Monson

ül m um

Voice PolarityTense & Aspect

Person & Number

umØuz

üyoryecek

Paradigms Structure Inflectional Morphology

Page 18: Evaluating an Agglutinative Segmentation Model for ParaMor

18Carnegie Mellon

Christian Monson

Paradigms

ParadigmMutually substitutable morphological operations

ül m umumØuz

üyoryecek

Paradigms Structure Inflectional Morphology

Page 19: Evaluating an Agglutinative Segmentation Model for ParaMor

19Carnegie Mellon

Christian Monson

Paradigm

ül m umumØuz

üyoryecek

ParadigmMutually substitutable strings

The ParaMor Algorithm

Page 20: Evaluating an Agglutinative Segmentation Model for ParaMor

20Carnegie Mellon

Christian Monson

Paradigm

ül m umumØuz

üyoryecek

Candidate Stems

1 Morpheme Boundary

The ParaMor Algorithm

Page 21: Evaluating an Agglutinative Segmentation Model for ParaMor

21Carnegie Mellon

Christian Monson

The ParaMor Algorithm

Simplifying Assumptions

Suffixes only70% of the World’s Languages are Suffixing (Dryer, 2005)

No morphophonology

Only a High-Level Overview

Page 22: Evaluating an Agglutinative Segmentation Model for ParaMor

22Carnegie Mellon

Christian Monson

The ParaMor Algorithm

Identify Paradigms in 3 Steps

ParaMorIdentify

Page 23: Evaluating an Agglutinative Segmentation Model for ParaMor

23Carnegie Mellon

Christian Monson

The ParaMor Algorithm

Identify Paradigms in 3 Steps1. Search for candidate paradigms

ParaMorIdentify

Search

Page 24: Evaluating an Agglutinative Segmentation Model for ParaMor

24Carnegie Mellon

Christian Monson

The ParaMor Algorithm

Identify Paradigms in 3 Steps1. Search for candidate paradigms

2. Cluster candidates modeling the same paradigm

ParaMorIdentify

SearchCluster

Page 25: Evaluating an Agglutinative Segmentation Model for ParaMor

25Carnegie Mellon

Christian Monson

The ParaMor Algorithm

Identify Paradigms in 3 Steps1. Search for candidate paradigms

2. Cluster candidates modeling the same paradigm

3. Filter

ParaMorIdentify

SearchClusterFilter

Page 26: Evaluating an Agglutinative Segmentation Model for ParaMor

26Carnegie Mellon

Christian Monson

The ParaMor Algorithm

Identify Paradigms in 3 Steps1. Search for candidate paradigms

2. Cluster candidates modeling the same paradigm

3. Filter

Segment Words Using the discovered paradigms

ParaMorIdentify

SearchClusterFilter

Segment

Page 27: Evaluating an Agglutinative Segmentation Model for ParaMor

27Carnegie Mellon

Christian Monson

This Presentation

Identify Paradigms in 3 Steps1. Search for candidate paradigms

2. Cluster candidates modeling the same paradigm

3. Filter

Segment Words Using the discovered paradigms

Example Search

Full Description in Monson et al. (SIGMORPHON 2007)

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 28: Evaluating an Agglutinative Segmentation Model for ParaMor

28Carnegie Mellon

Christian Monson

This Presentation

Identify Paradigms in 3 Steps1. Search for candidate paradigms

2. Cluster candidates modeling the same paradigm

3. Filter

Segment Words Using the discovered paradigms

Agglutinative Segmentation Model

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 29: Evaluating an Agglutinative Segmentation Model for ParaMor

29Carnegie Mellon

Christian Monson

This Paper

Identify Paradigms in 3 Steps1. Search for candidate paradigms

2. Cluster candidates modeling the same paradigm

3. Filter

Segment Words Using the discovered paradigms

2 Filters Adapted from

Harris (1955) and Goldsmith (2006)

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 30: Evaluating an Agglutinative Segmentation Model for ParaMor

30Carnegie Mellon

Christian Monson

The ParaMor Algorithm

Identify Paradigms in 3 Steps1. Search for candidate paradigms

2. Cluster candidates modeling the same paradigm

3. Filter

Segment Words Using the discovered paradigms

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 31: Evaluating an Agglutinative Segmentation Model for ParaMor

31Carnegie Mellon

Christian Monson

s10697

autorizacionesbuscabamos

costasimportadoras

vallas…

Search for Candidate Paradigms

Spanish Example

Propose a morpheme boundary at every character boundary in every word

Consolidate identical candidate suffixes into paradigm seeds

Word List50,000 Types

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 32: Evaluating an Agglutinative Segmentation Model for ParaMor

32Carnegie Mellon

Christian Monson

s10697

autorizacionesbuscabamos

costaØ costasimportadoraØ importadoras

vallaØ vallas…

Ø s5513

Identify the most frequent mutually replaceable candidate suffix

Stems that occur with one suffix in a paradigm will likely occur with other suffixes in that paradigm

Search for Candidate ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Page 33: Evaluating an Agglutinative Segmentation Model for ParaMor

33Carnegie Mellon

Christian Monson

s10697

A Parameter halts the introduction of suffixes When the most frequent

mutually replaceable candidate suffix severely decreases the stem count

Ø s5513

Ø r s

281autorizaciones

buscabamos costar costaØ

costasimportadoraØ importadoras

vallaØ vallas…

Search for Candidate ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Page 34: Evaluating an Agglutinative Segmentation Model for ParaMor

34Carnegie Mellon

Christian Monson

Move on to the next most frequent paradigm seed

a9020

s10697

Ø s5513

Ø r s

281

Search for Candidate ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Page 35: Evaluating an Agglutinative Segmentation Model for ParaMor

35Carnegie Mellon

Christian Monson

a9020

a o2325

a o os

1418

a as o os899

s10697

Ø s5513

Ø r s

281

Search for Candidate ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Page 36: Evaluating an Agglutinative Segmentation Model for ParaMor

36Carnegie Mellon

Christian Monson

n6039

Ø n1863

Ø n r

512

Ø do n r357

Ø da das do dos n ndo r ron

115

a9020

a o2325

a o os

1418

a as o os899

s10697

Ø s5513

Ø r s

281

Search for Candidate ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Page 37: Evaluating an Agglutinative Segmentation Model for ParaMor

37Carnegie Mellon

Christian Monson

es2750

Ø es845

n6039

Ø n1863

Ø n r

512

Ø do n r357

Ø da das do dos n ndo r ron

115

a9020

a o2325

a o os

1418

a as o os899

s10697

Ø s5513

Ø r s

281

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Search for Candidate Paradigms

Page 38: Evaluating an Agglutinative Segmentation Model for ParaMor

38Carnegie Mellon

Christian Monson

an1784

a an1045

a an ar

417

a an ar ó355

a ada adas ado ados an

ar aron ó148

es2750

Ø es845

n6039

Ø n1863

Ø n r

512

Ø do n r357

Ø da das do dos n ndo r ron

115

a9020

a o2325

a o os

1418

a as o os899

s10697

Ø s5513

Ø r s

281

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Search for Candidate Paradigms

Page 39: Evaluating an Agglutinative Segmentation Model for ParaMor

39Carnegie Mellon

Christian Monson

strado15

rado167

rada radas rado rados

53

rada radorados

67

rada rado89

ra rada radasrado rados ran

rar raron ró23

strada strado12

strada strado stró

9

strada strado strar stró

8

strada stradas strado strar stró

7

...an

1784

a an1045

a an ar

417

a an ar ó355

a ada adas ado ados an

ar aron ó148

es2750

Ø es845

n6039

Ø n1863

Ø n r

512

Ø do n r357

Ø da das do dos n ndo r ron

115

a9020

a o2325

a o os

1418

a as o os899

s10697

Ø s5513

Ø r s

281

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Search for Candidate Paradigms

Page 40: Evaluating an Agglutinative Segmentation Model for ParaMor

40Carnegie Mellon

Christian Monson

The ParaMor Algorithm

Identify Paradigms in 3 Steps1. Search for candidate paradigms

2. Cluster candidates modeling the same paradigm

3. Filter

Segment Words Using the discovered paradigms

ParaMorIdentify

SearchClusterFilter

Segment

Page 41: Evaluating an Agglutinative Segmentation Model for ParaMor

41Carnegie Mellon

Christian Monson

A Few of the 42 Final Paradigms4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

Page 42: Evaluating an Agglutinative Segmentation Model for ParaMor

42Carnegie Mellon

Christian Monson

4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

A Few of the 42 Final Paradigms

Number on Nouns

Page 43: Evaluating an Agglutinative Segmentation Model for ParaMor

43Carnegie Mellon

Christian Monson

A Few of the 42 Final Paradigms4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

Number & Gender on Adjectives

Page 44: Evaluating an Agglutinative Segmentation Model for ParaMor

44Carnegie Mellon

Christian Monson

A Few of the 42 Final Paradigms4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

Verbal Suffixes

Page 45: Evaluating an Agglutinative Segmentation Model for ParaMor

45Carnegie Mellon

Christian Monson

The ParaMor Algorithm

Identify Paradigms in 3 Steps1. Search for candidate paradigms

2. Cluster candidates modeling the same paradigm

3. Filter

Segment Words Using the discovered paradigms

ParaMorIdentify

SearchClusterFilter

Segment

Agglutinative Segmentation Model

Page 46: Evaluating an Agglutinative Segmentation Model for ParaMor

46Carnegie Mellon

Christian Monson

Segment Words Using the Paradigms4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

administradas‘Feminine gender nouns under administration’

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 47: Evaluating an Agglutinative Segmentation Model for ParaMor

47Carnegie Mellon

Christian Monson

Segment Words Using the Paradigms4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

administr + ad + a + s

Past Participle

FemininePlural

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 48: Evaluating an Agglutinative Segmentation Model for ParaMor

48Carnegie Mellon

Christian Monson

4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

administradas

Segment Words Using the ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Page 49: Evaluating an Agglutinative Segmentation Model for ParaMor

49Carnegie Mellon

Christian Monson

4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

administradas administrada

Also in corpus

Segment Words Using the ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Page 50: Evaluating an Agglutinative Segmentation Model for ParaMor

50Carnegie Mellon

Christian Monson

4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

administradas administrada

Segment Words Using the ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Page 51: Evaluating an Agglutinative Segmentation Model for ParaMor

51Carnegie Mellon

Christian Monson

4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

administradas administradaØ

Segment Words Using the ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Page 52: Evaluating an Agglutinative Segmentation Model for ParaMor

52Carnegie Mellon

Christian Monson

Segment Words Using the Paradigms4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

administr + ad + a + s

Recovers multiple morpheme boundaries

from candidate paradigms which each propose single morpheme boundaries

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 53: Evaluating an Agglutinative Segmentation Model for ParaMor

53Carnegie Mellon

Christian Monson

Segment Words Using the Paradigms4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

administr + ad + a + s

Baseline ParaMor

single morpheme boundary in each analysis of each word

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 54: Evaluating an Agglutinative Segmentation Model for ParaMor

54Carnegie Mellon

Christian Monson

Morpho Challenge 2007Morpho Challenge 2007ParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Peer operated competition For unsupervised morphology

induction algorithms

Page 55: Evaluating an Agglutinative Segmentation Model for ParaMor

55Carnegie Mellon

Christian Monson

Morpho Challenge 2007Morpho Challenge 2007ParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Peer operated competition For unsupervised morphology

induction algorithms

4 languagesEnglish ( 384,904 Types)

German (1,266,160 Types)

Finnish (2,206,720 Types)

Turkish ( 617,299 Types)

Page 56: Evaluating an Agglutinative Segmentation Model for ParaMor

56Carnegie Mellon

Christian Monson

Morpho Challenge 2007Morpho Challenge 2007ParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Peer operated competition For unsupervised morphology

induction algorithms

4 languagesEnglish ( 384,904 Types)

German (1,266,160 Types)

Finnish (2,206,720 Types)

Turkish ( 617,299 Types)

2 methods of evaluationLinguistic – Morpheme IdentificationInformation Retrieval

Page 57: Evaluating an Agglutinative Segmentation Model for ParaMor

57Carnegie Mellon

Christian Monson

Morpho Challenge 2007Morpho Challenge 2007

Today

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Peer operated competition For unsupervised morphology

induction algorithms

4 languagesEnglish ( 384,904 Types)

German (1,266,160 Types)

Finnish (2,206,720 Types)

Turkish ( 617,299 Types)

2 methods of evaluationLinguistic – Morpheme IdentificationInformation Retrieval

Page 58: Evaluating an Agglutinative Segmentation Model for ParaMor

58Carnegie Mellon

Christian Monson

Morpho Challenge 2007Morpho Challenge 2007

Developed on SpanishParameters Frozen

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Peer operated competition For unsupervised morphology

induction algorithms

4 languagesEnglish ( 384,904 Types)

German (1,266,160 Types)

Finnish (2,206,720 Types)

Turkish ( 617,299 Types)

2 methods of evaluationLinguistic – Morpheme IdentificationInformation Retrieval

Page 59: Evaluating an Agglutinative Segmentation Model for ParaMor

59Carnegie Mellon

Christian Monson

Combine ParaMor and Morfessor

MorfessorFreely available unsupervised morphology

induction system (Creutz, 2006)

Combine ParaMor and Morfessor Performs better than either system alone

(Monson et al., 2007)

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 60: Evaluating an Agglutinative Segmentation Model for ParaMor

60Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

Par

aMor

& M

orfe

ssor

50.7

47.2

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

F1

Mor

fess

or

Baseline

Page 61: Evaluating an Agglutinative Segmentation Model for ParaMor

61Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

Mor

fess

or

Agg

lutin

ativ

e P

& M

56.3

50.7

47.2

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

F1

Par

aMor

& M

orfe

ssor

Page 62: Evaluating an Agglutinative Segmentation Model for ParaMor

62Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

60.8

56.3

50.7

47.2

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

F1

Par

aMor

& M

orfe

ssor

Page 63: Evaluating an Agglutinative Segmentation Model for ParaMor

63Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

Ber

nhar

d

Mor

fess

or

Par

aMor

& M

orfe

ssor

60.8

56.3

52.950.7

47.2 47.8

53.4

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

F1

Par

aMor

& M

orfe

ssor

Page 64: Evaluating an Agglutinative Segmentation Model for ParaMor

64Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

Ber

nhar

d

Mor

fess

or

Par

aMor

& M

orfe

ssor

Agg

lutin

ativ

e P

& M

60.8

56.3

52.954.1

50.7

47.2 47.8

53.4

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

F1

Par

aMor

& M

orfe

ssor

Page 65: Evaluating an Agglutinative Segmentation Model for ParaMor

65Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

Ber

nhar

d

Mor

fess

or

Par

aMor

& M

orf.

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

60.8

56.3

52.954.1

48.250.7

47.2 47.8

53.4

40.643.2

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

F1

Par

aMor

& M

orfe

ssor

Par

aMor

& M

orfe

ssor

Page 66: Evaluating an Agglutinative Segmentation Model for ParaMor

66Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

60.8

56.3

52.954.1

48.2 48.550.7

47.2 47.8

53.4

40.643.2

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

F1

Par

aMor

& M

orfe

ssor

Par

aMor

& M

orfe

ssor

Par

aMor

& M

orf.

Page 67: Evaluating an Agglutinative Segmentation Model for ParaMor

67Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

60.8

56.3

52.954.1

48.2 48.5

24.7

50.7

47.2 47.8

53.4

40.643.2

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

F1

Par

aMor

& M

orfe

ssor

Par

aMor

& M

orfe

ssor

Par

aMor

& M

orf.

Page 68: Evaluating an Agglutinative Segmentation Model for ParaMor

68Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

Mor

fess

or

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

60.8

56.3

52.954.1

48.2 48.5

24.7

50.7

47.2 47.8

53.4

40.638.5

43.2

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

F1

Par

aMor

& M

orfe

ssor

Par

aMor

& M

orfe

ssor

Par

aMor

& M

orf.

Page 69: Evaluating an Agglutinative Segmentation Model for ParaMor

69Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

Mor

fess

or

Par

aMor

& M

orf.

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

60.8

56.3

52.954.1

48.2 48.5

24.7

50.7

47.2 47.8

53.4

40.638.5

43.2

46.7

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

F1

Par

aMor

& M

orfe

ssor

Par

aMor

& M

orfe

ssor

Par

aMor

& M

orf.

Page 70: Evaluating an Agglutinative Segmentation Model for ParaMor

70Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

Mor

fess

or

Par

aMor

& M

orf.

Agg

lutin

ativ

e P

& M

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

Ber

nhar

d

Mor

fess

or

Agg

lutin

ativ

e P

& M

60.8

56.3

52.954.1

48.2 48.5

24.7

52.050.7

47.2 47.8

53.4

40.638.5

43.2

46.7

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

F1

Par

aMor

& M

orfe

ssor

Par

aMor

& M

orfe

ssor

Par

aMor

& M

orf.

Page 71: Evaluating an Agglutinative Segmentation Model for ParaMor

71Carnegie Mellon

Christian Monson

ParaMor: State-of-the-Art Unsupervised Morphology Induction System

ParaMorIdentifies paradigms

The organizing structure of inflectional morphology

Segments words As discovered paradigms suggest

Our Agglutinative Segmentation ModelSignificantly improves morpheme identification

Particularly for agglutinative languages

Page 72: Evaluating an Agglutinative Segmentation Model for ParaMor

72Carnegie Mellon

Christian Monson

The Next Steps for ParaMor

Beyond Suffixes

English, German, Finnish, Turkish, and Spanishare all primarily suffixing

Straightforward extension to ParaMor forPrefixes

More ChallengingReduplicationInfixation etc.

Page 73: Evaluating an Agglutinative Segmentation Model for ParaMor

73Carnegie Mellon

Christian Monson

Beyond ParaMor

Improve Performance

Segmentation F1 of 50-60% is state of the art!

Morphophonology is the primary culprit

Simply splitting words cannot identify alternate forms of the same morpheme

Page 74: Evaluating an Agglutinative Segmentation Model for ParaMor

74Carnegie Mellon

Christian Monson

Page 75: Evaluating an Agglutinative Segmentation Model for ParaMor

75Carnegie Mellon

Christian Monson

Page 76: Evaluating an Agglutinative Segmentation Model for ParaMor

76Carnegie Mellon

Christian Monson

The ParaMor Algorithm

Identify Paradigms in 3 Steps1. Search for candidate paradigms

2. Cluster candidates modeling the same paradigm

3. Filter

Segment Words Using the discovered paradigms

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 77: Evaluating an Agglutinative Segmentation Model for ParaMor

77Carnegie Mellon

Christian Monson

Cluster Paradigm Fragments

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 78: Evaluating an Agglutinative Segmentation Model for ParaMor

78Carnegie Mellon

Christian Monson

Cluster Paradigm FragmentsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …

anunci+aaplic+aapoy+a…anunci+abaaplic+abaapoy+aba…anunci+aría…

Page 79: Evaluating an Agglutinative Segmentation Model for ParaMor

79Carnegie Mellon

Christian Monson

Cluster Paradigm Fragments

anunci+aapoy+aconfirm+a…anunci+abaapoy+abaconfirm+aba…anunci+ara…

anunci+aaplic+aapoy+a…anunci+abaaplic+abaapoy+aba…anunci+aría…

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …

Page 80: Evaluating an Agglutinative Segmentation Model for ParaMor

80Carnegie Mellon

Christian Monson

Cluster Paradigm Fragments

anunci+aapoy+aconfirm+a…anunci+abaapoy+abaconfirm+aba…anunci+ara…

anunci+aaplic+aapoy+a…anunci+abaaplic+abaapoy+aba…anunci+aría…

Cosine Similarity

0.664

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …

Page 81: Evaluating an Agglutinative Segmentation Model for ParaMor

81Carnegie Mellon

Christian Monson

Cluster Paradigm Fragments

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …

Page 82: Evaluating an Agglutinative Segmentation Model for ParaMor

82Carnegie Mellon

Christian Monson

Cluster Paradigm Fragments

15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó25 Stems: anunci- aplic- apoy- celebr- consider- …

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 83: Evaluating an Agglutinative Segmentation Model for ParaMor

83Carnegie Mellon

Christian Monson

Cluster Paradigm Fragments

17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.715

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó25 Stems: anunci- aplic- apoy- celebr- consider- …

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664

Page 84: Evaluating an Agglutinative Segmentation Model for ParaMor

84Carnegie Mellon

Christian Monson

17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.715

15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó25 Stems: anunci- aplic- apoy- celebr- consider- …

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664

Cluster Paradigm Fragments

Continue Clustering UntilAny merger would place in the same cluster 2

suffixes which share no stem in the corpus

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 85: Evaluating an Agglutinative Segmentation Model for ParaMor

85Carnegie Mellon

Christian Monson

Cluster Paradigm FragmentsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.715

15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó25 Stems: anunci- aplic- apoy- celebr- consider- …

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664

Continue Clustering UntilAny merger would place in the same cluster 2

suffixes which share no stem in the corpus

11: a e en ida idas ido idos iendo ieron ió ía15 Stems: culpl- discut- emit- part- recib- reun- transmit- un- vend- viv- …

Page 86: Evaluating an Agglutinative Segmentation Model for ParaMor

86Carnegie Mellon

Christian Monson

Cluster Paradigm FragmentsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.715

15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó25 Stems: anunci- aplic- apoy- celebr- consider- …

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci- aplic- apoy- celebr- concentr- …

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci- apoy- confirm- consider- declar- …

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664

In all, 23 initial candidate paradigms joined this cluster

Page 87: Evaluating an Agglutinative Segmentation Model for ParaMor

87Carnegie Mellon

Christian Monson

Page 88: Evaluating an Agglutinative Segmentation Model for ParaMor

88Carnegie Mellon

Christian Monson

Cluster Overlapping Candidates

Greedy bottom-up agglomerative clustering

Merge most similar candidate paradigms Cosine similarity:

Sets of boundary annotated supporting types

Halting conditionThe corpus must contain paradigmatic evidence for

each pair of suffixes in a cluster:

Two suffixes may not be in the same cluster if they share no common candidate stem in the corpus

YXYX

Page 89: Evaluating an Agglutinative Segmentation Model for ParaMor

89Carnegie Mellon

Christian Monson

Page 90: Evaluating an Agglutinative Segmentation Model for ParaMor

90Carnegie Mellon

Christian Monson

A Few of the 42 Final Paradigm Clusters4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

Page 91: Evaluating an Agglutinative Segmentation Model for ParaMor

91Carnegie Mellon

Christian Monson

Spanish Derivation and Clitics4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

Page 92: Evaluating an Agglutinative Segmentation Model for ParaMor

92Carnegie Mellon

Christian Monson

Morphophonology in ParaMor4 SuffixesØ menente mente s

11 Suffixes a amente as illa illas o or ora oras ores os

41 Suffixes a aba aban acion aciones ación ada adas ado ador adora adoras adores ados amos an ando ante antes ar ara aran aremos arla arlas arlo arlos arme aron arse ará arán aré aría arían ase e en ándose é ó

29 Suffixes e edor edora edoras edores en er erlo erlos erse erá erán ería erían ida idas ido idos iendo iera ieran ieron imiento imientos iéndose ió í ía ían

20 Suffixes ida idas ido idor idores idos imos ir iremos irle irlo irlos irse irá irán iré iría irían ía ían

29 Suffixes ce cedores cemos cen cer cerlo cerlos

cerse cerá cerán cería cida cidas cido cidos ciendo ciera cieran cieron cimiento cimientos cimos ció cí cía cían zca zcan zco

6 SuffixesØ es idad idades mente ísima

Page 93: Evaluating an Agglutinative Segmentation Model for ParaMor

93Carnegie Mellon

Christian Monson

Page 94: Evaluating an Agglutinative Segmentation Model for ParaMor

94Carnegie Mellon

Christian Monson

Morpheme to Feature Mapping

((TENSE past) (LEXICAL-ASPECT activity) ...

(SUBJ ((NUM sg) (PERSON 3sg) ...)))

Subject Number marked in 3 places:

1. on N head with Ø = sg, es = pl 2. on dependent Det with El = sg, Los = pl 3. on governing V with ó = sg, eron = pl

((TENSE past) (LEXICAL-ASPECT activity) ...

(SUBJ ((NUM pl) (PERSON 3sg) ...)))

Los cayeronárbolesEl cayóárbolØ

S

NP VP

VDet N

The tree fell

S

NP VP

VDet N

The trees fell

Page 95: Evaluating an Agglutinative Segmentation Model for ParaMor

95Carnegie Mellon

Christian Monson

Page 96: Evaluating an Agglutinative Segmentation Model for ParaMor

96Carnegie Mellon

Christian Monson

IR EvaluationParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Task Based EvaluationInformation retrieval

Data from CLEF (Cross Language Evaluation Forum)

Short two-sentence queries

About international news topics

Binary relevance assessments

About 50 queries and 20K relevance judgments for each language

Okapi term weighting

Page 97: Evaluating an Agglutinative Segmentation Model for ParaMor

97Carnegie Mellon

Christian Monson

25

35

45

English German Finnish Turkish

IR EvaluationParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Ber

nhar

d

Mor

fess

or

Par

aMor

Par

aMor

& M

orf.

39.4 39.639.3

37.2

Average Precision

31.2 – No Morphological Analysis

Page 98: Evaluating an Agglutinative Segmentation Model for ParaMor

98Carnegie Mellon

Christian Monson

25

35

45

English German Finnish Turkish

IR EvaluationParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Ber

nhar

d

Mor

fess

or

Par

aMor

Par

aMor

& M

orf.

Ber

nhar

d

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor39.4 39.6

47.348.1

39.3

37.2

46.0

39.6

Average Precision

32.3 – No Morphological Analysis

Page 99: Evaluating an Agglutinative Segmentation Model for ParaMor

99Carnegie Mellon

Christian Monson

25

35

45

English German Finnish Turkish

IR EvaluationParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Ber

nhar

d

Mor

fess

or

Par

aMor

Par

aMor

& M

orf.

Ber

nhar

d

Mor

fess

or

Par

aMor

Par

aMor

& M

orf.

Ber

nhar

d

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor39.4 39.6

47.348.1

49.2

38.839.3

37.2

46.0

39.6

37.9

36.9

Average Precision

32.3 – No Morphological Analysis

Page 100: Evaluating an Agglutinative Segmentation Model for ParaMor

100Carnegie Mellon

Christian Monson

Turkish Was Not Evaluated for the IR Task

25

35

45

English German Finnish Turkish

IR EvaluationParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Ber

nhar

d

Mor

fess

or

Par

aMor

Par

aMor

& M

orf.

Ber

nhar

d

Mor

fess

or

Par

aMor

Par

aMor

& M

orf.

Ber

nhar

d

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor39.4 39.6

47.348.1

49.2

38.839.3

37.2

46.0

39.6

37.9

36.9

Page 101: Evaluating an Agglutinative Segmentation Model for ParaMor

101Carnegie Mellon

Christian Monson

Page 102: Evaluating an Agglutinative Segmentation Model for ParaMor

102Carnegie Mellon

Christian Monson

Is Beads-on-a-String Model Adequate?

Page 103: Evaluating an Agglutinative Segmentation Model for ParaMor

103Carnegie Mellon

Christian Monson

A Sample of 894 Languages (Dryer, 2005)

Page 104: Evaluating an Agglutinative Segmentation Model for ParaMor

104Carnegie Mellon

Christian Monson

86% Have Affixational Morphology

Affixing Languages

Little Affixation

Page 105: Evaluating an Agglutinative Segmentation Model for ParaMor

105Carnegie Mellon

Christian Monson

70% are Suffixing

Primarily Prefixation

Little Affixation

Significant SuffixationSuffixationSuffixationSuffixationSuffixation

Page 106: Evaluating an Agglutinative Segmentation Model for ParaMor

106Carnegie Mellon

Christian Monson

Page 107: Evaluating an Agglutinative Segmentation Model for ParaMor

107Carnegie Mellon

Christian Monson

Paradigms Do Not Describe Derivation

inform ationmiser

ment

manage mentmis

Page 108: Evaluating an Agglutinative Segmentation Model for ParaMor

108Carnegie Mellon

Christian Monson

Paradigms Do Not Describe Derivation

inform ationmiser

ment

manage mentmisation

Page 109: Evaluating an Agglutinative Segmentation Model for ParaMor

109Carnegie Mellon

Christian Monson

Paradigms Do Not Describe Derivation

inform ationmiser

ment

manage mentmisation

Page 110: Evaluating an Agglutinative Segmentation Model for ParaMor

110Carnegie Mellon

Christian Monson

Page 111: Evaluating an Agglutinative Segmentation Model for ParaMor

111Carnegie Mellon

Christian Monson

sinyecek

present2nd person singular

Morphology is Complex – Fusion

me

take passive negative

You are not taken

götür ül

Page 112: Evaluating an Agglutinative Segmentation Model for ParaMor

112Carnegie Mellon

Christian Monson

sin

negative-present2nd person singular

take passive

You are not taken

götür ül mez

Morphology is Complex – Fusion

Page 113: Evaluating an Agglutinative Segmentation Model for ParaMor

113Carnegie Mellon

Christian Monson

sin

negative-present2nd person singular

take passive

You are not taken

götür ül mez

Morphology is Complex – Fusion

Page 114: Evaluating an Agglutinative Segmentation Model for ParaMor

114Carnegie Mellon

Christian Monson

Page 115: Evaluating an Agglutinative Segmentation Model for ParaMor

115Carnegie Mellon

Christian Monson

Size of Search Space

Huge: 2|candidate suffixes|

Most candidate suffixes have no common stems

Still Exponential

Greedily searched space: O(|candidate suffixes|)

This example: 0.1% of the searched space

s10697

autorizacionesbuscabamos

costaØ costasimportadoraØ importadoras

vallaØ vallas…

Ø s5513

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Search for Candidate Paradigms

Page 116: Evaluating an Agglutinative Segmentation Model for ParaMor

116Carnegie Mellon

Christian Monson

Page 117: Evaluating an Agglutinative Segmentation Model for ParaMor

117Carnegie Mellon

Christian Monson

Some Candidates are Errors

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Stem-internal boundary hypothesis

Correct

Incorrect

Page 118: Evaluating an Agglutinative Segmentation Model for ParaMor

118Carnegie Mellon

Christian Monson

Page 119: Evaluating an Agglutinative Segmentation Model for ParaMor

119Carnegie Mellon

Christian Monson

Enable Cross-Lingual Communication

7000 languages in the world

6.66 billion peopleHalf speak one of the 10 largest languages

Half don’t!

Page 120: Evaluating an Agglutinative Segmentation Model for ParaMor

120Carnegie Mellon

Christian Monson

Page 121: Evaluating an Agglutinative Segmentation Model for ParaMor

121Carnegie Mellon

Christian Monson

Preliminary Linguistic Evaluation

P R P R P R P R

English German English German

Inflectional Only Inflectional & Derivational

Built 2 styles of answer keys

For 2 languages

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 122: Evaluating an Agglutinative Segmentation Model for ParaMor

122Carnegie Mellon

Christian Monson

Preliminary Linguistic Evaluation

ParaMor 33.0 81.4 42.8 68.6 48.9 53.6 60.0 33.5

P R P R P R P R

English German English German

Inflectional Only Inflectional & Derivational

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Built 2 styles of answer keys

For 2 languages

Page 123: Evaluating an Agglutinative Segmentation Model for ParaMor

123Carnegie Mellon

Christian Monson

Preliminary Linguistic Evaluation

ParaMor 33.0 81.4 42.8 68.6 48.9 53.6 60.0 33.5

Morfessor

P R P R P R P R

English German English German

Inflectional Only Inflectional & Derivational

MorfessorFreely available unsupervised morphology

induction system

Statistical model of morphology

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 124: Evaluating an Agglutinative Segmentation Model for ParaMor

124Carnegie Mellon

Christian Monson

Preliminary Linguistic Evaluation

ParaMor 33.0 81.4 42.8 68.6 48.9 53.6 60.0 33.5

Morfessor 53.3 47.0 38.7 44.2 73.6 34.0 66.9 37.1

P R P R P R P R

English German English German

Inflectional Only Inflectional & Derivational

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

MorfessorFreely available unsupervised morphology

induction system

Statistical model of morphology

Page 125: Evaluating an Agglutinative Segmentation Model for ParaMor

125Carnegie Mellon

Christian Monson

Page 126: Evaluating an Agglutinative Segmentation Model for ParaMor

126Carnegie Mellon

Christian Monson

strado15

rado167

rada radas rado rados

53

rada radorados

67

rada rado89

ra rada radasrado rados ran

rar raron ró23

strada strado12

strada strado stró

9

strada strado strar stró

8

strada stradas strado strar stró

7

...an

1784

a an1045

a an ar

417

a an ar ó355

a ada adas ado ados an

ar aron ó148

es2750

Ø es845

n6039

Ø n1863

Ø n r

512

Ø do n r357

Ø da das do dos n ndo r ron

115

a9020

a o2325

a o os

1418

a as o os899

s10697

Ø s5513

Ø r s

281

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Where is the Statistics?

ParaMor has no explicit statistical model

Each candidate paradigm is a minimal description

MDL has close ties to Bayesian statistics

Page 127: Evaluating an Agglutinative Segmentation Model for ParaMor

127Carnegie Mellon

Christian Monson

Page 128: Evaluating an Agglutinative Segmentation Model for ParaMor

128Carnegie Mellon

Christian Monson

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

Some Model Fragments of Paradigms

15 Suffixes from the

ar Verbal Paradigm(Which has more than 30)

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 129: Evaluating an Agglutinative Segmentation Model for ParaMor

129Carnegie Mellon

Christian Monson

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

Some Model Fragments of Paradigms

Here’s 15 More Suffixes from the ar Verbal Paradigm

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 130: Evaluating an Agglutinative Segmentation Model for ParaMor

130Carnegie Mellon

Christian Monson

Page 131: Evaluating an Agglutinative Segmentation Model for ParaMor

131Carnegie Mellon

Christian Monson

Raw Text

The ParaMor AlgorithmParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Spanish Example

Page 132: Evaluating an Agglutinative Segmentation Model for ParaMor

132Carnegie Mellon

Christian Monson

Raw Text

ParaMorWord List50,000 Types

autorizacionesbuscabamoscostarimportadoravallas…

The ParaMor AlgorithmParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Page 133: Evaluating an Agglutinative Segmentation Model for ParaMor

133Carnegie Mellon

Christian Monson

Raw Text

ParaMor autorizacionesbuscabamoscostarimportadoravallas…

v + allasva + llasval + lasvall + asvalla + svallas + Ø

A priori, each character boundary is a candidate morpheme boundary

Propose multiple analyses of each word

Each analysis contains exactly 1 morpheme boundary

The ParaMor AlgorithmParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Page 134: Evaluating an Agglutinative Segmentation Model for ParaMor

134Carnegie Mellon

Christian Monson

Paradigms

ParadigmMutually substitutable morphological operations

ül m umumØuz

üyoryecek

The ParaMor Algorithm

Page 135: Evaluating an Agglutinative Segmentation Model for ParaMor

135Carnegie Mellon

Christian Monson

Paradigms

ParadigmMutually substitutable strings

ül m umumØuz

üyoryecek

The ParaMor Algorithm

Page 136: Evaluating an Agglutinative Segmentation Model for ParaMor

136Carnegie Mellon

Christian Monson

s10697

ParaMor

Consolidate Identical candidate suffixes into paradigm seeds

Raw Text

autorizacionesbuscabamoscostarimportadoravallas…

v + allasva + llasval + lasvall + asvalla + svallas + Ø

The ParaMor Algorithm

Page 137: Evaluating an Agglutinative Segmentation Model for ParaMor

137Carnegie Mellon

Christian Monson

s10697

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

autorizacionesbuscabamos

costasimportadoras

vallas…

Begin search with the most frequent candidate suffix

Search for Candidate Paradigms

Bottom-Up

Greedy

Page 138: Evaluating an Agglutinative Segmentation Model for ParaMor

138Carnegie Mellon

Christian Monson

Page 139: Evaluating an Agglutinative Segmentation Model for ParaMor

139Carnegie Mellon

Christian Monson

8240 Selected Candidates Paradigms

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 140: Evaluating an Agglutinative Segmentation Model for ParaMor

140Carnegie Mellon

Christian Monson

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

Some Candidates Model ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Page 141: Evaluating an Agglutinative Segmentation Model for ParaMor

141Carnegie Mellon

Christian Monson

Some Candidates are Errors

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 142: Evaluating an Agglutinative Segmentation Model for ParaMor

142Carnegie Mellon

Christian Monson

The ParaMor Algorithm

Identify Paradigms in 3 Steps1. Search for candidate paradigms

2. Cluster candidates modeling the same paradigm

3. Filter

Segment Words Using the discovered paradigms

2 Filters New to ParaMor

ParaMorIdentify

SearchClusterFilter

Segment

Page 143: Evaluating an Agglutinative Segmentation Model for ParaMor

143Carnegie Mellon

Christian Monson

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

1. Spurious String SimilaritiesParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

Page 144: Evaluating an Agglutinative Segmentation Model for ParaMor

144Carnegie Mellon

Christian Monson

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

1. Spurious String SimilaritiesParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

From: allØ amØ gØ sØ alla ama ga saallanar amanar ganar sanar

Page 145: Evaluating an Agglutinative Segmentation Model for ParaMor

145Carnegie Mellon

Christian Monson

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

From: allØ amØ gØ sØ alla ama ga saallanar amanar ganar sanar

Supported by 8 Short Types

1. Spurious String Similarities

Page 146: Evaluating an Agglutinative Segmentation Model for ParaMor

146Carnegie Mellon

Christian Monson

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

From: allØ amØ gØ sØ alla ama ga saallanar amanar ganar sanar

Exclude Short Types from the Induction Vocabulary

1. Spurious String Similarities

Page 147: Evaluating an Agglutinative Segmentation Model for ParaMor

147Carnegie Mellon

Christian Monson

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

2. Suffix-Internal Boundary Hypotheses

Page 148: Evaluating an Agglutinative Segmentation Model for ParaMor

148Carnegie Mellon

Christian Monson

2. Suffix-Internal Boundary Hypotheses

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th aØ aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

Incorrect

Correct

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Page 149: Evaluating an Agglutinative Segmentation Model for ParaMor

149Carnegie Mellon

Christian Monson

Incorrect

Correct

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th aØ aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

2. Suffix-Internal Boundary Hypotheses

Page 150: Evaluating an Agglutinative Segmentation Model for ParaMor

150Carnegie Mellon

Christian Monson

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th aØ aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

2. Suffix-Internal Boundary Hypotheses

Adapt the Transition Likelihood ApproachSimilar to Goldsmith (2006)

Page 151: Evaluating an Agglutinative Segmentation Model for ParaMor

151Carnegie Mellon

Christian Monson

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

acompaña-anuncia-

aplica-apoya-

celebra-considera-

controla-desarrolla-desplaza-

disputa-eleva-

enfrenta-forma-

halla-integra-

lanza-llama-llega-lleva-

ocupa-pasa-

presenta-realiza-

registra-toma-

From The Candidate Stems

2. Suffix-Internal Boundary Hypotheses

Page 152: Evaluating an Agglutinative Segmentation Model for ParaMor

152Carnegie Mellon

Christian Monson

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

acompañ-anunci-

aplic-apoy-

celebr-consider-desarroll-desplaz-

disput-elev-

enfrent-form-

hall-integr-

lanz-llam-lleg-llev-

ocup-pas-

present-realiz-

registr-tom-

From The Candidate Stems

acompaña-anuncia-

aplica-apoya-

celebra-considera-

controla-desarrolla-desplaza-

disputa-eleva-

enfrenta-forma-

halla-integra-

lanza-llama-llega-lleva-

ocupa-pasa-

presenta-realiza-

registra-toma-

From The Candidate Stems

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

2. Suffix-Internal Boundary Hypotheses

Page 153: Evaluating an Agglutinative Segmentation Model for ParaMor

153Carnegie Mellon

Christian Monson

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

Entropy

3.490.00

Entropy

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

acompañ-anunci-

aplic-apoy-

celebr-consider-desarroll-desplaz-

disput-elev-

enfrent-form-

hall-integr-

lanz-llam-lleg-llev-

ocup-pas-

present-realiz-

registr-tom-

From The Candidate Stems

acompaña-anuncia-

aplica-apoya-

celebra-considera-

controla-desarrolla-desplaza-

disputa-eleva-

enfrenta-forma-

halla-integra-

lanza-llama-llega-lleva-

ocupa-pasa-

presenta-realiza-

registra-toma-

From The Candidate Stems

2. Suffix-Internal Boundary Hypotheses

Page 154: Evaluating an Agglutinative Segmentation Model for ParaMor

154Carnegie Mellon

Christian Monson

1st Ø s

2nd a as o os

3rd Ø ba ban da das do dos n ndo r ron rse rá rán

5th a aba aban ada adas ado ados an ando ar aron arse ará arán ó

11th ta tamente tas to tos

12th Ø ba ción da das do dos n ndo r ron rá rán ría

13th a aba ada adas ado ados an ando ar aron ará arán e en ó

30th a e en ida idas ido idos iendo ieron ió ía

1000th Ø g gs

1566th ido idos ir iré

2000th lia liana

3000th Ø a anar

4000th Ø e ince

8000th trada trarnos

acompañ-anunci-

aplic-apoy-

celebr-consider-desarroll-desplaz-

disput-elev-

enfrent-form-

hall-integr-

lanz-llam-lleg-llev-

ocup-pas-

present-realiz-

registr-tom-

From The Candidate Stems

Entropy

3.490.00

Entropy

Removed

ParaMor discards candidates whose entropy falls below a threshold parameter

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

acompaña-anuncia-

aplica-apoya-

celebra-considera-

controla-desarrolla-desplaza-

disputa-eleva-

enfrenta-forma-

halla-integra-

lanza-llama-llega-lleva-

ocupa-pasa-

presenta-realiza-

registra-toma-

From The Candidate Stems

2. Suffix-Internal Boundary Hypotheses

Page 155: Evaluating an Agglutinative Segmentation Model for ParaMor

155Carnegie Mellon

Christian Monson

Page 156: Evaluating an Agglutinative Segmentation Model for ParaMor

156Carnegie Mellon

Christian Monson

Morphology is Complex – Operations

Prefixation

Suffixation

Page 157: Evaluating an Agglutinative Segmentation Model for ParaMor

157Carnegie Mellon

Christian Monson

Prefixation

Reduplication

Suffixation

Morphology is Complex – Operations

Page 158: Evaluating an Agglutinative Segmentation Model for ParaMor

158Carnegie Mellon

Christian Monson

Prefixation

Reduplication

Infixation

Suffixation

Morphology is Complex – Operations

Page 159: Evaluating an Agglutinative Segmentation Model for ParaMor

159Carnegie Mellon

Christian Monson

Prefixation

Reduplication

Infixation

Suffixation

Morphology is Complex – Operations

Page 160: Evaluating an Agglutinative Segmentation Model for ParaMor

160Carnegie Mellon

Christian Monson

Prefixation

Reduplication

Infixation

Suffixation

Morphology is Complex – Operations

Page 161: Evaluating an Agglutinative Segmentation Model for ParaMor

161Carnegie Mellon

Christian Monson

Inflection

Morphology is Complex – Purpose

götür ül m sunüyor

take passive negativepresent

progressive2nd person singular

Page 162: Evaluating an Agglutinative Segmentation Model for ParaMor

162Carnegie Mellon

Christian Monson

Inflection

götür ül m sunüyor

take passive negativepresent

progressive2nd person singular

Derivation

inform

Morphology is Complex – Purpose

Page 163: Evaluating an Agglutinative Segmentation Model for ParaMor

163Carnegie Mellon

Christian Monson

Inflection

Morphology is Complex – Purpose

götür ül m sunüyor

take passive negativepresent

progressive2nd person singular

Derivation

inform ation

abstract noun

Page 164: Evaluating an Agglutinative Segmentation Model for ParaMor

164Carnegie Mellon

Christian Monson

Inflection

götür ül m sunüyor

take passive negativepresent

progressive2nd person singular

Derivation

inform ationmis

abstract noun

negative

Morphology is Complex – Purpose

Page 165: Evaluating an Agglutinative Segmentation Model for ParaMor

165Carnegie Mellon

Christian Monson

götür ül m sunüyor

take passive negativepresent

progressive

You are not being taken

2nd person singular

Morphology is Complex – Morphophonology

Page 166: Evaluating an Agglutinative Segmentation Model for ParaMor

166Carnegie Mellon

Christian Monson

sunyecek

future2nd person singular

götür ül m

take passive negative

You will not be taken

Morphology is Complex – Morphophonology

Page 167: Evaluating an Agglutinative Segmentation Model for ParaMor

167Carnegie Mellon

Christian Monson

sunyecek

future2nd person singular

götür ül m

take passive negative

You will not be taken

Morphology is Complex – Morphophonology

Page 168: Evaluating an Agglutinative Segmentation Model for ParaMor

168Carnegie Mellon

Christian Monson

sunyecek

future2nd person singular

götür ül me

take passive negative

You will not be taken

Morphology is Complex – Morphophonology

Page 169: Evaluating an Agglutinative Segmentation Model for ParaMor

169Carnegie Mellon

Christian Monson

sinyecek

future2nd person singular

götür ül me

take passive negative

You will not be taken

Morphology is Complex – Morphophonology

Page 170: Evaluating an Agglutinative Segmentation Model for ParaMor

170Carnegie Mellon

Christian Monson

sinyecek

future2nd person singular

götür ül me

take passive negative

You will not be taken

Morphology is Complex – Morphophonology

Page 171: Evaluating an Agglutinative Segmentation Model for ParaMor

171Carnegie Mellon

Christian Monson

Morphology is Complex – Ambiguity

Hungarianmentek

men +tekgo +Present.2nd.Plural‘yinz go’

Page 172: Evaluating an Agglutinative Segmentation Model for ParaMor

172Carnegie Mellon

Christian Monson

Hungarianmentek

men +tekgo +Present.2nd.Plural‘yinz go’

men +t +ekgo +PastParticiple

+Plural‘those who have gone’

Morphology is Complex – Ambiguity

Page 173: Evaluating an Agglutinative Segmentation Model for ParaMor

173Carnegie Mellon

Christian Monson

Page 174: Evaluating an Agglutinative Segmentation Model for ParaMor

174Carnegie Mellon

Christian Monson

Paradigms Do Not Describe Derivation

inform ationmis

Page 175: Evaluating an Agglutinative Segmentation Model for ParaMor

175Carnegie Mellon

Christian Monson

inform ationmiser

Paradigms Do Not Describe Derivation

Page 176: Evaluating an Agglutinative Segmentation Model for ParaMor

176Carnegie Mellon

Christian Monson

inform ationmiser

ement

Paradigms Do Not Describe Derivation

Page 177: Evaluating an Agglutinative Segmentation Model for ParaMor

177Carnegie Mellon

Christian Monson

inform ationmiser

ement

Paradigm Based ImpliesStrong at inflectional morphology

Paradigms Do Not Describe Derivation

Page 178: Evaluating an Agglutinative Segmentation Model for ParaMor

178Carnegie Mellon

Christian Monson

Page 179: Evaluating an Agglutinative Segmentation Model for ParaMor

179Carnegie Mellon

Christian Monson

The Next Steps for ParaMor

Scaling Paradigm InductionCurrently 50,000 typesUp to larger vocabulariesDown for languages with few resourcesParameter settings need tuning

Scaling Down SegmentationCurrently 300,000 to 2.2 million typesThe larger the vocabulary, the more likely a

particular stem will occur in more than one surface form

Page 180: Evaluating an Agglutinative Segmentation Model for ParaMor

180Carnegie Mellon

Christian Monson