language-specificrepresentations...

40
. . Language-specific representations: Learning segmentation and coocurrence restrictions in Arabic Itamar Kastner and Frans Adriaans New York University http://www.nyu.edu/projects/itamar NECPhon University of Delaware, November 7, 2015 Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 1 / 17

Upload: lecong

Post on 21-Mar-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

.

......Language-specific representations:

Learning segmentation and coocurrence restrictions in Arabic

Itamar Kastner and Frans AdriaansNew York University

http://www.nyu.edu/projects/itamar

NECPhonUniversity of Delaware, November 7, 2015

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 1 / 17

Page 2: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

The segmentation problem

[lʊk ðrz̩əbɔjwɪðɪzhætændəˈdɔgi]

(idealized somewhat).The problem: segmentation crosslinguistically..

......

...1 Modeling segmentation: how might the infant accomplish this.(Brent and Cartwright 1996; Lignos and Yang 2010; Phillips and Pearl 2015b; a.m.o)

...2 Crosslinguistic? Do English learners track the same distributionsas Arabic learners?...3 A “proto-lexicon” emerges from the segmented phones.

Is it correct?Is it useful for the acquisition of morphology?

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 2 / 17

Page 3: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

The segmentation problemThe input stream is linear. Semitic morphology isn’t.istaqbal ‘greeted, welcomed’ qabil ‘received’ taqabbal ‘accepted (responsibility)’

qabla ‘before’ qabiːla ‘tribe’ qibla ‘direction of prayer’

Consonantal roots have phonological/semantic content. Vocalic templates havemorphosyntactic content (McCarthy 1981; Bat-El 1994; Doron 2003; Arad 2005; Ussishkin2005; Borer 2013; Kastner 2015).

1 XaYaZ katab ‘he wrote’2 XaYYaZ kattab ‘he made someone write’

Never unaccusative, often intensive/iterative3 Xa:YaZ ka:tab ‘he corresponded’

Applicative4 aXYaZ aktab ‘he dictated’

Never unaccusative, often causative5 taXaYYaZ takattab ‘he was made to write’

Middle6 taXa:YaZ taka:tab ‘he corresponded with’

Middle applicative7 inXaYaZ inkatab ‘he subscribed’

Middle or passive, never active, no direct object8 iXtaYaZ iktatab ‘he copied’

Often reflexive10 istaXYaZ istaktab ‘he asked someone to write’

Often reflexive, some other meaningsKastner and Adriaans (NYU) Language-specific representations NECPhon 2015 3 / 17

Page 4: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

The segmentation problemThe input stream is linear. Semitic morphology isn’t.istaqbal ‘greeted, welcomed’ qabil ‘received’ taqabbal ‘accepted (responsibility)’qabla ‘before’ qabiːla ‘tribe’ qibla ‘direction of prayer’

Consonantal roots have phonological/semantic content. Vocalic templates havemorphosyntactic content (McCarthy 1981; Bat-El 1994; Doron 2003; Arad 2005; Ussishkin2005; Borer 2013; Kastner 2015).

1 XaYaZ katab ‘he wrote’2 XaYYaZ kattab ‘he made someone write’

Never unaccusative, often intensive/iterative3 Xa:YaZ ka:tab ‘he corresponded’

Applicative4 aXYaZ aktab ‘he dictated’

Never unaccusative, often causative5 taXaYYaZ takattab ‘he was made to write’

Middle6 taXa:YaZ taka:tab ‘he corresponded with’

Middle applicative7 inXaYaZ inkatab ‘he subscribed’

Middle or passive, never active, no direct object8 iXtaYaZ iktatab ‘he copied’

Often reflexive10 istaXYaZ istaktab ‘he asked someone to write’

Often reflexive, some other meaningsKastner and Adriaans (NYU) Language-specific representations NECPhon 2015 3 / 17

Page 5: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

The segmentation problemThe input stream is linear. Semitic morphology isn’t.istaqbal ‘greeted, welcomed’ qabil ‘received’ taqabbal ‘accepted (responsibility)’qabla ‘before’ qabiːla ‘tribe’ qibla ‘direction of prayer’

Consonantal roots have phonological/semantic content. Vocalic templates havemorphosyntactic content (McCarthy 1981; Bat-El 1994; Doron 2003; Arad 2005; Ussishkin2005; Borer 2013; Kastner 2015).

1 XaYaZ katab ‘he wrote’2 XaYYaZ kattab ‘he made someone write’

Never unaccusative, often intensive/iterative3 Xa:YaZ ka:tab ‘he corresponded’

Applicative4 aXYaZ aktab ‘he dictated’

Never unaccusative, often causative5 taXaYYaZ takattab ‘he was made to write’

Middle6 taXa:YaZ taka:tab ‘he corresponded with’

Middle applicative7 inXaYaZ inkatab ‘he subscribed’

Middle or passive, never active, no direct object8 iXtaYaZ iktatab ‘he copied’

Often reflexive10 istaXYaZ istaktab ‘he asked someone to write’

Often reflexive, some other meaningsKastner and Adriaans (NYU) Language-specific representations NECPhon 2015 3 / 17

Page 6: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

The segmentation problem

One approach to segmentation: the learner tracks distributions(want, dog, …)Learn the proto-lexicon as a result.This is less promising for Semitic.

.Today:..

......

Hypothesis: focus only on the consonants.Two simulations testing this model.

Experiment 1: evaluate the segmentation.Experiment 2: evaluate the proto-lexicon.

Ideas for future work.

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 4 / 17

Page 7: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Hypothesis

The segmentation mechanism operates on representations that arelanguage-specific.Assume that the segmentation mechanism itself does not varyacross languages. (Phillips and Pearl 2015a)Arabic (Semitic?) acquisition is facilitated by dividing the inputstream into Cs and Vs.The learner pays attention to a representation consisting only ofCs.

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 5 / 17

Page 8: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Hypothesis

Adults track consonant co-occurrence probabilities acrossintervening vowels in artificial grammars.

(Newport and Aslin 2004; Bonatti et al. 2005; Keidel et al. 2007)Infants show different acquisition graphs for Cs and Vs.

(Werker and Tees 1984; Polka and Werker 1994)American 4-month-olds can distinguish German vowels.Less so at 6 months. Pretty bad by 9 months.American 7-month-olds are still good at foreign consonants,including /kˀ/, /qˀ/, ʈ and t.̪ Ability lost by age 0;11.

11-month-olds learn different generalizations over consonants andvowels in artificial language experiments. (Hochmann et al. 2011)

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 6 / 17

Page 9: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Hypothesis

.What we know..

......

Adults can tell Cs from Vs in artificial language experiments.Infants can distinguish Cs from Vs.

.What we don’t know..

......Does this ability help in acquisition of natural languages?

.Today..

......

Yes, if you track Cs and Vs this helps you in acquisition of naturallanguage.Needed: a model and methods of evaluating it.

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 7 / 17

Page 10: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Experiment 1: DataRun a segmentation algorithm on different representations in English andArabic.

English: subset of CHILDES. (Bernstein-Ratner 1987; Goldwater et al. 2009)Arabic: A few datasets of parsed Gigaword newswire.

(Graff 2003; Pasha et al. 2014)A subset of the Emirati infant-directed speech corpus EMALAC.

(Ntelitheos and Idrissi 2015).Representations..

......

Full CVyawanttoseethebookC-onlyywnttsthbkV-onlyaaoeeeoo

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 8 / 17

Page 11: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Experiment 1: Methods.Unigram segmentation algorithm (Goldwater et al. 2009)..

......

Language model: performs Bayesian inference on the input.Maximize probability of a hypothesized segmentation of thecorpus given the observed data.Generates words and boundaries, producing unigrams.

yawanttoseethebookya.wanttosee.thebookyawa.ntt.oseeth.eb.ookya.want.to.see.the.book

yawanttoseethebook

yawantmybookyawantmybook

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 9 / 17

Page 12: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Experiment 1: Methods.Unigram segmentation algorithm (Goldwater et al. 2009)..

......

Language model: performs Bayesian inference on the input.Maximize probability of a hypothesized segmentation of thecorpus given the observed data.Generates words and boundaries, producing unigrams.

yawanttoseethebookyawanttoseethebookyawa.ntt.oseeth.eb.ookya.want.to.see.the.book

yawanttoseethebook

yawantmybookyawantmybook

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 9 / 17

Page 13: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Experiment 1: Methods.Unigram segmentation algorithm (Goldwater et al. 2009)..

......

Language model: performs Bayesian inference on the input.Maximize probability of a hypothesized segmentation of thecorpus given the observed data.Generates words and boundaries, producing unigrams.

yawanttoseethebookya.wanttosee.thebookyawanttoseethebookya.want.to.see.the.book

yawanttoseethebook

yawantmybookyawantmybook

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 9 / 17

Page 14: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Experiment 1: Methods.Unigram segmentation algorithm (Goldwater et al. 2009)..

......

Language model: performs Bayesian inference on the input.Maximize probability of a hypothesized segmentation of thecorpus given the observed data.Generates words and boundaries, producing unigrams.

yawanttoseethebookya.wanttosee.thebookyawa.ntt.oseeth.eb.ookyawanttoseethebook

yawanttoseethebook

yawantmybookyawantmybook

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 9 / 17

Page 15: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Experiment 1: Methods.Unigram segmentation algorithm (Goldwater et al. 2009)..

......

Language model: performs Bayesian inference on the input.Maximize probability of a hypothesized segmentation of thecorpus given the observed data.Generates words and boundaries, producing unigrams.

yawanttoseethebookya.wanttosee.thebookyawa.ntt.oseeth.eb.ookya.want.to.see.the.book

yawanttoseethebook

yawantmybookyawantmybook

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 9 / 17

Page 16: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Experiment 1: ResultsRecall, Precision and F-Measure for each representation:

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 10 / 17

Page 17: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Experiment 1: Results

ä C-only helps in Arabic but hamperssegmentation in English.ä Hypothesis supported.

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 10 / 17

Page 18: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Experiment 2Hypothesis supported, but you want the lexicon to aid furtheracquisition. (Phillips and Pearl 2015a).Further evaluation, further acquisition..

......

Phonotactics are learned early on, within the first year.(Jusczyk et al. 1994)

Does the segmented proto-lexicon support phonologicalgeneralizations that the infant would need to learn?Restriction against homorganic consonant pairs (OCP-Place).

(Greenberg 1950; McCarthy 1989; Berent and Shimron 1997; Frisch et al. 2004)*dadammadad: possible but under-representedtasaba: possible but under-represented

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 11 / 17

Page 19: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Experiment 2Hypothesis supported, but you want the lexicon to aid furtheracquisition. (Phillips and Pearl 2015a).Further evaluation, further acquisition..

......

Phonotactics are learned early on, within the first year.(Jusczyk et al. 1994)

Does the segmented proto-lexicon support phonologicalgeneralizations that the infant would need to learn?Restriction against homorganic consonant pairs (OCP-Place).

(Greenberg 1950; McCarthy 1989; Berent and Shimron 1997; Frisch et al. 2004)*dadammadad: possible but under-representedtasaba: possible but under-represented

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 11 / 17

Page 20: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Experiment 2Hypothesis supported, but you want the lexicon to aid furtheracquisition. (Phillips and Pearl 2015a).Further evaluation, further acquisition..

......

Phonotactics are learned early on, within the first year.(Jusczyk et al. 1994)

Does the segmented proto-lexicon support phonologicalgeneralizations that the infant would need to learn?Restriction against homorganic consonant pairs (OCP-Place).

(Greenberg 1950; McCarthy 1989; Berent and Shimron 1997; Frisch et al. 2004)*dadammadad: possible but under-representedtasaba: possible but under-represented

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 11 / 17

Page 21: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Experiment 2: Methods

...1 Take four segmentations of the Arabic data:The result of C-only.The result of Full.An unsegmented baseline.Correct segmentation (gold standard).

...2 For each segmentation, calculate O/E ratio (Frisch et al. 2004):For non-identical labial biphones.For non-identical coronal biphones.For non-identical dorsal biphones.

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 12 / 17

Page 22: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Experiment 2: Results

ä The consonant-only segmentation performs closest to the gold standard.ä Not enough data in the IDS dataset for good generalizations, but coronals followthe pattern (observed = 32, even fewer for labials and dorsals).

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 13 / 17

Page 23: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

DiscussionNew contributions:

An explicit way of modeling how the infant might pay attention toconsonants more than vowels, and ways of evaluating this model.In Semitic, separating consonants from vowels is beneficial forsegmentation and phonological learning.Less information is more helpful, even though vowels arenecessary and can disambiguate word boundaries.

(Phillips and Pearl 2015b)

....mnql.

..manqal‘grill’

.

..man qa:l‘who said’

....klmtn.

..kalimatani‘word.du’

.

..kul matn‘every content’

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 14 / 17

Page 24: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Discussion

Distinguishing consonants from vowels might further aid inbootstrapping syntactic and semantic patterns.

Nespor et al. (2003): learner assigns consonantal chunks to objects, thenfills in the grammar with vowels.This is exactly how templates function in Semitic.

How does the child learn to use precisely this representation?How do you know what to pay attention to?Pay attention to all representations and interpolate Bayesian priors untildiscarding one hypothesis?How does the learner evaluate competing hypotheses?

Learning phonotactics?Hebrew and other Semitic languages?

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 15 / 17

Page 25: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Discussion

Distinguishing consonants from vowels might further aid inbootstrapping syntactic and semantic patterns.

Nespor et al. (2003): learner assigns consonantal chunks to objects, thenfills in the grammar with vowels.This is exactly how templates function in Semitic.

How does the child learn to use precisely this representation?How do you know what to pay attention to?Pay attention to all representations and interpolate Bayesian priors untildiscarding one hypothesis?How does the learner evaluate competing hypotheses?

Learning phonotactics?Hebrew and other Semitic languages?

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 15 / 17

Page 26: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Discussion

Distinguishing consonants from vowels might further aid inbootstrapping syntactic and semantic patterns.

Nespor et al. (2003): learner assigns consonantal chunks to objects, thenfills in the grammar with vowels.This is exactly how templates function in Semitic.

How does the child learn to use precisely this representation?How do you know what to pay attention to?Pay attention to all representations and interpolate Bayesian priors untildiscarding one hypothesis?How does the learner evaluate competing hypotheses?

Learning phonotactics?Hebrew and other Semitic languages?

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 15 / 17

Page 27: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

ConclusionsSegmentation crosslinguistically: same algorithm, differentrepresentations.

Focus not on the algorithm itself but on what the learner attunes to.In Arabic, C-only is better than paying attention to full CV.Our results provide a first computational test of this hypothesis innatural language data.

Segmentation and phonotactics.An explicit, testable model that can generate further predictions.We are now in a position to ask additional questions about theacquisition of morphology.If you’re learning a language such as Arabic, you might be learning:

Word boundaries,phonotactics,word classes,abstract semantic sets,and morphosyntactic frames.

all at once.Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 16 / 17

Page 28: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Thank you!

ä Frans Adriaans (co-author)

ä Maria Gouskova

ä Alec Marantz

ä Participants in Frans and Maria’s Spring 2014 Seminar in Phonology

ä The NYU Phonetics and Experimental Phonology Lab

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 17 / 17

Page 29: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

References IArad, Maya. 2005. Roots and Patterns: Hebrew Morpho-syntax. Springer.Bat-El, Outi. 1994. Stem modification and cluster transfer in Modern Hebrew. Natural Language and Linguistic Theory 12:571–596.

Berent, Iris, and Joseph Shimron. 1997. The representation of Hebrew words: Evidence from the Obligatory Contour Principle.Cognition 64: 39–72.

Bernstein-Ratner, Nan. 1987. The phonology of parent-child speech. In Children’s language, eds. K. Nelson and A. van Kleeck, Vol.6. Hillsdale, NJ: Erlbaum.

Bonatti, Luca L., Marcela Peña, Marina Nespor, and Jacques Mehler. 2005. Linguistic Constraints on Statistical Computations:The Role of Consonants and Vowels in Continuous Speech Processing. Psychological Science 16 (6): 451–459.

Borer, Hagit. 2013. Structuring sense, Vol. 3: Taking Form. Oxford: Oxford University Press.Brent, Michael R., and Timothy A. Cartwright. 1996. Distributional regularity and phonotactic constraints are useful forsegmentation. Cognition 61: 93–125.

Doron, Edit. 2003. Agency and voice: The semantics of the Semitic templates. Natural Language Semantics 11: 1–67.Frisch, Stefan A., Janet B. Pierrehumbert, and Michael B. Broe. 2004. Similarity avoidance and the OCP. Natural Language and

Linguistic Theory 22: 179–228.Goldwater, Sharon, Thomas L Griffiths, and Mark Johnson. 2009. A Bayesian framework for word segmentation: Exploring theeffects of context. Cognition 112: 21–54.

Graff, David. 2003. Arabic Gigaword Corpus. Linguistic Data Consortium, Philadelphia, PA.Greenberg, Joseph H. 1950. The patterning of root morphemes in Semitic. Word 5: 162–181.Hochmann, Jean-Remy, Silvia Benavides-Varela, Marina Nespor, and Jacques Mehler. 2011. Consonants and vowels: differentroles in early language acquisition. Developmental Science 14 (6): 1445–1458.

Jusczyk, Peter W., Paul A. Luce, and Jan Charles-Luce. 1994. Infants’ sensitivity to phonotactic patterns in the native language.Journal of Memory and Language 33: 630–645.

Kastner, Itamar. 2015. Nonconcatenative morphology with concatenative syntax. In Proceedings of the 45th annual meeting of theNorth East Linguistic Society (NELS 45), eds. Thuy Bui and Deniz Özyıldız, Vol. 2, 83–96. Amherst, MA: GLSA.

Keidel, James L., Rick L. Jenison, Keith R. Kluender, and Mark S. Seidenberg. 2007. Does grammar constrain statistical learning?commentary on Bonatti, Peña, Nespor, and Mehler (2005). Psychological Science 18 (10): 922–923.

Lignos, Constantine, and Charles Yang. 2010. Recession segmentation: Simpler online word segmentation using limitedresources. In Proceedings of the fourteenth conference on computational natural language learning, 88–97.

McCarthy, John J. 1981. A prosodic theory of nonconcatenative morphology. Linguistic Inquiry 12: 373–418.

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 17 / 17

Page 30: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

References II

McCarthy, John J. 1989. Linear order in phonological representation. Linguistic Inquiry 20: 71–99.Nespor, Marina, Marcela Peña, and Jacques Mehler. 2003. On the different roles of vowels and consonants in speech processingand language acquisition. Lingue e linguaggio 2 (2): 203–230.

Newport, Elissa L., and Richard N. Aslin. 2004. Learning at a distance I. statistical learning of non-adjacent dependencies.Cognitive Psychology 48 (2): 127–162.

Ntelitheos, Dimitrios, and Ali Idrissi. 2015. Language growth in child Emirati Arabic. In 29th annual symposium on Arabiclinguistics. The University of Wisconsin—Milwaukee.

Pasha, Arfath, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, OwenRambow, and Ryan M. Roth. 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguationof Arabic. In LREC 2014.

Phillips, Lawrence, and Lisa Pearl. 2015a. Evaluating language acquisition strategies: A cross-linguistic look at earlysegmentation. Ms., UC Irvine.

Phillips, Lawrence, and Lisa Pearl. 2015b. The utility of cognitive plausibility in language acquisition modeling: Evidence fromword segmentation. Cognitive Science.

Polka, Linda, and Janet F. Werker. 1994. Developmental changes in perception of nonnative vowel contrasts. Journal ofExperimental Psychology: Human Perception and Performance 20: 421–435.

Ussishkin, Adam. 2005. A fixed prosodic theory of nonconcatenative templatic morphology. Natural Language and LinguisticTheory 23: 169–218.

Werker, Janet F., and Richard C. Tees. 1984. Cross-language speech perception: Evidence for perceptual reorganization duringthe first year of life. Infant Behavior and Development 7: 49–63.

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 17 / 17

Page 31: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

.The GGJ unigram algorithm in a nutshell..

......

...1 Let w= w1 . . .wN be words in the proposed segmentation.

...2 For each word wi, decide if it is a new lexical item.n is the number of previously generated words (i− 1) and α0 is aparameter.P(wi is novel) = α0

n+α0P(wi is not novel) = nn+α0

...3 If wi is novel, generate its word form ℓ.If wi is not novel, look up its word form (rich-get-richer/Dirichlet).

nℓ is the number of times ℓ has appeared in the n words so far.P(wi = ℓ|wi is not novel) = nℓ

n...4 After each word, generate an utterance boundary with probability ps.

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 17 / 17

Page 32: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

.Inference via Gibbs sampling:..

......

First, initialize word boundaries randomly (or not).Then, sample each potential word boundary in turn and compare the twohypotheses. Repeat 20,000 times.Example: say at a certain point you have the following segmentation,...1 yaw.antto.see.thebook

h1: yaw.antto.see.thebook P(h1|h−) = α0P(yaw)5+α0

h2: y.aw.antto.see.thebook P(h2|h−) = α0P(y)5+α0

· α0P(aw)6+α0

...2 yaw.antto.see.thebookh1: yaw.antto.see.thebook P(h1|h−) = α0P(yaw)

5+α0

h2: ya.w.antto.see.thebook P(h2|h−) = α0P(ya)5+α0

· α0P(w)6+α0

...3 ya.w.antto.see.thebookh1: ya.wantto.see.thebook P(h1|h−) = α0P(wantto)

5+α0

h2: ya.w.antto.see.thebook P(h2|h−) = α0P(w)5+α0

· α0P(antto)6+α0

...4 yaw.antto.see.thebook⇒ ya.wantto.see.thebookKastner and Adriaans (NYU) Language-specific representations NECPhon 2015 17 / 17

Page 33: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

.Inference via Gibbs sampling:..

......

First, initialize word boundaries randomly (or not).Then, sample each potential word boundary in turn and compare the twohypotheses. Repeat 20,000 times.Example: say at a certain point you have the following segmentation,...1 yaw.antto.see.thebook

h1: yaw.antto.see.thebook P(h1|h−) = α0P(yaw)5+α0

h2: y.aw.antto.see.thebook P(h2|h−) = α0P(y)5+α0

· α0P(aw)6+α0

...2 yaw.antto.see.thebookh1: yaw.antto.see.thebook P(h1|h−) = α0P(yaw)

5+α0

h2: ya.w.antto.see.thebook P(h2|h−) = α0P(ya)5+α0

· α0P(w)6+α0

...3 ya.w.antto.see.thebookh1: ya.wantto.see.thebook P(h1|h−) = α0P(wantto)

5+α0

h2: ya.w.antto.see.thebook P(h2|h−) = α0P(w)5+α0

· α0P(antto)6+α0

...4 yaw.antto.see.thebook⇒ ya.wantto.see.thebookKastner and Adriaans (NYU) Language-specific representations NECPhon 2015 17 / 17

Page 34: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

.Inference via Gibbs sampling:..

......

First, initialize word boundaries randomly (or not).Then, sample each potential word boundary in turn and compare the twohypotheses. Repeat 20,000 times.Example: say at a certain point you have the following segmentation,...1 yaw.antto.see.thebook

+ h1: yaw.antto.see.thebook P(h1|h−) = α0P(yaw)5+α0

h2: y.aw.antto.see.thebook P(h2|h−) = α0P(y)5+α0

· α0P(aw)6+α0

...2 yaw.antto.see.thebookh1: yaw.antto.see.thebook P(h1|h−) = α0P(yaw)

5+α0

h2: ya.w.antto.see.thebook P(h2|h−) = α0P(ya)5+α0

· α0P(w)6+α0

...3 ya.w.antto.see.thebookh1: ya.wantto.see.thebook P(h1|h−) = α0P(wantto)

5+α0

h2: ya.w.antto.see.thebook P(h2|h−) = α0P(w)5+α0

· α0P(antto)6+α0

...4 yaw.antto.see.thebook⇒ ya.wantto.see.thebookKastner and Adriaans (NYU) Language-specific representations NECPhon 2015 17 / 17

Page 35: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

.Inference via Gibbs sampling:..

......

First, initialize word boundaries randomly (or not).Then, sample each potential word boundary in turn and compare the twohypotheses. Repeat 20,000 times.Example: say at a certain point you have the following segmentation,...1 yaw.antto.see.thebook

+ h1: yaw.antto.see.thebook P(h1|h−) = α0P(yaw)5+α0

h2: y.aw.antto.see.thebook P(h2|h−) = α0P(y)5+α0

· α0P(aw)6+α0

...2 yaw.antto.see.thebookh1: yaw.antto.see.thebook P(h1|h−) = α0P(yaw)

5+α0

h2: ya.w.antto.see.thebook P(h2|h−) = α0P(ya)5+α0

· α0P(w)6+α0

...3 ya.w.antto.see.thebookh1: ya.wantto.see.thebook P(h1|h−) = α0P(wantto)

5+α0

h2: ya.w.antto.see.thebook P(h2|h−) = α0P(w)5+α0

· α0P(antto)6+α0

...4 yaw.antto.see.thebook⇒ ya.wantto.see.thebookKastner and Adriaans (NYU) Language-specific representations NECPhon 2015 17 / 17

Page 36: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

.Inference via Gibbs sampling:..

......

First, initialize word boundaries randomly (or not).Then, sample each potential word boundary in turn and compare the twohypotheses. Repeat 20,000 times.Example: say at a certain point you have the following segmentation,...1 yaw.antto.see.thebook

+ h1: yaw.antto.see.thebook P(h1|h−) = α0P(yaw)5+α0

h2: y.aw.antto.see.thebook P(h2|h−) = α0P(y)5+α0

· α0P(aw)6+α0

...2 yaw.antto.see.thebookh1: yaw.antto.see.thebook P(h1|h−) = α0P(yaw)

5+α0

+ h2: ya.w.antto.see.thebook P(h2|h−) = α0P(ya)5+α0

· α0P(w)6+α0

...3 ya.w.antto.see.thebookh1: ya.wantto.see.thebook P(h1|h−) = α0P(wantto)

5+α0

h2: ya.w.antto.see.thebook P(h2|h−) = α0P(w)5+α0

· α0P(antto)6+α0

...4 yaw.antto.see.thebook⇒ ya.wantto.see.thebookKastner and Adriaans (NYU) Language-specific representations NECPhon 2015 17 / 17

Page 37: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

.Inference via Gibbs sampling:..

......

First, initialize word boundaries randomly (or not).Then, sample each potential word boundary in turn and compare the twohypotheses. Repeat 20,000 times.Example: say at a certain point you have the following segmentation,...1 yaw.antto.see.thebook

+ h1: yaw.antto.see.thebook P(h1|h−) = α0P(yaw)5+α0

h2: y.aw.antto.see.thebook P(h2|h−) = α0P(y)5+α0

· α0P(aw)6+α0

...2 yaw.antto.see.thebookh1: yaw.antto.see.thebook P(h1|h−) = α0P(yaw)

5+α0

+ h2: ya.w.antto.see.thebook P(h2|h−) = α0P(ya)5+α0

· α0P(w)6+α0

...3 ya.w.antto.see.thebookh1: ya.wantto.see.thebook P(h1|h−) = α0P(wantto)

5+α0

h2: ya.w.antto.see.thebook P(h2|h−) = α0P(w)5+α0

· α0P(antto)6+α0

...4 yaw.antto.see.thebook⇒ ya.wantto.see.thebookKastner and Adriaans (NYU) Language-specific representations NECPhon 2015 17 / 17

Page 38: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

.Inference via Gibbs sampling:..

......

First, initialize word boundaries randomly (or not).Then, sample each potential word boundary in turn and compare the twohypotheses. Repeat 20,000 times.Example: say at a certain point you have the following segmentation,...1 yaw.antto.see.thebook

+ h1: yaw.antto.see.thebook P(h1|h−) = α0P(yaw)5+α0

h2: y.aw.antto.see.thebook P(h2|h−) = α0P(y)5+α0

· α0P(aw)6+α0

...2 yaw.antto.see.thebookh1: yaw.antto.see.thebook P(h1|h−) = α0P(yaw)

5+α0

+ h2: ya.w.antto.see.thebook P(h2|h−) = α0P(ya)5+α0

· α0P(w)6+α0

...3 ya.w.antto.see.thebook+ h1: ya.wantto.see.thebook P(h1|h−) = α0P(wantto)

5+α0

h2: ya.w.antto.see.thebook P(h2|h−) = α0P(w)5+α0

· α0P(antto)6+α0

...4 yaw.antto.see.thebook⇒ ya.wantto.see.thebookKastner and Adriaans (NYU) Language-specific representations NECPhon 2015 17 / 17

Page 39: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

.A tweak: the bigram model..

......

what’sthat (though children under-segment as well)...1 Let w= w1 . . .wN be words in the proposed segmentation....2 For each word pair ⟨wi−1,wi⟩, decide if it is a novel bigram....3 If not novel, choose a lexical form for wi from among those that havebeen generated following wi−1.

...4 If novel, then decide if wi will be a novel unigram and proceed as in theunigram model.

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 17 / 17

Page 40: Language-specificrepresentations ...amor.cms.hu-berlin.de/~kastneri/media/pubs/kastner_necphon2015.pdf · Thesegmentationproblem Theinputstreamislinear.Semiticmorphologyisn’t. istaqbal

Hochmann et al. (2011), Figure 1:

Kastner and Adriaans (NYU) Language-specific representations NECPhon 2015 17 / 17