# Statistical Methods in Historical Linguistics

Post on 11-May-2015

1.603 views

Embed Size (px)

DESCRIPTION

Talk given at UCLA, March 2013TRANSCRIPT

<ul><li>1.Statistical methods for Historical LinguisticsRobin J. Ryder Centre de Recherche en Mathmatiques de la DcisioneeUniversit Paris-Dauphine e 7 March 2013UCLARobin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 1 / 98</li></ul><p>2. IntroductionA large number of recent papers describe computationally-intensivestatistical methods for Historical LinguisticsIncreased computational powerAdvances in statistical methodologyComplex linguistic questions which cannot be answered withtraditional methods Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 2 / 98 3. CaveatsI am not a linguistI am a statisticianThese papers were not written by me; most gures were created bythe papers authorsI use the word evolution in a broad senseAll models are wrong, but some are useful Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 3 / 98 4. Aims of this talkReview of several recent papers on statistical models for HistoricalLinguisticsStatisticians wont replace linguistsWhen done correctly, collaborations between statisticians and linguistscan provide useful resultsBrief introduction to some major statistical concepts + statisticalcomments Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 4 / 98 5. Statistics slidesDierent colourFeel free to ignore the equationsUseful (essential?) if you wish to collaborate with statisticiansLimitations of statistical toolsStatistics should not be a black box Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 5 / 98 6. Advantages of statistical methodsAnalyse (very) large datasetsTest multiple hypothesesCross-validationEstimate uncertainty Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 6 / 98 7. Statistical method in a nutshell1 Collect data2 Design model3 Perform inference (MCMC, ...)4 Check convergence5 In-model validation (is our inference method able to answer questionsfrom our model?)6 Model mis-specication analysis (do we need a more complex model?)7 ConcludeIn general, it is more dicult to perform inference for a more complexmodel. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 7 / 98 8. Contents1 Swadesh: glottochronology2 Gray and Atkinson: Language phylogenies3 Lieberman et al., Pagel et al.: Frequency of use4 Perreault and Mathew, Atkinson: Phoneme expansion5 Dunn et al.: Word-order universals6 Bouchard-Ct et al: automated cognatesoe7 Overall conclusionsTomorrow: phylogenetics for core vocabularyRobin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 8 / 98 9. Outline1 Swadesh: glottochronology2 Gray and Atkinson: Language phylogenies3 Lieberman et al., Pagel et al.: Frequency of use4 Perreault and Mathew, Atkinson: Phoneme expansion5 Dunn et al.: Word-order universals6 Bouchard-Ct et al: automated cognatesoe7 Overall conclusionsRobin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 9 / 98 10. Swadesh (1952) LEXICO-STATISTIC DATING OF PREHISTORIC ETHNIC CONTACTSWithSpecialReference North toAmericanIndiansand EskimosMORRIS SWADESH PREHISTORY refersto the long period of early measuring amountof radioactivity going the stillhumansocietybefore writing was availablefortheon. Consequently, is possible to determine itrecording events. In a fewplaces it gives way of withincertainlimitsof accuracythetimedepthofto the modernepoch of recorded history muchasany archeological whichcontains suitablebit siteaas six or eightthousand yearsago; in manyareasof bone, wood, grass, or any otherorganic sub-this happened only in the last few centuries. stance.Everywhereprehistory represents greatobscureaLexicostatisticdating makes use of very dif-depthwhichscienceseeks to penetrate. And in-ferent materialfromcarbondating,but the broaddeed powerful means have been foundforillumi- theoreticalprin~iple similar. Researchesby theisnatingthe unrecordedpast,including evidencethe presentauthorand several otherscholarswithinof archeological findsand that of the geographicthe last few years have revealedthat the funda-distribution culturalfactsin the earliestknown of mentaleveryday vocabularyof any language-asperiods. Much dependson thepainstaking analy- againstthe specializedor "cultural"vocabulary-sis and comparison data, and on the effectiveofchanges at a relativelyconstantrate. The per-readingof theirimplications. Very importantis centage of retainedelementsin a suitable testthecombined of all theevidence,use linguistic andvocabularythereforeindicatesthe elapsed time.ethnographic well as archeological, as biological, aWherever speechcommunity comesto be dividedand geological. And it is essentialconstantlyto into two or more parts so that linguistic changeseek new meansof expandingand rendering moregoes separatewaysin each of thenew speechcom-accurateour deductionsabout prehistory. munities, percentage commonretainedvo- theofOne of the mostsignificantrecenttrendsin thecabularygives an index of theamountof timethat Robin J. Ryder (Paris Dauphine)Statistics and Historical Linguistics UCLA March 2013 10 / 98 11. Question at handAim: dating the MRCA (Most Recent Common Ancestor) of a pair oflanguages.Data: core vocabulary (Swadesh lists). 215 or 100 words. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 11 / 98 12. all doggoodlive rightspit two and drinkgrass liver(side) splitvomit animaldrygreen long river road squeezewalk ashes dull gutslouse atroot stab warm dust hairman backrope stand earhandmany wash bad rotten star earthhemeat water barkroundstick eatheadmoon rubstonewe egghear becausemother saltwet eyeheart belly sand straight fall heavy suck what bigmountain say farheresunwhen bird mouth fathitscratchswell bite name where father hold seaswim blacknarrow fear horn white bloodnear seetailhowseed tenwho blow neck featherhunt bone newsewthat wide few breast nightsharpthere ght husbandwifenose shortthey breathe reIwindnotsing thick burnshicewingoldsitthin child veifoneskin thinkwipe clawoat inotherskythis owkill with cloud sleepthouperson coldowerkneeplay smallthree y know woman comesmellthrowpullRobin countfogJ. Ryder (Paris Dauphine) lake Statistics and Historical Linguistics smoke UCLA March 2013 woods 9812 /tie 13. AssumptionsSwadesh assumed that core vocabulary evolves at a constant rate (throughtime, space and meanings). Given a pair of languages with percentage Cof shared cognates, and a constant retention rate r , the age t of theMRCA islog C t=2 log rThe constant r was estimated using a pair of languages for which the ageof the MRCA is known. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 13 / 98 14. Issues with glottochronologyMany statistical shortcomings. Mainly:1 Simplistic model2 No evaluation of uncertainty of estimates3 Only small amounts of data are usedSee Bergsland and Vogt (1962) for debunking of glottochronology. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 14 / 98 15. What has changed?More elaborate models + model misspecication analysesWe can estimate the uncertainty (easier to answer I dont know)Large amounts of dataTomorrow: new analysis of Bergsland and Vogts data. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 15 / 98 16. Outline1 Swadesh: glottochronology2 Gray and Atkinson: Language phylogenies3 Lieberman et al., Pagel et al.: Frequency of use4 Perreault and Mathew, Atkinson: Phoneme expansion5 Dunn et al.: Word-order universals6 Bouchard-Ct et al: automated cognatesoe7 Overall conclusionsRobin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 16 / 98 17. nature of effusive silicic lavas , and with textural observations atCompeting interests statement The authors declare that they have no competing nancialthe outcrop scale down to the microscale911 (Fig. 1). As opposed tointerests.the common view that explosive volcanism is dened as involvingGray and Atkinsonfragmentation of magma during ascent1, we conclude that frag-mentation may play an equally important role in reducing theCorrespondence and requests for materials should be addressed to H.M.G.(hmg@seismo.berkeley.edu).likelihood of explosive behaviour, by facilitating magma degassing.Because shear-induced fragmentation depends so strongly on therheology of the ascending magma, our ndings are in a broadersense equivalent to Eichelbergers hypothesis1 that higher viscosity ..............................................................of magma may favour non-explosive degassing rather thanhinder it, albeit with the added complexity of shear-inducedLanguage-tree divergence timesfragmentation. Asupport the Anatolian theoryReceived 19 May; accepted 15 November 2003; doi:10.1038/nature02138.1. Eichelberger, J. C. Silicic volcanism: ascent of viscous magmas from crustal reservoirs. Annu. Rev.of Indo-European originEarth Planet. Sci. 23, 4163 (1995).2. Dingwell, D. B. Volcanic dilemma: Flow or blow? Science 273, 10541055 (1996). Russell D. Gray & Quentin D. Atkinson3. Papale, P. Strain-induced magma fragmentation in explosive eruptions. Nature 397, 425428 (1999).4. Dingwell, D. B. & Webb, S. L. Structural relaxation in silicate melts and non-Newtonian melt rheologyDepartment of Psychology, University of Auckland, Private Bag 92019,in geologic processes. Phys. Chem. Miner. 16, 508516 (1989).5. Webb, S. L. & Dingwell, D. B. The onset of non-Newtonian rheology of silcate melts. Phys. Chem.Auckland 1020, New Zealand.............................................................................................................................................................................Miner. 17, 125132 (1990).6. Webb, S. L. & Dingwell, D. B. Non-Newtonian rheology of igneous melts at high stresses and strainLanguages, like genes, provide vital clues about human history1,2.rates: experimental results for rhyolite, andesite, basalt, and nephelinite. J. Geophys. Res. 95, The origin of the Indo-European language family is the most1569515701 (1990).intensively studied, yet still most recalcitrant, problem of his-7. Newman, S., Epstein, S. & Stolper, E. Water, carbon dioxide and hydrogen isotopes in glasses from theca. 1340 A.D. eruption of the Mono Craters, California: Constraints on degassing phenomena andtorical linguistics3. Numerous genetic studies of Indo-Europeaninitial volatile content. J. Volcanol. Geotherm. Res. 35, 7596 (1988). origins have also produced inconclusive results4,5,6. Here we8. Villemant, B. & Boudon, G. Transition from dome-forming to plinian eruptive styles controlled by analyse linguistic data using computational methods derivedH2O and Cl degassing. Nature 392, 6569 (1998).9. Polacci, M., Papale, P. & Rosi, M. Textural heterogeneities in pumices from the climactic eruption offrom evolutionary biology. We test two theories of Indo-Mount Pinatubo, 15 June 1991, and implications for magma ascent dynamics. Bull. Volcanol. 63, European origin: the Kurgan expansion and the Anatolian8397 (2001). farming hypotheses. The Kurgan theory centres on possible10. Tuffen, H., Dingwell, D. B. & Pinkerton, H. Repeated fracture and healing of silicic magma generatesarchaeological evidence for an expansion into Europe and theow banding and earthquakes? Geology 31, 10891092 (2003).11. Stasiuk, M. V. et al. Degassing during magma ascent in the Mule Creek vent (USA). Bull. Volcanol. 58,Near East by Kurgan horsemen beginning in the sixth millen-117130 (1996). nium BP 7,8. In contrast, the Anatolian theory claims that Indo-12. Goto, A. A new model for volcanic earthquake at Unzen Volcano: Melt rupture model. Geophys. Res.European languages expanded with the spread of agricultureLett. 26, 25412544 (1999).from Anatolia around 8,0009,500 years BP 9. In striking agree-13. Mastin, L. G. Insights into volcanic conduit ow from an open-source numerical model. Geochem.Geophys. Geosyst. 3, doi:10.1029/2001GC000192 (2002). ment with the Anatolian hypothesis, our analysis of a matrix of14. Proussevitch, A. A., Sahagian, D. L. & Anderson, A. T. Dynamics of diffusive bubble growth in 87 languages with 2,449 lexical items produced an estimated agemagmas: Isothermal case. J. Geophys. Res. 3, 2228322307 (1993).range for the initial Indo-European divergence of between 7,80015. Lensky, N. G., Lyakhovsky, V. & Navon, O. Radial variations of melt viscosity around growing bubblesand gas overpressure in vesiculating magmas. Earth Planet. Sci. Lett. 186, 16 (2001).and 9,800 years BP. These results were robust to changes in coding16. Rust, A. C. & Manga, M. Effects of bubble deformation on the viscosity of dilute suspensions. procedures, calibration points, rooting of the trees and priors inJ. Non-Newtonian Fluid Mech. 104, 5363 (2002). the bayesian analysis.NATURE | VOL 426 | 27 NOVEMBER 2003 | www.nature.com/nature 2003 Nature Publishing Group435 Robin J. Ryder (Paris Dauphine)Statistics and Historical Linguistics UCLA March 2013 17 / 98 18. Swadesh lists, better analysedUse Swadesh lists for 87 Indo-European languages, and a phylogeneticmodel from GeneticsAssume a tree-like model of evolutionBayesian inference via MCMC (Markov Chain Monte Carlo)Reconstruct trees and datesMain parameter of interest: age of the root (Proto-Indo-European,PIE) Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 18 / 98 19. Lexical trees Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 19 / 98 20. Correcting the issues with glottochronologyReturning to the issues with glottochronology:1 Simplistic model Slightly better, but the model of evolution isrudimentary2 No evaluation of uncertainty of estimates Bayesian inference3 Only small amounts of data are used Large number of languagesreduces variability of estimates Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 20 / 98 21. Why be Bayesian?In the settings described in this talk, it usually makes sense to useBayesian inference, because:The models are complexEstimating uncertainty is paramountThe output of one model is used as the input of anotherWe are interested in complex functions of our parameters Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 21 / 98 22. Frequentist statisticsStatistical inference deals with estimating an unknown parameter given some data D.In the frequentist view of statistics, has a true xed (deterministic)value.Uncertainty is measured by condence intervals, which are notintuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 20)for , I cannot say that there is a 95% probability that belongs tothe interval [80 ; 120]. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 22 / 98 23. Frequentist statisticsStatistical inference deals with estimating an unknown...</p>