computational analysis of gondi dialectscoltekin.net/cagri/papers/rama2017vardial.pdfproceedings of...

10
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 26–35, Valencia, Spain, April 3, 2017. c 2017 Association for Computational Linguistics Computational analysis of Gondi dialects Taraka Rama and C ¸a˘ grı C ¸¨ oltekin and Pavel Sofroniev Department of Linguistics University of T¨ ubingen, Germany [email protected] [email protected] [email protected] Abstract This paper presents a computational anal- ysis of Gondi dialects spoken in central India. We present a digitized data set of the dialect area, and analyze the data us- ing different techniques from dialectome- try, deep learning, and computational bi- ology. We show that the methods largely agree with each other and with the ear- lier non-computational analyses of the lan- guage group. 1 Introduction Gondi languages are spoken across a large region in the central part of India (cf. figure 1). The lan- guages belong to the Dravidian language family and are closely related to Telugu, a major literary language spoken in South India. The Gondi lan- guages received wide attention in comparative lin- guistics (Burrow and Bhattacharya, 1960; Garap- ati, 1991; Smith, 1991) due to their dialectal vari- ation. On the one hand, the languages look like a dialect chain while, on the other hand, some of the dialects are shown to exhibit high levels of mutual unintelligibility (Beine, 1994). Smith (1991) and Garapati (1991) perform clas- sical comparative analyses of the dialects and clas- sify the Gondi dialects into two subgroups: North- west and Southeast. Garapati (1991) compares Gondi dialects where most of the dialects belong to Northwest subgroup and only three dialects be- long to Southeast subgroup. In a different study, Beine (1994) collected lexical word lists tran- scribed in International Phonetic Alphabet (IPA) for 210 concepts belonging to 46 sites and at- tempted to perform a classification based on word similarity. Beine (1994) determines two words to be cognate (having descended from the same com- mon ancestor) if they are identical in form and meaning. The average similarity between two sites is determined as the average number of identical words between the two sites. The author describes the experiments of the results qualitatively and does not perform any quantitative analysis. Until now, there has been no computational analysis of the lexical word lists to determine the exact rela- tionship between these languages. We digitize the dataset and then perform a computational analysis. Recent years have seen an increase in the num- ber of computational methods applied to the study of both dialect and language classification. For in- stance, Nerbonne (2009) applied Levenshtein dis- tance for the classification of Dutch and German dialects. Nerbonne finds that the classification of- fered by Levenshtein distance largely agrees with the traditional dialectological knowledge of Dutch and German areas. In this paper, we apply the di- alectometric analysis to the Gondi language word lists. In the related field of computational histor- ical linguistics, Gray and Atkinson (2003) ap- plied Bayesian phylogenetic methods from com- putational biology to date the age of Proto-Indo- European language tree. The authors use cog- nate judgments given by historical linguists to in- fer both the topology and the root age of the Indo- European family. In parallel to this work, Kon- drak (2009) applied phonetically motivated string similarity measures and word alignment inspired methods for the purpose of determining if two words are cognates or not. This work was fol- lowed by List (2012) and Rama (2015) who em- ployed statistical and string kernel methods for de- termining cognates in multilingual word lists. In typical dialectometric studies (Nerbonne, 2009), the assumption that all the pronunciations of a particular word are cognates is often justified by the data. However, we cannot assume that this is the case in Gondi dialects since there are sig- 26

Upload: others

Post on 15-Dec-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational analysis of Gondi dialectscoltekin.net/cagri/papers/rama2017vardial.pdfProceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pages

Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 26–35,Valencia, Spain, April 3, 2017. c©2017 Association for Computational Linguistics

Computational analysis of Gondi dialects

Taraka Rama and Cagrı Coltekin and Pavel SofronievDepartment of Linguistics

University of Tubingen, [email protected]

[email protected]@student.uni-tuebingen.de

Abstract

This paper presents a computational anal-ysis of Gondi dialects spoken in centralIndia. We present a digitized data set ofthe dialect area, and analyze the data us-ing different techniques from dialectome-try, deep learning, and computational bi-ology. We show that the methods largelyagree with each other and with the ear-lier non-computational analyses of the lan-guage group.

1 Introduction

Gondi languages are spoken across a large regionin the central part of India (cf. figure 1). The lan-guages belong to the Dravidian language familyand are closely related to Telugu, a major literarylanguage spoken in South India. The Gondi lan-guages received wide attention in comparative lin-guistics (Burrow and Bhattacharya, 1960; Garap-ati, 1991; Smith, 1991) due to their dialectal vari-ation. On the one hand, the languages look like adialect chain while, on the other hand, some of thedialects are shown to exhibit high levels of mutualunintelligibility (Beine, 1994).

Smith (1991) and Garapati (1991) perform clas-sical comparative analyses of the dialects and clas-sify the Gondi dialects into two subgroups: North-west and Southeast. Garapati (1991) comparesGondi dialects where most of the dialects belongto Northwest subgroup and only three dialects be-long to Southeast subgroup. In a different study,Beine (1994) collected lexical word lists tran-scribed in International Phonetic Alphabet (IPA)for 210 concepts belonging to 46 sites and at-tempted to perform a classification based on wordsimilarity. Beine (1994) determines two words tobe cognate (having descended from the same com-mon ancestor) if they are identical in form and

meaning. The average similarity between two sitesis determined as the average number of identicalwords between the two sites. The author describesthe experiments of the results qualitatively anddoes not perform any quantitative analysis. Untilnow, there has been no computational analysis ofthe lexical word lists to determine the exact rela-tionship between these languages. We digitize thedataset and then perform a computational analysis.

Recent years have seen an increase in the num-ber of computational methods applied to the studyof both dialect and language classification. For in-stance, Nerbonne (2009) applied Levenshtein dis-tance for the classification of Dutch and Germandialects. Nerbonne finds that the classification of-fered by Levenshtein distance largely agrees withthe traditional dialectological knowledge of Dutchand German areas. In this paper, we apply the di-alectometric analysis to the Gondi language wordlists.

In the related field of computational histor-ical linguistics, Gray and Atkinson (2003) ap-plied Bayesian phylogenetic methods from com-putational biology to date the age of Proto-Indo-European language tree. The authors use cog-nate judgments given by historical linguists to in-fer both the topology and the root age of the Indo-European family. In parallel to this work, Kon-drak (2009) applied phonetically motivated stringsimilarity measures and word alignment inspiredmethods for the purpose of determining if twowords are cognates or not. This work was fol-lowed by List (2012) and Rama (2015) who em-ployed statistical and string kernel methods for de-termining cognates in multilingual word lists.

In typical dialectometric studies (Nerbonne,2009), the assumption that all the pronunciationsof a particular word are cognates is often justifiedby the data. However, we cannot assume that thisis the case in Gondi dialects since there are sig-

26

Page 2: Computational analysis of Gondi dialectscoltekin.net/cagri/papers/rama2017vardial.pdfProceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pages

Figure 1: The Gondi language area with major cities in focus. The dialect/site codes and the geographicaldistribution of the codes are based on Beine (1994).

nificant amount of lexical replacement due to bor-rowing (from contact) and internal lexical innova-tions. Moreover, the previous comparative linguis-tic studies classify the Gondi dialects using soundcorrespondences and lexical cognates. In this pa-per, we will use the Pointwise Mutual Information(Wieling et al., 2009) method for obtaining soundchange matrices and use the matrix to automati-cally identify cognates.

The comparative linguistic research classifiedthe Gondi dialects into five different geneticgroups (cf. table 1). However, the exact branchingof the Gondi dialects is yet a open question. In thispaper, we apply both dialectometric and phyloge-netic approaches to determine the exact branchingstructure of the dialects.

The paper is organized as followed. In sec-tion 2, we describe the dataset and the gold stan-dard dialect classification used in our experiments.In section 3, we describe the various techniquesfor computing and visualizing the dialectal differ-ences. In section 4, we describe the results of thedifferent analyses. We conclude the paper in sec-tion 5.

2 Datasets

The word lists for our experiments are derivedfrom the fieldwork of Beine (1994). Beine (1994)provides multilingual word lists for 210 mean-ings in 46 sites in central India which is shownin figure 1. In the following sections, we usethe Glottolog classification (Nordhoff and Ham-marstrom, 2011) as gold standard to evaluate thevarious analyses. Glottolog is a openly avail-able resource that summarizes the genetic relation-

ships of the world’s dialects and languages frompublished scholarly linguistic articles. For refer-ence, we provide the Glottolog classification1 ofthe Gondi dialects in table 1. The Glottolog clas-sification is derived from comparative linguistics(Garapati, 1991; Smith, 1991) and dialect mutualintelligibility tests (Beine, 1994).

Dialect codes Classification

gdh, gam, gar, gse, glb,gtd, gkt, gch, prg, gka,gwa, grp, khu, ggg, gcj,bhe, pmd, psh, pkh, ght

Northwest Gondi, NorthernGondi

rui, gki, gni, dog, gut,gra, lxg

Northwest Gondi, SouthernGondi

met, get, mad, gba, goa,mal, gja, gbh, mbh

Southeast Gondi, GeneralSoutheast Gondi, HillMaria-Koya, Hill Maria

mku, mdh, ktg, mud,mso, mlj, gok

Southeast Gondi, GeneralSoutheast Gondi, Muria

bhm, bhb, bhs Southeast Gondi, GeneralSoutheast Gondi, Bison HornMaria

Table 1: Classification of the 46 sites according toGlottolog (Nordhoff and Hammarstrom, 2011).

The whole dialect region is divided into twomajor groups: Northwest Gondi and SoutheastGondi which are divided into five major sub-groups: Northern Gondi, Southern Gondi, HillMaria, Bison Horn Maria, Muria where NorthernGondi and Southern Gondi belong to the North-west Gondi branch whereas the rest of the sub-groups belong to Southeast Gondi branch. It has

1http://glottolog.org/resource/languoid/id/gond1265

27

Page 3: Computational analysis of Gondi dialectscoltekin.net/cagri/papers/rama2017vardial.pdfProceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pages

to be noted that there is no gold standard about theinternal structure of dialects belonging to each di-alect group.

3 Methods for comparing and visualizingdialectal differences

We use the IPA transcribed data to computeboth unweighted and weighted string similar-ity/distance between two words. We use the sameIPA data to train LSTM autoencoders introducedby Rama and Coltekin (2016) and project the au-toencoder based distances onto a map.

As mentioned earlier, the dialectometric analy-ses typically assume that all words that share thesame meaning are cognates. However, as shownby Garapati (1991), some Gondi dialects exhibita clear tree structure. Both dialectometric andautoencoder methods only provide an aggregateamount of similarity between dialects and do notwork with cognates directly. The methods are sen-sitive to lexical differences only through high dis-similarity of phonetic strings. Since lexical andphonetic differences are likely to indicate differ-ent processes of linguistic change, we also analyzethe categorical differences due to lexical borrow-ings/changes. For this purpose, we perform auto-matic cognate identification and then use the in-ferred cognates to perform both Bayesian phylo-genetic analysis and dialectometric analysis.

3.1 Dialectometry

3.1.1 Computing aggregate distancesIn this subsection, we describe how Levenshteindistance and autoencoder based methods are em-ployed for computing site-site distances.

Levenshtein distance: Levenshtein distance isdefined as the minimum number of edit operations(insertion, deletion, and substitution) that are re-quired to transform one string to another. We usethe Gabmap (Nerbonne et al., 2011) implementa-tion of Levenshtein distance to compute site-sitedifferences.

Autoencoders: Rama and Coltekin (2016) in-troduced LSTM autoencoders for the purpose ofdialect classification. Autoencoders were em-ployed by Hinton and Salakhutdinov (2006) forreducing the dimensionality of images and docu-ments. Autoencoders learn a dense representationthat can be used for clustering the documents andimages.

An autoencoder network consists of two parts:

encoder and decoder. The encoder network takesa word as an input and transforms the word to afixed dimension representation. The fixed dimen-sion representation is then supplied as an input toa decoder network that attempts to reconstruct theinput word. In our paper, both the encoder anddecoder networks are Long-Short Term Memorynetworks (Hochreiter and Schmidhuber, 1997).

In this paper, each word is represented as asequence (x1, . . . xT ) of one-hot vectors of di-mension |P | where P is the total number (58)of IPA symbols across the dialects. The encoderis a LSTM network that transforms each wordinto h ∈ Rk where k is predetermined before-hand (in this paper, k is assigned a value of 32).The decoder consists of another LSTM networkthat takes h as input at each timestep to predictan output representation. Each output represen-tation is then supplied to a softmax function toyield xt ∈ R|P |. The autoencoder network istrained using the binary cross-entropy function(−∑

t xtlog(xt) + (1 − xt)log(1 − xt)) where,xt is a 1-hot vector and xt is the output of the soft-max function at timestep t to learn both the en-coder and decoder LSTM’s parameters. FollowingRama and Coltekin (2016), we use a bidirectionalLSTM as the encoder network and a unidirec-tional LSTM as the decoder network. Our autoen-coder model was implemented using Keras (Chol-let, 2015) with Tensorflow (Abadi et al., 2016) asthe backend.

3.1.2 VisualizationWe use Gabmap, a web-based application for di-alectometric analysis for visualizing the site-sitedistances (Nerbonne et al., 2011; Leinonen et al.,2016).2 Gabmap provides a number of methodsfor analyzing and visualizing dialect data. Below,we present maps and graphics that are results ofmulti-dimensional scaling (MDS) clustering.

For all analyses, Gabmap aggregates the differ-ences calculated over individual items (concepts)to a site-site distance matrix. With phonetic data, ituses site-site differences based on string edit dis-tance with a higher penalty for vowel–consonantalignments and a lower penalty for the alignmentsof sound pairs that differ only in IPA diacritics.With binary data, Gabmap uses Hamming dis-tances to compute the site-site differences. Thecognate clusters obtained from the automatic iden-

2Available at http://gabmap.nl/.

28

Page 4: Computational analysis of Gondi dialectscoltekin.net/cagri/papers/rama2017vardial.pdfProceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pages

tification procedure (section 2.2) forms categories(cognate clusters) which are analyzed using bi-nary distances. Finally, we also visualize the dis-tances from the autoencoders (section 2.1) usingGabmap.

Gabmap provides various agglomerative hierar-chical clustering methods for clustering analyses.In all the results below, we use Ward’s method forcalculating cluster differences. For our analyses,we present the clustering results on (color) mapsand dendrograms. Since the clustering is known tobe relatively unstable, we also present probabilis-tic dendrograms that are produced by noisy clus-tering (Nerbonne et al., 2008). In noisy clustering,a single cluster analysis is performed a large num-ber of times (∼ 100) by adding a small noise tothe distance matrix that is proportional to the stan-dard deviation of the original distance matrix. Thecombined analysis then provides statistical sup-port for the branches in a dendrogram.

The multi-dimensional scaling (MDS) is a use-ful analysis/visualization technique for verifyingthe clustering results and visualizing the dialectcontinuum. A site-site (linguistic) distance ma-trix represents each site on a multi-dimensionalspace. MDS ‘projects’ these distances to a smallerdimensional space that can be visualized easily. Indialect data, the distances in few most-importantMDS dimensions correlate highly with the orig-inal distances, and these dimensions often cor-respond to linguistically meaningful dimensions.Below, we also present maps where areas aroundthe linguistic similar locations are plotted usingsimilar colors.

3.2 Phylogenetic approaches

3.2.1 Automatic cognate detectionGiven a multilingual word list for a concept, theautomatic cognate detection procedure (Hauer andKondrak, 2011) can be broken into two parts:

1. Compute a pairwise similarity score for allword pairs in the concept.

2. Supply the pairwise similarity matrix to aclustering algorithm to output clusters thatshow high similarity with one another.

Needleman-Wunsch algorithm (NW, Needle-man and Wunsch (1970); the similarity counter-part of Levenshtein distance) is a possible choicefor computing the similarity between two words.The NW algorithm maximizes similarity whereas

Levenshtein distance minimizes the distance be-tween two words. The NW algorithm assigns ascore of 1 for character match and a score −1 forcharacter mismatch. Unlike Levenshtein distance,NW algorithm assigns a penalty score for openinga gap (deletion operation) and a penalty for gapextension which models the fact that deletion op-erations occur in chunks (Jager, 2013).

The NW algorithm is not sensitive to differ-ent sound segment pairs, but a realistic algorithmshould assign higher similarity score to sound cor-respondences such as /l/ ∼ /r/ than the sound cor-respondences /p/ ∼ /r/. The weighted Needleman-Wunsch algorithm takes a segment-segment sim-ilarity matrix as input and then aligns the twostrings to maximize the similarity between the twowords.

In dialectometry (Wieling et al., 2009), thesegment-segment similarity matrix is estimatedusing pointwise mutual information (PMI). ThePMI score for two sounds x and y is defined asfollowed:

pmi(x, y) = logp(x, y)p(x)p(y)

(1)

where, p(x, y) is the probability of x, y beingmatched in a pair of cognate words, whereas, p(x)is the probability of x. A positive PMI value be-tween x and y indicates that the probability ofx being aligned with y in a pair of cognates ishigher than what would be expected by chance.Conversely, a negative PMI value indicates that analignment of x with y is more likely the result ofchance than of shared inheritance.

The PMI based computation requires a prior listof plausible cognates for computing a weightedsimilarity matrix between sound segments. In theinitial step, we extract cross-lingual word pairsthat have a Levenshtein distance less than 0.5 andtreat them as a list of plausible cognates. The PMIestimation procedure is described as followed:

1. Compute alignments between the word pairsusing a vanilla Needleman-Wunsch algo-rithm.3

2. The computed alignments from step 1 arethen used to compute similarity between seg-ments x, y according to the following for-mula:

3We set the gap-opening penalty to -2.5 and gap extensionpenalty to -1.75.

29

Page 5: Computational analysis of Gondi dialectscoltekin.net/cagri/papers/rama2017vardial.pdfProceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pages

3. The PMI matrix obtained from step 2 is usedto realign the word pairs and generate a newlist of segment alignments. The new list ofalignments is employed to compute a newPMI matrix.

4. Steps 2 and 3 are repeated until the differencebetween PMI matrices reach zero.

In our experience, five iterations were sufficientto reach convergence. At this stage, we use thePMI matrix to compute a word similarity matrixbetween the words belonging to a single meaning.The word similarity matrix was converted into aword distance matrix using the following transfor-mation: (1+exp(x))−1 where, x is the PMI scorebetween two words. We use the InfoMap cluster-ing algorithm (List et al., 2016) for the purpose ofidentifying cognate clusters.

3.2.2 Bayesian phylogenetic inferenceThe Bayesian phylogenetics originated in evolu-tionary biology and works by inferring the evo-lutionary relationship (trees) between DNA se-quences of species. The same method is appliedto binary traits of species (Yang, 2014). A binarytrait is typically a presence or absence of a evolu-tionary character in an biological organism. Com-putational biologists employ a probabilistic substi-tution model θ that models the transition probabil-ities from 0 → 1 and 1 → 0. The substitutionmatrix would be a 2 × 2 matrix in the case of abinary data matrix.

A evolutionary tree that explains the relation-ship between languages consist of topology (τ )and branch lengths (T). The likelihood of the bi-nary data to a tree is computed using the pruningalgorithm (Felsenstein, 1981). Ideally, identifyingthe best tree would involve exhaustive enumera-tion of the trees and calculating the likelihood ofthe binary matrix for each tree. However, the num-ber of possible binary tree topologies grows facto-rially ((2n − 3)!! where, n is the number of lan-guages) and, hence intractable even for a smallnumber (20) of languages. The inference problemwould be to estimate the joint posterior density ofτ, θ,T.

The Bayesian phylogenetic inference program(MrBayes;4 Ronquist and Huelsenbeck (2003)) re-quires a binary matrix (languages × number ofclusters) of 0s and 1s, where, each column showsif a language is present in a cluster or not. The

4http://mrbayes.sourceforge.net/

German Hund 1 0Swedish hund 1 0Hindi kutta 0 1

Table 2: Binary matrix for meaning “dog”.

cognate clusters are converted into a binary ma-trix of 0s and 1s in the following manner. A wordfor a meaning would belong to one or more cog-nate sets. For example, in the case of German,Swedish, and Hindi, the word for dog in German‘Hund’ and Swedish ‘hund’ would belong to thesame cognate set, while Hindi ‘kutta’ would be-long to a different category. The binary trait ma-trix for these languages for a single meaning, dog,would be as in table 2. A Bayesian phylogeneticanalysis employs a Markov-Chain Monte-Carloprocedure to navigate across the tree space. In thispaper, we ran two independent runs until the treesinferred by the two runs do not differ beyond athreshold of 0.01. In summary, we ran both thechains for 4 million states and sampled trees at ev-ery 500 states to avoid auto-correlation. Then, wethrew away the initial one million states as burn-inand generated a summary tree of the post burn-inruns (Felsenstein, 2004). The summary tree con-sists of only those branches which have occurredmore than 50% of the time in the posterior sample,consisting of 25000 trees.

4 Results

In this section, we present visualizations of differ-ences in the language area using MDS and noisyclustering.

4.1 String edit distanceIn the left map in Figure 2, the first three MDSdimensions are mapped to RGB color space, visu-alizing the differences between the locations. Notethat the major dialectal differences outlined in ta-ble 1 are visible in this visualization. For exam-ple, the magenta and yellow-green regions sep-arate the Bison Horn Maria and the Hill Mariagroups from the surrounding areas with sharp con-trasts. The original linguistic distances and thedistances based on first three MDS dimensionscorrelate with r = 0.90, hence, retaining about81% of the variation in the original distances. Themiddle map in figure 2 displays only the first di-mension, which seems to represent a differencebetween north and south. On the other hand, the

30

Page 6: Computational analysis of Gondi dialectscoltekin.net/cagri/papers/rama2017vardial.pdfProceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pages

Figure 2: MDS analysis performed by Gabmap with string edit distance. The left map shows first threeMDS dimensions mapped to RGB color space. The middle map shows only the first dimension, andthe right map shows the second MDS dimension. The first three dimensions correlate with the originaldistances with r = 0.73, r = 0.55 and r = 0.41, respectively, and first three dimensions with r = 0.90.

bhmbhs 84bhb

100

msomud 100

mlj100

72

bhekhu 100ggg

100

gamgar 100

gchgdh 75gwa

100

prg96

99

ghtglbgrp

100

67

gkagse 100

67

pkhpsh 100

pmd100

gcj100

gktgtd 67

67

52

doggki 100rui

100

gra52

gnilxg 100gut

100

52

gbaget 100gja

100

madmet 100mal

100100

goa

100

gbhmbh 100

mdhmku 100ktg

60

gok100

100

0.0 0.1 0.2 0.3 0.4

Figure 3: Clustering analysis performed by Gabmap with string edit distance using Ward’s method andthe color in the map indicate 5 dialect groups. Probabilistic dendrogram from the default Gabmap anal-ysis (string edit distance).

right map (second MDS dimension) seems to indi-cate a difference between Bison Horn Maria (andto some extent Muria) and the rest.

The clustering results are also complementaryto the MDS analysis. The 5-way cluster mappresented in figure 3 indicates the expected di-alect groups described in table 1. Despite someunexpected results in the detailed clustering, theprobabilistic dendrogram presented in figure 3also shows that the main dialect groups are stableacross noisy clustering experiments. For instance,the Bison Horn Maria group (bhm, bhs, bhb) pre-sented on the top part of the dendrogram indicatesa very stable group: these locations are clusteredtogether in all the noisy clustering experiments.Similarly, the next three locations (mco, mud, mlj,belonging to Muria area) also show a very stronginternal consistency, and combine with the BisonHorn Maria group in 72% of the noisy clustering

experiments. However, other members of Muriagroup (mdh, mku, ktg, gok at the bottom of theprobabilistic dendrogram) seem to be placed oftenapart from the rest of the group.

4.2 Binary distancesNext, we present the MDS analysis based on lex-ical distances in figure 4. For this analysis, weidentify cognates for each meaning (cf. section2.2), and treat the cognate clusters found in eachlocation as the only (categorical) features for anal-ysis. The overall picture seems to be similar tothe analysis based on the phonetic data, althoughthe north-south differences are more visible inthis analysis. Besides the first three dimensions(left map), both first (middle map) and second(right map) dimensions indicate differences be-tween north and south. The left figure shows thatthere is a gradual transition from the Northern di-alects (gtd, gkt, prg, ggg, khu, bhe, gcj, pmd, psh,

31

Page 7: Computational analysis of Gondi dialectscoltekin.net/cagri/papers/rama2017vardial.pdfProceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pages

Figure 4: MDS analysis performed by Gabmap with categorical differences. The left map shows firstthree MDS dimensions mapped to RGB color space. The middle map shows only the first dimension, andthe right map shows the second MDS dimension. The first three dimensions correlate with the originaldistances with r = 0.77, r = 0.53 and r = 0.41, respectively, and first three dimensions with r = 0.94.

bhmbhs 73bhb

100

gokmlj 65

msomud 100

9067

bhekhu 100ggg

100

prg56

gchgwa 100gdhgkt 100

90

65

gamgar 100gkagse 100

100

ght

68

glbgrp 91

97

75

gtd

59

pkhpsh 100gcj

70

pmd99

99

doggki 100rui

100

gnilxg 100gut

100

gra100

99

gbaget 100

goa100

gbh70

mbh

70

madmalmet

100

gja90

mdhmku 100

90

ktg

88

100

0.0 0.1 0.2 0.3 0.4

Figure 5: The dendrogram shows the results of the hierarchical clustering (left) based on binary matrix.Probabilistic dendrogram from the Gabmap analysis with Hamming distances.

pkh) to the rest of the northern dialects that shareborders with Muria and Southern dialects. Thereis a transition between Southern dialects to the HillMaria dialects where, the Hill Maria dialects donot show much variation.

The clustering analysis of the binary matrixfrom cognate detection step is projected on the ge-ographical map of the region in figure 5. The mapretrieves the five clear subgroups listed in table 1.Then, we perform a noisy clustering analysis ofthe Hamming distance matrix which is shown inthe same figure. The dendrogram places Bison-Horn Maria dialects (bhm, bhs, bhb) along withthe eastern dialects of Muria subgroup. It alsoplaces all the Northern Gondi dialects into a sin-gle cluster with high confidence. The dendrogramalso places all the southern dialects into a singlecluster. On the other hand, the dendrogram incor-rectly places the Hill Maria dialects along with thewestern dialects of Muria subgroup. With slight

variation in the detail, the cluster analysis and theprobabilistic dendrogram presented in figure 5 aresimilar to the analysis based on phonetic differ-ences.

4.3 Autoencoder distancesThe MDS analysis of autoencoder-based distancesare shown in figure 6. The RGB color map ofthe first three dimensions shows the five dialectregions. The figure shows a clear boundary be-tween Northern and Southern Gondi dialects. Themap shows the Bison Horn Maria region to beof distinct blue color that does not show muchvariance. The autoencoder MDS dimensions cor-relate the highest with the autoencoder distancematrix. The first dimension (middle map in fig-ure 6) clearly distinguishes the Northern dialectsfrom the rest. The second dimension distinguishesSouthern Gondi dialects and Muria dialects fromthe rest of the dialects.

The clustering analysis of the autoencoder dis-

32

Page 8: Computational analysis of Gondi dialectscoltekin.net/cagri/papers/rama2017vardial.pdfProceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pages

Figure 6: MDS analysis performed by Gabmap with autoencoder differences. The left map shows firstthree MDS dimensions mapped to RGB color space. The middle map shows only the first dimension, andthe right map shows the second MDS dimension. The first three dimensions correlate with the originaldistances with r = 0.74, r = 0.57 and r = 0.49, respectively, and first three dimensions with r = 0.92.

bhmbhs 100bhb

100

msomud 100

mlj100

99

bhekhu 100ggg

100

gchgwa 86gdh

100

prg100

82

gkt

85

gamgar 100gka

95

gse95

glbgrp 90ght

100

82

pkhpsh 100

pmd100

gcj100

gtd

51

doggki 100rui

100

gbaget 100gja

76

madmal 74met

10076

mdhmku 100gok

52

ktg76

76

gnilxg 100gut

100

gra96

76

gbhmbh 100goa

87

100

0.0 0.1 0.2 0.3 0.4 0.5

Figure 7: Probabilistic dendrogram from theGabmap analysis with autoencoder distances. Theclustering result is similar to the left map in fig-ure 3.

tances are projected on to the geographical mapin figure 7. The map retrieves the five subgroupsin table 1. The noisy clustering clearly puts theBison Horn Maria group into a single cluster. Italso places all the northern dialects into a singlegroup with 100% confidence. On the other hand,the dendrogram splits the Southern Gondi dialectsinto eastern and western parts. The eastern partsare placed along with the Hill Maria dialects. Theclustering analysis also splits the Muria dialectsinto three parts. However, the dendrogram placesgok (a eastern Muria dialect) incorrectly with FarWestern Muria (mku).

4.4 Bayesian analysis

The summary tree of the Bayesian analysis isshown in figure 8. The figure also shows the per-centage of times each branch exists in the posteriorsample of trees. The tree clearly divided North-

0 . 0 2

g e t

g r p

gch

gba

met

g k t

k t g

mku

bhe

dog

pmd

gkagse

g r a

gok

psh

goa

khu

mud

mal

p r g

gam

l x g

g j a

ggg

m l j

g h t

bhb

g t d

pkh

g a r

g c j

r u i

gdh

g n i

g k i

bhm

mso

mad

mdh

gbh

g u t

bhs

gwa

g l b

mbh

100

100

100

100

100

98

95

98

100

100100

100

100

100

100

93

100

100

100

100

100

100

100

100

99

100

52

56

83

82

100

100

99

100

100

99

100

83

100

100

97

100

100

100

100

100

100

100

97

100

100

100

84

97

70

100

100

100

66

100

100

100

100

100

100

74

74100

62

100

95

100

100

99

93

100

99

100

100

91

100

100

100

99

100

100

99

99

100

Figure 8: The majority consensus tree of theBayesian posterior trees.

west Gondi from Southeast Gondi groups. Thetree places all the Northern Gondi dialects intoa single group in 99% of the trees. The south-ern dialects are split into two different brancheswith rui, dog, gki branching later from the com-mon Northwest Gondi later than the rest of theSouthern Gondi dialects. The tree clearly splits theHill Maria dialects from rest of Southeast Gondidialects. The tree also places all the Bison HornMaria dialects into a single group but does not putthem into a different group from the rest of theMuria dialects.

5 Conclusion

In this paper, we performed analysis using toolsfrom dialectometry and computational historical

33

Page 9: Computational analysis of Gondi dialectscoltekin.net/cagri/papers/rama2017vardial.pdfProceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pages

linguistics for the analysis of Gondi dialects. Thedialectometric analysis rightly retrieves all thesubgroups in the region. However, both edit dis-tance and autoencoder distances differ in the noisyclustering analysis. On the other hand, the noisyclustering analysis on the binary cognate matrixyields the best results. The Bayesian tree based oncognate analysis also retrieves the top level sub-groups right but does not clearly distinguish Bi-son Horn Maria group from Muria dialects. As amatter of fact, the Bayesian tree agrees the high-est with the gold standard classification from Glot-tolog.

The contributions of the paper is as followed.We digitized a multilingual lexical wordlist for 46dialects and applied both dialectometric and phy-logenetic methods for the classification of dialectsand find that phylogenetic methods perform thebest when compared to the gold standard classi-fication.

Acknowledgments

The authors thank the five reviewers for the com-ments which helped improve the paper. Thefirst and third authors are supported by the ERCAdvanced Grant 324246 EVOLAEMP, which isgratefully acknowledged.The code and the data for the experimentsis available at https://github.com/PhyloStar/Gondi-Dialect-Analysis

ReferencesMartın Abadi, Ashish Agarwal, Paul Barham, Eugene

Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, et al.2016. Tensorflow: Large-scale machine learning onheterogeneous distributed systems. arXiv preprintarXiv:1603.04467.

David K. Beine. 1994. A sociolinguistic survey of theGondi-speaking communities of central india. Mas-ter’s thesis, San Diego State University, San Diego.

Thomas Burrow and S. Bhattacharya. 1960. A com-parative vocabulary of the Gondi dialects. Journalof the Asiatic Society, 2:73–251.

Francois Chollet. 2015. Keras. GitHub repository:https://github. com/fchollet/keras.

Joseph Felsenstein. 1981. Evolutionary trees fromDNA sequences: A maximum likelihood approach.Journal of Molecular Evolution, 17(6):368–376.

Joseph Felsenstein. 2004. Inferring phylogenies. Sin-auer Associates, Sunderland, Massachusetts.

Umamaheshwar Rao Garapati. 1991. Subgrouping ofthe Gondi dialects. In B. Lakshmi Bai and B. Ra-makrishna Reddy, editors, Studies in Dravidian andgeneral linguistics: a festschrift for Bh. Krishna-murti, pages 73–90. Centre of Advanced Study inLinguistics, Osmania University.

Russell D. Gray and Quentin D. Atkinson. 2003.Language-tree divergence times support the Ana-tolian theory of Indo-European origin. Nature,426(6965):435–439.

Bradley Hauer and Grzegorz Kondrak. 2011. Clus-tering semantically equivalent words into cognatesets in multilingual lists. In Proceedings of 5th In-ternational Joint Conference on Natural LanguageProcessing, pages 865–873, Chiang Mai, Thailand,November. Asian Federation of Natural LanguageProcessing.

Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006.Reducing the dimensionality of data with neural net-works. Science, 313(5786):504–507.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Gerhard Jager. 2013. Phylogenetic inference fromword lists using weighted alignment with empiri-cally determined weights. Language Dynamics andChange, 3(2):245–291.

Grzegorz Kondrak. 2009. Identification of cognatesand recurrent sound correspondences in word lists.Traitement Automatique des Langues et LanguesAnciennes, 50(2):201–235, October.

Therese Leinonen, Cagrı Coltekin, and John Nerbonne.2016. Using Gabmap. Lingua, 178:71–83.

Johann-Mattis List, Philippe Lopez, and Eric Bapteste.2016. Using sequence similarity networks to iden-tify partial cognates in multilingual wordlists. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers), pages 599–605, Berlin, Germany,August. Association for Computational Linguistics.

Johann-Mattis List. 2012. LexStat: Automatic de-tection of cognates in multilingual wordlists. InProceedings of the EACL 2012 Joint Workshopof LINGVIS & UNCLH, pages 117–125, Avignon,France, April. Association for Computational Lin-guistics.

Saul B. Needleman and Christian D. Wunsch. 1970.A general method applicable to the search for simi-larities in the amino acid sequence of two proteins.Journal of Molecular Biology, 48(3):443–453.

John Nerbonne, Peter Kleiweg, Wilbert Heeringa, andFranz Manni. 2008. Projecting dialect differ-ences to geography: Bootstrap clustering vs. noisy

34

Page 10: Computational analysis of Gondi dialectscoltekin.net/cagri/papers/rama2017vardial.pdfProceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pages

clustering. In Christine Preisach, Lars Schmidt-Thieme, Hans Burkhardt, and Reinhold Decker, ed-itors, Data Analysis, Machine Learning, and Appli-cations. Proc. of the 31st Annual Meeting of the Ger-man Classification Society, pages 647–654, Berlin.Springer.

John Nerbonne, Rinke Colen, Charlotte Gooskens, Pe-ter Kleiweg, and Therese Leinonen. 2011. Gabmap– a web application for dialectology. Dialectologia,Special Issue II:65–89.

John Nerbonne. 2009. Data-driven dialectology. Lan-guage and Linguistics Compass, 3(1):175–198.

Sebastian Nordhoff and Harald Hammarstrom. 2011.Glottolog/Langdoc: Defining dialects, languages,and language families as collections of resources. InProceedings of the First International Workshop onLinked Science, volume 783, pages 53–58.

Taraka Rama and Cagri Coltekin. 2016. LSTM au-toencoders for dialect analysis. In Proceedings ofthe Third Workshop on NLP for Similar Languages,Varieties and Dialects (VarDial3), pages 25–32, Os-aka, Japan, December. The COLING 2016 Organiz-ing Committee.

Taraka Rama. 2015. Automatic cognate identifica-tion with gap-weighted string subsequences. In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies., pages1227–1231.

Fredrik Ronquist and John P Huelsenbeck. 2003.Mrbayes 3: Bayesian phylogenetic inference undermixed models. Bioinformatics, 19(12):1572–1574.

Ian Smith. 1991. Interpreting conflicting isoglosses:Historical relationships among the Gondi dialects.In B. Lakshmi Bai and B. Ramakrishna Reddy, edi-tors, Studies in Dravidian and general linguistics: afestschrift for Bh. Krishnamurti, pages 27–48. Cen-tre of Advanced Study in Linguistics, Osmania Uni-versity.

Martijn Wieling, Jelena Prokic, and John Nerbonne.2009. Evaluating the pairwise string alignment ofpronunciations. In Proceedings of the EACL 2009Workshop on Language Technology and Resourcesfor Cultural Heritage, Social Sciences, Humanities,and Education, pages 26–34. Association for Com-putational Linguistics.

Ziheng Yang. 2014. Molecular evolution: A statisticalapproach. Oxford University Press, Oxford.

35