bipartite spectral graph partitioning to co-cluster ... · bipartite spectral graph partitioning to...

9
Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, ACL-IJCNLP 2009, pages 14–22, Suntec, Singapore, 7 August 2009. c 2009 ACL and AFNLP Bipartite spectral graph partitioning to co-cluster varieties and sound correspondences in dialectology Martijn Wieling University of Groningen The Netherlands [email protected] John Nerbonne University of Groningen The Netherlands [email protected] Abstract In this study we used bipartite spectral graph partitioning to simultaneously clus- ter varieties and sound correspondences in Dutch dialect data. While cluster- ing geographical varieties with respect to their pronunciation is not new, the simul- taneous identification of the sound corre- spondences giving rise to the geographi- cal clustering presents a novel opportunity in dialectometry. Earlier methods aggre- gated sound differences and clustered on the basis of aggregate differences. The de- termination of the significant sound corre- spondences which co-varied with cluster membership was carried out on a post hoc basis. Bipartite spectral graph clustering simultaneously seeks groups of individual sound correspondences which are associ- ated, even while seeking groups of sites which share sound correspondences. We show that the application of this method results in clear and sensible geographi- cal groupings and discuss the concomitant sound correspondences. 1 Introduction Exact methods have been applied successfully to the analysis of dialect variation for over three decades (S´ eguy, 1973; Goebl, 1982; Nerbonne et al., 1999), but they have invariably functioned by first probing the linguistic differences between each pair of a range of varieties (sites, such as Whitby and Bristol in the UK) over a body of carefully controlled material (say the pronuncia- tion of the vowel in the word ‘put’). Second, the techniques AGGREGATE over these linguistic dif- ferences, in order, third, to seek the natural groups in the data via clustering or multidimensional scal- ing (MDS) (Nerbonne, 2009). Naturally techniques have been developed to determine which linguistic variables weigh most heavily in determining affinity among varieties. But all of the following studies separate the deter- mination of varietal relatedness from the question of its detailed linguistic basis. Kondrak (2002) adapted a machine translation technique to deter- mine which sound correspondences occur most regularly. His focus was not on dialectology, but rather on diachronic phonology, where the regular sound correspondences are regarded as strong ev- idence of historical relatedness. Heeringa (2004: 268–270) calculated which words correlated best with the first, second and third dimensions of an MDS analysis of aggregate pronunciation differ- ences. Shackleton (2004) used a database of ab- stract linguistic differences in trying to identify the British sources of American patterns of speech variation. He applied principal component analy- sis to his database to identify the common com- ponents among his variables. Nerbonne (2006) examined the distance matrices induced by each of two hundred vowel pronunciations automati- cally extracted from a large American collection, and subsequently applied factor analysis to the co- variance matrices obtained from the collection of vowel distance matrices. Proki´ c (2007) analyzed Bulgarian pronunciation using an edit distance algorithm and then collected commonly aligned sounds. She developed an index to measure how characteristic a given sound correspondence is for a given site. To study varietal relatedness and its linguistic basis in parallel, we apply bipartite spectral graph partitioning. Dhillon (2001) was the first to use spectral graph partitioning on a bipartite graph of documents and words, effectively clustering groups of documents and words simultaneously. Consequently, every document cluster has a direct connection to a word cluster; the document clus- tering implies a word clustering and vice versa. In 14

Upload: lecong

Post on 22-Jun-2018

240 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bipartite spectral graph partitioning to co-cluster ... · Bipartite spectral graph partitioning to co-cluster varieties and sound correspondences in dialectology ... Whitby and Bristol

Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, ACL-IJCNLP 2009, pages 14–22,Suntec, Singapore, 7 August 2009. c©2009 ACL and AFNLP

Bipartite spectral graph partitioning to co-cluster varieties and soundcorrespondences in dialectology

Martijn WielingUniversity of Groningen

The [email protected]

John NerbonneUniversity of Groningen

The [email protected]

Abstract

In this study we used bipartite spectralgraph partitioning to simultaneously clus-ter varieties and sound correspondencesin Dutch dialect data. While cluster-ing geographical varieties with respect totheir pronunciation is not new, the simul-taneous identification of the sound corre-spondences giving rise to the geographi-cal clustering presents a novel opportunityin dialectometry. Earlier methods aggre-gated sound differences and clustered onthe basis of aggregate differences. The de-termination of the significant sound corre-spondences which co-varied with clustermembership was carried out on a post hocbasis. Bipartite spectral graph clusteringsimultaneously seeks groups of individualsound correspondences which are associ-ated, even while seeking groups of siteswhich share sound correspondences. Weshow that the application of this methodresults in clear and sensible geographi-cal groupings and discuss the concomitantsound correspondences.

1 Introduction

Exact methods have been applied successfully tothe analysis of dialect variation for over threedecades (Seguy, 1973; Goebl, 1982; Nerbonneet al., 1999), but they have invariably functionedby first probing the linguistic differences betweeneach pair of a range of varieties (sites, such asWhitby and Bristol in the UK) over a body ofcarefully controlled material (say the pronuncia-tion of the vowel in the word ‘put’). Second, thetechniques AGGREGATE over these linguistic dif-ferences, in order, third, to seek the natural groupsin the data via clustering or multidimensional scal-ing (MDS) (Nerbonne, 2009).

Naturally techniques have been developed todetermine which linguistic variables weigh mostheavily in determining affinity among varieties.But all of the following studies separate the deter-mination of varietal relatedness from the questionof its detailed linguistic basis. Kondrak (2002)adapted a machine translation technique to deter-mine which sound correspondences occur mostregularly. His focus was not on dialectology, butrather on diachronic phonology, where the regularsound correspondences are regarded as strong ev-idence of historical relatedness. Heeringa (2004:268–270) calculated which words correlated bestwith the first, second and third dimensions of anMDS analysis of aggregate pronunciation differ-ences. Shackleton (2004) used a database of ab-stract linguistic differences in trying to identifythe British sources of American patterns of speechvariation. He applied principal component analy-sis to his database to identify the common com-ponents among his variables. Nerbonne (2006)examined the distance matrices induced by eachof two hundred vowel pronunciations automati-cally extracted from a large American collection,and subsequently applied factor analysis to the co-variance matrices obtained from the collection ofvowel distance matrices. Prokic (2007) analyzedBulgarian pronunciation using an edit distancealgorithm and then collected commonly alignedsounds. She developed an index to measure howcharacteristic a given sound correspondence is fora given site.

To study varietal relatedness and its linguisticbasis in parallel, we apply bipartite spectral graphpartitioning. Dhillon (2001) was the first to usespectral graph partitioning on a bipartite graphof documents and words, effectively clusteringgroups of documents and words simultaneously.Consequently, every document cluster has a directconnection to a word cluster; the document clus-tering implies a word clustering and vice versa. In

14

Page 2: Bipartite spectral graph partitioning to co-cluster ... · Bipartite spectral graph partitioning to co-cluster varieties and sound correspondences in dialectology ... Whitby and Bristol

his study, Dhillon (2001) also demonstrated thathis algorithm worked well on real world examples.

The usefulness of this approach is not only lim-ited to clustering documents and words simulta-neously. For example, Kluger et al. (2003) useda somewhat adapted bipartite spectral graph parti-tioning approach to successfully cluster microar-ray data simultaneously in clusters of genes andconditions.

In summary, the contribution of this paper is toapply a graph-theoretic technique, bipartite spec-tral graph partitioning, to a new sort of data,namely dialect pronunciation data, in order tosolve an important problem, namely how to rec-ognize groups of varieties in this sort of data whilesimultaneously characterizing the linguistic basisof the group. It is worth noting that, in isolat-ing the linguistic basis of varietal affinities, wethereby hope to contribute technically to the studyof how cognitive and social dynamics interact inlanguage variation. Although we shall not pursuethis explicitly in the present paper, our idea is verysimple. The geographic signal in the data is a re-flection of the social dynamics, where geographicdistance is the rough operationalization of socialcontact. In fact, dialectometry is already success-ful in studying this. We apply techniques to extract(social) associations among varieties and (linguis-tic) associations among the speech habits whichthe similar varieties share. The latter, linguisticassociations are candidates for cognitive explana-tion. Although this paper cannot pursue the cogni-tive explanation, it will provide the material whicha cognitive account might seek to explain.

The remainder of the paper is structured as fol-lows. Section 2 presents the material we studied,a large database of contemporary Dutch pronunci-ations. Section 3 presents the methods, both thealignment technique used to obtain sound corre-spondences, as well as the bipartite spectral graphpartitioning we used to simultaneously seek affini-ties in varieties as well as affinities in sound corre-spondences. Section 4 presents our results, whileSection 5 concludes with a discussion and someideas on avenues for future research.

2 Material

In this study we use the most recent broad-coverage Dutch dialect data source available: datafrom the Goeman-Taeldeman-Van Reenen-project(GTRP; Goeman and Taeldeman, 1996; Van den

Berg, 2003). The GTRP consists of digital tran-scriptions for 613 dialect varieties in the Nether-lands (424 varieties) and Belgium (189 varieties),gathered during the period 1980–1995. For everyvariety, a maximum of 1876 items was narrowlytranscribed according to the International PhoneticAlphabet. The items consist of separate words andphrases, including pronominals, adjectives andnouns. A detailed overview of the data collectionis given in Taeldeman and Verleyen (1999).

Because the GTRP was compiled with a viewto documenting both phonological and morpho-logical variation (De Schutter et al., 2005) andour purpose here is the analysis of sound corre-spondences, we ignore many items of the GTRP.We use the same 562 item subset as introducedand discussed in depth in Wieling et al. (2007).In short, the 1876 item word list was filtered byselecting only single word items, plural nouns(the singular form was preceded by an article andtherefore not included), base forms of adjectivesinstead of comparative forms and the first-personplural verb instead of other forms. We omit wordswhose variation is primarily morphological as wewish to focus on sound correspondences. In all va-rieties the same lexeme was used for a single item.

Because the GTRP transcriptions of Belgianvarieties are fundamentally different from tran-scriptions of Netherlandic varieties (Wieling et al.,2007), we will restrict our analysis to the 424Netherlandic varieties. The geographic distribu-tion of these varieties including province namesis shown in Figure 1. Furthermore, note that wewill not look at diacritics, but only at the 82 dis-tinct phonetic symbols. The average length of ev-ery item in the GTRP (without diacritics) is 4.7tokens.

3 Methods

To obtain the clearest signal of varietal differ-ences in sound correspondences, we ideally wantto compare the pronunciations of each variety witha single reference point. We might have used thepronunciations of a proto-language for this pur-pose, but these are not available. There are also nopronunciations in standard Dutch in the GTRP andtranscribing the standard Dutch pronunciationsourselves would likely have introduced between-transcriber inconsistencies. Heeringa (2004: 274–276) identified pronunciations in the variety ofHaarlem as being the closest to standard Dutch.

15

Page 3: Bipartite spectral graph partitioning to co-cluster ... · Bipartite spectral graph partitioning to co-cluster varieties and sound correspondences in dialectology ... Whitby and Bristol

Figure 1: Distribution of GTRP localities includ-ing province names

Because Haarlem was not included in the GTRPvarieties, we chose the transcriptions of Delft (alsoclose to standard Dutch) as our reference tran-scriptions. See the discussion section for a con-sideration of alternatives.

3.1 Obtaining sound correspondencesTo obtain the sound correspondences for every sitein the GTRP with respect to the reference siteDelft, we used an adapted version of the regularLevenshtein algorithm (Levenshtein, 1965).

The Levenshtein algorithm aligns two (pho-netic) strings by minimizing the number of editoperations (i.e. insertions, deletions and substitu-tions) required to transform one string into theother. For example, the Levenshtein distance be-tween [lEIk@n] and [likh8n], two Dutch variants ofthe word ‘seem’, is 4:

lEIk@n delete E 1lIk@n subst. i/I 1lik@n insert h 1likh@n subst. 8/@ 1likh8n

4

The corresponding alignment is:l E I k @ nl i k h 8 n

1 1 1 1

When all edit operations have the same cost,multiple alignments yield a Levenshtein distanceof 4 (i.e. by aligning the [i] with the [E] and/or byaligning the [@] with the [h]). To obtain only thebest alignments we used an adaptation of the Lev-enshtein algorithm which uses automatically gen-erated segment substitution costs. This procedurewas proposed and described in detail by Wielinget al. (2009) and resulted in significantly better in-dividual alignments than using the regular Leven-shtein algorithm.

In brief, the approach consists of obtaining ini-tial string alignments by using the Levenshtein al-gorithm with a syllabicity constraint: vowels mayonly align with (semi-)vowels, and consonantsonly with consonants, except for syllabic conso-nants which may also be aligned with vowels. Af-ter the initial run, the substitution cost of everysegment pair (a segment can also be a gap, rep-resenting insertion and deletion) is calculated ac-cording to a pointwise mutual information proce-dure assessing the statistical dependence betweenthe two segments. I.e. if two segments are alignedmore often than would be expected on the basis oftheir frequency in the dataset, the cost of substi-tuting the two symbols is set relatively low; oth-erwise it is set relatively high. After the new seg-ment substitution costs have been calculated, thestrings are aligned again based on the new seg-ment substitution costs. The previous two stepsare then iterated until the string alignments remainconstant. Our alignments were stable after 12 iter-ations.

After obtaining the final string alignments, weuse a matrix to store the presence or absence ofeach segment substitution for every variety (withrespect to the reference place). We therefore ob-tain an m × n matrix A of m varieties (in ourcase 423; Delft was excluded as it was used as ourreference site) by n segment substitutions (in ourcase 957; not all possible segment substitutionsoccur). A value of 1 in A (i.e. Aij = 1) indicatesthe presence of segment substitution j in variety i,while a value of 0 indicates the absence. We ex-perimented with frequency thresholds, but decidedagainst applying one in this paper as their applica-tion seemed to lead to poorer results. We postponea consideration of frequency-sensitive alternativesto the discussion section.

16

Page 4: Bipartite spectral graph partitioning to co-cluster ... · Bipartite spectral graph partitioning to co-cluster varieties and sound correspondences in dialectology ... Whitby and Bristol

3.2 Bipartite spectral graph partitioning

An undirected bipartite graph can be representedby G = (R,S, E), where R and S are two setsof vertices and E is the set of edges connectingvertices from R to S. There are no edges betweenvertices in a single set. In our case R is the set ofvarieties, while S is the set of sound segment sub-stitutions (i.e. sound correspondences). An edgebetween ri and sj indicates that the sound segmentsubstitution sj occurs in variety ri. It is straight-forward to see that matrix A is a representation ofan undirected bipartite graph.

Spectral graph theory is used to find the prin-cipal properties and structure of a graph from itsgraph spectrum (Chung, 1997). Dhillon (2001)was the first to use spectral graph partitioning ona bipartite graph of documents and words, effec-tively clustering groups of documents and wordssimultaneously. Consequently, every documentcluster has a direct connection to a word cluster. Insimilar fashion, we would like to obtain a cluster-ing of varieties and corresponding segment substi-tutions. We therefore apply the multipartitioningalgorithm introduced by Dhillon (2001) to find kclusters:

1. Given the m × n variety-by-segment-substitution matrix A as discussed previ-ously, form

An = D1−1/2AD2

−1/2

with D1 and D2 diagonal matrices such thatD1(i, i) = ΣjAij and D2(j, j) = ΣiAij

2. Calculate the singular value decomposition(SVD) of the normalized matrix An

SVD(An) = U ∗Λ ∗ V T

and take the l = dlog2ke singular vectors,u2, . . . ,ul + 1 and v2, . . . ,vl + 1

3. Compute Z =[D1

−1/2 U [2,...,l+1]

D2−1/2 V [2,...,l+1]

]4. Run the k-means algorithm on Z to obtain

the k-way multipartitioning

To illustrate this procedure, we will co-clusterthe following variety-by-segment-substitution ma-trix A in k = 2 groups.

[2]:[I] [d]:[w] [-]:[@]Vaals (Limburg) 0 1 1Sittard (Limburg) 0 1 1Appelscha (Friesland) 1 0 1Oudega (Friesland) 1 0 1

We first construct matrices D1 and D2. D1

contains the total number of edges from every va-riety (in the same row) on the diagonal, while D2

contains the total number of edges from every seg-ment substitution (in the same column) on the di-agonal. Both matrices are show below.

D1 =

2 0 0 00 2 0 00 0 2 00 0 0 2

D2 =

2 0 00 2 00 0 4

We can now calculate An using the formula dis-played in step 1 of the multipartitioning algorithm:

An =

0 .5 .350 .5 .35.5 0 .35.5 0 .35

Applying the SVD to An yields:

U =

−.5 .5 .71−.5 .5 .71−.5 −.5 0−.5 −.5 0

Λ =

1 0 00 .71 00 0 0

V =

−.5 −.71 −.5−.5 .71 −.5−.71 0 .71

To cluster in two groups, we look at the second

singular vectors (i.e. columns) of U and V andcompute the 1-dimensional vector Z:

Z =[.35 .35 −.35 −.35 −.5 .5 0

]TNote that the first four values correspond with theplaces (Vaals, Sittard, Appelscha and Oudega) andthe final three values correspond to the segmentsubstitutions ([2]:[I], [d]:[w] and [-]:[@]).

After running the k-means algorithm on Z, theitems are assigned to one of two clusters as fol-lows:

[1 1 2 2 2 1 1

]T17

Page 5: Bipartite spectral graph partitioning to co-cluster ... · Bipartite spectral graph partitioning to co-cluster varieties and sound correspondences in dialectology ... Whitby and Bristol

1

2

1

2

3

1

2

34

1 2 1 2 3 1 2 3 4

Figure 2: Visualizations of co-clustering varieties (y-axis) and segments substitutions (x-axis) in 2 (left),3 (middle) and 4 (right) clusters

The clustering shows that Appelscha andOudega are clustered together and linked to theclustered segment substitution of [2]:[I] (cluster2). Similarly, Vaals and Sittard are clustered to-gether and linked to the clustered segment substi-tutions [d]:[w] and [-]:[@] (cluster 1). Note that thesegment substitution [-]:[@] (an insertion of [@]) isactually not meaningful for the clustering of thevarieties (as can also be observed in A), becausethe bottom value of the second column of V cor-responding to this segment substitution is 0. Itcould therefore just as likely be grouped in clus-ter 2. Nevertheless, the k-means algorithm alwaysassigns every item to a single cluster.

In the following section we will report the re-sults on clustering in two, three and four groups.1

4 Results

After running the multipartitioning algorithm2 weobtained a two-way clustering in k clusters of va-rieties and segment substitutions. Figure 2 triesto visualize the simultaneous clustering in twodimensions. A black dot is drawn if the vari-ety (y-axis) contains the segment substitution (x-axis). The varieties and segments are sorted insuch a way that the clusters are clearly visible (andmarked) on both axes.

To visualize the clustering of the varieties, wecreated geographical maps in which we indicate

1We also experimented with clustering in more than fourgroups, but the k-means clustering algorithm did not give sta-ble results for these groupings. It is possible that the randominitialization of the k-means algorithm caused the instabilityof the groupings, but since we are ignoring the majority ofinformation contained in the alignments it is more likely thatthis causes a decrease in the number of clusters we can reli-ably detect.

2The implementation of the multipartitioning algo-rithm was obtained from http://adios.tau.ac.il/SpectralCoClustering

the cluster of each variety by a distinct pattern.The division in 2, 3 and 4 clusters is shown in Fig-ure 3.

In the following subsections we will discussthe most important geographical clusters togetherwith their simultaneously derived sound corre-spondences. For brevity, we will only focus onexplaining a few derived sound correspondencesfor the most important geographical groups. Themain point to note is that besides a sensible geo-graphical clustering, we also obtain linguisticallysensible results.

Note that the connection between a cluster ofvarieties and sound correspondences does not nec-essarily imply that those sound correspondencesonly occur in that particular cluster of varieties.This can also be observed in Figure 2, wheresound correspondences in a particular cluster ofvarieties also appear in other clusters (but lessdense).3

The Frisian areaThe division into two clusters clearly separates theFrisian language area (in the province of Fries-land) from the Dutch language area. This is theexpected result as Heeringa (2004: 227–229) alsomeasured Frisian as the most distant of all thelanguage varieties spoken in the Netherlands andFlanders. It is also expected in light of the fact thatFrisian even has the legal status of a different lan-guage rather than a dialect of Dutch. Note that theseparate “islands” in the Frisian language area (seeFigure 3) correspond to the Frisian cities whichare generally found to deviate from the rest of theFrisian language area (Heeringa, 2004: 235–241).

3In this study, we did not focus on identifying the mostimportant sound correspondences in each cluster. See theDiscussion section for a possible approach to rank the soundcorrespondences.

18

Page 6: Bipartite spectral graph partitioning to co-cluster ... · Bipartite spectral graph partitioning to co-cluster varieties and sound correspondences in dialectology ... Whitby and Bristol

Figure 3: Clustering of varieties in 2 clusters (left), 3 clusters (middle) and 4 clusters (right)

A few interesting sound correspondences be-tween the reference variety (Delft) and the Frisianarea are displayed in the following table and dis-cussed below.

Reference [2] [2] [a] [o] [u] [x] [x] [r]Frisian [I] [i] [i] [E] [E] [j] [z] [x]

In the table we can see that the Dutch /a/ or /2/is pronounced [i] or [I] in the Frisian area. Thiswell known sound correspondence can be foundin words such as kamers ‘rooms’, Frisian [kIm@s](pronunciation from Anjum), or draden ‘threads’and Frisian [trIdn] (Bakkeveen). In addition, theDutch (long) /o/ and /u/ both tend to be realizedas [E] in words such as bomen ‘trees’, Frisian[bjEm@n] (Bakkeveen) or koeien ‘cows’, Frisian[kEi] (Appelscha).

We also identify clustered correspondences of[x]:[j] where Dutch /x/ has been lenited, e.g. ingeld (/xElt/) ‘money’, Frisian [jIlt] (Grouw), butnote that [x]:[g] as in [gElt] (Franeker) also oc-curs, illustrating that sound correspondences fromanother cluster (i.e. the rest of the Netherlands)can indeed also occur in the Frisian area. An-other sound correspondence co-clustered with theFrisian area is the Dutch /x/ and Frisian [z] inzeggen (/zEx@/) ‘say’ Frisian [siz@] (Appelscha).

Besides the previous results, we also note someproblems. First, the accusative first-person plu-ral pronoun ons ‘us’ lacks the nasal in Frisian, butthe correspondence was not tallied in this case be-cause the nasal consonant is also missing in Delft.

Second, some apparently frequent sound corre-spondences result from historical accidents, e.g.[r]:[x] corresponds regularly in the Dutch:Frisianpair [dor]:[trux] ‘through’. Frisian has lost the fi-nal [x], and Dutch has either lost a final [r] orexperienced metathesis. These two sorts of ex-amples might be treated more satisfactorily if wewere to compare pronunciations not to a standardlanguage, but rather to a reconstruction of a proto-language.

The Limburg areaThe division into three clusters separates thesouthern Limburg area from the rest of the Dutchand Frisian language area. This result is also inline with previous studies investigating Dutch di-alectology; Heeringa (2004: 227–229) found theLimburg dialects to deviate most strongly fromother different dialects within the Netherlands-Flanders language area once Frisian was removedfrom consideration.

Some important segment correspondences forLimburg are displayed in the following table anddiscussed below.

Reference [r] [r] [k] [n] [n] [w]Limburg [ö] [K] [x] [ö] [K] [f]

Southern Limburg uses more uvular versionsof /r/, i.e. the trill [ö], but also the voiced uvularfricative [K]. These occur in words such as over‘over, about’, but also in breken ‘to break’, i.e.both pre- and post-vocalically. The bipartite clus-

19

Page 7: Bipartite spectral graph partitioning to co-cluster ... · Bipartite spectral graph partitioning to co-cluster varieties and sound correspondences in dialectology ... Whitby and Bristol

tering likewise detected examples of the famous“second sound shift”, in which Dutch /k/ is lenitedto /x/, e.g. in ook ‘also’ realized as [ox] in Epenand elsewhere. Interestingly, when looking atother words there is less evidence of lenition in thewords maken ‘to make’, gebruiken ‘to use’, koken‘to cook’, and kraken ‘to crack’, where only twoLimburg varieties document a [x] pronunciation ofthe expected stem-final [k], namely Kerkrade andVaals. The limited linguistic application does ap-pear to be geographically consistent, but Kerkradepronounces /k/ as [x] where Vaals lenites further to[s] in words such as ruiken ‘to smell’, breken ‘tobreak’, and steken ‘to sting’. Further, there is noevidence of lenition in words such as vloeken ‘tocurse’, spreken ‘to speak’, and zoeken ‘to seek’,which are lenited in German (fluchen, sprechen,suchen).

Some regular correspondences merely reflectedother, and sometimes more fundamental differ-ences. For instance, we found correspondencesbetween [n] and [ö] or [K] for Limburg , but thisturned out to be a reflection of the older pluralsin -r. For example, in the word wijf ‘woman’,plural wijven in Dutch, wijver in Limburg dialect.Dutch /w/ is often realized as [f] in the word tarwe‘wheat’, but this is due to the elision of the finalschwa, which results in a pronunciation such as[taö@f], in which the standard final devoicing ruleof Dutch is applicable.

The Low Saxon areaFinally, the division in four clusters also separatesthe varieties from Groningen and Drenthe fromthe rest of the Netherlands. This result differssomewhat from the standard scholarship on Dutchdialectology (see Heeringa, 2004), according towhich the Low Saxon area should include not onlythe provinces of Groningen and Drenthe, but alsothe province of Overijssel and the northern partof the province of Gelderland. It is nonethelessthe case that Groningen and Drenthe normally areseen to form a separate northern subgroup withinLow Saxon (Heeringa, 2004: 227–229).

A few interesting sound correspondences aredisplayed in the following table and discussed be-low.

Reference [@] [@] [@] [-] [a]Low Saxon [m] [N] [ð] [P] [e]

The best known characteristic of this area, theso-called “final n” (slot n) is instantiated strongly

in words such as strepen, ‘stripes’, realized as[strepm] in the northern Low Saxon area. It wouldbe pronounced [strep@] in standard Dutch, so thedifferences shows up as an unexpected correspon-dence of [@] with [m], [N] and [ð].

The pronunciation of this area is also distinctivein normally pronouncing words with initial glottalstops [P] rather than initial vowels, e.g. af ‘fin-ished’ is realized as [POf] (Schoonebeek). Further-more, the long /a/ is often pronounced [e] as inkaas ‘cheese’, [kes] in Gasselte, Hooghalen andNorg.

5 Discussion

In this study, we have applied a novel methodto dialectology in simultaneously determininggroups of varieties and their linguistic basis (i.e.sound segment correspondences). We demon-strated that the bipartite spectral graph partitioningmethod introduced by Dhillon (2001) gave sensi-ble clustering results in the geographical domainas well as for the concomitant linguistic basis.

As mentioned above, we did not have transcrip-tions of standard Dutch, but instead we used tran-scriptions of a variety (Delft) close to the stan-dard langueage. While the pronunciations of mostitems in Delft were similar to standard Dutch,there were also items which were pronounced dif-ferently from the standard. While we do not be-lieve that this will change our results significantly,using standard Dutch transcriptions produced bythe transcribers of the GTRP corpus would makethe interpretation of sound correspondences morestraightforward.

We indicated in Section 4 that some sound cor-respondences, e.g. [r]:[x], would probably not oc-cur if we used a reconstructed proto-language asa reference instead of the standard language. Apossible way to reconstruct such a proto-languageis by multiple aligning (see Prokic, 2009) all pro-nunciations of a single word and use the most fre-quent phonetic symbol at each position in the re-constructed word. It would be interesting to see ifusing such a reconstructed proto-language wouldimprove the results by removing sound correspon-dences such as [r]:[x].

In this study we did not investigate methodsto identify the most important sound correspon-dences. A possible option would be to create aranking procedure based on the uniqueness of thesound correspondences in a cluster. I.e. the sound

20

Page 8: Bipartite spectral graph partitioning to co-cluster ... · Bipartite spectral graph partitioning to co-cluster varieties and sound correspondences in dialectology ... Whitby and Bristol

correspondence is given a high importance whenit only occurs in the designated cluster, while theimportance goes down when it also occurs in otherclusters).

While sound segment correspondences functionwell as a linguistic basis, it might also be fruitfulto investigate morphological distinctions presentin the GTRP corpus. This would enable us tocompare the similarity of the geographic distribu-tions of pronunciation variation on the one handand morphological variation on the other.

As this study was the first to investigate the ef-fectiveness of a co-clustering approach in dialec-tometry, we focused on the original bipartite spec-tral graph partitioning algorithm (Dhillon, 2001).Investigating other approaches such as bicluster-ing algorithms for biology (Madeira and Oliveira,2004) or an information-theoretic co-clusteringapproach (Dhillon et al., 2003) would be highlyinteresting.

It would likewise be interesting to attempt toincorporate frequency, by weighting correspon-dences that occur frequently more heavily thanthose which occur only infrequently. While itstands to reason that more frequently encoun-tered variation would signal dialectal affinity morestrongly, it is also the case that inverse fre-quency weightings have occasionally been applied(Goebl, 1982), and have been shown to functionwell. We have the sense that the last word on thistopic has yet to be spoken, and that empirical workwould be valuable.

Our paper has not addressed the interaction be-tween cognitive and social dynamics directly, butwe feel it has improved our vantage point for un-derstanding this interaction. In dialect geogra-phy, social dynamics are operationalized as geog-raphy, and bipartite spectral graph partitioning hasproven itself capable of detecting the effects of so-cial contact, i.e. the latent geographic signal in thedata. Other dialectometric techniques have donethis as well.

Linguists have rightly complained, however,that the linguistic factors have been neglected indialectometry (Schneider, 1988:176). The currentapproach does not offer a theoretical framework toexplain cognitive effects such as phonemes corre-sponding across many words, but does enumeratethem clearly. This paper has shown that bipartitegraph clustering can detect the linguistic basis ofdialectal affinity. If deeper cognitive constraints

are reflected in that basis, then we are now in animproved position to detect them.

Acknowledgments

We would like to thank Assaf Gottlieb for shar-ing the implementation of the bipartite spectralgraph partitioning method. We also would like tothank Peter Kleiweg for supplying the L04 pack-age which was used to generate the maps in thispaper. Finally, we are grateful to Jelena Prokic andthe anonymous reviewers for their helpful com-ments on an earlier version of this paper.

ReferencesFan Chung. 1997. Spectral graph theory. American

Mathematical Society.

Georges De Schutter, Boudewijn van den Berg, TonGoeman, and Thera de Jong. 2005. MorfologischeAtlas van de Nederlandse Dialecten (MAND) Deel1. Amsterdam University Press, Meertens Instituut- KNAW, Koninklijke Academie voor NederlandseTaal- en Letterkunde, Amsterdam.

Inderjit Dhillon, Subramanyam Mallela, and Dhar-mendra Modha. 2003. Information-theoretic co-clustering. In Proceedings of the ninth ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 89–98. ACM NewYork, NY, USA.

Inderjit Dhillon. 2001. Co-clustering documents andwords using bipartite spectral graph partitioning. InProceedings of the seventh ACM SIGKDD interna-tional conference on Knowledge discovery and datamining, pages 269–274. ACM New York, NY, USA.

Hans Goebl. 1982. Dialektometrie: Prinzipienund Methoden des Einsatzes der NumerischenTaxonomie im Bereich der Dialektgeographie.Osterreichische Akademie der Wissenschaften,Wien.

Ton Goeman and Johan Taeldeman. 1996. Fonolo-gie en morfologie van de Nederlandse dialecten.Een nieuwe materiaalverzameling en twee nieuweatlasprojecten. Taal en Tongval, 48:38–59.

Wilbert Heeringa. 2004. Measuring Dialect Pronunci-ation Differences using Levenshtein Distance. Ph.D.thesis, Rijksuniversiteit Groningen.

Yuval Kluger, Ronen Basri, Joseph Chang, and MarkGerstein. 2003. Spectral biclustering of microarraydata: Coclustering genes and conditions. GenomeResearch, 13(4):703–716.

Grzegorz Kondrak. 2002. Determining recur-rent sound correspondences by inducing translation

21

Page 9: Bipartite spectral graph partitioning to co-cluster ... · Bipartite spectral graph partitioning to co-cluster varieties and sound correspondences in dialectology ... Whitby and Bristol

models. In Proceedings of the Nineteenth Inter-national Conference on Computational Linguistics(COLING 2002), pages 488–494, Taipei. COLING.

Vladimir Levenshtein. 1965. Binary codes capable ofcorrecting deletions, insertions and reversals. Dok-lady Akademii Nauk SSSR, 163:845–848.

Sara Madeira and Arlindo Oliveira. 2004. Bicluster-ing algorithms for biological data analysis: a survey.IEEE/ACM Transactions on Computational Biologyand Bioinformatics, 1(1):24–45.

John Nerbonne, Wilbert Heeringa, and Peter Kleiweg.1999. Edit distance and dialect proximity. In DavidSankoff and Joseph Kruskal, editors, Time Warps,String Edits and Macromolecules: The Theory andPractice of Sequence Comparison, 2nd ed., pages v–xv. CSLI, Stanford, CA.

John Nerbonne. 2006. Identifying linguistic struc-ture in aggregate comparison. Literary and Lin-guistic Computing, 21(4):463–476. Special Issue,J.Nerbonne & W.Kretzschmar (eds.), Progress inDialectometry: Toward Explanation.

John Nerbonne. 2009. Data-driven dialectology. Lan-guage and Linguistics Compass, 3(1):175–198.

Jelena Prokic, Martijn Wieling, and John Nerbonne.2009. Multiple sequence alignments in linguistics.In Lars Borin and Piroska Lendvai, editors, Lan-guage Technology and Resources for Cultural Her-itage, Social Sciences, Humanities, and Education,pages 18–25.

Jelena Prokic. 2007. Identifying linguistic structurein a quantitative analysis of dialect pronunciation.In Proceedings of the ACL 2007 Student ResearchWorkshop, pages 61–66, Prague, June. Associationfor Computational Linguistics.

Edgar Schneider. 1988. Qualitative vs. quantitativemethods of area delimitation in dialectology: Acomparison based on lexical data from georgia andalabama. Journal of English Linguistics, 21:175–212.

Jean Seguy. 1973. La dialectometrie dans l’atlas lin-guistique de gascogne. Revue de Linguistique Ro-mane, 37(145):1–24.

Robert G. Shackleton, Jr. 2005. English-americanspeech relationships: A quantitative approach. Jour-nal of English Linguistics, 33(2):99–160.

Johan Taeldeman and Geert Verleyen. 1999. DeFAND: een kind van zijn tijd. Taal en Tongval,51:217–240.

Boudewijn van den Berg. 2003. Phonology & Mor-phology of Dutch & Frisian Dialects in 1.1 milliontranscriptions. Goeman-Taeldeman-Van Reenenproject 1980-1995, Meertens Instituut ElectronicPublications in Linguistics 3. Meertens Instituut(CD-ROM), Amsterdam.

Martijn Wieling, Wilbert Heeringa, and John Ner-bonne. 2007. An aggregate analysis of pronuncia-tion in the Goeman-Taeldeman-Van Reenen-Projectdata. Taal en Tongval, 59:84–116.

Martijn Wieling, Jelena Prokic, and John Nerbonne.2009. Evaluating the pairwise alignment of pronun-ciations. In Lars Borin and Piroska Lendvai, editors,Language Technology and Resources for CulturalHeritage, Social Sciences, Humanities, and Educa-tion, pages 26–34.

22