regression analysis on levenshtein-pointwise mutual

Regression Analysis on Levenshtein-Pointwise Mutual

Information Segment Distance Across Languages and Acoustic

Distance

Eliza Margaretha, Martijn Wieling, John Nerbonne

[email protected], [email protected], [email protected]

University of Groningen

Abstract

We compare phonetic segment distances induced by Levenshtein with pointwise mutual

information (PMI) weights among 3 languages, namely Dutch, German and Bulgarian. Our results

show that Dutch Levenshtein-PMI segment distances have a significant correlation with those of Bulgarian. While Dutch and Bulgarian pair yields a rather low correlation, Dutch and German pair

yields a high correlation. Furthermore, we are interested in bridging the phonetic-linguistic and

distribution information approaches together. We observe how well vowel quality would influence Levenshtein-PMI segment distances by presenting the correlation of Levenshtein-PMI distances and

acoustic vowel distances derived from formant measurements.

1 Introduction

Phonetic segment distance measures how far a phonetic segment considered from another

segment, namely the similarity between two segments. It serves as fundamental information

for a wide range of research, for instance, speech recognition and spoken language

processing. From transcribing spoken discourse (Geutner, et al., 1998) to the studies of

dialectology (Nerbonne, et al., 1996), segment distances are of great importance to predict

different pronunciations produced due to speaker differences such as geographic location,

gender and the size and shape of vocal tracts.

In addition to the classic Levenshtein algorithm, Wieling, et al. (2009) proposed a

variation using PMI as generated-weights which in turn suggests the best performance

compared to the other variations. In this work, we are interested in comparing the segment

distances induced by this Levenshtein-PMI method across Dutch, German and Bulgarian. We

explore whether segment distances of different languages significantly correlate to each other

and thus also whether segment distances in one language can be used to predict those in other

languages.

Levenshtein-PMI method attempts to estimate segment distances automatically. Wieling,

et al. (2009) describes that the method has a very strong relationship with PairHMM method

which correlate highly with acoustic distances. In this work, we seek the direct relationship

between the Levenshtein-PMI method and vowel quality. To generalize, we would like to

perceive the relationship between information distribution and phonetic-linguistic approaches

as we figure out how well Levenshtein-PMI distances can be estimated by acoustic distances.

Previous study conducted by Ellison (1992) shows similar attempt to derive content from

distribution of word information rather than from acoustic data. They proposed a

methodology to construct unsupervised, cipher-independent and language-independent

machine learning systems for learning phonology. Their induction model learns from surface

information of words derived from a lexicon and by using such information, they show that

they were able to perform a consonant-vowel classification.

2

This report is structured as follows. We give brief explanations of the Levenshtein-PMI

method, formant measurement and Mantel test in the next section. We describe our datasets

and methodology in section 3 and report the results in section 4. Finally, discussions and

summary of our work are pointed out in sections 5 and 6.

2 Literature

This section highlights previous works related to our research. Firstly, we describe the

Levenshtein-PMI method by which we obtain our segment distances. Secondly, we represent

the concept of vowel quality and formant measurements to obtain the acoustic distances.

Lastly, we explain Mantel test as a decent method to clarify the significance of a correlation

of distance matrices.

2.1 Levenshtein using Pointwise Mutual Information-generated segment distances

Levenshtein algorithm (Levenshtein, 1965) is a well-known distance measure which has

been applied widely, also known for computing segment distances, namely how often

segment � is aligned with segment �. In the context of string alignment, an insertion is

regarded as an alignment of a segment against a gap, a deletion is an alignment of a gap

against a segment, and a substitution is alignment of two segments (Wieling, et al., 2009).

The following is an example of string alignment between 2 different pronunciations of

milk in Dutch. The first string is /molke/ which is a Frisian dialect and the second is /melek/,

another dialect spoken in several regions in the Netherlands such as Limburg. In the example,

we discover a substitution (S) between the vowel /�/ and /�/, a deletion (D) as the alignment

between a gap and /�/, and an insertion (I) of /�/ into a gap. For computing segment distances,

we are interested most in the substitution operation.

m � � k �

m � � � k

S D I

PMI proposed by Church & Hanks (1990) was originally to measure word association

norms. PMI compares the probability of observing two variables x and y together (joint

probability) with the probability of observing the variables independently (chances). PMI

applied to Levenshtein algorithm attempts to adjust an alignment distance by giving a weight

which specify the distance according to the alignment frequency. Therefore, it is able to

explain whether an alignment is nearer or further than other similar alignments.

��, �� log� � ��, ��

In the view of information distribution, segment distances can be considered as how far

the distribution of segment � from the distribution of segment �. Wieling, et al. (2009)

defined the properties of PMI between a segment pair � and � as described below with regard

to generating segment distances:

• ��, �� is the relative occurrence of the aligned segments � and � in the whole data

set. Specifically, ��, �� is computed as the number of � and � occurrences at the

3

same position in 2 aligned strings of � and �, divided by the total number of aligned

segments.

• �� or �� shows the relative occurrence of � or � respectively in the whole

dataset, namely the number of the occurrences of � or � divided by the total number

of segment occurrences.

PMI value goes proportionally with the number of the segment pair's co-occurrence. If

segments x and y are likely to co-occur, ��, �� will be much larger than the �� and

consequently the ��, �� will be much larger than 0. Conversely, PMI negative values

indicate that segments are not likely to co-occur. To set corresponding segments at low

distance, the segment distances are transformed by subtracting PMI value from 0 and adding

the maximum PMI value.

Segment distances are trained using an iterative procedure in the following manner. First,

string alignments are generated using Levensthein algorithm which does not allow

alignments of vowels with consonants. Second, the PMI values for every segment pair is

calculated and transformed. Third, Levenshtein algorithm is applied to these segment

distances to create a new alignment sets. Step 2 and 3 are repeated until convergence is

reached, namely the difference between two consecutive iterations is very small, close to 0.

2.2 Vowel Quality and Formant measurement

Vowel quality is the property that makes one vowel sound different from another, for

example, /��/ as in sheep from /�/ as in ship (McArthur, 1998). The quality of a vowel is

determined by the position of the vocal tracts (the parts of the anatomy which produce vocal

sounds) during pronunciation, i.e. the tongue, lips, and lower jaw, and the resulting size and

shape of the mouth and pharynx.

The most common way to measure vowel quality by means of acoustic signals is formant

measurement (Leinonen, 2010). Formants specify the energy concentration positions in the

acoustic signals, i.e. the lowest resonance frequencies (Peterson & Barney, 1952). At a

resonance frequency, similar acoustic signals oscillate at larger amplitudes than at other

frequencies. These vocal resonances are able to characterize distinguishable vowel sounds.

The 2 first formants are the most distinguishing features and the 3rd formant would be

useful when pronunciation is very much affected by the position of the lips (Ladefoged,

2005). Figure 2.1 illustrates vowel distinguishments via formant measurements. A formant is

presented as a darker band in a spectrogram. It shows that /�/ and /�/ has similar first

formants but the second formant of /�/ is higher than that of /�/. The third formant provides

additional information for the distinguishments.

Figure 2.1 Illustration of Format Measurements (Leinonen, 2010)

4

An acoustic distance between 2 vowels can be acquired by calculating the Euclidean

distance of their formant values (Wieling, et al., 2007). To generalize the acoustic distance,

normalizing non-linguistic speaker-dependent differences, such as pitch, in acoustic signal is

required. A common way to do so is by applying a band-pass filtering using Bark filters or

Mel filters. To match human pitch and perception which are not linear, linear Hertz frequency

should be transformed to non-linear, almost logarithmic, Bark or Mel scales.

In addition to Bark and Mel scales, z-score transformation was suggested by Lobanov

(1971) with the intention of achieving normalization per speaker. Thus, z-scores

transformation would help to assimilate the voice differencies between men and women.

While Bark and Mel scales are based on one vowel token, z-scores transformation make use

of information across vowels. More vowel normalization methods such as Gerstman’s range

normalization and Millers’ formant-ratio model are discussed and compared by Adank

(2003).

2.3 Mantel Test

Normally we compare independent objects in

carrying out regression analysis. In other words, we

assume that the objects to correlate are

independent. However, distance matrices are

typically dependent in some ways (Manly, 1994).

In the case of acoustic distances derived from

the first and second formants, the distances are dependent as particularly they obey the

theorem of triangle inequality. According to the theorem, if the straight distance between 2

objects A and C is smaller than the sum of other distances through another object B, then the

straight distance is dependent to the other distances. This concept is illustrated in Figure 2.2.

On the other hand, Levenshtein-PMI distances can be viewed as independent as they do

not necessarily obey the triangle inequality. Moreover, since Levenshtein-PMI distances

come from information distribution theory, they are not guaranteed as distances or metrics in

mathematical sense.

Comparing Levenshtein-PMI distances to acoustic distances would introduce a

comparison between an independent matrix and a dependent matrix. In such a case, it is

essential to test if the relationship between such matrices would be truly significant.

Evaluating their correlation coefficient and testing its significance are not sufficient.

Mantel test introduced by Mantel (1967) is a prevalent test for dealing with such a

purpose. It was primarily suggested as a solution for identifying space and time clustering of

disease. Mantel test is based on randomization and permutation test. To assure the

significance of the correlation of two distance matrices, their correlation is compared with the

correlations of permutated matrices, i.e. multiple comparisons with correlations of one

original matrix and all possible permutated matrices where the rows and columns of the other

matrix are permutated randomly.

The null hypothesis is set as there is no relationship between the two matrices. If it is

satisfied, then the correlation coefficients of permuted matrices should be equally larger or

smaller. An observation value can be used to show a positive relationship. Specifically, we

compute the observation value by adding 1 for every ��1, �2� � ��1, �2�, where D1

and D2 are distance matrices and PD1 is permutated D1, and then divided by the number of

replicates.

To be precise, we need to perform comparisons for all possible permutations. However,

the number of permutation would grow enormously as the size of the matrix grows larger.

Figure 2.2 Triangle

Inequality

5

Therefore, Monte-Carlo test (Metropolis & Ulam, 1949) would be a good alternative. It

suggests that taking a small random sample of the possible replicates should be sufficient.

3 Dataset

Our Dutch data came from digital Dutch dialect data transcriptions from the Goeman-

Taeldeman-Van Reenen-Project (GTRP) as used by (Wieling, et al., 2007). It consists of

dialect varieties from 424 different locations in the Netherlands. For each variety, there are

562 transcriptions of different words which altogether comprise 82 Dutch phonetic segment

types.

On the other hand, the Bulgarian data was collected from various resources, namely

students’ theses at the University of Sofia, published monographs, dictionaries, and the

archive of the Ideographic Dictionary of Bulgarian Dialects (Prokić, et al., 2009). It contains

transcriptions of 152 words from 197 locations all over Bulgaria and there are 67 different

segment types including diacritics and suprasegmentals, i.e. vocal effects such as emphasis or

prosodic.

Additionally, we made use of the German dataset (Nerbonne & Siedle, 2005) which

consists of 78 segment types. The transcriptions of 196 words were collected from 186

locations in Germany for the Kleiner Deutscher Lautatlas project.

For each language, L04 program1 developed by Peter Kleiweg was employed to compute

its Levenshtein-PMI segment distances. The program produces a contingency matrix where

the rows and columns designate segment types and each cell ��, �� describes the distance

between 2 corresponding segment types � and �. For each non-alignment pair, we give a very

high penalty which turns out to be a very high distance.

Since the segment labels of different languages are written in different Extended Speech

Assessment Methods Phonetic Alphabet (X-SAMPA) formats, we transform the X-SAMPA

labels to their corresponding International Phonetic Alphabet (IPA) standard. Then, we map

each shared segment label between 2 languages, namely Dutch and Bulgarian, and Dutch and

German. For each shared segment in a language pair, we collect all segment alignments

which have low distances in both languages. Additionally, we also calculate the number of

vowel alignments and consonant alignments separately.

Table 3.1 Dutch, Bulgarian, and German Segment Figures

Language Pair Shared

Types

Segment

Alignments

Vowel

Alignments

Consonant

Alignments

Dutch and Bulgarian 43 235 92 143

Dutch and German 71 870 261 609

Dutch and Bulgarian share 43 identical phonetic segments in our data. In total, there are

903 segment alignments, but there are only 235 alignments with low distances consisting of

92 vowel alignments and 143 consonant alignments. On the other hand, Dutch and German

data share 71 identical phonetic segments. There are 870 alignments with low distances

including 261 vowel alignments and 609 consonant alignments. These figures are

summarized in Table 3.1.

The normal Q-Q plots of Dutch and Bulgarian low segment distances plotting observed

values against expected normal values in Figure 3.2 (a) and (b) suggest that the data are

1 http://www.let.rug.nlkleiweg/L04/Manuals/leven.html

6

normally distributed with one outlier in

Dutch. Similarly, the Q-Q plot of Dutch

and German low segment distances also

show that the data are normally distributed

(see Appendix I.2). The segment distances

vary from 0 to 5000.

The box plot of the data depicted in

Figure 3.2 show that the Dutch and

Bulgarian medians are close to each other

and most of the data overlap. Thus, we

expect the data would be fairly similar, i.e.

no significant difference. The box plot of

Dutch and German data also show

comparable manner (see Appendix I.2).

Our acoustic data was obtained from

Pols, et al. (1973) containing three first formants of 50 Dutch male speakers and Van Nierop,

et al. (1973) those of 25 female speakers. The formants of all speakers are averaged and the

acoustic distances were computed as the Euclidean distances of the formant values. In total,

there are 36 acoustic vowel alignments in the acoustic data. All of these alignments also

appear in our Levenshtein-PMI Dutch data.

Beside the raw Hertz frequency of the formants, we use the transformed formants in Bark

and Mel scale. Since raw Hertz frequency is linear whereas our perception is not, transformed

formants in Bark and Mel scales which are nonlinear should fit to the nature of our perception

better.

Additionally, we apply z-score transformation to our acoustic data in the following

manner. Raw hertz values are transformed to standardized z-scores of each speaker so as to

normalize the differences over all the vowels per speaker. Then, the average of z-scores per

vowel of all speakers is taken.

4 Results and Analysis

In this section, we present regression analyses over different setups. First, we highlight our

comparisons of Levenshtein-PMI distances across languages. Second, we compare

Levenshtein-PMI distance with various variations of acoustic distances.

(a) (b)

Figure 3.2 Q-Q Plots of (a) Bulgarian and (b) Dutch Data

Figure 3.1 Box Plot of Bulgarian and Dutch

Data

7

4.1 Comparing Dutch, Bulgarian and German Levenshtein-PMI Distances

We analyze Levenshtein-PMI distances across languages with the following arrangement.

We compare variable pairs, which are Dutch and Bulgarian and Dutch and German, for all

existing cases, namely all segment alignments occurring in both languages. The value of each

variable for each case is the corresponding segment distance. For example, the value of /�/ -

/a/ alignment in Dutch is its distance and we aim at comparing it with such a distance in

Bulgarian and in German. We carry on the task by performing a regression analysis on 2

Levenshtein-PMI distance sets and computing their correlation coefficient to measure the

effect size. The task is modeled in Figure 4.1 below.

Dutch Bulgarian/German

Segment alignment Levenshtein-PMI distance Levenshtein-PMI distance

Figure 4.1 Regression Analysis Model on Comparing Levenshtein-PMI Distances

The scatter plot in Figure 4.2 (a) visualizes the relationship between Dutch and Bulgarian

data, whilst (b) Dutch and German. Each point in the scatter plots indicates a case where � is

the segment distance in Dutch and � is the segment distance in Bulgarian or German. We

assume Dutch as the independent variable which somewhat determine the values of the

dependent variables (Bulgarian or German).

A straight line (regression line) in each scatter plot is drawn suggesting linear dependency.

The line has a formula � � � � where � is the intercept and is the slope. Each �! in

the regression line is the predicted segment distance in another language (Bulgarian or

German) estimated by the corresponding Dutch segment distance. The difference between the

actual and predicted segment distances is the residual, "# ��# $ �!#�. Least-squares

regression is applied to find the minimal squared residuals for all segment alignments.

The points in Dutch and Bulgarian data seem to scatter more than those in Dutch and

German data which is fairly concentrated nearby its regression line. Although the points look

moderately random, the points in Dutch and German show a slight trend that the residuals

become larger as the distances become larger.

To examine the residuals accurately, we plot the residuals against the predicted value as

depicted in Figure 4.3 (a). The residuals imply linearity since the points are moderately

random and widely spread. They are also reasonably normally distributed as shown by P-P

plot in Figure 4.3 (b). Besides, Dutch and German data show similar manners with extra data

(a) (b)

Figure 4.2 Scatter Plots of (a) Dutch and Bulgarian, (b) Dutch and German

8

points (see Appendix I.2). It also shows the trend mentioned before, i.e. residuals become

larger as the distances become larger.

According to our SPPS results given in Figure 4.4, the regression line for Dutch and

Bulgarian is � 1568.562 � 0.3�. By using this regression line, we are able to calculate the

predicted Bulgarian segment distance given the corresponding Dutch distance. For example,

given the distance of /�/ aligned to /�/ is 1556 in Dutch, the predicted alignment distance in

Bulgarian is �! 1568.562 � 0.3�1556� 2053.362. If we allow 5% error, i.e. with 95%

confidence interval, the mean of /�/ and /�/ alignment distance in Bulgarian should lie

between 2053.362 + 1083 �970,3136� where 1083 is the standard error �! for specific

� 1556. In our data, the real distance is 1675 which indeed lie in the interval.

Figure 4.4 Dutch and Bulgarian Regression Line Coefficients

The t-statistics is 5.454 presenting that the relationship between Dutch and Bulgarian data

is significant at � / 0.000. In other words, Dutch segment distances can be considered as a

good predictor for Bulgarian segment distances. The correlation coefficient � 0.336 shows

that Dutch and Bulgarian has a low positive correlation.

A coefficient of determination, that is the square of correlation coefficient (��) shows the

proportion of variability in a data set accounted for by a regression model (Moore & McCabe,

2006). It compares the variations explained by an explanatory variable, i.e. Dutch segment

distances in our case, to the total variations in the whole data set. Therefore, it presents the

explanatory size of the independent variable to the dependent variable. It also specifies how

well future outcomes can be predicted by the model.

In an ANOVA’s point of view as depicted in Figure 4.5, the coefficient of determination is

computed as the sum of squares of regression model divided by the total sum of squares. For

Dutch and Bulgarian case, the coefficient shows that Dutch segment distances account for

approximately 11% variation of Bulgarian segment distances. Figure 4.5 also presents that

Dutch distances have a significant effect on Bulgarian distances as their F-statistics is

significant at (p < 0.000).

(a) (b)

Figure 4.3 Plots of Dutch and Bulgarian Residuals

9

Figure 4.5 ANOVA Summary of Dutch and Bulgarian Data

In the case of Dutch and German (see Appendix I.2), the t-statistics also indicates

significant relationship, namely 23.925 (� / 0.000). The regression line formula is � 879.010 � 550�. The correlation (� 0.630) is stronger than Dutch and Bulgarian. Almost

40% variation of German segment distances is accounted for by Dutch segment distances.

Furthermore, we compare vowel alignments and consonant alignments separately.

Generally, both vowel and consonant alignments yield significant correlations at � / 0.000.

We figure that vowel alignments obtain better correlations than consonant alignments. Dutch

and Bulgarian vowel alignments correlate significantly at � 0.418 which means nearly

18% variation of Bulgarian vowel distances can be predicted by Dutch vowel distances. Their

consonant alignments on the other hand, yield correlation at � 0.339, that is roughly 11%

variation of Bulgarian consonant distances are accounted for by Dutch consonant distances.

Table 4.1 Dutch, Bulgarian and German Levenshtein-PMI Distances Correlations (p < 0.001)

Language

Pair

Alignment

Sets

Pearson

Correlation (r)

Explanatory

size (r2)

Dutch and

Bulgarian

All 0.336 0.113

Vowel 0.418 0.178

Consonant 0.339 0.115

Dutch and

German

All 0.630 0.397

Vowel 0.620 0.384

Consonant 0.587 0.345

For Dutch and German, their vowel distances correlate at � 0.620 and thus Dutch vowel

distances account for over 38% variation of German vowels. Their consonant distances have

a slightly lower correlation at � 0.587 suggesting that approximately 35% German

consonants are accounted for by Dutch consonant distances.

4.2 Comparing Levenshtein-PMI Distances to Acoustic Distances

Our second task is to compare segment distances produced by information distribution

approach to the common assessment concerning vowel quality, phonetic-linguistics approach.

Specifically, we compare Dutch Levenshtein-PMI distances to acoustic distances.

Since we are interested in perceiving how well acoustic distances would explain

Levenshtein-PMI distances, we set acoustic distance as the explanatory variable and

10

Levenshtein-PMI distance as the response variable. Akin to the previous task, the cases and

values of the variables are segment alignments and distances in the corresponding

approaches.

We evaluate each variation of the acoustic distances as described in section 3, namely raw

Hertz frequency and transformed frequency in Bark scale, Mel scale and Z-scores. For each

variation, we compute Pearson correlation coefficients (r) and coefficients of determination

showing the explanatory size (r2) for the first 2 and the first 3 formants. The results are

presented in Table 4.2.

Table 4.2 Dutch Levenshtein-PMI and Acoustic Distances Correlations

Acoustic

variation

Number of

first formants

Pearson

Correlation (r)

Explanatory

Size (r2)

Significance

(p-value)

Hertz 2 0.481 0.231 0.003

3 0.426 0.181 0.010

Z-scores 2 0.720 0.518 0.000

3 0.640 0.410 0.000

Bark Scale 2 0.616 0.379 0.000

3 0.517 0.267 0.001

Mel Scale 2 0.603 0.364 0.000

3 0.507 0.257 0.002

The correlations between acoustic distances using raw Hertz and Levenshtein-PMI

distance are not remarkable. Raw hertz with 2 first formants has correlation at � 0.481

which shows that it accounts for 23% variation of Levenshtein-PMI distance. Taking into

account the third formant slightly lower the correlation coefficient to � 0.426 signifying

that the acoustic distance accounts for 18% variation of the Levenshtein-PMI distance.

Normalizing the raw Hertz is indeed improving the results. Our acoustic z-scores distances

yield the best correlations at � 0.720 for 2 first formants and � 0.640 for the 3 first

formants. Both results are significant at � / 0.000. Using the 2 first formants, it accounts

for nearly 52% variation of Levenshtein-PMI distance. Considering the third formant does

not help refining the results and yields a poorer result, explicitly over 10% minor explanatory

size than excluding the formant. Only 41% variation of Levenshtein-PMI distance accounted

for by acoustic z-scores distances with 3 first formants.

Bark and Mel scales produce similar results although Bark scales are marginally better

than Mel scales. Almost 38% variation of Levenshtein-PMI distance is explained by acoustic

distances in Bark scale with 2 first formants (� 0.616) and over 36% is explained by Mel

scale, also with 2 first formants �� 0.603). The third formant is again exacerbating the

results. With the third formant, acoustic distances in Bark scale predict nearly 27% variation

of Levenshtein-PMI distance (� 0.517) and the distances in Mel scale predict almost 26%

(� 0.507). While using 2 first formants in Bark and Mel scales is significance at � /0.001, the significance of using 3 first formants also fall to � / 0.005.

As mentioned in section 2.3, p-value is not sufficient for validating the significance of a

correlation coefficient of distance matrices. Although Levenshtein-PMI distances can be

recognized as independent, acoustic distance is not independent. Since we are comparing an

independent object with a dependent object, we need to perform Mantel test to the

significance of their correlation coefficient. Instead of testing all possible permutations, we

11

use Monte-Carlo sampling of 10000 replicates. The outcomes of the Mantel test with Monte-

Carlo sampling are given in Table 4.3.

Table 4.3 Mantel Test Results of Dutch Levenshtein-PMI and Acoustic Distances

Acoustic

variation

Observation

value

Significance

(p-value)

Hertz 2 0.168 0.013

Hertz 3 0.132 0.035

Z-score 2 0.410 1e-04

Z-score 3 0.317 3e-04

Bark 2 0.303 2e-04

Bark 3 0.206 0.002

Mel 2 0.286 2e-04

Mel 3 0.195 0.004

The significances in Mantel test goes proportionately with the significances of the

correlation coefficients in Table 4.2. The previous table shows that Z-scores using first 2 and

3 formants, Bark scale 2 formants, Mel scale 2 formants are significant at p < 0.001. On the

other hand, Table 4.3 highlights that these variations have tremendously low values implying

that permuting the rows and columns does not really affect the correlations between the

acoustic distances and Levenshtein-PMI distances. Thus, the two kinds of distances have a

decent relationship and their correlation is truly significant.

5 Discussion

Our results show that Dutch Levenshtein-PMI distances is able to predict distances in

German and Bulgarian. Dutch prediction over German, which has similar characteristics to

Dutch, is much better than the prediction over Bulgarian, which has different characteristics.

Dutch and German are deemed to be grouped in Germanic languages category. Since they

have the same earlier parent language during the historical developments, they share a wide

range of similarities including types of consonants, vowels and accents (Auwera & König,

1994).

On the other side, Bulgarian is included in Slavonic languages which are mainly spoken in

Eastern Europe. Therefore, Bulgarian has diverse phonetic properties from Dutch. Since the

sound systems of Slavonic languages are rich in consonants, Slavonic people particularly are

not accustomed to pronounce vowels. They typically find difficulties in pronouncing vowels

and they pronounce vowel in different ways from Dutch people.

Another issue that should be taken into account is that the phonetic notation system in

International Phonetic Alphabet (IPA) does not necessarily denote exactly the same phonetic

sounds from different languages. The alphabet was originally defined based on English. A

sound which is alike but not exactly the same as in English could be signified to the same

alphabet. For instance, /i/ sound in Bulgarian might be pronounced slightly different from

English /i/ but it is labeled to the same alphabet /i/.

Comparisons of Levenshtein-PMI distances and various transformations of acoustic

distances show that Z-score transformation yields the best results. Bark and Mel scales help

in normalizing the formants to meet human perception which is nonlinear. They improve the

12

estimation of Levenshtein-PMI distances for more than 10%. However, z-score

transformation suits our acoustic data better since the data was collected from male and

female speaker and z-score transformation attempt to normalize differences of all vowels per

speaker. Therefore, z-scores assist properly in smoothing speaker differences with regard to

gender. It improves nearly 30% of the predictions.

It appears that the third formant is not useful in our experiments, even impact poorer

outcomes. This phenomenon is not peculiar as it was also previously discovered in Wieling,

et al., (2007). We suspect that it might be due to the pronunciations in our data are not much

determined by lips position which greatly affects the third formant. Instead of helping in

distinguishing vowels, the third formant seems to make the differences among the vowels

more unclear.

6 Summary

We have described 2 segment distance comparison tasks. First, we demonstrate

comparisons of Levenshtein-PMI segment distances between 2 pairs of languages, namely

Dutch-Bulgarian and Dutch-German. Second, we present comparison of Levenshtein-PMI

distances and some variations of acoustic distances induced by formant measurements. For

both tasks, we show significant correlations between the variables to compare.

Our results reveal that Levenshtein-PMI distances of Dutch are able to predict those of

Bulgarian and German. That is to say Levenshtein-PMI distances of one language are able to

predict distances in other languages. We also report that prediction of a language whose

similar characteristics to the predictor is better than that of a language whose different

characteristics. Particularly in our work, Dutch prediction over German is better than Dutch

prediction over Bulgarian. Dutch distances account for up to 40% variation of German

distances. In Bulgarian case, Dutch distances are able to estimate approximately 11%

variation.

Additionally, we display that vowel quality as represented by acoustic distances correlate

reasonably highly with Levenshtein-PMI distances. This implies that phonetic-linguistic

approach has a significant relationship with distribution information approach and the former

can finely explain the latter to some extent. In our case, we evaluate how well Dutch acoustic

distances are capable of predicting its Levenshtein-PMI distances. We show that the former

can predict up to 52% of the latter.

Acoustic distances using raw Hertz frequency from 2 first formants are able to estimate

about 23% variation of Levenshtein-PMI distances. Normalizing the raw Hertz frequency is

indeed improving the results. Bark and Mel scales transform linear raw Hertz to nonlinear

frequency in order to match human perception. The acoustic distances in Bark and Mel scales

produce comparable results where Bark is faintly better than Mel scales. They achieve

approximation about 37% variations of Levenshtein-PMI distances.

The best prediction is attained by z-score transformation. Since our acoustic data combine

male and female speaker and z-score transformation attempt to normalize differences of all

vowels per speaker, it helps to smooth the differences between men and women.

Since we compare independent Levenshtein-PMI distances to dependent acoustic

distances, we also test the significance of their correlations. We do so by carrying out Mantel

test which eventually assure the significance. Especially for correlations with normalized

acoustic distances: z-score, Bark and Mel scale using 2 first formants, the p-values are very

low indicating that there is a relationship between the 2 compared distances.

13

Appendix I Data

I.1 Dutch and Bulgarian Data

I.2 Dutch and German Data

14

Appendix II Results

II. 1 Results of Levenshtein-PMI Dutch and German Segment Distance Comparison

15

Bibliography

Adank, P. M. (2003). Vowel Normalization: a perceptual acoustic study of Dutch vowels.

Wageningen: Ponsen & Looijen.

Auwera, J. v., & König, E. (1994). The Germanic Languages. London: Routledge.

Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and

lexicography. Comput. Linguist., 16(1), 22-29.

Ellison, T. M. (1992). The Machine Learning of Phonological Structure. Phd Thesis,

University of Western Australia, Department of Computer Science.

Geutner, P., Finke, M., & Waibel, A. (1998). Phonetic-Distance-Based Hypothesis Driven

Lexical Adaptation For Transcribing Multlingual Broadcast News. In Proceedings of the

International Conference on Spoken Language Processing.

Ladefoged, P. (2005). Vowels and Consonants: An Introduction to the Sounds of Languages

(2nd ed.). Malden, MA: Blackwell.

Leinonen, T. (2010). An Acoustic Analysis of Vowel Pronunciation in Swedish Dialects. PhD

Thesis, Groningen.

Levenshtein, V. (1965). Binary codes capable of correcting deletions, insertions and

reversals. Soviet Physics Doklady, 163(4), 845-848.

Lobanov, B. M. (1971). Classification of Russian Vowels Spoken by Different Speakers. J.

Acoust. Soc. Am., 49, 606-608.

Manly, B. F. (1994). Multivariate Statistical Methods: A Primer (2nd ed.). USA: Chapman

and Hall.

Mantel, N. (1967). The Detection of Disease Clustering and a Generalized Regression

Approach. Cancer Research, 27(2), 209-220.

McArthur, T. (1998). "VOWEL QUALITY" Concise Oxford Companion to the English

Language. Retrieved May 5, 2010, from Oxford Reference Online:

http://www.oxfordreference.com/views/ENTRY.html?subview=Main&entry=t29.e1288

Metropolis, N., & Ulam, S. (1949). The Monte Carlo Method. Journal of the American

Statistical Association, 44(247), 335-341.

Moore, D. S., & McCabe, G. P. (2006). Introduction to the Practice of Statistics 5th edition.

New York: W. H. Freeman.

Nerbonne, J., & Siedle, C. (2005). Dialektklassifikation auf der Grundlage aggregierter

Ausspracheunterschiede. Zeitschrift für Dialektologie und Linguistik, 72(2), 129–147.

Nerbonne, J., Heeringa, W., van den Hout, E., van der Koo, P., Otten, S., & van de Vis, W.

(1996). Phonetic Distance between Dutch Dialects. G.Durieux, W.Daelemans, & S.Gillis

(eds.) CLIN VI: Proc. of the Sixth CLIN Meeting, (pp. 185-202). Antwerp, Centre for

Dutch Language and Speech (UIA).

Peterson, G., & Barney, H. (1952). Control methods used in a study of the vowels.

J.Acoust.Soc.Am, 24(2), 175-184.

Pols, L. C., Tromp, H. R., & Plomp, R. (1973). Frequency analysis of Dutch vowels from 50

male speakers. The Journal of Acoustical Society of America, 43, 1093–1101.

16

Prokić, J., Nerbonne, J., Zhobov, V., Osenova, P., Simov, K., Zastrow, T., et al. (2009). The

computational analysis of Bulgarian dialect pronunciation. Serdica Journal of Computing.

Statistical Consulting Group. (n.d.). How can I perform a Mantel test in R? Retrieved May 8,

2010, from UCLA: Academic Technology Services:

http://www.ats.ucla.edu/stat/R/faq/mantel_test.htm

Van Nierop, D. J., Pols, L. C., & Plomp, R. (1973). Frequency analysis of Dutch vowels from

25 female speakers. Acoustica, 29, 110–118.

Wieling, M., Heeringa, W., & Nerbonne, J. (2007). An Aggregate Analysis of Pronunciation

in the Goeman-Taeldeman-van Reenen-Project Data. Taal en Tongval, 59(1), 84-116.

Wieling, M., Leinonen, T., & Nerbonne, J. (2007). Inducing sound segment differences using

Pair Hidden Markov Models. SigMorPhon '07: Proceedings of Ninth Meeting of the ACL

Special Interest Group in Computational Morphology and Phonology (pp. 48-56).

Prague, Czech Republic: Association for Computational Linguistics.

Wieling, M., Prokić, J., & Nerbonne, J. (2009). Evaluating the pairwise string alignment of

pronunciations. Proceedings of the EACL 2009 Workshop on Language Technology and

Resources for Cultural Heritage, Social Sciences, Humanities, and Education (pp. 26-

34). Athens, Greece: Association for Computational Linguistics.

Zwicker, E. (1961). Subdivision of the audible frequency range into critical bands. The

Journal of the Acoustical Society of America, 33(2), 248.

regression analysis on levenshtein-pointwise mutual

Documents