a variation in illusion : vowel epenthesis after /h/ in …...more coarticulation than other...

A variation in illusion : Vowel epenthesis after /h/ in Japanese loanwords master’s thesis by

Isabelle LIN

co-supervised by

Dr Emmanuel DUPOUX and Dr Sharon PEPERKAMP

Submitted in June 2014

For partial completion of the academic requirements in

the Research Master in Cognitive Science (EHESS, ENS, Paris V) and Majeure ‘Research Master in Cognitive Science’ (HEC Paris)

Vowel epenthesis after /h/ in Japanese loanwords 1

Abstract

When we hear sequences of sounds that are not attested in our

native language, we tend to misperceive them and hear instead a

sequence that is more probable in our native system. Looking at

loanword adaptation of such illegal segments, we can gain insight

into the way non-native sequences are perceived. We examined

through two behavioral tasks the perception of /hp/ and /kp/

clusters by native speakers of Japanese. Our aim was to test the

influence of coarticulation in the perception of a more likely

sequence. We found that coarticulation information does influence

the illusory vowel perceived in those clusters, albeit all types of

coarticulation do not have the same degree of influence. In a

second phase, we ran the experimental stimuli through speech

recognition models trained on a corpus of spoken Japanese. These

models are either purely acoustic, or integrate some measure of

phonotactic constraints. Comparing the responses of the

participants to those of the models can give us an inkling of the

processes involved in loan adaptation.

Declaration of Originality

Previous studies in Japanese loanword adaptations have shown that

illusory vowel segments perceived in illegal consonant clusters are

influenced by coarticulation information contained in said clusters.

In this project, we aim to compare the effects of different of

coarticulations (/ɑ/, /i/, /u/, /e/ or /o/), as well as that of more or less

strongly coarticulated environements (/hp/ or /kp/). In particular, for

the case of /h/, current theories about loanword adaptation would

lead to different predictions. Our results could provide additional

evidence to inform this debate. In addition to behavioral tasks, we

also explore the possible use of speech recognition systems to

model this phenomenon.


Aknowledgements

This work was made possible thanks to the help and insights of many people, to whom I wish to express my most heartfelt thanks. First and foremost, I am extremely grateful to my supervisors Dr Emmanuel Dupoux and Sharon Peperkamp, whose previous work this project has stemmed from. Ideas for this study originated in their collaboration with Dr Yuki Hirose, whose notes on an experiment to examine coarticulation in the velar fricative /χ/ served as a basis to create the test stimuli used in our experiments. With their advice, the project was made more detailed and another approach to our research question was defined through the possibility of using speech recognition models as well as behavioral data to test our predictions. Literature review done during the first half of my internship at the Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP) allowed me to adjust the experimental design for both experiments according to more precise predictions. I would like to thank Dr Akiko Takemura for proofing the recruitment notices and experimental instructions, and Abdellah Fourtassi for letting me use his HMM models of Japanese for the computational part of this project. Members of the LSCP writing group (Dr Christina Bergmann, Dr Alejandrina Cristià, Alexander Martin, Élise Michon), helped in proofreading parts of this thesis. Making the final stimuli from the recordings, designing the counterbalancing, coding the experiments in Python language, recruiting, testing, and data collecting were done by me. All mistakes are my own. I would also like to thank all the members of LSCP in

general, who provided such an amazing environment in which to

learn how to do research. Being still new to the field, I am indebted

to many of them for their kind and professional advice.

Lastly, I thank Olivier Wang for accompanying me on my

recruiting outings during weekends, for his insights on

programming, but most of all for his constant personal support.


Table of Contents

Abstract .............................................................................................. 1

Aknowledgements ............................................................................. 2

1. Introduction ................................................................................ 4

1.1 Objectives ............................................................................... 5

1.2 Predictions .............................................................................. 7

2. Study Methodology .................................................................... 9

2.2 Experiment 1. Vowel-labelling ............................................... 9

2.2.1 Choice of method ............................................................... 9

2.2.2 Stimuli ............................................................................... 10

2.2.4 Results .............................................................................. 13

2.2.5 Discussion ......................................................................... 20

2.3 Experiment 2. ABX discrimination ....................................... 21

2.3.1 Choice of method ................................................................. 21

2.3.2 Stimuli ............................................................................... 21

2.3.3 Procedure .......................................................................... 22

2.3.4 Results .............................................................................. 23

2.3.5 Discussion ......................................................................... 26

3. Speech Recognition Models ..................................................... 27

3.1 Choice of method ................................................................. 27

3.2 Stimuli ................................................................................... 27

3.3 Procedure ............................................................................. 28

3.4 Results .................................................................................. 29

3.5 Discussion ............................................................................. 38

4. Discussion and concerns .......................................................... 40

5. Appendices ............................................................................... 41

6. References ................................................................................ 42

7. Other resources ........................................................................ 44


1. Introduction

Just like social communities can integrate members from various origins, languages can adopt words from other languages. This process is known as loanword adaptation. A borrowing language (henceforth Lb) takes a word from a source language (henceforth Ls) and submits it to a bevy of phonological, semantic or even syntactic modifications, until it enters Lb lexicon as a loanword. This happens often with words expressing notions related to another culture (e.g. ‘kimono’, a loan from Japanese in English), or technical terms (e.g.

/ki:bo:do/ ― keyboard, a loan from English in Japanese).

Historically, borrowing is most active during times of intensive contact between communities. In some languages, such as Japanese, loans from different sources build up categories in the lexicon (lexical strata), traceable to periods of academic, diplomatic and economic relations with other cultures. The resulting loans are more or less well blended in the lexicon, both in terms of frequency of use and similarity to words of the native stratum. For Ls speakers, the adaptation is often such that a loanword in Lb is not easily recognizable as coming from their own language (e.g. from English in Japanese, /ʃimpuru/ ― simple1). This is mainly because of the phonological adaptations that have taken place during borrowing. The sound inventory and phonotactic rules are different from language to language, and loanword adaptation must take into account these differences, using various strategies to make the loan abide by L1 phonology. For instance, in the previous example, the sound /sɪ/ does not exist in Japanese phonology and is adapted to the native sound /ʃi/ instead. Similarly, consonant clusters such as /pl/ are also illegal, so they are adapted by adding vowels in the cluster. This second strategy is called vowel epenthesis. Looking into these kinds of strategies can provide insights into the way we process sounds from other languages in a different context that actual L2 acquisition. Literature on loanword adaptation gives multiple theories to account for the origin of the pronunciation changes that words undergo when they are borrowed into another language. Phonological theories suggest a two-step model, in which the sounds of the source language are perceived faithfully, and then adapted according to the borrowing language’s phonological rules during production (Lovins 1975, Paradis & LaCharité 1997, Uffmann 2006). However, experimental data from Lb speakers seem to suggest that the influence of native phonotactics extends to the perception of source language sounds. This is the basis for the

1 The Japanese vowel is actually an unrounded /ɯ/, but we will use /u/ for

simplicity throughout this paper. The stimuli used during the experiments contained /u/, since our aim was to test for adaptation from foreign categories. Previous studies have also shown that /u/ is easily mapped to /ɯ/ during adaptation (Dupoux, Kakehi, Hirose, Pallier. & Mehler, 1999)


perception-based theory (Peperkamp and Dupoux 2003, Vendelin and Peperkamp 2006, Dupoux, Parlato, Frota, Hirose, & Peperkamp, 2011). Instead of hearing consonant clusters that are illegal in their native system, Japanese speakers would perceive illusory epenthetic vowels between the consonants (Dupoux et al., 1999). More than this, (1) the nature of the perceptual epenthetic vowel appears to depend on the characteristics of the vowels from the speaker’s native language, (2) the choice of the epenthetic vowel depends on acoustic cues contained in the consonant cluster (Dupoux et al., 2011). These cues come from coarticulation: the consonant sounds produced are influenced by the preceding or following vowel sounds. According to this theory, loanword adaptation would be a single-step process, where native phonotactic constraints would already take over perception and thus phonetic categorisation.

1.1 Objectives

The case of Japanese is particularly interesting when examining loanword adaptation and vowel epenthesis. First, Japanese has phonotactic constraints against consonant clusters, except some starting with the nasal /n/ (Itô & Mester, 1995, Labrune, 2006). This is reflected in the syllabary writing system (kana), where only /n/ has a separate grapheme, all other symbols representing a CV combination (e.g. there are graphemes for /hɑ/, /hi/, /hu/, /he/ and /ho/, but none for /h/). Second, the proportion of loanwords in basic vocabulary is quite important: 6,3% non-Chinese foreign Ls, 2,3% hybrid (Haspelmath & Tadmor, 2009). We consider only non-Chinese foreigh Ls, because Chinese source words do not present consonant clusters either. A hybrid loan is a mix of a native word

and a loan (e.g. /karaoke/ ‘空 /kara/, empty’+‘orche(stra) /oke/’) , or a mix of loans that has no corresponding source word (e.g. /aisukiandi:/ ‘ice’+‘candy’: popsicle). In this study we aim to test whether the choice of epenthetic vowel in Japanese for a consonant cluster starting with the fricative /h/ can be accounted for by coarticulation cues contained in said fricative. Looking at existing loanwords, it seems that the epenthetic vowel in such contexts is identical to the vowel preceding the cluster. The same phenomenon is observed when the source word ends with /h/. This could be explained by a phonological rule of vowel copying through /h/. However, the acoustic characteristics of /h/ may provide an alternative explanation. Sometimes described as the devoiced version of an adjacent vowel (Ladefoged and Maddieson, 1996), /h/ contains more coarticulation than other consonants, including /x/ and /ç/, which are both adapted to /h/ in the case of German loanwords in Japanese (Lovins, 1975, Irwin,). We can also observe tokens of /x/ and /ç/ to Japanese /h/ in loans from other languages. Although a great number of loans come from source words containing these sounds, the exemplars where the source word ended with one of


them or included it in a consonant cluster are fewer, and mostly fall in the category of proper names or specialized concepts:

(1) /bahha/ ‘Bach’ (German) (2) /hohhohausu/ ‘Hochhaus (University)’

(German) (3) /ihhiroman/ ‘Ich-Roman’ (German) (4) /ahoro:toru/ ‘axolotl’ (Nahuatl) (5) /uhuta/ ‘Ukhta’ (Russian)

We can also look at another source for adaptations from /hC/ clusters or words ending in /h/. Some vocabulary lists for Japanese learners of foreign languages include a pronunciation approximation in Japanese syllabary. These transcriptions can give more data on adaptations, even if the transcription does not correspond to an actual loanword. For instance, Shinohara (1997) worked on data from oral and written adaptation of French words by native Japanese speakers. Transcriptions of German words in such lists provide more instances of /x/ and /ç/ to /h/ with a copy

vowel (Yamamoto, これなら覚えら得るどいつ語単語帳 Korenara oboerareru doitsugo tangochou, NHK press 2009). By contrast, other consonants do not behave the way /h/ does during adaptation. For instance, /kC/ clusters or words ending in /k/ are usually adapted using the default epenthetic vowel in Japanese, /u/ (Lovins, 1975). /u/ is the shortest vowel in Japanese, and can be devoiced (Han, 1962, Labrune, 2006), making it a phonetically minimal element. If epenthetic repair takes place for a source that contained no vowel segment, the minimal element seems to be a good candidate. Loans from sources with /kC/ clusters or ending in /k/ are much more frequent:

(6) /abekku/ ‘avec’ (French) (7) /miruku/ ‘milk’ (English) (8) /kurasu/ ‘class’ (English) (9) /akuʃon/ ‘action’ (English)

(10) /bekutoru/ ‘Vektor’ (German)

This difference between /h/ and /k/ could be due to the fact that no vowel-copying rule applies to /k/, but it might also be because other consonants contain less coarticulation information than /h/. To test the influence of coarticulation, we examined Japanese speakers’ perception of /hC/ and /kC/ clusters in a forced-choice vowel-labelling task and an ABX discrimination task. We recorded items such as /ɑhpɑ/ and /ihpi/, then digitally exchanged their consonant clusters (cross-splicing) to construct stimuli where the context vowel (e.g. /ɑ/) did not match the coarticulation cues contained in /h/ (e.g. /i/). We will denote such items as V1CcoartpV1 (e.g. ahipa). All stimuli were non-words, and could not become a word after epenthesis. In both tasks, we should observe different


outcomes depending on whether adaptation follows a vowel-copying phonological rule or is determined by coarticulation cues.

Figure 1. Cross-splicing stimuli. Items with natural clusters were recorded, then digitally manipulated so that the coarticulation information contained in the consonant cluster (colour) did not match the context vowel (black)

In parallel, we will run the experimental stimuli used for these two tasks through speech recognition systems based on Hidden Markov Models (HMM) created with HTK Speech Recognition Toolkit (CUED). These models were trained on the Corpus of Spontaneous Japanese (Kokken, 2003), and then adapted for speaker variation using other recordings from the speakers who produced the stimuli. This will give us a rough idea of what a purely acoustic model optionally equipped with bigram transition probabilities will predict for the transcription of non-native sounds into Japanese phonemes.

1.2 Predictions

If the choice of epenthetic vowel after /h/ is determined by a vowel-copying process, we should observe no difference in the adaptation of the non-words of the form V1hpV1 in the vowel labelling task. Items such as ahapa, ahepa, ahipa, ahopa and ahupa should all be mapped to /ɑhpɑ/ (faithful perception) or /ɑhɑpɑ/ (adaptation with vowel-copy). However, if the choice hinges on coarticulation cues contained in /h/, we should observe some degree of variation in the epenthetic vowel. If coarticulation predicts the epenthetic vowel, we should observe:

ahapa → /ɑhɑpɑ/ ahipa → /ɑhipɑ/

ahupa → /ɑhupɑ/ ahepa → /ɑhepɑ/ ahopa → /ɑhopɑ/

Furthermore, from the observations by Lovins (1975) and Labrune (2006), and the transcription of German words in Japanese syllabary (Yamamoto, 2012), we predict that ahepa might be mapped to /ɑhipɑ/. Control items will contain the voiceless stop /k/, which should lead to less coarticulation. They should not exhibit such a variation in the epenthetic vowel regardless of the context from


which /k/ originated. The default epenthetic vowel would then be /u/, and:

akapa, akipa, akupa, akepa, akopa→ /ɑkupɑ/

While no real words could arise from epenthesis on the stimuli, it is interesting to note the following adaptation, whose source contains /akpa/: (11) /akupatoku/ ‘Akpatok (island name)’ (Inuktitut) We expect performance in the ABX task to depend on that in the vowel-labelling task. That is, the more two items are mapped to the same adaptation, the harder it will be to distinguish them from one another. For example, if akapa and akupa are both mapped to /ɑkupɑ/, ABX trials comparing this pair (e.g. A : akapa B: akupa X :akapa → X=A) would be more difficult than if they were respectively mapped to /ɑkɑpɑ/ and /ɑkupɑ/. Such trials would likely result in longer reaction times as well as higher error rates. We could also compare the discrimination of coarticulated items from items containing a full medial vowel: if ahapa is mapped to /ɑhɑpɑ/, then an ahapa (splice) vs. ahapa (full medial vowel) comparison will be difficult.


2. Study Methodology

To test our hypotheses, we conducted two experiments, a vowel-

labelling study and an ABX discrimination study. Both were

programmed in Python language (Python Software Foundation,

2013) and conducted at the Laboratoire de Sciences Cognitives et

Psycholinguistique, 29, rue d’Ulm, Paris, France. The procedures

used are similar to those used in Dupoux et al. (1999, 2011).

2.1 Participants

Participants were 25 native speakers of Japanese (mean age 30, 5

males, 20 females) recruited in Paris. Recruitment was conducted

by notices placed in libraries and shops frequented by the Japanese

residents in Paris, Japanese newspapers (OVNI…) and on Japanese

community websites (MixB bulletin board, 日本人会 Nihonjinkai).

Some participants were recruited directly from French classes for

newly arrived Japanese visitors, as well as the Japanese House in

the International University Students Residence (Cité Universitaire).

We searched for people who arrived in Paris recently and only had

limited exposure to languages allowing the consonant clusters

tested, such as French or English. It is possible to find participants

who meet those criteria because a significant part of the Japanese

community in Paris came to France accompanying a family

member, and have not yet had time to learn French. Among our

participants were also recently arrived exchange students and

people on Working Holiday (1 year stay). Recruitment announces,

instructions during the experiments and debriefing were all done in

Japanese. Participants completed both experiments (about one

hour, including pauses and debriefing), and were compensated 10€.

2.2 Experiment 1. Vowel-labelling

2.2.1 Choice of method

We aimed to test the participants’ online decision regarding the

presence or absence of an epenthetic vowel, as well as that vowel’s

identity. Searches through general Japanese dictionaries (Kindaichi,

Yamada, Shibata, Sakai, Kuramochi & Yamada, 2000) as well as

loanwords dictionaries (Arakawa, 1977, Kamiya, 2002) and corpora

(Haspelmath & Tadmor, 2009) showed that loans adapted from

source words containing /hC/ clusters or ending in /h/ were relatively

rare, and usually limited to very specialized vocabulary. In existing


loans, the epenthetic vowel after /h/ does vary according to the

vowel preceding /h/, but it is impossible to tease apart the roles of

vowel context and coarticulation.

A review of the literature on pointed to several possible

paradigms to test perceptual epenthesis. Some studies on

phonotactic repair of illegal sequences have used syllable count

tasks. Pitt (1998) showed that by lengthening the liquids in [tl] or

[sr] sequences, unattested in English, native speakers are more

likely to report that they heard a disyllable. Berent, Lennertz, Jun,

Moreno & Smolensky (2008) tested Korean listeners’ perception of

non-words like [lbif] against [ləbif], showing that epenthetic repair

of the former leads to reports of it being a disyllable, just like the

latter. However, if we want to look at a variation in the choice of

epenthetic vowel, syllable count alone will not suffice. While

transcription tasks are well-suited to the study of loanword

adaptation in languages using the Roman alphabet (Hallé, Segui,

Frauenfelder & Meunier, 1998, Davidson, 2007), in the case of

Japanese, the influence of the writing system can affect our results.

As mentioned before, the native syllabary system does not allow for

the transcription of consonant clusters (except the case of /n/):

using this system would limit severely the potential ‘no vowel’

responses if we ask our participants to transcribe consonant

clusters. On the other hand, while the Roman alphabet is also well-

known to most native speakers of Japanese, asking for a full

transcription of a token like ahupa might result in more variability

than we want to study : /hu/ in Japanese realizes into its allophone

[φu], and could possibly be transcribed into ‘fu’ (e.g. the loan

/huirumu/ [Φuirumu] ‘film’, English). We used therefore a paradigm

similar to that of Dupoux et al. (1999, 2011). Participants were given

a partial transcription of the test tokens, and asked to make a

decision on the central segment of the audio stimuli.

2.2.2 Stimuli

All recorded items were non-words produced by a trained

phonetician, native speaker of Dutch, and were stressed on the first

syllable. Obstruent clusters are allowed in Dutch, and [h] is present

in its sound inventory (Ladefoged & Maddieson, 1996). Incidentally,

Dutch was also one of the main source languages for loans in

Japanese history. Records were made in a soundproof booth, at 16

bits mono with a sampling rate of 44.1 kHz. Recorded items were of

the following structure:

/hp/ clusters : V1hpV1 (e.g. ahpa)

/kp clusters : V1kpV1 (e.g. akpa)


where V1 is one of the vowels /ɑ, e, i, o, u/

(see Appendix for full list of recorded items)

Test stimuli were then obtained by cross-splicing /hp/, and

/kp/ from natural clusters preceded by the different vowels using

Praat speech analysis software (Boersma & Weenink). Clusters were

cut out at zero-crossings. We decided to splice the entire cluster

rather than /h/ of /k/ alone based on previous results showing that

the second consonant of the cluster could also contain traces of

coarticulation (Dupoux & Nakamura, in preparation). We mostly

used the pitch tracker in Praat to determine the boundaries of the

consonant clusters, double-checking the spectrograms manually in

case of doubt. All test items were cross-spliced, including those

whose coarticulation information matched the V1 context (e.g.

ahapa). This is to avoid the possible effect of a cross-splicing

artefact in the other items. Tokens like ahapa were made by splicing

the /hp/ cluster from one recording of /ɑhpɑ/ to a second one.

The main difficulty in making these stimuli rested in the fact

that the resulting items had to sound as natural as possible. While

splices from other tokens of the same V1 context were usually easy to

make, base recordings to make items with mismatched V1 and

coarticulation had to be carefully selected so that the spliced clusters

blended in seamlessly. For example, some /p/ bursts contained more

energy than others, and splicing them in another context renders this

very salient. To build test items that would comparable in the analysis,

the spliced cluster for one type of coarticulation (e.g. /hap/) was the

same across all V1 contexts (ehape, ihapi, ohapo, uhapu) except when V1

matches coarticulation (ahapa).

Figure 2. Splicing out /hap/ cluster from a recording of /ɑhpɑ/. The section in pale red was saved as the /hap/ cluster to be spliced in contexts where V1 is /e, i, o, u/. To make ahapa, a /hap/ cluster was spliced from another token of /ɑhpɑ/ and inserted in place of the selection. The cluster was considered to start at the zero-crossing immediately after pitch (in blue) disappeared, and to end at the zero-crossing after the /p/ burst.


Each condition (/h/ or /k/) contained 25 items. The mean

duration of an item was 598.0 ms, (SD = 55.5 ms). Each item was

presented three times during the experiment, so that a participant

heard and responded to a list of 25*2*3 = 150 items. Items were

presented in a randomized order, with the additional condition that

items starting with the same V1C could not occur within 3 trials (for

example if trial 1 is akepa, the next two trials cannot be items

starting with /ɑk/). The order of stimuli was thus different for every

subject. Stimuli are detailed in Appendix.

2.2.3 Procedure

Participants sat in from of a computer in a soundproof booth. All

instructions were displayed on screen in Japanese, and additional

explanations were given orally when necessary. Participants heard

the test stimuli through headphones. They were told that they

would hear words without meaning, all of the form “V1C?pV1” (e.g.

ah?pa). They were to indicate whether they perceived a vowel in the

consonant cluster, and if so, which vowel was perceived. Questions

appeared on the computer screen, in the Roman alphabet, in the

form “V1C?pV1” (e.g. ah?pa), with the choices ‘無’(no vowel), ‘a’, ‘e’,

‘i’, ‘o’, ‘u’ (forced choice). We used the Japanese character for ‘none’

as it is the choice traditionally given in multiple choice questions on

Japanese administrative forms. Participants were asked to reply as

fast as possible, and their response triggered the next trial with an

ISI of 1 s. Responses were given by pressing labeled keys on a

QWERTY keyboard. A training session of 10 items (items with a full

media vowel, not used in this task) preceded the actual task. No

feedback was given as the correct answer would always have been

‘no vowel’. For each trial, we collected the data ‘response’ (‘no

vowel’, ‘a’, ‘i’, ‘u’, ‘e’, ‘o’) and ‘reaction time’. We did not use

equipment to collect precise reaction times : these data points will

mainly serve to exclude trials where the reply came too fast

(accidental keypress) or too slowly (no longer reflecting first

impression).The experiment lasted about 10 minutes.


Figure 3. Paradigm for forced choice vowel-labelling task. Participants were asked to reply as fast as possible after hearing the audio stimulus, and their key press triggered the next trial with an ISI of 1 s. Questions were presented on screen.

2.2.4 Results

Since we have 3 within subjects independent variables (Consonant

(2) x V1 (5) x Coarticulation (5)), and our dependent variable is

nominal with 6 possible responses (a, i, u, e, o, no_vowel), we had to

simplify our data to conduct statistical analyses. The proportions of

responses in each combination of coarticulation and V1 for /hp/

clusters are plotted on Figure 4 and 5 below.

Figure 4. Items with /hp/ clusters, all responses: percentages of responses in each category (a, i , u, e, o, no vowel) according to V1 (smaller graduations) and coarticulation in the consonant cluster (categories delimited by white lines).


Qualitatively, the results for /hp/ items show that /e/ and /i/

coarticulation give rise to a strong /i/ epenthesis effect (dark green),

regardless of the V1 context. This concurs with the findings of

Dupoux et al. (2011) that /i/ coarticulation cues can lead to reports

of /i/ epenthesis. Such an effect also seems to arise from /e/ and /i/

V1 contexts even when the coarticulation information points to

another vowel. It seems furthermore that V1 context also has an

influence on the choice of epenthetic vowel: /ɑ/ epenthesis happens

in very small proportions, and almost only in V1 = /ɑ/ contexts, but

when both context and coarticulation point to /ɑ/, 67% of the

responses were also /ɑ/. This appears to be the case of /o/ also,

albeit to a lesser degree.

When we look however at the same graph for /kp/ clusters,

the image is quite different (Figure 5 below).

Figure 5. Items with /kp/ clusters, all responses: percentages of responses in each category (a, i , u, e, o, no vowel) according to V1 (smaller graduations) and coarticulation in the consonant cluster (categories delimited by white lines).

From the appearance of this graph, /kp/ clusters mostly

induce /u/ epenthesis, with /e/ and /i/ coarticulation still generating a

small measure of /i/ responses even in other V1 contexts. We can

also observe that ‘no vowel’ responses are fewer. This could be

explained by the fact that the /kp/ clusters were produced with a

released /k/ to keep the durations comparable between /hp/ and

/kp/ items. Therefore, /kp/ clusters were more likely to contain

schwa-like material.

For all our graphs and tests, we analysed the responses of

25 participants. One additional participant was tested, but reported

in the debriefing phase that they were a professional musician.

Previous results suggest that musicians tend to report less

epenthesis, as they are generally more sensitive to the temporal

cues of the acoustic signal (Dupoux et al., in preparation).


Subsequent analysis of this participant’s data showed indeed that

he made more than 60% of ‘no vowel’ responses (against an

average of 11% for the other participants, SE = 0,12). Consequently,

we excluded this participant’s data from all analyses. One other

subject spoke fluent English, but their results did not differ

significatively from the other participants’, so we still included them

in the analysis.

Additionally, we filtered reaction times so only responses

made between 200 ms and 3860 ms were considered in our analysis

(µ = 1305 ms, SD = 1277 ms).

Epenthesis effect

We first examined the amount of epenthesis responses elicited by

each consonant type. Our analyses were done in R statistics (R Core

Team, 2014). We conducted a within subjects one-way ANOVA with

the factor Consonant on the dependent variable ‘no vowel’

response. We found a significant effect of Consonant (F(1,24) = 6.8,

p < .02).

Figure 6. Mean percentages of ‘no vowel’ responses in function of consonant. Error bars represent standard error of the mean.

This confirms that items containing /kp/ clusters do elicit

more epenthesis responses. However, we need more precision to

qualify the choice of epenthetic vowels. Subsequently, we will limit

our analysis to ‘vowel’ responses (/ɑ/, /i/, /u/, /e/ and /o/ responses).

Since /i/ responses seemed to have a certain importance in both

consonant conditions, we will first focus on these responses, coding

all other responses as ‘other vowel’.


Context and coarticulation effects

To look at context and coarticulation effects, we simplified our data

according to the possible responses. For instance, for /i/ responses,

we considered responses, V1 context and Coarticulation to be either

‘i’ or ‘non i’. We then conducted analyses for /hp/ and /kp/ clusters,

and plotted the percentage of ‘i’ responses according to V1 context

and coarticulation. For each Consonant condition, we conducted a

mixed 2-way ANOVA with the fixed factors V1 context and

Coarticulation, and the random factor Subject, using the nlme

package in R (Pinheiro, Bates, DebRoy, Sarkar & R Core Team,

2014).

‘a’ responses

For /hp/ items, we found main effects of V1 context (F(1, 1442) =

41.8 , p<.0001) and Coarticulation (F(1, 1442) = 34.7, p<.0001). We

also found an interaction of V1 context and Coarticulation (F(1, 1442)

= 877.6, p<.0001). Context and Coarticulation both leads to some

degree of ‘a’ responses, but ‘a’ Context seems indeed to have a

multiplicative effect on the proportion of ‘a’ responses when

Coarticulation is also ‘a’. Looking back at the responses given in

/hap/ conditions in Figure 4, ‘a’ coarticulation seems to be weakest,

and the responses were much more varied depending on context.

This could be explained by lexical influences: some participants

reported during debriefing that they did think of the example

/bahha/ ‘Bach’ during the task.

For /kp/ items, ‘a’ responses were few in all contexts (<12%,

chance level = 20%). We found a main effect of V1 context (F(1, 1442)

= 3.7, p<.06) but not of Coarticulation (p>.5). There was no

interaction of V1 context and Coarticulation (p>.5). This is consistent

with our prediction that /kp/ clusters would contain less

coarticulation information.

Figure 7. Context and Coarticulation effects for /hp/ and /kp/ items on ‘a’ responses. Context appears to have a multiplicative effect on the proportion of ‘a’ responses for /hp/ items when coarticulation information also points to ‘a’. Error bars are standard errors to the mean.


‘i’ responses

For /hp/ items, we found main effects of V1 context (F(1, 1442) = 87.4,

p<.0001) and Coarticulation (F(1, 1442) = 41.6, p<.0001). We also

found an interaction of V1 context and Coarticulation (F(1, 1442) =

10.5, p<.002). Context and Coarticulation both leads to a strong

degree of ‘i’ responses, as seen earlier in Figure 4. Dupoux et al.

(2011) had also noted a very strong coarticulation effect of /i/. In

addition, /hip/ clusters were slightly palatalized in our test items,

which could have contributed to more distinctiveness. It should be

noted that the Japanese /hi/ is also palatalized.

For /kp/ items, we found main effects of V1 context (F(1,

1442) = 8.4, p<.005) and Coarticulation (F(1, 1442) = 21.9, p<.0001).

We also found an interaction of V1 context and Coarticulation (F(1,

1442) = 76.3, p<.0001). /i/ coarticulation seems to be present even in

/kp/ items. We can examine the acoustic distances between

different clusters by looking at the traces of vowel formants

contained in the clusters (Figure 8 below). While /kp/ clusters are

overall less distinctive in terms of F1, they still present distinctivity

in terms of F2. /kip/ and /kep/ in particular are especially distinct

from the other /kp/ clusters, which could explain the coarticulation

effect for ‘i’ responses.

Figure 8. Formant traces contained in the coarticulated /hp/ and /kp/ clusters. /kp/ clusters appear to be less distinctive in terms of F1, but still quite distinct in terms of F2/

Figure 9. Context and Coarticulation effects for /hp/ and /kp/ items on ‘i’ responses. /i/ coarticulation appears to be strong enough to induce more ‘i’ responses even in the less coarticulated /kp/ clusters. Error bars are standard errors to the mean.


‘e’ responses

For /hp/ items, we found a main effect of V1 context (F(1, 1442) =

14.6, p<.0001) but ) but not of Coarticulation (p>.5). There was no

interaction of V1 context and Coarticulation (p>.5). Looking at

Figure 4, we can observe that ‘e’ context and coarticulation mainly

induced ‘i’ responses. This could be explained by the fact that they

are acoustically very similar in our test items (Figure 8)

For /kp/ items, we found a main effect of V1 context (F(1,

1442) = 8.2, p<.005) but not of Coarticulation (p>.5). There was no

interaction of V1 context and Coarticulation (p>.4).

It seems that ‘e’ responses are only induced by Context

effects: the acoustic similarity between /i/ and /e/ coarticulation

could explain this phenomenon. /hp/ clusters were indeed slightly

palatalized for both coarticulations. While the Japanese /hi/ is

palatalized, it is not the case of /he/. Furthermore, /e/ is the least

devoiced vowel in Japanese, and /e/ to /i/ adaptation was frequent in

early Chinese loans (Labrune, 2006). Modern kana transcriptions of

German words such as lecht (transcribed to /rehito/) also appear to

exhibit this effect (Yamamoto, 2009). It is also interesting to

observe that licht is transcribed to /rihito/.

Figure 10. Context and Coarticulation effects for /hp/ and /kp/ items on ‘e’ responses. ‘e’ responses appear to be solely triggered by Context effects. The acoustic similarity between /i/ and /e/ clusters in our test items could account for this phenomenon, which is also present in modern transcriptions. Error bars are standard errors to the mean.

‘o’ responses

For /hp/ items, we found main effects of V1 context (F(1, 1442) =

26.5, p<.0001) and Coarticulation (F(1, 1442) = 43.1, p<.0001). We

also found an interaction of V1 context and Coarticulation (F(1, 1442)

= 66.5, p<.0001). Context and Coarticulation both lead to ‘o’

responses, with an increase when both point to ‘o’.

For /kp/ items, very few ‘o’ responses were given in all

situations (1%, chance level = 20%). We found neither effect of V1

context (p>.05) nor of Coarticulation (p>.1). There was no


interaction of V1 context and Coarticulation (p>.5). /kp/ clusters

were mainly mapped to ‘u’ responses.

Figure 11. Context and Coarticulation effects for /hp/ and /kp/ items on ‘o’ responses. ‘o’ responses appear to be influenced by both Context and Coarticulation effects in /hp/ clusters, but are almost completely absent from /kp/ clusters. Error bars are standard errors to the mean.

‘u’ responses

For /hp/ items, we found a main effect of V1 context (F(1, 1442) =

75.4, p<.0001) and of Coarticulation (F(1, 1442) = 19.5, p<.0001).

There was no interaction of V1 context and Coarticulation (p>.1).

The effects seem to have been additive in this case.

For /kp/ items, we found a main effect of Coarticulation (F(1,

1442) = 8.4, p<.005) but not of V1 context (p>.5). There was no

interaction of V1 context and Coarticulation (p>.3). As /u/ is expected

to be the default epenthetic vowel in /kp/ clusters, it seems

reasonable that Context effects would not appear. However, as

seen earlier, /i/ coarticulation seems to be strong enough to have

even an influence in /kp/ clusters, thus explaining the effect of

Coarticulation.

Figure 12. Context and Coarticulation effects for /hp/ and /kp/ items on ‘u’ responses. ‘u’ responses appear to be influenced by both Context and Coarticulation effects in /hp/ cluster. In /kp/ clusters, ‘u’ is expected to be the default epenthetic vowel, yielding only to the cases of strong /i/ coarticulation. Error bars are standard errors to the mean.


2.2.5 Discussion

We find that Consonant type has an influence on the amount of

epenthesis responses. /hp/ clusters lead to more ‘no vowel’

responses, possibly because /k/ was released in /kp/ clusters.

According to prediction, /kp/ clusters yielded mostly default /u/

epenthesis, with the surprising phenomenon that /i/ coarticulation

in /k/ was strong enough to elicit some degree to /i/ epenthesis even

in /kp/ clusters. Our results reproduce the strong effect of /i/

epenthesis observed by Dupoux et al. (2011). Instead of containing a

partially excised vowel, our test items contained no medial vowel to

begin with, which gives further evidence of the effect.

As predicted, coarticulation plays an important part in

determining the nature of the epenthetic vowel in /hp/ clusters.

However, the strength of the coarticulation effect was not identical

for all vowels. /i/ and /e/ coarticulation lead to a slightly palatalized

/h/, closer to the native Japanese category of /hi/, also palatalized,

than /he/, not palatalized. /ɑ/ coarticulation was weaker and /hap/

clusters were thus more prone to Context effects, including the case

where V1 is also /ɑ/, where most ‘a’ responses are induced.

Looking at the formant traces contained in both types of

clusters, there appear to be indeed an observable influence of

coarticulation, which makes them acoustically distinguishable from

one another.

This vowel-labelling task has however one drawback. To

make a decision in the forced-choice question, participants must

explicitly segment and the acoustic signal into consonants and

vowels. For native speakers of Japanese, this is made all the more

difficult by the fact that they are used to syllables as minimal

elements rather than phonemes. This is why we carried out a

second experiment, an ABX discrimination task, in which

participants are asked to make a judgement on the overall similarity

of entire non-words rather than on the identity of a single segment.


2.3 Experiment 2. ABX discrimination

2.3.1 Choice of method

AX and ABX discrimination paradigms allow for a comparison of

overall similarity between tokens, rather than focusing on a single

acoustic detail in the signal. Davidson & Shaw (2011) made a

comparison between both paradigms in a study on phonotactic

repair of illegal consonant clusters in English. They found that for

word-length stimuli, AX comparisons tended to be performed by

scanning fine acoustic details in both tokens rather than comparing

their overall temporal and spectral properties. However, ABX

paradigms could be reduced to an AX paradigm if the tokens are

physically too similar: rather than doing all comparisons, it becomes

easier to compare only the latter two. This is why we are using a

cross-talker version of the ABX paradigm, such as the one

implemented by Dupoux et al. (1999, 2011). The three items A, B

and X are produced by different speakers, requiring thus a

comparison of the three utterances at a more abstract level of

representation.

2.3.2 Stimuli

Stimuli for this experiment included those of the vowel-labelling

task, as well as similarly constructed tokens from 2 other people

(one female, one male). Those additional speakers were also

trained phoneticians, respectively native speakers of Argentinian

Spanish and American English. We chose to record speakers with

different native languages containing /h/ and /k/ in their sound

inventories so as to create more variability and increase the

dissimilarity between the compared tokens in terms of fine-grained

acoustic properties. This way, any comparison between the three

utterances had to be made on the basis of more abstract

representations resulting from the integration of the phonetic

details of each token rather than a direct comparison of said details.

More types of non-words were used than in Experiment 1. We

compared 4 types of pairs:

NfV: Natural cluster vs Full vowel (ahpa – ahapa) 10 pairs

Nsp: Natural cluster vs Spliced cluster (ahpa – ahipa) 40 pairs

spsp: Spliced cluster vs Spliced cluster (ahipa – ahopa) 100 pairs

spfV: Spliced cluster vs Full vowel (ahipa – ahipa) 50 pairs

We predicted that items which were mapped to the same

representation in the vowel-labelling task should be harder to

differentiate in the ABX task. For instance, if /u/ is the default


epenthetic vowel for items with /kp/ clusters, then those items,

natural or spliced, should all be mapped to /V1kupV1/ (if akpa, akipa

→ /akupa/, akpa-akipa comparison will be hard), and thus be

difficult to differentiate from one another, but also from the item

V1kupV1 (akipa – akupa should also be hard).

Each pair of non-words could be presented in 4 different

trials: ABA, ABB, BAA, BAB. As there were 200 possible pairs, there

were 800 possible trials. This would have been too much for every

single participant to go through. Instead, the design was

counterbalanced such that each subject hear 400 trials, 200 with /h/

and 200 with /k/, including the same number of trials for each type

of comparison, V1, full vowels and coarticulation. Since we

maintained the order of the voices for A, B and X, one group of

sujects heard A1 B2 A3 and B1 A2 A3 for a given pair while the other

group heard B1 A2 B3 and A1 B2 B3. This way, only one toke was

heard twice.

2.3.3 Procedure

Participants were the same as in Experiment 1. They sat in from of a

computer in a soundproof booth. All instructions were displayed on

screen in Japanese, and additional explanations were given orally

when necessary. Participants heard the test stimuli through

headphones. They were told that they would hear triplets of words

without meaning, of the same kind as the ones heard previously.

The last word would be either identical to the first one, or to the

second one. They were to indicate which one they thought it was by

pressing on the left (X=A) and right (X=B) arrow keys. Participants

were asked to reply as fast as possible, and their response triggered

the next trial with an ISI of 1 s. They were told to make a random

guess if they were not sure. A training session of 10 items preceded

the actual task. Training items were taken at random from the test

stimuli of the other group. Feedback was given during training by

means of a green O (correct) or a red X (incorrect) appearing on the

screen following the response. No feedback was given during test

trials and the screen remained black. For each trial, we collected the

data ‘response’ (‘A’ or ‘B’) and ‘reaction time’. We did not use

equipment to collect precise reaction times, but since there are only

2 options, we can compare the reaction times to get a rough idea of

how difficult the decision. Additionally, these data points will serve

to exclude trials where the reply came too fast (accidental keypress)

or too slowly (no longer reflecting first impression).The experiment

lasted about 45 minutes, and included 3 self-paced breaks,

scheduled every 100 trials. Participants were told that they could

take a break by a message on the screen and confirmed by pressing


the space bar. Hitting the space bar again launched the following

block.

Figure 13. ABX discrimination task paradigm. A, B, and X were produced by 3 different speakers (2 females, 1 male), and played at 500ms intervals. Keypress triggered next trial with 1 s ISI.

2.3.4 Results

We first looked at the percentage of correct responses across types

of comparison and consonant. We conducted a mixed ANOVA with

Consonant and Comparison Type as within-subjects factors, and

included Subjects and Group as random factors. We found a

significant effect of Type (F(3, 9251) = 33.5, p<.0001) but not of

Consonant (p>.05). There was an interaction between Type and

Consonant (F(3, 9251) = 12.1, p<.0001). We plotted in Figure 14

below the percentages of correct answers in ABX in each type of

comparison and for /h/ and /k/ comparisons.


Figure 14. ABX discrimination task. Percentage of correct answers across types of comparison and consonant. Error bars are standard errors of the mean.

Reaction times seemed to complement the observations on

percentage of correct answers: lower proportion of correct answers

corresponded to higher mean reaction times. A mixed ANOVA on

reaction times with within-subjects variables Pair Type, Consonant

and Correctness with Subjects as a random factor showed a main

effect of Correctness (F(1,9243) = 46.2, p <.0001), an interaction

between Pair Type and Consonant (F(3, 9243) = 2.7, p< 0.05) and an

interaction between Type and Correctness (F(3, 9243) = 8.1, p<

0.0001)

Figure 15. ABX discrimination task. Mean reaction times across types of comparison and consonant. Error bars are standard errors of the mean.

If we define a measure of perceptual distance between two tokens,

we could estimate the difficulty of each comparison. This could be

done by referring to the results of Experiment 1. However, since we

did no test natural clusters and full medial vowel items in

Experiment one because of time and logistic reasons, we need to

make some assumptions:

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

NfV Nsp spfV spsp

% c

orr

ect

h

k

0

200

400

600

800

1000

1200

1400

1600

NfV Nsp spfV spsp

RT

in m

s

h

k


- natural clusters : roughly equal to the splices where the

consonant cluster came from another token of the same V1

context (e.g. ahpa≃ahapa)

- items with a full medial vowel : we assume that if our

participants were presented with a token including a full

medial vowel, there would be no ambiguity in identifying

said vowel (e.g. ahopa → ‘o’, ahapa → ‘a’)

These hypotheses allowed us to estimate a ‘perceptual’ distance for

a given pair of stimuli. As an example, let us take akipa : in the first

experiment, akipa elicited a certain percentage of responses of each

category. If we use these percentages as the coefficients of a

vector, we can then compute the Euclidian distance between any

such 2 vectors. Let a and b be the vectors representing A and B in an

ABX trial:

( ) √∑( )

This measure would approximate the perceptual similarity between

two items: the larger its value, the more distinct the two items are.

We can then plot the percentage of correct answers in the ABX task

against this value for each pair.

To simplify the data, we grouped together the conditions involving

a full medial vowel (natural cluster – full vowel and spliced cluster –

full vowel) and the conditions involving only clusters (spliced

cluster-spliced cluster and natural cluster –splice cluster)

R² = 0,1842

R² = 0,552

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6

% c

orr

ect

Euclidian distance A-B

cluster_vs_fullV

h

k

Linear (h)

Linear (k)


While R² in a linear regressions were <.6, qualitatively greater

Euclidian distance between two tokens did seem to result in better

performances. Looking at the Euclidian distances among /hp/

tokens and comparing them to the distances between /kp/ cluster

tokens (Figure 15b), we also see confirmation that /kp/ tokens were

less perceptively distinct from one another.

Due to time and word limit reasons, further analyses of

these data will be conducted later, outside the scope of this written

work.

2.3.5 Discussion

Ideally, we would have liked to test different participants in

experiment 1 and experiment 2. However, due to logistic and time

constraints, this was not possible. Testing different subjects could

have allowed for more fine-grained analyses. For instance we

derived our estimation of perceptual distance from the responses in

the vowel-labelling task, but we only had actual data for the splice1-

splice2 comparisons. We made hypotheses for the responses to the

items we did not present in Experiment 1 for time issues:

Additionally, having the other two voices in the vowel-

labelling task, for example with different groups of subjects, could

also have yielded better perceptual distance estimations. This

might improve the correlation between the results of both tasks.

We kept only one voice in the vowel-labelling task for simplicity

issues, as subjects started with this experiment to familiarize with

the type of stimuli used in both tasks.

Figure 175a. Percentage of correct answers in ABX task against Euclidian distance between the compared tokens. Cluster vs full vowel comparisons

R² = 0,4689

R² = 0,0226

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 0,2 0,4 0,6 0,8 1 1,2

% c

orr

ect

A-B Euclidian distance

cluster_vs_cluster

h

k

Linear (h)

Linear (k)

Figure 16b. Percentage of correct answers in ABX task against Euclidian distance between the compared tokens. Cluster vs cluster comparisons


3. Speech Recognition

Models

3.1 Choice of method

While looking at behavioral data can give us direct insight into

subjects’ perception of our stimuli, it is also interesting to look at

what computational models can predict for the perception of non-

native segments. Speech recognition algorithms take into account

the sound structure of their target language and map an acoustic

input to the most plausible segments in this structure. We will be

looking in particular at Hidden Markov Models created with HTK

Speech Recognition Toolkit (CUED, 2009).

Hidden Markov Models (HMMs) can represent time series

with a succession of states. HTK in particular performs speech

recognition by sequencing the speech signal into multiple evenly

spaced parameter vectors, based on the approximation that for an

instant t the signal is stationary and can thus be fully described by

those parameter coefficients. In HMMs, the output of a state is

accessible while the state itself is hidden. However, each state has a

probability distribution over the various outputs. HTK represents

output distributions by Gaussian Mixture Densities. Considering the

sequence of outputs generated, estimations can therefore be made

on the sequence of states. The way this related to speech

recognition is that models can be trained to recognize particular

target sequences using multiple exemplars of those targets. The

parameters of a model can be estimated with increasing precision

as more exemplars are given. Then, when presented with an

unknown sequence, the likelihood that each of the possible models

generated it is computed, and the most likely model is chosen as

the transcription.

3.2 Stimuli

Since the aim is to see what the model would predict for the

perception of coarticulated consonant clusters, we ran the speech

recognition models on the cross-spliced stimuli used in the two

experiments. The one difference is that we kept short silences at

the beginning and the end of the sound files, to allow for the

model’s tendency to transcribe speech boundaries at the beginning

and end of a file.


3.3 Procedure

Starting from a model of Japanese generated previously by

Emmanuel Dupoux’s PhD student Abdellah Fourtassi, we used the

annotated part of Corpus of Spontaneous Japanese (about 500k

words, 45 hours of speech), to retrain 20 different models. Each

model was retrained on a separate part of the 45h hours, balancing

for male and female speakers, using the retraining pipeline made by

Dupoux et al. The retrained models were then readapted for

speaker variation and recording conditions using our recordings

with a full medial vowel. The final models were then made to

transcribe the cross-spliced stimuli used in the experiments. Such

models transcribe the acoustic signal phoneme by phoneme and

can be considered purely acoustic. Later, we added bigram

transition probabilities to the model: bigram models can take into

account some phonotactic constraints, as illegal sequences will

simply be composed of states with a quasi-null transition

probability in-between.

The resulting transcriptions were of course imperfect. An

accuracy of 60% with speech recognition systems is usually already

considered good. Transcriptions were generated as MLF files,

readable in basic text treatment programs.

For example, below is a transcription of uhepu :

"/retrain/csplice/uhepu.rec" 0 200000 <s> -112.915863 200000 1100000 k -570.235291 1100000 2300000 u -907.622375 2300000 3600000 h -1078.534424 3600000 4600000 e: -822.209900 4600000 6100000 t -960.711792 6100000 6800000 u -544.606140 6800000 8800000 </s> -1401.380615

The first two columns indicate the time points, the 3rd

column gives the phoneme transcription at each time point, and the

last column gives a measure of how accurately the transcription

model fitted the output. The more negative this score is, the better

the fit.

Silences and closures are problematic for speech

recognition, because their models are very similar. This explains the

/k/ at the beginning of the transcription. Stops were often

hallucinated at the beginning and end of the tokens. While

automating (in Python) the coding process of these transcriptions

for data analysis, we adopted some criteria:

- stops at boundaries were ignored

- we looked for segments contained within 2 consonants


- if the first consonant was /h/ or /k/, then the middle

segment was taken to be the choice of epenthetic vowel (or

lack thereof)

- if the middle segment was a consonant or noise, the trial

was coded ‘no vowel’

- if there were multiple vowel segments in the middle

segment, the one with the most negative score was kept.

- long vowels were transcribed as their shorter counterparts

We thus obtained transcriptions of 150 cross-spliced non-words by

model, generating 3000 transcriptions. With the monophone

models, we would expect there to be only effects of coarticulation.

Context effects should appear mainly with the addition of bigrams.

3.4 Results

Monophone models

Qualitatively, results also showed differences between /hp/ items

and /kp/ items.

Figure 18. Items with /hp/and /kp/ clusters, all responses: percentages of responses in each category (a, i , u, e, o, no vowel) according to V1 (smaller graduations) and coarticulation in the consonant cluster (categories delimited by white lines).

/hp/ clusters appeared to elicit more varied responses that /kp/

responses/. Additionally, the models seemed biased towards ‘a’

responses in /h/ clusters/. Coarticulation with /i/ and /e/ also induced

greater proportions of ‘i’ and ‘e’ results, but not as much as was

observed in the behavioral data. ANOVAs on ‘no vowel’ responses

according to consonant for the 3 speakers showed significant

effects of Consonant only for the two female speakers (ac : F(10,10)

h p


= 28.1, p<.ooo1, sh : F(10,10) = 18.9, p<.ooo1) on the left in Figure 18

below).

Figure 19. Percentage of ‘no vowel’ transcriptions for /hp/ and /kp/ clusters by monophone models

Using the ‘vowel’ response transcriptions, we conducted mixed

ANOVAs for each type of response (‘a’, ‘i’, ‘u’, ‘e’, ‘o’), with

Consonant, Coarticulation and V1 context as within-subjects factors,

and Model as a random effect. This is comparable to treating the

models as subjects in the vowel-labelling task (1 model ≃ 1

participant).

‘a’ responses

We found main effects of Consonant (F(1, 2973) = 0.8, p<.0001) and

Coarticulation (F(1, 2973) = 59.5, p<.0001). We also found an

interaction of Consonant and Coarticulation (F(1, 2973) = 83.5,

p<.0001) and an interaction of V1 Context and Coarticulation (F(1,

2973) = 4.1, p<.05) . There was no main effect of V1 Context (p>.3)

nor interaction of V1 Context and Consonant (p>.05) or three way

interaction between V1 Context, Coarticulation and Consonant

(p>.3). Since the monophone model is a purely acoustic one, strong

effects of coarticulation can indeed be expected. As with human

subjects, most ‘a’ responses occurred both V1 context and

Coarticulation point to /a/.

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

ac sh am

% o

f 'n

o_v

ow

el'

tra

nsc

rip

tio

ns

Speaker

h

k

*** ***


Figure 20. Context and Coarticulation effects for /hp/ and /kp/ items on ‘a’ responses for monophone model. Bars are standard errors to the mean.

‘e’ responses

We found a main effect of Coarticulation (F(1, 2973) = 20.3, p<.0001).

There were no main effects of Consonant (p>.5) or V1 Context

(p>.05). We found an interaction of Consonant and Coarticulation

(F(1, 2973) = 14.7, p<.0001) and a three-way-interaction of V1

Context, Consonant and Coarticulation (F(1, 2973) = 5.7, p<.02) .

Figure 17 below shows a surprising phenomenon: ‘e’ responses were

more frequent in contexts where V1 was another vowel.

Figure 21. Context and Coarticulation effects for /hp/ and /kp/ items on ‘e’ responses by monophone model. Bars are standard errors to the mean.

‘i’ responses

We found main effects of Consonant (F(1, 2973) = 13.1, p<.0005),

Coarticulation (F(1, 2973) = 32.3, p<.0001) and V1 Context (F(1, 2973)

= 26.0, p<.0001). We found an interaction of V1 Context and

Coarticulation (F(1, 2973) = 40.2, p<.0001) and a three-way-

interaction of V1 Context, Consonant and Coarticulation (F(1, 2973)

= 3.9, p<.05) . Interestingly, the percentage of ‘i’ transcriptions never

exceeded 50% even when Coarticulation and Context concurred. By


contrast, participants in the vowel-labelling task gave over 95% of ‘i’

responses in the same situation. It would seem that the model gave

more ‘conservative’ transcriptions than the subject, as most of the

other transcriptions in this situation were either the default

epenthesis ‘u’, or ‘no vowel’. The model can transcribe consonant

clusters as containing no vowel because /i/ or /u/ devoicing in

Japanese can sometimes produce a surface realization close to a

consonant cluster.

Figure 22. Context and Coarticulation effects for /hp/ and /kp/ items on ‘i’ responses by monophone models. Bars are standard errors to the mean.

‘o’ responses




Coarticulation (F(1, 2973) = 31.4, p<.0001), Consonant and

Coarticulation (F(1, 2973) = 161.1, p<.0001), V1 Context and

Consonant (F(1, 2973) = 7.0, p<.01) and a three-way-interaction of V1

Context, Consonant and Coarticulation (F(1, 2973) = 40.3, p<.0001) .

‘o’ transcriptions were few in general. Some /kup/ clusters were

mapped to /o/, as /u/ in our stimuli was more rounded than the

native category.


Figure 23. Context and Coarticulation effects for /hp/ and /kp/ items on ‘o’ responses by monophone models. Bars are standard errors to the mean.

‘u’ responses

We found a main of Consonant (F(1, 2973) = 41.2, p<.0001), and no

other significant factor (Coarticulation (p>.5), V1 Context (F(1, 2973)

= 6.4, p>.5). The models do seem to use /u/ as the defaut epenthetic

vowel to some degree, leading to an important number of ‘u’

transcriptions in both /h/ and /k/ contexts (37% for /h/ contexts, and

53% for /k/ contexts). For /h/ contexts, this is more than what

participants did, for /k/ contexts, less.

Figure 20. Context and Coarticulation effects for /hp/ and /kp/ items on ‘u’ responses by monophone models. Bars are standard errors to the mean.

Discussion

The monophone models also transcribed epenthesis and showed an

influence of coarticulation. Except for ‘u’ transcriptions, there were

main effects of Coarticulation for all vowel responses. Model

transcriptions appeared to be somewhat more ‘conservative’ than

subject responses. As the default epenthetic vowel, /u/ was chosen

much more frequently to epenthetise /hp/ clusters than in

Experiment 1. Effects of /i/ and /e/ coarticulation were also less

pronounced. Looking at the transcription data in detail, it appeared

also that transcriptions varied from speaker to speaker. For better


comparability with the vowel-labelling task, we should maybe

restrict our analyses to the transcriptions for the voice used in

Experiment 1. However, since this work with model transcriptions is

mostly exploratory, and for time reasons, we decided to do the

general analysis including all speakers here. Qualitatively, we could

observe a stronger resemblance between experimental and model

data:

Figure 24 Items with /hp/and /kp/ clusters, all responses: percentages of responses in each category (a, i , u, e, o, no vowel) according to V1 (smaller graduations) and coarticulation in the consonant cluster (separated by white lines). Model transcriptions appear to show more similarity with experimental data if we look only to the responses to the voice used in experiment 1

h p

h p

subject

responses

model

transcriptions


Bigram models

By adding transitional probabilities information to monophone

models, we also made transcriptions of the same stimuli using

bigram models. Bigram models are a way to approximate

phonotactic constraints, as illegal sequences will have very low

transitional probabilities. We should then expect to observe more

‘vowel’ transcriptions. Qualitatively, this is indeed the case (Figure

24 below).

Figure 24. Items with /hp/ and /kp/ clusters, all responses: percentages of responses in each category (a, i , u, e, o, no vowel) according to V1 (smaller graduations) and coarticulation in the consonant cluster (separated by white lines). Model transcriptions appear to show more similarity with experimental data if we look only to the responses to the voice used in experiment 1

‘no vowel’ transcriptions accounted only for 3.1% of all

transcriptions. ANOVAs on ‘no vowel’ responses according to

consonant for the 3 speakers showed no significant effects for any

speaker (p>.05).

Figure 25. Percentage of ‘no vowel’ transcriptions for /hp/ and /kp/ clusters by monophone models

h p


For context and coarticulation effects, we carried out the same

analyses as for the monophone models.

‘a’ responses

We found main effects of Consonant (F(1, 2973) = 80.2, p<.0001)

and Coarticulation (F(1, 2973) = 59.5, p<.0001). We also found an

interaction of Consonant and Coarticulation (F(1, 2973) = 48.8,

p<.0001) and an interaction of V1 Context and Coarticulation (F(1,

2973) = 4.3, p<.001), V1 Context and Consonant (F(1, 2973) = 4.3,

p<.05) and Coarticulation and Consonant (F(1, 2973) = 29.2,

p<.0001). If we look at the plots below, ‘a’ responses are mainly

triggered by V1 = a contexts. The fact that proportions of ‘a’

responses are lower in each context than for monophone models is

actually explained by the overall greater amount of ‘a’ responses in

the bigram models’ transcriptions, especially in /h/ contexts. This

bias could suggest that /h/ → /ɑ/ Vowel → /hɑ/ transitions are more

frequent in the Japanese lexicon than the other combinations.

Figure 26. Context and Coarticulation effects for /hp/ and /kp/ items on ‘a’ responses for bigram models. Error bars are standard errors to the mean

‘e’ responses

We found main effects of Coarticulation (F(1, 2973) = 25.9, p<.0001)

and V1 Context (F(1, 2973) = 4.9, p<.005). We found an interaction of

Consonant and Coarticulation (F(1, 2973) = 5.9, p<.0) and V1 Context

and Coarticulation (F(1, 2973) = 8.7, p<.005. The plot below suggests

that the models were more sensitive to /e/ coarticulation than

human subjects, possibly because the corpus it was trained on

contained input from speakers from more varied geographical

origins than our subjects (mostly from Tokyo area).


Figure 27. Context and Coarticulation effects for /hp/ and /kp/ items on ‘e’ responses for bigram models. Error bars are standard errors to the mean

‘i’ responses




Coarticulation (F(1, 2973) = 54.0, p<.0001). Just like with

monophone transcriptions, we found that proportions of ‘i’

transcriptions were fewer than in the vowel-labelling task.

Figure 7. Context and Coarticulation effects for /hp/ and /kp/ items on ‘i’ responses for bigram models. Error bars are standard errors to the mean

‘o’ responses

We found main effects of Consonant (F(1, 2973) = 32.2, p<.0001) and

Coarticulation (F(1, 2973) = 58.2, p<.0001). We found an interaction

of V1 Context and Coarticulation (F(1, 2973) = 8.9, p<.005),

Consonant and Coarticulation (F(1, 2973) = 185.4, p<.0001), V1

Context and Consonant (F(1, 2973) = 8.9, p<.0002) and a three-way-

interaction of V1 Context, Consonant and Coarticulation (F(1, 2973)

= 36.3, p<.0001) . As with monophone models, ‘o’ transcriptions


were few in general. Some /kup/ clusters were mapped to /o/, as /u/

in our stimuli was more rounded than the native category.

Figure 28. Context and Coarticulation effects for /hp/ and /kp/ items on ‘o’ responses for bigram models. Error bars are standard errors to the mean

‘u’ responses

We found a main of Consonant (F(1, 2973) = 65.9, p<.0001), and no

other significant factor (Coarticulation (p>.5), V1 Context (F(1, 2973)

= 6.4, p>.5). Bigram models gave more ‘u’ transcriptions in /k/

clusters than monophone models (64% against 53% for monophone

models). This suggest stronger transition probabilities between k →

u probabilities than k and any other vowel.

3.5 Discussion

Overall, bigram models’ transcriptions allowed for more influence

of V1 context. Coarticulation still played a significant part in

predicting the transcription. As with monophone results,

considering transcriptions only for the voice used in the behavioral

experiment could have given more comparable results.

Following on these analyses, we could also try to model

results for the ABX task, using comparisons between the


transcriptions given for all stimuli used. Again, due to time

limitations, further analyses will be done outside the scope of this

written work.

It is extremely interesting to observe that speech

recognition models also show a variation in the epenthesis effect.

For monophone models, this suggests that acoustic coarticulation

cues were sufficient to switch the choice of epenthetic vowel when

coarticulation and V1 context did not concur. Bigram models

approximate some measure of top-down phonological knowledge

by describing phonotactic constraints with variations in transition

probabilities. With the addition of this information, coarticulation

still influenced the transcriptions, while frequent sequences

modulated the proportions of responses corresponding to the

coarticulation information.


4. Discussion and concerns

In this project, we examined the influence of coarticulation cues on

the choice of epenthetic vowel after /h/ and /k/ by native speakers of

Japanese. Our results suggest that coarticulation does indeed play

an important part in predicting participants’ answers, but not to the

same degree for every vowel. /i/ coarticulation in particular strongly

predicted /i/ epenthesis, even when the vowel preceding /h/ was not

i. This would be problematic if we tried to account for epenthesis

after /h/ with a vowel-copy rule. The fact that /i/ coarticulation also

induced /i/ epenthesis after /k/, where the default epenthetic vowel

should be /u/, also supports the idea that epenthesis repair happens

during perception. In this case, two-step phonological theories

would predict in this case accurate perception of the /kp/ cluster,

and subsequent repair with the default /u/ epenthesis. Looking at

the way our test stimuli were transcribed by speech recognition

algorithms based on Hidden Markov models, we could observe the

responses of purely acoustic systems optionally equipped with

some degree of phonotactic knowledge. These models were also

sensitive to the influence of coarticulation. Trained on Japanese

speech, these models approximated the perception of Japanese

speakers. In the case of monophone models, transcriptions were

based solely on acoustic information. Epenthesis according to

coarticulation in these models’ transcription suggests that

coarticulation cues are acoustically salient enough to be perceived

as an epenthetic vowel. Bigram models modulated the influence of

coarticulation with the addition of transition probabilities.

The experiments having been carried out in Paris, we could

only recruit and test a limited number of participants in the period

of time allotted for this project. If possible, using all three voices in

the identification task could have allowed for an analysis of speaker

effects, as well as better data for correlating the two experiments.

This could also have provided more complete comparisons for the

model transcriptions. A more accurate measure or reaction times

could also

Further analyses could be done on both our experimental

and model data. However, they would exceed the scope of this

written work, submitted for partial fulfillment of the requirements

for academic year 2013-2014 in the Research Master in Cognitive

Science (EHESS, ENS, Paris V). We plan to refine these analyses in

the near future.


5. Appendices

Items recorded recorded at 16 kHz:

Natural clusters ahpa ehpe ihpi ohpo uhpu

akpa ekpe ikpi okpo ukpu

ABX ahapa ehepe ihipi ohopo uhupu

akapa ekepe ikipi okopo ukupu

Non-words test stimuli V1C?pV1, obtained by cross-splicing

recorded items with Praat speech analysis software (Boersma &

Weenink)

Colour indicates origin of /hp/: for example, ehape was created by

cross-splicing /hp/ from ahpa into ehpe). On the diagonal, items

such as ahapa were created by cross-splicing /hp/ from another

recording of ahpa, to neutralize possible effects of a splicing

artefact and make it more comparable to other spliced items.

Control stimuli V1k?pV1 were obtained by a similar process.

V1 choice

V2 choice /a/ /e/ /i/ /o/ /u/

/a/ ahapa ehape ihapi ohapo uhapu

/e/ ahepa ehpe ihepi ohepo uhepu

/i/ ahipa ehipe ihpi ohipo uhipu

/o/ ahopa ehope ihopi ohpo uhopu

/u/ ahupa ehupe ihupi ohupo uhpu


6. References

Arakawa, S. (1977). Gairaigo jiten [loanword dictionary]. Kadokawa, Tokyo.

Berent, I., Lennertz, T., Jun, J., Moreno, M. A., & Smolensky, P. (2008). Language universals in human brains. Proceedings of the National Academy of Sciences, 105(14), 5321-5325.

Berent, I., Lennertz, T., Balaban, A. (2011). Language universals and misidentification: a two-way street. Language and Speech, 55(3), 311–330

Chang, C. B. (2012). Phonetics vs. phonology in loanword adaptation: Revisiting the role of the bilingual. In Proceedings of the 34th Annual Meeting of the Berkeley Linguistics Society: General Session and Parasession on Information Structure, Berkeley, CA. Berkeley Linguistics Society.

Crawford, C. (2007). The role of loanword diffusion in changing adaptation patterns: A study of coronal stops in Japanese borrowings. Working Papers of the Cornell Phonetics Laboratory, 16, 32-56.

Davidson, L. (2007). The relationship between the perception of non-native phonotactics and loanword adaptation. Phonology, 24 (2007) 261–286.

Davidson, L. (2011). Phonetic, phonemic, and phonological factors in cross-language discrimination of phonotactic contrasts. Journal of Experimental Psychology: Human Perception and Performance, 37(1), 270..

Davidson, L., & Shaw, J. A. (2012). Sources of illusion in consonant cluster perception. Journal of Phonetics, 40(2), 234-248.

Dupoux, E., Kakehi, K., Hirose, Y., Pallier, C., & Mehler, J. (1999). Epenthetic vowels in Japanese: A perceptual illusion? JEP: HPP 25:1568-1578.

Dupoux, E., Parlato, E., Frota, S., Hirose, Y., & Peperkamp, S. (2011). Where do illusory vowels come from?. Journal of Memory and Language, 64(3), 199-210.

Goldsmith, J. A. (1995) The Handbook of Phonological Theory. Blackwell Handbooks in Linguistics. Blackwell Publishers. pp. 817–838.

Hallé, P. A., Segui, J., Frauenfelder, U., & Meunier, C. (1998). Processing of illegal consonant clusters: A case of perceptual assimilation?. Journal of experimental psychology: Human perception and performance, 24(2), 592.

Han, M. (1962). Unvoicing of vowels in Japanese. Onsei no kenkyuu, 10, 81–100.

Irwin, M. (2011). Loanwords in Japanese. John Benjamins Publishing.


Itō, Junko; Mester, R. Armin (1995). "Japanese phonology". In Goldsmith, John A. The Handbook of Phonological Theory. Blackwell Handbooks in Linguistics. Blackwell Publishers. pp. 817–838.

Kamiya, T. (1994). Tuttle new dictionary of loanwords in Japanese: a user's guide to Gairaigo. Tuttle Publishing.

Kindaichi, K., Yamada, T., Shibata, T., Sakai, K., Kuramochi, Y., & Yamada, A. (2000). Shinmeikai Kokugo Jiten [Shinmeikai Japanese-Japanese Dictionary]. Sanseido, Tokyo.

Labrune, L. (2006). La phonologie du japonais, Collection Linguistique, Société de Linguistique de Paris n°90, éditions Peeters, Leuven.

Labrune, L. (2008). Principes d’organisation phonémique des emprunts occidentaux composés abrégés, Revue d’Etudes Japonaises Université Paris 7, 2008, pp. 107-121.

LaCharité, D. & Paradis, C. (2005). Category preservation and proximity versus phonetic approximation in loanword adaptation. LI 36. 223–258.

Ladefoged, P., & Maddieson, I. (1996). The sounds of the world's languages. Blackwell Publishing.

Lovins, J. B. (1975). Loanwords and the phonological structure of Japanese. Bloomington: Indiana University Linguistics Club.

Paradis, C. & LaCharité, D. (1997). Preservation and minimality in loanword adaptation. Journal of Linguistics 33:379-430

Peperkamp, S., Vendelin, I. & Nakamura, K. (2008). On the perceptual origin of loanword adaptations: experimental evidence from Japanese. Phonology, 25, 129-164.)

Peperkamp, S. (2005). A psycholinguistic theory of loanword adaptations. In: M. Ettlinger, N. Fleischer & M. Park-Doob (eds.) Proceedings of the 30th Annual Meeting of the Berkeley Linguistics Society. Berkeley, CA: The Society, 341-352.

Peperkamp, S. & Dupoux, E. (2003). Reinterpreting loanword adaptations: The role of perception. In: M.J. Solé, D. Recasens & J. Romero (éds.) Proceedings of the 15th International Congress of Phonetic Sciences. Adelaide: Causal Productions, 367-370.

Pitt, M. (1998). Phonological processes and the perception of phonotactically illegal consonant clusters. Perception & Psychophysics, 60(6), 941–951.

Schmidt, C., (2009) “Chapter 21: Loanwords in Japanese”, In Haspelmath, M., & Tadmor, U. Loanwords in the World's Languages: A Comparative Handbook, Walter de Gruyter, Language Arts & Disciplines. p.545-574.


Shinohara, S. (2000). Default accentuation and foot structure in Japanese: Evidence from adaptations of French words. JEAL 9:55-96.

Smith, J. L. (2006) Loan phonology is not all perception: Evidence from Japanese loan doublets. In Timothy J. Vance and Kimberly A. Jones (eds.), Japanese/Korean Linguistics 14, 63-74. Stanford: CSLI.

Tews, A. (2008). Japanese geminate perception in nonsense words involving German [f] and [x]. Gengo Kenkyu, 133, 133-145.

Yamamoto, A. (2009). Korenara oboerareru doitsugo tangochou, NHK press

7. Other resources

Boersma, P. & Weenink, D. (2014). Praat: doing phonetics by computer [Computer program]. Version 5.3.77, retrieved from http://www.praat.org

Breen, J. & Ahlström, K. (2014), Denshi Jisho, online Japanese Dictionary. WWWJDIC project, Electronic Dictionary Research Group. http://jisho.org

Cambridge University Engineering Department (CUED), Hidden Markov Models Toolkit HTK. http://htk.eng.cam.ac.uk/

MixB classifieds, Furansu keijiban. http://fra.mixb.net

National Institute for Japanese Language (NIJLA), Communications Research Laboratory (CRL), Tokyo Institute of Technology (TITech). (2003) Corpus of Spontaneous Japanese

Pinheiro J, Bates D, DebRoy S, Sarkar D and R Core Team (2014). nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-117, http://CRAN.R-project.org/package=nlme.

Python Software Foundation (2013). Python Language Reference. Version 2.7.6, retrieved from http://www.python.org

R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Version 3.1.0 "Spring Dance", retrieved from http://www.R-project.org

http://www.praat.org/

http://jisho.org/

http://htk.eng.cam.ac.uk/

http://fra.mixb.net/

http://www.python.org/

http://www.r-project.org/

a variation in illusion : vowel epenthesis after /h/ in …...more coarticulation than other...

Documents