a variation in illusion : vowel epenthesis after /h/ in …...more coarticulation than other...
TRANSCRIPT
A variation in illusion : Vowel epenthesis after /h/ in Japanese loanwords master’s thesis by
Isabelle LIN
co-supervised by
Dr Emmanuel DUPOUX and Dr Sharon PEPERKAMP
Submitted in June 2014
For partial completion of the academic requirements in
the Research Master in Cognitive Science (EHESS, ENS, Paris V) and Majeure ‘Research Master in Cognitive Science’ (HEC Paris)
Vowel epenthesis after /h/ in Japanese loanwords 1
Abstract
When we hear sequences of sounds that are not attested in our
native language, we tend to misperceive them and hear instead a
sequence that is more probable in our native system. Looking at
loanword adaptation of such illegal segments, we can gain insight
into the way non-native sequences are perceived. We examined
through two behavioral tasks the perception of /hp/ and /kp/
clusters by native speakers of Japanese. Our aim was to test the
influence of coarticulation in the perception of a more likely
sequence. We found that coarticulation information does influence
the illusory vowel perceived in those clusters, albeit all types of
coarticulation do not have the same degree of influence. In a
second phase, we ran the experimental stimuli through speech
recognition models trained on a corpus of spoken Japanese. These
models are either purely acoustic, or integrate some measure of
phonotactic constraints. Comparing the responses of the
participants to those of the models can give us an inkling of the
processes involved in loan adaptation.
Declaration of Originality
Previous studies in Japanese loanword adaptations have shown that
illusory vowel segments perceived in illegal consonant clusters are
influenced by coarticulation information contained in said clusters.
In this project, we aim to compare the effects of different of
coarticulations (/ɑ/, /i/, /u/, /e/ or /o/), as well as that of more or less
strongly coarticulated environements (/hp/ or /kp/). In particular, for
the case of /h/, current theories about loanword adaptation would
lead to different predictions. Our results could provide additional
evidence to inform this debate. In addition to behavioral tasks, we
also explore the possible use of speech recognition systems to
model this phenomenon.
Vowel epenthesis after /h/ in Japanese loanwords 2
Aknowledgements
This work was made possible thanks to the help and insights of many people, to whom I wish to express my most heartfelt thanks. First and foremost, I am extremely grateful to my supervisors Dr Emmanuel Dupoux and Sharon Peperkamp, whose previous work this project has stemmed from. Ideas for this study originated in their collaboration with Dr Yuki Hirose, whose notes on an experiment to examine coarticulation in the velar fricative /χ/ served as a basis to create the test stimuli used in our experiments. With their advice, the project was made more detailed and another approach to our research question was defined through the possibility of using speech recognition models as well as behavioral data to test our predictions. Literature review done during the first half of my internship at the Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP) allowed me to adjust the experimental design for both experiments according to more precise predictions. I would like to thank Dr Akiko Takemura for proofing the recruitment notices and experimental instructions, and Abdellah Fourtassi for letting me use his HMM models of Japanese for the computational part of this project. Members of the LSCP writing group (Dr Christina Bergmann, Dr Alejandrina Cristià, Alexander Martin, Élise Michon), helped in proofreading parts of this thesis. Making the final stimuli from the recordings, designing the counterbalancing, coding the experiments in Python language, recruiting, testing, and data collecting were done by me. All mistakes are my own. I would also like to thank all the members of LSCP in
general, who provided such an amazing environment in which to
learn how to do research. Being still new to the field, I am indebted
to many of them for their kind and professional advice.
Lastly, I thank Olivier Wang for accompanying me on my
recruiting outings during weekends, for his insights on
programming, but most of all for his constant personal support.
Vowel epenthesis after /h/ in Japanese loanwords 3
Table of Contents
Abstract .............................................................................................. 1
Aknowledgements ............................................................................. 2
1. Introduction ................................................................................ 4
1.1 Objectives ............................................................................... 5
1.2 Predictions .............................................................................. 7
2. Study Methodology .................................................................... 9
2.2 Experiment 1. Vowel-labelling ............................................... 9
2.2.1 Choice of method ............................................................... 9
2.2.2 Stimuli ............................................................................... 10
2.2.4 Results .............................................................................. 13
2.2.5 Discussion ......................................................................... 20
2.3 Experiment 2. ABX discrimination ....................................... 21
2.3.1 Choice of method ................................................................. 21
2.3.2 Stimuli ............................................................................... 21
2.3.3 Procedure .......................................................................... 22
2.3.4 Results .............................................................................. 23
2.3.5 Discussion ......................................................................... 26
3. Speech Recognition Models ..................................................... 27
3.1 Choice of method ................................................................. 27
3.2 Stimuli ................................................................................... 27
3.3 Procedure ............................................................................. 28
3.4 Results .................................................................................. 29
3.5 Discussion ............................................................................. 38
4. Discussion and concerns .......................................................... 40
5. Appendices ............................................................................... 41
6. References ................................................................................ 42
7. Other resources ........................................................................ 44
Vowel epenthesis after /h/ in Japanese loanwords 4
1. Introduction
Just like social communities can integrate members from various origins, languages can adopt words from other languages. This process is known as loanword adaptation. A borrowing language (henceforth Lb) takes a word from a source language (henceforth Ls) and submits it to a bevy of phonological, semantic or even syntactic modifications, until it enters Lb lexicon as a loanword. This happens often with words expressing notions related to another culture (e.g. ‘kimono’, a loan from Japanese in English), or technical terms (e.g.
/ki:bo:do/ ― keyboard, a loan from English in Japanese).
Historically, borrowing is most active during times of intensive contact between communities. In some languages, such as Japanese, loans from different sources build up categories in the lexicon (lexical strata), traceable to periods of academic, diplomatic and economic relations with other cultures. The resulting loans are more or less well blended in the lexicon, both in terms of frequency of use and similarity to words of the native stratum. For Ls speakers, the adaptation is often such that a loanword in Lb is not easily recognizable as coming from their own language (e.g. from English in Japanese, /ʃimpuru/ ― simple1). This is mainly because of the phonological adaptations that have taken place during borrowing. The sound inventory and phonotactic rules are different from language to language, and loanword adaptation must take into account these differences, using various strategies to make the loan abide by L1 phonology. For instance, in the previous example, the sound /sɪ/ does not exist in Japanese phonology and is adapted to the native sound /ʃi/ instead. Similarly, consonant clusters such as /pl/ are also illegal, so they are adapted by adding vowels in the cluster. This second strategy is called vowel epenthesis. Looking into these kinds of strategies can provide insights into the way we process sounds from other languages in a different context that actual L2 acquisition. Literature on loanword adaptation gives multiple theories to account for the origin of the pronunciation changes that words undergo when they are borrowed into another language. Phonological theories suggest a two-step model, in which the sounds of the source language are perceived faithfully, and then adapted according to the borrowing language’s phonological rules during production (Lovins 1975, Paradis & LaCharité 1997, Uffmann 2006). However, experimental data from Lb speakers seem to suggest that the influence of native phonotactics extends to the perception of source language sounds. This is the basis for the
1 The Japanese vowel is actually an unrounded /ɯ/, but we will use /u/ for
simplicity throughout this paper. The stimuli used during the experiments contained /u/, since our aim was to test for adaptation from foreign categories. Previous studies have also shown that /u/ is easily mapped to /ɯ/ during adaptation (Dupoux, Kakehi, Hirose, Pallier. & Mehler, 1999)
Vowel epenthesis after /h/ in Japanese loanwords 5
perception-based theory (Peperkamp and Dupoux 2003, Vendelin and Peperkamp 2006, Dupoux, Parlato, Frota, Hirose, & Peperkamp, 2011). Instead of hearing consonant clusters that are illegal in their native system, Japanese speakers would perceive illusory epenthetic vowels between the consonants (Dupoux et al., 1999). More than this, (1) the nature of the perceptual epenthetic vowel appears to depend on the characteristics of the vowels from the speaker’s native language, (2) the choice of the epenthetic vowel depends on acoustic cues contained in the consonant cluster (Dupoux et al., 2011). These cues come from coarticulation: the consonant sounds produced are influenced by the preceding or following vowel sounds. According to this theory, loanword adaptation would be a single-step process, where native phonotactic constraints would already take over perception and thus phonetic categorisation.
1.1 Objectives
The case of Japanese is particularly interesting when examining loanword adaptation and vowel epenthesis. First, Japanese has phonotactic constraints against consonant clusters, except some starting with the nasal /n/ (Itô & Mester, 1995, Labrune, 2006). This is reflected in the syllabary writing system (kana), where only /n/ has a separate grapheme, all other symbols representing a CV combination (e.g. there are graphemes for /hɑ/, /hi/, /hu/, /he/ and /ho/, but none for /h/). Second, the proportion of loanwords in basic vocabulary is quite important: 6,3% non-Chinese foreign Ls, 2,3% hybrid (Haspelmath & Tadmor, 2009). We consider only non-Chinese foreigh Ls, because Chinese source words do not present consonant clusters either. A hybrid loan is a mix of a native word
and a loan (e.g. /karaoke/ ‘空 /kara/, empty’+‘orche(stra) /oke/’) , or a mix of loans that has no corresponding source word (e.g. /aisukiandi:/ ‘ice’+‘candy’: popsicle). In this study we aim to test whether the choice of epenthetic vowel in Japanese for a consonant cluster starting with the fricative /h/ can be accounted for by coarticulation cues contained in said fricative. Looking at existing loanwords, it seems that the epenthetic vowel in such contexts is identical to the vowel preceding the cluster. The same phenomenon is observed when the source word ends with /h/. This could be explained by a phonological rule of vowel copying through /h/. However, the acoustic characteristics of /h/ may provide an alternative explanation. Sometimes described as the devoiced version of an adjacent vowel (Ladefoged and Maddieson, 1996), /h/ contains more coarticulation than other consonants, including /x/ and /ç/, which are both adapted to /h/ in the case of German loanwords in Japanese (Lovins, 1975, Irwin,). We can also observe tokens of /x/ and /ç/ to Japanese /h/ in loans from other languages. Although a great number of loans come from source words containing these sounds, the exemplars where the source word ended with one of
Vowel epenthesis after /h/ in Japanese loanwords 6
them or included it in a consonant cluster are fewer, and mostly fall in the category of proper names or specialized concepts:
(1) /bahha/ ‘Bach’ (German) (2) /hohhohausu/ ‘Hochhaus (University)’
(German) (3) /ihhiroman/ ‘Ich-Roman’ (German) (4) /ahoro:toru/ ‘axolotl’ (Nahuatl) (5) /uhuta/ ‘Ukhta’ (Russian)
We can also look at another source for adaptations from /hC/ clusters or words ending in /h/. Some vocabulary lists for Japanese learners of foreign languages include a pronunciation approximation in Japanese syllabary. These transcriptions can give more data on adaptations, even if the transcription does not correspond to an actual loanword. For instance, Shinohara (1997) worked on data from oral and written adaptation of French words by native Japanese speakers. Transcriptions of German words in such lists provide more instances of /x/ and /ç/ to /h/ with a copy
vowel (Yamamoto, これなら覚えら得るどいつ語単語帳 Korenara oboerareru doitsugo tangochou, NHK press 2009). By contrast, other consonants do not behave the way /h/ does during adaptation. For instance, /kC/ clusters or words ending in /k/ are usually adapted using the default epenthetic vowel in Japanese, /u/ (Lovins, 1975). /u/ is the shortest vowel in Japanese, and can be devoiced (Han, 1962, Labrune, 2006), making it a phonetically minimal element. If epenthetic repair takes place for a source that contained no vowel segment, the minimal element seems to be a good candidate. Loans from sources with /kC/ clusters or ending in /k/ are much more frequent:
(6) /abekku/ ‘avec’ (French) (7) /miruku/ ‘milk’ (English) (8) /kurasu/ ‘class’ (English) (9) /akuʃon/ ‘action’ (English)
(10) /bekutoru/ ‘Vektor’ (German)
This difference between /h/ and /k/ could be due to the fact that no vowel-copying rule applies to /k/, but it might also be because other consonants contain less coarticulation information than /h/. To test the influence of coarticulation, we examined Japanese speakers’ perception of /hC/ and /kC/ clusters in a forced-choice vowel-labelling task and an ABX discrimination task. We recorded items such as /ɑhpɑ/ and /ihpi/, then digitally exchanged their consonant clusters (cross-splicing) to construct stimuli where the context vowel (e.g. /ɑ/) did not match the coarticulation cues contained in /h/ (e.g. /i/). We will denote such items as V1CcoartpV1 (e.g. ahipa). All stimuli were non-words, and could not become a word after epenthesis. In both tasks, we should observe different
Vowel epenthesis after /h/ in Japanese loanwords 7
outcomes depending on whether adaptation follows a vowel-copying phonological rule or is determined by coarticulation cues.
Figure 1. Cross-splicing stimuli. Items with natural clusters were recorded, then digitally manipulated so that the coarticulation information contained in the consonant cluster (colour) did not match the context vowel (black)
In parallel, we will run the experimental stimuli used for these two tasks through speech recognition systems based on Hidden Markov Models (HMM) created with HTK Speech Recognition Toolkit (CUED). These models were trained on the Corpus of Spontaneous Japanese (Kokken, 2003), and then adapted for speaker variation using other recordings from the speakers who produced the stimuli. This will give us a rough idea of what a purely acoustic model optionally equipped with bigram transition probabilities will predict for the transcription of non-native sounds into Japanese phonemes.
1.2 Predictions
If the choice of epenthetic vowel after /h/ is determined by a vowel-copying process, we should observe no difference in the adaptation of the non-words of the form V1hpV1 in the vowel labelling task. Items such as ahapa, ahepa, ahipa, ahopa and ahupa should all be mapped to /ɑhpɑ/ (faithful perception) or /ɑhɑpɑ/ (adaptation with vowel-copy). However, if the choice hinges on coarticulation cues contained in /h/, we should observe some degree of variation in the epenthetic vowel. If coarticulation predicts the epenthetic vowel, we should observe:
ahapa → /ɑhɑpɑ/ ahipa → /ɑhipɑ/
ahupa → /ɑhupɑ/ ahepa → /ɑhepɑ/ ahopa → /ɑhopɑ/
Furthermore, from the observations by Lovins (1975) and Labrune (2006), and the transcription of German words in Japanese syllabary (Yamamoto, 2012), we predict that ahepa might be mapped to /ɑhipɑ/. Control items will contain the voiceless stop /k/, which should lead to less coarticulation. They should not exhibit such a variation in the epenthetic vowel regardless of the context from
Vowel epenthesis after /h/ in Japanese loanwords 8
which /k/ originated. The default epenthetic vowel would then be /u/, and:
akapa, akipa, akupa, akepa, akopa→ /ɑkupɑ/
While no real words could arise from epenthesis on the stimuli, it is interesting to note the following adaptation, whose source contains /akpa/: (11) /akupatoku/ ‘Akpatok (island name)’ (Inuktitut) We expect performance in the ABX task to depend on that in the vowel-labelling task. That is, the more two items are mapped to the same adaptation, the harder it will be to distinguish them from one another. For example, if akapa and akupa are both mapped to /ɑkupɑ/, ABX trials comparing this pair (e.g. A : akapa B: akupa X :akapa → X=A) would be more difficult than if they were respectively mapped to /ɑkɑpɑ/ and /ɑkupɑ/. Such trials would likely result in longer reaction times as well as higher error rates. We could also compare the discrimination of coarticulated items from items containing a full medial vowel: if ahapa is mapped to /ɑhɑpɑ/, then an ahapa (splice) vs. ahapa (full medial vowel) comparison will be difficult.
Vowel epenthesis after /h/ in Japanese loanwords 9
2. Study Methodology
To test our hypotheses, we conducted two experiments, a vowel-
labelling study and an ABX discrimination study. Both were
programmed in Python language (Python Software Foundation,
2013) and conducted at the Laboratoire de Sciences Cognitives et
Psycholinguistique, 29, rue d’Ulm, Paris, France. The procedures
used are similar to those used in Dupoux et al. (1999, 2011).
2.1 Participants
Participants were 25 native speakers of Japanese (mean age 30, 5
males, 20 females) recruited in Paris. Recruitment was conducted
by notices placed in libraries and shops frequented by the Japanese
residents in Paris, Japanese newspapers (OVNI…) and on Japanese
community websites (MixB bulletin board, 日本人会 Nihonjinkai).
Some participants were recruited directly from French classes for
newly arrived Japanese visitors, as well as the Japanese House in
the International University Students Residence (Cité Universitaire).
We searched for people who arrived in Paris recently and only had
limited exposure to languages allowing the consonant clusters
tested, such as French or English. It is possible to find participants
who meet those criteria because a significant part of the Japanese
community in Paris came to France accompanying a family
member, and have not yet had time to learn French. Among our
participants were also recently arrived exchange students and
people on Working Holiday (1 year stay). Recruitment announces,
instructions during the experiments and debriefing were all done in
Japanese. Participants completed both experiments (about one
hour, including pauses and debriefing), and were compensated 10€.
2.2 Experiment 1. Vowel-labelling
2.2.1 Choice of method
We aimed to test the participants’ online decision regarding the
presence or absence of an epenthetic vowel, as well as that vowel’s
identity. Searches through general Japanese dictionaries (Kindaichi,
Yamada, Shibata, Sakai, Kuramochi & Yamada, 2000) as well as
loanwords dictionaries (Arakawa, 1977, Kamiya, 2002) and corpora
(Haspelmath & Tadmor, 2009) showed that loans adapted from
source words containing /hC/ clusters or ending in /h/ were relatively
rare, and usually limited to very specialized vocabulary. In existing
Vowel epenthesis after /h/ in Japanese loanwords 10
loans, the epenthetic vowel after /h/ does vary according to the
vowel preceding /h/, but it is impossible to tease apart the roles of
vowel context and coarticulation.
A review of the literature on pointed to several possible
paradigms to test perceptual epenthesis. Some studies on
phonotactic repair of illegal sequences have used syllable count
tasks. Pitt (1998) showed that by lengthening the liquids in [tl] or
[sr] sequences, unattested in English, native speakers are more
likely to report that they heard a disyllable. Berent, Lennertz, Jun,
Moreno & Smolensky (2008) tested Korean listeners’ perception of
non-words like [lbif] against [ləbif], showing that epenthetic repair
of the former leads to reports of it being a disyllable, just like the
latter. However, if we want to look at a variation in the choice of
epenthetic vowel, syllable count alone will not suffice. While
transcription tasks are well-suited to the study of loanword
adaptation in languages using the Roman alphabet (Hallé, Segui,
Frauenfelder & Meunier, 1998, Davidson, 2007), in the case of
Japanese, the influence of the writing system can affect our results.
As mentioned before, the native syllabary system does not allow for
the transcription of consonant clusters (except the case of /n/):
using this system would limit severely the potential ‘no vowel’
responses if we ask our participants to transcribe consonant
clusters. On the other hand, while the Roman alphabet is also well-
known to most native speakers of Japanese, asking for a full
transcription of a token like ahupa might result in more variability
than we want to study : /hu/ in Japanese realizes into its allophone
[φu], and could possibly be transcribed into ‘fu’ (e.g. the loan
/huirumu/ [Φuirumu] ‘film’, English). We used therefore a paradigm
similar to that of Dupoux et al. (1999, 2011). Participants were given
a partial transcription of the test tokens, and asked to make a
decision on the central segment of the audio stimuli.
2.2.2 Stimuli
All recorded items were non-words produced by a trained
phonetician, native speaker of Dutch, and were stressed on the first
syllable. Obstruent clusters are allowed in Dutch, and [h] is present
in its sound inventory (Ladefoged & Maddieson, 1996). Incidentally,
Dutch was also one of the main source languages for loans in
Japanese history. Records were made in a soundproof booth, at 16
bits mono with a sampling rate of 44.1 kHz. Recorded items were of
the following structure:
/hp/ clusters : V1hpV1 (e.g. ahpa)
/kp clusters : V1kpV1 (e.g. akpa)
Vowel epenthesis after /h/ in Japanese loanwords 11
where V1 is one of the vowels /ɑ, e, i, o, u/
(see Appendix for full list of recorded items)
Test stimuli were then obtained by cross-splicing /hp/, and
/kp/ from natural clusters preceded by the different vowels using
Praat speech analysis software (Boersma & Weenink). Clusters were
cut out at zero-crossings. We decided to splice the entire cluster
rather than /h/ of /k/ alone based on previous results showing that
the second consonant of the cluster could also contain traces of
coarticulation (Dupoux & Nakamura, in preparation). We mostly
used the pitch tracker in Praat to determine the boundaries of the
consonant clusters, double-checking the spectrograms manually in
case of doubt. All test items were cross-spliced, including those
whose coarticulation information matched the V1 context (e.g.
ahapa). This is to avoid the possible effect of a cross-splicing
artefact in the other items. Tokens like ahapa were made by splicing
the /hp/ cluster from one recording of /ɑhpɑ/ to a second one.
The main difficulty in making these stimuli rested in the fact
that the resulting items had to sound as natural as possible. While
splices from other tokens of the same V1 context were usually easy to
make, base recordings to make items with mismatched V1 and
coarticulation had to be carefully selected so that the spliced clusters
blended in seamlessly. For example, some /p/ bursts contained more
energy than others, and splicing them in another context renders this
very salient. To build test items that would comparable in the analysis,
the spliced cluster for one type of coarticulation (e.g. /hap/) was the
same across all V1 contexts (ehape, ihapi, ohapo, uhapu) except when V1
matches coarticulation (ahapa).
Figure 2. Splicing out /hap/ cluster from a recording of /ɑhpɑ/. The section in pale red was saved as the /hap/ cluster to be spliced in contexts where V1 is /e, i, o, u/. To make ahapa, a /hap/ cluster was spliced from another token of /ɑhpɑ/ and inserted in place of the selection. The cluster was considered to start at the zero-crossing immediately after pitch (in blue) disappeared, and to end at the zero-crossing after the /p/ burst.
Vowel epenthesis after /h/ in Japanese loanwords 12
Each condition (/h/ or /k/) contained 25 items. The mean
duration of an item was 598.0 ms, (SD = 55.5 ms). Each item was
presented three times during the experiment, so that a participant
heard and responded to a list of 25*2*3 = 150 items. Items were
presented in a randomized order, with the additional condition that
items starting with the same V1C could not occur within 3 trials (for
example if trial 1 is akepa, the next two trials cannot be items
starting with /ɑk/). The order of stimuli was thus different for every
subject. Stimuli are detailed in Appendix.
2.2.3 Procedure
Participants sat in from of a computer in a soundproof booth. All
instructions were displayed on screen in Japanese, and additional
explanations were given orally when necessary. Participants heard
the test stimuli through headphones. They were told that they
would hear words without meaning, all of the form “V1C?pV1” (e.g.
ah?pa). They were to indicate whether they perceived a vowel in the
consonant cluster, and if so, which vowel was perceived. Questions
appeared on the computer screen, in the Roman alphabet, in the
form “V1C?pV1” (e.g. ah?pa), with the choices ‘無’(no vowel), ‘a’, ‘e’,
‘i’, ‘o’, ‘u’ (forced choice). We used the Japanese character for ‘none’
as it is the choice traditionally given in multiple choice questions on
Japanese administrative forms. Participants were asked to reply as
fast as possible, and their response triggered the next trial with an
ISI of 1 s. Responses were given by pressing labeled keys on a
QWERTY keyboard. A training session of 10 items (items with a full
media vowel, not used in this task) preceded the actual task. No
feedback was given as the correct answer would always have been
‘no vowel’. For each trial, we collected the data ‘response’ (‘no
vowel’, ‘a’, ‘i’, ‘u’, ‘e’, ‘o’) and ‘reaction time’. We did not use
equipment to collect precise reaction times : these data points will
mainly serve to exclude trials where the reply came too fast
(accidental keypress) or too slowly (no longer reflecting first
impression).The experiment lasted about 10 minutes.
Vowel epenthesis after /h/ in Japanese loanwords 13
Figure 3. Paradigm for forced choice vowel-labelling task. Participants were asked to reply as fast as possible after hearing the audio stimulus, and their key press triggered the next trial with an ISI of 1 s. Questions were presented on screen.
2.2.4 Results
Since we have 3 within subjects independent variables (Consonant
(2) x V1 (5) x Coarticulation (5)), and our dependent variable is
nominal with 6 possible responses (a, i, u, e, o, no_vowel), we had to
simplify our data to conduct statistical analyses. The proportions of
responses in each combination of coarticulation and V1 for /hp/
clusters are plotted on Figure 4 and 5 below.
Figure 4. Items with /hp/ clusters, all responses: percentages of responses in each category (a, i , u, e, o, no vowel) according to V1 (smaller graduations) and coarticulation in the consonant cluster (categories delimited by white lines).
Vowel epenthesis after /h/ in Japanese loanwords 14
Qualitatively, the results for /hp/ items show that /e/ and /i/
coarticulation give rise to a strong /i/ epenthesis effect (dark green),
regardless of the V1 context. This concurs with the findings of
Dupoux et al. (2011) that /i/ coarticulation cues can lead to reports
of /i/ epenthesis. Such an effect also seems to arise from /e/ and /i/
V1 contexts even when the coarticulation information points to
another vowel. It seems furthermore that V1 context also has an
influence on the choice of epenthetic vowel: /ɑ/ epenthesis happens
in very small proportions, and almost only in V1 = /ɑ/ contexts, but
when both context and coarticulation point to /ɑ/, 67% of the
responses were also /ɑ/. This appears to be the case of /o/ also,
albeit to a lesser degree.
When we look however at the same graph for /kp/ clusters,
the image is quite different (Figure 5 below).
Figure 5. Items with /kp/ clusters, all responses: percentages of responses in each category (a, i , u, e, o, no vowel) according to V1 (smaller graduations) and coarticulation in the consonant cluster (categories delimited by white lines).
From the appearance of this graph, /kp/ clusters mostly
induce /u/ epenthesis, with /e/ and /i/ coarticulation still generating a
small measure of /i/ responses even in other V1 contexts. We can
also observe that ‘no vowel’ responses are fewer. This could be
explained by the fact that the /kp/ clusters were produced with a
released /k/ to keep the durations comparable between /hp/ and
/kp/ items. Therefore, /kp/ clusters were more likely to contain
schwa-like material.
For all our graphs and tests, we analysed the responses of
25 participants. One additional participant was tested, but reported
in the debriefing phase that they were a professional musician.
Previous results suggest that musicians tend to report less
epenthesis, as they are generally more sensitive to the temporal
cues of the acoustic signal (Dupoux et al., in preparation).
Vowel epenthesis after /h/ in Japanese loanwords 15
Subsequent analysis of this participant’s data showed indeed that
he made more than 60% of ‘no vowel’ responses (against an
average of 11% for the other participants, SE = 0,12). Consequently,
we excluded this participant’s data from all analyses. One other
subject spoke fluent English, but their results did not differ
significatively from the other participants’, so we still included them
in the analysis.
Additionally, we filtered reaction times so only responses
made between 200 ms and 3860 ms were considered in our analysis
(µ = 1305 ms, SD = 1277 ms).
Epenthesis effect
We first examined the amount of epenthesis responses elicited by
each consonant type. Our analyses were done in R statistics (R Core
Team, 2014). We conducted a within subjects one-way ANOVA with
the factor Consonant on the dependent variable ‘no vowel’
response. We found a significant effect of Consonant (F(1,24) = 6.8,
p < .02).
Figure 6. Mean percentages of ‘no vowel’ responses in function of consonant. Error bars represent standard error of the mean.
This confirms that items containing /kp/ clusters do elicit
more epenthesis responses. However, we need more precision to
qualify the choice of epenthetic vowels. Subsequently, we will limit
our analysis to ‘vowel’ responses (/ɑ/, /i/, /u/, /e/ and /o/ responses).
Since /i/ responses seemed to have a certain importance in both
consonant conditions, we will first focus on these responses, coding
all other responses as ‘other vowel’.
Vowel epenthesis after /h/ in Japanese loanwords 16
Context and coarticulation effects
To look at context and coarticulation effects, we simplified our data
according to the possible responses. For instance, for /i/ responses,
we considered responses, V1 context and Coarticulation to be either
‘i’ or ‘non i’. We then conducted analyses for /hp/ and /kp/ clusters,
and plotted the percentage of ‘i’ responses according to V1 context
and coarticulation. For each Consonant condition, we conducted a
mixed 2-way ANOVA with the fixed factors V1 context and
Coarticulation, and the random factor Subject, using the nlme
package in R (Pinheiro, Bates, DebRoy, Sarkar & R Core Team,
2014).
‘a’ responses
For /hp/ items, we found main effects of V1 context (F(1, 1442) =
41.8 , p<.0001) and Coarticulation (F(1, 1442) = 34.7, p<.0001). We
also found an interaction of V1 context and Coarticulation (F(1, 1442)
= 877.6, p<.0001). Context and Coarticulation both leads to some
degree of ‘a’ responses, but ‘a’ Context seems indeed to have a
multiplicative effect on the proportion of ‘a’ responses when
Coarticulation is also ‘a’. Looking back at the responses given in
/hap/ conditions in Figure 4, ‘a’ coarticulation seems to be weakest,
and the responses were much more varied depending on context.
This could be explained by lexical influences: some participants
reported during debriefing that they did think of the example
/bahha/ ‘Bach’ during the task.
For /kp/ items, ‘a’ responses were few in all contexts (<12%,
chance level = 20%). We found a main effect of V1 context (F(1, 1442)
= 3.7, p<.06) but not of Coarticulation (p>.5). There was no
interaction of V1 context and Coarticulation (p>.5). This is consistent
with our prediction that /kp/ clusters would contain less
coarticulation information.
Figure 7. Context and Coarticulation effects for /hp/ and /kp/ items on ‘a’ responses. Context appears to have a multiplicative effect on the proportion of ‘a’ responses for /hp/ items when coarticulation information also points to ‘a’. Error bars are standard errors to the mean.
Vowel epenthesis after /h/ in Japanese loanwords 17
‘i’ responses
For /hp/ items, we found main effects of V1 context (F(1, 1442) = 87.4,
p<.0001) and Coarticulation (F(1, 1442) = 41.6, p<.0001). We also
found an interaction of V1 context and Coarticulation (F(1, 1442) =
10.5, p<.002). Context and Coarticulation both leads to a strong
degree of ‘i’ responses, as seen earlier in Figure 4. Dupoux et al.
(2011) had also noted a very strong coarticulation effect of /i/. In
addition, /hip/ clusters were slightly palatalized in our test items,
which could have contributed to more distinctiveness. It should be
noted that the Japanese /hi/ is also palatalized.
For /kp/ items, we found main effects of V1 context (F(1,
1442) = 8.4, p<.005) and Coarticulation (F(1, 1442) = 21.9, p<.0001).
We also found an interaction of V1 context and Coarticulation (F(1,
1442) = 76.3, p<.0001). /i/ coarticulation seems to be present even in
/kp/ items. We can examine the acoustic distances between
different clusters by looking at the traces of vowel formants
contained in the clusters (Figure 8 below). While /kp/ clusters are
overall less distinctive in terms of F1, they still present distinctivity
in terms of F2. /kip/ and /kep/ in particular are especially distinct
from the other /kp/ clusters, which could explain the coarticulation
effect for ‘i’ responses.
Figure 8. Formant traces contained in the coarticulated /hp/ and /kp/ clusters. /kp/ clusters appear to be less distinctive in terms of F1, but still quite distinct in terms of F2/
Figure 9. Context and Coarticulation effects for /hp/ and /kp/ items on ‘i’ responses. /i/ coarticulation appears to be strong enough to induce more ‘i’ responses even in the less coarticulated /kp/ clusters. Error bars are standard errors to the mean.
Vowel epenthesis after /h/ in Japanese loanwords 18
‘e’ responses
For /hp/ items, we found a main effect of V1 context (F(1, 1442) =
14.6, p<.0001) but ) but not of Coarticulation (p>.5). There was no
interaction of V1 context and Coarticulation (p>.5). Looking at
Figure 4, we can observe that ‘e’ context and coarticulation mainly
induced ‘i’ responses. This could be explained by the fact that they
are acoustically very similar in our test items (Figure 8)
For /kp/ items, we found a main effect of V1 context (F(1,
1442) = 8.2, p<.005) but not of Coarticulation (p>.5). There was no
interaction of V1 context and Coarticulation (p>.4).
It seems that ‘e’ responses are only induced by Context
effects: the acoustic similarity between /i/ and /e/ coarticulation
could explain this phenomenon. /hp/ clusters were indeed slightly
palatalized for both coarticulations. While the Japanese /hi/ is
palatalized, it is not the case of /he/. Furthermore, /e/ is the least
devoiced vowel in Japanese, and /e/ to /i/ adaptation was frequent in
early Chinese loans (Labrune, 2006). Modern kana transcriptions of
German words such as lecht (transcribed to /rehito/) also appear to
exhibit this effect (Yamamoto, 2009). It is also interesting to
observe that licht is transcribed to /rihito/.
Figure 10. Context and Coarticulation effects for /hp/ and /kp/ items on ‘e’ responses. ‘e’ responses appear to be solely triggered by Context effects. The acoustic similarity between /i/ and /e/ clusters in our test items could account for this phenomenon, which is also present in modern transcriptions. Error bars are standard errors to the mean.
‘o’ responses
For /hp/ items, we found main effects of V1 context (F(1, 1442) =
26.5, p<.0001) and Coarticulation (F(1, 1442) = 43.1, p<.0001). We
also found an interaction of V1 context and Coarticulation (F(1, 1442)
= 66.5, p<.0001). Context and Coarticulation both lead to ‘o’
responses, with an increase when both point to ‘o’.
For /kp/ items, very few ‘o’ responses were given in all
situations (1%, chance level = 20%). We found neither effect of V1
context (p>.05) nor of Coarticulation (p>.1). There was no
Vowel epenthesis after /h/ in Japanese loanwords 19
interaction of V1 context and Coarticulation (p>.5). /kp/ clusters
were mainly mapped to ‘u’ responses.
Figure 11. Context and Coarticulation effects for /hp/ and /kp/ items on ‘o’ responses. ‘o’ responses appear to be influenced by both Context and Coarticulation effects in /hp/ clusters, but are almost completely absent from /kp/ clusters. Error bars are standard errors to the mean.
‘u’ responses
For /hp/ items, we found a main effect of V1 context (F(1, 1442) =
75.4, p<.0001) and of Coarticulation (F(1, 1442) = 19.5, p<.0001).
There was no interaction of V1 context and Coarticulation (p>.1).
The effects seem to have been additive in this case.
For /kp/ items, we found a main effect of Coarticulation (F(1,
1442) = 8.4, p<.005) but not of V1 context (p>.5). There was no
interaction of V1 context and Coarticulation (p>.3). As /u/ is expected
to be the default epenthetic vowel in /kp/ clusters, it seems
reasonable that Context effects would not appear. However, as
seen earlier, /i/ coarticulation seems to be strong enough to have
even an influence in /kp/ clusters, thus explaining the effect of
Coarticulation.
Figure 12. Context and Coarticulation effects for /hp/ and /kp/ items on ‘u’ responses. ‘u’ responses appear to be influenced by both Context and Coarticulation effects in /hp/ cluster. In /kp/ clusters, ‘u’ is expected to be the default epenthetic vowel, yielding only to the cases of strong /i/ coarticulation. Error bars are standard errors to the mean.
Vowel epenthesis after /h/ in Japanese loanwords 20
2.2.5 Discussion
We find that Consonant type has an influence on the amount of
epenthesis responses. /hp/ clusters lead to more ‘no vowel’
responses, possibly because /k/ was released in /kp/ clusters.
According to prediction, /kp/ clusters yielded mostly default /u/
epenthesis, with the surprising phenomenon that /i/ coarticulation
in /k/ was strong enough to elicit some degree to /i/ epenthesis even
in /kp/ clusters. Our results reproduce the strong effect of /i/
epenthesis observed by Dupoux et al. (2011). Instead of containing a
partially excised vowel, our test items contained no medial vowel to
begin with, which gives further evidence of the effect.
As predicted, coarticulation plays an important part in
determining the nature of the epenthetic vowel in /hp/ clusters.
However, the strength of the coarticulation effect was not identical
for all vowels. /i/ and /e/ coarticulation lead to a slightly palatalized
/h/, closer to the native Japanese category of /hi/, also palatalized,
than /he/, not palatalized. /ɑ/ coarticulation was weaker and /hap/
clusters were thus more prone to Context effects, including the case
where V1 is also /ɑ/, where most ‘a’ responses are induced.
Looking at the formant traces contained in both types of
clusters, there appear to be indeed an observable influence of
coarticulation, which makes them acoustically distinguishable from
one another.
This vowel-labelling task has however one drawback. To
make a decision in the forced-choice question, participants must
explicitly segment and the acoustic signal into consonants and
vowels. For native speakers of Japanese, this is made all the more
difficult by the fact that they are used to syllables as minimal
elements rather than phonemes. This is why we carried out a
second experiment, an ABX discrimination task, in which
participants are asked to make a judgement on the overall similarity
of entire non-words rather than on the identity of a single segment.
Vowel epenthesis after /h/ in Japanese loanwords 21
2.3 Experiment 2. ABX discrimination
2.3.1 Choice of method
AX and ABX discrimination paradigms allow for a comparison of
overall similarity between tokens, rather than focusing on a single
acoustic detail in the signal. Davidson & Shaw (2011) made a
comparison between both paradigms in a study on phonotactic
repair of illegal consonant clusters in English. They found that for
word-length stimuli, AX comparisons tended to be performed by
scanning fine acoustic details in both tokens rather than comparing
their overall temporal and spectral properties. However, ABX
paradigms could be reduced to an AX paradigm if the tokens are
physically too similar: rather than doing all comparisons, it becomes
easier to compare only the latter two. This is why we are using a
cross-talker version of the ABX paradigm, such as the one
implemented by Dupoux et al. (1999, 2011). The three items A, B
and X are produced by different speakers, requiring thus a
comparison of the three utterances at a more abstract level of
representation.
2.3.2 Stimuli
Stimuli for this experiment included those of the vowel-labelling
task, as well as similarly constructed tokens from 2 other people
(one female, one male). Those additional speakers were also
trained phoneticians, respectively native speakers of Argentinian
Spanish and American English. We chose to record speakers with
different native languages containing /h/ and /k/ in their sound
inventories so as to create more variability and increase the
dissimilarity between the compared tokens in terms of fine-grained
acoustic properties. This way, any comparison between the three
utterances had to be made on the basis of more abstract
representations resulting from the integration of the phonetic
details of each token rather than a direct comparison of said details.
More types of non-words were used than in Experiment 1. We
compared 4 types of pairs:
NfV: Natural cluster vs Full vowel (ahpa – ahapa) 10 pairs
Nsp: Natural cluster vs Spliced cluster (ahpa – ahipa) 40 pairs
spsp: Spliced cluster vs Spliced cluster (ahipa – ahopa) 100 pairs
spfV: Spliced cluster vs Full vowel (ahipa – ahipa) 50 pairs
We predicted that items which were mapped to the same
representation in the vowel-labelling task should be harder to
differentiate in the ABX task. For instance, if /u/ is the default
Vowel epenthesis after /h/ in Japanese loanwords 22
epenthetic vowel for items with /kp/ clusters, then those items,
natural or spliced, should all be mapped to /V1kupV1/ (if akpa, akipa
→ /akupa/, akpa-akipa comparison will be hard), and thus be
difficult to differentiate from one another, but also from the item
V1kupV1 (akipa – akupa should also be hard).
Each pair of non-words could be presented in 4 different
trials: ABA, ABB, BAA, BAB. As there were 200 possible pairs, there
were 800 possible trials. This would have been too much for every
single participant to go through. Instead, the design was
counterbalanced such that each subject hear 400 trials, 200 with /h/
and 200 with /k/, including the same number of trials for each type
of comparison, V1, full vowels and coarticulation. Since we
maintained the order of the voices for A, B and X, one group of
sujects heard A1 B2 A3 and B1 A2 A3 for a given pair while the other
group heard B1 A2 B3 and A1 B2 B3. This way, only one toke was
heard twice.
2.3.3 Procedure
Participants were the same as in Experiment 1. They sat in from of a
computer in a soundproof booth. All instructions were displayed on
screen in Japanese, and additional explanations were given orally
when necessary. Participants heard the test stimuli through
headphones. They were told that they would hear triplets of words
without meaning, of the same kind as the ones heard previously.
The last word would be either identical to the first one, or to the
second one. They were to indicate which one they thought it was by
pressing on the left (X=A) and right (X=B) arrow keys. Participants
were asked to reply as fast as possible, and their response triggered
the next trial with an ISI of 1 s. They were told to make a random
guess if they were not sure. A training session of 10 items preceded
the actual task. Training items were taken at random from the test
stimuli of the other group. Feedback was given during training by
means of a green O (correct) or a red X (incorrect) appearing on the
screen following the response. No feedback was given during test
trials and the screen remained black. For each trial, we collected the
data ‘response’ (‘A’ or ‘B’) and ‘reaction time’. We did not use
equipment to collect precise reaction times, but since there are only
2 options, we can compare the reaction times to get a rough idea of
how difficult the decision. Additionally, these data points will serve
to exclude trials where the reply came too fast (accidental keypress)
or too slowly (no longer reflecting first impression).The experiment
lasted about 45 minutes, and included 3 self-paced breaks,
scheduled every 100 trials. Participants were told that they could
take a break by a message on the screen and confirmed by pressing
Vowel epenthesis after /h/ in Japanese loanwords 23
the space bar. Hitting the space bar again launched the following
block.
Figure 13. ABX discrimination task paradigm. A, B, and X were produced by 3 different speakers (2 females, 1 male), and played at 500ms intervals. Keypress triggered next trial with 1 s ISI.
2.3.4 Results
We first looked at the percentage of correct responses across types
of comparison and consonant. We conducted a mixed ANOVA with
Consonant and Comparison Type as within-subjects factors, and
included Subjects and Group as random factors. We found a
significant effect of Type (F(3, 9251) = 33.5, p<.0001) but not of
Consonant (p>.05). There was an interaction between Type and
Consonant (F(3, 9251) = 12.1, p<.0001). We plotted in Figure 14
below the percentages of correct answers in ABX in each type of
comparison and for /h/ and /k/ comparisons.
Vowel epenthesis after /h/ in Japanese loanwords 24
Figure 14. ABX discrimination task. Percentage of correct answers across types of comparison and consonant. Error bars are standard errors of the mean.
Reaction times seemed to complement the observations on
percentage of correct answers: lower proportion of correct answers
corresponded to higher mean reaction times. A mixed ANOVA on
reaction times with within-subjects variables Pair Type, Consonant
and Correctness with Subjects as a random factor showed a main
effect of Correctness (F(1,9243) = 46.2, p <.0001), an interaction
between Pair Type and Consonant (F(3, 9243) = 2.7, p< 0.05) and an
interaction between Type and Correctness (F(3, 9243) = 8.1, p<
0.0001)
Figure 15. ABX discrimination task. Mean reaction times across types of comparison and consonant. Error bars are standard errors of the mean.
If we define a measure of perceptual distance between two tokens,
we could estimate the difficulty of each comparison. This could be
done by referring to the results of Experiment 1. However, since we
did no test natural clusters and full medial vowel items in
Experiment one because of time and logistic reasons, we need to
make some assumptions:
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
NfV Nsp spfV spsp
% c
orr
ect
h
k
0
200
400
600
800
1000
1200
1400
1600
NfV Nsp spfV spsp
RT
in m
s
h
k
Vowel epenthesis after /h/ in Japanese loanwords 25
- natural clusters : roughly equal to the splices where the
consonant cluster came from another token of the same V1
context (e.g. ahpa≃ahapa)
- items with a full medial vowel : we assume that if our
participants were presented with a token including a full
medial vowel, there would be no ambiguity in identifying
said vowel (e.g. ahopa → ‘o’, ahapa → ‘a’)
These hypotheses allowed us to estimate a ‘perceptual’ distance for
a given pair of stimuli. As an example, let us take akipa : in the first
experiment, akipa elicited a certain percentage of responses of each
category. If we use these percentages as the coefficients of a
vector, we can then compute the Euclidian distance between any
such 2 vectors. Let a and b be the vectors representing A and B in an
ABX trial:
( ) √∑( )
This measure would approximate the perceptual similarity between
two items: the larger its value, the more distinct the two items are.
We can then plot the percentage of correct answers in the ABX task
against this value for each pair.
To simplify the data, we grouped together the conditions involving
a full medial vowel (natural cluster – full vowel and spliced cluster –
full vowel) and the conditions involving only clusters (spliced
cluster-spliced cluster and natural cluster –splice cluster)
R² = 0,1842
R² = 0,552
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6
% c
orr
ect
Euclidian distance A-B
cluster_vs_fullV
h
k
Linear (h)
Linear (k)
Vowel epenthesis after /h/ in Japanese loanwords 26
While R² in a linear regressions were <.6, qualitatively greater
Euclidian distance between two tokens did seem to result in better
performances. Looking at the Euclidian distances among /hp/
tokens and comparing them to the distances between /kp/ cluster
tokens (Figure 15b), we also see confirmation that /kp/ tokens were
less perceptively distinct from one another.
Due to time and word limit reasons, further analyses of
these data will be conducted later, outside the scope of this written
work.
2.3.5 Discussion
Ideally, we would have liked to test different participants in
experiment 1 and experiment 2. However, due to logistic and time
constraints, this was not possible. Testing different subjects could
have allowed for more fine-grained analyses. For instance we
derived our estimation of perceptual distance from the responses in
the vowel-labelling task, but we only had actual data for the splice1-
splice2 comparisons. We made hypotheses for the responses to the
items we did not present in Experiment 1 for time issues:
Additionally, having the other two voices in the vowel-
labelling task, for example with different groups of subjects, could
also have yielded better perceptual distance estimations. This
might improve the correlation between the results of both tasks.
We kept only one voice in the vowel-labelling task for simplicity
issues, as subjects started with this experiment to familiarize with
the type of stimuli used in both tasks.
Figure 175a. Percentage of correct answers in ABX task against Euclidian distance between the compared tokens. Cluster vs full vowel comparisons
R² = 0,4689
R² = 0,0226
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 0,2 0,4 0,6 0,8 1 1,2
% c
orr
ect
A-B Euclidian distance
cluster_vs_cluster
h
k
Linear (h)
Linear (k)
Figure 16b. Percentage of correct answers in ABX task against Euclidian distance between the compared tokens. Cluster vs cluster comparisons
Vowel epenthesis after /h/ in Japanese loanwords 27
3. Speech Recognition
Models
3.1 Choice of method
While looking at behavioral data can give us direct insight into
subjects’ perception of our stimuli, it is also interesting to look at
what computational models can predict for the perception of non-
native segments. Speech recognition algorithms take into account
the sound structure of their target language and map an acoustic
input to the most plausible segments in this structure. We will be
looking in particular at Hidden Markov Models created with HTK
Speech Recognition Toolkit (CUED, 2009).
Hidden Markov Models (HMMs) can represent time series
with a succession of states. HTK in particular performs speech
recognition by sequencing the speech signal into multiple evenly
spaced parameter vectors, based on the approximation that for an
instant t the signal is stationary and can thus be fully described by
those parameter coefficients. In HMMs, the output of a state is
accessible while the state itself is hidden. However, each state has a
probability distribution over the various outputs. HTK represents
output distributions by Gaussian Mixture Densities. Considering the
sequence of outputs generated, estimations can therefore be made
on the sequence of states. The way this related to speech
recognition is that models can be trained to recognize particular
target sequences using multiple exemplars of those targets. The
parameters of a model can be estimated with increasing precision
as more exemplars are given. Then, when presented with an
unknown sequence, the likelihood that each of the possible models
generated it is computed, and the most likely model is chosen as
the transcription.
3.2 Stimuli
Since the aim is to see what the model would predict for the
perception of coarticulated consonant clusters, we ran the speech
recognition models on the cross-spliced stimuli used in the two
experiments. The one difference is that we kept short silences at
the beginning and the end of the sound files, to allow for the
model’s tendency to transcribe speech boundaries at the beginning
and end of a file.
Vowel epenthesis after /h/ in Japanese loanwords 28
3.3 Procedure
Starting from a model of Japanese generated previously by
Emmanuel Dupoux’s PhD student Abdellah Fourtassi, we used the
annotated part of Corpus of Spontaneous Japanese (about 500k
words, 45 hours of speech), to retrain 20 different models. Each
model was retrained on a separate part of the 45h hours, balancing
for male and female speakers, using the retraining pipeline made by
Dupoux et al. The retrained models were then readapted for
speaker variation and recording conditions using our recordings
with a full medial vowel. The final models were then made to
transcribe the cross-spliced stimuli used in the experiments. Such
models transcribe the acoustic signal phoneme by phoneme and
can be considered purely acoustic. Later, we added bigram
transition probabilities to the model: bigram models can take into
account some phonotactic constraints, as illegal sequences will
simply be composed of states with a quasi-null transition
probability in-between.
The resulting transcriptions were of course imperfect. An
accuracy of 60% with speech recognition systems is usually already
considered good. Transcriptions were generated as MLF files,
readable in basic text treatment programs.
For example, below is a transcription of uhepu :
"/retrain/csplice/uhepu.rec" 0 200000 <s> -112.915863 200000 1100000 k -570.235291 1100000 2300000 u -907.622375 2300000 3600000 h -1078.534424 3600000 4600000 e: -822.209900 4600000 6100000 t -960.711792 6100000 6800000 u -544.606140 6800000 8800000 </s> -1401.380615
The first two columns indicate the time points, the 3rd
column gives the phoneme transcription at each time point, and the
last column gives a measure of how accurately the transcription
model fitted the output. The more negative this score is, the better
the fit.
Silences and closures are problematic for speech
recognition, because their models are very similar. This explains the
/k/ at the beginning of the transcription. Stops were often
hallucinated at the beginning and end of the tokens. While
automating (in Python) the coding process of these transcriptions
for data analysis, we adopted some criteria:
- stops at boundaries were ignored
- we looked for segments contained within 2 consonants
Vowel epenthesis after /h/ in Japanese loanwords 29
- if the first consonant was /h/ or /k/, then the middle
segment was taken to be the choice of epenthetic vowel (or
lack thereof)
- if the middle segment was a consonant or noise, the trial
was coded ‘no vowel’
- if there were multiple vowel segments in the middle
segment, the one with the most negative score was kept.
- long vowels were transcribed as their shorter counterparts
We thus obtained transcriptions of 150 cross-spliced non-words by
model, generating 3000 transcriptions. With the monophone
models, we would expect there to be only effects of coarticulation.
Context effects should appear mainly with the addition of bigrams.
3.4 Results
Monophone models
Qualitatively, results also showed differences between /hp/ items
and /kp/ items.
Figure 18. Items with /hp/and /kp/ clusters, all responses: percentages of responses in each category (a, i , u, e, o, no vowel) according to V1 (smaller graduations) and coarticulation in the consonant cluster (categories delimited by white lines).
/hp/ clusters appeared to elicit more varied responses that /kp/
responses/. Additionally, the models seemed biased towards ‘a’
responses in /h/ clusters/. Coarticulation with /i/ and /e/ also induced
greater proportions of ‘i’ and ‘e’ results, but not as much as was
observed in the behavioral data. ANOVAs on ‘no vowel’ responses
according to consonant for the 3 speakers showed significant
effects of Consonant only for the two female speakers (ac : F(10,10)
h p
Vowel epenthesis after /h/ in Japanese loanwords 30
= 28.1, p<.ooo1, sh : F(10,10) = 18.9, p<.ooo1) on the left in Figure 18
below).
Figure 19. Percentage of ‘no vowel’ transcriptions for /hp/ and /kp/ clusters by monophone models
Using the ‘vowel’ response transcriptions, we conducted mixed
ANOVAs for each type of response (‘a’, ‘i’, ‘u’, ‘e’, ‘o’), with
Consonant, Coarticulation and V1 context as within-subjects factors,
and Model as a random effect. This is comparable to treating the
models as subjects in the vowel-labelling task (1 model ≃ 1
participant).
‘a’ responses
We found main effects of Consonant (F(1, 2973) = 0.8, p<.0001) and
Coarticulation (F(1, 2973) = 59.5, p<.0001). We also found an
interaction of Consonant and Coarticulation (F(1, 2973) = 83.5,
p<.0001) and an interaction of V1 Context and Coarticulation (F(1,
2973) = 4.1, p<.05) . There was no main effect of V1 Context (p>.3)
nor interaction of V1 Context and Consonant (p>.05) or three way
interaction between V1 Context, Coarticulation and Consonant
(p>.3). Since the monophone model is a purely acoustic one, strong
effects of coarticulation can indeed be expected. As with human
subjects, most ‘a’ responses occurred both V1 context and
Coarticulation point to /a/.
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
ac sh am
% o
f 'n
o_v
ow
el'
tra
nsc
rip
tio
ns
Speaker
h
k
*** ***
Vowel epenthesis after /h/ in Japanese loanwords 31
Figure 20. Context and Coarticulation effects for /hp/ and /kp/ items on ‘a’ responses for monophone model. Bars are standard errors to the mean.
‘e’ responses
We found a main effect of Coarticulation (F(1, 2973) = 20.3, p<.0001).
There were no main effects of Consonant (p>.5) or V1 Context
(p>.05). We found an interaction of Consonant and Coarticulation
(F(1, 2973) = 14.7, p<.0001) and a three-way-interaction of V1
Context, Consonant and Coarticulation (F(1, 2973) = 5.7, p<.02) .
Figure 17 below shows a surprising phenomenon: ‘e’ responses were
more frequent in contexts where V1 was another vowel.
Figure 21. Context and Coarticulation effects for /hp/ and /kp/ items on ‘e’ responses by monophone model. Bars are standard errors to the mean.
‘i’ responses
We found main effects of Consonant (F(1, 2973) = 13.1, p<.0005),
Coarticulation (F(1, 2973) = 32.3, p<.0001) and V1 Context (F(1, 2973)
= 26.0, p<.0001). We found an interaction of V1 Context and
Coarticulation (F(1, 2973) = 40.2, p<.0001) and a three-way-
interaction of V1 Context, Consonant and Coarticulation (F(1, 2973)
= 3.9, p<.05) . Interestingly, the percentage of ‘i’ transcriptions never
exceeded 50% even when Coarticulation and Context concurred. By
Vowel epenthesis after /h/ in Japanese loanwords 32
contrast, participants in the vowel-labelling task gave over 95% of ‘i’
responses in the same situation. It would seem that the model gave
more ‘conservative’ transcriptions than the subject, as most of the
other transcriptions in this situation were either the default
epenthesis ‘u’, or ‘no vowel’. The model can transcribe consonant
clusters as containing no vowel because /i/ or /u/ devoicing in
Japanese can sometimes produce a surface realization close to a
consonant cluster.
Figure 22. Context and Coarticulation effects for /hp/ and /kp/ items on ‘i’ responses by monophone models. Bars are standard errors to the mean.
‘o’ responses
We found main effects of Consonant (F(1, 2973) = 23.4, p<.0001),
Coarticulation (F(1, 2973) = 28.8, p<.0001) and V1 Context (F(1, 2973)
= 6.4, p<.02). We found an interaction of V1 Context and
Coarticulation (F(1, 2973) = 31.4, p<.0001), Consonant and
Coarticulation (F(1, 2973) = 161.1, p<.0001), V1 Context and
Consonant (F(1, 2973) = 7.0, p<.01) and a three-way-interaction of V1
Context, Consonant and Coarticulation (F(1, 2973) = 40.3, p<.0001) .
‘o’ transcriptions were few in general. Some /kup/ clusters were
mapped to /o/, as /u/ in our stimuli was more rounded than the
native category.
Vowel epenthesis after /h/ in Japanese loanwords 33
Figure 23. Context and Coarticulation effects for /hp/ and /kp/ items on ‘o’ responses by monophone models. Bars are standard errors to the mean.
‘u’ responses
We found a main of Consonant (F(1, 2973) = 41.2, p<.0001), and no
other significant factor (Coarticulation (p>.5), V1 Context (F(1, 2973)
= 6.4, p>.5). The models do seem to use /u/ as the defaut epenthetic
vowel to some degree, leading to an important number of ‘u’
transcriptions in both /h/ and /k/ contexts (37% for /h/ contexts, and
53% for /k/ contexts). For /h/ contexts, this is more than what
participants did, for /k/ contexts, less.
Figure 20. Context and Coarticulation effects for /hp/ and /kp/ items on ‘u’ responses by monophone models. Bars are standard errors to the mean.
Discussion
The monophone models also transcribed epenthesis and showed an
influence of coarticulation. Except for ‘u’ transcriptions, there were
main effects of Coarticulation for all vowel responses. Model
transcriptions appeared to be somewhat more ‘conservative’ than
subject responses. As the default epenthetic vowel, /u/ was chosen
much more frequently to epenthetise /hp/ clusters than in
Experiment 1. Effects of /i/ and /e/ coarticulation were also less
pronounced. Looking at the transcription data in detail, it appeared
also that transcriptions varied from speaker to speaker. For better
Vowel epenthesis after /h/ in Japanese loanwords 34
comparability with the vowel-labelling task, we should maybe
restrict our analyses to the transcriptions for the voice used in
Experiment 1. However, since this work with model transcriptions is
mostly exploratory, and for time reasons, we decided to do the
general analysis including all speakers here. Qualitatively, we could
observe a stronger resemblance between experimental and model
data:
Figure 24 Items with /hp/and /kp/ clusters, all responses: percentages of responses in each category (a, i , u, e, o, no vowel) according to V1 (smaller graduations) and coarticulation in the consonant cluster (separated by white lines). Model transcriptions appear to show more similarity with experimental data if we look only to the responses to the voice used in experiment 1
h p
h p
subject
responses
model
transcriptions
Vowel epenthesis after /h/ in Japanese loanwords 35
Bigram models
By adding transitional probabilities information to monophone
models, we also made transcriptions of the same stimuli using
bigram models. Bigram models are a way to approximate
phonotactic constraints, as illegal sequences will have very low
transitional probabilities. We should then expect to observe more
‘vowel’ transcriptions. Qualitatively, this is indeed the case (Figure
24 below).
Figure 24. Items with /hp/ and /kp/ clusters, all responses: percentages of responses in each category (a, i , u, e, o, no vowel) according to V1 (smaller graduations) and coarticulation in the consonant cluster (separated by white lines). Model transcriptions appear to show more similarity with experimental data if we look only to the responses to the voice used in experiment 1
‘no vowel’ transcriptions accounted only for 3.1% of all
transcriptions. ANOVAs on ‘no vowel’ responses according to
consonant for the 3 speakers showed no significant effects for any
speaker (p>.05).
Figure 25. Percentage of ‘no vowel’ transcriptions for /hp/ and /kp/ clusters by monophone models
h p
Vowel epenthesis after /h/ in Japanese loanwords 36
For context and coarticulation effects, we carried out the same
analyses as for the monophone models.
‘a’ responses
We found main effects of Consonant (F(1, 2973) = 80.2, p<.0001)
and Coarticulation (F(1, 2973) = 59.5, p<.0001). We also found an
interaction of Consonant and Coarticulation (F(1, 2973) = 48.8,
p<.0001) and an interaction of V1 Context and Coarticulation (F(1,
2973) = 4.3, p<.001), V1 Context and Consonant (F(1, 2973) = 4.3,
p<.05) and Coarticulation and Consonant (F(1, 2973) = 29.2,
p<.0001). If we look at the plots below, ‘a’ responses are mainly
triggered by V1 = a contexts. The fact that proportions of ‘a’
responses are lower in each context than for monophone models is
actually explained by the overall greater amount of ‘a’ responses in
the bigram models’ transcriptions, especially in /h/ contexts. This
bias could suggest that /h/ → /ɑ/ Vowel → /hɑ/ transitions are more
frequent in the Japanese lexicon than the other combinations.
Figure 26. Context and Coarticulation effects for /hp/ and /kp/ items on ‘a’ responses for bigram models. Error bars are standard errors to the mean
‘e’ responses
We found main effects of Coarticulation (F(1, 2973) = 25.9, p<.0001)
and V1 Context (F(1, 2973) = 4.9, p<.005). We found an interaction of
Consonant and Coarticulation (F(1, 2973) = 5.9, p<.0) and V1 Context
and Coarticulation (F(1, 2973) = 8.7, p<.005. The plot below suggests
that the models were more sensitive to /e/ coarticulation than
human subjects, possibly because the corpus it was trained on
contained input from speakers from more varied geographical
origins than our subjects (mostly from Tokyo area).
Vowel epenthesis after /h/ in Japanese loanwords 37
Figure 27. Context and Coarticulation effects for /hp/ and /kp/ items on ‘e’ responses for bigram models. Error bars are standard errors to the mean
‘i’ responses
We found main effects of Consonant (F(1, 2973) = 32.4, p<.0001),
Coarticulation (F(1, 2973) = 30.5, p<.0001) and V1 Context (F(1, 2973)
= 23.9, p<.0001). We found an interaction of V1 Context and
Coarticulation (F(1, 2973) = 54.0, p<.0001). Just like with
monophone transcriptions, we found that proportions of ‘i’
transcriptions were fewer than in the vowel-labelling task.
Figure 7. Context and Coarticulation effects for /hp/ and /kp/ items on ‘i’ responses for bigram models. Error bars are standard errors to the mean
‘o’ responses
We found main effects of Consonant (F(1, 2973) = 32.2, p<.0001) and
Coarticulation (F(1, 2973) = 58.2, p<.0001). We found an interaction
of V1 Context and Coarticulation (F(1, 2973) = 8.9, p<.005),
Consonant and Coarticulation (F(1, 2973) = 185.4, p<.0001), V1
Context and Consonant (F(1, 2973) = 8.9, p<.0002) and a three-way-
interaction of V1 Context, Consonant and Coarticulation (F(1, 2973)
= 36.3, p<.0001) . As with monophone models, ‘o’ transcriptions
Vowel epenthesis after /h/ in Japanese loanwords 38
were few in general. Some /kup/ clusters were mapped to /o/, as /u/
in our stimuli was more rounded than the native category.
Figure 28. Context and Coarticulation effects for /hp/ and /kp/ items on ‘o’ responses for bigram models. Error bars are standard errors to the mean
‘u’ responses
We found a main of Consonant (F(1, 2973) = 65.9, p<.0001), and no
other significant factor (Coarticulation (p>.5), V1 Context (F(1, 2973)
= 6.4, p>.5). Bigram models gave more ‘u’ transcriptions in /k/
clusters than monophone models (64% against 53% for monophone
models). This suggest stronger transition probabilities between k →
u probabilities than k and any other vowel.
3.5 Discussion
Overall, bigram models’ transcriptions allowed for more influence
of V1 context. Coarticulation still played a significant part in
predicting the transcription. As with monophone results,
considering transcriptions only for the voice used in the behavioral
experiment could have given more comparable results.
Following on these analyses, we could also try to model
results for the ABX task, using comparisons between the
Vowel epenthesis after /h/ in Japanese loanwords 39
transcriptions given for all stimuli used. Again, due to time
limitations, further analyses will be done outside the scope of this
written work.
It is extremely interesting to observe that speech
recognition models also show a variation in the epenthesis effect.
For monophone models, this suggests that acoustic coarticulation
cues were sufficient to switch the choice of epenthetic vowel when
coarticulation and V1 context did not concur. Bigram models
approximate some measure of top-down phonological knowledge
by describing phonotactic constraints with variations in transition
probabilities. With the addition of this information, coarticulation
still influenced the transcriptions, while frequent sequences
modulated the proportions of responses corresponding to the
coarticulation information.
Vowel epenthesis after /h/ in Japanese loanwords 40
4. Discussion and concerns
In this project, we examined the influence of coarticulation cues on
the choice of epenthetic vowel after /h/ and /k/ by native speakers of
Japanese. Our results suggest that coarticulation does indeed play
an important part in predicting participants’ answers, but not to the
same degree for every vowel. /i/ coarticulation in particular strongly
predicted /i/ epenthesis, even when the vowel preceding /h/ was not
i. This would be problematic if we tried to account for epenthesis
after /h/ with a vowel-copy rule. The fact that /i/ coarticulation also
induced /i/ epenthesis after /k/, where the default epenthetic vowel
should be /u/, also supports the idea that epenthesis repair happens
during perception. In this case, two-step phonological theories
would predict in this case accurate perception of the /kp/ cluster,
and subsequent repair with the default /u/ epenthesis. Looking at
the way our test stimuli were transcribed by speech recognition
algorithms based on Hidden Markov models, we could observe the
responses of purely acoustic systems optionally equipped with
some degree of phonotactic knowledge. These models were also
sensitive to the influence of coarticulation. Trained on Japanese
speech, these models approximated the perception of Japanese
speakers. In the case of monophone models, transcriptions were
based solely on acoustic information. Epenthesis according to
coarticulation in these models’ transcription suggests that
coarticulation cues are acoustically salient enough to be perceived
as an epenthetic vowel. Bigram models modulated the influence of
coarticulation with the addition of transition probabilities.
The experiments having been carried out in Paris, we could
only recruit and test a limited number of participants in the period
of time allotted for this project. If possible, using all three voices in
the identification task could have allowed for an analysis of speaker
effects, as well as better data for correlating the two experiments.
This could also have provided more complete comparisons for the
model transcriptions. A more accurate measure or reaction times
could also
Further analyses could be done on both our experimental
and model data. However, they would exceed the scope of this
written work, submitted for partial fulfillment of the requirements
for academic year 2013-2014 in the Research Master in Cognitive
Science (EHESS, ENS, Paris V). We plan to refine these analyses in
the near future.
Vowel epenthesis after /h/ in Japanese loanwords 41
5. Appendices
Items recorded recorded at 16 kHz:
Natural clusters ahpa ehpe ihpi ohpo uhpu
akpa ekpe ikpi okpo ukpu
ABX ahapa ehepe ihipi ohopo uhupu
akapa ekepe ikipi okopo ukupu
Non-words test stimuli V1C?pV1, obtained by cross-splicing
recorded items with Praat speech analysis software (Boersma &
Weenink)
Colour indicates origin of /hp/: for example, ehape was created by
cross-splicing /hp/ from ahpa into ehpe). On the diagonal, items
such as ahapa were created by cross-splicing /hp/ from another
recording of ahpa, to neutralize possible effects of a splicing
artefact and make it more comparable to other spliced items.
Control stimuli V1k?pV1 were obtained by a similar process.
V1 choice
V2 choice /a/ /e/ /i/ /o/ /u/
/a/ ahapa ehape ihapi ohapo uhapu
/e/ ahepa ehpe ihepi ohepo uhepu
/i/ ahipa ehipe ihpi ohipo uhipu
/o/ ahopa ehope ihopi ohpo uhopu
/u/ ahupa ehupe ihupi ohupo uhpu
Vowel epenthesis after /h/ in Japanese loanwords 42
6. References
Arakawa, S. (1977). Gairaigo jiten [loanword dictionary]. Kadokawa, Tokyo.
Berent, I., Lennertz, T., Jun, J., Moreno, M. A., & Smolensky, P. (2008). Language universals in human brains. Proceedings of the National Academy of Sciences, 105(14), 5321-5325.
Berent, I., Lennertz, T., Balaban, A. (2011). Language universals and misidentification: a two-way street. Language and Speech, 55(3), 311–330
Chang, C. B. (2012). Phonetics vs. phonology in loanword adaptation: Revisiting the role of the bilingual. In Proceedings of the 34th Annual Meeting of the Berkeley Linguistics Society: General Session and Parasession on Information Structure, Berkeley, CA. Berkeley Linguistics Society.
Crawford, C. (2007). The role of loanword diffusion in changing adaptation patterns: A study of coronal stops in Japanese borrowings. Working Papers of the Cornell Phonetics Laboratory, 16, 32-56.
Davidson, L. (2007). The relationship between the perception of non-native phonotactics and loanword adaptation. Phonology, 24 (2007) 261–286.
Davidson, L. (2011). Phonetic, phonemic, and phonological factors in cross-language discrimination of phonotactic contrasts. Journal of Experimental Psychology: Human Perception and Performance, 37(1), 270..
Davidson, L., & Shaw, J. A. (2012). Sources of illusion in consonant cluster perception. Journal of Phonetics, 40(2), 234-248.
Dupoux, E., Kakehi, K., Hirose, Y., Pallier, C., & Mehler, J. (1999). Epenthetic vowels in Japanese: A perceptual illusion? JEP: HPP 25:1568-1578.
Dupoux, E., Parlato, E., Frota, S., Hirose, Y., & Peperkamp, S. (2011). Where do illusory vowels come from?. Journal of Memory and Language, 64(3), 199-210.
Goldsmith, J. A. (1995) The Handbook of Phonological Theory. Blackwell Handbooks in Linguistics. Blackwell Publishers. pp. 817–838.
Hallé, P. A., Segui, J., Frauenfelder, U., & Meunier, C. (1998). Processing of illegal consonant clusters: A case of perceptual assimilation?. Journal of experimental psychology: Human perception and performance, 24(2), 592.
Han, M. (1962). Unvoicing of vowels in Japanese. Onsei no kenkyuu, 10, 81–100.
Irwin, M. (2011). Loanwords in Japanese. John Benjamins Publishing.
Vowel epenthesis after /h/ in Japanese loanwords 43
Itō, Junko; Mester, R. Armin (1995). "Japanese phonology". In Goldsmith, John A. The Handbook of Phonological Theory. Blackwell Handbooks in Linguistics. Blackwell Publishers. pp. 817–838.
Kamiya, T. (1994). Tuttle new dictionary of loanwords in Japanese: a user's guide to Gairaigo. Tuttle Publishing.
Kindaichi, K., Yamada, T., Shibata, T., Sakai, K., Kuramochi, Y., & Yamada, A. (2000). Shinmeikai Kokugo Jiten [Shinmeikai Japanese-Japanese Dictionary]. Sanseido, Tokyo.
Labrune, L. (2006). La phonologie du japonais, Collection Linguistique, Société de Linguistique de Paris n°90, éditions Peeters, Leuven.
Labrune, L. (2008). Principes d’organisation phonémique des emprunts occidentaux composés abrégés, Revue d’Etudes Japonaises Université Paris 7, 2008, pp. 107-121.
LaCharité, D. & Paradis, C. (2005). Category preservation and proximity versus phonetic approximation in loanword adaptation. LI 36. 223–258.
Ladefoged, P., & Maddieson, I. (1996). The sounds of the world's languages. Blackwell Publishing.
Lovins, J. B. (1975). Loanwords and the phonological structure of Japanese. Bloomington: Indiana University Linguistics Club.
Paradis, C. & LaCharité, D. (1997). Preservation and minimality in loanword adaptation. Journal of Linguistics 33:379-430
Peperkamp, S., Vendelin, I. & Nakamura, K. (2008). On the perceptual origin of loanword adaptations: experimental evidence from Japanese. Phonology, 25, 129-164.)
Peperkamp, S. (2005). A psycholinguistic theory of loanword adaptations. In: M. Ettlinger, N. Fleischer & M. Park-Doob (eds.) Proceedings of the 30th Annual Meeting of the Berkeley Linguistics Society. Berkeley, CA: The Society, 341-352.
Peperkamp, S. & Dupoux, E. (2003). Reinterpreting loanword adaptations: The role of perception. In: M.J. Solé, D. Recasens & J. Romero (éds.) Proceedings of the 15th International Congress of Phonetic Sciences. Adelaide: Causal Productions, 367-370.
Pitt, M. (1998). Phonological processes and the perception of phonotactically illegal consonant clusters. Perception & Psychophysics, 60(6), 941–951.
Schmidt, C., (2009) “Chapter 21: Loanwords in Japanese”, In Haspelmath, M., & Tadmor, U. Loanwords in the World's Languages: A Comparative Handbook, Walter de Gruyter, Language Arts & Disciplines. p.545-574.
Vowel epenthesis after /h/ in Japanese loanwords 44
Shinohara, S. (2000). Default accentuation and foot structure in Japanese: Evidence from adaptations of French words. JEAL 9:55-96.
Smith, J. L. (2006) Loan phonology is not all perception: Evidence from Japanese loan doublets. In Timothy J. Vance and Kimberly A. Jones (eds.), Japanese/Korean Linguistics 14, 63-74. Stanford: CSLI.
Tews, A. (2008). Japanese geminate perception in nonsense words involving German [f] and [x]. Gengo Kenkyu, 133, 133-145.
Yamamoto, A. (2009). Korenara oboerareru doitsugo tangochou, NHK press
7. Other resources
Boersma, P. & Weenink, D. (2014). Praat: doing phonetics by computer [Computer program]. Version 5.3.77, retrieved from http://www.praat.org
Breen, J. & Ahlström, K. (2014), Denshi Jisho, online Japanese Dictionary. WWWJDIC project, Electronic Dictionary Research Group. http://jisho.org
Cambridge University Engineering Department (CUED), Hidden Markov Models Toolkit HTK. http://htk.eng.cam.ac.uk/
MixB classifieds, Furansu keijiban. http://fra.mixb.net
National Institute for Japanese Language (NIJLA), Communications Research Laboratory (CRL), Tokyo Institute of Technology (TITech). (2003) Corpus of Spontaneous Japanese
Pinheiro J, Bates D, DebRoy S, Sarkar D and R Core Team (2014). nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-117, http://CRAN.R-project.org/package=nlme.
Python Software Foundation (2013). Python Language Reference. Version 2.7.6, retrieved from http://www.python.org
R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Version 3.1.0 "Spring Dance", retrieved from http://www.R-project.org