CHILDES: Japanese corpora and language acquisi9on
research
Susanne Miyata Aichi Shukutoku University
Outline
1. Child speech research in Japan (Pre‐CHILDES)
2. CHILDES Data now
3. Start of CHILDES in Japan
4. The development of CLAN based tools
5. CHILDES influence on research (Summary)
6. Outlook
Child speech research in Japan (Pre‐CHILDES)
Noji Data (print data) Noji, J. (1973‐74) Yooji no gengo seikatsu no ji.ai I ‐IV. Bunka Hyoron Shuppan.
1 Boy, 0;0‐6;11 (ca. 62,000 u].) diary data (star9ng 1948)
records of all (?) child u]erances + context
based on hand‐wri]en instant records of the parents (father linguist specialized on dialect research)
Child speech research in Japan (Pre‐CHILDES)
Noji Data (print data) Noji, J. (1973‐74) Yooji no gengo seikatsu no ji.ai I ‐IV. Bunka Hyoron Shuppan.
1 Boy, 0;0‐6;11 (ca. 62,000 u].) diary data (star9ng 1948)
consecu9ve records of child u]erances + context
based on hand‐wri]en records of the parents (father linguist specialized on dialect research)
Child speech research in Japan (Pre‐CHILDES)
Okubo Data (hand‐wri]en print data) Na9onal Ins9tute for Japanese Language (1981‐83). Child Language Materials. 6 vols.
1 Boy, 1;0‐4;0 (ca. 90,000 u].) hand‐wri]en transcripts
based on tape‐recordings of
weekly Mother‐Child conversa9ons
Child speech research in Japan (Pre‐CHILDES)
Okubo Data (hand‐wri]en print data) Na9onal Ins9tute for Japanese Language (1981‐83). Child Language Materials. 6 vols.
1 Boy, 1;0‐4;0 (ca. 90,000 u].) based on tape‐recordings of
weekly Mother‐Child conversa9ons
published as hand‐wri]en transcript
Child speech research in Japan (Pre‐CHILDES)
Noji Data (print) 0;0‐6;11 (ca. 62,000 u].) Okubo Data (hand‐wri]en) 1;0‐4;0 (ca. 90,000 u].)
Much research before 1990 was based on comparisons of English research with the Noji and/or Okubo data, oden combined with (unpublished) data collected by the researcher.
Search of 1 item (e.g.: nega9on) would take 2‐3 months.
Child speech research in Japan (Pre‐CHILDES)
Noji Data (print) 0;0‐6;11 (ca. 62,000 u].) Okubo Data (hand‐wri]en) 1;0‐4;0 (ca. 90,000 u].)
+ freely available (library)
+ longitudinal, extensive, reliable
ー single cases => further data collec9on ー print (paper & pencil analysis) => digitaliza9on (around 1990)
Child speech research in Japan (Pre‐CHILDES)
Noji Data (print) 0;0‐6;11 (ca. 62,000 u].) Okubo Data (hand‐wri]en) 1;0‐4;0 (ca. 90,000 u].)
+ freely available (library)
+ longitudinal, extensive, reliable
ー single cases => further data collec9on ー print (paper & pencil analysis) => digitaliza9on (around 1990)
Jpn CHILDES Data now 2010nov10 (violet = in prep.)
JL1 children (]l 419,000 u].) Aki Corpus weekly 1;6‐3;0 (M‐C conversa9on) mor 48,000 u]. Ryo Corpus weekly 1;6‐3;0 (M‐C conversa9on) mor 9,000 u]. Tai Corpus weekly 1;6‐3;1 (M‐C conversa9on) audio‐linked, mor 81,000 u]. Noji Corpus diary data 0;0‐7;0 (C u]erances with context) 62,000 u]. Ishii Corpus weekly 0;6‐3;8 (F‐C conversa9on) video‐linked 95,000 u]. Hamasaki Corpus bimontly 2;2‐3;7 (M‐C conversa9on) 37,000 u]. MiiPro Corpus 2 girls, 2 boys (1 girl also F‐C conversa9ons)
weekly 1;2‐3;0 (M‐C conversa9on) audio‐linked, mor monthly 3;0‐5;0 (M‐C conversa9on) audio‐linked, mor 84,000 u].
Inaba Corpus Frog Story 90 stories age 3 to 11, 10 children each audio‐linked 3,600 u].
JL1 adults (]l 50,000 u].) Sakura Corpus 18 student conversa9ons @ 20 min.
groups of 4 students (same sex, opp. sex) audio‐linked, mor 15,000 u]. CallFriend Corpus 30 telephone conversa9ons audio‐linked 33,000 u]. Inaba Corpus Frog Story 50 stories audio‐linked 2,000 u].
JL2 adults (]l 2,700 u].) Inaba Corpus Frog Story 50 stories audio‐linked 2,700 u].
Jpn level 1 ‐ 5, 10 adults each
Clinical Aphasia Corpus 5 student‐pa9ent conversa9ons audio‐linked
....and lots of unpublished data
Start of CHILDES in Japan
In the late 1980ies:
Noji data syntac9cal analysis (Morikawa) Aki Ryo Tai corpora prep for CHILDES (Miyata)
=> discussion about CHAT format for Japanese
Oshima call => JCHAT Project 1993 but why a whole project...?!
Start of CHILDES in Japan
CLAN worked only on restricted La9n script (ASCII) => defini9on of orthography necessary
1) modern pronuncia9on not reflected by Jpn syllable script si > ʃi tu ‐> tsu 9 ‐> tʃi etc. Should JCHAT adapt the modern pronuncia9on (shi, tsu, chi) <= L2 researchers or the historical pronuncia9on (si, tu, 9) <= grammarians ...?
Start of CHILDES in Japan
2) No spaces to indicate word boundaries.
chiizu o tabeta no? “did [you] eat the cheese?” chi [long] zu o ta be ta no cheese ACC eat ‐ PAST FINP
チ ー ズ を 食 べ た の 。<= three different scripts
change of script type gives inconsistent hints. Syllable (loan words) Syllable Chinese ideogram Katakana Hiragana Kanji (Kana/Kanji)
No common intui9ve consensus on what cons9tutes a word. (Student surveys revealed a broad range of varia9on and inconsistency).
Start of CHILDES in Japan
2) No spaces to indicate word boundaries.
“did [you] eat the cheese?” chi [long] zu o ta be ta no cheese ACC eat ‐ PAST FINP
チ ー ズ を 食 べ た の 。<= three different scripts
change of script type gives some hints, but inconsistent . Syllable (loan words) Syllable Chinese ideogram Katakana Hiragana Kanji (Kana/Kanji)
No common intui9ve consensus on what cons9tutes a word. (Student surveys revealed a broad range of varia9on and inconsistency).
Start of CHILDES in Japan
2) No spaces to indicate word boundaries. Should JCHAT a]ach par9cles? chiizuo tabetano チーズを 食べたの。 or separate par9cles? chiizu o tabeta no チーズ を 食べた の。
How can we dis9nct auxiliaries from inflec9ons? tabe‐ta eat‐PAST <= universal inventory of morph. codes tabe‐rare‐ta eat‐PASS‐PAST tabe‐te‐ru eat‐ASP‐PRES <interpreta9on as inflec9on or tabe‐te ir‐u eat‐CONN exist‐PRES <interpreta9on as auxiliary
(gramma9caliza9on process from auxiliaries to inflec9ons)
Start of CHILDES in Japan
2) No spaces to indicate word boundaries. Should JCHAT a]ach par9cles? chiizuo tabetano チーズを 食べたの。 or separate par9cles? chiizu o tabeta no チーズ を 食べた の。
How can we dis9nct auxiliaries from inflec9ons? tabe‐ta eat‐PAST tabe‐rare‐ta eat‐PASS‐PAST tabe‐te‐ru eat‐ASP‐PRES <interpret as inflec9on = 3 morph or tabe‐te (i)‐ru eat‐CONN exist‐PRES <interpret as auxiliary = 4 morph
(ongoing gramma9caliza9on process from auxiliaries to inflec9ons) word defini9on influences MLU results => lively MLU discussion in the 1990ies
Start of CHILDES in Japan
JCHAT Project: ‐ Discussions about JCHAT format and MLU ‐ CHILDES Workshops
‐ Research reports
⇒ developed into an independent research society (from 1999):
The Japanese Society for Language Sciences (JSLS) h]p://www.jsls.jpn.org/
The development of CLAN based tools
1) MLU
Defini9ons of words and morphemes MLU program already available in CLAN
(Kanji were introduced with Chinese loan words, than applied to Jpn words as well => 2 readings for the same Kanji).
shoku hin tabe mo no
食品 食べもの both: “food” Chinese reading: shoku vs. Jpn reading: tabe (compare: beverage vs. drink ) => no one‐to‐one correspondence between ideogram and
pronuncia9on
The development of CLAN based tools
1) MLU 2) JMOR: morphological analysis (Naka & Miyata, 1999)
Should the MOR lexicon be based on la9n script or on Japanese script? (CHILDES moving from ASCII to Unicode)
a) Problems with ambiguous reading of Kanji (Chinese ideograms)
(Kanji were introduced with Chinese loan words, than applied to Jpn words as well => 2 readings for the same Kanji).
shoku hin tabe mo no
食品 食べもの both: “food” Chinese reading: shoku vs. Jpn reading: tabe (compare: beverage vs. drink ) => no one‐to‐one correspondence between ideogram and
pronuncia9on
The development of CLAN based tools
1) MLU 2) JMOR Should the MOR lexicon be based on la9n script or on Japanese script?
a) Problems with ambiguous reading of Kanji (Chinese ideograms)
(Kanji were introduced with Chinese loan words, than applied to Jpn words as well => 2 readings for the same Kanji).
shoku hin ta be mo no
食品 食べもの both: “food” Chinese reading: shoku vs. Jpn reading: tabe (compare: beverage vs. drink ) => no one‐to‐one correspondence between ideogram and pronuncia9on
possible for humans, but difficult for computers
The development of CLAN based tools
1) MLU 2) JMOR Should the MOR lexicon be based on la9n script or on Japanese script?
a) Problems with ambiguous reading of Kanji (Chinese ideogram)
(Kanji were introduced with Chinese loan words, than applied to Jpn words as well => 2 readings for the same Kanji).
shoku hin tabe mo no
食品 食べもの both: “food” Chinese reading: shoku vs. Jpn reading: tabe (compare: beverage vs. drink ) => no one‐to‐one correspondence between ideogram and pronuncia9on
possible for humans, but difficult for computers
The development of CLAN based tools
1) MLU 2) JMOR Should the MOR lexicon be based on la9n script or on Japanese script?
a) Problems with ambiguous reading of Kanji (Chinese ideogram)
(Kanji were introduced with Chinese loan words, than applied to Jpn words as well => 2 readings for the same Kanji).
shoku hin tabe mo no
食品 食べもの both: “food” Chinese reading: shoku vs. Jpn reading: tabe (compare: beverage vs. drink ) => no one‐to‐one correspondence between ideogram and pronuncia9on
possible for humans, but difficult for computers
The development of CLAN based tools
1) MLU 2) JMOR Should the MOR lexicon be based on La9n script or on Japanese script? a) Kanji (Chinese ideogram) reading is ambiguous. b) For many words, different wri9ngs are possible:
e.g.: ichigo “strawberry”
イチゴ いちご 苺 Katakana Hiragana Kanji Syllable (loan words) Syllable Chinese ideogram <= three different scripts scien9fic sod & natural literary‐refined with stylis9c differences
even more difficult for computers
The development of CLAN based tools
1) MLU 2) JMOR Should the MOR lexicon be based on La9n script or on Japanese script? a) Kanji (Chinese ideogram) reading is ambiguous. b) For many words, different wri9ngs are possible.
If KanaKanji….. => huge lexicon with • at least two readings for each ideogram • and triple entries an9cipa9ng possible stylis9c varia9ons • complex rela9ons between script and gramma9cal form
=> leads to difficul9es with linguis9c analysis programs
The development of CLAN based tools
1) MLU 2) JMOR Our solu9on (at least for now):
CHAT with La9n script on the main line and Jpn script on a dependent 9er.
*CHI: chiizu o tabeta no . %ort: チーズを食べたの。
but two transcription lines have to be entered...!
The development of CLAN based tools
1) MLU 2) JMOR (Naka & Miyata, 1999, Miyata & Naka, 2004)
Our solu9on (at least for now):
CHAT with La9n script on the main line and Jpn script on a dependent 9er. Analysis is based on the La9n script main line. *CHI: chiizu o tabeta no ? %ort: チーズを食べたの。 %mor: n|chiizu ptl:case|o v|tabe‐PAST ptl:final|no ?
ca. 96% accuracy in combina9on with POST (written by C. Parisse)
The development of CLAN based tools
1) MLU 2) JMOR 3) DSSJ (Miyata, Hirakawa, Itoh, MacWhinney, Otomo, Shirai, Sirai & Sugiura, 2009)
developed between 1999 and 2009 MLU discussion ‐> looking for a more specific gramma9cal measure => adapta9on of Developmental Sentence Score (DSS; Lee, 1974) already DSS command in CLAN!
based on %mor info!
The development of CLAN based tools
1) MLU 2) JMOR 3) DSSJ
Example (DSS output): Sentence |IP |PP |PV |SV |NG |CNJ|IR |WHQ|S| Tom likes him.| |2 | 3 | | | | | |1| personal pronoun : him 2 points primary verb: like‐s 3 points sentence point (adult‐like): 1 point = 6 points
DSS score: avg overall score (based on 100 u]erances) avg area score (e.g. score on verb inflec9on)
The development of CLAN based tools
1) MLU 2) JMOR 3) DSSJ
=> adapta9on of Developmental Sentence Score (Lee, 1974) But: Japanese grammar different from English (simple transla9on impossible) No general model of JL1 acquisi9on available (only studies on details)
• analysis of longitudinal corpora (+ new corpus of 4 children) • screening of 20 gramma9cal areas for items suited for DSSJ • extrac9on of 104 items covering 11 areas and 5 stages • reconfirmed with cross‐sec9onal data of 84 children 2;8‐5;2 • working with CLAN program DSS, based on %mor 9er
The development of CLAN based tools
1) MLU 2) JMOR 3) DSSJ
=> adapta9on of Developmental Sentence Score (Lee, 1974) But: Japanese grammar different from English (simple transla9on impossible) No general model of JL1 acquisi9on available (only studies on details)
• analysis of 8 longitudinal corpora (+ new corpus of 4 children) • screening of 20 gramma9cal areas for items suited for DSSJ • extrac9on of 104 items covering 11 areas and 5 stages • reconfirmed with cross‐sec9onal data of 84 children 2;8‐5;2 • Jpn DSS file for the CLAN program
The development of CLAN based tools
1) MLU 2) JMOR 3) DSSJ
=> adapta9on of Developmental Sentence Score (Lee, 1974) But: Japanese grammar different from English (simple transla9on impossible) No general model of JL1 acquisi9on available (only studies on details)
• analysis of 8 longitudinal corpora (+ new corpus of 4 children) • screening of 20 gramma9cal areas for items suited for DSSJ • extrac9on of 104 items covering 11 areas and 5 stages • reconfirmed with cross‐sec9onal data of 84 children 2;8‐5;2 • Jpn DSS file for the CLAN program
The development of CLAN based tools
1) MLU 2) JMOR 3) DSSJ
Mean Scores of DSSJ overall scores in comparison to MLUm and MLUw for six age groups (adapted from Miyata, Hirakawa, Itoh, MacWhinney, Otomo, Shirai, Sirai & Sugiura al., 2009)
0
1
2
3
4
5
6
2;8 3;2 3;8 4;2 4;8 5;2
Score/Va
lue
Age Group
DSSJ
MLUm
MLUw
The development of CLAN based tools
1) MLU 2) JMOR 3) DSSJ 4) GRASP (Sagae 2008; Jpn version: Miyata, 2009)
syntax analyser Dependency Grammar combined with Case Grammar
also based on the %mor line!*CHI: chiizu o tabeta no ? %ort: チーズを食べたの。 %mor: n|chiizu ptl:case|o v|tabe‐PAST ptl:fina|no . %xgra: 1|3|OBJ 2|1|CASP 3|0|ROOT 4|3|SFP 5|3|PUNCT
The development of CLAN based tools
1) MLU 2) JMOR 3) DSSJ 4) GRASP (jp: Miyata, 2009)
Dependency Grammar combined with Case Grammar also based on the %mor line!*CHI: chiizu o tabeta no ? “did [you] eat the cheese?” %ort: チーズを食べたの。 %mor: n|chiizu ptl:case|o v|tabe‐PAST ptl:fina|no . %xgra: 1|3|OBJ 2|1|CASP 3|0|ROOT 4|3|SFP 5|3|PUNCT
Order|Dependent on|Gramma9cal Role generated automatically! !
high accuracy!!!
The development of CLAN based tools
1) MLU 2) JMOR 3) DSSJ 4) GRASP (jp: Miyata, 2009)
Dependency Grammar combined with Case Grammar also based on the %mor line!*CHI: chiizu o tabeta no ? “did [you] eat the cheese?” %ort: チーズを食べたの。 %mor: n|chiizu ptl:case|o v|tabe‐PAST ptl:fina|no . %xgra: 1|3|OBJ 2|1|CASP 3|0|ROOT 4|3|SFP 5|3|PUNCT
Order|Dependent on|Gramma9cal Role generated automatically! !
high accuracy!!!
The development of CLAN based tools
1) MLU 2) JMOR 3) DSSJ 4) GRASP (jp: Miyata, 2009)
Dependency Grammar combined with Case Grammar also based on the %mor line!*CHI: chiizu o tabeta no ? “did [you] eat the cheese?” %ort: チーズを食べたの。 %mor: n|chiizu ptl:case|o v|tabe‐PAST ptl:fina|no . %xgra: 1|3|OBJ 2|1|CASP 3|0|ROOT 4|3|SFP 5|3|PUNCT
Order|Dependent on|Gramma9cal Role generated automatically! !
high accuracy!!!
The development of CLAN based tools
1) MLU 2) JMOR 3) DSSJ 4) GRASP (jp: Miyata, 2009)
Dependency Grammar combined with Case Grammar also based on the %mor line!*CHI: chiizu o tabeta no ? “did [you] eat the cheese?” %ort: チーズを食べたの。 %mor: n|chiizu ptl:case|o v|tabe‐PAST ptl:fina|no . %xgra: 1|3|OBJ 2|1|CASP 3|0|ROOT 4|3|SFP 5|3|PUNCT
Order|Dependent on|Gramma9cal Role generated automatically! !
high accuracy!!!
CHILDES influence on research
Pre‐CHILDES
• printed speech corpora
CHAT
MLU
• defini9ons for orthography & word boundaries • => first online corpora
MOR
DSSJ
• defini9ons for morphemes & grammar tags • => corpora with morph tags; JL1 acquisi9on model
GRASP
?…?
• defini9ons for syntac9cal rela9ons • => corpora with morph & syntac9c tags; CaseDep Model
CHILDES influence on research
• Online CHILDES/TalkBank corpora Hayashi (1 boy 1;0 ‐ 2;5, Jp‐Danish) Miyata (3 boys 1;6 ‐ 3;0) Hamazaki (1 boy 2;2 ‐ 3;4) Ishii (1 boy 0;8 ‐ 3:8) Noji (1 boy 0;0 ‐ 6;0, diary) Nisisawa & Miyata (2 girls 1;2 ‐ 5;0) Miyata & Nisisawa (2 boys 1;2 ‐5;0) Sakura (18 student conversa9ons) CallFriend (30 telephone conversa9ons) Inaba (in work) (190 frog stories, 3;0 ‐ 11;0, L1 & L2 adults)
... and many many unpublished corpora !
CHILDES influence on research
outlook (en rose): stronger coopera9on between languages (e.g.: Jpn GRASP developed in coopera9on with Chinese, English, & Spanish)
cross‐linguis9c comparison (e.g.: cross‐linguis9cally comparable measures with conversion tables)
be]er documenta9on of tool development (e.g.: pla{orm for technical papers and manuals)
Basic Tool Kit (set of analysis & measurement tools to be developed first in a “new” language,
with the help of standards and tools developed by “earlier” languages)
CHILDES influence on research
outlook: stronger coopera9on between languages (e.g.: Jpn GRASP developed in coopera9on with Chinese, English, & Spanish)
cross‐linguis9c comparison (e.g.: cross‐linguis9cally comparable measures with conversion tables)
be]er documenta9on of tool development (e.g.: pla{orm for technical papers and manuals)
Basic Tool Kit (set of analysis & measurement tools to be developed first in a “new” language,
with the help of standards and tools developed by “earlier” languages)
CHILDES influence on research
outlook: stronger coopera9on between languages (e.g.: Jpn GRASP developed in coopera9on with Chinese, English, & Spanish)
more cross‐linguis9c comparison (e.g.: cross‐linguis9cally comparable measures with conversion tables)
be]er documenta9on of tool development (e.g.: pla{orm for technical papers and manuals)
Basic Tool Kit (set of analysis & measurement tools to be developed first in a “new” language,
with the help of standards and tools developed by “earlier” languages)
CHILDES influence on research
outlook: stronger coopera9on between languages (e.g.: Jpn GRASP developed in coopera9on with Chinese, English, & Spanish)
cross‐linguis9c comparison (e.g.: cross‐linguis9cally comparable measures with conversion tables)
be]er documenta9on of tool development (e.g.: pla{orm for technical papers and manuals)
Basic Tool Kit (set of analysis & measurement tools to be developed first in a “new” language,
with the help of standards and tools developed by “earlier” languages)
CHILDES influence on research
outlook: stronger coopera9on between languages
more cross‐linguis9c comparison
be]er documenta9on of tool development
Basic Tool Kit Thank you for listening!