Download - CHILDES: Japanese corpora and language acquisi on research

CHILDES: Japanese corpora and language acquisi9on

research

Susanne Miyata Aichi Shukutoku University

Outline

1. Child speech research in Japan (Pre‐CHILDES)

2. CHILDES Data now

3. Start of CHILDES in Japan

4. The development of CLAN based tools

5. CHILDES influence on research (Summary)

6. Outlook

Child speech research in Japan　　　　　　(Pre‐CHILDES)

Noji Data (print data) Noji, J. (1973‐74) Yooji no gengo seikatsu no ji.ai I ‐IV. Bunka Hyoron Shuppan.

1 Boy, 0;0‐6;11 (ca. 62,000 u].) diary data (star9ng 1948)

records of all (?) child u]erances + context

based on hand‐wri]en instant records of the parents (father linguist specialized on dialect research)

Child speech research in Japan　　　　　　(Pre‐CHILDES)

Noji Data (print data) Noji, J. (1973‐74) Yooji no gengo seikatsu no ji.ai I ‐IV. Bunka Hyoron Shuppan.

1 Boy, 0;0‐6;11 (ca. 62,000 u].) diary data (star9ng 1948)

consecu9ve records of child u]erances + context

based on hand‐wri]en records of the parents (father linguist specialized on dialect research)

Child speech research in Japan　 (Pre‐CHILDES)

Okubo Data (hand‐wri]en print data) Na9onal Ins9tute for Japanese Language (1981‐83). Child Language Materials. 6 vols.

1 Boy, 1;0‐4;0 (ca. 90,000 u].) hand‐wri]en transcripts

based on tape‐recordings of

weekly Mother‐Child conversa9ons


Okubo Data (hand‐wri]en print data) Na9onal Ins9tute for Japanese Language (1981‐83). Child Language Materials. 6 vols.

1 Boy, 1;0‐4;0 (ca. 90,000 u].) based on tape‐recordings of

weekly Mother‐Child conversa9ons

published as hand‐wri]en transcript


Noji Data (print) 0;0‐6;11 (ca. 62,000 u].) Okubo Data (hand‐wri]en) 1;0‐4;0 (ca. 90,000 u].)

Much research before 1990 was based on comparisons of English research with the Noji and/or Okubo data, oden combined with (unpublished) data collected by the researcher.

Search of 1 item (e.g.: nega9on) would take 2‐3 months.


Noji Data (print) 0;0‐6;11 (ca. 62,000 u].) Okubo Data (hand‐wri]en) 1;0‐4;0 (ca. 90,000 u].)

＋ freely available (library)

＋ longitudinal, extensive, reliable

ー single cases => further data collec9on ー print (paper & pencil analysis) => digitaliza9on (around 1990)

Jpn CHILDES Data now 2010nov10 (violet = in prep.)

JL1 children (]l 419,000 u].) Aki Corpus weekly 1;6‐3;0 (M‐C conversa9on) mor 48,000 u]. Ryo Corpus weekly　　 1;6‐3;0 (M‐C conversa9on) mor 9,000 u]. Tai Corpus　 weekly　　 1;6‐3;1 (M‐C conversa9on) audio‐linked, mor 81,000 u]. Noji Corpus diary data 0;0‐7;0 (C u]erances with context) 62,000 u]. Ishii Corpus　 weekly　 0;6‐3;8 (F‐C conversa9on) video‐linked 95,000 u]. Hamasaki Corpus bimontly　 2;2‐3;7 (M‐C conversa9on) 37,000 u]. MiiPro Corpus 2 girls, 2 boys (1 girl also F‐C conversa9ons)　

weekly 1;2‐3;0 (M‐C conversa9on) audio‐linked, mor monthly 3;0‐5;0 (M‐C conversa9on) audio‐linked, mor 84,000 u].

Inaba Corpus　 Frog Story 90 stories age 3 to 11, 10 children each audio‐linked 3,600 u].

JL1 adults (]l 50,000 u].) Sakura Corpus 18 student conversa9ons @ 20 min.

groups of 4 students (same sex, opp. sex) audio‐linked, mor 15,000 u]. CallFriend Corpus 30 telephone conversa9ons audio‐linked 33,000 u]. Inaba Corpus　 Frog Story 50 stories audio‐linked 2,000 u].

JL2 adults (]l 2,700 u].) Inaba Corpus　 Frog Story 50 stories audio‐linked 2,700 u].

Jpn level 1 ‐ 5, 10 adults each

Clinical Aphasia Corpus　 5 student‐pa9ent conversa9ons audio‐linked

....and lots of unpublished data

Start of CHILDES in Japan

In the late 1980ies:

Noji data syntac9cal analysis (Morikawa) Aki Ryo Tai corpora prep for CHILDES (Miyata)

=> discussion about CHAT format for Japanese

Oshima call => JCHAT Project 1993 but why a whole project...?!


CLAN worked only on restricted La9n script (ASCII) => defini9on of orthography necessary

1) modern pronuncia9on not reflected by Jpn syllable script si > ʃi 　 tu ‐> tsu 9 ‐> tʃi etc. Should JCHAT adapt the modern pronuncia9on (shi, tsu, chi) <= L2 researchers or the historical pronuncia9on (si, tu, 9) <= grammarians ...?


2) No spaces to indicate word boundaries.

chiizu o tabeta no? “did [you] eat the cheese?” chi [long] zu o　 ta be ta no cheese ACC eat ‐ PAST FINP

チーズを食べたの。<= three different scripts

change of script type gives inconsistent hints. Syllable (loan words) Syllable Chinese ideogram Katakana Hiragana Kanji (Kana/Kanji)

No common intui9ve consensus on what cons9tutes a word. (Student surveys revealed a broad range of varia9on and inconsistency).


2) No spaces to indicate word boundaries.

“did [you] eat the cheese?” chi [long] zu o　 ta be ta no cheese ACC eat ‐ PAST FINP

チーズを食べたの。<= three different scripts

change of script type gives some hints, but inconsistent . Syllable (loan words) Syllable Chinese ideogram Katakana Hiragana Kanji (Kana/Kanji)

No common intui9ve consensus on what cons9tutes a word. (Student surveys revealed a broad range of varia9on and inconsistency).


2) No spaces to indicate word boundaries. Should JCHAT a]ach par9cles? chiizuo tabetano チーズを食べたの。 or separate par9cles? chiizu o tabeta no チーズを食べたの。

How can we dis9nct auxiliaries from inflec9ons? tabe‐ta eat‐PAST <= universal inventory of morph. codes tabe‐rare‐ta eat‐PASS‐PAST tabe‐te‐ru eat‐ASP‐PRES <interpreta9on as inflec9on or tabe‐te ir‐u eat‐CONN exist‐PRES <interpreta9on as auxiliary

(gramma9caliza9on process from auxiliaries to inflec9ons)


2) No spaces to indicate word boundaries. Should JCHAT a]ach par9cles? chiizuo tabetano チーズを食べたの。 or separate par9cles? chiizu o tabeta no チーズを食べたの。

How can we dis9nct auxiliaries from inflec9ons? tabe‐ta eat‐PAST tabe‐rare‐ta eat‐PASS‐PAST tabe‐te‐ru eat‐ASP‐PRES <interpret as inflec9on = 3 morph or tabe‐te (i)‐ru eat‐CONN exist‐PRES <interpret as auxiliary = 4 morph

(ongoing gramma9caliza9on process from auxiliaries to inflec9ons) word defini9on influences MLU results => lively MLU discussion in the 1990ies


JCHAT Project: ‐ Discussions about JCHAT format and MLU ‐ CHILDES Workshops

‐ Research reports

⇒  developed into an independent research society (from 1999):

The Japanese Society for Language Sciences (JSLS) h]p://www.jsls.jpn.org/

The development of CLAN based tools

1) MLU

Defini9ons of words and morphemes MLU program already available in CLAN

(Kanji were introduced with Chinese loan words, than applied to Jpn words as well => 2 readings for the same Kanji).

　　　　shoku hin tabe mo no

食品　　　　食べもの both: “food” Chinese reading: shoku vs. Jpn reading: tabe (compare: beverage vs. drink ) => no one‐to‐one correspondence between ideogram and

pronuncia9on


1) MLU 2) JMOR: morphological analysis (Naka & Miyata, 1999)

Should the MOR lexicon be based on la9n script or on Japanese script? (CHILDES moving from ASCII to Unicode)

a) Problems with ambiguous reading of Kanji (Chinese ideograms)



食品　　　　食べもの both: “food” Chinese reading: shoku vs. Jpn reading: tabe (compare: beverage vs. drink ) => no one‐to‐one correspondence between ideogram and

pronuncia9on


1) MLU 2) JMOR Should the MOR lexicon be based on la9n script or on Japanese script?

a) Problems with ambiguous reading of Kanji (Chinese ideograms)


　　　　shoku hin ta be mo no

食品　　　　食べもの both: “food” Chinese reading: shoku vs. Jpn reading: tabe (compare: beverage vs. drink ) => no one‐to‐one correspondence between ideogram and pronuncia9on

possible for humans, but difficult for computers


1) MLU 2) JMOR Should the MOR lexicon be based on la9n script or on Japanese script?

a) Problems with ambiguous reading of Kanji (Chinese ideogram)



食品　　　　食べもの both: “food” Chinese reading: shoku vs. Jpn reading: tabe (compare: beverage vs. drink ) => no one‐to‐one correspondence between ideogram and pronuncia9on

possible for humans, but difficult for computers


1) MLU 2) JMOR Should the MOR lexicon be based on La9n script or on Japanese script? a) Kanji (Chinese ideogram) reading is ambiguous. b) For many words, different wri9ngs are possible:

　　　e.g.: ichigo “strawberry”

　　イチゴ　　　　　いちご　　　　苺 Katakana Hiragana Kanji Syllable (loan words) Syllable Chinese ideogram <= three different scripts scien9fic sod & natural literary‐refined with stylis9c differences

even more difficult for computers


1) MLU 2) JMOR Should the MOR lexicon be based on La9n script or on Japanese script? a) Kanji (Chinese ideogram) reading is ambiguous. b) For many words, different wri9ngs are possible.

If KanaKanji….. => huge lexicon with •  at least two readings for each ideogram •  and triple entries an9cipa9ng possible stylis9c varia9ons •  complex rela9ons between script and gramma9cal form

=> leads to difficul9es with linguis9c analysis programs


1) MLU 2) JMOR Our solu9on (at least for now):

CHAT with La9n script on the main line and Jpn script on a dependent 9er.

*CHI: chiizu o tabeta no . %ort: チーズを食べたの。

but two transcription lines have to be entered...!


1) MLU 2) JMOR (Naka & Miyata, 1999, Miyata & Naka, 2004)

Our solu9on (at least for now):

CHAT with La9n script on the main line and Jpn script on a dependent 9er. Analysis is based on the La9n script main line. *CHI: chiizu o tabeta no ? %ort: チーズを食べたの。 %mor: n|chiizu ptl:case|o v|tabe‐PAST ptl:final|no ?

ca. 96% accuracy in combina9on with POST (written by C. Parisse)


1) MLU 2) JMOR 3) DSSJ (Miyata, Hirakawa, Itoh, MacWhinney, Otomo, Shirai, Sirai & Sugiura, 2009)

developed between 1999 and 2009 MLU discussion ‐> looking for a more specific gramma9cal measure => adapta9on of Developmental Sentence Score (DSS; Lee, 1974) already DSS command in CLAN!

based on %mor info!


1) MLU 2) JMOR 3) DSSJ

Example (DSS output): Sentence |IP |PP |PV |SV |NG |CNJ|IR |WHQ|S| Tom likes him.| |2 | 3 | | | | | |1| personal pronoun : him 2 points primary verb: like‐s 3 points sentence point (adult‐like): 1 point = 6 points

DSS score: avg overall score (based on 100 u]erances) avg area score (e.g. score on verb inflec9on)



=> adapta9on of Developmental Sentence Score (Lee, 1974) But: Japanese grammar different from English (simple transla9on impossible) No general model of JL1 acquisi9on available (only studies on details)

•  analysis of longitudinal corpora (+ new corpus of 4 children) •  screening of 20 gramma9cal areas for items suited for DSSJ •  extrac9on of 104 items covering 11 areas and 5 stages •  reconfirmed with cross‐sec9onal data of 84 children 2;8‐5;2 •  working with CLAN program DSS, based on %mor 9er



=> adapta9on of Developmental Sentence Score (Lee, 1974) But: Japanese grammar different from English (simple transla9on impossible) No general model of JL1 acquisi9on available (only studies on details)

•  analysis of 8 longitudinal corpora (+ new corpus of 4 children) •  screening of 20 gramma9cal areas for items suited for DSSJ •  extrac9on of 104 items covering 11 areas and 5 stages •  reconfirmed with cross‐sec9onal data of 84 children 2;8‐5;2 •  Jpn DSS file for the CLAN program



Mean Scores of DSSJ overall scores in comparison to MLUm and MLUw for six age groups (adapted from Miyata, Hirakawa, Itoh, MacWhinney, Otomo, Shirai, Sirai & Sugiura al., 2009)

0

1

2

3

4

5

6

2;8 3;2 3;8 4;2 4;8 5;2

Score/Va

lue

Age Group

DSSJ

MLUm

MLUw


1) MLU 2) JMOR 3) DSSJ 4) GRASP (Sagae 2008; Jpn version: Miyata, 2009)

syntax analyser Dependency Grammar combined with Case Grammar

also based on the %mor line!*CHI: chiizu o tabeta no ? %ort: チーズを食べたの。 %mor: n|chiizu ptl:case|o v|tabe‐PAST ptl:fina|no . %xgra: 1|3|OBJ 2|1|CASP 3|0|ROOT 4|3|SFP 5|3|PUNCT


1) MLU 2) JMOR 3) DSSJ 4) GRASP (jp: Miyata, 2009)

Dependency Grammar combined with Case Grammar also based on the %mor line!*CHI: chiizu o tabeta no ? “did [you] eat the cheese?” %ort: チーズを食べたの。 %mor: n|chiizu ptl:case|o v|tabe‐PAST ptl:fina|no . %xgra: 1|3|OBJ 2|1|CASP 3|0|ROOT 4|3|SFP 5|3|PUNCT

Order|Dependent on|Gramma9cal Role generated automatically! !

high accuracy!!!

CHILDES influence on research

Pre‐CHILDES

• printed speech corpora

CHAT

MLU

• defini9ons for orthography & word boundaries • => first online corpora

MOR

DSSJ

• defini9ons for morphemes & grammar tags • => corpora with morph tags; JL1 acquisi9on model

GRASP

?…?

• defini9ons for syntac9cal rela9ons • => corpora with morph & syntac9c tags; CaseDep Model


•  Online CHILDES/TalkBank corpora Hayashi (1 boy 1;0 ‐ 2;5, Jp‐Danish) Miyata (3 boys 1;6 ‐ 3;0) Hamazaki (1 boy 2;2 ‐ 3;4) Ishii (1 boy 0;8 ‐ 3:8) Noji (1 boy 0;0 ‐ 6;0, diary) Nisisawa & Miyata (2 girls 1;2 ‐ 5;0) Miyata & Nisisawa (2 boys 1;2 ‐5;0) Sakura (18 student conversa9ons) CallFriend (30 telephone conversa9ons) Inaba (in work) (190 frog stories, 3;0 ‐ 11;0, L1 & L2 adults)

... and many many unpublished corpora !


outlook (en rose): stronger coopera9on between languages (e.g.: Jpn GRASP developed in coopera9on with Chinese, English, & Spanish)

cross‐linguis9c comparison (e.g.: cross‐linguis9cally comparable measures with conversion tables)

be]er documenta9on of tool development (e.g.: pla{orm for technical papers and manuals)

Basic Tool Kit (set of analysis & measurement tools to be developed first in a “new” language,

with the help of standards and tools developed by “earlier” languages)


outlook: stronger coopera9on between languages (e.g.: Jpn GRASP developed in coopera9on with Chinese, English, & Spanish)







more cross‐linguis9c comparison (e.g.: cross‐linguis9cally comparable measures with conversion tables)





outlook: stronger coopera9on between languages

more cross‐linguis9c comparison

be]er documenta9on of tool development

Basic Tool Kit Thank you for listening!

Download - CHILDES: Japanese corpora and language acquisi on research

Top Related