2015-1003 英語コーパス学会ワークショップ使用スライド

Webアプリケーションを使ったコーパス研究のための統計的手法

英語コーパス学会（JAECS）第41回大会 2015/10/03@愛知大学名古屋キャンパス

水本　篤（関西大学）

自己紹介

竹内・水本（編著）(2012)

http://mizumot.com/handbook

書籍で使用したデータと分析方法


•MS Excel（できるものだけ） • IBM SPSS • フリーのデータ解析環境R

•過去のJAECSでのWS（田畑, 2004; 金, 2007; 田中・小林, 2009; 阪上, 2013）

• 過去のLET全国大会WS（小林, 2011; 阪上, 2012, 2014） • “R passes SPSS in scholarly use” (Muenchen, 2014)

ただ... RはCLI

RをGUIで利用できる http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/Rcmdr-screenshot.html

R Commander（EZR）など

http://www.jichi.ac.jp/saitama-sct/SaitamaHP.files/statmedEN.html

https://sites.google.com/site/casualmacr/home

RをGUIで利用できる Mac用アプリのMacR

https://sites.google.com/site/casualmacr/home

http://norimune.net/had

（FYI）なんでもExcelでできるHAD

http://norimune.net/had

さらに一歩進んで便利

（というか楽）なのが

Webアプリケーション

これまでの経験から…

赤野・堀・投野（編著）(2014)石川・前田・山崎（編著）(2010)

http://www.kisnet.or.jp/nappa/software/star/

便利な

http://www.kisnet.or.jp/nappa/software/star/

http://www.m-sugaya.jp/python/

便利な

http://www.m-sugaya.jp/python/

http://www.hju.ac.jp/~kiriki/anova4/index.html便利な

http://www.hju.ac.jp/~kiriki/anova4/index.html

普段Rでやってること

•csvやxlsなどで元データを準備

•Rにデータを読み込む

•パッケージの関数を使って分析

http://hoxom-hist.appspot.com/hist.html

こういうのを作りたかった便利な

http://hoxom-hist.appspot.com/hist.html

こういうのを作りたかった

http://www.wessa.net/rwasp_cronbach.wasp

便利な

http://www.wessa.net/rwasp_cronbach.wasp

Since 2012

http://shiny.rstudio.com/


• 「ハンドブック」の量的チャプターのサンプルを使用して再現できる。

• アウトプットの見方がわかる • 自分でも簡単に分析できる。 • グラフを充実させている。 • Excelのデータをコピペするだけ。

langtest.jp

http://langtest.jp

ここにExcelからデータをコピペするだけ

行列もいける。

コードはアプリ上とGitHubですべて公開

https://github.com/mizumot

https://github.com/mizumot

注意点•誰でもできる… だけに危険。

•ドキュメンテーションがない。

•サーバでRを走らせているので少し重い。

•自由度ゼロ（要望が有り次第改善予定）。

•コードが残らないので再現性に乏しい。

• 学部生，修士課程の院生「ハンドブック」などの分析をハンズオンで実行し，卒論，修論の分析で利用。

• 博士課程の院生，量的研究を行う研究者分析方法の確認，コードを見て自分でRを使う。（langtest.jp だけでは不十分と感じるはずなので）

対象と目的

http://www.routledgetextbooks.com/textbooks/9781138024571/

http://www.routledgetextbooks.com/textbooks/9781138024571/

これまでに使用された国 (2015/09/30 現在）


http://langtest.jp/


http://langtest.jp/

https://twitter.com/CorpusTan/status/640876418801405953

https://twitter.com/CorpusTan/status/640876418801405953

Webアプリケーションを使ったコーパス研究のための統計的手法

1. コンコーダンサーやウェブサイトからワードリスト作成，もしくは，特定の語・フレーズの頻度抽出（レマ化，頻度の標準化）

2. Rなどの統計解析ソフトで分析

基本的な分析の流れ

1. 記述統計と推測統計

2. 統計的検定と効果量

3. 相関と多変量解析

4. 再現性

Webアプリケーションを使ったコーパス研究のための統計解析

全体

一部抽出

推測


母集団と標本

母集団

（未知）

標　本

（既知）推定

データ解析

Σ, F, t, p...

http://www.urano-ken.com/blog/2013/08/05/let2013-workshop/


母集団μ = 15.3

標本A M = 14.7

標本BM = 15.9

標本C M = 15.2

標本DM = 15.4

標本EM = 15.1 http://www.urano-ken.com/blog/2013/08/05/let2013-workshop/

標本ごとに実現値は違う


母集団μ = ？

標本A M = 14.7


実際はM = μとして推定


母集団μ = ？

実際はM = μとして推定

ScoreFrequency

30 40 50 60 70 80

05

1015

20

M = 50.59

コーパスの代表性について

The web may not be “representative of anything other than itself,” as Kilgarriff and Grefenstette (2003: 333) point out – “but then neither are other corpora” (Boulton, 2012).

e.g., The web as “corpus”

Hands-on 11. langtest.jpを開く2. “Basic Statistics Calculator” を開く3. JAECS2015data の「(1)記述統計」の「語数」を数字のみをコピペ

Basic statistics

http://langtest.jp

平均30点，標準偏差10点

MとSD

頻度データの扱いに注意

•同じ内容をコーパスから作成したワードリスト（頻度データ）に適用しても意味がない。

•データの種類にあった分析方法を心がける。




4. 再現性


• Chi-square test: 22%

• Correlation: 17%

• ANOVA: 12%

• t-test: 11%

• log-linear analysis: 10%

• Followed by non-parametric techniques, multiple regression, logistic regression, etc.

どのような分析が多いか？“Quantitative research methods and study quality in Learner Corpus Research” Paquot & Plonsky (2015@LCR) reported by Dr. Akira Murakami https://twitter.com/mrkm_a/status/642802550928998400

https://twitter.com/mrkm_a/status/642802550928998400

カイ2乗検定

コーパスA

コーパスB 合計

語X 40

語Y 40

合計 40 40 80

カイ2乗検定

コーパスA


語X 20 20 40

語Y 20 20 40

合計 40 40 80

左が期待値，右が実測値

コーパスA


語X 20 20 40

語Y 20 20 40

合計 40 40 80

コーパスA


語X 15 25 40

語Y 25 15 40

合計 40 40 80

カイ2乗値のイメージ

コーパスA


語X 20 20 40

語Y 20 20 40

合計 40 40 80

コーパスA


語X 15 25 40

語Y 25 15 40

合計 40 40 80

ズレ

全体

一部抽出

推測

母集団から抽出

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

df=1のときのカイ2乗分布曲線

Chi-square value

相対度数（確率密度）同じ違う

カイ2乗値のイメージ

コーパスA


語X 20 20 40

語Y 20 20 40

合計 40 40 80

コーパスA


語X 15 25 40

語Y 25 15 40

合計 40 40 80

ズレ

(15-20)^2/20+(25-20)^2/20+(25-20)^2/20+(15-20)^2/20 = 5

同じ違う

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

df=1のときのカイ2乗分布曲線

Chi-square value

相対度数（確率密度）

ズレ

結果の見方

• X-squared: カイ2乗値（大きいと「ズレ」が大きい）

• df: 自由度（行の数 − 1）×（列の数 − 1）

• p-value: p 値が0.05以下なら有意差あり

• 特定のセルの「期待値」が5以下の場合，カイ2乗検定は不正確

残差分析

Hands-on 21. langtest.jpを開く2. “Chi-square Test” を開く3. JAECS2015dataの「(2)カイ2乗」の該当部分をコピペ

Chi-square test

http://langtest.jp

コロケーション指標

• 相互情報量（MI）

• tスコア，zスコア

• ダイス係数，ジャッカード係数，コサイン類似度，シンプソン係数

コロケーション指標

石川（2012）石川（2008）

で分析を実行する場合

← Shinyで使用

← Shinyで使用

この部分をRで使用

↑ワーキングディレクトリなどのファイルを参照

で分析を実行する場合（カイ2乗検定）

Hands-on 3

「小林（2015）の例」のセルの値をすべて10倍にして，10倍にする前と後で p 値がどう変化するか確認する。

Introduction to effect sizes

統計的に有意な

p < .05（0.05以下）

statistically significant

• 統計的検定の問題点- サンプルサイズが影響。- 有意差あり・なしのみの判断。- p 値は実質的な差を示さない。

効果量（effect size）

• 効果量（effect size）- サンプルサイズに影響されない。- 効果の大小を示す。- 実質的な差を確認できる。

• APA 6th では報告が「不可欠」

Cumming (2012)

ストップ p 値信仰

APA 6th (2009) 大久保・岡田 (2009)

「統計改革」

The Basic and Applied Social Psychology

http://www.tandfonline.com/doi/abs/10.1080/01973533.2015.1012991#.Vb3tuJPtlBd

p値（帰無仮説検定）禁止！

http://www.tandfonline.com/doi/abs/10.1080/01973533.2015.1012991#.Vb3tuJPtlBd

"it is important to note that one cannot use the chi-square value as a measure of effect size, i.e. as an indication of how strong the correlation between the two investigated variables is. This is due to the fact that the chi-square value is dependent on the effect size, but also on the sample size."

Gries (2009, p. 196)

http://www.mizumot.com/method/06-05_Kobayashi.pdf

http://www.mizumot.com/method/06-05_Kobayashi.pdf

“log ratio as a means of taking effect size into consideration in the ranking of keyword results is being incorporated into a number of programs” (p. 105).

Culpeper, J., & Demmen, J. (2015). Keywords. In D. Biber & R. Reppen (Eds.), The Cambridge handbook of English corpus linguistics (pp. 90–105). Cambridge University Press.

コーパス言語学でも

log ratio = ”the binary log of the ratio of relative frequencies” (http://cass.lancs.ac.uk/?p=1133)

http://cass.lancs.ac.uk/?p=1133

https://benjamins.com/#catalog/journals/ijcl.20.3.01ant/details

http://www.laurenceanthony.net/software/protant/

https://benjamins.com/#catalog/journals/ijcl.20.3.01ant/details

http://www.laurenceanthony.net/software/protant/

CasualConc https://sites.google.com/site/casualconcj/

Version 2.0: 効果量 r を特徴語抽出に利用

https://sites.google.com/site/casualconcj/

langtest.jp — Cramer’s V

※ 行と列のいずれかが2のクロス表の場合は，

　M=2となり，ファイ係数（四分点相関係数）と一致。

（一般的な）基準：

V = 0.1 効果量小

V = 0.3 効果量中

V = 0.5 効果量大

Vは0から1の値をとる（相関係数と同じ）

(with 95% CI)

Cramer’s V 2×2の分割表

（= φ係数）の場合（四分点）相関係数の絶対値を求める

= 0.25

コーパスA語X

コーパスA語Y

コーパスB語X

コーパスB語Y

langtest.jp — オッズ比

語Xは語Yと比べると，コーパスAよりもコーパスBで使われやすい。

コーパスA（語X）: 15/25 = 0.60コーパスB（語X）: 25/15 = 1.6667オッズ比: 0.6/1.6667 = 0.36

オッズ比 = 1 は2つのコーパスで差がないことを意味する。1以上だとコーパスA，1以下だとコーパスBで使われやすいという解釈になる。

（2×2の分割表のときのみ表示） (with 95% CI)

（1 / 0.36 = 2.778倍）

http://ucrel.lancs.ac.uk/llwizard.html

http://ucrel.lancs.ac.uk/llwizard.html

Hands-on 41. langtest.jpを開く2. “Chi-square Test” を開く3. JAECS2015dataの「(2)カイ2乗」の該当部分をコピペ4. オッズ比，クラメールのV（φ係数）を確認

Calculating effect sizes

http://langtest.jp




4. 再現性


相関係数•  2

1 1

.00 ± . 20 .00

.20 ± . 40

.40 ± . 70

.70 ± 1.00

赤野・堀・投野（2014）「英語教師のためのコーパス活用ガイド」(p. 204）

エッセイの総語数

英語習熟度

エッセイの総語数英語習熟度


英語習熟度


英語習熟度

エッセイ中のエラー数

英語習熟度


英語習熟度


英語習熟度


英語習熟度

��

r = .00 r = .30 r = .70 r = .90

r = .00 r = -.30 r = -.70 r = -.90

相関係数の効果量は「相関係数そのもの」で，

一般的には，0.1（小），0.3（中），0.5（大）

Hands-on 51. langtest.jpを開く2. “Correlation” を開く3. JAECS2015dataの「(3)相関・多変量」の該当部分をコピペ

Correlation

http://langtest.jp

多変量解析

（大まかな）目的クラスター分析（cluster analysis） → データを分類する因子分析（exploratory factor analysis） → データの潜在因子を探る主成分分析（principal component analysis） → データを圧縮する，結合するコレスポンデンス分析（correspondence analysis） → データを圧縮する（少ない次元にまとめる）

主成分分析のイメージ

変数間の情報を圧縮して「合成得点」（主成分）を作る

コレスポンデンス分析のイメージ

行と列の関係（相関）が

最大になるように並べ替える

http://www.mizumot.com/files/2009_corpus2.pdf

http://www.mizumot.com/files/2009_corpus2.pdf

Tabata, T. (1995). Narrative style and the frequencies of very common words: A corpus-based approach to Dickens's first person and third person narratives. English Corpus Studies, 2, 91–109. Retrieved from http://www.lang.osaka-u.ac.jp/~tabata/papers/1995.pdf

Narrative Style & the Frequencies of Very Common Words 99

-200

-150

-100

-50

0

50

100

150

200

-400 -300 -200 -100 0 100 200 300 400

2nd PC (8.15 %)

1st PC (20.15 %)

the

and

be

of

a

in(p)

his

have

to(i)

he

with

to(p)

say

it

as

at

that(c)

on(p)

by(p)

her(a)

which(r)

him

for(p)

but

she

not

from

whenthis

all

an

they

look

or

out

there

into

one

who(r)

that(d)

very

if

little

up(adv)

go

so(a.d.)

do

upon(p)

take

their

make

no(a)

come

them

would

see

down

some

could

moreold

man

then

beforeher(pron)

other

over

again

itsthat(r)

time

two

than

about

head

himself

gentleman

knowwhat

reply aftermuch

any

face

great

hand

like(p)

eyes

turn

mother

get

such

on(adv)

seem

backsit

think

way

young

never

Figure 1. First person narratives versus Third person narratives: Word-plot(for the 100 most common words of the narrative corpus).

-100

0

100

200

300

400

500

600

-2000 -1500 -1000 -500 0 500 1000 1500

2nd PC

1st PC

David#1

David#2

David#3

David#4David#5

Esther#1

Esther#2Esther#3Esther#4

Pip#1

Pip#2Pip#3 Pip#4

SB#1

SB#2

SB#3

PP#1

PP#2

PP#3

OT#1

OT#2OT#3 OT#4

NN#1

NN#2

NN#3

BH#1

BH#2TTC#1

TTC#2

TTC#3

OMF#1

OMF#2

OMF#3

ED#1

ED#2

ED#3

First person narratives

Third person narratives

Figure 2. First person narratives versus Third person narratives: Texts in 4000-word segments (based on the 100 most common words of the narrative corpus).

http://www.lang.osaka-u.ac.jp/~tabata/papers/1995.pdf

内田諭 (2015).「CEFR レベルに基づいた教材コーパス—レベル別基準特性の抽出に向けて」『英語コーパス研究』22, 87–100.

Tono, Y. (2013). Criterial feature extraction using parallel learner corpora and machine learning. In A. Díaz-Negrillo, N. Ballier, & P. Thompson (Eds.), Automatic treatment and analysis of learner corpus data (pp. 169–203). Amsterdam/Philadelphia: John Benjamins.

Hands-on 61. langtest.jpを開く

2. "Cluster Analysis","Principal Component Analysis", "Correspondence

Analysis"の3つを確認

3. JAECS2015data の「(3)相関・多変量」の該当部分をコピペ

Multivariate analysis

http://langtest.jp

参考

http://www.lang.osaka-u.ac.jp/~tabata/JAECS2004/multi.html

http://www.lang.osaka-u.ac.jp/~tabata/JAECS2004/JAECS2004hand.pdf

http://www.lang.osaka-u.ac.jp/~tabata/JAECS2004/multi.html

http://www.lang.osaka-u.ac.jp/~tabata/JAECS2004/JAECS2004hand.pdf




4. 再現性


1. コンコーダンサーやウェブサイトからワードリスト作成，もしくは，特定の語・フレーズの頻度抽出（レマ化，頻度の標準化）

2. Rなどの統計解析ソフトで分析

コーパス研究の再現性は？

前田・山森（編著）(2004)

「必要な情報はきちんと書く。情報は追試できるように書く。読者にわかりやすく書く。」（p. 172）

Porte (2012)

Replication やメタ分析に

必要な情報を書く

「ダメ。ゼッタイ。」

•平均・標準偏差の記載なし。•人数・総数が不明。•信頼性係数などの報告なし。• p 値のみの報告。（* がたくさん。）

（分析の）再現に必要な情報

•サンプルサイズ，平均，標準偏差•相関係数（対応ありデータ，SEMなど）

•信頼性係数（平均への回帰，相関の希薄化　　　　　　　の修正など）

L2研究における「統計改革」

•「統計改革」がL2研究でも進んでいる。

• 各ジャーナルで Editorial や Guideline，特別号に方針が掲載されている。

http://onlinelibrary.wiley.com/doi/10.1111/lang.2015.65.issue-S1/issuetoc

http://onlinelibrary.wiley.com/doi/10.1111/lang.2015.65.issue-S1/issuetoc

L2研究における「統計改革」Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research findings: What gets reported and recommendations for the field. Language Learning, 65/Supp. 1, 125–157. doi:10.1111/lang.12115

1. 記述統計報告の改善

2. 効果量とその信頼区間の報告

3. 測定道具の信頼性の報告

4. データ可視化の重視

5. データの公開

再現性は研究の基本

• データの二次利用を推奨すべき。例えば，使用したデータを（個人情報に気をつけて）オンラインなどで公開。

• Rなどのコードも公開すれば，誰でも再現可能。

コーパス研究の場合• 分析の各ステップで使用したデータやメモ書きは残しておく。論文中では可能な限り記載。

• 研究を生業とするのなら，データやコードの公開，分析再現の練習を惜しまない。

どうやってやればいいですか？

http://mizumot.com/files/ecs2015.html

http://mizumot.com/files/ecs2015.html

http://onlinelibrary.wiley.com/doi/10.1111/lang.12134/full

http://onlinelibrary.wiley.com/doi/10.1111/lang.12134/full

http://www.iris-database.org/iris/app/home/index;jsessionid=CB9E46535FA0D81136CADA87BC414BA0

http://www.iris-database.org/iris/app/home/index;jsessionid=CB9E46535FA0D81136CADA87BC414BA0

https://osf.io/

Open Science Framework

Dataverse Projecthttp://dataverse.org/

https://osf.io/

http://dataverse.org/

まとめ• langtest.jp

-「ハンドブック」などの分析確認- Rへの橋渡し

• コーパス研究のための統計解析記述・推測統計，検定・効果量，相関・多変量解析

• 進む「統計改革」と研究の透明化

を使ってコーパス研究をはじめたい人

http://www.slideshare.net/langstat/presentationshttp://www.slideshare.net/sakaue/presentations

http://www.slideshare.net/langstat/presentations

http://www.slideshare.net/sakaue/presentations

2015-1003 英語コーパス学会ワークショップ使用スライド

Education