(semi-) big data corpora: new challanges and new solutions for corpus linguists

21
(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists Tobias G¨ artner July 20 th , 2016

Upload: tobias-gaertner

Post on 22-Jan-2017

25 views

Category:

Science


0 download

TRANSCRIPT

Page 1: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

(Semi-) Big Data Corpora: New Challanges and

New Solutions for Corpus Linguists

Tobias Gartner July 20th, 2016

Page 2: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

Content

1. Introduction

2. Technical Prerequisites

3. Multivariate Procedures

Page 2Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 3: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

New methods in corpus linguistics

Let’s assume:

Your corpus contains 7,834 texts with 4,441,087 tokens(ICLE & ICNALE)

You want to analyse the present perfect (+ continuous)active

Additionally you assume an influence of several othervariables

You don’t have an army of PhD students and studentassistants

Page 3Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 4: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

Why do we need to apply new methods?

Thus:

Manual counting is tedious

Manual counting is error prone

Manual counting is expensive (time and money wise)

and you still need to analyse your findings

Page 4Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 5: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

Why do we need to apply new methods?

Univariate statistical procedures become misleading onceseveral variables are to be assessed simultaneously

Crosstables become complex and confusing as theyexponentially grow

ANOVAs and linear regressions return misleading (actuallyplain wrong) results for non-normally distributed count data

Page 5Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 6: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

Data Sources

Sources

Internal

Texts

BGSU1043

BGSU1089

. . .

Database

ICLE Database

ICNALE Database

External

MRC Psycholinguistic Database

Academic Word List

Page 6Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 7: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

Possible independent variables to analyse

N Words N Sentences

Avg. Sentence

Length

Avg.Word

Length

N FunctionalWords

N LexicalWords

N ActiveSentences

N PassiveSentences

Negation

SubordinateSentences

Type-TokenRatio

Flesch-Kincaid

Reading-EaseLog(N Words)

AcademicWord

List Score

MeanCohesion

(POS)

SdCohesion

(POS)

MeanCohesion

(Word)

SdCohesion

(Word)

ReaderVisibility

WriterVisibility

N “not”

MeanN Phonemes

MeanN Letters

MeanN Syllables

Kucera-Francis

Frequency

AgeOf Acquisition

Familiarity

Concreteness Imagery

MeanPavio

Meaning-fulness

MeanColoradoMeaning-fulness

Gender Age

YearsOf

Tuition

YearsAt

Unitiversity

MajorAcademic

Genre

N Words

Country

Surface StructuresPsychological VariablesDegree of Academic Writing Social Variables

Figure 1: Overview over independent variables

Page 7Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 8: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

Machine readable texts with meta information

PlainText

ICLE

ICNALE

POS-Tagger

Parser

koRpus LSA

pscl

Figure 2: Plain text to part-of-speech tagged xml files to an Rdata-frame

Page 8Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 9: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

The present perfect as a syntactical tree

S

VP

VBP

have

... VP

VBN

Participle

S

VP

VBP

have

... VP

VBN

been

VP

VBG

Gerund(a) Present Perfect (b) Present Perfect Progressive

Figure 3: Syntactical trees using the Penn Treebank Tag Set Page 9Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 10: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

The present perfect in X-Query

1 f o r $ i i n c o l l e c t i o n ( ” c o r p u s ” ) // eTree [ @Label=”S” ] /eTree [ @Label=”VP” ]

2 where $ i / eTree / @Label=”VBP”3 and $ i / eTree / e L e a f /@Text=” have ”4 and $ i / eTree / @Label=”VP”5 and $ i / eTree / eTree / @Label=”VBN”6 and not ( $ i / eTree / eTree / e L e a f /@Text=” been ” )7 r e t u r n base−u r i ( $ i )

Listing 1: X-Query code for the present perfect

Page 10Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 11: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

Collinearity

Corpus

Database

TenseAnd

Aspect

IndependentVariables

Factor-Analysis

VarianceInflationFactor

Gauß-Elimination

LiniearlyIndependentIndependent

Variables

Multi-Collinearity

Figure 4: Procedure to avoid (multi-) collinearity

Page 11Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 12: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

Linear Regression

0 1 2 3 4 5 6 7 8 9 11 13 15 17 20

PP and PPP Constructions per Text

Fre

quen

cy

050

010

0015

0020

0025

0030

0035

00

Figure 5: Histogram of the present perfect active and the presentperfect progressive active in ICLE & ICNALE

Page 12Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 13: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

Linear Regressions

Three reasons why a linear regression is not a smart idea:

1. The dependent variable is by no means normally distributed(Anderson-Darling test: p < 2.2e−16)

2. The distribution is heavily skewed (skewness = 2.70988)

3. The dependent variable is count data, i.e. consists solely ofintegers

Page 13Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 14: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

Count Data Regressions

So why not use classical count data regressions?

3534 out of 7834 texts (≈ 45%) do not contain the requiredconstruction!

Page 14Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 15: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

Excess Zero Problem

There are several regression types that deal with the excesszero problem:

Hurdle models

Ordinal models

Zero-inflated models

Page 15Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 16: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

Excess Zero Problem

Model AIC-Value Drop in AIC in %Linear 30072.76 Reference ModelPoisson 23627.33 -21.43Negative Binomial 22244.58 -26.03Ordinal 22270.14 -25.94Hurdle (Poisson) 22901.65 -23.85Hurdle (Neg.Binomial) 22799.20 -24.19Zero-Infl. (Poisson/Logit) 22848.17 -24.02Zero-Infl. (Poisson/Probit) 22861.85 -23.98Zero-Infl. (Poisson/Cauchit) 22807.94 -24.16Zero-Infl. (Poisson/Compl.Log.-Log.) 22833.85 -24.07Zero-Infl. (Neg.Bin./Logit) 22132.78 -26.4Zero-Infl. (Neg.Bin./Probit) 22133.28 -26.4Zero-Infl. (Neg.Bin./Cauchit) 22135.46 -26.39Zero-Infl. (Neg.Bin./Compl.Log.-Log.) 22135.61 -26.39

Page 16Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 17: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

Zero-Inflated Neg. Bin./Probit Regression

P(Yi = yi) =

ωi + (1 − ωi)exp(−αλci )−λ1−ci /α , yi = 0

(1 − ωi)Γ (yi + λ1−c

i /α)

yi !Γ (λ1−ci /α)

×(1 + αλci )−λ1−ci /α

×(1 + λ1−ci /α)−yi , y > 0

where ω = φ(X ′β)

Page 17Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 18: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

Average Marginal Effects

AMEi = βi1

n

n∑k=1

(φβxk) (1)

Page 18Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 19: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

The Results I

Table 1: Zero-Inflated Negative Binomial Regression with ProbitLink

Dependent variable:

Present Perfect and Present Perfect Continuous

Zero-Inflated Zero-InflatedNeg. Bin. with Log. link Binomial with Probit link

β-Coef. S.E. β-Coef. AME S.E.(Intercept) 0.8308 0.4779 . 1.2515 0.163 2.5103Combined Text Length 0.3174 0.0424 *** -1.1624 -0.151 0.2856 ***Text Length Relations -0.0370 0.0392 -0.5867 -0.076 0.2852 *

Type-Token Ratio -0.1541 0.4098 -6.3566 -0.828 1.5727 ***Cohesion (µ) -0.1653 0.3487 1.5112 0.197 1.5061Cohesion (σ) -0.8735 0.7246 -3.4195 -0.445 2.7403

Academic Word List -0.0108 0.3589 -0.8702 -0.113 2.3809

Reader/Writer Visibility 1 -0.0096 0.0230 -0.3583 -0.047 0.2306

Page 19Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 20: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

The Results II

Reader/Writer Visibility 2 -0.0851 0.0361 * -0.9777 -0.127 0.6859

Grade (truncated) -0.0055 0.0159 0.1541 0.02 0.0966Years of English Tuition -0.0066 0.0079 -0.0139 -0.002 0.0350Age (truncated) -0.0189 0.0058 ** -0.0246 -0.003 0.0491

Female 0.0624 0.0387 0.5358 0.07 0.3035 .Default = Male

Social Sciences -0.3159 0.1340 * 0.1415 0.018 0.2818Science -0.3562 0.1249 ** 0.0200 0.003 0.3054Default = Humanities

MRC Length 0.0981 0.0386 * 0.2061 0.027 0.1863MRC Frequency -0.0437 0.0237 . -0.1096 -0.014 0.1360MRC Concreteness -0.0005 0.0217 0.1852 0.024 0.0834 *

Belgium 0.4296 0.1772 * -0.5692 -0.074 3.0118Botswana -0.1483 0.1929 -0.2078 -0.027 1.1476Bulgaria 0.4849 0.1795 ** -0.1022 -0.013 1.5073Czech Republic 0.1927 0.1966 1.1815 0.154 1.3433ESL -0.3318 0.1930 . 0.9345 0.122 1.3812Finland 0.6938 0.1808 *** 0.3254 0.042 1.4068Greater China -0.1601 0.1844 0.6470 0.084 1.2497

Page 20Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

Page 21: (Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists

The Results III

Germanic 0.2204 0.1815 1.3361 0.174 1.1574IndoThai -0.8334 0.2687 ** 0.3768 0.049 1.3308Italy 0.5099 0.1883 ** -0.1500 -0.02 1.8934Japan -0.1109 0.1875 0.8262 0.108 1.2544Korea -0.6147 0.2900 * 1.1176 0.146 1.2690Norway 0.7539 0.1711 *** -3.8321 -0.499 136.0929Poland 0.2089 0.1857 0.7670 0.1 1.7382Russia 0.1917 0.2009 1.5672 0.204 1.1829Spain 0.6052 0.1860 ** 0.4621 0.06 1.4366Sweden 0.7369 0.1701 *** -3.5256 -0.459 147.8183Turkey 0.0886 0.2032 1.0227 0.133 1.4696Default = NS

Log(theta) 0.8715 0.0720 ***

Note: Significance Levels: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Page 21Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016