(semi-) big data corpora: new challanges and new solutions for corpus linguists

(Semi-) Big Data Corpora: New Challanges and

New Solutions for Corpus Linguists

Tobias Gartner July 20th, 2016

Content

1. Introduction

2. Technical Prerequisites

3. Multivariate Procedures

Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

New methods in corpus linguistics

Let’s assume:

Your corpus contains 7,834 texts with 4,441,087 tokens(ICLE & ICNALE)

You want to analyse the present perfect (+ continuous)active

Additionally you assume an influence of several othervariables

You don’t have an army of PhD students and studentassistants


Why do we need to apply new methods?

Thus:

Manual counting is tedious

Manual counting is error prone

Manual counting is expensive (time and money wise)

and you still need to analyse your findings


Why do we need to apply new methods?

Univariate statistical procedures become misleading onceseveral variables are to be assessed simultaneously

Crosstables become complex and confusing as theyexponentially grow

ANOVAs and linear regressions return misleading (actuallyplain wrong) results for non-normally distributed count data


Data Sources

Sources

Internal

Texts

BGSU1043

BGSU1089

. . .

Database

ICLE Database

ICNALE Database

External

MRC Psycholinguistic Database

Academic Word List


Possible independent variables to analyse

N Words N Sentences

Avg. Sentence

Length

Avg.Word

Length

N FunctionalWords

N LexicalWords

N ActiveSentences

N PassiveSentences

Negation

SubordinateSentences

Type-TokenRatio

Flesch-Kincaid

Reading-EaseLog(N Words)

AcademicWord

List Score

MeanCohesion

(POS)

SdCohesion

(POS)

MeanCohesion

(Word)

SdCohesion

(Word)

ReaderVisibility

WriterVisibility

N “not”

MeanN Phonemes

MeanN Letters

MeanN Syllables

Kucera-Francis

Frequency

AgeOf Acquisition

Familiarity

Concreteness Imagery

MeanPavio

Meaning-fulness

MeanColoradoMeaning-fulness

Gender Age

YearsOf

Tuition

YearsAt

Unitiversity

MajorAcademic

Genre

N Words

Country

Surface StructuresPsychological VariablesDegree of Academic Writing Social Variables

Figure 1: Overview over independent variables


Machine readable texts with meta information

PlainText

ICLE

ICNALE

POS-Tagger

Parser

koRpus LSA

pscl

Figure 2: Plain text to part-of-speech tagged xml files to an Rdata-frame


The present perfect as a syntactical tree

S

VP

VBP

have

... VP

VBN

Participle

S

VP

VBP

have

... VP

VBN

been

VP

VBG

Gerund(a) Present Perfect (b) Present Perfect Progressive

Figure 3: Syntactical trees using the Penn Treebank Tag Set Tobias Gartner (Semi-) Big Data Corpora July 20th, 2016

The present perfect in X-Query

1 f o r $ i i n c o l l e c t i o n ( ” c o r p u s ” ) // eTree [ @Label=”S” ] /eTree [ @Label=”VP” ]

2 where $ i / eTree / @Label=”VBP”3 and $ i / eTree / e L e a f /@Text=” have ”4 and $ i / eTree / @Label=”VP”5 and $ i / eTree / eTree / @Label=”VBN”6 and not ( $ i / eTree / eTree / e L e a f /@Text=” been ” )7 r e t u r n base−u r i ( $ i )

Listing 1: X-Query code for the present perfect


Collinearity

Corpus

Database

TenseAnd

Aspect

IndependentVariables

Factor-Analysis

VarianceInflationFactor

Gauß-Elimination

LiniearlyIndependentIndependent

Variables

Multi-Collinearity

Figure 4: Procedure to avoid (multi-) collinearity


Linear Regression

0 1 2 3 4 5 6 7 8 9 11 13 15 17 20

PP and PPP Constructions per Text

Fre

quen

cy

050

010

0015

0020

0025

0030

0035

00

Figure 5: Histogram of the present perfect active and the presentperfect progressive active in ICLE & ICNALE


Linear Regressions

Three reasons why a linear regression is not a smart idea:

1. The dependent variable is by no means normally distributed(Anderson-Darling test: p < 2.2e−16)

2. The distribution is heavily skewed (skewness = 2.70988)

3. The dependent variable is count data, i.e. consists solely ofintegers


Count Data Regressions

So why not use classical count data regressions?

3534 out of 7834 texts (≈ 45%) do not contain the requiredconstruction!


Excess Zero Problem

There are several regression types that deal with the excesszero problem:

Hurdle models

Ordinal models

Zero-inflated models


Excess Zero Problem

Model AIC-Value Drop in AIC in %Linear 30072.76 Reference ModelPoisson 23627.33 -21.43Negative Binomial 22244.58 -26.03Ordinal 22270.14 -25.94Hurdle (Poisson) 22901.65 -23.85Hurdle (Neg.Binomial) 22799.20 -24.19Zero-Infl. (Poisson/Logit) 22848.17 -24.02Zero-Infl. (Poisson/Probit) 22861.85 -23.98Zero-Infl. (Poisson/Cauchit) 22807.94 -24.16Zero-Infl. (Poisson/Compl.Log.-Log.) 22833.85 -24.07Zero-Infl. (Neg.Bin./Logit) 22132.78 -26.4Zero-Infl. (Neg.Bin./Probit) 22133.28 -26.4Zero-Infl. (Neg.Bin./Cauchit) 22135.46 -26.39Zero-Infl. (Neg.Bin./Compl.Log.-Log.) 22135.61 -26.39


Zero-Inflated Neg. Bin./Probit Regression

P(Yi = yi) =

ωi + (1 − ωi)exp(−αλci )−λ1−ci /α , yi = 0

(1 − ωi)Γ (yi + λ1−c

i /α)

yi !Γ (λ1−ci /α)

×(1 + αλci )−λ1−ci /α

×(1 + λ1−ci /α)−yi , y > 0

where ω = φ(X ′β)


Average Marginal Effects

AMEi = βi1

n

n∑k=1

(φβxk) (1)


The Results I

Table 1: Zero-Inflated Negative Binomial Regression with ProbitLink

Dependent variable:

Present Perfect and Present Perfect Continuous

Zero-Inflated Zero-InflatedNeg. Bin. with Log. link Binomial with Probit link

β-Coef. S.E. β-Coef. AME S.E.(Intercept) 0.8308 0.4779 . 1.2515 0.163 2.5103Combined Text Length 0.3174 0.0424 *** -1.1624 -0.151 0.2856 ***Text Length Relations -0.0370 0.0392 -0.5867 -0.076 0.2852 *

Type-Token Ratio -0.1541 0.4098 -6.3566 -0.828 1.5727 ***Cohesion (µ) -0.1653 0.3487 1.5112 0.197 1.5061Cohesion (σ) -0.8735 0.7246 -3.4195 -0.445 2.7403

Academic Word List -0.0108 0.3589 -0.8702 -0.113 2.3809

Reader/Writer Visibility 1 -0.0096 0.0230 -0.3583 -0.047 0.2306


The Results II

Reader/Writer Visibility 2 -0.0851 0.0361 * -0.9777 -0.127 0.6859

Grade (truncated) -0.0055 0.0159 0.1541 0.02 0.0966Years of English Tuition -0.0066 0.0079 -0.0139 -0.002 0.0350Age (truncated) -0.0189 0.0058 ** -0.0246 -0.003 0.0491

Female 0.0624 0.0387 0.5358 0.07 0.3035 .Default = Male

Social Sciences -0.3159 0.1340 * 0.1415 0.018 0.2818Science -0.3562 0.1249 ** 0.0200 0.003 0.3054Default = Humanities

MRC Length 0.0981 0.0386 * 0.2061 0.027 0.1863MRC Frequency -0.0437 0.0237 . -0.1096 -0.014 0.1360MRC Concreteness -0.0005 0.0217 0.1852 0.024 0.0834 *

Belgium 0.4296 0.1772 * -0.5692 -0.074 3.0118Botswana -0.1483 0.1929 -0.2078 -0.027 1.1476Bulgaria 0.4849 0.1795 ** -0.1022 -0.013 1.5073Czech Republic 0.1927 0.1966 1.1815 0.154 1.3433ESL -0.3318 0.1930 . 0.9345 0.122 1.3812Finland 0.6938 0.1808 *** 0.3254 0.042 1.4068Greater China -0.1601 0.1844 0.6470 0.084 1.2497


The Results III

Germanic 0.2204 0.1815 1.3361 0.174 1.1574IndoThai -0.8334 0.2687 ** 0.3768 0.049 1.3308Italy 0.5099 0.1883 ** -0.1500 -0.02 1.8934Japan -0.1109 0.1875 0.8262 0.108 1.2544Korea -0.6147 0.2900 * 1.1176 0.146 1.2690Norway 0.7539 0.1711 *** -3.8321 -0.499 136.0929Poland 0.2089 0.1857 0.7670 0.1 1.7382Russia 0.1917 0.2009 1.5672 0.204 1.1829Spain 0.6052 0.1860 ** 0.4621 0.06 1.4366Sweden 0.7369 0.1701 *** -3.5256 -0.459 147.8183Turkey 0.0886 0.2032 1.0227 0.133 1.4696Default = NS

Log(theta) 0.8715 0.0720 ***

Note: Significance Levels: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01