the idea corpus data mining experiment modelling ... filethe idea corpus data mining experiment...

The idea Corpus Data Mining Experiment Modelling

Rules and Exceptions in Language Dynamics:

a Quantitative Investigation

Martina Pugliese

Sapienza Universita di Roma, Dipartimento di Fisica

Joint work with Prof. V. Loreto, C. Cuskley, C. Castellano, F. Colaiori, F. Tria

✞

✝

☎

✆Final Seminar for the PhD program

Roma, 29th October 2014

Martina Pugliese

The idea Corpus Data Mining Experiment Modelling Introduction

Human Languages: Rules versus ExceptionsThe past tense as the object of investigation

Human Languages are structured into syntactic rules, whichpresent exceptions in the form of irregularities

l

The past tense formation is a typical example of a rule (regular,-ed form) and many irregular forms

sneak

snuck?

sneaked?

??play played

swim swam

I saw him run after a gilded

butterfly: and when he caught it,

he let it go again; and after it

again; and over and over he comes,

and again; catched it again (...)

— W. Shakespeare, Coriolanus

Martina Pugliese

The idea Corpus Data Mining Experiment Modelling Introduction

Outline of the workTackling the problem from different points of view

Succint summary of the literature in the field

Regularization is the expected phenomenon

Irregularization is typically considered irrelevant

Frequency plays the leading role: low frequency verbs aremore prone to regularize

The sociolinguistic enviroment of speakers influencesregularity of verbs

We will explore the past tense problem with three parallelapproaches:

1 Data Mining on a Linguistic Corpus

2 Experiment on novel verbal forms

3 Agent-based modelling of competing inflections

Martina Pugliese

The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes

(Ir)Regularity of verbs: a corpus perspectiveThe Corpus of Historical American English (CoHA)

CoHA: 400 · 106 written words in period 1810− 2009

The data we used

verbs in 1830− 1989: 16 decades, diachronic perspective

dataset confined to size of first decade (≈ 2.1 · 106 tokens)

threshold on frequency to define regularity

Core verbs and extended vocabulary analyses

I : irregularity proportion (irreg. past tokens/tot. past tokens)f : frequency of lemmaRoot: set of lemmas sharing a verb root (e.g. go, forego, undergo,...)

Core: existing in every decade Extended: entire vocabulary

Martina Pugliese


Language as an open systemThe number of types changes in time

undefined; all; mostly I and mostly R: threshold on I at 0.5

4400

4800

5200

5600

6000

1830

1880

1930

1980

Alltypes

Decade

A

1900

2100

2300

2500

1830

1880

1930

1980

Mostly

Rtypes

Decade

B

130

155

180

1830

1880

1930

1980

Mostly

IRtypes

Decade

C

2200

2550

2900

3250

3600

1830

1880

1930

1980

Undefined

types

Decade

D

verbsroots

verbsroots

verbsroots

verbsroots

Overall increasein vocabularysize

mostly R andundefined typesincrease innumber

mostly I typesare constant innumber

Martina Pugliese


The situation of the core roots in the last decadeAn evolving picture of two opposite behaviours

stable IR (IR in every decade); stable R (R in every decade); active(0 < I < 1 in at least one decade, threshold at 1%)

0

0.2

0.4

0.6

0.8

1

10-6

10-5

10-4

10-3

10-2

10-1

1

I

f

0 0.5 1

Color coding:average I acrosstime

Arrows: trajectoryfrom first to lastdefined occurrence

Purple curves:binned values

Verbs are in a dynamic state: some regularize, some irregularize,some are stable

Martina Pugliese


How do core roots navigate the plane (f , I ) ?The active roots pattern in a cloud

d =√

(δf )2 + (δI )2 δf = ∆f

∆t, δI = ∆I

∆t

10-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

10-6 10-5 10-4 10-3 10-2 10-1 1

d

f

0

0.2

0.4

0.6

0.8

1

f : frequency averagedover time

Bigger points: activeroots, color-codedwith average I acrosstime

Smaller dots: thestable roots

Martina Pugliese


Active roots and the variation in I

Confirmation of two opposite and balanced forces

∆I > 0: irregularization; ∆I < 0: regularization

f : frequency averagedover time

decreasing trend withincreasing f

the numbers in thetwo subplots arecomparable

10-4

10-3

10-2

10-1

1

10-6

10-5

10-4

10-3

10-2

10-1

1

∆I

f

∆I > 0

10-4

10-3

10-2

10-1

1

−∆

I

∆I < 0

Martina Pugliese


Phonological classification of rootsClasses as phonological attractors

Regularization: broad application of dominant rule, individual failsin retrieving irregular form (morphological level)

What drives irregularization?What is the source of activity in dynamic verbs?

What keeps the number of irregular types constant?

↓

Phonological classification of roots:

Roots with I > 0 classified according to phonological changefrom infinitive to past tense→ e.g., sing-sang and ring-rang

Size of class proportional to number of membersFrequency of class is sum of frequencies of its members

♣ Classes may work as attractors, conditioning dynamics ♣Martina Pugliese


Dynamics of classes: a clarified pictureThe evolution of four pivotal classes

Bubbles are classes, grey points at the bottom are regular roots

0

0.2

0.4

0.6

0.8

1

10-6 10-5 10-4 10-3 10-2 10-1 1

I

fsum

0

0.2

0.4

0.6

0.8

1

10−5 10−4 10−3 10−2 10−1

I

fsum

burn

0

0.2

0.4

0.6

0.8

1

10−5 10−4 10−3 10−2 10−1

I

fsum

dwell

0

0.2

0.4

0.6

0.8

1

10−5 10−4 10−3 10−2 10−1

I

fsum

hide

0

0.2

0.4

0.6

0.8

1

10−5 10−4 10−3 10−2 10−1

I

fsum

sing

Points are less scattered, window is narrower

Martina Pugliese

The idea Corpus Data Mining Experiment Modelling Motivation Task Results

Is irregularity backed by a cognitive process?How do individuals choose the past tense ending in the first place?

✞

✝

☎

✆If I do not know or recall a verb, which ending do I choose?

↓

Does this change depending on my language nativeness?

Martina Pugliese


How does the experiment workProviding the past tense of non-existing verbs (non-verbs)

Non-verbs are built using the phonological distance with existingverbs and are categorized as Regular, Irregular and Duplicate

Info

rmat

ion

par

t

Is English your

first language? Do you speak other languages?

START

END

Ask user

information

about languages

Martina Pugliese


The irregular rates divided by stimulus categoryNon-native and native outcomes are different

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Regular Duplicate Irregular

Irreg.rate

Non-verb category

Natives

Non-natives

Martina Pugliese


How many, and which, irregular responses do we get?Different choices for natives and non-natives

The irregular responses can be divided into 7 linguistic categories

0

5

10

15

20

0 2 4 6 8 10 12 14 16

Num.users

Num. irreg. resp.

05

1015202530

0 2 4 6 8 10 12 14 16

Natives

Non-natives

All

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Reg.

VC Lev.

Other

VC+d

Fin+t

/d

VC+t

Ruck.

Frequency

Irregular category

CoHA

All users

Natives

Non-natives

Natives speakers weigh the regular category more than non-natives

Martina Pugliese

The idea Corpus Data Mining Experiment Modelling Rules Outcome

A three-states model of competing inflectionsThe rules of the model

Before After

s h s h

I I I IR R R RI R I MR I R MI M I IR M R RM(I) I I IM(R) I M MM(I) R M MM(R) R R RM(I) M I IM(R) M R R

s: speaker; h: hearer

Three possible inflections: R (regular), I(irregular), M (mixed) (both endingspossible)

At each time step, s and h interact over alemma whose frequency is f

When h does not have the utteredinflection, he appends it

When h has the uttered inflection, both sand h delete the other one

At each time step, a randomly selectedagent is replaced with probability r withone who only has R inflections

Martina Pugliese

The idea Corpus Data Mining Experiment Modelling Rules Outcome

The model exhibits a transitionAnalytical and numerical solutions

n = r/fρX represents the fraction of individuals in the X inflection

The model is analytically solvable

0

0.2

0.4

0.6

0.8

1

0 0.02 0.04 0.06 0.08

ρI

n

0

0.2

0.4

0.6

0.8

1

0 0.02 0.04 0.06 0.08

ρR

n

ρ(3)I

ρ(2)I

ρ(1)I

ρI (0) = 0.8

ρI (0) = 0.5

ρI (0) = 0.3

High-frequency verbs tend to stay I, low-frequency verbs tend tobecome R

Martina Pugliese

The idea Corpus Data Mining Experiment Modelling

Conclusions and Perspectives

What the work shows in a nutshell

Frequency alone does not necessarily predict fate of verbs

Vocabulary lemmas change in a complex way as result ofseveral factors

Activity in irregularity proportion is mostly located in anintermediate frequency window

Phonological classification clarifies the existence of clusters ofverbs behaving in similar way

Native and non-native speakers tend to have opposing viewsof regularity preference

Modelling sheds light on the stationary states of lemmas anduncovers a transition

Thanks for bearing with me!

Martina Pugliese

the idea corpus data mining experiment modelling ... filethe idea corpus data mining experiment...

Documents