the idea corpus data mining experiment modelling ... filethe idea corpus data mining experiment...
TRANSCRIPT
The idea Corpus Data Mining Experiment Modelling
Rules and Exceptions in Language Dynamics:
a Quantitative Investigation
Martina Pugliese
Sapienza Universita di Roma, Dipartimento di Fisica
Joint work with Prof. V. Loreto, C. Cuskley, C. Castellano, F. Colaiori, F. Tria
✞
✝
☎
✆Final Seminar for the PhD program
Roma, 29th October 2014
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Introduction
Human Languages: Rules versus ExceptionsThe past tense as the object of investigation
Human Languages are structured into syntactic rules, whichpresent exceptions in the form of irregularities
l
The past tense formation is a typical example of a rule (regular,-ed form) and many irregular forms
sneak
snuck?
sneaked?
??play played
swim swam
I saw him run after a gilded
butterfly: and when he caught it,
he let it go again; and after it
again; and over and over he comes,
and again; catched it again (...)
— W. Shakespeare, Coriolanus
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Introduction
Human Languages: Rules versus ExceptionsThe past tense as the object of investigation
Human Languages are structured into syntactic rules, whichpresent exceptions in the form of irregularities
l
The past tense formation is a typical example of a rule (regular,-ed form) and many irregular forms
sneak
snuck?
sneaked?
??play played
swim swam
I saw him run after a gilded
butterfly: and when he caught it,
he let it go again; and after it
again; and over and over he comes,
and again; catched it again (...)
— W. Shakespeare, Coriolanus
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Introduction
Outline of the workTackling the problem from different points of view
Succint summary of the literature in the field
Regularization is the expected phenomenon
Irregularization is typically considered irrelevant
Frequency plays the leading role: low frequency verbs aremore prone to regularize
The sociolinguistic enviroment of speakers influencesregularity of verbs
We will explore the past tense problem with three parallelapproaches:
1 Data Mining on a Linguistic Corpus
2 Experiment on novel verbal forms
3 Agent-based modelling of competing inflections
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Introduction
Outline of the workTackling the problem from different points of view
Succint summary of the literature in the field
Regularization is the expected phenomenon
Irregularization is typically considered irrelevant
Frequency plays the leading role: low frequency verbs aremore prone to regularize
The sociolinguistic enviroment of speakers influencesregularity of verbs
We will explore the past tense problem with three parallelapproaches:
1 Data Mining on a Linguistic Corpus
2 Experiment on novel verbal forms
3 Agent-based modelling of competing inflections
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Introduction
Outline of the workTackling the problem from different points of view
Succint summary of the literature in the field
Regularization is the expected phenomenon
Irregularization is typically considered irrelevant
Frequency plays the leading role: low frequency verbs aremore prone to regularize
The sociolinguistic enviroment of speakers influencesregularity of verbs
We will explore the past tense problem with three parallelapproaches:
1 Data Mining on a Linguistic Corpus
2 Experiment on novel verbal forms
3 Agent-based modelling of competing inflections
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes
(Ir)Regularity of verbs: a corpus perspectiveThe Corpus of Historical American English (CoHA)
CoHA: 400 · 106 written words in period 1810− 2009
The data we used
verbs in 1830− 1989: 16 decades, diachronic perspective
dataset confined to size of first decade (≈ 2.1 · 106 tokens)
threshold on frequency to define regularity
Core verbs and extended vocabulary analyses
I : irregularity proportion (irreg. past tokens/tot. past tokens)f : frequency of lemmaRoot: set of lemmas sharing a verb root (e.g. go, forego, undergo,...)
Core: existing in every decade Extended: entire vocabulary
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes
(Ir)Regularity of verbs: a corpus perspectiveThe Corpus of Historical American English (CoHA)
CoHA: 400 · 106 written words in period 1810− 2009
The data we used
verbs in 1830− 1989: 16 decades, diachronic perspective
dataset confined to size of first decade (≈ 2.1 · 106 tokens)
threshold on frequency to define regularity
Core verbs and extended vocabulary analyses
I : irregularity proportion (irreg. past tokens/tot. past tokens)f : frequency of lemmaRoot: set of lemmas sharing a verb root (e.g. go, forego, undergo,...)
Core: existing in every decade Extended: entire vocabulary
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes
(Ir)Regularity of verbs: a corpus perspectiveThe Corpus of Historical American English (CoHA)
CoHA: 400 · 106 written words in period 1810− 2009
The data we used
verbs in 1830− 1989: 16 decades, diachronic perspective
dataset confined to size of first decade (≈ 2.1 · 106 tokens)
threshold on frequency to define regularity
Core verbs and extended vocabulary analyses
I : irregularity proportion (irreg. past tokens/tot. past tokens)f : frequency of lemmaRoot: set of lemmas sharing a verb root (e.g. go, forego, undergo,...)
Core: existing in every decade Extended: entire vocabulary
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes
Language as an open systemThe number of types changes in time
undefined; all; mostly I and mostly R: threshold on I at 0.5
4400
4800
5200
5600
6000
1830
1880
1930
1980
Alltypes
Decade
A
1900
2100
2300
2500
1830
1880
1930
1980
Mostly
Rtypes
Decade
B
130
155
180
1830
1880
1930
1980
Mostly
IRtypes
Decade
C
2200
2550
2900
3250
3600
1830
1880
1930
1980
Undefined
types
Decade
D
verbsroots
verbsroots
verbsroots
verbsroots
Overall increasein vocabularysize
mostly R andundefined typesincrease innumber
mostly I typesare constant innumber
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes
The situation of the core roots in the last decadeAn evolving picture of two opposite behaviours
stable IR (IR in every decade); stable R (R in every decade); active(0 < I < 1 in at least one decade, threshold at 1%)
0
0.2
0.4
0.6
0.8
1
10-6
10-5
10-4
10-3
10-2
10-1
1
I
f
0 0.5 1
Color coding:average I acrosstime
Arrows: trajectoryfrom first to lastdefined occurrence
Purple curves:binned values
Verbs are in a dynamic state: some regularize, some irregularize,some are stable
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes
How do core roots navigate the plane (f , I ) ?The active roots pattern in a cloud
d =√
(δf )2 + (δI )2 δf = ∆f
∆t, δI = ∆I
∆t
10-8
10-7
10-6
10-5
10-4
10-3
10-2
10-1
10-6 10-5 10-4 10-3 10-2 10-1 1
d
f
0
0.2
0.4
0.6
0.8
1
f : frequency averagedover time
Bigger points: activeroots, color-codedwith average I acrosstime
Smaller dots: thestable roots
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes
Active roots and the variation in I
Confirmation of two opposite and balanced forces
∆I > 0: irregularization; ∆I < 0: regularization
f : frequency averagedover time
decreasing trend withincreasing f
the numbers in thetwo subplots arecomparable
10-4
10-3
10-2
10-1
1
10-6
10-5
10-4
10-3
10-2
10-1
1
∆I
f
∆I > 0
10-4
10-3
10-2
10-1
1
−∆
I
∆I < 0
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes
Phonological classification of rootsClasses as phonological attractors
Regularization: broad application of dominant rule, individual failsin retrieving irregular form (morphological level)
What drives irregularization?What is the source of activity in dynamic verbs?
What keeps the number of irregular types constant?
↓
Phonological classification of roots:
Roots with I > 0 classified according to phonological changefrom infinitive to past tense→ e.g., sing-sang and ring-rang
Size of class proportional to number of membersFrequency of class is sum of frequencies of its members
♣ Classes may work as attractors, conditioning dynamics ♣Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes
Phonological classification of rootsClasses as phonological attractors
Regularization: broad application of dominant rule, individual failsin retrieving irregular form (morphological level)
What drives irregularization?What is the source of activity in dynamic verbs?
What keeps the number of irregular types constant?
↓
Phonological classification of roots:
Roots with I > 0 classified according to phonological changefrom infinitive to past tense→ e.g., sing-sang and ring-rang
Size of class proportional to number of membersFrequency of class is sum of frequencies of its members
♣ Classes may work as attractors, conditioning dynamics ♣Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes
Phonological classification of rootsClasses as phonological attractors
Regularization: broad application of dominant rule, individual failsin retrieving irregular form (morphological level)
What drives irregularization?What is the source of activity in dynamic verbs?
What keeps the number of irregular types constant?
↓
Phonological classification of roots:
Roots with I > 0 classified according to phonological changefrom infinitive to past tense→ e.g., sing-sang and ring-rang
Size of class proportional to number of membersFrequency of class is sum of frequencies of its members
♣ Classes may work as attractors, conditioning dynamics ♣Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes
Dynamics of classes: a clarified pictureThe evolution of four pivotal classes
Bubbles are classes, grey points at the bottom are regular roots
0
0.2
0.4
0.6
0.8
1
10-6 10-5 10-4 10-3 10-2 10-1 1
I
fsum
0
0.2
0.4
0.6
0.8
1
10−5 10−4 10−3 10−2 10−1
I
fsum
burn
0
0.2
0.4
0.6
0.8
1
10−5 10−4 10−3 10−2 10−1
I
fsum
dwell
0
0.2
0.4
0.6
0.8
1
10−5 10−4 10−3 10−2 10−1
I
fsum
hide
0
0.2
0.4
0.6
0.8
1
10−5 10−4 10−3 10−2 10−1
I
fsum
sing
Points are less scattered, window is narrower
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Motivation Task Results
Is irregularity backed by a cognitive process?How do individuals choose the past tense ending in the first place?
✞
✝
☎
✆If I do not know or recall a verb, which ending do I choose?
↓
Does this change depending on my language nativeness?
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Motivation Task Results
How does the experiment workProviding the past tense of non-existing verbs (non-verbs)
Non-verbs are built using the phonological distance with existingverbs and are categorized as Regular, Irregular and Duplicate
Info
rmat
ion
par
t
Is English your
first language? Do you speak other languages?
START
END
Ask user
information
about languages
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Motivation Task Results
The irregular rates divided by stimulus categoryNon-native and native outcomes are different
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Regular Duplicate Irregular
Irreg.rate
Non-verb category
Natives
Non-natives
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Motivation Task Results
How many, and which, irregular responses do we get?Different choices for natives and non-natives
The irregular responses can be divided into 7 linguistic categories
0
5
10
15
20
0 2 4 6 8 10 12 14 16
Num.users
Num. irreg. resp.
05
1015202530
0 2 4 6 8 10 12 14 16
Natives
Non-natives
All
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Reg.
VC Lev.
Other
VC+d
Fin+t
/d
VC+t
Ruck.
Frequency
Irregular category
CoHA
All users
Natives
Non-natives
Natives speakers weigh the regular category more than non-natives
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Rules Outcome
A three-states model of competing inflectionsThe rules of the model
Before After
s h s h
I I I IR R R RI R I MR I R MI M I IR M R RM(I) I I IM(R) I M MM(I) R M MM(R) R R RM(I) M I IM(R) M R R
s: speaker; h: hearer
Three possible inflections: R (regular), I(irregular), M (mixed) (both endingspossible)
At each time step, s and h interact over alemma whose frequency is f
When h does not have the utteredinflection, he appends it
When h has the uttered inflection, both sand h delete the other one
At each time step, a randomly selectedagent is replaced with probability r withone who only has R inflections
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling Rules Outcome
The model exhibits a transitionAnalytical and numerical solutions
n = r/fρX represents the fraction of individuals in the X inflection
The model is analytically solvable
0
0.2
0.4
0.6
0.8
1
0 0.02 0.04 0.06 0.08
ρI
n
0
0.2
0.4
0.6
0.8
1
0 0.02 0.04 0.06 0.08
ρR
n
ρ(3)I
ρ(2)I
ρ(1)I
ρI (0) = 0.8
ρI (0) = 0.5
ρI (0) = 0.3
High-frequency verbs tend to stay I, low-frequency verbs tend tobecome R
Martina Pugliese
The idea Corpus Data Mining Experiment Modelling
Conclusions and Perspectives
What the work shows in a nutshell
Frequency alone does not necessarily predict fate of verbs
Vocabulary lemmas change in a complex way as result ofseveral factors
Activity in irregularity proportion is mostly located in anintermediate frequency window
Phonological classification clarifies the existence of clusters ofverbs behaving in similar way
Native and non-native speakers tend to have opposing viewsof regularity preference
Modelling sheds light on the stationary states of lemmas anduncovers a transition
Thanks for bearing with me!
Martina Pugliese