self-organization of the sound inventories: an explanation based on complex networks

Post on 17-Dec-2015

226 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Self-Organization of the Sound Inventories: An Explanation

based on Complex Networks

Overview of the Talk

• Motivation

• Approach & Objective

• Principle of Occurrence in Consonant Inventories

• Principle of Co-Occurrence in Consonant Inventories

• Findings

• Conclusions and Future Work

Sabda Bramha: Sound is Eternity

sabda-brahma su-durbodham pranendriya-mano-mayam ananta-param gambhiramdurvigahyam samudra-vat

– Sound is eternal and as well very difficult to comprehend. It manifests within the life air, the senses, and the mind. It is unlimited and unfathomable, just like the ocean.

• Several living organisms can produce sound

– They emit sound signals to communicate

– These signals are mapped to certain symbols (meanings) in the brain

– E.g., mating calls, danger alarms

Signals and Symbols & § ۞ ☼ ♥

Human Communication

• Human beings also produce sound signals

• Unlike other organisms, they can concatenate these sounds to produce new messages – Language

• Language is one of the primary cause/effect of human intelligence

Human Speech Sounds

• Human speech sounds are called phonemes – the smallest unit of a language

• Phonemes are characterized by certain distinctive features like

Mermelstein’s Model

I. Place of articulation

II. Manner of articulation

III. Phonation

Types of Phonemes

Vowels Consonants Diphthongs

/ai/L

/a/

/i/

/u/

/p/

/t/

/k/

Choice of Phonemes

• How a language chooses a set of phonemes in order to build its sound inventory?

• Is the process arbitrary?

• Certainly Not!

• What are the forces affecting this choice?

Forces of Choice

/a/

Speaker Listener / Learner

/a/

Desires “ease of articulation” Desires “perceptual contrast” / “ease of learnability”

A Linguistic System – How does it look?

The forces shaping the choice are opposing – Hence there has to be a non-trivial solution

Vowels: A (Partially) Solved Mystery

• Languages choose vowels based on maximal perceptual contrast.

• For instance if a language has three vowels then in more than 95% of the cases they are /a/,/i/, and /u/.

Max

imall

y Dist

inct

Maximally Distinct

Maximally Distinct/u/

/a/

/i/

Consonants: A puzzle

• Research: From 1929 – Date

• No single satisfactory explanation of the organization of the consonant inventories

– The set of features that characterize consonants is much larger than that of vowels

– No single force is sufficient to explain this organization

– Rather a complex interplay of forces goes on in shaping these inventories

Ji g

sa

w

The Approach & Objective

• We adopt a Complex Network Approach to attack the problem of consonant inventories

• We try to figure out the principle of the distribution of the occurrence of consonants over languages

• We also attempt to figure out the co-occurrence patterns (if any) that are found across the consonant inventories

Principle of Occurrence

• PlaNet – The “Phoneme-Language Network”

– A bipartite network N=(VL,VC,E)

– VL : Nodes representing languages of the world

– VC : Nodes representing consonants

– E : Set of edges which run between VL and VC

• There is an edge e Є E between two nodes

vl Є VL and vc Є VC if the consonant c occurs

in the language l.

L1

L4

L2

L3

/m/

/ŋ/

/p/

/d/

/s/

/θ/

Conso

na

nts

Langu

ages

The Structure of PlaNet

Construction of PlaNet

• Data Source : UCLA Phonological Inventory Database (UPSID)

• Number of nodes in VL is 317

• Number of nodes in VC is 541

• Number of edges in E is 7022

Degree Distribution

• Degree of a node is defined as the number of edges connected to the node.

• Degree Distribution (DD) is the fraction of nodes, pk, having degree equal to k.

• The Cumulative Degree Distribution (CDD) is the fraction of nodes, Pk, having degree k.

Degree Distribution of PlaNet

0 50 100

150

0.02

0.04

0.06

0.08

Language inventory size (degree k)

pk

pk = beta(k) with α = 7.06, and β = 47.64

pk =Γ(54.7) k6.06(1-k)46.64

Γ(7.06) Γ(47.64)

kmin= 5, kmax= 173, kavg= 21

200

Pk

1000Degree of a consonant, k

Pk = k -0.71

Exponential Cut-off

1 10 100

0.001

0.01

0.1

1

DD of the language nodes follows a β-distribution

DD of the consonant nodes follows a power-law with an exponential cut-off

Distribution of Consonants over Languages follow a power-law

Preferential Attachment: The Key to Power Law

• Power law distributions observed in

– Social Networks

– Biological Networks

– Internet Graphs

– Citation Networks

• These distributions emerge due to preferential attachment

$$ $ $

$$ $ $

$ $ $ $$ $ $ $

RIC

H RIC

HE

R

Synthesis of PlaNet

Given: VL = {L1, L2, ..., L317} sorted in the ascending order of their degrees and 541 unlabeled nodes in VC .

Step 0: All nodes in VC have degree 0.

Step t+1:

Choose a language node Lj (in order) with cardinality kj (inventory size)

for c running from 1 to kj do

Pr(Ci) =di

α+ ε

∑xV* (dxα + ε)

Connect Lj preferentially with a consonant node Ci VC, to which it is already not connected, with a probability

where, di = degree of node Ci at step t and V* = subset of VC not connected to Lj at t and ε is the smoothing parameter.

L1 L3L2 L4

L1 L3L2 L4

The Preferential Mechanism of Synthesis

After step 3

After step 4

Simulation Result

The parameters α and ε are 1.44 and 0.5 respectively.

The results are averaged over 100 runs

PlaNetrand

PlaNetPlaNetsyn

1 10 100 1000

1

.1

.01

.001 Degree

(k)

Pk

Principle of Co-occurrence

• Consonants tend to co-occur in groups or communities

• These groups tend to be organized around a few distinctive features (based on: manner of articulation, place of articulation & phonation) – Principle of feature economy

If a language has in its inventory

then it will also tend to have

voiced voiceless

bilabial

dental

/b/ /p/

/d/ /t/

plosive

How to Capture these Co-occurrences?

• PhoNet – “Phoneme Phoneme Network”– A weighted network N=(VC,E)

– VC : Nodes representing consonants

– E : Set of edges which run between the nodes in VC

• There is an edge e Є E between two nodes vc1 ,vc2 Є VC if the consonant c1 and c2 co-occur in a language. The number of languages in which c1 and c2 co-occurs defines the edge-weight of e. The number of languages in which c1 occurs defines the node-weight of vc1.

/kw/

/k′/

/k/

/d′/42

14

38

13

283

17

50

39

Construction of PhoNet

• Data Source : UPSID

• Number of nodes in VC is 541

• Number of edges is 34012

PhoNet

Community Structures in PhoNet

• Radicchi et al. algorithm (for unweighted networks) – Counts number of triangles that an edge is a part of. Inter-community edges will have low count so remove them.

• Modification for a weighted network like PhoNet

– Look for triangles, where the weights on the edges are comparable.

– If they are comparable, then the group of consonants co-occur highly else it is not so.

– Measure strength S for each edge (u,v) in PhoNet where S is,

– Remove edges with S less than a threshold η

S =wuv

√Σi Є Vc-{u,v}(wui – wvi)2 if √Σi Є Vc-{u,v}(wui – wvi)2>0 else S = ∞

3

1

2

4100

110

101

10

5

646

52

45 3

1

2

411.11

10.94

7.14

0.06

5

63.77

5.17

7.5S

η>1

3

1

2 6

4

5

Community Formation

For different values of η we get different sets of communities

Consonant Societies!

η=1.25η=0.72

η=0.60

η=0.35

Evaluation of the Communities: Occurrence Ratio

• Hypothesis: The communities obtained from the algorithm should be found frequently in UPSID

• We define occurrence ratio to capture the “intensity” of occurrence,

– N is the number of consonants in C (ranked by the ascending order of frequency of occurrence) , M is the number of consonants of C that occur in a language L and Rtop is the rank of the highest ranking consonant in L that is also present in C

– If a high-frequency consonant is present in L it is not necessary that the low-frequency one should be present; but if a lower one is already present then it is expected that the higher one must be present

OL =M

N – (Rtop – 1)

Computing Occurrence Ratio: An Example

X

/kh/

/k/

/kw/

/kh/

X

/kw/

/kh/

/k/

/k/

/kh/

/kw/

C

L1

L2

L3

R =1

R =2

R =3

M=3, N=3, Rtop=1

OL=3/3=1

M=2, N=3, Rtop=2

OL=2/2=1

M=2, N=3, Rtop=1

OL=2/3=0.66

Average Occurrence Ratio

• For a given community it will have an occurrence ratio in each language L in UPSID

• We average this ratio over all L as,

where Loccur is the number of languages where at least one of the members of C has occurred

Oav =Loccur

ΣL Є UPSIDOL

Results of the Evaluation

Consonants show patterns of co-occurrence in 80% or more of the world’s languages

η >

0.3

Oav > 0.8

The Binding Force of the Communities: Feature Economy

• Feature Entropy: The idea is borrowed from information theory

• For a community C of size N, let there be pf consonants for which a particular feature f is present and qf other consonants for which f is absent – probability that a consonant chosen from C has f is pf /N and that it does have f is qf /N or (1- pf /N)

• Feature entropy can be therefore defined as

where F is the set of all features present in the consonants in C

• Essentially the number of bits needed to transmit the entire information about C through a channel.

ΣFЄf(-(pf /N)log(pf /N) – (qf /N)log(qf /N))FE =

Computing Feature Entropy

Lower FE -> C1 economizes on the number of features

Higher FE -> C2 does not economize on the number of features

If the Inventories had Evolved by Chance!

• Construction of PhoNetrand

– For each consonant c let the frequency of occurrence in UPSID be denoted by fc.

– Let there be 317 bins each corresponding to a language in UPSID.

– fc bins are then chosen uniformly at random and the consonant c is packed into these bins without repetition.

– Thus the consonant inventories of the 317 languages corresponding to the bins are generated.

– PhoNetrand can be constructed from these new consonant inventories similarly as PhoNet.

• Cluster PhoNetrand by the method proposed earlier

PhoNet

PhoNetrand

0 5 10 15 20

10

5

0

Avera

ge F

eatu

re

En

trop

y

Community Size

The curve shows the average feature entropy of the communities of a particular size versus the community size

Comparison between PhoNet and PhoNetrand

Our Findings

• The distribution of the occurrence of consonants over languages follow a power-law behavior;

• A preferential attachment-based model can reproduce this distribution of occurrence to a very close approximation (mean error ~0.01);

• The patterns of co-occurrence of the consonants, reflected through communities in PhoNet, are observed in 80% or more of the world's languages;

•Such patterns of co-occurrence would not have emerged if the consonant inventories had evolved just by chance;

The Epilogue

• How to explain preferential attachment?– Perhaps it is due to the linguistic heterogeneity involved in the

process of language change (at the microscopic level)– Consonants belonging to languages that are prevalent among the

speakers in one generation have a higher (and higher) chance of getting transmitted to the speakers of the subsequent generations

– The above heterogeneity manifests as preferential attachment in the mesoscopic level

• What is the cause of the origin of feature economy?– Perhaps it is the outcome of the interplay of the functional forces

such as the perceptual contrast and ease of learnability that is reflected as feature economy

Indo-European family of languages

Danke!

top related