a social network approach to unsupervised induction of syntactic clusters for bengali monojit...
TRANSCRIPT
![Page 1: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/1.jpg)
A Social Network Approach to
Unsupervised Induction of Syntactic Clusters for Bengali
Monojit ChoudhuryMicrosoft Research India
![Page 2: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/2.jpg)
Co-authors
Chris BiemannUniversity of Leipzig
Joydeep Nath Animesh Mukherjee Niloy GangulyIndian Institute of Technology Kharagpur
![Page 3: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/3.jpg)
Language – A Complex System
Structure: phones words, words phrases, phrase
sentence, sentence discourseFunction: Communication through
recursive syntax compositional semantics
Dynamics:EvolutionLanguage change
![Page 4: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/4.jpg)
Computational Linguistics
Study of language using computersStudy of language-using computers
Natural Language Processing:Speech recognitionMachine translationAutomatic summarizationSpell checkers, Information retrieval &
extraction, …
![Page 5: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/5.jpg)
Labeling of Text
Lexical Category (POS tags)Syntactic Category (Phrases, chunks)Semantic Role (Agent, theme, …)Sense Domain dependent labeling (genes, proteins, …)
How to define the set of labels?
How to (learn to) predict them automatically?
![Page 6: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/6.jpg)
Distributional Hypothesis
“A word is characterized by the company it keeps” – Firth, 1957
Syntax: function words (Harris, 1968)Semantics: content words
![Page 7: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/7.jpg)
Outline
Defining ContextSyntactic Network of WordsComplex Network – Theory & ApplicationsChinese Whispers: Clustering the NetworkExperimentsTopological Properties of the NetworksEvaluationFuture work
![Page 8: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/8.jpg)
Features Words
Estimate the unigram frequencies
Feature words: Most frequent m words
![Page 9: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/9.jpg)
Feature Vector
From the familiar to the exotic, the collection is a delight
0 0 … 0 1
1 0 … 0 0
0 1 … 0 0
1 0 … 0 0
fw1 fw2 fw199 fw200
p-2
p-1
p1
p2
the to is from
![Page 10: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/10.jpg)
Syntactic Network of Words
light
color
red
blue
blood
sky
heavy
weight
100
20
1
1
1 – cos(red, blue)
![Page 11: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/11.jpg)
The Chinese Whisper Algorithm
light
color
red
blue
blood
sky
heavy
weight
0.9
0.5
0.9
0.7
0.8
-0.5
![Page 12: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/12.jpg)
The Chinese Whisper Algorithm
light
color
red
blue
blood
sky
heavy
weight
0.9
0.5
0.9
0.7
0.8
-0.5
![Page 13: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/13.jpg)
The Chinese Whisper Algorithm
light
color
red
blue
blood
sky
heavy
weight
0.9
0.5
0.9
0.7
0.8
-0.5
![Page 14: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/14.jpg)
Experiments
Corpus: Anandabazaar Patrika (17M words)
We build networks Gn,m
n: corpus size – {1M, 2M, 5M, 10M, 17M}m: number of feature words – {25, 50, 100, 200}
Number of nodes: 5000Number of edges ~ 150,000
![Page 15: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/15.jpg)
Topological Properties: Cumulative Degree Distribution
Pk
kPk -log(k)pk = -dPk /dk 1/k Zipfian Distribution!!
CDD: Pk is the probability that a randomly chosen node has degree ≥ k
G17M,50
![Page 16: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/16.jpg)
Topological Properties:Clustering Coefficient
Measures transitivity of the network or equivalently the proportion of triangles
Very small for random graphs, high for social networks
Mean CC for G17M,50: 0.53CC vs. Degree
![Page 17: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/17.jpg)
Topological Properties: Cluster Size Distribution
Clu
ster
Siz
e
rankrank
Variation with n (m = 50) Variation with m (n = 17M)
![Page 18: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/18.jpg)
Evaluation: Tag Entropy
w: {t1, t6, t9}
Tagw:
Cluster C: {w1, w2, w3, w4}
TE(C)=
1 0 0 0 0 1 0 0 1 0
1 0 0 0 0 1 0 0 1 0
0 0 1 0 0 1 0 0 1 0
0 0 0 0 0 1 0 0 1 0
1 0 1 0 0 1 0 0 1 0
1 0 1 0 0 0 0 0 0 0 = 2
![Page 19: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/19.jpg)
Mean Tag Entropy
MTE = 1/N TE(Ci)
Weighted MTE = |Ci|TE(Ci)/(|Ci|)
Caveat: Every word in separate cluster has 0 MTE and WMTE
Baseline: Every word in a single cluster
![Page 20: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/20.jpg)
Tag Entropy vs. Corpus Size
m = 50
1M 2M 5M 10M 17M
74.49 75.14 76.09 78.29 74.94
17.46 18.68 24.23 27.56 30.60
%Reduction in Tag Entropy
MTE
WMTE
![Page 21: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/21.jpg)
The Bigger the worse!
Cluster Size
Tag
Ent
ropy
![Page 22: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/22.jpg)
Clusters …
Big ones Bad ones mix of everything!
Medium sized clusters are good
http://banglaposclusters.googlepages.com/home
Rank Size Type
5 596 Proper nouns, titles and posts
6 352 Possessive case of nouns (common, proper, verbal) and pronouns
8 133 Nouns (common, verbal) forming compounds with “do” or “be”
11 44 Number-Classifier (e.g. 1-TA, ekaTA)
12 84 Adjectives
![Page 23: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/23.jpg)
More Observations
Words are split intoFirst name vs. SurnamesAnimate nouns-poss vs. Inanimate noun-possNouns-acc vs. Nouns-poss vs. Nouns-locVerb-finite vs. Verb-infinitive
Syntactic or semantic?Nouns related to professions, months, days of week,
stars, players etc.
![Page 24: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/24.jpg)
Advantages
No labeled data required: A good solution to resources scarcity
No prior class information: Circumvents issues related to tag set definition
Computational definition of Class
Understanding the structure of language (Syntax) and it’s evolution
![Page 25: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/25.jpg)
Danke für Ihre Aufmerksamkeit.
Dieses ist „vom Übersetzer übersetzt worden, der“ von Phasen Microsoft Beta ist.
Thank you for your attention
This has been translated by "Translator Beta" from Microsoft Live.
![Page 26: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/26.jpg)
Related Work
Harris, 68: Distributional hypothesis for syntactic classes
Miller and Charles, 91: Function words as featuresFinch and Chater, 92; Schtze, 93, 95; Clark, 00;
Rapp, 05; Biemann, 06: The general techniqueHaghighi and Klein, 06; Goldwater and Griffiths, 07:
Bayesian approach to unsupervised POS taggingDasgupta and Ng, 07: Bengali POS induction
through morphological features
![Page 27: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/27.jpg)
Medium and Low Frequency Words
Neighboring (window 4) co-occurrences ranked by log-likelihood thresholded by θ
Two words are connected iff they share at least 4 neighbors
Language English Finnish German
Nodes 52857 85627 137951
Edges 691241 702349 1493571
![Page 28: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/28.jpg)
Construction of Lexicon
Each word assigned a unique tag based on the word class it belongs toClass 1: sky, color, blood, weightClass 2: red, blue, light, heavy
Ambiguous words: High and medium frequency words that formed
singleton clusterPossible tags of neighboring clusters
![Page 29: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/29.jpg)
Training and Evaluation
Unsupervised training of trigram HMM using the clusters and lexicon
Evaluation:Tag a text, for which gold standard is availableEstimate the conditional entropy H(T|C) and the
related perplexity 2H(T|C)
Final Results: English – 2.05 (619/345), Finnish – 3.22
(625/466), German – 1.79 (781/440)
![Page 30: A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India monojitc@microsoft.com](https://reader035.vdocuments.mx/reader035/viewer/2022062721/56649f1e5503460f94c357b9/html5/thumbnails/30.jpg)
Example
From the familiar to the exotic, the collection is a delight
Prep At JJ Prep At JJ At NN V At NN C200 C1 C331 C5 C1 C331 C1 C221 C3 C1 C220