noun sense induction and disambiguation using graph-based distributional semantics

31
Seminar at University of Leipzig 8 September, 2016, Leipzig, Germany Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics Alexander Panchenko, Johannes Simon, Martin Riedl and Chris Biemann Technische Universität Darmstadt, LT Group, Computer Science Department, Germany September 7, 2016 | 1

Upload: alexander-panchenko

Post on 21-Feb-2017

401 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Seminar at University of Leipzig8 September, 2016, Leipzig, Germany

Noun Sense Induction and Disambiguation usingGraph-Based Distributional Semantics

Alexander Panchenko, Johannes Simon, Martin Riedl and Chris Biemann

Technische Universität Darmstadt, LT Group, Computer Science Department, Germany

September 7, 2016 | 1

Page 2: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Summary

▶ Panchenko A., Simon J., Riedl M., Biemann C. "Noun Sense Induction andDisambiguation using Graph-Based Distributional Semantics". InProceedings of the KONVENS 2016, Bochum, Germany

▶ An approach to word sense induction and disambiguation.▶ The method is unsupervised and knowledge-free.▶ Sense induction by clustering of word similarity networks▶ Feature aggregation w.r.t. the induced inventory.▶ Comparable to the state-of-the-art unsupervised WSD (SemEval’13

participants and various sense embeddings).▶ Open source implementation: github.com/tudarmstadt-lt/JoSimText

September 7, 2016 | 2

Page 3: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Motivation for Unsupervised Knowledge-FreeWord Sense Disambiguation

▶ A word sense disambiguation (WSD) system:▶ Input: word and its context.▶ Output: a sense of this word.

▶ Surveys: Agirre and Edmonds (2007) and Navigli (2009).▶ Knowledge-based approaches that rely on hand-crafted resources, such as

WordNet.▶ Supervised approaches learn from hand-labeled training data, such as SemCor.

▶ Problem 1: hand-crafted lexical resources and training data are expensive tocreate, often inconsistent and domain-dependent.

▶ Problem 2: These methods assume a fixed sense inventory:▶ senses emerge and disappear over time.▶ different applications require different granularities the sense inventory.

▶ An alternative route is the unsupervised knowledge-free approach.▶ learn an interpretable sense inventory▶ learn a disambiguation model

September 7, 2016 | 3

Page 4: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Contribution

▶ The contribution is a framework that relies on induced inventories as a pivotfor learning contextual feature representations and disambiguation.

▶ We rely on the JoBimText framework and distributional semantics (Biemannand Riedl, 2013) adding a word sense disambiguation functionality on top of it.

▶ The advantage of our method, compared to prior art, is that it can integrateseveral types of context features in an unsupervised way.

▶ The method achieves state-of-the-art results in unsupervised WSD.

September 7, 2016 | 4

Page 5: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Method: Data-Driven Noun Sense Modelling

1. Computation of a distributional thesaurus▶ using distributional semantics

2. Word sense induction▶ using ego-network clustering of related words

3. Building a disambiguation model of the induced senses▶ by feature aggregation w.r.t. the induced sense inventory

September 7, 2016 | 5

Page 6: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Method: Distributional Thesaurus of Nouns us-ing the JoBimText framework

▶ A distributional thesaurus (DT) is a graph of word similarities, such as“(Python, Java, 0.781)”.

▶ We used the JoBimText framework (Biemann and Riedl, 2013):▶ efficient computation of nearest neighbours for all words▶ providing state-of-the-art performance (Riedl, 2016)

▶ For each noun in the corpus get 200 most similar nouns

September 7, 2016 | 6

Page 7: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Method: Distributional Thesaurus of Nouns us-ing the JoBimText framework (cont.)

▶ For each noun in the corpus get l = 200 most similar nouns:1. Extract word, feature and word-feature frequencies.

▶ Using dependency-based features, such as amod(•, grilled) or prep_for(•,dinner) using the Malt parser (Nivre et al., 2007)

▶ Collapsing of dependencies in the same way as the Stanford dependencies.2. Discard rare words, features and word-features (t < 3).3. Normalize word-feature scores using the Local Mutual Information (LMI):

LMI(i , j) = fij · PMI(i , j) = fij · logfij∑

i ,j fijfi∗ · f∗j

4. Ranking word features by LMI.5. Prune all, but p = 1000 most significant features per word.6. Word similarities are computed as a number of common features for two words:

sim(ti , tj ) = |k : fik > 0 ∧ fjk > 0|

7. Return l = 200 most related words per word.

September 7, 2016 | 7

Page 8: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Method: Noun Sense Induction via Ego-NetworkClustering

▶ The "furniture" and the "data" sense clusters of the word "table".▶ Graph clustering using the Chinese Whispers algorithm (Biemann, 2006).

September 7, 2016 | 8

Page 9: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Method: Noun Sense Induction via Ego-NetworkClustering (cont.)

▶ Process one word per iteration▶ Construct an ego-network of the word:

▶ use dependency-based distributional word similarities▶ the ego-network size (N): the number of related words▶ the ego-network connectivity (n): how strongly the neighbours are related; this

parameter controls granularity of sense inventory.▶ Graph clustering using the Chinese Whispers algorithm.

September 7, 2016 | 9

Page 10: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Method: Disambiguation of Induced NounSenses

▶ Learning a disambiguation model P(si |C) for each of the induced sensessi ∈ S of the target word w in context C = {c1, ..., cm}.

▶ We use the Naïve Bayes model:

P(si |C) =P(si )

∏|C|j=1 P(cj |si )

P(c1, ..., cm),

▶ The best sense given the context C:

s∗i = arg maxsi∈S

P(si )|C|∏j=1

P(cj |si ).

September 7, 2016 | 10

Page 11: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Method: Disambiguation of Induced NounSenses (cont.)

▶ The prior probability of each sense is computed based on the largest clusterheuristic:

P(si ) =|si |∑

si∈S |si |.

▶ Extract sense representations by aggregation of features from all words of thecluster si .

▶ Probability of the feature cj given the sense si :

P(cj |si ) =1 − α

Λi

|si |∑k

λkf (wk , cj )

f (wk )+ α,

▶ To normalize the score we divide it by the sum of all the weights Λi =∑|si |

k λk :▶ α is a small number, e.g. 10−5, added for smoothing.

September 7, 2016 | 11

Page 12: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Method: Disambiguation of Induced NounSenses (cont.)

▶ To calculate a WSD model we need to extract from corpus:1. the distributional thesaurus;2. sense clusters;3. word-feature frequencies.

▶ Sense representations are obtained by “averaging” of feature representationsof words in the sense clusters.

September 7, 2016 | 12

Page 13: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Feature Extraction: Single Models

▶ The method requires sparse word-feature counts f (wk , cj ).▶ We demonstrate the approach on the four following types of features:

1. Features based on sense clusters: Cluster▶ Features: words from the induced sense clusters;▶ Weights: similarity scores.

2. Dependency features: Deptarget, Depall▶ Features: syntactic dependencies attached to the word, e.g. “subj(•,type)” or

“amod(digital,•)”▶ Weights: LMI scores of the scores.

3. Dependency word features: Depword▶ Features: words extracted from all syntactic dependencies attached to a target word.

For instance, the feature “subj(•,write)” would result in the feature “write”.▶ Weights: LMI scores.

4. Trigram features: Trigramtarget, Trigramall▶ Features: pairs of left and right words around the target word, e.g. “typing_•_or” and

“digital_•_.”.▶ Weights: LMI scores.

September 7, 2016 | 13

Page 14: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Feature Combination: Combined Models

▶ Feature-level Combination of Features▶ Union context features of different types, such as dependencies and trigrams.▶ “Stack” feature spaces.

▶ Meta-level Combination of Features1. Independent sense classifications by single models2. Aggregation of predictions with:

▶ Majority selects the sense si selected by the largest number of single models.▶ Ranks. First, results of single model classification are ranked by their confidence

P̂(si |C): the most suitable sense to the context obtains rank one and so on. Finally,we assign the sense with the least sum of ranks.

▶ Sum. This strategy assigns the sense with the largest sum of classificationconfidences i.e.,

∑i P̂(si |C i

k ), where i is the number of the single model.

September 7, 2016 | 14

Page 15: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Corpora used for experiments

# Tokens Size Text Type

Wikipedia 1.863 · 109 11.79 Gb encyclopaedicukWaC 1.980 · 109 12.05 Gb Web pages

Table: Corpora used for training our models.

September 7, 2016 | 15

Page 16: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Results: Evaluation on the “Python-Ruby-Jaguar” (PRJ) dataset: 3 words, 60 contexts, 2senses per word

▶ A simple dataset: 60 contexts, 2 homonyms per word.▶ The models based on the meta-combinations are not shown for brevity as they

did not improve performance of the presented models in terms of F-score.

September 7, 2016 | 16

Page 17: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Results: Evaluation on the TWSI dataset: 1012nouns, 145140 contexts, 2.33 senses per word

September 7, 2016 | 17

Page 18: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Results: the TWSI dataset: effect of the corpuschoice on the WSD performance

▶ 10 best models according to the F-score on the TWSI dataset▶ Trained on Wikipedia and ukWaC corpora

September 7, 2016 | 18

Page 19: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Results: Evaluation on the SemEval 2013 Task13 dataset: 20 nouns, 1848 contexts

September 7, 2016 | 19

Page 20: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Thank you!

September 7, 2016 | 20

Page 21: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Word Embeddings for WSD using Graph-BasedDistributional Semantics

▶ Pelevina M., Arefiev N., Biemann C., Panchenko A. "Making Sense of WordEmbeddings". In Proceedings of the 1st Workshop on RepresentationLearning for NLP. ACL 2016, Berlin, Germany. Best Paper Award

▶ An approach to learn word sense embeddings.▶ The same approach as presented above, but using word2vec instead of

JoBimText: dense vs sparse feature representations.

September 7, 2016 | 21

Page 22: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Overview of the contribution

Prior methods:

▶ Induce inventory by clustering of word instances (Li and Jurafsky, 2015)▶ Use existing inventories (Rothe and Schütze, 2015)

Our method:

▶ Input: word embeddings▶ Output: word sense embeddings▶ Word sense induction by clustering of word ego-networks▶ Word sense disambiguation based on the induced sense representations

September 7, 2016 | 22

Page 23: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Learning Word Sense Embeddings

September 7, 2016 | 23

Page 24: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Word Sense Induction: Ego-Network Clustering

▶ The "furniture" and the "data" sense clusters of the word "table".▶ Graph clustering using the Chinese Whispers algorithm (Biemann, 2006).

September 7, 2016 | 24

Page 25: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Neighbours of Word and Sense Vectors

Vector Nearest Neighbours

tabletray, bottom, diagram, bucket, brackets, stack, bas-ket, list, parenthesis, cup, trays, pile, playfield,bracket, pot, drop-down, cue, plate

table#0leftmost#0, column#1, randomly#0, tableau#1, top-left0, indent#1, bracket#3, pointer#0, footer#1, cur-sor#1, diagram#0, grid#0

table#1pile#1, stool#1, tray#0, basket#0, bowl#1, bucket#0,box#0, cage#0, saucer#3, mirror#1, birdcage#0,hole#0, pan#1, lid#0

▶ Neighbours of the word “table" and its senses produced by our method.▶ The neighbours of the initial vector belong to both senses.▶ The neighbours of the sense vectors are sense-specific.

September 7, 2016 | 25

Page 26: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Word Sense Disambiguation

1. Context Extraction▶ use context words around the target word

2. Context Filtering

▶ based on context word’s relevance for disambiguation

3. Sense Choice▶ maximize similarity between context vector and sense vector

September 7, 2016 | 26

Page 27: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Word Sense Disambiguation: Example

September 7, 2016 | 27

Page 28: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Evaluation on SemEval 2013 Task 13 dataset:comparison to the state-of-the-art

Model Jacc. Tau WNDCG F.NMI F.B-Cubed

AI-KU (add1000) 0.176 0.609 0.205 0.033 0.317AI-KU 0.176 0.619 0.393 0.066 0.382AI-KU (remove5-add1000) 0.228 0.654 0.330 0.040 0.463Unimelb (5p) 0.198 0.623 0.374 0.056 0.475Unimelb (50k) 0.198 0.633 0.384 0.060 0.494UoS (#WN senses) 0.171 0.600 0.298 0.046 0.186UoS (top-3) 0.220 0.637 0.370 0.044 0.451La Sapienza (1) 0.131 0.544 0.332 – –La Sapienza (2) 0.131 0.535 0.394 – –

AdaGram, α = 0.05, 100 dim 0.274 0.644 0.318 0.058 0.470

w2v 0.197 0.615 0.291 0.011 0.615w2v (nouns) 0.179 0.626 0.304 0.011 0.623JBT 0.205 0.624 0.291 0.017 0.598JBT (nouns) 0.198 0.643 0.310 0.031 0.595TWSI (nouns) 0.215 0.651 0.318 0.030 0.573

September 7, 2016 | 28

Page 29: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Conclusion

▶ Novel approach for learning word sense embeddings.

▶ Can use existing word embeddings as input.

▶ WSD performance comparable to the state-of-the-art systems.

▶ Source code and pre-trained models:https://github.com/tudarmstadt-lt/SenseGram

September 7, 2016 | 29

Page 30: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Evaluation based on the TWSI dataset: a large-scale dataset for development

September 7, 2016 | 30

Page 31: Noun Sense Induction and Disambiguation using Graph-Based Distributional Semantics

Thank you!

September 7, 2016 | 31