conll

1

Combining Lexical and Syntactic Features for

Supervised Word Sense Disambiguation

Saif Mohammad Ted Pedersen Univ. of Toronto Univ. of Minnesota, Duluth

http//:www.cs.toronto.edu/~smm

http//:www.d.umn.edu/~tpederse

2

Word Sense Disambiguation

Harry cast a bewitching spell

We understand target word spell in this context to mean charm or incantation

not “reading out letter by letter” or “a period of time” Automatically identifying the intended

sense of a word based on its context is hard! Best accuracies often around 65%-75%

3

WSD as Classification Learn model for a given target word

from a corpus of manually sense tagged training examples

The model assigns target word a sense based on the context in which it occurs Context represented by feature set

Evaluate model on held out test set

4

Motivations Lexical features do “reasonably well” at

supervised WSD… Duluth systems in Senseval-2 Pedersen NAACL-2001

POS features do “reasonably well” too Complementary or redundant?

Complementary? Find the simplest ways to represent instances and combine results to improve performance

Redundant? We can reduce feature space without affecting performance

5

Decision Trees Assigns sense to an instance by asking a

series of questions Questions correspond to features of the

instance and depend on previous answer In the tree…

Top most node is called the root Each node corresponds to a feature Each value of a feature has a branch Each path terminates in a sense/leaf

6

WSD Tree

Feature 4?

Feature 4 ?

Feature 2 ?

Feature 3 ?

Feature 2 ?

SENSE 4

SENSE 3SENSE 2

SENSE 1

SENSE 3

SENSE 3

0

0

0

1

1

1

0

10

1

0 1

Feature 1 ?

SENSE 1

7

Why Decision Trees?

Many kinds of features can contribute to WSD performance

Many learning algorithms result in comparable classifiers when given the same set of features

A learned decision tree captures interactions among features

Many implementations available Weka J48

8

Lexical Features Surface form

Observed form of target word Unigrams and Bigrams

One and two word sequences Ngram Statistics Package

http://www.d.umn.edu/~tpederse/nsp.html

9

POS Features Surrounding POS indicate different sense:

Why did Jack turn/VB against/IN his/PRP$ team/NN

Why did Jack turn/VB left/NN at/IN the/DT crossing

Individual word POS: P-2, P-1, P0, P1, P2

Used individually and in combination

10

Part of Speech Tagging Brill Tagger

Open Source Easy to Understand

Guaranteed Pre-Tagging Manually tag target words Implemented in BrillPatch

11

Parse Features Head word of the target phrase

the hard work, the hard surface Head word of the parent phrase

fasten the line, cross the line Target and parent phrase POS

noun phrase, verb phrase… Used individually and in

combination Obtained via Collins Parser

12

Experiments How accurate are simple classifiers

based on a single feature type? How complementary or redundant

are lexical and syntactic features? Is it possible (in theory at least) to

combine just a few very simple classifiers and achieve near state of the art results?

13

Experiments Learn a decision tree based on a

single feature type Surface, Unigram, Bigram, POS,

Parse, … Combine pairs of these trees via a

simple ensemble technique Weighted vote

14

Sense-Tagged Data Senseval-2 data

4328 test instances, 8611 training instances 73 nouns, verbs and adjectives.

Senseval-1 data 8512 test instances, 13276 training instances 35 nouns, verbs and adjectives.

line, hard, interest, serve data 4149, 4337, 4378 and 2476 instances

50,000 sense-tagged instances in all!

15

Lexical Features

Sval-2 Sval-1 line hard serve interest

Majority

47.7% 56.3% 54.3% 81.5% 42.2%

54.9%

Surface Form

49.3% 62.9% 54.3% 81.5% 44.2%

64.0%

Unigram

55.3% 66.9% 74.5% 83.4% 73.3%

75.7%

Bigram 55.1% 66.9% 72.9% 89.5% 72.1%

79.9%

16

POS FeaturesSval-2Sval-2 Sval-1Sval-1 lineline hardhard serve serve interestinterest

majority

47.7% 56.3% 54.3% 81.5% 42.2% 54.9%

P-247.1% 57.5% 54.9% 81.6% 60.3% 56.0%

P-149.6% 59.2% 56.2% 82.1% 60.2% 62.7%

P049.9% 60.3% 54.3% 81.6% 58.0% 64.0%

P153.1% 63.9% 54.2% 81.6% 73.0% 65.3%

P248.9% 59.9% 54.3% 81.7% 75.7% 62.3%

17

Combining POS FeaturesSval-2 Sval-1 line hard serve interest

Majority 47.7% 56.3% 54.3% 81.5% 42.2% 54.9%

P0, P154.3% 66.7% 54.1% 81.9% 60.2% 70.5%

P-1, P0, P154.6% 68.0% 60.4% 84.8% 73.0% 78.8%

P-2, P-1,

P0, P1 , P254.6% 67.8% 62.3% 86.2% 75.7% 80.6%

18

Parse FeaturesSval-2 Sval-1 line hard serve interest

Majority 47.7% 56.3% 54.3% 81.5% 42.2% 54.9%

Head Word

51.7% 64.3% 54.7% 87.8% 47.4% 69.1%

Parent Word

50.0% 60.6% 59.8% 84.5% 57.2% 67.8%

Phrase POS

52.9% 58.5% 54.3% 81.5% 41.4% 54.9%

Parent Phrase POS

52.7% 57.9% 54.3% 81.7% 41.6% 54.9%

19

Discussion Lexical and syntactic features perform

comparably. Do they get the same instances right ? Are there instances disambiguated by one

feature set and not by the other? How much are the individual feature sets

complementary?

20

Measures

Baseline Ensemble: accuracy of a hypothetical ensemble which predicts the sense correctly only if both individual feature sets do so.

Optimal Ensemble: accuracy of a hypothetical ensemble which predicts the sense correctly if either of the individual feature sets do so.

21

Our Ensemble Approach We use a weighted vote ensemble to

decide the sense of a target word For a given test instance, takes the

output of two classifiers (one lexical and one syntactic) and sums the probabilities associated with each possible sense

22

Best Combinations

Data Set 1 Set 2 Base Ours Optimal BestSval-247.7%

Unigrams 55.3%

P-1,P0, P1

55.3%

43.6% 57.0% 67.9% 66.7%

Sval-156.3%

Unigrams 66.9%

P-1,P0, P1

68.0%

57.6% 71.1% 78.0% 81.1%

line54.3%

Unigrams 74.5%

P-1,P0, P1

60.4%

55.1% 74.2% 82.0% 88.0%

hard81.5%

Bigrams 89.5%

Head, Parent 87.7%

86.1% 88.9% 91.3% 83.0%

serve42.2%

Unigrams 73.3%

P-1,P0, P1

73.0%

58.4% 81.6% 89.9% 83.0%

interest54.9%

Bigrams 79.9%

P-1,P0, P1

78.8%

67.6% 83.2% 90.1% 89.0%

23

Conclusions

Reasonable amount of complementarity across lexical and syntactic features.

Simple lexical and part of speech features can be combined to achieve state of the art results.

Future Work : How best to capitalize on the complementarity?

24

Senseval-3

Approx. 8000 training and 4000 test instances. English lexical sample task.

Training data collected via Open Mind Word Expert.

Comparative results unveiled at ACL workshop!

25

Software and Data SyntaLex : WSD using lexical and syntactic

features. posSenseval : POS tag data in Senseval-2 format

using Brill Tagger. parseSenseval : parse output from Brill Tagger

using Collins Parser. BrillPatch : Supports Guaranteed Pre-Tagging. Packages to convert line hard, serve and interest

data to Senseval-1 and Senseval-2 data formats.

http://www.d.umn.edu/~tpederse/code.htmlhttp://www.d.umn.edu/~tpederse/data.html

26

Individual Word POS : Senseval-1

All Nouns Verbs Adj.Majority 56.3% 57.2% 56.9% 64.3%

P-257.5% 58.2% 58.6% 64.0

P-159.2% 62.2% 58.2% 64.3%

P060.3% 62.5% 58.2% 64.3%

P163.9% 65.4% 64.4% 66.2%

P-259.9% 60.0% 60.8% 65.2%

27

Individual Word POS: Senseval-2


P-247.1% 51.9% 38.0% 57.9%

P-149.6% 55.2% 40.2% 59.0%

P049.9% 55.7% 40.6% 58.2%

P153.1% 53.8% 49.1% 61.0%

P-248.9% 50.2% 43.2% 59.4%

28

Parse Features:Senseval-1


Head Word 64.3% 70.9% 59.8% 66.9%

Parent Word 60.6% 62.6% 60.3% 65.8%

Phrase 58.5% 57.5% 57.2% 66.2%

Parent Phrase 57.9% 58.1% 58.3% 66.2%

29

Parse Features:Senseval-2


Head 51.7% 58.5% 39.8% 64.0%

Parent 50.0% 56.1% 40.1% 59.3%

Phrase 48.3% 51.7% 40.3% 59.5%

Parent Phrase

48.5% 53.0% 39.1% 60.3%

conll

Technology

different sense

intended sense

sense leaf

pos features surrounding

syntactic features

parse features head

parent word

kinds of features