taal- en spraaktechnologie · word sense disambiguation (wsd) machine learning and languages...

Covered so farToday

Taal- en spraaktechnologie

Sophia Katrenko

Utrecht University, the Netherlands

Sophia Katrenko Lecture 2

Covered so farToday

Outline

1 Covered so far

2 TodayMachine learning: what is it?Evaluation measuresWord Sense Disambiguation (WSD)

Covered so farToday

Covered last time

Collocation extraction methods

Covered so farToday

Machine learning: what is it?Evaluation measuresWord Sense Disambiguation (WSD)

Today we discuss Chapter 7, and more precisely1 intro to machine learning2 word sense disambiguation techniques

Covered so farToday

Intro to Machine Learning (ML)

Covered so farToday

Why learning?

When talking about learning w.r.t. natural languages, weconsider at least two aspects

1 (first and second) language acquisition2 language understanding and generation by a machine

Here, we focus on the second.

Covered so farToday

Machine learning and languages

Learning by a machine can be used to1 model morphological, syntactic, semantic and pragmatic

analysis of a natural language2 solve application tasks, such as information extraction,

summarization, machine translation, and others.The second group does not exclude input from the first.

Covered so farToday

Example

Software fromhttp://cogcomp.cs.illinois.edu/page/demos/.

Utrecht University has concentrated its leading research into fifteenresearch focus areas.

PoS TaggingNNP/ Utrecht NNP/ University VBZ/ has VBN/ concentrated PRP$/its VBG/ leading NN/ research IN/ into NN/ fifteen NN/ research NN/

focus NNS/ areas ./ .

Shallow parsing[NP Utrecht University] [VP has concentrated] [NP its leading research][PP into] [NP fifteen research focus areas] .

Named entity recognition[ORG Utrecht University] has concentrated its leading research intofifteen research focus areas.

Covered so farToday

Example

Covered so farToday

Example

Covered so farToday

Example

Covered so farToday

Main learning notions (1)

Learning involves three components: task T, experience E,and performance measure P.The goal of learning is to perform well w.r.t. someperformance measure P on task T given some pastexperience or observations.Consider, for example, weather prediction given previousobservations.

Covered so farToday

ButWhat is experience? Is it direct or implicit?Do given observations reflect the task/goal?Does the number of observations matter? What aboutnoisy data?

Covered so farToday

Today, we consider1 Learning tasks: regression and classification2 Learning types: supervised, unsupervised, and

semi-supervised3 Evaluation measures: accuracy, precision, recall and

F-score

Covered so farToday

Formally, let observations (training data) (X ,Y ) be definedas (X ,Y ) ∈ X × Y on the input space X and the outputspace Y.Pairs (X ,Y ) are random variables distributed according tothe unknown distribution D.The observed data points we denote by (xi, yi) and saythat they are independently and identically distributedaccording to D.The goal is to construct a hypothesis h such that for anyinstance from the input space X it predicts its label fromthe output space Y, i.e. h : X → Y.

Covered so farToday

Let also every example xi ∈ X , i = 1, . . . ,n be representedby a fixed number of features, xi = (xi1, . . . , xik ).For instance, for the task ‘is a given word a noun?’, X is acollection of words, Y = {0,1}, and training examples areof the form {w , y}, where w ∈ X and y ∈ Y, as in{Utrecht ,1}, {in,0}, . . .For PoS tagging, |Y| > 2 and is represented by tags NNS,IN, NN, and others.

Covered so farToday

Classification: For h : X → Y, if Y is discrete (set ofcategories). If Y = {+1,−1}, then it is a binaryclassification taskRegression: if output is continuous (a real number).There are different types of learning:

supervisedunsupervisedsemi-supervisedactive

Covered so farToday

Supervised learning requires a training set (as describedabove), which is used by an algorithm to produce afunction (hypothesis).Unsupervised learning uses no labeled data, and its goalis to reveal hidden structure in data.Semi-supervised learning takes as input both labeled(small amount) and unlabeled data.In the active learning scenario, a learning algorithm isquerying a human expert for true labels of the examples itselects according to some criteria (i.e., an example thealgorithm is not certain about).

Covered so farToday

Main learning notions (7): examples

How does this relate to natural language processing?Most research in NLP (at least initially) has concerned supervisedlearning: parsing (treebanks for training available), named entityrecognition systems, text categorization, others.

It has shifted to semi-supervised learning because of the cost of humanlabour (e.g., for parsing Steedman’02).

Semi-supervised methods perform quite well compared to heavysupervised systems.

Unsupervised learning is used when clustering words/documentsbased on their similarity.

Active learning is less studied, but is becoming more popular in the NLPcommunity (e.g., text annotation by Tomanek et al.’09, and anaphoraresolution by Gasperin’09 at the Workshop on Active Learning for NLP).

Covered so farToday

Empirical risk: the risk of the target function t is the minimum over allpossible hypotheses g and is called the Bayes risk R∗ = infgR(g).

Since the underlying distribution is unknown, the quality of h is usuallymeasured by the empirical error in Eq. 1.

Rn(h) =1n

n∑i=1

l(h(xi), yi) (1)

Zero-one loss Several loss functions have been proposed in theliterature so far, the best known of which is the zero-one loss (Eq. 2).This loss is a function that outputs 1 any time a method errs on a datapoint (h(xi) 6= yi) and 0 otherwise.

l(h(xi), yi) = Ih(xi) 6=yi (2)

Covered so farToday

Rn(h) =1n

n∑i=1

l(h(xi), yi) (1)

Covered so farToday

Rn(h) =1n

n∑i=1

l(h(xi), yi) (1)

Covered so farToday

At first glance the goal of any learning algorithm should be to minimizeempirical error Rn(h), which is often referred to as empirical riskminimization.

This turns out to be not sufficient as some methods can perform well onthe training set but be not as accurate on the new data points.

In structural risk minimization (Eq. 3) not only the empirical error istaken into account, but the complexity (capacity) of h as well. In Eq. 3,pen(h) stands for a penalty that reflects complexity of a hypothesis.

gn = arg minh∈H

Rn(h) + pen(h) (3)

Covered so farToday

gn = arg minh∈H

Rn(h) + pen(h) (3)

Covered so farToday

gn = arg minh∈H

Rn(h) + pen(h) (3)

Covered so farToday

Bias and varianceIf h∗ is the best function in H with R(h∗) = infh∈HR(h), then thedifference |R(h∗)− R∗| is called the approximation error or bias.

A quantity that measures how far any hypothesis h in H is from its besthypothesis R(h∗) is referred to as an estimation error or variance(|R(h∗)− Rn(h)|).Bias does not depend on data used during the training phase whereasvariance always does.

Variance is equal to zero if predictions of a method do not change andare always the same regardless of the training data.

Bias is equal to zero if a classifier outputs the optimal prediction.

Covered so farToday

Consider for instance binary classification where each examplehas to be classified either as positive or as negative.

Positive examples on which the method errs are referred toas false negatives (FN) and negative examples which itmisclassifies are called false positives (FP).Those examples that are classified correctly are either truepositives (TP) or true negatives (TN).

Covered so farToday

Consider for instance binary classification where each examplehas to be classified either as positive or as negative.

Positive examples on which the method errs are referred toas false negatives (FN) and negative examples which itmisclassifies are called false positives (FP).Those examples that are classified correctly are either truepositives (TP) or true negatives (TN).

Covered so farToday

Accuracy is defined as the fraction of all examples thatwere classified correctly (Eq. 4). Accuracy is often usedwhen the data set is balanced (i.e., a number of truepositives and true negatives is the same).

Acc =TP + TN

TP + TN + FP + FN(4)

Precision reflects how many examples in the data set thatwere classified as positive really belong to true positives,Eq. 5.

precision =TP

TP + FP(5)

Covered so farToday

Accuracy is defined as the fraction of all examples thatwere classified correctly (Eq. 4). Accuracy is often usedwhen the data set is balanced (i.e., a number of truepositives and true negatives is the same).

Acc =TP + TN

TP + TN + FP + FN(4)

Precision reflects how many examples in the data set thatwere classified as positive really belong to true positives,Eq. 5.

precision =TP

TP + FP(5)

Covered so farToday

Recall shows what fraction of the true positives were foundby the method (Eq. 6).

recall =TP

TP + FN(6)

The F1 score is defined as the harmonic mean betweenprecision and recall (Eq. 7).

F1 =2 ∗ precision ∗ recall

precision + recall(7)

Covered so farToday

Word Sense Disambiguation (WSD)

Covered so farToday

We have already discussed polysemy and homonymy last time.Consider, for instance, how many senses bank has in WordNet 3.0(http://wordnetweb.princeton.edu/perl/webwn).

Covered so farToday

So what is WordNet (Miller et al., 1990)?

A wide-coverage computational lexicon of English which exploitspsycholinguistic theories.

Concepts are expressed as sets of synonyms (synsets){ bank7

n, cant2n, camber2n }

A word sense is a word occurring in a synset, e.g. bank7n is the

seventh sense of noun bank

There are also semantic relations between synsets (e.g.,hypernymy, meronymy, entailment), and lexical relationsbetween word senses (e.g., antonymy, nominalization).

Covered so farToday

Sentence: Utrecht University has concentrated its leading researchinto fifteen research focus areas.

Utrecht University has concentrated its leading research1 × 3 × 19 × 8 × 1 × 4 × 2 ×into fifteen research focus areas.1 × 1 × 2 × 7 × 6

= 306,432 interpretations!

Note that I already assumed the correct PoS tags here!Utrecht has only 1 sense, and is therefore monosemous, while focusis polysemous.

Covered so farToday

WSD references

WSD is the task of finding out which sense of a word is activated byits use in a particular context in an automatic way.

Navigli R. Word Sense Disambiguation: a Survey. ACMComputing Surveys, 41(2), ACM Press, 2009, pp. 1-69.

Agirre E. and Edmonds P. Word Sense Disambiguation:Algorithms and Applications, New York, USA, Springer, 2006.

Ide N. and Vronis J. Word Sense Disambiguation: The State ofThe Art. Computational Linguistics, 24(1), 1998, pp. 1-40.

Covered so farToday

WSD approaches

WSD has been typically seen as a supervised problem:classification given a fixed number of senses

grouping words having the same sense together (in anunsupervised way, clustering) is called word sensediscrimination

is important for many NLP applications (e.g., machinetranslation)

has been a popular topic for decades: have a look at Senseval-1(1998) up to SemEval (2010)!

http://www.senseval.org/

Covered so farToday

Senseval

Senseval has introduced the following tasks:

lexical sample: only a selected number of words are taggedaccording to their senses. E.g., in Senseval-1, these were 35words of different PoS, such as accident, bother, bitter.

all-words: all content (open-class) words in text have to beannotated⇒ more realistic, but also more difficult.

lexical substitution: find an alternative substitute word or phrasefor a target word in context (McCarthy and Navigli, 2007),whereby both synonyms need to be found and the context needsto be disambiguated.

cross-lingual disambiguation: disambiguate a target word bylabeling it with the appropriate translation in other languages(Lefever and Hoste, 2009)

Covered so farToday

Senseval

Even though primarily for English, Senseval has expanded itslanguage list to Basque, Chinese, Czech, Danish, Dutch, English,Estonian, Italian, Japanese, Korean, Spanish, Swedish.

Example of lexical substitution

Input: “The packed screening of about 100 high-level press peopleloved the film as well”Output: synonyms for the target movie (5); picture (3)

Example of cross-lingual disambiguation

Input: “Ill buy a train or coach ticket”Output: translations in other languagesDE: Bus (3); Linienbus (2); Omnibus (2); Reisebus (2);NL: autobus (3); bus (3); busvervoer (1); toerbus (1);

Covered so farToday

So what is a word sense?

R. Navigli: A word sense is a commonly-accepted meaning of aword:

We are fond of fruit such as kiwifruit and banana.

The kiwibird is the national bird of New Zealand.

1 But is the number of senses per word really fixed?

2 What about the boundaries between senses - are they rigid?

Covered so farToday

So why is it difficult? Consider the distribution of senses (sourceMacCartney; Navigli):

Covered so farToday

WSD: Baselines

take the most frequent sense (MFS) in the corpus (or the firstWordNet sense)

yields around 50-60% accuracy on lexical sample task w/WordNet senses

is a strong baseline (why?)

Covered so farToday

WSD: Baselines

Covered so farToday

WSD approaches

WSD approaches we consider today/next time:

Supervised (Gale et al., 1992)

Dictionary-based (Lesk, simplified Lesk, 1986)

Minimally supervised (Yarowsky, 1995)

Unsupervised (Mihalcea, 2009; Ponzetto and Navigli, 2010)

Our own work on using qualia structures for WS induction (2008)

Covered so farToday

WSD approaches: data

Supervised WSD needs training data! Then the steps are as follows

extract features given training/test set

train a ML method on the training set

apply a model to test data

Sense-annotated corpora for all-words task

SemCor: 200K words from Brown corpus w/ WordNet senses

SENSEVAL 3: 2081 tagged content words

Covered so farToday

WSD approaches: data

SemCor 3.0

<wf cmd=ignore pos=DT>The</wf><wf cmd=done rdf=group pos=NNP lemma=group wnsn=1lexsn=1:03:00:: pn=group>Fulton County Grand Jury</wf><wf cmd=done pos=VB lemma=say wnsn=1 lexsn=2:32:00::>said</wf><wf cmd=done pos=NN lemma=friday wnsn=1 lexsn=1:28:00::>Friday</wf><wf cmd=ignore pos=DT>an</wf><wf cmd=done pos=NN lemma=investigation wnsn=1lexsn=1:09:00::>investigation</wf><wf cmd=ignore pos=IN>of</wf>

Covered so farToday

Supervised WSD

We have already talked about Naive Bayesian approach. How to useit for the WSD?

we aim at selecting the most probable sense s of a given wordw , described by a set of features f1 . . . fn, arg maxs∈S P(s|f )

s = arg maxsi∈S

P(s|w) =arg maxsi∈S

P(w |si)P(si)

=arg maxs∈S

P(w |si)P(si)

we also naively assume all features to be independent:

s = arg maxsi∈S

P(si)n∏

P(fj |si)

Covered so farToday

Supervised WSD

we have to calculate P(si)

P(si) =freq(si ,w)

freq(w)

we have to calculate P(fj |si)

P(fj |si) =freq(fj , si)

freq(si)

don’t forget smoothing!

Covered so farToday

Supervised WSD

Naive Bayes has been used in (Gale, Church, Yarowsky, 92)

to disambiguate 6 words (duty, drug, land, language, position,sentence) with 2 senses each

using context of varying size

achieved around 90%

concluded that wide contexts are useful, as well asnon-immediately surrounding words

Covered so farToday

Dictionary-based WSD

Introduced in 1986 by Lesk and uses the following steps

Retrieve all sense definitions of target word from a machinereadable dictionary

Compare with sense definitions of words in context

Choose the sense with the most overlap

Covered so farToday

Dictionary-based WSD

Example (MacCartney)

pine(a) a kind of evergreen tree with needle-shaped leaves(b) to waste away through sorrow or illness

cone(a) a solid body which narrows to a point(b) something of this shape, whether solid or hollow(c) fruit of certain evergreen trees

A simplified version of Lesk’s method (Kilgarriff and Rosenzweig,2000) works on the overlap of words, and not senses from thedefinitions

Covered so farToday

Minimally supervised WSD

Introduced in 1995 by Yarowsky, based on two assumptions:

one sense per discoursethe sense of a target word can be determined by looking at thewords nearby (e.g. river and finance for bank1

n and bank2n,

respectively)

one sense per collocationsense of a target word tends to be preserved consistently withina single discourse (e.g. a document in finance)

Paper: http://acl.ldc.upenn.edu/P/P95/P95-1026.pdf

Covered so farToday

It’s a bootstrapping method:

1 start from small seed set of manually annotated data Dl

2 learn decision-list classifier from Dl

3 use learned classifier to label unlabeled data Du

4 move high-confidence examples to Dl

5 repeat from step 2

Covered so farToday

Source: Yarowsky (1995).

Covered so farToday

Decision list: a sequence of “if/else if/else” rules

If f1, then class 1

Else if f2, then class 2

Else class n

Collocational features are identified from tagged data

Word immediately to the left or right of target :The window bars3

n were broken.

Pair of words to immediate left or right of target:The worlds largest bar1

n is here in New York.

Covered so farToday

For all collocational features the log-likelihood ratio is computed, andthey are ordered according to it:

logP(sensei |fj)P(sensek |fj)

What does the log-likelihood ratio really mean?

Covered so farToday

Quote from Yarowsky (1995)

“New data are classified by using the single most predictive piece ofdisambiguating evidence that appears in the target context. By notcombining probabilities, this decision-list approach avoids theproblematic complex modeling of statistical dependenciesencountered in other frameworks.”

Covered so farToday

Initial decision list for plant (abbreviated), source: Yarowsky (1995)

LogL Collocation Sense8.10 plant life A7.58 manufacturing plant B7.39 life (within ±2-10 words) A7.20 manufacturing (in ±2-10 words) B6.27 animal (within ±2-10 words) A4.70 equipment (within ±2-10 words) B4.39 employee (within ±2-10 words) B4.30 assembly plant B4.10 plant closure B3.52 plant species A3.48 automate (within ±2-10 words) B3.45 microscopic plant A

Covered so farToday

Concrete noun categorization task

Covered so farToday

1 Lexical representation/categorization in cognitive sciencea lexical concept is represented by a set of features (Rapp& Caramazza, 1991; Gonnerman et. al., 1997)lexical concepts are atomic representations and“conceptual relations . . . can be captured by the sets ofinferential relations drawn from elementary and complexconcepts” (Almeida, 1999), the thesis of conceptualatomism (Fodor, 1990)

2 Categorization in computational lingusticsword-space models (Sahlgren, 2006; Lenci, Baroni, andothers)

Covered so farToday

44 concrete nouns to be categorized in2 categories (natural kind and artifact)3 categories (vegetable, animal and artifact)6 categories (green, fruitTree, bird, groundAnimal, vehicleand tool) the entity derived from the origin.

Covered so farToday

Generative Lexicon Theory

Pustejovsky (1998) proposed a linguistically motivatedapproach to modelling categories. Semantic descriptions use 4levels of linguistic representations such as

argument structure (”specification of number and a type oflogic arguments”)event structure (”definition of the event type of anexpression”)qualia structure (”a structural differentiation of thepredicative force for a lexical item”)lexical inheritance structure (”identification of how a lexicalstructure is related to other structures in the type lattice”)

Covered so farToday

Generative Lexicon Theory (cont’d)

Covered so farToday

Approach (cont’d)

How can we acquire qualia information? Some of the methodsproposed in the past:

Hearst, 1992 (hyperonymy)Girju, 2007 (part-whole relations)Cimiano and Wenderoth, 2007

predefined patterns for all 4 rolesranking results according to some measures

Yamada et al., 2007fully supervisedfocuses on acquisition of telic information

Covered so farToday

Approach (cont’d)

We make use of the patterns defined by Cimiano andWenderoth, 2007

role patternx NN is VBZ (a DT|the DT) kind NN of IN

formal x NN is VBZx NN and CC other JJx NN or CC other JJpurpose NN of IN (a DT)* x NN is VBZ

telic purpose NN of IN p NNP is VBZ(a DT|the DT)* x NN is VBZ used VVN to TOp NNP are VBP used VVN to TO

Table: Patterns: some examples

Covered so farToday

Approach (cont’d)

role pattern(a DT|the DT)* x NN is VBZ made VVN (up RP )*of IN

constitutive (a DT|the DT)* x NN comprises VVZ(a DT|the DT)* x NN consists VVZ of INp NNP are VBP made VVN (up RP )*of INp NNP comprise VVPto TO * a DT new JJ x NNto TO * a DT complete JJ x NN

agentive to TO * new JJ p NNPto TO * complete JJ p NNPa DT new JJ x NN has VHZ been VBNa DT complete JJ x NN has VHZ been VBN

Table: Patterns: some examples

Covered so farToday

Approach (cont’d)

Categorization procedure consists of the following stepsextraction of the passages containing candidates for therole fillers using patterns (Google, 50 snippets per pattern)PoS tagging of all passagesactual extraction of the candidates for the role fillers usingpatternsbuilding a word-space model where rows correspond to thewords provided by the organizers of the challenge andcolumns are the qualia elements for a selected role(clustering using CLUTO toolkit)

Covered so farToday

Approach (cont’d)

Covered so farToday

Approach (cont’d)

Covered so farToday

Approach (cont’d)

Covered so farToday

Evaluation

We use two evaluation measures (Zhao and Karypis, 2004):

Entropy

pij =mij

mj,H(cj) = −

L∑i=1

pij log pij (9)

H(C) =K∑

mH(cj) (10)

Purity

Pu(cj) = maxi=1,...,L

pij ,Pu(C) =K∑

mPu(cj) (11)

where: C = c1, ..., cK is the output clusteringL is the set of classes (“gold” senses)mij is the number of words in cluster j of class i (a class is a gold cluster)mj is the number of words in cluster jm is the overall number of words to cluster

Covered so farToday

Evaluation

clustering entropy purity2-way 0.59 0.803-way 0.00 1.006-way 0.13 0.892-way>1 0.70 0.773-way>1 0.14 0.966-way>1 0.23 0.82

Table: Performance using formal role only

Covered so farToday

What are the most representative elements in the clusters?

The similarity between elements in a cluster is measured asfollows:

zI =sI

j − µIl

sIj stands for the average similarity between the object j and the

rest objects in the same cluster, µIl is the average of sI

j valuesover all objects in the l th cluster, and δI

l is the standarddeviation of the similarities.

Covered so farToday

What are the most representative elements in the clusters?

the core of the cluster respresenting tools is formed bychisel followed by knife and scissors as they have thelargest internal z-score (the same cluster wrongly containsrocket but according to the internal z-score, it is an outlier(with the lowest z-score in the cluster))bowl, cup, bottle and kettle all have the lowest internalz-scores in the cluster of vehicles. The core of the clusteris formed by a truck and motorcycle

Covered so farToday

Descriptive and discriminative features: 3-way clustering

Cl FeaturesVEG fruit (41.3%), vegetables (28.3%), crop (14.6%),

food (3.4%), plant (2.5%)ANI animal (43.3%), bird (23.0%), story (6.6%),

pet (3.5%), waterfowl (2.4%)ART tool (31.0%), vehicle (15.3%), weapon (5.4%),

instrument (4.4%), container (3.9%)VEG fruit (21.0%), vegetables (14.3%), animal (11.6%),

crop (7.4%), tool (2.5%)ANI animal (22.1%), bird (11.7%), tool (10.1%),

fruit (7.4%), vegetables (5.1%)ART tool (15.8%), animal (14.8%), bird (7.9%),

vehicle (7.8%), fruit (6.8%)

Covered so farToday

Results: telic role

seed extractionshelicopter to rescuerocket to propelchisel to cut, to chop, to cleanhammer to hitkettle to boil, to preparebowl to servepencil to draw, to createspoon to servebottle to store, to pack

Table: Some extractions for the telic role

Covered so farToday

Results: constitutive role

seed extractionshelicopter a section, a bodyrocket a section, a part, a bodymotorcycle a frame, a part, a structuretruck a frame, a segment, a program, a compartmenttelephone a tranceiver, a handset, a stationkettle a pool, a cylinderbowl a corpus, a piecepen an ink, a componentspoon a surface, a partchisel a bladehammer a handle, a headbottle a container, a component, a wall, a segment, a piece

Table: Some extractions for the constitutive role

Covered so farToday

Results per role

role clustering entropy purity commentsformal 6-way 0.13 0.89 all 44 wordsagentive 6-way 0.54 0.61 43 wordsconstitutive 6-way 0.51 0.61 28 words

Table: Performance using one role only

Covered so farToday

Results: formal and agentive roles combined

Figure: A combination of the formal and the agentive rolesSophia Katrenko Lecture 2

Covered so farToday

The best performance

The best results are obtained by combining formal role with theagentive one

clustering entropy purity2-way 0.59 0.803-way 0.00 1.006-way 0.09 0.91

Table: Performance using formal and agentive roles

Interestingly, the worst performance on 2-way clustering isachieved by combining formal and constitutive roles (entropy of0.92, purity of 0.66)

Covered so farToday

Error analysis

1 Errors due to the extraction procedureincorrect PoS tagging/sentence boundary detectionpatterns do not always provide correct extractions/features(”chicken and other stories”)

2 Ambiguous words (”in fact, scottish gardens are starting tosee many more butterflies including peacocks”)

3 Features that do not suffice to discriminate among allcategories

Covered so farToday

Error analysis (cont’d)

1 6-way clustering always fails to discriminate between toolsand vehicles well. Containers (a bowl, a kettle, a cup, abottle) are always placed in the cluster of vehicles (insteadof tools). This is the only type of errors for the 6-wayclustering.

2 In 2-way clustering, vegetables are usually not considerednatural objects

Covered so farToday

Conclusions

1 formal role is already sufficient for identification ofvegetables, animals and artifacts (perfect clustering)

2 a combination of formal and agentive roles provides thebest performance on 6-way clustering (in line withPustejovsky, 2001)

3 no combination of roles accounts well for natural objectsand artifacts

Covered so farToday

To summarize

Today, we have looked atmachine learning problemsWSD methods

Covered so farToday

To summarize

Today, we have looked atmachine learning problemsWSD methods

taal- en spraaktechnologie · word sense disambiguation (wsd) machine learning and languages...

Documents