a resource and tool for super sense tagging of italian texts lrec 2010, malta – 19-21/05/2010...

A resource and tool for A resource and tool for Super Sense Tagging Super Sense Tagging

of Italian Texts of Italian Texts

LREC 2010, Malta – 19-21/05/2010

Giuseppe Attardi* Alessandro Giuseppe Attardi* Alessandro Lenci*Lenci*++

Stefano Dei Rossi* Simonetta Stefano Dei Rossi* Simonetta MontemagniMontemagni++

Giulia Di Pietro* Maria Simi*Giulia Di Pietro* Maria Simi*

* Universit* Università di Pisaà di Pisa++ ILC - CNR, Pisa ILC - CNR, Pisa

SummarySummary

Why Super Sense tagging

Preliminary results

Improving an existing resource

Building a new resource

A new tagger for the task

Discussion on the results

Future work

Semantic taggingSemantic tagging

Named Entity Recognition (NER) Simple ontologies: person, organization, location

…

Limited semantic/syntactic coverage

High accuracy

Word Sense Disambiguation Identifying WordNet senses

tens of thousands of specific “word senses”

all open class words covered, domain-independent

inadeguate performance

Super SensesSuper Senses

Super Senses Introduced by Ciaramita and Altun (2006)

WordNet super senses Noun and verb synsets mapped to 41

general semantic classes (lexicographic categories) 26 noun categories; 15 verb categoriesExample: “Clara Harrisperson , one of the

guestsperson in the boxartifact , stood upmotion

and demandedcommunication watersubstance”

Super Sense TaggingSuper Sense Tagging

For English (Ciaramita and Altun, 2006) training on SemCor (Senseval-3) discriminative HMM, trained with an

average perceptron algorithm average F-Score on 41 categories: 77.18

For Italian (Picca, Gliozzo, Ciaramita,

2008) trained on MultiSemCor (Bentivoglio et al.) average F-Score on 41 categories: 62,90

Improving MultiSemCorImproving MultiSemCor

Problems Smaller size (64% of English corpus) Incomplete alignment (sense in Eng., no sense in

Ita.) PoS coarseness Word by word translation

Stategy Retagging, adding morphology

Results average F-Score: 64,95 (same algorithm; 45

categories)

Further workFurther work

Our requirements Integration of a SST tagger in the TANL

pipeline Useful model for annotating realistic

Italian texts

Two directions for improvement A brand new resource for SST A new algorithm for SST, based on

Maximum Entropy

Building the new resourceBuilding the new resource

ISST - Italian Syntactic-Semantic

Treebank 305,547 tokens 81,236 content words annotated at the

lexico-semantic level, including IWN senses

ILI* mapping from

IWN to WN senses

* Inter Linguistic Index

<Lemma, sense>

WordNetIWNILI

Supersense

ISST Corpus

From Italian senses From Italian senses to English super sensesto English super senses

Starting from sense Si:

1. If Si is in ILI, return che corresponding Se

2. If not, look for the first hyperonym in the ILI and return the corresponding Se

In both cases return the super sense of Se

in WN

ItalWordNetSynset ILI

n#24931n #08770969

WordNet

Supersense

Synset ILI

n#16564 –

Synset ILI

n#12484 #04692559Hyperonym

Token <Lemma, Sense>L’ <lo, 0>

atmosfera <atmosfera, 4>

di <di, 0>

festa <festa, 1>

ISST Corpus

ISST-SST after mappingISST-SST after mapping

Tokens with super-senseTokens with ambiguous

super-sense

Tokens without

super-sensedirect ILI ILI from hyp

noun 43.908 1.741 3.492 38.266

verb 10.088 60 1.351 29.260

adjective 3.219 1.519 118 16.492

adverb 0 0 0 13.812

Total 57.215 3.320 4.961 97.830

RevisionRevision

Mapping of adverbs in adv.all (~ 10,000)

Listing of possible super senses Alternative for nouns: 2-6 Alternative for verbs: 3-10

An ad-hoc tool for revision

Difficulties Aspectual verbs: “continuare a …”, “stare per …” Support verbs:

“prestare attenzione a …”, “ dar una mano …”

ISST-SST after revisionISST-SST after revision

Tokens with super sense

Tokens without super sense

noun 69,360 11,545

verb 27,667 7,075

adjective 17,478 4,649

adverb 12,232 1,596

Total 126,737 24,865

Super Sense TaggerSuper Sense Tagger

Adapting a generic chunker, part of the Tanl

pipeline

Maximum Entropy classifier Effective for chunking since it does not

assume independence of features Dynamic programming to select sequences

of tags with higher probability The tagger is flexibile and customizable for

different tasks specialization of class FeatureExtractor

FeaturesFeatures

No external resources, no first sense heuristics

Local features Token Attribute features

POSTAG -2 -1 0 1 2

CPOSTAG -1 0

Form features

FORM ^\p{Lu} -1 +1

FORM ^\p{Lu}*$ 0

Global features Whether a word in the document was previously

annotated with a given tag

Detailed resultsDetailed results

Results for ItalianResults for Italian

Improvement for Italian due to: new corpus the different algorithm and the tuning of

features

Precision Recall F1

Italian Picca et al. 62.26 63.57 62,90

Italian our 79.92 78.30 79.10

Analysis of improvementAnalysis of improvement

Improvement due to new corpus MultiSemCor vs ISST-SST, ME tagger

about +4.5 on the F1 score

Improvement due to new algorithm and

features Ciaramita-Altun tagger vs ME tagger, on ISST-SST

about +10 on the F1 score

ConclusionsConclusions

Significant improvement in accuracy for SS

tagging

The tagger has been used to annotate the Italian Wikipedia

Examples of queries made possible on the semantic index Who proves emotions? (the subj of a verb.emotion) What did Edison invent/create/discover …? (Edison

as the subject of a verb.creation)

Completion of the ISST-ISST resource can further improve accuracy

a resource and tool for super sense tagging of italian texts lrec 2010, malta – 19-21/05/2010...

Documents

super sense of se

super sensetokens

sense si

super sensenoun69

english super sensesstarting

wordnet super sensesnoun

new resourcea new tagger

sst tagger