a resource and tool for super sense tagging of italian texts lrec 2010, malta – 19-21/05/2010...
TRANSCRIPT
A resource and tool for A resource and tool for Super Sense Tagging Super Sense Tagging
of Italian Texts of Italian Texts
LREC 2010, Malta – 19-21/05/2010
Giuseppe Attardi* Alessandro Giuseppe Attardi* Alessandro Lenci*Lenci*++
Stefano Dei Rossi* Simonetta Stefano Dei Rossi* Simonetta MontemagniMontemagni++
Giulia Di Pietro* Maria Simi*Giulia Di Pietro* Maria Simi*
* Universit* Università di Pisaà di Pisa++ ILC - CNR, Pisa ILC - CNR, Pisa
SummarySummary
Why Super Sense tagging
Preliminary results
Improving an existing resource
Building a new resource
A new tagger for the task
Discussion on the results
Future work
Semantic taggingSemantic tagging
Named Entity Recognition (NER) Simple ontologies: person, organization, location
…
Limited semantic/syntactic coverage
High accuracy
Word Sense Disambiguation Identifying WordNet senses
tens of thousands of specific “word senses”
all open class words covered, domain-independent
inadeguate performance
Super SensesSuper Senses
Super Senses Introduced by Ciaramita and Altun (2006)
WordNet super senses Noun and verb synsets mapped to 41
general semantic classes (lexicographic categories) 26 noun categories; 15 verb categoriesExample: “Clara Harrisperson , one of the
guestsperson in the boxartifact , stood upmotion
and demandedcommunication watersubstance”
Super Sense TaggingSuper Sense Tagging
For English (Ciaramita and Altun, 2006) training on SemCor (Senseval-3) discriminative HMM, trained with an
average perceptron algorithm average F-Score on 41 categories: 77.18
For Italian (Picca, Gliozzo, Ciaramita,
2008) trained on MultiSemCor (Bentivoglio et al.) average F-Score on 41 categories: 62,90
Improving MultiSemCorImproving MultiSemCor
Problems Smaller size (64% of English corpus) Incomplete alignment (sense in Eng., no sense in
Ita.) PoS coarseness Word by word translation
Stategy Retagging, adding morphology
Results average F-Score: 64,95 (same algorithm; 45
categories)
Further workFurther work
Our requirements Integration of a SST tagger in the TANL
pipeline Useful model for annotating realistic
Italian texts
Two directions for improvement A brand new resource for SST A new algorithm for SST, based on
Maximum Entropy
Building the new resourceBuilding the new resource
ISST - Italian Syntactic-Semantic
Treebank 305,547 tokens 81,236 content words annotated at the
lexico-semantic level, including IWN senses
ILI* mapping from
IWN to WN senses
* Inter Linguistic Index
<Lemma, sense>
WordNetIWNILI
Supersense
ISST Corpus
From Italian senses From Italian senses to English super sensesto English super senses
Starting from sense Si:
1. If Si is in ILI, return che corresponding Se
2. If not, look for the first hyperonym in the ILI and return the corresponding Se
In both cases return the super sense of Se
in WN
ItalWordNetSynset ILI
n#24931n #08770969
WordNet
Supersense
Synset ILI
n#16564 –
Synset ILI
n#12484 #04692559Hyperonym
Token <Lemma, Sense>L’ <lo, 0>
atmosfera <atmosfera, 4>
di <di, 0>
festa <festa, 1>
ISST Corpus
ISST-SST after mappingISST-SST after mapping
Tokens with super-senseTokens with ambiguous
super-sense
Tokens without
super-sensedirect ILI ILI from hyp
noun 43.908 1.741 3.492 38.266
verb 10.088 60 1.351 29.260
adjective 3.219 1.519 118 16.492
adverb 0 0 0 13.812
Total 57.215 3.320 4.961 97.830
RevisionRevision
Mapping of adverbs in adv.all (~ 10,000)
Listing of possible super senses Alternative for nouns: 2-6 Alternative for verbs: 3-10
An ad-hoc tool for revision
Difficulties Aspectual verbs: “continuare a …”, “stare per …” Support verbs:
“prestare attenzione a …”, “ dar una mano …”
ISST-SST after revisionISST-SST after revision
Tokens with super sense
Tokens without super sense
noun 69,360 11,545
verb 27,667 7,075
adjective 17,478 4,649
adverb 12,232 1,596
Total 126,737 24,865
Super Sense TaggerSuper Sense Tagger
Adapting a generic chunker, part of the Tanl
pipeline
Maximum Entropy classifier Effective for chunking since it does not
assume independence of features Dynamic programming to select sequences
of tags with higher probability The tagger is flexibile and customizable for
different tasks specialization of class FeatureExtractor
FeaturesFeatures
No external resources, no first sense heuristics
Local features Token Attribute features
POSTAG -2 -1 0 1 2
CPOSTAG -1 0
Form features
FORM ^\p{Lu} -1 +1
FORM ^\p{Lu}*$ 0
Global features Whether a word in the document was previously
annotated with a given tag
Detailed resultsDetailed results
Results for ItalianResults for Italian
Improvement for Italian due to: new corpus the different algorithm and the tuning of
features
Precision Recall F1
Italian Picca et al. 62.26 63.57 62,90
Italian our 79.92 78.30 79.10
Analysis of improvementAnalysis of improvement
Improvement due to new corpus MultiSemCor vs ISST-SST, ME tagger
about +4.5 on the F1 score
Improvement due to new algorithm and
features Ciaramita-Altun tagger vs ME tagger, on ISST-SST
about +10 on the F1 score
ConclusionsConclusions
Significant improvement in accuracy for SS
tagging
The tagger has been used to annotate the Italian Wikipedia
Examples of queries made possible on the semantic index Who proves emotions? (the subj of a verb.emotion) What did Edison invent/create/discover …? (Edison
as the subject of a verb.creation)
Completion of the ISST-ISST resource can further improve accuracy