tÍtulo genÉrico concept indexing for automated text categorization enrique puertas sanz...

21
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz [email protected] Universidad Europea de Madrid

Upload: daniel-dennis

Post on 17-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

TÍTULO GENÉRICO

Concept Indexing for Automated Text Categorization

Enrique Puertas Sanz

[email protected] Europea de Madrid

Page 2: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

OUTLINE

• Motivation• Concept indexing with WordNet synsets• Concept indexing in ATC• Experiments set-up• Summary of results & discussion• Updated results• Conclusions & current work

Page 3: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

MOTIVATION

• Most popular & effective model for thematic ATC• IR-like text representation• ML feature selection, learning classifiers

Pre-classifieddocuments Representation

& learning

Newdocuments Representation

Classifier(s)

Newdocumentsinstances

Classification

Newdocumentscategorized

Categories

Page 4: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

MOTIVATION

• Bag of Words

• Binary

• TF

• TF*IDF

• Stoplist

• Stemming

• Feature Selection

Page 5: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

MOTIVATION

• Text representation requirements in thematic ATC

• Semantic characterization of text content• Words convey an important part of the meaning

• But we must deal with polysemy and synonymy

• Must allow effective learning• Thousands to tens of thousands attributes

noise (effectiveness) & lack of efficiency

Page 6: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

CONCEPT INDEXING WITH WORDNET SYNSETS

• Using vectors of synsets instead of word stems• Ambiguous words mapped to correct senses• Synonyms mapped to same synsets

---------------------------------------

automobile

---- car ------------------------------ wagon --

N036030448 {automobile, car, wagon}

N206726781{train wagon, wagon}

Page 7: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

CONCEPT INDEXING WITH WORDNET SYNSETS

• Considerable controversy in IR• Assumed potential for improving text

representation• Mixed experimental results, ranging from

• Very good [Gonzalo et al. 98] to bad [Voorhees 98]

• Recent review in [Stokoe et al. 03]• A problem of state-of-the-art WSD effectiveness

• But ATC is different!!!

Page 8: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

CONCEPT INDEXING IN ATC

• Apart of the potential...• We have much more information about ATC

categories than IR queries• WSD lack of effectiveness can be less hurting

because of term (feature) selection• But we have new problems!!!

• Data sparseness & noise• Most terms are rare (Zipf’s Law) bad estimates• Categories with few documents bad estimates, lack of

information

Page 9: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

CONCEPT INDEXING IN ATC

• Concept indexing helps to solve IR & new ATC problems• Text ambiguity in IR & ATC• Data sparseness & noise in ATC

• Less indexing units of higher quality (selection) probably better estimates

• Categories with few documents why not enriching representation with WordNet semantic relations?

• Hyperonymy, meronymy, etc.

Page 10: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

CONCEPT INDEXING IN ATC

• Literature review• As in IR, mixed results, ranging from

• Good [Fukumoto & Suzuki, 01] to bad [Scott, 98]

• Notably, researchers use words in synsets instead of the synset codes themselves

• Still lacking

Concept indexing evaluation in ATC overa representative range of selectionstrategies and learning algorithms

Page 11: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

EXPERIMENTS SETUP

• Primary goal• Comparing terms vs. correct synsets as indexing

units• Requires perfect disambiguated collection (SemCor)

• Secondary goals• Comparing perfect WSD with simple methods

• More scalability, less accuracy

• Comparing terms with/out stemming, stop-listing• Nature of SemCor (genre + topic classification)

Page 12: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

EXPERIMENTS SETUP

• Overview of parameters• Binary classifiers vs. multi-class classifiers• Three concept indexing representations

• Correct WSD (CD)• WSD by POS Tagger (CF) • WSD by corpus frequency (CA)

Page 13: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

EXPERIMENTS SETUP

• Overview of parameters

• Four term indexing representations• No Stemming, No StopList (BNN)• No Stemming, with Stoplist (BNS)• With Stemming, without Stoplist (BSN)• With Stemming and Stoplist (BSS)

Page 14: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

EXPERIMENTS SETUP

• Levels of selection with IG• No selection (NOS)• top 1% (S01)• top 10% (S10)• IG>0 (S00)

Page 15: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

EXPERIMENTS SETUP

• Learning algorithms• Naïve Bayes • kNN• C4.5• PART • SVMs • Adaboost+Naïve Bayes • Adaboost+C4.5

Page 16: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

EXPERIMENTS SETUP

• Evaluation metrics• F1 (average of recall – precission)

• Macroaverage• Microaverage

• K-fold cross validation (k=10 in our experiments)

Page 17: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

SUMMARY OF RESULTS & DISCUSSION

• Overview of results

Binary classification Multi-class classification

Page 18: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

SUMMARY OF RESULTS & DISCUSSION

• CD > C* weakly supports that accurate WSD is required• BNN > B* does not support that stemming & stop-listing are NOT required

• Genre/topic orientation• Most importantly CD > B* does not support that synsets are better indexing units than words (stemmed & stop-listed or not)

Page 19: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

UPDATED RESULTS

• Recent results combining synsets & words (no stemming, no stop-listing, binary problem)

• NB S00, C4.5 S00, S01, S10

• SVM S01, ABNB S00, S00, S10

Page 20: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

CONCLUSSIONS & CURRENT WORK

• Synsets are NOT a better representation, but IMPROVE the bag-of-words representation

• We are testing semantic relations (hyperonymy) on SemCor

• It is required more work on Reuters-21578• We will have to address WSD, initially with the

approaches described in this work

Page 21: TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid

THANK YOU !