mapping between taxonomies elena eneva 11 dec 2001 advanced ir seminar

30
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar

Post on 20-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Mapping Between Taxonomies

Elena Eneva

11 Dec 2001

Advanced IR Seminar

Mapping Between TaxonomiesFormal systems of orderly classification

of knowledge, which are designed for a specific purpose

Companies, organizing information in various ways (eg. one for marketing, another for product development)

ApproachGerman

French

Textile

Automobile

By country

By industry

ApproachGerman

French

Textile

Automobile

By country

By industry

ApproachGerman

French

Textile

Automobile

By country

By industry

ApproachGerman

French

Textile

Automobile

By country

By industry

ApproachTextile

Automobile

By industry

ApproachTextile

Automobile

By industry

abcabcabcabcabcabc

abcabcabcabcabcabc

abcabcabcabcabcabc

abcabcabcabcabcabc

ApproachTextile

Automobile

By industry

abcabcabcabcabcabc

abcabcabcabcabcabc

abcabcabcabcabcabc

abcabcabcabcabcabc

ApproachGerman

French

Textile

Automobile

By country

By industry

abc abc abc abc

ApproachGerman

French

Textile

Automobile

By country

By industry

abc abc abc abc

ApproachGerman

French

Textile

Automobile

By country

By industry

abc abc abc abc

abc abc abc abc

DatasetsTwo classification schemes:

Reuter 2001 (807900 docs) Topics (127) Industry categories (871) Regions (376)

Hoovers-255 and Hoovers-28 (4286 docs) industry categories (28) industry categories (255)

Learning2 separate methods of learning for the

documents: Old doc category -> new doc category Doc contents -> new category

Combined method: Weighted average based on confidence Final result determined by a decision tree One combined learner – used both old

category and contents as features

Simple Learners

Simple Decision Tree (C4.5) – learns probabilities of new categories based on 1 kind of feature: Old categories (doesn’t know about documents/words) Word-based classification (doesn’t know about old

categories) Naïve Bayes (rainbow)

Old categories (doesn’t know about documents/words) Word-based classification (doesn’t know about old

categories) Support Vector Machine (SVM-Light)

word-based classification (doesn’t know about old categories), linear kernel [results will be reported in the final paper]

Learning

Using the document content

abcabcabcabcabcabc

Using the document labels

DT, NB, SVM

DT, NB, SVM

Combined Learners

Weighted Average Voting scheme

Combination Decision Tree takes the outputs and confidences of two of

the simple learners, predicts new category

Learning

Using both the content and the label

Combining the two outputs

abcabcabcabcabcabcDT

abcabcabcabcabcabc

DT, NB, SVM

DT, NB, SVM

voting

3rd classifier

Results Words Only

5-fold cross validation

Words Only

0

10

20

30

40

50

60

28p255 255p28

% a

cc

ura

cy

words only NB

words only DT

Results Categories Only

5-fold cross validation

Categories Only

0

20

40

60

80

100

120

28p255 255p28

% a

cc

ura

cy

categs only NB

categs only DT

Results Combination

5-fold cross validation

Combination

0

20

40

60

80

100

120

28p255 255p28

% a

cc

ura

cy

Combination Vote

Combination Comb

Results

words onlyNB DT

28p255 21.14 7.9255p28 53.2 17.5

categs onlyNB DT

28p255 26.19 26.19255p28 100 100

CombinationVote Comb

28p255 28.05 30.26255p28 100 100

Remarks

Hierarchy (old classes) usually ignoredShown that helpsLearners are not the issueBetter way of understandingOld label (or hierarchy path) is meta

data

Remaining Work

SVM results (running even as we speak)Repeat experiments on Reuters-2001

Internal hierarchies Missing labels Less correlated types of classes

Results in standard evaluation format

Future Work

Try with a web dataset (Google and Yahoo! Hierarchies)

Hierarchies of more levelsMeta data (for non-text sources)

Related Literature

A study of Approaches to Hypertext, Y. Yang, S. Slattery, R. Ghani, Journal of Intelligent Information Systems, Volume 18, Number 2, March 2002 (to appear).

Learning Mappings between Data Schemas , A. Doan, P. Domingos, and A. Levy. Proceedings of the AAAI-2000 Workshop on Learning Statistical Models from Relational Data, 2000, Austin, TX.

Questions and Suggestions

The end.

DT accuracy vs Vocabulary size

0

1020

30

40

5060

70

10 100 500 1000 2000

vocabulary size

% a

ccur

acy train accuracy

test accuracy

Taxonomies

Formal systems of orderly classification of knowledge, which are designed for a specific purpose

Change of purpose, change of taxonomies

Businesses often need and keep theinformation in several structures

Important to be able to automatically map between taxonomies

Useful Mappings Companies, organizing information in various ways

(eg. one for marketing, another for product development)

Personal online bookmark classification

Search engines (eg. Google <-> Yahoo)

EU Committee for Standardization “detailed overview of the existing taxonomies officially used in the EU, in order to derive general concepts such as: information organisation, properties, multilinguality, keywords, etc. and, last but not least, the mapping between.”