an empirical study of vocabulary relatedness and its application to recommender systems

Post on 03-Jul-2015

590 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

.nju.edu.cn

An Empirical Study of Vocabulary Relatedness

and Its Application to Recommender Systems

Gong Cheng, Saisai Gong, Yuzhong Qu

State Key Laboratory for Novel Software Technology, Nanjing University, China

gcheng@nju.edu.cn

Presented at ISWC2011

Gong Cheng (程龚) gcheng@nju.edu.cn 2 of 36

ws .nju.edu.cn

Vocabulary matching

Measuring term similarity

FullProfessor

FacultyMember

AssistantProfessor

Professor

Faculty

AssistantProfessor

0.9

0.8

1.0

Gong Cheng (程龚) gcheng@nju.edu.cn 3 of 36

ws .nju.edu.cn

Vocabulary matching

Vocabulary distance

Measuring vocabulary similarity

Semantic Web for Research

Communities (SWRC)

eBiquity Person

Foundational Model of

Anatomy (FMA)

GALEN

NCBI organismal classification

(NCBITaxon)

0.8

0.5

0.5

0.60.02

Gong Cheng (程龚) gcheng@nju.edu.cn 4 of 36

ws .nju.edu.cn

Vocabulary matching

Vocabulary distance

Vocabulary relatedness

Measuring vocabulary relatedness

FullProfessor

FacultyMember

AssistantProfessorPhD

Postgraduate-Research-

Degree

EngD

not that similar, but somewhat related

Gong Cheng (程龚) gcheng@nju.edu.cn 5 of 36

ws .nju.edu.cn

Contributions

How to measure vocabulary relatedness?

6 measures, from 4 aspects

How about vocabulary relatedness in real-life cases?

Empirical analysis of 2,996 vocabularies and other 4 billion RDF triples

Where to apply vocabulary relatedness?

Post-selection vocabulary recommendation in vocabulary search

Gong Cheng (程龚) gcheng@nju.edu.cn 6 of 36

ws .nju.edu.cn

Outline

Data set

Vocabulary relatedness

Post-selection vocabulary recommendation

Conclusions

Gong Cheng (程龚) gcheng@nju.edu.cn 7 of 36

ws .nju.edu.cn

Data set statistics

Crawled from February 2010 to May 2011 by

Gong Cheng (程龚) gcheng@nju.edu.cn 8 of 36

ws .nju.edu.cn

Data set distributions

RDF documents over pay-level domains

Gong Cheng (程龚) gcheng@nju.edu.cn 9 of 36

ws .nju.edu.cn

Data set distributions

Vocabularies over top-level domains

Gong Cheng (程龚) gcheng@nju.edu.cn 10 of 36

ws .nju.edu.cn

Outline

Data set

Vocabulary relatedness

Post-selection vocabulary recommendation

Conclusions

Gong Cheng (程龚) gcheng@nju.edu.cn 11 of 36

ws .nju.edu.cn

Vocabulary relatedness

6 numerical measures, from 4 aspects

Semantic relatedness

Explicit

Implicit

Hybrid

Content similarity

Expressivity closeness

Distributional relatedness

Comparison

Gong Cheng (程龚) gcheng@nju.edu.cn 12 of 36

ws .nju.edu.cn

Measure 1: explicit semantic relatedness

owl:imports

v1 v2 v3

1 2

Eji

ji

E

SGvv

vvRin and between path shortest a ofweight

1,

GE

v1 v2

v3

rdfs:seeAlso

owl:priorVersion

Gong Cheng (程龚) gcheng@nju.edu.cn 13 of 36

ws .nju.edu.cn

Measure 2: implicit semantic relatedness

owl:inverseOf

v2 v3 v4

1 2GI

t2 t3t4

owl:inverseOf

rdfs:subClassOf

Iji

ji

I

SGvv

vvRin and between path shortest a ofweight

1,

v2 v3 v4

Gong Cheng (程龚) gcheng@nju.edu.cn 14 of 36

ws .nju.edu.cn

Measure 3: hybrid semantic relatedness

v1

v2

v3

1

2

IEji

ji

IE

SGvv

vvRin and between path shortest a ofweight

1,

v4

1

GE+I

Gong Cheng (程龚) gcheng@nju.edu.cn 15 of 36

ws .nju.edu.cn

Statistical properties of GE, GI and GE+I

Empirical analysis (1)

Gong Cheng (程龚) gcheng@nju.edu.cn 16 of 36

ws .nju.edu.cn

Empirical analysis (2)

Explicit relations between vocabularies

Gong Cheng (程龚) gcheng@nju.edu.cn 17 of 36

ws .nju.edu.cn

Measure 4: content similarity

Harmonic mean

Maximum similarity between their labels

Gong Cheng (程龚) gcheng@nju.edu.cn 18 of 36

ws .nju.edu.cn

Empirical analysis (3)

86 label-like properties

rdfs:label, dc:title, and their subproperties (e.g. skos:prefLabel)

and local name

63.67%

36.33%

Terms and their labels

w/

w/o

36.21%

63.79%

Vocabulary distribution

w/

w/o

Gong Cheng (程龚) gcheng@nju.edu.cn 19 of 36

ws .nju.edu.cn

Measure 5: expressivity closeness

tq

tp

tr

MetaTerms

rdfs:domain

owl:inverseOf

owl:TransitiveProperty

owl:TransitiveProperty

rdf:type

Jaccard

Gong Cheng (程龚) gcheng@nju.edu.cn 20 of 36

ws .nju.edu.cn

Empirical analysis (4)

4,978 meta-level terms, 469 (9.42%) in >1 vocabulary

Most popular meta-level terms

1. rdf:type

2. rdfs:domain

3. rdfs:range

4. …

and after excluding language constructs

10.13 meta-level terms per vocabulary

≤20 meta-level terms in 92.96% vocabularies

but hundreds in Cyc

Gong Cheng (程龚) gcheng@nju.edu.cn 21 of 36

ws .nju.edu.cn

Measure 6: distributional relatedness

Distributional profile

vvp

vvp

vvp

v

n |

...

|

|

DP2

1

jijiD vvvvR DP,DPcos,

Gong Cheng (程龚) gcheng@nju.edu.cn 22 of 36

ws .nju.edu.cn

Empirical analysis (5)

Instantiation found for 1,874 (62.55%) vocabularies

Most popular vocabularies (excluding languages)

Gong Cheng (程龚) gcheng@nju.edu.cn 23 of 36

ws .nju.edu.cn

Empirical analysis (6)

Co-instantiation found for 9,763 pairs of vocabularies

Most popular vocabulary co-instantiation (excluding languages)

Gong Cheng (程龚) gcheng@nju.edu.cn 24 of 36

ws .nju.edu.cn

Vocabulary relatedness

6 numerical measures, from 4 aspects

Semantic relatedness

Explicit

Implicit

Hybrid

Content similarity

Expressivity closeness

Distributional relatedness

Comparison

Gong Cheng (程龚) gcheng@nju.edu.cn 25 of 36

ws .nju.edu.cn

Agreement between measures

Spearman’s rank correlation coefficient (ρ∈[-1,1])

Single-link hierarchical clustering

Gong Cheng (程龚) gcheng@nju.edu.cn 26 of 36

ws .nju.edu.cn

Outline

Data set

Vocabulary relatedness

Post-selection vocabulary recommendation

Conclusions

Gong Cheng (程龚) gcheng@nju.edu.cn 27 of 36

ws .nju.edu.cn

Ranking by single measure:

Ranking by multiple measures:

Relatedness-based ranking

Gong Cheng (程龚) gcheng@nju.edu.cn 28 of 36

ws .nju.edu.cn

Popularity-based re-ranking

Number of pay-level domains instantiating vi

Degree of influence of popularity

Gong Cheng (程龚) gcheng@nju.edu.cn 29 of 36

ws .nju.edu.cn

Evaluation settings

20 “selections” randomly selected from 1,302 moderate-sized vocabularies

Depth-10 pooling with

2 experts

Ratings

Closely related: 2

Somewhat related: 1

Unrelated: 0

Metric: NDCG

Gong Cheng (程龚) gcheng@nju.edu.cn 30 of 36

ws .nju.edu.cn

Gold standard

739 assessments

Agreement between experts

80%

or 91% when “closely related = somewhat related = related”

7.85%10.55%

81.60%

Assessments

Closely related

Somewhat related

Unrelated

Gong Cheng (程龚) gcheng@nju.edu.cn 31 of 36

ws .nju.edu.cn

Evaluation results --- individual measures

56.88% isolated vocabularies in GE 37.45% uninstantiated vocabularies

Gong Cheng (程龚) gcheng@nju.edu.cn 32 of 36

ws .nju.edu.cn

Evaluation results --- combinations of measures

Gong Cheng (程龚) gcheng@nju.edu.cn 33 of 36

ws .nju.edu.cn

Relatedness vs. popularity

NDCG@1 vs. number of pay-level domains instantiating it

Gong Cheng (程龚) gcheng@nju.edu.cn 34 of 36

ws .nju.edu.cn

Outline

Data set

Vocabulary relatedness

Post-selection vocabulary recommendation

Conclusions

Gong Cheng (程龚) gcheng@nju.edu.cn 35 of 36

ws .nju.edu.cn

Conclusions

Vocabulary-level relatedness

4 aspects, 6 measures

Empirical analysis

Statistical findings

Comparison

Post-selection vocabulary recommendation

Relatedness-based ranking

Popularity-based re-ranking

Evaluation

Falcons Ontology Search

http://ws.nju.edu.cn/falcons/ontologysearch/

Gong Cheng (程龚) gcheng@nju.edu.cn 36 of 36

ws .nju.edu.cn

Take away

Vocabulary meta-descriptions are incomplete.

Terms lack labels.

Co-instantiated ∝ explicitly related

http://ws.nju.edu.cn/falcons/ontologysearch/

top related