an empirical study of vocabulary relatedness and its application to recommender systems
Post on 03-Jul-2015
590 Views
Preview:
TRANSCRIPT
.nju.edu.cn
An Empirical Study of Vocabulary Relatedness
and Its Application to Recommender Systems
Gong Cheng, Saisai Gong, Yuzhong Qu
State Key Laboratory for Novel Software Technology, Nanjing University, China
gcheng@nju.edu.cn
Presented at ISWC2011
Gong Cheng (程龚) gcheng@nju.edu.cn 2 of 36
ws .nju.edu.cn
Vocabulary matching
Measuring term similarity
FullProfessor
FacultyMember
AssistantProfessor
Professor
Faculty
AssistantProfessor
0.9
0.8
1.0
Gong Cheng (程龚) gcheng@nju.edu.cn 3 of 36
ws .nju.edu.cn
Vocabulary matching
Vocabulary distance
Measuring vocabulary similarity
Semantic Web for Research
Communities (SWRC)
eBiquity Person
Foundational Model of
Anatomy (FMA)
GALEN
NCBI organismal classification
(NCBITaxon)
0.8
0.5
0.5
0.60.02
Gong Cheng (程龚) gcheng@nju.edu.cn 4 of 36
ws .nju.edu.cn
Vocabulary matching
Vocabulary distance
Vocabulary relatedness
Measuring vocabulary relatedness
FullProfessor
FacultyMember
AssistantProfessorPhD
Postgraduate-Research-
Degree
EngD
not that similar, but somewhat related
Gong Cheng (程龚) gcheng@nju.edu.cn 5 of 36
ws .nju.edu.cn
Contributions
How to measure vocabulary relatedness?
6 measures, from 4 aspects
How about vocabulary relatedness in real-life cases?
Empirical analysis of 2,996 vocabularies and other 4 billion RDF triples
Where to apply vocabulary relatedness?
Post-selection vocabulary recommendation in vocabulary search
Gong Cheng (程龚) gcheng@nju.edu.cn 6 of 36
ws .nju.edu.cn
Outline
Data set
Vocabulary relatedness
Post-selection vocabulary recommendation
Conclusions
Gong Cheng (程龚) gcheng@nju.edu.cn 7 of 36
ws .nju.edu.cn
Data set statistics
Crawled from February 2010 to May 2011 by
Gong Cheng (程龚) gcheng@nju.edu.cn 8 of 36
ws .nju.edu.cn
Data set distributions
RDF documents over pay-level domains
Gong Cheng (程龚) gcheng@nju.edu.cn 9 of 36
ws .nju.edu.cn
Data set distributions
Vocabularies over top-level domains
Gong Cheng (程龚) gcheng@nju.edu.cn 10 of 36
ws .nju.edu.cn
Outline
Data set
Vocabulary relatedness
Post-selection vocabulary recommendation
Conclusions
Gong Cheng (程龚) gcheng@nju.edu.cn 11 of 36
ws .nju.edu.cn
Vocabulary relatedness
6 numerical measures, from 4 aspects
Semantic relatedness
Explicit
Implicit
Hybrid
Content similarity
Expressivity closeness
Distributional relatedness
Comparison
Gong Cheng (程龚) gcheng@nju.edu.cn 12 of 36
ws .nju.edu.cn
Measure 1: explicit semantic relatedness
owl:imports
v1 v2 v3
1 2
Eji
ji
E
SGvv
vvRin and between path shortest a ofweight
1,
GE
v1 v2
v3
rdfs:seeAlso
owl:priorVersion
Gong Cheng (程龚) gcheng@nju.edu.cn 13 of 36
ws .nju.edu.cn
Measure 2: implicit semantic relatedness
owl:inverseOf
v2 v3 v4
1 2GI
t2 t3t4
owl:inverseOf
rdfs:subClassOf
Iji
ji
I
SGvv
vvRin and between path shortest a ofweight
1,
v2 v3 v4
Gong Cheng (程龚) gcheng@nju.edu.cn 14 of 36
ws .nju.edu.cn
Measure 3: hybrid semantic relatedness
v1
v2
v3
1
2
IEji
ji
IE
SGvv
vvRin and between path shortest a ofweight
1,
v4
1
GE+I
Gong Cheng (程龚) gcheng@nju.edu.cn 15 of 36
ws .nju.edu.cn
Statistical properties of GE, GI and GE+I
Empirical analysis (1)
Gong Cheng (程龚) gcheng@nju.edu.cn 16 of 36
ws .nju.edu.cn
Empirical analysis (2)
Explicit relations between vocabularies
Gong Cheng (程龚) gcheng@nju.edu.cn 17 of 36
ws .nju.edu.cn
Measure 4: content similarity
Harmonic mean
Maximum similarity between their labels
Gong Cheng (程龚) gcheng@nju.edu.cn 18 of 36
ws .nju.edu.cn
Empirical analysis (3)
86 label-like properties
rdfs:label, dc:title, and their subproperties (e.g. skos:prefLabel)
and local name
63.67%
36.33%
Terms and their labels
w/
w/o
36.21%
63.79%
Vocabulary distribution
w/
w/o
Gong Cheng (程龚) gcheng@nju.edu.cn 19 of 36
ws .nju.edu.cn
Measure 5: expressivity closeness
tq
tp
tr
MetaTerms
rdfs:domain
owl:inverseOf
owl:TransitiveProperty
owl:TransitiveProperty
rdf:type
Jaccard
Gong Cheng (程龚) gcheng@nju.edu.cn 20 of 36
ws .nju.edu.cn
Empirical analysis (4)
4,978 meta-level terms, 469 (9.42%) in >1 vocabulary
Most popular meta-level terms
1. rdf:type
2. rdfs:domain
3. rdfs:range
4. …
and after excluding language constructs
10.13 meta-level terms per vocabulary
≤20 meta-level terms in 92.96% vocabularies
but hundreds in Cyc
Gong Cheng (程龚) gcheng@nju.edu.cn 21 of 36
ws .nju.edu.cn
Measure 6: distributional relatedness
Distributional profile
vvp
vvp
vvp
v
n |
...
|
|
DP2
1
jijiD vvvvR DP,DPcos,
Gong Cheng (程龚) gcheng@nju.edu.cn 22 of 36
ws .nju.edu.cn
Empirical analysis (5)
Instantiation found for 1,874 (62.55%) vocabularies
Most popular vocabularies (excluding languages)
Gong Cheng (程龚) gcheng@nju.edu.cn 23 of 36
ws .nju.edu.cn
Empirical analysis (6)
Co-instantiation found for 9,763 pairs of vocabularies
Most popular vocabulary co-instantiation (excluding languages)
Gong Cheng (程龚) gcheng@nju.edu.cn 24 of 36
ws .nju.edu.cn
Vocabulary relatedness
6 numerical measures, from 4 aspects
Semantic relatedness
Explicit
Implicit
Hybrid
Content similarity
Expressivity closeness
Distributional relatedness
Comparison
Gong Cheng (程龚) gcheng@nju.edu.cn 25 of 36
ws .nju.edu.cn
Agreement between measures
Spearman’s rank correlation coefficient (ρ∈[-1,1])
Single-link hierarchical clustering
Gong Cheng (程龚) gcheng@nju.edu.cn 26 of 36
ws .nju.edu.cn
Outline
Data set
Vocabulary relatedness
Post-selection vocabulary recommendation
Conclusions
Gong Cheng (程龚) gcheng@nju.edu.cn 27 of 36
ws .nju.edu.cn
Ranking by single measure:
Ranking by multiple measures:
Relatedness-based ranking
Gong Cheng (程龚) gcheng@nju.edu.cn 28 of 36
ws .nju.edu.cn
Popularity-based re-ranking
Number of pay-level domains instantiating vi
Degree of influence of popularity
Gong Cheng (程龚) gcheng@nju.edu.cn 29 of 36
ws .nju.edu.cn
Evaluation settings
20 “selections” randomly selected from 1,302 moderate-sized vocabularies
Depth-10 pooling with
2 experts
Ratings
Closely related: 2
Somewhat related: 1
Unrelated: 0
Metric: NDCG
Gong Cheng (程龚) gcheng@nju.edu.cn 30 of 36
ws .nju.edu.cn
Gold standard
739 assessments
Agreement between experts
80%
or 91% when “closely related = somewhat related = related”
7.85%10.55%
81.60%
Assessments
Closely related
Somewhat related
Unrelated
Gong Cheng (程龚) gcheng@nju.edu.cn 31 of 36
ws .nju.edu.cn
Evaluation results --- individual measures
56.88% isolated vocabularies in GE 37.45% uninstantiated vocabularies
Gong Cheng (程龚) gcheng@nju.edu.cn 32 of 36
ws .nju.edu.cn
Evaluation results --- combinations of measures
Gong Cheng (程龚) gcheng@nju.edu.cn 33 of 36
ws .nju.edu.cn
Relatedness vs. popularity
NDCG@1 vs. number of pay-level domains instantiating it
Gong Cheng (程龚) gcheng@nju.edu.cn 34 of 36
ws .nju.edu.cn
Outline
Data set
Vocabulary relatedness
Post-selection vocabulary recommendation
Conclusions
Gong Cheng (程龚) gcheng@nju.edu.cn 35 of 36
ws .nju.edu.cn
Conclusions
Vocabulary-level relatedness
4 aspects, 6 measures
Empirical analysis
Statistical findings
Comparison
Post-selection vocabulary recommendation
Relatedness-based ranking
Popularity-based re-ranking
Evaluation
Falcons Ontology Search
http://ws.nju.edu.cn/falcons/ontologysearch/
Gong Cheng (程龚) gcheng@nju.edu.cn 36 of 36
ws .nju.edu.cn
Take away
Vocabulary meta-descriptions are incomplete.
Terms lack labels.
Co-instantiated ∝ explicitly related
http://ws.nju.edu.cn/falcons/ontologysearch/
top related