summarization and relevance of linked data facts
TRANSCRIPT
Dr. Harald SackHasso-Plattner-Institut for IT-Systems Engineering
University of Potsdam
Summarization and Relevance of
Linked Data Facts
4. Leipziger Semantic Web Tag 2012Leipzig, 25.09.2012
Dienstag, 25. September 12
Dr. Harald SackHasso-Plattner-Institut for IT-Systems Engineering
University of Potsdam
Summarization and Relevance of Linked Data Facts
•Linked Open Data and Semantics•The Importance of Being Relevant•The Most Important Facts•Heuristics for Fact Relevance•Relevance Evaluation
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
There are more than 30 Billion facts in the Linked Data Universe
http://www4.wiwiss.fu-berlin.de/lodcloud/state/
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
• LOD facts are encoded in RDF• related ontologies only provide ,shallow‘ semantics• poor data quality
• inconsistencies • ambiguities• redundancies
• mapping and interlinking are still a problem
State of LOD - as a knowledge base
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Consider a single LOD Dataset
Aldous Huxley
•e.g.,..Albert Einstein•> 600 facts•> 70 properties•no given order of facts•no given relevance of facts
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
How to find outwhat facts are
important?
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
• dbpedia:Albert_Einstein rdf:type yago:AmericanVegetarians .
vs.
• dbpedia:Albert_Einstein rdf:type yago:TheoreticalPhysicist .
Which fact is more important?
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
• dbpedia:Albert_Einstein rdf:type yago:AmericanVegetarians .
vs.
• dbpedia:Albert_Einstein rdf:type yago:TheoreticalPhysicist .
Importance depends on the Context
Context: ,Nutrition‘
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
• dbpedia:Albert_Einstein rdf:type yago:AmericanVegetarians .
vs.
• dbpedia:Albert_Einstein rdf:type yago:TheoreticalPhysicist .
Context: ,Science‘
Importance depends on the Context
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
What can it be used for?
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
• Making Decisions• which fact(s) should be considered to make a decision?• how should the (relevant) facts be weighted to make the
right decision? • Applications: Recommendation, Exploratory Search,...
• Creating Summarizations• “entity summarization ... produce a version of the original
[entity] description that is more concise, yet containing sufficient information for users to quickly identify the underlying entity.“ [1]
[1] Gong Cheng, Thanh Tran, and Yuzhong Qu. \RE-LIN: relatedness and informativeness-based centralityfor entity summarization". In: Proc. of the 10th intl.conf. on The semantic web - Volume Part I. ISWC'11.Bonn, Germany: Springer-Verlag, 2011, pp. 114{129.
What can it be used for?
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Google Knowledge Graph
• Summarizations of factsrelated to a given entity
• on average 192 facts per entity [2]
[2] Gong Cheng, Thanh Tran, and Yuzhong Qu. RELIN: relatedness and informativeness-based centrality for entity summarization. In Proc. of the 10th int. conf. on The Semantic Web - Vol. Part I, ISWC’11, pages 114–129, Berlin, Heidelberg, 2011. Springer-Verlag.
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Google Knowledge Graph
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Google Knowledge Graph
• the Google users decidewith their querieswhat is important
• Queries:• subject + object
„Einstein ETH Zürich“
• subject + property„Einstein birthplace“
• Collaborative Filtering
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Entity Summarization
dbpedia:Albert_Einstein
yago:AmericanVegetarian
yago:TheoreticalPhysicist
rdt:type
rdt:type
Albert
foaf:givenName
Einstein
foaf:familyName
RDF GraphDienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Entity Summarization
dbpedia:Albert_Einstein
yago:AmericanVegetarian
yago:TheoreticalPhysicist
rdt:type
rdt:type
Albert
foaf:givenName
Einstein
foaf:familyName
RDF Graph
FS(dbpedia:Albert_Einstein)FS(dbpedia:Albert_Einstein)
f1 <rdf:type, yago:AmericanVegetarian>
f2 <rdf:type, yago:TheoreticalPhysicist>
f3 <foaf:givenName, „Albert“>
f4 <foaf:familyName, „Einstein“>
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Entity Summarization
Given a feature set FS(e) of an entity e and a positive integer k < |FS(e)|, the problem of entity summarization is to select Summ(e) ⊂ FS(e) such that |Summ(e)| = k. Summ(e) is called a summary of e. [3]
[3] Gong Cheng, Thanh Tran, and Yuzhong Qu. \RE-LIN: relatedness and informativeness-based centralityfor entity summarization". In: Proc. of the 10th intl.conf. on The semantic web - Volume Part I. ISWC'11.Bonn, Germany: Springer-Verlag, 2011, pp. 114{129.
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
How to determine relevant LOD facts?
graph analysis
(usage) statistics
semantic analysis
linguistic analysis
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Heuristics for Feature Relevance
(1) (Manual) Whitelisting(2) RDF-properties connecting objects of same rdf:type(3) Inverse and symmetric RDF-properties(4) Disambiguations(5) Bidirectional (wiki)Links(6) Linguistic Co-occurrences(7) Frequency of features shared with entities of same type (8) Frequency of features shared with ,neighborhood‘
entities [4,5]
[4] Jörg Waitelonis and Harald Sack. Towards exploratory video search using linked data. Multimedia Tools and Applications, 59:645–672, 2012[5] Andreas Thalhammer, Ioan Toma, Antonio J. Roa-Valverde, and Dieter Fensel. Leveraging usage data for linked data movie entity summarization. In Proc. of the 2nd Int. Ws. on Usage Analysis and the Web of Data (USEWOD2012)
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Heuristics for Feature Relevance(1) (Manual) Whitelisting
dbpedia:Albert_Einstein dbpedia:Switzerland
dbpedia-owl:residence
dbpedia-owl:Place
rdf:type
dbpedia-owl:Person
rdf:type
Locations are (manually) considered to be important for persons...
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Heuristics for Feature Relevance(2) RDF-properties connecting objects of same rdf:type
dbpedia:Albert_Einstein
yago:AmericanVegetarian
dbpedia-owl:Scientist
rdt:type
rdt:type
dbpedia:AlfredKleiner
dbpedia-owl:doctoralAdvisor
rdt:type
dbpedia:BillCosby
rdt:type
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Heuristics for Feature Relevance(3) Inverse and Symmetric RDF properties
dbpedia:Albert_Einstein dbpedia-owl:Scientist
rdt:type
dbpedia:AlfredKleiner
dbpedia-owl:doctoralAdvisor
rdt:type
dbpedia-owl:doctoralStudent
dbpedia-owl:doctoralAdvisor owl:inverseOf dbpedia:doctoralStudent .
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Heuristics for Feature Relevance(4) Disambiguations
dbpedia:Sumerian dbpedia:Cuneiform_script
dbpedia-owl:wikiPageDisambiguates
dbpedia:Sumer
dbpedia:Sumerian_religion
dbpedia:Sumerian_language
dbpedia:Sumerian_art
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Heuristics for Feature Relevance(5) Bidirektional (Wiki)Links
dbpedia:Albert_Einstein
dbpedia:AlfredKleiner
dbprop:wikilink
dbprop:wikilink
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Heuristics for Feature Relevance(6) Linguistic Co-occurrences
dbpedia:Albert_Einstein dbpedia:Physics
dbpprop:fields
Albert
foaf:givenNameEinstein
foaf:familyName
dbpedia:Violinists
rdf:type
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Heuristics for Feature Relevance(7) Frequency of features shared with entities of same type
dbpedia:Albert_Einstein
dbpedia:AlfredKleiner
dbpedia-owl:Scientist
rdt:type
rdt:type
dbpedia:Archimedes
dbpedia:Niels_Bohr
rdt:type
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Heuristics for Feature Relevance(8) Frequency of features shared with ,neighborhood‘ entities
(1) Determine neighborhood entitiesNe,k ⊂ E of an entity e ∈ E
(2) Frequency of features shared withneighbors determines featureimportance:
For all features FS(e) of entity e:Ae,f and Be,f are sets of items sharing the same features,where Ae,f ⊂ Ne,k and Be,f ⊂ E
(3) The weight we(f) of a feature ffor an entity e is determined as
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
A Ground Truth is Hard to Find
• A ground truth for relevance evaluation of facts is hard to find. We need:
• a sufficient number of arbitrary and independent facts• a sufficient number of people to ask with different
opinions
• Idea: Crowd Sourcing
• Game based approach for evaluation
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Game Based Evaluation Approach
• Idea: a Quiz GameCreate Questions from LOD facts
• Hypotheses:• If the user is able to answer the question correctly,
it is rather likely that the fact behind the answer iswell known and of some importance for the related entity.
• If the user is not able to answer the question or gives the wrong answer, it is rather likely that the fact behind the answer is not well known and maybe not of high importance for the related entity.
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Game Based Evaluation Approach
Jörg Waitelonis, Nadine Ludwig, Magnus Knuth, Harald Sack: Whoknows? - Evaluating Linked Data Heuristics with a Quiz that cleans up DBpedia. International Journal of Interactive Technology and Smart Education (ITSE), Emerald Group, Vol. 8, 2011 (3).
http://tinyurl.com/whoknowsgame
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Game Based Evaluation Approach
Lina Wolf, Magnus Knuth, Johannes P. Osterhoff, Harald Sack: RISQ! Renowned Individuals Semantic Quiz – A Jeopardy like Quiz Game for Ranking Facts. In Proc. of 7th Int. Conf. on Semantic Systems I-SEMANTICS, 07.-09. Sept., 2011, Graz, Austria, ACM, 2011, pp. 71-78.
http://apps.facebook.com/hpi-risq
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Game Based Evaluation Approach
• WhoKnows? Movies! game adapted for entity summarization in the movie domain
• restricted set of entities (movies) and related properties
• Questions are generated out of RDF triples, e.g.
fb:en.pulp_fiction :hasActor fb:en.john_travolta .
Question: John Travolta is the actor of ...?Correct Answer: Pulp Fiction
http://bit.ly/WhoKnowsMovies
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Game Based Ground Truth• WhoKnows? Movies has been played 690 times by 217
players• All 2,829 triples have been played at least once, 2,314
triples at least three times. • In total 8,308 questions have been played of which 4,716
have been answered correctly.• Overall result of relevance ranking:
http://yovisto.com/labs/iswc2012/
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Evaluation of Fact Ranking Heuristics• Heuristics based on User-Based Entity Summarization (UBES),
i.e. heuristics (8) Frequency of features shared with ,neighborhood‘ entities with usage-based neighborhood determinationwith data from Freebase and HetRec2011 MovieLens2k dataset
• Comparison with results derived from Google Knowledge Graph (GKG) and random results
• Kendall τ rank correlation used to evaluate the orderings of GKG and UBES with our ground truth
Dienstag, 25. September 12
Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig
Evaluation of Fact Ranking Heuristics
• Feature Ranking for the cast of a movie• Average difference of both, UBES and GKG to random is significant
Dienstag, 25. September 12
Dr. Harald SackHasso-Plattner-Institut for IT-Systems Engineering
University of Potsdam
Summary of Results
•LOD fact relevance can be determined via heuristics based on statistical, linguistic, and semantic features
•Ground Truth for fact relevance can be created with the help of a game based approach
•Ground truth dataset and evaluation dataset are publicly available
(Thalhammer, Knuth, Sack: Evaluating Entity Summarizations Using a Game-Based Ground Truth, ISWC 2012)
Dienstag, 25. September 12
Dr. Harald SackHasso-Plattner-Institut for IT-Systems Engineering
University of Potsdam
Contact:Harald SackHasso-Plattner-Institut für SoftwaresystemtechnikUniversität PotsdamProf.-Dr.-Helmert-Str. 2-3D-14482 Potsdam
Homepage: http://bit.ly/HaraldSack-HPI http://www.yovisto.com/Blog: http://moresemantic.blogspot.com/E-Mail: [email protected] Twitter: lysander07 / biblionomicon / yovisto
Summarization and Relevance of Linked Data Facts
Dienstag, 25. September 12