summarization and relevance of linked data facts

37
Dr. Harald Sack Hasso-Plattner-Institut for IT-Systems Engineering University of Potsdam Summarization and Relevance of Linked Data Facts 4. Leipziger Semantic Web Tag 2012 Leipzig, 25.09.2012 Dienstag, 25. September 12

Upload: harald-sack

Post on 10-May-2015

1.826 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Summarization and Relevance of Linked Data Facts

Dr. Harald SackHasso-Plattner-Institut for IT-Systems Engineering

University of Potsdam

Summarization and Relevance of

Linked Data Facts

4. Leipziger Semantic Web Tag 2012Leipzig, 25.09.2012

Dienstag, 25. September 12

Page 2: Summarization and Relevance of Linked Data Facts

Dr. Harald SackHasso-Plattner-Institut for IT-Systems Engineering

University of Potsdam

Summarization and Relevance of Linked Data Facts

•Linked Open Data and Semantics•The Importance of Being Relevant•The Most Important Facts•Heuristics for Fact Relevance•Relevance Evaluation

Dienstag, 25. September 12

Page 3: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

There are more than 30 Billion facts in the Linked Data Universe

http://www4.wiwiss.fu-berlin.de/lodcloud/state/

Dienstag, 25. September 12

Page 4: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

• LOD facts are encoded in RDF• related ontologies only provide ,shallow‘ semantics• poor data quality

• inconsistencies • ambiguities• redundancies

• mapping and interlinking are still a problem

State of LOD - as a knowledge base

Dienstag, 25. September 12

Page 5: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Consider a single LOD Dataset

Aldous Huxley

•e.g.,..Albert Einstein•> 600 facts•> 70 properties•no given order of facts•no given relevance of facts

Dienstag, 25. September 12

Page 6: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

How to find outwhat facts are

important?

Dienstag, 25. September 12

Page 7: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

• dbpedia:Albert_Einstein rdf:type yago:AmericanVegetarians .

vs.

• dbpedia:Albert_Einstein rdf:type yago:TheoreticalPhysicist .

Which fact is more important?

Dienstag, 25. September 12

Page 8: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

• dbpedia:Albert_Einstein rdf:type yago:AmericanVegetarians .

vs.

• dbpedia:Albert_Einstein rdf:type yago:TheoreticalPhysicist .

Importance depends on the Context

Context: ,Nutrition‘

Dienstag, 25. September 12

Page 9: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

• dbpedia:Albert_Einstein rdf:type yago:AmericanVegetarians .

vs.

• dbpedia:Albert_Einstein rdf:type yago:TheoreticalPhysicist .

Context: ,Science‘

Importance depends on the Context

Dienstag, 25. September 12

Page 10: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

What can it be used for?

Dienstag, 25. September 12

Page 11: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

• Making Decisions• which fact(s) should be considered to make a decision?• how should the (relevant) facts be weighted to make the

right decision? • Applications: Recommendation, Exploratory Search,...

• Creating Summarizations• “entity summarization ... produce a version of the original

[entity] description that is more concise, yet containing sufficient information for users to quickly identify the underlying entity.“ [1]

[1] Gong Cheng, Thanh Tran, and Yuzhong Qu. \RE-LIN: relatedness and informativeness-based centralityfor entity summarization". In: Proc. of the 10th intl.conf. on The semantic web - Volume Part I. ISWC'11.Bonn, Germany: Springer-Verlag, 2011, pp. 114{129.

What can it be used for?

Dienstag, 25. September 12

Page 12: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Google Knowledge Graph

• Summarizations of factsrelated to a given entity

• on average 192 facts per entity [2]

[2] Gong Cheng, Thanh Tran, and Yuzhong Qu. RELIN: relatedness and informativeness-based centrality for entity summarization. In Proc. of the 10th int. conf. on The Semantic Web - Vol. Part I, ISWC’11, pages 114–129, Berlin, Heidelberg, 2011. Springer-Verlag.

Dienstag, 25. September 12

Page 13: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Google Knowledge Graph

Dienstag, 25. September 12

Page 14: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Google Knowledge Graph

• the Google users decidewith their querieswhat is important

• Queries:• subject + object

„Einstein ETH Zürich“

• subject + property„Einstein birthplace“

• Collaborative Filtering

Dienstag, 25. September 12

Page 15: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Entity Summarization

dbpedia:Albert_Einstein

yago:AmericanVegetarian

yago:TheoreticalPhysicist

rdt:type

rdt:type

Albert

foaf:givenName

Einstein

foaf:familyName

RDF GraphDienstag, 25. September 12

Page 16: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Entity Summarization

dbpedia:Albert_Einstein

yago:AmericanVegetarian

yago:TheoreticalPhysicist

rdt:type

rdt:type

Albert

foaf:givenName

Einstein

foaf:familyName

RDF Graph

FS(dbpedia:Albert_Einstein)FS(dbpedia:Albert_Einstein)

f1 <rdf:type, yago:AmericanVegetarian>

f2 <rdf:type, yago:TheoreticalPhysicist>

f3 <foaf:givenName, „Albert“>

f4 <foaf:familyName, „Einstein“>

Dienstag, 25. September 12

Page 17: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Entity Summarization

Given a feature set FS(e) of an entity e and a positive integer k < |FS(e)|, the problem of entity summarization is to select Summ(e) ⊂ FS(e) such that |Summ(e)| = k. Summ(e) is called a summary of e. [3]

[3] Gong Cheng, Thanh Tran, and Yuzhong Qu. \RE-LIN: relatedness and informativeness-based centralityfor entity summarization". In: Proc. of the 10th intl.conf. on The semantic web - Volume Part I. ISWC'11.Bonn, Germany: Springer-Verlag, 2011, pp. 114{129.

Dienstag, 25. September 12

Page 18: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

How to determine relevant LOD facts?

graph analysis

(usage) statistics

semantic analysis

linguistic analysis

Dienstag, 25. September 12

Page 19: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Heuristics for Feature Relevance

(1) (Manual) Whitelisting(2) RDF-properties connecting objects of same rdf:type(3) Inverse and symmetric RDF-properties(4) Disambiguations(5) Bidirectional (wiki)Links(6) Linguistic Co-occurrences(7) Frequency of features shared with entities of same type (8) Frequency of features shared with ,neighborhood‘

entities [4,5]

[4] Jörg Waitelonis and Harald Sack. Towards exploratory video search using linked data. Multimedia Tools and Applications, 59:645–672, 2012[5] Andreas Thalhammer, Ioan Toma, Antonio J. Roa-Valverde, and Dieter Fensel. Leveraging usage data for linked data movie entity summarization. In Proc. of the 2nd Int. Ws. on Usage Analysis and the Web of Data (USEWOD2012)

Dienstag, 25. September 12

Page 20: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Heuristics for Feature Relevance(1) (Manual) Whitelisting

dbpedia:Albert_Einstein dbpedia:Switzerland

dbpedia-owl:residence

dbpedia-owl:Place

rdf:type

dbpedia-owl:Person

rdf:type

Locations are (manually) considered to be important for persons...

Dienstag, 25. September 12

Page 21: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Heuristics for Feature Relevance(2) RDF-properties connecting objects of same rdf:type

dbpedia:Albert_Einstein

yago:AmericanVegetarian

dbpedia-owl:Scientist

rdt:type

rdt:type

dbpedia:AlfredKleiner

dbpedia-owl:doctoralAdvisor

rdt:type

dbpedia:BillCosby

rdt:type

Dienstag, 25. September 12

Page 22: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Heuristics for Feature Relevance(3) Inverse and Symmetric RDF properties

dbpedia:Albert_Einstein dbpedia-owl:Scientist

rdt:type

dbpedia:AlfredKleiner

dbpedia-owl:doctoralAdvisor

rdt:type

dbpedia-owl:doctoralStudent

dbpedia-owl:doctoralAdvisor owl:inverseOf dbpedia:doctoralStudent .

Dienstag, 25. September 12

Page 23: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Heuristics for Feature Relevance(4) Disambiguations

dbpedia:Sumerian dbpedia:Cuneiform_script

dbpedia-owl:wikiPageDisambiguates

dbpedia:Sumer

dbpedia:Sumerian_religion

dbpedia:Sumerian_language

dbpedia:Sumerian_art

Dienstag, 25. September 12

Page 24: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Heuristics for Feature Relevance(5) Bidirektional (Wiki)Links

dbpedia:Albert_Einstein

dbpedia:AlfredKleiner

dbprop:wikilink

dbprop:wikilink

Dienstag, 25. September 12

Page 25: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Heuristics for Feature Relevance(6) Linguistic Co-occurrences

dbpedia:Albert_Einstein dbpedia:Physics

dbpprop:fields

Albert

foaf:givenNameEinstein

foaf:familyName

dbpedia:Violinists

rdf:type

Dienstag, 25. September 12

Page 26: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Heuristics for Feature Relevance(7) Frequency of features shared with entities of same type

dbpedia:Albert_Einstein

dbpedia:AlfredKleiner

dbpedia-owl:Scientist

rdt:type

rdt:type

dbpedia:Archimedes

dbpedia:Niels_Bohr

rdt:type

Dienstag, 25. September 12

Page 27: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Heuristics for Feature Relevance(8) Frequency of features shared with ,neighborhood‘ entities

(1) Determine neighborhood entitiesNe,k ⊂ E of an entity e ∈ E

(2) Frequency of features shared withneighbors determines featureimportance:

For all features FS(e) of entity e:Ae,f and Be,f are sets of items sharing the same features,where Ae,f ⊂ Ne,k and Be,f ⊂ E

(3) The weight we(f) of a feature ffor an entity e is determined as

Dienstag, 25. September 12

Page 28: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

A Ground Truth is Hard to Find

• A ground truth for relevance evaluation of facts is hard to find. We need:

• a sufficient number of arbitrary and independent facts• a sufficient number of people to ask with different

opinions

• Idea: Crowd Sourcing

• Game based approach for evaluation

Dienstag, 25. September 12

Page 29: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Game Based Evaluation Approach

• Idea: a Quiz GameCreate Questions from LOD facts

• Hypotheses:• If the user is able to answer the question correctly,

it is rather likely that the fact behind the answer iswell known and of some importance for the related entity.

• If the user is not able to answer the question or gives the wrong answer, it is rather likely that the fact behind the answer is not well known and maybe not of high importance for the related entity.

Dienstag, 25. September 12

Page 30: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Game Based Evaluation Approach

Jörg Waitelonis, Nadine Ludwig, Magnus Knuth, Harald Sack: Whoknows? - Evaluating Linked Data Heuristics with a Quiz that cleans up DBpedia. International Journal of Interactive Technology and Smart Education (ITSE), Emerald Group, Vol. 8, 2011 (3).

http://tinyurl.com/whoknowsgame

Dienstag, 25. September 12

Page 31: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Game Based Evaluation Approach

Lina Wolf, Magnus Knuth, Johannes P. Osterhoff, Harald Sack: RISQ! Renowned Individuals Semantic Quiz – A Jeopardy like Quiz Game for Ranking Facts. In Proc. of 7th Int. Conf. on Semantic Systems I-SEMANTICS, 07.-09. Sept., 2011, Graz, Austria, ACM, 2011, pp. 71-78.

http://apps.facebook.com/hpi-risq

Dienstag, 25. September 12

Page 32: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Game Based Evaluation Approach

• WhoKnows? Movies! game adapted for entity summarization in the movie domain

• restricted set of entities (movies) and related properties

• Questions are generated out of RDF triples, e.g.

fb:en.pulp_fiction :hasActor fb:en.john_travolta .

Question: John Travolta is the actor of ...?Correct Answer: Pulp Fiction

http://bit.ly/WhoKnowsMovies

Dienstag, 25. September 12

Page 33: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Game Based Ground Truth• WhoKnows? Movies has been played 690 times by 217

players• All 2,829 triples have been played at least once, 2,314

triples at least three times. • In total 8,308 questions have been played of which 4,716

have been answered correctly.• Overall result of relevance ranking:

http://yovisto.com/labs/iswc2012/

Dienstag, 25. September 12

Page 34: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Evaluation of Fact Ranking Heuristics• Heuristics based on User-Based Entity Summarization (UBES),

i.e. heuristics (8) Frequency of features shared with ,neighborhood‘ entities with usage-based neighborhood determinationwith data from Freebase and HetRec2011 MovieLens2k dataset

• Comparison with results derived from Google Knowledge Graph (GKG) and random results

• Kendall τ rank correlation used to evaluate the orderings of GKG and UBES with our ground truth

Dienstag, 25. September 12

Page 35: Summarization and Relevance of Linked Data Facts

Dr. Harald Sack, Hasso-Plattner-Institut Potsdam, Leipziger Semantic Web Tage, 24./25.09.2012, Leipzig

Evaluation of Fact Ranking Heuristics

• Feature Ranking for the cast of a movie• Average difference of both, UBES and GKG to random is significant

Dienstag, 25. September 12

Page 36: Summarization and Relevance of Linked Data Facts

Dr. Harald SackHasso-Plattner-Institut for IT-Systems Engineering

University of Potsdam

Summary of Results

•LOD fact relevance can be determined via heuristics based on statistical, linguistic, and semantic features

•Ground Truth for fact relevance can be created with the help of a game based approach

•Ground truth dataset and evaluation dataset are publicly available

(Thalhammer, Knuth, Sack: Evaluating Entity Summarizations Using a Game-Based Ground Truth, ISWC 2012)

Dienstag, 25. September 12

Page 37: Summarization and Relevance of Linked Data Facts

Dr. Harald SackHasso-Plattner-Institut for IT-Systems Engineering

University of Potsdam

Contact:Harald SackHasso-Plattner-Institut für SoftwaresystemtechnikUniversität PotsdamProf.-Dr.-Helmert-Str. 2-3D-14482 Potsdam

Homepage: http://bit.ly/HaraldSack-HPI http://www.yovisto.com/Blog: http://moresemantic.blogspot.com/E-Mail: [email protected] Twitter: lysander07 / biblionomicon / yovisto

Summarization and Relevance of Linked Data Facts

Dienstag, 25. September 12