mining and managing large-scale linked open data
TRANSCRIPT
Slide 1Prof. Ansgar Scherp – [email protected]
Ansgar Scherp
Mining and Managing Large-scale Linked Open Data
GVDB, Nörten-Hardenberg, May 25, 2016
Thanks to: Chifumi Nishioka, Renata Dividino, Thomas Gottron, and many more …
Slide 2Prof. Ansgar Scherp – [email protected]
Team Knowledge Discovery @
Ansgar Scherp
Ahmed Saleh
ChifumiNishioka
FalkBöschen
Mohammad Abdel-Qader
Till Blume
AnkeKoslowski(Secretariat)
HenrikSchmidt(Engineer)
LukasGalke
FlorianMai
&
Slide 3Prof. Ansgar Scherp – [email protected]
Linked Open Data (LOD) Cloud• Publishing and interlinking data on the web• Different quality, purpose, and sources• Using the Resource Description Framework (RDF)
World Wide Web LOD CloudDocuments DataHyperlinks via <a> Typed LinksHTML RDFAddresses (URIs) Addresses (URIs)
Slide 4Prof. Ansgar Scherp – [email protected]
Relevance of Linked Data?
Slide 5Prof. Ansgar Scherp – [email protected] 1000+ Datasets, 50+ Billion Triples
Media
Geographic
Publications
Web 2.0
eGovernment
Cross-Domain
LifeSciences
Linked Data: May ‘07 August ‘14
Source: http://lod-cloud.net
Social Networking
Slide 6Prof. Ansgar Scherp – [email protected]
LOD on One Slide: Example Graph
biglynx:matt-briggs
foaf:Person
rdf:type
Fully qualified URI using vocabulary prefixes:@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://w3.org/1999/02/22-rdf-syntax-ns#> .@prefix biglynx: <http://biglynx.co.uk/people/> .
Object
Predicate
Subject
RDF Triple
Slide 7Prof. Ansgar Scherp – [email protected]
LOD on One Slide: Example Graph
biglynx:matt-briggs
foaf:Person
rdf:type
Fully qualified URI using vocabulary prefixes:@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix rdf: <http://w3.org/1999/02/22-rdf-syntax-ns#> .@prefix biglynx: <http://biglynx.co.uk/people/> .
biglynx:Director
rdf:type …
…
Slide 8Prof. Ansgar Scherp – [email protected]
LOD on One Slide: Example Graph
biglynx:matt-briggs
foaf:Person
biglynx:dave-smith
biglynx:Director
rdf:type
foaf:knows
rdf:type
_1:point
wgs84:lat
wgs84:long
dp:London
foaf:based_near
……
…
…
ex:loc
“-0.118”
“51.509”
TypesProperties
Entity
Slide 9Prof. Ansgar Scherp – [email protected]
Motivation for the SchemEX Index• Single entry point to query the LOD cloud• Search for data sources containing entities like
– ‘Persons, who are Politicians and Actors’– ‘Research data sets’– ‘Scientific publications’
Query
SELECT ?xFROM …WHERE { ?x rdf:type ex:Actor . ?x rdf:type ex:Politician . }
Index1
2
2
2
Slide 10Prof. Ansgar Scherp – [email protected]
Input Data for SchemEX• Quads: <subject> <predicate> <object> <context>
• Example: <http://biglynx.co.uk/people/matt-briggs> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://biglynx.co.uk/people/matt-briggs.rdf>
<http://biglynx.co.uk/peopl
e/matt-briggs.rdf> rdf:typebiglynx:
matt-briggsfoaf:
Person LOD Cloud
Dataset
Slide 11Prof. Ansgar Scherp – [email protected]
SchemEX Idea• Schema-level index SchemEX
• Assign RDF entities to graph patterns• Map graph patterns to data sources (context)• Defined over entities, but store the context
• Construction of schema-level index• Stream-based for scalability• Stratified bi-simulation for detecting patterns• Little loss of accuracy
[KGS+12]
Slide 12Prof. Ansgar Scherp – [email protected]
Building the Index from a Stream• Stream of quads coming from a LD crawler
… Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1
FiFo
4
3
2
1
1
6
23
4
5
C3
C2
C2
C1
+ Reasonable accuracy at cache size of 50k
Slide 13Prof. Ansgar Scherp – [email protected]
Full BTC 2011 Data Set: 2.17 Bn Triples
Cache size: 50 k
WinnerBTC’11
+ Linear runtime with respect to number of triples
+ Memory consumption scales with window size
Slide 14Prof. Ansgar Scherp – [email protected]
[GSK+13] Generalization
Specialization
Result list withexamples
Inspired byGoogle
Slide 15Prof. Ansgar Scherp – [email protected]
LODatio Under the Hood
SPARQL
Snippets
Generalize
Retrieve Data Sources
Query translation Rank
Specialize
Count
Select
Select
• Hybrid database with off-the-shelf components
Slide 16Prof. Ansgar Scherp – [email protected]
LOD on One Slide: Recap
biglynx:matt-briggs
foaf:Person
biglynx:dave-smith
biglynx:Director
rdf:type
foaf:knows
rdf:type
_1:point
wgs84:lat
wgs84:long
dp:London
foaf:based_near
……
…
…
ex:loc
“-0.118”
“51.509”
Type Set (TS)Property Set (PS)
Information theoretic analyses of LOD• How much information is encoded in TS and PS?• … information encoded, once TS or PS is known?• … to which degree are TS and PS redundant?• Example: 20% of PLDs do not need TS (6% for PS)
[GKS15]
Slide 17Prof. Ansgar Scherp – [email protected]
• 29 weekly LOD snapshots of ~100 Mio triples • Still running since May 2012 (now 200+ weeks)
Käfer et al.’s Temporal Analysis of LOD• Data on the cloud changes a lot
[Käfer et al., 2013] T. Käfer, A. Abdelrahman, J. Umbrich, P. O'Byrne, A. Hogan: Observing Linked Data Dynamics. ESWC 2013: 213-227
Changes?
• But vocabularies defining RDF types and properties are highly static, e.g., RDF, FOAF
LOD cloud ~2012 LOD cloud ~2014
Slide 18Prof. Ansgar Scherp – [email protected]
𝐻(𝑃𝑆
∨𝑇𝑆=
𝑡𝑠)
𝐻(𝑇𝑆∨
𝑃𝑆=𝑝
𝑠)
But: Do Changes Occur in PS and TS?• Analysis: expected conditional entropy over time• : entropy of given is known
• Observation: types become less important• Changes in the use of TS and PS ? !
Slide 19Prof. Ansgar Scherp – [email protected]
Changes over Time• Extended characteristic sets: ECS = PS TS# of ECS
Avg.: 83.898 ECS per week
# of ECS
[DSG+13]
• Avg. 73% of ECS re-occur next week (orange)• Avg. 35% of ECS remain unchanged (blue)• Avg. 20% of entity sets of ECS change / week[Neumann and Moerkotte, 2011] Thomas Neumann, Guido Moerkotte: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. ICDE 2011: 984-994
[Neumann and Moerkotte, 2011]
Slide 20Prof. Ansgar Scherp – [email protected]
Temporal Dynamics of the Entities?• Notion of entity motivated by ECS: entity is a
set of triples sharing the same subject URI • Example:
–1 entity–4 triples
w.l.o.g.
• Useful to keep LOD caches up-to-date?• Can we predict when LOD sources will
change?
Slide 21Prof. Ansgar Scherp – [email protected]
Dynamics Function • Definition of over change rate function
Time
X
𝑡𝑖 𝑡 𝑗
Θ
[DGS+14]
𝑡 𝑗 ≈ ∑𝑘=𝑖+1
𝑗
𝛿(𝑋 𝑡𝑘− 1, 𝑋 𝑡𝑘)
• Approximation as step function over changes
Monotone,non-negative
Slide 22Prof. Ansgar Scherp – [email protected]
Update Strategies for LOD Sources• Apply strategies from keeping caches of WWW
documents up-to-date to maintain LOD caches
• Assumptions–LOD is fetched from various sources–Sources are scored and prioritized based on
strategy–Data of a source is fetched only when the
operation can be entirely executed
Slide 23Prof. Ansgar Scherp – [email protected]
Scheduling Update Strategies
a) HTTP Header [Dividino et al., 2014a]b) Age or Last Visited [Dasdan et al., 2009, Cho and
Garcia-Molina, 2000]c) PageRank [Page et al., 1999, Boldi et al., 2004,
Baeza-Yates et al., 2005]d) LOD Sources Sizee) Change Ratio [Douglis et al., 1997, Cho et al., 2002.
Tan et al., 2007]f) Change Rate [Olston et al., 2002, Ntoulas et al.,
2004, Dividino et al., 2013]g) History Information: Dynamics [Dividino et al., 2014b]
We borrow strategies developed for the WWW and metrics for data change analysis in the LOD cloud.
Slide 24Prof. Ansgar Scherp – [email protected]
Ranking
Sources which changed (most)
Sources that not changed/less changesTime
e) Change Ratio• Captures the change
frequency of the data(freshness)
• Percentage of data items in the cache that are up-to-date
Slide 25Prof. Ansgar Scherp – [email protected]
f) Change Rate• Data from sources which are less similar which their
previous update (snapshot) should be updated first
• Comparison of two RDF data sets– : Set of triple statements – : Numeric expression (distance)
Time𝑡𝑖 𝑡 𝑗
𝛿Example:
Slide 26Prof. Ansgar Scherp – [email protected]
g) History Information: Dynamics• Data from sources which most evolve in a given
period of time should be updated first• Uses both history information and change rate
Time
X
𝑡𝑖 𝑡 𝑗
Θ
≈ ∑𝑘=𝑖+1
𝑗
𝛿(𝑋 𝑡𝑘− 1, 𝑋 𝑡𝑘)
Slide 27Prof. Ansgar Scherp – [email protected]
Evaluation Idea: simulation of limitations of available
computational resources (network bandwidth, computation time)
Time
100%
Which sources to
prioritise in an update?
𝑡𝑖 𝑡𝑖+1
Slide 28Prof. Ansgar Scherp – [email protected]
Evaluation: Single Step Update
Which strategy is the
most appropriated one
to keep the cache up-to-
date?
Time
100%15%
5%40%
75%95%60%
𝑡𝑖 𝑡𝑖+1
Slide 29Prof. Ansgar Scherp – [email protected]
Evaluation: Iterative Updates
Time
. . .
15%5%40%
75%95%60%
15%5%40%
75%95%60%
100%
Simulates a LOD
search engine
continuously updating
its caches
𝑡𝑖 𝑡𝑖+1 𝑡𝑖+2
Slide 30Prof. Ansgar Scherp – [email protected]
Dataset• Dynamic Linked Data Observatory• Weekly snapshots, 14 M triples 154 snapshots (approx. 3 years)
590 data sources (PLD)Top 10 largest data sources Average sizedbpedia.org 3,406,364.5edgarwrap.ontologycentral.com 982,631.0dbtune.org 864,107.6dbtropes.org 787,299.9data.linkedct.org 498,986.3aims.fao.org 416,708.9www.legislation.gov.uk 399,601.6kent.zpr.fer.hr 387,034.8identi.ca 278,316.2webenemasuno.linkeddata.es 250,557.9
Slide 31Prof. Ansgar Scherp – [email protected]
Metrics:Precision & Recall• Precision: portion of cached data that are
actually up-to-date• Recall: portion of data in the LOD cloud that
is identical to the cached data
Cached dataActual data on the LOD cloud(w.r.t. to the 590 sources considered)
Slide 32Prof. Ansgar Scherp – [email protected]
Results: Single Step Update
Time
100% 15%5%40%
75%95%60%
5%5%
Slide 33Prof. Ansgar Scherp – [email protected]
Results: Iterative Updates
5% 5%
Time. . .
15%
5%40%
75%95%60%
15%
5%40%
75%95%60%
100%
Slide 34Prof. Ansgar Scherp – [email protected]
Results: Iterative Updates
Time. . .
15%
5%40%
75%95%60%
15%
5%40%
75%95%60%
100%
15%15%
Slide 35Prof. Ansgar Scherp – [email protected]
Results: Iterative Updates
Time. . .
15%
5%40%
75%95%60%
15%
5%40%
75%95%60%
100%
40%40%
Slide 36Prof. Ansgar Scherp – [email protected]
Results: Summary Best strategies: ones which
capture the change behaviour over time
Specially for low relative bandwidth
Slide 37Prof. Ansgar Scherp – [email protected]
Dynamics Function : Revisited
Time
X
𝑡𝑖 𝑡 𝑗
• Can we predict when LOD sources will change?
• Notion of dynamics to compute periodicities!• Dynamics as vector of changes:
Slide 38Prof. Ansgar Scherp – [email protected]
Temporal Clustering of Entities• Dynamics as vector:
Time
C
hang
e (lo
g sc
ale)
[NS15]
• Clustering withk-means++ to find patterns
• 165 snapshots• 65,044 entities• 7 patterns (after
optimizing )
Slide 39Prof. Ansgar Scherp – [email protected]
Periodicity of Entity Dynamics• Examples: ,
# of entities
Most likely periodicity
C1 12,982 66C2 168 23C3 35 1C4 12 1C5 1 1C6 1,541 56C7 30 37CS 50,725
[Elfeky et al., 2005] Mohamed G. Elfeky, Walid G. Aref, Ahmed K. Elmagarmid:Periodicity Detection in Time Series Databases. IEEE Trans. Knowl. Data Eng. 17(7): 875-887 (2005)
• Convolution-based algorithm [Elfeky et al. 2005]
• Entities of found in several clusters (C1,C3,C4,C5,C6)
• No changes (CS): 77.29%• CS: entities from and
Slide 40Prof. Ansgar Scherp – [email protected]
Application Areas: More than One!• Searching for LOD sources
[GSK+13,KGS+12]
• Strategies for updating data caches [DGS15]• Programming queries against LOD [SSS12] • Recommending LOD vocabularies [SGS16]
Foundation for Future Data-driven Applications
Slide 41Prof. Ansgar Scherp – [email protected]
Summary: KDD in Social Media & DLHow to deal with the vast amount of content related to research and innovation?
• H2020 INSO-4 project, duration: 04/2016-03/2019• Data mining & visualization tools enabling information
professionals to deal with large corpora • Website: http://www.moving-project.eu/
New
Slide 42Prof. Ansgar Scherp – [email protected]
Got Interested?Knowledge Discovery at ZBWContact me!Prof. Dr. Ansgar Scherp
• Email: [email protected]• Twitter: https://twitter.com/ansgarscherp• Slideshare: http://de.slideshare.net/ascherp• KD-Website:
http://www.zbw.eu/en/research/knowledge-discovery/http://www.kd.informatik.uni-kiel.de/en/
Slide 43Prof. Ansgar Scherp – [email protected]
References[DGS15] R. Dividino, T. Gottron, A. Scherp: Strategies for Efficiently Keeping Local
Linked Open Data Caches Up-To-Date. International Semantic Web Conference (2) 2015: 356-373
[DGS+14] R. Dividino, T. Gottron, A. Scherp, G. Gröner: From Changes to Dynamics: Dynamics Analysis of Linked Open Data Sources. PROFILES@ESWC 2014
[GKS15] T. Gottron, M. Knauf, A. Scherp: Analysis of schema structures in the Linked Open Data graph based on unique subject URIs, pay-level domains, and vocabulary usage. Distributed and Parallel Databases 33(4): 515-553 (2015)
[DSG+13] R. Dividino, A. Scherp, G. Gröner, T. Gottron: Change-a-LOD: Does the Schema on the Linked Data Cloud Change or Not? COLD 2013
[GSK+13] T. Gottron, A. Scherp, B. Krayer, A. Peters: LODatio: using a schema-level index to support users in finding relevant sources of linked data. K-CAP 2013: 105-108
[KGS+12] M. Konrath, T. Gottron, S. Staab, A. Scherp: SchemEX - Efficient construction of a data catalogue by stream-based indexing of linked data. J. Web Sem. 16: 52-58 (2012)
[NS15] C. Nishioka, A Scherp: Temporal Patterns and Periodicity of Entity Dynamics in the Linked Open Data Cloud. K-CAP 2015.
[SGS16] J. Schaible, T. Gottron, and A. Scherp: TermPicker Enabling the Reuse of Vocabulary Terms by Exploiting Data from the Linked Open Data Cloud, ESWC, Springer, 2016.
[SSS12] S. Scheglmann, A. Scherp, S. Staab: Declarative Representation of Programming Access to Ontologies. ESWC 2012: 659-673
Slide 44Prof. Ansgar Scherp – [email protected]
a) HTTP Header• Data from sources which have been changed
since the last update should be updated first
HTTP Response
HEADER…
Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT
CONTENT
Slide 45Prof. Ansgar Scherp – [email protected]
b) Age or Last Visited • Time elapsed from last
update (the difference between query time and last update time)
• It guarantees that every source is updated after a period
Ranking
Sources that have been at longer time updated
Sources that have been recently updated
Slide 46Prof. Ansgar Scherp – [email protected]
c) PageRank and d) Source Size• PageRank captures popularity/
importance of the LOD source • Data from sources with highest
PageRank are updated first
• LOD source size: data from the biggest/smallest LOD sources should be updated first
Ranking
Sources with higher PR
Sources with lower PR
Slide 47Prof. Ansgar Scherp – [email protected]
Results: Single Step Update
Time
100% 15%5%40%
75%95%60%