linked data & dbpedia - fusionfactory.de€¦ · linked data & dbpedia ... label...

139
Linked Data & DBpedia M. Freudenberg, K. Müller & M. Ackermann AKSW/KILT - Leipzig DBpedia Association

Upload: dobao

Post on 24-May-2018

277 views

Category:

Documents


1 download

TRANSCRIPT

Linked Data & DBpediaM. Freudenberg, K. Müller & M. Ackermann

AKSW/KILT - LeipzigDBpedia Association

Linked Data & DBpedia

AKSW/KILT- Knowledge Integration and Language Technology- Part of Agile Knowledge Engineering and Semantic Web (AKSW)

Knowledge Integration

Language TechnologyKILT

Linked Data & DBpedia

Tim Berners-Lee

- British computer scientist- director of the W3C- Inventor of the World Wide Web

(1989 @ CERN)- Over his frustration of disconnected islands of

information (about scientists, projects and results)- Published the first Website:

http://info.cern.ch/hypertext/WWW/TheProject.html

Linked Data & DBpedia

From the WWW to the Web of Data

- applying the principles of the WWW to data

data is relationships,not only properties

Linked Data & DBpedia

TimBL’s next leap: from WWW to WOD

Use Linked Data to build a Web Of Data

- applying the principles of the WWW to data- Data is relationships → not only properties- The more data you have to connect together

→ the more you can find out

- using Linked Data to:- Bridging disciplines and domains (by linking their data) - Unlock the potential of island repositories

→ don’t hoard your data, if possible: share itWatch the TED talk of TimBL about Linked Data

Linked Data & DBpedia

Linked Data Principles

1. Use HTTP URIs as identifiers for resources→ so people can look up the names

2. Provide data at the location of URIs→ to provide data for interested parties

3. Include links to other resources→ so people can discover more things

→ bridging disciplines and domains

→ the more linked resources, the more one can find out

Linked Data & DBpedia

RDF - Resource Description Framework

… is a so called ‘Triple’.

http://dbpedia.org/resource/Siem

ens

- Statements of subject > predicate > object

"Siemens"label

Predicate ObjectSubject

Linked Data & DBpedia

Knowledge Graphs

● Combining multiple Triples is known as a Graph

● Linking resource to resource inside or outside the current graph/dataset

● A knowledge-base of this style is considered a Knowledge Graph (KG)

Linked Data & DBpedia

The Data (in RDF/XML)

<rdf:Description rdf:about="http://dbpedia.org/resource/Siemens"> <rdfs:label>Siemens</rdfs:label> <dbo:type rdf:resource="http://dbpedia.org/resource/Aktiengesellschaft"/> <dbo:location rdf:resource="http://dbpedia.org/resource/Munich"/></rdf:Description>

<rdf:Description rdf:about="http://dbpedia.org/resource/Munich"> <dbo:country rdf:resource="http://dbpedia.org/resource/Germany"/></rdf:Description>

Linked Data & DBpedia

The Data (in Turtle)

<http://dbpedia.org/resource/Siemens> dbo:type <http://dbpedia.org/resource/Aktiengesellschaft> ; rdfs:label "Siemens"@de ; dbo:location <http://dbpedia.org/resource/Munich> .

<http://dbpedia.org/resource/Munich> dbo:country <http://dbpedia.org/resource/Germany/> .

Linked Data & DBpedia

Linked Data vs Open Data

4. Final principle: Open your data using open licenses

• Not all linked data is open

• Licensed data can still profit from using standards

• Can be enriched with links to Linked Data

• Can be accessed by standard tools

Linked Data & DBpedia

5 ★ Linked Open Data

Linked Data & DBpedia

Why publish Linked Data

• Ease of discovery through linking

• Easy to consume by humans and machines

• Reduce data redundancy

• Support collaboration & interoperability

• Add value and visibility

Linked Data & DBpedia

Benefits of Linked Data: Consumer View

- discover more related data by following links- reuse the data of other datasets- combine data safely from different sources- formulate sophisticated queries → example in appendix- query data over multiple repositories- semantic enrichment of text resources- semantic feature for machine learning models (e.g. deep

learning, word embeddings, etc.)

Linked Data & DBpedia

Benefits of Linked Data: Publisher View

- link data to any other resource on the web, thereby increasing the value of your data

- making your data discoverable (via links)- exhaustive descriptions of large and changing domains (Gene

Ontology, Human Disease Ontology)- structured representation of large, versatile datasets

(Knowledge Graphs, Thesauri, Taxonomies)- deal with unstructured data (text) as no DB-Schema could- data and schemata using the same format (RDF)- store metadata alongside the actual data (e.g. DCAT)

Linked Data & DBpedia

Linked Open Data

LOD-Cloud 2014

Linked Data -Datasets under an open access- 1014 datasets- any subject- over 50B triples- over 100M links

Linked Data & DBpedia

DBpedia

● First public Knowledge Graph● Has become the focal point of the so

called “Linked Open Data Cloud”.● Is the most universal dataset

(since it’s based on Wikipedia).● Links actively to many relevant

Linked Open Datasets.● Is a link destination for many other

Datasets.(more on DBpedia later…)

Linked Data & DBpedia

Other Linked Data Sets: Freebase

● Managed, hosted by Google until 2015● Now (in part) subsumed by Wikidata● extracted structured data from Wikipedia and other Sources● available in RDF● Differences to DBpedia

○ Freebase used several sources (but DBpedia+ does as well)○ Freebase can be directly edited by users ○ Ontology and mappings were not coordinated by a community

→ never established a community which enriched or validated the data, mostly generated by crawlers

Linked Data & DBpedia

Wikidata

● Initialized by Wikimedia Germany e.V. in 2012● free knowledge base about the world that can be read● edited by humans and machines alike● can offer a variety of statements from different sources● DBpedia is extracting information from Wikidata to fuse it with

knowledge from Wikipedia● Goal is to provide a single point of truth for facts in Wikipedia

across different language versions

Linked Data & DBpedia

Other Datasets

● Geonames○ geographical database covers all countries○ contains over eleven million placenames○ e.g. http://www.geonames.org/3399415/fortaleza.html

● Linked Open Vocabularies (LOV)○ Keeps track of available open ontologies and provides them as a graph○ Search for available ontologies, open for reuse○ e.g. http://lov.okfn.org/dataset/lov/vocabs/foaf

● Lexvo.org○ information about languages, words, characters, and other human

language-related entities○ e.g. http://www.lexvo.org/page/iso639-3/deu

Linked Data & DBpedia

Excursus: Ontologies

This is a concise introduction to ontologies and their role as schemata in Linked Data.

(No worries, we keep this short ;)

Linked Data & DBpedia

Levels of Knowledge

Linked Data & DBpedia

Different Perceptions

Linked Data & DBpedia

Conceptualization

Linked Data & DBpedia

Ontologies in Computer Science

● An ontology has a common language (symbols, expressions) → Syntax● The meaning of symbols and expressions is clear → Semantics● Symbols and expressions with similar semantics are grouped in

concepts (classes) → Conceptualization● Concepts are organized in a hierarchical way → Taxonomy● Concepts might be related to others → Relations● Implicit knowledge can be made explicit → Reasoning

Linked Data & DBpedia

Ontology, a definition

“An ontology is an explicit, formal specification of a shared conceptualization.”

(Thomas R. Gruber, 1993)

Linked Data & DBpedia

Example

Linked Data & DBpedia

Axioms

Axioms are knowledge definitions in the ontology that were explicitly defined and have not been proven true.

Implicit knowledge can be made explicit by logical induction: → Reasoning over an ontology

Source for the ontology related slides:http://www.slideshare.net/SergeLinckels/semantic-web-ontologies

Linked Data & DBpedia

Ontology Language

To express ontologies in a formal, machine readable way, in order to reason over the outlined knowledge, we need a specialized language.→ most common: Web ontology language (OWL)● represent rich and complex knowledge about things● based on a subset of First Order Logic (FOL)● can be used to verify the consistency of a knowledge ● can make implicit knowledge explicit● as the data it conceptualizes, it is serializable in RDF

Linked Data & DBpedia

How to utilize Linked Data Standards

Any OWL ontology/taxonomy can be used in a non LD context.- through its ability to link resources, RDF based ontologies can

easily amalgamate, thereby making them reusable- extending ontologies to fit a narrower use cases- reducing ontologies of a certain area to fit a broader scope- separating semantic structure (classes, properties) from use case

specific restrictions (e.g. cardinalities) -> SHACL- Example: DataID

The W3C, responsible for common standards on the Web, is focusing on RDF based standards in many fields.

Linked Data & DBpedia

Incremental adoption of LD technologies

Linked Data standards and technologies are manifold and, at times, confusing.Fortunately, introducing Linked Data into existing IT environments can be accomplished in an incremental fashion:● Collect data without given schemata/ontologies

○ Very helpful when dealing with semi- or unstructured data● Use RDF Views on top of existing DBMS

○ With an easy to change R2RML mapping● Develop an domain ontology over time (‘Open World Assumption’)

○ Especially useful in fast changing domains● Enrich data with every iteration of you data management cycle

○ See ALIGNED methods for more● Start using LD based tooling: e.g. Rel Finder

http://www.visualdataweb.org/relfinder/demo.swf?obj1=Sm9obiBDbGVlc2V8aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL0pvaG5fQ2xlZXNl&obj2=VGVycnkgR2lsbGlhbXxodHRwOi8vZGJwZWRpYS5vcmcvcmVzb3VyY2UvVGVycnlfR2lsbGlhbQ==&obj3=TW9udHkgUHl0aG9ufGh0dHA6Ly9kYnBlZGlhLm9yZy9yZXNvdXJjZS9Nb250eV9QeXRob24=&name=REJwZWRpYSAobWlycm9yKSAoZnJvbSBVUkwgcGFyYW1ldGVycyk=&abbreviation=ZGJwMTQ2MjczNzExMTgxMQ==&description=TGlua2VkIERhdGEgdmVyc2lvbiBvZiBXaWtpcGVkaWEu&endpointURI=aHR0cDovL2RicGVkaWEub3JnL3NwYXJxbA==&dontAppendSPARQL=ZmFsc2U=&defaultGraphURI=aHR0cDovL2RicGVkaWEub3Jn&isVirtuoso=dHJ1ZQ==&useProxy=dHJ1ZQ==&method=UE9TVA==&autocompleteLanguage=ZW4=&autocompleteURIs=aHR0cDovL3d3dy53My5vcmcvMjAwMC8wMS9yZGYtc2NoZW1hI2xhYmVs&ignoredProperties=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L3dpa2lQYWdlV2lraUxpbmssaHR0cDovL2RicGVkaWEub3JnL3Byb3BlcnR5L3dpa2lQYWdlVXNlc1RlbXBsYXRlLGh0dHA6Ly9kYnBlZGlhLm9yZy9wcm9wZXJ0eS93aWtpbGluayxodHRwOi8vZGJwZWRpYS5vcmcvcHJvcGVydHkvd29yZG5ldF90eXBlLGh0dHA6Ly9wdXJsLm9yZy9kYy90ZXJtcy9zdWJqZWN0LGh0dHA6Ly93d3cudzMub3JnLzE5OTkvMDIvMjItcmRmLXN5bnRheC1ucyN0eXBlLGh0dHA6Ly93d3cudzMub3JnLzIwMDIvMDcvb3dsI3NhbWVBcyxodHRwOi8vd3d3LnczLm9yZy8yMDA0LzAyL3Nrb3MvY29yZSNzdWJqZWN0&abstractURIs=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L2Fic3RyYWN0&imageURIs=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L3RodW1ibmFpbCxodHRwOi8veG1sbnMuY29tL2ZvYWYvMC4xL2RlcGljdGlvbg==&linkURIs=aHR0cDovL3B1cmwub3JnL29udG9sb2d5L21vL3dpa2lwZWRpYSxodHRwOi8veG1sbnMuY29tL2ZvYWYvMC4xL2hvbWVwYWdlLGh0dHA6Ly94bWxucy5jb20vZm9hZi8wLjEvcGFnZQ==&maxRelationLegth=Mg==

Linked Data & DBpedia

Linked Data in the context of Big Data

In 2006, Clive Humby coined the phrase "the new oil" for (digital) data, heralding the ever-expanding realm of what is now summarised as: Big Data.

What role does Linked Data play in the context of buzzwords like:● Big Data● Smart Data (e.g. Smart Data Forum)

Linked Data & DBpedia

The four V’s of Big Data

Linked Data & DBpedia

The four V’s heatmap for Linked Data

Gartner Study in 2013 found:- many organizations find the “variety” dimension a much bigger challenge than volume or velocity.

Linked Data to the rescue:

- Combine multiple sources with different structures- while retaining the flexibility to add new ones- without adapting schematas- query combined data, or multiple sources at once- detecting patterns in the data

Linked Data & DBpedia

Linked Data in the context of Big Data

- Linked Data can describe any kind of data

- no matter the amount or domain

- especially useful in here:- graph structured data

- social media, knowledge graphs- (multi-) lingual data

- easy incorporation of unstructured data- perfect for annotating purposes

- data for complex domains - e.g. taxonomies in life sciences

- ontologies, metadata and provenance info- ontos are modeled with OWL a RDF extension

Big Data

Linked Data & DBpedia

Linked Data in the context of Big Data

- Linked Data can describeany kind of data

- no matter the amount or domain

- especially useful in here:- graph structured data

- social media, knowledge graphs- (multi-) lingual data

- easy incorporation of unstructured data- perfect for annotating purposes

- data for complex domains - e.g. taxonomies in life sciences

- ontologies, metadata and provenance info- ontos are modeled with OWL a RDF extension

Big Data Smart Data

Linked Data & DBpedia

Linked Data in the context of Big Data

- Linked Data can describeany kind of data

- no matter the amount or domain

- especially useful in here:- graph structured data

- social media, knowledge graphs- (multi-) lingual data

- easy incorporation of unstructured data- perfect for annotating purposes

- data for complex domains - e.g. taxonomies in life sciences

- ontologies, metadata and provenance info- ontologies are modeled with OWL a RDF extension

Big Data Smart Data

Linked Data

Linked Data & DBpedia

Linked Data in Research

● Computer science: especially graph- and NLP-related, QA, AI○ e.g. IBM: Natural Language Understanding of Unstructured Data

● Life sciences: to describe complex domains (large ontologies & taxonomies)

○ e.g. Human Disease Ontology● (Digital) Humanities: to manage (record, annotate…) large text records

○ e.g. Homer Multitext Project● Libraries: recording of metadata and interlinking it with other

institutions○ e.g. Deutsche Nationalbibliothek

Linked Data & DBpedia

● Online Search (large Knowledge Graphs)○ e.g. Google

● Social Media (social network analysis)○ e.g. Facebook

● Publishing Industry (large text corpora annotation)○ e.g. Wolters Kluwer, NYT

● Broadcasting (ontology-centered Linked Data services)○ e.g. BBC

● (Open) Government Data○ e.g. US publishing data as RDF http://data.gov

Linked Data in Industry

Linked Data & DBpedia

DBpedia… a fused, multi-domain, multilingual dataset

● Is a crowed sourced community effort to extract structured information from Wikipedia and Wikidata.

● Enriches the extracted information w. semantic layer

● Provides a query service and many additional tools.

Linked Data & DBpedia

A web of knowledge

Linked Data & DBpedia

DBpedia History

● DBpedia project was started in 2006 as a collaboration of Freie U. Berlin, U. Leipzig and U. Mannheim

● Has been a key factor in the rapid growth of the LOD initiative and the overall success of Linked Data

● http://wiki.dbpedia.orgExample: http://dbpedia.org/page/Leipzig_University

http://en.wikipedia.org/wiki/Monty_Python ⇔ http://dbpedia.org/resource/Monty_Python

Linked Data & DBpedia

Some statistics

The latest release (2015-10):● was extracted from 127 language editions ● describing up to 20 million things● with 8.8 billion RDF statements (triples)● Mirroring every wikipedia page = 6.2M things

○ of which 4.6M have abstracts, ○ 955K have geo coordinates ○ and 1.54M depictions

● “In general we observed a significant growth in raw infobox and mapping-based statements of close to 10%”.

Linked Data & DBpedia

Structure of DBpedia

● Structured in language datasets with multiple subsets● Each sub-dataset specializes on a certain type of data (see below)● mapping-based types and facts governed by the DBpedia Ontology

Linked Data & DBpedia

DBpedia Ontology

● A cross-domain ontology● maintained and extended by the community in the

DBpedia Mappings Wiki● manually created based on the most commonly used

infoboxes● currently covers 685 classes which form a subsumption

hierarchy and are described by 2,795 different properties● subsumption hierarchy with a maximal depth of 5● is maintained and extended by the community in the

DBpedia Mappings Wiki

DBpedia Ontology Extract

Linked Data & DBpedia

DBpedia Mappings Wiki

● a community effort to:● develop an ontology schema● provide mappings from Wikipedia Infoboxes properties

to this ontology → creating an alignment between Wikipedia and Dbpedia→ eliminating name variations in properties and classes→ big boost for Precision

http://mappings.dbpedia.org/

Linked Data & DBpedia

Extracting a DBpedia

● Wikipedia articles consist mostly of free text● also comprise various types of structured

information ● depending on the template used for a specific article

(e.g. Actor, Village etc.)● including: infobox templates, categorisation

information, images, geo-coordinates, links to external web pages, disambiguation pages,

● redirects between pages, other language links

Linked Data & DBpedia

Wikipedia Article Structure

● Title● Abstract● Infoboxes● Geo-coordinates● Categories● Images● Links

○ other language versions○ other Wikipedia pages○ To the Web○ Redirects○ Disambiguations

Linked Data & DBpedia

Infobox Encoding

Linked Data & DBpedia

DIEF - DBpedia Information Extraction Framework

● extracts structured information from Wikipedia and turns it into a rich knowledge base○ Mapping-Based Infobox Extraction, ○ Raw Infobox Extraction, ○ Feature Extraction, ○ Statistical Extraction

● Updated to adapt to changes in Wikipedia● Expanded for new knowledge extraction methods

○ E.g. by multiple GSOC projects (extraction tables, NIF,...)● Open Source code in Scala & Java

Linked Data & DBpedia

DBpedia Live

● Wikipedia articles are continuously revised at a very high rate

● English Wikipedia, in June 2013, had approximately 3.3 million edits per month (^= 77 edits per minute)

● Dbpedia Live was developed to keep Dbpedia in synchronization with Wikipedia

● works on a continuous stream of updates from Wikipedia and processes that stream on the fly

Linked Data & DBpedia

Accessing and Querying DBpedia

● per resource view○ Linked Data interfaces

http://dbpedia.org/page/Immanuel_Kant

● navigation view○ LodLive Browser

http://en.lodlive.it/?http:

//dbpedia.org/resource/Immanuel_Kant

● querying for resources○ SPARQL (introduced later)○ DBpedia Lookup Service

Linked Data & DBpedia

DBpedia Lookup Service

● REST service to query for DBpedia resources

● index of DBpedia resource, including alternative names/labels(page redirects, disambiguisation links, …)

● search by complete keywords and prefix search● results ranked by relevance (Page Rank)● filtering by DBpedia ontology classes

http://lookup.dbpedia.org/api/search/KeywordSearch?QueryClass=place&QueryString=berlin

http://lookup.dbpedia.org/api/search/KeywordSearch?QueryClass=person&QueryString=berlin

Linked Data & DBpedia

DBpedia internationalised

● non-English versions of DBpedia offers○ coverage of more entities○ more detailed or up-to-date information for entities associated with the

particular countries

● international mapping community ○ helps in provision of localized dbpedia datasets for 125 languages

● 15 DBpedia chapters (by languages)○ autonomous management of mapping, ○ organisation of local community, ○ hosting of datasets and services

● canonicalized datasets○ facts derived from localized Wikipedias, but only statements for resources

also present in Englisch DBpedia

Linked Data & DBpedia

DBpedia Association

- founded in 2014, based in Leipzig- goal: supporting the DBpedia community and provide free

data and services to the general public- Data Releases- Software Maintenance- Dissemination- Data Accessibility- Communication (internal and external)

- persons and organisations can become member:- gaining support for all DBpedia specific problems (queries, tools etc.)- deciding on the future of DBpedia- acquiring help for creating and linking their own datasets

Linked Data & DBpedia

A need for information

„Which films starred John Cleese without any other members of Monty Python?“

Linked Data & DBpedia

SPARQL Protocol and RDF Query Language

● RDF data query language○ also query and data transfer protocol specifications (HTTP-based)

● graph-data oriented, designed independently from ontology & related reasoning○ but some SPARQL implementations can provide reasoning (e.g.

RDFS+)

● declarative approach carrying several similarities to SQL

tutorial: https://jena.apache.org/tutorials/sparql.html

Linked Data & DBpedia

Basic Graph Patterns

Linked Data & DBpedia

Graph Group Pattern

Linked Data & DBpedia

Filtering Unwanted Results

Linked Data & DBpedia

SPARQL: Combination of Complex Graph Patterns

Linked Data & DBpedia

SPARQL - additional constructs

● alternative result types:○ ASK ⇒ true, if a valid binding can be found○ CONSTRUCT ⇒ create new graph from result bindings

● combinators and modifiers for queries/graph patterns:○ UNION, MINUS, Subqueries○ LIMIT, OFFSET, DISTINCT, ORDER BY

● property paths as regular language (*,+,^,{n,m})○ e.g. rel:hasParent / rel:hasChild{2} / rel:hasFriend+

● sizable library of functions and operators for resources and literal values

find it all at: http://www.w3.org/TR/sparql11-query/

Linked Data & DBpedia

End of session 1

Grab a coffee…

Next session:helpful LD technologies for:

- NLP- Link Discovery- Data Fusion

A common Use Case for integrating LD technologies

Linked Data & DBpedia

NIF - Natural Language Processing Interchange Format

● RDF/OWL-based

● utilizes various existing standards: RDF, OWL2, PROV Ontology, ITSRDF, …

● promotes stable URIs to identify primary text, its structure, annotations and their meta-data

Linked Data & DBpedia

Interoperability for Language Data and Tools

Structural Interoperability:unanimous data format and structure of annotations ⇒ RDF & NIF vocab

Conceptual Interoperability:identical vocabularies/taxonomies for annotations (or linkage to common reference vocabulary) ⇒ Ontology of Linguistic Annotations (OLiA), GOLD, ...

Access Interoperability:unanimous, widespread, easily adoptable method for access ⇒ REST

Linked Data & DBpedia

NIF: String Relations and Text Structure

Linked Data & DBpedia

Integration of NIF service results

Linked Data & DBpedia

Integration of NIF service results

Linked Data & DBpedia

Integration of NIF service results

Linked Data & DBpedia

Linking OWL Ontologies for Conceptual Interoperability

● Ontology of Linguistic Annotations○ linking specific annotation tag sets into

ling. refrence models/ontologies○ machine-actionable, granular

representations of semantics of tags (beyond string values)

Linked Data & DBpedia

NIF: Further Widespread Requirements Covered

● Provenance and Confidence for Annotations

● Multiple Alternative Annotations

exdoc:2_offset_23_29 nif:anchorOf "Berlin" ; itsrdf:taIdentRef <http://dbpedia.org/resource/Berlin> ; nif-ann:taIdentConf "0.9"^^xsd:decimal ; nif-ann:taIdentProv exdoc:eEntityProdServiceInvocation ; nif:annotationUnit [ itsrdf:taIdentRef <http://dbpedia.org/resource/Berlin,%20Nevada> ; nif-ann:taIdentConf "0.32"^^xsd:decimal ; nif-ann:taIdentProv exdoc:eEntityExpServiceInvocation ] .

Linked Data & DBpedia

Available NIF Resources

● Corpora

● Services○ Tokenisation○ Annotation○ Validation○ Combining Tools Outputs

● Documentation, Specs

Linked Data & DBpedia

NIF Corpora

● Brown Corpus

● AQUAINT News Corpus

● NER corpora○ RSS-500, Reuters-128, KORE 50○ Microposts NEEL○ ACE Mutlilingual

● DBpedia abstract corpora○ English, French, German, Dutch,

...

Linked Data & DBpedia

NIF Services

● OpenNLP○ POS tags

● Stanford NLP○ POS tags, lemmatization

● Snowball○ Stemming

● DBpedia Spotlight

● Validation (via RDFUnit)

Linked Data & DBpedia

Mapping Languages

● Helps to create class mappings between source dataset and target RDF ontology○ E.g. Table heading to RDF

predicates (e.g. rdfs:label)

Linked Data & DBpedia

Mapping Languages cont.

● R2RML: Only supports mappings between relational databases and RDF

● RML: Extension of R2RML and supports other input data formats such as CSV, JSON, XML

● SML: Extension of R2RML and supports other input formats such as CSV

Linked Data & DBpedia

ETL Frameworks

● Extract Transform Load (ETL)● Common in Data Warehousing● Extract phase:

○ Extract data from data sources (e.g. CSV, JSON, database, etc.)● Transform phase:

○ Transform data for storing in target format○ Boilerplating/Normalization○ Content Enrichment (e.g. loading of geo-information)

● Load phase:○ Loads data into target data store (e.g. Virtuoso)

Linked Data & DBpedia

ETL Frameworks - LDIF

● Linked Data Integration Framework (LDIF)

● Hadoop based ETL pipeline● Supports Provenance Metadata● Components:

○ Scheduler○ Data Import

(Crawl, Sparql, Dump)○ ETL

● Custom Mapping Language

Linked Data & DBpedia

ETL Frameworks - Unified Views

● Joined project between Semantic Web Company and Semantica.cz○ Supported by LOD2 FP7 project

● Components:○ Frontend UI○ Backend○ Database○ Scheduler

● Possible to add custom plugins

● Link

Linked Data & DBpedia

Link Discovery Frameworks

● Finding links between related data items in different datasets● Use cases:

○ owl:sameAs links○ Class mappings (e.g. like R2RML)○ data transformation

● Survey● Matching algorithms:

○ string similarity, geo-location matching, regular expressions, etc.● Link Discovery Strategies

○ Rule based: Using predefined rules to find matching data items○ Statistical based: Using machine learning techniques to find matching

data items

Linked Data & DBpedia

Link Discovery Frameworks - LIMES

● Link discovery framework for MEtric Spaces

● Fast, large-scale link discovery using specification language

● Link

Linked Data & DBpedia

Link Discovery Frameworks - Silk

● UI driven linking framework

● Uses its own specification language

● Supports for data transformation

● Link

Linked Data & DBpedia

Data Fusion

● Fusing of multiple records representing the same real-world object into a single, consistent, and clean representation (Bleiholder & Naumann 2008)

● Possible use cases:○ Same value for the same property in all datasets (e.g. name)○ Different value for the same property in all datasets (e.g. age)○ New information

● Problems:○ No unique IDs○ Real world data is dirty, big and complex○ No training data for many linkage applications○ Trustworthiness of external data

Linked Data & DBpedia

Data Fusion - Strategies

● Rule based○ Using observed value from most updated source○ Taking average/maximum/minimum for numerical values○ Idea is to improve efficiency

● Statistically based○ Unsupervised/supervised strategies:

■ Vote: Take the value which is supported by largest number of sources■ Quality based: evaluate trustworthiness

● Web-link-based, IR-based, Bayesian, graphical model■ Relation Based

● Extends Quality Based methods and considers relationship between sources (e.g. copy data around, etc.)

Linked Data & DBpedia

Data Fusion - LD-FusionTool

● Developed in conjunction with Unified Views project● Features:

○ Resolution of schema and identity conflicts○ Resolution of data conflicts○ Quality Assessment○ Provenance Tracking

● No machine learning based fusion● Link

Linked Data & DBpedia

Data Fusion - Sieve

● Developed in conjunction with LDIF project● Features:

○ Resolution of data conflicts○ Quality Assessment○ Provenance Tracking○ Support for Plugins

● Link

Linked Data & DBpedia

Data Fusion - Sieve

Examples

Linked Data & DBpedia

ALIGNED - Software an Software & Data Engineering

● quality-centric, software and data engineering ● research project funded by Horizon 2020 (EC)● will develop new ways to build and maintain IT systems

that use big data on the web

Linked Data & DBpedia

ALIGNED - One Page

Linked Data & DBpedia

ALIGNED - Goals

● New methodology for parallel software and data engineering of web-scale information systems. ○ Linked Data the unifying foundation for system specification, process

and tool integration. ○ Support evolution of software dependent on heterogeneous, complex

data of varying quality with an independent lifecycle.

Linked Data & DBpedia

ALIGNing Problem: the example of DBpedia

Lot’s of code & a lot more data● Wikipedia evolves over time

○ Infobox Templates change, merge, deleted○ New formatting templates○ Structural differences per language edition

● DBpedia Ontology and Mappings change as well● Code should adapt to all the changes

○ hard at this (data) scale→ Data Quality will suffer

Linked Data & DBpedia

Unit-testing to the rescue?

● Software & Data testing● Straightforward for software (since 70’s)● Preliminary for (RDF) data

○ RDFUnit, SPIN…○ W3C Data Shapes WG

Data testing● Generation: manual, (Semi)automatic, ...● Linking: data & software tests

Linked Data & DBpedia

RDF Unit

http://rdfunit.aksw.org

Linked Data & DBpedia

RDF Unit

● Input: ontologies, updated Data Quality Patterns (DQP),and the Datasets to test against

● Produces Data Unit Test Cases automatically by applying DQPs to the Axioms of an ontology○ User defined test cases are added as well

● Runs all Data Unit Test Cases against a given Dataset● Generates Test Case Result data for every violating

triple for one of these DQPs● ( evaluate Test Case Results to change triples, software

or DQP/Test Cases, then run RDF Unit again… )

Linked Data & DBpedia

DBpedia+ Workflow

Linked Data & DBpedia

FREME: Multilingual Content Enrichment & Curation

● two year H2020 innovation action: bridge language and data

● driven by four business use-casesbusiness partners:vistatec - translation, localisation, content creation/curation

tilde - language & terminology services

agroknow - agriculture & food information & research

wripl - content optimisation & personalisation, SEO

Linked Data & DBpedia

Current State and Challenges for Content Enrichment

Linked Data & DBpedia

Contribution of FREME

various target user groups:● developers● content authors● content architects● ...

several access modes:● graphical interfaces● programmatically● official endpoints and

local service instance

Linked Data & DBpedia

FREMEs e-Services

● e-Entity: for enriching content with information on named entities

● e-Link: for enrichment with linked data sources● e-Terminology: for detecting terms and enriching them

with term related information;● e-Translation: for providing custom machine translation

systems● e-Publishing: for exporting enriched contend in the ePub

Linked Data & DBpedia

FREME Demo

Linked Data & DBpedia

Linked Data in FREME

● service interoperability and integration using NIF

● low entrance barrier: start for test / limited volumes immediately, just using REST queries

● utilization of several popular Linked Data knowledge bases:Europeana (cultural heritage data), ORCID (research idetifiers), ONLD (organisation names), Library of Congress Authorities

Linked Data & DBpedia

Named Entity Spotting, Recognition and Linking

Linked Data & DBpedia

Choice of the Most Appropriate NER/NEL Tool

● Topic/Domain of○ used training data○ knowledge bases linked against

● Overall Performance (Precision,Recall,...)

● Support for/ Performance for specific Entity Categories (Companies, Artists, …)

?

Linked Data & DBpedia

GERBIL - Sustainable Entity Annotation Benchmarks

● unified experiment setups

● extensible for additional services and datasets

● experiment results as Linked Data Resources○ easy documentation○ improving

reproducibility

Linked Data & DBpedia

Smart Data Web (SDW)

- BMWi funded project- Main goal: Data collection for

the German industry using state of the art extraction and enrichment technologies

- Use Cases:- Supply Chain Management- Market Research

Linked Data & DBpedia

SDW - Knowledge Graph

- AKSW/KILT group responsible for Knowledge Graph

- Curated sources:- DBpedia, PermID, GRID, etc.

- Uncurated sources:- Twitter, news feeds, etc.

- Data quality and persistence- RdfUnit: test driven data-

debugging framework- LIMES: link discovery for the web

of data

Linked Data & DBpedia

SDW - Use Case: Supply Chain Management

● Public KG is fed continuously with information from news channels, websites, etc.

● Corporate/Internal KG is connected to public KG● Detection of potential problems in supply chain

○ Need to check suppliers regularly■ Check for compliance■ Quality of products■ Who else is being supplied by this supplier?■ strike, natural disaster, insolvency, etc.■ ...

○ Get information about potential problems as quickly as possible

Linked Data & DBpedia

SDW - Use Case: Market Research

● Public KG is fed continuously with information from news channels, websites, etc.

● Finding more information about value chain:○ Potential leads/potential customers○ Competition○ Potential suppliers○ Price development on the market○ Customer satisfaction

● Connecting information from different data silo○ Finding out about new relations

Linked Data & DBpedia

SDW - NLP

- Extraction of company and traffic events using state of the art NLP and machine learning technologies

- Demo

Linked Data & DBpedia

SDW - Linked Data

● Common public knowledge graph (KG)○ Modular corporate ontology○ ETL pipeline for different datasets○ Fusion of different datasets and web data

● Unique URIs for all entities (through knowledge fusion)● Store meta-data (e.g. provenance) for each RDF

statement● Use KG for NLP tasks (e.g. Named Entity Recognition,

Disambiguation, etc.)● Use KG for Enterprise Search

Linked Data & DBpedia

Use Case: Introducing Linked Data into an established IT environment

● Task 1: Support transformation from relational database data to RDF from different sources○ Establish a transformation process from SQL data to RDF.○ Combine multiple DBs into a single RDF

● Task 2: Generate Software components capable of manipulating the underlying data by any user○ Based on the domain description (ontology)

● Task 3: Enable data quality checks using data constraints○ Applying an iterative process of developing an ontology, providing test

results and correcting both data and ontology in a changing overall software environment.

Linked Data & DBpedia

Task 1: Transforming DB-data into RDF

● Using a Mapping Language like R2RML can generate a RDF (or Triple-) View of the relational data○ Databases like Virtuoso already have all tools needed for this mapping

included and provide an automated mapping function if needed.○ Multiple approaches for automated creation of R2RML mappings exist

creating Wrapper layer on to of the Relational Database (RDB)○ A different approach is the query rewriting SPARQL → SQL,

needs an ontology adaptation of the RDB schema● Multiple mappings for additional databases

Linked Data & DBpedia

Task 2: ALIGNED Tool - Semantic Booster

● Given a domain description as input (ontology, DB schema, etc.)

● Optional mapping to an existing DB schema

● Automatic creation of a Booster Specification

● Automatic generation of a SQL based DB-specification.

● The data is propagated through the Booster web interface

Linked Data & DBpedia

Semantic Booster - Features

● Generating high-quality software components from precise models with metadata annotations.

● Using Model-Driven Engineering techniques to generate a complete information system.

● key feature: Allows for the smooth transition of data when the underlying database is updated.

● Enables domain experts to develop systems conformant with existing standards, datasets or systems.

Linked Data & DBpedia

Booster Web Interface

Linked Data & DBpedia

Task 3: Data Quality Validation

Option A: ● use the Form Validation methods of the Booster Web

Interface to validate user input directly○ Useful for simple domains without many restrictions

Option B:● Use RDF Unit to automatically generate data unit tests

○ for more complex domains

Linked Data & DBpedia

Creating data unit tests with RDF Unit

● Common restrictions (e.g. cardinalities, domain/range) are automatically transformed into unit tests by RDF U.

● More complex restriction can be inserted ○ with new unit patterns → enabling RDF Unit to generate the test itself○ Using custom SPARQL queries → defining the pattern to look for by

the user● A failed unit test will produce error object in RDF

serialization○ containing the necessary metadata to pinpoint the offending triple

Linked Data & DBpedia

Evaluating RDF Unit results

● Test case results can be used to implement a (semi-) automatic process to improve the tested data or its generating software

● Or to provide statistics about the quality of a dataset● ...

Linked Data & DBpedia

Completing the tasks

In addition to validation processes, any number of tasks can be executed on the given data● e.g. extracting NIF annotations on linguistic data● Spotlight Entity recognition● etc.

Linked Data & DBpedia

Summary

● WOD: applying the principles of the WWW to data● Bridging disciplines and domains (by linking their data) ● Linked Data makes Smart Data out of Big Data● Many Linked Data Standards can be reused for Big Data● DBpedia can be used as for many domains and

processes● Linked Data can be applied in many different parts of

commercial environments

Linked Data & DBpedia

Q&A