scientific rdf databases

49
Scientific RDF Databases Michael Mertens K.U.Leuven

Upload: cera

Post on 23-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Scientific RDF Databases. Michael Mertens K.U.Leuven. Outline. Introduction to RDF RDF Databases Advantages for scientific R&D In practice Criticism. Outline. Introduction to RDF RDF Databases Advantages for scientific R&D In practice Criticism. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scientific RDF Databases

Scientific RDF Databases

Michael MertensK.U.Leuven

Page 2: Scientific RDF Databases

Outline

• Introduction to RDF• RDF Databases • Advantages for scientific R&D• In practice• Criticism

2

Page 3: Scientific RDF Databases

Outline

• Introduction to RDF• RDF Databases • Advantages for scientific R&D• In practice• Criticism

3

Page 4: Scientific RDF Databases

RDF: Resource Description Framework

• Originally: metadata data model

• Now: General method for conceptual description for web resources (Semantic Web)

Introduction

4

Page 5: Scientific RDF Databases

• Traditional Web in 2009:

Introduction

• Sharing documents• URL as retrieval mechanism• HTML standard format• Hypertext links

Image taken from “The Emerging Web of Linked Data”, Chris Bizer5

> Semantic Web

Page 6: Scientific RDF Databases

• Data on the web– HTML describes documents and links between them

– Semantic web:• Publish data in RDF, OWL, XML, ..• Describe arbitrary things: people, books, events, ..• Link between these concepts• Machine-readable, web-accessible databases

Introduction

6

> Semantic Web

Page 7: Scientific RDF Databases

• Tim-Berners Lee: LINKED DATA• Connected structured data• 3 simple principles:– URLs for conceptual things– Returns useful data about that thing– Relationships link to other URLs

Introduction

7

> Semantic Web > Linked Data

Page 8: Scientific RDF Databases

Introduction

8

• Before: Scientific data usually not shared• Pharmaceutical Drug Discovery – A lot of spread out data

• Drug Bank, ClinicalTrial.gov, Health Care and Life Science– Genomics data, Protein data, ..

• A question nobody examined before:“What Proteins are involved in signal transduction AND

are related to pyramidal neurons?”

Example taken from “Tim Berners-Lee on the next Web”

> Semantic Web > Linked Data > Example

Page 9: Scientific RDF Databases

Introduction

9

• The web: 223,000 hits, 0 results

Example taken from “Tim Berners-Lee on the next Web”

> Semantic Web > Linked Data > Example

Page 10: Scientific RDF Databases

Introduction

10

• Linked Data: 32 hits, 32 results

Example taken from “Tim Berners-Lee on the next Web”

DRD1, 1812 adenylate cyclase activationADRB2, 154 adenylate cyclase activationADRB2, 154 arrestin mediated desensitization of G-protein coupled … DRD1IP, 50632 dopamine receptor signaling pathwayDRD1, 1812 dopamine receptor, adenylate cyclase activating pathwayDRD2, 1813 dopamine receptor, adenylate cyclase inhibiting pathwayGRM7, 2917 G-protein coupled receptor protein signaling pathwayGNG3, 2785 G-protein coupled receptor protein signaling pathwayGNG12, 55970 G-protein coupled receptor protein signaling pathwayDRD2, 1813 G-protein coupled receptor protein signaling pathwayADRB2, 154 G-protein coupled receptor protein signaling pathwayCALM3, 808 G-protein coupled receptor protein signaling pathwayHTR2A, 3356 G-protein coupled receptor protein signaling pathwayDRD1, 1812 G-protein signaling, coupled to cyclic nucleotide second… SSTR5, 6755 G-protein signaling, coupled to cyclic nucleotide second… MTNR1A, 4543 G-protein signaling, coupled to cyclic nucleotide …

HTR6, 3362 G-protein signaling, coupled to cyclic nucleotide second …GRIK2, 2898 glutamate signaling pathwayGRIN1, 2902 glutamate signaling pathwayGRIN2A, 2903 glutamate signaling pathwayGRIN2B, 2904 glutamate signaling pathwayADAM10, 102 integrin-mediated signaling pathwayGRM7, 2917 negative regulation of adenylate cyclase activityLRP1, 4035 negative regulation of Wnt receptor signaling pathwayADAM10, 102 Notch receptor processingASCL1, 429 Notch signaling pathwayHTR2A, 3356 serotonin receptor signaling pathwayADRB2, 154 transmembrane receptor protein tyrosine kinase … PTPRG, 5793 transmembrane receptor protein tyrosine kinase … EPHA4, 2043 transmembrane receptor protein tyrosine kinase … NRTN, 4902 transmembrane receptor protein tyrosine kinase … CTNND1, 1500 Wnt receptor signaling pathway

> Semantic Web > Linked Data > Example

Page 11: Scientific RDF Databases

Introduction

11Example taken from “Tim Berners-Lee on the next Web”

PREFIX go: <http://purl.org/obo/owl/GO#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX owl: <http://www.w3.org/2002/07/owl#>PREFIX mesh: http://purl.org/commons/record/mesh/SELECT ?genename ?processnameWHERE{ graph http://purl.org/commons/hcls/pubmesh { ?paper ?p mesh:D017966 . ?article sc:identified_by_pmid ?paper. ?gene sc:describes_gene_or_gene_product_mentioned_by ?article.} graph <http://purl.org/commons/hcls/goa> { ?protein rdfs:subClassOf ?res. ?res owl:onProperty ro:has_function. ?res owl:someValuesFrom ?res2. ?res2 owl:onProperty ro:realized_as. ?res2 owl:someValuesFrom ?process. graph <http://purl.org/commons/hcls/20070416/classrelations> {{?process <http://purl.org/obo/owl/obo#part_of> go:GO_0007166} union { ?process rdfs:subClassOf go:GO_0007166 }} ?protein rdfs:subClassOf ?parent. ?parent owl:equivalentClass ?res3. ?res3 owl:hasValue ?gene.} graph <http://purl.org/commons/hcls/gene> { ?gene rdfs:label ?genename } graph <http://purl.org/commons/hcls/20070416> { ?process rdfs:label ?processname}}

> Semantic Web > Linked Data > Example

Related to Pyramidal Neurons

Part of Signal Transduction

Used 4 sources

Page 12: Scientific RDF Databases

Introduction

12

> Semantic Web > Linked Data

Page 13: Scientific RDF Databases

Introduction

13

> Semantic Web > Linked Data

Page 14: Scientific RDF Databases

• What do we need?– Identifiers: URIs– Linking mechanism: HTTP– Vocabulary: Web Ontology Language (OWL)– Serialization: RDF/XML

Introduction

14

> Semantic Web > Linked Data

Page 15: Scientific RDF Databases

• Identifiers: URIs– Use of HTTP URL– Link to “Resources”– Possibly many documents per resource– Shift to non-information resources:

Introduction

15

> Semantic Web > Linked Data

http://dbpedia.org/resource/London

HTML: http://dbpedia.org/page/LondonRDF: http://dbpedia.org/data/London.rdfN3: http://dbpedia.org/data/London.ntriples

Page 16: Scientific RDF Databases

• Linking mechanism: HTTP– Accessible through generic data browsers– Allowing to be crawled by search engines – Connecting different sources

– In contrast, Web APIs use different interfaces

Introduction

16

> Semantic Web > Linked Data

Page 17: Scientific RDF Databases

• Vocabulary: Web Ontology Language (OWL)– Knowledge representation language– Designed to be interpreted by computers– Describes data, based on individuals (classes) and

property assertions (relationships)

Introduction

17

> Semantic Web > Linked Data

<owl:Class rdf:ID="Money"> <rdfs:subClassOf rdf:resource="http://www.w3.org/2002/07/owl#Thing"/></owl:Class><owl:DatatypeProperty rdf:ID="currency"> <rdfs:domain rdf:resource="#Money"/> <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/></owl:DatatypeProperty>

Page 18: Scientific RDF Databases

• Vocabulary: Web Ontology Language (OWL)– Knowledge representation language– Designed to be interpreted by computers– Describes data, based on individuals (classes) and

property assertions (relationships)– URIs about the same thing: ‘owl:sameAs’

Introduction

18

> Semantic Web > Linked Data

Page 19: Scientific RDF Databases

• Based on triples– Subject, predicate, object

• Resources identified by URI• URIs allow to look up RDF information• RDF information links to other URIs

RDF: Resource Description Framework

19

< http://dbpedia.org/resource/London,http://dbpedia.org/ontology/country,

http://dbpedia.org/resource/United_Kingdom >

Page 20: Scientific RDF Databases

20

RDF: Resource Description Framework

Page 21: Scientific RDF Databases

21

RDF: Resource Description Framework

Page 22: Scientific RDF Databases

22

RDF: Resource Description Framework

This looks a lot like XML..

Why don’t we just use XML??

Page 23: Scientific RDF Databases

RDF: <Page, author, Name>

XML: <document href=“Page”> <author>Name</author> </document>

<document> <details> <uri>Page</uri> <author>Name</author> </details></document>

<author> <uri>Page</uri> <name>Name</name></author> ...

RDF vs XML

23

Page 24: Scientific RDF Databases

• RDF/XML: proposed by W3C

• N3 or Turtle: human-readability

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">

<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn"> <dc:title>Tony Benn</dc:title> <dc:publisher>Wikipedia</dc:publisher> </rdf:Description> </rdf:RDF>

@prefix dc: <http://purl.org/dc/elements/1.1/>. <http://en.wikipedia.org/wiki/Tony_Benn> dc:title "Tony Benn"; dc:publisher "Wikipedia".

RDF: Serialization

24

Page 25: Scientific RDF Databases

Outline

• Introduction to RDF• RDF Databases • Advantages for scientific R&D• In practice• Criticism

25

Page 26: Scientific RDF Databases

• Also called “Triple Store”• Data in the form of triples:

Subject – predicate – object• Dominant query language: SPARQL

RDF Databases

26

PREFIX abc: <nul://sparql/exampleOntology#> . SELECT ?capital ?country WHERE {

?x abc:cityname ?capital ; abc:isCapitalOf ?y. ?y abc:countryname ?country ; abc:isInContinent abc:Africa.

}

Page 27: Scientific RDF Databases

• Built on W3C’s “Linked Data” • Subset of “Graph databases” • Nodes (entities), edges (relationships),

properties

Directed, labeled graph structure (Predicate URI as label)

RDF Databases

27

Page 28: Scientific RDF Databases

Graph View

28Image taken from w3.org

Page 29: Scientific RDF Databases

• Only standarised NoSQL database• In contrast to normal RDBMS:– Very flexible data model• Do not require fixed table schema

– Information as most basic building blocks• Enabling improvement on data-intensive

operations

• Examples: Ebay, Facebook, digg, ..

RDF Databases

29

Page 30: Scientific RDF Databases

• Scalable: Distributed design• Self-Documenting Data – Vocabulary identified in OWL or RDFS definitions– Allows multiple schemata

• Open– Discover new data sources at run-time

• Often weak consistency guarantees– Solved with additional middleware

RDF Databases

30

Page 31: Scientific RDF Databases

Limitations of Relational Databases:

• Not directly visible to web-agents• Primary-foreign key relationships– Meaning is implicit, unspecified semantics

• No relationships across seperate databases• Parent-child relationship are not natural– “Self-joins” for each level in hierarchy

31

RDF Databases

Page 32: Scientific RDF Databases

Outline

• Introduction to RDF• RDF Databases • Advantages for scientific R&D• Criticism• In practice

32

Page 33: Scientific RDF Databases

Advantages for Scientific R&D

• Studies continue to show that research in all fields is increasingly collaborative

• Example: genomic research– Complex data distributed over many datasets• Entrez Gene (EG), Gene Ontology (GO), Swiss_Prot,

GenBank, ..

33

Page 34: Scientific RDF Databases

• Problem = Lack of well defined standards– Integration Nightmare: • data scattered, different formats, lacking information• synonyms, ambiguity

– Changing models: • maintenance not feasible

– Understanding and reasoning • need for connecting ontologies

• Challenge: Syntatic and Semantic heterogeneity

34

Advantages for Scientific R&D

Page 35: Scientific RDF Databases

• Localization of resources– Identify relevant webresources

• Data formats– Resources are represented in HTML, TXT, images, ..

• Synonyms– Researchers can name their own data differently

35

Integration of Databases > Challenges

Page 36: Scientific RDF Databases

• Ambiguity– E.g. “insulin” can represent a drug, protein, gene, ..

• Relations– One-to-one / One-to-many between identifiers

• Granularity– Can cause missing data, ..

36

Integration of Databases > Challenges

Page 37: Scientific RDF Databases

• Data Warehouse Approach– Translate data in one local database– Eliminate unavailability & slow response– Allow data processing and optimalization– Maintenance problem • evolution of content and structure

– Examples: BioWarehouse, Biozon, DataFoundry

37

Integration of Databases > Approaches

Page 38: Scientific RDF Databases

• Federated Database Approach– Translate queries for individual sources– Easier to maintain (e.g. Adding new source)– Poor performance

– Examples: BioKleisli, DiscoveryLink, QIS

38

Integration of Databases > Approaches

Page 39: Scientific RDF Databases

• Semantic Web Approach– No need to map data models– Rely on standarized ontologies

– Less work, better performance– But only if sources comply

39

Integration of Databases > Approaches

Page 40: Scientific RDF Databases

Outline

• Introduction to RDF• RDF Databases • Advantages for scientific R&D• In practice• Criticism

40

Page 41: Scientific RDF Databases

In Practice

• Scientists need:– Access to data

– Ability to utilize data

– Handle uncertainty

41

Page 42: Scientific RDF Databases

In Practice

• Linked Open Data:– “We all need the same databases, for different

decisions or applications”– Complements data in internal/licensed sources– Stimulates cross scientific sharing

42

Page 43: Scientific RDF Databases

• Biological data: Human Genome Project– Increase in web-accessible databases• GenBank, Gene Ontology, UniProt, PhenoDB, ..

– Integration is key problem

– Increase in RDF availability

43

Examples

Page 44: Scientific RDF Databases

• YeastHub– Registration of web-accessible database• Metadata according to Dublin Core standards using

RSS1.0 to describe an ontology– Data Conversion• XML or RDB to RDF conversion

– (eg Unique ID = RDF ID , rest of columns are properties)

– Data Integration• Ad hoc RDF queries• Form-based queries (supervised)

44

Examples

Page 45: Scientific RDF Databases

Outline

• Introduction to RDF• RDF Databases • Advantages for scientific R&D• In practice• Criticism

45

Page 46: Scientific RDF Databases

• Feasability– Human behavior and personal preferences

• ‘Database hugging’– Organizations tend to keep data for themselves

• Censorship and Privacy

46

Criticism

Page 47: Scientific RDF Databases

• Published data reusable in research?– Requires:• Provenance information• Quality• Attribution• Consistency• ...

– Out-of context data fails to respect scientific research methodology

47

Criticism

Page 48: Scientific RDF Databases

• Bringing Web 2.0 to bioinformatics2008, Zhang Zhang, Kei-Hoi Cheung and Jeffrey P. Townsend

• Semantic web approach to database integration in life sciences2006, Kei-Hoi Cheung, Andrew K. Smith, Kevin Y.L. Yip, Christopher J.O. Baker and Mark B. Gerstein

• Integrating large biomedical knowledge resources with RDF2007, Satya S. Sahoo, Olivier Bodenreider, Kelly Zeng, Amit Sheth

• RDF/RDFS-based Relational Database Integration2006, Huajun Chen , Zhaohui Wu , Heng Wang , Yuxin Mao

48

References

Page 49: Scientific RDF Databases

• Has anyone ever worked with linked (RDF) data before? What are your experiences?

• Will the semantic web grow to become the Giant Global Graph?

• Why haven’t RDF databases taken off like Relational Databases?

49

Discussion