october 17, 2012

38
Big Data Management: Storing and Querying the Semantic Web Artem Chebotko Department of Computer Science University of Texas – Pan American [email protected] http://faculty.utpa.edu/chebotkoa October 17, 2012

Upload: walter

Post on 04-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Big Data Management: Storing and Querying the Semantic Web Artem Chebotko Department of Computer Science University of Texas – Pan American chebotkoa @ utpa .edu http://faculty.utpa.edu/chebotkoa. October 17, 2012. Background: Data Management. Data Base File System Legacy Database - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: October 17, 2012

Big Data Management: Storing and Querying the Semantic Web

Artem ChebotkoDepartment of Computer Science

University of Texas – Pan [email protected]

http://faculty.utpa.edu/chebotkoa

October 17, 2012

Page 2: October 17, 2012

22

2

Background: Data Management

Data BaseFile SystemLegacy DatabaseRelational DatabaseObject-Oriented DatabaseXML DatabaseRDF DatabaseNoSQL Database

Page 3: October 17, 2012

33

3

Background: Big Data

Big DataWeb-Scale Data

Many companies work at this level: Google, Yahoo!, LinkedIn, Facebook, Twitter, Amazon, Walmart, etc.

Many more companies will have to meet Big Data this decade

“As of 2012, about 2.5 exabytes of data are created each day, and the number is doubling every 40 months or so” and

“Walmart collects more than 2.5 petabytes of data every hour” (Harvard Business Review, October 2012)

1 EB = 1,000,000 TB 1 PB = 1,000 TB

Page 4: October 17, 2012

44

4

Background: Big Data

What can you do with 1 PB of data?Data Scientist: The Sexiest Job of the

21st Century (HBR, October 2012)Data Management Skills

Programming Skills

Data Mining and Data Analysis Skills

Social Skills

Business Understanding

Page 5: October 17, 2012

55

5

®The Semantic Web – a neat, meaningful mate for the messy, unstructured Big Data

Page 6: October 17, 2012

66

6

®

WWW and Semantic Web

World Wide Web – Web of Linked DocumentsEnormous collection of information (Big Data) intended for

people to share and use

Keyword-based search

Semantic Web – Web of DataAn emerging vision to make information collected by WWW

processable by machines

Computational knowledge-based search/answering

Big Data

Page 7: October 17, 2012

77

7

®

Motivating Example

Web Search Example:Find a professor in UTPA who authored an article published

in Data & Knowledge Engineering in 2009.

This information is available in two different pages of my website

welcome.html publications.html

Page 8: October 17, 2012

88

8

®

Example (cont): traditional search

Google search in Nov. 2009 finds 184 documentsOne of them mentions my name

• It is not displayed on the first page of the results• It contains my name and affiliation, but no information about the

DKE article Google search in Oct. 2010 finds ~19,800

documentsFour of them mention my name and affiliation

• They are not displayed on the first page of the results• No information about the DKE article

2011 & 2012: no noticable improvement

Page 9: October 17, 2012

99

9

®

Example (cont): traditional search

What went wrong?Keyword-based search interprets my query as a list of

syntactic words: professor, UTPA, data, article, publish, knowledge, engineering, 2009

It searches for a document that contains as many matching words as possible

PageRank is “biased” towards keyword ‘UTPA’

Moreover, my two pieces of information are viewed as lists of syntactic words. The pieces are not linked!

Page 10: October 17, 2012

1010

10

®

Example (cont): semantic search

How can we do better?Encode the two pieces of information as

machine-interpretable data

Link them

Express (automatically) the natural language query in a machine-friendly query language

Page 11: October 17, 2012

1111

11

®

Example (cont): encoding

<resource1> <type> <Professor>.<resource1> <name> “Artem Chebotko”.<resource1> <worksIn> <resource2>.<resource2> <type> <University>.<resource2> <name> “UTPA”.

<resource3> <type> <Journal>.<resourse3> <title> “Data & …”.<resource3> <published> <resource4>.<resource4> <type> <Article>.<resource4> <title> “Semantics …”.<resource4> <year> “2009”.<resource4> <author> <resource5>.

<resource1> <sameAs> <resource5>.

Page 12: October 17, 2012

1212

12

®

Example (cont): linked data!

re s o u rc e1

re s o u rc e2ty p e nam

e

p u b lish e d

titlety

p e

ty p e

ty p e

sameA s

"A rtem C h e b o tko "

U n ive rs ity

P ro fe s s o r

"U TP A "

re s o u rc e3

worksIn

n ame

J o u rn a l

"D a ta & ..."

re s o u rc e4

A rtic le

title

"S e m a n tic s ..."ye

a r

"2009"

a u th o rre s o u rc e5

Information from two sources was integrated

Page 13: October 17, 2012

1313

13

®

Example (cont): query

Find a professor in UTPA who authored an article published in Data & Knowledge Engineering in 2009.

SELECT ?nameWHERE { ?p <type> <Professor>. ?p <name> ?name ?p <worksIn> ?u. ?u <type> <University>. ?u <name> “UTPA”. ?j <type> <Journal>. ?j <title> “Data & …”. ?j <published> ?a. ?a <type> <Article>. ?a <title> “Semantics …”. ?a <year> “2009”. ?a <author> ?p.}

Result:?name = “Artem Chebotko”

This is the exact answer to our question

Page 14: October 17, 2012

1414

14

®

Semantic Web Technologies

Page 15: October 17, 2012

1515

15

®

Semantic Web Current State

Semantic search/indexing http://sindice.com/

• Over 664 million Semantic Web documents as of today• ~400 million Semantic Web documents in 2011• ~ 140 million Semantic Web documents in 2010• ~ 70 million in 2009

Page 16: October 17, 2012

1616

16

®

Semantic Web Current State (cont)

Semantic Web datasets:DBPedia (~2 billion triples)

US Census Data (>1 billion triples)

UniProt (>600 million triples)

BestBuy (>27 million triples)

Semantic Web can potentially grow the size of Web (> 22 billion pages)

Page 17: October 17, 2012

1717

17

®

Linking Open Data Project (March 2009)

Page 18: October 17, 2012

1818

18

®

Linking Open Data Project (Sept 2010)

Page 19: October 17, 2012

1919

19

®

Linking Open Data Project (Sept 2011)

Page 20: October 17, 2012

2020

20

®

Semantic Web Data Management:Research at UTPA

Page 21: October 17, 2012

2121

21

®

Semantic Web Data Management:Research at UTPA

Roadmap:Research Goals and Current Projects

S2ST

ProvBase

Future Directions

Page 22: October 17, 2012

2222

22

®

Research Goal and Current Projects

Goal: efficient storage and querying of large Semantic Web data sets

Projects:S2ST: Relational RDF Database Management System (RRDBMS)

http://s2stproject.cs.panam.edu/

ProvBase: Semantic Web Database in the Cloud http://provbase.cs.panam.edu/

Page 23: October 17, 2012

2323

23

®

S2ST Overview

http://s2stproject.cs.panam.edu/

Page 24: October 17, 2012

2424

24

®

S2ST Definition

Page 25: October 17, 2012

2525

25

®

S2ST Architecture

Page 26: October 17, 2012

2626

26

®

S2ST Main Functions

Create logical schemaUser specifies a template for a database schema that will

store RDF data

Very flexible. Supports the following approaches to schema design:

• Generic

• Schema-aware

• Schema-oblivious

• Data-driven

• User-driven

• Hybrid

Page 27: October 17, 2012

2727

27

®

S2ST Main Functions (cont)

Schema mappingCreates physical schema and database schema in an

RDBMS.

Data mappingMaps RDF triples into relational tuples and inserts them

into the database

Query mappingMaps SPARQL queries into SQL that can be evaluated by

an RDBMS

Most complex mapping

Page 28: October 17, 2012

2828

28

®

SPARQL-to-SQL Query Translation

Generic

Reusable

Semantics preserving

Correct

Page 29: October 17, 2012

2929

29

®

S2ST Fact Sheet

Next-generation relation RDF storeRelational RDF Database Management System

Supports user-driven schema design like in relational databases

Supports semantics-preserving SPARQL-to-SQL query translation

Supports generic schema, data and query mapping algorithms

Supports ~20 RDBMS backends, including Oracle, DB2, PostgreSQL, MySQL, and SQLServer

Page 30: October 17, 2012

3030

30

®

S2ST Applications

VIEWScientific workflow provenance metadata

management

GEO-SEEDWeb services RDF data management

Page 31: October 17, 2012

3131

31

®

Future Directions

Inference supportQuery optimizationData mapping algorithmsData browsing interfaceDistributed data managementTesting and performance evaluationData and query visualizationApplications

Page 32: October 17, 2012

3232

32

®

ProvBase Overview

http://provbase.cs.panam.edu/

Page 33: October 17, 2012

3333

33

ProvBase: Distributed RDF Provenance Database

Based on

Hadoop Wins Terabyte Sort Benchmark: One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. This is the first time that either a Java or an open source program has won.

http://hadoop.apache.org

Page 34: October 17, 2012

3434

34

ProvBase: Distributed RDF Provenance Database

Based on

HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storeage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop.

Sample BigTable:

http://hbase.apache.org

Page 35: October 17, 2012

3535

35

®

ProvBase Architecture

Page 36: October 17, 2012

3636

36

®

Future Directions

SPARQL optional graph pattern support

Inference supportQuery optimizationGUITesting and performance evaluationData and query visualizationApplications

Page 37: October 17, 2012

3737

37

Other Projects

Relational Algebra Toolkit (RAT) http://rat.cs.panam.edu

The University of Texas Provenance Benchmark (UTPB) http://faculty.utpa.edu/chebotkoa/utpb

Student Research Organizer k-Nearest Keyword Search in RDF Graphs

Page 38: October 17, 2012

3838

38

Thank You!

Questions?

Artem Chebotko

Department of Computer ScienceUniversity of Texas – Pan American

[email protected] http://faculty.utpa.edu/chebotkoa