semantic technologies for big data

38
XML Amsterdam 2012 Semantic Technologies for Big Data Marin Dimitrov (Ontotext)

Upload: marin-dimitrov

Post on 06-May-2015

9.030 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Semantic Technologies for Big Data

XML Amsterdam 2012

Semantic Technologies for Big Data

Marin Dimitrov (Ontotext)

Page 2: Semantic Technologies for Big Data

XML Amsterdam 2012

#2 Semantic Technologies for Big Data Sep 2012

Page 3: Semantic Technologies for Big Data

About Ontotext

• Provides products and services for creating, managing and exploiting semantic data

– Founded in 2000

– Offices in Bulgaria, USA and UK

• Major clients and industries

– Media & Publishing (BBC, Press Association)

– HCLS (AstraZeneca, UCB)

– Cultural Heritage (The British Museum, The National Archives, Polish National Museum, Dutch Public Library)

– Defense and Homeland Security

#3 Semantic Technologies for Big Data Sep 2012

Page 4: Semantic Technologies for Big Data

Outline

• Semantic Technologies for the Enterprise

• Semantic Technologies for Big Data

• Success stories

#4 Semantic Technologies for Big Data Sep 2012

Page 5: Semantic Technologies for Big Data

SEMANTIC TECHNOLOGIES FOR THE ENTERPRISE

#5 Semantic Technologies for Big Data Sep 2012

Page 6: Semantic Technologies for Big Data

The need for a smarter Web

• "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.“ (Tim Berners-Lee, 2001)

• “PricewaterhouseCoopers believes a Web of data will develop that fully augments the document Web of today. You’ll be able to find pieces of data sets from different places, aggregate them without warehousing, and analyze them in a more straightforward, powerful way than you can now.” (PWC, May 2009)

#6 Semantic Technologies for Big Data Sep 2012

Page 7: Semantic Technologies for Big Data

Linked Data

• Linked Data is a set of principles that allows publishing, querying and consumption of RDF data, distributed across different servers

• Design principles

– Use unambiguous identifiers for resources (URIs)

– Use HTTP URIs (dereference-able)

– Provide useful information for URI lookups

– Interlink resources

#7 Semantic Technologies for Big Data Sep 2012

Page 8: Semantic Technologies for Big Data

#8

The Semantic Web timeline

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

RDF

DAML+OIL OWL

SPARQL SPARQL 1.1

OWL 2

RDFa

RIF

RDB2RDF

LOD

HCLS

SKOS

SAWSDL

RDF 2

PIL

GLD

LDP

SSN

Semantic Technologies for Big Data Sep 2012

Page 9: Semantic Technologies for Big Data

Enterprise Information Management Challenges

• Many disparate data sources and data silos

• Many point-to-point interfaces

• Data sources with similar/inconsistent information

• Complex data integration processes inadequate for changing business requirements

• Most of the knowledge is hidden in texts

• Difficult to integrate & analyse structured data and text

#9 Semantic Technologies for Big Data Sep 2012

Page 10: Semantic Technologies for Big Data

Semantic Web and Linked Data Opportunities for the Enterprise

• Simplify the information integration processes

– Flexible, easy to evolve data model

– Bottom-up / incremental integration

– Efficiently integrate structured and unstructured data

• Provide an enterprise metadata layer

– Unified metadata vocabulary for the enterprise

– Align the legacy data silos

– Improve the information sharing and reuse

#10 Semantic Technologies for Big Data Sep 2012

Page 11: Semantic Technologies for Big Data

Semantic Web and Linked Data Opportunities for the Enterprise (2)

• Discovery and enrichment of information

– Interlink people, organisations, events, etc.

– Enrich enterprise content with structured annotations

– Discover implicit links and relationships

• Unified access to information within the enterprise

– Simplified infrastructure based on open web standards

• Information interchange across a value chain

– Easy publishing and consumption of Linked Data

• Augments existing IT assets and technologies

– No need for disruptive replacement

#11 Semantic Technologies for Big Data Sep 2012

Page 12: Semantic Technologies for Big Data

XML and RDF: friends or foes

• Complement each other

– XML best for content, structure and interchange format

– RDF for metadata layer and semantics

• Typical use case

– Many XML content data sources • Content stored in an XML store (XQuery and XSLT)

– Structured data sources & external Linked Data • RDF-ized and stored in an RDF store (SPARQL)

– Metadata extracted from content • stored in an RDF store (SPARQL)

• semantic search and metadata driven content delivery

#12 Semantic Technologies for Big Data Sep 2012

Page 13: Semantic Technologies for Big Data

BBC Sports

#13

(c) BBC

Semantic Technologies for Big Data Sep 2012

Page 14: Semantic Technologies for Big Data

Added value of RDF

• Explicit semantics

– Intended meaning of entities and relations

• Global identifiers (URIs)

• Simple and flexible graph-based data model

• Easier data mapping & integration

– Bottom-up / incremental data integration with owl:sameAs

• Inference of implicit information

• Working with distributed information

– Linked Data, federated SPARQL

#14 Semantic Technologies for Big Data Sep 2012

Page 15: Semantic Technologies for Big Data

Added value of RDF

• Descriptive / agile schema

– Open World Assumption, don’t restrict predicates

– Generated dynamically from data

• Queries based on meaning

– Not depending on structure / order of statements

• Data and queries may use different vocabularies

• Exploratory queries

• Choice of OWL2 profiles

– Tradeoff features vs performance

– New profiles may emerge in the future

#15 Semantic Technologies for Big Data Sep 2012

Page 16: Semantic Technologies for Big Data

SEMANTIC TECHNOLOGIES FOR BIG DATA

#16 Semantic Technologies for Big Data Sep 2012

Page 17: Semantic Technologies for Big Data

The three V’s of Big Data

• Velocity

– Streaming, sensor, real-time data

– Solution: distributed processing & storage

– Semantic challenge: stream reasoning

• Volume

– Petabytes of data

– Solution: distributed processing & storage

– Semantic challenge: distributed reasoning & querying

• Variety

– Structured, semi-structured and unstructured data

– Semantic Technologies (RDF) are a good fit #17 Semantic Technologies for Big Data Sep 2012

Page 18: Semantic Technologies for Big Data

Types of Big Data (NIST)

• Type 1

– Velocity (-), Volume (-), Variety (+)

– Perfect fit for Semantic Technologies

• Type 2

– Velocity and/or Volume, Variety (-)

– Only horizontal scalability required, traditional approaches are a good enough fit

• Type 3

– All V’s

– Semantic Technologies not a good fit yet, but moving in that direction

#18 Semantic Technologies for Big Data Sep 2012

Page 19: Semantic Technologies for Big Data

Semantic Technologies for Volume and Velocity

• Promising ongoing research

• Distributed inference with Hadoop/Storm

• Stream reasoning

– Continuous queries

– Continuous (dynamic) semantics

• SPARQL to Pig translation

• Distributed RDF stores on top of NoSQL

• C-SPARQL, EP-SPARQL, CQELS

#19 Semantic Technologies for Big Data Sep 2012

Page 20: Semantic Technologies for Big Data

Linked Open Data Cloud (Sep 2011)

#20

(c) Cyganiak & Jentzsch

Semantic Technologies for Big Data Sep 2012

Page 21: Semantic Technologies for Big Data

From Big Linked Data to Linked Big Data

• Big Linked Data

– Big Data approach adopted by the Linked Data community • In particular handling Volume and Velocity

– Exponential growth of Linked Data in the last 5 years

• Linked Big Data

– Linked Data approach adopted by the Big Data community

– RDF data model for Variety

– Enrich Big Data with metadata and semantics – more powerful analytics on top of it

– Interlink Big Data sets

– Simplify data access and data integration

#21 Semantic Technologies for Big Data Sep 2012

Page 22: Semantic Technologies for Big Data

SUCCESS STORIES

#22 Semantic Technologies for Big Data Sep 2012

Page 23: Semantic Technologies for Big Data

Typical Use Cases for Linked Data and Semantic Technologies

• Publish / consume Linked Data across enterprises

– Linked Data is not necessarily free data

– Facilitate data interchange within the value chain

• Information integration within the enterprise

– Integrated asset management / align data silos

– Master Data Management

• Knowledge discovery and semantic search

– Integrate structured and unstructured data

– Enrich and interlink information

– Semantic search and exploration of information

#23 Semantic Technologies for Big Data Sep 2012

Page 24: Semantic Technologies for Big Data

Semantic Information Integration (Ontotext)

#24 Semantic Technologies for Big Data Sep 2012

Page 25: Semantic Technologies for Big Data

The National Archives (Ontotext)

• Challenge

– Large archive of various UK Government websites since 1997

– Lots of duplicated information & documents

– Inefficient search & navigation

• Semantic Knowledge Base project goals

– Integrate multiple data sources

– Extract information & metadata from archived documents

– Interlink the web archive with data.gov.uk and LOD data

– Advanced search & navigation of the archive

#25 Semantic Technologies for Big Data Sep 2012

Page 26: Semantic Technologies for Big Data

The National Archives (Ontotext)

#26

Semantic Annotation

Identity Resolution

Semantic Repository

Front Ends:

3rd party Ontology

Editors

O1

O2

O3

Data Trans-

formation and

Integration

A

B

C

D

Annotation Process (GATE Teamware)

Semantic Search

SPARQL

graph exploration

SKB Ontologies

Factual Knowledge (TNA data, LOD,

data.gov.uk)

Semantic Index

Semantic annotations

Semantic Technologies for Big Data Sep 2012

Page 27: Semantic Technologies for Big Data

The National Archives (Ontotext)

• The numbers

– 2.5 billion input files

– 40TB compressed archive data

– 10 billion RDF triples stored in OWLIM

– 33,000 EC2 hours used on AWS

– Dynamic EC2 cluster (180 instances average, 500 max)

• Major challenges

– Complex pre-processing of documents

– De-duplication of information & documents

– EC2/RRS performance & reliability

#27 Semantic Technologies for Big Data Sep 2012

Page 28: Semantic Technologies for Big Data

Dutch Public Library (Ontotext + Dayon)

• Challenge

– Many disparate data sources, inefficient search

• Goals

– Data integration

– Automated metadata generation

– Open search platform

• Numbers

– 500 heterogeneous data sources

– 40 million cultural heritage artifacts to be describes

– 6-8 billion triples to be stored into the knowledge base

#28 Semantic Technologies for Big Data Sep 2012

Page 29: Semantic Technologies for Big Data

Linked Life Data (Ontotext)

• Challenge

– Disparate, heterogeneous and unaligned data silos lock valuable biomedical information

• Goals

– Semantic warehouse integrating and interlinking public biomedical data sources

– Interactive discovery and exploration

• Numbers

– 25+ heterogeneous biomedical data sources integrated

– 1 billion entities described

– 5.5 billion RDF triples

#29 Semantic Technologies for Big Data Sep 2012

Page 30: Semantic Technologies for Big Data

Linked Life Data (Ontotext)

#30 Semantic Technologies for Big Data Sep 2012

Page 31: Semantic Technologies for Big Data

Linked Life Data-as-a-Service (Ontotext)

• More data sources

• Large scale text mining over the LOD cloud

• Adapted for specific use cases

• UCB use case

– 2 billion entities described

– 11 billion RDF triples

#31 Semantic Technologies for Big Data Sep 2012

Page 32: Semantic Technologies for Big Data

Dynamic Semantic Publishing (Ontotext)

• Challenge

– Difficult & slow to aggregate content from various sources

• Goals

– Metadata generation for news (semantic annotation)

– Interlink & categorize content

– Metadata driven web pages

• Numbers

– Nearly real-time processing & annotation required

– Tens of millions (SPARQL) queries to the knowledge base per day

#32 Semantic Technologies for Big Data Sep 2012

Page 33: Semantic Technologies for Big Data

Trillion RDF triples (Franz Inc.)

• Use case

– Use RDF for the customer management database of a telecom

• Challenge

– 4,000 triples per customer, more than a trillion for the whole customer base

• Numbers

– 1 trillion triples stored in AllegroGraph by Franz Inc • Hardware requirements undisclosed

• The 310 billion triple result used 8-CPU system with 2TB RAM

#33 Semantic Technologies for Big Data Sep 2012

Page 34: Semantic Technologies for Big Data

uRiKA (Cray/YarcData)

• Big Data appliance for graph analytics

– Based on the Threadstormtm architecture

– Up to 8K processors, 512TB RAM, 350TB/hr IO throughput

• In-memory RDF database

• SPARQL 1.0 engine

#34 Semantic Technologies for Big Data Sep 2012 (c) YarcData

Page 35: Semantic Technologies for Big Data

TAKEAWAYS

#35 Semantic Technologies for Big Data Sep 2012

Page 36: Semantic Technologies for Big Data

Semantic Technologies for Big Data

• Rich ecosystem of Semantic Technologies since 1999

• Strong Enterprise focus in the last 5 years

• Semantic Technologies provide opportunity for reducing the cost and complexity of data integration

• Common metadata layer for the enterprise

• More powerful ways to find and explore information

• RDF complements XML within the enterprise

• Semantic Technologies are a good fit for Big Data’s Variety

#36 Semantic Technologies for Big Data Sep 2012

Page 37: Semantic Technologies for Big Data

Semantic Technologies for Big Data

• Velocity and Volume still challenging for Semantic Technologies, but lots of progress in that direction

• Linked Data will grow into Big Linked Data, but Big Data will also benefit from evolving into Linked Big Data

• Interesting success stories for Semantic Technologies in Big Data scenarios

#37 Semantic Technologies for Big Data Sep 2012

Page 38: Semantic Technologies for Big Data

THANK YOU!

#38 Semantic Technologies for Big Data Sep 2012