designing etl processes using semantic web technologiesdolap06/presentations/09_skoutas.pdf · d....

32
Designing ETL Processes Using Semantic Web Technologies Dimitrios Skoutas Alkis Simitsis {dskoutas,asimi}@dblab.ece.ntua.gr National Technical University of Athens Dept. of Electrical and Computer Engineering http://www.dblab.ece.ntua.gr

Upload: others

Post on 11-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

Designing ETL ProcessesUsing Semantic Web Technologies

Dimitrios Skoutas Alkis Simitsis{dskoutas,asimi}@dblab.ece.ntua.gr

National Technical University of AthensDept. of Electrical and Computer Engineering

http://www.dblab.ece.ntua.gr

Page 2: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 2

IntroductionDatastore AnnotationOntology GenerationETL TransformationsConclusions

Outline

Page 3: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 3

IntroductionIntroductionDatastore AnnotationOntology GenerationETL TransformationsConclusions

Outline

Page 4: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 4

Extract-Transform-Load (ETL)

Page 5: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 5

Problem DescriptionConceptual design of ETL processes

is a critical task performed at the early stages of a DW projectdescribe the integration of data from heterogeneous sources into the Data Warehouse

Two main goalsspecify inter-schema mappingsidentify appropriate transformations

Page 6: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 6

MotivationThe problem of heterogeneity in data sources

structural heterogeneitydata stored under different schemata

semantic heterogeneitydifferent naming conventions

e.g., homonyms, synonymsdifferent representation formats

e.g., units of measurement, currencies, encodingsdifferent ranges of values

Page 7: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 7

Lack of precise metadata/documentationoften available in natural language format

Current approaches are not sufficientcommercial tools

focus on the graphical representation and design of the ETL transformations

research works focus on the representation and formal description of the ETL transformations

The identification of the necessary transformations and the inter-schema mappings

needs to be done manuallyis time consuming and error prone

Motivation

Page 8: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8

Overview of our approachKey idea

an ontology-based approach to facilitate the conceptual design of an ETL scenario

An ontology is a “formal, explicit specification of a shared conceptualization”describes the knowledge in a domain in terms of classes, properties, and relationships between themmachine processableformal semanticsreasoning mechanisms

The Web Ontology Language (OWL) is used as the language for the ontology

W3C recommendationbased on Description Logics

Page 9: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 9

Overview of our approachMethod

Construct an appropriate application vocabulary

Annotate the data sources

Generate the application ontology

Apply reasoning techniques to

select relevant sourcesto identify required transformations

Page 10: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 10

Paris/Rome/Athens

A motivating example

Above 200 Euros

prices in Dollars

From 500 to 1500 Euros

software/hardware

only software products

Paris/Rome/Athens

Rome/Athens

software

hardware

Page 11: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 11

IntroductionDatastoreDatastore AnnotationAnnotationOntology GenerationETL TransformationsConclusions

Outline

Page 12: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 12

Datastore AnnotationCreating the application vocabulary

a set of common terms to describe the sources and the application requirements

V = (VC, VP, VF, VT, fP, fF, fT)

VC: concepts of the domain

VP: concept features fP : VP VC

VF: representation formats fF : VF VP

VT: feature values fT : VT VF ∪ VP

Page 13: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 13

Datastore AnnotationA relational datastore DS comprises

a set of relations RDS

a set of attributes ADS

Relation mappingsfRM : RDS VC

maps each relation to a primitive concept

Attribute mappingsMAM ⊆ ADS × VP

maps attributes to features

Page 14: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 14

Datastore AnnotationAttribute annotation

IR,p = (φ, min, max, T, n, R΄, Γf, Γα)

φ: the representation format

min, max, T: range of values

n: cardinality

R΄: referenced relation (foreign keys)

Γf, Γα: function and attributes used for aggregation

Page 15: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 15

Reference example (cont’d)

Application vocabularyVC = {product, store}

VPproduct = {pid, pName, quantity, price, type, storage}

VPstore = {sid, sName, city, street}

VFpid = {source_pid, dw_pid}

VFsid = {source_sid, dw_sid}

VFprice = {dollars, euros}

VTtype = {software, hardware}

VTcity = {paris, rome, athens}

Datastore mappings

Datastore annotation

Page 16: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 16

IntroductionDatastore AnnotationOntology GenerationOntology GenerationETL TransformationsConclusions

Outline

Page 17: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 17

Ontology GenerationAn OWL ontology is generated based on the application vocabulary and the annotations

The ontology consists ofa set of primitive classes corresponding to the specified concepts, representation formats and ranges or sets of valuesa set of properties corresponding to the specified features of the concepts of the domaina set of defined classes representing the datastores, based on the datastore mappings and annotation

A reasoner is used to infer the class hierarchy

Page 18: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 18

Reference example (cont’d)The class hierarchy Definition for class

DS1_Products

Page 19: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 19

IntroductionDatastore AnnotationOntology GenerationETL TransformationsETL TransformationsConclusions

Outline

Page 20: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 20

Generating ETL transformationsCommon types of ETL operations

Page 21: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 21

Generating ETL transformationsTwo main steps

select relevant sources to populate each DW element

identify required data transformations between the sources and the DW

Page 22: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 22

Generating ETL transformationsSelect relevant sources

a source relation RS, represented by class CS

a target relation RT, represented by class CT

RS is provider for RT, ifCS and CT have a common superclass

ensures that the integrated data records have the same semantics

CS and CT are not disjointprevents data integration between datastores with conflicting constraints

Page 23: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 23

Generating ETL transformationsIdentifying transformations (I)

use project operations to exclude attributes not mapped to the ontology

use concatenate/split operations for n:1/1:n mappings

use join operations for foreign key attributes

Page 24: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 24

Generating ETL transformationsIdentifying transformations (II)

if CS ≡ CT or CS ⊏ CT , no transformations are required

if CT ⊏ CS, for each additional constraint of CT the corresponding select, aggregate or not null transformation is added

else, as previous plus convert operations

Page 25: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 25

Generating ETL transformationsIdentifying transformations (III)

use add operations for missing attributes

use union operations for data coming from multiple sources

use distribute operations to load data to more than one relations in the DW

use detect duplicates operations for target attributes having a not null constraint

Page 26: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 26

Reference example (cont’d)Reasoning on the mappings s(location, city, street)c(street, number, street)

Page 27: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 27

Reference example (cont’d)Reasoning on the definitions

f(pid, Source_Pid, DW_Pid)σ(price, From_500_To_1500)nn(quantity)σ(type, {software})

Page 28: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 28

IntroductionDatastore AnnotationOntology GenerationETL TransformationsConclusionsConclusions

Outline

Page 29: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 29

ConclusionsAn application ontology to

model the domainformally specify the semantics of the datastore schemata

Automated reasoning techniques toidentify relevant sources for populating the DWinfer required conceptual transformations for propagating data from the sources to the DW

Page 30: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 30

Current and Future WorkExtend the approach to non-relational sources

Evaluation on real-world ETL scenarios

Maintenance/adaptation of the ETL workflow

Automated design of the conceptual DW schema

Page 31: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 31

Thank You

Page 32: Designing ETL Processes Using Semantic Web Technologiesdolap06/Presentations/09_skoutas.pdf · D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 8 Overview of our approach

D. Skoutas, A. Simitsis @ DOLAP'06, Arlington, USA, 10/11/2006 32

Questions