a taxonomy of etl activities panos vassiliadis 1, alkis simitsis 2, eftychia baikousi 1 (1)...

30
A Taxonomy of ETL Activities Panos Vassiliadis 1 , Alkis Simitsis 2 , Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs [email protected]

Upload: alaina-patterson

Post on 04-Jan-2016

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

A Taxonomy of ETL Activities

Panos Vassiliadis1, Alkis Simitsis2, Eftychia Baikousi1

(1) University of Ioannina{pvassil,ebaikou}@cs.uoi.gr

(2) HP [email protected]

Page 2: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

Outline

Motivation

Normal Form

Taxonomy

Design Patterns

Conclusions

DOLAP'09, Hong Kong, Nov. 2009 2

Page 3: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

Outline

Motivation

Normal Form

Taxonomy

Design Patterns

Conclusions

DOLAP'09, Hong Kong, Nov. 2009 3

Page 4: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 4

A developer’s problemUlysses is a developer in the IT department of his

company & he has a problem: he was assigned the task to populate a data mart of the organization

with source data =>

He needs to construct an ETL workflow

Page 5: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 5

A developer’s problem

Ulysses has a variety of solutions:Off-the-self ETL toolPL/SQL scripts Embedded SQL in Java, C/C++, …Python, Perl, … scripts…

Page 6: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 6

Ulysses also knows that…

Relational DBMSs perform query optimization using a simple model of data (the relational one)

a (relational) algebra both at the logical and physical levels

Software engineers construct their (O-O) code based onbest practices, or patterns, that allow the efficient construction, maintenance, testing, and execution of code

quality measures that allow the assessment of the quality of the constructed software

Page 7: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 7

Ulysses also knows that…

Relational DBMSs perform query optimization using a simple model of data (the relational one)

a (relational) algebra both at the logical and physical levels

Software engineers construct their (O-O) code based onbest practices, or patterns, that allow the efficient construction, maintenance, testing, and execution of code

quality measures that allow the assessment of the quality of the constructed software

Wouldn’t it be nice if we had similar tools for ETL flows?

Page 8: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 8

Can we help Ulysses? Yes, we can!

We can have a uniform model

to cover a large variety of ETL activities with complicated semantics and provide the framework for the optimization of ETL flows

a taxonomy of ETL activities based on their internals and the elementary components that characterize them with respect to their properties

a set of archetype patterns that combine ETL activities in a practical manner that can be used by an ETL engine for optimization, tuning, parallelization, …

Page 9: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

Outline

Motivation

Normal Form

Taxonomy

Design Patterns

Conclusions

DOLAP'09, Hong Kong, Nov. 2009 9

Page 10: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 10

Rationale & Summary of Contribution

Is there a uniform, reference way to describe ETL activities?

Can we classify both simple activities and complicated activities in an meaningful manner?

Is it possible to come up with a simple “algebra” of operations for ETL activities and workflows?

Page 11: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 11

Rationale & Summary of Contribution

We introduce a uniform modeling notation it works for all kinds of ETL activities and ETL flows

both, experience and practice verifies that

it is based on a mapping to structure of matter Particle ↔ Simple operation Atom ↔ Simple activity Molecule ↔ Complex activity Compound ↔ ETL flow

typically, atoms and molecules can be represented as scriptsa particle is a certain function call inside the scriptETL tools provide atoms and not molecules

Page 12: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 12

Particle

Output schema

Input schema

A1

A2

A3

A4

A5

A6

A4

A5

A6

A1

A2

A3

A7New field: A7

Rejected fields

ETL Atoms

Merging & Routing Particle

Output schema 1Input schema 1

A1

A2

A3

A4

A5

A6

A4

A5

A6A1

A2

A3

Rejected fields

Output schema 2

A1

A2

A3

Output schema 3

A1

A2

A3

A4

A5

A6

A4

A5

A6

Input schema 2

A1

A2

A3

A4

A5

A6

A7

A8

ETL atom

an ETL activity that performs exactly one job

it involves exactly one ETL particle

it can be unary or n-ary

it may contain multiple output schemata

it may project out input attributes and generate new ones at the output

Page 13: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 13

ETL Molecules

Merging Particle

Output schema 1Input schema 1

A1

A2

A3

A4

A5

A6

A4

A5

A6A1

A2

A3

Rejected fields

Output schema 2

A1

A2

A3

Output schema 3

A1

A2

A3

A4

A5

A6

A4

A5

A6

Input schema 2

A1

A2

A3

A4

A5

A6

A7

A8

Routing Particle

Transformation Particles

ETL molecule

it transfers data from input to output schemata

but there is a linear workflow of particles in between these two groups of schemata.

Page 14: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

Benefits

A normal form is a single, reference way to theoretically discuss properties of ETL activitiesSeveral results can be expressed under this formalism

logical optimization is a prominent caseeasy to show that results around swapping of activities (ICDE’05, TKDE’05) fit nicely in this paper

e.g., when can we change the particle order within the same molecule? what kind of (inter-/intra-) molecule operations can we do?

compose molecules (to be able to form flows)split or merge molecules (i.e., SW modules)exchange particles between molecules or swap molecules (for optimization reasons)

DOLAP'09, Hong Kong, Nov. 2009 14

Page 15: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

Benefits

Example inter-/intra-molecule operations

Coupling

Activity swapping

DOLAP'09, Hong Kong, Nov. 2009 15

ba

b

b

b

...

a

abba

aIb,k

ObbOa,j

Page 16: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

Outline

Motivation

Normal Form

Taxonomy

Design Patterns

Conclusions

DOLAP'09, Hong Kong, Nov. 2009 16

Page 17: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 17

Taxonomy of ETL activities

The taxonomy is based on interrelationship of input-output w.r.t.

schemata

processing of incoming tuples

Exploit taxonomical characteristics forlogical & physical optimization of a workflow

parallelization of activities

any other tuning for improving efficiency, resilience to failures

Page 18: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

Taxonomy of ETL activities

DOLAP'09, Hong Kong, Nov. 2009 18

blocking

semi-blocking

non-blocking

# inputs

Final classificatio

n

Physical-level characteristics

unary

N-ary

Other

Page 19: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 19

Applicability

Page 20: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 20

Benefits

Example uses of taxonomy as heuristics for optimization1:1

locally processedeasily exchangeable with one another (under conditions)easily parallelizable

N:1aggregators are blocking / semi-blocking at bestorder or hash sensitivehard to parallelize

1:Ntuple producersorder generators

Binary, primary flowssemi-blockingcan be treated as local if right input can be hashed in main memory

Routersstrong possibility to parallelize and enhance pipelining

Restmostly blocking, very hard to include in pipelining

Page 21: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

Outline

Motivation

Normal Form

Taxonomy

Design Patterns

Conclusions

DOLAP'09, Hong Kong, Nov. 2009 21

Page 22: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 22

Design Patterns

So far, we have given Ulysses the database related foundations

a notation / model to express ETL molecules (activities) in a uniform way

a taxonomy for the particles (elementary operations) that compose molecules

a set of cases where we can show that optimizations can be decided on the grounds of the above

Now we discuss design related foundations

Page 23: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 23

Design patterns for ETL

Apart from a normal form for ETL activities, we strive to establish the benefits of having normal forms for combinations of activities

We have introduced butterflies and a set of ETL design patterns (TPC-TC’09) having ‘appropriate’ combinations of activities based on their taxonomical characteristics

These patterns have been proved to be beneficiary for many purposes (design, efficiency, benchmarking)

Page 24: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 24

γA,Β

4

V

S

R Z

γA

5

W

100000

100000

sel1=0.6

sel2=0.1

sel4=0.2

p1=0.003

p2=0.004

p4=0.001

σA<600

1

σA>300

2

sel5=0.5

p5=0.005

wv A=A

3

sel3=0.2

p3=0.001

Butterflies for ETL

Page 25: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 25

(a) Line (b) Wishbone

(c) Primary Flow (d) Tree

(e) Flat Hierarchy - Fork (f) Deep Hierarchy

Butterfly classes

V WγA,B

3

R σA>300

1

σB>400

2

γΑ

4

V

S

R

W

σA<600

1

γΒ

2

wv A=A

3

WR σA>300

1

σB>400

2

wv A=A

3

SQ

R σA>50 X

S σA<50

2

Y

wv A=A

3

DW

Z

wv A=A

4

1

γA

2

V

Z

σA>300

1

γA,Β

3

W

γB4

Y

R

σA=500

3

V

U

γB,C

5

W

TγA

2

γA

6

Q

R σA>300

1γC

4

Z

Page 26: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 26

Benefits

Design Guidelinesallows designers to communicate designs (document, maintain, …) better

Performancethe synthesis of ETL scenarios should be performed in a manner that allows the underlying engine to exploit them.

Beneficiary for the purpose of benchmarking ETL as well

Page 27: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 27

Benefits in efficiency

In V. Tziovara’s MSc @ UoI (also DOLAP’07) it is shown that we can introduce sorters intentionally to a workflow to exploit common sorting of data for sequences of order-dependent operatorsDifferent butterflies can benefit from different policies

Page 28: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 28

Benefits in scheduling

Different scheduling policies can be used for different butterflies (A. Karagiannis MSc, UoI)

Page 29: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

Outline

Motivation

Normal Form

Taxonomy

Design Patterns

Conclusions

DOLAP'09, Hong Kong, Nov. 2009 29

Page 30: A Taxonomy of ETL Activities Panos Vassiliadis 1, Alkis Simitsis 2, Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs

DOLAP'09, Hong Kong, Nov. 2009 30

Conclusions

We have presented and discusseda uniform model to cover a large variety of ETL activities with complicated semantics and provide the framework for the optimization of ETL flowsa taxonomy for the internals of ETL activities that allows us to treat them as black-boxes in optimization, scheduling, parallelization, …a set of design patterns for combinations of ETL activities in archetype patterns

Future work on optimizationoptimization techniquesalternative physical implementations for the same logical operationsresource allocation and scheduling policiespossibilities for parallelization