towards a benchmark for etl workflows panos vassiliadis anastasios karagiannis vasiliki tziovara...
Post on 18-Dec-2015
221 views
TRANSCRIPT
![Page 1: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/1.jpg)
Towards a Benchmark for ETL Workflows
Panos VassiliadisAnastasios KaragiannisVasiliki Tziovara
Alkis Simitsis
Univ. of Ioannina
Almaden Research Center
![Page 2: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/2.jpg)
2
Goal of this work
• The goal of this paper is to contribute towards a benchmark that can serve as the experimental testbed for the assessment of ETL methods or tools concerning their basic behavioral properties (measures) over a broad range of ETL workflows.
![Page 3: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/3.jpg)
3
Roadmap
• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues
![Page 4: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/4.jpg)
4
Roadmap
• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues
![Page 5: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/5.jpg)
5
Add_SPK1
SUPPKEY=1
SK1
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SUPPKEY
$2€
COST DATE
DS.PS2 Add_SPK2
SUPPKEY=2
SK2
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SUPPKEYCOST DATE=SYSDATE
AddDate CheckQTY
QTY>0
U
DS.PS1
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF1
DS.PS_NEW1.PKEY,DS.PS_OLD1.PKEYDS.PS_NEW
1
DS.PS_OLD
1
DW.PARTSUPP
Aggregate1
PKEY, DAYMIN(COST)
Aggregate2
PKEY, MONTHAVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,DAY
FTP1S1_PARTSU
PP
S2_PARTSUPP
FTP2
DS.PS_NEW
2
DIFF2
DS.PS_OLD
2
DS.PS_NEW2.PKEY,DS.PS_OLD2.PKEY
Sources DW
DSA
ETL workflows
![Page 6: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/6.jpg)
6
Off-line vs. Active ETL
• Off-line mode: the workflow is executed during a specific time window of some hours (typically at night), when the systems are not servicing their end-users. – Due to the low load of both the source systems and the
warehouse, both the refreshment of data and any other administrative activities (cleanups, auditing, etc) are easier to complete.
• Active mode: the sources continuously (with high frequency) try to send new data to the warehouse. – Not necessarily instantly; rather, small groups of data are
collected and sent to the warehouse for further processing.
• Differences of the two modes:– frequency of the workflow execution– load incurred to all the systems involved.
![Page 7: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/7.jpg)
7
Problem & Motivation
• When experimenting with ETL wokflows– what are the important problem parameters
& what are the realistic values for them?– what test suites should we use?
• We have faced the problem twice so far:– Logical optimization of the ETL process
(transposition of activities to speed up the workflow – ICDE05, TKDE05)
– Physical optimization of the ETL process (which physical operators to use for optimal execution of the process – DOLAP07)
![Page 8: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/8.jpg)
8
Logical Optimization
PARTSPARTS1
A2E(DATE)
σ(€COST)
γ(DATE)
1 987
2 5 6
UNN(€COST)
3
$2€($COST)
4
PARTS2
PARTS
PARTS1
σ(€COST)
U
γ(DATE)
1
97
3
2 8_2 6
NN(€COST)
σ(€COST)
8_1
$2€($COST)
4
PARTS2A2E(DATE)
5
Can we push selection early enough?
Can we aggregate before $2€ takes place?
How about naming conflicts?
![Page 9: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/9.jpg)
9
Problem & Motivation
• Existing standards are insufficient– TPC-H– TPC-DS
• Practical cases are not publishable• We resort in devising our own ad-
hoc test scenarios– either through a specific set of
scenarios– or, through a scenario generator
![Page 10: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/10.jpg)
10
Roadmap
• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues
![Page 11: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/11.jpg)
11
Goal of this work
• We are interested in understanding– The fundamental families of activities
performed in an ETL scenario– The frequent ways with which activities
and recordsets interconnect in an ETL scenario
– The important parameters to be tuned in an experiment & the appropriate values for them
– The appropriate measures to be measured during an experiment
![Page 12: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/12.jpg)
12
The goal is not …
• To compare existing systems…• To provide specialized performance
measures for very specific tasks in the overall process…
• To exhaustively enumerate all the possible alternatives for specific operations…
• …
![Page 13: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/13.jpg)
13
Fundamental goals of any ETL workflow
• Effectiveness– Does the workflow execution reach the
maximum possible (or, at least, the minimum tolerable) level of data freshness, completeness and consistency in the warehouse within the necessary time (or resource) constraints?
– Is the workflow execution resilient to occasional failures?
• Efficiency– How fast is the workflow executed?– What resource overheads does the workflow
incur at the source and the warehouse side?
![Page 14: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/14.jpg)
14
Effectiveness
• Bring the data from the sources to the DW s.t.:– Periodically, as many of them as possible are
loaded. – The loaded data are as fresh as possible– At “checkpoint” timestamps, all data have been
loaded in the warehouse– Data respect both database and business rules.
• Potential problems :– Recovery from failures. – Missing changes at the source. – Invalid records due to failed transactions.
![Page 15: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/15.jpg)
15
Efficiency
• Data exchange between sources and warehouse is done as fast as possible
• Minimal overhead is incurred – at the source systems. – at the data warehouse.
![Page 16: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/16.jpg)
16
Problem parameters
• the size of the workflow (i.e., the number of nodes contained in the graph),
• the structure of the workflow (i.e., the variation of the nature of the involved nodes and their interconnection as the workflow graph)
• the size of input data originating from the sources,
• the overall selectivity of the workflow, based on the selectivities of the activities of the workflow,
• the values of probabilities of failure.
![Page 17: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/17.jpg)
17
Measures
• Measures for data freshness and data consistency– (M1.1) Percentage of data that violate
business rules.– (M1.2) Percentage of data that should
be present at their appropriate warehouse targets, but they are not.
• Measures for the resilience to failures – (M2) Percentage of successfully
resumed workflow executions.
![Page 18: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/18.jpg)
18
Measures
• Speed of the overall process– (M3.1) Throughput of regular workflow
execution (this may also be measured as total completion time).
– (M3.2) Throughput of workflow execution including a specific percentage of failures and their resumption.
– (M3.3) Average latency per tuple in regular execution.
![Page 19: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/19.jpg)
19
Measures• Overheads
– (M4.1) Min/Max/Avg/ timeline of memory consumed by the ETL process at the source system.
– (M4.2) Time needed to complete the processing of a certain number of OLTP transactions in the presence (as opposed to the absence) of ETL software at the source, in regular source operation.
– (M4.3) The same as 4.2, but in the case of source failure, where ETL tasks are to be performed too, concerning the recovered data.
– (M4.4) Min/Max/Avg/ timeline of memory consumed by the ETL process at the warehouse system.
– (M4.5) (active warehousing) Time needed to complete the processing of a certain number of decision support queries in the presence (as opposed to the absence) of ETL software at the warehouse, in regular operation.
– (M4.6) The same as M4.5, but in the case of any (source or warehouse) failure, where ETL tasks are to be performed too at the warehouse side.
![Page 20: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/20.jpg)
20
Goal of this work
• Covered:The important parameters to be tuned in
an experiment & the appropriate values for them
The appropriate measures to be measured during an experiment
• Pending: The fundamental families of activities
performed in an ETL scenario The frequent ways with which activities
and recordsets interconnect in an ETL scenario
![Page 21: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/21.jpg)
21
Roadmap
• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues
![Page 22: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/22.jpg)
22
Formal treatment
• We model an ETL workflow as a directed acyclic graph G(V,E). – Each node vV is either an activity a or
a recordset r. – An edge (a,b)E denotes that b
receives data from node a for further processing.
![Page 23: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/23.jpg)
23
Formal treatment• Intra-node constraints
– Each recordset r is a pair (r.name, r.schema), with the schema being a finite list of attribute names.
– Each activity a is a tuple (N,I,O,S,A). N is the activity’s name. I is a finite set of input schemata. O is a finite set of output schemata. S is a declarative description of the relationship of its output schema with its input schemata in an appropriate language (without delving into algorithmic or implementation issues). A is the algorithm chosen for activity’s execution.
• Inter-node constraints – The data consumer of a recordset cannot be another
recordset. Still, more than one consumer is allowed for recordsets.
– Each activity must have at least one provider, either another activity or a recordset. When an activity has more than one data providers, these providers can be other activities or activities combined with recordsets.
– Feedback of data is not allowed; i.e., the data consumer of an activity cannot be the same activity.
![Page 24: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/24.jpg)
24
Resulting problem…
• Due to the intra-node characteristics, we need a “taxonomy” for ETL activities (micro-level)
• Due to the infinite possibilities of connecting nodes (activities and recordsets) we need a set of “design patterns” as abstractions of how frequently encountered ETL graphs look like (macro-level)
![Page 25: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/25.jpg)
25
Micro level
• Micro level: derive a set of fundamental classes, where frequently encountered activities can be classified
• Why a taxonomy of ETL activities?– Impossible to predict any possible
script / algorithm / operator /…– No algebra available right now
Not necessary only for the benchmark, but orthogonally for other tasks (e.g.,
optimization, statistics, etc)
![Page 26: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/26.jpg)
26
![Page 27: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/27.jpg)
27
Logical classification
• Row-level operations, which are locally applied to a single row.
• Router operations, which locally decide, for each row, which of the many (output) destinations it should be sent to.
• Unary Grouper operations, which transform a set of rows to a single row.
• Unary Holistic operations, which perform a transformation to the entire data set. These are usually blocking operations.
• Binary or N-ary operations, which combine many inputs into one output.
![Page 28: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/28.jpg)
28
Micro level
blocking
semi-blocking
non-blocking
binary# inputs
Final classification
Physical-level characteristics
unary
N-ary
row-level
router grouper
![Page 29: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/29.jpg)
29
Micro-level
Unary N-ary
Blocking HolisticGrouper
HolisticGrouper
Semi-blocking Grouper Grouper
Non-blocking Row-levelRouter
![Page 30: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/30.jpg)
30
Macro level
• Even harder!• How to derive a set of typical
structural patterns for an ETL scenario?– Top down: delve to the fundamental
constituents of such a scenario– Bottom up: explore scenarios and try to
abstract common parts
• We did a little bit of both, and derived a fundamental pattern of structure
![Page 31: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/31.jpg)
31
A butterfly is an ETL workflow that consists of three distinct components:
• Body: a central, detailed point of persistence (fact or dimension table) that is populated with the data produced by the left wing.
• Left wing: sources, activities and intermediate results. Performs extraction, cleaning and transformation + loads the data to the body.
• Right wing: materialized views, reports, spreadsheets, as well as the activities that populate them, to support reporting and analysis
Butterflies to the rescue!
![Page 32: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/32.jpg)
32
γA,Β
4
V
S
R Z
γA
5
W
100000
100000
sel1=0.6
sel2=0.1
sel4=0.2
p1=0.003
p2=0.004
p4=0.001
σA<600
1
σA>300
2
sel5=0.5
p5=0.005
wv A=A
3
sel3=0.2
p3=0.001
Butterflies to the rescue!
![Page 33: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/33.jpg)
33
γA,Β
4
V
S
R Z
γA
5
W
100000
100000
sel1=0.6
sel2=0.1
sel4=0.2
p1=0.003
p2=0.004
p4=0.001
σA<600
1
σA>300
2
sel5=0.5
p5=0.005
wv A=A
3
sel3=0.2
p3=0.001
Butterflies to the rescue!
![Page 34: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/34.jpg)
34
Butterfly classes
• Butterflies constitute a fundamental pattern of reference– Line– Balanced butterfly
• Left-winged variants (heavy of the ETL part)– Primary flow– Wishbone– Tree
• Right winged variants (heavy on the “reporting” part)– Fork
• Irregular variants
![Page 35: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/35.jpg)
35
1 2 3 5 6 8
Not Null (Part Key, Order Key, Supp Key)
Currency (Ext. Price, Discount,
Tax)Derive Fnc
(Profit)
Sum (Profit), Sum (Ext. Price)Group by (Part Key, Line Status)
Return Status = True
Sum (Profit), Sum (Ext. Price)Group by (Part Key)
11
Line Status <> ‘Delivered’
Line Item.D+
DW.LView01
View02
View03
4DW.D+9
Line Status = ‘Delivered’7
10
12
Line
![Page 36: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/36.jpg)
36
1 2 3
6
Not Null (CustKey) SK(custkey) PhoneFormat
New - Old
Customer.new
CUSTOMER
7C.D+
Error
4 5
SK(custkey) PhoneFormat
Customer.old
Cnew
Cold
Wishbone
![Page 37: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/37.jpg)
37
Order
1
2L-status
L-cust3
L-ord
4 DW.Order
SK (OrderStatus)
SK (CustKey)
SK (Order Key)
SCD T1Insert or Update
Primary Flow
![Page 38: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/38.jpg)
38
7
4SK(custkey)
DIFF I,U
PS1.new PS1new
PS1old
1
Sort(Pkey, SuppKey)
135
DIFF I,U
PS2.new PS2new
PS2old
2
Sort(Pkey, SuppKey)
6
DIFF I,U
PS3.new PS3new
PS3old
3
Sort(Pkey, SuppKey)
10
UNION SORTED (Pkey)
DW.PS11
PS.D 12
Sum (AvailQty)Group by (Part Key)
View04
Insert or Update
Insert or Update
9
SK(custkey)
8
SK(custkey)
Tree
![Page 39: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/39.jpg)
39
Line Item.D+
2 3 4
DW.Lineitem
6 7
8
12
View05
View06
View08
SK(Part Key, Order Key, Supp Key)
Date Key (Ship Date, Receipt
Date)
Currency (Ext. Price, Discount,
Tax)
Derive Fnc (Profit)
1
Sum (Profit), Sum (Ext. Price)Group by (Part Key, Line Status)
10 View07
Sum (Profit), Sum (Ext. Price)Group by (Part Key, Line Status)
Sum (Profit), Avg (Discount)Group by (Part Key, Supp Key)
Avg (Profit), Avg(Ext. Price)Group by (Part Key, Line Status)5
D+.LI
13
11
9
Fork
![Page 40: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/40.jpg)
40
Part Supp+
Supplier+
4
1 2
DW.PS
View097
8 10
View12
View10
5 12 14
View13
View11
SK(PartKey, SuppKey) Derive Fnc
(Total Cost)
SK(SuppKey)Phone Format
(Phone)
SP_Supp Key = S_Supp Key
Max (S. C.), Min (S. C.)Group by (Nation Key, Part Key)
Max (S. C.), Min (S. C.)Group by (Part Key)
Sum (T. C.) Group by (Nation Key, Supp Key)
Sum (T. C.)Group by (Supp Key)
3
D+.PS
DW.S 6
D+.S
9 11
1513
Full recomputation for all right-wing views
Balanced Butterfly
![Page 41: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/41.jpg)
41
SK+RK(partkey)
2
1
DIFFPart.new PART
Part.old 5P.DU
P.D++ 4
Most recent PART3
SK+RK(partkey)
Balanced ButterflySlowly Changing Dimension of
Type II
Not a typical butterfly…
![Page 42: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/42.jpg)
42
Roadmap
• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues
![Page 43: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/43.jpg)
43
Open Issues• Data sizes
– the numbers given by TPC-H can be a valid point of reference for data warehouse contents.
– Important: fraction of source data over the warehouse contents. Values in the range 0.01 to 0.7?
• Selectivity of the left wing of a butterfly – Values between 0.5 and 1.2?
• Failure rates – Range of 10-4 and 10-2?
• Workflow size – Although we provide scenarios of small scale, medium–size
and large-size scenarios are also needed.
• A careful assignment of values based on a large number of real-world case studies (that we do not possess) should be a topic for a full-fledged benchmark.
![Page 44: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/44.jpg)
44
Open Issues
• Nature of data (now: relational; also: XML, multimedia, …)
• Active vs. off-line modus operandi• Auxiliary structures and processes
(indexes, backup & maintenance scenarios, etc)
• Parallelism and Partitioning
![Page 45: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/45.jpg)
45
Message to you
• We need a commonly agreed benchmark that realistically reflects real-world ETL scenarios
• Butterflies to the rescue• Feedback from the industry is
necessary – workflow complexity – the frequencies of typically encountered ETL
operations)
![Page 46: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/46.jpg)
46
Thank you!
All pictures are imported from MS Clipart
![Page 47: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/47.jpg)
47
Auxiliary slides
![Page 48: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/48.jpg)
48
![Page 49: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/49.jpg)
49
V WγA,B
3
R σA>300
1
σB>400
2
Line
![Page 50: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/50.jpg)
50
γΑ
4
V
S
R
W
σA<600
1
γΒ
2
wv A=A
3
Wishbone
![Page 51: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/51.jpg)
51
WR σA>300
1
σB>400
2
wv A=A
3
S
Primary Flow
![Page 52: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/52.jpg)
52
Q
R σA>50 X
S σA<50
2
Y
wv A=A
3
DW
Z
wv A=A
4
1
Tree
![Page 53: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/53.jpg)
53
γA
2
V
Z
σA>300
1
γA,Β
3
W
γB4
Y
R
Fork
![Page 54: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/54.jpg)
54
σA=500
3
V
U
γB,C
5
W
TγA
2
γA
6
Q
R σA>300
1γC
4
Z
Right Deep Hierarchy
![Page 55: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/55.jpg)
55
Body
Right wing Left wing
n1
n2
nm
n1
n2
nk
V
![Page 56: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/56.jpg)
56
R
t1
t2A B
S
R t2 A B
S
Original state
(a) Sequential execution
t1
R
A B
S
(b) pipelining
t1t2
R1 A1 B1
S1
t1
(c) partitioning
Rk Ak Bk
Sk
tk
...tn
t3
...tn
t3
...tn
t2k...tn
Partitioning & parallelism
![Page 57: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/57.jpg)
57
Statistics per pattern
Legend:
•N+M (left wing + right wing)
•INCR: incremental maintenance
•I/U: insert and/or update
•FULL: full recomputation
![Page 58: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center](https://reader036.vdocuments.mx/reader036/viewer/2022062421/56649d225503460f949f8f3d/html5/thumbnails/58.jpg)
58