etl queues for active data warehousing

45
ETL Queues for Active Data Warehousing Alexis Karakasidis Panos Vassiliadis Evaggelia Pitoura Dept. of Computer Science University of Ioannina

Upload: nelly

Post on 21-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

ETL Queues for Active Data Warehousing. Alexis Karakasidis Panos Vassiliadis Evaggelia Pitoura. Dept. of Computer Science University of Ioannina. Forecast. We demonstrate that we can employ queue theory to predict the behavior of an Active ETL process - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ETL Queues for Active Data Warehousing

ETL Queues for Active Data Warehousing

Alexis Karakasidis

Panos Vassiliadis

Evaggelia Pitoura

Dept. of Computer ScienceUniversity of Ioannina

Page 2: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 2

Forecast

• We demonstrate that we can employ queue theory to predict the behavior of an Active ETL process

• We discuss implementation issues in order to achieve several nice properties concerning minimal system overhead and high freshness of data

Page 3: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 3

Contents

• Problem description

• System Architecture & Theoretical Analysis

• Experiments

• Conclusions and Future Work

Page 4: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 4

Contents

• Problem description

• System Architecture & Theoretical Analysis

• Experiments

• Conclusions and Future Work

Page 5: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 5

Active Data Warehousing

• Traditionally, data warehouse refreshment has been performed off-line, through Extractction-Transformation-Loading (ETL) software.

• Active Data Warehousing refers to a new trend where data warehouses are updated as frequently as possible, to accommodate the high demands of users for fresh data.

• Issues that come up:– How to design an Active DW?– How can we implement an Active DW?

Page 6: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 6

Issues and Goals of this paper

• Smooth upgrade of the software at the source – The modification of the software configuration at the

source side is minimal.• Minimal overhead of the source system • No data losses are allowed• Maximum freshness of data

– The response time for the transport, cleaning transformation and loading of a new source record to the DW should be small and predictable

• Stable interface at the warehouse side – The architecture should scale up with respect to the

number of sources and data consumers at the DW

Page 7: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 7

Contributions

• We set up the architectural framework and the issues that arise for the case of active data warehousing.

• We develop the theoretical framework for the problem, by employing queue theory for the prediction of the performance of the system.

– We provide a taxonomy for ETL tasks that allows treating them as black-box tasks.

– Then, standard queue theory techniques can be applied for the design of an ETL workflow.

• We provide technical solutions for the implementation of our reference architecture, achieving the aforementioned goals

• We prove our results through extensive experimentation.

Page 8: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 8

Related work

• Obviously, work in the field of ETL is related – must be customized for active DW

• Streams, due to the nature of the data – still, all R.W. is on continuous queries, no updates

• Huge amount of work in materialized view refreshment – orthogonal to our problem

• Web services – due to the fact that in our architecture, the DW

exports W.S.’s to the sources

Page 9: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 9

Contents

• Problem description

• System Architecture & Theoretical Analysis

• Experiments

• Conclusions and Future Work

Page 10: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 10

Add_SPK1

SUPPKEY=1

SK1

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SUPPKEY

$2€

COST DATE

DS.PS2 Add_SPK2

SUPPKEY=2

SK2

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SUPPKEYCOST DATE=SYSDATE

AddDate CheckQTY

QTY>0

U

DS.PS1

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

DIFF1

DS.PS_NEW1.PKEY,DS.PS_OLD1.PKEYDS.PS_NEW

1

DS.PS_OLD

1

DW.PARTSUPP

Aggregate1

PKEY, DAYMIN(COST)

Aggregate2

PKEY, MONTHAVG(COST)

V2

V1

TIME

DW.PARTSUPP.DATE,DAY

FTP1S1_PARTSU

PP

S2_PARTSUPP

FTP2

DS.PS_NEW

2

DIFF2

DS.PS_OLD

2

DS.PS_NEW2.PKEY,DS.PS_OLD2.PKEY

Sources DW

DSA

ETL workflows

Page 11: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 11

Queue Theory for ETL• We can model various kinds of ETL

transformations as queues, which we call ETL queues

• Each queue has an incoming arrival rate λ and a mean service time 1/μ

• Little’s Law: N= λ*T• M/M/1 queue (Poisson arrivals)

– Mean response time W=1/(μ-λ)– Mean queue length L=ρ/(1 - ρ), ρ=λ/μ

Server

Page 12: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 12

Queue Theory for ETL

• Queues can be combined to form queue networks

• Jackson networks: networks were each queue can be solved independently (under reasonable constraints)

• We can use queue theory to predict the behavior of the Active Data Warehouse

Page 13: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 13

How to predict the behavior of the Active Data Warehouse

1. Compose ETL queues in a Jackson network to simulate the implementation of the Active Data Staging Area (ADSA)

2. Then, solve the Jackson network and relate the parameters of ADSA, specifically:– Source arrival rate (i.e., rate or record production at

the source)– Overall service time (i.e., time that a record spends

in the ADSA)– Mean queue length (i.e., no. of records in the

network)

Page 14: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 14

Taxonomy of ETL transformations

• Filters• Transformers• Binary Operators

• Generic model

P ai

P ri

ETL

rejected

Page 15: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 15

System Architecture

ETL

Source

Source

S FlowR

ADSA DW

ETL

ETL

WS Client

ETL

WS Client

WS

WS

DW

Page 16: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 16

Contents

• Problem description

• System Architecture & Theoretical Analysis

• Experiments

• Conclusions and Future Work

Page 17: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 17

Experimentation environment• Source: an application in C that uses an ISAM library• ADSA implemented in Sun JDK 1.4• Web Services platform:

– Apache Axis 1.1 [AXIS04]– Xerces XML parser– Apache Tomcat 1.3.29

• DW implemented over MySQL 4.1 • Configuration:

– Source: PIII 700MHz with 256MB memory, SuSE Linux 8.1– DW: Pentium 4 2.8GHz with 1GB memory, Mandrake Linux,

ADSA included– Department’s LAN for the network

• Source operates at full capacity

Page 18: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 18

First set of experiments

• A first set of experiments over a simple configuration, to determine fundamental architectural choices

• Issues– Smooth upgrade of the source software– UDP vs TCP– Source Overhead– Data delay– Topology

Source

Source

S FlowR

ADSA DW

WS Client WS

DW

Page 19: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 19

Experimentation results

• Smooth upgrade: not more than 100 lines of code modified

• UDP resulted in 35% data loss, due to ADSA overflow => TCP a clear choice

• Source overhead is highly dependent on row blocking:– Source overhead is 1.7% with a source flow regulator,

vs 34% without– WS mode (blocking vs non-blocking) has no effect– Medium size packets seem to work better

Page 20: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 20

Data Freshness• We count the time to carry all records from source to

DW• We empty the ADSA with 3 policies:

– Immediate transport– We simulate a slower ADSA by removing 50, 100, 150, 200,

250 and 300 records from the queue every 0.1 sec– We remove 500, 1000, 1500, 2000, 2500 and 3000 records

every 1 sec– Source max rate is about 1250 records / sec

• Findings:– Small package sizes result in small delays– There is a threshold (the source rate) underneath which the

queue explodes– We can achieve data freshness time equal to data insertion

time when we continuously empty a small size queue

Page 21: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 21

Data Freshness

Queue size over time. Emptying rate 150 records per 0.1 sec

0

500

1000

1500

2000

0

4.9

9.69

14.4

19.1

23.8

28.5

33.2

37.9

42.5

47.2 52

56.6

61.3

65.9

70.6

75.2

Time (secs)

Siz

e o

f qu

eue

(#el

emen

ts)

Page 22: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 22

Queue size over time. Emptying rate 50 records per 0.1 sec

0

10000

20000

30000

40000

50000

60000

70000

0

13.8

27.4

41.1

54.5

68.1

81.6

94.9

108

122

135

148

162

175

189

202

215

Time (secs)

Siz

e o

f qu

eue

(#el

emen

ts)

Queue size over time. Emptying rate 100 records per 0.1 sec

0

5000

10000

15000

20000

25000

30000

35000

0

7.2

14.2

21.2

28.2

35.1 42

48.9

55.9

62.7

69.6

76.5

83.4

90.2 97 104

111

Time (secs)

Siz

e o

f qu

eue

(#el

emen

ts)

Queue size over time. Emptying rate 150 records per 0.1 sec

0

500

1000

1500

20000

4.9

9.69

14.4

19.1

23.8

28.5

33.2

37.9

42.5

47.2 52

56.6

61.3

65.9

70.6

75.2

Time (secs)

Siz

e o

f qu

eue

(#el

emen

ts)

Queue size over time. Emptying rate 200 records per 0.1 sec

0

500

1000

1500

2000

0

4.92

9.61

14.3 19

23.6

30.6

35.3

39.9

44.6

49.3

53.9

58.6

63.2

67.8

72.4 77

Time (secs)

Siz

e o

f qu

eue

(#el

emen

ts)

Queue size over time. Emptying rate 250 records per 0.1 sec

0200400600800

1000120014001600

0

4.88

9.81

16.3 21

25.7

30.3

34.9

39.6

44.2

48.8

53.5

58.1

62.8

67.4 72

76.7

Time (secs)

Siz

e o

f qu

eue

(#el

emen

ts)

Queue size over time. Emptying rate 300 records per 0.1 sec

0

500

1000

1500

2000

0

5.7

10.4

15.1

20.1

24.8

29.4 34

38.7

43.4 48

52.7

57.3

61.9

66.5

71.1

75.7

Time (secs)

Siz

e o

f qu

eue

(#el

emen

ts)

Data Fresh-ness

Page 23: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 23

Data Freshness

Time to complete transfer from ADSA to DW

0

50

100

150

200

250

500 1000

1500

2000

2500

3000 Queue emptying

rate

Time (secs) Time to

complete transfer from ADSA to DW

Page 24: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 24

Experiments including transformation scenarios

• We enrich the previous configuration with several ETL activities in the ADSA

• Based on the previous, we have fixed:– 2-tier architecture, ADSA at the DW– Source Flow Regulation with medium size

packages– TCP for network connection– Non-blocking calling of DW WS’s

Page 25: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 25

Scenarios to measure data freshness

Filter 10%

Source

Source

S FlowR

ADSA DW

SK GB Sum WS Client WS

DW

Filter 10%

Source

Source

S FlowR

ADSA DW

Filter 2%

GB Sum WS Client

WS Client

WS

WS

DW SK

Source

Source

S FlowR

ADSA DW

WS Client WS

DW

Filter 10%

Source

Source

SFlowR

ADSA DW

WS

WS

DW Replace SK

GB Sum

Filter 6%

WS Client

Replace

ment

Filter 2%

WS Client

(a) (c)

(b) (d)

Page 26: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 26

Goals of the experiments

• Steadiness of the system– System is steady whenever service rate is higher than

arrival rate; transient effects disappear

• Source overhead– Medium size blocking is still a winner

• Throughput for ADSA – The ADSA is only one packet behind the source– Avg. delay per row ~0.9 msec for all scenarios

• Success of theoretical prediction– Half a packet underestimation

Page 27: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 27

Contents

• Problem description

• System Architecture & Theoretical Analysis

• Experiments

• Conclusions and Future Work

Page 28: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 28

Conclusions

• We can employ queue theory to predict the behavior of an Active ETL process

• We have proposed an architectural configuration with– Minimal source overhead– No effect on the source due to the operation of an

ADSA– No packet losses, due to the usage of TCP– Small delay in the ADSA, especially if row blocking in

medium size blocks is used

Page 29: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 29

Future Work

• Combine our configuration with results in the optimization of ETL processes (ICDE’05)

• Fault tolerance

• Experiment with higher client loads at the warehouse side

• Scale-up the number of sources involved

Page 30: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 30

Thank you!

Page 31: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 31

Backup Slides

Page 32: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 32

Grand View

Source

DW

Source Application

Source

Source Application

Source

Source Application

Plain Data

Clean, reconciled, possibly aggregated data to be loaded in the DW

γ σ

GROUP

SK

σ γ

ADSA

Page 33: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 33

Jackson’s Theorem and ETL queues

Jackson’s Theorem. If in an open network the condition λi < µi · mi holds for every i {1, ..,N} (with mi standing for the number of servers at node i) then the steady state probability of the network can be expressed as the product of the state probabilities of the individual nodes:

π (k1,…, kN) = π1(k1)π2(k2)... πΝ(kΝ)

Therefore, we can solve this class of networks in four steps:• Solve the traffic equations to find λi for each queuing node i• Determine separately for each queuing system i its steady-state

probabilities πi(ki)• Determine the global steady-state probabilities π (k1,…, kN). Derive

the desired global performance measures.• From step 1, we can derive the mean delay and queue length for

each node.

Page 34: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 34

Source Code AlterationsOriginal Routine Altered Routine

Open_isam_File(){ …opening_isam_file_commands …}

Open_isam_File(){ …opening_isam_file_commands …if(open==success)

DWFlowR_socket_open()}

Write_record_to_File(){ …insert_record_commands …}

Write_record_to_File(){ …insert_record_commands …if(write==success)

write_to_SFlowR()}

Close_isam_File(){ …closing_isam_file_commands …}

Close_isam_File(){ …closing_isam_file_commands …if(close==success)

DWFlowR_socket_close()}

Page 35: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 35

First set of experiments

Page 36: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 36

Data Freshness• We count the time to carry all records from source to DW• We empty the ADSA with 3 policies:

– Immediate transport– We simulate a slower ADSA by removing 50, 100, 150, 200, 250

and 300 records from the queue every 0.1 sec– We remove 500, 1000, 1500, 2000, 2500 and 3000 records every 1

sec• Source max rate is about 1250 records / sec• Findings:

– Small package sizes result in small delays– There is a threshold (the source rate) underneath which the queue

explodes– We can achieve data freshness time equal to data insertion time

when we continuously empty a small size queue

Page 37: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 37

Source overhead

Time to insert 1 000 000 records

0

200

400

600

800

1000

1200

1 100 1000

Number of records sent simultaneously

Co

mp

leti

on

tim

e (s

ecs)

plain

non blocking invocation

blocking invocation

Page 38: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 38

Topology and source overhead

Time to insert 1 000 000 records to the Source in relation to topology used

780

800

820

840

860

880

900

920

Configuration

Tim

e (s

ecs)

plain

1-tier

2-tier (Mediator atSource Host)

2-tier (Mediator at DWHost)

3-tier

Page 39: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 39

Second set of experiments

Page 40: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 40

Source overhead

0

20

40

60

80

100

120

140

times (secs)

Plain Operation Packet size at source: 1 row/packet Packet size at source: 10 rows/packet Packet size at source: 25 rows/packet Packet size at source: 50 rows/packet Packet size at source: 75 rows/packet

Page 41: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 41

Throughput for ETL operations

Throughput Capability of ETL Operations

0

50

100150

200

250

300

350400

450

500

Etl Operations

pac

kets

/ s

ec

Filter - 2%

Filter - 6%

Filter - 10%

Aggregate - group bysum

Transform - SurrogateKey

Transform - Replace

Page 42: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 42

Scenarios to measure data freshness

Scenario a - Average Number of Packets in Queue with Various Service Rates

020

4060

80100

120140

160

0 9 18 27 36 45 54 63 72 81 90

Time (seconds)

# P

acke

ts

~20 packets / sec

~23 packets / sec

~27 packets / sec

~33 packets / sec

Scenario b - Average Number of Packets in Queue @ ~23 packets / sec

01

23

45

67

8

1 8 15 22 29 36 43 50 57 64 71 78 85 92

Time (seconds)

# P

acke

ts

FILTER_10_01

GBSUM_01

SK_01

WS_01

Scenario c - Average Number of Packets in Queue @ ~23 packets / sec

0

2

4

6

8

10

1 9 17 25 33 41 49 57 65 73 81 89

Time (seconds)

# P

acke

ts

FILTER_10_01

FILTER_2_01

GBSUM_01

SK_01

WS_GB_01

WS_GB_01

Scenario d - Average Number of Packets in Queue @ ~23 packets / sec

0

5

10

15

20

1 9 17 25 33 41 49 57 65 73 81 89

Time (seconds)

# P

acke

ts

FILTER_10_01

FILTER_2_01

FILTER_6_01

GBSUM_01

REP_01

REP_02

SK_01

WS_GB_01

WS_UPD2_01

Page 43: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 43

Data Delay

88.589

89.590

90.591

91.592

92.593

scenario(a)

scenario(b)

scenario(c)

STORE

scenario(c)

GROUPBY

scenario(d)

STORE

scenario(d)

GROUPBY

Tim

e (s

ecs)

Page 44: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 44

Theoretical prediction vs. actual measurements of average queue length for scenario (c) in packets

Measured Theoretical Prediction

Difference

FILTER_10_01 0.160 0.056 0.104

FILTER_02_01 0.134 0.047 0.087

SK_01 0.154 0.054 0.100

GB_SUM_01 0.137 0.048 0.089

WS_GB 0.091 0.031 0.059

WS_GB_UPD 0.100 0.035 0.066

Page 45: ETL Queues for Active Data Warehousing

IQIS'05 17 June 2005, Baltimore MD, USA 45

Theoretical Predictions and Actual Measurements

• In most cases, we underestimate the actual queue size by half a packet (i.e., 25 records)

• We overestimate the actual queue size when we simulate slow servers, esp. in the combination of large timeouts and large packets

• Reasons for the discrepancies:– Simulation of slower rates through timeouts– Due to the row-blocking approach, the granule of

transport is a single packet