extract, transform, load (etl) - ulisboa · pdf fileextract, transform, load (etl) sad2 07/8...

29
H.Galhardas SAD 2007/08 Extract, Transform, Load (ETL) H.Galhardas SAD 2007/08 Overview General ETL issues: Data staging area (DSA) Building dimensions Building fact tables – Extract – Load – Transformation/cleaning Commercial (and open-source) tools The AJAX data cleaning and transformation framework

Upload: truongdieu

Post on 31-Jan-2018

234 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Extract, Transform, Load (ETL)

H.GalhardasSAD 2007/08

Overview• General ETL issues:

– Data staging area (DSA)– Building dimensions– Building fact tables– Extract– Load– Transformation/cleaning

• Commercial (and open-source) tools• The AJAX data cleaning and transformation

framework

Page 2: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

DW PhasesDesign phase

– Modeling, DB design, source selection,…Loading phase

– First load/population of the DW– Based on all data in sources

Refreshment phase– Keep the DW up-to-date wrt. source data

changes

H.GalhardasSAD 2007/08

The ETL Process• The most underestimated process in DW development

• The most time-consuming process in DW development– Often, 80% of development time is spent on ETL

Extract– Extract relevant data

Transform– Transform data to DW format– Build keys, etc.– Cleansing of data

Load– Load data into DW– Build aggregates, etc.

Page 3: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

othersources

Data Storage

OLAP Server

Data Staging

DW ArchitectureDW Architecture

H.GalhardasSAD 2007/08

Operational DBs Data

Warehouse

othersources

ExtractTransformLoadRefresh

Data Staging

ETL processETL process

Implemented using anETL tool!

Ex: SQLServer 2005Integration Services

Page 4: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Data Staging Area

• Transit storage for data underway in the ETLprocess– Transformations/cleansing done here

• No user queries (some do it)• Sequential operations (few) on large data volumes

– Performed by central ETL logic– Easily restarted– No need for locking, logging, etc.– RDBMS or flat files? (DBMS have become better at this)

• Finished dimensions copied from DSA to relevantmarts

H.GalhardasSAD 2007/08

ETL construction processPlan

1)Make high-level diagram of source-destination flow2)Test, choose and implement ETL tool3)Outline complex transformations, key generation and job

sequence for every destination tableConstruction of dimensions

4)Construct and test building static dimension5)Construct and test change mechanisms for one dimension6)Construct and test remaining dimension builds

Construction of fact tables and automation7)Construct and test initial fact table build8)Construct and test incremental update9)Construct and test aggregate build10)Design, construct, and test ETL automation

Page 5: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Building Dimensions

Static dimension table– Assignment of keys: production keys to DW using

table– Combination of data sources: find common key?

Handling dimension changes– Slowly changing dimensions– Find newest DW key for a given production key– Table for mapping production keys to DW keys

must be updatedLoad of dimensions

– Small dimensions: replace– Large dimensions: load only changes

H.GalhardasSAD 2007/08

Building fact tables

Two types of load:• Initial load

– ETL for all data up till now– Done when DW is started the first time– Often problematic to get correct historical data– Very heavy -large data volumes

• Incremental update– Move only changes since last load– Done periodically (../month/week/day/hour/...) after DW start– Less heavy -smaller data volumes

• Dimensions must be updated before facts– The relevant dimension rows for new facts must be in place– Special key considerations if initial load must be performed

again

Page 6: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Types of data sourcesNon-cooperative sources�Snapshot sources –provides only full copy of source�Specific sources –each is different, e.g., legacy systems�Logged sources –writes change log (DB log)�Queryable sources –provides query interface, e.g., SQLCooperative sources�Replicated sources –publish/subscribe mechanism�Call back sources –calls external code (ETL) when changes

occur�Internal action sources –only internal actions when changes

occur (DB triggers is an example)

Extract strategy is very dependent on the source types

H.GalhardasSAD 2007/08

Extract phaseGoal: fast extract of relevant data�Extract from source systems can take a long time

Types of extracts:�Extract applications (SQL): co-existence with other

applications�DB unload tools: must faster than SQL-based extracts�Extract applications sometimes the only solution

Often too time consuming to ETL– Extracts can take days/weeks– Drain on the operational systems– Drain on DW systems

=> Extract/ETL only changes since last load (delta)

Page 7: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Computing deltas• Much faster to only ”ETL” changes since last load

A number of methods can be used• Store sorted total extracts in DSA

– Delta can easily be computed from current+last extract+ Always possible+ Handles deletions- Does not reduce extract time

• Put update timestamp on all rows– Updated by DB trigger– Extract only where ”timestamp > time for last extract”+ Reduces extract time+/- Less operational overhead- Cannot (alone) handle deletions- Source system must be changed

H.GalhardasSAD 2007/08

Load (1)Goal: fast loading into DW

– Loading deltas is much faster than total load

• SQL-based update is slow– Large overhead (optimization,locking,etc.) for every SQL call– DB load tools are much faster– Some load tools can also perform UPDATEs

• Index on tables slows load a lot– Drop index and rebuild after load– Can be done per partition

• Parallellization– Dimensions can be loaded concurrently– Fact tables can be loaded concurrently– Partitions can be loaded concurrently

Page 8: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Load (2)• Relationships in the data

– Referential integrity must be ensured– Can be done by loader

• Aggregates– Must be built and loaded at the same time as the detail data– Today, RDBMSes can often do this

• Load tuning– Load without log– Sort load file first– Make only simple transformations in loader– Use loader facilities for building aggregates– Use loader within the same database

• Should DW be on-line 24*7?– Use partitions or several sets of tables

H.GalhardasSAD 2007/08

Overview• General ETL issues:

– Data staging area (DSA)– Building dimensions– Building fact tables– Extract– Load– Transformation/cleaning

• Commercial (and open-source) tools• The AJAX data cleaning and transformation

framework

Page 9: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Data Cleaning

Activity of converting source data into targetdata without errors, duplicates, andinconsistencies, i.e.,

Cleaning and Transforming to get…

High-quality data!

H.GalhardasSAD 2007/08

Why Data Cleaning andTransformation?

Data in the real world is dirtyincomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data• e.g., occupation=“”

noisy: containing errors or outliers (spelling, phoneticand typing errors, word transpositions, multiple valuesin a single free-form field)

• e.g., Salary=“-10”inconsistent: containing discrepancies in codes or

names (synonyms and nicknames, prefix and suffixvariations, abbreviations, truncation and initials)

• e.g., Age=“42” Birthday=“03/07/1997”• e.g., Was rating “1,2,3”, now rating “A, B, C”• e.g., discrepancy between duplicate records

Page 10: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Why Is Data Dirty?

• Incomplete data comes from:– non available data value when collected– different criteria between the time when the data was collected

and when it is analyzed.– human/hardware/software problems

• Noisy data comes from:– data collection: faulty instruments– data entry: human or computer errors– data transmission

• Inconsistent (and redundant) data comes from:– Different data sources, so non uniform naming conventions/data

codes– Functional dependency and/or referential integrity violation

H.GalhardasSAD 2007/08

Why Is Data CleaningImportant?

• Data warehouse needs consistentintegration of quality data– Data extraction, cleaning, and transformation comprises

the majority of the work of building a data warehouse

• No quality data, no quality decisions!– Quality decisions must be based on quality data (e.g.,

duplicate or missing data may cause incorrect or evenmisleading statistics)

Page 11: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Types of data cleaning

• Conversion, parsing and normalization– Text coding, date formats, etc.– Most common type of cleansing

• Special-purpose cleansing– Normalize spellings of names, addresses, etc.– Remove duplicates, e.g., duplicate customers– Domain-independent cleansing

• Approximate, ”fuzzy” joins on not-quite-matchingkeys

H.GalhardasSAD 2007/08

Data quality vs cleaning• Data quality = Data cleaning +

– Data enrichment• enhancing the value of internally held data by appending related

attributes from external sources (for example, consumerdemographic attributes or geographic descriptors).

– Data profiling• analysis of data to capture statistics (metadata) that provide

insight into the quality of the data and aid in the identification ofdata quality issues.

– Data monitoring• deployment of controls to ensure ongoing conformance of data to

business rules that define data quality for the organization.– Data stewards responsible for data quality– DW-controlled improvement– Source-controlled improvement– Construct programs to check data quality

Page 12: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Overview• General ETL issues:

– Data staging area (DSA)– Building dimensions– Building fact tables– Extract– Load– Transformation/cleaning

• Commercial (and open-source) tools• The AJAX data cleaning and transformation

framework

H.GalhardasSAD 2007/08

ETL toolsETL tools from the big vendors, e.g.,

– �Oracle Warehouse Builder– �IBM DB2 Warehouse Manager– �Microsoft Integration Services

Offer much functionality at a reasonable price (included…)– �Data modeling– �ETL code generation– �Scheduling DW jobs– �…

Many others– �Hundreds of tools– �Often specialized tools for certain jobs (insurance cleansing,…)

The ”best” tool does not exist– �Choose based on your own needs– �Check first if the ”standard tools”from the big vendors are ok

Page 13: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

ETL and data quality tools• http://www.etltool.com/• Magic Quadrant for Data Quality Tools, 2007• Some open source ETL tools:

– Talend– Enhydra Octopus– Clover.ETL

• Not so many open source quality/cleaningtools

H.GalhardasSAD 2007/08

Application context

Integrate data from different sources• E.g.,populating a DW from different operational data stores

• Eliminate errors and duplicates within a single source• E.g., duplicates in a file of customers

• Migrate data from a source schema into a differentfixed target schema

• E.g., discontinued application packages

• Convert poorly structured data into structured data• E.g., processing data collected from the Web

Page 14: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

The AJAX data transformationand cleaning framework

H.GalhardasSAD 2007/08

Motivating example (1)

DirtyData(paper:String)

Data Cleaning & Transformation

Events(eventKey, name)

Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year)Authors(authorKey, name)

PubsAuthors(pubKey, authorKey)

Page 15: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Motivating example (2)

[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and JenniferWidom. Making Views Self-Maintainable for Data Warehousing. InProceedings of the Conference on Parallel and DistributedInformation Systems. Miami Beach, Florida, USA, 1996[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-maintianable for data warehousing, PDIS’95

DirtyData

Data Cleaning & Transformation

PDIS | Conference onParallel andDistributedInformation Systems

Events

QGMW96| Making Views Self-Maintainablefor Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996

PublicationsAuthors

DQua | Dallan Quass

AGup | Ashish Gupta

JWid | Jennifer Widom…..

QGMW96 | DQua

QGMW96 | AGup….

PubsAuthors

H.GalhardasSAD 2007/08

Modeling a data cleaning process

A data cleaning process is modeled by a directed acyclic graph of data transformations

DirtyData

DirtyAuthors

Authors

Duplicate Elimination

Extraction

Standardization

Formatting

DirtyTitles... DirtyEvents

CitiesTags

Page 16: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Existing technology• Ad-hoc programs written in a programming language

like C or Java or using an RDBMS proprietarylanguage– Programs difficult to optimize and maintain

• Data transformation scripts using an ETL(Extraction-Transformation-Loading) or a data quality tool

H.GalhardasSAD 2007/08

Problems of ETL and data qualitysolutions (1)

The semantics of some data transformationsis defined in terms of their implementationalgorithms

App. Domain 1

App. Domain 2

App. Domain 3

Data cleaning transformations

...

Page 17: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

There is a lack of interactive facilities to tunea data cleaning application program

Problems of ETL and data qualitysolutions (2)

Dirty Data

Cleaning process

Clean data Rejected data

H.GalhardasSAD 2007/08

AJAX features• An extensible data quality framework

– Logical operators as extensions of relational algebra– Physical execution algorithms

• A declarative language for logical operators– SQL extension

• A debugger facility for tuning a data cleaningprogram application– Based on a mechanism of exceptions

Page 18: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

AJAX features• An extensible data quality framework

– Logical operators as extensions of relational algebra– Physical execution algorithms

• A declarative language for logical operators– SQL extension

• A debugger facility for tuning a data cleaningprogram application– Based on a mechanism of exceptions

H.GalhardasSAD 2007/08

Logical level: parametricoperators

• View: arbitrary SQL query• Map: iterator-based one-to-many mapping with

arbitrary user-defined functions• Match: iterator-based approximate join• Cluster: uses an arbitrary clustering function• Merge: extends SQL group-by with user-defined

aggregate functions• Apply: executes an arbitrary user-defined

algorithm

Map Match

Merge

ClusterView

Apply

Page 19: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Logical level

DirtyData

DirtyAuthors

Authors

Duplicate Elimination

Extraction

Standardization

Formatting

DirtyTitles...

CitiesTags

H.GalhardasSAD 2007/08

Logical level

DirtyData

DirtyAuthors

Map

Cluster

Match

Merge

Authors

Map

Map

Duplicate Elimination

Extraction

Standardization

Formatting

DirtyTitles...

CitiesTags

DirtyData

DirtyAuthors

TC

NL

Authors

SQL Scan

Java Scan

Physical level

DirtyTitles...

Java Scan

Java Scan

CitiesTags

Page 20: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

AJAX features• An extensible data quality framework

– Logical operators as extensions of relational algebra– Physical execution algorithms

• A declarative language for logical operators– SQL extension

• A debugger facility for tuning a data cleaningprogram application– Based on a mechanism of exceptions

H.GalhardasSAD 2007/08

Match• Input: 2 relations• Finds data records that correspond to the

same real object• Calls distance functions for comparing field

values and computing the distancebetween input tuples

• Output: 1 relation containing matchingtuples and possibly 1 or 2 relationscontaining non-matching tuples

Page 21: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Example

Cluster

Match

Merge

Duplicate Elimination

Authors

DirtyAuthors

MatchAuthors

H.GalhardasSAD 2007/08

ExampleCREATE MATCH MatchDirtyAuthors

FROM DirtyAuthors da1, DirtyAuthors da2

LET distance = editDistance(da1.name, da2.name)

WHERE distance < maxDist

INTO MatchAuthorsCluster

Match

Merge

Duplicate Elimination

Authors

DirtyAuthors

MatchAuthors

Page 22: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

ExampleCREATE MATCH MatchDirtyAuthors

FROM DirtyAuthors da1, DirtyAuthors da2

LET distance = editDistance(da1.name, da2.name)

WHERE distance < maxDist

INTO MatchAuthors

Input:DirtyAuthors(authorKey, name)861|johann christoph freytag822|jc freytag819|j freytag814|j-c freytag

Output:MatchAuthors(authorKey1, authorKey2, name1, name2)861|822|johann christoph freytag| jc freytag822|814|jc freytag|j-c freytag ...

Cluster

Match

Merge

Duplicate Elimination

Authors

DirtyAuthors

MatchAuthors

H.GalhardasSAD 2007/08

Implementation of the matchoperator

∀ s1∈ S1, s2 ∈ S2

(s1, s2) is a match ifeditDistance (s1, s2) < maxDist

Page 23: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Nested loopS1 S2

...

• Very expensive evaluation when handlinglarge amounts of data

Þ Need alternative execution algorithms forthe same logical specification

editDistance

H.GalhardasSAD 2007/08

A database solution

CREATE TABLE MatchAuthors AS

SELECT authorKey1, authorKey2, distance

FROM (SELECT a1.authorKey authorKey1, a2.authorKey authorKey2,editDistance (a1.name, a2.name) distanceFROM DirtyAuthors a1, DirtyAuthors a2)

WHERE distance < maxDist;

· No optimization supported for aCartesian product with external functioncalls

Page 24: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Window scanningS

n

H.GalhardasSAD 2007/08

Window scanningS

n

Page 25: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Window scanningS

n

· May loose some matches

H.GalhardasSAD 2007/08

String distance filtering

S1 S2

maxDist = 1

John Smith

John Smit

Jogn Smith

John Smithe

length

length- 1

length

length + 1editDistance

Page 26: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Annotation-based optimization

• The user specifies types of optimization• The system suggests which algorithm

to use

Ex:CREATE MATCHING MatchDirtyAuthors

FROM DirtyAuthors da1, DirtyAuthors da2

LET dist = editDistance(da1.name, da2.name)

WHERE dist < maxDist% distance-filtering: map= length; dist = abs %INTO MatchAuthors

H.GalhardasSAD 2007/08

DEFINE FUNCTIONS ASChoose.uniqueString(OBJECT[]) RETURN STRINGTHROWS CiteSeerExceptionGenerate.generateId(INTEGER) RETURN STRINGNormal.removeCitationTags(STRING) RETURN STRING(600)

DEFINE ALGORITHMS ASTransitiveClosureSourceClustering(STRING)

DEFINE INPUT DATA FLOWS ASTABLE DirtyData(paper STRING (400));TABLE City(city STRING (80),citysyn STRING (80))KEY city,citysyn;

DEFINE TRANSFORMATIONS AS

CREATE MAPPING mapKeDiDaFROM DirtyData DdLET keyKdd = generateId(1){SELECT keyKdd AS paperKey, Dd.paper AS paperKEY paperKeyCONSTRAINT NOT NULL mapKeDiDa.paper}

Declarative specification

Page 27: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

AJAX features• An extensible data quality framework

– Logical operators as extensions of relational algebra– Physical execution algorithms

• A declarative language for logical operators– SQL extension

• A debugger facility for tuning a data cleaningprogram application– Based on a mechanism of exceptions

H.GalhardasSAD 2007/08

Management of exceptions

• Problem: to mark tuples not handled by thecleaning criteria of an operator

• Solution: to specify the generation ofexception tuples within a logical operator– exceptions are thrown by external functions– output constraints are violated

Page 28: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

Debugger facility

• Supports the (backward and forward) dataderivation of tuples wrt an operator todebug exceptions

• Supports the interactive data modificationand, in the future, the incrementalexecution of logical operators

H.GalhardasSAD 2007/08

Architecture

Page 29: Extract, Transform, Load (ETL) - ULisboa · PDF fileExtract, Transform, Load (ETL) SAD2 07/8 H.Galhardas Overview •General ETL issues: –Data staging area (DSA) –Building dimensions

H.GalhardasSAD 2007/08

To see it working....

Workshop-UQ?Tomorrow at Tagus Park,

10H-16H