etl process etl: overview

79
ETL Process ETL: Overview Two steps I From the sources to staging area F Extraction of data from the sources F Creation / detection of differential updates F Creating LOAD Files I From the staging area to the base database F Data Cleaning and Tagging F Preparation of integrated data sets I Continuous data provision for the DWH I Assurance of consistency regarding DWH data sources Efficient methods essential minimize offline time Rigorous tests essential ensure data quality c Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–1

Upload: others

Post on 27-Jan-2022

85 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: ETL Process ETL: Overview

ETL Process

ETL: Overview

Two stepsI From the sources to staging area

F Extraction of data from the sourcesF Creation / detection of differential updatesF Creating LOAD Files

I From the staging area to the base databaseF Data Cleaning and TaggingF Preparation of integrated data sets

I Continuous data provision for the DWHI Assurance of consistency regarding DWH data sources

Efficient methods essential→ minimize offline timeRigorous tests essential→ ensure data quality

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–1

Page 2: ETL Process ETL: Overview

ETL Process

ETL Process

Frequently most elaborate part of the Data WarehousingI Variety of sourcesI HeterogeneityI Data volumeI Complexity of the transformation

F Schema and instance integrationF Data cleansing

I Hardly consistent methods and system support, but variety of toolsavailable

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–2

Page 3: ETL Process ETL: Overview

ETL Process

ETL Process

Extraction: selecting a section of the data from the sources andproviding it for transformationTransformation: fitting the data to predefined schema and qualityrequirementsLoad: physical insertion of the data from the staging area into thedata warehouse (including necessary aggregations)

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–3

Page 4: ETL Process ETL: Overview

ETL Process

Definition Phase of the ETL Process

Quelldaten-analyse

Metadaten-Management

Repository

OLTP

Legacy

ExterneQuellen

Auswahl derObjekte

Erstellen derTransformation

Erstellen derETL-Routinen

Analyse-bedarf

Datenmodell undKonventionen

Dokumentation,operativerDatenkatalog

Regelwerk fürDatenqualität Transfor-

mations-regeln

Erfolgskriterienfür Laderoutinen

ETL-JobsAbbildungSchlüsseltransf.Normalisierung

DWH

Datenquellen

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–4

Page 5: ETL Process ETL: Overview

Extraction of Data from Sources

Extraction

TaskI Regular extraction of change data from sourcesI Data provision for the DWH

DistinctionI Time of extractionI Type of extracted data

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–5

Page 6: ETL Process ETL: Overview

Extraction of Data from Sources

Point in Time

Synchronous notificationI Source propagates each change

Asynchronous notificationI Periodically

F Sources produce extracts regularlyF DWH regularly scans dataset

I Event-drivenF DWH requests changes before each annual reportingF Source informs after each X changes

I Query-controlledF DWH queries for changes before any actual access

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–6

Page 7: ETL Process ETL: Overview

Extraction of Data from Sources

Type of Data

Flow: integrate all changes in DWHI Short positions, tradeI accomodate for changes

Stock: point in time is essential, must be setI Number of employees at end of the month in a storeI Stock at the end of the year

Value per Unit: Depending on unit and other dimensionsI Exchange rate at a point in timeI Gold price on a stock exchange

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–7

Page 8: ETL Process ETL: Overview

Extraction of Data from Sources

Type of Data

Snapshots: Source always provides complete data setI New suppliers directory, new price list, etc.I Detect changesI Depict history correctly

Logs: Source provides any changeI Transaction logs, application-controlled loggingI Import changes efficiently

Net Logs: Source provides net changesI Catalog updates, snapshot deltasI No complete history possibleI Changes efficiently importable

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–8

Page 9: ETL Process ETL: Overview

Extraction of Data from Sources

Point in Time of Data ProvisionSource . . . Method Timeliness

DWHWorkloadon DWH

WorloadonSources

creates files periodi-cally

Batch runs,Snapshots

dependingon fre-quency

low low

propagates eachchange

Trigger, Repli-cation

maximum high very high

createsextracts onrequest

beforeuse

very hard maximum medium medium

application-driven

application-driven

dependingon fre-quency

dependingon fre-quency

dependingon fre-quency

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–9

Page 10: ETL Process ETL: Overview

Extraction of Data from Sources

Point in Time of Data ProvisionComments for three previous options:

Many systens (Mainframe) not accessible onlineContradicts idea of DWH: More workload on sourcesTechnically not efficiently implementable

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–10

Page 11: ETL Process ETL: Overview

Extraction of Data from Sources

Extraction from Legacy Systems

Very dependent on the applicationAccess to host systems without online access

I Access via BATCH, Report Writer, schedulingData in non-standard databases without APIs

I Programming in PL-1, COBOL, Natural, IMS . . .

Unclear semantics, double occupancy of fields, speaking keys,missing documentation, domain knowledge only held by fewpeopleBut: Commercial tools available

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–11

Page 12: ETL Process ETL: Overview

Extraction of Data from Sources

Differential Snapshot Problem

Many sources provide only the full datasetI Molecular biological data basesI Customer lists, employee listsI Product catalogues

ProblemI Repeated import of all data is inefficientI Duplikates need to be detected

Algorithms to compute Delta-FilesHard for very large files

[Labio Garcia-Molina 1996]

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–12

Page 13: ETL Process ETL: Overview

Extraction of Data from Sources

Scenario

Sources provide Snapshots as file FI Unordered set of records (K,A1, . . . ,An)

Given: F1, F2, mit f1 = |F1|, f2 = |F2|Calculate smallest set O = {INS,DEL,UPD}∗ with O(F1) = F2

O not unique!

O1 = {(INS(X)), ∅, (DEL(X))} ≡ O2 = {∅, ∅, ∅}

Differential Snapshot Problem

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–13

Page 14: ETL Process ETL: Overview

Extraction of Data from Sources

Scenario

K4, t, r, ...K102, p, q, ...K104, k, k, ...K202, a, a, ...

DifferentialSnapshot

AlgorithmusK3, t, r, ...K102, p, q, ...K103, t, h, ...K104, k, k, ...K202, b, b, ...

INS K3DEL K4INS K103UPD K202: ...

F1

F2

DWH

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–14

Page 15: ETL Process ETL: Overview

Extraction of Data from Sources

Assumptions

Computing a consecutive order of DSI Files from 1.1.2010, 1.2.2010, 1.3.2010, . . .

Cost ModelI All operations in the main memory are for freeI IO counts the number of records: sequential readI No consideration of block sizes

Size of main memory: M (Records)File size |Fx| = fx (Records)Files generally larger than main memory

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–15

Page 16: ETL Process ETL: Overview

Extraction of Data from Sources

DSnaive – Nested Loop

Computing OI Read record R from F1I Read F2 sequentially and compare to R

F R not in F2 → O := O ∪ (DEL(R))F R in F2 → O := O ∪ (UPD(R)) / ignore

Problem: INS is not foundI Auxiliary structure necessaryI Array with IDs from F2 (generated on the fly)I Mark R respectively, final run for INS

Number of IO operations: f1 · f2 + δ

Improvements?I Cancel search in F2 if R has been foundI Load partitions of size M from F1: f1

M · f2

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–16

Page 17: ETL Process ETL: Overview

Extraction of Data from Sources

DSsmall – small files

Assumption: Main memory M > f1 (or f2)Computing O

I Read F1 completelyI Read F2 sequentially (S)

F S ∈ F1: O := O ∪ (UPD(S)) / ignoreF S 6∈ F1: O := O ∪ (INS(S))F Mark S in F1 (Bitarray)

I Finally: Records R ∈ F1 without marks: O := O ∪ (DEL(R))

Number of IO operations: f1 + f2 + δ

ImprovementsI Sort F1 in the main memory faster lookup

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–17

Page 18: ETL Process ETL: Overview

Extraction of Data from Sources

DSsort – Sort-Merge

General case: M � f1 und M � f2Assumption: F1 is sortedSort F2 in secondary storage

I read F2 in partitions Pi with |Pi| = MI Sort Pi in main memory and write in Fi ("Runs")I Mix all Fi

I Assumption: M >√|F2| → IO: 4 · f2

Keep sorted F2 for next DS (becomes F1 there)I Per DS only F2 needs to be sorted

Computing OI Open sorted F1 and F2I Mix (parallel reads with skipping)

Number of IO operations: f1 + 5 · f2 + δ

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–18

Page 19: ETL Process ETL: Overview

Extraction of Data from Sources

DSsort2 – Interleaved

Sorted F1 givenComputing O

I Read F2 in partitions Pi with |Pi| = MI Sort Pi in main memory and write in Fi

2I Mix all Fi

2 and simultaneously compare to F1

Number of IO operations: f1 + 4 · f2 + δ

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–19

Page 20: ETL Process ETL: Overview

Extraction of Data from Sources

DShash – Partitioned Hash

Calculating OI Hash F2 in partitions Pi with |Pi| = M/2I Hash funktion has to guarantee:

Pi ∩ Pj = ∅, ∀i 6= j

I Partitions are "equivalence classes" w.r.t. the hash functionI F1 is still partitionedI F1 and F2 have been partitioned by the same hash functionI Read and mix P1,i and P2,i in parallel

Number of IO operations: f1 + 3 · f2 + δ

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–20

Page 21: ETL Process ETL: Overview

Extraction of Data from Sources

Why not simply . . .

UNIX diff?I diff requires / considers surroundings of recordsI Here: records are not ordered

in the database with SQL?I Requires to read each relation three times

INSERT INTO deltaSELECT ’UPD’, ...FROM F1, F2WHERE F1.K = F2.K AND F1.W <> F2.W

UNIONSELECT ’INS’, ...FROM F2WHERE NOT EXISTS (...)

UNIONSELECT ’DEL’, ...FROM F1WHERE NOT EXISTS (...)

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–21

Page 22: ETL Process ETL: Overview

Extraction of Data from Sources

Comparison – Features

IO BemerkungenDSnaive f1 · f2 out of concurrence, auxiliary data struc-

ture requiredDSsmall f1 + f2 only for smaller filesDSsort2 f1 + 4 · f2DShash f1 + 3 · f2 non-overlapping hash function, hard

to estimate partition size, assumptionsabout distribution (Sampling)

Extensions of DShash for "worse" hash functions known

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–22

Page 23: ETL Process ETL: Overview

Extraction of Data from Sources

Further DS Approaches

Number of partitions / runs larger than file descriptors in OSI Hierarchical external sorting methods

Compression: Compress FilesI Larger partitions / runsI Better chance of performing comparisons within the main memoryI In reality faster (assumptions of the cost model)

"Windows" AlgorithmI Assumption: Files have a "fuzzy" orderI Mixing with Sliding Window over both filesI Returns many redundant INS-DEL pairsI Number of IO operations: f1 + f2

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–23

Page 24: ETL Process ETL: Overview

Extraction of Data from Sources

DS with Timestamp

Assumption: Records are (K,A1, . . . ,An,T)

T: Timestamp of the last changeCreating O

I Adherence of Talt: Last update (max{T} of F1)I Read F2 sequentiallyI Entries with T > Talt interestingI But: INS or UPD?

Another problem: DEL is not foundTimestamp spares only attribute comparison

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–24

Page 25: ETL Process ETL: Overview

Data Load

Load

AufgabeI Efficient incorporation of external data in DWH

Critical PointI Loading operations may block the entire DWH (Write lock on fact

table)Aspects:

I TriggersI Integrity constraintsI Index updateI Update or Insert?

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–25

Page 26: ETL Process ETL: Overview

Data Load

Set based

Use of standard interfaces:PRO*SQL, JDBC, ODBC, . . .Works in the normal transaction contextTriggers, indexes and constraints remain active

I Manual deactivation possible

No large-scale locksLocks can be reduced by COMMIT

I Not in Oracle: Read operations are never locked (MVCC)

Using prepared statementsPartial proprietary extensions (arrays) available

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–26

Page 27: ETL Process ETL: Overview

Data Load

BULK Load

DB-specific extensions for loading large amounts of dataRunning (usually) in a special context

I Oracle: DIRECTPATH option in the loaderI Complete table lockI No consideration of triggers or constraintsI Indexes are not updated until afterI No transactional contextI No loggingI Checkpoints for recovery

Practice: BULK Uploads

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–27

Page 28: ETL Process ETL: Overview

Data Load

Example: ORACLE sqlldr

SQL*Loader

Loader-Kontroll-Datei

InputDatafiles

SchlechteDateien

InputDatafiles

AbgelehnteDateienDatenbank

Indexe

Tabellen

Log-Datei

InputDatafilesInput-

Dateien

[Oracle 11g Documentation]c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–28

Page 29: ETL Process ETL: Overview

Data Load

Example: ORACLE sqlldr (2)

Control-FileLOAD DATAINFILE ’bier.dat’REPLACE INTO TABLE getraenke (bier_name POSITION(1) CHAR(35),bier_preis POSITION(37) ZONED(4,2),bier_bestellgroesse POSITION(42) INTEGER,getraenk_id "getraenke_seq.nextval")

Data file: bier.datIlmenauer Pils 4490 100Erfurter Bock 6400 80Magdeburger Weisse 1290 20Anhaltinisch Flüssig 8800 200

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–29

Page 30: ETL Process ETL: Overview

Data Load

BULK Load Example

Many optionsI Treatment of exceptions (Badfile)I Data transformationsI CheckpointsI Optional fieldsI Conditional loading into multiple tablesI Conditional loading of recordsI REPLACE or APPENDI Parallel loadI . . .

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–30

Page 31: ETL Process ETL: Overview

Data Load

Direct Path Load

Schreibe Datenbank-Block

SQL*Loader SQL*Loader Benutzerprozesse

SQL-Kommando Verarbeitung

Generiere SQL-Kommandos

Generiere SQL-Kommandos

Generiere SQL-Kommandos

Generiere SQL-Kommandos

Oracle Server

KonventionellerPfad

Speichermanagement

Hole neue Ausmaße

Passe Füllstand an

Finde partielle Blöcke

Befülle partielle Blöcke

Puffer-Cache

Puffer Cache Management- Manage Queues- Löse Konflikte auf

Datenbank-Blöckelesen

Datenbank-Blöcke schreiben

Datenbank

DirekterPfad

[Oracle 11g Documentation]

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–31

Page 32: ETL Process ETL: Overview

Data Load

Multi-Table-Insert in Oracle

Insert in to multiple tables or multiple times (e.g., for pivoting)

INSERT ALLINTO Quartal_Verkauf

VALUES (Produkt_Nr, Jahr || ’/Q1’, Umsatz_Q1)INTO Quartal_Verkauf

VALUES (Produkt_Nr, Jahr || ’/Q2’, Umsatz_Q2)INTO Quartal_Verkauf

VALUES (Produkt_Nr, Jahr || ’/Q3’, Umsatz_Q3)INTO Quartal_Verkauf

VALUES (Produkt_Nr, Jahr || ’/Q4’, Umsatz_Q4)SELECT ... FROM ...

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–32

Page 33: ETL Process ETL: Overview

Data Load

Multi-Table-Insert in Oracle (2)

Conditional insertINSERT ALLWHEN ProdNr IN

(SELECT ProdNr FROM Werbe_Aktionen)INTO Aktions_Verkauf

VALUES (ProdNr, Quartal, Umsatz)WHEN Umsatz > 1000

INTO Top_Produkte VALUES (ProdNr)SELECT ... FROM ...

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–33

Page 34: ETL Process ETL: Overview

Data Load

Merge in Oracle

Merge: attempt an insert in error (by breach of a key condition)→Update

MERGE INTO Kunden K USING Neukunden NON (N.Name = K.Name AND N.GebDatum = K.GebDatum)WHEN MATCHED THENUPDATE SET K.Name = N.Name, K.Vorname=N.Vorname,

K.GebDatum=N.GebDatumWHEN NOT MATCHED THENINSERT VALUES (MySeq.NextVal, N.Name,

N.Vorname, N.GebDatum)

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–34

Page 35: ETL Process ETL: Overview

Data Load

The ETL Process: Transformation Tasks

Data-Warehouse

Einsatz-fähigeQuellen

Extraktion, Transformation, Laden

Scheduling, Logging, Monitoring, Recovery, Backup

Extraktion Integration Aggregation

Instanzextraktionund Transformation

Instanzabgleichund Integration

Filterung,Aggregation

DatenflussMeta-Datenfluss

Zwischen-speicher

Data-Warehouse

1 2 3 4 5

1 3

2

4

5

Instanz-Charakteristika(reale Meta-Daten)

Translationsregeln

Abbildungen von Quell- aufZielschemata

Filterungs- und Aggregationsregeln

Legende:

[Rahm Do 2000]c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–35

Page 36: ETL Process ETL: Overview

Data Load

Method: Source – Staging Area – BaseDB

Quelle 1:RDBMS

Quelle 2:IMS

Rel. Schema Q1

Rel. Schema Q2

Datenwürfel,Integriertes

Schema

BULK Load only the first stepNext loads

I INSERT INTO ...SELECT ...I Logging can be switched offI Parallelizable

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–36

Page 37: ETL Process ETL: Overview

Data Load

Transformation Tasks

When loadingI Simple conversions (for LOAD - File)I Record orientation (tuples)I Preparation for BULK Loader –> mostly scripts or 3GL

In the data staging areaI set-oriented calculationsI Inter-and intra-relation comparisonI Comparison with base database→ DuplicatesI Tagging of recordsI SQL

Loading in the BaseDBI Bulk-LoadI set-oriented inserts without logging

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–37

Page 38: ETL Process ETL: Overview

Data Load

Task: Source – Staging Area – BaseDB

What to do, where and when?I No defined task assignment

Extraction Load

Source→ Staging Area Staging Area→ BaseDBAccess type record-oriented set-orientedAvailable databases one source (Updatefile) many sourcesAvailable datasets Depending in source: sll, all

changes, deltasBaseDB additionally available

Programming language Skripts: Perl, AWK, . . . or3GL

SQL, PL/SQL

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–38

Page 39: ETL Process ETL: Overview

Transformation Tasks

Transformation

ProblemI Data in non-working area not in the format of the basic databaseI Structure of the data varies

F Staging Area: Schema close to sourceF BaseDB: Multidimensional schemaF Structural heterogeneity

AspectsI Data transformationI Schema transformation

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–39

Page 40: ETL Process ETL: Overview

Transformation Tasks

Data and schema heterogeneity

Main data source: OLTP systemsSecondary sources:

I Documents from in-house old archivesI Documents from the Internet via WWW, FTP

F Unstructured: access via search engines, . . .F Semi Structured: access via search engines, mediators, wrappers

etc. as XML documents or similar

Basic problem: heterogeneity of sources

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–40

Page 41: ETL Process ETL: Overview

Transformation Tasks

Aspects of Heterogeneity

Various data modelsI Due to autonomous decisions on acquisition of systems in the

divisions,I Various and different powerful modeling constructs,I Application semantics are specifiable in varying degrees Mapping

ambiguous between data models

Example: Relational Model vs. object-oriented modeling vs. XML

KundeNameVornamePLZ...Kunde

NameVorname

PLZKunde

Name

Vorname

PLZ

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–41

Page 42: ETL Process ETL: Overview

Transformation Tasks

Aspects of Heterogeneity (2)

Different models for the same real-world factsI Due to design autonomyI Even in the same data model different modeling possible, e.g., by

different modeling perspectives of DB Designer

KundeNameVornameGeschlecht...

KundeNameVorname...

Mann Frau

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–42

Page 43: ETL Process ETL: Overview

Transformation Tasks

Aspects of Heterogeneity (3)

Different representations of the dataI Different data types possibleI Different scopes of the supported data typesI Different internal representations of the dataI Also, different "values" of a data type to represent the same

information

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–43

Page 44: ETL Process ETL: Overview

Transformation Tasks

Data Error ClassificationDatenfehler

Einzelne Datenquellen Integrierte Datenquellen

- Unzulässiger Wert- Attributabhängigkeit verletzt- Eindeutigkeit verletzt- Referenzielle Integrität verletzt

- Fehlende Werte- Schreibfehler- Falsche Werte- Falsche Referenz- Kryptische Werte- Eingebettete Werte- Falsche Zuordnung- Widersprüchliche Werte- Transpositionen- Duplikate- Datenkonflikte

- Strukturelle Heterogenität- SemantischeHeterogenität- SchematischeHeterogenität

- Widersprüchliche Werte- Unterschiedliche Repräsentationen- Unterschiedliche Genauigkeit- UnterschiedlicheAggregationsebenen-Duplikate

Schemaebene

Fehlende Integritätsbedingungen,

schlechtes SchemaDesign

Datenebene

Fehler in Datenträgern

Schemaebene

Heterogene Datenmodelle und

-schemata

Datenebene

Überlappende, widersprüchliche und inkonsistente Daten

[Rahm Do 2000, Leser Naumann 2007]c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–44

Page 45: ETL Process ETL: Overview

Schema Heterogeneity

Schema Heterogeneity

Cause: design autonomy different modelsI Different normalizationI What is a relation, what is an attribute, what is a value?I Distribution of data in tablesI Redundancies from source systemsI Keys

In SQL is not well supportedI INSERT has only one target tableI SQL accesses data, not schema elementsI Usually requires programming

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–45

Page 46: ETL Process ETL: Overview

Schema Heterogeneity

Schema Mapping

Data transformation between heterogeneous schemasI Old but recurrent problemI Usually, experts write complex queries or programsI Time intensive

F Expert for the domain, for schemas and for queriesF XML makes it even more difficult: XML Schema, XQuery

Idea: AutomationI Given: Two schemas and a high-level mapping between themI Wanted: query for data transformation

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–46

Page 47: ETL Process ETL: Overview

Schema Heterogeneity

Why is schema mapping difficult?

Generation of the "right" request, taking into accountI the source and target schemaI the mappingI and the user intention: semantics!

Guarantee that the transformed data correspond to the targetschema

I Flat or nestedI Integrity constraints

Efficient data transformation

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–47

Page 48: ETL Process ETL: Overview

Schema Heterogeneity

Schema Mapping:Normalized vs. Denormalized

1:1 associations are represented differentlyI By occurrence in the same tupleI Due to foreign key relationship

bIDnamealkoholgehalt

BierpIDnameherstellerproduktsorte

Produkt

pFKbezeichnung

Produktsorte

SELECT bID AS pID, name, NULL AS hersteller,NULL AS produktsorte FROM Bier

UNIONSELECT NULL AS pID, NULL AS name, NULL AS hersteller,

bezeichnung AS produktsorte FROM Produktsorte

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–48

Page 49: ETL Process ETL: Overview

Schema Heterogeneity

Schema Mapping:Normalized vs. Denormalized (2)

bIDnamealkoholgehalt

BierpIDnameherstellerproduktsorte

Produkt

pFKbezeichnung

Produktsorte

SELECT bID AS pID, name, NULL AS hersteller,bezeichnung AS produktsorte

FROM Bier, ProduktsorteWHERE bID = pFK

Only one of four possible interpretations!

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–49

Page 50: ETL Process ETL: Overview

Schema Heterogeneity

Schema Mapping:Normalized vs. Denormalized (3)

nameherstellerproduktsorte

Produkt bIDnamealkoholgehalt

Bier

pFKbezeichnung

Produktsorte

Requires key generation: Skolem funktion SK, supplying a uniquevalue with respect to the input (e.g., concatenation of all values)

Bier := SELECT SK(name) AS bID, name,NULL AS alkoholgehalt FROM Produkt

Produktsorte := SELECT SK(name) AS pFK,produktsorte AS bezeichnung FROM Produkt

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–50

Page 51: ETL Process ETL: Overview

Schema Heterogeneity

Schema mapping: Nested vs. Flat

1:1 associations are represented differentlyI I.e., nested elementsI Due to foreign key relationship

bIDnamealkoholgehalt

BierpIDnameproduktsorte

Produkt

bezeichnungProduktsorte

nameherstellerproduktsorte

Produktname

Bier

bezeichnungProduktsorte

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–51

Page 52: ETL Process ETL: Overview

Schema Heterogeneity

Difficulties

Example: Source(ID, Name, Street, ZIP-Code,Revenue)Target schema #1Customer(ID, Name, Revenue)Address(ID, Street, ZIP-Code)

I Requires 2 scans of the source table

INSERT INTO Customer ... SELECT ...INSERT INTO Address ... SELECT ...

Target schema #2PremCustomer(ID, Name, Revenue)NormCustomer(ID, Name, Revenue)

I Requires 2 scans of the source table

INSERT INTO PremCustomer ... SELECT ... WHERE Revenue>=XINSERT INTO NormCustomer ... SELECT ... WHERE Revenue<X

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–52

Page 53: ETL Process ETL: Overview

Schema Heterogeneity

Difficulties (2)

SchemaP1(Id, Name, Gender)P2(Id, Name, M, W)P31(Id, Name), P32(Id, Name)P1→ P2

INSERT INTO P2 (id, name, ’T’, ’F’) ... SELECT ...INSERT INTO P2 (id, name, ’F’, ’T’) ... SELECT ...

P3→ P1INSERT INTO P1(id, name, ’female’) ...

SELECT ... FROM P31INSERT INTO P1(id, name, ’male’) ...

SELECT ... FROM P32

Number of values must be fixed; new gender – Change all queries

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–53

Page 54: ETL Process ETL: Overview

Data Errors

Data Errors

KNr Name Geb.datum Alter Geschl. Telefon PLZ34 Meier, Tom 21.01.1980 35 M 999-999 3910734 Tina Möller 18.04.78 29 W 763-222 3699935 Tom Meier 32.05.1969 27 F 222-231 39107

Person Emailnullnull

[email protected]

PLZ391073699695555

OrtMagdeburg

SpanienIllmenau

Ort

Eindeutigkeitverletzt

Unterschiedliche Repräsentation

WidersprüchlicheWerte

Fehlende Werte(z.B. Default-Werte)

Referentielle Integrität verletzt

Duplikate

Schreib- oder Tippfehler

Falsche oder unzulässige Werte

unvollständig

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–54

Page 55: ETL Process ETL: Overview

Data Errors

Avoiding Data ErrorsAvoiding of bywrong data types Data type definition,

domain-constraintswrong values checkmissing values not nullinvalid foreign key references foreign keyDuplikates unique, primary keyInkonsistencies transactionsoutdated data replikation, materialized views

However, in practice:I Lack of metadata and integrity constraints, . . .I Input errors, ignorance, . . .I HeterogeneityI . . .

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–55

Page 56: ETL Process ETL: Overview

Data Errors

Phases of Data Processing

Sammlung/Auswahl

DQ-Problemeidentifizieren/quantifizieren

Fehlerarten/-ursachenerkennen

Standardisierung/Normalisierung

Fehler-korrektur

Duplikat-erkennung und

Merging

Aggregation /Feature-

Extraktion

Dimensions-reduktion /Sampling

Diskretisierung

Dat

a Pr

ofilin

g

Data Cleaning

Tran

sfor

mat

ion

Nutzung

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–56

Page 57: ETL Process ETL: Overview

Data Errors

Data Profiling

Analysis of the content and structure of individual attributesI Data type, range, distribution and variance, occurrence of null

values, uniqueness, pattern (e.g., dd / mm / yyyy)Analysis of dependencies between attributes of a relation

I "fuzzy" keysI Functional dependencies, potential primary key, "fuzzy"

dependenciesI Need:

F No explicit constraints specifiedF However, in most data satisfied

Analysis of overlaps between attributes of different relationsI Redundancies, foreign key relationships

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–57

Page 58: ETL Process ETL: Overview

Data Errors

Data Profiling (2)

Missing or incorrect valuesI Calculated vs. Expected cardinality (e.g. number of branches,

gender of clients)I ANumber of null values, minimum / maximum, variance

Data or input errorsI Sorting and manual testingI Similarity tests

DuplicatesI Number of tuples vs. attribute cardinality

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–58

Page 59: ETL Process ETL: Overview

Data Errors

Data Profiling with SQL

SQL queries for simple profiling tasksI Schema, data types: requests to schema catalogI Range of values

select min(A), max(A), count(distinct A)from Tabelle

I Data errors, default values

select City, count(*) as Numbfrom Customer group by City order by Numb

F Ascending: Input errors, e.g., Illmenau: 1, Ilmenau: 50F Descending: undocumented default values, z.B. AAA: 80

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–59

Page 60: ETL Process ETL: Overview

Data Errors

Data Cleaning

Detect & eliminate inconsistencies, contradictions, and errors indata with the aim of improving the quality.Also Cleansing or ScrubbingUp to 80% of the expense in DW projectsCleaning in DW: part of the ETL process

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–60

Page 61: ETL Process ETL: Overview

Data Errors

Data Quality and Data Cleaning

Rege

lbas

ierte

Ana

lyse

Bezie

hung

sana

lyse

Abhä

ngig

keits

anal

yse

Spal

tena

nalys

e

Gültigkeit einzelner W

erte

Gültigkeit mehrerer W

erte

Konsistenz mittels regel-

basierter AnalyseGeschäfts- und

Datenregeln (Defekte)

Referenzielle IntegritätIntegritäts-

verletzungen, Waisen (Orphans),

Kardinalitäten

Korrektheit mittels statis-

tischer KontrolleMin, Max, Mittel,

Median, Standardab-weichung, ...

KonsistenzDatentyp-,

Feldlängen- und Wertebereichs-konsistenzen

Schlüssel-eindeutigkeit

Eindeutigkeit der Primär- bzw.

Kandidatenschlüssel

RedundanzfreiheitNormalisierungsgrad(1.,2. und 3. NF), Duplikatprüfung

EindeutigkeitAnalyse der Metadaten

VollständigkeitFüllgradanalyse der

Entitäten und Attribute

GenauigkeitAnalyse der Stelligkeiten

(Gesamt- und Nach-kommastellen für

numerische Attribute)

EinheitlichkeitFormatanalyse (für

numerische Attribute, Zeiteinheiten und Zeichenketten)

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–61

Page 62: ETL Process ETL: Overview

Data Errors

Normalization and Standardization

Data type conversion: varchar→ intEncodings: 1: address unknown, 2: old address, 3: currentaddress, 4: addresse of spouse, . . .Normalization: mapping in unified format

I Date: 03/01/11→ 01. März 2011I Currency: $→ eI Strings to uppercase

Tokenization: "Saake, Gunter"→ "Saake", "Gunter"Discretization of numeric valuesDomain-specific transformations

I Codd, Edgar Frank→ Edgar Frank CoddI Str. → StreetI Addresses from address databasesI Industry-specific product names

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–62

Page 63: ETL Process ETL: Overview

Data Errors

Data Transformation

In SQL well supportedI Multiple functions in the language standardI SString functions, decoding, conversion date, formulas, system

variable, . . .I Create functions in PL/SQL - use in SQL

Daten"Pause, Lilo" ⇒ "Pause", "Lilo""Prehn, Leo" ⇒ "Prehn", "Leo"SQL

INSERT INTO customers (last_name, first_name)SELECT SubStr(name, 0, inStr(name,’,’)-1),

SubStr(name, inStr(name,’,’)+1)FROM rawdata;

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–63

Page 64: ETL Process ETL: Overview

Data Errors

Duplicate Detection

Identify semantically equivalent data sets, i.e., they represent thesame real world objectSee also: Record Linkage, Object Identification, DuplicateElimination, Merge / Purge

I Merge: Detect duplicatesI Purge: selection / calculation of the "best" representative per class.

CustomerNr Name Address3346 Just Vorfan Hafenstrasse 123346 Justin Forfun Hafenstr. 125252 Lilo Pause Kuhweg 425268 Lisa Pause Kuhweg 42⊥ Ann Joy Domplatz 2a⊥ Anne Scheu Domplatz 28

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–64

Page 65: ETL Process ETL: Overview

Data Errors

Duplicate Detection: Comparisons

Typical comparison rules

if ssn1 = ssn2 then matchelse if name1=name2 then

if firstname1=firstname2 thenif adr1=adr2 then matchelse unmatch

else if adr1=adr2 then match_householdelse if adr1=adr2 then...

Naive approach: "all-vs-all"I O(n2) comparisonsI Maximum accuracy (depending on rules)I Far too expensive

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–65

Page 66: ETL Process ETL: Overview

Data Errors

Duplicate Detection: Principle

Rr1, r2, r3, ...

Ss1, s2, s3, ...

R x S

r1,s1r2,s2

r3,s3

...

Matches (M)

Non Matches (U)

Vergleichs-funktion

Partitionierungdes

Suchraums

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–66

Page 67: ETL Process ETL: Overview

Data Errors

Partitioning

BlockingI Division of the search space into disjoint blocksI Duplicates only within a block

Sorted neighborhood [Hernandez Stolfo 1998]I Sorting the data based on a selected keyI Compare in a sliding window

Multi-pass techniqueI Transitive closure over different collations

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–67

Page 68: ETL Process ETL: Overview

Data Errors

Sorted Neighborhood1 Compute a key for each record

I ex: SSN + "first 3 characters of Name"+ ...

I Observance of typical errors: 0-O,Soundex, neighboring keys, ...

2 Sort by key3 Scan list sequentially4 Comparisons within a window W,|W| = w

I Which tuples really need to becompared?

w

w

ComplexityI Key generation: O(n), sorting: O(n · log(n)); comparing:

O((n/w) · (w2)) = O(n · w);I Total: O(n · log(n)) or O(n · w)

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–68

Page 69: ETL Process ETL: Overview

Data Errors

Sorted Neighborhood: Problems

Poor AccuracyI Sorting criterion always prefers certain attributesI Are the first letters more important for identity than the last ones?I Is Surname more important than the house number?

Increase window size?I Not helpfulI Dominance of an attribute remains the same, but runtime

deteriorates rapidly

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–69

Page 70: ETL Process ETL: Overview

Data Errors

Multi-pass technique

Sort by multiple criteria and identification of duplicatesFormation of the transitive closure of the duplicates up to a givenlength

A

B

C

B

C

A

1. Run: "A matches B"2. Run: "B matches C"Transitivity: "A matchesC"

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–70

Page 71: ETL Process ETL: Overview

Data Errors

Comparison functions

Comparison functions for fields (String A und B), including:I EEdit distance: number of edit operations (insert, delete, Change)

for change from A to BI q-Grams: Comparison of the amounts of all substrings of A and B of

length qI Jaro distance and Jaro-Winkler distance: Consideration of common

characters (within the half string length) and transposed characters(at another position)

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–71

Page 72: ETL Process ETL: Overview

Data Errors

Edit Distance

Levensthein Distance:I Number of edit operations (insert, delete, modify) for change from A

to BI Example:

edit_distance("Qualität", "Quantität") = 2⇒ update(3,’n’)⇒ insert(4,’t’)

I Application:

select P1.Name, P2.Namefrom Produkt P1, Produkt P2where edit_distance(P1.Name, P2.Name) <= 2

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–72

Page 73: ETL Process ETL: Overview

Data Errors

q-Grams

Set of all substrings of length qQualität3 := { __Q, _Qu, Qua, ual, ali, lit, itä, tät, ät_, t__ }Observation: strings with small edit distance have many commonq-grams, i.e., for edit distance k min.

max(|A|, |B|)− 1− (k − 1) · q

common q-gramsPositional q-grams: extension with position in a stringQualität := { (-1, __Q), (0, _Qu), (1, Qua), ... }

I Filtering for efficient comparison:F COUNT: number of common q-gramsF POSITION:Position difference between corresponding q-grams ≤ kF LENGTH: The difference in string lengths ≤ k

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–73

Page 74: ETL Process ETL: Overview

Data Errors

Data Conflicts

Data conflict: Two duplicates have different attribute values for asemantically same attribute

I In contrast to conflicts with integrity constraintsData conflicts arise

I Within an information system (intra-source) andI With the integration of multiple information systems (inter-source)

Prerequisite: Duplicate, already established that identityRequires: Conflict Resolution (Purging, Reconciliation)

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–74

Page 75: ETL Process ETL: Overview

Data Errors

Data Conflicts: Origins

Lack of integrity constraints or consistency checksIn case of redundant schemasBy partial informationWith emergence of duplicatesIncorrect entries

I Typing errors, transmission errorsI Incorrect calculation results

Obsolete entriesI Different update times

F Adequate timeliness of a sourceF Delayed update

I Forgotten update

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–75

Page 76: ETL Process ETL: Overview

Data Errors

Data Conflicts: Remedies

Reference tables for exact value mappingI For example, cities, countries, product names, codes...

Similarity measuresI With typos, language variants (Meier, Mayer, ...)

Standardizing and transformingUse of background knowledge (metadata)

I For example, conventions (typical spellings)I Ontologies, thesauri, dictionaries for the treatment of homonyms,

synonyms,. . .At integration

I Preference ordering over data sources according to relevance,trust, opening times, etc.

I Conflict resolution functions

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–76

Page 77: ETL Process ETL: Overview

ELT

ETL vs. ELT

ELT = Extract-Load-TransformI Variant of the ETL process, in which the data is transformed after

the loadI Objective: transformation with SQL statements in the target

databaseI Waiving special ETL engines

Quellen

Data Warehouse

E L T

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–77

Page 78: ETL Process ETL: Overview

ELT

ELT

ExtraktionI For Database optimized queries (e.g. SQL)I Extraction also monitored with monitorsI Automatic extraction difficult (e.g. data structure changes)

LadenI Parallel processing of SQL statementsI Bulk Load (assumption: no write access to the target system)I No record-based logging

TransformationI Utilization of set operations of the DW-transformation componentI Complex transformations by means of procedural languages (e.g.,

PL/SQL)I Specific statements (e.g., CTAS von Oracle)

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–78

Page 79: ETL Process ETL: Overview

ELT

Summary

ETL as a process of transferring data from source systems in theDWHTopics of ETL and data quality typically make up 80% of efforts inDWH projects!

I Slow queries are annoyingI Incorrect results make the DWH useless

Part of the transformation stepI Schema level: Schema mapping and schema transformationI Instance level: data cleaning

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–79