etl process etl: overview

ETL Process

ETL: Overview

Two stepsI From the sources to staging area

F Extraction of data from the sourcesF Creation / detection of differential updatesF Creating LOAD Files

I From the staging area to the base databaseF Data Cleaning and TaggingF Preparation of integrated data sets

I Continuous data provision for the DWHI Assurance of consistency regarding DWH data sources

Efficient methods essential→ minimize offline timeRigorous tests essential→ ensure data quality

c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–1

ETL Process

ETL Process

Frequently most elaborate part of the Data WarehousingI Variety of sourcesI HeterogeneityI Data volumeI Complexity of the transformation

F Schema and instance integrationF Data cleansing

I Hardly consistent methods and system support, but variety of toolsavailable


ETL Process

ETL Process

Extraction: selecting a section of the data from the sources andproviding it for transformationTransformation: fitting the data to predefined schema and qualityrequirementsLoad: physical insertion of the data from the staging area into thedata warehouse (including necessary aggregations)


ETL Process

Definition Phase of the ETL Process

Quelldaten-analyse

Metadaten-Management

Repository

OLTP

Legacy

ExterneQuellen

Auswahl derObjekte

Erstellen derTransformation

Erstellen derETL-Routinen

Analyse-bedarf

Datenmodell undKonventionen

Dokumentation,operativerDatenkatalog

Regelwerk fürDatenqualität Transfor-

mations-regeln

Erfolgskriterienfür Laderoutinen

ETL-JobsAbbildungSchlüsseltransf.Normalisierung

DWH

Datenquellen


Extraction of Data from Sources

Extraction

TaskI Regular extraction of change data from sourcesI Data provision for the DWH

DistinctionI Time of extractionI Type of extracted data



Point in Time

Synchronous notificationI Source propagates each change

Asynchronous notificationI Periodically

F Sources produce extracts regularlyF DWH regularly scans dataset

I Event-drivenF DWH requests changes before each annual reportingF Source informs after each X changes

I Query-controlledF DWH queries for changes before any actual access



Type of Data

Flow: integrate all changes in DWHI Short positions, tradeI accomodate for changes

Stock: point in time is essential, must be setI Number of employees at end of the month in a storeI Stock at the end of the year

Value per Unit: Depending on unit and other dimensionsI Exchange rate at a point in timeI Gold price on a stock exchange



Type of Data

Snapshots: Source always provides complete data setI New suppliers directory, new price list, etc.I Detect changesI Depict history correctly

Logs: Source provides any changeI Transaction logs, application-controlled loggingI Import changes efficiently

Net Logs: Source provides net changesI Catalog updates, snapshot deltasI No complete history possibleI Changes efficiently importable



Point in Time of Data ProvisionSource . . . Method Timeliness

DWHWorkloadon DWH

WorloadonSources

creates files periodi-cally

Batch runs,Snapshots

dependingon fre-quency

low low

propagates eachchange

Trigger, Repli-cation

maximum high very high

createsextracts onrequest

beforeuse

very hard maximum medium medium

application-driven

application-driven






Point in Time of Data ProvisionComments for three previous options:

Many systens (Mainframe) not accessible onlineContradicts idea of DWH: More workload on sourcesTechnically not efficiently implementable



Extraction from Legacy Systems

Very dependent on the applicationAccess to host systems without online access

I Access via BATCH, Report Writer, schedulingData in non-standard databases without APIs

I Programming in PL-1, COBOL, Natural, IMS . . .

Unclear semantics, double occupancy of fields, speaking keys,missing documentation, domain knowledge only held by fewpeopleBut: Commercial tools available



Differential Snapshot Problem

Many sources provide only the full datasetI Molecular biological data basesI Customer lists, employee listsI Product catalogues

ProblemI Repeated import of all data is inefficientI Duplikates need to be detected

Algorithms to compute Delta-FilesHard for very large files

[Labio Garcia-Molina 1996]



Scenario

Sources provide Snapshots as file FI Unordered set of records (K,A1, . . . ,An)

Given: F1, F2, mit f1 = |F1|, f2 = |F2|Calculate smallest set O = {INS,DEL,UPD}∗ with O(F1) = F2

O not unique!

O1 = {(INS(X)), ∅, (DEL(X))} ≡ O2 = {∅, ∅, ∅}

Differential Snapshot Problem



Scenario

K4, t, r, ...K102, p, q, ...K104, k, k, ...K202, a, a, ...

DifferentialSnapshot

AlgorithmusK3, t, r, ...K102, p, q, ...K103, t, h, ...K104, k, k, ...K202, b, b, ...

INS K3DEL K4INS K103UPD K202: ...

F1

F2

DWH



Assumptions

Computing a consecutive order of DSI Files from 1.1.2010, 1.2.2010, 1.3.2010, . . .

Cost ModelI All operations in the main memory are for freeI IO counts the number of records: sequential readI No consideration of block sizes

Size of main memory: M (Records)File size |Fx| = fx (Records)Files generally larger than main memory



DSnaive – Nested Loop

Computing OI Read record R from F1I Read F2 sequentially and compare to R

F R not in F2 → O := O ∪ (DEL(R))F R in F2 → O := O ∪ (UPD(R)) / ignore

Problem: INS is not foundI Auxiliary structure necessaryI Array with IDs from F2 (generated on the fly)I Mark R respectively, final run for INS

Number of IO operations: f1 · f2 + δ

Improvements?I Cancel search in F2 if R has been foundI Load partitions of size M from F1: f1

M · f2



DSsmall – small files

Assumption: Main memory M > f1 (or f2)Computing O

I Read F1 completelyI Read F2 sequentially (S)

F S ∈ F1: O := O ∪ (UPD(S)) / ignoreF S 6∈ F1: O := O ∪ (INS(S))F Mark S in F1 (Bitarray)

I Finally: Records R ∈ F1 without marks: O := O ∪ (DEL(R))

Number of IO operations: f1 + f2 + δ

ImprovementsI Sort F1 in the main memory faster lookup



DSsort – Sort-Merge

General case: M � f1 und M � f2Assumption: F1 is sortedSort F2 in secondary storage

I read F2 in partitions Pi with |Pi| = MI Sort Pi in main memory and write in Fi ("Runs")I Mix all Fi

I Assumption: M >√|F2| → IO: 4 · f2

Keep sorted F2 for next DS (becomes F1 there)I Per DS only F2 needs to be sorted

Computing OI Open sorted F1 and F2I Mix (parallel reads with skipping)

Number of IO operations: f1 + 5 · f2 + δ



DSsort2 – Interleaved

Sorted F1 givenComputing O

I Read F2 in partitions Pi with |Pi| = MI Sort Pi in main memory and write in Fi

2I Mix all Fi

2 and simultaneously compare to F1




DShash – Partitioned Hash

Calculating OI Hash F2 in partitions Pi with |Pi| = M/2I Hash funktion has to guarantee:

Pi ∩ Pj = ∅, ∀i 6= j

I Partitions are "equivalence classes" w.r.t. the hash functionI F1 is still partitionedI F1 and F2 have been partitioned by the same hash functionI Read and mix P1,i and P2,i in parallel




Why not simply . . .

UNIX diff?I diff requires / considers surroundings of recordsI Here: records are not ordered

in the database with SQL?I Requires to read each relation three times

INSERT INTO deltaSELECT ’UPD’, ...FROM F1, F2WHERE F1.K = F2.K AND F1.W <> F2.W

UNIONSELECT ’INS’, ...FROM F2WHERE NOT EXISTS (...)

UNIONSELECT ’DEL’, ...FROM F1WHERE NOT EXISTS (...)



Comparison – Features

IO BemerkungenDSnaive f1 · f2 out of concurrence, auxiliary data struc-

ture requiredDSsmall f1 + f2 only for smaller filesDSsort2 f1 + 4 · f2DShash f1 + 3 · f2 non-overlapping hash function, hard

to estimate partition size, assumptionsabout distribution (Sampling)

Extensions of DShash for "worse" hash functions known



Further DS Approaches

Number of partitions / runs larger than file descriptors in OSI Hierarchical external sorting methods

Compression: Compress FilesI Larger partitions / runsI Better chance of performing comparisons within the main memoryI In reality faster (assumptions of the cost model)

"Windows" AlgorithmI Assumption: Files have a "fuzzy" orderI Mixing with Sliding Window over both filesI Returns many redundant INS-DEL pairsI Number of IO operations: f1 + f2



DS with Timestamp

Assumption: Records are (K,A1, . . . ,An,T)

T: Timestamp of the last changeCreating O

I Adherence of Talt: Last update (max{T} of F1)I Read F2 sequentiallyI Entries with T > Talt interestingI But: INS or UPD?

Another problem: DEL is not foundTimestamp spares only attribute comparison


Data Load

Load

AufgabeI Efficient incorporation of external data in DWH

Critical PointI Loading operations may block the entire DWH (Write lock on fact

table)Aspects:

I TriggersI Integrity constraintsI Index updateI Update or Insert?


Data Load

Set based

Use of standard interfaces:PRO*SQL, JDBC, ODBC, . . .Works in the normal transaction contextTriggers, indexes and constraints remain active

I Manual deactivation possible

No large-scale locksLocks can be reduced by COMMIT

I Not in Oracle: Read operations are never locked (MVCC)

Using prepared statementsPartial proprietary extensions (arrays) available


Data Load

BULK Load

DB-specific extensions for loading large amounts of dataRunning (usually) in a special context

I Oracle: DIRECTPATH option in the loaderI Complete table lockI No consideration of triggers or constraintsI Indexes are not updated until afterI No transactional contextI No loggingI Checkpoints for recovery

Practice: BULK Uploads


Data Load

Example: ORACLE sqlldr

SQL*Loader

Loader-Kontroll-Datei

InputDatafiles

SchlechteDateien

InputDatafiles

AbgelehnteDateienDatenbank

Indexe

Tabellen

Log-Datei

InputDatafilesInput-

Dateien

[Oracle 11g Documentation]c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–28

Data Load

Example: ORACLE sqlldr (2)

Control-FileLOAD DATAINFILE ’bier.dat’REPLACE INTO TABLE getraenke (bier_name POSITION(1) CHAR(35),bier_preis POSITION(37) ZONED(4,2),bier_bestellgroesse POSITION(42) INTEGER,getraenk_id "getraenke_seq.nextval")

Data file: bier.datIlmenauer Pils 4490 100Erfurter Bock 6400 80Magdeburger Weisse 1290 20Anhaltinisch Flüssig 8800 200


Data Load

BULK Load Example

Many optionsI Treatment of exceptions (Badfile)I Data transformationsI CheckpointsI Optional fieldsI Conditional loading into multiple tablesI Conditional loading of recordsI REPLACE or APPENDI Parallel loadI . . .


Data Load

Direct Path Load

Schreibe Datenbank-Block

SQL*Loader SQL*Loader Benutzerprozesse

SQL-Kommando Verarbeitung

Generiere SQL-Kommandos




Oracle Server

KonventionellerPfad

Speichermanagement

Hole neue Ausmaße

Passe Füllstand an

Finde partielle Blöcke

Befülle partielle Blöcke

Puffer-Cache

Puffer Cache Management- Manage Queues- Löse Konflikte auf

Datenbank-Blöckelesen

Datenbank-Blöcke schreiben

Datenbank

DirekterPfad

[Oracle 11g Documentation]


Data Load

Multi-Table-Insert in Oracle

Insert in to multiple tables or multiple times (e.g., for pivoting)

INSERT ALLINTO Quartal_Verkauf

VALUES (Produkt_Nr, Jahr || ’/Q1’, Umsatz_Q1)INTO Quartal_Verkauf



VALUES (Produkt_Nr, Jahr || ’/Q4’, Umsatz_Q4)SELECT ... FROM ...


Data Load

Multi-Table-Insert in Oracle (2)

Conditional insertINSERT ALLWHEN ProdNr IN

(SELECT ProdNr FROM Werbe_Aktionen)INTO Aktions_Verkauf

VALUES (ProdNr, Quartal, Umsatz)WHEN Umsatz > 1000

INTO Top_Produkte VALUES (ProdNr)SELECT ... FROM ...


Data Load

Merge in Oracle

Merge: attempt an insert in error (by breach of a key condition)→Update

MERGE INTO Kunden K USING Neukunden NON (N.Name = K.Name AND N.GebDatum = K.GebDatum)WHEN MATCHED THENUPDATE SET K.Name = N.Name, K.Vorname=N.Vorname,

K.GebDatum=N.GebDatumWHEN NOT MATCHED THENINSERT VALUES (MySeq.NextVal, N.Name,

N.Vorname, N.GebDatum)


Data Load

The ETL Process: Transformation Tasks

Data-Warehouse

Einsatz-fähigeQuellen

Extraktion, Transformation, Laden

Scheduling, Logging, Monitoring, Recovery, Backup

Extraktion Integration Aggregation

Instanzextraktionund Transformation

Instanzabgleichund Integration

Filterung,Aggregation

DatenflussMeta-Datenfluss

Zwischen-speicher

Data-Warehouse

1 2 3 4 5

1 3

2

4

5

Instanz-Charakteristika(reale Meta-Daten)

Translationsregeln

Abbildungen von Quell- aufZielschemata

Filterungs- und Aggregationsregeln

Legende:

[Rahm Do 2000]c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–35

Data Load

Method: Source – Staging Area – BaseDB

Quelle 1:RDBMS

Quelle 2:IMS

Rel. Schema Q1

Rel. Schema Q2

Datenwürfel,Integriertes

Schema

BULK Load only the first stepNext loads

I INSERT INTO ...SELECT ...I Logging can be switched offI Parallelizable


Data Load

Transformation Tasks

When loadingI Simple conversions (for LOAD - File)I Record orientation (tuples)I Preparation for BULK Loader –> mostly scripts or 3GL

In the data staging areaI set-oriented calculationsI Inter-and intra-relation comparisonI Comparison with base database→ DuplicatesI Tagging of recordsI SQL

Loading in the BaseDBI Bulk-LoadI set-oriented inserts without logging


Data Load

Task: Source – Staging Area – BaseDB

What to do, where and when?I No defined task assignment

Extraction Load

Source→ Staging Area Staging Area→ BaseDBAccess type record-oriented set-orientedAvailable databases one source (Updatefile) many sourcesAvailable datasets Depending in source: sll, all

changes, deltasBaseDB additionally available

Programming language Skripts: Perl, AWK, . . . or3GL

SQL, PL/SQL



Transformation

ProblemI Data in non-working area not in the format of the basic databaseI Structure of the data varies

F Staging Area: Schema close to sourceF BaseDB: Multidimensional schemaF Structural heterogeneity

AspectsI Data transformationI Schema transformation



Data and schema heterogeneity

Main data source: OLTP systemsSecondary sources:

I Documents from in-house old archivesI Documents from the Internet via WWW, FTP

F Unstructured: access via search engines, . . .F Semi Structured: access via search engines, mediators, wrappers

etc. as XML documents or similar

Basic problem: heterogeneity of sources



Aspects of Heterogeneity

Various data modelsI Due to autonomous decisions on acquisition of systems in the

divisions,I Various and different powerful modeling constructs,I Application semantics are specifiable in varying degrees Mapping

ambiguous between data models

Example: Relational Model vs. object-oriented modeling vs. XML

KundeNameVornamePLZ...Kunde

NameVorname

PLZKunde

Name

Vorname

PLZ



Aspects of Heterogeneity (2)

Different models for the same real-world factsI Due to design autonomyI Even in the same data model different modeling possible, e.g., by

different modeling perspectives of DB Designer

KundeNameVornameGeschlecht...

KundeNameVorname...

Mann Frau



Aspects of Heterogeneity (3)

Different representations of the dataI Different data types possibleI Different scopes of the supported data typesI Different internal representations of the dataI Also, different "values" of a data type to represent the same

information



Data Error ClassificationDatenfehler

Einzelne Datenquellen Integrierte Datenquellen

- Unzulässiger Wert- Attributabhängigkeit verletzt- Eindeutigkeit verletzt- Referenzielle Integrität verletzt

- Fehlende Werte- Schreibfehler- Falsche Werte- Falsche Referenz- Kryptische Werte- Eingebettete Werte- Falsche Zuordnung- Widersprüchliche Werte- Transpositionen- Duplikate- Datenkonflikte

- Strukturelle Heterogenität- SemantischeHeterogenität- SchematischeHeterogenität

- Widersprüchliche Werte- Unterschiedliche Repräsentationen- Unterschiedliche Genauigkeit- UnterschiedlicheAggregationsebenen-Duplikate

Schemaebene

Fehlende Integritätsbedingungen,

schlechtes SchemaDesign

Datenebene

Fehler in Datenträgern

Schemaebene

Heterogene Datenmodelle und

-schemata

Datenebene

Überlappende, widersprüchliche und inkonsistente Daten

[Rahm Do 2000, Leser Naumann 2007]c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–44

Schema Heterogeneity


Cause: design autonomy different modelsI Different normalizationI What is a relation, what is an attribute, what is a value?I Distribution of data in tablesI Redundancies from source systemsI Keys

In SQL is not well supportedI INSERT has only one target tableI SQL accesses data, not schema elementsI Usually requires programming



Schema Mapping

Data transformation between heterogeneous schemasI Old but recurrent problemI Usually, experts write complex queries or programsI Time intensive

F Expert for the domain, for schemas and for queriesF XML makes it even more difficult: XML Schema, XQuery

Idea: AutomationI Given: Two schemas and a high-level mapping between themI Wanted: query for data transformation



Why is schema mapping difficult?

Generation of the "right" request, taking into accountI the source and target schemaI the mappingI and the user intention: semantics!

Guarantee that the transformed data correspond to the targetschema

I Flat or nestedI Integrity constraints

Efficient data transformation



Schema Mapping:Normalized vs. Denormalized

1:1 associations are represented differentlyI By occurrence in the same tupleI Due to foreign key relationship

bIDnamealkoholgehalt

BierpIDnameherstellerproduktsorte

Produkt

pFKbezeichnung

Produktsorte

SELECT bID AS pID, name, NULL AS hersteller,NULL AS produktsorte FROM Bier

UNIONSELECT NULL AS pID, NULL AS name, NULL AS hersteller,

bezeichnung AS produktsorte FROM Produktsorte



Schema Mapping:Normalized vs. Denormalized (2)


BierpIDnameherstellerproduktsorte

Produkt

pFKbezeichnung

Produktsorte

SELECT bID AS pID, name, NULL AS hersteller,bezeichnung AS produktsorte

FROM Bier, ProduktsorteWHERE bID = pFK

Only one of four possible interpretations!



Schema Mapping:Normalized vs. Denormalized (3)

nameherstellerproduktsorte

Produkt bIDnamealkoholgehalt

Bier

pFKbezeichnung

Produktsorte

Requires key generation: Skolem funktion SK, supplying a uniquevalue with respect to the input (e.g., concatenation of all values)

Bier := SELECT SK(name) AS bID, name,NULL AS alkoholgehalt FROM Produkt

Produktsorte := SELECT SK(name) AS pFK,produktsorte AS bezeichnung FROM Produkt



Schema mapping: Nested vs. Flat

1:1 associations are represented differentlyI I.e., nested elementsI Due to foreign key relationship


BierpIDnameproduktsorte

Produkt

bezeichnungProduktsorte

nameherstellerproduktsorte

Produktname

Bier

bezeichnungProduktsorte



Difficulties

Example: Source(ID, Name, Street, ZIP-Code,Revenue)Target schema #1Customer(ID, Name, Revenue)Address(ID, Street, ZIP-Code)

I Requires 2 scans of the source table

INSERT INTO Customer ... SELECT ...INSERT INTO Address ... SELECT ...

Target schema #2PremCustomer(ID, Name, Revenue)NormCustomer(ID, Name, Revenue)

I Requires 2 scans of the source table

INSERT INTO PremCustomer ... SELECT ... WHERE Revenue>=XINSERT INTO NormCustomer ... SELECT ... WHERE Revenue<X



Difficulties (2)

SchemaP1(Id, Name, Gender)P2(Id, Name, M, W)P31(Id, Name), P32(Id, Name)P1→ P2

INSERT INTO P2 (id, name, ’T’, ’F’) ... SELECT ...INSERT INTO P2 (id, name, ’F’, ’T’) ... SELECT ...

P3→ P1INSERT INTO P1(id, name, ’female’) ...

SELECT ... FROM P31INSERT INTO P1(id, name, ’male’) ...

SELECT ... FROM P32

Number of values must be fixed; new gender – Change all queries


Data Errors

Data Errors

KNr Name Geb.datum Alter Geschl. Telefon PLZ34 Meier, Tom 21.01.1980 35 M 999-999 3910734 Tina Möller 18.04.78 29 W 763-222 3699935 Tom Meier 32.05.1969 27 F 222-231 39107

Person Emailnullnull

[email protected]

PLZ391073699695555

OrtMagdeburg

SpanienIllmenau

Ort

Eindeutigkeitverletzt

Unterschiedliche Repräsentation

WidersprüchlicheWerte

Fehlende Werte(z.B. Default-Werte)

Referentielle Integrität verletzt

Duplikate

Schreib- oder Tippfehler

Falsche oder unzulässige Werte

unvollständig


Data Errors

Avoiding Data ErrorsAvoiding of bywrong data types Data type definition,

domain-constraintswrong values checkmissing values not nullinvalid foreign key references foreign keyDuplikates unique, primary keyInkonsistencies transactionsoutdated data replikation, materialized views

However, in practice:I Lack of metadata and integrity constraints, . . .I Input errors, ignorance, . . .I HeterogeneityI . . .


Data Errors

Phases of Data Processing

Sammlung/Auswahl

DQ-Problemeidentifizieren/quantifizieren

Fehlerarten/-ursachenerkennen

Standardisierung/Normalisierung

Fehler-korrektur

Duplikat-erkennung und

Merging

Aggregation /Feature-

Extraktion

Dimensions-reduktion /Sampling

Diskretisierung

Dat

a Pr

ofilin

g

Data Cleaning

Tran

sfor

mat

ion

Nutzung


Data Errors

Data Profiling

Analysis of the content and structure of individual attributesI Data type, range, distribution and variance, occurrence of null

values, uniqueness, pattern (e.g., dd / mm / yyyy)Analysis of dependencies between attributes of a relation

I "fuzzy" keysI Functional dependencies, potential primary key, "fuzzy"

dependenciesI Need:

F No explicit constraints specifiedF However, in most data satisfied

Analysis of overlaps between attributes of different relationsI Redundancies, foreign key relationships


Data Errors

Data Profiling (2)

Missing or incorrect valuesI Calculated vs. Expected cardinality (e.g. number of branches,

gender of clients)I ANumber of null values, minimum / maximum, variance

Data or input errorsI Sorting and manual testingI Similarity tests

DuplicatesI Number of tuples vs. attribute cardinality


Data Errors

Data Profiling with SQL

SQL queries for simple profiling tasksI Schema, data types: requests to schema catalogI Range of values

select min(A), max(A), count(distinct A)from Tabelle

I Data errors, default values

select City, count(*) as Numbfrom Customer group by City order by Numb

F Ascending: Input errors, e.g., Illmenau: 1, Ilmenau: 50F Descending: undocumented default values, z.B. AAA: 80


Data Errors

Data Cleaning

Detect & eliminate inconsistencies, contradictions, and errors indata with the aim of improving the quality.Also Cleansing or ScrubbingUp to 80% of the expense in DW projectsCleaning in DW: part of the ETL process


Data Errors

Data Quality and Data Cleaning

Rege

lbas

ierte

Ana

lyse

Bezie

hung

sana

lyse

Abhä

ngig

keits

anal

yse

Spal

tena

nalys

e

Gültigkeit einzelner W

erte

Gültigkeit mehrerer W

erte

Konsistenz mittels regel-

basierter AnalyseGeschäfts- und

Datenregeln (Defekte)

Referenzielle IntegritätIntegritäts-

verletzungen, Waisen (Orphans),

Kardinalitäten

Korrektheit mittels statis-

tischer KontrolleMin, Max, Mittel,

Median, Standardab-weichung, ...

KonsistenzDatentyp-,

Feldlängen- und Wertebereichs-konsistenzen

Schlüssel-eindeutigkeit

Eindeutigkeit der Primär- bzw.

Kandidatenschlüssel

RedundanzfreiheitNormalisierungsgrad(1.,2. und 3. NF), Duplikatprüfung

EindeutigkeitAnalyse der Metadaten

VollständigkeitFüllgradanalyse der

Entitäten und Attribute

GenauigkeitAnalyse der Stelligkeiten

(Gesamt- und Nach-kommastellen für

numerische Attribute)

EinheitlichkeitFormatanalyse (für

numerische Attribute, Zeiteinheiten und Zeichenketten)


Data Errors

Normalization and Standardization

Data type conversion: varchar→ intEncodings: 1: address unknown, 2: old address, 3: currentaddress, 4: addresse of spouse, . . .Normalization: mapping in unified format

I Date: 03/01/11→ 01. März 2011I Currency: $→ eI Strings to uppercase

Tokenization: "Saake, Gunter"→ "Saake", "Gunter"Discretization of numeric valuesDomain-specific transformations

I Codd, Edgar Frank→ Edgar Frank CoddI Str. → StreetI Addresses from address databasesI Industry-specific product names


Data Errors

Data Transformation

In SQL well supportedI Multiple functions in the language standardI SString functions, decoding, conversion date, formulas, system

variable, . . .I Create functions in PL/SQL - use in SQL

Daten"Pause, Lilo" ⇒ "Pause", "Lilo""Prehn, Leo" ⇒ "Prehn", "Leo"SQL

INSERT INTO customers (last_name, first_name)SELECT SubStr(name, 0, inStr(name,’,’)-1),

SubStr(name, inStr(name,’,’)+1)FROM rawdata;


Data Errors

Duplicate Detection

Identify semantically equivalent data sets, i.e., they represent thesame real world objectSee also: Record Linkage, Object Identification, DuplicateElimination, Merge / Purge

I Merge: Detect duplicatesI Purge: selection / calculation of the "best" representative per class.

CustomerNr Name Address3346 Just Vorfan Hafenstrasse 123346 Justin Forfun Hafenstr. 125252 Lilo Pause Kuhweg 425268 Lisa Pause Kuhweg 42⊥ Ann Joy Domplatz 2a⊥ Anne Scheu Domplatz 28


Data Errors

Duplicate Detection: Comparisons

Typical comparison rules

if ssn1 = ssn2 then matchelse if name1=name2 then

if firstname1=firstname2 thenif adr1=adr2 then matchelse unmatch

else if adr1=adr2 then match_householdelse if adr1=adr2 then...

Naive approach: "all-vs-all"I O(n2) comparisonsI Maximum accuracy (depending on rules)I Far too expensive


Data Errors

Duplicate Detection: Principle

Rr1, r2, r3, ...

Ss1, s2, s3, ...

R x S

r1,s1r2,s2

r3,s3

...

Matches (M)

Non Matches (U)

Vergleichs-funktion

Partitionierungdes

Suchraums


Data Errors

Partitioning

BlockingI Division of the search space into disjoint blocksI Duplicates only within a block

Sorted neighborhood [Hernandez Stolfo 1998]I Sorting the data based on a selected keyI Compare in a sliding window

Multi-pass techniqueI Transitive closure over different collations


Data Errors

Sorted Neighborhood1 Compute a key for each record

I ex: SSN + "first 3 characters of Name"+ ...

I Observance of typical errors: 0-O,Soundex, neighboring keys, ...

2 Sort by key3 Scan list sequentially4 Comparisons within a window W,|W| = w

I Which tuples really need to becompared?

w

w

ComplexityI Key generation: O(n), sorting: O(n · log(n)); comparing:

O((n/w) · (w2)) = O(n · w);I Total: O(n · log(n)) or O(n · w)


Data Errors

Sorted Neighborhood: Problems

Poor AccuracyI Sorting criterion always prefers certain attributesI Are the first letters more important for identity than the last ones?I Is Surname more important than the house number?

Increase window size?I Not helpfulI Dominance of an attribute remains the same, but runtime

deteriorates rapidly


Data Errors

Multi-pass technique

Sort by multiple criteria and identification of duplicatesFormation of the transitive closure of the duplicates up to a givenlength

A

B

C

B

C

A

1. Run: "A matches B"2. Run: "B matches C"Transitivity: "A matchesC"


Data Errors

Comparison functions

Comparison functions for fields (String A und B), including:I EEdit distance: number of edit operations (insert, delete, Change)

for change from A to BI q-Grams: Comparison of the amounts of all substrings of A and B of

length qI Jaro distance and Jaro-Winkler distance: Consideration of common

characters (within the half string length) and transposed characters(at another position)


Data Errors

Edit Distance

Levensthein Distance:I Number of edit operations (insert, delete, modify) for change from A

to BI Example:

edit_distance("Qualität", "Quantität") = 2⇒ update(3,’n’)⇒ insert(4,’t’)

I Application:

select P1.Name, P2.Namefrom Produkt P1, Produkt P2where edit_distance(P1.Name, P2.Name) <= 2


Data Errors

q-Grams

Set of all substrings of length qQualität3 := { __Q, _Qu, Qua, ual, ali, lit, itä, tät, ät_, t__ }Observation: strings with small edit distance have many commonq-grams, i.e., for edit distance k min.

max(|A|, |B|)− 1− (k − 1) · q

common q-gramsPositional q-grams: extension with position in a stringQualität := { (-1, __Q), (0, _Qu), (1, Qua), ... }

I Filtering for efficient comparison:F COUNT: number of common q-gramsF POSITION:Position difference between corresponding q-grams ≤ kF LENGTH: The difference in string lengths ≤ k


Data Errors

Data Conflicts

Data conflict: Two duplicates have different attribute values for asemantically same attribute

I In contrast to conflicts with integrity constraintsData conflicts arise

I Within an information system (intra-source) andI With the integration of multiple information systems (inter-source)

Prerequisite: Duplicate, already established that identityRequires: Conflict Resolution (Purging, Reconciliation)


Data Errors

Data Conflicts: Origins

Lack of integrity constraints or consistency checksIn case of redundant schemasBy partial informationWith emergence of duplicatesIncorrect entries

I Typing errors, transmission errorsI Incorrect calculation results

Obsolete entriesI Different update times

F Adequate timeliness of a sourceF Delayed update

I Forgotten update


Data Errors

Data Conflicts: Remedies

Reference tables for exact value mappingI For example, cities, countries, product names, codes...

Similarity measuresI With typos, language variants (Meier, Mayer, ...)

Standardizing and transformingUse of background knowledge (metadata)

I For example, conventions (typical spellings)I Ontologies, thesauri, dictionaries for the treatment of homonyms,

synonyms,. . .At integration

I Preference ordering over data sources according to relevance,trust, opening times, etc.

I Conflict resolution functions


ELT

ETL vs. ELT

ELT = Extract-Load-TransformI Variant of the ETL process, in which the data is transformed after

the loadI Objective: transformation with SQL statements in the target

databaseI Waiving special ETL engines

Quellen

Data Warehouse

E L T


ELT

ELT

ExtraktionI For Database optimized queries (e.g. SQL)I Extraction also monitored with monitorsI Automatic extraction difficult (e.g. data structure changes)

LadenI Parallel processing of SQL statementsI Bulk Load (assumption: no write access to the target system)I No record-based logging

TransformationI Utilization of set operations of the DW-transformation componentI Complex transformations by means of procedural languages (e.g.,

PL/SQL)I Specific statements (e.g., CTAS von Oracle)


ELT

Summary

ETL as a process of transferring data from source systems in theDWHTopics of ETL and data quality typically make up 80% of efforts inDWH projects!

I Slow queries are annoyingI Incorrect results make the DWH useless

Part of the transformation stepI Schema level: Schema mapping and schema transformationI Instance level: data cleaning