etl and data quality - · pdf filesas data quality blue fusion architecture data analysis -...

45
Copyright © 2004 SAS Institute Inc. All rights reserved. ETL and Data Quality

Upload: lynhi

Post on 28-Mar-2018

229 views

Category:

Documents


4 download

TRANSCRIPT

Copyright © 2004 SAS Institute Inc. All rights reserved.

ETL and Data Quality

Copyright © 2004, SAS Institute Inc. All rights reserved.

Agenda

Scope of SAS Data Quality

Key features

Integration with ETL

Performance benchmarks

Copyright © 2004, SAS Institute Inc. All rights reserved. 3

Scope of Data Quality

Data Profiling

Parsing and Standardization

Linking and Matching

Data Augmentation and Integration

Real-Time Data Quality

Address Validation

Copyright © 2004, SAS Institute Inc. All rights reserved.

Data Quality Methodology

Data Integration

Data Profiling Parsing andStandardization

Data Linking

Copyright © 2004, SAS Institute Inc. All rights reserved.

Data Profiling; scoping the benefits“Data profiling is the process of systematically

scanning and analyzing the contents of all the columns in tables of interest.”

• Identify data defects• Table column analysis

− Frequency distribution− Min/max/outlier detection− Data type analysis− Unique/null analysis

• Metadata analysis

Copyright © 2004, SAS Institute Inc. All rights reserved.

Data Profiling – Example SAS report

Copyright © 2004, SAS Institute Inc. All rights reserved.

Parsing …

Mr Adam S Smith Jr

Name Prefix: MrFirst Name: AdamMiddle Name: SLast Name: SmithName Suffix: Jr

Address data15 yr. Fixed, 6.5 percent, 2% orig + 1 point

Term: 15 yearsRate: 6.5% Points: 1%Origination Fee: 2%Type: Fixed

Loan data

Copyright © 2004, SAS Institute Inc. All rights reserved.

… and Standardisation

IBM, Intl Bus Machines, I.B.M., ibm

VP Sales, Sales Director, V.P. Sales

Avenue, Ave, AVENUE, ave.

Ernst and Young, E&Y, Ernst Young

IBM

VP Sales

AVE

Ernst & Young

Copyright © 2004, SAS Institute Inc. All rights reserved.

Linking and Matchingclustering and de-duplication

Customer Count Cluster

Ernst & Young 83311

Peter Hans-Joachim 1 4

Petra Hans-Joachim 1 4

Hans-Joachim 1 4

3Ernst and Young 3Ernest Young 3Ernstein & Young 3E & Y 3

This is also used for applications like “house-holding”

Copyright © 2004, SAS Institute Inc. All rights reserved.

Direct Marketing - House-holdingclustering

Customer Address Family Entity Type

3144 Huelani Dr Ind

Ind

Ind

Ind

Ind

Ind

Bus

Bus

Schlossallee 9

Peter Kraus Schlossallee 9 Kraus 212

Kuester, Petra Am Neunen Rhienhafen 10 Kuester 414

Dietrich Kuester Am Neunen Rhienhafen 10 Kuester 414

Kuester, Kirstin Am Neunen Rhienhafen 10 Kuester 414

Ernst and Young 100 1st Avenue 593

Ernst & Young 100 First Avenue 593

Household

Brittany Huff Huff 1

Kraus, Petra Kraus 212

Copyright © 2004, SAS Institute Inc. All rights reserved.

Duplicates and ROI for CRM Size of Cluster Number of

RecordsPercentage

of Total

Annual Marketing

Cost

Annual MarketingCost

(Reduced)

Annual Savings (sample)

Annual Savings (scaled)

1 149,259 72.90% $965,407 $965,407 $0 $0

2 34,598 16.90% $223,780 $111,890 $111,890 $634,287

3 11,538 5.60% $74,628 $24,876 $49,752 $282,036

4 4,128 2.00% $26,700 $6,675 $20,025 $113,518

5 1,735 0.80% $11,222 $2,244 $8,978 $50,893

6 798 0.40% $5,161 $860 $4,301 $24,383

7 455 0.20% $2,943 $420 $2,523 $14,300

202,511 98.80% $1,309,841 $1,112,373 $197,468 $1,119,417

Copyright © 2004, SAS Institute Inc. All rights reserved.

Data Augmentation

Adding value from reference data

Can be prototyped in df Architect

Join on match codes

Customer Address CASS_ZIP Lat

239 N Edgeworth 27401

28205

27330

29429

29429

4305 Central Avenue

36.0748

35.2134

35.4097

34.3586

Dale R Aagand 1515 Lord Ashley Drive 9368 -79.2821

Leola Asberg 161 Northfork Road 5557 -77.8942

Walter Asberg 161 Norfork Road 34.35865557 -77.8942

CASS_ZIP+4 Long

Frank Johns 2217 -79.796

Graham, A. 5659 -80.7791

Copyright © 2004, SAS Institute Inc. All rights reserved.

Integrating business mergers

Gas2520 customers

Remove duplicates(91 records)

Single customerView3577

All customers5421

Unique ElectricityCustomers

2992

Electricity3622 customers

Remove duplicates(630 records)

Unique GasCustomers

2429

Name Address

MR F N CARRERAS

65 BINGLEY AVE BROWNHILLS WALSALL WEST MIDLANDS WS8 7JL

Copyright © 2004, SAS Institute Inc. All rights reserved.

Agenda

Scope of SAS Data Quality

Key features

Integration with ETL

Performance benchmarks

Copyright © 2004, SAS Institute Inc. All rights reserved.

Deterministic Matching: business benefits

Experience based

Customizable

Fast performance

Modest hardware requirements

Consistent results

Auditable matching

Fuzzy matching with sensitivity tuning

Phonetics

Copyright © 2004, SAS Institute Inc. All rights reserved.

LocalizationSupport for Latin1, Latin2 and Greek characters

Locales • German, Germany (DEDEU)• Italian, Italy (ITITA)• English, US ( ENUSA) • English, UK (ENGBR)• English, Australia (ENAUS)• Portuguese, Brazil (PTBRA)• Dutch, Netherlands NLNLD• French, France (FRFRA)• Spanish, Spain (ESESP) - BETA

Copyright © 2004, SAS Institute Inc. All rights reserved.

Vocabulary Editor

Copyright © 2004, SAS Institute Inc. All rights reserved.

Turkish organization noise words

Copyright © 2004, SAS Institute Inc. All rights reserved.

Examples of matched Names (German Locale)

Christine,Krüger

von Kleist, ReinholdRühl MichaelRichter, Hildegard

Wagner, Dr. HelmutGuenter Moeller

Christine,KrügerKrueger,Christa

Eva SchneiderEva-Maria SchneiderJohannes MayerHans MeierAndreas WernickeAndreas von Wernickevon Kleist, ReinholdKleist,Reinhold vonRühl MichaelMichael Rühl

......

Richter, HildegardRichter, Hilde

Wagner, Dr. HelmutDres.med.Helmut Wagner

Dipl.-Volksw. Rolf BenderRolf BenderGuenter MoellerGünther Möller

Krueger,Christa

Eva SchneiderEva-Maria SchneiderJohannes MayerHans MeierAndreas WernickeAndreas von Wernicke

Kleist,Reinhold vonMichael Rühl

......

Richter, Hilde

Dres.med.Helmut Wagner

Dipl.-Volksw. Rolf BenderRolf BenderGünther Möller

Copyright © 2004, SAS Institute Inc. All rights reserved.

Examples of matched Addresses (German Locale)

......an der Talsperre 1a. d. Talsperre 1Clausenfad 2Klausenpfad 2Momsenstr. 15Mommsenstr. 15Konrad Adenauer Allee 18Konrad-Adenauer-Allee 18Loewenstrasse 30 ALöwenstr. 30aSankt Georgen Weg 8St. Georgenweg 8Ring Str. 18Ringstr. 18Rheinacker 29Am Rheinacker 29Sandstr. 42a/34-026Sandstr. 42ABergstraße 36Bergstr. 36

Copyright © 2004, SAS Institute Inc. All rights reserved.

DataFlux profiling report

Copyright © 2004, SAS Institute Inc. All rights reserved.

Profiling: analysis

Copyright © 2004, SAS Institute Inc. All rights reserved.

Building a standardization scheme

Copyright © 2004, SAS Institute Inc. All rights reserved.

SAS Data Quality Blue Fusion Architecture

Data analysis- rule discovery Build schemes

- editorsExecution

--parse, match, link

df Base

dfProfiledfMatch

dfCustomize

Blue FusionSDK

End userapplication

End userapplication

real time

Businessmodeling batch

dfIntelliServer SAS Data QualityServer

Blue Fusion

Copyright © 2004, SAS Institute Inc. All rights reserved.

Typical deployment architecture

Network

Data Steward

Data stewardteam QKB's

SAS IntegrationTechnologies

DATA QUALITYSERVER

DevelopmentQKB's

TestQKB's

ProductionQKB

ODBC

Legacy

ERP

Web BI Server

MetadataServer

DataWarehouse

SAS Access

SAS DataSurveyorsfor ERP

Data Steward

ETL Server

COM, J2EE,Webshere,

Tibco

Data sources

SAS Load

IntelliServer

Copyright © 2004, SAS Institute Inc. All rights reserved.

DataFlux business user functionsData profiling

Design and test of standardization schemes

Build and refine localization rules

Refine and test match definitions and sensitivities

Merge standardization schemes

Manual editing

Manual review and selection of ‘surviving records’

New data types and data quality definitions

Copyright © 2004, SAS Institute Inc. All rights reserved.

SAS DQ Server functionsData access and movement

Staging area: multiple sources

Column transformations

First pass clustering and sampling of data for purification team

Applying standardization schemes

Applying matching rules

De-duplication

Merging manually cleansed data

Validation and augmenting with external data

Copyright © 2004, SAS Institute Inc. All rights reserved.

Key features of SAS DQ SolutionBusiness user

Non technical Windows interface for rules database maintenanceData profiling with graphical analysis of data patterns

Standardization scheme build and identification functionsPrototyping data quality with df Architect

Fuzzy matching; manual or automatic de-duplication

Customizable data types and rules definitions

SAP connector * DQ add-ins

Copyright © 2004, SAS Institute Inc. All rights reserved.

Key features of SAS DQ SolutionIT user

Ability to read QKB created by business users

DQ functions integrated into column transformations

Threaded SQL and SORT and I/O; multi-platform deploymentData augmentation and enhancement

ETL Studio option for graphical management of data integrationAddress Validation

DfConnector for SAP to dfIntelliServer

Copyright © 2004, SAS Institute Inc. All rights reserved.

Agenda

Scope of SAS Data Quality

Key features

Integration with ETL

Performance benchmarks

Copyright © 2004, SAS Institute Inc. All rights reserved.

Integrated ETL and Data Quality

Garbage in, Garbage out

Data Quality is a significant issue for companies implementing warehousing-based solutions

Fail to address data quality issues and your project is more likely to fail.

Integrating data quality processing into the ETL workflow makes ETLQ an intelligent process.

Source:TDWI Report Series: Data Quality and the Bottom Line, Wayne Eckerson, The Data Warehousing Institute, 2002.

Copyright © 2004, SAS Institute Inc. All rights reserved.

Integrated ETL and Data Quality Integrated into the ETL process for ‘in-stream’ quality

Standardizes key corporate data such as customer, products, vendors and organisation codes and for the identification

Linking and cleansing of related or duplicate entries in the customer, vendor, and product files.

Downstream reporting and analysis are accurate, reliable and consistent

Copyright © 2004, SAS Institute Inc. All rights reserved.

‘Super’ cluster logic Name Addres

sPhone Mcode

#1Mcode

#2Mcode #3

1 John Smith Oxford St

123 ~A$$$ +A$$$ {A$$$

2 Fred Jones Regent St

456 ~B$$$ +B$$$ {B$$$

3 John Smith Oxford St

124 ~A$$$ +A$$$ {C$$$

4 Frederick Jones

Regent St

457 ~B$$$ +B$$$ {D$$$

5 Alison Smith

123 ~C$$$ +C$$$ {A$$$

6 Kirsten Brown

Regent St

456 ~D$$$ +B$$$ {B$$$

7 Mr J. Smith Flat 4 123 ~E$$$ +D$$$ {A$$$

Copyright © 2004, SAS Institute Inc. All rights reserved.

Case study - Workflow ClusteringCustomer database

Pass 1: name, family name and id Clusters

Clusters

Clusters

Merge clusters andgive to data stewards to build best record

Merge clusters

Non clustered records

Non clustered records

Non clustered records

Pass 2: name, family name and dob

Pass 3: name, family name and tax id

Copyright © 2004, SAS Institute Inc. All rights reserved.

Address validation

Externaldata

Internal data

Match-code

Match-code

Externaldata

Internal data

Left join on Match-code

retained data

Best record processing

Copyright © 2004, SAS Institute Inc. All rights reserved.

Address validation

xaddr1,,,MA, xaddr2,,,MB,xaddr3,,,MC,xaddr4,,,MD,etc

iaddr5,,,MH, iaddr6,,,MJ,iaddr7,,,MK,iaddr8,,,ML,etc

xaddr5,,,ME,iaddr9,,,ME,xaddr6,,,MF,iaddr10,,,MF,xaddr7,,,MG,Iaddr11,,,MG

x=externali=internalM=match-code

External reference

data

Internal data

Copyright © 2004, SAS Institute Inc. All rights reserved.

Integrating Data Quality into ETL

Copyright © 2004, SAS Institute Inc. All rights reserved.

Agenda

Scope of SAS Data Quality

Key features

Integration with ETL

Performance benchmarks

Copyright © 2004, SAS Institute Inc. All rights reserved.

Data Quality Server matching; 3 columns

Copyright © 2004, SAS Institute Inc. All rights reserved.

15:43.0

24:51.7

35:58.0

20:17.9

33:15.3

44:41.3

03:25.007:02.5

10:43.0

17:13.0

07:09.009:10.9

0:14:24

0:43:12

1:12:00

1:40:48

2:09:36

2:38:24

0 1 2 3 4

Observations (Millions)

Rea

l tim

e (H

MS

)

Win NT1

Win NT2

Win NT3

Windows NT Server

Copyright © 2004, SAS Institute Inc. All rights reserved.

0:14:46

0:19:37

0:49:04

0:32:16

0:48:16

1:04:11

0:09:540:04:59

0:12:16

0:24:47

0:36:54

0:16:060:14:24

0:43:12

1:12:00

1:40:48

2:09:36

2:38:24

0 1 2 3 4

Observations (Millions)

Rea

l tim

e (H

MS

)

Sun Solaris3

Sun Solaris1

Sun Solaris2

Sun Solaris

Copyright © 2004, SAS Institute Inc. All rights reserved.

1:09:39

0:56:21

1:26:47

1:55:54

1:48:50

0:17:08

0:33:34

0:50:15

0:28:42

1:12:23

0:36:00

2:25:43

0:14:24

0:43:12

1:12:00

1:40:48

2:09:36

2:38:24

0 1 2 3 4

Observations (Millions)

Rea

l tim

e (H

MS

)

AIX1

AIX2

AIX3

AIX

Copyright © 2004, SAS Institute Inc. All rights reserved.

Hardware

Hersteller SUNModell SUN-Fire-480RProzessoren 4x900 MhzFestplatten 14x35 GBRAM 8 GB

OS SUN OS 5.9SAS SAS 9.1

Typ SAS Data SetVariablen 9DS 1, 2, 3, 4 Mio.

Hardware:

Software:

Daten:

Hersteller DELLModell PowerEdge 2600Typ dual-Prozessor Prozessoren 2x XEON 2,4 GHzFestplatten 4x 17,5 GBRAM 2 GB

OS WIN NT Server 32 bitSAS 9.1

Typ SAS Data SetVariablen 9DS 1, 2, 3, 4 Mio.

Hardware:

Software:

Daten:

Ack. Martin Neudecker, SAS Germany

Copyright © 2004, SAS Institute Inc. All rights reserved.

SAS9 scalable architectureSAS/ACCESS threaded reads

Data Surveyors for ERP

Threaded procedures

Threaded Open Metadata Server

Threaded distributed technology

SAS/CONNECT using MP CONNECT

Copyright © 2004, SAS Institute Inc. All rights reserved.

Thank you for your time

[email protected]