etl and data quality - · pdf filesas data quality blue fusion architecture data analysis -...
TRANSCRIPT
Copyright © 2004, SAS Institute Inc. All rights reserved.
Agenda
Scope of SAS Data Quality
Key features
Integration with ETL
Performance benchmarks
Copyright © 2004, SAS Institute Inc. All rights reserved. 3
Scope of Data Quality
Data Profiling
Parsing and Standardization
Linking and Matching
Data Augmentation and Integration
Real-Time Data Quality
Address Validation
Copyright © 2004, SAS Institute Inc. All rights reserved.
Data Quality Methodology
Data Integration
Data Profiling Parsing andStandardization
Data Linking
Copyright © 2004, SAS Institute Inc. All rights reserved.
Data Profiling; scoping the benefits“Data profiling is the process of systematically
scanning and analyzing the contents of all the columns in tables of interest.”
• Identify data defects• Table column analysis
− Frequency distribution− Min/max/outlier detection− Data type analysis− Unique/null analysis
• Metadata analysis
Copyright © 2004, SAS Institute Inc. All rights reserved.
Parsing …
Mr Adam S Smith Jr
Name Prefix: MrFirst Name: AdamMiddle Name: SLast Name: SmithName Suffix: Jr
Address data15 yr. Fixed, 6.5 percent, 2% orig + 1 point
Term: 15 yearsRate: 6.5% Points: 1%Origination Fee: 2%Type: Fixed
Loan data
Copyright © 2004, SAS Institute Inc. All rights reserved.
… and Standardisation
IBM, Intl Bus Machines, I.B.M., ibm
VP Sales, Sales Director, V.P. Sales
Avenue, Ave, AVENUE, ave.
Ernst and Young, E&Y, Ernst Young
IBM
VP Sales
AVE
Ernst & Young
Copyright © 2004, SAS Institute Inc. All rights reserved.
Linking and Matchingclustering and de-duplication
Customer Count Cluster
Ernst & Young 83311
Peter Hans-Joachim 1 4
Petra Hans-Joachim 1 4
Hans-Joachim 1 4
3Ernst and Young 3Ernest Young 3Ernstein & Young 3E & Y 3
This is also used for applications like “house-holding”
Copyright © 2004, SAS Institute Inc. All rights reserved.
Direct Marketing - House-holdingclustering
Customer Address Family Entity Type
3144 Huelani Dr Ind
Ind
Ind
Ind
Ind
Ind
Bus
Bus
Schlossallee 9
Peter Kraus Schlossallee 9 Kraus 212
Kuester, Petra Am Neunen Rhienhafen 10 Kuester 414
Dietrich Kuester Am Neunen Rhienhafen 10 Kuester 414
Kuester, Kirstin Am Neunen Rhienhafen 10 Kuester 414
Ernst and Young 100 1st Avenue 593
Ernst & Young 100 First Avenue 593
Household
Brittany Huff Huff 1
Kraus, Petra Kraus 212
Copyright © 2004, SAS Institute Inc. All rights reserved.
Duplicates and ROI for CRM Size of Cluster Number of
RecordsPercentage
of Total
Annual Marketing
Cost
Annual MarketingCost
(Reduced)
Annual Savings (sample)
Annual Savings (scaled)
1 149,259 72.90% $965,407 $965,407 $0 $0
2 34,598 16.90% $223,780 $111,890 $111,890 $634,287
3 11,538 5.60% $74,628 $24,876 $49,752 $282,036
4 4,128 2.00% $26,700 $6,675 $20,025 $113,518
5 1,735 0.80% $11,222 $2,244 $8,978 $50,893
6 798 0.40% $5,161 $860 $4,301 $24,383
7 455 0.20% $2,943 $420 $2,523 $14,300
202,511 98.80% $1,309,841 $1,112,373 $197,468 $1,119,417
Copyright © 2004, SAS Institute Inc. All rights reserved.
Data Augmentation
Adding value from reference data
Can be prototyped in df Architect
Join on match codes
Customer Address CASS_ZIP Lat
239 N Edgeworth 27401
28205
27330
29429
29429
4305 Central Avenue
36.0748
35.2134
35.4097
34.3586
Dale R Aagand 1515 Lord Ashley Drive 9368 -79.2821
Leola Asberg 161 Northfork Road 5557 -77.8942
Walter Asberg 161 Norfork Road 34.35865557 -77.8942
CASS_ZIP+4 Long
Frank Johns 2217 -79.796
Graham, A. 5659 -80.7791
Copyright © 2004, SAS Institute Inc. All rights reserved.
Integrating business mergers
Gas2520 customers
Remove duplicates(91 records)
Single customerView3577
All customers5421
Unique ElectricityCustomers
2992
Electricity3622 customers
Remove duplicates(630 records)
Unique GasCustomers
2429
Name Address
MR F N CARRERAS
65 BINGLEY AVE BROWNHILLS WALSALL WEST MIDLANDS WS8 7JL
Copyright © 2004, SAS Institute Inc. All rights reserved.
Agenda
Scope of SAS Data Quality
Key features
Integration with ETL
Performance benchmarks
Copyright © 2004, SAS Institute Inc. All rights reserved.
Deterministic Matching: business benefits
Experience based
Customizable
Fast performance
Modest hardware requirements
Consistent results
Auditable matching
Fuzzy matching with sensitivity tuning
Phonetics
Copyright © 2004, SAS Institute Inc. All rights reserved.
LocalizationSupport for Latin1, Latin2 and Greek characters
Locales • German, Germany (DEDEU)• Italian, Italy (ITITA)• English, US ( ENUSA) • English, UK (ENGBR)• English, Australia (ENAUS)• Portuguese, Brazil (PTBRA)• Dutch, Netherlands NLNLD• French, France (FRFRA)• Spanish, Spain (ESESP) - BETA
Copyright © 2004, SAS Institute Inc. All rights reserved.
Examples of matched Names (German Locale)
Christine,Krüger
von Kleist, ReinholdRühl MichaelRichter, Hildegard
Wagner, Dr. HelmutGuenter Moeller
Christine,KrügerKrueger,Christa
Eva SchneiderEva-Maria SchneiderJohannes MayerHans MeierAndreas WernickeAndreas von Wernickevon Kleist, ReinholdKleist,Reinhold vonRühl MichaelMichael Rühl
......
Richter, HildegardRichter, Hilde
Wagner, Dr. HelmutDres.med.Helmut Wagner
Dipl.-Volksw. Rolf BenderRolf BenderGuenter MoellerGünther Möller
Krueger,Christa
Eva SchneiderEva-Maria SchneiderJohannes MayerHans MeierAndreas WernickeAndreas von Wernicke
Kleist,Reinhold vonMichael Rühl
......
Richter, Hilde
Dres.med.Helmut Wagner
Dipl.-Volksw. Rolf BenderRolf BenderGünther Möller
Copyright © 2004, SAS Institute Inc. All rights reserved.
Examples of matched Addresses (German Locale)
......an der Talsperre 1a. d. Talsperre 1Clausenfad 2Klausenpfad 2Momsenstr. 15Mommsenstr. 15Konrad Adenauer Allee 18Konrad-Adenauer-Allee 18Loewenstrasse 30 ALöwenstr. 30aSankt Georgen Weg 8St. Georgenweg 8Ring Str. 18Ringstr. 18Rheinacker 29Am Rheinacker 29Sandstr. 42a/34-026Sandstr. 42ABergstraße 36Bergstr. 36
Copyright © 2004, SAS Institute Inc. All rights reserved.
SAS Data Quality Blue Fusion Architecture
Data analysis- rule discovery Build schemes
- editorsExecution
--parse, match, link
df Base
dfProfiledfMatch
dfCustomize
Blue FusionSDK
End userapplication
End userapplication
real time
Businessmodeling batch
dfIntelliServer SAS Data QualityServer
Blue Fusion
Copyright © 2004, SAS Institute Inc. All rights reserved.
Typical deployment architecture
Network
Data Steward
Data stewardteam QKB's
SAS IntegrationTechnologies
DATA QUALITYSERVER
DevelopmentQKB's
TestQKB's
ProductionQKB
ODBC
Legacy
ERP
Web BI Server
MetadataServer
DataWarehouse
SAS Access
SAS DataSurveyorsfor ERP
Data Steward
ETL Server
COM, J2EE,Webshere,
Tibco
Data sources
SAS Load
IntelliServer
Copyright © 2004, SAS Institute Inc. All rights reserved.
DataFlux business user functionsData profiling
Design and test of standardization schemes
Build and refine localization rules
Refine and test match definitions and sensitivities
Merge standardization schemes
Manual editing
Manual review and selection of ‘surviving records’
New data types and data quality definitions
Copyright © 2004, SAS Institute Inc. All rights reserved.
SAS DQ Server functionsData access and movement
Staging area: multiple sources
Column transformations
First pass clustering and sampling of data for purification team
Applying standardization schemes
Applying matching rules
De-duplication
Merging manually cleansed data
Validation and augmenting with external data
Copyright © 2004, SAS Institute Inc. All rights reserved.
Key features of SAS DQ SolutionBusiness user
Non technical Windows interface for rules database maintenanceData profiling with graphical analysis of data patterns
Standardization scheme build and identification functionsPrototyping data quality with df Architect
Fuzzy matching; manual or automatic de-duplication
Customizable data types and rules definitions
SAP connector * DQ add-ins
Copyright © 2004, SAS Institute Inc. All rights reserved.
Key features of SAS DQ SolutionIT user
Ability to read QKB created by business users
DQ functions integrated into column transformations
Threaded SQL and SORT and I/O; multi-platform deploymentData augmentation and enhancement
ETL Studio option for graphical management of data integrationAddress Validation
DfConnector for SAP to dfIntelliServer
Copyright © 2004, SAS Institute Inc. All rights reserved.
Agenda
Scope of SAS Data Quality
Key features
Integration with ETL
Performance benchmarks
Copyright © 2004, SAS Institute Inc. All rights reserved.
Integrated ETL and Data Quality
Garbage in, Garbage out
Data Quality is a significant issue for companies implementing warehousing-based solutions
Fail to address data quality issues and your project is more likely to fail.
Integrating data quality processing into the ETL workflow makes ETLQ an intelligent process.
Source:TDWI Report Series: Data Quality and the Bottom Line, Wayne Eckerson, The Data Warehousing Institute, 2002.
Copyright © 2004, SAS Institute Inc. All rights reserved.
Integrated ETL and Data Quality Integrated into the ETL process for ‘in-stream’ quality
Standardizes key corporate data such as customer, products, vendors and organisation codes and for the identification
Linking and cleansing of related or duplicate entries in the customer, vendor, and product files.
Downstream reporting and analysis are accurate, reliable and consistent
Copyright © 2004, SAS Institute Inc. All rights reserved.
‘Super’ cluster logic Name Addres
sPhone Mcode
#1Mcode
#2Mcode #3
1 John Smith Oxford St
123 ~A$$$ +A$$$ {A$$$
2 Fred Jones Regent St
456 ~B$$$ +B$$$ {B$$$
3 John Smith Oxford St
124 ~A$$$ +A$$$ {C$$$
4 Frederick Jones
Regent St
457 ~B$$$ +B$$$ {D$$$
5 Alison Smith
123 ~C$$$ +C$$$ {A$$$
6 Kirsten Brown
Regent St
456 ~D$$$ +B$$$ {B$$$
7 Mr J. Smith Flat 4 123 ~E$$$ +D$$$ {A$$$
Copyright © 2004, SAS Institute Inc. All rights reserved.
Case study - Workflow ClusteringCustomer database
Pass 1: name, family name and id Clusters
Clusters
Clusters
Merge clusters andgive to data stewards to build best record
Merge clusters
Non clustered records
Non clustered records
Non clustered records
Pass 2: name, family name and dob
Pass 3: name, family name and tax id
Copyright © 2004, SAS Institute Inc. All rights reserved.
Address validation
Externaldata
Internal data
Match-code
Match-code
Externaldata
Internal data
Left join on Match-code
retained data
Best record processing
Copyright © 2004, SAS Institute Inc. All rights reserved.
Address validation
xaddr1,,,MA, xaddr2,,,MB,xaddr3,,,MC,xaddr4,,,MD,etc
iaddr5,,,MH, iaddr6,,,MJ,iaddr7,,,MK,iaddr8,,,ML,etc
xaddr5,,,ME,iaddr9,,,ME,xaddr6,,,MF,iaddr10,,,MF,xaddr7,,,MG,Iaddr11,,,MG
x=externali=internalM=match-code
External reference
data
Internal data
Copyright © 2004, SAS Institute Inc. All rights reserved.
Agenda
Scope of SAS Data Quality
Key features
Integration with ETL
Performance benchmarks
Copyright © 2004, SAS Institute Inc. All rights reserved.
15:43.0
24:51.7
35:58.0
20:17.9
33:15.3
44:41.3
03:25.007:02.5
10:43.0
17:13.0
07:09.009:10.9
0:14:24
0:43:12
1:12:00
1:40:48
2:09:36
2:38:24
0 1 2 3 4
Observations (Millions)
Rea
l tim
e (H
MS
)
Win NT1
Win NT2
Win NT3
Windows NT Server
Copyright © 2004, SAS Institute Inc. All rights reserved.
0:14:46
0:19:37
0:49:04
0:32:16
0:48:16
1:04:11
0:09:540:04:59
0:12:16
0:24:47
0:36:54
0:16:060:14:24
0:43:12
1:12:00
1:40:48
2:09:36
2:38:24
0 1 2 3 4
Observations (Millions)
Rea
l tim
e (H
MS
)
Sun Solaris3
Sun Solaris1
Sun Solaris2
Sun Solaris
Copyright © 2004, SAS Institute Inc. All rights reserved.
1:09:39
0:56:21
1:26:47
1:55:54
1:48:50
0:17:08
0:33:34
0:50:15
0:28:42
1:12:23
0:36:00
2:25:43
0:14:24
0:43:12
1:12:00
1:40:48
2:09:36
2:38:24
0 1 2 3 4
Observations (Millions)
Rea
l tim
e (H
MS
)
AIX1
AIX2
AIX3
AIX
Copyright © 2004, SAS Institute Inc. All rights reserved.
Hardware
Hersteller SUNModell SUN-Fire-480RProzessoren 4x900 MhzFestplatten 14x35 GBRAM 8 GB
OS SUN OS 5.9SAS SAS 9.1
Typ SAS Data SetVariablen 9DS 1, 2, 3, 4 Mio.
Hardware:
Software:
Daten:
Hersteller DELLModell PowerEdge 2600Typ dual-Prozessor Prozessoren 2x XEON 2,4 GHzFestplatten 4x 17,5 GBRAM 2 GB
OS WIN NT Server 32 bitSAS 9.1
Typ SAS Data SetVariablen 9DS 1, 2, 3, 4 Mio.
Hardware:
Software:
Daten:
Ack. Martin Neudecker, SAS Germany
Copyright © 2004, SAS Institute Inc. All rights reserved.
SAS9 scalable architectureSAS/ACCESS threaded reads
Data Surveyors for ERP
Threaded procedures
Threaded Open Metadata Server
Threaded distributed technology
SAS/CONNECT using MP CONNECT