cleansing/matching data using the sas data quality toolset

22
© Ecclesiastical Insurance Office plc 2011 Cleansing/Matching Data Using the SAS Data Quality Toolset Nigel Light, Data Governance and Quality Analyst Ecclesiastical Insurance

Upload: others

Post on 11-Sep-2021

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Cleansing/Matching Data Using the SAS Data Quality ToolsetNigel Light, Data Governance and Quality Analyst Ecclesiastical Insurance

Page 2: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Ecclesiastical Insurance – niches where we conduct business

• Charitable Insurer with over 125 years of experience (aim of giving £50M to charity in next 3 years)

• Insure a diverse mix of organisations and risks aligned to our core values

Page 3: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Data Quality tools – why buy one?

Data Quality doesn’t need a toolset.

Without one, using standard database/spreadsheet software, you can :

• Match and deduplicate data• Write code to

• Profile data • Reformat data• Identify where data does not match expected patterns or values

But, with a data quality toolset, this is all generally available ‘out of the box’(and there are often additional features too)

Page 4: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Cleansing data using SAS Dataflux – the basics

Things that a Data Quality tool, such as SAS, can do….

• Parsing • ie Breaking out a string of data into its standard elements

Eg Address string “123 Brookside Close, Henleze, Bristol, BS2 7BJ”

Can be ‘parsed’ into House Number : 123individual data elements Street : Brookside Close

Address line 2 : HenlezeCity : Bristol Postcode : BS2 7BJ

Page 5: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Cleansing data using SAS – the basics

Also….

• Standardisation • Putting data into a defined standard form (for the defined data element)

Eg1 Phone number 07891425687 standardised to (07891) 425867 ie defined01452 678923 (01452) 678923 standard078 123 98756 (07812) 398756 form

Eg2 Suffix Ltd, Lmtd, Ltd., Limited standardised to Ltd.

Eg3 Name Bob, Bobby, Rob, Robert etc standardised to Robert

Page 6: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Cleansing data using SAS– the basics

As well it can….

• Pattern identification of invalid data items • Identification on inappropriate or incomplete field contents based on

defined element values

Eg Postcode must be one of 6 formats XX99 9XX, XX9 9XX etc

• A postcode of eg G19L 9P2 would be recognised as invalid. • Ditto GL21 3 (incomplete)

NB It cannot identify invalid entries eg postcodes, of valid format, which do not exist(need to match to a reference dataset eg Royal Mail postcode file)

Page 7: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Cleansing data using SAS– the basics

And….

• Profile data• Gain an understanding and insight

of data from a specified source

Eg For a particular field

• What are the 5 most common values? The 5 least common?

• What is the maximum, minimum values?• What is the type of data in the field

(eg alphabetic?, numeric? date?)• What is the longest/shortest value?

(alphabetic field)• What is the average value? (numeric field)

etc

Page 8: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Cleansing data using SAS – the basics

How?

• SAS - Quality Knowledge Base (QKB)• Set of pre-defined ‘out-of-the-box’ templates• Target specific• Location specific

• Can also define you our own valueseg to accommodate clergy salutations

The Most Reverend and Right Honourable the Lord Archbishop of Canterbury(www.crockford.org.uk)

Page 9: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Cleansing data using SAS – the basics

BUT

• SAS Data Quality cannot do ‘magic’ and fix all data problems

• Data needs to be of a certain ‘standard’ for the QKB to work satisfactorily

• Eg A phone number entry of ‘Ext 2378’ cannot be standardised

• The UK QKB may also find certain Eastern European and Asian name standardisation hard

Page 10: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Cleansing data using SAS Data Quality – grouping and deduplication

Grouping

De-duplication

Page 11: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Cleansing data using SAS Data Quality – be aware

Look the same? – yes

Same address? – yes (possibly)

Same birthday? - yes

Same surname? - yes

Same initial? – no (possible – but not this pair)

Same people? – NO

To be 100% sure need a unique identifier eg NI NumberOtherwise a human decision is required to identify whether they are the same

Page 12: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

SAS Data Quality – ‘fuzzy’ matching

Matching can be simple eg using conventional tools• Does A = B (exact match – including the number of spaces) eg ‘Smith’ = ‘Smith’

• Does A = B (using ‘wild card’ % characters) eg ‘Smith’ = ‘Smi%%’

But what about … if A looks ‘so similar’ to B they can be considered a ‘likely’ match?

eg ‘Smith’ = ‘Smith’?eg2 = ‘Smythe’?

eg3 = ‘Smtih’?eg4 = ‘Smith-Jones’?

= a ‘fuzzy’ match(the degree of ‘fuzziness’ can be varied – ie akin to matching non-identical twins)

This is achieved in SAS by using Match Codes

Page 13: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Probabilistic record linkage– Brain overload?

http://en.wikipedia.org/wiki/Record_linkage

Luckily, the tool does all of this for you….

Jaro-Winkler Distance

Low Levenshstein

Distance

Phonetic Algorithm

Etc…

Page 14: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

SAS Data Quality – match codes

Matching – when to use match codes?

• Postcodes, email addresses and phone numbers

• Addresses

• Names

Page 15: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Matching - Final tips

Matching is rarely a straightforward, exact processIt requires perseverance. Success rates can be improved by :

Understanding the data (ie identify the data nuances)

Experiment and try different matching techniques

Remove any ‘noise’ from the match strings

Using a ‘cascading degree of confidence’ – retaining the strongest match

However, it is often a balance between the number of false matches and missing the occasional ‘true’ match

Page 16: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Example

Business Issue – Theft of metals from church roofs

linked to the high demandfor metals

Page 17: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Example

Mitigation steps

Include application of Smartwater

Uniquely links metal to a location and is now a pre-requisite for obtaining insurance for a church with Ecclesiastical

Page 18: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Example

Business Process :

Page 19: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Example

Data Issues :

• ‘House of God…’

• ‘Many to many’

• Transposition of key data elements

• Standardisation

Page 20: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Matching and Cleansing Methodology

• Remove non-Anglican church entries from the policy file

• Validate the Smartwater supplied policy number and attempt to match it

• ‘Break out’ church name

• Standardise and match on church name, postcode/short-postcode and town

• Score and de-duplicate, retaining the highest scoring match

• Output confident matches …and where not confident, suggest alternatives, to permit data correction

Example

Page 21: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Final Results :

• 85% Smartwater locations were confidently matched to a policy

• 4% of the remaining policies had a single, more confidently matched alternative to the policy specified by Smartwater

• The remainder of the Smartwater entries had multiple possibilities• these required a manual decision to be made

• Currently working to maintain the level of Data Quality

• … and all users of the system can be more confident of the data used in the process

Example

Page 22: Cleansing/Matching Data Using the SAS Data Quality Toolset

© Ecclesiastical Insurance Office plc 2011

Questions?

Thank you Any questions?

[email protected]