de-duplication a not so simple problem covers appendix part 5
TRANSCRIPT
![Page 1: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/1.jpg)
De-Duplication
A not so simple problemCovers Appendix Part 5
![Page 2: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/2.jpg)
False?
• False positives occur when a group of duplicates are identified that do NOT represent the same customer
• False negatives occur when actual redundant representations of the same customer are NOT identified
![Page 3: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/3.jpg)
• Customer Name – only personal names • Postal Address – only United States address
formats• Tax ID – Could be personal National
Insurance Number or another unique identifier
![Page 4: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/4.jpg)
Identical
Would you argue that these are NOT duplicate customers?
![Page 6: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/6.jpg)
AbbreviationThe abbreviation of first and middle names is a common challenge:
Does a matching Tax ID guarantee that a variation is a duplicate? What about when Tax ID is missing?
![Page 7: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/7.jpg)
MarriageMarriages can be good for people but possibly bad for their data:
Did the hyphenated last name on Key 252 help overcome the change of address and missing Tax ID? How do you know if Keys 261 and/or 262 are truly the same customer as Key 263?
![Page 8: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/8.jpg)
False Positives
For Keys 312 & 313, do you think the matching Tax ID and similar name indicate possible duplication of Key 311 despite the different postal address?
For Keys 322 & 323, do you think the exact same postal address and similar name indicate possible duplication of Key 321 despite the missing Tax IDs
![Page 9: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/9.jpg)
Same Address
A common challenge is the same family name and the exact same postal address
![Page 10: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/10.jpg)
What goes in Report Appendix?
• Discuss deduplication– What is your business strategy
• Show via a flow chart how you would attempt deduplication
![Page 11: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/11.jpg)
Mailing List Management Functional Requirements
• Set out what the new system will do.
• You have some experience with this from CS22120 Group Project.
• An attempt to describe, logically, the functionality of the system.
• You need to describe it NOT build it.
![Page 12: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/12.jpg)
Requirements
• Functional– What is it supposed to do
• Non-Functional requirements– Computer Environment– Personnel– Web based
![Page 13: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/13.jpg)
Some functions
• Set up required fields• Add, Modify and Delete Fields• Import initial list
– Field matching– Excel, CSV programs
• Add, Modify and Delete Records• Merge records from externally purchased files
![Page 14: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/14.jpg)
Mailing List Functionality cont’d
• Cleanse using Post Office Address File (PAF)– Contains all address in UK– Use to correct address from post code– Can add correct:
• Street name• Posttown• County
![Page 15: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/15.jpg)
Sorting
• Sort by– Post code– Geographic Areas– Job Title– SIC codes– Turnover (Ascending/Descending/Random)– And combinations of above
![Page 16: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/16.jpg)
Mailing List Functionality cont’d
• Select Number of records to deliver and maybe by– Post code– Job Title– SIC codes– Turnover (Ascending/Descending/Random)– Add false “ghosts”– File formats
![Page 17: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/17.jpg)
Product?
• Must be able distribute software– How?– Web or local OS– Hardware Platform
![Page 18: De-Duplication A not so simple problem Covers Appendix Part 5](https://reader034.vdocuments.mx/reader034/viewer/2022042607/551ae5095503465e7d8b4955/html5/thumbnails/18.jpg)
Competition
• Mailing Houses– Data discs– Web– Mailing list management services
• Software Companies– Dedupe software– Mailing List Management Software
• CHECK THESE OUT FOR THE REPORT