clarity solution group presentation at the chief data officer insurance 2016
TRANSCRIPT
Driving Improved Customer Experience via Entity Resolution and Machine LearningResolving Entities with Machine Learning
2
Agenda
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
Why?What
Value?
Business Context
ML ExamplesCleanse Entity
Resolution HDFSHDFS
Many moving pieces Collaboration with business
user is key
Many moving pieces Collaboration with business
user is key
E.g.:- Consider only features of
interest- Move to upper case- Remove punctuation- Standardization (e.g.,
truncate to 5 characters US zip codes)
- ….- ….
Proprietary and Confidential - ©2014 Clarity Solution Group, Inc.
Entity Resolution
Stad
ardi
zedD
ata G
OGOOGP
HierarchicalClustering
Grouping entries based on their distance
Creating all pairs and computing each distance:- Dist = 0 same- Dist = 1 notthe same
0.90.2……0……0.10
distance
…..
≠~
=
Dist.
Logistic Regression
Trained Threshold
0.9 No0 Yes
0.2 Yes… No
Full
Dat
aset
Dist.
Training the Machine: Visually ID a subset as match or non match
Yes
Nodistance
Subs
et
Determining the best separating line
O G P
Key Consideratio
ns
Business Problem
3
The Business Problem Order from chaos: Common definition of Partners, and Partner's clients drives improved growth, service, innovation
Client Business Partners
Business Partner Clients
Financial services firm with ~ 1,000,000 direct partner
relationships and significant duplication
Classic entity resolution issue
Problem multiplies exponentially with duplication within Partner’s
Clients
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
4
Resolving duplication “noise” has many positive impacts on customer experience and business measures
Client Business Partners
Business Partner Clients
The Business Opportunity
Improved Service
Decreased repetitive, labor-intensive
activity
Accelerated Client-onboarding
Increased revenueImproved network
visibility driving cross-selling
Improved bottom-line
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
5
Machine Learning – Some Notorious Examples
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
6
The Overall Process
Cleanse
Entity Resolutio
n
Defining the end-to-end solution scope is key
!
E.g.:- Remove punctuation- Standardization (e.g.,
truncate to 5 characters US zip codes)
- ….
Data Points: - Integration: ~8 weeks- Machine learning dev.: ~6
weeks
Storage
Storage
!
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
7
The Overall Process
Cleanse
Entity Resolutio
n
Defining the end-to-end solution scope is key
!
E.g.:- Remove punctuation- Standardization (e.g.,
truncate to 5 characters US zip codes)
- ….
Data Points:- Integration: ~8 weeks- Machine learning dev.: ~6
weeks
Storage
Storage
!Entity Resolutio
n
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
8
Stan
dard
ized
Dat
a GOGOOGP
HierarchicalClustering
Entity Resolution via Machine Learning Grouping entries based on their distance
Creating all pairs and computing each distance:- Dist. = 0 same- Dist. = 1 not the same
0.90.2……0……0.10
distance
…..
≠~
=
Dist.
Logistic Regression
Trained Threshold
0.9 No0 Yes
0.2 Yes… No
Full
Dat
aset
Dist.
Training the Machine: Visually ID a subset as match or non match
YesNo
distance
Subs
et
Determining the best separating line
O G P
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
9
HierarchicalClustering
Stan
dard
ized
Dat
a GOGOOGP
Trained Threshold
0.9 No0 Yes
0.2 Yes… No
Dist.Su
bset
Entity Resolution: Behind the Scenes
distance
USITUSITCA
USUSITIT
Filter out exact matches
Use common features to parallelize the calculation
CA
Label unique entries without clustering
1.2 million
entries !!!
!
Logistic Regression
distance
YesNo
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
10
An ExampleName Address Countr
yGroup
SPACCANAPOLI PIZZERIA 123 WEST SUNNYSIDE USA ????
SPACCA PIZZERIA 123 W SUNNYSIDE AVE USA ????
SPACCANAPOLI PIZZERIA 123 WEST SUNNYSIDE IT ????
SPACCA PIZZERIA 123 W SUNNYSIDE AVE IT ????PIZZERIA LIBRETTO 221 OSSINGTON
AVENUECA ????
Out
put
Name Address Country
Group
SPACCANAPOLI PIZZERIA
123 WEST SUNNYSIDE USA USA_1
SPACCA PIZZERIA 123 W SUNNYSIDE AVE USA USA_1SPACCANAPOLI
PIZZERIA123 WEST SUNNYSIDE IT IT_1
SPACCA PIZZERIA 123 W SUNNYSIDE AVE IT IT_1PIZZERIA LIBRETTO 221 OSSINGTON
AVENUECA CA_1
Inpu
t
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
11
Findings and Data PointsBusiness Value
• Reduction in Effort: ~ 3 FTE’s
• Increased Client Onboarding: ~ 30%
• Individual / anecdotal evidence of increased cross-selling and loyalty
Technical Measures
• Approximately ~500K duplicates identified from ~1.2MM total customer records
• Job parallelism reduced run-time from >> 24 hours to 15 minutes
• Run-time enabled overnight process, with capability to run intra-day if needed
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
12
Key Considerations
Consideration Implication
Machine Learning is not a stand-alone exercise
Outline end to end process with business application integration points
Business Collaboration in “training” process is critical
Ensure heavy degree of subject matter expert involvement
Recognize the importance of technique in the solution
Leverage a data science process: Problem to hypothesis to technique selection
Underlying technology is not “one size fits all”
Machine Learning / Big Data solutions require customization and corresponding
investment in people
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
13
Key Considerations
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
Questions?
14
Appendix
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
15
Cle
an D
ata
GOGOOGP
Trained Threshold
0.9 No0 Yes
0.2 Yes… No
Dist.Su
bset
Entity Resolution: Jaccard Distance on Shingles
Jaccard Distance1) ‘CLARITY SOLUTION GROUP’:
['CLA', 'LAR', 'ARI', 'RIT', 'ITY', 'TY ', 'Y S', ' SO', 'SOL', 'OLU', 'LUT', 'UTI', 'TIO', 'ION', 'ON ', 'N G', ' GR', 'GRO', 'ROU', 'OUP']
2) ‘CLARITY SOL GR’:['CLA', 'LAR', 'ARI', 'RIT', 'ITY', 'TY ', 'Y S', ' SO', 'SOL', 'OL ', 'L G', ' GR']
1− ¿¿1+¿2−¿
¿=¿ (0→1)
0.90.2……0……0.10
…..
≠~
=
Dist.
HierarchicalClustering
distance
O G P
Logistic Regression
distance
YesNo
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
16
Stan
dard
ized
Dat
a
HierarchicalClustering
Entity Resolution via Machine Learning Grouping entries based on their distance
Creating all pairs and computing each distance:- Dist. = 0 same- Dist. = 1 not the same
0.90.2……0……0.1
distance
…..
≠~
=
Dist.
Logistic Regression
Trained Threshold
0.9 No0 Yes
0.2 Yes… No
Full
Dat
aset
Dist.
Training the Machine: Visually ID a subset as match or non match
Yes
Nodistance
Subs
et
Determining the best separating line
P A E
APAPPA
E
Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
17Proprietary and Confidential - ©2016 Clarity Solution Group, LLC
Stan
dard
ized
Dat
a
Entity Resolution via Machine Learning
Logistic Regression
Trained Threshold
0.9 No0 Yes
0.2 Yes… No
Dist.Yes
Nodistance
Subs
etHierarchicalClustering
P A E
APAPPA
E
1.2 million
entries !!!
!
USITUSITCA
Filter out exact matches
Use common features to parallelize the calculation
Label unique entries without clustering
USUSITITCA