large scale entity resolution, lexus nexus

17
Large Scale Entity Resolution Tools for Finding the Important Needle in the Haystack Mary Galvin, Technical Consultant, LexisNexis Kodak Global Directions ‘13

Upload: kodak-alaris-document-imaging

Post on 13-Jan-2015

226 views

Category:

Business


1 download

DESCRIPTION

Large Scale Entity Resolution Tools for finding the important needle in the haystack Global Directions Confrence 2013

TRANSCRIPT

Page 1: Large Scale Entity Resolution, Lexus Nexus

2

Large Scale Entity Resolution Tools for Finding the Important Needle in the Haystack

Mary Galvin, Technical Consultant, LexisNexis Kodak Global Directions ‘13

Page 2: Large Scale Entity Resolution, Lexus Nexus

2

2 Strategies for Entity Resolution to Reveal Hidden Connections

Page 3: Large Scale Entity Resolution, Lexus Nexus

2

Semantics

1. ‘Entity’: A thing with distinct and independent existence containing

enough attributes to uniquely set it apart from something else.

2. ‘Entity Resolution’: The processes and methodologies used to

uncover instances where the same ‘entity’ is referred to across

disparate sources of digital information (ie, records, news stories,

blogs/microblogs, etc.).

Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack 3

Page 4: Large Scale Entity Resolution, Lexus Nexus

2

4

Large Scale Entity Resolution Use Case #1

Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack

Page 5: Large Scale Entity Resolution, Lexus Nexus

2

5

Scenario

Healthcare insurers need better analytics to identify drug seeking behavior and schemes that recruit members to use their membership fraudulently.

Groups of people collude to source schedule drugs through multiple members to avoid being detected by rules based systems. Providers recruit members to provide and escalate services that are not rendered.

Result

The analysis detected social groups that are sourcing Vicodin and other schedule drugs. Identifies prescribers and pharmacies involved to help the insurer focus investigations and intervene strategically to mitigate risk.

Large Scale Entity Resolution Use Case #2

Almost every prescription is in social isolation (> 96%)

Non-Social

Large % of prescriptions show socialization (long tail)

Social

Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack

Page 6: Large Scale Entity Resolution, Lexus Nexus

2

6

Large Scale Entity Resolution Challenges

1. Permanence/Persistence 2. Transparency 3. Spatial and Temporal Considerations 4. Source Credibility Considerations 5. Completeness

Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack

Page 7: Large Scale Entity Resolution, Lexus Nexus

2

Entity Resolution Methodologies

7

Rules-Based:

− Based on logic (IF/ELSE or SWITCH statements)

− Example: If field values 1, 2 and 5 from source ‘a’ are equivalent to values 3, 6 and 7 in source ‘b’, respectively, then declare a match.

Statistics-Based:

− Based on computation of weights and thresholds; a match is declared only when the sum of all weights surpasses a certain threshold

− Example:

Threshold = 29

Sum of Individual Field

Scores (based on specificity

Values)

Source A

Source B

Field 1 Score

Field 2 Score

Field 3 Score

Field 4 Score

Field 1 Score

Field 2 Score

Field 3 Score

Field 4 Score

Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack

Page 8: Large Scale Entity Resolution, Lexus Nexus

2

8

Choosing the Right Methodology

Methodology Pros Cons

Rules-Based • High Precision • Optimal for Small Datasets

• Heavy Maintenance Required • Performance Degradation as Rule Set

and Datasets Increase • Re-writing of Rules Required as

Additional Languages are Present

Statistics-Based • Language Agnostic • Entity Agnostic • Optimal for Large Datasets

• Overkill for Small Datasets

Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack

Page 9: Large Scale Entity Resolution, Lexus Nexus

2

9

Why Statistically-Based Systems Excel

“The advantage of this [statistical] approach over hand-coded rules is that the models develop probabilistic rules of which human experts are often not aware. We noticed that many of the rules that the system had automatically learned from the data differed in subtle but important ways from the rules established by human experts” - Ray Kurzweil, How To Create A Mind (in reference to using statistical approaches for speech recognition technology)

Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack

Page 10: Large Scale Entity Resolution, Lexus Nexus

2

10

Consideration #1: “Dirty” Data US Consumer Data

Frequent Zip Code Patterns

US Consumer Data Frequent Phone Number Values

International Cargo Shipping Data – Shipper Names

Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack

Page 11: Large Scale Entity Resolution, Lexus Nexus

2

11

Consideration #2: Incomplete Data

Null Field Value Scenarios

Partial Field Value Scenarios

Cluster # F Name M Name L Name

1 Sardar Khan Niazi

2. S. K. Niazi

Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack

Page 12: Large Scale Entity Resolution, Lexus Nexus

2

12 Strategies for Entity Resolution to Reveal Hidden Connections

Consideration #3: Semi-Structured Data

International Postal Addresses

OFFICE # 406 4TH FLOOR SUNNY PLAZA HASRAT MOHANI ROAD I.I

101 ZUBAIDA GARDEN NEAR AWAMI MARKAZ SHAHRAH-E-FAISAL,KARACHI

101 BLOCK E FIRST FLOOR ZUBAIDA GARDENS NEAR AWAMI MARKAZ SHAHRAH-E-FA,KARACHI

E-101 ZUBAIDA GARDENS NEAR AWAMI MARKAZ SHAHRAH-E-FAISAL,KARACHI

Page 13: Large Scale Entity Resolution, Lexus Nexus

2

13 Strategies for Entity Resolution to Reveal Hidden Connections

Consideration #4: Semi-Structured Data

US Postal Addresses

939 JEFFERSON ST

110 E ELM ST

426 NEW YORK AVE

212 E MAIN ST

1900 EAGLE DR

Street Name City Name State Name Bakersfield

Ashland

Newton

Brookfield

Middletown

California

North Carolina

Ohio

Connecticut

Maryland

Average Specificity:

19.63 11.12 5.03

Location

14.03

Page 14: Large Scale Entity Resolution, Lexus Nexus

2

Entity Resolution Benefits

14 Strategies for Entity Resolution to Reveal Hidden Connections

Which Scenario is More Optimal for Your Business?

Page 15: Large Scale Entity Resolution, Lexus Nexus

2

Entity Resolution Vision

15 Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack

• Across industry and government, many initiatives and missions boil down to 4 primary entity types:

• People • Businesses/Organizations • Locations • Assets

• A deeper understanding of entities and their interconnections translates to:

• Increased successes in cracking fraud, waste and abuse • Better matching of people to people across social networks • Stronger indicators of supply chain risk for the enterprise

Page 16: Large Scale Entity Resolution, Lexus Nexus

2

Entity Resolution Vision

16 Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack

From a technical implementation standpoint, can scientific findings pertaining to the neocortex help us further revolutionize entity resolution technology as it stands today?

• Our statistical approach has us heading in the right direction • We are continuously finding new ways to represent the hierarchical

nature of entities • We should take heed of the brain’s innate ability to “prune”, while

possibly looking at ways to emulate “pruning” so that unnecessary retention of data with little to no value doesn’t continue to bog the enterprise down

Page 17: Large Scale Entity Resolution, Lexus Nexus

2

17 Large Scale Entity Resolution: Tools for Finding the Important Needle in a Haystack

Mary Galvin Technical Consultant LexisNexis Special Services, Inc. (LNSSI) LexisNexis | Risk Solutions 202.595.4043 Mobile [email protected]

Q&A