a primer on entity resolution

58
A Primer on Entity Resolution

Upload: benjamin-bengfort

Post on 07-Aug-2015

153 views

Category:

Data & Analytics


8 download

TRANSCRIPT

Page 1: A Primer on Entity Resolution

A Primer on Entity Resolution

Page 2: A Primer on Entity Resolution

Workshop Objectives

● Introduce entity resolution theory and tasks● Similarity scores and similarity vectors● Pairwise matching with the Fellegi Sunter algorithm● Clustering and Blocking for deduplication ● Final notes on entity resolution

Page 3: A Primer on Entity Resolution

Entity Resolution Theory

Page 4: A Primer on Entity Resolution

Entity Resolution refers to techniques that identify, group, and link digital mentions or

manifestations of some object in the real world.

Page 5: A Primer on Entity Resolution

In the Data Science Pipeline, ER is generally a wrangling technique.

ComputationStorageDataInteraction

ComputationalData Store

Feature AnalysisModel Builds

Model Selection & Monitoring

NormalizationIngestion

Feedback

Wrangling

APICross Validation

Page 6: A Primer on Entity Resolution

- Creation of high quality data sets- Reduction in the number of instances in

machine learning models- Reduction in the amount of covariance and

therefore collinearity of predictor variables.- Simplification of relationships

Information Quality

Page 7: A Primer on Entity Resolution

Graph Analysis Simplification and Connection

[email protected]

[email protected]@ddl.com

[email protected] [email protected]

[email protected] [email protected] Ben

Rebecca

Allen

Tony

Selma

Page 8: A Primer on Entity Resolution

- Heterogenous data: unstructured records- Larger and more varied datasets- Multi-domain and multi-relational data- Varied applications (web and mobile)

Parallel, Probabilistic Methods Required** Although this is often debated in various related domains.

Machine Learning and ER

Page 9: A Primer on Entity Resolution

Entity Resolution Tasks

Deduplication

Record Linkage

Canonicalization

Referencing

Page 10: A Primer on Entity Resolution

- Primary consideration in ER- Cluster records that correspond to the same

real world entity, normalizing a schema- Reduces number of records in the dataset- Variant: compute cluster representative

Deduplication

Page 11: A Primer on Entity Resolution

Record Linkage

- Match records from one deduplicated data store to another (bipartite)

- K-partite linkage links records in multiple data stores and their various associations

- Generally proposed in relational data stores, but more frequently applied to unstructured records from various sources.

Page 12: A Primer on Entity Resolution

Referencing

- Known as entity disambiguation- Match noisy records to a clean, deduplicated

reference table that is already canonicalized- Generally used to atomize multiple records

to some primary key and donate extra information to the record

Page 13: A Primer on Entity Resolution

Canonicalization

- Compute representative- Generally the “most complete” record- Imputation of missing attributes via merging - Attribute selection based on the most likely

candidate for downstream matching

Page 14: A Primer on Entity Resolution

Notation

- R: set of records- M: set of matches- N: set of non-matches- E: set of entities- L: set of links

Compare (Mt,Nt,Et,Lt)⇔(Mp,Np,Ep,Lp)- t = true, p = predicted

Page 15: A Primer on Entity Resolution

Key Assumptions

- Every entity refers to a real world object (e.g. there are no “fake” instances

- References or sources (for record linkage) include no duplicates (integrity constraints)

- If two records are identical, they are true matches ( , ) ∈ Mt

Page 17: A Primer on Entity Resolution

Similarity

Page 18: A Primer on Entity Resolution

At the heart of any entity resolution task is the computation of similarity or distance.

Page 19: A Primer on Entity Resolution

For two records, x and y, compute a similarity vector for each component attribute:

[match_score(attrx, attry) for attr in zip(x,y)]

Where match_score is a per-attribute function that computes either a boolean (match, not match) or real

valued distance score.

match_score ∈ [0,1]*

Page 20: A Primer on Entity Resolution

x = {

'id': 'b0000c7fpt',

'title': 'reel deal casino shuffle master edition',

'description': 'reel deal casino shuffle master edition is ...',

'manufacturer': 'phantom efx',

'price': 19.99,

'source': 'amazon',

}

y = {

'id': '17175991674191849246',

'name': 'phantom efx reel deal casino shuffle master edition',

'description': 'reel deal casino shuffle master ed. is ...',

'manufacturer': None,

'price': 17.24,

'source': 'google',

}

Page 21: A Primer on Entity Resolution

# similarity vector is a match score of:

# [name_score, description_score, manufacturer_score, price_score]

# Boolean Match

similarity(x,y) == [0, 1, 0, 0]

# Real Valued Match

similarity(x,y) == [0.83, 1.0, 0, 2.75]

Page 22: A Primer on Entity Resolution

Match Scores Reference

String Matching Distance Metrics Relational Matching

Other Matching

Edit Distance- Levenstein- Smith-Waterman- Affine

Alignment- Jaro-Winkler- Soft-TFIDF- Monge-Elkan

Phonetic- Soundex- Translation

- Euclidean - Manhattan- Minkowski

Text Analytics- Jaccard- TFIDF- Cosine similarity

Set Based- Dice- Tanimoto (Jaccard)- Common Neighbors- Adar Weighted

Aggregates- Average values- Max/Min values- Medians- Frequency (Mode)

- Numeric distance- Boolean equality- Fuzzy matching- Domain specific

Gazettes- Lexical matching- Named Entities (NER)

Page 23: A Primer on Entity Resolution

Fellegi Sunter

Page 24: A Primer on Entity Resolution

Pairwise Matching:Given a vector of attribute match

scores for a pair of records (x,y) compute Pmatch(x,y).

Page 25: A Primer on Entity Resolution

Weighted Sum + Threshold

Pmatch = sum(weight*score for score in vector)

- weights should sum to one

- determine weight for each attribute match score

- higher weights for more predictive features

- e.g. email more predictive than username

- attribute value also contributes to predictability

- If weighted score > threshold then match.

Page 26: A Primer on Entity Resolution

Rule Based Approach

- Formulate rules about the construction of a match for attribute collections. if scorename > 0.75 && scoreprice > 0.6

- Although formulating rules is hard, domain specific rules can be applied, making this a typical approach for many applications.

Page 27: A Primer on Entity Resolution

Modern record linkage theory was formalized in 1969 by Ivan Fellegi and Alan Sunter who proved that the probabilistic decision rule they described was optimal when the comparison attributes were conditionally independent.

Their pioneering work “A Theory For Record Linkage” remains the mathematical foundation for many record linkage applications even today.

Fellegi, Ivan P., and Alan B. Sunter. "A theory for record linkage." Journal of the American Statistical Association 64.328 (1969): 1183-1210.

Page 28: A Primer on Entity Resolution

Record Linkage Model

For two record sets, A and B:and a record pair,

is the similarity vector, where is some match score function for the record set.

M is the match set and U the non-match set

Page 29: A Primer on Entity Resolution

Record Linkage Model

Probabilistic linkage based on:

Linkage Rule: L(tl, tu) - upper & lower thresholds:

R(r)

tutl MatchUncertainNon-Match

Page 30: A Primer on Entity Resolution

Linkage Rule Error

- Type I Error: a non-match is called a match.

- Type II Error: match is called a non-match

Page 31: A Primer on Entity Resolution

Optimizing a Linkage Rule

L*(t*l, t

*u) is optimized in (similarity vector

space) with error bounds and if:- L* bounds type I and II errors:

- L* has the least conditional probability of not making a decision - e.g. minimizes the uncertainty range in R(r).

Page 32: A Primer on Entity Resolution

L* Discovery

Given N records in (e.g. N similarity vectors):Sort the records decreasing by R(r) (m( ) / u( ))

Select n and n′ such that:

R(r)

0

1 , … , n n+1 , … , n′

-1 n′ , … , N, ,

Page 33: A Primer on Entity Resolution

Practical Application of FS

is high dimensional: m( ) and u( ) computations are inefficient.

Typically a naive Bayes assumption is made about the conditional independence of features in given a match or a non-match.

Computing P( |r ∈ M) requires knowledge of matches.

- Supervised machine learning with a training set.- Expectation Maximization (EM) to train parameters.

Page 34: A Primer on Entity Resolution

Machine Learning ParametersSupervised Methods

- Decision Trees- Cochinwala, Munir, et al. "Efficient data reconciliation." Information Sciences 137.1 (2001): 1-15.

- Support Vector Machines- Bilenko, Mikhail, and Raymond J. Mooney. "Adaptive duplicate detection using learnable string similarity

measures." Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003.

- Christen, Peter. "Automatic record linkage using seeded nearest neighbour and support vector machine classification." Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008.

- Ensembles of Classifiers- Chen, Zhaoqi, Dmitri V. Kalashnikov, and Sharad Mehrotra. "Exploiting context analysis for combining multiple

entity resolution systems." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.

- Conditional Random Fields- Gupta, Rahul, and Sunita Sarawagi. "Answering table augmentation queries from unstructured lists on the web."

Proceedings of the VLDB Endowment 2.1 (2009): 289-300.

Page 35: A Primer on Entity Resolution

Machine Learning ParametersUnsupervised Methods

- Expectation Maximization- Winkler, William E. "Overview of record linkage and current research directions." Bureau of the Census. 2006.- Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. Data quality and record linkage techniques. Springer

Science & Business Media, 2007.

- Hierarchical Clustering- Ravikumar, Pradeep, and William W. Cohen. "A hierarchical graphical model for record linkage."Proceedings of the

20th conference on Uncertainty in artificial intelligence. AUAI Press, 2004.

Active Learning Methods- Committee of Classifiers

- Sarawagi, Sunita, and Anuradha Bhamidipaty. "Interactive deduplication using active learning."Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002.

- Tejada, Sheila, Craig A. Knoblock, and Steven Minton. "Learning object identification rules for information integration." Information Systems 26.8 (2001): 607-633.

Page 36: A Primer on Entity Resolution

Luckily, all of these models are in Scikit-Learn.Considerations:- Building training sets is hard:

- Most records are easy non-matches - Record pairs can be ambiguous

- Class imbalance: more negatives than positives

Machine Learning & Fellegi Sunter is the state of the art.

Implementing Papers

Page 37: A Primer on Entity Resolution

Clustering & Blocking

Page 38: A Primer on Entity Resolution

To obtain a supervised training set, start by using clustering and then add active learning techniques to propose items to knowledge

engineers for labeling.

Page 39: A Primer on Entity Resolution

Advantages to Clusters

- Resolution decisions are not made simply on pairwise comparisons, but search a larger space.

- Can use a variety of algorithms such that:- Number of clusters is not known in advance

- There are numerous small, singleton clusters

- Input is a pairwise similarity graph

Page 40: A Primer on Entity Resolution

Requirement: Blocking

- Naive Approach is |R|2 comparisons. - Consider 100,000 products from 10 online

stores is 1,000,000,000,000 comparisons. - At 1 s per comparison = 11.6 days

- Most are not going to be matches- Can we block on product category?

Page 41: A Primer on Entity Resolution

Canopy Clustering

- Often used as a pre-clustering optimization for approaches that must do pairwise comparisons, e.g. K-Means or Hierarchical Clustering

- Can be run in parallel, and is often used in Big Data systems (implementations exist in MapReduce on Hadoop)

- Use distance metric on similarity vectors for computation.

Page 42: A Primer on Entity Resolution

Canopy Clustering

The algorithm begins with two thresholds T1 and T2 the loose and tight distances respectively, where T1 > T2.

1. Remove a point from the set and start a new “canopy”2. For each point in the set, assign it to the new canopy if the distance

is less than the loose distance T1.3. If the distance is less than T2 remove it from the original set

completely.4. Repeat until there are no more data points to cluster.

Page 43: A Primer on Entity Resolution
Page 44: A Primer on Entity Resolution
Page 45: A Primer on Entity Resolution
Page 46: A Primer on Entity Resolution
Page 47: A Primer on Entity Resolution
Page 48: A Primer on Entity Resolution
Page 49: A Primer on Entity Resolution
Page 50: A Primer on Entity Resolution

Canopy Clustering

By setting threshold values relatively permissively - canopies will capture more data.In practice, most canopies will contain only a single point, and can be ignored. Pairwise comparisons are made between the similarity vectors inside of each canopy.

Page 51: A Primer on Entity Resolution

Final Notes

Page 52: A Primer on Entity Resolution

Data Preparation

Good Data Preparation can go a long way in getting good results, and is most of the work.

- Data Normalization- Schema Normalization- Imputation

Page 53: A Primer on Entity Resolution

Data Normalization

- convert to all lower case, remove whitespace- run spell checker to remove known

typographical errors- expand abbreviations, replace nicknames- perform lookups in lexicons- tokenize, stem, or lemmatize words

Page 54: A Primer on Entity Resolution

Schema Normalization

- match attribute names (title → name)- compound attributes (full name → first, last)- nested attributes, particularly boolean attributes- deal with set and list valued attributes- segment records from raw text

Page 55: A Primer on Entity Resolution

Imputation

- How do you deal with missing values?- Set all to nan or None, remove empty string.- How do you compare missing values? Omit

from similarity vector? - Fill in missing values with aggregate (mean) or

with some default value.

Page 56: A Primer on Entity Resolution

Canonicalization

Merge information from duplicates to a representative entity that contains maximal information - consider downstream resolution.

Name, Email, Phone, AddressJoe Halifax, [email protected], null, New York, NYJoseph Halifax Jr., null, (212) 123-4444, 130 5th Ave Apt 12, New York, NY

Joseph Halifax, [email protected], (212) 123-4444, 130 5th Ave Apt 12, New York, NY

Page 57: A Primer on Entity Resolution

Evaluation

- # of predicted matching pairs, cluster level metrics - Precision/Recall → F1 score

Match Miss

Actual Match True Match False Match |A|

Actual Miss False Miss True Miss |B|

|P(A)| |P(B)| total

Page 58: A Primer on Entity Resolution

Conclusion