large-scale deduplication with constraints using dedupalog arvind arasu et al

Large-Scale Deduplication with Constraints using Dedupalog

Arvind Arasu et al.

Definitions

Deduplication: the process of identifying references in data records that refer to the same real-world entity.

Collective Deduplication: a generalization of deduplication in which one wants to find types of real-world entities in a set of records that are related.

Weaknesses of Prior Approaches

Allow clustering of only a single entity type in isolation—can’t answer queries such as: how many distinct papers were in ICDE 2008?

Ignore Constraints Use Constraints in an ad-hoc way

which prevents users from flexibly combining constraints.

Constraints

“ICDE” and “Conference on Data Engineering” are the same conference.

conferences in different cities, are in different years.

author references that do not share any common coauthors do not refer to the same author.

Dedupalog

Collective deduplication Declarative Domain independent Expressive enough to encode many

constraints (hard and soft) Scales to large datasets

Dedupalog

1. Provide a set of input tables that contain references to be deduplicated and other useful info (e.g. results of similarity computation)

2. Define a list of entity references to deduplicate (e.g. authors, papers, publishers)

3. Define a Dedupalog program4. Execute the Dedupalog program

Notation

Entity reference:

Clustering Relation (these are duplicates):

Input Tables (example)

Entity Reference Tables (example)

Dedupalog Program (example)

*conflicts that occur from the rules are detected by the system and reported to the user

Soft-complete Rules

“papers with similar titles are likely duplicates”

Paper references whose titles appear in TitleSimilar are likely to be clustered together.

Paper references whose titles do not appear in TitleSimilar are not likely to be clustered together.

Soft-incomplete Rules

“papers with very similar titles are likely duplicates”

Paper references whose titles appear in TitleVerySimilar are likely to be clustered together.

*This rule says nothing about paper references whose titles do not appear in TitleVerySimilar.

Hard Rules

“the publisher references listed in the table PublisherEQ must be clustered together”

“the publisher references in PublisherNEQ must not be clustered together”

must-link

cannot-link

Hard rules may only contain a positive body.

Complex Hard Rules

“whenever we cluster two papers, we must also cluster the publishers of those papers”

Such constraints are central to collective deduplication

At most one entity reference is allowed in the body of the rule as in this example.

Complex Negative Rules

“two distinct author references on a single paper cannot be the same person”

Recursive Rules

“Authors that do not share common coauthors are unlikely to be duplicates”

These constraints require inspecting the current clustering---thus recursion.

Recursion is only allowed in soft rules.

Cost

If gamma is soft-complete, then its cost on J* is:

The cost of a clustering, J*, is the number of tuples on which the constraint, gamma, and the clustering disagree.

Cost

The cost is 2 because of two violations:

1) d belongs in the same cluster as c

2) c does not belong in the same cluster as b

* The Goal of Dedupalog is to incur the minimum cost.

Main Algorithm

Turns out this is NP-hard, even for a single soft-complete constraint.

For a large fragment of Dedupalog, the following algorithm is a constant factor approximation of the optimal.

Clustering Graphs

Clustering graph is a pair (V,Phi) V is a set of nodes that correspond to an

entity reference. Phi is a symmetric function that assigns

pairs of nodes that make up edges, (u,v), to labels: [+] : soft-plus [-] : soft-minus [=] : hard-plus [] : hard-minus

Clustering Graphs

Uniformly choose a random permutation of the nodes

This gives a partial order on the edges Harden each edge in order:

Change soft edges into hard edges Apply these two rules:

A clustering is all [=] connected components

Guarantees:

Creating the Clustering Graph

Perform forward voting Perform Backward-propagation Creates clustering graph for each

entity reference relation (i.e. Publisher!, Paper!, and Author!)

Physical Implementation and Optimization

Implicit Representation of Edges implicitly store some edge values

Choosing Edge Orderings order edges so that [+] edges are

processed first Sort Optimization

Experiments—Cora dataset

•Standard: matching titles and running correlated clustering

•NEQ: Standard + an additional hard rule constraint—lists conference papers that were known to be distinct from their journal papers

Experiments—Cora dataset

•Standard: Clustering based on string similarity

•Soft Constraint: Standard + soft constraint—”papers must be in a single conference”

•Hard Constraint: Standard + hard constraint—”papers must be in a single conference”

Experiments—ACM dataset

•No Constraints: String similarity and correlation clustering

•Constraints: No Constraints + hard constraint—”references with different years, do not refer to the same conference”

ACM spanning 1988-1990

ACM has 436,000 references

Experiments—ACM dataset

Helps catches errors in records. Added a hard rule that says: “If two references refer

to the same paper, then they must refer to the same conference”

On the subset it found 5 references that contained incorrect years

On the full dataset it found 152 suspect papers

Performance

Performance

•Vanilla: Clustering of the references by conference with a single soft constraint

•[=]: Vanilla with two additional hard constraints

•HMorphism: Vanilla + [=] + Cluster conferences and papers with the constraint—”conference papers appear in only one conference”

•NoStream: Vanilla with sort-optimization off

Interactive Deduplication

Manual clustering of Cora took a couple of hours—98% precision and recall.

Obtaining ground truth for the ACM subset to only 4 hours

Conclusions

Proposed a novel language, Dedupalog

Validated its practicality on two datasets

Proved that a large syntactic fragment of Dedupalog has a constant factor approximation algorithm using a novel algorithm

Strengths

Convey their language and algorithm effectively

Mostly good examples that help readers understand their contribution

Strong and meaningful results Good contribution

Weaknesses

Some mislabeled figure references and occasional typos can cause grief.

large-scale deduplication with constraints using dedupalog arvind arasu et al

Documents

person slide

user slide

dedupalog program slide

soft rules

publisher references

constraints hard

list of entity references

softcomplete rules papers