large-scale deduplication with constraints using dedupalog arvind arasu et al

33
Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.

Post on 20-Dec-2015

224 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Large-Scale Deduplication with Constraints using Dedupalog

Arvind Arasu et al.

Page 2: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Definitions

Deduplication: the process of identifying references in data records that refer to the same real-world entity.

Collective Deduplication: a generalization of deduplication in which one wants to find types of real-world entities in a set of records that are related.

Page 3: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Weaknesses of Prior Approaches

Allow clustering of only a single entity type in isolation—can’t answer queries such as: how many distinct papers were in ICDE 2008?

Ignore Constraints Use Constraints in an ad-hoc way

which prevents users from flexibly combining constraints.

Page 4: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Constraints

“ICDE” and “Conference on Data Engineering” are the same conference.

conferences in different cities, are in different years.

author references that do not share any common coauthors do not refer to the same author.

Page 5: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Dedupalog

Collective deduplication Declarative Domain independent Expressive enough to encode many

constraints (hard and soft) Scales to large datasets

Page 6: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Dedupalog

1. Provide a set of input tables that contain references to be deduplicated and other useful info (e.g. results of similarity computation)

2. Define a list of entity references to deduplicate (e.g. authors, papers, publishers)

3. Define a Dedupalog program4. Execute the Dedupalog program

Page 7: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Notation

Entity reference:

Clustering Relation (these are duplicates):

Page 8: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Input Tables (example)

Page 9: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Entity Reference Tables (example)

Page 10: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Dedupalog Program (example)

*conflicts that occur from the rules are detected by the system and reported to the user

Page 11: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Soft-complete Rules

“papers with similar titles are likely duplicates”

Paper references whose titles appear in TitleSimilar are likely to be clustered together.

Paper references whose titles do not appear in TitleSimilar are not likely to be clustered together.

Page 12: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Soft-incomplete Rules

“papers with very similar titles are likely duplicates”

Paper references whose titles appear in TitleVerySimilar are likely to be clustered together.

*This rule says nothing about paper references whose titles do not appear in TitleVerySimilar.

Page 13: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Hard Rules

“the publisher references listed in the table PublisherEQ must be clustered together”

“the publisher references in PublisherNEQ must not be clustered together”

must-link

cannot-link

Hard rules may only contain a positive body.

Page 14: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Complex Hard Rules

“whenever we cluster two papers, we must also cluster the publishers of those papers”

Such constraints are central to collective deduplication

At most one entity reference is allowed in the body of the rule as in this example.

Page 15: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Complex Negative Rules

“two distinct author references on a single paper cannot be the same person”

Page 16: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Recursive Rules

“Authors that do not share common coauthors are unlikely to be duplicates”

These constraints require inspecting the current clustering---thus recursion.

Recursion is only allowed in soft rules.

Page 17: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Cost

If gamma is soft-complete, then its cost on J* is:

The cost of a clustering, J*, is the number of tuples on which the constraint, gamma, and the clustering disagree.

Page 18: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Cost

The cost is 2 because of two violations:

1) d belongs in the same cluster as c

2) c does not belong in the same cluster as b

* The Goal of Dedupalog is to incur the minimum cost.

Page 19: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Main Algorithm

Turns out this is NP-hard, even for a single soft-complete constraint.

For a large fragment of Dedupalog, the following algorithm is a constant factor approximation of the optimal.

Page 20: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Clustering Graphs

Clustering graph is a pair (V,Phi) V is a set of nodes that correspond to an

entity reference. Phi is a symmetric function that assigns

pairs of nodes that make up edges, (u,v), to labels: [+] : soft-plus [-] : soft-minus [=] : hard-plus [] : hard-minus

Page 21: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Clustering Graphs

Uniformly choose a random permutation of the nodes

This gives a partial order on the edges Harden each edge in order:

Change soft edges into hard edges Apply these two rules:

A clustering is all [=] connected components

Guarantees:

Page 22: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Creating the Clustering Graph

Perform forward voting Perform Backward-propagation Creates clustering graph for each

entity reference relation (i.e. Publisher!, Paper!, and Author!)

Page 23: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Physical Implementation and Optimization

Implicit Representation of Edges implicitly store some edge values

Choosing Edge Orderings order edges so that [+] edges are

processed first Sort Optimization

Page 24: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Experiments—Cora dataset

•Standard: matching titles and running correlated clustering

•NEQ: Standard + an additional hard rule constraint—lists conference papers that were known to be distinct from their journal papers

Page 25: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Experiments—Cora dataset

•Standard: Clustering based on string similarity

•Soft Constraint: Standard + soft constraint—”papers must be in a single conference”

•Hard Constraint: Standard + hard constraint—”papers must be in a single conference”

Page 26: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Experiments—ACM dataset

•No Constraints: String similarity and correlation clustering

•Constraints: No Constraints + hard constraint—”references with different years, do not refer to the same conference”

ACM spanning 1988-1990

ACM has 436,000 references

Page 27: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Experiments—ACM dataset

Helps catches errors in records. Added a hard rule that says: “If two references refer

to the same paper, then they must refer to the same conference”

On the subset it found 5 references that contained incorrect years

On the full dataset it found 152 suspect papers

Page 28: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Performance

Page 29: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Performance

•Vanilla: Clustering of the references by conference with a single soft constraint

•[=]: Vanilla with two additional hard constraints

•HMorphism: Vanilla + [=] + Cluster conferences and papers with the constraint—”conference papers appear in only one conference”

•NoStream: Vanilla with sort-optimization off

Page 30: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Interactive Deduplication

Manual clustering of Cora took a couple of hours—98% precision and recall.

Obtaining ground truth for the ACM subset to only 4 hours

Page 31: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Conclusions

Proposed a novel language, Dedupalog

Validated its practicality on two datasets

Proved that a large syntactic fragment of Dedupalog has a constant factor approximation algorithm using a novel algorithm

Page 32: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Strengths

Convey their language and algorithm effectively

Mostly good examples that help readers understand their contribution

Strong and meaningful results Good contribution

Page 33: Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al

Weaknesses

Some mislabeled figure references and occasional typos can cause grief.