large-scale deduplication with constraints using dedupalog arvind arasu et al
Post on 20-Dec-2015
224 views
TRANSCRIPT
Large-Scale Deduplication with Constraints using Dedupalog
Arvind Arasu et al.
Definitions
Deduplication: the process of identifying references in data records that refer to the same real-world entity.
Collective Deduplication: a generalization of deduplication in which one wants to find types of real-world entities in a set of records that are related.
Weaknesses of Prior Approaches
Allow clustering of only a single entity type in isolation—can’t answer queries such as: how many distinct papers were in ICDE 2008?
Ignore Constraints Use Constraints in an ad-hoc way
which prevents users from flexibly combining constraints.
Constraints
“ICDE” and “Conference on Data Engineering” are the same conference.
conferences in different cities, are in different years.
author references that do not share any common coauthors do not refer to the same author.
Dedupalog
Collective deduplication Declarative Domain independent Expressive enough to encode many
constraints (hard and soft) Scales to large datasets
Dedupalog
1. Provide a set of input tables that contain references to be deduplicated and other useful info (e.g. results of similarity computation)
2. Define a list of entity references to deduplicate (e.g. authors, papers, publishers)
3. Define a Dedupalog program4. Execute the Dedupalog program
Notation
Entity reference:
Clustering Relation (these are duplicates):
Input Tables (example)
Entity Reference Tables (example)
Dedupalog Program (example)
*conflicts that occur from the rules are detected by the system and reported to the user
Soft-complete Rules
“papers with similar titles are likely duplicates”
Paper references whose titles appear in TitleSimilar are likely to be clustered together.
Paper references whose titles do not appear in TitleSimilar are not likely to be clustered together.
Soft-incomplete Rules
“papers with very similar titles are likely duplicates”
Paper references whose titles appear in TitleVerySimilar are likely to be clustered together.
*This rule says nothing about paper references whose titles do not appear in TitleVerySimilar.
Hard Rules
“the publisher references listed in the table PublisherEQ must be clustered together”
“the publisher references in PublisherNEQ must not be clustered together”
must-link
cannot-link
Hard rules may only contain a positive body.
Complex Hard Rules
“whenever we cluster two papers, we must also cluster the publishers of those papers”
Such constraints are central to collective deduplication
At most one entity reference is allowed in the body of the rule as in this example.
Complex Negative Rules
“two distinct author references on a single paper cannot be the same person”
Recursive Rules
“Authors that do not share common coauthors are unlikely to be duplicates”
These constraints require inspecting the current clustering---thus recursion.
Recursion is only allowed in soft rules.
Cost
If gamma is soft-complete, then its cost on J* is:
The cost of a clustering, J*, is the number of tuples on which the constraint, gamma, and the clustering disagree.
Cost
The cost is 2 because of two violations:
1) d belongs in the same cluster as c
2) c does not belong in the same cluster as b
* The Goal of Dedupalog is to incur the minimum cost.
Main Algorithm
Turns out this is NP-hard, even for a single soft-complete constraint.
For a large fragment of Dedupalog, the following algorithm is a constant factor approximation of the optimal.
Clustering Graphs
Clustering graph is a pair (V,Phi) V is a set of nodes that correspond to an
entity reference. Phi is a symmetric function that assigns
pairs of nodes that make up edges, (u,v), to labels: [+] : soft-plus [-] : soft-minus [=] : hard-plus [] : hard-minus
Clustering Graphs
Uniformly choose a random permutation of the nodes
This gives a partial order on the edges Harden each edge in order:
Change soft edges into hard edges Apply these two rules:
A clustering is all [=] connected components
Guarantees:
Creating the Clustering Graph
Perform forward voting Perform Backward-propagation Creates clustering graph for each
entity reference relation (i.e. Publisher!, Paper!, and Author!)
Physical Implementation and Optimization
Implicit Representation of Edges implicitly store some edge values
Choosing Edge Orderings order edges so that [+] edges are
processed first Sort Optimization
Experiments—Cora dataset
•Standard: matching titles and running correlated clustering
•NEQ: Standard + an additional hard rule constraint—lists conference papers that were known to be distinct from their journal papers
Experiments—Cora dataset
•Standard: Clustering based on string similarity
•Soft Constraint: Standard + soft constraint—”papers must be in a single conference”
•Hard Constraint: Standard + hard constraint—”papers must be in a single conference”
Experiments—ACM dataset
•No Constraints: String similarity and correlation clustering
•Constraints: No Constraints + hard constraint—”references with different years, do not refer to the same conference”
ACM spanning 1988-1990
ACM has 436,000 references
Experiments—ACM dataset
Helps catches errors in records. Added a hard rule that says: “If two references refer
to the same paper, then they must refer to the same conference”
On the subset it found 5 references that contained incorrect years
On the full dataset it found 152 suspect papers
Performance
Performance
•Vanilla: Clustering of the references by conference with a single soft constraint
•[=]: Vanilla with two additional hard constraints
•HMorphism: Vanilla + [=] + Cluster conferences and papers with the constraint—”conference papers appear in only one conference”
•NoStream: Vanilla with sort-optimization off
Interactive Deduplication
Manual clustering of Cora took a couple of hours—98% precision and recall.
Obtaining ground truth for the ACM subset to only 4 hours
Conclusions
Proposed a novel language, Dedupalog
Validated its practicality on two datasets
Proved that a large syntactic fragment of Dedupalog has a constant factor approximation algorithm using a novel algorithm
Strengths
Convey their language and algorithm effectively
Mostly good examples that help readers understand their contribution
Strong and meaningful results Good contribution
Weaknesses
Some mislabeled figure references and occasional typos can cause grief.