entity resolution with evolving rules
DESCRIPTION
Entity Resolution with Evolving Rules. Youzhong Ma 2010-9-25 Lab of WAMDM. Outline. Motivations ER Related concepts ER properties Conclusions. Entity Resolution background. Entity Resolution background. Naïve ER Approach Vs. New Approach. Outline. Motivations ER Related concepts - PowerPoint PPT PresentationTRANSCRIPT
Entity Resolution with Evolving Rules
Youzhong Ma 2010-9-25Lab of WAMDM
Outline
Motivations ER Related concepts ER properties Conclusions
Entity Resolution background
Entity Resolution background
Naïve ER Approach Vs. New Approach
Outline
Motivations
ER Related concepts ER properties Conclusions
ER Related concepts
Suppose market A will merge market B They have to combine their customers The same person may occur in two
markets’ customer DB, but some attributes are different
How to deal with it?
ER Rule
Boolean functions determines if two records represent the same
entity: true or false.
Distance functions How different(similar) the records are.
ER Example
ER procedure
B1:Pname E1 = {{r1,r2,r3},{r4}} (6 comps) )
B2: Pname ∧ Pzip E2 = {{r1,r2},{r3},{r4}}
Naïve approachNaïve approach6 comps6 comps
original records set S = {r1,r2,r3,r4}ER input Pi = {{r1},{r2},{r3},{r4}}
Evolving ruleEvolving rule3 comps3 comps
The Evolving rule approach only works if the ER algorithm satisfies Certain properties and B2 is Stricter than B1.
So one contribution of this paper is to exploitUnder what conditions and for what ER algorithmsAre incremental approaches feasible?
B1:Pname ∧ Pzip E1 = {{r1,r2},{r3},{r4}} (6 comps) )
B2: Pname ∧ Phone E2 ={{r1},{r2,r3},{r4}}
3comps3comps
original records set S = {r1,r2,r3,r4}ER input Pi = {{r1},{r2},{r3},{r4}}
Pname Ename = {{r1,r2,r3},{r4}}
Pzip Ezip = {{r1,r2},{r3},{r4}}
Materialization!
Outline
Motivations ER Related concepts
ER properties Conclusions
Two important properties for ER algorithms that enable efficient rule evolution for match-based clustering
Rule Monotonicity(RM)
Context Free(CF)
Pname ∧ Pzip ≤ Pname
Rule Monotonicity(RM)
B2:Pname E2 = {{r1,r2,r3},{r4}}
B1: Pname ∧ Pzip E1 = {{r1,r2},{r3},{r4}}
Context Free (CF)
General Incremental VS. Context Free
Order independent VS. Rule Monotonicity An ER algorithm is order independent if the ER
result is same regardless of the order of the records processed.
Existing properties in literature
experiments
Outline
Motivations ER Related concepts ER properties
Conclusions
conclusions
Propose a new ER approach with evolving rules
Exploiting the properties (RM、 CF) of the ER algorithms that enable efficient rule evolution
Providing guidance to the ER algorithms designer
Some problems
How are the comparision rules generated?
How to design the ER Algorithms that hold the RM and CF properties?
How to Implement the ER algorithms in MapReduce framework?
Thanks to everyone of Web Group sincerely