linking records with value diversity pei li university of milan – bicocca advisor : andrea maurino...

47
Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh Srivastava October, 2012

Upload: julianna-douglas

Post on 16-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Linking Records with Value Diversity

Pei LiUniversity of Milan – Bicocca

Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh Srivastava

October, 2012

Page 2: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Some Statistics from DBLP

-How many Wei Wang’s are there?-What are their authoring histories?

••• 2

Page 3: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Some Statistics from YellowPages

••• 3

-Are there any business chains?-If yes, which businesses are their members?

Page 4: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Record Linkage

• What is record linkage (entity resolution)?• Input: a set of records• Output: clustering of records • A critical problem in data integration and data cleaning

• “A reputation for world-class quality is profitable, a ‘business maker’.” – William E. Winkler

• Current work (surveyed in [Elmagarmid, 07], [Koudas, 06]) :• assume that records of the same entities are consistent • often focus on different representations of the same value • e.g., “IBM” and “International Business Machines”

••• 4

Page 5: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

New Challenges

• In reality, we observe value diversity of entities• Values can evolve over time

• Catholic Healthcare (1986 - 2012) Dignity Health (2012 -)

• Different records of the same group can have “local” values

• Some sources may provide erroneous values

••• 5

ID Name Address Phone URL

001 F.B. Insurance Vernon 76384 TX 877 635-4684 txfb-ins.com

002 F.B. Insurance #1 Lufkin 75901 TX 936 634-7285 txfb.org

003 F.B. Insurance #5 Cibolo 78108 TX 877 635-4684

ID Name URL Source

001 Meekhof Tire Sales & Service Inc www.meekhoftire.com Src. 1

002 Meekhof Tire Sales & Service Inc www.napaautocare.com Src. 2

••• 5

Page 6: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

My Goal

• To improve the linkage quality of integrated data with fairly high diversity

• linking temporal records[VLDB ’11] [VLDB ’12 demo][FCS Journal ’12]

• linking records of the same group[Under preparation for SIGMOD ’13]

••• 6

Page 7: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Outline

• Motivation• Linking temporal records• Decay• Temporal clustering• Demo

• Linking records of the same group• Related work• Conclusions & Future work

••• 7

Page 8: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

1991

1991

1991

1991

1991

2004

2005

2006

2007

2008

2009

2010

r1: Xin Dong R. Polytechnic Institute r2: Xin Dong

University of Washington

r7: Dong Xin University of Illinois

r3: Xin Dong University of Washington

r4: Xin Luna DongUniversity of Washington

r8:Dong XinUniversity of Illinois

r9: Dong XinMicrosoft Research

r5: Xin Luna DongAT&T Labs-Research

r10: Dong Xin University of Illinois

r11: Dong Xin Microsoft Research

r6: Xin Luna DongAT&T Labs-Research

r12: Dong Xin Microsoft Research

-How many authors?-What are their authoring histories? 201

1

8

Page 9: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

1991

1991

1991

1991

1991

2004

2005

2006

2007

2008

2009

2010

r1: Xin Dong R. Polytechnic Institute r2: Xin Dong

University of Washington

r7: Dong Xin University of Illinois

r3: Xin Dong University of Washington

r4: Xin Luna DongUniversity of Washington

r8:Dong XinUniversity of Illinois

r9: Dong XinMicrosoft Research

r5: Xin Luna DongAT&T Labs-Research

r10: Dong Xin University of Illinois

r11: Dong Xin Microsoft Research

r6: Xin Luna DongAT&T Labs-Research

r12: Dong Xin Microsoft Research

-Ground truth

3 authors

2011

9

Page 10: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

1991

1991

1991

1991

1991

2004

2005

2006

2007

2008

2009

2010

r1: Xin Dong R. Polytechnic Institute r2: Xin Dong

University of Washington

r7: Dong Xin University of Illinois

r3: Xin Dong University of Washington

r4: Xin Luna DongUniversity of Washington

r8:Dong XinUniversity of Illinois

r9: Dong XinMicrosoft Research

r5: Xin Luna DongAT&T Labs-Research

r10: Dong Xin University of Illinois

r11: Dong Xin Microsoft Research

r6: Xin Luna DongAT&T Labs-Research

r12: Dong Xin Microsoft Research

-Solution 1:-requiring high value consistency

5 authorsfalse negative

2011

10

Page 11: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

1991

1991

1991

1991

1991

2004

2005

2006

2007

2008

2009

2010

r1: Xin Dong R. Polytechnic Institute r2: Xin Dong

University of Washington

r7: Dong Xin University of Illinois

r3: Xin Dong University of Washington

r4: Xin Luna DongUniversity of Washington

r8:Dong XinUniversity of Illinois

r9: Dong XinMicrosoft Research

r5: Xin Luna DongAT&T Labs-Research

r10: Dong Xin University of Illinois

r11: Dong Xin Microsoft Research

r6: Xin Luna DongAT&T Labs-Research

r12: Dong Xin Microsoft Research

-Solution 2:-matching records w. similar names

2 authorsfalse positive

2011

11

Page 12: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Opportunities

ID Name Affiliation Co-authors Year

r1 Xin Dong R. Polytechnic Institute

Wozny 1991

r2 Xin Dong University of Washington

Halevy, Tatarinov

2004

r7 Dong Xin University of Illinois Han, Wah 2004

r3 Xin Dong University of Washington

Halevy 2005

r4 Xin Luna Dong

University of Washington

Halevy, Yu 2007

r8 Dong Xin University of Illinois Wah 2007

r9 Dong Xin Microsoft Research Wu, Han 2008

r10

Dong Xin University of Illinois Ling, He 2009

r11

Dong Xin Microsoft Research Chaudhuri, Ganti

2009

r5 Xin Luna Dong

AT&T Labs-Research

Das Sarma, Halevy

2009

r6 Xin Luna Dong

AT&T Labs-Research

Naumann 2010

r12

Dong Xin Microsoft Research He 2011

Smooth transition

Seldom erratic change

s

Continuity of history

••• 12

Page 13: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

IntuitionsID Name Affiliation Co-authors Year

r1 Xin Dong R. Polytechnic Institute

Wozny 1991

r2 Xin Dong University of Washington

Halevy, Tatarinov

2004

r7 Dong Xin University of Illinois Han, Wah 2004

r3 Xin Dong University of Washington

Halevy 2005

r4 Xin Luna Dong

University of Washington

Halevy, Yu 2007

r8 Dong Xin University of Illinois Wah 2007

r9 Dong Xin Microsoft Research Wu, Han 2008

r10

Dong Xin University of Illinois Ling, He 2009

r11

Dong Xin Microsoft Research Chaudhuri, Ganti

2009

r5 Xin Luna Dong

AT&T Labs-Research

Das Sarma, Halevy

2009

r6 Xin Luna Dong

AT&T Labs-Research

Naumann 2010

r12

Dong Xin Microsoft Research He 2011

Less penalty on different values over time

Less reward on the same value over time

Consider records in time order for clustering

••• 13

Page 14: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Outline

• Motivation• Linking temporal records• Decay• Temporal clustering• Demo

• Linking records of the same group• Related work• Conclusions & Future work

••• 14

Page 15: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Disagreement Decay

• Intuition: different values over a long time is not a strong indicator of referring to different entities.

• University of Washington (01-07)• AT&T Labs-Research (07-date)

• Definition (Disagreement decay) • Disagreement decay of attribute A over

time ∆t is the probability that an entity changes its A-value within time ∆t.

••• 15

Page 16: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Agreement Decay• Intuition: the same value over a long

time is not a strong indicator of referring to the same entities.

• Adam Smith: (1723-1790) Adam Smith: (1965-)

• Definition (Agreement decay) • Agreement decay of attribute A over

time ∆t is the probability that different entities share the same A-value within time ∆t. ••• 16

Page 17: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Decay Curves

• Decay curves of address learnt from European Patent data

0 5 10 15 20 250

0.10.20.30.40.50.60.70.80.9

1

∆ Year

Dec

ay

Disagreement decay

Agreement decay

Patent records: 1871

Real-world inventors: 359

In years: 1978 - 2003

••• 17

Page 18: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Applying Decay

• E.g. • r1 <Xin Dong, Uni. of Washington, 2004>• r2 <Xin Dong, AT&T Labs-Research, 2009>

• No decayed similarity:• w(name)=w(affi.)=.5• sim(r1, r2)=.5*1+.5*0=.5

• Decayed similarity• w(name, ∆t=5)=1-dagree(name , ∆t=5)=.95, • w(affi., ∆t=5)=1-ddisagree(affi. , ∆t=5)=.1 • sim(r1, r2)=(.95*1+.1*0)/(.95+.1)=.9 Match

Un-match

••• 18

Page 19: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Applying Decay

••• 19

ID Name Affiliation Co-authors Year

r1 Xin Dong R. Polytechnic Institute

Wozny 1991

r2 Xin Dong University of Washington

Halevy, Tatarinov

2004

r7 Dong Xin University of Illinois Han, Wah 2004

r3 Xin Dong University of Washington

Halevy 2005

r4 Xin Luna Dong

University of Washington

Halevy, Yu 2007

r8 Dong Xin University of Illinois Wah 2007

r9 Dong Xin Microsoft Research Wu, Han 2008

r10

Dong Xin University of Illinois Ling, He 2009

r11

Dong Xin Microsoft Research Chaudhuri, Ganti

2009

r5 Xin Luna Dong

AT&T Labs-Research

Das Sarma, Halevy

2009

r6 Xin Luna Dong

AT&T Labs-Research

Naumann 2010

r12

Dong Xin Microsoft Research He 2011

All records are merged into the same cluster!!

Able to detect changes!

Page 20: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Decayed Similarity & Traditional Clustering

••• 20

F-1 Precision Recall0

0.10.20.30.40.50.60.70.80.9

1

PARTITION CENTER MERGE DECAY

Decay improves recall over baselines by 23-67%

Patent records: 1871

Real-world inventors: 359

In years: 1978 - 2003

Page 21: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Outline

• Motivation• Linking temporal records• Decay• Temporal clustering• Demo

• Linking records of the same group• Related work• Conclusions & Future work

••• 21

Page 22: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Early Binding

• Compare a new record with existing clusters

• Make eager merging decision for each record

• Maintain the earliest/latest timestamp for its last value

••• 22

Page 23: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Early BindingID Name Affiliation Co-authors Fro

m To

r2 Xin Dong Univ. of Washington

Halevy, Tatarinov

2004 2004

ID Name Affiliation Co-authors From

To

r3 Xin Dong Univ. of Washington

Halevy 2004 2005

r1 Xin Dong R. P. Institute Wozny 1991 1991

r7 Dong Xin

University of Illinois

Han, Wah 2004 2004

r8 Dong Xin

University of Illinois

Wah 2004 2007

r4 Xin Luna Dong

Univ. of Washington

Halevy, Yu 2004 2007

r9 Dong Xin

Microsoft Research

Wu, Han 2008 2008

r10

Dong Xin University of Illinois

Ling, He 2009 2009

ID Name Affiliation Co-authors From

To

r5 Xin Luna Dong

AT&T Labs-Research

Das Sarma, Halevy

2009

2009

r11

Dong Xin

Microsoft Research

Chaudhuri, Ganti

2008 2009

r6 Xin Luna Dong

AT&T Labs-Research

Naumann 2009

2010

r12

Dong Xin

Microsoft Research

He 2008 2011

C1

C2

C3

earlier mistakes prevent later merging!!

Avoid a lot of false positives!

••• 23

Page 24: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Adjusted Binding

• Compare earlier records with clusters created later

• Proceed in EM-style1. Initialization: Start with the result of initialized

clustering 2. Estimation: Compute record-cluster similarity3. Maximization: Choose the optimal clustering4. Termination: Repeat until the results converge

or oscillate

••• 24

Page 25: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Adjusted Binding

• Compute similarity by • Consistency: consistency in evolution of

values• Continuity: continuity of records in time

Case 1:r.t C.late

record time stamp cluster time stamp

C.early

Case 2:r.t C.lateC.early

Case 3:r.t C.lateC.early

Case 4:r.tC.lateC.early

sim(r, C)=cont(r, C)*cons(r, C)

••• 25

Page 26: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

26

Adjusted Bindingr7

DongXin@UI -2004

r9DongXin@MSR -2008

C3

C4

C5r10DongXin@UI -2009

r8DongXin@UI -2007

r11DongXin@MSR -2009

r12DongXin@MSR -2011

r10 has higher continuity with C4

r8 has higher continuity with C4

Once r8 is merged to C4, r7 has higher continuity with C4

Page 27: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Adjusted Binding

C1

C2

C3

ID Name Affiliation Co-authors Year

r1 Xin Dong R. Polytechnic Institute

Wozny 1991

r2 Xin Dong University of Washington

Halevy, Tatarinov

2004

r3 Xin Dong University of Washington

Halevy 2005

r4 Xin Luna Dong

University of Washington

Halevy, Yu 2007

r5 Xin Luna Dong

AT&T Labs-Research

Das Sarma, Halevy

2009

r6 Xin Luna Dong

AT&T Labs-Research

Naumann 2010

r7 Dong Xin University of Illinois Han, Wah 2004

r8 Dong Xin University of Illinois Wah 2007

r9 Dong Xin Microsoft Research Wu, Han 2008

r10

Dong Xin University of Illinois Ling, He 2009

r11

Dong Xin Microsoft Research Chaudhuri, Ganti

2009

r12

Dong Xin Microsoft Research He 2011

Correctly cluster all records

••• 27

Page 28: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Temporal Clustering

••• 28

Patent records: 1871

Real-world inventors: 359

In years: 1978 - 2003

F-1 Precision Recall0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1PARTITION CENTER MERGE DECAY ADJUST FULL ALGO.

Full algorithm has the best result

Adjusted Clustering improves recall without reducing precision much

Page 29: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

F-1 Precision Recall0

0.10.20.30.40.50.60.70.80.9

1

PARTITION CENTER MERGE FULL ALGO.

F-1 Precision Recall0

0.10.20.30.40.50.60.70.80.9

1

PARTITION CENTER MERGE FULL ALGO.

Experimental Results• Data sets:

#Records #Entities Years

Patent 1871 359 1978-2003

DBLP-XD 72 8 1991-2010

DBLP-WW 738 18+potpourri 1992-2011

(a) Results of XD data (b) Results of WW data

••• 29

Page 30: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Demonstration

• CHRONOS: Facilitating History Discovery by Linking Temporal Records

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 30

Page 31: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Outline

• Motivation• Linking temporal records• Decay• Temporal clustering• Demo

• Linking records of the same group• Related work• Conclusions & Future work

••• 31

Page 32: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

32

-Are there any business chains?-If yes, which businesses are their members?

Page 33: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

33

-Ground Truth

2 chains

Page 34: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

34

-Solution 1: -Require high value consistency

0 chain

Page 35: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

35

-Solution 2:-Match records w. same name

1 chain

Page 36: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Challenges

ID name phone state URL domain

r1 Taco Casa AL tacocasa.com

r2 Taco Casa 900 AL tacocasa.com

r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com

r4 Taco Casa 900 AL

r5 Taco Casa 900 AL

r6 Taco Casa 701 TX tacocasatexas.com

r7 Taco Casa 702 TX tacocasatexas.com

r8 Taco Casa 703 TX tacocasatexas.com

r9 Taco Casa 704 TX

r10 Elva’s Taco Casa

TX tacodemar.com

Erroneous values

Different local values

Scalability6.8M Records

••• 36

Page 37: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Two-Stage Linkage – Stage I

• Stage I: Identify cores containing listings very likely to belong to the same chain• Require strong robustness in presence of possibly

erroneous values Graph theory• High Scalability

••• 37

ID name phone state URL domain

r1 Taco Casa AL tacocasa.com

r2 Taco Casa 900 AL tacocasa.com

r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com

r4 Taco Casa 900 AL

r5 Taco Casa 900 AL

r6 Taco Casa 701 TX tacocasatexas.com

r7 Taco Casa 702 TX tacocasatexas.com

r8 Taco Casa 703 TX tacocasatexas.com

r9 Taco Casa 704 TX

r10 Elva’s Taco Casa TX tacodemar.com

Page 38: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Two-Stage Linkage – Stage II

• Stage II: Cluster cores and remaining records into chains. • Collect strong evidence from cores and leverage in

clustering• No penalty on local values

••• 38

ID name phone state URL domain

r1 Taco Casa AL tacocasa.com

r2 Taco Casa 900 AL tacocasa.com

r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com

r4 Taco Casa 900 AL

r5 Taco Casa 900 AL

r6 Taco Casa 701 TX tacocasatexas.com

r7 Taco Casa 702 TX tacocasatexas.com

r8 Taco Casa 703 TX tacocasatexas.com

r9 Taco Casa 704 TX

r10 Elva’s Taco Casa TX tacodemar.com

Reward strong evidence

Page 39: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

• Stage II: Cluster cores and remaining records into chains. • Collect strong evidence from cores and leverage in

clustering• No penalty on local values

••• 39

ID name phone state URL domain

r1 Taco Casa AL tacocasa.com

r2 Taco Casa 900 AL tacocasa.com

r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com

r4 Taco Casa 900 AL

r5 Taco Casa 900 AL

r6 Taco Casa 701 TX tacocasatexas.com

r7 Taco Casa 702 TX tacocasatexas.com

r8 Taco Casa 703 TX tacocasatexas.com

r9 Taco Casa 704 TX

r10 Elva’s Taco Casa TX tacodemar.com

Reward strong evidence

Two-Stage Linkage – Stage II

Page 40: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

• Stage II: Cluster cores and remaining records into chains. • Collect strong evidence from cores and leverage in

clustering• No penalty on local values

••• 40

ID name phone state URL domain

r1 Taco Casa AL tacocasa.com

r2 Taco Casa 900 AL tacocasa.com

r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com

r4 Taco Casa 900 AL

r5 Taco Casa 900 AL

r6 Taco Casa 701 TX tacocasatexas.com

r7 Taco Casa 702 TX tacocasatexas.com

r8 Taco Casa 703 TX tacocasatexas.com

r9 Taco Casa 704 TX

r10 Elva’s Taco Casa TX tacodemar.com

Apply weak evidence

Two-Stage Linkage – Stage II

Page 41: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

• Stage II: Cluster cores and remaining records into chains. • Collect strong evidence from cores and leverage in

clustering• No penalty on local values

••• 41

ID name phone state URL domain

r1 Taco Casa AL tacocasa.com

r2 Taco Casa 900 AL tacocasa.com

r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com

r4 Taco Casa 900 AL

r5 Taco Casa 900 AL

r6 Taco Casa 701 TX tacocasatexas.com

r7 Taco Casa 702 TX tacocasatexas.com

r8 Taco Casa 703 TX tacocasatexas.com

r9 Taco Casa 704 TX

r10 Elva’s Taco Casa TX tacodemar.com

No penalty on local values

Two-Stage Linkage – Stage II

Page 42: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Experimental Evaluation

• Data set • 6.8M records from YellowPages.com

• Effectiveness:• Precision / Recall / F-measure (avg.): .96 / .96 / .96

• Efficiency:• 6.9 hrs for single-machine solution• 40 mins for Hadoop solution

• 80K chains and 1M records in chains

••• 42

Chain name # Stores

USPS - United States Post Office 12,776

SUBWAY 11,278

State Farm Insurance 8,711

McDonald's 7,450

Edward Jones 6,781

Page 43: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Experimental Evaluation II

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 43

Sample #Records #Chains Chain size #Single-biz records

Random 2062 30 [2, 308] 503

AI 2446 1 2446 0

UB 322 7 [2, 275] 5

FBIns 1149 14 [33, 269] 0

Page 44: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Related Work

• Record similarity: • Probabilistic linkage

• Classification-based approaches: classify records by probabilistic model [Felligi, ’69]

• Deterministic linkage• Distance-base approaches: apply distance metric to compute

similarity of each attribute, and take the weighted sum as record similarity [Dey,08]

• Rule-based approaches: apply domain knolwedge to match record [Hernandez,98]

• Record clustering• Transitive rule [Hernandez,98]• Optimization problem [Wijaya,09]• …

••• 44

Page 45: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Conclusions

• In some applications record linkage needs to be tolerant with value diversity

• When linking temporal records, time decay allows tolerance on evolving values

• When linking group members, two-stage linkage allows leveraging strong evidence and allows tolerance on different local values

••• 45

Page 46: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Future Work

••• 46

Data Integration

Temporal Database

Data Quality

Page 47: Linking Records with Value Diversity Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh

Thanks!

••• 47