dataengconf sf16 - entity resolution in data pipelines using spark

58
Slides @ www.jakequist.com/go/dataengconf

Upload: hakka-labs

Post on 16-Apr-2017

580 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Slides @

www.jakequist.com/go/dataengconf

Page 2: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf

Page 3: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Entity Resolution

Page 4: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Talk StructureLayer 1: Naive ER

Layer 2: Graphical ER

Layer 3: Big Data ER

Layer 4: Temporal ER

Layer 5: Learned ER

Page 5: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Naive ER

Page 6: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Entity Resolution

ID Name Website GeoA Facebook facebook.com MenloPark,CAB FB facebook.com CAC Joe'sCookies joescookies.com SanFrancisco,CA

Suppose we have the following data:

Page 7: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Entity ResolutionSuppose we have the following data:

ID Name Website GeoA Facebook facebook.com MenloPark,CAB FB facebook.com CAC Joe'sCookies joescookies.com SanFrancisco,CAD JoesCookies facebook.com SanFrancisco,CA

Page 8: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Entity ResolutionSuppose we have the following data:

ID Name Website GeoA Facebook facebook.com MenloPark,CAB FB facebook.com CAC Joe'sCookies joescookies.com SanFrancisco,CAD JoesCookies facebook.com SanFrancisco,CAE JoesCookies NULL NewYork,NY

Page 9: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Fundamental Concept

Match entities on the similarity of their properties

Page 10: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Example: Company Similarity

Page 11: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Example: Company Similarity

Page 12: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Problems

• What about when match arity != 2

• Entities can’t duplicate across matches

• O(N^2) isn’t great either

Page 13: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Graphical ER

Page 14: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Think Like a Graph

A B

E C

D

ID Name Website Geo

A Facebook facebook.com MenloPark,CA

B FB facebook.com CA

C Joe'sCookies joescookies.com SanFrancisco,CA

D JoesCookies facebook.com SanFrancisco,CA

E JoesCookies NULL NewYork,NY

Page 15: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Think Like a Graph

A B

E C

D

150

50

-100 -100

50 50

50 50

-150-150

Page 16: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Think Like a Graph

A B

E C

D

150

50

-100 -100

50 50

50 50

-150-150

Page 17: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Key Concept: Cliques

Page 18: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Think Like a Clique

A B

E C

D

150

50

-100 -100

50 50

50 50

-150-150

{A}{B}{C}{D}{E}{E, A}{E, B}{E, C}{E, D}{A, B}{A, C}{A, D}{B, C}{B, D}{C, D}{E, A, B}{E, A, C}{E, A, D}{E, B, C}{E, B, D}{E, C, D}{A, B, C}{A, B, D}{A, C, D}{B, C, D}{E, A, B, C}{E, A, B, D}{E, A, C, D}{E, B, C, D}{A, B, C, D}{E, A, B, C, D}

possible cliques =>

Page 19: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Recurring Theme:Powerset

Page 20: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Scoring Cliques

from above

Page 21: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Overlapping Cliques

A B

E C

D

A B

E C

D

A = 0.75 B = 0.55

Page 22: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Overlapping Cliques

An entity can’t belong to more than one clique.

When we choose a clique, we must ensure no other cliques

use any of those entities

Page 23: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Clique Choosing

Page 24: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Clique Choosing

Page 25: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Recap• Given a dataset of entities…

• Take the powerset of those entities => every possible clique

• Score all the cliques

• In sorted order, choose the best cliques when no elements have been touched

Page 26: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

ER on Bigger Data

Page 27: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

• Get potential matches on the same machine

• Avoid using powerset(n) for large n

Challenges

Page 28: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Locality-Sensitive Hashing (LSH)

Basic Idea: Use Map Reduce to get likely matches onto the same machines

“Johnathon”

“Sequoia Capital, LLC”

[37.773972, -122.431297]

“John”

“Sequoia”

[37.73, -122.43]

“app.example.com” “example.com”

Page 29: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Locality-Sensitive Hashing

Page 30: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Locality-Sensitive Hashing

Page 31: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Problems

• What if our entities have missing properties?

Page 32: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Locality-Sensitive Hashing

Joe’s CookiesJoe’s Cookie’sjoescookies.com joescookies.com

A B C

“Joe Cookie” “Joe Cookie” “”

LSH on “name”

Page 33: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Multilevel LSH

• Basic Idea: Use LSH multiple times on converging cliques

Page 34: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Joe’s CookiesJoe’s Cookie’sjoescookies.com joescookies.com

A B C

“Joe Cookie” “Joe Cookie” “”

LSN on “name”

Joe’s Cookie’sjoescookies.com joescookies.com

Clique #3

Clique #2

“joescookies.com” “joescookies.com”

LSN on “website”

Clique #1

Page 35: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Clique Choosing• We now have all potential cliques, spread across

the cluster

• We now need to choose the best cliques?

• Remember: But choosing one clique invalidates others

• Fundamentally a Serial Algorithm!

Page 36: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Clique ChoosingRDD[T].toLocalIterator() : Iterator[T]

• Produces an iterator on the Driver that seamlessly iterates every partition

Page 37: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Clique Choosing

Page 38: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Clique Choosing

uh oh

Page 39: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Challenge

• We need to keep track of which entities we’ve “touched”

• But using a HashSet means we will start eating a lot memory

Page 40: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Primer: Bloom Filters

BloomFilter { def mightContain(T obj) def put(T obj)}

example: 1 MB @ 0.5% error => 130 KB

Page 41: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Clique Choosing w/ Bloom Filters

Page 42: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Clique Choosing w/ Bloom Filters

Page 43: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Recap

• Challenge: Get data to the right machine. Solution: Use Locality-Sensitive-Hashing

• Challenge: Choose the best cliques. Solution: Use serial iterator and bloom-filters to keep memory low

Page 44: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Temporal ER

Page 45: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Temporal Entity Resolution

T1 T2

Ms Sally Smith Mrs Sally Doe

thefacebook.com facebook.com

Zen Payroll Gusto

Page 46: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Temporal Entity Resolution

A B

Zen Payrollzenpayroll.com

Gustogusto.com

-1000

Page 47: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Temporal Entity Resolution

A B

Zen Payrollzenpayroll.com

+100

C

Zen Payroll <=> Gusto zenpayroll.com <=> gusto.com

Gustogusto.com

+100

-1000

Page 48: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Iterative Poison Pills

• Basic Idea: Use ER techniques we’ve already established

• Introduce “poison pills” that can break up cliques if temporal properties don’t match

• Iteratively use the poison pills to match on increasingly temporally-aware entities

Page 49: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

gusto.com (Payroll)2016

Perform Regular ER

gusto.com (Travel)2010

gusto.com

< 2015

gusto.com zenpayroll.com

> 2015

zenpayroll.com(Payroll)2014

A B C D E

A, C, D, E B, E

Kick Out Entities ThatDon’t Match TemporalRequirements

A, Dgusto.com < 2015

B, Egusto.com > 2015zenpayroll < 2014

C, Egusto,2016

Perform Regular ER(now with more temporal fields available)

A, C, D B, C, E

Temporal Poison Pills

Page 50: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Temporal Entity Resolution

• Very Computational Expensive

• Requires Significant Tuning & Tweaking to Keep Tractable

• Considered one of the Holy Grails of ER

Page 51: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Learned ER

Page 52: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Recap

• Gorilla in the room: All of our scoring has been manual

Page 53: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Supervised Learning ER

• Basic Idea: Use a training set to learn the weights in our scoring functions

• Disclaimer: Only proceed with this if you have very complex scoring properties

Page 54: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Supervised Learning ER

Page 55: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Supervised Learning ER

Page 56: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

More Learning Opts

• Gradient Descent: What if we viewed the system as having overall “error”? We can then use Gradient Descent to find optimal solution.

• Very very computationally intense

Page 57: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Questions?Thanks!

[email protected]

Page 58: DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark