yael elmatad, senior data scientist, tapad at mlconf nyc - 4/15/16

31
Beyond the Classifier, Inspiration from Engineering Algorithms Yael Elmatad, Data Scientist at Tapad @y_s_e ML Conf NYC April 15, 2016 +

Upload: mlconf

Post on 26-Jan-2017

617 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

Beyond the Classifier, Inspiration from Engineering Algorithms

Yael Elmatad, Data Scientist at Tapad@y_s_e

ML Conf NYCApril 15, 2016

+

Page 2: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

2

Introduction to TapadTapad is a marketing technology company that seeks to bridge the gap between users’ various screens.

Page 3: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

3

Tapad’s Solution: The Device Graph™

Page 4: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

4

Modeling Identity is Hard

1. Identifier persistence and accuracy

2. Conflicting data

3. Grouping keys / Transitive properties

4. User Privacy and Data Governance

5. Use case flexibility

4

Page 5: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

5

Modeling Identity is Hard

1. Identifier persistence and accuracy

2. Conflicting data

3. Grouping keys / Transitive properties

4. User Privacy and Data Governance

5. Use case flexibility

5

Page 6: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

6

Focus: Identifier Persistence & GroupingsGrouping keys

How can we effectively, at scale, determine groups of identifiers?

Identifier Persistence

How can we make sure that these identifiers are persistent in time?

Spoiler Alert

No classifiers, recommender systems, or community detection in sight.

6

Page 7: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

7

Grouping: Connected Components● Over 1.4 billion devices in each weekly Device Graph

● There are 6.6 billion connections between these Devices

Question:

How do we determine connected components at scale?

Previous attempts:

Various graph based databases and solutions (Giraph, GraphX, Cassovary) - we were not able to identify clusters at scale.

Current solution:

Runs in logarithmic rounds

Page 8: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

8

Connected Component Basics: Label PropInitializing, assign self as cluster label

A B C D

Cluster Label (Temp): A B C D

Iterations: Ask neighbor for current label, take min of neighbors and self.

A B C D

A A B C

Stop iterations when no labels change over previous iteration.

Page 9: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

9

Need A More Efficient Solution: Hash-to-MinStandard message passing is O(d), where d = cluster diameter.

arXiv.org > cs > arXiv:1203.5387v2

Page 10: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

10

Hash-to-Min: Initialization

A B C D

E

v C(v)

A (A,B)

B (A,B,C)

C (B,C,D,E)

D (C,D)

E (C,E)

A A B C

C

For node v, assign minimum of v and its neighbors as cluster label and a cluster C(v) which is a set of v + v’s neighbors.

Page 11: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

11

Hash-to-Min: Round 1For each C(v), vmin = minimal member of C(v)Broadcast C(v) to vmin and broadcast vmin to all other members of C(v)Each node, v, then merges all the C(v) + vmin it receives.

A B C D

E

A A B C

C

v C(v)

A (A,B)

B (A,B,C)

C (B,C,D,E)

D (C,D)

E (C,E)

Page 12: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

12

Hash-to-Min: Round 1For each C(v), vmin = minimal member of C(v)Broadcast C(v) to vmin and broadcast vmin to all other members of C(v)Each node, v, then merges all the C(v) + vmin it receives.

A B C D

E

A A A B

B

v C(v)

A (A,B,C)

B (A,B,C,D,E)

C (A,C,D,E)

D (B)

E (B)

Page 13: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

13

Hash-to-Min: Round 2 + Completion

A B C D

E

A A A A

A

v C(v)

A (A,B,C,D,E)

B (A,B)

C (A)

D (A)

E (A)

Iterations cease when no updates are made to C(v)’s

Completes in O(log(d)) where d = cluster diameter.

Page 14: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

14

Hash-to-Min: Round 2 + Completion

Iterations cease when no updates are made to C(v)’s

Completes in O(log(d)) where d = cluster diameter.

A B C D

E

A A A A

A

v C(v)

A (A,B,C,D,E)

B (A)

C (A)

D (A)

E (A)

Page 15: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

15

First labeling scheme:

Labeled by lowest device id participating in cluster.

Example:

Once we have CC, how do we label them?

A

B

C

DE

AOnly 78% of devices maintain label after 1 week.

Page 16: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

16

Why 22% Change? ID Expiration & Creation

D

B

C

D

C

B C

Label Device Expires:

D

B

C

D

B

C

AB A

New Lowest ID Created:

Page 17: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

17

Why 22% Change? Splits and Merges

D

B

C

AAD

B

CAAC

Cluster Splits:

DB

CAAC

D

B

C

AA

Clusters Merge:

Page 18: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

18

Only a small fraction are of Merge/Split variety

Type of change Percent

Device Expiration & Creation

> 75%

Cluster Merges & Splits < 25%

Page 19: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

19

Solution? Map onto Stable-Marriage Problem

Definition of “Stable Marriage”

Given n men and n women, where each person has ranked all members of the opposite sex in order of preference, marry the men and women together such that there are no two people of opposite sex who would both rather have each other than their current partners. When there are no such pairs of people, the set of marriages is deemed stable.

(wikipedia definition)

Page 20: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

20

Stable-Marriage - (By Negation)Want to pair triangles to circles.

Unstable Match:

Prefer Each Other

A stable solution is defined as the lack of these instabilities.The Gale-Shapley algorithm is a method for finding stable solutions.

Page 21: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

21

Gale-Shapley Algorithm

a

b

c δ

ɣ

β

(Psst… it won the Nobel Prize in Economics in 2012)

Page 22: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

22

Gale-Shapley Pre-Iteration (GS0): Rankings

Rank: (β,ɣ,δ)

Rank: (β,δ,ɣ)

Rank: (δ,ɣ,β)

Rank: (c,b,a)

Rank: (b,a,c)

Rank: (c,a,b)

a

b

c δ

ɣ

β

Page 23: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

23

GS1: Circles “Propose” to Triangles

Rank: (β,ɣ,δ)

Rank: (β,δ,ɣ)

Rank: (δ,ɣ,β)

Rank: (c,b,a)

Rank: (b,a,c)

Rank: (c,a,b)

a

b

c δ

ɣ

β

Page 24: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

24

GS1: Triangles tentatively accept best proposal

Rank: (β,ɣ,δ)

Rank: (β,δ,ɣ)

Rank: (δ,ɣ,β)

Rank: (c,b,a)

Rank: (b,a,c)

Rank: (c,a,b)

a

b

c δ

ɣ

β

Page 25: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

25

GS2: Unengaged circles try again

Rank: (β,ɣ,δ)

Rank: (β,δ,ɣ)

Rank: (δ,ɣ,β)

Rank: (c,b,a)

Rank: (b,a,c)

Rank: (c,a,b)

a

b

c δ

ɣ

β

Page 26: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

26

GS2: Triangles again tentatively accept best offer

Rank: (β,ɣ,δ)

Rank: (β,δ,ɣ)

Rank: (δ,ɣ,β)

Rank: (c,b,a)

Rank: (b,a,c)

Rank: (c,a,b)

a

b

c δ

ɣ

β

Page 27: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

27

GS3: iterations terminate when all triangles/circles are paired

Rank: (β,ɣ,δ)

Rank: (β,δ,ɣ)

Rank: (δ,ɣ,β)

Rank: (c,b,a)

Rank: (b,a,c)

Rank: (c,a,b)

a

b

c δ

ɣ

β

Page 28: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

28

How do we use it at Tapad?

Considerations:

● How do you rank best labels for your cluster?

● Need to be able to run at scale for 100 million label pairs.

● Needs to run on in a distributed fashion (MapReduce).

● Needs to be able to handle ties.

● Need to handle label expiry and new label creation.

Page 29: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

29

Results & Cluster Stability

Metric:

The % of devices that maintain their cluster label after x weeks.

Min ID Based Gale-Shapley Based

1 week 78% 98%

8 weeks 33% 87%

Page 30: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

30

Conclusion

Many challenges which get thrown at data scientists can potentially be solved by deterministic engineering algorithms.

Being familiar with these algorithms prevents data scientists from reinventing the wheel.

Once you start using these algorithms, you start seeing use cases for them everywhere (we use connected components in no less than 3 parts of our graph building process).

Page 31: Yael Elmatad, Senior Data Scientist, Tapad at MLconf NYC - 4/15/16

31

Thank you!

Thanks to the Data Science/Engineering teams at Tapad

Read our blog: http://engineering.tapad.com

Careers:http://www.tapad.com/about-us/careers/openings

(Data Science & Engineering!)

Follow us on twitter: @tapad, @tapadeng

Contact me: [email protected], @y_s_e