social network signatures: a framework and experimental...

Social Network Signatures: A Framework and Experimental Results

Shawndra HillAssistant Professor

Operations and Information Management DepartmentWharton

Kick-off Meeting, July 28, 2008

WhartonUniversity of Pennsylvania

First Year Review, August 27, 2009g y

ONR MURI: NexGeNetSci

g

Hill

Social networkSocial network signatures

Theory DataAnalysis

Numerical Experiments

LabExperiments

FieldExercises

Real‐WorldOperations

Fi t i i l C t Si l ti St li d S i U di t bl• First principles• Rigorous math• Algorithms• Proofs

• Correct statistics

• Only as good as underlying d t

• Simulation• Synthetic, clean data

• Stylized• Controlled• Clean, real‐world d t

• Semi‐Controlled

• Messy, real‐world d t

• Unpredictable• After action reports in lieu of data

data data data

Motivating example:

Repetitive Subscription Fraudp p

• Large telecommunications company• Telecom service• Long experience with fraud detection• Sophisticated models based on record linkage

3

Motivating example:

Repetitive Subscription Fraud• Lots of people can’t pay their bill, but they want phone service anyway:

Name Ted Hanley Name Debra Handley

p p

Address 14 Pearl DrSt Peters, MN

$

Address 14 Pearl DrSt Peters, MN

Balance $208.00

Disconnected 2/19/04 (nonpayment)

Balance $142.00

Connected 2/22/04

4

Motivating Example: Repetitive FraudHow can we identify that it is the same person behind both accounts?How can we identify that it is the same person behind both accounts?

5

Motivating Example: Challenges

• This is a problem of record linkage and graph matching, but because of p g g p g,obfuscation, we can only count on entity matching.

• But the number of potential matchesis huge… Connect pool

10 K/day10 K/day300K/month300K/month

Connect pool

TRestrict pool

5 K/day5 K/day150 K/month150 K/month45 billion comparisons

• If we have an efficient representation of entities, we might be able to make a dent…

p

6

Prior Work: Representation

•Because we are interested in entities, and to facilitate efficient storage, we represent the entire graph as a union of entity graphswe represent the entire graph as a union of entity graphs.

•These are our atomic units of analysis, a signature of the node’s behaviorbehavior.

•Storing hundreds of millions of small graphs is much more efficientthan storing one massive graph, especially in an indexed database.than storing one massive graph, especially in an indexed database.

•Pros: efficiency, recursion Cons: redundancy2222222222 100.32222222222 100.31111111111 90.11111111111 90.13213232423 27.03213232423 27.09098765453 11.39098765453 11.388764573268876457326 5.45.42122121212 3.02122121212 3.09908989898 0.99908989898 0.98887878787 0 18887878787 0 1

7

8887878787 0.18887878787 0.1

Applying the Method

•Real World Success:

– We identify 50-100 of these cases per dayy y– 95% match rate– 85% block rate– Credited with saving telecom millions of dollars

– By far the most reliable matching criteria is the entity based matching

*We also demonstrate our method on email and clickstream data

LIMITATION: WE DO NOT SEE FALSE NEGATIVES, SCALE

8

Other References

S. Hill, M.F. Farone, S. Lombardi, M. Gorgoglione. Using Context for Online Re-Identification. To be submitted to ICDM 2009.S. Hill, M.F. Farone, S. Lombardi, M. Gorgoglione. Using Context for Online Re Identification. To be submitted to ICDM 2009.

S. Hill and Akash Nagle Social Network Signatures: A Random Graph Approximation Framework for Re-Identification and Experimental Results, Computational Aspects of Social Networks, 2009

S. Hill and F. Provost. The myth of the double-blind review?: Author identification using only citations. SIGKDD Explorer Newsletter, 5(2):179–184, 2003.

S. Hill , D. K. Agarwal , R. Bell, C. Volinsky, and. Building an effective representation for dynamic networks. Journal of Computational and Graphical Statistics, 15(3):584 – 608, 2006.

R. Holzer, B. Malin, and L. Sweeney. Email alias detection using social network analysis. In Link KDD ’05: Proceedings of the 3rd international workshop on link discovery, pages 52–57, New York, NY, USA, 2005. ACM.

S Mehrotra D Kalashnikov Learning importance of relationships for reference disambiguation In UCI Technical Report RESCUE-04-23 pages 04–23 2004S. Mehrotra, D. Kalashnikov. Learning importance of relationships for reference disambiguation. In UCI Technical Report RESCUE-04-23, pages 04–23, 2004.

A. Sung, J. Xu and Q. Liu. Behaviour mining for fraud detection. In Professor Sidney Morris, editor, Journal of Research and Practice in Information Technology, volume 39, pages 3–18. Australian Computer Society Inc., 2007.

L. Sweeney, B. Malin. Re-identification of dna through an automated linkage process. Proc AMIA Symp., pages 423–7, 2001.

C. Hilas and J. Sahalos. User profiling for fraud detection in telecommunication networks. In International Conference on Technology and Automation, pages 382–387, 2005.

L. Sweeney. Guaranteeing anonymity when sharing medical data, the datafly system. In Journal of the American Medical Informatics Association, 1997

C. Cortes, D. Pregibon, and C. Volinsky. Computational methods for dynamic graphs. Journal of Computational and Graphical Statistics, 12:950–970, 2003.

E. Minkov, W. Cohen, and A. Ng. Contextual search and name disambiguation in email using graphs. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Re- search and development in information retrieval, pages 27–34, New York, NY, USA, 2006. ACM.

Re-identification works well. BUT h / h ?BUT why/when?

– Network is not highly clustered (random vs. small world vs scale free)world vs. scale free)

Limited missing links– Limited missing links

Limited change in behavior from one time period– Limited change in behavior from one time period to another

Generate Networks d C t l P tiand Control Properties

– Network is not highly clustered (random vs. small world vs scale free)world vs. scale free)

Limited missing links– Limited missing links

Limited change in behavior from one time period– Limited change in behavior from one time period to another

Random

G(n,p) is a labeled graph with vertex set V(G) = {1,2,…,n}, in which every one of the possible (n/2) edges ( ,p) g p ( ) { , , , }, y p ( / ) gexists with probability 0 < p < 1, independent of any other edges. The random graph G(n,m) consists of n nodes or vertices, joined by m links or edges which are placed between pairs of n vertices chosen uniformly at random.

M. E. J. Newman, Random graphs as models of networks, in Handbook of Graphs and Networks, S. Bornholdt and H G Schuster (eds ) Wiley VCH Berlin (2003)H. G. Schuster (eds.), Wiley‐VCH, Berlin (2003).

Small World

A small world network is a graph in which any two nodes are likely to be connected through a short sequence of intermediate nodes.

Watts and Strogatz define the following properties of a small world graph:1. The clustering coefficient C is much larger than that of a random graph with the same number of vertices and average number of edges per vertex.

2 Th h t i ti th l th L i l t ll L f th2. The characteristic path length L is almost as small as L for thecorresponding random graph.

Scale-free

A scale‐free network is a network whose degree distribution follows a power law, at least asymptotically. That is the fraction P(k) of nodes in the network having k connections to other nodes goes for large valuesThat is, the fraction P(k) of nodes in the network having k connections to other nodes goes for large values of k as

Where is a constant whose value is typically in the range

P(k)~ k−γ

2 < γ < 3γ

Simulation - Erdős-Rényi

Compare Graph to Itself

)(),( BABAOverlap ∩=

NodeA NodeB Overlap Match?

1 1 5 Match1 1 5 Match1 2 3 Non-Match1 3 2 Non-Match2 2 1 Match

Simulation - Erdős-Rényi

Compare Graph to PerturbedVersion of Itself

)(),( BABAOverlap ∩=

NodeA NodeB Overlap Match?

1 1 4 Match1 1 4 Match1 2 2 Non-Match1 3 1 Non-Match2 2 1 Match

TestbedTestbedExperimental Setup:

• 3 types of graph structure with different parameters (controlled degree distribution, CC, etc.)

• Missing data• Dynamics• Dynamics

1. Start with graph at time to

2. Manipulate graph to time tnp g p n

3. Compare new graph to old graph by performing pair-wise comparisons

4. Get the match score (overlap) distribution and non-match score distribution----------------1. Estimate the amount of overlap between the distributions using our framework.

2. Evaluate performance based on TPR (assuming we want to operate above a threshold th t i f f l iti )that gives few or no false positives)

RandomRandom

For a given clustering coefficient (0.63), the degree (63,126,252), size (100,200,300) of the graph matters. We can estimate the mean of the non-match population based on the mean of the match population

Same clustering cc, different degree, size

Small WorldSmall World

For a given clustering coefficient (0.63), the degree (63,126,252) size (100,200,300) of the graph matters We can estimate the mean of the non match population based on the mean of the matchmatters. We can estimate the mean of the non-match population based on the mean of the match population

Same clustering cc, different degree, size

Scale FreeScale Free

For a given clustering coefficient (0.38), the degree (19.5,40.5,52.8), size (100,200,300)

Estimating M t h/N M t h Di t ib tiMatch/Non-Match Distributions

If we assume a Erdos-Renyi random graph the degree distribution is binomial:

( ) knk ppkn

kvP −−−⎟⎟⎠

⎞⎜⎜⎝

⎛ −== 11

1))(deg(

If we assume a Erdos Renyi random graph, the degree distribution is binomial:

Match Non-Match

k ⎟⎠

⎜⎝

Where k is degree, n is number of nodes and p is likelihood of a link between any two nodes

Match Non-Matchpnm )1( −=μ

)1()1( ppn −−=σ

pmnm μμ =2)1( pn −=μ)1()1( ppnm −−=σ

)1/( −== npcc mμ)1( pnnm =μ

)1()1( 22 ppnnm −−=σ

Estimating Overlap f M t h/N M t h Di t ib tiof Match/Non-Match Distributions

Set probability density functions for match and non-match distributions equal and solve for x,th i t t hi h th t di t ib ti i t tthe point at which the two distributions intersect.

Above this point, we expect more matches than non-matches for each value of x

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−=⎟⎟

⎠

⎞⎜⎜⎝

⎛ −− 2

2

2

2

2)(exp

21

2)(exp

21

nm

nmnm

m

mm

uxnuxnnmm

σπσσπσ

Alternatively, we can set , the point above 99.999% of non-matches nmnmx σμ 4+=

In either case, we can use the Z-score to estimate the number of positives and negatives above x

From there we can calculate the True Positive Rate above x – percent of all the matches that we label match

Random:Estimates and Simulations a es a d S u a o

MatchNon - Match

Performance on Random Graphs

100 Nodes, cc = .38 300 Nodes, cc = .63%LinksActual TPR Predicted TPR Actual TPR Predicted TPR%LinksActual TPR Predicted TPR Actual TPR Predicted TPR

0 0 0 0 010 0.101 0.206 0.077 0.08920 0.260 0.401 0.297 0.31930 0.420 0.560 0.653 0.55240 0.570 0.692 0.807 0.73950 0.810 0.805 0.860 0.86260 0.860 0.860 0.930 0.93670 0 870 0 910 0 980 0 97570 0.870 0.910 0.980 0.97580 0.900 0.945 0.997 0.99290 0.980 0.968 1.000 0.998

100 0.980 0.982 1.000 1.000

What about more “realistic” graph structures?s uc u es

Network Size Clustering Coefficient Average Path Length Degree Exponent

Internet, domain level 32711 0.24 3.56 2.1

Internet, router level 228298 0.03 9.51 2.1

www 153127 0.11 3.1 In=2.1 out=2.45

E-mail 56969 0.03 4.95 1.81

Software 1376 0.06 6.39 2.5

Electronic Circuits 329 0 34 3 17 2 5Electronic Circuits 329 0.34 3.17 2.5

Language 460902 0.437 2.67 2.7

Math. Co-authorship 70975 0.59 9.50 2.5

Food Web 154 0.15 3.40 1.13

Metabolic System 778 - 3.2 In = out =2.2

Xiao Fan Wang and Guanrong Chen. Complex networks: small-world, scale-free and beyond. Circuits and Systems Magazine, IEEE, 3(1):6–20, 2003.

Estimating M t h/N M t h Di t ib tiMatch/Non-Match Distributions

If we assume a Erdos-Renyi random graph, the degree distribution is binomial:

( ) knk ppkn

kvP −−−⎟⎟⎠

⎞⎜⎜⎝

⎛ −== 11

1))(deg(

Match Non-Matchpn )1(μ pμμ =

k ⎠⎝Where k is degree, n is number of nodes and p is likelihood of a link between any two nodes

pnm )1( −=μ)1()1( ppnm −−=σ

pmnm μμ =2)1( pnnm −=μ

)1/( −== npcc mμ )1()1( 22 ppnnm −−=σ

ipletsnumberoftrtsosedtriplenumberofclcc /= ipletsnumberoftrtsosedtriplenumberofclcc /=

Small World:Estimates and Simulations a es a d S u a o

MatchNon - Match

Performance on Watts – Strogatz Small World Graphs with p = .2World Graphs with p .2100 Nodes, cc = .38 300 Nodes, cc = .63

%LinksActual TPR Predicted TPR Actual TPR Predicted TPR%LinksActual TPR Predicted TPR Actual TPR Predicted TPR0 0 0 0 0

10 0.161 0.062 0.026 0.03020 0.312 0.250 0.123 0.31630 0.468 0.350 0.270 0.59040 0.591 0.510 0.440 0.75350 0.695 0.710 0.598 0.88660 0.779 0.760 0.738 0.98070 0 843 0 850 0 848 0 98370 0.843 0.850 0.848 0.98380 0.894 0.920 0.923 0.99690 0.930 0.960 0.967 1

100 0.957 0.970 0.989 1

*Note: CC same as random graphs, average degree is different

What about “realistic” graph structures with non-normal degree distributions?with non normal degree distributions?

RMAT Package to simulate scale free, highly clusteredRMAT Package to simulate scale free, highly clustered graphs (cc = .38)

If we remove the top 10% high degree nodes we performIf we remove the top 10% high degree nodes, we perform well using our strategy

Downside: We don’t build model for every note “type”

Scale Free:Estimates and Simulations a es a d S u a o

MatchNon - Match

Missing Data vs. Dynamics

( )mxmmx p μμμ −= ( )mxmmx p μμμ

( )[ ]xmnmx pcccc ×−×= μμWh i th t f i i li kWhere px is the percentage of missing links

Missing DataMissing Data

Dynamics

Random: Missing DataRandom: Missing Data

The lower the clustering coefficient the better the TPR and AUC (not shown) performance. Degree matters for a given clustering coefficient.

*TPR = TP / P = TP / (TP + FN)

Small World: Missing DataSmall World: Missing Data

The lower the clustering coefficient the better the performance with respect to TPR and AUC (not shown). For the same clustering coefficient you can ( ) g yhandle more missing data when the degree is higher

Back to the Real World –R lit Mi i C ll D t il D tReality Mining – Call Detail Data

-Cell phone usage of 100 MIT students, faculty and staff-Consider only within subject callsy j-Performance in the baseline – TPR = .347 predicted, .342 actual at the 50/50 cutoff

*nathan eagle alex pentland*nathan eagle, alex pentland

Next Steps – beyond the random graph approximationgraph approximation

• Citation, im, call detail, comscore data,Citation, im, call detail, comscore data,

Next Steps - behavior

• Models of behaviorModels of behavior• Models of emerging graphs

Scale FreeScale Free

Next Steps -Similarity ScoresNext Steps Similarity ScoresA large amount of the overlap can be overcome by using g p y ga score that takes the degree of the connected nodes into account

Missing 100 200 300100 0 0 090 0 048387 0 0403 0 1445

TPR & CC=0.37 Overlap

90 0.048387 0.0403 0.144580 0.174603 0.2421 0.359370 0.390625 0.3046 0.464860 0.4375 0.4062 0.589850 0.421875 0.5625 0.62540 0.640625 0.5392 0.695330 0.578125 0.5859 0.687520 0.6875 0.5703 0.726510 0.6875 0.6328 0.7421

Next Steps – ContextNext Steps Context

• Experiments – in the labExperiments in the labS. Hill, M.F. Farone, S. Lombardi, M. Gorgoglione (2009) Using

Context for Online Re-Identification.

• Reality Mining data sets

• Cross cultural

Next Steps

• Close loop and apply to real world– IM– Telecom

Cit ti N t k– Citation Networks• Skewed, Scale free degree distributions• Small vs large networksSmall vs. large networks• Top-k analysis• Explore different overlap scoresp p• Combine overlap score with additional information• Explore the question of how does one hide on a social network• Cost of waiting for new information

Re-identification works well. BUT h / h ?BUT why/when?

– Network is not highly clustered (random vs. small world vs. scale free)

– Limited missing links

– Limited change in behavior from one time period to anotherg p

If you have an estimate of the above – and the degree distribution is normal - you can estimate performance!y p

Take Aways

• A guarantee for re-identification for a certain class of graphsg g p

• A framework that does not rely on pair-wise comparisons, one of the main challenges of the extant approachmain challenges of the extant approach

• Missing data doesn’t hurt re-identification performance as much as noisenoise

• Pruning noisy links should prove to be beneficial

• If you want to hide, change your behavior significantly!

More …

•C. Cortes , D. Pregibon, and C. Volinsky. Computational Methods for g y pDynamic Graphs. Journal of Computational and Graphical Statistics, 12:950 – 970, 2003.

•S. Hill , D. K. Agarwal , R. Bell, C. Volinsky. Building an effective representation for dynamic networks. Journal of Computational and Graphical Statistics, 15(3):584 – 608, 2006.

•S. Hill and A. Nagle. Social Network Signatures: A Random Graph Approximation for Re-Identification and Experimental Results. In Proceedings of Computational Aspects of Social Networks, pp 23-33, g p p , pp ,2009.

•S. Hill, M.F. Farone, S. Lombardi, M. Gorgoglione. Using Context for Online Re-Identification.

ThanksThanks

ONR MURI: NexGeNetSci

social network signatures: a framework and experimental...

Documents