social network signatures: a framework and experimental...
TRANSCRIPT
Social Network Signatures: A Framework and Experimental Results
Shawndra HillAssistant Professor
Operations and Information Management DepartmentWharton
Kick-off Meeting, July 28, 2008
WhartonUniversity of Pennsylvania
First Year Review, August 27, 2009g y
ONR MURI: NexGeNetSci
g
Hill
Social networkSocial network signatures
Theory DataAnalysis
Numerical Experiments
LabExperiments
FieldExercises
Real‐WorldOperations
Fi t i i l C t Si l ti St li d S i U di t bl• First principles• Rigorous math• Algorithms• Proofs
• Correct statistics
• Only as good as underlying d t
• Simulation• Synthetic, clean data
• Stylized• Controlled• Clean, real‐world d t
• Semi‐Controlled
• Messy, real‐world d t
• Unpredictable• After action reports in lieu of data
data data data
Motivating example:
Repetitive Subscription Fraudp p
• Large telecommunications company• Telecom service• Long experience with fraud detection• Sophisticated models based on record linkage
3
Motivating example:
Repetitive Subscription Fraud• Lots of people can’t pay their bill, but they want phone service anyway:
Name Ted Hanley Name Debra Handley
p p
Address 14 Pearl DrSt Peters, MN
$
Address 14 Pearl DrSt Peters, MN
Balance $208.00
Disconnected 2/19/04 (nonpayment)
Balance $142.00
Connected 2/22/04
4
Motivating Example: Repetitive FraudHow can we identify that it is the same person behind both accounts?How can we identify that it is the same person behind both accounts?
5
Motivating Example: Challenges
• This is a problem of record linkage and graph matching, but because of p g g p g,obfuscation, we can only count on entity matching.
• But the number of potential matchesis huge… Connect pool
10 K/day10 K/day300K/month300K/month
Connect pool
TRestrict pool
5 K/day5 K/day150 K/month150 K/month45 billion comparisons
• If we have an efficient representation of entities, we might be able to make a dent…
p
6
Prior Work: Representation
•Because we are interested in entities, and to facilitate efficient storage, we represent the entire graph as a union of entity graphswe represent the entire graph as a union of entity graphs.
•These are our atomic units of analysis, a signature of the node’s behaviorbehavior.
•Storing hundreds of millions of small graphs is much more efficientthan storing one massive graph, especially in an indexed database.than storing one massive graph, especially in an indexed database.
•Pros: efficiency, recursion Cons: redundancy2222222222 100.32222222222 100.31111111111 90.11111111111 90.13213232423 27.03213232423 27.09098765453 11.39098765453 11.388764573268876457326 5.45.42122121212 3.02122121212 3.09908989898 0.99908989898 0.98887878787 0 18887878787 0 1
7
8887878787 0.18887878787 0.1
Applying the Method
•Real World Success:
– We identify 50-100 of these cases per dayy y– 95% match rate– 85% block rate– Credited with saving telecom millions of dollars
– By far the most reliable matching criteria is the entity based matching
*We also demonstrate our method on email and clickstream data
LIMITATION: WE DO NOT SEE FALSE NEGATIVES, SCALE
8
Other References
S. Hill, M.F. Farone, S. Lombardi, M. Gorgoglione. Using Context for Online Re-Identification. To be submitted to ICDM 2009.S. Hill, M.F. Farone, S. Lombardi, M. Gorgoglione. Using Context for Online Re Identification. To be submitted to ICDM 2009.
S. Hill and Akash Nagle Social Network Signatures: A Random Graph Approximation Framework for Re-Identification and Experimental Results, Computational Aspects of Social Networks, 2009
S. Hill and F. Provost. The myth of the double-blind review?: Author identification using only citations. SIGKDD Explorer Newsletter, 5(2):179–184, 2003.
S. Hill , D. K. Agarwal , R. Bell, C. Volinsky, and. Building an effective representation for dynamic networks. Journal of Computational and Graphical Statistics, 15(3):584 – 608, 2006.
R. Holzer, B. Malin, and L. Sweeney. Email alias detection using social network analysis. In Link KDD ’05: Proceedings of the 3rd international workshop on link discovery, pages 52–57, New York, NY, USA, 2005. ACM.
S Mehrotra D Kalashnikov Learning importance of relationships for reference disambiguation In UCI Technical Report RESCUE-04-23 pages 04–23 2004S. Mehrotra, D. Kalashnikov. Learning importance of relationships for reference disambiguation. In UCI Technical Report RESCUE-04-23, pages 04–23, 2004.
A. Sung, J. Xu and Q. Liu. Behaviour mining for fraud detection. In Professor Sidney Morris, editor, Journal of Research and Practice in Information Technology, volume 39, pages 3–18. Australian Computer Society Inc., 2007.
L. Sweeney, B. Malin. Re-identification of dna through an automated linkage process. Proc AMIA Symp., pages 423–7, 2001.
C. Hilas and J. Sahalos. User profiling for fraud detection in telecommunication networks. In International Conference on Technology and Automation, pages 382–387, 2005.
L. Sweeney. Guaranteeing anonymity when sharing medical data, the datafly system. In Journal of the American Medical Informatics Association, 1997
C. Cortes, D. Pregibon, and C. Volinsky. Computational methods for dynamic graphs. Journal of Computational and Graphical Statistics, 12:950–970, 2003.
E. Minkov, W. Cohen, and A. Ng. Contextual search and name disambiguation in email using graphs. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Re- search and development in information retrieval, pages 27–34, New York, NY, USA, 2006. ACM.
Re-identification works well. BUT h / h ?BUT why/when?
– Network is not highly clustered (random vs. small world vs scale free)world vs. scale free)
Limited missing links– Limited missing links
Limited change in behavior from one time period– Limited change in behavior from one time period to another
Generate Networks d C t l P tiand Control Properties
– Network is not highly clustered (random vs. small world vs scale free)world vs. scale free)
Limited missing links– Limited missing links
Limited change in behavior from one time period– Limited change in behavior from one time period to another
Random
G(n,p) is a labeled graph with vertex set V(G) = {1,2,…,n}, in which every one of the possible (n/2) edges ( ,p) g p ( ) { , , , }, y p ( / ) gexists with probability 0 < p < 1, independent of any other edges. The random graph G(n,m) consists of n nodes or vertices, joined by m links or edges which are placed between pairs of n vertices chosen uniformly at random.
M. E. J. Newman, Random graphs as models of networks, in Handbook of Graphs and Networks, S. Bornholdt and H G Schuster (eds ) Wiley VCH Berlin (2003)H. G. Schuster (eds.), Wiley‐VCH, Berlin (2003).
Small World
A small world network is a graph in which any two nodes are likely to be connected through a short sequence of intermediate nodes.
Watts and Strogatz define the following properties of a small world graph:1. The clustering coefficient C is much larger than that of a random graph with the same number of vertices and average number of edges per vertex.
2 Th h t i ti th l th L i l t ll L f th2. The characteristic path length L is almost as small as L for thecorresponding random graph.
Scale-free
A scale‐free network is a network whose degree distribution follows a power law, at least asymptotically. That is the fraction P(k) of nodes in the network having k connections to other nodes goes for large valuesThat is, the fraction P(k) of nodes in the network having k connections to other nodes goes for large values of k as
Where is a constant whose value is typically in the range
P(k)~ k−γ
2 < γ < 3γ
Simulation - Erdős-Rényi
Compare Graph to Itself
)(),( BABAOverlap ∩=
NodeA NodeB Overlap Match?
1 1 5 Match1 1 5 Match1 2 3 Non-Match1 3 2 Non-Match2 2 1 Match
Simulation - Erdős-Rényi
Compare Graph to PerturbedVersion of Itself
)(),( BABAOverlap ∩=
NodeA NodeB Overlap Match?
1 1 4 Match1 1 4 Match1 2 2 Non-Match1 3 1 Non-Match2 2 1 Match
TestbedTestbedExperimental Setup:
• 3 types of graph structure with different parameters (controlled degree distribution, CC, etc.)
• Missing data• Dynamics• Dynamics
1. Start with graph at time to
2. Manipulate graph to time tnp g p n
3. Compare new graph to old graph by performing pair-wise comparisons
4. Get the match score (overlap) distribution and non-match score distribution----------------1. Estimate the amount of overlap between the distributions using our framework.
2. Evaluate performance based on TPR (assuming we want to operate above a threshold th t i f f l iti )that gives few or no false positives)
RandomRandom
For a given clustering coefficient (0.63), the degree (63,126,252), size (100,200,300) of the graph matters. We can estimate the mean of the non-match population based on the mean of the match population
Same clustering cc, different degree, size
Small WorldSmall World
For a given clustering coefficient (0.63), the degree (63,126,252) size (100,200,300) of the graph matters We can estimate the mean of the non match population based on the mean of the matchmatters. We can estimate the mean of the non-match population based on the mean of the match population
Same clustering cc, different degree, size
Scale FreeScale Free
For a given clustering coefficient (0.38), the degree (19.5,40.5,52.8), size (100,200,300)
Estimating M t h/N M t h Di t ib tiMatch/Non-Match Distributions
If we assume a Erdos-Renyi random graph the degree distribution is binomial:
( ) knk ppkn
kvP −−−⎟⎟⎠
⎞⎜⎜⎝
⎛ −== 11
1))(deg(
If we assume a Erdos Renyi random graph, the degree distribution is binomial:
Match Non-Match
k ⎟⎠
⎜⎝
Where k is degree, n is number of nodes and p is likelihood of a link between any two nodes
Match Non-Matchpnm )1( −=μ
)1()1( ppn −−=σ
pmnm μμ =2)1( pn −=μ)1()1( ppnm −−=σ
)1/( −== npcc mμ)1( pnnm =μ
)1()1( 22 ppnnm −−=σ
Estimating Overlap f M t h/N M t h Di t ib tiof Match/Non-Match Distributions
Set probability density functions for match and non-match distributions equal and solve for x,th i t t hi h th t di t ib ti i t tthe point at which the two distributions intersect.
Above this point, we expect more matches than non-matches for each value of x
⎟⎟⎠
⎞⎜⎜⎝
⎛ −−=⎟⎟
⎠
⎞⎜⎜⎝
⎛ −− 2
2
2
2
2)(exp
21
2)(exp
21
nm
nmnm
m
mm
uxnuxnnmm
σπσσπσ
Alternatively, we can set , the point above 99.999% of non-matches nmnmx σμ 4+=
In either case, we can use the Z-score to estimate the number of positives and negatives above x
From there we can calculate the True Positive Rate above x – percent of all the matches that we label match
Random:Estimates and Simulations a es a d S u a o
MatchNon - Match
Performance on Random Graphs
100 Nodes, cc = .38 300 Nodes, cc = .63%LinksActual TPR Predicted TPR Actual TPR Predicted TPR%LinksActual TPR Predicted TPR Actual TPR Predicted TPR
0 0 0 0 010 0.101 0.206 0.077 0.08920 0.260 0.401 0.297 0.31930 0.420 0.560 0.653 0.55240 0.570 0.692 0.807 0.73950 0.810 0.805 0.860 0.86260 0.860 0.860 0.930 0.93670 0 870 0 910 0 980 0 97570 0.870 0.910 0.980 0.97580 0.900 0.945 0.997 0.99290 0.980 0.968 1.000 0.998
100 0.980 0.982 1.000 1.000
What about more “realistic” graph structures?s uc u es
Network Size Clustering Coefficient Average Path Length Degree Exponent
Internet, domain level 32711 0.24 3.56 2.1
Internet, router level 228298 0.03 9.51 2.1
www 153127 0.11 3.1 In=2.1 out=2.45
E-mail 56969 0.03 4.95 1.81
Software 1376 0.06 6.39 2.5
Electronic Circuits 329 0 34 3 17 2 5Electronic Circuits 329 0.34 3.17 2.5
Language 460902 0.437 2.67 2.7
Math. Co-authorship 70975 0.59 9.50 2.5
Food Web 154 0.15 3.40 1.13
Metabolic System 778 - 3.2 In = out =2.2
Xiao Fan Wang and Guanrong Chen. Complex networks: small-world, scale-free and be- yond. Circuits and Systems Magazine, IEEE, 3(1):6–20, 2003.
Estimating M t h/N M t h Di t ib tiMatch/Non-Match Distributions
If we assume a Erdos-Renyi random graph, the degree distribution is binomial:
( ) knk ppkn
kvP −−−⎟⎟⎠
⎞⎜⎜⎝
⎛ −== 11
1))(deg(
Match Non-Matchpn )1(μ pμμ =
k ⎠⎝Where k is degree, n is number of nodes and p is likelihood of a link between any two nodes
pnm )1( −=μ)1()1( ppnm −−=σ
pmnm μμ =2)1( pnnm −=μ
)1/( −== npcc mμ )1()1( 22 ppnnm −−=σ
ipletsnumberoftrtsosedtriplenumberofclcc /= ipletsnumberoftrtsosedtriplenumberofclcc /=
Small World:Estimates and Simulations a es a d S u a o
MatchNon - Match
Performance on Watts – Strogatz Small World Graphs with p = .2World Graphs with p .2100 Nodes, cc = .38 300 Nodes, cc = .63
%LinksActual TPR Predicted TPR Actual TPR Predicted TPR%LinksActual TPR Predicted TPR Actual TPR Predicted TPR0 0 0 0 0
10 0.161 0.062 0.026 0.03020 0.312 0.250 0.123 0.31630 0.468 0.350 0.270 0.59040 0.591 0.510 0.440 0.75350 0.695 0.710 0.598 0.88660 0.779 0.760 0.738 0.98070 0 843 0 850 0 848 0 98370 0.843 0.850 0.848 0.98380 0.894 0.920 0.923 0.99690 0.930 0.960 0.967 1
100 0.957 0.970 0.989 1
*Note: CC same as random graphs, average degree is different
What about “realistic” graph structures with non-normal degree distributions?with non normal degree distributions?
RMAT Package to simulate scale free, highly clusteredRMAT Package to simulate scale free, highly clustered graphs (cc = .38)
If we remove the top 10% high degree nodes we performIf we remove the top 10% high degree nodes, we perform well using our strategy
Downside: We don’t build model for every note “type”
Scale Free:Estimates and Simulations a es a d S u a o
MatchNon - Match
Scale Free:Estimates and Simulations a es a d S u a o
MatchNon - Match
Missing Data vs. Dynamics
( )mxmmx p μμμ −= ( )mxmmx p μμμ
( )[ ]xmnmx pcccc ×−×= μμWh i th t f i i li kWhere px is the percentage of missing links
Missing DataMissing Data
Dynamics
Random: Missing DataRandom: Missing Data
The lower the clustering coefficient the better the TPR and AUC (not shown) performance. Degree matters for a given clustering coefficient.
*TPR = TP / P = TP / (TP + FN)
Small World: Missing DataSmall World: Missing Data
The lower the clustering coefficient the better the performance with respect to TPR and AUC (not shown). For the same clustering coefficient you can ( ) g yhandle more missing data when the degree is higher
Back to the Real World –R lit Mi i C ll D t il D tReality Mining – Call Detail Data
-Cell phone usage of 100 MIT students, faculty and staff-Consider only within subject callsy j-Performance in the baseline – TPR = .347 predicted, .342 actual at the 50/50 cutoff
*nathan eagle alex pentland*nathan eagle, alex pentland
Next Steps – beyond the random graph approximationgraph approximation
• Citation, im, call detail, comscore data,Citation, im, call detail, comscore data,
Next Steps - behavior
• Models of behaviorModels of behavior• Models of emerging graphs
Scale FreeScale Free
Next Steps -Similarity ScoresNext Steps Similarity ScoresA large amount of the overlap can be overcome by using g p y ga score that takes the degree of the connected nodes into account
Missing 100 200 300100 0 0 090 0 048387 0 0403 0 1445
TPR & CC=0.37 Overlap
90 0.048387 0.0403 0.144580 0.174603 0.2421 0.359370 0.390625 0.3046 0.464860 0.4375 0.4062 0.589850 0.421875 0.5625 0.62540 0.640625 0.5392 0.695330 0.578125 0.5859 0.687520 0.6875 0.5703 0.726510 0.6875 0.6328 0.7421
Next Steps – ContextNext Steps Context
• Experiments – in the labExperiments in the labS. Hill, M.F. Farone, S. Lombardi, M. Gorgoglione (2009) Using
Context for Online Re-Identification.
• Reality Mining data sets
• Cross cultural
Next Steps
• Close loop and apply to real world– IM– Telecom
Cit ti N t k– Citation Networks• Skewed, Scale free degree distributions• Small vs large networksSmall vs. large networks• Top-k analysis• Explore different overlap scoresp p• Combine overlap score with additional information• Explore the question of how does one hide on a social network• Cost of waiting for new information
Re-identification works well. BUT h / h ?BUT why/when?
– Network is not highly clustered (random vs. small world vs. scale free)
– Limited missing links
– Limited change in behavior from one time period to anotherg p
If you have an estimate of the above – and the degree distribution is normal - you can estimate performance!y p
Take Aways
• A guarantee for re-identification for a certain class of graphsg g p
• A framework that does not rely on pair-wise comparisons, one of the main challenges of the extant approachmain challenges of the extant approach
• Missing data doesn’t hurt re-identification performance as much as noisenoise
• Pruning noisy links should prove to be beneficial
• If you want to hide, change your behavior significantly!
More …
•C. Cortes , D. Pregibon, and C. Volinsky. Computational Methods for g y pDynamic Graphs. Journal of Computational and Graphical Statistics, 12:950 – 970, 2003.
•S. Hill , D. K. Agarwal , R. Bell, C. Volinsky. Building an effective representation for dynamic networks. Journal of Computational and Graphical Statistics, 15(3):584 – 608, 2006.
•S. Hill and A. Nagle. Social Network Signatures: A Random Graph Approximation for Re-Identification and Experimental Results. In Proceedings of Computational Aspects of Social Networks, pp 23-33, g p p , pp ,2009.
•S. Hill, M.F. Farone, S. Lombardi, M. Gorgoglione. Using Context for Online Re-Identification.
ThanksThanks
ONR MURI: NexGeNetSci