studying user footprints in different online social networks

39
Studying User Footprints in Different Online Social Networks Studying User Footprints in Different Online Social Networks Anshu Malhotra 1 , Luam Totti 2 , Wagner Meira Jr. 2 , Ponnurangam Kumaraguru 1 , Virg´ ılio Almeida 2 1 Indraprastha Institute of Information Technology New Delhi, India 2 Universidade Federal de Minas Gerais Belo Horizonte, Brazil August, 2012

Upload: precog

Post on 10-May-2015

26.159 views

Category:

Technology


1 download

DESCRIPTION

With the growing popularity and usage of online social media services, people now have accounts (some times several) on multiple and diverse services like Facebook, LinkedIn, Twitter and YouTube. Publicly available information can be used to create a digital footprint of any user using these social media services. Generating such digital footprints can be very useful for personalization, profile management, detecting malicious behavior of users. A very important application of analyzing users’ online digital footprints is to protect users from potential privacy and security risks arising from the huge publicly available user information. We extracted information about user identities on different social networks through Social Graph API, FriendFeed, and Profilactic; we collated our own dataset to create the digital footprints of the users. We used username, display name, description, location, profile image, and number of connections to generate the digital footprints of the user. We applied context specific techniques (e.g. Jaro Winkler similarity, Wordnet based ontologies) to measure the similarity of the user profiles on different social networks. We specifically focused on Twitter and LinkedIn. In this paper, we present the analysis and results from applying automated classifiers for disambiguating profiles belonging to the same user from different social networks UserID and Name were found to be the most discriminative features for disambiguating user profiles. Using the most promising set of features and similarity metrics, we achieved accuracy, precision and recall of 98%, 99%, and 96%, respectively.

TRANSCRIPT

Page 1: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Studying User Footprints inDifferent Online Social Networks

Anshu Malhotra 1, Luam Totti2, Wagner Meira Jr.2,

Ponnurangam Kumaraguru 1, Virgılio Almeida2

1 Indraprastha Institute of Information TechnologyNew Delhi, India

2 Universidade Federal de Minas GeraisBelo Horizonte, Brazil

August, 2012

Page 2: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Online Digital Footprints

I Users commonly register and access accounts (some timesseveral) on multiple and diverse online services likeFacebook, LinkedIn, Twitter and Youtube

I The set of all information related to the user, either provideddirectly or observed from the user’s interaction, is often calledthe user’s online digital footprint [6]

Page 3: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Linking User’s Online Accounts

I To create a user’s digital footprint the user’s multipleaccounts must be known.

I We call this process linking user’s online accounts.

Page 4: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Linking User’s Online Accounts

I Linking user accounts from different services can serve severalpurposes [1, 4, 10, 11, 12, 13]:

I Centralize user information, enforcing data consistency andsimplifying account maintenance

I Enrich recommendation systems

I Cross-system personalization

I Enable cross-system characterization and pattern analysis

I Assess and possibly prevent unwanted information leakage,thereby protecting users from various privacy and securitythreats

Page 5: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Main Challenges

I Users may choose different (and unrelated) usernames ondifferent services, which may be unrelated to their real names[5]

I People with common names tend to have similar usernames[8, 17]

I Users may enter inconsistent and misleading informationacross their profiles [5], unintentionally or often deliberately inorder to preserve privacy

I Heterogeneity in the network structure and profile fieldsamong the services

Page 6: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Existing Techniques

I Various techniques have been proposed for unifying /disambiguating users’ various profiles across different onlineservices:

I Techniques based on FOAF ontology & graphs [4, 9, 10, 11]

I Techniques based on user generated tags [5, 12, 13]

I Techniques based on usernames [8, 17]

I Techniques based on user profile attributes [1, 2, 3, 6, 7, 14]

Page 7: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Limitations of Existing Techniques

I Specificity to certain types of social networks

I Dependency on identifiers like email IDs, Instant MessengerIDs which might not be publicly available

I Use of simple text matching algorithms for comparingcomplex profile fields

I Manual and experimental assignment of weights andthresholds which can be subjective and not scalable

Page 8: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Limitations of Existing Techniques

I Use of small datasets (biggest being 5,000 users), whereinthe data collection and evaluation was done manually insome approaches

I Real world evaluation has not been done for most of thetechniques

I Computationally expensive

Page 9: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Major Contributions of our Work

I An scalable supervised learning approach for linking users’accounts from different services

I Evaluation of different context specific similarity metricsfor comparing different profile fields

I Results using a large dataset of linking user accounts acrossTwitter and LinkedIn

I Evaluation of the system’s performance for discovering newaccounts for a given user.

Page 10: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

User Profile Disambiguation - System Architecture

I Account Correlation Extractor: collates the dataset of userprofiles known to be belonging to the same user acrossdifferent social networks

I Profile Crawler crawls the public profile information fromuser accounts for these services

I A user’s Online Digital Footprints are generated after FeatureExtraction and Selection

I Various Classifiers are trained for account pairs belonging tothe same users and pairs belonging to different users, whichare then used to disambiguate user profiles i.e. classify thegiven input profile pairs to be belonging to the same user ornot

Page 11: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

User Profile Disambiguation - System Architecture

Figure : System Architecture

I

Page 12: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Dataset Collection - 1st Stage

I The training and testing dataset was collated from two typesof sources:

I Social Aggregators: Services that allow users to specify theirmultiple accounts in order to create an unified feed. Wecrawled 883,668 users from FriendFeed1 and 38,755 users fromProfilactic.2

I Social Graph:3 API that constructs and provides socialinteractions data, including information about users accountson multiple services. Of the 14 million users collected, only 3.9million had useful information.

1http://friendfeed.com/2http://www.profilactic.com/3http://code.google.com/apis/socialgraph/docs/

Page 13: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Dataset Collection - Example

twitter,justinbieber youtube,kidrauhl youtube,justinbieber

twitter,aplusk facebook,ashton

youtube,felipeneto youtube,felipenetovlog

youtube,maspoxavida twitter,pecesiqueira

youtube,jp youtube,MysteryGuitarMan

youtube,descealetra twitter,cauemoura

twitter,jcillpam11six facebook,jeremiahcillpam11six

twitter,cjdances facebook,cjperrydances

twitter,sirhilton facebook,richardsrueda

Page 14: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Dataset Collection - 2nd Stage

I Publicly available information on each account was thencollected from each service

I Four services were initially chosen for the analysisI Twitter, LinkedIn, YouTube and Flickr

I Six profile fields were chosen for the analysisI UserID, display name, description, location, connections, image

Page 15: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Dataset Collection

I Due to high percentage of missing fields, YouTube and Flickrwere excluded

I All further analysis in this work refers only to Twitter andLinkedIn accounts

0

10

20

30

40

50

60

70

80

Name Location Description Image

Mis

sing

(%

)

TwitterLinkedInYouTube

Flickr

Page 16: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Profile Similarity

I A profile may be seen as a N-dimensional vector, where eachcomponent is a profile field [14]

I Therefore, comparing profiles can be done component-wise

I However, components are of very distinct nature and hencedemand different similarity methods for comparison

I In this work we evaluated different approaches for comparingeach profile field

Page 17: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Similarity Metrics

I UserID & Display Name:I Jaro-Winkler[15] distance (JW ) is best suited for similarity

between small strings, hence it was used for both these fields

I Description (desc): The fields had punctuation and stopwords removed. The words were then lemmatized andconverted to lower case to produce the final token set.

I TF-IDF: Cosine similarity between the two token sets usingtheir tf-idf vector space representation

I Jaccard (Jacc): Jaccard’s similarity score between the twotoken sets

I Ontology (Ont): Wu-Palmer [16] similarity distance betweenthe Wordnet based ontologies of each description field

Page 18: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Similarity Metrics

I Location (Loc): Tokens were extracted from the locationfields of both profiles by removing the punctuations andconverting them to lower case.

I Sub-string (Substr): Normalized score of number of tokensfrom one field value present as a substring of the other fieldvalue

I Geo-distance (Geo): Euclidean Distance between the twolocations using their latitude and longitudes (Google MapsGeoCoding API4).

I Jaro-Winkler distanceI Jaccard’s score

4https://developers.google.com/maps

Page 19: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Similarity Metrics

I Profile Image (img):I The profile image was downloaded and stored locally

I Each image was then scaled down to 48 X 48 pixels usingcubic spline interpolation

I Each image was then converted to gray scale.

I Each image could then be represented as a vector of valuesfrom 0 to 255 to which functions for computing Mean SquareError (mse), Peak Signal-to-Noise Ratio (psnr), andLevenshtein (ls) were applied to quantify profile imagesimilarity

Page 20: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Similarity Metrics

I Number Of Connections (conn): For Twitter the number ofconnections of a user u is the number of users that u follows.For LinkedIn is the number of users in the private network ofuser u

I The number of connections in different services can assumedifferent ranges, with different meanings

I Normalized (norm): Each connection value c was normalizedto the range [0..1] using the smallest and greatest connectionvalues observed in each service. The similarity was then takenas the unsigned difference between the two values.

I Class: Each value was assigned a (equally sized) class denotinghow big it was. Five classes were used in this work (0-4). Thesimilarity was taken as the different between the two classes.

Page 21: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Evaluation Experiments

I Feature Analysis: To analyze the discriminative capacity ofdifferent profile attributes and similarity metrics fordisambiguating user profiles

I Matching Profiles: To test the effectiveness of supervisedlearning approaches for classifying two profiles as belongingthe same user

I Discovering Candidate Profiles: To evaluate theperformance of our framework for discovering new accountsfor a given known user

I The analysis was done using a dataset of account pairs of29,129 unique users

Page 22: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Feature Analysis

UserID Name Description Connections

JW JW Jacc TF-IDF Ont Norm Class

IG 0.548 0.812 0.286 0.323 0.161 0.000 0.009Relief 0.434 0.521 0.134 0.180 0.113 0.002 0.095MDL 0.379 0.562 0.274 0.300 0.188 -0.006 0.006Gini 0.151 0.217 0.084 0.092 0.051 0.000 0.003

Location Image

JW Jacc Substr Geo MSE PSNR LS

IG 0.232 0.337 0.350 0.520 0.183 0.184 0.215Relief 0.108 0.041 0.039 0.227 0.157 0.158 0.188MDL 0.158 0.233 0.270 0.488 0.205 0.205 0.227Gini 0.067 0.098 0.102 0.146 0.051 0.051 0.061

Table : Discriminative capacity of each pair < feature,metric > according tofour different approaches.

Page 23: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Feature Analysis - Box Plots

0

0.2

0.4

0.6

0.8

1

Match Non Match

userid jw

0

0.2

0.4

0.6

0.8

1

Match Non Match

name jw

0

0.2

0.4

0.6

0.8

1

Match Non Match

location jw

0

0.2

0.4

0.6

0.8

1

Match Non Match

location jaccard

0

0.2

0.4

0.6

0.8

1

Match Non Match

location substring

0

50

100

150

200

250

300

Match Non Match

location geo

Figure : Box plots for the UserID, Name and Location features.

Page 24: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Feature Analysis - Box Plots

0

0.1

0.2

0.3

0.4

0.5

Match Non Match

description jaccard

0

0.2

0.4

0.6

0.8

1

Match Non Match

description tf-idf

0

0.1

Match Non Match

description ontology

0

1

2

3

4

5

Match Non Match

connections class

0

5

10

15

20

25

30

Match Non Match

image psnr

0.8

0.9

1

Match Non Match

image levenshtein

Figure : Box plots for the Description, Connections and Image features.

Page 25: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Matching User Profiles

I The most promising similarity metrics and features were usedto train classifiers for the task of detecting profiles that belongto the same user

I Similarity Vector: E.g.: <useridjw , descjaccard , locgeo >

I Training Set:I Positive Examples: Similarity vectors for the accounts pairs of

the dataset

I Negative Examples: Equal number of negative examplessynthesized by randomly pairing accounts from different usersand calculating their similarity vectors

I A total of 58,258 training instances

Page 26: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Matching User Profiles

I After training the classifiers were tested with Twitter-LinkedInprofile pairs to be classified as a “Match” or a “Non Match”

I A “Match” means that the two given input profiles belong tothe same user, while “Non Match” means they don’t

I Classifiers used:I Naıve BayesI Decision TreeI SVMI kNN

Page 27: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Matching User Profiles

I Results were generated for all possible combinations ofprofile features and similarity metrics using 10-fold crossvalidation.

I As shown below, we achieved accuracy, precision and recall as98%, 99% and 96% respectively for the best feature set

Accuracy Precision Recall F1

Naıve Bayes 0.980 0.996 0.964 0.980Decision Tree 0.965 0.994 0.936 0.964

SVM 0.972 0.988 0.956 0.971kNN 0.898 0.998 0.798 0.887

Table : Results for multiple classifiers using the feature set{namejw , useridjw , locgeo , desctfidf , imgls , connnorm}.

Page 28: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Discovering New User Profiles

I So the results are very good using an static dataset, but whatif we don’t have candidates to match to a known userprofile?

I A system was developed for retrieving profile candidates ofpossible matches for a known account from some otherservice.

I A part (one fifth) of the true positive data was reserved to bethe testing set

I A Naıve Bayes classifier was trained with the remaining setand was modified to return the probability of the two inputprofiles of belonging to the same user

Page 29: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Discovering Candidate User Profiles

I We query Twitter’s API using LinkedIn’s display name foreach profile pair from the testing dataset

I For each of the profiles returned from the Twitter API, wecompute the similarity vector with the LinkedIn profile of theuser

I We next used the trained classifier to return the probability ofeach of these profiles of belonging to the same user

I We rank the Twitter profiles in decreasing order of theirprobabilities

I Ideally the correct Twitter profile (of the profile pair from thetesting set) should be at the top of this ranking

Page 30: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Discovering Candidate User Profiles

40

45

50

55

60

65

70

75

80

5 10 15 20

Pro

files

Fou

nd (

%)

Rank position

All featuresBest Features

Figure : Relation between the position in the rank r and the percentageof times the right profile is found in a position lower or equal to r .

Page 31: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Discovering Candidate User Profiles

I In 64% of the cases the right profile was found in the firstposition of the rank when using all features, while this valuewas 49% for the set of the best features

I This means that using all features instead of only the best canhelp to disambiguate between the possible candidates.

I In 75% of the times the right profile was in the first 3positions of the rank

I This suggests the system can be used in a semi-supervisedmanner

Page 32: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Conclusions & Results

I Applied automated techniques to identify accounts belogingto a same user in different online services

I Only publicly available information was extracted and used

I Proposed and evaluated multiple similarity metrics, comparingtheir discriminative capacity for the task profile linking

I UserID and Name, when compared using the Jaro-Winklermetric, were the most discriminative features

Page 33: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Conclusions & Results

I For the best set of features and similarity metrics we achievedaccuracy, precision and recall as 98%, 99% and 96%respectively

I Evaluation of the system’s performance for discovering theuser’s profile on Twitter given his display name on LinkedIn

I Using all features instead of only the most discriminative oneshas shown better results

I The system may be used to match user profile automaticallywith 64% accuracy or in a semi supervised manner to narrowdown candidate profiles

Page 34: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Future Work

I Incorporate more profile fields

I Generalize our model to include other social networks

I Adapt our system to handle missing and incorrect profileattributes

Page 35: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

For any further information, please write [email protected]

precog.iiitd.edu.in

Page 36: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Bibliography I

Carmagnola, F., and Cena, F.User identification for cross-system personalisation.Inf. Sci. 179, 1-2 (Jan. 2009), 16–32.

Carmagnola, F., Osborne, F., and Torre, I.Cross-systems identification of users in the social web.In 8th IADIS Int. Conf. WWW/INTERNET, Rome, Italy (2009), pp. 129–134.

Carmagnola, F., Osborne, F., and Torre, I.User data distributed on the social web: how to identify users on different socialsystems and collecting data about them.In Proceedings of the 1st International Workshop on Information Heterogeneityand Fusion in Recommender Systems (New York, NY, USA, 2010), HetRec ’10,ACM, pp. 9–15.

Golbeck, J., and Rothstein, M.Linking social networks on the web with foaf: a semantic web case study.AAAI’08, pp. 1138–1143.

Iofciu, T., Fankhauser, P., Abel, F., and Bischoff, K.Identifying users across social tagging systems, 2011.

Page 37: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Bibliography II

Irani, D., Webb, S., Li, K., and Pu, C.Large online social footprints–an emerging threat.In CSE ’09 (aug. 2009), vol. 3, pp. 271 –276.

Kontaxis, G., Polakis, I., Ioannidis, S., and Markatos, E.Detecting social network profile cloning.In PerCom (march 2011), pp. 295 –300.

Perito, D., Castelluccia, C., Kaafar, M. A., and Manils, P.How unique and traceable are usernames?In PETS (2011), pp. 1–17.

Rowe, M.Applying semantic social graphs to disambiguate identity references.In 6th Annual European Semantic Web Conference (ESWC2009) (June 2009),pp. 461–475.

Rowe, M.Interlinking distributed social graphs.In LDOW2009 (April,Spring 2009).

Page 38: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Bibliography III

Rowe, M., and Ciravegna, F.Harnessing the social web: The science of identity disambiguation.In Web Science Conference (2010).

Szomszor, M., Alani, H., Cantador, I., O’Hara, K., and Shadbolt, N.Semantic modelling of user interests based on cross-folksonomy analysis.ISWC ’08, pp. 632–648.

Szomszor, M. N., Cantador, I., and Alani, H.Correlating user profiles from multiple folksonomies.HT ’08, pp. 33–42.

Vosecky, J., Hong, D., and Shen, V. Y.User identification across multiple social networks.In NDT’09 (July 2009).

Winkler, W. E.String comparator metrics and enhanced decision rules in the fellegi-suntermodel of record linkage.In Proceedings of the Section on Survey Research (1990), pp. 354–359.

Page 39: Studying user footprints in different online social networks

Studying User Footprints in Different Online Social Networks

Bibliography IV

Wu, Z., and Palmer, M.Verbs semantics and lexical selection.In Proceedings of the 32nd annual meeting on Association for ComputationalLinguistics (Stroudsburg, PA, USA, 1994), ACL ’94, Association forComputational Linguistics, pp. 133–138.

Zafarani, R., and Liu, H.Connecting corresponding identities across communities, 2009.