peoples’ interests in social networks

Slide 1

Peoples Interests in Social NetworksGroup Members:07005029 Abhinav Gokari07005030 Sudheer Kumar07d05004 Ignatius Pereira07d05019 Praveen Dhanala

Under the guidance of Prof. Pushpak BhattacharyyaOutline MotivationSocial Networks and HomophilyExperiments Statistical Methods UsedAdvantages of Analyzing Social Networks for HomophilyConclusionReferencesMotivationSocial Networks are the ongoing phenomenonOrkut, Facebook, Twitter, etc.,Almost 1/10th of the worlds population use FacebookThere is a great scope for innovation and development in Social Computing which deals with creating social contexts through the use of software and technology.Interesting problems arise like :Social Network analysisTarget marketing and improving e-commerceFriendship(or relationship) suggestions

Social NetworksA social network is a social structure made up of individuals (or organizations) called "nodes", which are tied (connected) by one or more specific types of interdependency such as friendship, kinship, etc.,Online social networks are attribute independent networks.In online social networks, a relationship between two individuals is mutually self defined and binary.Friendship is not functional and reasons could be subtle. E.g. an offline/online meeting, common workplace, pure visual interest.

HomophilyIt is the tendency of individuals to associate or bond with others with a similar set of interests or attributes. (Birds of same feather flock together)People choose friends who share same common interests and characteristicsOne of the most general and least contested theoretical principles in sociology is the principle of homophily

Homophily(contd.)A social system is homophilous if contacts are more similar to one another than to strangers in terms of their individual attributes and behaviorIf homophily is a robust aspect of human behavior, it can be used to deduce a particular persons attributes from his/her friends attributes in an online social network.We shall now examine the following experiments to observe Homophily in social networks

Expt. based on the paper by Apoorv et alA Travel Site is chosen with mutually self-declared friendshipsA data-set of 181 nodes is selected with 1214 friendship links with additional information like their attributes, characteristics, etc.,Each user can select their hobbies from a list of 26 pre-defines hobbiesAlso each user has additional characteristics such as language spoken, country they live in, etc.,

- Based on Predicting Interests of People on Online Social Networks by Apoorv et al, 2009.

Contd.The information gathered on the website wasFriends Network: This is the mutually self-declared friends network matrix. Hobbies: Members declare their hobbies by clicking on boxes next to a list of 26 possible hobbies. Languages Spoken: There is a list of 139 languages from which members select a maximum of three languages they speak.Age group: The age group is in terms of ranges, example under 20, 20-25, 26-30 etc, from which the user chooses one. There are a total of 12 ranges.

Statistics about the dataThis is the 181 by 181 friends network matrix. If person p1 has a friend p2then F[p1,p2] will be 1, otherwise it will be 0.

H - This is the hobbies matrix, 181 by 26. 181 for number of people and 26 forthe number different hobbies a person may have. For example if p50 has threehobbies - Acting, Dancing and Theatre, H[p50, Acting], H[p50, Dancing], H[p50,Theatre] will be 1 and all the other cells in row H[p50] will be 0.

L - This is the languages spoken by people matrix. It is exactly similar to Hwith the only difference that the columns here are the different languages aperson can speak along columns. This matrix is therefore, 181 by 139

P - This is the places visited by people matrix. It is similar to H and L with theonly difference that the columns here are the different places visited by a personalong columns. This matrix is therefore, 181 by 263.

The hypothesis of the experiment is that there is a correlation between mutual self-declared friendship links in online social networks and attributes listed in the profiles of said friends, presumably because of homophily

GFHF algorithm is extremely sensitive to the correctness of the weight matrices. Thus, GFHF allows us to test our hypothesis.

GFHF Example

Based on the paper by Amir Saffari et al. , 2010GFHF Example

Based on the paper by Amir Saffari et al. , 2010Support Vector MachinesA classifier derived from statistical learning theory by Vapnik, et al. in 1992

Currently, SVM is widely used in object detection & recognition, content-based image retrieval, text recognition, biometrics, speech recognition, regression analysis, etc.V. Vapnik.The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1992

Linear classifierThe goal of statistical classificationis to use an object's characteristics to identify which class (or group) it belongs to. Alinear classifierachieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known asfeature valuesand are typically presented to the machine in a vector called a feature vector.SVM in testMATLAB Support Vector Machine ToolboxThe toolbox provides routines for support vector classification and support vector regression. A GUI is included which allows the visualisation of simple classification and regression problems. (The MATLAB optimisation toolbox, or an alternative quadratic programming routine is required.)http://www.isis.ecs.soton.ac.uk/isystems/kernel/Support Vector Machine ran on a sample data.http://users.ecs.soton.ac.uk/srg/publications/pdf/SVM.pdf

Why SVMsExperiment by Thorstein Joachims et al. on Text Categorization with support vector machines.Text categorization is the classification of documents into a fixed number of predefined categories where each documents can be in one, multiple or no category at all.SVMs well suited for the task with categorisation with many features.SVMs are robust, dont require parameter tuning.Thorstein Joachims, Text Categorisation with support vector machines: Learning with many relevant features, 1998.Why SVMs Contd.?SVMs are based on Structural Risk Management Principle.Idea of structural risk management is to find a hypothesis h for which we can guarantee the lowest true error i.e the probability to h will make an error on an unseen or a randomly selected test sample.SVMs are universal learnersAbility of learning is independent of the dimensionality of the feature space.With the use of a simple kernel function, they can be used to learn polynomial classifiers.

SVMs v/s MLPsExperiment by Barabino et al. on Support Vector Machines v/s Multiple Linear Perceptrons in particle Identification in Physics.SVM are based on minimization of Structured risk whereas MLPs are based on minimization of Empirical risk.

Findings :-1) very similar performance except the SVM perform as good as MLPs2) SVM work well in case of large training drawn from input spaces of small dimensionsM. Barabino et al., Support Vector Machines versus Multilinear Perceptrons in Particle Identification, 1999.

Back to the Expt.To accept or reject our research hypothesis, we consider the prediction capability of GFHF using two weight matrices:Randomly generated binary weight matrix(Gr)Self declared friends network(Gf)To incorporate the effect of other attributes, Support Vector Machine(SVM) is used along with GFHFTwo feature sets are used when using SVMsThe set with only personal characteristics(Sc) Set with all the hobbies except the one being predicted(St)

Contd.GFHF is run 30 times, each time for a random configuration of ni number of labeled data points where ni N = (10; 30; 50; 70; 90)These predictions are calculated for all 26 hobbies under considerationTherefore, for each weight matrix, Gr and Gf we get a corresponding 26 x 5 x 30 matrix, where 26 is the number of hobbies, 5 is the different number of data-points and 30 is the number trials.

Explanation of ResultsTable shows the accuracy of running GFHF with the random matrix (Gr) and with the friends matrix (Gf ) for 26 hobbies and across 3 different training set sizes (numbers of labeled data-points) The numbers are averages over the 30 trials with the same configuration. The second-to-last column shows the average of difference in accuracy between Gf and Gr across all training set sizes, and the last column shows the difference in accuracy between St and Gf, again as average across all training set sizes. Contd.The results show that in most of the cases Gf performs significantly better than Gr which implies that the underlying friends network is in fact important for prediction.For some hobbies, the difference in the performance of Gf and Gr is extremely high. These are precisely the hobbies that over 50% of the people in the network have.Contd.There are quite a few hobbies for which the friends network does not provide any useful information.

We see that the friends network does not consistently help over a random network if the hobby has a relative incidence of 41% or less.

At 47% and above, the friends network consistently outperforms the random network.

Contd.The results corresponding to Sc and St are also similar.

In general, St performs better than Sc, which performs better than Gf

From this table we also observe that as we increase the data, prediction accuracy increases for the SVM

Expt. Based on paper by Akshay PatilThe data was gathered from a large online social networking site The data is essentially in form of a huge network of interconnected nodes, with nodes representing actual people or users and the ties between them denoting relationships in the social network.Also each of the nodes store information regarding the individual user. This information make up the node or user profile, and is essentially a list of attribute: value pairs.

-Akshay N Patil. Homophily Based Link Prediction in Social Networks. 2009

31Statistics of Data Set

DefinitionsThe nodes are distinguished asClass of Near Nodes N(u) Nodes within 2-hop radiusClass of Far Nodes F(u) All nodes other than Near nodesWe introduce a t bit vector associated with every pair of nodes(to denote the attributes of a node), whereby we place 1 at the ith position if the two nodes match on attribute Ai, or a 0 if they do not match.

Contd.Now, for each attribute Ai in the network, we define a 2 2 contingency matrix as shown in Table 3.1, where,C00: Pairs of nodes in FS not matching on Ai.C01: Pairs of nodes in FS matching on Ai.C10 : Pairs of nodes in NS not matching on Ai.C11 : Pairs of nodes in NS matching on Ai.| Cij | = kij

X2 (chi square) MeasureThe statistical measure we use to detect the homophily is X2 (chi square) Measure X2 measure aggregates the deviation of observed values from the expected values, under the independence hypothesis . The independence hypothesis in our case can be stated as follows - An attribute plays no role in classification of a node into a particular class Cij

where, Ai refers to a particular attribute, C refers to the classes defined, klm refers to the number of users in class having value m for attribute A and n refers to the total number of users.

The larger the X2 value, the lower is the belief in the independence hypothesis, and hence larger is the role played by the particular attribute in relationship formation.

We can rewrite the forumla for X2 measure in known terms using the probabilities of each class/attribute and independence hypothesis as follows :

In this way, we calculate the X2 value associated with each attribute in the network.

The Odds RatioThe X2 measures assesses how statistically unlikely the lack of association between similarity on an attribute and the probability of a social relationship is.The X2 measure cannot tell us is whether the association is positive or negative.Yet we need such a directional measure to test the principle of homophily, which predicts a positive relationship. A negative relationship would imply negative homophily, a tendency for individuals to associate, not with alikes, but with different othersWe therefore also compute the odds ratio for each attributeThe odds ratio is simply the odds that two similar individuals are connected divided by the odds that two dissimilar individuals are connected.The odds ratio for an attribute can be defined as follows,

Explanation of ResultsTrends that are visible from the online social network results are as follows,Geographical location is the strongest factor affecting how relationships shape up in a social network.The results also indicate that relationships are more likely to develop between individuals belonging to the same age group.Religious affiliation and ethnicity are also dominant factors in relationship formation, as demonstrated by attributes like religion and languages spoken by individuals.Likings, hobbies etc. are less likely to influence how ties are made in a social network. Relationships are less likely to be formed between individuals who for example enjoy the same movies or music, read the same books etc.

Advantages of analyzing HomophilySome offline friendships may be absent in online social communities, and are thus detectable. Friends may not know of each other that they are members of the same online community. This is especially true for young online communities or new users to the system. Facebook has a feature of people you may know, with which, people who are possibly friends are suggested to be connected online.Advantages in target marketing and e-commerce are straight-forward. For example, orkut shows us ads on our profile which are based on our profile information. Information spread in social networks is being used in diverse fields such as marketing campaigns

Contd.Link prediction may also turn out to be useful for suggesting links that are likely to develop in the future, thus steering the evolution of a social community.

In the case of large organizations or companies, there is often an official hierarchy for collaboration and interaction. Methods for link prediction could be effectively used to uncover beneficial interactions or collaborations that have not yet been fully utilized, which would otherwise be hidden by this official hierarchyConclusionIt has been widely observed that social networks exhibit homophily

We have observed how to detect the Homophily and some important applications using this phenomenon

More research had to done on different kinds of sample data to analyze Homophily more accurately and exploit itReferencesApoorv Agarwal, Owen Rambow, Nandini Bhardwaj. Predicting Interests of People on Online SocialNetworks . In the Proceedings of IEEE CSE 09, 12th IEEE International Conference on Computational Science and Engineering, IEEE Computer Society Press, Vancouver, Canada, 2009.Akshay N Patil. Homophily Based Link Prediction in Social Networks. 2009Miller McPherson, Lynn Smith-Lovin, and James M. Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 2001.

References(Contd.)V. Vapnik.The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1992Thorstein Joachims, Text Categorisation with support vector machines: Learning with many relevant features, 1998.M. Barabino et al., Support Vector Machines versus Multilinear Perceptrons in Particle Identification, 1999.Amir Saffari, Christian Leistner, Horst Bischof. Semi-supervised Learning in Vision. CVPR San Francisco, 2010http://www.dtreg.com/svm.htmWikipedia

peoples’ interests in social networks

Documents

online social networks

social structure

social system

social computing

social contexts

social networksgroup

independent networks

hisher friends attributes