social sub-groups ii outline “how?” - review group-finding strategies - “evade” – pca...
TRANSCRIPT
Social Sub-groups II
Outline“How?” - Review group-finding strategies - “Evade” – PCA (=SVD for the math-oriented!) - Theory Problem: What should group-structure be?
“Why?” Wayne Baker
•Social structure in a place where there should be none Scott Feld•What causes clustering in a network? Opportunity and interestsExamples from Add Health & Prosper
Practical:•Software & Program examples.
Next week: Roles & Blockmodels
Strategies for identifying primary groups: Search:
1) Fit Measure: Identify a measure of groupness (usually a function of the number of ties that fall within group compared to the number of ties that fall between group).2) Algorithm to maximize fit. Once we have the index, we need a clever method for searching through the network to maximize the fit. See: “Jiggle”, “Factions” etc.
Destroy:Break apart the network in strategic ways, removing the weakest parts first, what’s left are your primary groups. See “edge betweeness” “MCL”
Evade:Don’t look directly, instead find a simpler problem that correlates:Examples: Generalized cluster analysis, Factor Analysis, RM.
Methods: How do we identify primary groups in a network?
Strategies for identifying primary groups: Search:
- UCINET’s Factions- R’s FastGreedy- PAJEK’s Generalized block-modeling- Frank’s KliqueFinder
Destroy:Edge-betweenness reductionMCL Flow model
Evade:Leading Eigenvector modelClustering Distance (or other) matrixPrinciple Component / Factor / SVD methodsRNM
Hybrids: Use a simple evade technique for starting values and then use a search
technique. (CROWDS, JIGGLE)
Strategies for identifying primary groups: Evade
Factor Analysis: Treat the adjacency/similarity matrix as a set of N variables and look for latent factors that explain the variance in the data.
SES IQ
IncomeMathScore
1.0 1.0
0.0 0.0
We often use simple indicators and assume they measure our concepts
Strategies for identifying primary groups: Evade
Factor Analysis: Treat the adjacency/similarity matrix as a set of N variables and look for latent factors that explain the variance in the data.
SES IQ
IncomeReading
ScoreOccupation
Highest Degree
House Size
LanguagesSpoken
MathScore
But we don’t have to! We can imagine that each latent concept causes our indicators, and build a measurement model.
Strategies for identifying primary groups: Evade
Factor Analysis: Treat the adjacency/similarity matrix as a set of N variables and look for latent factors that explain the variance in the data.
But we don’t have to! We can imagine that each latent concept causes our indicators, and build a measurement model.
33
22
11
)(
)(
)(
sesHouseSize
sesOccupation
sesIncome
Strategies for identifying primary groups: Evade
Factor Analysis: Treat the adjacency/similarity matrix as a set of N variables and look for latent factors that explain the variance in the data.
In a network, we assume that the tie pattern is an imperfect measure of an underlying latent structure that we can explain with similar factors. Instead of lots of “measurements” we have many columns in the adjacency (sim) matrix, and we can summarize that with factor scores.
-- works best if the similarity matrix has more information – so multiple account data are perfect.– or you can transform the data in some way to more information (like
use a distance matrix.
Strategies for identifying primary groups: Evade
Factor Analysis: Treat the adjacency/similarity matrix as a set of N variables and look for latent factors that explain the variance in the data.
/* this section builds info on how to weight dyads for in-group, out-group. */
twostp=((adjmat+adjmat`)>0)*adjmat; /* make it either direction w. the first term */ttie=adjmat#twostp; /*=1 if tie contributes to a transitive triple */ttie=((ttie+ttie`));
adjraw=adjmat; adjmat=(adjmat+adjmat`); /* force it to be symetric, 1=asym 2=reciped */
adjmat=adjmat-diag(adjmat); /* remove any self ties */d2=reachlim((adjmat>0),3);
/* re-weight to bias toward recip ties */wm_4 = (d2=1)#(adjmat=2)#8; /* recip direct ties */wm_2a = (d2=1)#(adjmat=1)#4; /* unrecip direct ties */wm_1 = 2*(d2=2);/* ties 2-steps out */wm_p5 = 0*(d2=3); /* ties 3-steps out - note it's zeroed out here*/wm=wm_4+wm_2a+wm_1++wm_p5+(3*(ttie/(max(ttie)))); /* transitivity is at the end*/wm=wm-diag(wm);
Here is code I used in the PROSPER data:
Strategies for identifying primary groups: Evade
Factor Analysis: Treat the adjacency/similarity matrix as a set of N variables and look for latent factors that explain the variance in the data.
Here is code I used in the PROSPER data:
/* run factor analysis. Note nfactors is a high value, should only take those
w. EV > 2, but this gives us room... */
proc factor rotate=varimax min=&minev out=factset data=symmat nfactors=175
outstat=fscores noprint;
run; quit;
Strategies for identifying primary groups: Evade
Result:
Strategies for identifying primary groups: Evade
Result:
Each column is a person, these are the factor loadings for each person on each retained factor.
Strategies for identifying primary groups: Evade
Result:
Sociogram for a single school
Strategies for identifying primary groups: Evade
Result:
Sociogram for a single school.
Problem is that there are no necessary connectivity checks – you can get “groups” that are disconnected.
Biggest strengths are:a) Really fastb) Allows for overlapping
groupsc) Gives you “embeddedness”
scores based on factor loadigs
The Crowds Algorithm1. Identify members of network bicomponents, remove people not included.
2. Cluster the reduced network. - Identify optimal number of groups: (TREEWALK) - For each level of the cluster partition tree do (BFS): -Move up the tree from smaller to larger groups. -If the fit for both groups is improved by joining them then do so. -If not, then identify group at that level. -End TREEWALK.
Do until all groups are identified (GLOBAL LOOP): 3. Evaluate node fit. Do until nodes cannot be moved: For each identified cluster do (GRPCHECK):
- Ensure group is a bi-component. -Calculate effect on group a of moving node j to group a. -Calculate effect on j's present group of removing j. - If there is a positive net gain to moving j from own group to a, then do so. End. 4. Identify Bridging members. -If removing j from group a would improve the fit of group a, AND assigning j to any other group
would lower the fit for that group, then j is considered a bridge. Place all bridges in separate class.5. Group Check.Check returns to combining groups. IF merging groups would improve the fit of all groups to be
merged, then do so.- Evaluate bridges, to be sure that they are not bridging two groups that have now merged. End Global loop.
Strategies for identifying primary groups: Hybrid
Return to first question: What is a group?
•The simple notions of a complete clique are difficult to square w. real-world data.•Density is an indicator, but subject to over-grouping (no connectivity) and star-patterns.•Groups are likely internally differentiated – with “core” vs. “periphery” members
•Most sociological theories of groups rest on transitive closure and short distances •There’s a sense that members are equal – a tight-knit group•The group should be fairly small – face-to-face scale•The social processes underlying the group turn on reciprocity, trust, communication, homogeneity of norms & beliefs.•Almost all require a comparative set: in-group to out-group. It is relational not essential.•Cross-cutting social circles – would lead us to expect overlapping groups, but in practice most methods do not do that, as it’s analytically too cumbersome.
Practically, group detection is hard and most methods will give you (slightly) different results. You can compare results using a Rand statistic (proportion of pairs similarly categorized in two partitions), but for small settings these differences can matter.
Social Sub-groups: why look?
Wayne Baker: The Social Structure of a National Securities Market:1) Behavioral assumptions of economic actors2) Micro-structure of networks3) Macro-structure of networks4) Price Consequences
Under standard economic assumptions, people should act rationally and act only on price. This would result in expansive and homogeneous (I.e. random) networks. It is, in fact, this structure that allows microeconomic theory to predict that prices will settle to an optimal equilibrium
Baker’s Model:
Baker’s Model:
He makes two assumptions in contrast to standard economic assumptions:a) that people do not have access to perfect information andb) that some people act opportunistically
He then shows how these assumptions change the underlying mechanisms in the market, focusing on price volatility as a marker for uncertainty.
The key on the exchange floor is “market makers” people who will keep the process active, keep trading alive, and thus not ‘hoard’ (and lower profits system wide)
Baker’s Model:
Micronetworks: Actors should trade extensively and widely. Why might they not?
A) Physical factors (noise and distance)B) Avoid risk and build trust
Macro-Networks: Should be undifferentiated. Why not?
A) Large crowds should be more differentiated than small crowds. Why?
Price consequences: Markets should clear. They often don’t. Why?
Network differentiation reduces economic efficiency, leading to less information and more volatile prices
Baker: Use frequency of exchange to identify the network, resulting in:
Baker finds that the structure of this network significantly (and differentially) affects the price volatility of the network
Groups found w. NEGOPY
The one other program you should know about is NEGOPY. Negopy is a program that combines elements of the density based approach and the graph theoretic approach to find groups and positions. Like CROWDS, NEGOPY assigns people both to groups and to ‘outsider’ or ‘between’ group positions. It also tells you how many groups are in the network.
It’s a DOS based program, and a little clunky to use, but NEGWRITE.MOD will translate your data into NEGOPY format if you want to use it.
There are many other approaches. If you’re interested in some specifically designed for very large networks (10,000+ nodes), I’ve developed something I call Recursive Neighborhood Means that seems to work fairly well.
Baker: Because size is the primary determinant of clustering in this setting, he concludes that the standard economic assumption of large market = efficient is unwarranted.
Scott Feld: Focal Organization of Social Ties
Feld wants to look at the effects of constraint & opportunity for mixing, to situate relational activity within a wider context.
The contexts form “Foci”, “A social, psychological, legal or physical entity around which
joint activities are organized” (p.1016)
People with similar foci will be clustered together. He contrasts this with social balance theory.
Claim: that much of the clustering attributed to interpersonal balance processes are really due to focal clustering.
(note that this is not theoretically fair critique -- given that balance theory can easily accommodate non-personal balance factors (like smoking or group membership) but is a good empirical critique -- most researchers haven’t properly accounted for foci.)
Observed Clustering within Adolescent Social Networks
• On average, 65% of a school’s adolescents are in
cohesive sub-groups.• 87% of all relations are within sub-groups.• The average sub-group has 22 members.• The average diameter for a sub-group is 3 steps. • The mean segregation index is .96 (1=Complete,
0=Random)
Network Characteristics of Sub Groups
Observed Clustering within Adolescent Social NetworksDistribution of Characteristic within groups, relative to school distribution
Grade
34%
Race
65%
College
84%
GPA
86%
Activities
79%
Smoking
74%
Groups 23 & 24 Group 1
Group 15 Group 18
Group Data in Add Health
2
4
30
13
16
3
1
20
7
24
5
19
17
27
1810
15
23
25
14 31
12
21
Mostly Seniors
Mostly Juniors
Mostly Sophomores
Mostly Freshmen
Mixed Grades
Directed Arrow
Group data in Add Health
Inter-Group Relations
Group data in Prosper
We have 368 network observations based on 2 cohorts observed over 5 waves in 2 states. Using a variant of the CROWDs algorithm, I identified groups in every network.
-Results in about 4500 groups averaging in size of about 10 kids, though some settings are really too cohesive to break into small bits, resulting “peer groups” of 40ish kids.
Table 1. All groups with > 40 members: state cohort wave school group grpsize grpnumbc grppctbc 1 2 1 112 5 45 2 0.82222 1 1 2 112 4 73 2 0.91781 2 1 2 160 11 41 1 0.90244 1 1 1 220 1 45 1 0.93333 2 2 3 262 1 42 1 1.00000 1 1 5 306 1 53 1 0.98113 1 1 5 306 5 66 1 0.87879 2 2 5 351 2 45 2 0.84444
Table 2. Mean network descriptives. Variable Mean Std Dev Min Max . NumGrps 13.3287671 8.1827593 2.0000000 50.0000000 pisolate 0.0295607 0.0245523 0 0.1343284 pliaison 0.0391871 0.0422634 0 0.3750000 jfoptmod 0.5605613 0.0661626 0.2668055 0.7366568
Network Group Characteristics
Group data in Prosper
We have 368 network observations based on 2 cohorts observed over 5 waves in 2 states. Using a variant of the CROWDs algorithm, I identified groups in every network.
-Results in about 4500 groups averaging in size of about 10 kids, though some settings are really too cohesive to break into small bits, resulting “peer groups” of 40ish kids.
Table 3. Descriptive stats for group-level structure scores. Variable Label N Mean Std Min Max grpsize Number of people in group 4865 10.025 5.759 1.0 73.0 group Group label 4865 56.644 200.1 0 888.0 igrpties Sum of within-group ties 4865 26.461 23.683 0 220.0 s_ogrpties Sum of ties sent to out-groups 4865 10.829 8.493 0 99.0 r_ogrpties Sum of ties received from out-groups 4865 10.829 9.332 0 111.0 ingrprat Ratio of in group ties to out-group ties 4482 1.590 2.169 0 49.0 grpsegs Freeman Segregation index, group specific 4539 0.655 0.164 -0.032 1.0 avgogtrcvd Per member ties received from other groups 4865 1.055 0.783 0 7.0 avgogtsent Per member ties sent to other groups 4865 1.100 0.777 0 6.0 grpden Density of within group ties 4777 0.294 0.170 0 1.0 grptran Transitivity of within group ties 4379 0.446 0.205 0 1.0 grprecp Reciprocity of within group ties 4433 0.393 0.181 0 1.0 grpdst Mean distance btwn reachable pairs, directed 4433 1.800 0.474 1.0 4.64 grprchbl Proportion pairs reachable, directed 4433 0.675 0.231 0.029 1.0 grpdst_sym Mean distance btwn reachble pairs, undirected 4433 1.777 0.438 1.0 5.50 grprchbl_sym Proportion pairs reachable, undirected 4433 0.978 0.124 0.044 1.0 grppctbc Proportion of members in largest bicomponent 4160 0.828 0.191 0.125 1.0 grpnumbc Number of Bicomponents within group 4160 1.131 0.377 1.0 5.0 avgpop Average popularity of members, percentile normalized 4865 0.528 0.171 0.013 0.96 grpcntrlzn Closeness centralization of the group 4263 0.431 0.318 0 5.60
Group data in Prosper
We have 368 network observations based on 2 cohorts observed over 5 waves in 2 states. Using a variant of the CROWDs algorithm, I identified groups in every network.
-Results in about 4500 groups averaging in size of about 10 kids, though some settings are really too cohesive to break into small bits, resulting “peer groups” of 40ish kids.
AVG USE wave1 wave2 wave3 wave4 wave5setting 0.0003 0.0004 0.0018 0.0051 0.0097group 0.0018 0.0081 0.0139 0.0581 0.1102
person 0.0488 0.0825 0.1985 0.3458 0.5290
ICC - setting 0.0060 0.0049 0.0085 0.0124 0.0149ICC - group 0.0359 0.0898 0.0665 0.1472 0.1795
IRT USE wave1 wave2 wave3 wave4 wave5setting 0.0014 0.0020 0.0052 0.0103 0.0152group 0.0067 0.0223 0.0400 0.1016 0.1660
person 0.1893 0.2657 0.4377 0.6317 0.8646
ICC - setting 0.0073 0.0068 0.0108 0.0139 0.0145ICC - group 0.0352 0.0788 0.0880 0.1470 0.1739
AVG DEV wave1 wave2 wave3 wave4 wave5setting 0.0003 0.0005 0.0005 0.0015 0.0015group 0.0055 0.0103 0.0158 0.0302 0.0319
person 0.0751 0.1084 0.1781 0.2364 0.3009
ICC - setting 0.0043 0.0046 0.0024 0.0056 0.0045ICC - group 0.0685 0.0866 0.0815 0.1140 0.0969
IRT DEV wave1 wave2 wave3 wave4 wave5setting 0.0025 0.0032 0.0030 0.0132 0.0090group 0.0366 0.0523 0.0686 0.0989 0.1058
person 0.3446 0.4030 0.5157 0.5948 0.6753
ICC - setting 0.0065 0.0070 0.0050 0.0186 0.0114ICC - group 0.0978 0.1173 0.1197 0.1531 0.1429
TGRAD_R wave1 wave2 wave3 wave4 wave5setting 0.0066 0.0075 0.0155 0.0051 0.0160group 0.1004 0.1202 0.1852 0.2040 0.2159
person 0.5686 0.5992 0.6724 0.6605 0.6905
ICC - setting 0.0098 0.0103 0.0178 0.0058 0.0173ICC - group 0.1552 0.1729 0.2276 0.2397 0.2501
setting 0.0048 0.0088 0.0057 0.0029 0.0025group 0.0448 0.0623 0.0656 0.0468 0.0240
person 0.9319 0.9629 0.8969 0.8253 0.7735
ICC - setting 0.0049 0.0085 0.0059 0.0033 0.0031ICC - group 0.0505 0.0690 0.0735 0.0564 0.0324
Group data in Prosper
Fixed Effects Coef. SE Coef. SE Coef. SE Coef. SE Coef. SE Coef. SE
School LevelIntercept 2.370 *** 0.027 2.372 *** 0.027 0.384 *** 0.007 0.382 *** 0.006 0.429 *** 0.009 0.433 *** 0.009PA School -0.117 * 0.056 0.029 0.016 -0.009 0.019Treatment School -0.027 0.053 0.018 0.013 0.012 0.018
Group LevelGroup Delinquency (IRT) -0.151 ** 0.052 -0.018 0.054 -0.101 *** 0.019 -0.007 0.022 -0.087 *** 0.023 0.041 0.027Group Drinking (%) 0.206 * 0.090 0.143 0.094 0.123 ** 0.038 0.072 * 0.035 0.135 ** 0.038 0.105 ** 0.036Family Attachment 0.213 * 0.100 -0.030 0.033 0.114 * 0.048Grades 0.088 ** 0.029 0.045 * 0.019 0.061 ** 0.020Religious Attendance 0.001 0.016 0.009 0.007 0.018 ** 0.007School Attachment -0.016 0.012 0.006 0.005 0.002 0.006Friends Outside of School 0.016 0.009 -0.021 *** 0.005 -0.013 ** 0.005Free Lunch (%) -0.317 *** 0.064 -0.015 0.034 -0.064 0.043Two-Parent Family (%) 0.020 0.093 -0.012 0.058 -0.027 0.057Male Group -0.029 0.042 -0.038 ** 0.011 0.017 0.018Female Group -0.052 0.038 0.103 *** 0.013 0.085 *** 0.016White Group -0.026 0.036 0.028 ** 0.011 -0.001 0.014Group Size -0.003 *** 0.001 -0.006 *** 0.001
Random Effects Variance ComponentsBetween (level-2) 0.025 *** 0.026 *** 0.001 0.000 0.002 ** 0.002 ***Within (level-1) 2.270 2.170 0.034 0.029 0.040 0.034
***p<.001, **p<.01, *p<.05Note: SE's are robust (adjusted for clustering) and variables are grand centered.a Model is hierarchical overdispersed poissonb Model is hierarchical linear
Model 1 Model 2 Model 1 Model 2Group Sizea Reciprocityb Transitivityb
Model 1 Model 2