changing with time: modelling and detecting user lifecycle periods in online community platforms

17
CHANGING WITH TIME: MODELLING AND DETECTING USER LIFECYCLE PERIODS IN ONLINE COMMUNITY PLATFORMS DR. MATTHEW ROWE SCHOOL OF COMPUTING AND COMMUNICATIONS @MROWEBOT | [email protected] International Conference on Social Informatics 2013 Kyoto, Japan

Upload: matthew-rowe

Post on 18-Nov-2014

429 views

Category:

Technology


0 download

DESCRIPTION

Presentation at the International Conference on Social Informatics 2013

TRANSCRIPT

Page 1: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

CHANGING WITH TIME: MODELLING AND DETECTING USER LIFECYCLE PERIODS IN ONLINE COMMUNITY PLATFORMS DR. MATTHEW ROWE SCHOOL OF COMPUTING AND COMMUNICATIONS @MROWEBOT | [email protected]

International Conference on Social Informatics 2013 Kyoto, Japan

Page 2: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

Offline Personal Development

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

1

Time

Primary School High School University Postgrad Postdoc Lecturing

Page 3: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

Offline Personal Development

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

2

Time

Primary School High School University Postgrad Postdoc Lecturing

Offline, we develop in terms of both our interests and social networks

Page 4: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

User Development in Online Communities

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

3

¨  Understanding user development enables: ¤ Churn prediction (concentration of this paper) ¤ Stage-based Recommendations (future work)

¨  Studied thus far in isolated dimensions: ¤ Socially (Telecoms Networks: Miritello et al. 2013) ¤ Lexically (Online Communities: McAuley & Leskovec.

2013) ¨  Without considering user development:

a)  Relative to earlier signals b)  Relative to the community of interaction

Page 5: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

Modelling User Lifecycles

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

4

Time

Primary School High School University Postgrad Postdoc Lecturing

Novice Users Asking Questions Asking & Answering Questions

Answering Questions

1 2 3 … n

1 2 3 … n

Offline Lifecycle Periods

Lifecycle Periods of a potential Question-Answering System user (conjecture!)

In reality: do not know the labels, however we can split by equal time intervals:

Yet, users non-uniformly distribute their activity across lifecycles

First Post Last Post

Page 6: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

Lifecycle Periods and User Properties

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

5

1 2 3 … n

¨  Capture period-specific user properties (in period s): ¤  In-degree distribution

n  Relative frequency distribution of senders to user u in period s ¤  Out-degree distribution

n  Relative frequency distribution of recipients from user u in s ¤  Term distribution

n  Relative frequency distribution of terms used by u in s

1 2 Divide lifetime into equal activity periods

#posts #posts =

s

Page 7: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

Datasets: Online Community Platforms

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

6

1.  Facebook ‘Open University’ Groups ¤ Containing discussions about courses and degrees

2.  SAP Community Network ¤ Question-answering system for SAP technologies

3.  Server Fault ¤  Stack Overflow subsidiary site for server-related issues

¨  For each dataset we set the number of lifecycle periods (n) to 20

examination of user lifecycles we used data collected from Facebook, the SAPCommunity Network (SAP) and Server Fault. Table 1 provides summary statis-tics of the datasets where we only considered users who had posted more than 40times within their lifetime on the platform.1 The Facebook dataset was collectedfrom groups discussing Open University courses, where users talked about theirissues with the courses and guidance on studying. The SAP Community Networkis a community question answering system related to SAP technologies whereusers post questions and provide answers related to technical issues. Similarly,Server Fault is a platform that is part of the Stack Overflow question answeringsite collection2 where users post questions related to server-related issues. Wedivided each platform’s users up into 80%/20% splits for training (and analysis)and testing, using the former in this section to examine user development andthe latter split for our later detection experiments.

Table 1. Statistics of the online community platform datasets.

Platform Time Span Post Count User CountFacebook [18-08-2007,24-01-2013] 118,432 4,745SAP [15-12-2003,20-07-2011] 427,221 32,926Server Fault [01-08-2008,31-03-2011] 234,790 33,285

3.1 Defining Lifecycle Periods

In order to examine how users develop over time we needed some means tosegment a user’s lifetime (i.e. from the first date at which they post to the dateof their final post) into discrete intervals. Prior work [6, 2, 5] has demonstratedthe extent to which users develop at their own pace and thus evolve accordingto their own ‘personal clock ’ [5]. Hence, for deriving the lifecycle periods of userswithin the platforms we adopted an activity-slicing approach that divided auser’s lifetime into 20 discrete time intervals, emulating the approach in [2], butwith an equal proportion of activity within each period. This approach functionsas follows: we derive the set of interval tuples ({[ti, tj ]} 2 T ) by first derivingthe chunk size (i.e. the number of posts in a single period) for each user, we thensort the posts in ascending date order, before deriving the start and end pointsof each interval in an incremental manner. This derives the set of time intervalsT that are specific to a given user.

3.2 Modelling User Properties

Based on the defined lifecycle periods, we now move on to defining user proper-ties, capturing social and lexical dynamics, and tracking how the properties ofusers change over time.

In-degree and Out-degree Distributions. Starting with social dynamics,we assessed the in-degree and out-degree distributions of users: the in-degree

1 Choosing 40 posts so that we had at least 2 posts per lifecycle period.2http://stackoverflow.com/

Page 8: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms 7

We can assess users’ properties in each of this lifecycle periods for: 1.  Property changes relative to earlier properties

2.  Property changes relative to the online community’s properties

Page 9: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

Assessing User Evolution: Period Cross-Entropy

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

8

¨  Cross-entropy: ‘uncertainty’ of one distribution (Q) relative to another (P)

¨  Computed between period s and earlier periods, choosing the minimum cross-entropy

that we eschew time-comparative assessments of how a useris changing relative to earlier properties. To inform suchcross-period assessment we examined the users’ in-degree,out-degree and term distributions across lifecycle periodsby computing the cross-entropy of one probability distri-bution with respect to another distribution from an lifecycleperiod, and then selecting the distribution that minimisescross-entropy. Assuming we have a probability distribution(P ) formed from a given lifecycle period ([t, t!]), and aprobability distribution (Q) from an earlier lifecycle period,then we define the cross-entropy between the distributionsas follows:

H(P,Q) = !!

x

p(x) log q(x) (5)

In the same vein as the earlier entropy analysis, wederived the period cross-entropy for each platform’s usersthroughout their lifecycles and then derived the mean cross-entropy for the 20 lifecycle periods. Figure 2 presents thecross-entropies derived for the different platforms and userproperties. We observe that for each distribution and eachplatform cross-entropies reduce throughout users’ lifecycles,suggesting that users do not tend to exhibit behaviour thathas not been seen previously. For instance, for the in-degreedistribution the cross-entropy gauges the extent to whichthe users who contact a given user at a given lifecyclestage differ from those who have contacted him previously,where a larger value indicates greater divergence. We findthat consistently across the platforms, users are contactedby people who have contacted them before and that fewernovel users appear. The same is also true for the out-degreedistributions: users contact fewer new people than they didbefore. This is symptomatic of community platforms wheredespite new users arriving within the platform, users formsub-communities in which they interact and communicatewith the same individuals. Figure 2(c) also demonstrates thatusers tend to reuse language over time and thus produce agradually decaying cross-entropy curve.

!!

!!

! ! ! ! ! ! ! ! ! !! ! ! ! !

0.0

00

.10

0.2

00

.30

Lifecycle Stages

Cro

ss E

ntr

opy

0 0.2 0.5 0.8 1

! FacebookSAPServer Fault

(a) In-degree

!

!

!

!!

!! !

! ! ! ! ! ! !! ! ! !

0.0

00

.05

0.1

00

.15

Lifecycle Stages

Cro

ss E

ntr

opy

0 0.2 0.5 0.8 1

(b) Out-degree

!

!! !

! ! ! ! ! ! ! ! ! ! ! ! ! ! !

0.0

0.4

0.8

1.2

Lifecycle Stages

Cro

ss E

ntr

opy

0 0.2 0.5 0.8 1

(c) Lexical

Figure 2. Cross-entropies derived from comparing users’ in-degree, out-degree and lexical term distributions with previous lifecycle periods. Wesee a consistent reduction in the cross-entropies over time.

3) Community Contrasts (Community Cross-Entropy):

For the third inspection of user lifecycles and how userproperties change, we examined how users compare with

the platform in which they are interacting over the sametime interval. We used the in-degree, out-degree and termdistributions and compared them with the same distributionsderived globally over the same time periods. For the globalprobability distributions we used the same means as forforming user-specific distributions, but rather than using theset of posts that a given user had authored (Pui

) to derivethe probability distribution, we instead used all posts. Forinstance, for the global in-degree distribution we used thefrequencies of received messages for all users. Given thediscrete probability distribution of a user from a time interval(P[t,t!]), and the global probability distribution over the sametime interval (Q[t,t!]), we derived the cross-entropy as abovebetween the distributions. (H(P[t,t!], Q[t,t!])).

As before we derived the community cross-entropy foreach platform’s users over their lifetimes and then calculatedthe mean community cross-entropy for the lifecycle periods.Figure 3 presents the plots of the cross-entropies for the in-degree, out-degree and term distributions over the lifecycleperiods. We find that for all platforms the community cross-entropy of users’ in-degree increases over time indicatingthat a given user tends to diverge in his properties fromusers of the platform. For instance, for the community cross-entropy of the in-degree distribution the divergence towardslater parts of the lifecycle indicates that users who reply to agiven user differ from the repliers in the entire community.This complements cross-period findings from above wherewe see a reduction in cross entropy, thus suggesting thatusers form sub-communities in which interaction is consis-tently performed within (i.e. reduction in new users joining).We find a similar effect for the out-degree of the userswhere divergence from the community is evident towardsthe latter stages of users’ lifecycles. The term distributiondemonstrates differing effects however: for Facebook andSAP we find that the community cross-entropy reducesinitially before rising again towards the end of the lifecycle,while for Server Fault there is a clear increase in communitycross-entropy towards the latter portions of users’ lifecyclessuggesting that the language used by the users actually tendsto diverge from that of the community in a linear manner.This effect is consistent with the findings of Danescu et al.[2] where users adapt their language to the community tobegin with, before then diverging towards the end.

V. MINING LIFECYCLE TRAJECTORIES

Inspecting how communities of users develop we haveconcentrated on assessments at the macro-level on eachplatform, examining how the social dynamics and lexical dy-namics of communities of users have changed over time. Wenow turn to examining how individual users evolve through-out their lifecycle periods. Understanding how individualusers develop over time in online community platformsallows for churners to be predicted, as we shall demonstratein the following section through our experiments, and also

●●

● ● ● ● ● ● ● ● ● ●●

● ● ● ●

0.00

0.05

0.10

0.15

0.20

Lifecycle Stages

Tim

e−pe

riod

Cro

ss E

ntro

py

0 0.2 0.4 0.6 0.8 1

● FacebookSAPServer Fault

(a) Period Cross-Entropy- In-degree

●●

●●

●● ●

● ● ● ● ● ● ●● ● ● ●

0.00

0.05

0.10

0.15

Lifecycle Stages

Tim

e−pe

riod

Cro

ss E

ntro

py

0 0.2 0.4 0.6 0.8 1

(b) Period Cross-Entropy- Out-degree

●● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Lifecycle StagesTi

me−

perio

d C

ross

Ent

ropy

0 0.2 0.4 0.6 0.8 1

(c) Period Cross-Entropy- Lexical

● ● ● ● ● ● ● ●●

● ● ● ● ● ●●

● ●●

01

23

45

Lifecycle Stages

Dis

tribu

tion

Cro

ss E

ntro

py

0 0.2 0.4 0.6 0.8 1

● FacebookSAPServer Fault

(d) Community Cross-Entropy - In-degree

●●

● ● ● ●● ● ●

● ● ●● ● ●

●●

2.0

3.0

4.0

5.0

Lifecycle Stages

Dis

tribu

tion

Cro

ss E

ntro

py

0 0.2 0.4 0.6 0.8 1

(e) Community Cross-Entropy - Out-degree

● ●

● ●● ●

●●

●●

●● ● ● ● ●

●●

6.0

6.5

7.0

7.5

8.0

8.5

Lifecycle Stages

Dis

tribu

tion

Cro

ss E

ntro

py

0 0.2 0.4 0.6 0.8 1

(f) Community Cross-Entropy - Lexical

Fig. 1. Changes in user properties throughout lifecycle periods based on: period cross-entropy (Fig. 1(a), 1(b), 1(c)); and community cross-entropy (Fig. 1(d), 1(e), 1(f)).

probability distributions were formed using the same means as above for the user-specific distributions, but instead using all posts rather than those by a givenuser from which to form the edges and term distributions. Therefore, given thediscrete probability distribution of a user from a time interval (P[t,t0]), and theglobal probability distribution over the same time interval (Q[t,t0]), we derivedthe cross-entropy, as above, between the distributions. (H(P[t,t0], Q[t,t0])).

As before, we derived the community cross-entropy for each platform’s usersover their lifetimes and then calculated the mean community cross-entropy forthe lifecycle periods. Fig. 1(d), fig. 1(e) and fig. 1(f) present the plots of the cross-entropies for the in-degree, out-degree and term distributions over the lifecycleperiods. We find that for all platforms the community cross-entropy of users’in-degree increases over time indicating that a given user tends to diverge in hisproperties from users of the platform. For instance, for the community cross-entropy of the in-degree distribution the divergence towards later parts of thelifecycle indicates that users who reply to a given user di↵er from the repliersin the entire community. This complements cross-period findings from abovewhere we see a reduction in cross entropy, thus suggesting that users form sub-communities in which interaction is consistently performed within (i.e. reduction

Convergence on prior properties

Page 10: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

Assessing User Evolution: Community Cross-Entropy

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

9

¨  Computed between properties in period s of user u and properties of the community in period s ¤  I.e. term distribution of entire community

●●

● ● ● ● ● ● ● ● ● ●●

● ● ● ●

0.00

0.05

0.10

0.15

0.20

Lifecycle Stages

Tim

e−pe

riod

Cro

ss E

ntro

py

0 0.2 0.4 0.6 0.8 1

● FacebookSAPServer Fault

(a) Period Cross-Entropy- In-degree

●●

●●

●● ●

● ● ● ● ● ● ●● ● ● ●

0.00

0.05

0.10

0.15

Lifecycle Stages

Tim

e−pe

riod

Cro

ss E

ntro

py

0 0.2 0.4 0.6 0.8 1

(b) Period Cross-Entropy- Out-degree

●● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Lifecycle Stages

Tim

e−pe

riod

Cro

ss E

ntro

py

0 0.2 0.4 0.6 0.8 1

(c) Period Cross-Entropy- Lexical

● ● ● ● ● ● ● ●●

● ● ● ● ● ●●

● ●●

01

23

45

Lifecycle Stages

Dis

tribu

tion

Cro

ss E

ntro

py

0 0.2 0.4 0.6 0.8 1

● FacebookSAPServer Fault

(d) Community Cross-Entropy - In-degree

●●

● ● ● ●● ● ●

● ● ●● ● ●

●●

2.0

3.0

4.0

5.0

Lifecycle Stages

Dis

tribu

tion

Cro

ss E

ntro

py

0 0.2 0.4 0.6 0.8 1

(e) Community Cross-Entropy - Out-degree

● ●

● ●● ●

●●

●●

●● ● ● ● ●

●●

6.0

6.5

7.0

7.5

8.0

8.5

Lifecycle StagesD

istri

butio

n C

ross

Ent

ropy

0 0.2 0.4 0.6 0.8 1

(f) Community Cross-Entropy - Lexical

Fig. 1. Changes in user properties throughout lifecycle periods based on: period cross-entropy (Fig. 1(a), 1(b), 1(c)); and community cross-entropy (Fig. 1(d), 1(e), 1(f)).

probability distributions were formed using the same means as above for the user-specific distributions, but instead using all posts rather than those by a givenuser from which to form the edges and term distributions. Therefore, given thediscrete probability distribution of a user from a time interval (P[t,t0]), and theglobal probability distribution over the same time interval (Q[t,t0]), we derivedthe cross-entropy, as above, between the distributions. (H(P[t,t0], Q[t,t0])).

As before, we derived the community cross-entropy for each platform’s usersover their lifetimes and then calculated the mean community cross-entropy forthe lifecycle periods. Fig. 1(d), fig. 1(e) and fig. 1(f) present the plots of the cross-entropies for the in-degree, out-degree and term distributions over the lifecycleperiods. We find that for all platforms the community cross-entropy of users’in-degree increases over time indicating that a given user tends to diverge in hisproperties from users of the platform. For instance, for the community cross-entropy of the in-degree distribution the divergence towards later parts of thelifecycle indicates that users who reply to a given user di↵er from the repliersin the entire community. This complements cross-period findings from abovewhere we see a reduction in cross entropy, thus suggesting that users form sub-communities in which interaction is consistently performed within (i.e. reduction

Convergence on community properties Divergence from the community

Page 11: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms 10

How can we use these developmental signals to detect the lifecycle period of a user?

Page 12: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

Detecting Lifecycle Periods: Engineering Features

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

11

¨  Growth feature: proportioned change in measure

¨  Dataset formalisation:

rive the probability distribution, we instead used all posts.For instance, for the global in-degree distribution we usedthe frequencies of received messages for all users. Given thediscrete probability distribution of a user from a time inter-val (P[t,t0]), and the global probability distribution over thesame time interval (Q[t,t0]), we derived the cross-entropy asabove between the distributions. (H(P[t,t0], Q[t,t0])).

Again, as with period cross-entropies, we find churners’signals to have a lower magnitude than non-churners sug-gesting that non-churners’ properties tend to diverge fromthe community as they progress throughout their lifetimewithin the online community platforms. There are some no-ticeably noisy signals however, in particular for Facebookand the in-degree distribution and lexical term distributionsof non-churners. Generally for each signal we see a growingcurve towards later lifecycle periods for both churners andnon-churners, while the magnitudes of the curves are thesalient di↵erentiating feature.

●●

● ● ●

●●

● ●

0.2

0.4

0.6

0.8

1.0

1.2

Lifecycle Stages

Com

umun

ity C

ross

Ent

ropy

0 0.2 0.4 0.6 0.8 1

● ChurnersNon−churners

(a) In-degree -Facebook

● ● ●●

●● ● ●

●●

●● ●

● ●

3.0

3.5

4.0

4.5

5.0

Lifecycle Stages

Com

umun

ity C

ross

Ent

ropy

0 0.2 0.4 0.6 0.8 1

(b) In-degree -SAP

●●

●● ●

●●

● ● ●

● ●

0.5

1.0

1.5

2.0

2.5

Lifecycle Stages

Com

umun

ity C

ross

Ent

ropy

0 0.2 0.4 0.6 0.8 1

(c) In-degree -Server Fault

● ●●

●●

●● ●

●●

●●

●●

2.0

2.5

3.0

3.5

4.0

Lifecycle Stages

Com

mun

ity C

ross

Ent

ropy

0 0.2 0.4 0.6 0.8 1

(d) Out-degree -Facebook

● ● ●●

●●

● ● ●●

● ●●

● ●●

3.0

3.5

4.0

4.5

5.0

5.5

6.0

Lifecycle Stages

Com

mun

ity C

ross

Ent

ropy

0 0.2 0.4 0.6 0.8 1

(e) Out-degree -SAP

●●

●●

● ●●

● ● ● ●●

●●

3.0

3.5

4.0

4.5

5.0

5.5

6.0

Lifecycle Stages

Com

mun

ity C

ross

Ent

ropy

0 0.2 0.4 0.6 0.8 1

(f) Out-degree -Server Fault

● ●

● ●

●●

● ●●

6.0

6.2

6.4

6.6

6.8

7.0

Lifecycle Stages

Com

umun

ity C

ross

Ent

ropy

0 0.2 0.4 0.6 0.8 1

(g) Lexical - Face-book

● ●●

●●

●●

● ●●

●● ●

●● ●

6.0

6.5

7.0

7.5

8.0

Lifecycle Stages

Com

umun

ity C

ross

Ent

ropy

0 0.2 0.4 0.6 0.8 1

(h) Lexical - SAP

● ●●

●●

●●

●●

●● ●

7.0

7.5

8.0

8.5

Lifecycle Stages

Com

umun

ity C

ross

Ent

ropy

0 0.2 0.4 0.6 0.8 1

(i) Lexical - ServerFault

Figure 4: Community cross-entropy comparingchurners with non-churners along: in-degree (Fig-ure 4(a); 4(b); and 4(c)), out-degree (Figure 4(d);4(e); and 4(f)) and lexical term distributions (Fig-ure 4(g); 3(h); and 4(i)).

5. CHURN PREDICTION MODELOur analysis of the di↵erences between churners and non-

churners exposed latent descriptions and signals of how thesegroups of users develop throughout their lifecycles, findingdi↵erent development signals in terms of each measure’s

magnitude and rate of change. In this section we turn tothe problem of engineering a model to predict churners byusing our prior insights to build features.

5.1 Feature EngineeringOur analysis results indicate that churners and non-churners

di↵ered between one another in terms of: (i) decay ratesfor certain measures, i.e. out-degree period cross-entropy;and (ii) the magnitude of features, i.e. lexical period cross-entropy. Therefore we define two types of features that ourprediction model uses: (i) rates and (ii) magnitudes, whereeach feature is measured for a given lifecycle period. To easefeature definition and model specification, we alter the life-cycle period notation from the existing interval tuple set (i.e.[t, t0] 2 T ) to use a set of discrete single elements: s 2 S,where S = {1, 2, . . . , 20}. Magnitude features are definedas a given user’s measure taken at a given lifecycle period:m(u, s), where the measure for user u is taken at lifecycleperiod s. Rates are defined as changes in measures from onelifecycle period to the next:

�m(u, s) =dm

ds=

m(u, s+ 1)�m(u, s)m(u, s)

(5)

Where �m is indexed by the given measure (i.e. in-degreeperiod cross-entropy), using the above magnitude functionto return the magnitude of a given measure (m) for user uat the allotted lifecycle period. Thus a feature vector (x)is formed for a single user using these rate and magnitudefeatures:

x =[m1(u, 2), . . . ,m1(u, 19),m2(u, 2), . . . ,m2(u, 19), . . .

�m1(u, 2), . . . , �m1(u, 18), �m2(u, 1), . . . , �m2(u, 18)]

As a result of using both rates and magnitudes from eachof the 20 lifecycle periods, aside from the first and last onefor magnitudes and the first and last two for rates, we areprovided with at most 210 features, given the provision of18 magnitude features for each of the 6 measures and 17rate features for each of the 6 measures. As we will de-scribe within the experiments section, we vary the featuresused between di↵erent: (i) feature sets, i.e. in-degree basedfeatures, community cross-entropy based features, etc.; and(ii) lifecycle periods. On this latter aspect we address one ofthe research questions that has driven this work: how earlyinto a user’s lifecycle can we predict them churning? Byconstraining the features to early time periods and thus it-eratively increasing the number of features to use, and hencethe lifecycle periods that the user’s information covers, wedemonstrate the earliest point, for each platform, at whichwe can accurately predict churners.

5.2 Model Definition and LearningWe are provided with, for each of our three online com-

munity platforms, a training dataset and a testing dataseteach taking the following form: {(xi, yi)} 2 D, where xi

defines the feature vector of user ui and yi defines the classlabel of the user, taking a value from the set {0, 1}: 1 ifthe user churned and 0 otherwise. In defining the predic-tion model our goal was to induce a function that predictsthe probability of churning based on a user’s feature vectorf : Rn ! [0, 1]. We define this function as follows for anarbitrary feature vector x and learnt model weights (w) as

in new users joining). We find a similar e↵ect for the out-degree of the userswhere divergence from the community is evident towards the latter stages ofusers’ lifecycles. The term distribution demonstrates di↵ering e↵ects however: forFacebook and SAP we find that the community cross-entropy reduces initiallybefore rising again towards the end of the lifecycle, while for Server Fault thereis a clear increase in community cross-entropy towards the latter portions ofusers’ lifecycles suggesting that the language used by the users actually tends todiverge from that of the community in a linear manner. This e↵ect is consistentwith the findings of Danescu et al. [2] where users adapt their language to thecommunity to begin with, before then diverging towards the end.

4 Lifecycle Period Detection

The above analysis unearthed the development that users go through acrossthe three community platforms, based on di↵erent properties and developmentindicators. We now turn to the problem of detecting the lifecycle period thata given user is in based on his prior development. We characterise this task asa multi-class classification problem in which our goal is to induce a functionthat returns a user’s lifecycle period: f : Rn ! S, where the domain is an n-dimensional feature vector representation of a user’s evolution and the co-domainis an integer representation of the lifecycle period (S = {6, 7, . . . , 20}).4

4.1 Feature Engineering

In the previous section we found that di↵erent lifecycle periods can be charac-terised and thus detected based on the evolution of the user into the lifecycleperiod: i.e. when inspecting the final lifecycle period of users on each platformin terms of their lexical period cross-entropy, we find that this is less than pre-vious periods, and that the rate of decay has levelled o↵ somewhat. Hence wecan use the growth rate from one lifecycle period to the next as informationfor detecting the lifecycle period of the user. We define this growth rate as fol-lows:5 �m(s) =

�m(s + 1) � m(s)

�/m(s), where m(s) denotes a convenience

function that returns the measure value (e.g. in-degree period cross-entropy)for the lifecycle period s; thus �m(s) < 0 indicates decay, �m(s) = 0 indicatesno change, and �m(s) > 0 indicates growth. The growth rate therefore formsa single growth feature in our dataset, and by looking back k lifecycle periodswe produce k growth features. Based on the 3 user properties (in-degree, out-degree, and lexical distributions) and 3 development indicators (entropy, periodcross-entropy and community cross-entropy) we have a total of 9 measures thatthe convenience function m can take, and with k growth features derived foreach measure (m 2 M). Our datasets, both for training and testing data, take

4 We use this integer representation for legibility and consider periods from 6 onwardsto provide su�cient training data.

5 This growth rate is equivalent to proportionate growth rates used in populationmodels.

in new users joining). We find a similar e↵ect for the out-degree of the userswhere divergence from the community is evident towards the latter stages ofusers’ lifecycles. The term distribution demonstrates di↵ering e↵ects however: forFacebook and SAP we find that the community cross-entropy reduces initiallybefore rising again towards the end of the lifecycle, while for Server Fault thereis a clear increase in community cross-entropy towards the latter portions ofusers’ lifecycles suggesting that the language used by the users actually tends todiverge from that of the community in a linear manner. This e↵ect is consistentwith the findings of Danescu et al. [2] where users adapt their language to thecommunity to begin with, before then diverging towards the end.

4 Lifecycle Period Detection

The above analysis unearthed the development that users go through acrossthe three community platforms, based on di↵erent properties and developmentindicators. We now turn to the problem of detecting the lifecycle period thata given user is in based on his prior development. We characterise this task asa multi-class classification problem in which our goal is to induce a functionthat returns a user’s lifecycle period: f : Rn ! S, where the domain is an n-dimensional feature vector representation of a user’s evolution and the co-domainis an integer representation of the lifecycle period (S = {6, 7, . . . , 20}).4

4.1 Feature Engineering

In the previous section we found that di↵erent lifecycle periods can be charac-terised and thus detected based on the evolution of the user into the lifecycleperiod: i.e. when inspecting the final lifecycle period of users on each platformin terms of their lexical period cross-entropy, we find that this is less than pre-vious periods, and that the rate of decay has levelled o↵ somewhat. Hence wecan use the growth rate from one lifecycle period to the next as informationfor detecting the lifecycle period of the user. We define this growth rate as fol-lows:5 �m(s) =

�m(s + 1) � m(s)

�/m(s), where m(s) denotes a convenience

function that returns the measure value (e.g. in-degree period cross-entropy)for the lifecycle period s; thus �m(s) < 0 indicates decay, �m(s) = 0 indicatesno change, and �m(s) > 0 indicates growth. The growth rate therefore formsa single growth feature in our dataset, and by looking back k lifecycle periodswe produce k growth features. Based on the 3 user properties (in-degree, out-degree, and lexical distributions) and 3 development indicators (entropy, periodcross-entropy and community cross-entropy) we have a total of 9 measures thatthe convenience function m can take, and with k growth features derived foreach measure (m 2 M). Our datasets, both for training and testing data, take

4 We use this integer representation for legibility and consider periods from 6 onwardsto provide su�cient training data.

5 This growth rate is equivalent to proportionate growth rates used in populationmodels.

the following form: D = {(x1, y1), (x2, y2), . . . , (xn, yn)}, where x denotes thefeature vector of a given user and y denotes the class label (lifecycle period) ofthe user. Hence the feature vector is a characterisation of the user up until andincluding the lifecycle period s and describes how the user has developed before-hand. The feature vector contains growth features of the user across di↵erentmeasures: x = [�m1(s� 1), . . . , �m1(s� k), �m(s� 1), . . . , �m2(s� k)]. This for-mat indicates that for both measures (m1 and m2) we include k growth features.We maintain the splits mentioned above by using the 80% of analysed users forthe training set and the held-out 20% of users for the testing split. For this latterdataset we hid the class labels and detected the label using the below model.

4.2 Vector Space Detection Model

To induce a detection function that can perform multi-class classification we usea vector space representation of classes and their boundaries to identify the mostsimilar, or rather proximal class, to an arbitrary user’s feature vector. We definethis function as: f(x) = argmaxs2S sim(x, s), where sim is derived using a givensimilarity function and thus chooses the class (lifecycle period) that maximisessimilarity. We vary the similarity function (sim(x, s)) through four measures:

1. Cosine Similarity: Measures the cosine of the angle between the user’s fea-ture vector x and the class centroid vector from the training data ps.

2. Euclidean Distance: Measures the distance between the vectors and thentakes the reciprocal of this distance to derive the similarity measure, as thereciprocal distance is maximised when the Euclidean distance is minimised.

3. Mahalanobis Distance: Accounts for the variance in the class distributionfrom which the centroid vector is derived by including the covariance matrix

⌃s of the class s: simmah =�(x� ps)⌃s(x� ps)

�� 12

4. Spearman Rank Correlation Coe�cient: Measures the extent to which amonotonically increasing or decreasing association exists between the fea-ture vector and class centroid: 1 indicates a strong positive association, �1indicates a strong negative association.

5 Experiments

5.1 Experimental Setup

For the experiments we tested the performance of di↵erent feature sets and sim-ilarity functions: for the feature sets we tested in-degree (i.e. in-degree entropy,in-degree period cross-entropy, etc.), out-degree and lexical features, followed byentropy (i.e. in-degree entropy, out-degree entropy, etc.), period cross-entropyand community cross-entropy features, before then combining all features to-gether. We set the value of k, the number of previous periods to derive thegrowth rates from, to 5, and tested the performance of detecting all lifecycleperiods (S = {6, 7, . . . , 20}). We applied the classification accuracy measures of

the following form: D = {(x1, y1), (x2, y2), . . . , (xn, yn)}, where x denotes thefeature vector of a given user and y denotes the class label (lifecycle period) ofthe user. Hence the feature vector is a characterisation of the user up until andincluding the lifecycle period s and describes how the user has developed before-hand. The feature vector contains growth features of the user across di↵erentmeasures: x = [�m1(s� 1), . . . , �m1(s� k), �m(s� 1), . . . , �m2(s� k)]. This for-mat indicates that for both measures (m1 and m2) we include k growth features.We maintain the splits mentioned above by using the 80% of analysed users forthe training set and the held-out 20% of users for the testing split. For this latterdataset we hid the class labels and detected the label using the below model.

4.2 Vector Space Detection Model

To induce a detection function that can perform multi-class classification we usea vector space representation of classes and their boundaries to identify the mostsimilar, or rather proximal class, to an arbitrary user’s feature vector. We definethis function as: f(x) = argmaxs2S sim(x, s), where sim is derived using a givensimilarity function and thus chooses the class (lifecycle period) that maximisessimilarity. We vary the similarity function (sim(x, s)) through four measures:

1. Cosine Similarity: Measures the cosine of the angle between the user’s fea-ture vector x and the class centroid vector from the training data ps.

2. Euclidean Distance: Measures the distance between the vectors and thentakes the reciprocal of this distance to derive the similarity measure, as thereciprocal distance is maximised when the Euclidean distance is minimised.

3. Mahalanobis Distance: Accounts for the variance in the class distributionfrom which the centroid vector is derived by including the covariance matrix

⌃s of the class s: simmah =�(x� ps)⌃s(x� ps)

�� 12

4. Spearman Rank Correlation Coe�cient: Measures the extent to which amonotonically increasing or decreasing association exists between the fea-ture vector and class centroid: 1 indicates a strong positive association, �1indicates a strong negative association.

5 Experiments

5.1 Experimental Setup

For the experiments we tested the performance of di↵erent feature sets and sim-ilarity functions: for the feature sets we tested in-degree (i.e. in-degree entropy,in-degree period cross-entropy, etc.), out-degree and lexical features, followed byentropy (i.e. in-degree entropy, out-degree entropy, etc.), period cross-entropyand community cross-entropy features, before then combining all features to-gether. We set the value of k, the number of previous periods to derive thegrowth rates from, to 5, and tested the performance of detecting all lifecycleperiods (S = {6, 7, . . . , 20}). We applied the classification accuracy measures of

Class label (lifecycle period)

Using growth features from the previous k lifecycle periods

m = measure of user property (e.g. in-degree distribution) by a developmental function (e.g. community cross-entropy)

= Decay = Growth

Page 13: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

Detecting Lifecycle Periods: Vector Space Model

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

12

¨  Goal: induce a surjective function that returns a user’s lifecycle period from his growth features

¨  Hence, we can vary the similarity function:

1.  Cosine similarity (class centroid similarity) 2.  Reciprocal of Euclidean distance (to class centroid) 3.  Reciprocal of Mahalanobis distance (considering class

covariance) 4.  Spearman Rank Correlation Coefficient (monotonic association)

in new users joining). We find a similar e↵ect for the out-degree of the userswhere divergence from the community is evident towards the latter stages ofusers’ lifecycles. The term distribution demonstrates di↵ering e↵ects however: forFacebook and SAP we find that the community cross-entropy reduces initiallybefore rising again towards the end of the lifecycle, while for Server Fault thereis a clear increase in community cross-entropy towards the latter portions ofusers’ lifecycles suggesting that the language used by the users actually tends todiverge from that of the community in a linear manner. This e↵ect is consistentwith the findings of Danescu et al. [2] where users adapt their language to thecommunity to begin with, before then diverging towards the end.

4 Lifecycle Period Detection

The above analysis unearthed the development that users go through acrossthe three community platforms, based on di↵erent properties and developmentindicators. We now turn to the problem of detecting the lifecycle period thata given user is in based on his prior development. We characterise this task asa multi-class classification problem in which our goal is to induce a functionthat returns a user’s lifecycle period: f : Rn ! S, where the domain is an n-dimensional feature vector representation of a user’s evolution and the co-domainis an integer representation of the lifecycle period (S = {6, 7, . . . , 20}).4

4.1 Feature Engineering

In the previous section we found that di↵erent lifecycle periods can be charac-terised and thus detected based on the evolution of the user into the lifecycleperiod: i.e. when inspecting the final lifecycle period of users on each platformin terms of their lexical period cross-entropy, we find that this is less than pre-vious periods, and that the rate of decay has levelled o↵ somewhat. Hence wecan use the growth rate from one lifecycle period to the next as informationfor detecting the lifecycle period of the user. We define this growth rate as fol-lows:5 �m(s) =

�m(s + 1) � m(s)

�/m(s), where m(s) denotes a convenience

function that returns the measure value (e.g. in-degree period cross-entropy)for the lifecycle period s; thus �m(s) < 0 indicates decay, �m(s) = 0 indicatesno change, and �m(s) > 0 indicates growth. The growth rate therefore formsa single growth feature in our dataset, and by looking back k lifecycle periodswe produce k growth features. Based on the 3 user properties (in-degree, out-degree, and lexical distributions) and 3 development indicators (entropy, periodcross-entropy and community cross-entropy) we have a total of 9 measures thatthe convenience function m can take, and with k growth features derived foreach measure (m 2 M). Our datasets, both for training and testing data, take

4 We use this integer representation for legibility and consider periods from 6 onwardsto provide su�cient training data.

5 This growth rate is equivalent to proportionate growth rates used in populationmodels.

the following form: D = {(x1, y1), (x2, y2), . . . , (xn, yn)}, where x denotes thefeature vector of a given user and y denotes the class label (lifecycle period) ofthe user. Hence the feature vector is a characterisation of the user up until andincluding the lifecycle period s and describes how the user has developed before-hand. The feature vector contains growth features of the user across di↵erentmeasures: x = [�m1(s� 1), . . . , �m1(s� k), �m(s� 1), . . . , �m2(s� k)]. This for-mat indicates that for both measures (m1 and m2) we include k growth features.We maintain the splits mentioned above by using the 80% of analysed users forthe training set and the held-out 20% of users for the testing split. For this latterdataset we hid the class labels and detected the label using the below model.

4.2 Vector Space Detection Model

To induce a detection function that can perform multi-class classification we usea vector space representation of classes and their boundaries to identify the mostsimilar, or rather proximal class, to an arbitrary user’s feature vector. We definethis function as: f(x) = argmaxs2S sim(x, s), where sim is derived using a givensimilarity function and thus chooses the class (lifecycle period) that maximisessimilarity. We vary the similarity function (sim(x, s)) through four measures:

1. Cosine Similarity: Measures the cosine of the angle between the user’s fea-ture vector x and the class centroid vector from the training data ps.

2. Euclidean Distance: Measures the distance between the vectors and thentakes the reciprocal of this distance to derive the similarity measure, as thereciprocal distance is maximised when the Euclidean distance is minimised.

3. Mahalanobis Distance: Accounts for the variance in the class distributionfrom which the centroid vector is derived by including the covariance matrix

⌃s of the class s: simmah =�(x� ps)⌃s(x� ps)

�� 12

4. Spearman Rank Correlation Coe�cient: Measures the extent to which amonotonically increasing or decreasing association exists between the fea-ture vector and class centroid: 1 indicates a strong positive association, �1indicates a strong negative association.

5 Experiments

5.1 Experimental Setup

For the experiments we tested the performance of di↵erent feature sets and sim-ilarity functions: for the feature sets we tested in-degree (i.e. in-degree entropy,in-degree period cross-entropy, etc.), out-degree and lexical features, followed byentropy (i.e. in-degree entropy, out-degree entropy, etc.), period cross-entropyand community cross-entropy features, before then combining all features to-gether. We set the value of k, the number of previous periods to derive thegrowth rates from, to 5, and tested the performance of detecting all lifecycleperiods (S = {6, 7, . . . , 20}). We applied the classification accuracy measures of

n-dimension growth feature vector

Choose the most similar class from vector space proximity/similarity

Page 14: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

Detecting Lifecycle Periods: Evaluation

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

13

¨  Setup: ¤  Set the number of prior lifecycle periods (k) to 5 ¤  Detected all periods in the closed interval [6,20] ¤  80% users for training (analysed above), 20% test

n  Used the former to induce centroids and covariance matrices ¤  Varied feature sets (e.g. just in-degree, just cross-entropy, etc.)

¨  Accuracy Measures: ¤  F1: F-measure (harmonic mean of precision & recall, beta=1) ¤  MCC: Matthews (not mine!) Correlation Coefficient

¨  Baselines: ¤  Random Model (implicit in MCC) ¤  Naïve Bayes

Page 15: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

Detecting Lifecycle Periods: Results

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

14

precision and recall as macro averages over the classes under inspection (s 2 S)and then took the harmonic mean of precision and recall as the f-measure (F1score), We used two baselines in our experiments: (i) a random guesser model,formed from the probability of success in a single Bernoulli trial per class; and (ii)the naive Bayes classifier. The latter comparing the vector space model againstan existing generative model that is regularly used in multi-class classificationtasks, and the former measuring performance of each vector space model againstthe random model using the Matthews correlation coe�cient (mcc).

Table 2. F1 scores of the di↵erent platforms when detecting the user lifecycle peri-ods using di↵erent detection models and feature sets, with the Matthews CorrelationCoe�cient in parentheses to show improvement over the random model baseline.

Platform Feature Set Cosine Euclidean Mahalanobis SpearmanFacebook In-degree 0.677 (0.637) 0.757 (0.730) 0.706 (0.659) 0.672 (0.627)

Out-degree 0.609 (0.582) 0.751 (0.718) 0.703 (0.665) 0.592 (0.553)Lexical 0.653 (0.632) 0.757 (0.730) 0.739 (0.700) 0.629 (0.601)Entropy 0.674 (0.618) 0.757 (0.730) 0.676 (0.621) 0.654 (0.602)Period Cross Entropy 0.650 (0.590) 0.774** (0.746) 0.630 (0.586) 0.647 (0.589)Comm’ Cross Entropy 0.643 (0.592) 0.760 (0.732) 0.657 (0.610) 0.671 (0.621)All 0.676 (0.614) 0.757 (0.730) 0.659 (0.608) 0.686 (0.633)

SAP In-degree 0.582 (0.520) 0.665 (0.652) 0.426 (0.376) 0.583 (0.527)Out-degree 0.597 (0.571) 0.658 (0.647) 0.600 (0.588) 0.574 (0.541)Lexical 0.583 (0.521) 0.665 (0.652) 0.431 (0.378) 0.558 (0.499)Entropy 0.522 (0.468) 0.665 (0.652) 0.470 (0.418) 0.532 (0.467)Period Cross Entropy 0.643 (0.591) 0.656 (0.651) 0.434 (0.377) 0.640 (0.590)Comm’ Cross Entropy 0.546 (0.497) 0.708*** (0.677) 0.529 (0.475) 0.520 (0.466)All 0.619 (0.565) 0.665 (0.652) 0.423 (0.364) 0.640 (0.590)

Server Fault In-degree 0.671 (0.631) 0.748 (0.721) 0.718 (0.664) 0.667 (0.619)Out-degree 0.635 (0.613) 0.760 (0.727) 0.732 (0.683) 0.608 (0.580)Lexical 0.666 (0.631) 0.748 (0.721) 0.711 (0.663) 0.643 (0.595)Entropy 0.669 (0.631) 0.748 (0.721) 0.703 (0.637) 0.654 (0.602)Period Cross Entropy 0.701 (0.650) 0.774** (0.747) 0.622 (0.584) 0.702 (0.660)Comm’ Cross Entropy 0.650 (0.597) 0.738 (0.710) 0.709 (0.647) 0.651 (0.603)All 0.698 (0.637) 0.748 (0.721) 0.706 (0.637) 0.680 (0.632)

Significance codes: p-value < 0.001 *** 0.01 ** 0.05 * 0.1 . 1

5.2 Detection Results

Table 2 shows the detection results across the platforms using di↵erent featuresets and similarity functions. For all platforms Euclidean distance performedbest with performance variation between the feature sets: period cross-entropyfor Facebook and Server Fault; and community cross-entropy for SAP. Thissuggests that information about the user evolving with respect to his earlierproperties and/or the community is su�cient to detect the period of the user.We performed significance testing using the Mann-Whitney test of the di↵er-ence in performance between the best model from each platform and the nextbest performing model, and found the di↵erences to all be significant (as indi-cated in the table). Interestingly, for all of our tested models we significantlyoutperformed the random model baseline, as indicated by the mcc values withinthe parentheses. We omitted the performance of the naive Bayes classifier forlegibility, however it is su�cient to add that it achieved low F1 scores acrossall platforms and feature sets (0.092, 0.098 and 0.126 for Facebook, SAP and

Key: F1 (MCC)

Page 16: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

Conclusions and Future Work

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

15

¨  Modelling user development provides signals for detecting lifecycle periods ¤  Based on social and lexical dynamics

¨  Users’ development is influenced by the community: ¤ Convergence, then divergence socially and lexically

¨  Users tend to not exhibit unseen behaviour ¤  I.e. reuse terms, maintain existing relationships

¨  Future Work: ¤ Addressing the necessity for knowledge of prior lifecycle

stages ¤  Stage-based recommendation

Page 17: Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms

@mrowebot [email protected]

http://www.lancaster.ac.uk/staff/rowem/

Questions? 16

Changing with Time: Modelling and Detecting User Lifecycle Periods in Online Community Platforms