user behaviour modeling on financial message boards

User Behaviour Modeling on Financial Message Boards

Pritha DN and Sahaj Biyani

Abstract

Online social communities like discussion boards and message boards are fast evolving intheir usage bringing people with similar interests together. From a social and anthropologicalstandpoint, these are the most interesting to study compared to Online Social Networks becausethey connect people (most often with no offline links) from different backgrounds and histories.Various theories exist in sociology about the intended behavior of users in online forums. In thispaper, we study the applicability of one such theory - “Participation Inequality” on financialmessage boards. We consider the activity of user, his network interaction structure and thecontent of postings and employ Machine Learning techniques to identify, cluster and infer rolesof users exhibiting similar behavior.

1 Introduction

1.1 Participation inequality

In Internet culture, the 1% rule is a rule of thumb pertaining to participation of users in an Internetcommunity, stating that only 1% of the users of a website actively create new content, while theother 99% of the participants only lurk. Variants include the 1-9-90 rule (sometimes 90–9–1 principleor the 89:10:1 ratio), which states that in a collaborative website such as a Wikipedia, 90% of theparticipants of a community only view content, 9% of the participants edit content, and 1% of theparticipants actively create new content. A related observation is that 1% of users generate themajority of revenue in free-to-play games. Similar rules are known in information science, such asthe 80/20 rule known as the Pareto principle, that 20 percent of a group will produce 80 percent ofthe activity, however the activity may be defined.

1.2 User behaviour in online communities

Online communities form a fundamental part of the web today where a large portion of the Internet’straffic is driven by and through them. These communities are where the majority of web users sharecontent, seek support, and socialize. Of particular relevance is identifying the behavioral patterns orroles that emerge from the community (e.g. experts, leaders, ignored users, etc.) as well as assessingthe distribution of users assuming different roles. There is currently no standard or agreed list ofbehavior types for describing activities of users in online communities. In this paper, we consider theactivity of user, his network interaction structure and the content of postings, and employ MachineLearning techniques to identify, cluster and infer roles of users exhibiting similar behavior.

1

2 Motivation and Related Work

Earlier work on user behavior modeling on financial message boards has primarily dealt with predict-ing the market status of a stock. In this paper, we aim to study the user behavior on financial messageboards from a sociological point of view. Previous research in [1], demonstrated Participation In-equality by studying users on Digital Health Social Networks (DHSNs)- the AlcoholHelpCenter,DepressionCenter, PanicCenter, and StopSmokingCenter sites. It would be interesting to analyze ifa similar pattern exists in communities dealing with a tangential domain. Hence we analyze userbehaviors in financial message boards, identify, cluster, label roles and study the interactions amongthe groups. In the Machine Learning perspective, the problem is of unsupervised clustering.

3 Dataset

3.1 Investors Hub and data retrieval

Investors Hub(”iHub”) is an online forum for investors to gather and share market insights in adynamic environment using a discussion platform. Investors Hub has been online for over 15 yearsand currently has 549,380 members who have posted 118,971,723 messages on 25,091 Boards. It hostsmessage boards broadly on the topics of stock market, commodities and Foreign Exchange. It offersforums, which are premium, and paid as well as free. We particularly focus on the Stock MarketMessage Boards, which has message boards for US listed, Canadian listed stocks. We analyze thefree US listed stocks message boards. We crawled this data from the website, gathering informationfrom a total of 6278 boards and 53,491 users with 5.6 million posts.

Figure 1: Volume ofposts across boards

Figure 2: User activityacross message boards

Figure 3: Activity ofusers on exact sameboards

Figure 4: Number ofposts initiated by eachuser

Figure 5: Number ofreplies made by eachuser

Figure 6: Number ofreplies each user re-ceives

Figure 7: Number ofreplies across MessageBoards

Figure 8: Average Userresponse time

3.2 Initial data analysis

In this section, we present the result of performing statistical analysis on the gathered data. We findthat many message boards are inactive and hardly record any conversation. Out of all the boardswe found that only about 1600 boards have more than 100 messages. And these boards account for97% of active users (who have commented at least once on any board). Figure 1. Depicts volumeof posts over each Message Board. We then analyze the activity of users across all Message Boards.We find that 57% of the users are active only on one message board. 53.1% of users are active on thesame exact message boards. Figure 2. and Figure 3. show the activity of users across all message

2

Initiation Rate Number of threads a user initiated over timeReply Rate Number of replies a user makes over timeOut-Degree Number of users a user replies toIn-Degree Number of users who reply to a userActivity Across Boards Number of boards he is active on / total number of boardsFollowers Number of followersReplier Share AVG[proportion of replies a user gets on a board]Reply Share AVG[proportion of reply a user makes on a board]Response Time Avg time to respond to a replyVolume of Posted Content Average length of postLinks in Posts No of links he has posted

Table 1: Features

boards. We further analyze the activity of users – the number of posts they initiate and the numberof replies they make. We find that 19.3% of the users have never initiated a post and 21.7% of theusers have never replied to another’ post. Figure 4. and Figure 5. Indicate the distribution of postsinitiated and replies made by each user. We also observed that 33.6% of the users do not get anyreplies on the posts the initiate. Figure 6. Shows the distribution of the number of replies each userreceives. We also infer that 18.4% of the Message Boards do not receive and reply posts. Figure7. Indicates the distribution of replies across all boards. We also analyzed the average time a usertakes to reply to another user’s post. Figure 8. shows the distribution of average response times ofall users.

4 Behavior Modeling and Approach

4.1 Feature selection

Taking inferences from the existing literature and research on the study of user roles in SocialNetworks[3], we consider the activity of the user, the egocentric network structure of the user andthe content of his postings as features for our clustering task. Features 3,4 are representative of the“Reply to” and “Replies to” network structure. Features 10, 11, represent the content or quality ofa users post. The rest of the features represent the activity of user on the message boards. Not allof the features listed in Table 1. are of equal importance in the task clustering. Features can beredundant and may be inter-correlated. This could lead to erroneous clustering results. Due to theabsence of ground-truth data, there is no metric to compare with for the most significant features.We employ Principal Component Analysis (PCA) for this task.

4.2 Min-Max Normalization

In order to operate on data originating from different features which fall in different ranges, it isimportant to normalize all feature values into a fixed range. Since we use the distance metric -euclidean distance, we normalize data using Min-Max Normalization, to fit the range [0-1].

4.3 Principal Component Analysis

PCA is a data reduction technique in which possibly correlated features are transformed into asmaller number of factors called principal components. We use the scree plot to determine thenumber of principal components to consider. The scree plot graphs the eigenvalue against thecomponent number. To determine the appropriate number of components, we look for an ”elbow”in the scree plot. Figure 9. shows the Scree Plot we obtained. The component number is taken tobe the point at which the remaining eigenvalues are relatively small and all about the same size.

3

For our dataset, we find that the first 5 components account for 95% of variance in the data. Thuswe build a new set of features from the original feature set using the first 5 principal components.

Figure 9: Scree PlotFigure 10: Silhou-ette Coefficient Plot

Figure 11: ElbowPlot

Figure 12: FeatureImportance

4.4 Unsupervised Clustering

4.4.1 Determining K in the K-means

One way to select K for the K-means algorithm is to try different values of K, plot the K-meansobjective versus K, and look at the “elbow-point” in the plot. By plotting the Within Group Sumof Squares against the number of clusters, we can visually examine the best point to choose forthe number of clusters. Initially the first cluster will add much information but at some point themarginal gain will drop giving an angle in the graph. that would indicate the number of clusterswe should aim for. We narrow down on cluster size of 4, 5 as they look promising from the plot.Silhouette Coefficient is a measure of how close each point in one cluster is to points in the neighboringclusters and thus provides a way to assess parameters like number of clusters visually. This measurehas a range of [-1, 1]. Silhouette coefficients near +1 indicate that the sample is far away fromthe neighboring clusters. A value of 0 indicates that the sample is on or very close to the decisionboundary between two neighboring clusters and negative values indicate that those samples mighthave been assigned to the wrong cluster. For each datum i, let a(i) be the average dissimilarity ofi with all other data within the same cluster. We can interpret a(i) as how well i is assigned to itscluster (the smaller the value, the better the assignment). The Silhouette Coefficient is given by:

si =b(i) − a(i)

max(a(i), b(i))

Based on our observation from the Elbow plot and using Silhouette Coefficients, we choose K=4.Figure 10. shows the Elbow Plot and Figure 11. shows the Silhouette Coefficient Plot for outdataset.

4.4.2 Cluster validation

In this step, we use the four clusters obtained as the labeled ground-truth dataset to train an ensembleRandom forest, and obtain the relative importance of each of the 12 features in the classificationtask. We use 5, 10, 20, 50 trees in the Random Forest in which we get the maximum score for crossvalidation, with number of trees equal to 10. Figure 12. shows the obtained feature importancevalues. We do not consider features with less than 1% importance, and therefore are left with only 9features. The content volume, average number of hyperlinks posted and the content volume featuresare dropped. Next, we perform clustering using the selected important features using the K-meansalgorithm with K=4. K-means depend on the initial centroid seeds that get chosen. K-meansalgorithm converges when the assignment no longer changes. We do 300 iterations for a single run.The algorithm aims to minimize the Within Cluster Sum of Squares and might not converge to theglobal optimum. So we run the algorithm with 10 different centroid seeds to get the best result.Comparing the clusters now obtained with clusters from PCA, the clusters formed are of similar sizeand have 99.99% overlap amongst its members. The cluster we chose had a silhouette coefficient of0.789 and an average inter-cluster distance of 2.51 and average intra-cluster distance of 1.01.

4

Cluster 1 91.73%Cluster 2 6.44%Cluster 3 1.13%Cluster 4 0.73%

Table 2: User distribution in each cluster

Figure 13: Comparison of clustercomposition and the content con-tribution

Figure 14: Interaction among usersacross clusters

Figure 15: Out-degree feature dis-tribution across clusters

Figure 16: In-degree feature distri-bution across clusters

5 Methodology

In this section, we summarize the methodology of our implementation. First, considering the metrics- activity of user, his network interaction structure and the content of postings, we extract 12features. In order to extract most significant of all these features, we perform PCA and use thefirst 5 Principal Components as determined by the scree plot. We use the projection data fromthis step as the set to be clustered. We construct 4 clusters using K Means clustering technique.K = 4 is selected considering the Silhouette Coefficients and the elbow plot. The aim is to obtainthe important features from our set of 11 features. We use the Random Forest ensemble to get thefeature importance using the labeled data from previous step as input. Nine of the input featuresare found to have relative importance greater than 1%, and they are subject to K-Means clusteringusing the Euclidean distance as the distance metric, to obtained the final groups of users.

5.0.1 Role Inference

The final step is that of role inference. Table 2. depicts the size of each cluster obtained. Threethresholds are determined as Low, High, and Medium for each of the features. Sociology attributesdifferent characteristics to different user roles. Figure 15. and 16. show the out-degree and in-degree feature value distribution respectively, in the four clusters. In the obtained four clusters, wefind a behavior marked by low initiation, low engagement and low activity and assign the label of“Lurkers” to this cluster. The second cluster exhibits medium initiation, medium reply share and

5

medium engagement; we label it as “Contributors”. There is an interesting third group highlightedby a high replier and reply rate, but with low post initiation value; we label this cluster as “Debaters”.The last cluster is characterized by high initiation, high user engagement and high interaction; weassign the label of “Super Users” to this cluster.

6 Results

We obtain four significant user roles of Lurkers, Contributors, Super users and Debaters from ourdataset. We find that 91.73% of the users from the dataset are lurkers. This observation is inaccordance with the theory of participation inequality. Thus our observations fortify and act asproof of the 1% Internet Rule and the 90–9–1 principle. We also make an interesting observationthat 72% of the posts in our dataset are made by the Contributors and Super-users though theyonly constitute 7.14% of our user base. Figure 13. illustrates the anomaly in the number of usersin each cluster and the volume of content the users in that cluster generate. Figure 14. depicts theinteraction (via replies) of users across clusters. We made an interesting observation that the usersin Cluster 3, which we label as Debaters, interact most among users of the same cluster.

7 Conclusion

In this paper we have presented one approach to label behavior of users of online communities.We presented a method to capture the behavioral characteristics of users as numeric attributes andexplained how Machine Learning can be employed to infer the role that a given user plays. A keycontribution of this paper is that we successfully substantiate and support the theory of ParticipationInequality on financial message boards.

References

[1] Van Mierlo, T. The 1% Rule in Four Digital Health Social Networks, in J Med Internet Res,16(2):e33 2014.

[2] Mattew Rowe , Miriam Fernandez and Harith Alani. ‘Modelling and analysis of user behaviourin online communities’.

[3] Mathilde Forestier, Anna Stavrianou, Julien Velcin, and Djamel A. Zighed. ‘Roles in socialnetworks: methodologies and research issues’.

6

user behaviour modeling on financial message boards

Data & Analytics