[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

2
User interests modeling in online forums Na Ni and Yaodong Li Institute of Automation, Chinese Academy of Science 95 Zhongguancun East Road, Beijing, China Email: {na.ni, yaodong.li}@ia.ac.cn Abstract—This paper studies the problem of user modeling in online forums from a personality viewpoint. A novel hierarchical user profiling mechanism is proposed, which utilizes the user- generated content, the reply relations among users and the topics of the discussions. The hierarchical model represent the users’ interests across different topics. The obtained user profiles are applied to three forum-related tasks: new discussions rec- ommendation, external news articles recommendation and user retrieval. The experimental results show that, comparing with the traditional methods, the hierarchical user profiling approach achieves a better performance in all three tasks. Keywords-online forum; user modeling; hierarchical model I. I NTRODUCTION The online discussion or forum is a type of social media promoted by Web 2.0, where people gather together and discuss a specific topic in depth they are interested in. In forums like Digg 1 , people participate in discussions to discover some valuable information or share their minds and knowledge with other people. In this type of forums, it is meaningful to find out what the users are interested in. Existing work on user modeling could be classified into content-based methods and collaborative filtering methods. These methods have been used to facilitate the personalized search or recommendation in some social media such as newsgroups, blog and microblog. But the online forum has its own characteristics. The content in online forums usually has more noises and the users’ relations in online forum are not as tight as that in blog or microblog. Meanwhile, existing works on user modeling for discussions are restricted to some specific tasks such as thread recommendation. In online forums, a user’s interests are reflected via the contents generated by him, the users he has exchanged opin- ions with and the topics of discussions he has participated in. Using the above information, a hierarchical user profile ap- proach is proposed to model the differences of users’ interests among different topics. This model contains two layers: cross- domain layer and inner-domain layer. The cross-domain layer describes a user’s interest across different domains and the inner-domain layer describes a user’s interest within a specific domain. To evaluate the effectiveness of this approach, the hierarchical user profiles are applied to three online forums related tasks: new discussions recommendation, external news articles recommendations and user retrieval. The framework of this study is shown in Figure 1. II. METHODOLOGY As is shown in Figure 1, the hierarchical user model proposed in this paper has two layers. In the cross-domain 1 http://digg.com/ Fig. 1: User modeling framework layer, a user’s model is viewed as a distribution across the domains that he is interested in, which could be represented as P C (u)= {(c i ,p(c i |u))| c i C}. Applying the Bayesian model: p (c j |u)= p(u|c j )p(c j )/p(u) (1) where p(c j ) is the probability of cluster c j in the training set. We estimate the probability p(u|c j ) of a user appeared in a given domain based on the directed relation among users in that domain. A user could reply to (RT) or be replied by (RB) other users in a discussion. Using these directed relations, we could build two directed graphs G rt and G rb on users for a given domain. In the graphs, each user corresponds to a vertex and the directed edge G rt (i, j ) or G rb (i, j ) denotes the times that user i has replied to user j or the times he has been replied by user j in the target domain. In graph G rt , a user has higher probability on the cluster if he has replied to more users. While in graph G rb he will achieve higher probability if he has been replied by more users. The PageRank algorithm [1] is adopted on the directed graph G rt and G rb to analyze the users’ probability p(u|c j ) of a given domain. p(u|c j )= δ N + (1 δ) u Uc j rank(u ,c j ) G(u, u ) (2) in which, δ is a damping factor, U cj is the user collection of domain c j , and N is the number of users in U cj . Analogously, the probability p(u) of a candidate user appearing in the entire collection could be also calculated by PageRank algorithm on the directed graph built by the replying relations among users in the entire collection. In the inner domain layer, the profile P CW (u|c j ) = {(w i ,p(w|θ u ,c j ))| w i V } of a user u in domain c j is calculated by the contents of posts that he has published within that domain. The value of p(w|θ u ,c j ) is estimated using the language model in this paper. 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4799-2/12 $26.00 © 2012 IEEE DOI 10.1109/ASONAM.2012.122 729 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4799-2/12 $26.00 © 2012 IEEE DOI 10.1109/ASONAM.2012.122 708

Upload: vuongtu

Post on 28-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

User interests modeling in online forumsNa Ni and Yaodong Li Institute of Automation, Chinese Academy of Science

95 Zhongguancun East Road, Beijing, China

Email: {na.ni, yaodong.li}@ia.ac.cn

Abstract—This paper studies the problem of user modeling inonline forums from a personality viewpoint. A novel hierarchicaluser profiling mechanism is proposed, which utilizes the user-generated content, the reply relations among users and thetopics of the discussions. The hierarchical model represent theusers’ interests across different topics. The obtained user profilesare applied to three forum-related tasks: new discussions rec-ommendation, external news articles recommendation and userretrieval. The experimental results show that, comparing withthe traditional methods, the hierarchical user profiling approachachieves a better performance in all three tasks.

Keywords-online forum; user modeling; hierarchical model

I. INTRODUCTION

The online discussion or forum is a type of social media

promoted by Web 2.0, where people gather together and

discuss a specific topic in depth they are interested in. In

forums like Digg1, people participate in discussions to discover

some valuable information or share their minds and knowledge

with other people. In this type of forums, it is meaningful to

find out what the users are interested in.Existing work on user modeling could be classified into

content-based methods and collaborative filtering methods.

These methods have been used to facilitate the personalized

search or recommendation in some social media such as

newsgroups, blog and microblog. But the online forum has

its own characteristics. The content in online forums usually

has more noises and the users’ relations in online forum are

not as tight as that in blog or microblog. Meanwhile, existing

works on user modeling for discussions are restricted to some

specific tasks such as thread recommendation.In online forums, a user’s interests are reflected via the

contents generated by him, the users he has exchanged opin-

ions with and the topics of discussions he has participated in.

Using the above information, a hierarchical user profile ap-

proach is proposed to model the differences of users’ interests

among different topics. This model contains two layers: cross-

domain layer and inner-domain layer. The cross-domain layer

describes a user’s interest across different domains and the

inner-domain layer describes a user’s interest within a specific

domain. To evaluate the effectiveness of this approach, the

hierarchical user profiles are applied to three online forums

related tasks: new discussions recommendation, external news

articles recommendations and user retrieval. The framework

of this study is shown in Figure 1.

II. METHODOLOGY

As is shown in Figure 1, the hierarchical user model

proposed in this paper has two layers. In the cross-domain

1http://digg.com/

Fig. 1: User modeling framework

layer, a user’s model is viewed as a distribution across the

domains that he is interested in, which could be represented

as PC (u) = {(ci, p(ci|u))| ci ∈ C}. Applying the Bayesian

model:

p (cj |u) = p(u|cj)p(cj)/p(u) (1)

where p(cj) is the probability of cluster cj in the training set.

We estimate the probability p(u|cj) of a user appeared in

a given domain based on the directed relation among users in

that domain. A user could reply to (RT) or be replied by (RB)

other users in a discussion. Using these directed relations, we

could build two directed graphs Grt and Grb on users for a

given domain. In the graphs, each user corresponds to a vertex

and the directed edge Grt(i, j) or Grb(i, j) denotes the times

that user i has replied to user j or the times he has been

replied by user j in the target domain. In graph Grt, a user

has higher probability on the cluster if he has replied to more

users. While in graph Grb he will achieve higher probability

if he has been replied by more users. The PageRank algorithm

[1] is adopted on the directed graph Grt and Grb to analyze

the users’ probability p(u|cj) of a given domain.

p(u|cj) = δ

N+ (1− δ)

u′∈Ucj

rank(u′, cj)G(u, u′)

(2)

in which, δ is a damping factor, Ucj is the user collection of

domain cj , and N is the number of users in Ucj . Analogously,

the probability p(u) of a candidate user appearing in the entire

collection could be also calculated by PageRank algorithm on

the directed graph built by the replying relations among users

in the entire collection.

In the inner domain layer, the profile PCW (u|cj) ={(wi, p(w|θu, cj))| wi ∈ V } of a user u in domain cj is

calculated by the contents of posts that he has published within

that domain. The value of p(w|θu, cj) is estimated using the

language model in this paper.

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.122

729

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.122

708

Page 2: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

III. APPLICATIONS AND EXPERIMENTS

A. Data set

We collect 13,872 discussions containing 64,374 users from

an online discussion forum: Digg, during Nov. to Dec., 2011.

The discussions have been classified to some domains manu-

ally, which will be used directly in building the hierarchical

user model. To verify the effectiveness of our method, we

also implement some existing user modeling methods such as

content-based(“CON”) method, relation-based method(based

on the “reply to”, “reply by” and “co-occurrence” relations

between two users, which are denoted as “CF-RT”, “CF-RB”

and “CF-CO”) and method proposed in [2](“CON+C+G”).

B. New discussions recommendation

When the first post dr of a new discussion is given, we

classify the new discussion to a domain of the training set

according to dr and compute a user’s interest in it with

p(dr, cd|u) = p(dr|cd, u)p(cd|u). When the author of a new

discussion is given, the discussion will be recommended to

other users via relation-based user model. Mean Reciprocal

Rank (MRR) and Precision@N are used as the evaluation

metrics. The performances of our methods and the contrasting

approaches are shown in table I. Both of the hierarchical

approache works better than all the other methods adopt in this

task: “CON”, “CON+C+G” and relation-based user models.

TABLE I: New discussions recommendation results

methods P@1 P@5 P@10 P@20 P@30 MRR

CON 0.1048 0.1036 0.0935 0.0801 0.0709 0.2364

CON+C+G 0.1036 0.1099 0.0842 0.0811 0.0691 0.2367

CF-RT 0.0248 0.0288 0.0318 0.0313 0.0300 0.0803

CF-RB 0.0225 0.0284 0.0306 0.0322 0.0300 0.0795

CF-CO 0.0248 0.0297 0.0311 0.0293 0.0276 0.0779

H-RB 0.1149 0.1153 0.1011 0.0827 0.0723 0.2500

H-RT 0.1214 0.1162 0.1158 0.1066 0.0908 0.2655

C. News articles recommendation

The external news articles is classified into the existing

domains of discussions using SVM first. Given a user u,

the candidate articles are ranked according to the value

cos(p(w|dnew, cd), p(w|θu)) · p(cd|u). In our experiment, for

the users in training set, we collect the source articles of the

discussions they have dug but not participated in. These news

articles will be used as the ground truth. About 4 articles

are downloaded for each of the 1,152 users, and a 4,580

news recommendation list is obtained. Table II shows the

result obtained by our method “H-RT” and the contrasting

approaches: “CON” and “CON+C+G”.

D. User retrieval

We use the titles of the discussions in training set to generate

30 queries for user retrieval task. The average length of query

is 5.67 words and the average number of relevant users of each

TABLE II: News articles recommendation results

methods P@1 P@5 P@10 P@20 P@30 MRR

CON 0.2483 0.1100 0.0733 0.0500 0.0396 0.3301

CON+C+G 0.2440 0.1083 0.0731 0.0498 0.0403 0.3245

H-RT 0.2683 0.1161 0.0781 0.0514 0.0414 0.3475

query is 14.83. In the situation that given a query q and the

target domain cq of results, the domain of the title is considered

as the target domain. In this task, the evaluation metrics are

Mean Average Precision (MAP)and Precision@N. The results

of user retrieval using different user models are shown in Table

III. From the table, we find when the target domain is assigned,

using the hierarchical model, the precision of retrieval results

is improved. This indicates that, we could use the hierarchical

user model to achieve a better retrieval results when the target

domain is assigned.

TABLE III: User retrieval results

methods P@1 P@5 P@10 P@20 P@30 MAP

CON 0.3000 0.2733 0.2500 0.1850 0.1544 0.4186

CON+C+G 0.3333 0.2800 0.2567 0.1883 0.1622 0.4334

H-RT 0.3667 0.3067 0.2677 0.1833 0.1611 0.4462

The experimental results show that, (1) in the task of new

threads recommendation, the relation-based model does not

achieve good performance. It reveals that the relations among

users in online forums are formed by discussions temporarily,

which is not as tight as other social media. (2) the hierarchical

user profile proposed in this paper performs best in all of the

three tasks. This demonstrates that a user’s interests usually

focus on more than one domain. Thus, modeling a user

hierarchically is reasonable and proved to be effective.

IV. CONCLUSION

A novel hierarchical user modeling approach is proposed to

model the interests of users in online forums. The method is

implemented on three tasks related to online forums: new dis-

cussions recommendation, external news articles recommenda-

tion and user retrieval. The experimental results demonstrate

that the users’ interests in online forum can be better modeled

by utilizing information such as the user-generated contents,

reply relationships among users and the domains that the

discussions belongs to. Especially, the proposed hierarchical

user modeling approach outperforms traditional methods in all

three tasks. The influence of temporal information on users’

profiles is needed to consider in future work.

ACKNOWLEDGMENT

This work is sponsored by NSFC (under grant 61072084).

REFERENCES

[1] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank CitationRanking: Bringing Order to the Web. Technical report, Stanford InfoLab,1999.

[2] G.-R. Xue, J. Han, Y. Yu, and Q. Yang. User language model forcollaborative personalized search. ACM Trans. Inf. Syst., 27:11:1–11:28,March 2009.

730709