[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

Extracting Celebrities from Online Discussions

Mathilde Forestier, Julien VelcinEric Laboratory, University of Lyon

5, avenue Pierre Mendes France

69676 Bron Cedex, France

Phone: 00 33 4 78 77 31 54

[email protected]

Anna StavrianouXerox Research Centre Europe

6 chemin de Maupertuis,

38240 Meylan, France

Phone: 00 33 4 76 61 50 35

[email protected]

Djamel ZighedEric Laboratory, University of Lyon

5, avenue Pierre Mendes France

69676 Bron Cedex, France

Phone: 00 33 4 78 77 31 54

[email protected]

Abstract—Online discussions became increasingly widespreadwith the Web 2.0: no matter the distance, whether you know theperson or not, you can discuss and exchange ideas with people allover the world through forums, blogs, and newsgroups. The newswebsites have extensively used forums in order to encourage thereader being a real participant in the information media. Thispaper aims at automatically extracting the celebrities from suchdiscussions. We propose certain meta-criteria and we provide anevaluation on a dataset of 35,175 posts written by 14,443 users.The results show that one of the proposed meta-criteria succeedsin extracting celebrities and allows for further improvements.

I. INTRODUCTION

The Web 2.0 has introduced an incredibly simple way of

interaction: regardless the distance or whether you know the

person or not, you can discuss with people from all over

the world through the Web 2.0 applications (blogs, e-mails,

dedicated media, etc.). In particular, forum debates on news

websites are a very representative case of interaction between

people using the Web 2.0.

In this kind of interaction, as in real life, people play a

social role [1]. Goffman [2] defines the social role as “the

enactment of rights and duties attached to a given status”. In

other words, people have some regularities in their behaviour

and this behaviour can be analysed in order to figure out the

social role. Golder and Donath [3] study the Usenet newsgroup

and they define the celebrity social role in order to represent

people who are recognised in and by their community.

As a result, in this paper we provide three major contribu-

tions: the theoretical formalisation of the ‘celebrity’ social role

inspired by and based on previous anthropological studies; the

experiments of this theoretical framework on data extracted

from a news website using three different meta-criteria and

a baseline-criterion; and a discussion about the social role

extraction, and the relation with the kind of data we deal with.

The paper continues as follows. At first, we discuss the

related work concerning the topic of social role recognition

in online media. Then, we present the theoretical framework

based on the study of Golder and Donath [3] extended with

three meta-criteria to take into account the special features

of the forum debates on news websites. We continue by

presenting the experimental framework and the dataset we

used, and we discuss about the evaluation of the different

criteria applied. We conclude by commenting on the results

and discussing the outcome.

II. RELATED WORK

The analysis of the social role has become an important

subject of research for researchers in humanities and computer

science. The existing works about the extraction of social roles

can be separated into two areas: the identification of explicit

and of non-explicit roles [4].

A. The non-explicit roles

The non-explicit roles refer to social roles that are not

pre-defined but extracted from the data using statistical and

probabilistic measures e.g., blockmodels [5][6][7]. These roles

can also be understood as positions [8], e.g., a secretary or

a manager in a firm. Blockmodels are based on a social

network of relation. Using a pre-defined equivalence, the

aim of blockmodeling is to sort people in blocks of 0/1

function of the pre-defined equivalence [8][9]. There exist

several equivalences depending on the research object: the

structural equivalence, the regular equivalence, the strong

equivalence and the automorphic equivalence. The structural

equivalence aims at grouping actors with similar interests

while the regular and the automorphic equivalence represent

more the sociological notion of the social role: people with

the same role have to be linked with people sharing the same

role.

B. The explicit roles

The explicit roles refer to predefined roles that could also be

found in virtual discussions. Two basic roles are more usually

dealt with by the research community: the influencer [10][11]

and the expert [12][13]. The influencer could influence other

people in the social network, and thus, this role is interesting

for viral marketing or for advertisement placing. The expert

is someone who knows his subject and may provide help to

other people e.g., in the Java Forum [13].

Golder and Donath [3] carried out an anthropological study

and identified six social roles: the celebrity, the newbie, the

lurker, the flamer, the troll and the ranter. They defined the

ranter as a social role that could appear to be “statistically

similar to celebrities”. Actually, the ranter is a “prolific writer,

[who] posts with great frequency on a particular issue or

issues and is unique in his or her lengthy posts and single

mindedness. Ranters exhibit some troll-like characteristics

[...]”.

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.61

322

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.61

322

C. Discussion about the related work

The social role study is quite an old domain of research in

social sciences [2][15]. Nevertheless, as we saw in this section,

the social role is studied nowadays on the new platforms of

interaction, e.g., databases of emails, Usenet newsgroups, etc.

We based our analysis on a very close work from ours

that exists in the literature: the work of Golder and Donath

[3] and on one of the social roles they identified, that of the

celebrity. They define the celebrity as “the prototypical central

figure. Celebrities are prolific posters who spend a great deal of

time and energy contributing to their newsgroup community.

Because they post so often, everybody knows them”. Golder

and Donath study this social role on Usenet newsgroups. Our

data, i.e., the forum debates on news websites have some

special features that differ from the Usenet newsgroups. Even

if we separate the topics of the forum debates (each news

website contains topics where the articles are classified in,

e.g., politics, economics, etc.), the user communities extracted

from these topics are larger than most of Usenet newsgroups.

Thus, in this paper, we will formalize and adapt the Golder

and Donath’s framework to our data.

III. THEORETICAL FRAMEWORK

Let us represent a set of forum discussions as a directed

graph G = (V,E) whose nodes represent forum participants

and the edges show who replies to whom. The edges are

weighted and the weight shows the number of posts between

two connected users. The weighted in-degree of a node v ∈ Vis deg−G(v), whereas the weighted out-degree is deg+G(v).We also denote the set of posts authored by a user who is

represented by the node v as p(v), the total existing posts

as p and the average number of posts in the specific set of

discussions as p. Finally, the posts that belong to the set of

threads that are not initiated by the user v will be denoted as

thr(v).In Table I we present the characteristics (aggregated) defined

by Golder and Donath and we show the respective formalized

criteria that identify the post-reply behaviour of the user within

a discussion.

We have applied the criteria to our set of data in three

different ways. The constants presented in Table I are also

clarified in the following described meta-criteria.

A. Meta-criteria 1All the criteria presented in Table I have to be satisfied

in order to choose a potential celebrity. The constants are

given certain values. For example, the in-degree and out-

degree values are chosen to be higher than 2, if in the dataset

the average in- and out-degree is lower than 2. The threshold

and constant values can always change according to the data

and to how selective we are.

The interesting people are identified according to the

pseudo-code in Table II. Once the interesting people are

identified, they are ranked in descending order by the total

number of their posts. The pseudo-code is applied in the same

way per-topic as well as for the whole dataset.

TABLE IIPSEUDO-CODE

Pseudo-code ∀v ∈ V Comment

if |p(v)| > p number of posts higher than theaverage

and (deg−G(v) > deg−G ) in-degree higher than average in-degree

and (degw−

G (v) > degw−

G ) weighted in-degree higher than av-erage weighted in-degree

and (deg+G(v) > deg+G) out-degree higher than average out-degree

and (∃p′ ∈ p(v) : p′ /∈ thr(v)) the user v has participated inthreads s/he has not initiated

B. Meta-criteria 2In this case, the criteria of ’Meta-criteria 1’ are all taken

into account in the same way. The ranking, though, is done in

a different way - by taking into account the participation of

the user in different forums. The participation is the number

of forums where a user v participates in : |forums(v)|. As a

result, the final ranking is done in descending order according

to the user’s average forum participation multiplied by the

number of posts:

participation(v) ∗ |p(v)|where participation(v) = |forums(v)|

max(|forums|)C. Meta-criteria 3

In this meta-criteria we additionally take into account the

citations. As Golder and Donath specify, a celebrity is cited

even if s/he has not participated in the discussion. We identify

two citation types: the citations of names and the quotations

of texts. Regarding the identification of celebrities, citing the

name of a user is considered more important than quoting the

respective text because it shows that the user is known and

probably well respected. In this meta-criteria we consider the

ranking of ’Meta-criteria 2’ and we differentiate it by adding

the number of citations and quotations:

|citations(v)| ∗ wc + |quotations(v)| ∗ wq +(participation(v) ∗ |p(v)|) ∗ wMeta−criteria2

where wc is the weight we decide to give to the citations, wq

is the weight we give to the quotations and wMeta−criteria2

is the weight we give to the ranking calculation of ’Meta-

criteria 2’. The different weights are proportional to each

other according to what we give more importance to.

After applying all three meta-criteria we end up with three

different rankings of users depending on their possibility to

be a celebrity. Note that not all the users are ranked, but only

the users who satisfy the meta-criteria. The rest of the users

are considered to have a score of 0 for the meta-criteria.

IV. EXPERIMENTAL FRAMEWORK

We saw in the previous section the theoretical framework. In

order to experiment with it, we have created an experimental

323323

TABLE IPOST-REPLY BEHAVIOUR CRITERIA BASED ON THE ARTICLE OF GOLDER AND DONATH [3].

Characteristics Post-reply criteria Formalized criteriaHigh contribution in the discussion (largevolume of posts) and a “magnitude variationin posting frequency”.

The number of posts per author should behigher than the average |p(v)| ≥ p, for v ∈ V

Communicative skills (not just a ’robot’ thatsends messages).

High in-degree and out-degree values.

deg−G(u) ≥ α and deg+G(u) ≥ β,

where α, β ∈ N+∗ constants

Not a Ranter. Participation in threads not initiated by thesame person and higher in-degree value thatthe average.

∃p′ ∈ p(v) : p′ /∈ thr(v)

and deg−G(v) > deg−G , for v ∈ V

framework, as explained in this section.

Initially, by using a parser we extract the forum debates

and the respective topics from a news website. The parser

aims at extracting the user posts, the user’s pseudonym and

the structural relation, i.e., the relation based on the “reply

to” structure of the forums. We, then, extract the citation

relations, i.e., when a user cites another user within his post

and the quotation relations, i.e., when a user quotes a part of

a previous post in his post [16]. All this information is saved

into a database. Finally, we use the meta-criteria defined in

section III and a baseline to rank the users by their score.

We test these three meta-criteria on the American version

of the HuffingtonPost1. We parsed and analysed 57 forums on

three topics: politics, media and living.

TABLE IIINUMBER OF USERS, POSTS, CITATIONS, QUOTATIONS AND AVERAGE

NUMBER OF POSTS PER USER.

Politics Media Living All#Users 4,547 5,974 3,667 11,443#Posts 12,725 14,176 8,274 35,175

#Posts / #Users 2.8 2.4 2.3 3.1#Citations 350 461 183 994

#Quotations 117 153 146 476

Table III describes the data. We can see that the media

community is the biggest one, but it does not have the highest

ratio of posts per user. The living community is the smallest

one. There is a biggest ratio per user for the whole community

(i.e., 3.1) than for the topic ones (i.e., <3). We could explain

this phenomenon by the fact that users participate in several

topics, i.e., there exists an overlapping between communities

in topics. This is also why there are 11,443 users in our dataset

and this number does not represent the sum of the users from

the three topics.

V. EVALUATION

As we saw in section III, we carried out three experiments

based on three meta-criteria and a baseline criterion: #posts,

Meta-criteria 1, 2 and 3.

1http://www.huffingtonpost.com/?country=US

In order to evaluate the ranking of the users based on the

criteria defined, for the three topics as well as the whole

dataset, we retrieve another information from the Huffing-

tonPost website: the number of fans that a user has. In our

opinion, the number of fans is a good validation criteria, since

when a user has a lot of fans, s/he seems to be recognised

by his/her community and, thus, s/he is a celebrity. Hence,

we retrieve and store this information for all the users of our

dataset.

By observing the data, we applied some basic statistics in

order to find a good fan-threshold. We noticed that, at the

moment of the evaluation, the maximum number of fans a user

has is 9,061 and the average number of fans is 377. Looking

at the quartiles, the first quartile is about 39 users, the second

quartile (median) is about 159 and the third quartile is about

436. Half of the users have a very low number of fans (159)

and only 25% of the population has more than 436 fans. We

finally decide to choose a minimum of 800 fans that represents

13% of the users.

Hence, we consider that when 800 users have decided

to become a fan of a certain user, then, this user is really

recognized by the community s/he belongs to. Actually, we

assume that when a user becomes a fan of someone else,

s/he already knows this other person (in our case through his

written activities). We realize that this threshold is not general

and it may change on another dataset but we had to make a

choice, knowing the type of our data, in order to evaluate our

model.

Some of the users have removed their HuffingtonPost ac-

count between the time they were writing in forums and the

time we were extracting the number of fans per user. We

decided to let them in the dataset but not consider them as

celebrities. In our opinion, a celebrity should remain active

through time, so if a user has decided to remove his/her

account s/he cannot be a celebrity.

We also calculate the precision at a cut-off value n of the

ranking [17]. The precision@n is better to be used since we

can evaluate the n best ranked users but not the whole dataset.

The formula for the precision@n is the following:

precision@n =correctly assigned celebrity users in n ranked users

n ,

324324

� ��

�

��

��

��

��

��

��

��

��

��

�

��

��

��

��

��

��

!��

Fig. 1. ROC curves for the three meta-criteria for the whole dataset.

where n is a cut-off value, discussed in the next section.

VI. RESULTS

Figure 1 shows the ROC curve for the three meta-criteria

for the whole dataset. Note that the tendency is the same for

the three topics.

It is surprising, at first, to see that the ROC curves of our

three meta-criteria become straight. Actually, the meta-criteria

help us to extract a sub-population of the users that satisfies all

the criteria. Then, they are ranked by their user activity (meta-

criteria 1), adding as a condition the number of forums a user

has participated in (meta-criteria 2), and including the name

and the text quotations (meta-criteria 3). This means that if a

user does not satisfy the meta-criteria his/her ranking score is

zero. Thus, if we look at the whole population, only 8% of

the population satisfy the meta-criteria 1 and 2, and 92% has

a score of 0 so they are all ranked in the same place.

Hence, we want to evaluate our model for the top n ranked

users. This is why, in order to have a better understanding

of the situation we “zoomed” in the ROC curves of Figure

1 for the top-n best ranked users. The idea behind this is to

see if our model is better for the best ranked even if we left

out some celebrities. In order to zoom, we choose the number

of users who satisfy all the criteria, e.g., 1,207 for all topics.

Then, we zoomed in on these users for each meta-criteria and

the baseline-criterion (the number of posts).

Figure 2 shows the results after the zoom. The meta-criteria

2 give better results than all the other meta-criteria and the

baseline for two topics and the whole dataset. This means that

the number of forums where a user participates in plays a role

in the celebrity recognition, since the ranking by the number

of forums improved the celebrity identification. The meta-

criteria 2 is clearly better for the media topic compared to the

baseline, followed by the meta-criteria 3. Actually, considering

the name citations and the text quotations seems to give some

good results for the media topic. We still believe that this

information has to be taken into account.

We choose n = 20 for our experiments to compute the

precision@n. We assume that a list of 20 users is adequate

if our task is to present the few top celebrity users to a new

user of a news website. These top users would represent the

��

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� ��

�

��

��

��

��"

��

��

��

�� #$��

��

��

��

��

��

��

� �� %

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Fig. 2. Zoom in on the interesting part of the ROC curves.

ones who are recognised by the forum community the specific

user is planning to integrate into. Table IV shows the results

for the precision@20.

TABLE IVPRECISION@20 FOR EACH TOPIC AND THE WHOLE DATASET.

Politics Living Media All#Posts 0.579 0.667 0.632 0.745

Meta-criteria 1 0.579 0.667 0.632 0.708Meta-criteria 2 0.737 0.55 0.895 0.755Meta-criteria 3 0.722 0.4 0.818 0.692

Concerning the precison@20, the meta-criteria 2 improves

the results for all experiments except for the living topic. The

number of forums where a user participates in seems a good

indicator for the celebrity recognition task. The meta-criteria

1 does not give any improvement compared to the baseline-

criterion. The ranking by the number of posts seems not to

be the most appropriate ranking for our task. The addition

of the name citations and the text quotations improve the

results compared to the baseline in two topics. Nevertheless,

the results for the living topic are quite bad.

We actually identified that there is an issue with the algo-

rithm that extracts the name citations [16]. For example, a

user who uses the pseudonym “obamaa1” will be identified as

a celebrity because of the name “Obama” mentioned several

times inside the comments of the forum. It is obvious, though,

that the users write about the politician and not the user that

chose this pseudonym. In order to make sure that the word is

a name citation, our algorithm ensures two things (cf. [16]).

But, in this case we are in front of a false positive.

The reason why the meta-criteria 2 give some bad results for

the living topic may be that it is a very small community. This

is not, though, a sufficient reason and further investigation is

needed.

In the next section we will discuss the results obtained

and put into perspective the celebrity social role in the news

website context.

325325

VII. DISCUSSION & PERSPECTIVES

a) A celebrity is not just a prolific poster: if it was, the

results of the baseline should be better for each topic and for

the whole dataset. As we could see, we obtained better results

for the Meta-criteria 2. This leads us to the conclusion that

even if the number of posts is an essential criterion, it is not

sufficient to identify the celebrities in an online discussion.

b) A celebrity is more global than local (into a topic): in

the beginning we were thinking that the meta-criteria 3 will

give better results for the topics than for the whole dataset.

Actually, we expected that the users participate more in their

subject of interest than in the whole website, e.g., it is not the

same thing to participate in the politics topic and in the living

topic. Surprisingly, it was the opposite. The results were better

for the whole dataset than for the separate topics. So, in the

news forum debates, a celebrity seems to be more global to

the website and less specialised than in a newsgroup.

c) A celebrity in a news website is different than acelebrity in a Usenet newsgroup: as we saw, the meta-criteria

2 improved the results. This means that the number of forums

where a user participates in is important. It could be easily

understandable: the more a user is present on several forums,

the more s/he has chances to meet other users. In other words,

if a user participates only in one forum s/he could only meet

users that also participate in that forum, but if s/he participates

in several s/he “meets” many other users. This automatically

increases the chances to be more recognised.

d) A celebrity should be cited and quoted but this isnot sufficient: The text quotations and the name citations

improve some results on two topics compared to the baseline.

Thus, we are still thinking that this information has to be

considered in the celebrity recognition task. Some future

experiments will allow us to decide whether this is a good

choice or not. Furthermore, we would like to try testing using

different thresholds for the variables wMeta−criteria2, wc and

wq explained in section III.

e) A celebrity should post interesting messages: Golder

and Donath, in their anthropological study, read the posts

written by the users. This information is helpful to determine

if a user is a celebrity or not. However, this is difficult to

do in an automatic way and as we experimented, the text

quotation seems not to be sufficient on its own to recognise the

celebrity. In order to improve the interesting post recognition

we are thinking to use an opinion extraction model [18] or to

measure the average length of a thread initiated by a user, i.e.,

automatically determine whether a post is interesting for the

community or not.

VIII. CONCLUSION

In conclusion, this paper provides three contributions: the

theoretical formalisation of the celebrity social role inspired

by previous anthropological studies, the experiments based on

this framework on a new type of data (using three different

meta-criteria and a baseline) and a discussion regarding the

celebrity social role recognition focusing on the forum debates

on news websites.

As we saw, the number of posts, in the baseline and in

the ranking for the meta-criteria 1, is not sufficient to find

the celebrities in news website forums. However, ranking by

the number of forums (meta-criteria 2) where a user has

participated in improved the results. This information proves

that the social role extraction can only be thought of as a

function of the context of the interaction, i.e., the data analysis.

We evaluated our model in two ways: we created the

ROC curves, we zoomed in on the top ranked users, and we

calculated the precision@n. Actually, we are more interested

in ranking well the first users than the global dataset. The

precision@20 for the meta-criteria 2 gave better results than

the baseline and the two other meta-criteria. So far, our result

experiments are encouraging, so, further experiments are to be

carried out in the future.

REFERENCES

[1] E. Gleave, H. Welser, T. Lento, and M. Smith, “A conceptual andoperational definition of social role in online community,” in 42ndHawaii International Conference on System Sciences. IEEE, 2009,pp. 1–11.

[2] E. Goffman, The presentation of self in everyday life. New York:Anchor, 1959.

[3] S. Golder and J. Donath, “Social roles in electronic communities,”Internet Research, vol. 5, pp. 19–22, 2004.

[4] M. Forestier, A. Stavrianou, J. Velcin, and D. A. Zighed, “Roles insocial networks: Methodologies and research issues,” Web Intelligenceand Agent Systems, vol. 10, no. 1, pp. 117–133, 2012.

[5] E. Airoldi, D. Blei, S. Fienberg, and E. Xing, “Mixed membershipstochastic blockmodels,” The Journal of Machine Learning Research,vol. 9, pp. 1981–2014, 2008.

[6] P. Doreian, V. Batagelj, and A. Ferligoj, Generalized blockmodeling.Cambridge Univ Pr, 2005.

[7] S. Wasserman and C. Anderson, “Stochastic a posteriori blockmodels:Construction and assessment,” Social Networks, vol. 9, no. 1, pp. 1–36,1987.

[8] S. Borgatti and M. Everett, “Notions of position in social networkanalysis,” Sociological methodology, vol. 22, pp. 1–35, 1992.

[9] D. White and K. Reitz, “Graph and semigroup homomorphisms onnetworks of relations,” Social Networks, vol. 5, no. 2, pp. 193–234,1983.

[10] N. Agarwal, H. Liu, L. Tang, and P. S. Yu, “Identifying the influentialbloggers in a community,” in WSDM ’08: Proceedings of the interna-tional conference on Web search and web data mining. New York, NY,USA: ACM, 2008, pp. 207–218.

[11] P. Domingos, “Mining social networks for viral marketing,” IEEEIntelligent Systems, vol. 20, no. 1, pp. 80–82, 2005.

[12] C. Campbell, P. Maglio, A. Cozzi, and B. Dom, “Expertise identificationusing email communications,” in Proceedings of the twelfth internationalconference on Information and knowledge management. ACM, 2003,pp. 528–531.

[13] J. Zhang, M. Ackerman, and L. Adamic, “Expertise networks in onlinecommunities: Structure and algorithms,” in Proc. of the 16th Interna-tional conference on World Wide Web, 2007, pp. 221–230.

[14] H. Welser, G. Kossinets, S. Marc, and D. Cosley, “Finding social roles inwikipedia,” in annual meeting of the American Sociological Association,Boston, MA, AllAcademic, 2008.

[15] S. Mind, “Society: From the standpoint of a social behaviorist,” Chicago:The, 1934.

[16] M. Forestier, J. Velcin, and Z. Djamel, “Extracting social network tounderstand interaction,” International Conference on Advances in SocialNetworks Analysis and Mining, pp. 213–219, 2011.

[17] G. Salton and M. Lesk, “Computer evaluation of indexing and textprocessing,” Journal of the ACM, vol. 15, no. 1, pp. 8–36, 1968.

[18] A. Stavrianou, J. Velcin, and J. Chauchat, “Definition and Measures ofan Opinion Model for Mining Forums,” in International Conference onAdvances in Social Network Analysis and Mining, 2009. IEEE, 2009,pp. 188–193.

326326

[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

Documents