[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...
TRANSCRIPT
Extracting Celebrities from Online Discussions
Mathilde Forestier, Julien VelcinEric Laboratory, University of Lyon
5, avenue Pierre Mendes France
69676 Bron Cedex, France
Phone: 00 33 4 78 77 31 54
Anna StavrianouXerox Research Centre Europe
6 chemin de Maupertuis,
38240 Meylan, France
Phone: 00 33 4 76 61 50 35
Djamel ZighedEric Laboratory, University of Lyon
5, avenue Pierre Mendes France
69676 Bron Cedex, France
Phone: 00 33 4 78 77 31 54
Abstract—Online discussions became increasingly widespreadwith the Web 2.0: no matter the distance, whether you know theperson or not, you can discuss and exchange ideas with people allover the world through forums, blogs, and newsgroups. The newswebsites have extensively used forums in order to encourage thereader being a real participant in the information media. Thispaper aims at automatically extracting the celebrities from suchdiscussions. We propose certain meta-criteria and we provide anevaluation on a dataset of 35,175 posts written by 14,443 users.The results show that one of the proposed meta-criteria succeedsin extracting celebrities and allows for further improvements.
I. INTRODUCTION
The Web 2.0 has introduced an incredibly simple way of
interaction: regardless the distance or whether you know the
person or not, you can discuss with people from all over
the world through the Web 2.0 applications (blogs, e-mails,
dedicated media, etc.). In particular, forum debates on news
websites are a very representative case of interaction between
people using the Web 2.0.
In this kind of interaction, as in real life, people play a
social role [1]. Goffman [2] defines the social role as “the
enactment of rights and duties attached to a given status”. In
other words, people have some regularities in their behaviour
and this behaviour can be analysed in order to figure out the
social role. Golder and Donath [3] study the Usenet newsgroup
and they define the celebrity social role in order to represent
people who are recognised in and by their community.
As a result, in this paper we provide three major contribu-
tions: the theoretical formalisation of the ‘celebrity’ social role
inspired by and based on previous anthropological studies; the
experiments of this theoretical framework on data extracted
from a news website using three different meta-criteria and
a baseline-criterion; and a discussion about the social role
extraction, and the relation with the kind of data we deal with.
The paper continues as follows. At first, we discuss the
related work concerning the topic of social role recognition
in online media. Then, we present the theoretical framework
based on the study of Golder and Donath [3] extended with
three meta-criteria to take into account the special features
of the forum debates on news websites. We continue by
presenting the experimental framework and the dataset we
used, and we discuss about the evaluation of the different
criteria applied. We conclude by commenting on the results
and discussing the outcome.
II. RELATED WORK
The analysis of the social role has become an important
subject of research for researchers in humanities and computer
science. The existing works about the extraction of social roles
can be separated into two areas: the identification of explicit
and of non-explicit roles [4].
A. The non-explicit roles
The non-explicit roles refer to social roles that are not
pre-defined but extracted from the data using statistical and
probabilistic measures e.g., blockmodels [5][6][7]. These roles
can also be understood as positions [8], e.g., a secretary or
a manager in a firm. Blockmodels are based on a social
network of relation. Using a pre-defined equivalence, the
aim of blockmodeling is to sort people in blocks of 0/1
function of the pre-defined equivalence [8][9]. There exist
several equivalences depending on the research object: the
structural equivalence, the regular equivalence, the strong
equivalence and the automorphic equivalence. The structural
equivalence aims at grouping actors with similar interests
while the regular and the automorphic equivalence represent
more the sociological notion of the social role: people with
the same role have to be linked with people sharing the same
role.
B. The explicit roles
The explicit roles refer to predefined roles that could also be
found in virtual discussions. Two basic roles are more usually
dealt with by the research community: the influencer [10][11]
and the expert [12][13]. The influencer could influence other
people in the social network, and thus, this role is interesting
for viral marketing or for advertisement placing. The expert
is someone who knows his subject and may provide help to
other people e.g., in the Java Forum [13].
Golder and Donath [3] carried out an anthropological study
and identified six social roles: the celebrity, the newbie, the
lurker, the flamer, the troll and the ranter. They defined the
ranter as a social role that could appear to be “statistically
similar to celebrities”. Actually, the ranter is a “prolific writer,
[who] posts with great frequency on a particular issue or
issues and is unique in his or her lengthy posts and single
mindedness. Ranters exhibit some troll-like characteristics
[...]”.
2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
978-0-7695-4799-2/12 $26.00 © 2012 IEEE
DOI 10.1109/ASONAM.2012.61
322
2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
978-0-7695-4799-2/12 $26.00 © 2012 IEEE
DOI 10.1109/ASONAM.2012.61
322
C. Discussion about the related work
The social role study is quite an old domain of research in
social sciences [2][15]. Nevertheless, as we saw in this section,
the social role is studied nowadays on the new platforms of
interaction, e.g., databases of emails, Usenet newsgroups, etc.
We based our analysis on a very close work from ours
that exists in the literature: the work of Golder and Donath
[3] and on one of the social roles they identified, that of the
celebrity. They define the celebrity as “the prototypical central
figure. Celebrities are prolific posters who spend a great deal of
time and energy contributing to their newsgroup community.
Because they post so often, everybody knows them”. Golder
and Donath study this social role on Usenet newsgroups. Our
data, i.e., the forum debates on news websites have some
special features that differ from the Usenet newsgroups. Even
if we separate the topics of the forum debates (each news
website contains topics where the articles are classified in,
e.g., politics, economics, etc.), the user communities extracted
from these topics are larger than most of Usenet newsgroups.
Thus, in this paper, we will formalize and adapt the Golder
and Donath’s framework to our data.
III. THEORETICAL FRAMEWORK
Let us represent a set of forum discussions as a directed
graph G = (V,E) whose nodes represent forum participants
and the edges show who replies to whom. The edges are
weighted and the weight shows the number of posts between
two connected users. The weighted in-degree of a node v ∈ Vis deg−G(v), whereas the weighted out-degree is deg+G(v).We also denote the set of posts authored by a user who is
represented by the node v as p(v), the total existing posts
as p and the average number of posts in the specific set of
discussions as p. Finally, the posts that belong to the set of
threads that are not initiated by the user v will be denoted as
thr(v).In Table I we present the characteristics (aggregated) defined
by Golder and Donath and we show the respective formalized
criteria that identify the post-reply behaviour of the user within
a discussion.
We have applied the criteria to our set of data in three
different ways. The constants presented in Table I are also
clarified in the following described meta-criteria.
A. Meta-criteria 1All the criteria presented in Table I have to be satisfied
in order to choose a potential celebrity. The constants are
given certain values. For example, the in-degree and out-
degree values are chosen to be higher than 2, if in the dataset
the average in- and out-degree is lower than 2. The threshold
and constant values can always change according to the data
and to how selective we are.
The interesting people are identified according to the
pseudo-code in Table II. Once the interesting people are
identified, they are ranked in descending order by the total
number of their posts. The pseudo-code is applied in the same
way per-topic as well as for the whole dataset.
TABLE IIPSEUDO-CODE
Pseudo-code ∀v ∈ V Comment
if |p(v)| > p number of posts higher than theaverage
and (deg−G(v) > deg−G ) in-degree higher than average in-degree
and (degw−
G (v) > degw−
G ) weighted in-degree higher than av-erage weighted in-degree
and (deg+G(v) > deg+G) out-degree higher than average out-degree
and (∃p′ ∈ p(v) : p′ /∈ thr(v)) the user v has participated inthreads s/he has not initiated
B. Meta-criteria 2In this case, the criteria of ’Meta-criteria 1’ are all taken
into account in the same way. The ranking, though, is done in
a different way - by taking into account the participation of
the user in different forums. The participation is the number
of forums where a user v participates in : |forums(v)|. As a
result, the final ranking is done in descending order according
to the user’s average forum participation multiplied by the
number of posts:
participation(v) ∗ |p(v)|where participation(v) = |forums(v)|
max(|forums|)C. Meta-criteria 3
In this meta-criteria we additionally take into account the
citations. As Golder and Donath specify, a celebrity is cited
even if s/he has not participated in the discussion. We identify
two citation types: the citations of names and the quotations
of texts. Regarding the identification of celebrities, citing the
name of a user is considered more important than quoting the
respective text because it shows that the user is known and
probably well respected. In this meta-criteria we consider the
ranking of ’Meta-criteria 2’ and we differentiate it by adding
the number of citations and quotations:
|citations(v)| ∗ wc + |quotations(v)| ∗ wq +(participation(v) ∗ |p(v)|) ∗ wMeta−criteria2
where wc is the weight we decide to give to the citations, wq
is the weight we give to the quotations and wMeta−criteria2
is the weight we give to the ranking calculation of ’Meta-
criteria 2’. The different weights are proportional to each
other according to what we give more importance to.
After applying all three meta-criteria we end up with three
different rankings of users depending on their possibility to
be a celebrity. Note that not all the users are ranked, but only
the users who satisfy the meta-criteria. The rest of the users
are considered to have a score of 0 for the meta-criteria.
IV. EXPERIMENTAL FRAMEWORK
We saw in the previous section the theoretical framework. In
order to experiment with it, we have created an experimental
323323
TABLE IPOST-REPLY BEHAVIOUR CRITERIA BASED ON THE ARTICLE OF GOLDER AND DONATH [3].
Characteristics Post-reply criteria Formalized criteriaHigh contribution in the discussion (largevolume of posts) and a “magnitude variationin posting frequency”.
The number of posts per author should behigher than the average |p(v)| ≥ p, for v ∈ V
Communicative skills (not just a ’robot’ thatsends messages).
High in-degree and out-degree values.
deg−G(u) ≥ α and deg+G(u) ≥ β,
where α, β ∈ N+∗ constants
Not a Ranter. Participation in threads not initiated by thesame person and higher in-degree value thatthe average.
∃p′ ∈ p(v) : p′ /∈ thr(v)
and deg−G(v) > deg−G , for v ∈ V
framework, as explained in this section.
Initially, by using a parser we extract the forum debates
and the respective topics from a news website. The parser
aims at extracting the user posts, the user’s pseudonym and
the structural relation, i.e., the relation based on the “reply
to” structure of the forums. We, then, extract the citation
relations, i.e., when a user cites another user within his post
and the quotation relations, i.e., when a user quotes a part of
a previous post in his post [16]. All this information is saved
into a database. Finally, we use the meta-criteria defined in
section III and a baseline to rank the users by their score.
We test these three meta-criteria on the American version
of the HuffingtonPost1. We parsed and analysed 57 forums on
three topics: politics, media and living.
TABLE IIINUMBER OF USERS, POSTS, CITATIONS, QUOTATIONS AND AVERAGE
NUMBER OF POSTS PER USER.
Politics Media Living All#Users 4,547 5,974 3,667 11,443#Posts 12,725 14,176 8,274 35,175
#Posts / #Users 2.8 2.4 2.3 3.1#Citations 350 461 183 994
#Quotations 117 153 146 476
Table III describes the data. We can see that the media
community is the biggest one, but it does not have the highest
ratio of posts per user. The living community is the smallest
one. There is a biggest ratio per user for the whole community
(i.e., 3.1) than for the topic ones (i.e., <3). We could explain
this phenomenon by the fact that users participate in several
topics, i.e., there exists an overlapping between communities
in topics. This is also why there are 11,443 users in our dataset
and this number does not represent the sum of the users from
the three topics.
V. EVALUATION
As we saw in section III, we carried out three experiments
based on three meta-criteria and a baseline criterion: #posts,
Meta-criteria 1, 2 and 3.
1http://www.huffingtonpost.com/?country=US
In order to evaluate the ranking of the users based on the
criteria defined, for the three topics as well as the whole
dataset, we retrieve another information from the Huffing-
tonPost website: the number of fans that a user has. In our
opinion, the number of fans is a good validation criteria, since
when a user has a lot of fans, s/he seems to be recognised
by his/her community and, thus, s/he is a celebrity. Hence,
we retrieve and store this information for all the users of our
dataset.
By observing the data, we applied some basic statistics in
order to find a good fan-threshold. We noticed that, at the
moment of the evaluation, the maximum number of fans a user
has is 9,061 and the average number of fans is 377. Looking
at the quartiles, the first quartile is about 39 users, the second
quartile (median) is about 159 and the third quartile is about
436. Half of the users have a very low number of fans (159)
and only 25% of the population has more than 436 fans. We
finally decide to choose a minimum of 800 fans that represents
13% of the users.
Hence, we consider that when 800 users have decided
to become a fan of a certain user, then, this user is really
recognized by the community s/he belongs to. Actually, we
assume that when a user becomes a fan of someone else,
s/he already knows this other person (in our case through his
written activities). We realize that this threshold is not general
and it may change on another dataset but we had to make a
choice, knowing the type of our data, in order to evaluate our
model.
Some of the users have removed their HuffingtonPost ac-
count between the time they were writing in forums and the
time we were extracting the number of fans per user. We
decided to let them in the dataset but not consider them as
celebrities. In our opinion, a celebrity should remain active
through time, so if a user has decided to remove his/her
account s/he cannot be a celebrity.
We also calculate the precision at a cut-off value n of the
ranking [17]. The precision@n is better to be used since we
can evaluate the n best ranked users but not the whole dataset.
The formula for the precision@n is the following:
precision@n =correctly assigned celebrity users in n ranked users
n ,
324324
� ��� ��� ��� ��� ��� �� �� ��� ��� �
�
���
���
���
���
���
��
��
���
���
�
������������������������
������
���������������
���������������
���������������
������������������
!�����������������
Fig. 1. ROC curves for the three meta-criteria for the whole dataset.
where n is a cut-off value, discussed in the next section.
VI. RESULTS
Figure 1 shows the ROC curve for the three meta-criteria
for the whole dataset. Note that the tendency is the same for
the three topics.
It is surprising, at first, to see that the ROC curves of our
three meta-criteria become straight. Actually, the meta-criteria
help us to extract a sub-population of the users that satisfies all
the criteria. Then, they are ranked by their user activity (meta-
criteria 1), adding as a condition the number of forums a user
has participated in (meta-criteria 2), and including the name
and the text quotations (meta-criteria 3). This means that if a
user does not satisfy the meta-criteria his/her ranking score is
zero. Thus, if we look at the whole population, only 8% of
the population satisfy the meta-criteria 1 and 2, and 92% has
a score of 0 so they are all ranked in the same place.
Hence, we want to evaluate our model for the top n ranked
users. This is why, in order to have a better understanding
of the situation we “zoomed” in the ROC curves of Figure
1 for the top-n best ranked users. The idea behind this is to
see if our model is better for the best ranked even if we left
out some celebrities. In order to zoom, we choose the number
of users who satisfy all the criteria, e.g., 1,207 for all topics.
Then, we zoomed in on these users for each meta-criteria and
the baseline-criterion (the number of posts).
Figure 2 shows the results after the zoom. The meta-criteria
2 give better results than all the other meta-criteria and the
baseline for two topics and the whole dataset. This means that
the number of forums where a user participates in plays a role
in the celebrity recognition, since the ranking by the number
of forums improved the celebrity identification. The meta-
criteria 2 is clearly better for the media topic compared to the
baseline, followed by the meta-criteria 3. Actually, considering
the name citations and the text quotations seems to give some
good results for the media topic. We still believe that this
information has to be taken into account.
We choose n = 20 for our experiments to compute the
precision@n. We assume that a list of 20 users is adequate
if our task is to present the few top celebrity users to a new
user of a news website. These top users would represent the
����� ����� ����� ����� ����� �����
�
����
����
����
����
����
�������������� ��������������
����
�������������
�������������
�������������
��������������
��������������
����� ����� ����� ����� �����
����
����
����
����
�������������� �����������
����
�������������
�������������
�������������
��������������
��������������
� ���� ���� ���� ���� ���� ����
�
����
����
����
���"
���
����
����
�������������� �����#$������
����
�������������
�������������
�������������
��������������
��������������
� ���� ���� ���� ���� ���� ���� ���%
�
����
���
����
���
����
���
����
����������������������
����
�������������
�������������
�������������
��������������
��������������
Fig. 2. Zoom in on the interesting part of the ROC curves.
ones who are recognised by the forum community the specific
user is planning to integrate into. Table IV shows the results
for the precision@20.
TABLE IVPRECISION@20 FOR EACH TOPIC AND THE WHOLE DATASET.
Politics Living Media All#Posts 0.579 0.667 0.632 0.745
Meta-criteria 1 0.579 0.667 0.632 0.708Meta-criteria 2 0.737 0.55 0.895 0.755Meta-criteria 3 0.722 0.4 0.818 0.692
Concerning the precison@20, the meta-criteria 2 improves
the results for all experiments except for the living topic. The
number of forums where a user participates in seems a good
indicator for the celebrity recognition task. The meta-criteria
1 does not give any improvement compared to the baseline-
criterion. The ranking by the number of posts seems not to
be the most appropriate ranking for our task. The addition
of the name citations and the text quotations improve the
results compared to the baseline in two topics. Nevertheless,
the results for the living topic are quite bad.
We actually identified that there is an issue with the algo-
rithm that extracts the name citations [16]. For example, a
user who uses the pseudonym “obamaa1” will be identified as
a celebrity because of the name “Obama” mentioned several
times inside the comments of the forum. It is obvious, though,
that the users write about the politician and not the user that
chose this pseudonym. In order to make sure that the word is
a name citation, our algorithm ensures two things (cf. [16]).
But, in this case we are in front of a false positive.
The reason why the meta-criteria 2 give some bad results for
the living topic may be that it is a very small community. This
is not, though, a sufficient reason and further investigation is
needed.
In the next section we will discuss the results obtained
and put into perspective the celebrity social role in the news
website context.
325325
VII. DISCUSSION & PERSPECTIVES
a) A celebrity is not just a prolific poster: if it was, the
results of the baseline should be better for each topic and for
the whole dataset. As we could see, we obtained better results
for the Meta-criteria 2. This leads us to the conclusion that
even if the number of posts is an essential criterion, it is not
sufficient to identify the celebrities in an online discussion.
b) A celebrity is more global than local (into a topic): in
the beginning we were thinking that the meta-criteria 3 will
give better results for the topics than for the whole dataset.
Actually, we expected that the users participate more in their
subject of interest than in the whole website, e.g., it is not the
same thing to participate in the politics topic and in the living
topic. Surprisingly, it was the opposite. The results were better
for the whole dataset than for the separate topics. So, in the
news forum debates, a celebrity seems to be more global to
the website and less specialised than in a newsgroup.
c) A celebrity in a news website is different than acelebrity in a Usenet newsgroup: as we saw, the meta-criteria
2 improved the results. This means that the number of forums
where a user participates in is important. It could be easily
understandable: the more a user is present on several forums,
the more s/he has chances to meet other users. In other words,
if a user participates only in one forum s/he could only meet
users that also participate in that forum, but if s/he participates
in several s/he “meets” many other users. This automatically
increases the chances to be more recognised.
d) A celebrity should be cited and quoted but this isnot sufficient: The text quotations and the name citations
improve some results on two topics compared to the baseline.
Thus, we are still thinking that this information has to be
considered in the celebrity recognition task. Some future
experiments will allow us to decide whether this is a good
choice or not. Furthermore, we would like to try testing using
different thresholds for the variables wMeta−criteria2, wc and
wq explained in section III.
e) A celebrity should post interesting messages: Golder
and Donath, in their anthropological study, read the posts
written by the users. This information is helpful to determine
if a user is a celebrity or not. However, this is difficult to
do in an automatic way and as we experimented, the text
quotation seems not to be sufficient on its own to recognise the
celebrity. In order to improve the interesting post recognition
we are thinking to use an opinion extraction model [18] or to
measure the average length of a thread initiated by a user, i.e.,
automatically determine whether a post is interesting for the
community or not.
VIII. CONCLUSION
In conclusion, this paper provides three contributions: the
theoretical formalisation of the celebrity social role inspired
by previous anthropological studies, the experiments based on
this framework on a new type of data (using three different
meta-criteria and a baseline) and a discussion regarding the
celebrity social role recognition focusing on the forum debates
on news websites.
As we saw, the number of posts, in the baseline and in
the ranking for the meta-criteria 1, is not sufficient to find
the celebrities in news website forums. However, ranking by
the number of forums (meta-criteria 2) where a user has
participated in improved the results. This information proves
that the social role extraction can only be thought of as a
function of the context of the interaction, i.e., the data analysis.
We evaluated our model in two ways: we created the
ROC curves, we zoomed in on the top ranked users, and we
calculated the precision@n. Actually, we are more interested
in ranking well the first users than the global dataset. The
precision@20 for the meta-criteria 2 gave better results than
the baseline and the two other meta-criteria. So far, our result
experiments are encouraging, so, further experiments are to be
carried out in the future.
REFERENCES
[1] E. Gleave, H. Welser, T. Lento, and M. Smith, “A conceptual andoperational definition of social role in online community,” in 42ndHawaii International Conference on System Sciences. IEEE, 2009,pp. 1–11.
[2] E. Goffman, The presentation of self in everyday life. New York:Anchor, 1959.
[3] S. Golder and J. Donath, “Social roles in electronic communities,”Internet Research, vol. 5, pp. 19–22, 2004.
[4] M. Forestier, A. Stavrianou, J. Velcin, and D. A. Zighed, “Roles insocial networks: Methodologies and research issues,” Web Intelligenceand Agent Systems, vol. 10, no. 1, pp. 117–133, 2012.
[5] E. Airoldi, D. Blei, S. Fienberg, and E. Xing, “Mixed membershipstochastic blockmodels,” The Journal of Machine Learning Research,vol. 9, pp. 1981–2014, 2008.
[6] P. Doreian, V. Batagelj, and A. Ferligoj, Generalized blockmodeling.Cambridge Univ Pr, 2005.
[7] S. Wasserman and C. Anderson, “Stochastic a posteriori blockmodels:Construction and assessment,” Social Networks, vol. 9, no. 1, pp. 1–36,1987.
[8] S. Borgatti and M. Everett, “Notions of position in social networkanalysis,” Sociological methodology, vol. 22, pp. 1–35, 1992.
[9] D. White and K. Reitz, “Graph and semigroup homomorphisms onnetworks of relations,” Social Networks, vol. 5, no. 2, pp. 193–234,1983.
[10] N. Agarwal, H. Liu, L. Tang, and P. S. Yu, “Identifying the influentialbloggers in a community,” in WSDM ’08: Proceedings of the interna-tional conference on Web search and web data mining. New York, NY,USA: ACM, 2008, pp. 207–218.
[11] P. Domingos, “Mining social networks for viral marketing,” IEEEIntelligent Systems, vol. 20, no. 1, pp. 80–82, 2005.
[12] C. Campbell, P. Maglio, A. Cozzi, and B. Dom, “Expertise identificationusing email communications,” in Proceedings of the twelfth internationalconference on Information and knowledge management. ACM, 2003,pp. 528–531.
[13] J. Zhang, M. Ackerman, and L. Adamic, “Expertise networks in onlinecommunities: Structure and algorithms,” in Proc. of the 16th Interna-tional conference on World Wide Web, 2007, pp. 221–230.
[14] H. Welser, G. Kossinets, S. Marc, and D. Cosley, “Finding social roles inwikipedia,” in annual meeting of the American Sociological Association,Boston, MA, AllAcademic, 2008.
[15] S. Mind, “Society: From the standpoint of a social behaviorist,” Chicago:The, 1934.
[16] M. Forestier, J. Velcin, and Z. Djamel, “Extracting social network tounderstand interaction,” International Conference on Advances in SocialNetworks Analysis and Mining, pp. 213–219, 2011.
[17] G. Salton and M. Lesk, “Computer evaluation of indexing and textprocessing,” Journal of the ACM, vol. 15, no. 1, pp. 8–36, 1968.
[18] A. Stavrianou, J. Velcin, and J. Chauchat, “Definition and Measures ofan Opinion Model for Mining Forums,” in International Conference onAdvances in Social Network Analysis and Mining, 2009. IEEE, 2009,pp. 188–193.
326326