[IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - Extracting Celebrities from Online Discussions

Download [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - Extracting Celebrities from Online Discussions

Post on 28-Mar-2017




1 download

Embed Size (px)


  • Extracting Celebrities from Online DiscussionsMathilde Forestier, Julien VelcinEric Laboratory, University of Lyon5, avenue Pierre Mendes France

    69676 Bron Cedex, FrancePhone: 00 33 4 78 77 31 54


    Anna StavrianouXerox Research Centre Europe

    6 chemin de Maupertuis,38240 Meylan, France

    Phone: 00 33 4 76 61 50 35anna.stavrianou@xrce.xerox.com

    Djamel ZighedEric Laboratory, University of Lyon5, avenue Pierre Mendes France

    69676 Bron Cedex, FrancePhone: 00 33 4 78 77 31 54


    AbstractOnline discussions became increasingly widespreadwith the Web 2.0: no matter the distance, whether you know theperson or not, you can discuss and exchange ideas with people allover the world through forums, blogs, and newsgroups. The newswebsites have extensively used forums in order to encourage thereader being a real participant in the information media. Thispaper aims at automatically extracting the celebrities from suchdiscussions. We propose certain meta-criteria and we provide anevaluation on a dataset of 35,175 posts written by 14,443 users.The results show that one of the proposed meta-criteria succeedsin extracting celebrities and allows for further improvements.

    I. INTRODUCTIONThe Web 2.0 has introduced an incredibly simple way of

    interaction: regardless the distance or whether you know theperson or not, you can discuss with people from all overthe world through the Web 2.0 applications (blogs, e-mails,dedicated media, etc.). In particular, forum debates on newswebsites are a very representative case of interaction betweenpeople using the Web 2.0.In this kind of interaction, as in real life, people play a

    social role [1]. Goffman [2] defines the social role as theenactment of rights and duties attached to a given status. Inother words, people have some regularities in their behaviourand this behaviour can be analysed in order to figure out thesocial role. Golder and Donath [3] study the Usenet newsgroupand they define the celebrity social role in order to representpeople who are recognised in and by their community.As a result, in this paper we provide three major contribu-

    tions: the theoretical formalisation of the celebrity social roleinspired by and based on previous anthropological studies; theexperiments of this theoretical framework on data extractedfrom a news website using three different meta-criteria anda baseline-criterion; and a discussion about the social roleextraction, and the relation with the kind of data we deal with.The paper continues as follows. At first, we discuss the

    related work concerning the topic of social role recognitionin online media. Then, we present the theoretical frameworkbased on the study of Golder and Donath [3] extended withthree meta-criteria to take into account the special featuresof the forum debates on news websites. We continue bypresenting the experimental framework and the dataset weused, and we discuss about the evaluation of the differentcriteria applied. We conclude by commenting on the resultsand discussing the outcome.


    The analysis of the social role has become an importantsubject of research for researchers in humanities and computerscience. The existing works about the extraction of social rolescan be separated into two areas: the identification of explicitand of non-explicit roles [4].

    A. The non-explicit roles

    The non-explicit roles refer to social roles that are notpre-defined but extracted from the data using statistical andprobabilistic measures e.g., blockmodels [5][6][7]. These rolescan also be understood as positions [8], e.g., a secretary ora manager in a firm. Blockmodels are based on a socialnetwork of relation. Using a pre-defined equivalence, theaim of blockmodeling is to sort people in blocks of 0/1function of the pre-defined equivalence [8][9]. There existseveral equivalences depending on the research object: thestructural equivalence, the regular equivalence, the strongequivalence and the automorphic equivalence. The structuralequivalence aims at grouping actors with similar interestswhile the regular and the automorphic equivalence representmore the sociological notion of the social role: people withthe same role have to be linked with people sharing the samerole.

    B. The explicit roles

    The explicit roles refer to predefined roles that could also befound in virtual discussions. Two basic roles are more usuallydealt with by the research community: the influencer [10][11]and the expert [12][13]. The influencer could influence otherpeople in the social network, and thus, this role is interestingfor viral marketing or for advertisement placing. The expertis someone who knows his subject and may provide help toother people e.g., in the Java Forum [13].Golder and Donath [3] carried out an anthropological study

    and identified six social roles: the celebrity, the newbie, thelurker, the flamer, the troll and the ranter. They defined theranter as a social role that could appear to be statisticallysimilar to celebrities. Actually, the ranter is a prolific writer,[who] posts with great frequency on a particular issue orissues and is unique in his or her lengthy posts and singlemindedness. Ranters exhibit some troll-like characteristics[...].

    2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

    978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.61


    2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

    978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.61


  • C. Discussion about the related work

    The social role study is quite an old domain of research insocial sciences [2][15]. Nevertheless, as we saw in this section,the social role is studied nowadays on the new platforms ofinteraction, e.g., databases of emails, Usenet newsgroups, etc.We based our analysis on a very close work from ours

    that exists in the literature: the work of Golder and Donath[3] and on one of the social roles they identified, that of thecelebrity. They define the celebrity as the prototypical centralfigure. Celebrities are prolific posters who spend a great deal oftime and energy contributing to their newsgroup community.Because they post so often, everybody knows them. Golderand Donath study this social role on Usenet newsgroups. Ourdata, i.e., the forum debates on news websites have somespecial features that differ from the Usenet newsgroups. Evenif we separate the topics of the forum debates (each newswebsite contains topics where the articles are classified in,e.g., politics, economics, etc.), the user communities extractedfrom these topics are larger than most of Usenet newsgroups.Thus, in this paper, we will formalize and adapt the Golderand Donaths framework to our data.


    Let us represent a set of forum discussions as a directedgraph G = (V,E) whose nodes represent forum participantsand the edges show who replies to whom. The edges areweighted and the weight shows the number of posts betweentwo connected users. The weighted in-degree of a node v Vis degG(v), whereas the weighted out-degree is deg


    We also denote the set of posts authored by a user who isrepresented by the node v as p(v), the total existing postsas p and the average number of posts in the specific set ofdiscussions as p. Finally, the posts that belong to the set ofthreads that are not initiated by the user v will be denoted asthr(v).In Table I we present the characteristics (aggregated) defined

    by Golder and Donath and we show the respective formalizedcriteria that identify the post-reply behaviour of the user withina discussion.We have applied the criteria to our set of data in three

    different ways. The constants presented in Table I are alsoclarified in the following described meta-criteria.

    A. Meta-criteria 1All the criteria presented in Table I have to be satisfied

    in order to choose a potential celebrity. The constants aregiven certain values. For example, the in-degree and out-degree values are chosen to be higher than 2, if in the datasetthe average in- and out-degree is lower than 2. The thresholdand constant values can always change according to the dataand to how selective we are.The interesting people are identified according to the

    pseudo-code in Table II. Once the interesting people areidentified, they are ranked in descending order by the totalnumber of their posts. The pseudo-code is applied in the sameway per-topic as well as for the whole dataset.


    Pseudo-code v V Comment

    if |p(v)| > p number of posts higher than theaverage

    and (degG(v) > degG ) in-degree higher than average in-


    and (degw

    G (v) > degwG ) weighted in-degree higher than av-

    erage weighted in-degreeand (deg+G(v) > deg

    +G) out-degree higher than average out-

    degreeand (p p(v) : p / thr(v)) the user v has participated in

    threads s/he has not initiated

    B. Meta-criteria 2In this case, the criteria of Meta-criteria 1 are all taken

    into account in the same way. The ranking, though, is done ina different way - by taking into account the participation ofthe user in different forums. The participation is the numberof forums where a user v participates in : |forums(v)|. As aresult, the final ranking is done in descending order accordingto the users average forum participation multiplied by thenumber of posts:

    participation(v) |p(v)|where participation(v) = |forums(v)|max(|forums|)

    C. Meta-criteria 3In this meta-criteria we additionally take into account the

    citations. As Golder and Donath specify, a celebrity is citedeven if s/he has not participated in the discussion. We identifytwo citation types: the citations of names and the quotationsof texts. Regarding the identification of celebrities, citing thename of a user is considered more important than quoting therespective text because it shows that the user is known andprobably well respected. In this meta-criteria we consider theranking of Meta-criteria 2 and we differentiate it by addingthe number of citations and quotations:

    |citations(v)| wc + |quotations(v)| wq +(participation(v) |p(v)|) wMetacriteria2

    where wc is the weight we decide to give to the citations, wqis the weight we give to the quotations and wMetacriteria2is the weight we give to the ranking calculation of Meta-criteria 2. The different weights are proportional to eachother according to what we give more importance to.

    After applying all three meta-criteria we end up with threedifferent rankings of users depending on their possibility tobe a celebrity. Note that not all the users are ranked, but onlythe users who satisfy the meta-criteria. The rest of the usersare considered to have a score of 0 for the meta-criteria.

    IV. EXPERIMENTAL FRAMEWORKWe saw in the previous section the theoretical framework. In

    order to experiment with it, we have created an experimental



    Characteristics Post-reply criteria Formalized criteriaHigh contribution in the discussion (largevolume of posts) and a magnitude variationin posting frequency.

    The number of posts per author should behigher than the average |p(v)| p, for v V

    Communicative skills (not just a robot thatsends messages).

    High in-degree and out-degree values.

    degG(u) and deg+G(u) ,where , N+ constants

    Not a Ranter. Participation in threads not initiated by thesame person and higher in-degree value thatthe average.

    p p(v) : p / thr(v)and degG(v) > deg

    G , for v V

    framework, as explained in this section.Initially, by using a parser we extract the forum debates

    and the respective topics from a news website. The parseraims at extracting the user posts, the users pseudonym andthe structural relation, i.e., the relation based on the replyto structure of the forums. We, then, extract the citationrelations, i.e., when a user cites another user within his postand the quotation relations, i.e., when a user quotes a part ofa previous post in his post [16]. All this information is savedinto a database. Finally, we use the meta-criteria defined insection III and a baseline to rank the users by their score.We test these three meta-criteria on the American version

    of the HuffingtonPost1. We parsed and analysed 57 forums onthree topics: politics, media and living.



    Politics Media Living All#Users 4,547 5,974 3,667 11,443#Posts 12,725 14,176 8,274 35,175

    #Posts / #Users 2.8 2.4 2.3 3.1#Citations 350 461 183 994#Quotations 117 153 146 476

    Table III describes the data. We can see that the mediacommunity is the biggest one, but it does not have the highestratio of posts per user. The living community is the smallestone. There is a biggest ratio per user for the whole community(i.e., 3.1) than for the topic ones (i.e.,

  • !

    Fig. 1. ROC curves for the three meta-criteria for the whole dataset.

    where n is a cut-off value, discussed in the next section.


    Figure 1 shows the ROC curve for the three meta-criteriafor the whole dataset. Note that the tendency is the same forthe three topics.It is surprising, at first, to see that the ROC curves of our

    three meta-criteria become straight. Actually, the meta-criteriahelp us to extract a sub-population of the users that satisfies allthe criteria. Then, they are ranked by their user activity (meta-criteria 1), adding as a condition the number of forums a userhas participated in (meta-criteria 2), and including the nameand the text quotations (meta-criteria 3). This means that if auser does not satisfy the meta-criteria his/her ranking score iszero. Thus, if we look at the whole population, only 8% ofthe population satisfy the meta-criteria 1 and 2, and 92% hasa score of 0 so they are all ranked in the same place.Hence, we want to evaluate our model for the top n ranked

    users. This is why, in order to have a better understandingof the situation we zoomed in the ROC curves of Figure1 for the top-n best ranked users. The idea behind this is tosee if our model is better for the best ranked even if we leftout some celebrities. In order to zoom, we choose the numberof users who satisfy all the criteria, e.g., 1,207 for all topics.Then, we zoomed in on these users for each meta-criteria andthe baseline-criterion (the number of posts).Figure 2 shows the results after the zoom. The meta-criteria

    2 give better results than all the other meta-criteria and thebaseline for two topics and the whole dataset. This means thatthe number of forums where a user participates in plays a rolein the celebrity recognition, since the ranking by the numberof forums improved the celebrity identification. The meta-criteria 2 is clearly better for the media topic compared to thebaseline, followed by the meta-criteria 3. Actually, consideringthe name citations and the text quotations seems to give somegood results for the media topic. We still believe that thisinformation has...


View more >