data mining for analyzing social media
TRANSCRIPT
06/05/2013
1
Data mining for analyzing the social media Social
Networks
Video/picture sharing
Opinions
News websites
Blogs
Knowledge sharing Microblogging
eminar at 4/18/2013
PresentaCon: J. Velcin hGp://mediamining.univ-‐lyon2.fr/people/velcin
eminar at housie University – 4/18/2013 – ulien elcin
Ecosystem of ERIC Lab
2
Axe Carrés 2 ter
BSc & MSc degrees
BI, data mining, staCsCcs 2 teams: SID & DMD
Academics
Companies
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
Lyon
eminar at housie University – 4/18/2013 – ulien elcin
Research landscape
3
Data Data-‐
warehouse Knowledge
ETL
Online analysis
Data mining
Decision
Complex data integraCon
MulCdimensional modeling
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
Data Mining & Decision (DMD)
eminar at housie University – 4/18/2013 – ulien elcin
Data Mining & Decision (DMD)
4
Social Networks
Microblogging
Video/picture sharing
Opinion sharing
News websites
Blogs
Knowledge sharing
e.g. Social Media -‐ heterogeneous -‐ voluminous -‐ interconnected -‐ evolving
RecommandaCon SummzarizaCon
InformaCon retrieval
MulCcriteria analysis
Machine learning Graph analysis
Complex data analysis
Topological learning Text mining
Prac<cal issue
Approach
Goal: coping with complex data
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
06/05/2013
2
eminar at housie University – 4/18/2013 – ulien elcin
Outline
" The big picture " Modeling and analyzing online discussions " Semi-‐supervised clustering " Focus on Project ImagiWeb " Future lines of research
5
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Outline
" The big picture " Modeling and analyzing online discussions " Semi-‐supervised clustering " Focus on Project ImagiWeb " Future lines of research
6
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
Section 1 The big picture
eminar at housie University – 4/18/2013 – ulien elcin
" A long questioning " Social representation through the media
[Lippman,22] [Moscovici,76] [Newman and Block,06]
" Numeric watch on the Web [Chateauraynaud,03]
8
Public event
From facts to people: the essential role of media
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
06/05/2013
3
eminar at housie University – 4/18/2013 – ulien elcin
Information overload
9
Image credit: Go-‐Globe.com
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Data journalism
10
" Crucial need to catch the meaning of voluminous data provided by modern social media, in order to design new search engine systems
" In particular (MSND workshop@WWW’12)
" “How to surface the best comments, videos and pictures from a variety of sources in real time and then how to verify them ?”
" “How to quickly surface the best comments and work out which ones are worth investigating further ?”
" “How to identify quickly the key influencers on any particular story, so they can get inside information or interview them for their news outlets ?”
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Salvaged by (media) curation?
" Term originated from Art, appears ~2011 " Three-‐step process:
" Aggregation: gathering " Editorialize: sorting, categorizing,
summarizing, presenting… " Disseminate: contextualizing, sharing
" Important role of the curator " Difference between “full curation” and
automatic edition (e.g., paper.li) " Many platforms (Scoop.it!, Storify, Storiful,
Hopflow, Stumbleupon, Patch…): http://socialcompare.com/fr/comparison/curation-‐platforms-‐amplify-‐knowledge-‐plaza-‐storify
11
[Rosenbaum,11]
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
A case study: the “HuffPost”
12
" Linked with social networks " Topically indexed " Available on various devices " Commented news
" Community of bloggers
" Journalist can play both the roles of curator and community manager
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
06/05/2013
4
eminar at housie University – 4/18/2013 – ulien elcin
Outline
" The big picture " Modeling and analyzing online discussions " Semi-‐supervised clustering " Focus on Project ImagiWeb " Future lines of research
13
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
Section 2 Modeling and analyzing
online discussions
eminar at housie University – 4/18/2013 – ulien elcin
Online discussions
" Motivation: " Numerous available, often underused data " Crucial to feel the opinion of people
" Contributions:
" Recommending key messages [Stavrianou et al.,09,10] " Extracting the latent social network [Forestier et al.,11] " Detecting celebrities from online forums [Forestier et al.,12] " Surfacing roles with unsupervised mechanisms [Anukhin et al.,12]
15
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin 16 Julien Velcin - présentation ARC6 18 Octobre 2012
06/05/2013
5
eminar at housie University – 4/18/2013 – ulien elcin
Anatomy of an online discussion
17
A
B
C
A
C
B
D D
A
B
C
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Recommending key messages
" “interesting” message: popular, opinionated, pioneer etc. " Formalization of 6 criteria + simple aggregation " Comparison to manually-‐labelled data on 8 french forums " Results for a priori evaluation:
" F1-‐Measure ranges from 0.2 to 0.3 for a single criterion " F1-‐Measure equals 0.48 for aggregated criteria (simple mean)
" Results for a posteriori evaluations:
18
1 [Stavrianou et al.,09,10]
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Extracting the (latent) social network
" Latent SN = reply-to links + name citation + text quotation " Name citation: bad spelling, compound names, abbreviations…
(what about “obama49”?) " Our solution: edit distance, soundex, PoS to detect nouns
" Text quotation: cut-paste without quotation marks, rephrasing… " Our solution: string matching, locality principle (comparing close
messages), use quotation marks if provided
19
2 [Forestier et al.,11]
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Detecting celebrities
" Modeling the forum discussion with a graph G=(V,E) " vertice v = forum participant " edge e = link (implicit or explicit) between two participants
" Weighted in-degree of v: deg-(v) " Weighted out-degree of v: deg+(v) " p(v) = set of messages posted by v " p~ = average of messages " thr(v) = set of threads not initiated by v
20
3 [ForesCer et al.,12]
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
06/05/2013
6
eminar at housie University – 4/18/2013 – ulien elcin
Detecting celebrities
" Extracting social roles from a SN is a key issue [Fisher et al.,06] [Himelboim et al.,09] [Forestier et al.,12]
" Some examples of roles: " Leader: very participative user, who initiates discussion threads and
makes the animation
" Expert: user particularly active in a restrictive number of topics " Celebrity: public person well known by the participants " Flammer: user with a negative behavior, who can generate conflicts " Lurker: user who has a low participation in the discussion
" In the following, we have chosen to focus on the explicit “celebrity” role within online discussion forums
21
3
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Detecting celebrities
" Formalize the criteria given by [Golder and Donath,04]
22
3
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Detecting celebrities
" Based on these atomic criteria, we define 3 meta-criteria: " meta-criterion 1: all the basic criteria must be satisfied (necessary
conditions), and we rank the interesting users in descending order relative to the total number of posts
" meta-criterion 2: id. but with a ranking depending on the user’s average forum participation multiplied by the number of posts
" meta-criterion 3: id. but taking into account name citation and text quotation
" Evaluation measure: compare the ranking of our meta-criteria with the number of fans of each user (>800) = gold standard
" Dataset: " 57 forums from the US version of the Huffington Post " 3 topics: politics, media, living " Overall 11,443 unique users and 35,175 posts
23
3
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin 24
[Forestier et al.,12]
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
06/05/2013
7
eminar at housie University – 4/18/2013 – ulien elcin
Surfacing roles
" New collaboration between and
" Bottom-up “emerging” roles:
25
Axe Carrés 2 ter
4
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Surfacing roles
" Discussions about 6 popular TV shows from TWOP forums
" Parent-child relationship is restored using “quote” mechanism: " check previous 20 messages in the thread; " a parent has to contain at least 95% of the quoted text.
26
4
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Surfacing roles
" Profiling users using temporal-aware features: " weighted in-degree, " weighted out-degree, " node in-g-index, " node out-g-index, " catalytic power, " number of posts, " cross-topic entropy.
" The role identification procedure is applied to the time series of feature vectors of 1 263 forum users.
" Using moving time windows (size=1 week, shift=1 day)
27
4
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Surfacing roles
" Clustering time series " Basic k-means algorithm " Hartigan’s index used for estimating the best k
28
[Anokhin et al.,12]
4
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
06/05/2013
8
eminar at housie University – 4/18/2013 – ulien elcin
Surfacing roles
" Some observations:
29
4
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Outline
" The big picture " Modeling and analyzing online discussions " Semi-‐supervised clustering " Focus on Project ImagiWeb " Future lines of research
30
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
Section 4 Semi-‐supervised
clustering
eminar at housie University – 4/18/2013 – ulien elcin
Temporal-‐driven clustering
" Goal: detecting typical patterns over time
" How to deal with temporally described entities?
" Applications: " Evolution of nation’s political
states (proof of concept) " Trajectories over roles " Evolution of entities’ images
(c.f. ImagiWeb) 32
φ2
φ1
t1
t2
t3
t1
t2
t3
x1d
x2d
x3d
x4d
x5d
x6d
t2 t3 t1
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
06/05/2013
9
eminar at housie University – 4/18/2013 – ulien elcin
Temporal-‐driven clustering
" Detect typical evolution patterns of individuals in the dataset: " phases through which the entity
collection went over time
" trajectory of entities through the different phases
33
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Temporal-‐aware constrained clustering
" The resulted partition must ensure: " descriptive coherence of clusters; " temporal coherence of clusters; " continuous segmentation of observations
belonging to an entity
" Objective function to minimize (inspired by semi-‐supervised clustering clustering [Wagstaff and Cardie,00]) + use of K-‐Means-‐like algorithm:
34
Temporal-‐aware dissimilarity measure
ConCguity penalty measure
(a)
(b)
(a) (b)
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Experiments on political dataset
" 23 countries, 60 years " 207 political, demographic, social and economic variables " Running TDCK-‐Means (8 clusters, β = 0.003 and δ = 3)
35
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Experiments on political dataset
36
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
06/05/2013
10
eminar at housie University – 4/18/2013 – ulien elcin
Experiments on political dataset
37
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Experiments on political dataset
38
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Outline
" The big picture " Modeling and analyzing online discussions " Semi-‐supervised clustering " Focus on Project ImagiWeb " Future lines of research
39
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
Section 5 Focus on Project
ImagiWeb hGp://eric.univ-‐lyon2.fr/~jvelcin/imagiweb
06/05/2013
11
eminar at housie University – 4/18/2013 – ulien elcin
Project ImagiWeb
" Goal of Project ANR ImagiWeb: analyzing the life cycle (production, diffusion, evolution) of images through the Web 2.0
" Strong points: " Joint analysis of opinions, topics, social networks… " Involvement of (true) researchers in LLSSH
" Partners: " ERIC: data mining, machine learning " LIA: text/opinion mining, information retrieval " CEPEL: social scientists, specialist in politics study " XRCE: information extraction, NLP " AMI Soft.: numeric watch " EDF R&D: end-‐user, semiology study
41
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Project ImagiWeb
42
!"#$%&'
("$)*"+,$)&'
)-.'/')"$*)0*1&)&2'
3455)&'0461#7,)&'
(5+8)'
%51&)'
(5+8)'
0)*9,)'
(5+8)'
0)*9,)'
(5+8)'
0)*9,)'
:455)"$+1*)&'
;%<1+&'<)'
=455,"1=+#4"'>&1$)&'?)@2'06+7,)A)2')$=.B'
C"+6D&)'<)&'<4""%)&'
<E)-0*)&&14"'
C"+6D&)'<)&'
040,6+#4"&'
F))<@+=G' (;CH(I!J'
%5)A),*&'
%5)A),*&'
*%=)0$),
*&'
*%=)0$),*&'
*%=)0$),*&'
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Platform for performing the annotation
" Web applications designed for annotating ~10k tweets + 200 blog comments; 22 annotators are working on it right now!
" Output: (mφ ; mt; mp ; ma ; mt ; ms )
43
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Platform for performing the annotation
" Web applications designed for annotating ~10k tweets + 200 blog comments; 22 annotators are working on it right now!
" Output: (mφ ; mt; mp ; ma ; mt ; ms )
44
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
06/05/2013
12
eminar at housie University – 4/18/2013 – ulien elcin
Catching image’s evolution over time
" Input: set of tuples (mφ ; mt; mp ; ma ; mt ; ms ) " Some good questions:
" What is an image? " How to sum up the bunch of (temporally-‐situated and spatially-‐located) opinions?
" First insight: investigating time series analysis, temporally-‐driven clustering, graphical models…
" Fortunately we’ll have a fulltime post-‐doc student to work on it!
45
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Recent work on opinion mining
" Participation to Sem-‐Eval 2013 " Task 2.B: Discriminating positive (+) from negative (-‐)
opinions (+ neutral) " Very recent work: improving basic NB by using
background knowledge (seed lists) " 6/35 and 3/16 on the official tweet dataset! " Results on our own datasets:
46
[paper just submiGed]
Context The big picture Online discussions ½ -‐sup. clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Outline
" The big picture " Modeling and analyzing online discussions " Semi-‐supervised clustering " Focus on Project ImagiWeb " Future lines of research
47
Context The big picture Online discussions Topics Clustering ImagiWeb Conclusion
Section 6 Future lines of
research
06/05/2013
13
eminar at housie University – 4/18/2013 – ulien elcin
An integrated view
Research + tools + applications
" Ongoing Research " Structured temporal-‐driven clustering (M. A. Rizoiu, PhD student) " Bridging the gap between topics and concepts (M. A. Rizoiu, PhD student) " Multi-‐document summarization of online discussions (C. Cercel, PhD student, in
collaboration with the Polytechnic Institute of Bucharest) " Bottom-‐up, dynamic extraction of roles (A. Lumbreras, PhD students, in
collaboration with Technicolor) " Dynamic joint extraction of topics and opinions (M. Dermouche, PhD student, in
collaboration with AMI Software) " Extracting opinionated images from tweets and blogs in an unsupervised way (Y.
Kim, post-‐doc student, in collaboration with LIA)
49
Context The big picture Online discussions Topics Clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
An integrated view
" Tools " MediaMining: a full open-‐access platform for analyzing online discussions
" Applications " Reputation Management services
=> Project ImagiWeb, with specialist in political studies (2012-‐2015, ~860k) " Discourse analysis in public opinion
=> Project DANuM, with linguists (2013-‐2014, 23k) => Project ALICE, with social scientists and specialists in communication
(just-‐submitted) " The next step: datamining-‐based services for “curation support”, with specialist in
communication and journalists
50
Context The big picture Online discussions Topics Clustering ImagiWeb Conclusion
eminar at housie University – 4/18/2013 – ulien elcin
Focus on the collaboration DAL/Lyon
" 3 possible scientific contributions: " Labeling hierarchical topic models " Labeling dynamic topic models " Visualization of hierarchical/dynamic topic models
51
ArCficial Neuronal Network
Neuroscience
OpCmizaCon
Efficiency (staCsCcs)
Learning theory
Vision chip GeneraCve
model
Graphical models
Neural networks
Background
Computer vision
Markov decision process
ComputaConal complexity
theory
eminar at housie University – 4/18/2013 – ulien elcin
References (excerpt)
" Anokhin N., J. Lanagan, J. Velcin (2012), Social Citation: Finding Roles in Social Networks. An Analysis of TV-‐Series Web Forums. Second International Workshop on Mining Communities and People Recommenders (COMMPER), in conjunction with ECML/PKDD, Bristol, UK.
" Dermouche M., J. Velcin, S. Loudcher, L. Khouas (2013), Une nouvelle mesure pour l'évaluation des méthodes d'extraction de thématiques : la Vraisemblance Généralisée. Actes de la 13ème Conférence Francophone sur l'Extraction et la Gestion des Connaissances (EGC). Toulouse, France.
" Forestier, M., Stavrianou, A., Velcin, J. and Zighed, D.A. (2012), Roles in Social Networks: Methodologies and Research Issues. Web Intelligence and Agent Systems: An International Journal (WIAS).
" Musat, C., Velcin, J., Rizoiu, M.A. and Trausan-‐Matu, S. (2011), Improving Topic Evaluation Using Conceptual Knowledge. Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI). Barcelona, Spain.
" Rizoiu M.A., J. Velcin, S. Lallich (2012), Structuring typical evolutions using Temporal-‐Driven Constrained Clustering. Proceedings of the 24th IEEE Internatinal Conference on Tools with Artificial Intelligence (ICTAI). Athens, Greece. Best student paper award.
" Stavrianou, A., Velcin, J. and Chauchat, J.H. (2009), A combination of opinion mining and social network techniques for discussion analysis. Revue des Nouvelles Technologies de l'Information (RNTI), Cepadues.
52
Context The big picture Online discussions Topics Clustering ImagiWeb Conclusion
06/05/2013
14
eminar at housie University – 4/18/2013 – ulien elcin
Thank you!
53
Context The big picture Online discussions Topics Clustering ImagiWeb Conclusion