1
Temporal and semantic analysis of richly typed social networks from
user-generated content sites on the web
Zide Meng, Supervisor: Fabien Gandon, Catherine Faron Zucker
2
Question Title
Question Content
Question User
Question Comments
Answer Content
Answer User
Question Tags
Answer Votes
Question Votes
3
4
Some facts about Q&A sites
• Traffic statistics in Oct. 2016 from Quantcast.com– 49.9M unique devices visit Stackoverflow– 52.9M unique devices visit Answer.com– 3.9M unique devices visit YahooAnswer
• to compare:– 211.8M unique devices visit Youtube.com– 147.6M unique devices visit Facebook.com
1/41/3
5
6
Site info of StackOverflow.com
• total question: 12.7M• unanswered: 3.5M• total answers: 20.1M• total user: 6.2M• question/min: 2.93• answer/min: 4.66
https://api.stackexchange.com/docs/info#filter=default&site=stackoverflow&run=true(accessed 2 Nov 2016)
7
what is Q&A site?
8
Detect topics and activities of users
how to export jar?
Topic detection
Temporal analysis
9
Detect topics and activities of groups
temporal dynamics
community detection
10
Toward the functionalities
• Expert identification• Question routing• Community evolution• Burst Topic detection• Event detection• etc.
11
Research Question(RQ1)• How can we formalize user-generated
content?
12
Research Question(RQ2)• How can we identify the common topics
binding users together?
13
Research Question(RQ3)• How can we detect topics-based overlapping
communities?
14
Research Question(RQ4)• How can we generate a semantic label for
topics?
Java Development
Database
15
Research Question(RQ5)• How can we extract topics-based expertise
and temporal dynamics?
16
Agenda• Backgrounds & Motivation• RQ1: Formalize user-generated content• RQ2: An efficient topic modeling method• RQ3: Overlapping community detection• RQ4: From a BOW to semantic labels• RQ5: Temporal Topic Expertise Activity• Conclusions and Perspectives
17
Q&A Data Open Data Other Data
Schema Mapping
DataEnrichment
DataInter-Linking
Integrated DataSet
Applications
Data Preparation
Data Integration
Data Analysis
Communitydetection
Topic Extraction
TemporalAnalysis
Application
Overview
18
Q&A Data Open Data Other Data
Schema Mapping
DataEnrichment
DataInter-Linking
Integrated DataSet
Applications
Data Preparation
Data Integration
Data Analysis
Communitydetection
Topic Extraction
TemporalAnalysis
Application
RQ1: how to formalize UGC?
RQ1 RQ4
RQ3 RQ2 RQ5
19
massive unstructured Q&A content
20
Formalize UGC with semantic schema
• From unstructured to structured• Explicit information (questions, answers…) • Implicit Information (interest, expertise…)
OriginalQ/Adata
Q/ATriples
SIOC & FOAF
User Expertise
User InterestQASM
Information Extraction
DataMining
existing work
21
QASM vocabulary
22
Formalize distribution
RQ1 Discussion
we propose the QASM vocabulary to formalize both explicit information and implicit information for user-generated content (Q&A sites)
How to extract implicit information from the original explicit information?
24
Q&A Data Open Data Other Data
Schema Mapping
DataEnrichment
DataInter-Linking
Integrated DataSet
Applications
Data Preparation
Data Integration
Data Analysis
Communitydetection
Topic Extraction
TemporalAnalysis
Application
RQ2: How to find topics?
RQ1 RQ4
RQ3 RQ2 RQ5
25
what kind of topics are they talking about?
26
what is a Topic?• e.g. Given two documents on topic “music” and
“cooking”• “guitar” and “singer” are more likely to appear in
document about “music”• “receipt” and “pizza” are more likely to appear in
document about “cooking”• “the” and “a” will appear equally in both• score (0 ~ 1) -> relevance (weak ~ strong)• Topic “music” :
“guitar” ”singer” ”receipt” ”pizza” ”the” ”a”0.4 0.3 0.1 0.1 0.05 0.05
27
Latent Dirichlet Allocation (LDA)
28
Replace Document with User
• Original LDA: Document-Topic-Word• In our problem: User-Topic-Tag
29
How to get topic assignment?• User, Tag are observed information• Topic is hidden information
P(topic|user)
P(tag|topic)
30
how to get the distributions?
• Gibbs sampling
Sample a new Zi
Update distributions
User-Topic distribution
Topic-Tag distribution
θuk
θkw
31
Intuition behind LDA
• How to create a user tag list
32
Topic-Tag distribution
User-Topic distribution
Loop: choose a topic, choose a tag
csshtml eclipse, , , ……mysql , layout
……
……
33
Output of LDA (User-Topic-Tag)
Web Development
Java Development
Database
0.3 0.6 0.1
Java mysql tomcat html
0.1 0.1 0.2 0.6
Java mysql tomcat html
0.6 0.1 0.2 0.1
User-Topic Distribution
Topic-Tag Distribution
java mysql tomcat html
0.1 0.6 0.2 0.1
Short summary
• Goal: We want to find topics, overlapping communities, user expertise, user activities…..
• Method: LDA may solve the problems• But LDA has problems, e.g. Slow & Complex
• Find an Efficient & Simple topic modeling method
34
35
Some empirical findings
-> Find topics based on tags
High frequent tags are more general
The first tag normally indicate the domain
Each question has 1~5 tags indicating the key points
36
Solution: Prefix Tree structure
layout
html
css
Q1: html css element
Q2: html layout float
Q3: html layout css-layout
Q4: html forms select input
Q5: html forms autocomplete
forms
element float css-layout select auto complete
input
1, Root tag can be used to represent the children tags2, Tags in a tree belong to the same topic3, The order is maintained in the tree structure
37
html
HTML prefix tree for StackOverflow
38
Combine Prefix Trees • why: some trees should be in the same topic• how: compute root tag similarity matrix• output: combine trees to get topics
layout
html
mysql forms tomcat
java
mysql jvm
cssjavascript
jquery
json
0.89
0.35
0.29
similarity html javascript java
html 1 0.89 0.35
javascript 0.89 1 0.29
java 0.35 0.29 1
39
How to get the topic-tag distribution?
layout:10
html:50
mysql:20 forms:20
Topic1: Web-dev :100
javascript:50
layout:0.1
html: 0.5
mysql: 0.2 forms:0.2
Topic1: Web-dev
javascript:0.5
• MLE (Maximum Likelihood Estimation)
40
Topic-tag distribution
probability
tags
topics
sql database
highly related
eclipse
not related
41
Topic Extraction experiment setup
• Dataset– Stackoverflow (2008/08 to 2009/09)– 103K users– 242K questions and 870K answers
• Baseline Algorithms– LDA (latent Dirichlet Algorithm)
42
Topic extraction evaluation metric
• Metric: Perplexity– how likely a model would generate the test dataset
• Example:
Training Set
html width css
0.9 0.05 0.05
html width css0.4 0.4 0.2
test case1: html width csstest case2: html widthtest case3: html widthtest case4: html width css
less likely
more likely
higher Probabilities to generate test dataset Lower Perplexity Score Better Performance
model 1
model 2
???
43
Perplexity Score compared with LDALower is better
LDA WE
44
Scalability compared with LDA
LDA
RQ2 Discussion• If two tags co-occur many times, they should be in the
same topic (In the same topic tree in our method)• The probability of a tag to a topic is approximated to its
frequency in that topic if the observed data is large enough!
• Question tag list is short (3~5 tags), which is not suitable for LDA to get very good results
• Test on Flickr dataset, it can also generate meaningful topics
• TTD: a simple and efficient topic modeling method while preserving topic quality
46
Q&A Data Open Data Other Data
Schema Mapping
DataEnrichment
DataInter-Linking
Integrated DataSet
Applications
Data Preparation
Data Integration
Data Analysis
Communitydetection
Topic Extraction
TemporalAnalysis
Application
RQ3: How to detect communities?
RQ1 RQ4
RQ3 RQ2 RQ5
47which users are interested in the same topic?
48
Existing community detection approaches
we focus on simple, efficient, topic-based, overlapping community detection method
49
how to get user-topic distribution?
jspJava tomcat*22, *15, *11, ……html *9,
0.10Web-Dev
Java-Dev
C#-Dev
0.50
0.05
0.20
0.20
0.05
0.30
0.25
0.15
0.30
0.05
0.10
*22+
*22+
*22+
*15+
*15+
*15+
*11+
*11+
*11+
*9 =
*9 =
*9 =
11.2
17.2
4.4
• topic-tag distribution + user tag list
Topic-tag distribution from the last step
50
User-topic distributionhigh interest
users
topicslow interest
user12960
web-dev
java-dev
51
How to find overlapping communities• use user-topic distribution• each topic represent each community
0.75
0.32
0.42
0.15
0.78
0.23
Web-Dev
Java-Dev
C#-Dev
0.15
0.78
0.82
For example, threshold : 0.3
Web-Dev C#-DevJava-Dev
52
Overlapping Community Detection experiment setup
• Dataset– Stackoverflow (2008/08 to 2009/09)– 103K users– 242K questions and 870K answers
• Compared Algorithms– SLPA (Label Propagation Algorithm)– LDA (latent Dirichlet Algorithm)– Ward (Hierarchical clustering algorithm)
53
Evaluation metrics
• metric: Jaccard Similarity & Cosine Similarity– avg_inner– avg_rand– avg_center– nmi=avg_inner/avg_rand
avg_inner
avg_rand
avg_center
Users in a community are more close
Users in a community are far away from outside
Users in a community are closer to center
the larger the better
54
Jaccard Similarity
avg_inner avg_rand
avg_center nmi
WE
WE
WE
WE
55
Cosine Similarity
avg_inner avg_rand
avg_center nmi
WE
WE
LDA win, but LDA has “sum to 1” restriction
LDA
LDA
56
RQ3 Discussion
• Both Our method and LDA can detect topic based communities, graph based method and clustering method can not.
• Compared with LDA, our method does not have a “sum-to-1” restrictions (a high interest in a community does not necessarily lower the interest in another community)
• Our method is simple and efficient compared with LDA
57
Q&A Data Open Data Other Data
Schema Mapping
DataEnrichment
DataInter-Linking
Integrated DataSet
Applications
Data Preparation
Data Integration
Data Analysis
Communitydetection
Topic Extraction
TemporalAnalysis
Application
RQ4: How to label topics?
RQ1 RQ4
RQ3 RQ2 RQ5
58
Bag of words is hard to view and manage
Topic 1
Topic 2
59
Existing topic labeling approaches
60
Using External Knowledge: Link to DBpedia
Java
http://dbpedia.org/resource/Java_(programming_language)
For example
http://dbpedia.org/resource/Java
Disambiguation
61
Disambiguation
Java
Tag Description in StackOverflow
Java
Java (Place) Description
Java (P.L.) Description
0.31
0.58
Content Cosine Similarity
• Method1: Babelfy (Moro et al. 2014)• Method2: DBpedia Lookup service
62
S
Find missing links and resourceshttp://dbpedia.org/resource/Apache_Tomcat
http://dbpedia.org/resource/Java_p.l.
http://dbpedia.org/resource/eclipse
http://dbpedia.org/resource/j2ee
Sparql queries to DBpedia
63
Find the center
Java Development
Algorithms to find central node
the central node
64
Algorithms to find the center
• InDegree (ID)• Betweenness Centrality (BC)• Degree Centrality (DC)• Page Rank (Page 1999) (PR)• Random• Top (sorted topic-tag distribution)• Most (most selected by above algorithms)
65
Topic Labeling experiment: Survey
66
Evaluation: NDCG • Normalized Discount Cumulative Gain– used to evaluate two ranked list– Perfect match: NDCG=1.0– Completely wrong: NDCG=0.0
HTML 10Firefox 8Web-Development 7CSS 1Brower 0
Ground Truth: Survey Result
HTML 0.81Web-Development 0.52Firefox 0.31CSS 0.02Brower 0.01
Algorithm output
67
Experiment:NDCG
NDCG=1 : perfect match
68
RQ4 Discussion
• many words can not link to DBpedia• 1 label is not enough, 2 or 3 labels is much better• By using external knowledge base DBpedia, we
propose a method to automatically generate semantic label for bag of words.
69
Q&A Data Open Data Other Data
Schema Mapping
DataEnrichment
DataInter-Linking
Integrated DataSet
Applications
Data Preparation
Data Integration
Data Analysis
Communitydetection
Topic Extraction
TemporalAnalysis
Application
RQ5: How to model temporal?
RQ1 RQ4
RQ3 RQ2 RQ5
70
who is active now?on which topic is he active?in which topic does he have expertise?
71
Related Work
72
LDA -> TTEATemporal
Expertise
Activity
73
TTEA Model details
How to get topic assignment?
User-Topic distribution
Topic-Word distribution
Topic-Tag distribution
Topic-Time distribution
Topic-Expertise distribution
74
How to get the distributions?
• Gibbs sampling
Sample a new Zi
Update distributions
Intuition behind TTEA (Temporal)
Topic-Tag distribution
User-Topic distribution
csshtml eclipse, , , ……mysql , layout
……
……
75
Topic-Time distribution
June 2016
Intuition behind TTEA (Expertise)
Topic-Tag distribution
User-Topic distribution
csshtml eclipse, , , ……mysql , layout
……
……
76
Topic-Exp distribution
June 2016 52
77
Experiments and Evaluations
• StackOverflow dataset (07/2008-11/2013)
78
Topic Extraction experiment setup
• Baseline Algorithms– TEM (Yang 2013b) : topic, expertise– UQA (Guo 2008b) : topic, categories– GrosToT (Hu 2014): topic, temporal– TTEA (our): topic, expertise, activity, temporal
79
• For each question in test dataset, we recommend 5,10,20,30 users
• MSC: the number of successful prediction
Question Routing Task
TTEA-ACT TEM UQA GROSTOT RANDOM0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
msc@5msc@10msc@20msc@30
WE
80
Best answer prediction
• when we recommend 100 users (out of 6.2M users) for each testing questions, in around 44% cases we have one user not only answering the question, but also winning the highest vote.
81
Temporal illustrations
Month
Day
Hour
User in same topic behavior different
Use
r in
diffe
rent
topi
c
global level user level
82
RQ5 Discussion
• TTEA: an extended LDA model to extract expertise, activity, and temporal dynamics.
• Extracted information could benefit question routing, expert detection tasks.
83
Agenda• Backgrounds & Motivation• RQ1: Formalize user-generated content• Apply LDA on User generated content• RQ2: An efficient topic modeling method• RQ3: Overlapping community detection• RQ4: From a BOW to semantic labels• RQ5: Temporal Topic Expertise Activity• Conclusions and Perspectives
84
Overview of contributions
• temporal and semantic analysis of richly typed social networks from user-generated-content sites on the web
• key points:– temporal analysis– semantic analysis– social networks–user-generated content
Community/topic evolution
Topic Extraction
Community Detection
Question Answer site
85
Detailed answers to questions
• RQ1: QASM: formalize implicit and explicit content
• RQ2: TTD: a simple and fast topic modeling method
• RQ3: a TTD based overlapping community detection method
• RQ4: A DBpedia based topic labeling method• RQ5: TTEA: joint model Topic Temporal Expertise
and Activity
86
Limitations & Perspectives
• RQ1: How to formalize UGC?– formalize single platform v.s. cross platform
• RQ2: How to detect topics?– automatically generate tag from content
• RQ3: How to find overlapping communities– combine graph-based community detection
• RQ4: How to generate Labels for BOW?– use extra knowledge base or create links
• RQ5: How to extract temporal and expertise?– use all the extracted information to provide more function
87
Publications• Zide Meng, Fabien L. Gandon, Catherine Faron-Zucker, Ge Song. Detecting topics and
overlapping communities in question and answer sites. Journal of Social Network Analysis and Mining. 2015
• Zide Meng, Fabien L. Gandon, Catherine Faron-Zucker: Overlapping Community Detection and Temporal Analysis on Q&A Sites. Journal of Web Intelligence and Agent Systems 2016. (to appear)
• Zide Meng, Fabien L. Gandon, Catherine Faron-Zucker: Joint model of topics, expertises, activities and trends for question answering web applications. IEEE/WIC/ACM 2016 (to appear)
• Zide Meng, Fabien L. Gandon, Catherine Faron-Zucker: Simplified detection and labeling of overlapping communities of interest in Q&A sites. IEEE/WIC/ACM Web Intelligence 2015
• Zide Meng, Fabine L. Gandon, Catherine Faron-Zucker. QASM: a Q&A Social Media system based social semantics. ISWC 2014.
• Zide Meng, Fabien L. Gandon, Catherine Faron-Zucker, Ge Song: Empirical study on overlapping community detection in question and answer sites. ASONAM 2014: 344-348
• Jean-Michel Dalle, Catherine Faron-Zucker, Fabien L. Gandon, Mathieu Lacage, Zide Meng: Online Knowledge Triage: Searching, Detecting, Labelling and Orienting User Generated Content. WWW (Companion Volume) 2016
88
Thank you! [email protected]
89
• check when you use TAG and when you use WORD
• read reports of your reviewers and prepare answers : get ready
• documents of defense
90
91
92
Evaluations (1/4)Topic -> Perplexity
WEWEWEWE
number of topics
perp
lexi
ty sc
ore
93
Topic
Topic over tag distribution
0.75
0.23
0.02
Html
css
eclipse
very related
not related
webdevelopment
94
Temporal
Topic over time distribution
0.75
0.23
0.02
May
June
July
very popular
not popular
webdevelopment
95
Expertise
User’s Expertise over topic distribution
0.75
0.23
0.02
Web-Dev
Java-Dev
C#-Dev
has high expertise
has low expertise
96
Activity
Topic over user distribution
0.75
0.23
0.02
very active
not active
webdevelopment
97