text processing of social media iii user geolocation ... · classi er acc acc@161 acc@c median text...
TRANSCRIPT
Text Processing of Social Media III Saarland University (17/7/2014)
Text Processing of Social Media IIIUser Geolocation; Twitter POS Tagging; Semantic and
Discourse Analysis of Social Media; Restrictions and Ethics ofSocial Media Usage
Timothy Baldwin
Text Processing of Social Media III Saarland University (17/7/2014)
Talk Outline
1 Geolocation Prediction (cont.)Results
2 Other Pre-processing Tasks
3 Lexical Semantic Analysis of Twitter
4 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums
5 Restrictions and Ethics of Social Media Usage
6 Overall Summary
Text Processing of Social Media III Saarland University (17/7/2014)
Experimental Setup
� Datasets:� North America dataset (NA, Roller et al. [2012]): 500K
users, 38M tweets� World dataset (WORLD, Han et al. [2012]): 1.4M users,
192M tweets
� Evaluation metrics:� Accuracy (Acc)� Accuracy within 161km (Acc@161), e.g., Frankfurt and
Darmstadt� Country-level accuracy (Acc@C)� Median error distance
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Experimental Parameters
� Many variables and parameters to explore, including:� feature set: all tokens vs. LIWs (vs. L2 regularisation)� learner: multinomial naive Bayes (NB), Kullback-Leibler
divergence (KL), logistic regression (LR)� Location representation: City, k-D tree partitioned earth
grid Roller et al. [2012]� language: English only, multilingual� data: geotagged data, geotagged and non-geotagged data,
metadata
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Multinomial Naive Bayes
� The basic formulation for multinomial NB is:
P(ci |D) ∝ P(ci )
|V |∏j=1
P(tj |ci )ND,tj
ND,tj!
where ND,tjis the frequency of term tj in test document D,
V is the set of all terms, and (with additive smoothing):
P(t|ci ) =1 +
∑|D |k=1 Nk,tP(ci |Dk)
|V |+∑|V |
j=1
∑|D |k=1 Nk,tj
P(ci |Dk)
� In practice, use addition of log-likelihoods rather thanproduct of likelihoods
Source(s): McCallum and Nigam [1998]
Text Processing of Social Media III Saarland University (17/7/2014)
Results using LIWs (NB, WORLD)
Features Acc Acc@161 Acc@C Median
Most Freq. 0.003 0.062 0.947 3089Full 0.171 0.308 0.831 571
CHI 0.233 0.402 0.850 385MaxCHI 0.238 0.412 0.848 356LOGLIKE 0.191 0.343 0.836 489
IG 0.184 0.336 0.838 491IGR 0.260 0.450 0.811 260MEW 0.183 0.326 0.836 520
ICF 0.209 0.359 0.841 533GEO 0.188 0.336 0.834 491Ripley 0.236 0.432 0.849 306
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Models and Location Representation (NA)
Partition Method Acc Acc@161 Acc@C Median
k-D tree
KL 0.117 0.344 – 469KL+IGR 0.161 0.437 – 273NB 0.122 0.367 – 404NB+IGR 0.153 0.432 – 280
City
NB 0.171 0.308 0.831 571NB+IGR 0.260 0.450 0.811 260LR 0.129 0.232 0.756 878LR+IGR 0.229 0.406 0.842 369
* Acc is not comparable between different class representations
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Models and Location Representation
(WORLD)
Dataset Method Acc Acc@161 Acc@C Median
CityNB 0.081 0.200 0.807 886NB+IGR 0.126 0.262 0.684 913
KD-tree
KL 0.116 0.283 – 564KL+IGR 0.121 0.286 – 602NB 0.119 0.289 – 553NB+IGR 0.134 0.290 – 577
Summary:� Feature selection improves geolocation prediction accuracy
� Less impact of model and location representation choicethan NA
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Models and Location Representation
(WORLD)
Dataset Method Acc Acc@161 Acc@C Median
CityNB 0.081 0.200 0.807 886NB+IGR 0.126 0.262 0.684 913
KD-tree
KL 0.116 0.283 – 564KL+IGR 0.121 0.286 – 602NB 0.119 0.289 – 553NB+IGR 0.134 0.290 – 577
Summary:� Feature selection improves geolocation prediction accuracy
� Less impact of model and location representation choicethan NA
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Adding Non-geotagged Data
� In addition to the geotagged tweets from each user, weoften have non-geotagged tweets, which we can potentiallyuse to expand the training/test user representation
Train Test Acc Acc@161 Acc@C MedianG G 0.126 0.262 0.684 913G+NG G 0.170 0.323 0.733 615G G+NG 0.187 0.366 0.835 398G+NG G+NG 0.280 0.492 0.878 170
G G-small 0.121 0.258 0.675 960G NG-small 0.114 0.248 0.666 1057
� Incorporating NG improves the prediction accuracy
� The difference between G-small and NG-small is minor
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Adding Non-geotagged Data
� In addition to the geotagged tweets from each user, weoften have non-geotagged tweets, which we can potentiallyuse to expand the training/test user representation
Train Test Acc Acc@161 Acc@C MedianG G 0.126 0.262 0.684 913G+NG G 0.170 0.323 0.733 615G G+NG 0.187 0.366 0.835 398G+NG G+NG 0.280 0.492 0.878 170
G G-small 0.121 0.258 0.675 960G NG-small 0.114 0.248 0.666 1057
� Incorporating NG improves the prediction accuracy
� The difference between G-small and NG-small is minor
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Exploration of Language Influence
� All results to date on English data; some languages highlypredictive of location (e.g. Japanese, Finnish)
� Investigate interaction between language and geolocationaccuracy:
� Partition: city� Learner: multinominal naive Bayes� Training: IGR on geotagged multilingual data
Method Acc Acc@161 Acc@C Median
Per-language majority class 0.107 0.189 0.693 2805Unified multilingual model 0.196 0.343 0.772 466Monolingual partitioned model 0.255 0.425 0.802 302
Table : WORLD in multilingual setting
Language is a good indicator of location (EN hard!)Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Exploration of Language Influence
� All results to date on English data; some languages highlypredictive of location (e.g. Japanese, Finnish)
� Investigate interaction between language and geolocationaccuracy:
� Partition: city� Learner: multinominal naive Bayes� Training: IGR on geotagged multilingual data
Method Acc Acc@161 Acc@C Median
Per-language majority class 0.107 0.189 0.693 2805Unified multilingual model 0.196 0.343 0.772 466Monolingual partitioned model 0.255 0.425 0.802 302
Table : WORLD in multilingual setting
Language is a good indicator of location (EN hard!)Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
User Metadata in Tweets� Examples of user-declared location in public profile:
� Calgary, Alberta� heat of Arizona� north east side of indy� -iN A Veryy Dope Place (:� hugging my big sister
� Examples of user-declared real names in public profile:� Michael Jordan� Yuji Matsumoto� Hinrich Schutze
Train NB classifier for each of user-declared location,timezone, self-description, registered real name, eachrepresented as char n-grams
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
User Metadata in Tweets� Examples of user-declared location in public profile:
� Calgary, Alberta� heat of Arizona� north east side of indy� -iN A Veryy Dope Place (:� hugging my big sister
� Examples of user-declared real names in public profile:� Michael Jordan� Yuji Matsumoto� Hinrich Schutze
Train NB classifier for each of user-declared location,timezone, self-description, registered real name, eachrepresented as char n-grams
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Exploration of User Metadata (WORLD)
Classifier Acc Acc@161 Acc@C Median
text 0.280 0.492 0.878 170loc 0.405 0.525 0.834 92tz 0.064 0.171 0.565 1330desc 0.048 0.117 0.526 2907rname 0.045 0.109 0.550 2611
� User-declared metadata contained in the tweet JSON objectis highly predictive of location, esp. the self-declared location
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Exploration of User Metadata (WORLD)
Classifier Acc Acc@161 Acc@C Median
text 0.280 0.492 0.878 170loc 0.405 0.525 0.834 92tz 0.064 0.171 0.565 1330desc 0.048 0.117 0.526 2907rname 0.045 0.109 0.550 2611
� User-declared metadata contained in the tweet JSON objectis highly predictive of location, esp. the self-declared location
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Stacking Metadata Classifiers
Level-0 classifiers Level-1 classifier
TEXT
LOC
TZ
DESC
RNAME
Logistic
Regression
Level-0
predictions
Tweet
Location
Time Zone
Description
Real NameStacking-based Geolocation Prediction
Final
prediction
Features Acc Acc@161 Acc@C Median0. text 0.280 0.492 0.878 1701. 0. + loc 0.483 0.653 0.903 142. 1. + tz 0.490 0.665 0.917 93. 2. + desc 0.490 0.666 0.919 94. 3. + rname 0.491 0.667 0.919 9
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Stacking Metadata Classifiers
Level-0 classifiers Level-1 classifier
TEXT
LOC
TZ
DESC
RNAME
Logistic
Regression
Level-0
predictions
Tweet
Location
Time Zone
Description
Real NameStacking-based Geolocation Prediction
Final
prediction
Features Acc Acc@161 Acc@C Median0. text 0.280 0.492 0.878 1701. 0. + loc 0.483 0.653 0.903 142. 1. + tz 0.490 0.665 0.917 93. 2. + desc 0.490 0.666 0.919 94. 3. + rname 0.491 0.667 0.919 9
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Stacking Metadata Classifiers
Level-0 classifiers Level-1 classifier
TEXT
LOC
TZ
DESC
RNAME
Logistic
Regression
Level-0
predictions
Tweet
Location
Time Zone
Description
Real NameStacking-based Geolocation Prediction
Final
prediction
Features Acc Acc@161 Acc@C Median0. text 0.280 0.492 0.878 1701. 0. + loc 0.483 0.653 0.903 142. 1. + tz 0.490 0.665 0.917 93. 2. + desc 0.490 0.666 0.919 94. 3. + rname 0.491 0.667 0.919 9
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Temporal Influence
� Can a model trained on “old” data generalise to “new”data?
WORLD: 10K time-homogeneous usersFeatures Acc Acc@161 Acc@C Median1. text 0.280 0.492 0.878 1702. loc 0.405 0.525 0.834 923. tz 0.064 0.171 0.565 13301. + 2. + 3. 0.490 0.665 0.917 9
LIVE: 32K time-heterogeneous usersFeatures Acc Acc@161 Acc@C Median1. text 0.268 0.510 0.901 1512. loc 0.326 0.465 0.813 3063. tz 0.065 0.160 0.525 15291. + 2. + 3. 0.406 0.614 0.901 40
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Temporal Influence
� Can a model trained on “old” data generalise to “new”data?
WORLD: 10K time-homogeneous usersFeatures Acc Acc@161 Acc@C Median1. text 0.280 0.492 0.878 1702. loc 0.405 0.525 0.834 923. tz 0.064 0.171 0.565 13301. + 2. + 3. 0.490 0.665 0.917 9
LIVE: 32K time-heterogeneous usersFeatures Acc Acc@161 Acc@C Median1. text 0.268 0.510 0.901 1512. loc 0.326 0.465 0.813 3063. tz 0.065 0.160 0.525 15291. + 2. + 3. 0.406 0.614 0.901 40
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Prediction Confidence I
More geolocatable user:� Porting my mobile to
Telstra is a brilliant idea,#vodafail
� @USER1 @USER2 @USER3actually Kevin Rudd alsohas an active weiboaccount.
� @USER good memory, Ican hardly remember theday I came to Melbourne.
Less geolocatable user:
� happy birthday to me
� i just finished my hw, oooh,too much
� Yes! all things are diffcultbefore they re easy
Not all users are equally predictable
Text Processing of Social Media III Saarland University (17/7/2014)
Prediction Confidence II� Rank users by confidence: probability (AP), probability ratio
of 1st and 2nd prediction (PR), geo-proximity in top-10predictions (PC), accumulated counts (FN) and weights(FW) of optimised features:
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
0.2
0.4
0.6
0.8
Acc@
16
1
Recall
Absolute Probability (AP)Prediction Coherence (PC)Prediction Ratio (PR)Feature Number (FN)Feature Weight (FW)
Source(s): Han et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
The Road Ahead
� Network features highly effective in user geolocation (moresothan text features: Backstrom et al. [2010], Jurgens[2013]); much work to be done in combining the two
� Message-level geolocation still very much an unsolved task
� How to keep the model temporally-relevant?
� Interaction between lexical normalisation and geolocation
Text Processing of Social Media III Saarland University (17/7/2014)
Summary
� User geolocation: supervised text-based multi-classificationproblem
� Location Indicative Words improve model effectiveness andefficiency
� Model location partition choice are less crucial than featureselection
� Adding non-geotagged data and language partitioning, andincorporating user metadata all improve the predictionaccuracy
Text Processing of Social Media III Saarland University (17/7/2014)
Talk Outline
1 Geolocation Prediction (cont.)Results
2 Other Pre-processing Tasks
3 Lexical Semantic Analysis of Twitter
4 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums
5 Restrictions and Ethics of Social Media Usage
6 Overall Summary
Text Processing of Social Media III Saarland University (17/7/2014)
Twitter POS Tagging
� How is POS tagging for social media data (focusing onTwitter) different to POS tagging for any other text source?
� deterministically taggable tokens (URLs, emoticons)� higher proportion of OOV words → lexical normalisation,
beef-up novel word handling rules, add word clusterinformation
� lower reliability of casing → add more gazetteers� lots of untagged, little tagged data → incorporate
semi-supervised retraining (e.g. bootstrapping)� some POS tag distinctions hard to make in social media →
tweak POS tagset to remove certain distinctions (and addothers)
Source(s): Gimpel et al. [2011], Derczynski et al. [2013], Owoputi et al. [2013]
Text Processing of Social Media III Saarland University (17/7/2014)
Penn ↔ CMU POS Tagset
Penn POS tag(s) CMU POS tag
NN, NNS NPRP, WP O
NNP, NNPS ˆMD, V* V
J* ARB, WRB R
UH !WDT, DT, WP$, PRP$ D
IN, TO PCC &RP T
EX, PDT XCD $
— # (hashtag)— @ (mention)— U (URL)
...
Text Processing of Social Media III Saarland University (17/7/2014)
Talk Outline
1 Geolocation Prediction (cont.)Results
2 Other Pre-processing Tasks
3 Lexical Semantic Analysis of Twitter
4 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums
5 Restrictions and Ethics of Social Media Usage
6 Overall Summary
Text Processing of Social Media III Saarland University (17/7/2014)
NLP for Social Media
� Lots of NLP research on Twitter
� Lexical normalisation, text-based geolocation, POS tagging,named entity recognition, sentiment analysis ...
� Lexical semantics?
� Challenges for WSD: short, noisy text (lack of reliablecontext)
� Possible benefits:
� possible benefits to applications (e.g. sentiment analysis)� possible insights into how social media and conventional
text differ
Text Processing of Social Media III Saarland University (17/7/2014)
NLP for Social Media
� Lots of NLP research on Twitter
� Lexical normalisation, text-based geolocation, POS tagging,named entity recognition, sentiment analysis ...
� Lexical semantics?
� Challenges for WSD: short, noisy text (lack of reliablecontext)
� Possible benefits:
� possible benefits to applications (e.g. sentiment analysis)� possible insights into how social media and conventional
text differ
Text Processing of Social Media III Saarland University (17/7/2014)
NLP for Social Media
� Lots of NLP research on Twitter
� Lexical normalisation, text-based geolocation, POS tagging,named entity recognition, sentiment analysis ...
� Lexical semantics?� Challenges for WSD: short, noisy text (lack of reliable
context)� Possible benefits:
� possible benefits to applications (e.g. sentiment analysis)� possible insights into how social media and conventional
text differ
Text Processing of Social Media III Saarland University (17/7/2014)
Word Usage Patterns
Conventional text
� One sense per discourse [Gale et al., 1992]
� First-sense heuristic [McCarthy et al., 2004]
� One sense per tweeter?
� documents are too small to consider applying one sense perdiscourse, but we can possibly address the lack of contextwith user-level sense priors
� First-sense heuristic?
� shown to change substantially across domains, so not clearthat it will work as well over Twitter
Text Processing of Social Media III Saarland University (17/7/2014)
Word Usage Patterns
Conventional text
� One sense per discourse [Gale et al., 1992]
� First-sense heuristic [McCarthy et al., 2004]
� One sense per tweeter?
� documents are too small to consider applying one sense perdiscourse, but we can possibly address the lack of contextwith user-level sense priors
� First-sense heuristic?
� shown to change substantially across domains, so not clearthat it will work as well over Twitter
Text Processing of Social Media III Saarland University (17/7/2014)
Word Usage Patterns
Conventional text
� One sense per discourse [Gale et al., 1992]
� First-sense heuristic [McCarthy et al., 2004]
� One sense per tweeter?� documents are too small to consider applying one sense per
discourse, but we can possibly address the lack of contextwith user-level sense priors
� First-sense heuristic?
� shown to change substantially across domains, so not clearthat it will work as well over Twitter
Text Processing of Social Media III Saarland University (17/7/2014)
Word Usage Patterns
Conventional text
� One sense per discourse [Gale et al., 1992]
� First-sense heuristic [McCarthy et al., 2004]
� One sense per tweeter?� documents are too small to consider applying one sense per
discourse, but we can possibly address the lack of contextwith user-level sense priors
� First-sense heuristic?
� shown to change substantially across domains, so not clearthat it will work as well over Twitter
Text Processing of Social Media III Saarland University (17/7/2014)
Word Usage Patterns
Conventional text
� One sense per discourse [Gale et al., 1992]
� First-sense heuristic [McCarthy et al., 2004]
� One sense per tweeter?� documents are too small to consider applying one sense per
discourse, but we can possibly address the lack of contextwith user-level sense priors
� First-sense heuristic?� shown to change substantially across domains, so not clear
that it will work as well over Twitter
Text Processing of Social Media III Saarland University (17/7/2014)
Resources
� Sense inventory: Macmillan Dictionary� coarse-grained senses� regularly updated
� Target lemmas: 20 nouns� high-to-mid frequency� medium polysemy: ≥ 3 senses
Source(s): Gella et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Datasets
� 4 datasets: {Twitter, ukWaC} × {rand,user}� ukWaC: more-conventional (web) text
� rand: random sample of usages from Twitter/ukWaC
� user: 5 usages of a given word from each user (Twitter) ordocument (ukWaC)
� 2000 items each: 100 usages of each noun
Source(s): Gella et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Annotation
� Use Amazon Mechanical Turk for annotation
� For each usage, pick the most appropriate sense(s), or“Other”
� Quality control� included some gold-standard Macmillan example sentences
in each HIT� filtered annotations based on accuracy over these items
� Fleiss’ Kappa: 0.47–0.71
Source(s): Gella et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Analysis
Average proportion of users/documents using a noun in thesame sense across all 5 usages
� Twitteruser: 65%
� ukWaCdoc: 63%
One sense per tweeter heuristic is as strong as one sense perdiscourse
Source(s): Gella et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Analysis: Pairwise Agreement
Partition Agreement (%)Gale et al. (1992) document 94.4Twitteruser user 95.4Twitteruser — 62.9Twitterrand — 55.1ukWaCdoc document 94.2ukWaCdoc — 65.9ukWaCrand — 60.2
Source(s): Gella et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Analysis: Pairwise Agreement
Partition Agreement (%)Gale et al. (1992) document 94.4Twitteruser user 95.4Twitteruser — 62.9Twitterrand — 55.1ukWaCdoc document 94.2ukWaCdoc — 65.9ukWaCrand — 60.2
Source(s): Gella et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Analysis: Pairwise Agreement
Partition Agreement (%)Gale et al. (1992) document 94.4Twitteruser user 95.4Twitteruser — 62.9Twitterrand — 55.1ukWaCdoc document 94.2ukWaCdoc — 65.9ukWaCrand — 60.2
Source(s): Gella et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Analysis: Pairwise Agreement
Partition Agreement (%)Gale et al. (1992) document 94.4Twitteruser user 95.4Twitteruser — 62.9Twitterrand — 55.1ukWaCdoc document 94.2ukWaCdoc — 65.9ukWaCrand — 60.2
Source(s): Gella et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Other Lexical Semantic Tales
Comparing Twitterrand and ukWaCrand:
� First-sense tagging is less accurate in Twitter data� Twitterrand: 45.3%� ukWaCrand: 55.4%
� Sense distributions are less skewed on Twitter� sense entropy lower for ukWaCrand for 15 nouns
� 8/20 nouns have different first senses
� More “Other” senses in Twitter data� Twitterrand: 12.3%� ukWaCrand: 6.6%
Source(s): Gella et al. [2014]
Text Processing of Social Media III Saarland University (17/7/2014)
Other Work on Lexical Semantic Analysis of
Social Media
� “Usage similarity” in Twitter [Gella et al., 2013]
� Wikification/babelfication [Mihalcea and Csomai, 2007,Ferragina and Scaiella, 2010, Moro et al., 2014]
� WordNet supersense tagging of Twitter data [Johannsenet al., to appear]
Text Processing of Social Media III Saarland University (17/7/2014)
Opportunities for Lexical Semantic Analysis of
Social Media
� Impact of time on sense distributions (per user or overall)?
� Interaction between geospatial and sociolinguistic factors onsense preferences?
� Network/thread-level analysis (also for comments associatedwith given document, user forum threads)
� Word sense (d)evolution over streamed data
� Geospatial word sense dispersal
� Interaction between user profile and word usage?
Text Processing of Social Media III Saarland University (17/7/2014)
Summary
� One sense per tweeter?� at least as strong as one sense per discourse
� First-sense heuristic?� first-sense tagging is less accurate for Twitter
Text Processing of Social Media III Saarland University (17/7/2014)
Talk Outline
1 Geolocation Prediction (cont.)Results
2 Other Pre-processing Tasks
3 Lexical Semantic Analysis of Twitter
4 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums
5 Restrictions and Ethics of Social Media Usage
6 Overall Summary
Text Processing of Social Media III Saarland University (17/7/2014)
Example Thread
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...
Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...A little more help... You would simply do it this way: ... You could also just ... An example of this is ...
User APost 4
User DPost 5
HTML Input Code - CNET Coding & scripting Forums
Source(s): http://forums.cnet.com/
Text Processing of Social Media III Saarland University (17/7/2014)
Example Thread
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...
Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...A little more help... You would simply do it this way: ... You could also just ... An example of this is ...
User APost 4
User DPost 5
HTML Input Code - CNET Coding & scripting Forums
External Link
External Video
500 words in total
Source(s): http://forums.cnet.com/
Text Processing of Social Media III Saarland University (17/7/2014)
Discourse Structure of Forum Threads
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
Question-QuestionØ
Source(s): Kim et al. [2010]
Text Processing of Social Media III Saarland University (17/7/2014)
Discourse Structure of Forum Threads
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action
asp.net c\# videoI’ve prepared for you video.link click ...
Question-Question
Answer-AnswerAnswer-Answer
Ø
Source(s): Kim et al. [2010]
Text Processing of Social Media III Saarland University (17/7/2014)
Discourse Structure of Forum Threads
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action
asp.net c\# videoI’ve prepared for you video.link click ...
Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...
User APost 4
Question-Question
Answer-AnswerAnswer-Answer
Answer-Confirmation
Question-Add
Ø
Source(s): Kim et al. [2010]
Text Processing of Social Media III Saarland University (17/7/2014)
Discourse Structure of Forum Threads
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action
asp.net c\# videoI’ve prepared for you video.link click ...
Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...
A little more help... You would simply do it this way: ... You could also just ... An example of this is ...
User APost 4
User DPost 5
Question-Question
Answer-Answer
Answer-Answer
Answer-Answer
Answer-Confirmation
Question-Add
Ø
Source(s): Kim et al. [2010]
Text Processing of Social Media III Saarland University (17/7/2014)
Research Aim and Contributions
� Aim:
- jointly classify the discourse structure of forum threads
� Contributions:
- apply structural learning and dependency parsing- in situ classification analysis
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
Dataset
� From Kim et al. [2010], 1332 posts spanning 315 threadsfrom CNET
� Each post is labelled with one or more links, each link islabelled with a dialogue act
- Question
* Question, Add, Correction, Confirmation
- Answer
* Answer, Add, Objection, Confirmation
- Resolution- Reproduction- Other
� Most common label: 1+Answer-Answer (28.4%)
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
Recap
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action
asp.net c\# videoI’ve prepared for you video.link click ...
Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...
A little more help... You would simply do it this way: ... You could also just ... An example of this is ...
User APost 4
User DPost 5
0+Question-Question
2+Answer-Answer
4+Answer-Answer
1+Answer-Answer
1+Answer-Confirmation
3+Question-Add
Ø
Text Processing of Social Media III Saarland University (17/7/2014)
Task Description
� Main task: joint classification of inter-post links (Link)and dialogue acts (DA)
� Explore two different learning approaches to the task
- a linear-chain CRF (CRFSGD)- a dependency parser (MaltParser)
� The task is a natural fit for dependency parsing, with somespecial properties:
⊕ strict reverse-chronological directionality (100%) non-projective dependencies (2%) multi-headedness (6%) disconnected sub-graphs (2%)
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
Features
� Structural features:
- Initiator: binary feature indicating whether the currentpost’s author is the thread initiator
- Position: relative position of the current post
� Semantic features:
- TitSim: relative location of the post which has the mostsimilar title to the current post.
- PostSim: relative location of the post which has themost similar content to the current post.
- Punct: number of question marks (QusCount),exclamation marks (ExcCount) and URLs (UrlCount) inthe current post.
- UserProf: class distribution of the current post’s author
Source(s): Kim et al. [2010]
Text Processing of Social Media III Saarland University (17/7/2014)
Post/thread-level Joint Classification F-scores
Method CRFSGD MaltParserpost/thread post/thread
Heuristic .515/.311NoFeatures .508/.394 .533/.356Joint +ALL .756/.578 .738/.578
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
Post/thread-level Joint Classification F-scores
Method CRFSGD MaltParserpost/thread post/thread
Heuristic .515/.311NoFeatures .508/.394 .533/.356Joint +ALL .756/.578 .738/.578
� Post-level analysis
? Initiator affects MaltParser
significantly
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
Post/thread-level Joint Classification F-scores
Method CRFSGD MaltParserpost/thread post/thread
Heuristic .515/.311NoFeatures .508/.394 .533/.356Joint +ALL .756/.578 .738/.578
� Thread-level analysis
? the best thread-level
F-scores from the two
learners are not significantly
different
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
Threads Evolve Over Time
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
Question-QuestionØ
Text Processing of Social Media III Saarland University (17/7/2014)
Threads Evolve Over Time
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action
asp.net c\# videoI’ve prepared for you video.link click ...
Question-Question
Answer-AnswerAnswer-Answer
Ø
Text Processing of Social Media III Saarland University (17/7/2014)
Threads Evolve Over Time
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action
asp.net c\# videoI’ve prepared for you video.link click ...
Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...
User APost 4
Question-Question
Answer-AnswerAnswer-Answer
Answer-Confirmation
Question-Add
Ø
Text Processing of Social Media III Saarland University (17/7/2014)
Threads Evolve Over Time
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action
asp.net c\# videoI’ve prepared for you video.link click ...
Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...
A little more help... You would simply do it this way: ... You could also just ... An example of this is ...
User APost 4
User DPost 5
Question-Question
Answer-Answer
Answer-Answer
Answer-Answer
Answer-Confirmation
Question-Add
Ø
Text Processing of Social Media III Saarland University (17/7/2014)
Threads Evolve Over Time
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action
asp.net c\# videoI’ve prepared for you video.link click ...
Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...
A little more help... You would simply do it this way: ... You could also just ... An example of this is ...
User APost 4
User DPost 5
Question-Question
Answer-Answer
Answer-Answer
Answer-Answer
Answer-Confirmation
Question-Add
Ø
� In situ classification — compare the accuracy of differentmodels when applied to partial threads vs. complete threads.
Text Processing of Social Media III Saarland University (17/7/2014)
Classify the “Evolving Threads”
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action
HTML Input Code - CNET Coding & scripting Forums
Classify first 2 posts
Text Processing of Social Media III Saarland University (17/7/2014)
Classify the “Evolving Threads”
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...
Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...
User APost 4
HTML Input Code - CNET Coding & scripting Forums
Classify first 4 posts
Text Processing of Social Media III Saarland University (17/7/2014)
Classify the “Evolving Threads”
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...
Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...A little more help... You would simply do it this way: ... You could also just ... An example of this is ...
User APost 4
User DPost 5
HTML Input Code - CNET Coding & scripting Forums
Classify all posts
Text Processing of Social Media III Saarland University (17/7/2014)
Evaluation of In situ Classification
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...
Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...A little more help... You would simply do it this way: ... You could also just ... An example of this is ...
User APost 4
User DPost 5
Evaluate first 2 posts
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...
Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...
User APost 4
Text Processing of Social Media III Saarland University (17/7/2014)
Evaluation of In situ Classification
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...
Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...A little more help... You would simply do it this way: ... You could also just ... An example of this is ...
User APost 4
User DPost 5
Evaluate first 4 posts
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action
HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...
User APost 1
User BPost 2
User CPost 3
Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...
Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...
User APost 4
Text Processing of Social Media III Saarland University (17/7/2014)
In Situ Classification
� Link-DA F-score for CRFSGD/MaltParser for in situclassification over sub-threads of different lengths, brokendown over different post extents
TestB/down
[1, 2] [1, 4] [1, 6] [1, 8] [All ]
[1, 2] .947/.947 — — — —[1, 4] .946/.947 .836/.841 — — —[1, 6] .946/.947 .840/.841 .800/.794 — —[1, 8] .946/.947 .840/.841 .800/.794 .780/.769 —[All ] .946/.946 .840/.838 .800/.791 .776/.767 .756/.738
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
In Situ Classification
� Link-DA F-score for CRFSGD/MaltParser for in situclassification over sub-threads of different lengths, brokendown over different post extents
TestB/down
[1, 2] [1, 4] [1, 6] [1, 8] [All ]
[1, 2] .947/.947 — — — —[1, 4] .946/.947 .836/.841 — — —[1, 6] .946/.947 .840/.841 .800/.794 — —[1, 8] .946/.947 .840/.841 .800/.794 .780/.769 —[All ] .946/.946 .840/.838 .800/.791 .776/.767 .756/.738
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
In Situ Classification
� Link-DA F-score for CRFSGD/MaltParser for in situclassification over sub-threads of different lengths, brokendown over different post extents
TestB/down
[1, 2] [1, 4] [1, 6] [1, 8] [All ]
[1, 2] .947/.947 — — — —[1, 4] .946/.947 .836/.841 — — —[1, 6] .946/.947 .840/.841 .800/.794 — —[1, 8] .946/.947 .840/.841 .800/.794 .780/.769 —[All ] .946/.946 .840/.838 .800/.791 .776/.767 .756/.738
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
In Situ Classification
� Link-DA F-score for CRFSGD/MaltParser for in situclassification over sub-threads of different lengths, brokendown over different post extents
TestB/down
[1, 2] [1, 4] [1, 6] [1, 8] [All ]
[1, 2] .947/.947 — — — —[1, 4] .946/.947 .836/.841 — — —[1, 6] .946/.947 .840/.841 .800/.794 — —[1, 8] .946/.947 .840/.841 .800/.794 .780/.769 —[All ] .946/.946 .840/.838 .800/.791 .776/.767 .756/.738
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
In Situ Classification
� Link-DA F-score for CRFSGD/MaltParser for in situclassification over sub-threads of different lengths, brokendown over different post extents
TestB/down
[1, 2] [1, 4] [1, 6] [1, 8] [All ]
[1, 2] .947/.947 — — — —[1, 4] .946/.947 .836/.841 — — —[1, 6] .946/.947 .840/.841 .800/.794 — —[1, 8] .946/.947 .840/.841 .800/.794 .780/.769 —[All ] .946/.946 .840/.838 .800/.791 .776/.767 .756/.738
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
In Situ Classification
� Link-DA F-score for CRFSGD/MaltParser for in situclassification over sub-threads of different lengths, brokendown over different post extents
TestB/down
[1, 2] [1, 4] [1, 6] [1, 8] [All ]
[1, 2] .947/.947 — — — —[1, 4] .946/.947 .836/.841 — — —[1, 6] .946/.947 .840/.841 .800/.794 — —[1, 8] .946/.947 .840/.841 .800/.794 .780/.769 —[All ] .946/.946 .840/.838 .800/.791 .776/.767 .756/.738
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
In Situ Classification
� Link-DA F-score for CRFSGD/MaltParser for in situclassification over sub-threads of different lengths, brokendown over different post extents
TestB/down
[1, 2] [1, 4] [1, 6] [1, 8] [All ]
[1, 2] .947/.947 — — — —[1, 4] .946/.947 .836/.841 — — —[1, 6] .946/.947 .840/.841 .800/.794 — —[1, 8] .946/.947 .840/.841 .800/.794 .780/.769 —[All ] .946/.946 .840/.838 .800/.791 .776/.767 .756/.738
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
In Situ Classification
� Link-DA F-score for CRFSGD/MaltParser for in situclassification over sub-threads of different lengths, brokendown over different post extents
TestB/down
[1, 2] [1, 4] [1, 6] [1, 8] [All ]
[1, 2] .947/.947 — — — —[1, 4] .946/.947 .836/.841 — — —[1, 6] .946/.947 .840/.841 .800/.794 — —[1, 8] .946/.947 .840/.841 .800/.794 .780/.769 —[All ] .946/.946 .840/.838 .800/.791 .776/.767 .756/.738
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
In Situ Classification
� Link-DA F-score for CRFSGD/MaltParser for in situclassification over sub-threads of different lengths, brokendown over different post extents
TestB/down
[1, 2] [1, 4] [1, 6] [1, 8] [All ]
[1, 2] .947/.947 — — — —[1, 4] .946/.947 .836/.841 — — —[1, 6] .946/.947 .840/.841 .800/.794 — —[1, 8] .946/.947 .840/.841 .800/.794 .780/.769 —[All ] .946/.946 .840/.838 .800/.791 .776/.767 .756/.738
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
In Situ Classification
� Link-DA F-score for CRFSGD/MaltParser for in situclassification over sub-threads of different lengths, brokendown over different post extents
TestB/down
[1, 2] [1, 4] [1, 6] [1, 8] [All ]
[1, 2] .947/.947 — — — —[1, 4] .946/.947 .836/.841 — — —[1, 6] .946/.947 .840/.841 .800/.794 — —[1, 8] .946/.947 .840/.841 .800/.794 .780/.769 —[All ] .946/.946 .840/.838 .800/.791 .776/.767 .756/.738
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
In Situ Classification
� Link-DA F-score for CRFSGD/MaltParser for in situclassification over sub-threads of different lengths, brokendown over different post extents
TestB/down
[1, 2] [1, 4] [1, 6] [1, 8] [All ]
[1, 2] .947/.947 — — — —[1, 4] .946/.947 .836/.841 — — —[1, 6] .946/.947 .840/.841 .800/.794 — —[1, 8] .946/.947 .840/.841 .800/.794 .780/.769 —[All ] .946/.946 .840/.838 .800/.791 .776/.767 .756/.738
? From this, we conclude that our method can be robustly applied
to real-time analysis of dynamically evolving threads.
Source(s): Wang et al. [2011b]
Text Processing of Social Media III Saarland University (17/7/2014)
What’s Discourse Parsing got to do with it?
� All well and good, but:
(a) does discourse parsing actually aid information access overuser forums?
(b) are our models accurate enough to be useful?
� Explore these questions relative to the ancestry.com
dataset of Elsas [2011], in the context of IR
� Best IR model of Elsas [2011] = perform IR over individualposts in each thread, score the thread via the geometricmean of the top-k retrieved posts’ scores (k = 5)
Source(s): Wang et al. [2013]
Text Processing of Social Media III Saarland University (17/7/2014)
IR Evaluation over Ancestry Dataset
DASubset mAPpref ppref @10IR baseline [Elsas, 2011] .657 .664DAs +ALL .668 .672
–Qq .674 .678
Source(s): Wang et al. [2013]
Text Processing of Social Media III Saarland University (17/7/2014)
IR Evaluation over Ancestry Dataset
DASubset mAPpref ppref @10IR baseline [Elsas, 2011] .657 .664DAs +ALL .668 .672
–Qq .674 .678
Source(s): Wang et al. [2013]
Text Processing of Social Media III Saarland University (17/7/2014)
Other NLP Research over Web Forums
� Thread classification for topic, “solvedness” etc. [Fenget al., 2006, Baldwin et al., 2007, Wang et al., 2012]
� Thread structure analysis, e.g. for summarisation [Wang andRose, 2010] or information retrieval [Seo et al., 2009, Wanget al., 2011a]
� Expert finding [Jurczyk and Agichtein, 2007, Bouguessaet al., 2008, Lin et al., 2009]
� Question–answer pair extraction [Cong et al., 2008, Dinget al., 2008, Hong and Davison, 2009]
� Post quality assessment [Weimer et al., 2007, Wanas et al.,2008, Lui and Baldwin, 2009]
Text Processing of Social Media III Saarland University (17/7/2014)
The Road Ahead
� Better user support within forums (duplicate questiondetection, thread recommendation, thread routing)
� Research generally focused on specific forums; much to bedone on cross-forum analysis (forum recommendation,cross-forum thread routing)
� General-purpose discourse analyser for forum threads?
� More use of user priors
Text Processing of Social Media III Saarland University (17/7/2014)
Summary
� “Discourse parsing” of DA and link structure of web userforum threads, via structured classification and dependencyparsing
� Findings:
- empirically little to separate simple CRF model anddependency parsing
- in situ classification: our method is robust overdynamically evolving threads
� Demonstration of utility of discourse parsing in an IRcontext
Text Processing of Social Media III Saarland University (17/7/2014)
Talk Outline
1 Geolocation Prediction (cont.)Results
2 Other Pre-processing Tasks
3 Lexical Semantic Analysis of Twitter
4 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums
5 Restrictions and Ethics of Social Media Usage
6 Overall Summary
Text Processing of Social Media III Saarland University (17/7/2014)
Data Restrictions: Twitter
� Twitter is famously “open” as a service:� possible to crawl any (undeleted) public tweet ever posted
to Twitter (in rate-limited way)� possible to access “random” sub-sample of tweets via
Streaming API (“garden hose” vs. “fire-hose”)
� Each tweet object is provided as a JSON object containinga wealth of message and user data (various user meta-data,geotag, language, basic social network data, thread data, ...)
� It is not possible, however, to redistribute Twitter data,other than in the form of tweet IDs (for others to recrawl)
� It is possible to crawl social network data from Twitter, butheavily rate-limited
Text Processing of Social Media III Saarland University (17/7/2014)
Data Restrictions: Other Sites
� YouTube terms of use similar to Twitter (but YouTube usedmuch less for social media research)
� Facebook offers very limited access to its data, and there ishence relatively little published research relating to it
� Individual forums vary considerably in the terms of use oftheir data, with many commercial forums banning crawling,but many community-run forums having relativelypermissive licenses
� Wikipedia is perhaps the most open social media site — allcontent is available via a Creative CommonsAttribution-ShareAlike 3.0 Unported License
Text Processing of Social Media III Saarland University (17/7/2014)
Research Datasets
� The Wikimedia Foundation provides periodic dumps ofWikipedia, which are heavily used for research purposes(although version data is often not provided in publications)
� Spinn3r made available a large crawl of blog data as part ofICWSM-2011 Burton et al. [2011], which is widely used
� Various Twitter datasets have been made available, in theform of tweet IDs for others to crawl
� issues for reproducibility of published results over Twitter
similarly with YouTube datasets
� Large-scale datasets of user-tagged images (e.g. from Flickr)
� Various other individual datasets made available throughICWSM
Text Processing of Social Media III Saarland University (17/7/2014)
Ethics of Social Media Research
� In “traditional” NLP datasets, the data is generated byorganisations (commercial or otherwise) andpublished/licensed directly by that organisation, oftenwithout information identifying the authors of individualdocuments (with some exceptions, e.g. BNC, ANC)
� In the case of social media sites, the data is generated byindividual users of that site, often for personal use, in somecases with user-specific data privacy settings (e.g. Twitter,Facebook), and other cases with site-specific privacysettings (e.g. forums, Wikipedia)
� User accounts are often associated with publicly-accessibleuser-declared profile data (e.g. age, gender, date-of-birth,location, ...), as well as site-generated “activity” statistics(e.g. date joined site, number of posts, number of followers,average user rating, ...)
Text Processing of Social Media III Saarland University (17/7/2014)
How to be Ethical when Using Social Media
Data
� For publicly-available, publicly-crawlable data, surelyanything goes?!
� NO!� important to get institutional ethics (“IRB”) approval for
social media data in cases where there is any interactionwith the users or any publication of user-identifyinginformation
� generally OK to publish “aggregated” models/datasets(within the terms of use of a given site), as long as it is notpossible to extract identifying user data from it
� Vulnerability of anonymised social media datasets to“privacy attacks” (as the source data is often public)
� Researchers are often faced with tradeoffs between scientificreproducibility and ethical data use
Text Processing of Social Media III Saarland University (17/7/2014)
Ethics of In-Site Social Media Research
� Social media sites are continually rolling out newfunctionality, or improving existing functionality, as part ofwhich they user interaction data to validate/A-B test newfunctionalities
� Ideally, users should be made aware of any A-B testing (as itinvolves direct user interaction), but subtle question ofwhether, in drawing users’ attention to the testing, the testis “faithful”
� Infamous recent case of A-B testing relating to Facebook“news feeds”, in looking at the correlation betweenpositive/negative information in a user’s feeds, and thesentiment in their own posts [Kramer et al., 2014]
� ethical?
Text Processing of Social Media III Saarland University (17/7/2014)
Talk Outline
1 Geolocation Prediction (cont.)Results
2 Other Pre-processing Tasks
3 Lexical Semantic Analysis of Twitter
4 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums
5 Restrictions and Ethics of Social Media Usage
6 Overall Summary
Text Processing of Social Media III Saarland University (17/7/2014)
Final Words
� Much has been done over social media, but even moreremains to be done
� Different social media sources present different challenges,but one common theme is the use of user information andvarious types of linking information
� There is more to social media than Twitter
� Social media is a many-splendored thing, with lots of roomto play for all, and many open challenges for NLP
Text Processing of Social Media III Saarland University (17/7/2014)
Acknowledgements
These slides are based heavily on joint work with PaulCook, Spandana Gella, Bo Han, Su Nam Kim, MarcoLui, Joakim Nivre and Li Wang. The research wassupported in part by the Australian Research Council,and in part by NICTA. NICTA is funded by theAustralian government as represented by theDepartment of Broadband, Communication and DigitalEconomy, and the Australian Research Council throughthe ICT Centre of Excellence programme.
Text Processing of Social Media III Saarland University (17/7/2014)
References I
Lars Backstrom, Eric Sun, and Cameron Marlow. Find me if you can: improvinggeographical prediction with social and spatial proximity. In Proceedings of the 19thInternational Conference on World Wide Web, pages 61–70, Raleigh, USA, 2010.
Timothy Baldwin, David Martinez, and Richard B. Penman. Automatic threadclassification for Linux user forum information access. In Proceedings of the TwelfthAustralasian Document Computing Symposium (ADCS 2007), pages 72–79,Melbourne, Australia, 2007.
Mohamed Bouguessa, Benoıt Dumoulin, and Shengrui Wang. Identifying authoritativeactors in question-answering forums: The case of yahoo! answers. In Proceedings ofthe 14th ACM SIGKDD International Conference on Knowledge Discovery and DataMining (KDD ’08), pages 866–874, Las Vegas, Nevada, USA, 2008. URLhttp://doi.acm.org/10.1145/1401890.1401994.
Kevin Burton, Niels Kasch, and Ian Soboroff. The ICWSM 2011 Spinn3r dataset. InProceedings of the 5th International Conference on Weblogs and Social Media(ICWSM 2011), Barcelona, Spain, 2011.
Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, and Yueheng Sun. Findingquestion-answer pairs from online forums. In Proceedings of 31st InternationalACM-SIGIR Conference on Research and Development in Information Retrieval(SIGIR’08), pages 467–474, Singapore, 2008.
Leon Derczynski, Alan Ritter, Sam Clark, and Kalina Bontcheva. Twitter part-of-speechtagging for all: Overcoming sparse and noisy data. In Proceedings of RANLP 2013(Recent Advances in Natural Language Processing), Hissar, Bulgaria, 2013.
Text Processing of Social Media III Saarland University (17/7/2014)
References II
Shilin Ding, Gao Cong, Chin-Yew Lin, and Xiaoyan Zhu. Using conditional random fieldsto extract context and answers of questions from online forums. In Proceedings of the46th Annual Meeting of the ACL: HLT (ACL 2008), pages 710–718, Columbus, USA,2008.
Jonathan L. Elsas. Ancestry.com online forum test collection. Technical report, CarnegieMellon University, 2011.
Donghui Feng, Erin Shaw, Jihie Kim, and Eduard Hovy. Learning to detect conversationfocus of threaded discussions. In Proceedings of the Main Conference on HumanLanguage Technology Conference of the North American Chapter of the Association ofComputational Linguistics (HLT-NAACL ’06), pages 208–215, New York, USA, 2006.
Paolo Ferragina and Ugo Scaiella. TAGME: On-the-fly annotation of short text fragments(by Wikipedia entities). In Proceedings of the 19th ACM Conference on Informationand Knowledge Management (CIKM 2010), pages 1625–1628, 2010.
William A. Gale, Kenneth W. Church, and David Yarowsky. One sense per discourse. InProceedings of the 4th DARPA Speech and Natural Language Workshop, pages233–237, 1992.
Spandana Gella, Paul Cook, and Bo Han. Unsupervised word usage similarity in socialmedia texts. In Proceedings of the Second Joint Conference on Lexical andComputational Semantics (*SEM 2013), pages 248–253, Atlanta, USA, 2013. URLhttp://www.aclweb.org/anthology/S13-1036.
Text Processing of Social Media III Saarland University (17/7/2014)
References IIISpandana Gella, Paul Cook, and Timothy Baldwin. One sense per tweeter ... and other
lexical semantic tales of Twitter. In Proceedings of the 14th Conference of the EACL(EACL 2014), pages 215–220, Gothenburg, Sweden, 2014.
Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, JacobEisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith.Part-of-speech tagging for Twitter: Annotation, features, and experiments. InProceedings of the 49th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies (ACL HLT 2011), pages 42–47, Portland,USA, 2011. URL http://www.aclweb.org/anthology/P11-2008.
Bo Han, Paul Cook, and Timothy Baldwin. Geolocation prediction in social media databy finding location indicative words. In Proceedings of the 24th InternationalConference on Computational Linguistics (COLING 2012), pages 1045–1062, Mumbai,India, 2012.
Bo Han, Paul Cook, and Timothy Baldwin. Text-based Twitter user geolocationprediction. Journal of Artificial Intelligence Research, 49:451–500, 2014.
Liangjie Hong and Brian D. Davison. A classification-based approach to questionanswering in discussion boards. In Proceedings of the 32nd Annual ACM SIGIRConference (SIGIR 2009), pages 171–178, Boston, Massachusetts, USA, 2009.
Anders Johannsen, Dirk Hovy, Hector Martınez Alonso, Barbara Plank, and AndersSgaard. More or less supervised super-sense tagging of Twitter. In Proceedings of theThird Joint Conference on Lexical and Computational Semantics (*SEM 2014),Dublin, Ireland, to appear.
Text Processing of Social Media III Saarland University (17/7/2014)
References IVPawel Jurczyk and Eugene Agichtein. Discovering authorities in question answer
communities by using link analysis. In Proceedings of the Sixteenth ACM Conferenceon Conference on Information and Knowledge Management (CIKM ’07), pages919–922, Lisbon, Portugal, 2007. URLhttp://doi.acm.org/10.1145/1321440.1321575.
David Jurgens. That’s what friends are for: Inferring location in online social mediaplatforms based on social relationships. In Proceedings of the 7th InternationalConference on Weblogs and Social Media (ICWSM 2013), pages 273–282, Boston,USA, 2013.
Su Nam Kim, Li Wang, and Timothy Baldwin. Tagging and linking web forum posts. InProceedings of the 14th Conference on Natural Language Learning (CoNLL-2010),pages 192–202, Uppsala, Sweden, 2010.
Adam D.I. Kramer, Jamie E. Guillory, and Jeffrey T. Hancock. Experimental evidence ofmassive-scale emotional contagion through social networks. Proceedings of theNational Academy of Sciences, 111(24):8788–8790, 2014.
Chen Lin, Jiang-Ming Yang, Rui Cai, Xin-Jing Wang, Wei Wang, and Lei Zhang.Modeling semantics and structure of discussion threads. In Proceedings of the 18thInternational Conference on the World Wide Web (WWW 2009), pages 1103–1104,Madrid, Spain, 2009.
Marco Lui and Timothy Baldwin. You are what you post: User-level features in threadeddiscourse. In Proceedings of the 14th Australasian Document Computing Symposium(ADCS 2009), Sydney, Australia, 2009.
Text Processing of Social Media III Saarland University (17/7/2014)
References V
Andrew McCallum and Kamal Nigam. A comparison of event models for Naive Bayestext classification. In Proceedings of the AAAI-98 Workshop on Learning for TextCategorization, pages Available as Technical Report WS–98–05, AAAI Press.,Madison, USA, 1998.
Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. Finding predominant sensesin untagged text. In Proceedings of the 42nd Annual Meeting of the Association forComputational Linguistics (ACL 2004), pages 280–287, Barcelona, Spain, 2004.
Rada Mihalcea and Andras Csomai. Wikify!: Linking documents to encyclopedicknowledge. In Proceedings of the Sixteenth ACM Conference on Conference onInformation and Knowledge Management, pages 233–242, Lisbon, Portugal, 2007.URL http://doi.acm.org/10.1145/1321440.1321475.
Andrea Moro, Alessandro Raganato, and Roberto Navigli. Entity linking meets wordsense disambiguation: a unified approach. Transactions of the Association forComputational Linguistics, 2:231–244, 2014.
Olutobi Owoputi, Brendan OConnor, Chris Dyer, Kevin Gimpel, Nathan Schneider, andNoah A. Smith. Improved part-of-speech tagging for online conversational text withword clusters. In Proceedings of the 2013 Conference of the North American Chapterof the Association for Computational Linguistics: Human Language Technologies(NAACL HLT 2013), pages 380–390, Atlanta, USA, 2013.
Text Processing of Social Media III Saarland University (17/7/2014)
References VI
Stephen Roller, Michael Speriosu, Sarat Rallapalli, Benjamin Wing, and Jason Baldridge.Supervised text-based geolocation using language models on an adaptive grid. InProceedings of the Joint Conference on Empirical Methods in Natural LanguageProcessing and Computational Natural Language Learning 2012 (EMNLP-CoNLL2012), pages 1500–1510, Jeju Island, Korea, 2012. URLhttp://www.aclweb.org/anthology/D12-1137.
Jangwon Seo, W. Bruce Croft, and David A. Smith. Online community search usingthread structure. In Proceedings of the 18th ACM Conference on Information andKnowledge Management (CIKM 2009), pages 1907–1910, Hong Kong, China, 2009.
Nayer Wanas, Motaz El-Saban, Heba Ashour, and Waleed Ammar. Automatic scoring ofonline discussion posts. In Proceedings of the 2nd ACM Workshop on InformationCredibility on the Web (WICOW’08), pages 19–26, Napa Valley, USA, 2008.
Hongning Wang, Chi Wang, ChengXiang Zhai, and Jiawei Han. Learning onlinediscussion structures by conditional random fields. In Proceedings of the 34th AnnualInternational ACM SIGIR Conference (SIGIR 2011), pages 435–444, Beijing, China,2011a.
Li Wang, Marco Lui, Su Nam Kim, Joakim Nivre, and Timothy Baldwin. Predictingthread discourse structure over technical web forums. In Proceedings of the 2011Conference on Empirical Methods in Natural Language Processing (EMNLP 2011),pages 13–25, Edinburgh, UK, 2011b.
Text Processing of Social Media III Saarland University (17/7/2014)
References VII
Li Wang, Su Nam Kim, and Timothy Baldwin. The utility of discourse structure inidentifying resolved threads in technical user forums. In Proceedings of the 24thInternational Conference on Computational Linguistics (COLING 2012), pages2739–2756, Mumbai, India, 2012.
Li Wang, Su Nam Kim, and Timothy Baldwin. The utility of discourse structure in forumthread retrieval. In Proceedings of the 9th Asian Information Retrieval SocietiesConference (AIRS 2013), pages 284–295, Singapore, 2013.
Yi-Chia Wang and Carolyn P. Rose. Making conversational structure explicit:identification of initiation-response pairs within online discussions. In Human LanguageTechnologies: The 2010 Annual Conference of the North American Chapter of theAssociation for Computational Linguistics (NAACL HLT 2010), pages 673–676, 2010.
Markus Weimer, Iryna Gurevych, and Max Muhlhauser. Automatically assessing the postquality in online discussions on software. In Proceedings of the 45th Annual Meetingof the ACL on Interactive Poster and Demonstration Sessions (ACL 2007), pages125–128, Prague, Czech Republic, 2007.