text processing of social media iii user geolocation ... · classi er acc acc@161 acc@c median text...

Text Processing of Social Media III Saarland University (17/7/2014)

Text Processing of Social Media IIIUser Geolocation; Twitter POS Tagging; Semantic and

Discourse Analysis of Social Media; Restrictions and Ethics ofSocial Media Usage

Timothy Baldwin


Talk Outline

1 Geolocation Prediction (cont.)Results

2 Other Pre-processing Tasks

3 Lexical Semantic Analysis of Twitter

4 Discourse Analysis of Web User ForumsIntroductionExperimental SetupExperiments and AnalysisThe Crowning Glory: Enhanced IROther NLP Research over Web Forums

5 Restrictions and Ethics of Social Media Usage

6 Overall Summary


Experimental Setup

� Datasets:� North America dataset (NA, Roller et al. [2012]): 500K

users, 38M tweets� World dataset (WORLD, Han et al. [2012]): 1.4M users,

192M tweets

� Evaluation metrics:� Accuracy (Acc)� Accuracy within 161km (Acc@161), e.g., Frankfurt and

Darmstadt� Country-level accuracy (Acc@C)� Median error distance

Source(s): Han et al. [2014]


Experimental Parameters

� Many variables and parameters to explore, including:� feature set: all tokens vs. LIWs (vs. L2 regularisation)� learner: multinomial naive Bayes (NB), Kullback-Leibler

divergence (KL), logistic regression (LR)� Location representation: City, k-D tree partitioned earth

grid Roller et al. [2012]� language: English only, multilingual� data: geotagged data, geotagged and non-geotagged data,

metadata



Multinomial Naive Bayes

� The basic formulation for multinomial NB is:

P(ci |D) ∝ P(ci )

|V |∏j=1

P(tj |ci )ND,tj

ND,tj!

where ND,tjis the frequency of term tj in test document D,

V is the set of all terms, and (with additive smoothing):

P(t|ci ) =1 +

∑|D |k=1 Nk,tP(ci |Dk)

|V |+∑|V |

j=1

∑|D |k=1 Nk,tj

P(ci |Dk)

� In practice, use addition of log-likelihoods rather thanproduct of likelihoods

Source(s): McCallum and Nigam [1998]


Results using LIWs (NB, WORLD)

Features Acc Acc@161 Acc@C Median

Most Freq. 0.003 0.062 0.947 3089Full 0.171 0.308 0.831 571

CHI 0.233 0.402 0.850 385MaxCHI 0.238 0.412 0.848 356LOGLIKE 0.191 0.343 0.836 489

IG 0.184 0.336 0.838 491IGR 0.260 0.450 0.811 260MEW 0.183 0.326 0.836 520

ICF 0.209 0.359 0.841 533GEO 0.188 0.336 0.834 491Ripley 0.236 0.432 0.849 306



Models and Location Representation (NA)

Partition Method Acc Acc@161 Acc@C Median

k-D tree

KL 0.117 0.344 – 469KL+IGR 0.161 0.437 – 273NB 0.122 0.367 – 404NB+IGR 0.153 0.432 – 280

City

NB 0.171 0.308 0.831 571NB+IGR 0.260 0.450 0.811 260LR 0.129 0.232 0.756 878LR+IGR 0.229 0.406 0.842 369

* Acc is not comparable between different class representations



Models and Location Representation

(WORLD)

Dataset Method Acc Acc@161 Acc@C Median

CityNB 0.081 0.200 0.807 886NB+IGR 0.126 0.262 0.684 913

KD-tree

KL 0.116 0.283 – 564KL+IGR 0.121 0.286 – 602NB 0.119 0.289 – 553NB+IGR 0.134 0.290 – 577

Summary:� Feature selection improves geolocation prediction accuracy

� Less impact of model and location representation choicethan NA



Adding Non-geotagged Data

� In addition to the geotagged tweets from each user, weoften have non-geotagged tweets, which we can potentiallyuse to expand the training/test user representation

Train Test Acc Acc@161 Acc@C MedianG G 0.126 0.262 0.684 913G+NG G 0.170 0.323 0.733 615G G+NG 0.187 0.366 0.835 398G+NG G+NG 0.280 0.492 0.878 170

G G-small 0.121 0.258 0.675 960G NG-small 0.114 0.248 0.666 1057

� Incorporating NG improves the prediction accuracy

� The difference between G-small and NG-small is minor



Exploration of Language Influence

� All results to date on English data; some languages highlypredictive of location (e.g. Japanese, Finnish)

� Investigate interaction between language and geolocationaccuracy:

� Partition: city� Learner: multinominal naive Bayes� Training: IGR on geotagged multilingual data

Method Acc Acc@161 Acc@C Median

Per-language majority class 0.107 0.189 0.693 2805Unified multilingual model 0.196 0.343 0.772 466Monolingual partitioned model 0.255 0.425 0.802 302

Table : WORLD in multilingual setting

Language is a good indicator of location (EN hard!)Source(s): Han et al. [2014]


User Metadata in Tweets� Examples of user-declared location in public profile:

� Calgary, Alberta� heat of Arizona� north east side of indy� -iN A Veryy Dope Place (:� hugging my big sister

� Examples of user-declared real names in public profile:� Michael Jordan� Yuji Matsumoto� Hinrich Schutze

Train NB classifier for each of user-declared location,timezone, self-description, registered real name, eachrepresented as char n-grams



Exploration of User Metadata (WORLD)

Classifier Acc Acc@161 Acc@C Median

text 0.280 0.492 0.878 170loc 0.405 0.525 0.834 92tz 0.064 0.171 0.565 1330desc 0.048 0.117 0.526 2907rname 0.045 0.109 0.550 2611

� User-declared metadata contained in the tweet JSON objectis highly predictive of location, esp. the self-declared location



Stacking Metadata Classifiers

Level-0 classifiers Level-1 classifier

TEXT

LOC

TZ

DESC

RNAME

Logistic

Regression

Level-0

predictions

Tweet

Location

Time Zone

Description

Real NameStacking-based Geolocation Prediction

Final

prediction

Features Acc Acc@161 Acc@C Median0. text 0.280 0.492 0.878 1701. 0. + loc 0.483 0.653 0.903 142. 1. + tz 0.490 0.665 0.917 93. 2. + desc 0.490 0.666 0.919 94. 3. + rname 0.491 0.667 0.919 9



Temporal Influence

� Can a model trained on “old” data generalise to “new”data?

WORLD: 10K time-homogeneous usersFeatures Acc Acc@161 Acc@C Median1. text 0.280 0.492 0.878 1702. loc 0.405 0.525 0.834 923. tz 0.064 0.171 0.565 13301. + 2. + 3. 0.490 0.665 0.917 9

LIVE: 32K time-heterogeneous usersFeatures Acc Acc@161 Acc@C Median1. text 0.268 0.510 0.901 1512. loc 0.326 0.465 0.813 3063. tz 0.065 0.160 0.525 15291. + 2. + 3. 0.406 0.614 0.901 40



Prediction Confidence I

More geolocatable user:� Porting my mobile to

Telstra is a brilliant idea,#vodafail

� @USER1 @USER2 @USER3actually Kevin Rudd alsohas an active weiboaccount.

� @USER good memory, Ican hardly remember theday I came to Melbourne.

Less geolocatable user:

� happy birthday to me

� i just finished my hw, oooh,too much

� Yes! all things are diffcultbefore they re easy

Not all users are equally predictable


Prediction Confidence II� Rank users by confidence: probability (AP), probability ratio

of 1st and 2nd prediction (PR), geo-proximity in top-10predictions (PC), accumulated counts (FN) and weights(FW) of optimised features:

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

0.2

0.4

0.6

0.8

Acc@

16

1

Recall

Absolute Probability (AP)Prediction Coherence (PC)Prediction Ratio (PR)Feature Number (FN)Feature Weight (FW)



The Road Ahead

� Network features highly effective in user geolocation (moresothan text features: Backstrom et al. [2010], Jurgens[2013]); much work to be done in combining the two

� Message-level geolocation still very much an unsolved task

� How to keep the model temporally-relevant?

� Interaction between lexical normalisation and geolocation


Summary

� User geolocation: supervised text-based multi-classificationproblem

� Location Indicative Words improve model effectiveness andefficiency

� Model location partition choice are less crucial than featureselection

� Adding non-geotagged data and language partitioning, andincorporating user metadata all improve the predictionaccuracy


Talk Outline






6 Overall Summary


Twitter POS Tagging

� How is POS tagging for social media data (focusing onTwitter) different to POS tagging for any other text source?

� deterministically taggable tokens (URLs, emoticons)� higher proportion of OOV words → lexical normalisation,

beef-up novel word handling rules, add word clusterinformation

� lower reliability of casing → add more gazetteers� lots of untagged, little tagged data → incorporate

semi-supervised retraining (e.g. bootstrapping)� some POS tag distinctions hard to make in social media →

tweak POS tagset to remove certain distinctions (and addothers)

Source(s): Gimpel et al. [2011], Derczynski et al. [2013], Owoputi et al. [2013]


Penn ↔ CMU POS Tagset

Penn POS tag(s) CMU POS tag

NN, NNS NPRP, WP O

NNP, NNPS ˆMD, V* V

J* ARB, WRB R

UH !WDT, DT, WP$, PRP$ D

IN, TO PCC &RP T

EX, PDT XCD $

— # (hashtag)— @ (mention)— U (URL)

...


Talk Outline






6 Overall Summary


NLP for Social Media

� Lots of NLP research on Twitter

� Lexical normalisation, text-based geolocation, POS tagging,named entity recognition, sentiment analysis ...

� Lexical semantics?

� Challenges for WSD: short, noisy text (lack of reliablecontext)

� Possible benefits:

� possible benefits to applications (e.g. sentiment analysis)� possible insights into how social media and conventional

text differ


NLP for Social Media

� Lots of NLP research on Twitter

� Lexical normalisation, text-based geolocation, POS tagging,named entity recognition, sentiment analysis ...

� Lexical semantics?� Challenges for WSD: short, noisy text (lack of reliable

context)� Possible benefits:

� possible benefits to applications (e.g. sentiment analysis)� possible insights into how social media and conventional

text differ


Word Usage Patterns

Conventional text

� One sense per discourse [Gale et al., 1992]

� First-sense heuristic [McCarthy et al., 2004]

Twitter

� One sense per tweeter?

� documents are too small to consider applying one sense perdiscourse, but we can possibly address the lack of contextwith user-level sense priors

� First-sense heuristic?

� shown to change substantially across domains, so not clearthat it will work as well over Twitter


Word Usage Patterns

Conventional text



Twitter

� One sense per tweeter?� documents are too small to consider applying one sense per

discourse, but we can possibly address the lack of contextwith user-level sense priors

� First-sense heuristic?

� shown to change substantially across domains, so not clearthat it will work as well over Twitter


Word Usage Patterns

Conventional text



Twitter

� One sense per tweeter?� documents are too small to consider applying one sense per

discourse, but we can possibly address the lack of contextwith user-level sense priors

� First-sense heuristic?� shown to change substantially across domains, so not clear

that it will work as well over Twitter


Resources

� Sense inventory: Macmillan Dictionary� coarse-grained senses� regularly updated

� Target lemmas: 20 nouns� high-to-mid frequency� medium polysemy: ≥ 3 senses

Source(s): Gella et al. [2014]


Datasets

� 4 datasets: {Twitter, ukWaC} × {rand,user}� ukWaC: more-conventional (web) text

� rand: random sample of usages from Twitter/ukWaC

� user: 5 usages of a given word from each user (Twitter) ordocument (ukWaC)

� 2000 items each: 100 usages of each noun



Annotation

� Use Amazon Mechanical Turk for annotation

� For each usage, pick the most appropriate sense(s), or“Other”

� Quality control� included some gold-standard Macmillan example sentences

in each HIT� filtered annotations based on accuracy over these items

� Fleiss’ Kappa: 0.47–0.71



Analysis

Average proportion of users/documents using a noun in thesame sense across all 5 usages

� Twitteruser: 65%

� ukWaCdoc: 63%

One sense per tweeter heuristic is as strong as one sense perdiscourse



Analysis: Pairwise Agreement

Partition Agreement (%)Gale et al. (1992) document 94.4Twitteruser user 95.4Twitteruser — 62.9Twitterrand — 55.1ukWaCdoc document 94.2ukWaCdoc — 65.9ukWaCrand — 60.2



Other Lexical Semantic Tales

Comparing Twitterrand and ukWaCrand:

� First-sense tagging is less accurate in Twitter data� Twitterrand: 45.3%� ukWaCrand: 55.4%

� Sense distributions are less skewed on Twitter� sense entropy lower for ukWaCrand for 15 nouns

� 8/20 nouns have different first senses

� More “Other” senses in Twitter data� Twitterrand: 12.3%� ukWaCrand: 6.6%



Other Work on Lexical Semantic Analysis of

Social Media

� “Usage similarity” in Twitter [Gella et al., 2013]

� Wikification/babelfication [Mihalcea and Csomai, 2007,Ferragina and Scaiella, 2010, Moro et al., 2014]

� WordNet supersense tagging of Twitter data [Johannsenet al., to appear]


Opportunities for Lexical Semantic Analysis of

Social Media

� Impact of time on sense distributions (per user or overall)?

� Interaction between geospatial and sociolinguistic factors onsense preferences?

� Network/thread-level analysis (also for comments associatedwith given document, user forum threads)

� Word sense (d)evolution over streamed data

� Geospatial word sense dispersal

� Interaction between user profile and word usage?


Summary

� One sense per tweeter?� at least as strong as one sense per discourse

� First-sense heuristic?� first-sense tagging is less accurate for Twitter


Talk Outline






6 Overall Summary

Introduction


Example Thread

HTML Input Code...Please can someone tell me how to create an input box that asks the user to enter their ID, and then allows them to press go. It will then redirect to the page ...

User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript actionasp.net c\# videoI’ve prepared for you video.link click ...

Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...A little more help... You would simply do it this way: ... You could also just ... An example of this is ...

User APost 4

User DPost 5

HTML Input Code - CNET Coding & scripting Forums

Source(s): http://forums.cnet.com/

http://forums.cnet.com/


Example Thread


User APost 1

User BPost 2

User CPost 3



User APost 4

User DPost 5


External Link

External Video

500 words in total

Source(s): http://forums.cnet.com/

http://forums.cnet.com/


Discourse Structure of Forum Threads


User APost 1

Question-QuestionØ

Source(s): Kim et al. [2010]




User APost 1

User BPost 2

User CPost 3

Re: html input codePart 1: create a form with a text field. See ... Part 2: give it a Javascript action

asp.net c\# videoI’ve prepared for you video.link click ...

Question-Question

Answer-AnswerAnswer-Answer

Ø





User APost 1

User BPost 2

User CPost 3



Thank You!Thanks a lot for that ... I have Microsoft Visual Studio 6, what program should I do this in? Lastly, how do I actually include this in my site? ...

User APost 4

Question-Question


Answer-Confirmation

Question-Add

Ø





User APost 1

User BPost 2

User CPost 3




A little more help... You would simply do it this way: ... You could also just ... An example of this is ...

User APost 4

User DPost 5

Question-Question

Answer-Answer

Answer-Answer

Answer-Answer

Answer-Confirmation

Question-Add

Ø



Research Aim and Contributions

� Aim:

- jointly classify the discourse structure of forum threads

� Contributions:

- apply structural learning and dependency parsing- in situ classification analysis

Source(s): Wang et al. [2011b]

Experimental Setup


Dataset

� From Kim et al. [2010], 1332 posts spanning 315 threadsfrom CNET

� Each post is labelled with one or more links, each link islabelled with a dialogue act

- Question

* Question, Add, Correction, Confirmation

- Answer

* Answer, Add, Objection, Confirmation

- Resolution- Reproduction- Other

� Most common label: 1+Answer-Answer (28.4%)



Recap


User APost 1

User BPost 2

User CPost 3





User APost 4

User DPost 5

0+Question-Question

2+Answer-Answer

4+Answer-Answer

1+Answer-Answer

1+Answer-Confirmation

3+Question-Add

Ø


Task Description

� Main task: joint classification of inter-post links (Link)and dialogue acts (DA)

� Explore two different learning approaches to the task

- a linear-chain CRF (CRFSGD)- a dependency parser (MaltParser)

� The task is a natural fit for dependency parsing, with somespecial properties:

⊕ strict reverse-chronological directionality (100%) non-projective dependencies (2%) multi-headedness (6%) disconnected sub-graphs (2%)



Features

� Structural features:

- Initiator: binary feature indicating whether the currentpost’s author is the thread initiator

- Position: relative position of the current post

� Semantic features:

- TitSim: relative location of the post which has the mostsimilar title to the current post.

- PostSim: relative location of the post which has themost similar content to the current post.

- Punct: number of question marks (QusCount),exclamation marks (ExcCount) and URLs (UrlCount) inthe current post.

- UserProf: class distribution of the current post’s author


Experiments and Analysis


Post/thread-level Joint Classification F-scores

Method CRFSGD MaltParserpost/thread post/thread

Heuristic .515/.311NoFeatures .508/.394 .533/.356Joint +ALL .756/.578 .738/.578






� Post-level analysis

? Initiator affects MaltParser

significantly






� Thread-level analysis

? the best thread-level

F-scores from the two

learners are not significantly

different



Threads Evolve Over Time


User APost 1

Question-QuestionØ




User APost 1

User BPost 2

User CPost 3



Question-Question


Ø




User APost 1

User BPost 2

User CPost 3




User APost 4

Question-Question


Answer-Confirmation

Question-Add

Ø




User APost 1

User BPost 2

User CPost 3





User APost 4

User DPost 5

Question-Question

Answer-Answer

Answer-Answer

Answer-Answer

Answer-Confirmation

Question-Add

Ø




User APost 1

User BPost 2

User CPost 3





User APost 4

User DPost 5

Question-Question

Answer-Answer

Answer-Answer

Answer-Answer

Answer-Confirmation

Question-Add

Ø

� In situ classification — compare the accuracy of differentmodels when applied to partial threads vs. complete threads.


Classify the “Evolving Threads”




User APost 1

User BPost 2



Classify first 2 posts




User APost 1

User BPost 2

User CPost 3



User APost 4


Classify first 4 posts




User APost 1

User BPost 2

User CPost 3



User APost 4

User DPost 5


Classify all posts


Evaluation of In situ Classification


User APost 1

User BPost 2

User CPost 3



User APost 4

User DPost 5

Evaluate first 2 posts


User APost 1

User BPost 2



User APost 1

User BPost 2

User CPost 3



User APost 4


Evaluation of In situ Classification


User APost 1

User BPost 2

User CPost 3



User APost 4

User DPost 5

Evaluate first 4 posts


User APost 1

User BPost 2



User APost 1

User BPost 2

User CPost 3



User APost 4


In Situ Classification

� Link-DA F-score for CRFSGD/MaltParser for in situclassification over sub-threads of different lengths, brokendown over different post extents

TestB/down

[1, 2] [1, 4] [1, 6] [1, 8] [All ]

[1, 2] .947/.947 — — — —[1, 4] .946/.947 .836/.841 — — —[1, 6] .946/.947 .840/.841 .800/.794 — —[1, 8] .946/.947 .840/.841 .800/.794 .780/.769 —[All ] .946/.946 .840/.838 .800/.791 .776/.767 .756/.738



In Situ Classification

� Link-DA F-score for CRFSGD/MaltParser for in situclassification over sub-threads of different lengths, brokendown over different post extents

TestB/down

[1, 2] [1, 4] [1, 6] [1, 8] [All ]

[1, 2] .947/.947 — — — —[1, 4] .946/.947 .836/.841 — — —[1, 6] .946/.947 .840/.841 .800/.794 — —[1, 8] .946/.947 .840/.841 .800/.794 .780/.769 —[All ] .946/.946 .840/.838 .800/.791 .776/.767 .756/.738

? From this, we conclude that our method can be robustly applied

to real-time analysis of dynamically evolving threads.



What’s Discourse Parsing got to do with it?

� All well and good, but:

(a) does discourse parsing actually aid information access overuser forums?

(b) are our models accurate enough to be useful?

� Explore these questions relative to the ancestry.com

dataset of Elsas [2011], in the context of IR

� Best IR model of Elsas [2011] = perform IR over individualposts in each thread, score the thread via the geometricmean of the top-k retrieved posts’ scores (k = 5)

Source(s): Wang et al. [2013]

ancestry.com


IR Evaluation over Ancestry Dataset

DASubset mAPpref ppref @10IR baseline [Elsas, 2011] .657 .664DAs +ALL .668 .672

–Qq .674 .678

Source(s): Wang et al. [2013]


Other NLP Research over Web Forums

� Thread classification for topic, “solvedness” etc. [Fenget al., 2006, Baldwin et al., 2007, Wang et al., 2012]

� Thread structure analysis, e.g. for summarisation [Wang andRose, 2010] or information retrieval [Seo et al., 2009, Wanget al., 2011a]

� Expert finding [Jurczyk and Agichtein, 2007, Bouguessaet al., 2008, Lin et al., 2009]

� Question–answer pair extraction [Cong et al., 2008, Dinget al., 2008, Hong and Davison, 2009]

� Post quality assessment [Weimer et al., 2007, Wanas et al.,2008, Lui and Baldwin, 2009]


The Road Ahead

� Better user support within forums (duplicate questiondetection, thread recommendation, thread routing)

� Research generally focused on specific forums; much to bedone on cross-forum analysis (forum recommendation,cross-forum thread routing)

� General-purpose discourse analyser for forum threads?

� More use of user priors


Summary

� “Discourse parsing” of DA and link structure of web userforum threads, via structured classification and dependencyparsing

� Findings:

- empirically little to separate simple CRF model anddependency parsing

- in situ classification: our method is robust overdynamically evolving threads

� Demonstration of utility of discourse parsing in an IRcontext


Talk Outline






6 Overall Summary


Data Restrictions: Twitter

� Twitter is famously “open” as a service:� possible to crawl any (undeleted) public tweet ever posted

to Twitter (in rate-limited way)� possible to access “random” sub-sample of tweets via

Streaming API (“garden hose” vs. “fire-hose”)

� Each tweet object is provided as a JSON object containinga wealth of message and user data (various user meta-data,geotag, language, basic social network data, thread data, ...)

� It is not possible, however, to redistribute Twitter data,other than in the form of tweet IDs (for others to recrawl)

� It is possible to crawl social network data from Twitter, butheavily rate-limited


Data Restrictions: Other Sites

� YouTube terms of use similar to Twitter (but YouTube usedmuch less for social media research)

� Facebook offers very limited access to its data, and there ishence relatively little published research relating to it

� Individual forums vary considerably in the terms of use oftheir data, with many commercial forums banning crawling,but many community-run forums having relativelypermissive licenses

� Wikipedia is perhaps the most open social media site — allcontent is available via a Creative CommonsAttribution-ShareAlike 3.0 Unported License


Research Datasets

� The Wikimedia Foundation provides periodic dumps ofWikipedia, which are heavily used for research purposes(although version data is often not provided in publications)

� Spinn3r made available a large crawl of blog data as part ofICWSM-2011 Burton et al. [2011], which is widely used

� Various Twitter datasets have been made available, in theform of tweet IDs for others to crawl

� issues for reproducibility of published results over Twitter

similarly with YouTube datasets

� Large-scale datasets of user-tagged images (e.g. from Flickr)

� Various other individual datasets made available throughICWSM


Ethics of Social Media Research

� In “traditional” NLP datasets, the data is generated byorganisations (commercial or otherwise) andpublished/licensed directly by that organisation, oftenwithout information identifying the authors of individualdocuments (with some exceptions, e.g. BNC, ANC)

� In the case of social media sites, the data is generated byindividual users of that site, often for personal use, in somecases with user-specific data privacy settings (e.g. Twitter,Facebook), and other cases with site-specific privacysettings (e.g. forums, Wikipedia)

� User accounts are often associated with publicly-accessibleuser-declared profile data (e.g. age, gender, date-of-birth,location, ...), as well as site-generated “activity” statistics(e.g. date joined site, number of posts, number of followers,average user rating, ...)


How to be Ethical when Using Social Media

Data

� For publicly-available, publicly-crawlable data, surelyanything goes?!

� NO!� important to get institutional ethics (“IRB”) approval for

social media data in cases where there is any interactionwith the users or any publication of user-identifyinginformation

� generally OK to publish “aggregated” models/datasets(within the terms of use of a given site), as long as it is notpossible to extract identifying user data from it

� Vulnerability of anonymised social media datasets to“privacy attacks” (as the source data is often public)

� Researchers are often faced with tradeoffs between scientificreproducibility and ethical data use


Ethics of In-Site Social Media Research

� Social media sites are continually rolling out newfunctionality, or improving existing functionality, as part ofwhich they user interaction data to validate/A-B test newfunctionalities

� Ideally, users should be made aware of any A-B testing (as itinvolves direct user interaction), but subtle question ofwhether, in drawing users’ attention to the testing, the testis “faithful”

� Infamous recent case of A-B testing relating to Facebook“news feeds”, in looking at the correlation betweenpositive/negative information in a user’s feeds, and thesentiment in their own posts [Kramer et al., 2014]

� ethical?


Talk Outline






6 Overall Summary


Final Words

� Much has been done over social media, but even moreremains to be done

� Different social media sources present different challenges,but one common theme is the use of user information andvarious types of linking information

� There is more to social media than Twitter

� Social media is a many-splendored thing, with lots of roomto play for all, and many open challenges for NLP


Acknowledgements

These slides are based heavily on joint work with PaulCook, Spandana Gella, Bo Han, Su Nam Kim, MarcoLui, Joakim Nivre and Li Wang. The research wassupported in part by the Australian Research Council,and in part by NICTA. NICTA is funded by theAustralian government as represented by theDepartment of Broadband, Communication and DigitalEconomy, and the Australian Research Council throughthe ICT Centre of Excellence programme.


References I

Lars Backstrom, Eric Sun, and Cameron Marlow. Find me if you can: improvinggeographical prediction with social and spatial proximity. In Proceedings of the 19thInternational Conference on World Wide Web, pages 61–70, Raleigh, USA, 2010.

Timothy Baldwin, David Martinez, and Richard B. Penman. Automatic threadclassification for Linux user forum information access. In Proceedings of the TwelfthAustralasian Document Computing Symposium (ADCS 2007), pages 72–79,Melbourne, Australia, 2007.

Mohamed Bouguessa, Benoıt Dumoulin, and Shengrui Wang. Identifying authoritativeactors in question-answering forums: The case of yahoo! answers. In Proceedings ofthe 14th ACM SIGKDD International Conference on Knowledge Discovery and DataMining (KDD ’08), pages 866–874, Las Vegas, Nevada, USA, 2008. URLhttp://doi.acm.org/10.1145/1401890.1401994.

Kevin Burton, Niels Kasch, and Ian Soboroff. The ICWSM 2011 Spinn3r dataset. InProceedings of the 5th International Conference on Weblogs and Social Media(ICWSM 2011), Barcelona, Spain, 2011.

Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, and Yueheng Sun. Findingquestion-answer pairs from online forums. In Proceedings of 31st InternationalACM-SIGIR Conference on Research and Development in Information Retrieval(SIGIR’08), pages 467–474, Singapore, 2008.

Leon Derczynski, Alan Ritter, Sam Clark, and Kalina Bontcheva. Twitter part-of-speechtagging for all: Overcoming sparse and noisy data. In Proceedings of RANLP 2013(Recent Advances in Natural Language Processing), Hissar, Bulgaria, 2013.

http://doi.acm.org/10.1145/1401890.1401994


References II

Shilin Ding, Gao Cong, Chin-Yew Lin, and Xiaoyan Zhu. Using conditional random fieldsto extract context and answers of questions from online forums. In Proceedings of the46th Annual Meeting of the ACL: HLT (ACL 2008), pages 710–718, Columbus, USA,2008.

Jonathan L. Elsas. Ancestry.com online forum test collection. Technical report, CarnegieMellon University, 2011.

Donghui Feng, Erin Shaw, Jihie Kim, and Eduard Hovy. Learning to detect conversationfocus of threaded discussions. In Proceedings of the Main Conference on HumanLanguage Technology Conference of the North American Chapter of the Association ofComputational Linguistics (HLT-NAACL ’06), pages 208–215, New York, USA, 2006.

Paolo Ferragina and Ugo Scaiella. TAGME: On-the-fly annotation of short text fragments(by Wikipedia entities). In Proceedings of the 19th ACM Conference on Informationand Knowledge Management (CIKM 2010), pages 1625–1628, 2010.

William A. Gale, Kenneth W. Church, and David Yarowsky. One sense per discourse. InProceedings of the 4th DARPA Speech and Natural Language Workshop, pages233–237, 1992.

Spandana Gella, Paul Cook, and Bo Han. Unsupervised word usage similarity in socialmedia texts. In Proceedings of the Second Joint Conference on Lexical andComputational Semantics (*SEM 2013), pages 248–253, Atlanta, USA, 2013. URLhttp://www.aclweb.org/anthology/S13-1036.

http://www.aclweb.org/anthology/S13-1036


References IIISpandana Gella, Paul Cook, and Timothy Baldwin. One sense per tweeter ... and other

lexical semantic tales of Twitter. In Proceedings of the 14th Conference of the EACL(EACL 2014), pages 215–220, Gothenburg, Sweden, 2014.

Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, JacobEisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith.Part-of-speech tagging for Twitter: Annotation, features, and experiments. InProceedings of the 49th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies (ACL HLT 2011), pages 42–47, Portland,USA, 2011. URL http://www.aclweb.org/anthology/P11-2008.

Bo Han, Paul Cook, and Timothy Baldwin. Geolocation prediction in social media databy finding location indicative words. In Proceedings of the 24th InternationalConference on Computational Linguistics (COLING 2012), pages 1045–1062, Mumbai,India, 2012.

Bo Han, Paul Cook, and Timothy Baldwin. Text-based Twitter user geolocationprediction. Journal of Artificial Intelligence Research, 49:451–500, 2014.

Liangjie Hong and Brian D. Davison. A classification-based approach to questionanswering in discussion boards. In Proceedings of the 32nd Annual ACM SIGIRConference (SIGIR 2009), pages 171–178, Boston, Massachusetts, USA, 2009.

Anders Johannsen, Dirk Hovy, Hector Martınez Alonso, Barbara Plank, and AndersSgaard. More or less supervised super-sense tagging of Twitter. In Proceedings of theThird Joint Conference on Lexical and Computational Semantics (*SEM 2014),Dublin, Ireland, to appear.

http://www.aclweb.org/anthology/P11-2008


References IVPawel Jurczyk and Eugene Agichtein. Discovering authorities in question answer

communities by using link analysis. In Proceedings of the Sixteenth ACM Conferenceon Conference on Information and Knowledge Management (CIKM ’07), pages919–922, Lisbon, Portugal, 2007. URLhttp://doi.acm.org/10.1145/1321440.1321575.

David Jurgens. That’s what friends are for: Inferring location in online social mediaplatforms based on social relationships. In Proceedings of the 7th InternationalConference on Weblogs and Social Media (ICWSM 2013), pages 273–282, Boston,USA, 2013.

Su Nam Kim, Li Wang, and Timothy Baldwin. Tagging and linking web forum posts. InProceedings of the 14th Conference on Natural Language Learning (CoNLL-2010),pages 192–202, Uppsala, Sweden, 2010.

Adam D.I. Kramer, Jamie E. Guillory, and Jeffrey T. Hancock. Experimental evidence ofmassive-scale emotional contagion through social networks. Proceedings of theNational Academy of Sciences, 111(24):8788–8790, 2014.

Chen Lin, Jiang-Ming Yang, Rui Cai, Xin-Jing Wang, Wei Wang, and Lei Zhang.Modeling semantics and structure of discussion threads. In Proceedings of the 18thInternational Conference on the World Wide Web (WWW 2009), pages 1103–1104,Madrid, Spain, 2009.

Marco Lui and Timothy Baldwin. You are what you post: User-level features in threadeddiscourse. In Proceedings of the 14th Australasian Document Computing Symposium(ADCS 2009), Sydney, Australia, 2009.

http://doi.acm.org/10.1145/1321440.1321575


References V

Andrew McCallum and Kamal Nigam. A comparison of event models for Naive Bayestext classification. In Proceedings of the AAAI-98 Workshop on Learning for TextCategorization, pages Available as Technical Report WS–98–05, AAAI Press.,Madison, USA, 1998.

Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. Finding predominant sensesin untagged text. In Proceedings of the 42nd Annual Meeting of the Association forComputational Linguistics (ACL 2004), pages 280–287, Barcelona, Spain, 2004.

Rada Mihalcea and Andras Csomai. Wikify!: Linking documents to encyclopedicknowledge. In Proceedings of the Sixteenth ACM Conference on Conference onInformation and Knowledge Management, pages 233–242, Lisbon, Portugal, 2007.URL http://doi.acm.org/10.1145/1321440.1321475.

Andrea Moro, Alessandro Raganato, and Roberto Navigli. Entity linking meets wordsense disambiguation: a unified approach. Transactions of the Association forComputational Linguistics, 2:231–244, 2014.

Olutobi Owoputi, Brendan OConnor, Chris Dyer, Kevin Gimpel, Nathan Schneider, andNoah A. Smith. Improved part-of-speech tagging for online conversational text withword clusters. In Proceedings of the 2013 Conference of the North American Chapterof the Association for Computational Linguistics: Human Language Technologies(NAACL HLT 2013), pages 380–390, Atlanta, USA, 2013.

http://doi.acm.org/10.1145/1321440.1321475


References VI

Stephen Roller, Michael Speriosu, Sarat Rallapalli, Benjamin Wing, and Jason Baldridge.Supervised text-based geolocation using language models on an adaptive grid. InProceedings of the Joint Conference on Empirical Methods in Natural LanguageProcessing and Computational Natural Language Learning 2012 (EMNLP-CoNLL2012), pages 1500–1510, Jeju Island, Korea, 2012. URLhttp://www.aclweb.org/anthology/D12-1137.

Jangwon Seo, W. Bruce Croft, and David A. Smith. Online community search usingthread structure. In Proceedings of the 18th ACM Conference on Information andKnowledge Management (CIKM 2009), pages 1907–1910, Hong Kong, China, 2009.

Nayer Wanas, Motaz El-Saban, Heba Ashour, and Waleed Ammar. Automatic scoring ofonline discussion posts. In Proceedings of the 2nd ACM Workshop on InformationCredibility on the Web (WICOW’08), pages 19–26, Napa Valley, USA, 2008.

Hongning Wang, Chi Wang, ChengXiang Zhai, and Jiawei Han. Learning onlinediscussion structures by conditional random fields. In Proceedings of the 34th AnnualInternational ACM SIGIR Conference (SIGIR 2011), pages 435–444, Beijing, China,2011a.

Li Wang, Marco Lui, Su Nam Kim, Joakim Nivre, and Timothy Baldwin. Predictingthread discourse structure over technical web forums. In Proceedings of the 2011Conference on Empirical Methods in Natural Language Processing (EMNLP 2011),pages 13–25, Edinburgh, UK, 2011b.

http://www.aclweb.org/anthology/D12-1137


References VII

Li Wang, Su Nam Kim, and Timothy Baldwin. The utility of discourse structure inidentifying resolved threads in technical user forums. In Proceedings of the 24thInternational Conference on Computational Linguistics (COLING 2012), pages2739–2756, Mumbai, India, 2012.

Li Wang, Su Nam Kim, and Timothy Baldwin. The utility of discourse structure in forumthread retrieval. In Proceedings of the 9th Asian Information Retrieval SocietiesConference (AIRS 2013), pages 284–295, Singapore, 2013.

Yi-Chia Wang and Carolyn P. Rose. Making conversational structure explicit:identification of initiation-response pairs within online discussions. In Human LanguageTechnologies: The 2010 Annual Conference of the North American Chapter of theAssociation for Computational Linguistics (NAACL HLT 2010), pages 673–676, 2010.

Markus Weimer, Iryna Gurevych, and Max Muhlhauser. Automatically assessing the postquality in online discussions on software. In Proceedings of the 45th Annual Meetingof the ACL on Interactive Poster and Demonstration Sessions (ACL 2007), pages125–128, Prague, Czech Republic, 2007.

text processing of social media iii user geolocation ... · classi er acc acc@161 acc@c median text...

Documents