collecting twitter data w/social feed manager

Post on 16-Jan-2015

598 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

a talk given at ELAG 2013 in Ghent, Belgium, on May 30, 2013.

TRANSCRIPT

collectingtwitter data

w /social feed manager

Daniel Chudnov - @dchud - dchud at gwu eduELAG 2013 - 2013-05-30 - Ghent, Belgium

tinyurl.com / dchud-elag-2013

social-feed-manager

•python / django

•user timelines, filter, sample, search

• simple display / export for user timelines

• free software, on github

social feed manager

github.com /gwu-libraries /

social-feed-manager

github.com / gwu-libraries / social-feed-manager

atraditional project

1expand scope

ofcollection development

2at-risk

e-resourcelicensing story

3save the time

of theresearcher

let’s startwith

the researcher

“How Mainstream News Outlets Use Twitter” (2011)• GWU’s Kimberly Gross (SMPA) +

students

• Pew Research Center’s Project for Excellence in Journalism

• “news agenda these organizations promoted on Twitter closely matches that of their legacy platforms”

http://www.journalism.org/analysis_report/how_mainstream_media_outlets_use_twitter

how do researchersstudy social media?

by hand.

•google reader

•copy and paste

•fold, spindle, mutilate

•excel

• ...eventually, SPSS and similar tools

whatever help

they can get

it’s a lot of workfor not a lot of data

(1000s of tweets)

copy and pasteto excel

doesn’t scale

just ask any student assigned to do this!

first tweet, in native JSON

astrategic

disadvantage

5,000+theses/dissertations

since 2010

(not all CS grad students)

see Leetaru et al.May 2013

First Monday

librarians can help here

what researchers ask for

• specific users, keywords

• historic time periods

• basic values: user, date, text, counts

• 10000s, not 10000000s

• delimited files to import

optionsfor

historical data?

Twitter-licensed data providers:

DataSiftGnipTopsy

data providers•friendly

•not cheap

•more than we need

•expensive

•still need tools to collect, process, etc.

what can we doourselves

?

social feed manager

github.com /gwu-libraries /

social-feed-manager

what researchers ask for

• specific users, keywords

• historic time periods

• basic values: user, date, text, counts

• 10000s, not 10000000s

• delimited files to import

can do thisfree

w/public API

twitter api

•user timelines

•filter streams

• spritzer

• search

up to 3,200most recent tweets

any public user200 at a time

and go back again for more later

dev.twitter.com/docs/working-with-timelines

1,969,760 tweetsfrom

1,228 users

group users in setsexport by user / set

all at onceor time slices

40+ media outlets400+ elected officials

300+ journalists300+ GWU groups

filter streams

millions of tweetsas they occur

around an event

filter streams

* a little more complicated than that

• filter by users, keywords, geo

• about 3,000 tweets / min *

• 10,000,000s of tweets

• political debates, news events

spritzer feed

• ~0.5% of all public tweets

• ~3,000,000 tweets / day (growing)

• a useful random sampling

search

•after an event

•find users, keywords

• limited - better than nothing

we can doall this

at no marginal costfor data*

* not really “big data” - GBs, not TBs

this muchalone

meets several needs

this muchalone

shows at-risk nature

when the Pope resigned

when Congress turned over

• 16+ accounts deleted / hidden

• combined 105,993 followers

• 14,479 tweets saved in SFM no longer public

if a researcher needs more•support selection,

acquisition, accession, storage, transformation

•collect what’s free around it to minimize cost

•plan purchase via grant

•collect prospectively

next steps

improving sfm

• support concurrent per-user filters / streams

• add Sina Weibo, YouTube, others as asked

driveselective, automated

web archiving

ensureyou can use

sfm

you can have it! it’s free to use, copify, modify, redistribute

discovery?

theobvious solution

653 - subject added entry, uncontrolled for hashtags

700 - name added entries for mentions

856 42 - URL of related resource for included links

500 - note for retweet count

336, 337, 338 - RDA ready!

w / catmanduslinging data around

is fun and easy!

already indexed piles of tweets in ElasticSearch** really!

we will add2 - 4 million

catalog recordsper month

WorldCatcan handle this

it’s web scale!

augmenting / creatingauthority records

w / twitter screen names

already cleared it with a PCC / NACO rep!

Summoncan handle this

Andrew is very familiar with growing consortial catalogs!

github.com /gwu-libraries /

social-feed-manager

@dchuddchud @ gwu edu

top related