collecting twitter data w/social feed manager
DESCRIPTION
a talk given at ELAG 2013 in Ghent, Belgium, on May 30, 2013.TRANSCRIPT
collectingtwitter data
w /social feed manager
Daniel Chudnov - @dchud - dchud at gwu eduELAG 2013 - 2013-05-30 - Ghent, Belgium
tinyurl.com / dchud-elag-2013
social-feed-manager
•python / django
•user timelines, filter, sample, search
• simple display / export for user timelines
• free software, on github
social feed manager
github.com /gwu-libraries /
social-feed-manager
github.com / gwu-libraries / social-feed-manager
atraditional project
1expand scope
ofcollection development
2at-risk
e-resourcelicensing story
3save the time
of theresearcher
let’s startwith
the researcher
“How Mainstream News Outlets Use Twitter” (2011)• GWU’s Kimberly Gross (SMPA) +
students
• Pew Research Center’s Project for Excellence in Journalism
• “news agenda these organizations promoted on Twitter closely matches that of their legacy platforms”
http://www.journalism.org/analysis_report/how_mainstream_media_outlets_use_twitter
how do researchersstudy social media?
by hand.
•google reader
•copy and paste
•fold, spindle, mutilate
•excel
• ...eventually, SPSS and similar tools
whatever help
they can get
it’s a lot of workfor not a lot of data
(1000s of tweets)
copy and pasteto excel
doesn’t scale
just ask any student assigned to do this!
first tweet, in native JSON
astrategic
disadvantage
5,000+theses/dissertations
since 2010
(not all CS grad students)
see Leetaru et al.May 2013
First Monday
librarians can help here
what researchers ask for
• specific users, keywords
• historic time periods
• basic values: user, date, text, counts
• 10000s, not 10000000s
• delimited files to import
optionsfor
historical data?
Twitter-licensed data providers:
DataSiftGnipTopsy
data providers•friendly
•not cheap
•more than we need
•expensive
•still need tools to collect, process, etc.
what can we doourselves
?
social feed manager
github.com /gwu-libraries /
social-feed-manager
what researchers ask for
• specific users, keywords
• historic time periods
• basic values: user, date, text, counts
• 10000s, not 10000000s
• delimited files to import
can do thisfree
w/public API
twitter api
•user timelines
•filter streams
• spritzer
• search
up to 3,200most recent tweets
any public user200 at a time
and go back again for more later
dev.twitter.com/docs/working-with-timelines
1,969,760 tweetsfrom
1,228 users
group users in setsexport by user / set
all at onceor time slices
40+ media outlets400+ elected officials
300+ journalists300+ GWU groups
filter streams
millions of tweetsas they occur
around an event
filter streams
* a little more complicated than that
• filter by users, keywords, geo
• about 3,000 tweets / min *
• 10,000,000s of tweets
• political debates, news events
spritzer feed
• ~0.5% of all public tweets
• ~3,000,000 tweets / day (growing)
• a useful random sampling
search
•after an event
•find users, keywords
• limited - better than nothing
we can doall this
at no marginal costfor data*
* not really “big data” - GBs, not TBs
this muchalone
meets several needs
this muchalone
shows at-risk nature
when the Pope resigned
when Congress turned over
• 16+ accounts deleted / hidden
• combined 105,993 followers
• 14,479 tweets saved in SFM no longer public
if a researcher needs more•support selection,
acquisition, accession, storage, transformation
•collect what’s free around it to minimize cost
•plan purchase via grant
•collect prospectively
next steps
improving sfm
• support concurrent per-user filters / streams
• add Sina Weibo, YouTube, others as asked
driveselective, automated
web archiving
ensureyou can use
sfm
you can have it! it’s free to use, copify, modify, redistribute
discovery?
theobvious solution
653 - subject added entry, uncontrolled for hashtags
700 - name added entries for mentions
856 42 - URL of related resource for included links
500 - note for retweet count
336, 337, 338 - RDA ready!
w / catmanduslinging data around
is fun and easy!
already indexed piles of tweets in ElasticSearch** really!
we will add2 - 4 million
catalog recordsper month
WorldCatcan handle this
it’s web scale!
augmenting / creatingauthority records
w / twitter screen names
already cleared it with a PCC / NACO rep!
Summoncan handle this
Andrew is very familiar with growing consortial catalogs!
github.com /gwu-libraries /
social-feed-manager
@dchuddchud @ gwu edu