collecting twitter data

19
Collecting Twitter data Dr. Cornelius Puschmann School of Library and Information Science Humboldt-University of Berlin / Humboldt Institute for Internet and Society 16 April 2013 Royal Statistical Society

Upload: cornelius-puschmann

Post on 27-Nov-2014

2.212 views

Category:

Documents


1 download

DESCRIPTION

Talk held at the Royal Statistical Society in London as part of the event series "Blurring the boundaries - New social media, new social science?". I thank Grant Blank from the OII for inviting me to this exciting workshop.

TRANSCRIPT

Page 1: Collecting Twitter Data

Collecting Twitter dataDr. Cornelius Puschmann

School of Library and Information Science Humboldt-University of Berlin /

Humboldt Institute for Internet and Society16 April 2013

Royal Statistical Society

Page 2: Collecting Twitter Data

1. Examples of research using Twitter data

2. Twitter's data infrastructure

3. Tools for collecting data

4. Sampling issues

Overview

Page 3: Collecting Twitter Data

Examples of research using Twitter data

• Kwak, H., Lee, C., Park, H., & Moon, S. (2010). What is Twitter, a Social Network or a News Media ? Categories and Subject Descriptors. Proceedings of the 19th International Conference on the World Wide Web (WWW ’10) (pp. 591–600). Raleigh, NC.

• González-Bailón, S., Borge-Holthoefer, J., Rivero, A., & Moreno, Y. (2011). The dynamics of protest recruitment through an online network. Scientific reports, 1, 197. doi:10.1038/srep00197

• Ausserhofer, J., & Maireder, A. (2013). National politics on Twitter: Structures and topics of a networked public sphere. Information, Communication & Society, 16(3), 291–314. doi:10.1080/1369118X.2012.756050

• Papacharissi, Z., & De Fatima Oliveira, M. (2012). Affective News and Networked Publics: The Rhythms of News Storytelling on #Egypt. Journal of Communication, 62(2), 266–282. doi:10.1111/j.1460-2466.2012.01630.x

Page 4: Collecting Twitter Data

Hashtags, keywords, and geography• How can the discussion of topic X be characterized? • Who is participating in discussions on X?• Where are users discussing X?

Twitter as a platform• How can Twitter's structure be described?

Social graph• Who follows whom?• How does information spread?

Example questions

Page 5: Collecting Twitter Data

Prediction/application• Can election results/flu outbreaks/consumption

patterns be reliably predicted?

URLs in Twitter• How is mass media content discussed?• How are academic papers cited on Twitter?

Example questions

Creative approaches• Where, when, and with what devices do people

call taxis?

Page 6: Collecting Twitter Data

#phdchat data set (30k tweets)

Page 7: Collecting Twitter Data

visualization of keywords using Gephi

Page 8: Collecting Twitter Data

Application Programming Interface (API)

HTTP request

return all data from a given user/hashtag/geolocation/...

Data (usually in a database or spreadsheet)

Extracting Twitter data

Page 9: Collecting Twitter Data

Tweet in browser

Tweet source via API

Page 10: Collecting Twitter Data

Streaming API• public, user, and

site streams• provides data in

real time and largely unprocessed as it flows through the platform

REST API• traditionally used

by most client software• v1.0 will be phased

out in May 2013• to be replaced by

more restrictive v1.1

Search API• same functionality

as Twitter search• rate-limited

Three Twitter APIs

1) data: tweets, social graph2) complex tools needed 3) constraints on how much data can be captured

Page 11: Collecting Twitter Data

"By submitting, posting or displaying Content on or through the Services, you grant us a worldwide, non-exclusive, royalty-free license (with the right to sublicense) to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute such Content in any and all media or distribution methods (now known or later developed)."

"You agree that this license includes the right for Twitter to make such Content available to other companies, organizations or individuals who partner with Twitter for the syndication, broadcast, distribution or publication of such Content on other media and services, subject to our terms and conditions for such Content use."

"We encourage and permit broad re-use of Content. The Twitter API exists to enable this."

Legal issues: Twitter's terms of service

Page 12: Collecting Twitter Data

"You will not attempt or encourage others to: sell, rent, lease, sublicense, redistribute, or syndicate access to the Twitter API or Twitter Content to any third party without prior written approval from Twitter. If you provide an API that returns Twitter data, you may only return IDs (including tweet IDs and user IDs). You may export or extract non-programmatic, GUI-driven Twitter Content as a PDF or spreadsheet by using "save as" or similar functionality. Exporting Twitter Content to a datastore as a service or other cloud based service, however, is not permitted."

"Except as permitted through the Services (or these Terms), you have to use the Twitter API if you want to reproduce, modify, create derivative works, distribute, sell, transfer, publicly display, publicly perform, transmit, or otherwise use the Content or Services."

Legal issues: API rules

Page 13: Collecting Twitter Data

Tweet Archivist Desktop(Windows desktop software)

Page 14: Collecting Twitter Data

yourTwapperKeeper(runs on a dedicated web server)

Page 15: Collecting Twitter Data

140kit(hosted platform for academic research)

Page 16: Collecting Twitter Data

DataSift/Gnip(social data resellers)

Page 17: Collecting Twitter Data

Strategy #3: Capture Twitter's entire throughput

Strategy #2: Use the 1% or 10% sample provided by the Streaming API

Strategy #1: Sample by hashtag, keyword, user, geographical location, or other filtering parameters

+ highly representative (of Twitter)

- technically very difficult/costly

+ generally assumed to be representative (of Twitter)

- time frame has to be carefully chosen

+ representativeness unclear on multiple levels

- time frame and parameters have to be carefully chosen

Sampling approaches

Page 18: Collecting Twitter Data

develop a question/general direction

collect data using these or other tools

store in a database or spreadsheet (CSV)

annotate, analyze and visualize using a variety of tools (Excel, Tableau, R, Gephi, NVIVO, ...)

Summary