slicing big data: gambling, twitter & time sensitive information
DESCRIPTION
Presented at the Internet Researchers conference in Denver, CO -- 26 October 2013. Discusses Gambling, Reality TV, and World Events in the Context of Twitter Data, and selecting usable data from big data.TRANSCRIPT
Gambling, Twitter & Time Sensitive Information
IR14 - Denver,[email protected]
@dpwoodford
Wednesday, 23 October 13
FORMAT
• Not going to simply repeat the paper.
• I will get to the gambling (& fantasy sports) examples, but want to discuss our wider work with large datasets.
• Happy to answer more specific questions about the use in the gambling industry.
• Examples from Sport, TV, Gambling & Fantasy Sports. A tour-de-force of current research projects
Wednesday, 23 October 13
DEALING WITH THE TITLE: TWITTER
• Twitter => Large Data Sets, but specific research questions often require a small data set:– Australian users– Users registering on the platform during natural disasters– ‘Experts’ on Fantasy Sports– Sporting Participants: Golf, Tennis, NFL, College Football, etc..– Reality TV ‘fanatics’– Almost infinite examples
• Goal is to get from “Big Data” to what I’ve been calling “useful data”
Wednesday, 23 October 13
DEALING WITH THE TITLE: GAMBLING
• Long term interest in the gambling industry (one case study in my prior work on games).
• Many parallels between Gambling and Fantasy Sports (another current research project).
• When I was an ‘active participant’, Twitter was just becoming popular (2006-2010).
• It quickly became a crucial source of information, and websites started aggregating it.
Wednesday, 23 October 13
DEALING WITH THE TITLE: GAMBLING
Wednesday, 23 October 13
DEALING WITH THE TITLE: GAMBLING
Wednesday, 23 October 13
DEALING WITH THE TITLE: TIME SENSITIVE INFORMATION
• Lines move incredibly fast: Just as much a market as day-trading on the stock exchange
Wednesday, 23 October 13
WHY IS DATA SLICED?
• Streaming API is limited to ~1% of total tweets per second & Firehose access is expensive.
• Large data sets are not easily malleable, or visually analyzed (e.g. with Tableau):– Our database of Twitter users is ~3.7TB, and growing.– A weeks worth of selected TV data (current US shows) in JSON
format is 750MB, and 600MB in TSV (selected fields). And millions of rows.
• Analyzing large data sets is slow, if it’s even possible => “Usable Data”
Wednesday, 23 October 13
HOW IS DATA SLICED: COMPULSORY
Wednesday, 23 October 13
HOW IS DATA SLICED: SELECTING FOR AUTHENTICITY -- WTA
Wednesday, 23 October 13
HOW IS DATA SLICED: SELECTING FOR AUTHENTICITY -- FANTASY SPORTS
Wednesday, 23 October 13
HOW IS DATA SLICED: SELECTING FOR AUTHENTICITY -- FANTASY SPORTS
CLIP FROM YAHOO FANTASY FOOTBALL RE: CALVIN JOHNSON INJURY & TWITTER REPORTS
Wednesday, 23 October 13
BUT YOU STILL NEED A SANITY CHECK
Wednesday, 23 October 13
BUT YOU STILL NEED A SANITY CHECK
Wednesday, 23 October 13
HOW IS DATA SLICED: RANDOM SAMPLING
Source: Tony Hirst (Open University UK)
Wednesday, 23 October 13
BUT SOMETIMES YOU NEED THE FULL SAMPLE & REPEATED CAPTURE
Source: Bruns / Woodford [Mapping Online Publics]
Wednesday, 23 October 13
HOW IS DATA SLICED: ONLY A SMALL SAMPLE MATTERS
Floods, Earthquake, Tsunami
Media Coverage
Wednesday, 23 October 13
HOW IS DATA SLICED: TV -- SEASONAL DATA VS EPISODIC
Impact of Live Feed
Wednesday, 23 October 13
HOW IS DATA SLICED: TV -- SEASONAL DATA VS EPISODIC
Wednesday, 23 October 13
HOW IS DATA SLICED: TV -- SEASONAL DATA VS EPISODIC
Delayed TV sucks
Wednesday, 23 October 13
HOW IS DATA SLICED: MOST ACTIVE ≠ REPRESENTATIVE
• Most active (#BB15, #BBLF) users often defend a HM to the death (akin to sporting tribalism), but most users are attackers (forthcoming paper w/ Katie Prowd)
Disclaimer: Scale changed to fit on slide
Source: Woodford / Prowd [Fan Cultures and Hatred in Big Brother 15: Race Rows, EliMsm & SporMng Tribalism -‐-‐ Forthcoming]
Wednesday, 23 October 13
TIME SLICES OF TWEET CONTENT IS ENLIGHTENING
Source: Woodford / Prowd [Fan Cultures and Hatred in Big Brother 15: Race Rows, EliMsm & SporMng Tribalism -‐-‐ Forthcoming]
Wednesday, 23 October 13
TIME SLICES OF TWEET CONTENT IS ENLIGHTENING
Source: Woodford / Prowd [Fan Cultures and Hatred in Big Brother 15: Race Rows, EliMsm & SporMng Tribalism -‐-‐ Forthcoming]
Wednesday, 23 October 13
HOW IS DATA SLICED: MOST ACTIVE ≠ REPRESENTATIVE
Source: Woodford / Prowd [Fan Cultures and Hatred in Big Brother 15: Race Rows, EliMsm & SporMng Tribalism -‐-‐ Forthcoming]
Wednesday, 23 October 13
HOW IS DATA SLICED: MOST ACTIVE ≠ REPRESENTATIVE
• Twitter closed these quickly, yet the BB15 accounts remained active for much of the season...
Wednesday, 23 October 13
AND A QUICK NOTE ON NON-TWITTER ANALYTICS
Wednesday, 23 October 13
AND A QUICK NOTE ON NON-TWITTER ANALYTICS
• There’s lots of data out there, but it needs to be sliced to be usable.
• You can work with large, original, data sets, but often this adds extra complexity that isn’t necessary to answer your research questions.
• But don’t delete the data you don’t need!
Wednesday, 23 October 13
AND A QUICK NOTE ON NON-TWITTER ANALYTICS
Wednesday, 23 October 13
ACKNOWLEDGEMENTS
• ARC Centre for Excellence in Creative Industries and Innovation (CCI) - http://www.cci.edu.au & http://www.mappingonlinepublics.net
• Social Media Research Group -- http://socialmedia.qut.edu.au
• Queensland University of Technology
Wednesday, 23 October 13