mining the social web for fun and profit: a getting started guide
DESCRIPTION
A presentation to the Nashville Data Science Meetup that introduces Mining the Social Web as an Open Source Software project/book, its virtual machine experience, the codebase, and a brief primer on data mining with TwitterTRANSCRIPT
![Page 1: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/1.jpg)
Mining the Social Web for Fun and Profit:
A Getting Started Guide
Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com
Nashville Data Science Meetup - 10 February 2014
1
![Page 2: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/2.jpg)
Overview
Intro (5 mins)
Virtual Machine Experience (10 mins)
Virtual Machine and IPython Notebook Demonstration (10 mins)
Mining Twitter: A Primer (20 mins)
Wrap Up/Final Q&A (10 mins)
2
![Page 3: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/3.jpg)
Intro
3
![Page 4: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/4.jpg)
Hello, My Name Is ... Matthew
4
Background in Computer Science
Data mining & machine learning
CTO @ Digital Reasoning Systems
Data mining; machine learning
Author @ O'Reilly Media
5 published books on technology
Principal @ Zaffra
Selective boutique consulting
![Page 5: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/5.jpg)
Transforming Curiosity Into Insight
5
An open source software (OSS) project
http://bit.ly/MiningTheSocialWeb2E
A (rewritten) book
http://bit.ly/135dHfs
Accessible to (virtually) everyone
Virtual machine with turn-key coding templates for data science experiments
Think of the book as "premium" support for the OSS project
![Page 6: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/6.jpg)
The Social Web Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
6
![Page 7: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/7.jpg)
Table of Contents (1/2)
Chapter 1 - Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More
Chapter 2 - Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More
Chapter 3 - Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More
Chapter 4 - Mining Google+: Computing Document Similarity, Extracting Collocations, and More
Chapter 5 - Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More
Chapter 6 - Mining Mailboxes: Analyzing Who's Talking to Whom About What, How Often, and More
7
![Page 8: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/8.jpg)
Table of Contents (2/2)
Chapter 7 - Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More
Chapter 8 - Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over RDF, and More
Chapter 9 - Twitter Cookbook
Appendix A - Information About This Machine's Virtual Machine Experience
Appendix B - OAuth Primer
Appendix C - Python and IPython Notebook Tips & Tricks
8
![Page 9: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/9.jpg)
Anatomy of Each ChapterBrief Intro
Objectives
API Primer
Analysis Technique(s)
Data Visualization
Recap
Suggested Exercises
Recommended Resources
9
![Page 10: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/10.jpg)
The Virtual Machine Experience
10
![Page 11: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/11.jpg)
Why do you need a VM?
11
To save time
Because installation and configuration management is harder than it first appears
So that you can focus on the task at hand instead
So that I can support you regardless of your hardware and operating system
Arguably, it's even a best practice for a dev environment
![Page 12: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/12.jpg)
But I can do all of that myself...True...
If you would rather troubleshoot unexpected installation/configuration issues instead of immediately focusing on the real task at hand
At least give it a shot before resorting to your own devices so that you don't have to install specific versions of ~40 Python packages
Including scientific computing tools that require underlying C/C++ code to be compiled
Which requires specific versions of developer libraries to be installed
You get the idea...
12
![Page 13: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/13.jpg)
The Virtual Machine ExperienceVagrant
A nice abstraction around virtual machine providers
One ring to rule them all
Virtualbox, VMWare, AWS, ...
IPython Notebook
The easiest way to program with Python
A better REPL (interpreter)
Great for hacking
13
![Page 14: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/14.jpg)
What happens when you vagrant up?
Vagrant follows the instructions in your Vagrantfile
Starts up a Virtualbox instance
Uses Chef to provision it
Installs OS patches/updates
Installs MTSW software dependencies
Starts IPython Notebook server on port 8888
14
![Page 15: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/15.jpg)
Why Should I Use IPython Notebook?
Because it's great for hacking
And hacking is usually the first step
Because it's great for collaboration
Sharing/publishing results is trivial
Because the UX is as easy as working in a notepad
Think of it as "executable paper"
15
![Page 16: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/16.jpg)
16
![Page 17: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/17.jpg)
17
![Page 18: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/18.jpg)
VM Quick Start Instructions
Go to http://MiningTheSocialWeb.com/quick-start/
Follow the instructions
And watch the screencasts!
Basically:
Install Virtualbox & Vagrant
Run "vagrant up" in a terminal to start a guest VM
Then, go to http://localhost:8888 on your host machine's web browser
18
![Page 19: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/19.jpg)
An (AWS) Hosted Virtual Machine
Is it free?
Perhaps...
...Sign-up for the AWS free tier at http://aws.amazon.com/free/
But not right now. Do it later
See this blog post for some inspiration on how to easily build your own AMI from Vagrant boxes
http://wp.me/p3QiJd-3T
19
![Page 20: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/20.jpg)
Virtual Machine and IPython Notebook Demonstration
20
![Page 21: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/21.jpg)
Demonstration of Virtual Machine
http://nbviewer.ipython.org
http://MiningTheSocialWeb.com/quick-start/
Your first "vagrant up"
21
![Page 22: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/22.jpg)
Mining Twitter: A Primer
22
![Page 23: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/23.jpg)
Objectives
23
Be able to identify Twitter primitives
Understand tweet metadata and how to use it
Learn how to extract entities such as user mentions, hashtags, and URLs from tweets
Apply techniques for performing frequency analysis with Python
Be able to plot histograms of Twitter data with IPython Notebook
![Page 24: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/24.jpg)
Twitter Primitives
24
Accounts Types: "Anything"
"Following" Relationships
Favorites
Retweets
Replies
(Almost) No Privacy Controls
![Page 25: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/25.jpg)
API RequestsRESTful requests
Everything is a "resource"
You GET, PUT, POST, and DELETE resources
Standard HTTP "verbs"
Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=SocialWebMining
Streaming API filters
JSON responses
Cursors (not quite pagination)
25
![Page 26: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/26.jpg)
Twitter is an Interest Graph
26
Roberto Mercedes
Jorge
Ana
Nina
Johnny Araya
Rodolfo Hernández
![Page 27: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/27.jpg)
What's in a Tweet?
27
140 Characters ...
... Plus ~5KB of metadata!
Authorship
Time & location
Tweet "entities"
Replying, retweeting, favoriting, etc.
![Page 28: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/28.jpg)
What are Tweet Entities?
Essentially, the "easy to get at" data in the 140 characters
@usermentions
#hashtags
URLs
multiple variations
(financial) symbols
stock tickers
media
28
![Page 29: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/29.jpg)
Data Mining Is...
Counting
Comparing
Filtering
Ranking
29
![Page 30: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/30.jpg)
Histograms
A chart that is handy for frequency analysis
They look like bar charts...except they're not bar charts
Each value on the x-axis is a range (or "bin") of values
Not categorical data
Each value on the y-axis is the combined frequency of values in each range
30
![Page 31: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/31.jpg)
31
Plotting with IPython Notebook
![Page 32: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/32.jpg)
32
Example: Histogram of Retweets
![Page 33: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/33.jpg)
Social Media Analysis FrameworkA memorable four step process to guide data science experiments:
Aspire
To test a hypothesis (answer a question)
Acquire
Get the data
Analyze
Count things
Summarize
Plot the results
33
![Page 34: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/34.jpg)
Recommended ExercisesReview Python idioms in the "Appendix C (Python Tips & Tricks)" notebook
Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook
Fill in Example 1-1 with credentials and begin work
Execute each example sequentially
Customize queries
Explore tweet metadata; count tweet entities; plot histograms of results
Explore the "Chapter 9 (Twitter Cookbook)" notebook
Think of it as a collection of building blocks
34
![Page 35: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/35.jpg)
35
![Page 36: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/36.jpg)
Final Q&A; Wrap Up
36
![Page 37: Mining the Social Web for Fun and Profit: A Getting Started Guide](https://reader033.vdocuments.mx/reader033/viewer/2022060108/554e847fb4c90573338b45e0/html5/thumbnails/37.jpg)
Recommended Resourceshttp://MiningTheSocialWeb.com
Mining the Social Web 2E Chapter 1 (Chimera)
http://bit.ly/13XgNWR
Source Code (GitHub)
http://bit.ly/MiningTheSocialWeb2E
http://bit.ly/1fVf5ej (numbered examples)
Screencasts (Vimeo)
http://bit.ly/mtsw2e-screencasts
37