predicting the future with social media
TRANSCRIPT
What The Future Holds For Social
Media Data Analysis
Predictive analytics using Twitter data
Peter Wlodarczak [email protected]
Agenda
Introduction
Research methodology
Applications
Challenges
Conclusions
Introduction I
Shift from publisher-generated to user-
created content
90% of the content on the Internet is now
user generated (Graham et al. 2011)
Unprecedented amount of opinionated
data on the Internet
Online social networks (OSN) are one of
the biggest data sources of the internet
(Oboler, Welsh & Cruz 2012)
Introduction II
Opinions can be expressed on the
Internet without programming knowledge
(Web 2.0)
Opinions are key influences of human
behavior
People increasingly consult the Internet
before making decisions
Introduction III
OSN give new insights into peoples
opinions, interests and views Social networking Web sites are amassing vast
quantities of data
Computational social science is providing tools to
process this data (Oboler, Welsh & Cruz 2012)
Social computing, a new paradigm of computing
and technology development, has become a
central theme across a number of information and
communication technology fields (Wang et al.
2007, p. 79)
Introduction IV
Growing interest in Social Media Mining
(SMM) in the market Gnip, Klout, DataSift and Sprout social specialized
in SM data analysis
Apple bought Topsy for 200 million US dollars
(Harris 2013)
TV stations buy Facebook data to see how
popular their shows are (Rusli 2013)
No surveys necessary
Introduction V
Research in the area of computational
social science and Big Data Social computing is a cross-disciplinary research
and application field with theoretical underpinnings
including both computational and social sciences
(Wang et al. 2007, p. 80)
Big Data is the ability of society to harness
information in novel ways to produce useful
insights or goods and services of significant value
(Mayer-Schonberger & Cukier 2013, p. 2)
Introduction VI
Analyzing data to:
Understand the underlying structure of it
and gain knowledge
Make predictions from new, unseen
examples
Introduction VII
Current behavior indication for future
decisions
New area of research: predictive
analytics
Machine learning techniques used for
prediction
Learning from experience, “data”, to predict
future behavior of individuals
Support decision making process
Introduction VIII
Big Data
Big Data is usually defined by the three
V’s. Volume, velocity and variety (Klein,
Tran-Gia & Hartmann 2013, p. 320)
High volume
Created at high velocity
Structured, semi-structured and unstructured
Introduction IX
Big Data principles
No sample selection, all data analysed
Data doesn’t have to be of high quality
Structured and unstructured data
Introduction X
Data mining
Techniques for finding and describing
structural patterns in data
Tool for helping to explain that data and
make predictions from it (Witten, Frank &
Hall 2011, p. 8)
Used to
gain knowledge
make predictions
Introduction XI
Data analysis steps
Analyze mood by means of sentiment
analysis
Create time series and correlate it to real
world phenomenon
Make predictions based on new data
Support decision making process
Introduction XII
Social Media data has been analysed to
predict
Financial indicators (Bollen, Mao & Zeng
2010)
Elections (Tumasjan et al. 2011)
Box office revenue (Asur & Huberman 2010)
Disease outbreak (Achrekar et al. 2011)
Natural disasters (Sakaki, Okazaki and
Matsuo 2010)
Research methodology I
Predictive analysis of Social Media
consists of two phases
Data conditioning phase
Predictive analysis phase
Research methodology II
Determination of time window
Selection of search terms
Selection of data extraction method
Collection and
filtering of raw
data
Selection of prediction variables
Measurement of predictor variables
Computation
of Predictor
Variables
Data Conditioning
Phase
Selection of predictive method
Identification of data for evaluation of prediction
Creation of
Predictive
Mode
Selection of the evaluation method
Specification of the prediction baseline
Evaluation of the
Predictive
Performance
Predictive Analysis
Phase
Analysis phases
Research methodology III
Input and output variables
Twitter sentiments
Share priceFuture
share price
Expressed as binary
sentiment
classification
Expressed in
dollars
Expressed in
dollars
Research methodology IV
Mood towards
Apple
Number of
Tweets
Apple stock
price
Data collection and analysis overview
Data collection
•Query Twitter through API
•Store in MongoDB
Preprocessing
•Remove stopwords
•Remove Tweets withLinks
Model evaluation
•Classificationalgorithm
•Neuralnetwork
Time series
•Twitter volume
•Binary sentimentclassification
Correlation
• Correlationbetweensentiment andfinancial data
Collection and analysis steps overview
Some steps like model evaluation are
iterative
Data collection I
Data collection
•Query Twitter through API
•Store in DB
Preprocessing
•Remove stopwords
•Remove Tweets withLinks
Model evaluation
•Classificationalgorithm
•Neuralnetwork
Time series
•Twitter volume
•Binary sentimentclassification
Correlation
• Correlationbetweensentiment andfinancial data
DB
Data collection II
Data Source
Query API
Firehose API
Gardenhose API
Data Store
MongoDB
Historic data collected through Twitter
APIs
Timestamp, message text, region
Data collection III
Data collected through Twitter query
API
Using the Java programming language
Using the Twitter4j library
Stored as JSON (JavaScript Object
Notation) in a MongoDB
Data collection IV
public void runQuery() {
Twitter twitter = new TwitterFactory().getInstance();
AccessToken accessToken = new AccessToken(ACCESS_TOKEN, ACCESS_TOKEN_SECRET);
twitter.setOAuthConsumer(CUSTOMER_KEY, CUSTOMER_SECRET);
twitter.setOAuthAccessToken(accessToken);
try {
Query query = new Query(“$Appl");
QueryResult result;
result = twitter.search(query);
List<Status> tweets = result.getTweets();
for (Status tweet : tweets) {
System.out.println("@" + tweet.getUser().getScreenName() + " - " + tweet.getText());
}
}
catch (TwitterException te) {
te.printStackTrace();
System.out.println("Failed to search tweets: " + te.getMessage());
System.exit(-1);
}
}
Twitter query algorithm to retrieve Tweets on Apple
Data preprocessing I
Data collection
•Query Twitter through API
•Store in DB
Preprocessing
•Remove stopwords
•Remove Tweets withLinks
Model evaluation
•Classificationalgorithm
•Neuralnetwork
Time series
•Twitter volume
•Binary sentimentclassification
Correlation
• Correlationbetweensentiment andfinancial data
Data preprocessing II
Remove stop-words, “the”, “then”, “at” …
Punctuation, apostrophe, brackets, colon ..
Discard Tweets with no explicit statements
like “Going to the Apple store”
Discard irrelevant Tweeds like “I love apples
and pears”
Discard possible spam by discarding Tweets
that match the regular expression “http:” and
“www”
Data preprocessing III
Machine learning algorithms don’t take text
as input
Create feature vector
Word frequencies
n-grams, unigram, bigram, trigram …
“good”, “very good”, “not very good”
Create sentiment lexicon
Sentiment analysis highly domain specific
“This mattress had a valley after one month”
“This car uses a lot of fuel”
Model evaluation I
Data collection
•Query Twitter through API
•Store in DB
Preprocessing
•Remove stopwords
•Remove Tweets withLinks
Model evaluation
•Classificationalgorithm
•Neuralnetwork
Time series
•Twitter volume
•Binary sentimentclassification
Correlation
• Correlationbetweensentiment andfinancial data
90.2 %
84.7 %
97.3 %
Neural Network
Naïve Bayes
Nearest Neighbor
Model evaluation II
Experience shows that no single machine
learning scheme is appropriate to all data
mining problems (Witten, Frank & Hall 2011,
p. 403)
Different algorithms are trained
The best performing algorithm will be
selected
Model evaluation III
Data classification and analysis through
Machine learning techniques
System can learn from data, e. g. detect spam
Finding and describing structural patterns in
data and generalize
Data classification is a supervised
learning problem
Class label is known
Model evaluation IV
Other machine learning models are
Unsupervised learning
Class label is unknown
Used for cluster analysis
Semi-supervised learning
Small amount of labeled data, big volumes of
unlabeled data
Model evaluation V
Model evaluation through iterative supervised
machine learning process
Select classification algorithm, Naïve Bayes, k-
NN, Decision tree induction …
Find a function ƒ that classifies Tweets into
positive and negative Tweets
Data is divided into training and test data
Model is trained using the training data
Trained model is verified using the test data
Model evaluation VI
Determine through loss function how well the
model performs on future, unseen data
Calculate error: Training error = fraction of training examples misclassified
Test error = fraction of test examples misclassified
Generalization error = probability of misclassifying new
random example
Model evaluation VII
Testing determines the classification
accuracy
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑠𝑡 𝑐𝑎𝑠𝑒𝑠
Simple but very optimistic since training data
is used for testing
Model evaluation VIII
n-fold cross-validation Divide data into n folds, where typically 4 < n < 11
Data divided randomly into n folds
n – 1 folds used for training, 1 holdout fold for
testing
Error rate is calculated on the holdout fold
repeated n times such that each fold is the holdout
fold once
Error estimate is averaged over all n error rates
Model evaluation IX
Typical data mining task goes through many
iterations
As many iterations as necessary till result is
satisfying, i. e. accuracy converges
Best data mining scheme is selected
Used against unseen data for classification
Can be used on real-time data
Model evaluation X
RapidMiner workbench
Model evaluation XI
Training data
sex mask cape tie ears smokes class
Batman male yes yes no yes no Good
Robin male yes yes no no no Good
Alfred male no no yes no no Good
Penguin male no no yes no yes Bad
Catwoman female yes no no yes no Bad
Joker male no no no no no Bad
Test data
Batgirl female yes yes no yes no ?
Riddler male yes no no no no ?
Model evaluation XII
Description of data:
Generalisation for new examples
if sex = male and mask = yes and cape = yes
and tie = yes and ears = yes and smokes = no
then character = Good
if mask = yes and ears = yes and smokes = no
then character = Good
Model evaluation XIII
tie
no yes
cape smokes
no yes no yes
bad badgood good
Model evaluation XIV
Trees must be:
Big enough to fit training data
Big enough to capture true patterns
Not too big (Ockham’s razor):
Overfitting
Capture noise
Find spurious patterns
Model evaluation XIV
Best tree size cannot be determined
from training error
Schapire 2004
Model evaluation XV
Schapire 2004
Model evaluation XVI
For building an accurate classifier:
Enough training examples
Good performance on training set
Classifier that is not too complex
Strategy for controlling tree size:
Build large tree that fully fits training data
Prune back
Model evaluation XVII
Grow on just part of the training data, then
prune using minimum error on held out
data
Classifiers I
Decision trees:
Best known:
C4.5 (Quinlan), successor C5.0
CART for classification and regression trees
(Breitman et al.)
Fast to train and evaluate
Relatively easy to interpret
Accuracy often not satisfactory
Classifiers II
Perceptron (Neuron)
Linear classifier
Data linearly separable using a hyperplane
Where w = weights, a = real-valued vector,
feature vector, a0 = bias
Binary classifier f(a) that maps its input
vector a to a single, binary output value
w0a0 + w1a1 + w2a2 + … + wkak = 0
Classifiers III
w0
1
bias
attr
a1
attr
a2
attr
a3
w1 w2
w3
f(a) = kwkak + b
f(a) > 0 or
f(a) < 0
Classifiers IV
Multilayer Perceptron
Non-linear classifier
Perceptrons are connected in a
hierarchical structure
Classifiers V
Not all data is linearly separable
Classifiers VI
1
bias
attr
a1
attr
a2
Input layer Hidden layer Output layer
Classifiers VII
Multilayer Perceptron
Perceptrons organized in several layers
All layer is fully interconnected with the next
layer
All nodes except input node are perceptrons
Feedforward neural network
Uses backpropagation for training
Error propagated back to minimize loss function
Classifiers VIII
Allows to get approximate solutions for
very complex problems
Support Vector Machines (SVM) are a
much simpler alternative to ANN
Many more classifiers
k-Nearest Neighbor
Naïve Bayes
…
Data classification I
Data collection
•Query Twitter through API
•Store in DB
Preprocessing
•Remove stopwords
•Remove Tweets withLinks
Model evaluation
•Classificationalgorithm
•Neuralnetwork
Time series
•Twitter volume
•Binary sentimentclassification
Correlation
• Correlationbetweensentiment andfinancial data
Data Classification II
Data classification:
Binary mood polarity: positive, negative
Represented graphically as time series
Positive Tweets
Negative Tweets
Correlations I
Data collection
•Query Twitter through API
•Store in DB
Preprocessing
•Remove stopwords
•Remove Tweets withLinks
Model evaluation
•Classificationalgorithm
•Neuralnetwork
Time series
•Twitter volume
•Binary sentimentclassification
Correlation
• Correlationbetweensentiment andfinancial data
Sentiment polarity
Share price
Correlations II
Finding correlations:
Binary sentiment classification time series
compared against stock price over same
time frame
Does the number of positive Tweets
preceding a soar of Apple stock price?
Correlations III
Microsoft stock price (Yahoo! Finance 2014)
Correlations IV
Tweet polarity and MSFT stock price
Correlations V
If there are correlations in historic data,
trained model used against real time
data
Access real time Tweets using Twitters
streaming API
Firehose API (100% of real time Tweets)
Gardenhose API (10% of real time Tweets)
Spritzer API (1% of real time Tweets)
Correlations VI
Since correlations are most certainly non
linear, correlating has to be automated
Bivariate Granger causality test
Determine whether one time series can be
used to predict another
If X in a time series causes Y = Granger-
cause
X provides statistical significant information
about Y
Correlations VII
Granger test examines linear causality
among bivariate or multivariate time series
Many real world phenomenon are not
linear
Non-linear extensions to Granger have
been developed
Other correlation techniques
Phase Slope Index measures temporal flux
between time series
Correlations VII
More robust than Granger since more
immune against noise
Machine learning techniques such as
ANN can be used for finding
correlations
Applications I
Technologies for predictive analysis
have matured
IBM SPSS
Stata
SAS
Applications II
Free open source
WEKA
Partly open source
RapidMiner
Cloud solutions
IBM WatsonAnalytics
Google BigQuery
SAS Cloud Analytics
Challenges I
Real word data often very poor quality
Social Media vast, noisy and
unstructured
Getting relevant posts is challenging
Spam has become a serious issue
Detecting sarcasm very difficult
Political opinions full of irony and sarcasm
Data preprocessing one of the most
important steps
Challenges II
Opinion mining remains challenging
task
Overall statement often difficult to
determine
No ground truth
Not everybody is using Social Media
Self-selection bias
Conclusions I
Predictive analysis poses many
interesting research problems
Many opportunities for future research
Determining the credibility of posts (catfish,
sock puppet)
Better filtering mechanisms
More research in Machine Learning
than feature extraction
Conclusions II
Correlation does not mean causation
Finding causative mechanism for
correlation
Thank you for the attention
Questions?
References I
Achrekar, H, Gandhe, A, Lazarus, R, Ssu-Hsin, Y and Benyuan, L 2011, 'Predicting Flu Trends using Twitter data', Computer
Communications Workshops (INFOCOM WKSHPS), IEEE, pp. 702-7.
Arias, M, Arratia, A & Xuriguera, R 2014, 'Forecasting with twitter data', ACM Trans. Intell. Syst. Technol., vol. 5, no. 1, pp. 1-24.
Asur, S & Huberman, BA 2010, 'Predicting the Future with Social Media', in Web Intelligence and Intelligent Agent Technology
(WI-IAT), 2010 IEEE/WIC/ACM International Conference on, vol. 1, pp. 492-9.Berman, JJ 2013, PRINCIPLES OF BIG DATA,
Elsevier Inc., Waltham, USA.
Bollen, J, Mao, H & Zeng, X-J 2010, 'Twitter mood predicts the stock market', Journal of Computational Science, vol. 2, p. 8.
Buhl, H, Röglinger, M, Moser, F & Heidemann, J 2013, 'Big Data', WIRTSCHAFTSINFORMATIK, vol. 55, no. 2, pp. 63-8.
Bulysheva, L & Bulyshev, A 2012, 'Segmentation modeling algorithm: a novel algorithm in data mining', Information Technology
and Management, vol. 13, no. 4, pp. 263-71.
Darwish, A & Lakhtaria, KI 2011, The Impact of the New Web 2.0 Technologies in Communication, Development, and
Revolutions of Societies, vol. 2, 2011.
Goh, KY, Heng, CS & Lin, Z 2012, ‘Social Media Brand Community and Consumer Behavior: Quantifying the Relative Impact of
User- and Marketer-Generated Content’, School of Computing, National University of Singapore, viewed 9 April 2013,
<https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2048614>.
Graham, DM, Hale, SA & Stephens, M 2011, 'User-generated Content in Google', Oxford University, Oxford, UK, viewed 27
October 2013, < http://www.oii.ox.ac.uk/vis/?id=4e3c030d>.
Harris, D 2013, 'DataSift raises $42M', Gigaom, viewed 27 December 2013, <http://gigaom.com/2013/12/03/datasift-raises-42m-
maybe-theres-something-to-this-social-data-after-all/>.
Huang, S, Peng, W, Li, J & Lee, D 2013, 'Sentiment and topic analysis on social media: a multi-task multi-label classification
approach', paper presented to Proceedings of the 5th Annual ACM Web Science Conference, Paris, France.
Kao, A, Ferng, W, Poteet, S, Quach, L & Tjoelker, R 2013, 'TALISON - Tensor analysis of social media data', in Intelligence and
Security Informatics (ISI), 2013 IEEE International Conference on, pp. 137-42.
Klein, D, Tran-Gia, P & Hartmann, M 2013, 'Big Data', Informatik-Spektrum, vol. 36, no. 3, p. 319.
Kumar, P, Nitin, Chauhan, DS & Sehgal, VK 2012, 'Selection of evolutionary approach based hybrid data mining algorithms for
decision support systems and business intelligence', paper presented to Proceedings of the International Conference on
Advances in Computing, Communications and Informatics, Chennai, India.
References II
Kumar, P, Kumar Sehgal, N, Kumar Sehgal, V & Singh Chauhan, D 2012, 'A Benchmark to Select Data Mining Based
Classification Algorithms for Business Intelligence and Decision Support Systems', International Journal of Data Mining &
Knowledge Management Process, vol. 2, no. 5, pp. 25-42.
Lim, E-P, Chen, H & Chen, G 2013, 'Business Intelligence and Analytics: Research Directions', ACM Trans. Manage. Inf. Syst.,
vol. 3, no. 4, pp. 1-10.
Manyika, J, Chui, M, Brown, B, Bughin, J, Dobbs, R, Roxburgh, C & Byers, AH 2011, Big data: The next frontier for innovation,
competition, and productivity, McKinsey Global Institute.
Mayer-Schonberger, V & Cukier, K 2013, Big Data: A Revolution That Will Transform How We Live, Work, and Think, Houghton
Mifflin Harcourt Publishing Company, New York, USA.
Mayer, A 2009, 'Online social networks in economics', Decision Support Systems, vol. 47, no. 3, pp. 169-184, viewed 22
September 2013, < http://sistemas-humano-computacionais.wdfiles.com/local--files/capitulo%3Aredes-sociais/amayer.pdf>.
McKelvey, K, Rudnick, A, Conover, MD & Menczer, F 2012, 'Visualizing Communication on Social Media, Making Big Data
Accessible', Indiana University School of Informatics and Computing, viewed 29 September 2013,
<http://arxiv.org/pdf/1202.1367v1.pdf>.
Neri, F, Aliprandi, C, Capeci, F, Cuadros, M & By, T 2012, 'Sentiment Analysis on Social Media', in Advances in Social Networks
Analysis and Mining (ASONAM), 2012 IEEE/ACM International Conference on, pp. 919-26.
Oboler, A, Welsh, K & Cruz, L 2012, The danger of big data: Social media as computational social science, 2012.
Ostrowski, DA 2011, 'Predictive Semantic Social Media Analysis', in Semantic Computing (ICSC), 2011 Fifth IEEE International
Conference on, pp. 283-90.
Paltoglou, G & Thelwall, M 2012, 'Twitter, MySpace, Digg: Unsupervised Sentiment Analysis in Social Media', ACM Trans. Intell.
Syst. Technol., vol. 3, no. 4, pp. 1-19.
Rusli, EM 2013, Facebook Woos TV Networks With Data, Digits, viewed 15 February 2014,
<http://blogs.wsj.com/digits/2013/09/29/facebook-woos-tv-networks-with-more-data/>.
Smith, MS, Ventura, AD, Dewey, DP, Knutson, CD & Embley, DW 2011, ‘A Computational Framework for Social Capital in Online
Communities’, Brigham Young University, viewed 28 July 2013, <http://posts.smithworx.com/publications/d.pdf>.
References III
Yahoo! Finance 2014, Microsoft Corporation (MSFT), Yahoo, viewed 15 February 2014,
<http://finance.yahoo.com/echarts?s=MSFT+Interactive#symbol=msft;range=20130102,20140214;compare=;indicator=volume;chartty
pe=area;crosshair=on;ohlcvalues=0;logscale=off;source=; >.
Trif, S 2011, 'Using Genetic Algorithms in Secured Business Intelligence Mobile Applications', Informatica economica, vol. 15, no. 1,
pp. 69-79.
Tumasjan, A, Welpe, IM, Sandner, PG, Tumasjan, A & Sprenger, TO 2011, 'Election Forecasts With Twitter: How 140 Characters
Reflect the Political Landscape', Social science computer review, vol. 29, no. 4, pp. 402-18.
Sakaki, T, Okazaki, M and Matsuo, Y 2010, 'Earthquake shakes Twitter users: real-time event detection by social sensors', Proc. of the
19th international conference on World wide web, Raleigh.
Twitter Statistics 2014, Statistic brain, viewed 18 February 2014, <http://www.statisticbrain.com/twitter-statistics/>.
Walton, A 2014, ‘Twitter Usage by Region’, Chron, viewed 18 February 2014, < http://smallbusiness.chron.com/twitter-usage-region-
62762.html>.
Wang, F-Y, Carley, KM, Zeng, D & Mao, W 2007, 'Social Computing: From Social Informatics to Social Intelligence', Intelligent
Systems, IEEE, vol. 22, no. 2, pp. 79-83.
Weka knowledge explorer, viewed 15 February 2014, <http://www.cs.waikato.ac.nz/~ml/weka/gui_explorer.html>.
Witten, IH, Frank, E & Hall, MA 2011, Data Mining, 3 edn, Elsevier, Burlington, MA, USA.
Wlodarczak, P 2014, ‘Big Personal Data’, Social Science Research Network, <http://dx.doi.org/10.2139/ssrn.2514721>.
World Stock Exchanges 2011, viewed 18 February 2014, <http://www.world-stock-exchanges.net/top10.html>.
Wong, FMF, Sen, S & Chiang, M 2012, 'Why Watching Movie Tweets Won’t Tell the Whole Story?', Cornell University, viewed 14 May
2013, <http://arxiv.org/pdf/1203.4642v1.pdf>.
Wu, X, Kumar, V, Ross Quinlan, J, Ghosh, J, Yang, Q, Motoda, H, McLachlan, GJ, Ng, A, Liu, B, Yu, PS, Zhou, Z-H, Steinbach, M,
Hand, DJ & Steinberg, D 2007, 'Top 10 algorithms in data mining', Knowledge and Information Systems, vol. 14, no. 1, pp. 1-37.
Zeng, D, Chen, H, Lusch, R & Li, S-H 2010, 'Social Media Analytics and Intelligence', Intelligent Systems, IEEE, vol. 25, no. 6, pp. 13-
6.
Zeng, L, Li, L & Duan, L 2012, 'Business intelligence in enterprise computing environment', Information Technology and Management,
vol. 13, no. 4, pp. 297-310.