predicting the future with social media

What The Future Holds For Social

Media Data Analysis

Predictive analytics using Twitter data

Peter Wlodarczak [email protected]

Agenda

Introduction

Research methodology

Applications

Challenges

Conclusions

Introduction I

Shift from publisher-generated to user-

created content

90% of the content on the Internet is now

user generated (Graham et al. 2011)

Unprecedented amount of opinionated

data on the Internet

Online social networks (OSN) are one of

the biggest data sources of the internet

(Oboler, Welsh & Cruz 2012)

Introduction II

Opinions can be expressed on the

Internet without programming knowledge

(Web 2.0)

Opinions are key influences of human

behavior

People increasingly consult the Internet

before making decisions

Introduction III

OSN give new insights into peoples

opinions, interests and views Social networking Web sites are amassing vast

quantities of data

Computational social science is providing tools to

process this data (Oboler, Welsh & Cruz 2012)

Social computing, a new paradigm of computing

and technology development, has become a

central theme across a number of information and

communication technology fields (Wang et al.

2007, p. 79)

Introduction IV

Growing interest in Social Media Mining

(SMM) in the market Gnip, Klout, DataSift and Sprout social specialized

in SM data analysis

Apple bought Topsy for 200 million US dollars

(Harris 2013)

TV stations buy Facebook data to see how

popular their shows are (Rusli 2013)

No surveys necessary

Introduction V

Research in the area of computational

social science and Big Data Social computing is a cross-disciplinary research

and application field with theoretical underpinnings

including both computational and social sciences

(Wang et al. 2007, p. 80)

Big Data is the ability of society to harness

information in novel ways to produce useful

insights or goods and services of significant value

(Mayer-Schonberger & Cukier 2013, p. 2)

Introduction VI

Analyzing data to:

Understand the underlying structure of it

and gain knowledge

Make predictions from new, unseen

examples

Introduction VII

Current behavior indication for future

decisions

New area of research: predictive

analytics

Machine learning techniques used for

prediction

Learning from experience, “data”, to predict

future behavior of individuals

Support decision making process

Introduction VIII

Big Data

Big Data is usually defined by the three

V’s. Volume, velocity and variety (Klein,

Tran-Gia & Hartmann 2013, p. 320)

High volume

Created at high velocity

Structured, semi-structured and unstructured

Introduction IX

Big Data principles

No sample selection, all data analysed

Data doesn’t have to be of high quality

Structured and unstructured data

Introduction X

Data mining

Techniques for finding and describing

structural patterns in data

Tool for helping to explain that data and

make predictions from it (Witten, Frank &

Hall 2011, p. 8)

Used to

gain knowledge

make predictions

Introduction XI

Data analysis steps

Analyze mood by means of sentiment

analysis

Create time series and correlate it to real

world phenomenon

Make predictions based on new data

Support decision making process

Introduction XII

Social Media data has been analysed to

predict

Financial indicators (Bollen, Mao & Zeng

2010)

Elections (Tumasjan et al. 2011)

Box office revenue (Asur & Huberman 2010)

Disease outbreak (Achrekar et al. 2011)

Natural disasters (Sakaki, Okazaki and

Matsuo 2010)

Research methodology I

Predictive analysis of Social Media

consists of two phases

Data conditioning phase

Predictive analysis phase

Research methodology II

Determination of time window

Selection of search terms

Selection of data extraction method

Collection and

filtering of raw

data

Selection of prediction variables

Measurement of predictor variables

Computation

of Predictor

Variables

Data Conditioning

Phase

Selection of predictive method

Identification of data for evaluation of prediction

Creation of

Predictive

Mode

Selection of the evaluation method

Specification of the prediction baseline

Evaluation of the

Predictive

Performance

Predictive Analysis

Phase

Analysis phases

Research methodology III

Input and output variables

Twitter sentiments

Share priceFuture

share price

Expressed as binary

sentiment

classification

Expressed in

dollars

Expressed in

dollars

Research methodology IV

Mood towards

Apple

Number of

Tweets

Apple stock

price

Data collection and analysis overview

Data collection

•Query Twitter through API

•Store in MongoDB

Preprocessing

•Remove stopwords

•Remove Tweets withLinks

Model evaluation

•Classificationalgorithm

•Neuralnetwork

Time series

•Twitter volume

•Binary sentimentclassification

Correlation

• Correlationbetweensentiment andfinancial data

Collection and analysis steps overview

Some steps like model evaluation are

iterative

Data collection I

Data collection


•Store in DB

Preprocessing

•Remove stopwords


Model evaluation


•Neuralnetwork

Time series

•Twitter volume


Correlation


DB

Data collection II

Data Source

Twitter

Query API

Firehose API

Gardenhose API

Data Store

MongoDB

Historic data collected through Twitter

APIs

Timestamp, message text, region

Data collection III

Data collected through Twitter query

API

Using the Java programming language

Using the Twitter4j library

Stored as JSON (JavaScript Object

Notation) in a MongoDB

Data collection IV

public void runQuery() {

Twitter twitter = new TwitterFactory().getInstance();

AccessToken accessToken = new AccessToken(ACCESS_TOKEN, ACCESS_TOKEN_SECRET);

twitter.setOAuthConsumer(CUSTOMER_KEY, CUSTOMER_SECRET);

twitter.setOAuthAccessToken(accessToken);

try {

Query query = new Query(“$Appl");

QueryResult result;

result = twitter.search(query);

List<Status> tweets = result.getTweets();

for (Status tweet : tweets) {

System.out.println("@" + tweet.getUser().getScreenName() + " - " + tweet.getText());

}

}

catch (TwitterException te) {

te.printStackTrace();

System.out.println("Failed to search tweets: " + te.getMessage());

System.exit(-1);

}

}

Twitter query algorithm to retrieve Tweets on Apple

Data preprocessing I

Data collection


•Store in DB

Preprocessing

•Remove stopwords


Model evaluation


•Neuralnetwork

Time series

•Twitter volume


Correlation


Data preprocessing II

Remove stop-words, “the”, “then”, “at” …

Punctuation, apostrophe, brackets, colon ..

Discard Tweets with no explicit statements

like “Going to the Apple store”

Discard irrelevant Tweeds like “I love apples

and pears”

Discard possible spam by discarding Tweets

that match the regular expression “http:” and

“www”

Data preprocessing III

Machine learning algorithms don’t take text

as input

Create feature vector

Word frequencies

n-grams, unigram, bigram, trigram …

“good”, “very good”, “not very good”

Create sentiment lexicon

Sentiment analysis highly domain specific

“This mattress had a valley after one month”

“This car uses a lot of fuel”

Model evaluation I

Data collection


•Store in DB

Preprocessing

•Remove stopwords


Model evaluation


•Neuralnetwork

Time series

•Twitter volume


Correlation


90.2 %

84.7 %

97.3 %

Neural Network

Naïve Bayes

Nearest Neighbor

Model evaluation II

Experience shows that no single machine

learning scheme is appropriate to all data

mining problems (Witten, Frank & Hall 2011,

p. 403)

Different algorithms are trained

The best performing algorithm will be

selected

Model evaluation III

Data classification and analysis through

Machine learning techniques

System can learn from data, e. g. detect spam

Finding and describing structural patterns in

data and generalize

Data classification is a supervised

learning problem

Class label is known

Model evaluation IV

Other machine learning models are

Unsupervised learning

Class label is unknown

Used for cluster analysis

Semi-supervised learning

Small amount of labeled data, big volumes of

unlabeled data

Model evaluation V

Model evaluation through iterative supervised

machine learning process

Select classification algorithm, Naïve Bayes, k-

NN, Decision tree induction …

Find a function ƒ that classifies Tweets into

positive and negative Tweets

Data is divided into training and test data

Model is trained using the training data

Trained model is verified using the test data

Model evaluation VI

Determine through loss function how well the

model performs on future, unseen data

Calculate error: Training error = fraction of training examples misclassified

Test error = fraction of test examples misclassified

Generalization error = probability of misclassifying new

random example

Model evaluation VII

Testing determines the classification

accuracy

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑠𝑡 𝑐𝑎𝑠𝑒𝑠

Simple but very optimistic since training data

is used for testing

Model evaluation VIII

n-fold cross-validation Divide data into n folds, where typically 4 < n < 11

Data divided randomly into n folds

n – 1 folds used for training, 1 holdout fold for

testing

Error rate is calculated on the holdout fold

repeated n times such that each fold is the holdout

fold once

Error estimate is averaged over all n error rates

Model evaluation IX

Typical data mining task goes through many

iterations

As many iterations as necessary till result is

satisfying, i. e. accuracy converges

Best data mining scheme is selected

Used against unseen data for classification

Can be used on real-time data

Model evaluation X

RapidMiner workbench

Model evaluation XI

Training data

sex mask cape tie ears smokes class

Batman male yes yes no yes no Good

Robin male yes yes no no no Good

Alfred male no no yes no no Good

Penguin male no no yes no yes Bad

Catwoman female yes no no yes no Bad

Joker male no no no no no Bad

Test data

Batgirl female yes yes no yes no ?

Riddler male yes no no no no ?

Model evaluation XII

Description of data:

Generalisation for new examples

if sex = male and mask = yes and cape = yes

and tie = yes and ears = yes and smokes = no

then character = Good

if mask = yes and ears = yes and smokes = no

then character = Good

Model evaluation XIII

tie

no yes

cape smokes

no yes no yes

bad badgood good

Model evaluation XIV

Trees must be:

Big enough to fit training data

Big enough to capture true patterns

Not too big (Ockham’s razor):

Overfitting

Capture noise

Find spurious patterns

Model evaluation XIV

Best tree size cannot be determined

from training error

Schapire 2004

Model evaluation XV

Schapire 2004

Model evaluation XVI

For building an accurate classifier:

Enough training examples

Good performance on training set

Classifier that is not too complex

Strategy for controlling tree size:

Build large tree that fully fits training data

Prune back

Model evaluation XVII

Grow on just part of the training data, then

prune using minimum error on held out

data

Classifiers I

Decision trees:

Best known:

C4.5 (Quinlan), successor C5.0

CART for classification and regression trees

(Breitman et al.)

Fast to train and evaluate

Relatively easy to interpret

Accuracy often not satisfactory

Classifiers II

Perceptron (Neuron)

Linear classifier

Data linearly separable using a hyperplane

Where w = weights, a = real-valued vector,

feature vector, a0 = bias

Binary classifier f(a) that maps its input

vector a to a single, binary output value

w0a0 + w1a1 + w2a2 + … + wkak = 0

Classifiers III

w0

1

bias

attr

a1

attr

a2

attr

a3

w1 w2

w3

f(a) = kwkak + b

f(a) > 0 or

f(a) < 0

Classifiers IV

Multilayer Perceptron

Non-linear classifier

Perceptrons are connected in a

hierarchical structure

Classifiers V

Not all data is linearly separable

Classifiers VI

1

bias

attr

a1

attr

a2

Input layer Hidden layer Output layer

Classifiers VII

Multilayer Perceptron

Perceptrons organized in several layers

All layer is fully interconnected with the next

layer

All nodes except input node are perceptrons

Feedforward neural network

Uses backpropagation for training

Error propagated back to minimize loss function

Classifiers VIII

Allows to get approximate solutions for

very complex problems

Support Vector Machines (SVM) are a

much simpler alternative to ANN

Many more classifiers

k-Nearest Neighbor

Naïve Bayes

…

Data classification I

Data collection


•Store in DB

Preprocessing

•Remove stopwords


Model evaluation


•Neuralnetwork

Time series

•Twitter volume


Correlation


Data Classification II

Data classification:

Binary mood polarity: positive, negative

Represented graphically as time series

Positive Tweets

Negative Tweets

Correlations I

Data collection


•Store in DB

Preprocessing

•Remove stopwords


Model evaluation


•Neuralnetwork

Time series

•Twitter volume


Correlation


Sentiment polarity

Share price

Correlations II

Finding correlations:

Binary sentiment classification time series

compared against stock price over same

time frame

Does the number of positive Tweets

preceding a soar of Apple stock price?

Correlations III

Microsoft stock price (Yahoo! Finance 2014)

Correlations IV

Tweet polarity and MSFT stock price

Correlations V

If there are correlations in historic data,

trained model used against real time

data

Access real time Tweets using Twitters

streaming API

Firehose API (100% of real time Tweets)

Gardenhose API (10% of real time Tweets)

Spritzer API (1% of real time Tweets)

Correlations VI

Since correlations are most certainly non

linear, correlating has to be automated

Bivariate Granger causality test

Determine whether one time series can be

used to predict another

If X in a time series causes Y = Granger-

cause

X provides statistical significant information

about Y

Correlations VII

Granger test examines linear causality

among bivariate or multivariate time series

Many real world phenomenon are not

linear

Non-linear extensions to Granger have

been developed

Other correlation techniques

Phase Slope Index measures temporal flux

between time series

Correlations VII

More robust than Granger since more

immune against noise

Machine learning techniques such as

ANN can be used for finding

correlations

Applications I

Technologies for predictive analysis

have matured

IBM SPSS

Stata

SAS

Applications II

Free open source

WEKA

Partly open source

RapidMiner

Cloud solutions

IBM WatsonAnalytics

Google BigQuery

SAS Cloud Analytics

Challenges I

Real word data often very poor quality

Social Media vast, noisy and

unstructured

Getting relevant posts is challenging

Spam has become a serious issue

Detecting sarcasm very difficult

Political opinions full of irony and sarcasm

Data preprocessing one of the most

important steps

Challenges II

Opinion mining remains challenging

task

Overall statement often difficult to

determine

No ground truth

Not everybody is using Social Media

Self-selection bias

Conclusions I

Predictive analysis poses many

interesting research problems

Many opportunities for future research

Determining the credibility of posts (catfish,

sock puppet)

Better filtering mechanisms

More research in Machine Learning

than feature extraction

Conclusions II

Correlation does not mean causation

Finding causative mechanism for

correlation

Thank you for the attention

Questions?

References I

Achrekar, H, Gandhe, A, Lazarus, R, Ssu-Hsin, Y and Benyuan, L 2011, 'Predicting Flu Trends using Twitter data', Computer

Communications Workshops (INFOCOM WKSHPS), IEEE, pp. 702-7.

Arias, M, Arratia, A & Xuriguera, R 2014, 'Forecasting with twitter data', ACM Trans. Intell. Syst. Technol., vol. 5, no. 1, pp. 1-24.

Asur, S & Huberman, BA 2010, 'Predicting the Future with Social Media', in Web Intelligence and Intelligent Agent Technology

(WI-IAT), 2010 IEEE/WIC/ACM International Conference on, vol. 1, pp. 492-9.Berman, JJ 2013, PRINCIPLES OF BIG DATA,

Elsevier Inc., Waltham, USA.

Bollen, J, Mao, H & Zeng, X-J 2010, 'Twitter mood predicts the stock market', Journal of Computational Science, vol. 2, p. 8.

Buhl, H, Röglinger, M, Moser, F & Heidemann, J 2013, 'Big Data', WIRTSCHAFTSINFORMATIK, vol. 55, no. 2, pp. 63-8.

Bulysheva, L & Bulyshev, A 2012, 'Segmentation modeling algorithm: a novel algorithm in data mining', Information Technology

and Management, vol. 13, no. 4, pp. 263-71.

Darwish, A & Lakhtaria, KI 2011, The Impact of the New Web 2.0 Technologies in Communication, Development, and

Revolutions of Societies, vol. 2, 2011.

Goh, KY, Heng, CS & Lin, Z 2012, ‘Social Media Brand Community and Consumer Behavior: Quantifying the Relative Impact of

User- and Marketer-Generated Content’, School of Computing, National University of Singapore, viewed 9 April 2013,

<https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2048614>.

Graham, DM, Hale, SA & Stephens, M 2011, 'User-generated Content in Google', Oxford University, Oxford, UK, viewed 27

October 2013, < http://www.oii.ox.ac.uk/vis/?id=4e3c030d>.

Harris, D 2013, 'DataSift raises $42M', Gigaom, viewed 27 December 2013, <http://gigaom.com/2013/12/03/datasift-raises-42m-

maybe-theres-something-to-this-social-data-after-all/>.

Huang, S, Peng, W, Li, J & Lee, D 2013, 'Sentiment and topic analysis on social media: a multi-task multi-label classification

approach', paper presented to Proceedings of the 5th Annual ACM Web Science Conference, Paris, France.

Kao, A, Ferng, W, Poteet, S, Quach, L & Tjoelker, R 2013, 'TALISON - Tensor analysis of social media data', in Intelligence and

Security Informatics (ISI), 2013 IEEE International Conference on, pp. 137-42.

Klein, D, Tran-Gia, P & Hartmann, M 2013, 'Big Data', Informatik-Spektrum, vol. 36, no. 3, p. 319.

Kumar, P, Nitin, Chauhan, DS & Sehgal, VK 2012, 'Selection of evolutionary approach based hybrid data mining algorithms for

decision support systems and business intelligence', paper presented to Proceedings of the International Conference on

Advances in Computing, Communications and Informatics, Chennai, India.

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2048614

http://www.oii.ox.ac.uk/vis/?id=4e3c030d

http://gigaom.com/2013/12/03/datasift-raises-42m-maybe-theres-something-to-this-social-data-after-all/

References II

Kumar, P, Kumar Sehgal, N, Kumar Sehgal, V & Singh Chauhan, D 2012, 'A Benchmark to Select Data Mining Based

Classification Algorithms for Business Intelligence and Decision Support Systems', International Journal of Data Mining &

Knowledge Management Process, vol. 2, no. 5, pp. 25-42.

Lim, E-P, Chen, H & Chen, G 2013, 'Business Intelligence and Analytics: Research Directions', ACM Trans. Manage. Inf. Syst.,

vol. 3, no. 4, pp. 1-10.

Manyika, J, Chui, M, Brown, B, Bughin, J, Dobbs, R, Roxburgh, C & Byers, AH 2011, Big data: The next frontier for innovation,

competition, and productivity, McKinsey Global Institute.

Mayer-Schonberger, V & Cukier, K 2013, Big Data: A Revolution That Will Transform How We Live, Work, and Think, Houghton

Mifflin Harcourt Publishing Company, New York, USA.

Mayer, A 2009, 'Online social networks in economics', Decision Support Systems, vol. 47, no. 3, pp. 169-184, viewed 22

September 2013, < http://sistemas-humano-computacionais.wdfiles.com/local--files/capitulo%3Aredes-sociais/amayer.pdf>.

McKelvey, K, Rudnick, A, Conover, MD & Menczer, F 2012, 'Visualizing Communication on Social Media, Making Big Data

Accessible', Indiana University School of Informatics and Computing, viewed 29 September 2013,

<http://arxiv.org/pdf/1202.1367v1.pdf>.

Neri, F, Aliprandi, C, Capeci, F, Cuadros, M & By, T 2012, 'Sentiment Analysis on Social Media', in Advances in Social Networks

Analysis and Mining (ASONAM), 2012 IEEE/ACM International Conference on, pp. 919-26.

Oboler, A, Welsh, K & Cruz, L 2012, The danger of big data: Social media as computational social science, 2012.

Ostrowski, DA 2011, 'Predictive Semantic Social Media Analysis', in Semantic Computing (ICSC), 2011 Fifth IEEE International

Conference on, pp. 283-90.

Paltoglou, G & Thelwall, M 2012, 'Twitter, MySpace, Digg: Unsupervised Sentiment Analysis in Social Media', ACM Trans. Intell.

Syst. Technol., vol. 3, no. 4, pp. 1-19.

Rusli, EM 2013, Facebook Woos TV Networks With Data, Digits, viewed 15 February 2014,

<http://blogs.wsj.com/digits/2013/09/29/facebook-woos-tv-networks-with-more-data/>.

Smith, MS, Ventura, AD, Dewey, DP, Knutson, CD & Embley, DW 2011, ‘A Computational Framework for Social Capital in Online

Communities’, Brigham Young University, viewed 28 July 2013, <http://posts.smithworx.com/publications/d.pdf>.

http://sistemas-humano-computacionais.wdfiles.com/local--files/capitulo:redes-sociais/amayer.pdf

References III

Yahoo! Finance 2014, Microsoft Corporation (MSFT), Yahoo, viewed 15 February 2014,

<http://finance.yahoo.com/echarts?s=MSFT+Interactive#symbol=msft;range=20130102,20140214;compare=;indicator=volume;chartty

pe=area;crosshair=on;ohlcvalues=0;logscale=off;source=; >.

Trif, S 2011, 'Using Genetic Algorithms in Secured Business Intelligence Mobile Applications', Informatica economica, vol. 15, no. 1,

pp. 69-79.

Tumasjan, A, Welpe, IM, Sandner, PG, Tumasjan, A & Sprenger, TO 2011, 'Election Forecasts With Twitter: How 140 Characters

Reflect the Political Landscape', Social science computer review, vol. 29, no. 4, pp. 402-18.

Sakaki, T, Okazaki, M and Matsuo, Y 2010, 'Earthquake shakes Twitter users: real-time event detection by social sensors', Proc. of the

19th international conference on World wide web, Raleigh.

Twitter Statistics 2014, Statistic brain, viewed 18 February 2014, <http://www.statisticbrain.com/twitter-statistics/>.

Walton, A 2014, ‘Twitter Usage by Region’, Chron, viewed 18 February 2014, < http://smallbusiness.chron.com/twitter-usage-region-

62762.html>.

Wang, F-Y, Carley, KM, Zeng, D & Mao, W 2007, 'Social Computing: From Social Informatics to Social Intelligence', Intelligent

Systems, IEEE, vol. 22, no. 2, pp. 79-83.

Weka knowledge explorer, viewed 15 February 2014, <http://www.cs.waikato.ac.nz/~ml/weka/gui_explorer.html>.

Witten, IH, Frank, E & Hall, MA 2011, Data Mining, 3 edn, Elsevier, Burlington, MA, USA.

Wlodarczak, P 2014, ‘Big Personal Data’, Social Science Research Network, <http://dx.doi.org/10.2139/ssrn.2514721>.

World Stock Exchanges 2011, viewed 18 February 2014, <http://www.world-stock-exchanges.net/top10.html>.

Wong, FMF, Sen, S & Chiang, M 2012, 'Why Watching Movie Tweets Won’t Tell the Whole Story?', Cornell University, viewed 14 May

2013, <http://arxiv.org/pdf/1203.4642v1.pdf>.

Wu, X, Kumar, V, Ross Quinlan, J, Ghosh, J, Yang, Q, Motoda, H, McLachlan, GJ, Ng, A, Liu, B, Yu, PS, Zhou, Z-H, Steinbach, M,

Hand, DJ & Steinberg, D 2007, 'Top 10 algorithms in data mining', Knowledge and Information Systems, vol. 14, no. 1, pp. 1-37.

Zeng, D, Chen, H, Lusch, R & Li, S-H 2010, 'Social Media Analytics and Intelligence', Intelligent Systems, IEEE, vol. 25, no. 6, pp. 13-

6.

Zeng, L, Li, L & Duan, L 2012, 'Business intelligence in enterprise computing environment', Information Technology and Management,

vol. 13, no. 4, pp. 297-310.

http://www.statisticbrain.com/twitter-statistics/

http://www.cs.waikato.ac.nz/~ml/weka/gui_explorer.html

http://dx.doi.org/10.2139/ssrn.2514721

predicting the future with social media

Data & Analytics