the process of gathering and analyzing twitter data to predict...

The process of gathering and analyzing Twitter data to predict stock returns

EC115

Economics

Purpose

Many Americans save for retirement through plans such as 401k’s and IRA’s and these

retirement plans save money through mutual funds, which are groups of stocks. Sometimes these

mutual funds can actually lose money for potential retirees. In 2008, on average, employees lost

$10,000 on their retirement savings (Brandon, 2009).

Currently, investors seeking information about public entities traditionally gather the

majority of their data from financial publications and documents filed by a company with the

Securities Exchange Commission, which sources typically contain financial data including

revenues, earnings per share, priceearnings ratios, cash flows, dividend yields, product launches

and company management strategies (Rader, 2006). However, this purely financial approach to

market prediction neglects the bullish or bearish sentiments of the public that play a significant

role in movement of the market. Due to this total dependence on financial data from investors,

stock market predictions are generally inaccurate (Ferri, 2013). By 2008, CXO had collected and

graded more than 5,000 predictions and the accuracy of these predictions have stabilized at about

48 percent ever since (Figure 1).

Figure 1. Data collected on the accuracy of stock market predictions by the CXO advisory group.

Hypothesis

The experiment sought a correlation between the calculated social media sentiment value

and the stock returns of the company. The hypothesis was that a positive social media sentiment

would result in an increase in stock returns and a negative social media sentiment would result in

a decrease in stock returns. The null hypothesis was that social media sentiment would have no

impact on stock returns.

Background Literature Review

Social Media is an outlet for instant news which is valuable to investors since accurate

news about a company is a reliable predictor of that company's stock value. In one study,

principal components analysis was used to reduce the approximately 400 features extracted from

social media to about 30 features, capturing about 25% of the variance. An Ordinary Least

Squares regression was used to forecast stock price movements from the approximately 30

variables (Sisk, 2013). By conducting these statistical processes on data collected from social

media, news metadata can inform shortterm future price movements and volatility.

In a study done by Fisher and Statman (2000), there was no significant relationship

between change in sentiment in one month and stock returns in the following month. For large

market capitalization stock returns, the relationship was negative for all sentiment groups but

never statistically significant. The relationship between change in sentiment during a month and

the following month's stock returns was positive for the small market capitalization stocks of the

Center for Research in Security Prices (CRSP) 910, but that relationship also was not

statistically significant.

Economists at the University of Rochester, Cornell University, and the University of

Texas concluded that individual investor sentiment predicts future stock returns, and that

investor sentiment is independent from data involving past returns or past volume (Kaniel et. al,

2004). Furthermore, the trading of these individual investors predicts weekly returns for stocks

of all market capitalizations, while data involving past returns are only predictive of stock with a

small market capitalization over the same time period. These findings show that the sentiment of

the individual investor is highly indicative of the movement of stock price.

When the propensity of investors to speculate is high and when the stock is volatile,

sentiment is a significant predictor of the fluctuations in the stock price (Baker et. al, 2007).

However, when a stock is stable and investors are calm, sentiment has less bearing on the

movement of stock price.

Using the Facebook's Gross National Happiness Index (FGNHI), which determined the

average optimism and pessimism of its users in a particular region, researchers found a

significant positive relation between sentiment and contemporaneous stock market returns,

showing that optimistic sentiment is related to gains in the market index and pessimistic

sentiment is related to losses in the market index (Siganos, 2014). However, the procedure of

using public sentiment as an indicator of the market may be flawed since the relation between

sentiment and stock returns is subject to reverse causality. For example, when investors profit or

loss from the market, those sentiments may be expressed through social media. These specific

sentiments are not indicative of what the movement of the market will be tomorrow.

Bollen, Mao, and Zeng (2011) investigated whether measurements of collective mood

states derived from largescale Twitter feeds are correlated to the value of the Dow Jones

Industrial Average (DJIA) over time. They used OpinionFinder to measure positive vs. negative

mood and GoogleProfile of Mood States (GPOMS) that measures mood in terms of 6

dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). Through this process, an accuracy of

86.7% was achieved in predicting the daily up and down changes in the closing values of the

DJIA and the Mean Average Percentage Error (MAPE) was reduced by more than 6%.

Research Methodology

Six companies were chosen for the calculations: Kirkland (KIRK), Zynga (ZNGA),

Aramark (ARMK), Foot Locker (FL), Verizon (VZ), and Disney (DIS). These companies were

chosen from the Russell Index Member Lists. Kirkland and Zynga were chosen from the Russell

Microcap List (Russell Investments, 2011), which lists the stocks of highly successful small

companies; Aramark and Foot Locker were chosen from the Russell Midcap List (Russell

Investments, 2011), which lists the stocks of highly successful midsize companies; Verizon and

Disney were chosen from the Russell 1000 Index List, which lists the stocks of highly successful

large companies. These companies were chosen for their varying product fields, their importance

to consumer markets, and their distinctive names.

A Python program, entitled StockPile was developed. The program asked for a company

name to predict the stock price of. Using the Twitter REST API, all the tweets mentioning the

name of the specified company in the last eight days was collected. As the Twitter REST API

has a rate limit on the number of queries possible per fifteen minute intervals, multiple API

accounts were created and cycled through each time one hit the limit. Python’s Natural Language

Toolkit (NLTk) associated a sentiment and polarity value for each Tweet and the Twitter API

associated the number of favorites, the number of retweets, and the number of followers of the

author for each Tweet. However, the number of followers metric was removed after its statistical

insignificance was noted. The Tweet, with the corresponding data, was stored as a TweetObj

Object as shown in Figure 2. The metrics of the TweetObjs were then related with the relevant

stock prices of that company with a two day timeshift in the future. The TweetObjs were saved

in a Pickle file. This process was repeated for each of the tested companies: Kirkland (KIRK),

Zynga (ZNGA), Aramark (ARMK), Foot Locker (FL), Verizon (VZ), and Disney (DIS).

Figure 2. Class Diagram for the TweetObj class.

A Least Squares Multiple regression was conducted on the metrics of the TweetObjs and

the stock prices with a certain timeshift. The most accurate timeshift was determined by

conducting a Least Squares Multiple regression on all the different timeshifts from a zero

timeshift to a timeshift of a week. Whichever regression model provides the least pvalue and

the greatest correlation coefficient of the predictive model will determine the delay in the

sentiment on twitter of a company affecting the stock price of that company. The predictive

model with the most accurate timeshift will provide an equation to predict the stock price in the

following form:

iFi] + K = Pt[∑n

i=1C

Where n is the number of features, Ci is the coefficient of the feature value, Fi is the feature

value, K is a constant, and Pt is the stock price after a certain time delay.

Figure 3. Data Flow Diagram for the StockPile algorithm.

After the algorithms were developed, for the following three days, a testing algorithm,

encapsulated in testing.py, was run after the markets closed at 3:00 PM EST. This program

collected Tweets via the Twitter REST API for a period of one day, and ran the values through

the predictive equation to determine a predicted stock price associated with each Tweet. These

stock prices were then compared to the actual stock price at the germane time. An R2 score was

then calculated comparing the predicted values to the actual values.

Figure 4. Tweets referencing a certain company are collected using the Twitter Streaming API. Each terminal window is collecting Tweets about a certain company.

Figure 5. The output the computer program that collects the Tweets and associates them with a stock price is shown above. The output shows each Tweets ID, properties of that Tweet, and the associated stock price with that Tweet. The last line of the output shows the coefficients and the constant for the predicted model. The coefficients are in the brackets. The companies being run

in the terminals from the topleft terminal to bottomleft terminal clockwise are Kirkland, Zynga, Aramark, Foot Locker, Verizon, and Disney.

Figure 6. Using the predictive model provided by the previous program, this program tests the accuracy of the predictive model. The R2 value is shown at the bottom of the terminal window. The companies being run in the terminals from the topleft terminal to bottomleft terminal clockwise are Kirkland, Zynga, Aramark, Foot Locker, Verizon, and Disney.

Results

Kirkland (KIRK)

Average volume of Tweets per day: 3442

Equation: 5.16412696e01a + 2.07975484e05b + 9.28207328e02c + 24.4283273762 Where a is the sentiment, b is the number of favorites, and c is the number of retweets

Day R2 Value

Day 1 1.70142808328

Day 2 0.958742568133

Day 3 36.3152009966

Figure 7. Data for Kirkland

Zynga (ZNGA)


Equation: 0.05174558a 0.00014439b 0.00452934c + 2.56291473118 Where a is the sentiment, b is the number of favorites, and c is the number of retweets

Day R2 Value

Day 1 0.0866621636987

Day 2 0.924072874293

Day 3 0.413653843975

Figure 8. Data for Zynga

Aramark (ARMK)


Equation: 0.20719973a + 0.16352239b 0.01903519c + 31.920903806 Where a is the sentiment, b is the number of favorites, and c is the number of retweets

Day R2 Value

Day 1 0.0500613156024

Day 2 19.2256210256

Day 3 7.89317668631

Figure 9. Data for Aramark

Foot Locker (FL)


Equation: 0.07723532a 0.08410707b + 0.0054169c + 56.7938719015 Where a is the sentiment, b is the number of favorites, and c is the number of retweets

Day R2 Value

Day 1 331.704823394

Day 2 300.694360902

Day 3 178.346760997

Figure 10. Data for Foot Locker

Verizon (VZ)


Equation: 1.37212215e01a 8.71009324e05b 2.69172043e03c + 47.203529803 Where a is the sentiment, b is the number of favorites, and c is the number of retweets

Day R2 Value

Day 1 0.981403546164

Day 2 14.3259053938

Day 3 2.06868027039

Figure 11. Data for Verizon

Disney (DIS)


Equation: 2.90521402e01a + 3.97562468e06b + 4.45964933e02c + 94.8674834768 Where a is the sentiment, b is the number of favorites, and c is the number of retweets

Day R2 Value

Day 1 1242.50846904

Day 2 7.7987392641

Day 3 5.52669624237

Figure 12. Data for Disney

Conclusions

The R2 value is the coefficient of determination. The coefficient of determination is the

proportion of the variance in the dependent variable that is predictable from the independent

variable. The higher the R2 value, the datasets are more correlative with an upper bound of 1.

The lower the R2 value, the less correlated the datasets. The datasets compared in this experiment

are the sets of predicted stock values using the model based on Tweets, related features, and the

actual stock price. All but one of the trials in the experiment had a negative coefficient of

determination. This means the model predicted the stock price inaccurately.

The Efficient Market Hypothesis states that information relevant to stock price will

immediately affect the stock price. Since the social media was highly uncorrelated with the stock

price according the values of the coefficient of determination, the Efficient Market Hypothesis

proves that information on social media is not relevant to the fluctuation in stock price.

The results of this experiment also prove that stock movement follows a Markov Chain.

A Markov Chain means that the state of something currently is not related at all to the previous

states that it has been in. An extension of this type of movement is a random walk. A random

walk is a succession of random steps. In the case of stocks, the movement of stocks can go either

up or down. These movements represent the total randomness of fluctuations in the stock market.

Nobel Prize winning Economist, Gene Fama (1965) finds that “empirical evidence

provides strong support for the random walk model”. He notes that economists must prove that

models can predict prices better than simply randomly choosing ‘buy’ or ‘sell’. As Figure 1

showed, the average accuracy of predictive models is only 48 percent, less than the fiftyfifty

chance of choosing randomly. The results of this experiment corroborate his findings. The major

finding of this experiment was that social media is irrelevant to determining stock returns.

Further research to seek a correlation between Social Media and stock returns may include taking

into account longer periods of time for data gathering and testing, using more sensitive natural

text analysis tools, and calculating based on a time shift, with an assumption that sentiment data

from the previous days would impact future days.

Bibliography

Baker, M. and Wurgler, J. (2007). Investor Sentiment in the Stock Market.

Bollen, J., Mao, H. and Zeng, X. (2011). Twitter mood predicts the stock market. Journal of

Computational Science, 2(1), pp.18.

Brandon, E. (2009, February 12). How Did Your 401(k) Really Stack Up in 2008? [online] US

News.

Available at:

http://money.usnews.com/money/retirement/articles/2009/02/12/howdidyour401kreally

stackupin2008

Fama, E. (1965). Random Walks in Stock Market Prices. Financial Analysts Journal, 21(5),

pp.5559.

Ferri, R. (2013). It's Official! Gurus Can't Accurately Predict Markets. [online] Forbes.

Available at:

http://www.forbes.com/sites/rickferri/2013/01/10/tsofficialguruscantaccuratelypredict

markets/ [Accessed 28 Dec. 2014].

Fisher, K. and Statman, M. (2000). Investor Sentiment and Stock Returns. Financial Analysts

Journal, 56(2), pp.1623.

Kaniel, R., Saar, G. and Titman, S. (2004). Individual Investor Sentiment and Stock Returns.

Li, X., Xie, H., Chen, L., Wang, J. and Deng, X. (2014). News impact on stock price return via

sentiment analysis. KnowledgeBased Systems, 69, pp.1423.

Minh, D. (2013). Sentiment and Influence Analysis of Twitter Tweets. US20130103667 A1.

Paniagua, J. and Sapena, J. (2014). Business performance and social media: Love or hate?.

Business Horizons, 57(6), pp.719728.

Rader, J. (2006). Method and system for conducting sentiment analysis for securities research.

US20060242040 A1.

Russell Investments, (2011). Russell 1000 Index Member List. Washington: Russell Investments.

Russell Investments, (2011). Russell Microcap Index Membership List. Washington: Russell

Investments.

Russell Investments, (2011). Russell Midcap Index Member List. Washington: Russell

Investments.

Siganos, A., VagenasNanos, E. and Verwijmeren, P. (2014). Facebook's daily sentiment and

international stock markets. Journal of Economic Behavior & Organization, 107,

pp.730743.

Sisk, J. (2013). Methods and systems for predicting market behavior based on news and

sentiment analysis. US20130138577 A1.

Acknowledgements

We would like to acknowledge and thank several people for providing us with an

abundance of support and assistance. We would like to express our gratitude to Dr. Alex

Tabarrok of George Mason University, Professor Francis DiTraglia of the University of

Pennsylvania, Dr. Andrew Lo of the Massachusetts Institute of Technology, and William Li of

the Massachusetts Institute of Technology for their highly supportive feedback on the Economic

and programming side. We would also like to thank Dr. Csaba Gabor and Mr. Phillip Ero of the

Thomas Jefferson High School for Science and Technology for providing us with support in

understanding our statistical analyses. We are grateful to our lab director, Dr. Dan Burden of the

Thomas Jefferson High School for Science and Technology, for patiently and proactively

ensuring that we had everything we needed to conduct our experiment. Finally, and most

importantly, we want to appreciate our families for their unparalleled support.

the process of gathering and analyzing twitter data to predict...

Documents