stock embeddings acquired from news articles and price ... · news articles and price history....

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3353–3363July 5 - 10, 2020. c©2020 Association for Computational Linguistics

3353

Stock Embeddings Acquired from News Articles and Price History, andan Application to Portfolio Optimization

Xin DuDepartment of Advanced InterdisciplinaryStudies, Graduate School of Engineering,

The University of [email protected]

Kumiko Tanaka-IshiiResearch Center of

Advanced Science and Technology,The University of Tokyo

[email protected]

Abstract

Previous works that integrated news articles tobetter process stock prices used a variety ofneural networks to predict price movements.The textual and price information were bothencoded in the neural network, and it is there-fore difficult to apply this approach in situa-tions other than the original framework of thenotoriously hard problem of price prediction.In contrast, this paper presents a method to en-code the influence of news articles through avector representation of stocks called a stockembedding. The stock embedding is acquiredwith a deep learning framework using bothnews articles and price history. Because theembedding takes the operational form of a vec-tor, it is applicable to other financial problemsbesides price prediction. As one example ap-plication, we show the results of portfolio op-timization using Reuters & Bloomberg head-lines, producing a capital gain 2.8 times largerthan that obtained with a baseline method us-ing only stock price data. This suggests thatthe proposed stock embedding can leveragetextual financial semantics to solve financialprediction problems.

1 Introduction

News articles influence the dynamics of financialmarkets. For example, after the release of breakingnews, the share prices of related stocks are oftenobserved to move. This suggests the possibilityof using natural language processing (NLP) to aidtraders by analyzing this influence between newsarticle texts and prices.

Recent studies (Ding et al., 2015; Hu et al., 2018;Chen et al., 2019; Yang et al., 2018) have indeedreported that news articles can be leveraged to im-prove the accuracy of predicting stock price move-ments. These previous works have used deep learn-ing techniques. They train neural networks witharticle texts and financial market prices, attempting

to improve price prediction. In these approaches,the overall mutual effect between texts and pricesis distributed over the neural network, which makesit difficult to extract this effect and apply it to tasksother than price prediction.

Therefore, we take a new approach by explicitlydescribing this mutual effect in terms of a vector.A stock is represented by a vector so that its innerproduct with an embedding of a text produces alarger value when the text is more related to thestock. In the rest of the paper, we call this vector astock embedding.

The names of stocks, such as “AAPL” (the tickersymbol for Apple Inc.), typically appear in a fi-nancial news article text. Because these namesform part of the text, usual NLP techniques canbe applied to acquire an embedding of a stock.Such general textual embedding, however, doesnot incorporate the financial reality of stock pricechanges. Hence, the proposed stock embeddingrepresents the price as well as the semantics ofthe text, as we acquire it by training on both newsarticles and stock prices. Precisely, our stock em-bedding is trained through a binary classificationproblem, namely, whether a stock price goes up ordown in comparison with the previous day’s price.As a result, an acquired stock embedding capturesthe relation between a stock name and a news arti-cle even when the article has no direct mention ofthe stock. Our stock embedding can be consideredas one technique to specialize, or ground, a symbolthat has a practical reality outside of text.

Furthermore, two major advantages come withthe vector form of our stock embedding. The firstis that the training can be effectuated for all stocksat once, rather than stock by stock. This is animportant advantage to alleviate data sparsenessand prevent overfitting, as discussed in Section 4.

The second advantage lies in the portability ofa vector. In contrast to previous works, in which

3354

stock-specific information was distributed amongthe parameters of a neural network, a vector rep-resenting all the characteristics of a stock is mucheasier to extract and apply to other uses besidesprice prediction.

Hence, this paper shows an example of portfoliooptimization, one of the most important applica-tions in finance. To the best of our knowledge, thisis the first report of incorporating NLP into modernportfolio theory (Markowitz, 1952). Our methoddiffers from previous works that used NLP to en-hance investment strategies. Many previous worksfocused on stock price forecasting only (Ding et al.,2015; Hu et al., 2018) and did not attempt to applythe learned results to other financial tasks. Anotherprevious work (Song et al., 2017) investigated port-folios with texts. It obtained a ranking of stocksfrom texts by using a neural network techniqueand then evaluated investment in the highest/lowestranked stocks. That work was not based on modernportfolio theory, however, nor did it integrate priceand text data. In contrast, our method uses NLPin addition to price data to acquire a general repre-sentation in the form of an embedding applicableto different targets. In our experiments, a portfoliogenerated using stock embeddings achieved an an-nual gain 2.8 times greater than that of a portfoliogenerated with price data only. This provides evi-dence that the stock embedding well encodes bothtext and price information.

2 Related Work

The main idea of this article is based on importanttechniques of NLP. It is now common to repre-sent discrete entities in natural language by con-tinuous vectors. These vectors are called “embed-dings” and usually obtained from neural networkmodels. Examples include the word embedding(Mikolov et al., 2013), phrase embedding (Zhanget al., 2014), sentence embedding (Lin et al., 2017),and event embedding (Ding et al., 2016).

One advantage of these continuous representa-tions is that the geometry of an embedding systemcontains rich semantic information, as has been dis-covered at many levels (Mikolov et al., 2013; Reifet al., 2019). The acquisition of stock embeddingsin this paper is based on the original idea developedfor linguistic entities. Here, we extend the idea fur-ther so that the embeddings reflect the reality of astock market outside text.

A stock embedding is trained using the attention

mechanism (Bahdanau et al., 2015), which is an-other current NLP technique. The basic idea of theoriginal attention mechanism is to assign higherweights to more relevant word vectors and makethe weights adaptive to different contexts.

Our framework is based on the classificationtask for text-driven stock price movement, whichhas been studied intensely as follows. Early re-search on exploiting financial news articles for bet-ter stock price prediction dates back to Ou and Pen-man (1989), in which financial indicators were ex-tracted manually from financial statements. Later,in Fung et al. (2002), NLP methods were adoptedfor automatic text feature extraction. Since the2000s, Twitter and other text-centered social mediaplatforms have become essential sources of finan-cial signals. Bollen et al. (2011) found evidence forcausality between the public mood extracted fromtweets and the Dow Jones Industrial Average index.In Nguyen et al. (2015), post texts collected fromthe Yahoo! Finance Message Board were used topredict whether the prices of 18 US stocks wouldrise or drop on the next trading day.

As deep learning methods for NLP have becomemore common, many papers have reported the useof neural networks for text-driven stock classifi-cation (or prediction) tasks. Ding et al. (2015)proposed an event embedding to represent a newsheadline with a vector and used a convolutionalneural network for classification. In that work, allthe event embeddings of news articles published onthe same day were simply averaged to summarizethat day’s market information.

Hu et al. (2018) was among the first worksthat applied the attention mechanism to the taskof news-driven stock price movement classifica-tion. They developed a dual-level attention frame-work, in which news articles were assigned differ-ent weights depending on the output of a logisticregression component with a bias term, so that themost informative news articles were “highlighted.”The method of weighting news articles in this paperis similar to that previous work. The stock-specificinformation in Hu et al. (2018) was encoded inthe neural network, however, making it focused onthe price prediction task. In contrast, we representsuch stock-specific information by the stock em-bedding, i.e., a vector, which is easy to interpretgeometrically and extract for other applications.

For one such application, we evaluated our stockembedding in terms of portfolio optimization. To

3355

the best of our knowledge, this is the first paperapplying NLP techniques to modern portfolio the-ory. We use the mean-variance minimization port-folio model (introduced in Section 7) proposed inMarkowitz (1952), which directly led to the capitalasset pricing model (Sharpe, 1964).

3 News-Driven Stock Price Classification

In this paper, the stock embedding is trained with adeep learning system through binary classificationof price movements. Let pt be the stock price onday t, and let yt be the desired output of the system.Here, t ∈ {1, 2, . . . , T}, and T is the number oftrading days in the considered time period. Thebinary classification problem indicates that yt isclassified in the following way:

yt =

{1, pt ≥ pt−10, pt < pt−1.

(1)

To train such a deep learning system, news ar-ticles are used as the input. In this work, newsarticles are considered daily (i.e., treated in unitsof days). We denote the set of articles publishedon day t by Nt, and each article by ni ∈ Nt, withi = 1, . . . , |Nt|. This paper considers a time win-dow around day t, denoted as [t− d1, t+ d2] giventwo constants d1, d2. Let N[t−d1,t+d2] be the set ofnews articles published within the time window.

When d2 = −1, indicating the use of articlesuntil day t− 1, the task is called prediction, as thetraining does not use any articles published on orafter day t. In general, this task is acknowledgedas very hard (Fama, 1970; Basu, 1977; Timmer-mann and Granger, 2004) according to the efficient-market hypothesis (EMH)1, and such predictionprovides only a limited gain, if any. Note that pre-vious NLP studies concerning stock prices wereall aimed at this hard problem (Ding et al., 2015;Hu et al., 2018; Xu and Cohen, 2018; Yang et al.,2018).

On the other hand, when d2 ≥ 0, this paperrefers to the task as classification. The performanceon classification shows how well the model under-stands a news article. Because the prediction prob-lem is too hard and offers limited gain, as provenby many previous works, our target lies in classifi-cation. The aims are thus to acquire embeddingsthat are highly sensitive to textual context and to

1According to the EMH, in an “efficient” market, pricesreflect the true values of assets by having incorporated all pastinformation, so nobody can predict the price. The EMH ishypothesized to hold but has also attracted criticism.

apply them to tasks other than price prediction.Therefore, in this paper, we set d1 = 4 and d2 = 0.

Let the classification model be represented bya mapping f . The probability that the price of astock j, where j = 1, . . . , J , goes up on day t is

yjt = f(N[t−4,t]

). (2)

In the process of model optimization, the modelshould reduce the mean cross-entropy loss betweenevery true label yjt and its corresponding estimateyjt , as follows:

lj = − 1

T

T∑t=1

(yjt log yjt + (1− yjt ) log(1− yjt )

).

This function describes the loss for only one stock,but a stock market includes multiple stocks. Thiswork considers all stocks in a market equally impor-tant. The overall loss function is therefore a simpleaverage of the cross-entropy loss for all stocks, i.e.,l = (

∑Jj=1 l

j)/J .

4 Method to Acquire Stock Embeddings

Let sj represent a stock embedding, where j =1, 2, ..., J . This is initialized as a random vectorand then trained via a neural model to obtain sj ,whose inner product with the embedding of a re-lated text becomes large. This section describes theproposed method to acquire stock embeddings bybuilding up a neural network for price movementclassification.

The neural network consists of two parts: a textfeature distiller and a price movement classifier.

Text feature distiller. The text feature distillerfirst converts every news article ni into a pair of vec-tors (nKi , n

Vi ) corresponding to “key” and “value”

vectors, respectively. Let NKt = {nKi }t, NV

t ={nVi }t denote the sets of key/value vectors of thearticles released on day t. Such dual-vector repre-sentation of a text was proposed and adopted suc-cessfully in Miller et al. (2016) and Daniluk et al.(2017). The pair of vectors contains the semanticinformation of the article text at two different lev-els. Roughly, nKi represents the article at the wordlevel, whereas nVi represents it at the context level.

The text feature distiller calculates the attentionscore for every article i published on day t. The at-tention score between article i and stock j is givenby the inner product of the two vectors nKi and sj :

scorei,j = nKi · sj .Note that there are other possible definitions ofthis inner product, such as the cosine similarity or

3356

a generalized inner product using some arbitraryfunction. Because this work focuses on the mostbasic capability of the stock embedding, it uses themost basic inner product (i.e., the dot product).

Let αji denote the weight put on news article iwith respect to stock j, to classify whether the stockprice will go up or down. With the use of scorei,jdefined above, αji is given as the following:

αji ≡exp(scorei,j)∑i′ exp(scorei′,j)

.

αji is thus acquired as the softmax function of thescores across the articles released on the same day.

By using αji as the weights put on news articles,we compute the market status of stock j on day tas the following, which is the input to the classifier:

mjt =

∑nVi ∈NV

t

αjinVi . (3)

Therefore, mjt is computed over a set of nVi , rep-

resenting the context of texts on day t. We callmjt the market vector, to which we will return in

Section 6.

Price movement classifier. The input of theprice movement classifier is a sequence of vectors,M j

[t−4,t] = [mjt−4,m

jt−3, . . . ,m

jt ], with respect to

stock j. This is processed by a recurrent neuralnetwork using a bidirectional gated recurrent unit(Bi-GRU). The choice of a Bi-GRU was made byconsidering the model capacity and training diffi-culty. The classifier estimates the probability yjt :

hOt = GRU(M j[t−4,t]),

yjt = σ(MLP(hOt )),(4)

where σ(x) = 1/(1 + exp(−x)), and GRU andMLP stand for the Bi-GRU and a multilayer per-ceptron, respectively. An optional re-weightingtechnique over the GRU’s output vectors hOτ (τ ∈[t − 4, t]) (Hu et al., 2018) can be applied. Inthis case, after the first line of formula (4), there-weighting is conducted in the following way:

hO =t∑

τ=t−4βτh

Oτ ,

and this hO becomes the input of the second lineinstead of hOt . Here, βτ , the weight for day τ ,decides how much one day is considered in theclassification. In our implementation,

βτ =exp(vτ−t · hOτ )∑0ξ=−4 exp(vξ · hOt+ξ)

,

where the vector vξ differentiates the temporal ef-fects of news articles released around day t. vξ

Figure 1: Illustration of the classifier sharing mecha-nism across stocks on day t: (a) one independent clas-sifier per stock, and (b) a shared classifier across stocks.|Nt| denotes the number of news articles on day t.

is initialized randomly and trained via the neuralnetwork. See Hu et al. (2018) for the details.

Such formulation of neural network training hasthe advantage of avoiding overfitting. A commonproblem in the task of stock movement classifica-tion or prediction is small sample sizes, especiallywhen adopting units of days. In contrast, the pro-posed model does not suffer from small samplesizes, because the price movement classifier can betrained across all the stocks by sharing one clas-sifier, rather than by generating one classifier foreach individual stock like in many previous works(Ding et al., 2015; Hu et al., 2018; Xu and Cohen,2018). We call this a classifier sharing mechanism.

Figure 1 illustrates the difference between mod-els with and without classifier sharing. The upperfigure (a) shows the conventional setting withoutsharing, in which J classifiers are generated, onefor each stock. In contrast, the lower figure (b)shows one classifier generated for all stocks. Thissetting enables learning of the correlation amongstocks, in addition to avoiding overfitting and theproblem of small sample sizes. Specifically, theclassifier is shared across all stocks, thus achievinga sample size about 50 to 100 times larger.

5 Dataset and Settings to Acquire StockEmbeddings

5.1 Dataset

We used two news article datasets to build stockembeddings: the Wall Street Journal (WSJ, in the

3357

Dataset Period # of articles # of days /trading days

Mean # of artic-les per day (std)

Mean # of wordsper article (std)

Wall Street Journal (WSJ) 2000/01-2015/12 403, 207 5, 649/4, 008 71.4 (41.1) 28.5 (9.0)Reuters & Bloomberg (R&B) 2006/10-2013/11 551, 479 2, 605/1, 794 211.7 (257.1) 9.34 (1.73)

Table 1: Basic information on the two news article datasets.

following) dataset and the Reuters & Bloomberg(R&B) dataset2, as listed in Table 1. WSJ containsaround 400,000 news headlines published across16 years, whereas R&B contains around 550,000articles across 7 years. Compared with R&B, WSJhas a relatively more uniform distribution of newsarticles across time (see the standard deviationslisted in parentheses in the fifth column of Table1). Following previous studies reporting that themain body of a news text produces irrelevant noise(Ding et al., 2015), we extracted only the headlinesin both datasets.

As for the stocks, we selected two subsets ofthe stocks in Standard & Poor’s S&P 500 index,one for each of the WSJ and R&B datasets. Thesesubsets consisted only of stocks that were men-tioned in no fewer than 100 different news arti-cles, so that mutual effects between the articles andthe price history would appear pretty often in thetexts. More importantly, this ensured that keywordretrieval-based methods that locate related articlesby explicit keyword matching could be applied forcomparison. For the WSJ and R&B datasets, thesubsets had 89 and 50 stocks, respectively. Allother stocks were removed from consideration.

As seen in formula (2), the input for the neuralnetwork is N[t−4,t], the set of articles around day t,and the output is yjt . The label yjt is the binarizedprice movement of stock j at day t. This is mea-sured by the log-return between two subsequentdays:

log returnjt = log pjt − log pjt−1.

The distribution of log-returns is typically bellshaped with a center close to 0, as also mentionedin Hu et al. (2018). The return values of the dayswere separated into three categories of “negative,”“ambiguous,” and “positive” by the use of thresh-olds3. Here, “ambiguous” refers to those samplesclose to 0.0, which were removed. Thus, by us-ing only the clearly negative and positive days, thereturns were binarized.

2This dataset was made open source in Ding et al. (2015).3We used the thresholds [−0.0053, 0.0079] for the WSJ

dataset and [−0.0059, 0.0068] for the R&B dataset. The mar-gins were asymmetric around 0 because these datasets hadslightly more “rising” days than “declining” ones.

Through such filtering, the number of samplesfor each stock became about two-thirds of the num-ber of all trading days, or around4 2600 and 1200samples for each stock, for the WSJ and R&Bdatasets, respectively.

5.2 Deep Learner System Settings

The Adam optimizer (Kingma and Ba, 2015) wasused with cosine annealing (Loshchilov and Hutter,2017) to train the neural network. The initial learn-ing rate was set to 5e-4. The mini-batch size was64. We stopped the training process when the valueof the loss function with respect to the validationset no longer dropped, and then we measured theaccuracy on the test set for evaluation.

As for the dual-vector representation of newsarticle texts, introduced in Section 4, the key andvalue vectors were calculated as described here.The key vector nKi is defined as follows5 by usingword embeddings wk acquired by Word2vec:

nKi =

∑k γkwk∑k γk

,

where γk = TFk · IDFk is the TFIDF (Manningand Schutze, 2001) score of word k. The dimensionof nKi equals that of the Word2vec model trainedon the news corpus, i.e., 64 in our implementation.

As for the value vector nVi , we used vectors ac-quired through a BERT encoder6. We used thepretrained BERT model available from Google Re-search, with 24 layers trained on an uncased corpus.This model outputs vectors of 1024 dimensions, butwe reduced the dimensions to 256 by using prin-cipal component analysis (PCA), to suppress thenumber of parameters in the neural network. Alongwith the effect of the stock embedding, the effectof the dual-vector representation (DVR) is also

4The number of samples after filtering differed slightlyamong stocks, because the distribution of log-returns differed,while the same thresholds were used.

5We chose this method after examining several options,including the smooth inverse frequency (SIF) (Arora et al.,2017), TFIDF-weighted word embeddings, and several othermethods. We found that TFIDF-weighted word embeddingswith Word2vec worked best.

6BERT (Bidirectional Encoder Representations fromTransformer) is a neural network model (Devlin et al., 2019)that can be used to encode text into vectors with a fixed di-mension.

3358

evaluated in the following section.

6 Effect of Stock Embedding on PriceMovement Classification

The basic effect of the stock embedding was eval-uated through the performance on the price move-ment classification task, as stated in Section 3.

The whole dataset described in Section 5.1was randomly divided into nonoverlapping train-ing/validation/test sets in the ratios of 0.6/0.2/0.2.The training/validation/test parts did not share anysamples from the same dates. Every method belowwas tested for 10 different random divisions, andthe average performance is reported here.

The proposed model is abbreviated asWA+CS+DVR, for weighted average with classi-fier sharing and dual-vector representation. For anablation test, four models were considered, whichvaried the market vector of the day (defined informula (3) in Section 4 (Ding et al., 2015)) andwere with or without the dual-vector representationand classifier sharing (Ding et al., 2015; Hu et al.,2018; Xu and Cohen, 2018; Yang et al., 2018), asfollows.

Simple average: The simple average of the textrepresentations of the same day is taken as themarket vector of the day, as proposed by Dinget al. (2015).

Weighted average (WA): As stated in formula(3), the market vector of the day is averagedby using the weights from the stock-text in-ner products, as proposed in Hu et al. (2018).Note again that their work did not apply classi-fier sharing but instead produced one classifierfor each stock, nor did it adopt the dual-vectorrepresentation.

WA + classifier sharing (CS): This refers to WAwith classifier sharing across stocks. This vari-ant does not adopt the dual-vector representa-tion, i.e., nKi is set equal to nVi for every newsarticle i. Thus, the same BERT text embed-ding is used for both nKi and nVi .

WA + dual-vector representation (DVR): Thisrefers to WA with the dual-vector represen-tation of news texts. This variant does notadopt classifier sharing.

Furthermore, to examine the effect of the datasize, we tested different dataset portions: 1 year,3 years, and the whole dataset. Therefore, the ex-perimental variants involved five methods (four

comparison + our proposal) and three data sizes, ora total of 15 experiments.

Figure 2 summarizes the complete experimen-tal results. The uppermost bar of each bar group,in red, corresponds to our model with classifiersharing (CS) and the dual-vector representation(DVR). The other bars, in orange, blue, purple, andgreen, correspond to the four ablation variants. Theablation datasets with only 1-year data containedaround 150 training samples and were too smallfor most variants to work well, yet our proposedmodel, WA+CS+DVR, could still obtain positiveresults (classification accuracy over 50%). With the3-year datasets, our WA+CS+DVR model widenedthe performance gap, whereas the simple averageand weighted average models still failed to workbetter than random guessing. These results showthe superiority of our model in handling the over-fitting problem with small datasets.

Finally, the significant differences betweenWA+CS+DVR (in red) and WA+CS (in blue) andbetween WA+DVR (in orange) and WA (in purple)strongly supported the advantage of adopting thedual-vector representation (DVR), especially whenclassifier sharing was combined.

Overall, our model successfully achieved 68.8%accuracy for the R&B dataset, which was signifi-cantly better than any of the other four variants.

7 Portfolio Optimization

Thus far, the evaluation on classification has shownthe capability of our framework in understandingnews articles. For financial applications, however,the task must be in the form of prediction; that is,it must produce some gain ahead of the time whena news article is published. As one such predictiveexample, we present portfolio optimization, one ofthe most important financial tasks, and we showhow our stock embedding can be applied to it.

A portfolio is essentially a set of weights as-signed to stocks, representing the proportions ofcapital invested in them. Intuitively, a portfoliobears a bigger risk if a large proportion is in-vested in two highly positively correlated stocks,rather than two uncorrelated or negatively corre-lated stocks. Based on this idea, the mean-varianceminimization model in Markowitz (1952) is formu-lated as follows:

minw

risk = wTΣw (5a)

subject to wT r = E, (5b)

3359

Figure 2: Mean classification accuracy percentages (with SD in parentheses) over 10 replications.

wT1 = 1, (5c)

0 ≤ wj ≤ 1 j = 1, ..., J, (5d)

where Σ is the risk matrix; w is the vector ofinvestment weights; r is a vector such that rj equalsthe mean historic return of stock j; 1 is a vectorof ones; and E, the expected portfolio return, isa parameter decided by an investor’s preference.Note that higher E usually means higher risk bornby the investor.

In the original model of Markowitz, Σ is thecovariance matrix of the historic return time se-ries of stocks, Σij = Cov({ri}t, {rj}t) (i, j ∈{1, ..., J}). According to Markowitz (1952), thesolution of this optimization problem, which can beobtained via quadratic programming, gives the port-folio with the smallest risk for an expected overallreturn E.

Using the covariance matrix as the risk matrixΣ is limited, however, for two reasons. First, theoverwhelming noise in price movements preventsaccurate estimation of the covariance. More im-portantly, it ignores the events described in newsarticles that indeed cause price movements.

On the other hand, the stock embeddings builthere provide much abundant textual informationfor defining Σ. Concretely,

Σi,j = cos (si, sj).

This should work because the stock embeddingreflects a stock’s responsiveness to a certain classof news events. In other words, close stock em-beddings indicate a correlated response pattern toan event described in news articles. Stock embed-dings capture this correlation much better than thecovariance matrix does, and this correlation is whata good portfolio relies on.

By solving the same optimization problem butwith a different matrix Σ, we get another vector of

investment ratios, w, with respect to the stocks. Byvirtually investing according to w and observingthe result within a certain period, Σ can be evalu-ated. For each of the WSJ and R&B datasets, weran one investment simulation for various defini-tions of Σ, as follows.

S&P 500 index: As a market baseline, we usedan S&P 500 index portfolio, in which all 505stocks in the index were considered and theinvestment weight wj was in proportion to themarket capitalization of stock j. The pricehistory of the portfolio was provided by DowJones. This method did not use Σ to form theportfolio.

S&P 89*/50*: This approach was the same asabove but with the set of stocks reduced tothose tested in our work, as explained in Sec-tion 5.1: 89 stocks for the WSJ dataset7, and50 for the R&B dataset.

Covariance matrix of historic stock returns:Σ was the covariance matrix as originallyproposed by Markowitz.

Word2vec-general: (text only) Σ was the cosinematrix of the word embeddings trained ongeneral corpora (fastText word embeddings(Bojanowski et al., 2017) were used in ourexperiments). For each stock, we used theword embedding of its ticker symbol, e.g., theword embedding of “AAPL” for Apple Inc.

Word2vec-news: (text only) Σ was the cosine ma-trix of the word embedding vectors trained

7The S&P 89* portfolio was evaluated during the periodof 2001 to 2016. The market capitalization history of thestocks before the year 2005 is not available, so the record wasestimated for this missing period. First, the number of sharesoutstanding was extrapolated from the data of 2005-2016, inwhich the values were pretty stable during the whole period.The market capitalization was then acquired by multiplyingthe price by the shares outstanding.

3360

on news text corpora. We used the full textof the R&B dataset for training, in which allmentions of a stock in the text were replacedby the stock’s ticker symbol.

Covariance · stock embedding: (text and price)Σ was the result of element-wise multiplica-tion of the covariance matrix and the cosinematrix of the stock embeddings.

Weighted BERT: (text only) Σ was the cosinematrix of stock vectors acquired as follows,where the BERT-based text representation nViwas used. For a stock j, the vector wasobtained as a weighted average of nVi forwhich the text mentioned the stock or com-pany. Here, the weight of article i was definedas follows:

ηi ≡(# of mentions of j in i)

(# of mentions of all stocks in i).

Stock embeddings: Σ was the cosine matrix ofthe stock embeddings.

Figure 3: The process of portfolio generation and evalu-ation over several years. The vertical axis indicates thetimes when the portfolio is renewed, and the horizontalaxis indicates the data grouped yearly. An average ofthe realized annual gains is computed to evaluate theportfolio’s performance.

The portfolio evaluation was conducted in ayearly setting, as illustrated in Figure 3. At thebeginning of each year, given some expected gainE, the portfolio was computed by using all newsarticles and historic prices until the end of the previ-ous year. In other words, for each year, the trainingset in the experiment consisted of samples strictlyearlier than those constituting the test set. There-fore, the evaluation was conducted in a predictionsetting. Then, investments were made according tothe yearly renewed portfolio as in Figure 3; that is,capital was allocated to stocks according to w. Therealized annual gain of the portfolio followed this

equation:

annual gain =

J∑j=1

wj(pjend-of-year

pjbegin-of-year

− 1),

where wj is the proportion of investment in stockj, and pj is the price of j.

In this way, for each of the WSJ and R&B,we obtained results over 16 and 7 years, re-spectively. For different expected gains E ∈{0.05, 0.06, ..., 0.29}, which cover typical casesin real-world portfolio construction, the averageannual gain was computed.

Figure 4 shows the experimental results. Theupper graphs show the annual gain with respect todifferent values of E (horizontal axes) for (a) theWSJ and (b) the R&B, averaged over years. Everycurve corresponds to a different definition of Σ.It can be seen that the proposed stock embeddingmethod outperformed the other methods, exceptfor larger E with WSJ8. Especially for the R&Bdataset, stock embedding greatly outperformed allother methods at all E.

The lower bar graph summarizes the overall ag-gregate gain for each method. The values in thebars indicate the average realized annual gains,while those above the bars are the ratios of the gainsin comparison with that of the standard covariancemethod (in blue). The leftmost two bars in eachbar graph show the gains of the S&P 500 portfolioand the S&P 89*/50* portfolio, respectively. Asdescribed above, the S&P 500 portfolio consistedof an average of around 500 stocks traded in theUS, while the S&P 89*/50* portfolio, which wascalculated with the same method but on a smallerset of stocks (89 for the WSJ, and 50 for the R&B),achieved higher gains than its S&P 500 sibling did.The values of the S&P portfolios generally went upduring the periods of both datasets, and therefore,the gains were positive.

The dashed horizontal line in each bar graphindicates the result for the standard covariancemethod as a baseline. Its gains were only 12.5%and 12.7% for the WSJ and R&B, respectively,but with stock embeddings, the gains increased to17.2% and 35.5%, or 1.37 and 2.80 times greaterthan the baseline results, respectively. This per-

8Our method did not perform well only for large E. Themean-variance minimization model has been reported to be-come unstable under the two conditions of large E and lowoverall market gain (Dai and Wang, 2019). The return of theWSJ period (2000-2015) was lower than that of the R&B pe-riod (2006-2013), and therefore, these two conditions weremore likely to be met for WSJ.

3361

Figure 4: Expected and realized annual portfolio gain in the investment simulations on both datasets: (a) results onthe WSJ dataset (2000-2015), and (b) results on the R&B dataset (2006-2013).

formance largely beat all other variants and givesevidence of how well the stock embeddings inte-grated both price and textual information.

The results for the method that integrated thecovariance matrix and stock embedding (in green)did not much outperform the baselines. A possiblereason is that the stock embedding had alreadyintegrated the price information. As for the othervariants based on pure text (in purple, orange, andbrown), the results improved slightly. Among them,weighted BERT outperformed the other methodsfor both datasets. This indicates the potential ofBERT and other recent neural language models forportfolio optimization.

8 Conclusion

This paper has proposed the idea of a stock embed-ding, a vector representation of a stock in a finan-cial market. A method was formulated to acquiresuch vectors from stock price history and news ar-ticles by using a neural network framework. Inthe framework, the stock embedding detects newsarticles that are related to the stock, which is theessence of the proposed method. We trained stockembeddings for the task of binary classification ofstock price movements on two different datasets,the WSJ and R&B. The improvements in classifica-tion accuracy with our framework, due to the clas-

sifier sharing and dual-vector text representationproposed in this paper, implied that the stock em-beddings successfully incorporated market knowl-edge from both the news articles and price history.

Because the stock embedding is a vector thatcan be separated from the other components of theclassification model, it can be applied to other tasksbesides price movement classification. As an ex-ample, we showed the use of stock embeddings ina portfolio optimization task by replacing the riskmatrix in the portfolio objective function with acosine matrix of stock embeddings. In investmentsimulations on the R&B dataset, our stock embed-ding method generated 2.80 times the annual returnobtained using the covariance matrix of the historicreturn series. This significant gain suggests furtherpotential of our stock embedding for modeling thecorrelation among stocks in a financial market, andfor further applications, such as risk control andasset pricing.

Acknowledgments

We sincerely thank the anonymous reviewers fortheir comments. This paper was supported by theResearch Institute of Science and Technology forSociety (HITE 17942497), and by the Universityof Tokyo Gap Fund Program. The paper reflectsthe view of the authors only.

3362

ReferencesSanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017.

A simple but tough-to-beat baseline for sentence em-beddings. In 5th International Conference on Learn-ing Representations, ICLR 2017, Toulon, France,April 24-26, 2017, Conference Track Proceedings.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In 3rd Inter-national Conference on Learning Representations,ICLR 2015, San Diego, CA, USA, May 7-9, 2015,Conference Track Proceedings.

Sanjoy Basu. 1977. Investment performance of com-mon stocks in relation to their price-earnings ratios:A test of the efficient market hypothesis. The jour-nal of Finance, 32(3):663–682.

Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information. Transactions of the Associa-tion for Computational Linguistics, 5:135–146.

Johan Bollen, Huina Mao, and Xiao-Jun Zeng. 2011.Twitter mood predicts the stock market. J. Comput.Science, 2(1):1–8.

Deli Chen, Yanyan Zou, Keiko Harimoto, Ruihan Bao,Xuancheng Ren, and Xu Sun. 2019. Incorporatingfine-grained events in stock movement prediction.In Proceedings of the Second Workshop on Eco-nomics and Natural Language Processing, pages31–40, Hong Kong. Association for ComputationalLinguistics.

Zhifeng Dai and Fei Wang. 2019. Sparse and ro-bust mean–variance portfolio optimization problems.Physica A: Statistical Mechanics and its Applica-tions, 523:1371 – 1378.

Michal Daniluk, Tim Rocktaschel, Johannes Welbl,and Sebastian Riedel. 2017. Frustratingly short at-tention spans in neural language modeling. In 5thInternational Conference on Learning Representa-tions, ICLR 2017, Toulon, France, April 24-26, 2017,Conference Track Proceedings.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan.2015. Deep learning for event-driven stock predic-tion. In Proceedings of the Twenty-Fourth Inter-national Joint Conference on Artificial Intelligence,IJCAI 2015, Buenos Aires, Argentina, July 25-31,2015, pages 2327–2333.

Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan.2016. Knowledge-driven event embedding for stockprediction. In Proceedings of COLING 2016, the26th International Conference on ComputationalLinguistics: Technical Papers, pages 2133–2142,Osaka, Japan. The COLING 2016 Organizing Com-mittee.

Eugene F. Fama. 1970. Efficient capital markets: Areview of theory and empirical work. The Journalof Finance, 25(2):383–417.

Gabriel Pui Cheong Fung, Jeffrey Xu Yu, and Wai Lam.2002. News sensitive stock trend prediction. In Ad-vances in Knowledge Discovery and Data Mining,6th Pacific-Asia Conference, PAKDD 2002, Taipei,Taiwan, May 6-8, 2002, Proceedings, pages 481–493.

Ziniu Hu, Weiqing Liu, Jiang Bian, Xuanzhe Liu, andTie-Yan Liu. 2018. Listening to chaotic whispers:A deep learning framework for news-oriented stocktrend prediction. In Proceedings of the EleventhACM International Conference on Web Search andData Mining, WSDM 2018, Marina Del Rey, CA,USA, February 5-9, 2018, pages 261–269.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In 3rd Inter-national Conference on Learning Representations,ICLR 2015, San Diego, CA, USA, May 7-9, 2015,Conference Track Proceedings.

Zhouhan Lin, Minwei Feng, Cıcero Nogueira dos San-tos, Mo Yu, Bing Xiang, Bowen Zhou, and YoshuaBengio. 2017. A structured self-attentive sen-tence embedding. In 5th International Conferenceon Learning Representations, ICLR 2017, Toulon,France, April 24-26, 2017, Conference Track Pro-ceedings.

Ilya Loshchilov and Frank Hutter. 2017. SGDR:stochastic gradient descent with warm restarts. In5th International Conference on Learning Represen-tations, ICLR 2017, Toulon, France, April 24-26,2017, Conference Track Proceedings.

Christopher D. Manning and Hinrich Schutze. 2001.Foundations of statistical natural language process-ing. MIT Press.

Harry Markowitz. 1952. Portfolio selection. The jour-nal of finance, 7(1):77–91.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. 2013. Distributed rep-resentations of words and phrases and their com-positionality. In Advances in Neural InformationProcessing Systems 26: 27th Annual Conference onNeural Information Processing Systems 2013. Pro-ceedings of a meeting held December 5-8, 2013,Lake Tahoe, Nevada, United States, pages 3111–3119.

https://openreview.net/forum?id=SyK00v5xx

https://openreview.net/forum?id=SyK00v5xx

http://arxiv.org/abs/1409.0473


https://doi.org/10.2307/2326304

https://doi.org/10.2307/2326304

https://doi.org/10.2307/2326304

https://doi.org/10.1162/tacl_a_00051

https://doi.org/10.1162/tacl_a_00051

https://doi.org/10.1016/j.jocs.2010.12.007

https://doi.org/10.18653/v1/D19-5105

https://doi.org/10.18653/v1/D19-5105

https://doi.org/https://doi.org/10.1016/j.physa.2019.04.151

https://doi.org/https://doi.org/10.1016/j.physa.2019.04.151

https://openreview.net/forum?id=ByIAPUcee

https://openreview.net/forum?id=ByIAPUcee

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

http://ijcai.org/Abstract/15/329

http://ijcai.org/Abstract/15/329

https://www.aclweb.org/anthology/C16-1201

https://www.aclweb.org/anthology/C16-1201

http://www.jstor.org/stable/2325486

http://www.jstor.org/stable/2325486

https://doi.org/10.1007/3-540-47887-6_48

https://doi.org/10.1145/3159652.3159690

https://doi.org/10.1145/3159652.3159690

https://doi.org/10.1145/3159652.3159690



https://openreview.net/forum?id=BJC_jUqxe

https://openreview.net/forum?id=BJC_jUqxe

https://openreview.net/forum?id=Skq89Scxx

https://openreview.net/forum?id=Skq89Scxx

https://doi.org/https://doi.org/10.1162/coli.2000.26.2.277

https://doi.org/https://doi.org/10.1162/coli.2000.26.2.277

https://doi.org/http://dx.doi.org/10.1111/j.1540-6261.1952.tb01525.x

http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality



3363

Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston.2016. Key-value memory networks for directly read-ing documents. In Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1400–1409, Austin, Texas. Asso-ciation for Computational Linguistics.

Thien Hai Nguyen, Kiyoaki Shirai, and Julien Vel-cin. 2015. Sentiment analysis on social media forstock movement prediction. Expert Syst. Appl.,42(24):9603–9611.

Jane A Ou and Stephen H Penman. 1989. Finan-cial statement analysis and the prediction of stockreturns. Journal of accounting and economics,11(4):295–329.

Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda BViegas, Andy Coenen, Adam Pearce, and Been Kim.2019. Visualizing and measuring the geometry ofbert. In Advances in Neural Information ProcessingSystems 32, pages 8592–8600. Curran Associates,Inc.

William F Sharpe. 1964. Capital asset prices: A theoryof market equilibrium under conditions of risk. Thejournal of finance, 19(3):425–442.

Qiang Song, Anqi Liu, and Steve Y. Yang. 2017. Stockportfolio selection using learning-to-rank algorithmswith news sentiment. Neurocomputing, 264:20 – 28.Machine learning in finance.

Allan Timmermann and Clive WJ Granger. 2004. Ef-ficient market hypothesis and forecasting. Interna-tional Journal of forecasting, 20(1):15–27.

Yumo Xu and Shay B. Cohen. 2018. Stock move-ment prediction from tweets and historical prices. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 1970–1979, Melbourne, Aus-tralia. Association for Computational Linguistics.

Linyi Yang, Zheng Zhang, Su Xiong, Lirui Wei, JamesNg, Lina Xu, and Ruihai Dong. 2018. Explainabletext-driven neural network for stock prediction. In5th IEEE International Conference on Cloud Com-puting and Intelligence Systems, CCIS 2018, Nan-jing, China, November 23-25, 2018, pages 441–445.

Jiajun Zhang, Shujie Liu, Mu Li, Ming Zhou, andChengqing Zong. 2014. Bilingually-constrainedphrase embeddings for machine translation. In Pro-ceedings of the 52nd Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 111–121, Baltimore, Maryland. As-sociation for Computational Linguistics.

https://doi.org/10.18653/v1/D16-1147

https://doi.org/10.18653/v1/D16-1147

https://doi.org/10.1016/j.eswa.2015.07.052

https://doi.org/10.1016/j.eswa.2015.07.052

https://doi.org/https://doi.org/10.1016/0165-4101(89)90017-7



http://papers.nips.cc/paper/9065-visualizing-and-measuring-the-geometry-of-bert.pdf

http://papers.nips.cc/paper/9065-visualizing-and-measuring-the-geometry-of-bert.pdf

https://doi.org/10.2307/2977928

https://doi.org/10.2307/2977928

https://doi.org/https://doi.org/10.1016/j.neucom.2017.02.097



https://doi.org/https://doi.org/10.1016/S0169-2070(03)00012-8

https://doi.org/https://doi.org/10.1016/S0169-2070(03)00012-8

https://doi.org/10.18653/v1/P18-1183

https://doi.org/10.18653/v1/P18-1183

https://doi.org/10.1109/CCIS.2018.8691233

https://doi.org/10.1109/CCIS.2018.8691233

https://doi.org/10.3115/v1/P14-1011

https://doi.org/10.3115/v1/P14-1011

stock embeddings acquired from news articles and price ... · news articles and price history....

Documents