[ieee 2014 ieee conference on computational intelligence for financial engineering & economics...

8
Diversification Improvements Through News Article Co-occurrences John Robert Yaros and Tomasz Imieli´ nski Department of Computer Science Rutgers University Piscataway, NJ 08854-8019 Email: {yaros,imielins}@cs.rutgers.edu Abstract—Intuition suggests that a set of companies mentioned in the same news article are more likely to be related than unrelated. For instance, an article discussing a retailer would more probably mention its competitors or supply chain partners than mention other companies with no economic connection. Correspondingly, we consider using news article co-occurrences as a means to determine company relatedness. We show that companies mentioned together frequently are more likely to have higher future stock-return correlation, and consider using this data source as a means to achieve portfolio diversification by avoiding having pairs of related companies in the portfolio. We find this approach reduces risk and can be used to improve standard approaches to diversification that use expert-defined industry taxonomies, seeking to avoid portfolio concentration in any given economic sector. I. I NTRODUCTION Two main techniques in risk management are hedging and diversification. With hedging, risk in an investment is reduced by making additional investments that should generate a time series of returns that is the opposite of the original investment. That is, hedging can be achieved by finding assets with greatest anti-correlation, or, alternatively, by finding assets with greatest correlation and using those assets for short sales. Conversely, diversification reduces risk by investing in assets with correlation closest to zero. If a single asset in a portfolio loses value, the likelihood of losses in the other assets remains constant, whereas if they were correlated, the likelihood of simultaneous losses is increased. In equity investing, diversification is often achieved by investing in stocks in different “sectors” or “industries.” 1 This approach matches the old proverb “don’t put all your eggs in one basket,” since investments will be spread across different areas of the economy. It is well-established that such an approach will help to reduce risk [1]. Furthermore, industry diversification has increased in importance as other means of diversification have weakened, such as country diversification, due the integration of the global economy [2], [3], [4]. At the same time, what defines industries is not straight- forward and the creation of taxonomies to assign companies to industries is generally left to experts. These experts must consider multiple factors, including similarities in goods pro- duced, inputs to production, market perceptions, sensitivity to consumer income, etc. They must compile all such information 1 Although this article focuses on equities, greater diversification can be achieved by investing in multiple asset classes, such as bonds, real estate, etc. to be able to produce a reasonable taxonomy that accurately groups similar companies. This article focuses on a similar aggregation of information, but by utilizing a separate set of experts: news writers. When a writer creates a news article that contains a company or set of companies, that writer generally communicates rich infor- mation about the companies, albeit only a sliver of the total information. In particular, they often convey much information about the relatedness of companies. Consider the following snippets from New York Times articles involving Wal-Mart (an American discount retailer with many stores in various countries): 1) The [advertising] agency changes were part of a strategy shift at Wal-Mart, the nation’s largest re- tailer, to compete more effectively against rivals like Kohl’s, J. C. Penney and Target [5]. 2) Procter & Gamble, the consumer products com- pany, reached an agreement yesterday to acquire the Gillette Company, the shaving-products and battery maker, ... The move is a bid by two venerable consumer-products giants to strengthen their bargain- ing position with the likes of Wal-Mart and Aldi in Europe, which can now squeeze even the largest suppliers for lower prices [6]. 3) BUSINESS DIGEST ... Federal regulators wrapped up the first set of public hearings on Wal-Mart’s request to open a bank, but gave scant indication of how they might rule on the company’s application. ... Stocks tumbled as strength in the commodities market fed inflation fears and stifled investors’ enthusiasm over upbeat first-quarter earnings from Alcoa [7]. In the first snippet, several companies are identified as competitors to Wal-Mart. In a diversified portfolio, it would make sense to avoid large positions in several of these stocks because the companies face similar risks. For instance, a drop in consumer spending would likely affect all retailers. In the second snippet, we see that Procter & Gamble and Wal-Mart hold different locations in the same supply chain. While the article clearly mentions a battle between the companies to extract more value in the supply chain, the profitability of each company is again linked to similar risks, such as a drop in consumer spending. In the third snippet, we see Wal-Mart is mentioned together with Alcoa (a producer of aluminum), but there is no real 130

Upload: tomasz

Post on 13-Mar-2017

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr) - London, UK (2014.3.27-2014.3.28)] 2014 IEEE Conference on Computational Intelligence

Diversification Improvements Through News ArticleCo-occurrences

John Robert Yaros and Tomasz ImielinskiDepartment of Computer Science

Rutgers University

Piscataway, NJ 08854-8019

Email: {yaros,imielins}@cs.rutgers.edu

Abstract—Intuition suggests that a set of companies mentionedin the same news article are more likely to be related thanunrelated. For instance, an article discussing a retailer wouldmore probably mention its competitors or supply chain partnersthan mention other companies with no economic connection.Correspondingly, we consider using news article co-occurrencesas a means to determine company relatedness. We show thatcompanies mentioned together frequently are more likely to havehigher future stock-return correlation, and consider using thisdata source as a means to achieve portfolio diversification byavoiding having pairs of related companies in the portfolio. Wefind this approach reduces risk and can be used to improvestandard approaches to diversification that use expert-definedindustry taxonomies, seeking to avoid portfolio concentration inany given economic sector.

I. INTRODUCTION

Two main techniques in risk management are hedgingand diversification. With hedging, risk in an investment isreduced by making additional investments that should generatea time series of returns that is the opposite of the originalinvestment. That is, hedging can be achieved by finding assetswith greatest anti-correlation, or, alternatively, by finding assetswith greatest correlation and using those assets for short sales.Conversely, diversification reduces risk by investing in assetswith correlation closest to zero. If a single asset in a portfolioloses value, the likelihood of losses in the other assets remainsconstant, whereas if they were correlated, the likelihood ofsimultaneous losses is increased.

In equity investing, diversification is often achieved byinvesting in stocks in different “sectors” or “industries.”1 Thisapproach matches the old proverb “don’t put all your eggs inone basket,” since investments will be spread across differentareas of the economy. It is well-established that such anapproach will help to reduce risk [1]. Furthermore, industrydiversification has increased in importance as other means ofdiversification have weakened, such as country diversification,due the integration of the global economy [2], [3], [4].

At the same time, what defines industries is not straight-forward and the creation of taxonomies to assign companiesto industries is generally left to experts. These experts mustconsider multiple factors, including similarities in goods pro-duced, inputs to production, market perceptions, sensitivity toconsumer income, etc. They must compile all such information

1Although this article focuses on equities, greater diversification can beachieved by investing in multiple asset classes, such as bonds, real estate, etc.

to be able to produce a reasonable taxonomy that accuratelygroups similar companies.

This article focuses on a similar aggregation of information,but by utilizing a separate set of experts: news writers. Whena writer creates a news article that contains a company or setof companies, that writer generally communicates rich infor-mation about the companies, albeit only a sliver of the totalinformation. In particular, they often convey much informationabout the relatedness of companies. Consider the followingsnippets from New York Times articles involving Wal-Mart(an American discount retailer with many stores in variouscountries):

1) The [advertising] agency changes were part of astrategy shift at Wal-Mart, the nation’s largest re-tailer, to compete more effectively against rivals likeKohl’s, J. C. Penney and Target [5].

2) Procter & Gamble, the consumer products com-pany, reached an agreement yesterday to acquire theGillette Company, the shaving-products and batterymaker, ... The move is a bid by two venerableconsumer-products giants to strengthen their bargain-ing position with the likes of Wal-Mart and Aldiin Europe, which can now squeeze even the largestsuppliers for lower prices [6].

3) BUSINESS DIGEST ... Federal regulators wrappedup the first set of public hearings on Wal-Mart’srequest to open a bank, but gave scant indication ofhow they might rule on the company’s application. ...Stocks tumbled as strength in the commodities marketfed inflation fears and stifled investors’ enthusiasmover upbeat first-quarter earnings from Alcoa [7].

In the first snippet, several companies are identified ascompetitors to Wal-Mart. In a diversified portfolio, it wouldmake sense to avoid large positions in several of these stocksbecause the companies face similar risks. For instance, a dropin consumer spending would likely affect all retailers.

In the second snippet, we see that Procter & Gambleand Wal-Mart hold different locations in the same supplychain. While the article clearly mentions a battle betweenthe companies to extract more value in the supply chain, theprofitability of each company is again linked to similar risks,such as a drop in consumer spending.

In the third snippet, we see Wal-Mart is mentioned togetherwith Alcoa (a producer of aluminum), but there is no real

130

Page 2: [IEEE 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr) - London, UK (2014.3.27-2014.3.28)] 2014 IEEE Conference on Computational Intelligence

relation between the companies presented in the article, otherthan the fact they had notable events occurring on the sameday and, therefore, appear together in a business digest.

Previous research (e.g., [8]) has found that strong signals ofcompany relatedness can be captured by simply examining theco-occurrences of companies, despite the presence of “noise,”such as in the case of the third snippet above. Yet, much of theexisting research offers validation by simply drawing or listingcompany relations for the reader to approve, or by demon-strating some agreement with existing industry taxonomies.In contrast, this article seeks to validate the approach by ap-plying news article co-occurrences to the fundamental task ofdiversification. We demonstrate that these articles can improvediversification beyond what is achieved by typical uses ofindustry taxonomies. For instance, in the first snippet above,Target is listed as a competitor to Wal-Mart. The correlationof average daily returns for Target and Wal-Mart is 1.78 timesthe average correlation of pairs of stocks for years 2004 to2010. Yet, the Global Industry Classification System (GICS),a leading commercial industry taxonomy, placed Target andWal-Mart in two separate sectors, Consumer Discretionary andConsumer Staples, respectively, for those years. A portfolioconstructed to make sure that no stocks appear in the samesector could still contain these two highly related companies.By using news co-occurrences, we can avoid such situations.

Additionally, we show that as more news articles areincluded, signals become more robust. Given the abundanceof not only news articles, but corporate filings, blog posts andother publicly available textual information, we believe morerelationships can be captured, perhaps even for small or newcompanies that are not in industry taxonomies.

The remainder of the article is composed as follows. Webegin by describing related work in section II. In section III, wedescribe our datasets and outline the tools and heuristics usedto identify companies in news articles and link them to stockprice and sector/industry datasets. In section IV, we describeour measure to quantify company relatedness through news andevaluate its predictiveness of pairwise company correlation.This measure is used in section V to select portfolios that aremore diversified and, consequently, have less risk. We concludein section VI.

II. RELATED WORK

The study of news and finance have intersected on anumber of topics, including the speed of investor reactionsto news stories [9] and the effects of media coverage (or lackthereof) on stock prices [10], [11]. Another area is sentimentanalysis, which has been applied to measuring the impact ofpessimism on stock price and volumes [12]. Sentiment analysisand use of other textual features have further been applied tocreate numerous signals for trading strategies [13], [14], [15],[16]. Yet, those threads of research tend to focus on the useof news to predict the movements of single stocks, sectorsor markets, rather than the relatedness and consequential co-movements of stocks.

The smaller thread of research more similar to this workis the use of news and textual data to extract inter-companyrelationships. In a seminal work [8], authors use ClearForestsoftware (a pre-cursor to the Calais service used in this article)

to extract entities from a corpus of business news articles,which are then cleaned for deviations in company naming (i.e.,I.B.M. vs IBM). Authors then visualize the data with a networkstructure where edges are drawn between company verticeswherever the count of co-occurrences exceeds a set threshold.Authors highlight clusters in the network that appear to matchcommon perceptions of industries, then develop a notion of“centrality” to measure a company’s importance to an industry.They further develop a cosine measure for the inter-relatednessof two pre-designated industries based on relatedness of theirrespective companies, as determined by news co-occurrences.As authors concede, the work is limited by its small setof articles covering only four months in 1999, where thetechnology bubble led to significantly high numbers of articlescontaining technology companies. Further, the results rely onthe reader to judge whether the industries and relatednessmeasurements are reasonable, rather than offering verificationthrough external measures of company similarity, such asstock-return correlation.

Other researchers have also considered use of news or othertextual information to determine various aspects of relatednessbetween companies. In [17], authors construct a networkderived from news articles appearing on Yahoo! Finance overan eight month period. If the same article appears under thenews pages for two different companies, a link is constructedbetween the two companies in the network. Multiple articlesincrease the weight of each link. Authors then use in-degreeand out-degree measures as features for binary classifiers thatseek to predict which company in a pair has higher revenue.In [18], authors similarly construct networks based on co-occurrences in New York Times articles, but instead study theevolution of networks over time and use network features alongwith regression models to predict future company profitabilityand value. In [19], authors suggest bank interdependencies canbe inferred from textual co-occurrences, rather than the twotraditional data sources, co-movements in market data (e.g.,CDS spreads), which are not always efficient, and interbankasset and liability exposures, which are generally not publiclydisclosed. They exemplify their approach using a Finnishdata set and examine the temporal changes in the network,particularly folloing the Global Financial Crisis. In [20], amethod for extracting competitors and competitive domains(e.g., laptops for Dell and HP) is presented that essentiallyuses a search engine to gather articles and then uses somesentence patterns to identify competitors. In [21], authorsdescribe a system that extracts companies and correspondingrelationships from news articles using predefined rules. Theysuggest an approach to detect events, such as acquisitions, byobserving changes in the strength of relationships over time.

To our knowledge, this work is the first to consider useof news in stock-portfolio risk management. In particular,the task of diversification is a well-suited application becauseit relies on correlation, which is a direct measure of therelatedness of two companies through their stock price co-movements. Further, whereas taxonomies are often regardedas “gold standards” in the literature, we take the view thatimprovements can be made beyond what they achieve in riskmanagement and in determining company relatedness.

131

Page 3: [IEEE 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr) - London, UK (2014.3.27-2014.3.28)] 2014 IEEE Conference on Computational Intelligence

1995 2000 2005

100

200

300

400

500

600

700

Cou

nt o

f Art

icle

s (x

1000

)

afpapwcnaltwnytxin

Fig. 1. Articles Per Year

III. DATA

Our collection of news articles is taken from two corporaat the Linguistic Data Consortium (LDC)2. The first is theNew York Times Annotated Corpus3, which contains over 1.8million articles published by the New York Times (NYT) fromJanuary 1, 1987 to June 19, 2007. The second is EnglishGigaword Fourth Edition4, which contains articles from thefollowing five newswire services5:

AFP Agence France-PresseAPW Associated Press WorldstreamCNA Central News Agency of TaiwanLTW Los Angeles Times / Washington PostXIN Xinhua News Agency

The data collection at LDC for most of the newswires wasperformed via dedicated lines that recorded article text andmeta-data in real-time, although some articles were later re-ceived (or recovered) in bulk. Due to various collection issues,there are large gaps in the collection for particular newswires.There are also periods where fewer articles where collected dueto changes in collection methods. Fig. 1 depicts the numberof articles from each source, per year6.

To extract company names from articles, we use Calais,which is developed and maintained by ClearForest7, a groupwithin Thomson Reuters. The free OpenCalais8 web serviceallows users to submit text and receive back annotations.Company name detection is a main feature, which we use here.

To quantify OpenCalais error rates, we randomly selected100 NYT articles and manually marked companies in the text.We then computed precision and recall as shown in Table I.

2https://www.ldc.upenn.edu3http://catalog.ldc.upenn.edu/LDC2008T194http://catalog.ldc.upenn.edu/LDC2009T135The English Gigaword Fourth Edition also contains newswire articles from

NYT, which we exclude since we are already using the separate NYT corpus.6A small number of articles contained non-English characters and could not

be processed by OpenCalais. Fig. 1 depicts only processed articles.7http:\www.clearforest.com8http:\www.opencalais.com

TABLE I. OPENCALAIS PERFORMANCE ON 100 SAMPLE ARTICLES

No. Companies in Text 288True Positives 213 F1 score 0.796False Positives 34 Precision 0.862

False Negatives 75 Recall 0.740

Calais does reasonably well at the difficult task of identifyingcompanies, including differentiating those companies fromnon-profit organizations (e.g., Red Cross) or governmentalagencies (e.g., Air Force). Some examples of false positives areshown in Fig. 2, where the words alleged to be companies byOpenCalais are outlined in boxes. In example (1), a region wasmisidentified as a company. Examples (2) & (3) demonstrate acommon problem, wherein only part of the company name isidentified as a company, or other words are combined into thecompany name (possibly from other companies). We did notattempt to perform any correction, again with the expectationthat such errors will amount to noise and enough articles willovercome any problems. The worst case occurs when a falsepositive has the name of an actual company. For example,if the fruit “apple” was somehow identified by OpenCalaisas the technology company, Apple. We never observed suchsituations, although it is possible they did occur. For most falsepositives, the misidentified text would not be included in ouranalysis because it simply would not link to our universe ofstocks.

Our stock universe is the constituents of the S&P 100,which is an index maintained by Standard and Poor’s (S&P).The index is a subset of 100 U.S. companies from the S&P500 chosen for sector balance and such that the companies arelarger, more stable and have listed options [22]. To identify theS&P 100 constituents each year, we use Compustat, a productof S&P.

We further use Compustat to get sector assignments for theS&P 100 stocks in the Global Industry Classification System(GICS), which is a leading commercial industry taxonomyproduced by S&P and MSCI/Barra. GICS has been shownto offer better risk management characteristics than manyhistorically-used government or academic taxonomies [23]. Wefocus on the 10 sectors, but GICS offers greater granularity bybreaking the sectors into 24 industry groups, then 68 industriesand finally 154 sub-industries. These counts reflect the stateof GICS at the end of 2010, but they have changed over timeto match changes in the economy (although there have alwaysbeen 10 sectors). Additionally, a company’s GICS assignmentmay be updated for changes in the company’s business.For these reasons, it is important to use concurrent GICSassignments when performing historical analysis. Accordingly,Compustat offers these historical assignments and we use themin our analysis.

1) Ray Sullivan, Northeast regional director for AmericanCollege Testing, ...

2) Specialty Products and Insulation Co. , East Petersburg, ...

3) ... Credit Suisse First Boston/Goldman, Sachs & Co.

Fig. 2. Examples of False Positives

132

Page 4: [IEEE 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr) - London, UK (2014.3.27-2014.3.28)] 2014 IEEE Conference on Computational Intelligence

Gigaword & NYT CorporaNews Articles

CRSPStock Returns

CompustatGICS Sectors

S&P 100 Constituents

OpenCalaisCompany

Identification

Data Linking

Relatedness Computation and

Evaluation

Fig. 3. Data Processing Overview

For stock returns, we use a dataset from the Center forResearch in Security Prices (CRSP). The dataset includesdaily returns that not only include returns due to daily pricechanges, but also all changes due to corporate actions, likecash dividends.

CRSP and Compustat were linked using a variety ofidentifiers, such as CUSIP and exchange tickers, with attentionto ensuring the identifiers are matching at concurrent timeperiods. Any unmatched constituents from the S&P 100 atany time period were manually linked.

The more significant task was linking the news articlecompanies because the same company may be referenced inmultiple ways. For instance, “DuPont,” “EI DuPont,” “E.I.Du Pont De Nemours” and “E.I. du Pont de Nemours andCompany” are all aliases for the same company. Calais doesoffer company “resolutions,” where these multiple aliases areresolved to a single company name and Reuters InstrumentCode (RIC). However, these resolutions do not account for thetime period of the article. For example, “Mobil” will resolveto ExxonMobil, which is not helpful if the article is priorto the 1998 merger of Exxon and Mobil. Therefore, we useonly the original company names identified by Calais, not theirresolutions.

To link the identified companies to the other datasets, weuse a manually constructed mapping of company aliases toCRSP permno (CRSP’s unique identifier). Each entry of themapping includes beginning and ending dates of validity toavoid mapping to incorrect companies for the given article’sdate, such as in the ExxonMobil example above. We further usea series of standardization rules on the raw company names,such as removal of periods, consolidation of abbreviationcharacters (i.e. “A. T. & T.” → “AT&T”), removal of suffixes(i.e., remove “Co,” “Company,” ”Inc,” etc.) and multiple otherrules to reduce the possible number of alias derivatives for agiven company.

Finally, an important note is that we do not considersubsidiaries in determining co-occurrences. A main reasonis that joint ownership of a subsidiary (by two or morecompanies) is frequent and may easily obscure the strength ofa co-occurrence in news. For example, suppose 32% of Huluis owned by NBCUniversal, which in turn is wholy ownedby Comcast. Further suppose there is a co-occurence in newsbetween Hulu and Google. What should be the strength of therelationship between Google and Comcast? What about theother owners of Hulu? We leave this for future work.

1

10

100

1000

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

Num

ber

of A

rtic

les

Men

tioni

ng C

ompa

ny

Fig. 4. Frequency Distribution Box Plots for Company Appearances in NewsArticles

IV. RELATEDNESS MEASUREMENT

The goal of this work is to improve portfolio diversification,which itself should result in reduced risk. Our hypothesisis that companies mentioned frequently in news articles arehighly related and, therefore, should not appear together in acandidate portfolio because it will not be diversified. To thisend, the collection of articles containing a specific companycan be represented by a set. To quantify the similarity of twocompanies, one can measure the similarity of their sets.

However, important consideration must be given to thedistribution of sizes of these sets. Fig. 4 displays cross-sectional boxplots of the number of times each company ismentioned in articles, across all datasets. In the figure, thecircle-dot represents the median number of articles a companyappears in for the given year. The top and bottom of each thickbar represent the 75-th and 25-th percentiles, respectively. Thetop and bottom of each thin line represent the maximum andminimum, respectively.

As can be seen in Fig. 4, some companies receive muchgreater news attention than others (note the logarithmic scale).This may occur because some companies are much larger, orperhaps because some companies have generated controversy(e.g., Enron was frequently in the news at the end of 2001due to accounting fraud, but was previously less mentioned.)Therefore, we compute our coefficient of relatedness C for twocompanies c1 and c2 using cosine similarity (also appearingin literature as the Ochiai coefficient [24]):

C(c1, c2) = |Wc1 ∩Wc2 |√|Wc1 | · |Wc2 |where Wc1 and Wc2 are the sets of articles for companiesc1 and c2, respectively. For two vectors, the cosine measuresthe angle between them, but not the magnitude. Likewise, thecosine for two sets reflects the similarity between them ratherthan the sizes of the sets [25]. This favorable property makescosine appropriate for our dataset where some companies mayhave a much greater volume of news than others.

Another favorable aspect of cosine similarity is that ithas the “null addition property,” wherein adding or removing

133

Page 5: [IEEE 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr) - London, UK (2014.3.27-2014.3.28)] 2014 IEEE Conference on Computational Intelligence

TABLE II. COMPANIES MOST SIMILAR TO TARGET (YEAR: 2005)

— articles — GICS 2006rank Company all w/ TGT cosine Sector correl

1 Wal-Mart 2099 136 0.215 Staples 0.4622 Limited Brands 38 9 0.106 Discretionary 0.4723 Home Depot 513 24 0.077 Discretionary 0.4174 Black & Decker 18 2 0.034 Discretionary 0.3705 RadioShack 25 2 0.029 Discretionary 0.2606 Dell 699 9 0.024 Technology 0.2537 P&G 366 6 0.022 Staples 0.2408 General Motors 2650 15 0.021 Discretionary 0.1779 Merck 755 8 0.021 Health 0.121

10 ExxonMobil 763 8 0.021 Energy 0.06911 3M 195 4 0.020 Industrial 0.23412 Microsoft 2849 12 0.016 Technology 0.17313 Goldman Sachs 1318 8 0.016 Financial 0.42514 Cisco 352 4 0.015 Technology 0.16015 Medtronic 91 2 0.015 Health 0.055

unrelated data does not a ffect the measure. More precisely,increasing or decreasing the number of articles that containneither c1 nor c2 will not alter C(c1, c2). Many other measures,such as Collective strength, odds ratio, All-confidence, etc., donot have the null addition property [25].

In Table II, we exemplify the similarity ranking for a singlecompany by showing the top 15 similar companies to Targetin 2005, as measured by cosine similarity C. For reference,Target was in 189 articles in 2005 and was in the ConsumerDiscretionary (“Discret.”) GICS sector. As seen in the table,the top companies ranked by the news cosine are intuitivelyrelated to Target. Wal-Mart directly competes with Target as adiscount retailer. Limited Brands, Home Depot and Radioshackare also retailers, but with different focuses and/or store sizesthan Target. Black & Decker and P&G produce goods that arefrequently sold in Target stores. Wal-Mart is selected as mostsimilar, even though GICS has no indication of similarity asit places the company in a different sector, Consumer Staples(“Staples”). We also see that the top stocks tend to be thosemost correlated with Target in the subsequent year, 2006.As the strength of the cosine weakens, so do the intuitiverelatedness of the companies and their correlation.

A. Correlation Prediction

Following Modern Portfolio Theory (MPT) [26], risk canbe measured by the variance of returns. Diversification canbest be achieved by including assets with correlation close tozero. We first measure how well news can predict correlationin order to validate our intuition that stocks mentioned togethermore in news will have higher correlation. Since correlationsvary greatly among stocks (and sectors), we normalize correla-tion for each stock by dividing by its average daily correlationwith all other stocks9:

ψi,j =ρi,jρi

where ρi,j is the correlation between stocks i and j and theaverage correlation ρi for stock i is

ρi =

j∈U,j �=i

ρi,j

|U | − 1

where U is the universe of stocks.

9This relies on the fact that stock correlations are nearly always positive.

0 20 40 60 80 100

1

1.2

1.4

1.6

1.8

Rank of Similar Stock by News

Rat

io S

tock

Cor

rel t

o A

vg C

orre

l (ψ

)

allapwcna

0 5 10 15 20

1

1.2

1.4

1.6

1.8

Fig. 5. Performance of Correlation Prediction by News

Since we ultimately wish to use news for prediction, weuse walk-forward testing [27], [28] where we split our datainto years and use news articles from each year to predictcorrelation in the subsequent year. Results are depicted inFig. 5. Each point along the curves indicates the correlationratio ψ (y-axis) between each stock and its corresponding k-thsimilar stock at the given rank of news similarity (x-axis). Thefigure displays the results averaged over all stocks and overall years (1995 to 2008), but each individual year follows thesame general pattern. For stocks that are most highly similarby news (left-most in the figure), the correlation is highest.The correlation rapidly diminishes with the first 5-10 stocks,then has a slower downward trend towards lower correlation.This indicates that news is strongest at determining the mostcorrelated companies. It can still differentiate at lower levels,but at weaker strength. Also evident in the inset of the figureis that as more articles are included, the better news does atdetermining the most correlated stocks. The smallest source,CNA, has weakest predictive power, while the larger APWdoes better. Using all articles does the best.

V. DIVERSIFICATION

To quantify the risk of a given portfolio, we use a simplemethod of counting the number of ordered pairs of stocks inthe portfolio that are expected to be highly similar throughnews. More specifically, we count the number of occurrencesof an ordered pair (c1, c2) where C(c1, c2) is in the top Khighest values for the given stock c1. Observe that two stocks{ca, cb} may be counted twice - once for C(ca, cb) and once forC(cb, ca). From Fig. 5, we can see news is best at predictingroughly the top 5-10 most similar stocks related to a givenstock, so we set K=10 in the following experiments.

For each year in our dataset, we randomly generate1,000,000 portfolios of five stocks each from the S&P 100.Since each stock in the portfolio may itself have varying levelsof risk, we weight each stock in the portfolio by the inverseof its historical standard deviation of 20-trading-day returns(approx. one calendar month) over the previous three years.That is, the weight wi of stock i in the portfolio is proportionalto 1/σh,i, where σh,i is the historical standard deviation of

134

Page 6: [IEEE 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr) - London, UK (2014.3.27-2014.3.28)] 2014 IEEE Conference on Computational Intelligence

TABLE III. RISK IMPROVEMENTS

All No. News Top K Pairs No. GICS Sectors News / GICSYear Portfolios 0 1 to 3 4 to 6 7 to 9 10 to 12 13 to 15 1 2 3 4 5 Combination

1995 0.5253 0.5121 0.5230 0.5370 0.5528 0.5679 0.5768 0.5558 0.5547 0.5408 0.5252 0.5103 0.50291996 0.5951 0.5831 0.5935 0.6038 0.6082 0.6119 0.5984 0.6455 0.6230 0.6081 0.5947 0.5838 0.57781997 0.6727 0.6638 0.6717 0.6789 0.6883 0.6937 0.7034 0.6447 0.6839 0.6813 0.6733 0.6646 0.66031998 0.6064 0.5955 0.6051 0.6150 0.6234 0.6315 0.6369 0.6559 0.6313 0.6176 0.6058 0.5967 0.58971999 0.6053 0.6106 0.6032 0.6048 0.6079 0.6186 0.6283 0.6297 0.6294 0.6172 0.6055 0.5941 0.59732000 0.5387 0.5463 0.5374 0.5341 0.5387 0.5610 0.5923 0.6372 0.5852 0.5561 0.5374 0.5228 0.52902001 0.6594 0.6433 0.6536 0.6708 0.6962 0.7208 0.7390 0.7768 0.7267 0.6886 0.6588 0.6336 0.62662002 0.7313 0.7231 0.7294 0.7364 0.7457 0.7592 0.7601 0.7725 0.7441 0.7352 0.7308 0.7285 0.72422003 0.6090 0.6091 0.6102 0.6079 0.6006 0.5951 0.5726 0.6620 0.6292 0.6154 0.6082 0.6043 0.60762004 0.6515 0.6409 0.6497 0.6580 0.6629 0.6658 0.6616 0.7030 0.6797 0.6629 0.6510 0.6409 0.63782005 0.5835 0.5800 0.5818 0.5871 0.5939 0.5975 0.5840 0.6025 0.5945 0.5877 0.5831 0.5803 0.58132006 0.5480 0.5413 0.5457 0.5530 0.5645 0.5783 0.6088 0.6421 0.5927 0.5664 0.5470 0.5320 0.53032007 0.6901 0.6989 0.6893 0.6866 0.6912 0.7030 0.7113 0.7408 0.7137 0.7009 0.6901 0.6804 0.68492008 0.6887 0.6802 0.6856 0.6958 0.7087 0.7246 0.7556 0.7199 0.6893 0.6866 0.6882 0.6912 0.6858

1995 1997 1999 2001 2003 2005 2007

0.98

1

1.02

1.04

1.06

Ris

k R

atio

of P

ortfo

lios

to R

ando

m

7 to 94 to 61 to 30

Fig. 6. Risk by News Top K Counts

stock i. In the case that there are fewer than twelve 20-trading-day periods of returns in the stock’s history, we give thatstock weight corresponding to the average historical standarddeviation of the other stocks in the S&P 100 that year. Theseweights are normalized such that

∑i wi = 1.

This weighting scheme helps ensure each stock will con-tribute equal risk within the portfolio. However, the portfoliosthemselves may have varying risk depending on their underly-ing stocks (even though each is weighted for equal risk withinthe portfolio). So, to make comparisons between portfolios,we compute the following “risk improvement” measure

R =σA

σH

where σA is the actual standard deviation of 20-trading-dayreturns of the weighted portfolio over the subsequent 12periods (essentially monthly returns over the next year). The“homogeneous” standard deviation σH is a theoretical valuethat measures the variance if the stocks had been perfectlycorrelated (i.e., ρij = 1 for all pairs of stocks i, j)

σ2H =

n∑

i=1

n∑

j=1

wiwjσiσjρij =n∑

i=1

n∑

j=1

wiwjσiσj

Of course, most stocks are not perfectly correlated, so R willrange between zero and one with lower values indicating lessrisk. A value of 0.5 means that the risk (as measured by

1995 1997 1999 2001 2003 2005 20070.95

1

1.05

1.1

1.15

1.2

1.25

Ris

k R

atio

of P

ortfo

lios

to R

ando

m

1 sector2 sectors3 sectors4 sectors5 sectors

Fig. 7. Risk by GICS Sectors

variance) is half the risk that would be encountered if thestocks were perfectly correlated.

Results are shown in Table III, where we display the riskimprovement measure, R, averaged over all sample portfoliosfor the given year. We also show averages for several subsetsof the portfolios, selected by the news cosine, GICS sectors,or the combination method that will be described shortly.

The risk improvements measures for all portfolios aresubstantially less than 1.0, which is unsurprising given thatstocks are rarely perfectly correlated and it is well-known thatincluding more stocks in a portfolio quickly reduces risk [29],[30]. Even using only five stocks offers a dramatic riskimprovement as seen in the second column (“All Portfolios”)of the table.

For news articles, we see that having fewer instances ofpairs of stocks in the top K similarity by news leads to lowerrisk. This is evident in the table, but also in Fig. 6, where thevalues are plotted relative to random (i.e., the “All Portfolios”column). We plot values in this ratio since the random riskreduction changes greatly from year to year and using the ratioprovides a more stable view. As seen in the figure, portfolioswith fewer top K similar stocks by news tend to have lowerrisk. For most years, portfolios with zero top K similar stockshave least risk. Years 1999, 2000 and 2007 are exceptions.Still, zero has lower risk than one to three pairs of top Kstocks at the 5% level of significance under a paired t-test.

135

Page 7: [IEEE 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr) - London, UK (2014.3.27-2014.3.28)] 2014 IEEE Conference on Computational Intelligence

0 10 20 30 40 50

0.61

0.615

0.62

0.625

K in "Top K"

Avg

Ris

k Im

prov

emen

t

newsgicsnews / gics combination

Fig. 8. Risk Improvement (as determined byvariance) by News Selectivity

0 10 20 30 40 50

−1400

−1350

−1300

−1250

−1200

−1150

K in "Top K"

Avg

Exp

ecte

d S

hort

fall

(in $

)

newsgicsnews / gics combination

Fig. 9. Expected Shortfall by News Selectivity

0 10 20 30 40 50

100

200

300

400

K in "Top K"

Avg

Num

ber

of P

ortfo

lios

(x10

00)

newsgicsnews / gics combination

Fig. 10. Selectivity Sensitivity of K

TABLE IV. EXPECTED SHORTFALL

All Zero News Five GICS News / GICSYear Portfolios Top K Pairs Sectors Combination

1995 1526.97 1407.94 1765.15 1682.581996 223.32 187.89 205.93 199.111997 209.56 93.81 404.20 352.331998 -1477.71 -1655.06 -1361.68 -1459.751999 -1489.93 -1755.87 -1457.65 -1605.662000 -2629.27 -2215.09 -1941.93 -1719.972001 -3594.49 -3459.68 -3560.99 -3553.362002 -4352.52 -4241.19 -4064.33 -4068.132003 499.00 512.40 514.40 503.212004 -481.12 -300.90 -403.07 -286.132005 -1079.67 -1005.15 -954.97 -885.582006 349.38 354.40 396.71 408.262007 -1247.00 -1139.27 -869.42 -881.322008 -6504.77 -6347.67 -6325.22 -6236.00Avg -1432.02 -1397.39 -1260.92 -1253.60

In Table III and in Fig. 7, results are shown for GICS.Our expectation is that the more sectors the portfolio spans,the more diversified the portfolio should be. Indeed, resultsindicate this is nearly always true. In Fig. 7, we can see thatportfolios with stocks in five sectors should have least risk.Correspondingly, at the 0.1% level of significance under apaired t-test, a portfolio spanning five sectors has less risk thana portfolio spanning only four sectors. In general, we see thatthis GICS approach to diversification offers a more consistentrisk reduction than the news approach.

Still, news can offer some benefit by improving GICSresults. The final column (“News / GICS Combination”) inTable III displays results for portfolios that both span five GICSsectors and have zero top K similar pairs of stocks by news.This method has least risk in many years and only does worsethan both the GICS and news methods in 2005.

Since the use of variance has been criticized as a measureof risk [31], we consider an alternative: expected shortfall (ES),which we compute as the average of the worst 5% of portfolioreturns. This measure is also sometimes called “expectedtail loss.” Unlike the standard deviation of returns, expectedshortfall is more interpretable since it tells the investor “there isa 1 in 20 chance your returns will be $X, on average (althoughthey can be worse).” The measure is usually applied to selectportfolios based on risk, but it can also be used to evaluateportfolio selection methods, as we do here.

For our ES measure, since each randomly generated port-folio may contain stocks that are more volatile than others, we(de)lever each portfolio such that they all have equal volatilityin terms of the historical variances of their stocks, specifically

the standard deviation of the thirty-six periods of 20-trading-day returns (i.e., roughly three years of monthly returns). Theinvestment for a portfolio with average volatility is $10,000and the amount invested will be lower or higher for otherportfolios based on whether they are more or less volatile,respectively. Results are shown in Table IV, where the ex-pected shortfall is shown per year for the given diversificationmethods. We find that the combination method performs onlymarginally better than GICS alone.

However, by increasing K and making the news articleportfolios more selective, the margin between the combinationmethod and GICS increases. This is shown in Fig. 9 forexpected shortfall and in Fig. 8 for the risk improvementmeasure, R. In the figures, the best portfolios for news (zeronews top K pairs), GICS (5 sectors) and the news / GICScombination (satisfying both) are shown, averaged over allyears. For both the ES and R risk measures, we see riskimprovements in the news and combined portfolios as Kincreases. (GICS remains constant because it is unaffected byK.) At the same time, the number of possible portfolios quicklydecreases, as shown in Fig. 10. Hence, trade-off for selectingK is improving risk at the cost of fewer alternatives.

VI. SUMMARY AND FUTURE DIRECTIONS

We evaluate the intuition that companies that frequentlyco-occur in news articles are related. We consider a large setof news articles from several major services over a 1994 to2007 period and use OpenCalais to identify company namesin the articles. To quantify company relatedness through co-occurrences, we use cosine similarity because it is well-suited for the large disparities in the number of articles eachcompany appears in. We find that when ranking companiesby their cosine similarity to a given company, the top rankedcompanies tend to have highest future stock-return correlationwith that given company. We suggest a simple portfoliodiversification method, wherein potential portfolios are avoidedif they contain any pairs of stocks that have high news-articlecosine similarity. Results show that such an approach tends toreduce risk as measured by variance in returns. We find thata standard approach of avoiding stocks in the same sector (asdefined by GICS) is generally more reliable in reducing risk.However, combining the approaches, such that similar stocksas determined both by news and by GICS sectors are avoided,does even better at reducing risk.

136

Page 8: [IEEE 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr) - London, UK (2014.3.27-2014.3.28)] 2014 IEEE Conference on Computational Intelligence

Several opportunities exist to be more exact in extractingcompany relatedness in the articles beyond the approach usedin this work. In [18], authors suggest considering the proximityof the company names in the news article may be beneficial.For example, more weight should be given to occurrences inthe same sentence. Further, using natural language processingto try to actually identify the relationship (competitor, supplychain partner, etc.) between two companies may help todetermine the future interrelationship of their stock returns.Conversely, identifying and reducing the weight of “noise”articles, such as market summaries, that do not present strongrelationships may help in improving the similarity signals.

Finally, whereas this article focuses on large companiesbecause industry taxonomies are available to use for baselines,greater potential may exist when examining smaller, new orprivate companies. In such situations it may be difficult todetermine company similarity because traditional methods,such as industry taxonomies or even historical prices, areunavailable. Still, there may be a large desire to be ableto identify “comparables” for use in company valuation orhedging. News articles and other textual documents, such asblog posts, may be able to fill this void, particularly becausethey are becoming more prevalent.

ACKNOWLEDGEMENTS

Authors thank Kenneth Liau and Gerard O’Neill for theircontributions during preliminary experiments.

REFERENCES

[1] B. F. King, “Market and industry factors in stock price behavior,” TheJournal of Business, vol. 39, no. 1, pp. 139–190, 1966.

[2] S. P. Baca, B. L. Garbe, and R. A. Weiss, “The rise of sector effects inmajor equity markets,” Financial Analysts Journal, vol. 56, no. 5, pp.34–40, 2000.

[3] S. Cavaglia, C. Brightman, and M. Aked, “The increasing importance ofindustry factors,” Financial Analysts Journal, vol. 56, no. 5, pp. 41–54,2000.

[4] F. Carrieri, V. Errunza, and S. Sarkissian, “The dynamics of geographicversus sectoral diversification: Is there a link to the real economy?”Quarterly Journal of Finance, vol. 02, no. 04, 2012.

[5] S. Elliott, “Why an agency said no to Wal-Mart,” New York Times,p. C1, Dec 15 2006.

[6] A. R. Sorkin and S. Lohr, “Procter closes $57 billion deal to buyGillette,” New York Times, p. A1, Jan 28 2005.

[7] “Business digest,” New York Times, p. C2, Apr 12 2006.

[8] A. Bernstein, S. Clearwater, S. Hill, C. Perlich, and F. Provost, “Dis-covering knowledge from relational data extracted from business news,”in SIGKDD Workshop on Multi-Relational Data Mining, 2002.

[9] P. Klibanoff, O. Lamont, and T. A. Wizman, “Investor reaction to salientnews in closed-end country funds,” The Journal of Finance, vol. 53,no. 2, pp. 673–699, 1998.

[10] W. S. Chan, “Stock price reaction to news and no-news: drift andreversal after headlines,” Journal of Financial Economics, vol. 70, no. 2,pp. 223–260, 2003.

[11] L. Fang and J. Peress, “Media coverage and the cross-section of stockreturns,” The Journal of Finance, vol. 64, no. 5, pp. 2023–2052, 2009.

[12] P. C. Tetlock, “Giving content to investor sentiment: The role of mediain the stock market,” The Journal of Finance, vol. 62, no. 3, pp. 1139–1168, 2007.

[13] N. Li and D. D. Wu, “Using text mining and sentiment analysisfor online forums hotspot detection and forecast,” Decision SupportSystems, vol. 48, no. 2, pp. 354 – 368, 2010.

[14] W. Zhang and S. Skiena, “Trading strategies to exploit blog and newssentiment,” in International AAAI Conference on Weblogs and SocialMedia, 2010.

[15] M. Hagenau, M. Liebmann, M. Hedwig, and D. Neumann, “Automatednews reading: Stock price prediction based on financial news usingcontext-specific features,” in Proceedings of the 2012 45th HawaiiInternational Conference on System Sciences, 2012, pp. 1040–1049.

[16] D. D. Wu, L. Zheng, and D. L. Olson, “A decision support approach foronline stock forum sentiment analysis,” IEEE Transactions on Systems,Man, and Cybernetics: Systems, 2014, to appear.

[17] Z. Ma, O. R. Sheng, and G. Pant, “Discovering company revenuerelations from news: A network approach,” Decision Support Systems,vol. 47, no. 4, pp. 408 – 414, 2009.

[18] Y. Jin, C.-Y. Lin, Y. Matsuo, and M. Ishizuka, “Mining dynamic socialnetworks from public news articles for company value prediction,”Social Network Analysis and Mining, vol. 2, no. 3, pp. 217–228, 2012.

[19] S. Ronnqvist and P. Sarlin, “From text to bank interrelation maps,”ArXiv e-prints, no. 1306.3856, 2013, working Paper.

[20] S. Bao, R. Li, Y. Yu, and Y. Cao, “Competitor mining with the web,”IEEE Trans. on Knowl. and Data Eng., vol. 20, no. 10, pp. 1297–1310,Oct. 2008.

[21] C. Hu, L. Xu, G. Shen, and T. Fukushima, “Temporal company relationmining from the web,” in Advances in Data and Web Management, ser.Lecture Notes in Computer Science, Q. Li, L. Feng, J. Pei, S. Wang,X. Zhou, and Q.-M. Zhu, Eds., 2009, vol. 5446, pp. 392–403.

[22] McGraw Hill Financial, “S&P U.S. indices methodology,” Sept 2013.

[23] L. K. Chan, J. Lakonishok, and B. Swaminathan, “Industry classificationand return comovement,” Financial Analysts Journal, vol. 63, no. 6, pp.56–70, 2007.

[24] A. Ochiai, “Zoogeographic studies on the soleoid fishes found in Japanand its neighbouring regions,” Bulletin of the Japanese Society forScientific Fisheries, vol. 22, pp. 526–530, 1957.

[25] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining.Boston: Pearson Addison Wesley, 2005.

[26] H. Markowitz, “Portfolio selection,” The Journal of Finance, vol. 7,no. 1, pp. 77–91, 1952.

[27] D. Meyers, “Walk forward with the XAU bond fund system,” TechnicalAnalysis of Stocks & Commodities, vol. 15, no. 5, pp. 199–205, 1997.

[28] D. Aronson, Evidence-based technical analysis : applying the scientificmethod and statistical inference to trading signals. Hoboken, N.J:John Wiley & Sons, 2007.

[29] J. L. Evans and S. H. Archer, “Diversification and the reduction ofdispersion: An empirical analysis,” The Journal of Finance, vol. 23,no. 5, pp. 761–767, 1968.

[30] E. J. Elton and M. J. Gruber, “Risk reduction and portfolio size: Ananalytical solution,” The Journal of Business, vol. 50, no. 4, pp. 415–437, 1977.

[31] B. Mandelbrot, The (mis)behavior of markets: a fractal view of risk,ruin, and reward. New York: Basic Books, 2004.

137