Transcript
Page 1: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 1/64

A Brief History of the ShortHistory of Text Mining (inFinance)Sanjiv Ranjan Das and Karthik Mokashi

Page 2: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 2/64

Reference monograph

Text expands the universe of data by many-fold. See my monographon text mining in finance at:http://algo.scu.edu/~sanjivdas/Das_TextAnalyticsInFinance.pdf

This covers some of the content of this presentation.

2/64

Page 3: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 3/64

Text as Data

1. Big Text: there is more textual data than numerical data.

2. Text is versatile. Nuances and behavioral expressions that are notconveyed with numbers.

3. Text contains emotive content. Sentiment analysis. Admati-Pfleiderer 2001; DeMarzo et al 2003; Antweiler-Frank 2004, 2005;Das-Chen 2007; Tetlock 2007; Tetlock et al 2008; Mitra et al 2008;Leinweber-Sisk 2010.

4. Text contains opinions and connections. Das et al 2005; Das andSisk 2005; Godes et al 2005; Li 2006; Hochberg et al 2007.

5. Numbers aggregate; text categorizes (topics).

6. Text can be generated.

3/64

Page 4: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 4/64

Definition: Text-Mining

1. Text mining is the large-scale, automated processing of plain textlanguage in digital form to extract data that is converted into usefulquantitative or qualitative information.

2. Text mining is automated on big data that is not amenable tohuman processing within reasonable time frames. It entailsextracting data that is converted into information of many types.

3. Simple: Text mining may be simple as in key word searches andcounts.

4. Complicated: It may require language parsing and complex rules forinformation extraction.

5. Structured text, such as the information in forms and some kinds ofweb pages.

6. Unstructured text is a much harder endeavor.

4/64

Page 5: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 5/64

Data and Algorithms

5/64

Page 6: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 6/64

Text Handling

Text Extraction

String Parsing

String Detection

Text cleanup

XML Package

Regular Expressions

Using APIs (Twitter, Facebook, Yelp)

·

·

·

·

·

·

·

6/64

Page 7: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 7/64

The Response to News (Das, Martinez-Jerez,and Tufano (FM 2005))

7/64

Page 8: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 8/64

Breakdown of News Flow

8/64

Page 9: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 9/64

Frequency of Postings

9/64

Page 10: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 10/64

Weekly Posting

10/64

Page 11: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 11/64

Intraday Posting

11/64

Page 12: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 12/64

Number of Characters per Posting

12/64

Page 13: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 13/64

Text Mining with the "tm" Package

1. Functions such as {readDOC()}, {readPDF()}, etc., for reading DOCand PDF files, the package makes accessing various file formatseasy.

2. Text mining involves applying functions to many text documents, acorpus. Ability to operate on the entire set of documents at one go.

3. Remove Stopwords, Punctuation, Numbers. Create a Bag of Words(BOW)

4. Term Document Matrix (TDM): An entire library of text inside asingle matrix. Analysis and searching documents. In search engines,topic analysis, and classification (spam filtering).

5. Term Frequency - Inverse Document Frequency (TF-IDF)

13/64

Page 14: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 14/64

Wordclouds

Wordlcouds are interesting ways in which to represent text. They givean instant visual summary. The wordcloud package in R may be usedto create your own wordclouds.

14/64

Page 15: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 15/64

Document Similarity

In this segment we will learn some popular functions on text that areused in practice. One of the first things we like to do is to find similartext or like sentences (think of web search as one application). Sincedocuments are vectors in the TDM, we may want to find the closestvectors or compute the distance between vectors.

where , is the dot product of with itself, also known asthe norm of . This gives the cosine of the angle between the twovectors and is zero for orthogonal vectors and 1 for identical vectors.

cos(θ) =A Þ B

||A|| × ||B||

||A|| = A Þ A‾ ‾‾‾‾√ AA

15/64

Page 16: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 16/64

Dictionaries

1. Standard Dictionaries: www.dictionary.com, and www.merriam-webster.com.

2. The Harvard General Inquirer:http://www.wjh.harvard.edu/~inquirer/

3. Computer dictionary: http://www.hyperdictionary.com/computerthat contains about 14,000 computer related words, such as "byte"or "hyperlink".

4. Math dictionary, such ashttp://www.amathsdictionaryforkids.com/dictionary.html.

5. Medical dictionary, see http://www.hyperdictionary.com/medical.

16/64

Page 17: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 17/64

Dictionaries - II

1. Internet lingo dictionaries.http://www.netlingo.com/dictionary/all.php for words such as"2BZ4UQT" which stands for "too busy for you cutey" (LOL).

2. Associative dictionaries are also useful when trying to find context,as the word may be related to a concept, identified using adictionary such as http://www.visuwords.com/. This dictionarydoubles up as a thesaurus.

3. Value dictionaries deal with values and may be useful when onlyaffect (positive or negative) is insufficient for scoring text. TheLasswell Value Dictionaryhttp://www.wjh.harvard.edu/~inquirer/lasswell.htm may be used toscore the loading of text on the eight basic value categories: Wealth,Power, Respect, Rectitude, Skill, Enlightenment, Affection, and Wellbeing.

17/64

Page 18: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 18/64

Lexicons

1. A lexicon is defined by Webster's as "a book containing analphabetical arrangement of the words in a language and theirdefinitions; the vocabulary of a language, an individual speaker orgroup of speakers, or a subject; the total stock of morphemes in alanguage." This suggests it is not that different from a dictionary.

2. A "morpheme" is defined as "a word or a part of a word that has ameaning and that contains no smaller part that has a meaning."

3. In the text analytics realm, we will take a lexicon to be a smaller,special purpose dictionary, containing words that are relevant to thedomain of interest.

4. The computational effort required by text analytics algorithms isdrastically reduced.

18/64

Page 19: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 19/64

Constructing a lexicon

1. By hand. This is an effective technique and the simplest. It calls for ahuman reader who scans a representative sample of textdocuments and culls important words that lend interpretivemeaning.

2. Examine the term document matrix for most frequent words, andpick the ones that have high connotation for the classification taskat hand.

3. Use pre-classified documents in a text corpus. We analyze theseparate groups of documents to find words whose difference infrequency between groups is highest. Such words are likely to bebetter in discriminating between groups.

19/64

Page 20: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 20/64

Lexicons as Word Lists

1. Das and Chen (2007) constructed a lexicon of about 375 words thatare useful in parsing sentiment from stock message boards. Thislexicon also introduced the notion of "negation tagging" into theliterature.

2. Loughran and McDonald (2011):

Taking a sample of 50,115 firm-year 10-Ks from 1994 to 2008, theyfound that almost three-fourths of the words identified as negativeby the Harvard Inquirer dictionary are not typically negative wordsin a financial context.

Therefore, they specifically created separate lists of words by thefollowing attributes of words: negative, positive, uncertainty,litigious, strong modal, and weak modal. Modal words are based onJordan's categories of strong and weak modal words. These wordlists may be downloaded fromhttp://www3.nd.edu/~mcdonald/Word_Lists.html.

·

·

20/64

Page 21: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 21/64

Scoring Text

Mood Scoring using Harvard Inquirer

21/64

Page 22: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 22/64

Language Detection

We may be scraping web sites from many countries and need to detectthe language and then translate it into English for mood scoring. Theuseful package textcat enables us to categorize the language.

library(textcat) text = c("Je suis un programmeur novice.", "I am a programmer who is a novice.", "Sono un programmatore alle prime armi.", "Ich bin ein Anfänger Programmierer", "Soy un programador con errores.") lang = textcat(text) print(lang)

## [1] "french" "english" "italian" "german" "spanish"

22/64

Page 23: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 23/64

Language Translation

library(translate) set.key("AIzaSyDIB8qQTmhLlbPNN38Gs4dXnlN4a7lRrHQ") print(translate(text[1],"fr","en"))

## [[1]] ## [1] "I am a novice programmer."

print(translate(text[3],"it","en"))

## [[1]] ## [1] "I'm a novice programmer."

print(translate(text[4],"de","en"))

## [[1]] ## [1] "I am a beginner programmer"

print(translate(text[5],"es","en"))23/64

Page 24: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 24/64

Text Classification

1. Machine classification is, from a layman's point of view, nothing butlearning by example. In new-fangled modern parlance, it is atechnique in the field of "machine learning".

2. Learning by machines falls into two categories, supervised andunsupervised. When a number of explanatory variables are usedto determine some outcome , and we train an algorithm to do this,we are performing supervised (machine) learning. The outcome may be a dependent variable (for example, the left hand side in alinear regression), or a classification (i.e., discrete outcome).

3. When we only have variables and no separate outcome variable , we perform unsupervised learning. For example, cluster analysis

produces groupings based on the variables of various entities,and is a common example.

XY

Y

XY

X

24/64

Page 25: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 25/64

Classification Algorithms and Metrics

Bayes Classifier: We use the e1071 package.

Support Vector Machines (SVM)

Statistical Significance of the Confusion Matrix

Word count classifiers, adjectives, and adverbs

Fisher's discriminant

Vector-Distance Classifier

Accuracy

Using the RTextTools package

·

·

·

·

·

·

·

·

25/64

Page 26: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 26/64

Sentiment over Time

26/64

Page 27: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 27/64

Stock Sentiment Correlations

27/64

Page 28: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 28/64

Phase Lag Analysis

28/64

Page 29: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 29/64

Disagreement

The metric uses the number of signed buys and sells in the day (basedon a sentiment model) to determine how much difference of opinionthere is in the market. The metric is computed as follows:

where are the numbers of classified buys and sells. Note thatDISAG is bounded between zero and one.

DISAG = 1 +<<<

<<<B + SB + S

<<<<<<

B, S

29/64

Page 30: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 30/64

Precision and Recall

Actual

Predicted Looking for Job Not Looking

Looking for Job 10 2 12

Not Looking 1 16 17

11 18 29

In this case precision is and recall is .

Precision is the fraction of positives identified that are truly positive,and is also known as positive predictive value.

Recall is the proportion of positives that are correctly identified, andis also known as sensitivity.

·

·

10/12 10/11

30/64

Page 31: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 31/64

Readability

"Readability" is a metric of how easy it is to comprehend text.

With a range of 90-100 easily accessible by a 11-year old, 60-70 beingeasy to understand for 13-15 year olds, and 0-30 for universitygraduates.

which gives a number that corresponds to the grade level. - The

Gunning-Fog Index

Flesch Reading Ease Score

·

0.4 Þ [ + 100 Þ ( )]\#words\#sentences

\#complex words\#words

·

206.835 + 1.015 ( ) + 84.6 ( )\#words\#sentences

\#syllables\#words

Flesch-Kincaid Grade Level·

0.39 ( ) + 11.8 ( ) + 15.59\#words

\#sentences\#syllables

\#words 31/64

Page 32: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 32/64

Text Summarization

It is really easy to write a summarizer in a few lines of code. Thefunction below takes in a text array and does the needful. Eachelement of the array is one sentence of the document we wansummarized.

In the function we need to calculate how similar each sentence is toany other one. This could be done using cosine similarity, but here weuse another approach, Jaccard similarity. Given two sentences, Jaccardsimilarity is the ratio of the size of the intersection word set divided bythe size of the union set.

32/64

Page 33: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 33/64

Jaccard Similarity

A document is comprised of sentences , whereeach is a set of words. We compute the pairwise overlap betweensentences using the Jaccard similarity index:

The overlap is the ratio of the size of the intersect of the two word setsin sentences and , divided by the size of the union of the two sets.The similarity score of each sentence is computed as the row sums ofthe Jaccard similarity matrix.

D m , i = 1, 2, . . . , msi

si

= J( , ) = =Jij si sj| B |si sj

| C |si sjJji

si sj

=i ∑j=1

m

Jij

33/64

Page 34: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 34/64

Text Mining Research in Finance - 1

1. Lu, Chen, Chen, Hung, and Li (2010) categorize finance relatedtextual content into three categories: (a) forums, blogs, and wikis;(b) news and research reports; and (c) content generated by firms.

2. Extracting sentiment and other information from messages postedto stock message boards such as Yahoo!, Motley Fool, SiliconInvestor, Raging Bull, etc., see Tumarkin and Whitelaw (2001),Antweiler and Frank (2004), Antweiler and Frank (2005), Das,Martinez-Jerez and Tufano (2005), Das and Chen (2007).

3. Other news sources: Lexis-Nexis, Factiva, Dow Jones News, etc., seeDas, Martinez-Jerez and Tufano (2005); Boudoukh, Feldman, Kogan,Richardson (2012).

34/64

Page 35: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 35/64

Text Mining Research in Finance - 2

1. The Heard on the Street column in the Wall Street Journal has beenused in work by Tetlock (2007), Tetlock, Saar-Tsechansky andMacskassay (2008); see also the use of Wall Street Journal articles byLu, Chen, Chen, Hung, and Li (2010).

2. Thomson-Reuters NewsScope Sentiment Engine (RNSE) based onInfonics/Lexalytics algorithms and varied data on stocks and textfrom internal databases, see Leinweber and Sisk (2011). Zhang andSkiena (2010) develop a market neutral trading strategy using newsmedia such as tweets, over 500 newspapers, Spinn3r RSS feeds, andLiveJournal.

35/64

Page 36: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 36/64

Das and Chen (Management Science 2007)

36/64

Page 37: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 37/64

System

37/64

Page 38: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 38/64

Optimism Score

38/64

Page 39: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 39/64

Ambiguity

39/64

Page 40: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 40/64

Tech Sector Sentiment

40/64

Page 41: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 41/64

Sentiment and Volatility

41/64

Page 42: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 42/64

Sentiment Correlations

42/64

Page 43: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 43/64

Using Twitter and Facebook for MarketPrediction

1. Bollen, Mao, and Zeng (2010): stock direction of DJIA predicted usingtweets, 87.6% accuracy.

2. Bar-Haim, Dinur, Feldman, Fresko and Goldstein (2011) predictstock direction using tweets by detecting and overweighting theopinion of expert investors.

3. Brown (2012) looks at the correlation between tweets and the stockmarket via several measures.

4. Twitter based sentiment developed by Rao and Srivastava (2012) iscorrelated as high as 0.88 for returns.

5. Sprenger and Welpe (2010): tweet bullishness associated withabnormal stock returns; tweet volume predicts trading volume.

43/64

Page 44: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 44/64

Stock Twits Prices

44/64

Page 45: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 45/64

Messages

45/64

Page 46: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 46/64

Sentiment

46/64

Page 47: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 47/64

iSentium iSense S/T

47/64

Page 48: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 48/64

iSentium iSense L/T

48/64

Page 49: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 49/64

Text Mining Corporate Reports

Text analysis is undertaken across companies in a cross-section.

The quality of text in company reports is much better than inmessage postings.

·

·

49/64

Page 50: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 50/64

Using the MD&A

Sharper conclusions may be possible from specific sections of thefiling such as a 10-K.

Loughran and McDonald (2011) examined whether the ManagementDiscussion and Analysis (MD&A) section of the filing was better atproviding tone (sentiment) then the entire 10-K. They found not.

They also showed that using their six tailor-made word lists gavebetter results for detecting tone than did the Harvard Inquirerwords.

Proper word-weighting also improves tone detection.

·

·

·

·

50/64

Page 51: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 51/64

Readability of Financial Reports

Loughran and McDonald (2014) examine the readability of financialdocuments, by surveying at the text in 10-K filings. They computethe Fog index for these documents and compare this to post filingmeasures of the information environment such as volatility ofreturns, dispersion of analysts recommendations.

Fog index does not seem to correlate well with these measures ofthe information environment, the file size of the 10-K is a muchbetter measure and is significantly related to return volatility,earnings forecast errors, and earnings forecast dispersion, afteraccounting for control variates such as size, book-to-market, laggedvolatility, lagged return, and industry effects.

Li (2008) also shows that 10-Ks with high Fog index and longerlength have lower subsequent earnings.

·

·

·

51/64

Page 52: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 52/64

IBM's Midas System

52/64

Page 53: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 53/64

Midas Filing Map

53/64

Page 54: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 54/64

Loan Network

54/64

Page 55: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 55/64

Loan Network

55/64

Page 56: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 56/64

Centrality

56/64

Page 57: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 57/64

Predicting Markets

Wysocki (1999) found that for the 50 top firms in message postingvolume on Yahoo! Finance, message volume predicted next dayabnormal stock returns.

Bagnoli, Beneish, and Watts (1999) examined earnings "whispers",unofficial crowd-sourced forecasts of quarterly earnings from smallinvestors, are more accurate than that of First Call analyst forecasts.

Tumarkin and Whitelaw (2001) examined self-reported sentiment onthe Raging Bull message board and found no predictive content,either of returns or volume.

·

·

·

57/64

Page 58: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 58/64

Bullishness Index

Antweiler and Frank (2004) bullishness index·

B = = ! (+1, +1)+nB nS

+nB nS

R + 1R + 1

The bullishness index does not predict returns, but returns doexplain message posting. More messages are posted in periods ofnegative returns, but this is not a significant relationship.

Message board postings do not predict returns.

Disagreement (measured from postings) induces trading.

Message posting does predict volatility at daily frequencies andintraday.

Messages reflect public information rapidly.

·

·

·

·

·

58/64

Page 59: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 59/64

Possibile Applications for Finance Firms

An illustrative list of applications for finance firms is as follows:

Monitoring corporate buzz. Analyzing textual data to detect, analyze,and understand the more profitable customers or products. Targetingnew clients. Customer retention, which is a huge issue. Text miningcomplaints to prioritize customer remedial action makes a hugedifference, especially in the insurance business. Lending activity -automated management of profiling information for lendingscreening. Market prediction and trading. Risk management.Automated financial analysts. Financial forensics to prevent rogueemployees from inflicting large losses. Fraud detection. Detectingmarket manipulation. Social network analysis of clients. Measuringinstitutional risk from systemic risk.

59/64

Page 60: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 60/64

Topic Modeling

Latent Semantic Analysis (LSA) is an approach for reducing thedimension of the Term-Document Matrix (TDM).

And Latent Dirichlet Allocation (LDA), what does it have to do withLSA? The topicmodels package.

Latent Dirichlet Allocation (LDA) was created by David Blei, AndrewNg, and Michael Jordan in 2003, see their paper titled "LatentDirichlet Allocation" in the Journal of Machine Learning Research, pp993–1022.

·

·

·

60/64

Page 61: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 61/64

LDA as a Probability Model

The simplest way to think about LDA is as a probability model thatconnects documents with words and topics.

Likelihood of the entire Corpus·

p(D) = ∫ p( |α) p( | )p( | ) d∏j=1

M

θj

⎛⎝⎜⎜∏

l=1

K

∑tjl

tl θj wl tl

⎞⎠⎟⎟ θj

The goal is to maximize this likelihood by picking the vector andthe probabilities in the matrix . (Note that were a Dirichletdistribution not used, then we could directly pick values in Matrices and .)

The computation is undertaken using MCMC with Gibbs sampling asshown in the example earlier.

· αB

AB

·

61/64

Page 62: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 62/64

Examples in Finance

62/64

Page 63: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 63/64

Topics

63/64

Page 64: A Brief History of the Short History of Text Mining (in ...files.meetup.com/1225993/A Brief History of the Short History of... · 6/24/2016 A Brief History of the Short History of

6/24/2016 A Brief History of the Short History of Text Mining (in Finance)

file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 64/64

End Note!

Biblio at:http://algo.scu.edu/~sanjivdas/Das_TextAnalyticsInFinance.pdf

64/64


Top Related