A Brief History of the Short History of Text Mining (in ...files. Brief History of the Short History of...6/24/2016 A Brief History of the Short History of Text Mining ... Text as Data 1. Big Text: ... extracting data that is converted into information of many types. 3

Download A Brief History of the Short History of Text Mining (in ...files.  Brief History of the Short History of...6/24/2016 A Brief History of the Short History of Text Mining ... Text as Data 1. Big Text: ... extracting data that is converted into information of many types. 3

Post on 20-Mar-2018

214 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 1/64</p><p>A Brief History of the ShortHistory of Text Mining (inFinance)Sanjiv Ranjan Das and Karthik Mokashi </p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 2/64</p><p>Reference monograph</p><p>Text expands the universe of data by many-fold. See my monographon text mining in finance at:http://algo.scu.edu/~sanjivdas/Das_TextAnalyticsInFinance.pdf</p><p>This covers some of the content of this presentation.</p><p>2/64</p><p>http://algo.scu.edu/~sanjivdas/Das_TextAnalyticsInFinance.pdf</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 3/64</p><p>Text as Data</p><p>1. Big Text: there is more textual data than numerical data.</p><p>2. Text is versatile. Nuances and behavioral expressions that are notconveyed with numbers.</p><p>3. Text contains emotive content. Sentiment analysis. Admati-Pfleiderer 2001; DeMarzo et al 2003; Antweiler-Frank 2004, 2005;Das-Chen 2007; Tetlock 2007; Tetlock et al 2008; Mitra et al 2008;Leinweber-Sisk 2010.</p><p>4. Text contains opinions and connections. Das et al 2005; Das andSisk 2005; Godes et al 2005; Li 2006; Hochberg et al 2007.</p><p>5. Numbers aggregate; text categorizes (topics).</p><p>6. Text can be generated.</p><p>3/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 4/64</p><p>Definition: Text-Mining</p><p>1. Text mining is the large-scale, automated processing of plain textlanguage in digital form to extract data that is converted into usefulquantitative or qualitative information.</p><p>2. Text mining is automated on big data that is not amenable tohuman processing within reasonable time frames. It entailsextracting data that is converted into information of many types.</p><p>3. Simple: Text mining may be simple as in key word searches andcounts.</p><p>4. Complicated: It may require language parsing and complex rules forinformation extraction.</p><p>5. Structured text, such as the information in forms and some kinds ofweb pages.</p><p>6. Unstructured text is a much harder endeavor.</p><p>4/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 5/64</p><p>Data and Algorithms</p><p>5/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 6/64</p><p>Text Handling</p><p>Text Extraction</p><p>String Parsing</p><p>String Detection</p><p>Text cleanup</p><p>XML Package</p><p>Regular Expressions</p><p>Using APIs (Twitter, Facebook, Yelp)</p><p>6/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 7/64</p><p>The Response to News (Das, Martinez-Jerez,and Tufano (FM 2005))</p><p>7/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 8/64</p><p>Breakdown of News Flow</p><p>8/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 9/64</p><p>Frequency of Postings</p><p>9/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 10/64</p><p>Weekly Posting</p><p>10/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 11/64</p><p>Intraday Posting</p><p>11/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 12/64</p><p>Number of Characters per Posting</p><p>12/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 13/64</p><p>Text Mining with the "tm" Package</p><p>1. Functions such as {readDOC()}, {readPDF()}, etc., for reading DOCand PDF files, the package makes accessing various file formatseasy.</p><p>2. Text mining involves applying functions to many text documents, acorpus. Ability to operate on the entire set of documents at one go.</p><p>3. Remove Stopwords, Punctuation, Numbers. Create a Bag of Words(BOW)</p><p>4. Term Document Matrix (TDM): An entire library of text inside asingle matrix. Analysis and searching documents. In search engines,topic analysis, and classification (spam filtering).</p><p>5. Term Frequency - Inverse Document Frequency (TF-IDF)</p><p>13/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 14/64</p><p>Wordclouds</p><p>Wordlcouds are interesting ways in which to represent text. They givean instant visual summary. The wordcloud package in R may be usedto create your own wordclouds.</p><p>14/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 15/64</p><p>Document Similarity</p><p>In this segment we will learn some popular functions on text that areused in practice. One of the first things we like to do is to find similartext or like sentences (think of web search as one application). Sincedocuments are vectors in the TDM, we may want to find the closestvectors or compute the distance between vectors.</p><p>where , is the dot product of with itself, also known asthe norm of . This gives the cosine of the angle between the twovectors and is zero for orthogonal vectors and 1 for identical vectors.</p><p>cos() = A B||A|| ||B||</p><p>||A|| = A A AA</p><p>15/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 16/64</p><p>Dictionaries</p><p>1. Standard Dictionaries: www.dictionary.com, and www.merriam-webster.com.</p><p>2. The Harvard General Inquirer:http://www.wjh.harvard.edu/~inquirer/</p><p>3. Computer dictionary: http://www.hyperdictionary.com/computerthat contains about 14,000 computer related words, such as "byte"or "hyperlink".</p><p>4. Math dictionary, such ashttp://www.amathsdictionaryforkids.com/dictionary.html.</p><p>5. Medical dictionary, see http://www.hyperdictionary.com/medical.</p><p>16/64</p><p>http://www.wjh.harvard.edu/~inquirer/http://www.hyperdictionary.com/computerhttp://www.amathsdictionaryforkids.com/dictionary.htmlhttp://www.hyperdictionary.com/medical</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 17/64</p><p>Dictionaries - II</p><p>1. Internet lingo dictionaries.http://www.netlingo.com/dictionary/all.php for words such as"2BZ4UQT" which stands for "too busy for you cutey" (LOL).</p><p>2. Associative dictionaries are also useful when trying to find context,as the word may be related to a concept, identified using adictionary such as http://www.visuwords.com/. This dictionarydoubles up as a thesaurus.</p><p>3. Value dictionaries deal with values and may be useful when onlyaffect (positive or negative) is insufficient for scoring text. TheLasswell Value Dictionaryhttp://www.wjh.harvard.edu/~inquirer/lasswell.htm may be used toscore the loading of text on the eight basic value categories: Wealth,Power, Respect, Rectitude, Skill, Enlightenment, Affection, and Wellbeing.</p><p>17/64</p><p>http://www.netlingo.com/dictionary/all.phphttp://www.visuwords.com/http://www.wjh.harvard.edu/~inquirer/lasswell.htm</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 18/64</p><p>Lexicons</p><p>1. A lexicon is defined by Webster's as "a book containing analphabetical arrangement of the words in a language and theirdefinitions; the vocabulary of a language, an individual speaker orgroup of speakers, or a subject; the total stock of morphemes in alanguage." This suggests it is not that different from a dictionary.</p><p>2. A "morpheme" is defined as "a word or a part of a word that has ameaning and that contains no smaller part that has a meaning."</p><p>3. In the text analytics realm, we will take a lexicon to be a smaller,special purpose dictionary, containing words that are relevant to thedomain of interest.</p><p>4. The computational effort required by text analytics algorithms isdrastically reduced.</p><p>18/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 19/64</p><p>Constructing a lexicon</p><p>1. By hand. This is an effective technique and the simplest. It calls for ahuman reader who scans a representative sample of textdocuments and culls important words that lend interpretivemeaning.</p><p>2. Examine the term document matrix for most frequent words, andpick the ones that have high connotation for the classification taskat hand.</p><p>3. Use pre-classified documents in a text corpus. We analyze theseparate groups of documents to find words whose difference infrequency between groups is highest. Such words are likely to bebetter in discriminating between groups.</p><p>19/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 20/64</p><p>Lexicons as Word Lists</p><p>1. Das and Chen (2007) constructed a lexicon of about 375 words thatare useful in parsing sentiment from stock message boards. Thislexicon also introduced the notion of "negation tagging" into theliterature.</p><p>2. Loughran and McDonald (2011):</p><p>Taking a sample of 50,115 firm-year 10-Ks from 1994 to 2008, theyfound that almost three-fourths of the words identified as negativeby the Harvard Inquirer dictionary are not typically negative wordsin a financial context.</p><p>Therefore, they specifically created separate lists of words by thefollowing attributes of words: negative, positive, uncertainty,litigious, strong modal, and weak modal. Modal words are based onJordan's categories of strong and weak modal words. These wordlists may be downloaded fromhttp://www3.nd.edu/~mcdonald/Word_Lists.html.</p><p>20/64</p><p>http://www3.nd.edu/~mcdonald/Word_Lists.html</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 21/64</p><p>Scoring Text</p><p>Mood Scoring using Harvard Inquirer</p><p>21/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 22/64</p><p>Language Detection</p><p>We may be scraping web sites from many countries and need to detectthe language and then translate it into English for mood scoring. Theuseful package textcat enables us to categorize the language.</p><p>library(textcat) text = c("Je suis un programmeur novice.", "I am a programmer who is a novice.", "Sono un programmatore alle prime armi.", "Ich bin ein Anfnger Programmierer", "Soy un programador con errores.") lang = textcat(text) print(lang)</p><p>## [1] "french" "english" "italian" "german" "spanish"</p><p>22/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 23/64</p><p>Language Translation</p><p>library(translate) set.key("AIzaSyDIB8qQTmhLlbPNN38Gs4dXnlN4a7lRrHQ") print(translate(text[1],"fr","en"))</p><p>## [[1]] ## [1] "I am a novice programmer."</p><p>print(translate(text[3],"it","en"))</p><p>## [[1]] ## [1] "I'm a novice programmer."</p><p>print(translate(text[4],"de","en"))</p><p>## [[1]] ## [1] "I am a beginner programmer"</p><p>print(translate(text[5],"es","en"))23/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 24/64</p><p>Text Classification</p><p>1. Machine classification is, from a layman's point of view, nothing butlearning by example. In new-fangled modern parlance, it is atechnique in the field of "machine learning".</p><p>2. Learning by machines falls into two categories, supervised andunsupervised. When a number of explanatory variables are usedto determine some outcome , and we train an algorithm to do this,we are performing supervised (machine) learning. The outcome may be a dependent variable (for example, the left hand side in alinear regression), or a classification (i.e., discrete outcome).</p><p>3. When we only have variables and no separate outcome variable , we perform unsupervised learning. For example, cluster analysis</p><p>produces groupings based on the variables of various entities,and is a common example.</p><p>XY</p><p>Y</p><p>XY</p><p>X</p><p>24/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 25/64</p><p>Classification Algorithms and Metrics</p><p>Bayes Classifier: We use the e1071 package.</p><p>Support Vector Machines (SVM)</p><p>Statistical Significance of the Confusion Matrix</p><p>Word count classifiers, adjectives, and adverbs</p><p>Fisher's discriminant</p><p>Vector-Distance Classifier</p><p>Accuracy</p><p>Using the RTextTools package</p><p>25/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 26/64</p><p>Sentiment over Time</p><p>26/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 27/64</p><p>Stock Sentiment Correlations</p><p>27/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 28/64</p><p>Phase Lag Analysis</p><p>28/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 29/64</p><p>Disagreement</p><p>The metric uses the number of signed buys and sells in the day (basedon a sentiment model) to determine how much difference of opinionthere is in the market. The metric is computed as follows:</p><p>where are the numbers of classified buys and sells. Note thatDISAG is bounded between zero and one.</p><p>DISAG = 1 </p><p>B SB + S</p><p>B, S</p><p>29/64</p></li><li><p>6/24/2016 A Brief History of the Short History of Text Mining (in Finance)</p><p>file:///Users/srdas/Google%20Drive/Papers/TextMining/BriefHistoryTextMining_inFinance.html#20 30/64</p><p>Precision and Recall</p><p>Actual</p><p>Predicted Looking for Job Not Looking</p><p>Looking for Job 10 2 12</p><p>N...</p></li></ul>