![Page 1: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/1.jpg)
Text Mining LabAdrian and Shawndra
December 4, 2012 (version 1)
![Page 2: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/2.jpg)
Outline
1. Download and Install Python
2. Download and Install NLTK
3. Download and Unzip Project Files
4. Simple Naïve Bayes Classifier
5. Demo collecting tweets -- > Evaluation
6. Other things you can do …
![Page 3: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/3.jpg)
Download and Install Python
http://www.python.org/getit/ (latest version 2.7.3)
http://pypi.python.org/pypi/setuptools (install the setup tools for 2.7)
![Page 4: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/4.jpg)
Download and Install NLTK
Install PyYAML: http://pyyaml.org/wiki/PyYAML
Install NUMPY: http://numpy.scipy.org/
Install NLTK: http://pypi.python.org/pypi/nltk
Install MatPlotLib: http://matplotlib.org/
![Page 5: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/5.jpg)
Test Installation
Run python
At the prompt type
>>import nltk
>>import matplotlib
![Page 6: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/6.jpg)
Downloading Models
>> nltk.download()
Open GUI downloader
Select “Models” tab and download:
maxent_ne_chunker
maxent_treebank_pos_tagger
hmm_treebank_pos_tagger
Select “Corpora” tab and download:
stopwords
AlternativelySelect “Collections” , click on all, and click the button to download all
![Page 7: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/7.jpg)
Getting Started
Unzip project directory (lab1.zip)
Change to the lab1 directory
Open command window in the “lab1” directory
Windows 7 and later – Hold SHIFT; right-click in directory, select “Open command window here”
Unix/Mac – Open terminal; cd PATH/TO/lab1
Type “python” and then <enter> in terminal
>> import text_processing as tp
>> import nltk
Note: text_processing comes from your lab1 folder
Note: You must work from your lab1 directory
![Page 8: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/8.jpg)
Downloading Models
>> nltk.download()
Open GUI downloader
Select “Models” tab and download:
maxent_ne_chunker
maxent_treebank_pos_tagger
hmm_treebank_pos_tagger
Select “Corpora” tab and download:
stopwords
AlternativelySelect “Collections” , click on all, and click the button to download all
![Page 9: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/9.jpg)
Simple NB Sentiment Classifier
![Page 10: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/10.jpg)
Read in tweetsCALL
>> paths = [‘neg_examples.txt', ‘pos_examples.txt']
>> documentClasses = [‘neg', ‘pos']
>> tweetSet = [tp.loadTweetText(p) for p in paths]
SAMPLE OUTPUT
>> len(tweetSet[0]), len(tweetSet[1])
(20000, 40000)
>> tweetSet[1][50]
"@davidarchie hey david !me and my bestfriend are forming a band .could you give us any advice please? it's means a lot for us :)"
![Page 11: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/11.jpg)
Read in tweets (Code)
Reads in a file and treats each line as a tweet, lower-casing the text
![Page 12: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/12.jpg)
TokenizeCALL
>> tokenSet = [tp.tokenizeTweets(tweets) for tweets in tweetSet]
SAMPLE OUTPUT
>> len(tokenSet[1][50])
31
>> tokenSet[1][50]
['@', 'davidarchie', 'hey', 'david', '!', 'me', 'and', 'my', 'bestfriend', 'are', 'forming', 'a', 'band', '.', 'could', 'you', 'give', 'us', 'any', 'advice', 'please', '?', 'it', "'", 's', 'means', 'a', 'lot', 'for', 'us', ':)']
![Page 13: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/13.jpg)
Tokenize (Code)
For each tweet, splits the tweet by whitespace. Splits off punctuation separately.
(nltk.WordPunctTokenizer)
![Page 14: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/14.jpg)
Filter out Non-EnglishCALL
>> englishSet = [tp.filterOnlyEnglish(tokens)
for tokens in tokenSet]
SAMPLE OUTPUT
>> len(englishSet[1][50])
22
>> englishSet[1][50]
['hey', 'david', 'me', 'and', 'my', 'are', 'forming', 'a', 'band', 'could', 'you', 'give', 'us', 'any', 'advice', 'please', 'it', 'means', 'a', 'lot', 'for', 'us']
![Page 15: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/15.jpg)
Filter out Non-English (Code)
Reads in a dictionary file of English words – “wordsEn.txt” – and only keeps tokens in that dictionary
![Page 16: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/16.jpg)
Filter out Stopwords
SAMPLE OUTPUT
>> len(noStopSet[1][50])
12
>> noStopSet[1][50]
['hey', 'david', 'forming', 'band', 'could', 'give', 'us', 'advice', 'please', 'means', 'lot', 'us']
CALL
>> noStopSet = [tp.removeStopwords(tokens,
[':)', ':(']) for tokens in englishSet]
![Page 17: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/17.jpg)
Filter out Stopwords (Code)
Loads stop word list, and removes tokens that are in stop words. Also able to pass additional words as stop words
using the “addtlStopwords” argument
![Page 18: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/18.jpg)
Stem
CALL
>> stemmedSet = [tp.stemTokens(tokens) for tokens in noStopSet]
SAMPLE OUTPUT
>> len(stemmedSet[1][50])
12
>> stemmedSet[1][50]
['hey', 'david', 'form', 'band', 'could', 'give', 'us', 'advic', 'pleas', 'mean', 'lot', 'us']
![Page 19: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/19.jpg)
Stem (Code)
Loads a Porter stemmer implementation to remove suffixes from tokens. http://nltk.org/api/nltk.stem.html for more
information on NLTK's stemmers.
![Page 20: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/20.jpg)
Make Bags of WordsCALL
>> bagsOfWords = [tp.makeBagOfWords(
tokens, documentClass=docClass)
for docClass, tokens in zip(documentClasses,
stemmedSet)]
SAMPLE OUTPUT
>> bagsOfWords[1][50][0].items()
[('us', 2), ('advic', 1), ('band', 1), ('could', 1), ('david', 1), ('form', 1), ('give', 1), ('hey', 1), ('lot', 1), ('mean', 1), ('pleas', 1)]
![Page 21: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/21.jpg)
Make Bags of Words (Code)
For each tweet, constructs a bag of words (FreqDist) that counts that number of times each token occurs. Setting the bigrams
argument to True will also include bigrams in the bags.
![Page 22: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/22.jpg)
Make Train and TestCALL
>> trainSet, testSet = tp.makeTrainAndTest(
reduce(lambda x, y: x + y, bagsOfWords),
cutoff=0.9)SAMPLE OUTPUT
>> len(trainSet), len(testSet)
(50697, 5633)
![Page 23: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/23.jpg)
Make Train and Test (Code)
Given all of your examples, randomly selects proportion cutoff of examples for training, and 1-cutoff examples for
testing.
![Page 24: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/24.jpg)
Train ClassifierCALL
>> import nltk.classify.util
>> nbClassifier = tp.trainNBClassifier(trainSet,
testSet)
SAMPLE OUTPUT
>> classifier.show_most_informative_features(n=20)
…..
![Page 25: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/25.jpg)
Train Classifier (Code)
Trains a Naive Bayes classifier over the input training set. Prints out the accuracy over the test set and prints tokens
the most discriminating tokens.
![Page 26: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/26.jpg)
Twitter Collection Demo
![Page 27: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/27.jpg)
Directions for Collection Demo
Now, try out twitter_kw_stream.py to collect more tweets over a couple of different “classes”.
Some possible tokens (with high volume)
apple google cat dog pizza “ice cream”
Open a new terminal window in the same directory
For each keyword – KW -- you search for, type:
python twitter_kw_stream.py --keywords=KW
Wait a minute or so (till you retrieve about 100 tweets)
![Page 28: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/28.jpg)
Accuracy given training data size
Assuming keywords searched for were:
apple google cat dog pizza “ice cream”
In terminal already running Python interpreter:>> paths = ['apple.txt', 'google.txt', 'cat.txt', 'dog.txt',
'pizza.txt', 'ice cream.txt']
>> addtlStopwords = ['apple', 'google', 'cat', 'dog', 'pizza', 'ice', 'cream']
>> cutoffs, uniAccs, biAccs = tp.plotAccuracy(paths,
addtlStopwords=addtlStopwords)If matplotlib installed correctly, should display accuracy of
NB classifier while varying the amount training data, with and without using bigrams. Saved to “nbPlot.png”
![Page 29: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/29.jpg)
Other things you can do
![Page 30: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/30.jpg)
Get Document SimilarityCALL
>> docs = tp.loadDocuments(paths)
>> sims = tp.getDocumentSimilarities(paths,
[p.replace('.txt', '') for p in paths])
SAMPLE OUTPUT
>> sims[('apple', 'dog')]
0.30735795122824466
>> sims[('apple', 'google')]
0.44204540065105324
![Page 31: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/31.jpg)
Get Document Similarity (Code)
Calculates cosine similarity for each bag of word pair. Dot product of the two frequency vectors (after normalizing each
to a unit vector).
![Page 32: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/32.jpg)
Calculate TF-IDF
CALL
>> tfIdfs = tp.getTfIdfs(docs)
SAMPLE OUTPUT
>> for path, tfIdf in zip(paths, tfIdfs):
… print 'Top 10 TF-IDF for %s: %s' %(path,
'\n'.join([str(t) for t in tfIdf]))
![Page 33: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/33.jpg)
Part-of-Speech TagCALL
>> posneTagged = [[tp.partOfSpeechTag(ts) for ts in
classTokens[:100]]
for tokens in tokenSet]
SAMPLE OUTPUT>> posSet[1][50]
[('@', 'IN'), ('davidarchie', 'NN'), ('hey', 'NN'), ('david', 'VBD'), ('!', '.'), ('me', 'PRP'), ('and', 'CC'), ('my', 'PRP$'), ('bestfriend', 'NN'), ('are', 'VBP'), ('forming', 'VBG'), ('a', 'DT'), ('band', 'NN'), ('.', '.'), ('could', 'MD'), ('you', 'PRP'), ('give', 'VB'), ('us', 'PRP'), ('any', 'DT'), ('advice', 'NN'), ('please', 'NN'), ('?', '.'), ('it', 'PRP'), ("'", "''"), ('s', 'VBZ'), ('means', 'NNS'), ('a', 'DT'), ('lot', 'NN'), ('for', 'IN'), ('us', 'PRP'), (':)', ':')]
![Page 34: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/34.jpg)
Part-of-Speech Tag (Code
Very simple, as long as you have a string of tokens, can just call nltk.pos_tag(tokens) to tag them with part-of-speech.
![Page 35: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/35.jpg)
Find Named-EntitiesCALL
>> neSet = [[tp.getNamedEntityTree(ts) for ts in
classTokens[:100]]
for tokens in tokenSet]
SAMPLE OUTPUT>> neSet[1][50]
Tree('S', [('@', 'IN'), ('davidarchie', 'NN'), ('hey', 'NN'), ('david', 'VBD'), ('!', '.'), ('me', 'PRP'), ('and', 'CC'), ('my', 'PRP$'), ('bestfriend', 'NN'), ('are', 'VBP'), ('forming', 'VBG'), ('a', 'DT'), ('band', 'NN'), ('.', '.'), ('could', 'MD'), ('you', 'PRP'), ('give', 'VB'), ('us', 'PRP'), ('any', 'DT'), ('advice', 'NN'), ('please', 'NN'), ('?', '.'), ('it', 'PRP'), ("'", "''"), ('s', 'VBZ'), ('means', 'NNS'), ('a', 'DT'), ('lot', 'NN'), ('for', 'IN'), ('us', 'PRP'), (':)', ':')])
![Page 36: Text Mining Lab Adrian and Shawndra December 4, 2012 (version 1)](https://reader034.vdocuments.mx/reader034/viewer/2022051515/55140ed1550346e2488b4f70/html5/thumbnails/36.jpg)
Find Named-Entities
Similarly simple, just call two NLTK functions. The performance of POS tagger and NE chunker are quite bad for
Twitter messages, however.