text analysis using python
TRANSCRIPT
Fools Rush In?”The fact is, that to do anything in the world worth doing, we must not stand back shivering and thinking of the cold and danger, but jump in and scramble through as well as we can.” - Robert Cushing
Text Mining
Extract high quality information from text
Typically, trends and patterns are analysed using statistical methods – Machine Learning
Common Tasks – entity recognition, sentiment analysis, categorization, clustering
Why Python?
Short, concise text processing NLTK Scipy, numpy, scikit.learn Integration with other languages!
Because when you start your company, YOU get to decide!
Pre-processing Lower casing, stripping extra characters
”Realyyyyyyyyyy!!!!!” Tokenisation (sentences, words)
>>>pktst = nltk.data.load('tokenizers/punkt/english.pickle')
>>>sentences = pktst.tokenize(tweet)
>>>words = nltk.word_tokenize(sent)
Handling Entities
>>> re.sub(r'(^| )@[^ ]+','',tweet).strip()
Removing stopwords >>> stopwords = set([”a”, ”an”, ”the”, ”by”])
>>> ' '.join([w for w in words if w not in stopwords])
Leveraging a Corpus
Simple techniques to analyze a domain Term Frequency to find important entities
”low light photography”, ”travel photography”
tf/idf to find representative terms across domains ”gaming” for TVs
GND to find aliases e.g., ”e700” and ”Samsung e700” vs ”e310” and
”Samsung e310” Yahoo BOSS is great!
Bayes Theorem
Conditional Probability Bayesian classifiers - given features, find
Probability of Class
P C∣F i , ... , F n=P C ∏
iP F i∣C
Features
Characteristics of to-be-classified object For text, typically unigrams, ”n”-grams, POS,
CHUNK, presence in gazeteer
Can be numerical or boolean Crucial to performance of classifier! In NLTK, a dict of feature name to value
>>> {”word1” : 2, ”word2” : 1, ”word1-word2” : 1,
”word2-word3” : 3, ”word1-in-monuments” : False}
Training
”Teach” the classifier how to classify, using training data
>>> from nltk import NaiveBayesClassifier as nbc
>>> trg = generate_features(training_samples)
>>> random.shuffle(trg)
>>> train, test = trg[:int(0.9*len(trg)], trg[int(0.9*len(trg):]
>>> clf = nbc.train(train)
>>> nltk.classify.accuracy(clf, test)
Training, part 2
Measure, tune, iterate
Cross Validation Aim is to find a balance between Precision and
Recall
A Gender Classifier
def gender_features(word):
return {'last_letter': word[-1]}
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
featuresets = [(gender_features(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)
Advice on Training
Its TEDIOUS! Cut the cognitive load
e.g.: ”I love this camera!”
BAD:
0 I PRP B-NP -
1 love VBP B-VP -
2 this DT B-NP -
3 camera NN I-NP 1
Good: (I, love) – NO, (love, this camera) – YES
Dealing with ambiguity
Mechanical Turk, jsonwidget
Other Classifiers
Maximum Entropy Support Vector Machines Conditional Random Fields
All follow the same workflow!
More Examples
Find questions in Tweets unigrams, bigrams, trigrams, parse distance
Recognize ”contextual questions” in discussions ”reco” words, ”thanks” words
”Use-type” recognizer POS, CHUNK, special words, verbs within 3 words
of phrase e.g., ”I want to compose the perfect landscape shot”
Eats, shoots and leaves
Common basic step! POS, chunk, parse
tree Hairy theory
FSA, Morphology, Phonology
n-Grams, Probabilistic models, CFGs
Don't Worry, Be Happy!
No License Required!
What to use? NLTK, or ? Docking with the Evil MotherShip
Jepp, JPype? ► use RPC
>>> from .stanford_corenlp import jsonrpc
>>> server = jsonrpc.TransportTcpIp(...)
>>> result = loads(server.parse(paragraph))
POS, Parse tree, and more (but no chunks?)
Wordnet®
”a large lexical database of English” synsets, hypernyms, hyponyms, gloss synonyms, antonyms
Sentiwordnet Start with candidate ”good” and ”bad” words Expand by recursively following edges Classify using definition
e.g., ”good” → ”better”, ”good” → ”impressive”
Love or Hate?
”I love the screen, but the battery life is poor” Shallow?
3 class classifier
Or Deep? Relationship classifier Extracting candidate subjects Lots of unsolved problems – co-reference, multiple
subjects, negation, etc.
Summary ratings for ”Executive Summaries”
Gender, revisited
solr - search semi-structured text Out of the box text processing utilities – stemming,
tokenising
Highly configurable relevancy fields, weights Sorting! ”sort(term, field, edit) desc” for Levenshtein edit
distance
Gender, revisited
Schema
<field name="name" type="string" />
<field name="name_phoneme" type="phonetic" />
Search: add ”sort=strdist(unknown_name, name, edit) desc)
Python:
for namerec in results:
If namerec.gender == 'Male':
male_score += namerec.match_score
else:
female_score += namerec.match_score
Correctly guesses ”Sheena” and ”Ashish”!!
80/80 precision/recall
Miles to go before I sleep!
Machine Learning on coursera