text analysis using python

Fools Rush In?”The fact is, that to do anything in the world worth doing, we must not stand back shivering and thinking of the cold and danger, but jump in and scramble through as well as we can.” - Robert Cushing


Machine Learning everywhere

Users expectations of ”standard experience”

Many Resources!

Text Mining

Extract high quality information from text

Typically, trends and patterns are analysed using statistical methods – Machine Learning

Common Tasks – entity recognition, sentiment analysis, categorization, clustering

Why Python?

Short, concise text processing NLTK Scipy, numpy, scikit.learn Integration with other languages!

Because when you start your company, YOU get to decide!

Pre-processing Lower casing, stripping extra characters

”Realyyyyyyyyyy!!!!!” Tokenisation (sentences, words)

>>>pktst = nltk.data.load('tokenizers/punkt/english.pickle')

>>>sentences = pktst.tokenize(tweet)

>>>words = nltk.word_tokenize(sent)

Handling Entities

>>> re.sub(r'(^| )@[^ ]+','',tweet).strip()

Removing stopwords >>> stopwords = set([”a”, ”an”, ”the”, ”by”])

>>> ' '.join([w for w in words if w not in stopwords])

Leveraging a Corpus

Simple techniques to analyze a domain Term Frequency to find important entities

”low light photography”, ”travel photography”

tf/idf to find representative terms across domains ”gaming” for TVs

GND to find aliases e.g., ”e700” and ”Samsung e700” vs ”e310” and

”Samsung e310” Yahoo BOSS is great!

Supervised Classification

Bayes Theorem

Conditional Probability Bayesian classifiers - given features, find

Probability of Class

P C∣F i , ... , F n=P C ∏

iP F i∣C


Characteristics of to-be-classified object For text, typically unigrams, ”n”-grams, POS,

CHUNK, presence in gazeteer

Can be numerical or boolean Crucial to performance of classifier! In NLTK, a dict of feature name to value

>>> {”word1” : 2, ”word2” : 1, ”word1-word2” : 1,

”word2-word3” : 3, ”word1-in-monuments” : False}


”Teach” the classifier how to classify, using training data

>>> from nltk import NaiveBayesClassifier as nbc

>>> trg = generate_features(training_samples)

>>> random.shuffle(trg)

>>> train, test = trg[:int(0.9*len(trg)], trg[int(0.9*len(trg):]

>>> clf = nbc.train(train)

>>> nltk.classify.accuracy(clf, test)

Training, part 2

Measure, tune, iterate

Cross Validation Aim is to find a balance between Precision and


A Gender Classifier

def gender_features(word):

return {'last_letter': word[-1]}

names = ([(name, 'male') for name in names.words('male.txt')] +

[(name, 'female') for name in names.words('female.txt')])

featuresets = [(gender_features(n), g) for (n,g) in names]

train_set, test_set = featuresets[500:], featuresets[:500]

classifier = nltk.NaiveBayesClassifier.train(train_set)

nltk.classify.accuracy(classifier, test_set)

Advice on Training

Its TEDIOUS! Cut the cognitive load

e.g.: ”I love this camera!”


0 I PRP B-NP -

1 love VBP B-VP -

2 this DT B-NP -

3 camera NN I-NP 1

Good: (I, love) – NO, (love, this camera) – YES

Dealing with ambiguity

Mechanical Turk, jsonwidget

Other Classifiers

Maximum Entropy Support Vector Machines Conditional Random Fields

All follow the same workflow!

More Examples

Find questions in Tweets unigrams, bigrams, trigrams, parse distance

Recognize ”contextual questions” in discussions ”reco” words, ”thanks” words

”Use-type” recognizer POS, CHUNK, special words, verbs within 3 words

of phrase e.g., ”I want to compose the perfect landscape shot”

Eats, shoots and leaves

Common basic step! POS, chunk, parse

tree Hairy theory

FSA, Morphology, Phonology

n-Grams, Probabilistic models, CFGs

Don't Worry, Be Happy!

No License Required!

What to use? NLTK, or ? Docking with the Evil MotherShip

Jepp, JPype? ► use RPC

>>> from .stanford_corenlp import jsonrpc

>>> server = jsonrpc.TransportTcpIp(...)

>>> result = loads(server.parse(paragraph))

POS, Parse tree, and more (but no chunks?)


”a large lexical database of English” synsets, hypernyms, hyponyms, gloss synonyms, antonyms

Sentiwordnet Start with candidate ”good” and ”bad” words Expand by recursively following edges Classify using definition

e.g., ”good” → ”better”, ”good” → ”impressive”

Love or Hate?

”I love the screen, but the battery life is poor” Shallow?

3 class classifier

Or Deep? Relationship classifier Extracting candidate subjects Lots of unsolved problems – co-reference, multiple

subjects, negation, etc.

Summary ratings for ”Executive Summaries”

Gender, revisited

solr - search semi-structured text Out of the box text processing utilities – stemming,


Highly configurable relevancy fields, weights Sorting! ”sort(term, field, edit) desc” for Levenshtein edit


Gender, revisited


<field name="name" type="string" />

<field name="name_phoneme" type="phonetic" />

Search: add ”sort=strdist(unknown_name, name, edit) desc)


for namerec in results:

If namerec.gender == 'Male':

male_score += namerec.match_score


female_score += namerec.match_score

Correctly guesses ”Sheena” and ”Ashish”!!

80/80 precision/recall

Thank You!!

