machine learning as a service: making sentiment predictions in realtime with zmq and nltk

MACHINE LEARNING ASA SERVICE

MAKING SENTIMENT PREDICTIONS IN REALTIME WITH ZMQAND NLTK

ABOUT ME

DISSERTATION

Let's make something cool!

SOCIAL MEDIA

+

MACHINELEARNING

+

API

SENTIMENT ANALYSISAS A SERVICE

A STEP-BY-STEP GUIDE

Fundamental Topics

Machine LearningNatural Language Processing

Overview of the platformThe process

PrepareAnalyzeTrainUseScale

MACHINE LEARNINGWHAT IS MACHINE LEARNING?

A method of teaching computers to make and improvepredictions or behaviors based on some data.It allow computers to evolve behaviors based on empirical dataData can be anything

Stock market pricesSensors and motorsemail metadata

SUPERVISED MACHINE LEARNINGSPAM OR HAM

NATURAL LANGUAGE PROCESSINGWHAT IS NATURAL LANGUAGE PROCESSING?

Interactions between computers and human languagesExtract information from textSome NLTK features

BigramsPart-or-speechTokenizationStemmingWordNet lookup

NATURAL LANGUAGE PROCESSINGSOME NLTK FEATURES

Tokentization

Stopword Removal

>>> phrase = "I wish to buy specified products or service">>> phrase = nlp.tokenize(phrase)>>> phrase['I', 'wish', 'to', 'buy', 'specified', 'products', 'or', 'service']

>>> phrase = nlp.remove_stopwords(tokenized_phrase)>>> phrase['I', 'wish', 'buy', 'specified', 'products', 'service']

SENTIMENT ANALYSIS

CLASSIFYING TWITTER SENTIMENT IS HARDImproper language useSpelling mistakes160 characters to express sentimentDifferent types of english (US, UK, Pidgin)

Gr8 picutre..God bless u RT @WhatsNextInGosp: Resurrection Sunday Service @PFCNY with @Donnieradio pic.twitter.com/nOgz65cpY57:04 PM - 21 Apr 2014

Donnie McClurkin @Donnieradio

Follow

8 RETWEETS 36 FAVORITES

https://twitter.com/WhatsNextInGosp

https://twitter.com/PFCNY

https://twitter.com/Donnieradio

http://t.co/nOgz65cpY5

https://twitter.com/Donnieradio/statuses/458290122203860992





BACK TO BUILDING OUR API.. FINALLY!

CLASSIFIER3 STEPS

THE DATASETSENTIMENT140

160.000 labelled tweetsCSV formatPolarity of the tweet (0 = negative, 2 = neutral, 4 = positive)The text of the tweet (Lyx is cool)

FEATURE EXTRACTIONHow are we going to find features from a phrase?

"Bag of Words" representation

my_phrase = "Today was such a rainy and horrible day"

In [12]: from nltk import word_tokenize

In [13]: word_tokenize(my_phrase)Out[13]: ['Today', 'was', 'such', 'a', 'rainy', 'and', 'horrible', 'day']

FEATURE EXTRACTIONCREATE A PIPELINE OF FEATURE EXTRACTORS

FORMATTER = formatting.FormatterPipeline( formatting.make_lowercase, formatting.strip_urls, formatting.strip_hashtags, formatting.strip_names, formatting.remove_repetitons, formatting.replace_html_entities, formatting.strip_nonchars, functools.partial( formatting.remove_noise, stopwords = stopwords.words('english') + ['rt'] ), functools.partial( formatting.stem_words, stemmer= nltk.stem.porter.PorterStemmer() ))

FEATURE EXTRACTIONPASS THE REPRESENTATION DOWN THE PIPELINE

In [11]: feature_extractor.extract("Today was such a rainy and horrible day")Out[11]: {'day': True, 'horribl': True, 'raini': True, 'today': True}

The result is a dictionary of variable length, containing keys asfeatures and values as always True

DIMENSIONALITY REDUCTIONRemove features that are common across all classes (noise)Increase performance of the classifierDecrease the size of the model, less memory usage and morespeed

DIMENSIONALITY REDUCTIONCHI-SQUARE TEST

NLTK gives us BigramAssocMeasures.chi_sq

DIMENSIONALITY REDUCTIONCHI-SQUARE TEST

# Calculate the number of words for each classpos_word_count = label_word_fd['pos'].N()neg_word_count = label_word_fd['neg'].N()total_word_count = pos_word_count + neg_word_count

# For each word and it's total occurancefor word, freq in word_fd.iteritems():

# Calculate a score for the positive class pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word], (freq, pos_word_count), total_word_count)

# Calculate a score for the negative class neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word], (freq, neg_word_count), total_word_count)

# The sum of the two will give you it's total score word_scores[word] = pos_score + neg_score

TRAININGNow that we can extract features from text, we can train a

classifier. The simplest and most flexible learning algorithm fortext classification is Naive Bayes

P(label | features) = P(label) * P(features | label) / P(features)

Simple to compute = fastAssumes feature indipendence = easy to updateSupports multiclass = scalable

TRAININGNLTK provides built-in components

1. Train the classifier

2. Serialize classifier for later use

3. Train once, use as much as you want

>>> from nltk.classify import NaiveBayesClassifier>>> nb_classifier = NaiveBayesClassifier.train(train_feats)... wait a lot of time>>> nb_classifier.labels()['neg', 'pos']

>>> serializer.dump(nb_classifier, file_handle)

USING THE CLASSIFIER# Load the classifier from the serialized fileclassifier = pickle.loads(classifier_file.read())

# Pick a new phrasenew_phrase = "At Pycon Italy! Love the food and this speaker is so amazing"

# 1) Preprocessingfeature_vector = feature_extractor.extract(phrase)

# 2) Dimensionality reduction, best_features is our set of best wordsreduced_feature_vector = reduce_features(feature_vector, best_features)

# 3) Classify!print self.classifier.classify(reduced_feature_vector)>>> "pos"

BUILDING A CLASSIFICATION API

Classifier is slow, no matter how much optimization is madeClassifier is a blocking process, API must be event-driven

BUILDING A CLASSIFICATION APISCALING TOWARDS INFINITY AND BEYOND

BUILDING A CLASSIFICATION APIZEROMQ

Fast, uses native socketsPromotes horizontal scalabilityLanguage-agnostic framework

BUILDING A CLASSIFICATION APIZEROMQ

...socket = context.socket(zmq.REP)... while True: message = socket.recv() phrase = json.loads(message)["text"]

# 1) Feature extraction feature_vector = feature_extractor.extract(phrase)

# 2) Dimensionality reduction, best_features is our set of best words reduced_feature_vector = reduce_features(feature_vector, best_features)

# 3) Classify! result = classifier.classify(reduced_feature_vector) socket.send(json.dumps(result))

POST-MORTEMReal-time sentiment analysis APIs can be implemented, andcan be scalableWhat if we use Redis instead of having serialized classifiers?Deep learning is giving very good results in NLP, let's try it!

FINQUESTIONS

machine learning as a service: making sentiment predictions in realtime with zmq and nltk

Technology