machine learning as a service: making sentiment predictions in realtime with zmq and nltk

42
MACHINE LEARNING AS A SERVICE MAKING SENTIMENT PREDICTIONS IN REALTIME WITH ZMQ AND NLTK

Upload: daniel-pyrathon

Post on 09-May-2015

151 views

Category:

Technology


4 download

DESCRIPTION

I am a Machine Learning (ML) and Natural Language Processing enthusiast. For my university dissertation I created a realtime sentiment analysis classifier for Twitter. My talk will be about the experience and the lessons learned. I will explain how to build a scalable machine learning software as a service, consumable with a REST API. The purpose of this talk is not to dig into the mathematics behind machine learning (as I do not have this experience), but it’s more about showing how easy it can be to build a ML SaaS by using some of the amazing libraries such as NLTK, ZMQ and MrJob that have helped me make throughout the development. This talk will give several benefits: users with no ML background will have a great introduction to the subject, they will also be able to replicate my project at home. More experienced users will gain new ideas to put in practice and (most) probably build a better system than mine! Finally, I will attach a GitHub project with the slides and a finished product.

TRANSCRIPT

Page 1: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

MACHINE LEARNING ASA SERVICE

MAKING SENTIMENT PREDICTIONS IN REALTIME WITH ZMQAND NLTK

Page 2: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

ABOUT ME

Page 3: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

DISSERTATION

Let's make something cool!

Page 4: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

SOCIAL MEDIA

+

MACHINELEARNING

+

API

Page 5: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

SENTIMENT ANALYSISAS A SERVICE

A STEP-BY-STEP GUIDE

Page 6: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

Fundamental Topics

Machine LearningNatural Language Processing

Overview of the platformThe process

PrepareAnalyzeTrainUseScale

Page 7: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

MACHINE LEARNINGWHAT IS MACHINE LEARNING?

A method of teaching computers to make and improvepredictions or behaviors based on some data.It allow computers to evolve behaviors based on empirical dataData can be anything

Stock market pricesSensors and motorsemail metadata

Page 8: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

SUPERVISED MACHINE LEARNINGSPAM OR HAM

Page 9: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

SUPERVISED MACHINE LEARNINGSPAM OR HAM

Page 10: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

SUPERVISED MACHINE LEARNINGSPAM OR HAM

Page 11: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

SUPERVISED MACHINE LEARNINGSPAM OR HAM

Page 12: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

SUPERVISED MACHINE LEARNINGSPAM OR HAM

Page 13: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

SUPERVISED MACHINE LEARNINGSPAM OR HAM

Page 14: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

SUPERVISED MACHINE LEARNINGSPAM OR HAM

Page 15: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

SUPERVISED MACHINE LEARNINGSPAM OR HAM

Page 16: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

NATURAL LANGUAGE PROCESSINGWHAT IS NATURAL LANGUAGE PROCESSING?

Interactions between computers and human languagesExtract information from textSome NLTK features

BigramsPart-or-speechTokenizationStemmingWordNet lookup

Page 17: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

NATURAL LANGUAGE PROCESSINGSOME NLTK FEATURES

Tokentization

Stopword Removal

>>> phrase = "I wish to buy specified products or service">>> phrase = nlp.tokenize(phrase)>>> phrase['I', 'wish', 'to', 'buy', 'specified', 'products', 'or', 'service']

>>> phrase = nlp.remove_stopwords(tokenized_phrase)>>> phrase['I', 'wish', 'buy', 'specified', 'products', 'service']

Page 18: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

SENTIMENT ANALYSIS

Page 19: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

CLASSIFYING TWITTER SENTIMENT IS HARDImproper language useSpelling mistakes160 characters to express sentimentDifferent types of english (US, UK, Pidgin)

Gr8 picutre..God bless u RT @WhatsNextInGosp: Resurrection Sunday Service @PFCNY with @Donnieradio pic.twitter.com/nOgz65cpY57:04 PM - 21 Apr 2014

Donnie McClurkin @Donnieradio

Follow

8 RETWEETS 36 FAVORITES

Page 20: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

BACK TO BUILDING OUR API.. FINALLY!

Page 21: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

CLASSIFIER3 STEPS

Page 22: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

THE DATASETSENTIMENT140

160.000 labelled tweetsCSV formatPolarity of the tweet (0 = negative, 2 = neutral, 4 = positive)The text of the tweet (Lyx is cool)

Page 23: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

FEATURE EXTRACTIONHow are we going to find features from a phrase?

"Bag of Words" representation

my_phrase = "Today was such a rainy and horrible day"

In [12]: from nltk import word_tokenize

In [13]: word_tokenize(my_phrase)Out[13]: ['Today', 'was', 'such', 'a', 'rainy', 'and', 'horrible', 'day']

Page 24: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

FEATURE EXTRACTIONCREATE A PIPELINE OF FEATURE EXTRACTORS

FORMATTER = formatting.FormatterPipeline( formatting.make_lowercase, formatting.strip_urls, formatting.strip_hashtags, formatting.strip_names, formatting.remove_repetitons, formatting.replace_html_entities, formatting.strip_nonchars, functools.partial( formatting.remove_noise, stopwords = stopwords.words('english') + ['rt'] ), functools.partial( formatting.stem_words, stemmer= nltk.stem.porter.PorterStemmer() ))

Page 25: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

FEATURE EXTRACTIONPASS THE REPRESENTATION DOWN THE PIPELINE

In [11]: feature_extractor.extract("Today was such a rainy and horrible day")Out[11]: {'day': True, 'horribl': True, 'raini': True, 'today': True}

The result is a dictionary of variable length, containing keys asfeatures and values as always True

Page 26: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

DIMENSIONALITY REDUCTIONRemove features that are common across all classes (noise)Increase performance of the classifierDecrease the size of the model, less memory usage and morespeed

Page 27: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

DIMENSIONALITY REDUCTIONCHI-SQUARE TEST

Page 28: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

DIMENSIONALITY REDUCTIONCHI-SQUARE TEST

Page 29: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

DIMENSIONALITY REDUCTIONCHI-SQUARE TEST

Page 30: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

DIMENSIONALITY REDUCTIONCHI-SQUARE TEST

Page 31: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

DIMENSIONALITY REDUCTIONCHI-SQUARE TEST

Page 32: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

NLTK gives us BigramAssocMeasures.chi_sq

DIMENSIONALITY REDUCTIONCHI-SQUARE TEST

# Calculate the number of words for each classpos_word_count = label_word_fd['pos'].N()neg_word_count = label_word_fd['neg'].N()total_word_count = pos_word_count + neg_word_count

# For each word and it's total occurancefor word, freq in word_fd.iteritems():

# Calculate a score for the positive class pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word], (freq, pos_word_count), total_word_count)

# Calculate a score for the negative class neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word], (freq, neg_word_count), total_word_count)

# The sum of the two will give you it's total score word_scores[word] = pos_score + neg_score

Page 33: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

TRAININGNow that we can extract features from text, we can train a

classifier. The simplest and most flexible learning algorithm fortext classification is Naive Bayes

P(label | features) = P(label) * P(features | label) / P(features)

Simple to compute = fastAssumes feature indipendence = easy to updateSupports multiclass = scalable

Page 34: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

TRAININGNLTK provides built-in components

1. Train the classifier

2. Serialize classifier for later use

3. Train once, use as much as you want

>>> from nltk.classify import NaiveBayesClassifier>>> nb_classifier = NaiveBayesClassifier.train(train_feats)... wait a lot of time>>> nb_classifier.labels()['neg', 'pos']

>>> serializer.dump(nb_classifier, file_handle)

Page 35: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

USING THE CLASSIFIER# Load the classifier from the serialized fileclassifier = pickle.loads(classifier_file.read())

# Pick a new phrasenew_phrase = "At Pycon Italy! Love the food and this speaker is so amazing"

# 1) Preprocessingfeature_vector = feature_extractor.extract(phrase)

# 2) Dimensionality reduction, best_features is our set of best wordsreduced_feature_vector = reduce_features(feature_vector, best_features)

# 3) Classify!print self.classifier.classify(reduced_feature_vector)>>> "pos"

Page 36: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

BUILDING A CLASSIFICATION API

Classifier is slow, no matter how much optimization is madeClassifier is a blocking process, API must be event-driven

Page 37: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

BUILDING A CLASSIFICATION APISCALING TOWARDS INFINITY AND BEYOND

Page 38: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

BUILDING A CLASSIFICATION APIZEROMQ

Fast, uses native socketsPromotes horizontal scalabilityLanguage-agnostic framework

Page 39: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

BUILDING A CLASSIFICATION APIZEROMQ

...socket = context.socket(zmq.REP)... while True: message = socket.recv() phrase = json.loads(message)["text"]

# 1) Feature extraction feature_vector = feature_extractor.extract(phrase)

# 2) Dimensionality reduction, best_features is our set of best words reduced_feature_vector = reduce_features(feature_vector, best_features)

# 3) Classify! result = classifier.classify(reduced_feature_vector) socket.send(json.dumps(result))

Page 40: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

DEMO

Page 41: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

POST-MORTEMReal-time sentiment analysis APIs can be implemented, andcan be scalableWhat if we use Redis instead of having serialized classifiers?Deep learning is giving very good results in NLP, let's try it!

Page 42: Machine Learning as a Service: making sentiment predictions in realtime with ZMQ and NLTK

FINQUESTIONS