machine learning as a service: making sentiment predictions in realtime with zmq and nltk
DESCRIPTION
I am a Machine Learning (ML) and Natural Language Processing enthusiast. For my university dissertation I created a realtime sentiment analysis classifier for Twitter. My talk will be about the experience and the lessons learned. I will explain how to build a scalable machine learning software as a service, consumable with a REST API. The purpose of this talk is not to dig into the mathematics behind machine learning (as I do not have this experience), but it’s more about showing how easy it can be to build a ML SaaS by using some of the amazing libraries such as NLTK, ZMQ and MrJob that have helped me make throughout the development. This talk will give several benefits: users with no ML background will have a great introduction to the subject, they will also be able to replicate my project at home. More experienced users will gain new ideas to put in practice and (most) probably build a better system than mine! Finally, I will attach a GitHub project with the slides and a finished product.TRANSCRIPT
MACHINE LEARNING ASA SERVICE
MAKING SENTIMENT PREDICTIONS IN REALTIME WITH ZMQAND NLTK
ABOUT ME
DISSERTATION
Let's make something cool!
SOCIAL MEDIA
+
MACHINELEARNING
+
API
SENTIMENT ANALYSISAS A SERVICE
A STEP-BY-STEP GUIDE
Fundamental Topics
Machine LearningNatural Language Processing
Overview of the platformThe process
PrepareAnalyzeTrainUseScale
MACHINE LEARNINGWHAT IS MACHINE LEARNING?
A method of teaching computers to make and improvepredictions or behaviors based on some data.It allow computers to evolve behaviors based on empirical dataData can be anything
Stock market pricesSensors and motorsemail metadata
SUPERVISED MACHINE LEARNINGSPAM OR HAM
SUPERVISED MACHINE LEARNINGSPAM OR HAM
SUPERVISED MACHINE LEARNINGSPAM OR HAM
SUPERVISED MACHINE LEARNINGSPAM OR HAM
SUPERVISED MACHINE LEARNINGSPAM OR HAM
SUPERVISED MACHINE LEARNINGSPAM OR HAM
SUPERVISED MACHINE LEARNINGSPAM OR HAM
SUPERVISED MACHINE LEARNINGSPAM OR HAM
NATURAL LANGUAGE PROCESSINGWHAT IS NATURAL LANGUAGE PROCESSING?
Interactions between computers and human languagesExtract information from textSome NLTK features
BigramsPart-or-speechTokenizationStemmingWordNet lookup
NATURAL LANGUAGE PROCESSINGSOME NLTK FEATURES
Tokentization
Stopword Removal
>>> phrase = "I wish to buy specified products or service">>> phrase = nlp.tokenize(phrase)>>> phrase['I', 'wish', 'to', 'buy', 'specified', 'products', 'or', 'service']
>>> phrase = nlp.remove_stopwords(tokenized_phrase)>>> phrase['I', 'wish', 'buy', 'specified', 'products', 'service']
SENTIMENT ANALYSIS
CLASSIFYING TWITTER SENTIMENT IS HARDImproper language useSpelling mistakes160 characters to express sentimentDifferent types of english (US, UK, Pidgin)
Gr8 picutre..God bless u RT @WhatsNextInGosp: Resurrection Sunday Service @PFCNY with @Donnieradio pic.twitter.com/nOgz65cpY57:04 PM - 21 Apr 2014
Donnie McClurkin @Donnieradio
Follow
8 RETWEETS 36 FAVORITES
BACK TO BUILDING OUR API.. FINALLY!
CLASSIFIER3 STEPS
THE DATASETSENTIMENT140
160.000 labelled tweetsCSV formatPolarity of the tweet (0 = negative, 2 = neutral, 4 = positive)The text of the tweet (Lyx is cool)
FEATURE EXTRACTIONHow are we going to find features from a phrase?
"Bag of Words" representation
my_phrase = "Today was such a rainy and horrible day"
In [12]: from nltk import word_tokenize
In [13]: word_tokenize(my_phrase)Out[13]: ['Today', 'was', 'such', 'a', 'rainy', 'and', 'horrible', 'day']
FEATURE EXTRACTIONCREATE A PIPELINE OF FEATURE EXTRACTORS
FORMATTER = formatting.FormatterPipeline( formatting.make_lowercase, formatting.strip_urls, formatting.strip_hashtags, formatting.strip_names, formatting.remove_repetitons, formatting.replace_html_entities, formatting.strip_nonchars, functools.partial( formatting.remove_noise, stopwords = stopwords.words('english') + ['rt'] ), functools.partial( formatting.stem_words, stemmer= nltk.stem.porter.PorterStemmer() ))
FEATURE EXTRACTIONPASS THE REPRESENTATION DOWN THE PIPELINE
In [11]: feature_extractor.extract("Today was such a rainy and horrible day")Out[11]: {'day': True, 'horribl': True, 'raini': True, 'today': True}
The result is a dictionary of variable length, containing keys asfeatures and values as always True
DIMENSIONALITY REDUCTIONRemove features that are common across all classes (noise)Increase performance of the classifierDecrease the size of the model, less memory usage and morespeed
DIMENSIONALITY REDUCTIONCHI-SQUARE TEST
DIMENSIONALITY REDUCTIONCHI-SQUARE TEST
DIMENSIONALITY REDUCTIONCHI-SQUARE TEST
DIMENSIONALITY REDUCTIONCHI-SQUARE TEST
DIMENSIONALITY REDUCTIONCHI-SQUARE TEST
NLTK gives us BigramAssocMeasures.chi_sq
DIMENSIONALITY REDUCTIONCHI-SQUARE TEST
# Calculate the number of words for each classpos_word_count = label_word_fd['pos'].N()neg_word_count = label_word_fd['neg'].N()total_word_count = pos_word_count + neg_word_count
# For each word and it's total occurancefor word, freq in word_fd.iteritems():
# Calculate a score for the positive class pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word], (freq, pos_word_count), total_word_count)
# Calculate a score for the negative class neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word], (freq, neg_word_count), total_word_count)
# The sum of the two will give you it's total score word_scores[word] = pos_score + neg_score
TRAININGNow that we can extract features from text, we can train a
classifier. The simplest and most flexible learning algorithm fortext classification is Naive Bayes
P(label | features) = P(label) * P(features | label) / P(features)
Simple to compute = fastAssumes feature indipendence = easy to updateSupports multiclass = scalable
TRAININGNLTK provides built-in components
1. Train the classifier
2. Serialize classifier for later use
3. Train once, use as much as you want
>>> from nltk.classify import NaiveBayesClassifier>>> nb_classifier = NaiveBayesClassifier.train(train_feats)... wait a lot of time>>> nb_classifier.labels()['neg', 'pos']
>>> serializer.dump(nb_classifier, file_handle)
USING THE CLASSIFIER# Load the classifier from the serialized fileclassifier = pickle.loads(classifier_file.read())
# Pick a new phrasenew_phrase = "At Pycon Italy! Love the food and this speaker is so amazing"
# 1) Preprocessingfeature_vector = feature_extractor.extract(phrase)
# 2) Dimensionality reduction, best_features is our set of best wordsreduced_feature_vector = reduce_features(feature_vector, best_features)
# 3) Classify!print self.classifier.classify(reduced_feature_vector)>>> "pos"
BUILDING A CLASSIFICATION API
Classifier is slow, no matter how much optimization is madeClassifier is a blocking process, API must be event-driven
BUILDING A CLASSIFICATION APISCALING TOWARDS INFINITY AND BEYOND
BUILDING A CLASSIFICATION APIZEROMQ
Fast, uses native socketsPromotes horizontal scalabilityLanguage-agnostic framework
BUILDING A CLASSIFICATION APIZEROMQ
...socket = context.socket(zmq.REP)... while True: message = socket.recv() phrase = json.loads(message)["text"]
# 1) Feature extraction feature_vector = feature_extractor.extract(phrase)
# 2) Dimensionality reduction, best_features is our set of best words reduced_feature_vector = reduce_features(feature_vector, best_features)
# 3) Classify! result = classifier.classify(reduced_feature_vector) socket.send(json.dumps(result))
DEMO
POST-MORTEMReal-time sentiment analysis APIs can be implemented, andcan be scalableWhat if we use Redis instead of having serialized classifiers?Deep learning is giving very good results in NLP, let's try it!
FINQUESTIONS