nlp & machine learning - an introductory talk

29
NLP & Machine Learning Vijay Ganti

Upload: vijay-ganti

Post on 13-Apr-2017

49 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: NLP & Machine Learning - An Introductory Talk

NLP & Machine LearningVijay Ganti

Page 2: NLP & Machine Learning - An Introductory Talk

About Me

• I am an amateur programmer and ML enthusiast

• I am developing NLP prototype systems for problems that I find interesting and have used models like Naive Bayes, LDA for topic modeling of HTML data.

• I code in Python

• I have developed a deep love for solving stimulating problems and since I also like writing I am intrigued by the problem of “can good/great writing be detected or one day created by ML/AI”

• An amateur is someone who does something for love

Page 3: NLP & Machine Learning - An Introductory Talk

–Laozi (604 BC- 531BC) - A contemporary of Confucius

“A journey of a thousand miles begins with a single step”

Page 4: NLP & Machine Learning - An Introductory Talk

AgendaWhy NLP & ML?

What is NLP?

Getting started with NLP & ML

Why Python?

Making it real with an NLP & ML coding demo

A program that predicts gender given name(s) as input

Some glimpses into some practical issues

Next Steps

Page 5: NLP & Machine Learning - An Introductory Talk

NLP powered by ML is ripe for changing the way business gets done !• Conversational agents are becoming an important form

of human-computer communication (Customer support interactions using chat-bots)

• Much of human-human communication is now mediated by computers (Email, Social Media, Messaging)

• An enormous amount of knowledge is now available in machine readable form as natural language text (web, proprietary enterprise content)

Page 6: NLP & Machine Learning - An Introductory Talk

My meet up calendar is

buzzing with NLP & ML

Page 7: NLP & Machine Learning - An Introductory Talk

So is their’s

Page 8: NLP & Machine Learning - An Introductory Talk

and his

Page 9: NLP & Machine Learning - An Introductory Talk

and all these folks

Page 10: NLP & Machine Learning - An Introductory Talk

So what is NLP ?Get machines to understand human language

Segmentation (words, sentences, stemming)

Part of speech tagging

Named Entity Recognition

Disambiguation (Semantics and Context)

Document/Text Classification like topic modeling……

Page 11: NLP & Machine Learning - An Introductory Talk

Disambiguation in language is easy for us but hard for machines

Sentence Relation

I ate spaghetti with meatballs ingredient

I ate spaghetti with salad side dish

I ate spaghetti with abandon feeling

I ate spaghetti with a fork instrument

I ate spaghetti with a friend company

Page 12: NLP & Machine Learning - An Introductory Talk

A few years back we faced the disambiguation problem with images. This was one time I wanted polarization

and but the machines couldn't tell the difference !

Page 13: NLP & Machine Learning - An Introductory Talk

Old vs New NLP

Rule Based

Deterministic

Hard Boundaries

Fixed

Machine Learning Based

Probabilistic

Soft boundaries

Malleable

Page 14: NLP & Machine Learning - An Introductory Talk

What do you need to become good at NLP & ML based on experience & ?

Pick Machine Learning & Distributed Computing stuff, as needed

ref: https://www.linkedin.com/pulse/20141114072915-11846569-what-it-takes-to-be-a-data-scientist-advice-from-a-non-data-scientist?trk=mp-reader-card

• Coding • Probability Theory & Statistical Inference Theory • Algorithm theory for both tweaking models and build

scalable implementations• Look for problems to solve end-to-end and soak in

large amounts of data (data are everywhere)

Page 15: NLP & Machine Learning - An Introductory Talk

Why should I study probability .. we have all tossed coins and played card games!

Outcomes are highly non-intuitive

Required to combat our primitive intuition & build sophisticated “intuition”

EXAMPLES ?Google “Birthday Problem” to see an

example

Page 16: NLP & Machine Learning - An Introductory Talk

Why Python for NLP & MLEasy to get productive quickly

Easy to access and “pre-process” text data

Interpreted so great for research productivity

Support for higher order abstractions and programming paradigms (declarative/functional, object oriented)

Rich eco-system with tons of modules for data science and NLP

Page 17: NLP & Machine Learning - An Introductory Talk
Page 18: NLP & Machine Learning - An Introductory Talk
Page 19: NLP & Machine Learning - An Introductory Talk

Getting started with NLP & ML & some foundational probability theory in Python

• Coursera course on Python Data Structures • Some basic Python - Google Lectures on Python

(https://developers.google.com/edu/python/)• NLTK - nltk.org

• Get other packages as needed like NumPy, Matplotlib, Scikit-learn, PyBrain, pandas, IPython

• Natural Language Processing with Python (book)• http://norvig.com/ngrams/ch14.pdf • Azure Text Analytics API ( I haven't tried it but looks

promising)• http://stats.stackexchange.com/• https://www.quora.com

Page 20: NLP & Machine Learning - An Introductory Talk

Coding time to demonstrate the ML workflow

Simple gender prediction problem solved interactively that uses Naive Bayes

Classifier to show the ML workflow & importance of feature engineering

Page 21: NLP & Machine Learning - An Introductory Talk

Supervised classification workflow

Training Data Feature Extraction z ML Algo

Prediction

ML AlgoPrediction Data

Feature Extraction z

Page 22: NLP & Machine Learning - An Introductory Talk

Practical issues seen our example - Curse of Dimensionality (too many features isn’t good)

Overfitting (sparse data for some features)

Scaling

More data vs better algorithms

Page 23: NLP & Machine Learning - An Introductory Talk

More data is better than better algorithm

Source - Scaling to Very Very Large Corpora for Natural Language DisambiguationMichele Banko and Eric BrillMicrosoft Research1 Microsoft Way Redmond, WA 98052 USA

Page 24: NLP & Machine Learning - An Introductory Talk

Practical lessons learned so far Data preparation is 70% of the work

Feature Engineering is 70% of the rest of the work

Domain expertise critical for feature engineering

Modeling is more about understanding the concepts so that you use it correctly.

It’s hard to understand the theory so don’t try to do this all at once. Instead pick them as needed and ask for help.

Page 25: NLP & Machine Learning - An Introductory Talk

Next StepsThink of use cases that will add most value for a customer

Think about the domain deeply not models

Think about the data deeply (acquisition, format, processing etc.)

Contact me for discussing problems worth solving - we can hack together or [email protected]

tweet to @vijayganti if you liked the talk and want more

Page 26: NLP & Machine Learning - An Introductory Talk

“Ars longa, vita brevis”

which in English is

"Life is short, [the] craft long”

Hippocrates’ Parting Words of Caution

Page 27: NLP & Machine Learning - An Introductory Talk

Backup Slides

Page 28: NLP & Machine Learning - An Introductory Talk

Naive Bayes Classifier

P (A|B) = P(B|A) x P (A) / P(B)

P (Class| Feature) = P(Feature|Class) x P(Class)/ P(Feature)

PosteriorLikelihoodPriors Evidence

Page 29: NLP & Machine Learning - An Introductory Talk

Naive Bayes ClassifierWhat is independence?

In NLP let’s say you are using word frequency as a feature but words like

United States

Damn good

Stainless steel

aren’t independent words. They often occur together. Hence you can get better classification accuracy if your initial processing uses something called “collocation” to treat them as one unit.