natural language processing (python)

29
Natural Language Processing Using Python Presented by:- Sumit Kumar Raj 1DS09IS082 ISE,DSCE-2013

Upload: sumit-raj

Post on 09-May-2015

2.731 views

Category:

Technology


5 download

DESCRIPTION

It's a brief overview of Natural Language Processing using Python module NLTK.The codes for demonstration can be found from the github link given in the references slide.

TRANSCRIPT

Page 1: Natural language processing (Python)

Natural Language ProcessingUsing Python

Presented by:-Sumit Kumar Raj1DS09IS082

ISE,DSCE-2013

Page 2: Natural language processing (Python)

Table of Contents

•Introduction•History•Methods in NLP•Natural Language Toolkit•Sample Codes•Feeling Lonely ?•Building a Spam Filter•Applications•References

ISE,DSCE-2013 1

Page 3: Natural language processing (Python)

l

What is Natural Language Processing ?

•Computer aided text analysis of human language.

•The goal is to enable machines to understand human language and extract meaning from text.

•It is a field of study which falls under the category of machine learning and more specifically computational linguistics.

ISE,DSCE-2013 2

Page 4: Natural language processing (Python)

l

History

•1948- 1st NLP application – dictionary look-up system – developed at Birkbeck College, London

•1949- American interest –WWII code breaker Warren Weaver – He viewed German as English in code.

•1966- Over-promised under-delivered – Machine Translation worked only word by word

l – NLP brought the first hostility of research fundingl – NLP gave AI a bad name before AI had a name.

ISE,DSCE-2013 3

Page 5: Natural language processing (Python)

Search engines

Site recommendations

Spam filtering

Knowledge bases and expert systems

Automated customer support systems

Sentiment analysis

Consumer behavior analysis

Natural language processing is heavily used throughout all web technologies

ISE,DSCE-2013 4

Page 6: Natural language processing (Python)

Context

Little sister: What’s your name?

Me: Uhh….Sumit..?

Sister: Can you spell it?

Me: yes. S-U-M-I-T…..ISE,DSCE-2013 5

Page 7: Natural language processing (Python)

Sister: WRONG! It’s spelled “I-T”

ISE,DSCE-2013 6

Page 8: Natural language processing (Python)

Ambiguity

“I shot the man with ice cream.“-A man with ice cream was shot-A man had ice cream shot at him

ISE,DSCE-2013 7

Page 9: Natural language processing (Python)

Methods :-

1) POS Tagging :-

•In corpus linguistics, Parts-of-speech tagging also called grammatical tagging or word-category disambiguation.•It is the process of marking up a word in a text corres- ponding to a particular POS.•POS tagging is harder than just having a list of words and their parts of speech.•Consider the example:

l The sailor dogs the barmaid.

ISE,DSCE-2013 8

Page 10: Natural language processing (Python)

2) Parsing :-

•In context of NLP, parsing may be defined as the process of assigning structural descriptions to sequences of words in a natural language.Applications of parsing include

simple phrase finding, eg. for proper name recognitionFull semantic analysis of text, e.g. information extraction or

machine translation

ISE,DSCE-2013 9

Page 11: Natural language processing (Python)

3) Speech Recognition:-

•It is concerned with the mapping a continuous speech signal into a sequence of recognized words.•Problem is variation in pronunciation, homonyms.•In sentence “the boy eats”, a bi-gram model sufficient to model the relationship b/w boy and eats.

“The boy on the hill by the lake in our town…eats”•Bi-gram and Trigram have proven extremely effective in obvious dependencies.

ISE,DSCE-2013 10

Page 12: Natural language processing (Python)

4) Machine Translation:-

•It involves translating text from one NL to another.•Approaches:-

-simple word substitution,with some changes in ordering to account for grammatical differences-translate the source language into underlying meaning representation or interlingua

ISE,DSCE-2013 11

Page 13: Natural language processing (Python)

5) Stemming:-

•In linguistic morphology and information retrieval, stemming is the process for reducing inflected words to their stem.

•The stem need not be identical to the morphological root of the word.

•Many search engines treat words with same stem as synonyms as a kind of query broadening, a process called conflation.

ISE,DSCE-2013 12

Page 14: Natural language processing (Python)

• NLTK is a leading platform for building Python program to work with human language data.• Provides a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. •Currently only available for Python 2.5 – 2.6http://www.nltk.org/download•`easy_install nltk•Prerequisites

– NumPy– SciPy

Natural Language Toolkit

ISE,DSCE-2013 13

Page 15: Natural language processing (Python)

Let’s dive into some code!

ISE,DSCE-2013 14

Page 16: Natural language processing (Python)

Part of Speech Tagging

from nltk import pos_tag,word_tokenize

sentence1 = 'this is a demo that will show you how to detects parts of speech with little effort using NLTK!'

tokenized_sent = word_tokenize(sentence1)print pos_tag(tokenized_sent)

[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('demo', 'NN'), ('that', 'WDT'), ('will', 'MD'), ('show', 'VB'), ('you', 'PRP'), ('how', 'WRB'), ('to', 'TO'), ('detects', 'NNS'), ('parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('with', 'IN'), ('little', 'JJ'), ('effort', 'NN'), ('using', 'VBG'), ('NLTK', 'NNP'),('!', '.')]ISE,DSCE-2013 15

Page 17: Natural language processing (Python)

Fun things to Try

ISE,DSCE-2013 16

Page 18: Natural language processing (Python)

Eliza is there to talk to you all day! What human could ever do that for you??

Feeling lonely?

from nltk.chat import elizaeliza.eliza_chat()

Therapist---------Talk to the program by typing in plain English, using normal upper-and lower-case letters and punctuation. Enter "quit" when done.========================================================================Hello. How are you feeling today?

……starts the chatbot

ISE,DSCE-2013 17

Page 19: Natural language processing (Python)

Let’s build something even cooler

ISE,DSCE-2013 18

Page 20: Natural language processing (Python)

Lets write a Spam filter!

A program that analyzes legitimate emails “Ham” as well as “Spam” and learns the features that are associated with each.

Once trained, we should be able to run this program on incoming mail and have it reliably label each one with the appropriate category.

ISE,DSCE-2013 19

Page 21: Natural language processing (Python)

1.Extract one of the archives from the site into your working directory.

2.Create a python script, lets call it “spambot.py”.

3.Your working directory should contain the “spambot” script and the folders “spam” and “ham”.

from nltk import word_tokenize,\ WordNetLemmatizer,NaiveBayesClassifier\,classify,MaxentClassifier

from nltk.corpus import stopwordsimport randomimport os, glob,reISE,DSCE-2013 20

“Spambot.py” (continued)

Page 22: Natural language processing (Python)

mixedemails = ([(email,'spam') for email in spamtexts] mixedemails += [(email,'ham') for email in hamtexts])

random.shuffle(mixedemails)

From this list of random but labeled emails, we will defined a “feature extractor” which outputs a feature set that our program can use to statistically compare spam and ham.

label each item with the appropriate label and store them as a list of tuples

lets give them a nice shuffle

“Spambot.py” (continued)

ISE,DSCE-2013 21

Page 23: Natural language processing (Python)

def email_features(sent): features = {} wordtokens = [wordlemmatizer.lemmatize(word.lower()) for word in word_tokenize(sent)] for word in wordtokens: if word not in commonwords: features[word] = True return features

featuresets = [(email_features(n), g) for (n,g) in mixedemails]

Normalize words

If the word is not a stop-word then lets consider it a “feature”

Let’s run each email through the feature extractor and collect it in a “featureset” list

“Spambot.py” (continued)

ISE,DSCE-2013

Page 24: Natural language processing (Python)

While True: featset = email_features(raw_input("Enter text to classify: ")) print classifier.classify(featset)

We can now directly input new email and have it classified as either Spam or Ham

“Spambot.py” (continued)

ISE,DSCE-2013 23

Page 25: Natural language processing (Python)

Applications :-

•Conversion from natural language to computer language and vice-versa.•Translation from one human language to another.•Automatic checking for grammar and writing techniques.•Spam filtering•Sentiment Analysis

ISE,DSCE-2013 24

Page 26: Natural language processing (Python)

Conclusion:-

NLP takes a very important role in new machine human interfaces. When we look at Some of the products based on technologies with NLP we can see that they are veryadvanced but very useful.

But there are many limitations, For example language we speak is highly ambiguous.This makes it very difficult to understand and analyze. Also with so many languages spoken all over the world it is very difficult to design a system that is 100% accurate.

These problems get more complicated when we think of different people speaking the same language with different styles.

Intelligent systems are being experimented right now.We will be able to see improved applications of NLP in the near future.

ISE,DSCE-2013 25

Page 27: Natural language processing (Python)

References :-

•http://en.wikipedia.org/wiki/Natural_language_processing•An overview of Empirical Natural Language Processing by Eric Brill and Raymond J. Mooney •Investigating classification for natural language processing tasks by Ben W. Medlock, University of Cambridge •Natural Language Processing and Machine Learning using Python by Shankar Ambady.•http://www.slideshare.net •http://www.doc.ic.ac.uk/~nd/surprise_97/journal/vol1/hks/index.html lhttp://googlesystem.blogspot.in/2012/10/google-improves-results-for-natural/ Codes from :https://github.com/shanbady/NLTK-Boston-Python-Meetup

ISE,DSCE-2013 26

Page 28: Natural language processing (Python)

Any Questions ???

ISE,DSCE-2013 27

Page 29: Natural language processing (Python)

Thank You...

ISE,DSCE-2013

Reach me @:

facebook.com/sumit12dec

[email protected]

9590 285 524