introduction to nltk

Getting Started with NLTKAn Introduction to NLTK

Sreejith [email protected]

@tweet2sree

FOSSMeet 2011,NIC Calicut

06 February 2011

Sreejith S Getting Started with NLTK

Just a word about me !!

Working in Natural Language Processing (NLP), Machine Learning,Text Mining

Active member of ilugcbe , http://ilugcbe.techstud.org

Works for 365Media Pvt. Ltd. Coimbatore India.

@tweet2sree , [email protected]

Introduction - NLP

Natural Language Processing

NLP is an inter-disciplinary subject

Computer ScienceLinguisticsStatistics etc...

NLP is a sub field of Artificial Intelligence

NLP - Any kind of computer manipulation of natural language.

It is a rapidly developing field of study

Everyday applications of NLP

Handwriting recognition,Machine translation,Question-answeringsystems,Spell checkers,Grammer checkers etc...

Introduction - NLP

Computer Science

LinguisticsStatistics etc...

Introduction - NLP

Computer ScienceLinguistics

Statistics etc...

Introduction - NLP

Natural Language Toolkit (NLTK)

A collection of Python programs, modules, data set and tutorial tosupport research and development in Natural Language Processing(NLP)

Written by Steven Bird, Edvard Loper and Ewan Klien

NLTK is

Free and Open sourceEasy to useModularWell documentedSimple and extensible

http://www.nltk.org

NLTK is

http://www.nltk.org

NLTK is

http://www.nltk.org

NLTK is

Free and Open source

Easy to useModularWell documentedSimple and extensible

http://www.nltk.org

NLTK is

Free and Open sourceEasy to use

ModularWell documentedSimple and extensible

http://www.nltk.org

NLTK is

Free and Open sourceEasy to useModular

Well documentedSimple and extensible

http://www.nltk.org

NLTK is

Free and Open sourceEasy to useModularWell documented

Simple and extensible

http://www.nltk.org

NLTK is

http://www.nltk.org

NLTK is

http://www.nltk.org

What You Will Learn

How simple programs can help you manipulate and analyze languagedata, and how to write these programs

How key concepts from NLP and linguistics are used to describe andanalyze language

How data structures and algorithms are used in NLP

How language data is stored in standard formats, and how data canbe used to evaluate the performance of NLP techniques

What You Will Learn

Installation of NLTK

Make sure that Ptyhon 2.4 or 2.5 or 2.6 is available in your system

Install Python Tkinter package

Install Numpy, Matplotlib, Prover9, MaltParse and MegaM

Download NLTK and Install it

If you are installing NLTK from source Downloadhttp://nltk.googlecode.com/files/nltk-2.0b9.zipUnzip it , It will create nltk-2.0b9 .Open terminal and cd in to this folder, Be super user , pythonsetup.py install

To install data

Start python interpreter

>>> import nltk

>>> nltk.download()

Now you are ready to play with NLTK !!!

To install data

>>> import nltk

>>> nltk.download()

To install data

>>> import nltk

>>> nltk.download()

To install data

>>> import nltk

>>> nltk.download()

If you are installing NLTK from source Downloadhttp://nltk.googlecode.com/files/nltk-2.0b9.zip

Unzip it , It will create nltk-2.0b9 .Open terminal and cd in to this folder, Be super user , pythonsetup.py install

To install data

>>> import nltk

>>> nltk.download()

If you are installing NLTK from source Downloadhttp://nltk.googlecode.com/files/nltk-2.0b9.zipUnzip it , It will create nltk-2.0b9 .

Open terminal and cd in to this folder, Be super user , pythonsetup.py install

To install data

>>> import nltk

>>> nltk.download()

To install data

>>> import nltk

>>> nltk.download()

To install data

>>> import nltk

>>> nltk.download()

To install data

>>> import nltk

>>> nltk.download()

To install data

>>> import nltk

>>> nltk.download()

NLTK Modules

NLTK Modules Functionality

nltk.corpus Courpus

nltk.tokenize,nltk.stem Tokenizers,stemmers

nltk.collocations t-test,chi-squared,mutual-info

nltk.tag n-gram,backoff,Brill,HMM,TnT

nltk.classify,nltk.cluster Decision tree,Naive bayes,K-means

nltk.chunk Regex,n-gram,named entity

nltk.parsing Parsing

nltk.sem,nltk.interence Semantic interpretation

nltk.metrics Evaluation metrics

nltk.probability Probability & Estimation

nltk.app,nltk.chat Applications

NLTK Modules

nltk.corpus Courpus

NLTK Modules

nltk.corpus Courpus

NLTK Modules

nltk.corpus Courpus

NLTK Modules

nltk.corpus Courpus

NLTK Modules

nltk.corpus Courpus

NLTK Modules

nltk.corpus Courpus

NLTK Modules

nltk.corpus Courpus

NLTK Modules

nltk.corpus Courpus

NLTK Modules

nltk.corpus Courpus

NLTK Modules

nltk.corpus Courpus

NLTK Modules

nltk.corpus Courpus

NLTK Modules

nltk.corpus Courpus

Let us start the game

To access data for working out the example in the book

Some basic work outs from the book

Concordance

>>> from nltk.book import *

>>> text1.concordance("monstrous")

Similar

>>> text1.similar("monstrous")

Dispersion plot - Positional information

>>> text4.dispersion_plot(["citizens",

"democracy", "freedom", "duties", "America"])

>>> text4.dispersion_plot(["and",

"to", "of", "with", "the"])

What is it !!! Why ???

Concordance

Similar

Concordance

Similar

Concordance

Similar

Concordance

Similar

Concordance

Similar

Concordance

Similar

Concordance

Similar

Continued...

Generate

>>> text3.generate()

Counting Vocabulary

>>> len(text3)

List of distinct words ,sorted in dictionary order.

>>> sorted(set(text3))

Count occurrence of a particular word in a text

>>> text3.count("and")

What percentage of text it is taken by a specific word

>>> 100 * text3.count("and") / len(text3)

Continued...

Generate

Counting Vocabulary

>>> len(text3)

Continued...

Generate

Counting Vocabulary

>>> len(text3)

Continued...

Generate

Counting Vocabulary

>>> len(text3)

Continued...

Generate

Counting Vocabulary

>>> len(text3)

Continued...

Generate

Counting Vocabulary

>>> len(text3)

Continued...

Generate

Counting Vocabulary

>>> len(text3)

Continued...

Generate

Counting Vocabulary

>>> len(text3)

Continued...

Generate

Counting Vocabulary

>>> len(text3)

Collocation & Bigram

Collocation

A collocation is a sequence of words that occur together unusually oftene.g :- red wine , strong teaBut strong computer is not a collocation

>>> text4.collocations()

Bigrams

List of word pairs

>>> text = "sreejith is talking about NLTK"

>>> wordlist = text.split()

>>> bigrams(wordlist)

what will happen if i do like this

>>> bigrams(text)

Collocation

Bigrams

List of word pairs

>>> bigrams(text)

Collocation

Bigrams

List of word pairs

>>> bigrams(text)

Collocation

Bigrams

List of word pairs

>>> bigrams(text)

Collocation

Bigrams

List of word pairs

>>> bigrams(text)

Collocation

Bigrams

List of word pairs

>>> bigrams(text)

Work with our own data

Populate our own corpora with NLTK and analyse it

>>> from nltk.corpus import

PlaintextCorpusReader as ptr

>>> corpus = /home/developer/Desktop/Sreejith

>>> wordlist = ptr(corpus,.*)

>>> wordlist.fileids()

Let us try to find it out how to count number of characters, wordsand sentences in the corpus

>>> for fid in wordlist.fileids():

print len(wordlist.raw(fid))

print len(wordlist.words(fid))

print len(wordlist.sents(fid))

Continued...

Ploting conditional frquency distribution

>>> words = text.split()

>>> big = bigrams(words)

>>> gd = nltk.ConditionalFreqDist(big)

>>> gd.plot()

Tabulate CFD

>>> gd.tabulate()

Plot frequency distribution

>>> fdist = FreqDist(text1)

>>> fdist.plot(50,cumulative=True)

Continued...

>>> gd.plot()

Tabulate CFD

>>> gd.tabulate()

Continued...

>>> gd.plot()

Tabulate CFD

>>> gd.tabulate()

Continued...

>>> gd.plot()

Tabulate CFD

>>> gd.tabulate()

Continued...

>>> gd.plot()

Tabulate CFD

>>> gd.tabulate()

Continued...

>>> gd.plot()

Tabulate CFD

>>> gd.tabulate()

Normalizing Text

Stemming

Stemming is the process for reducing inflected (or sometimes derived)words to their stem, base or root form , generally a written word form

>>> porter = nltk.PorterStemmer()

>>> word = running

>>> porter.stem(word)

>>> lancaster = nltk.LancasterStemmer()

>>> lancaster.stem(tok[2])

Normalizing Text

Stemming

>>> word = running

Normalizing Text

Stemming

>>> word = running

Normalizing Text

Lemmatization

Stemming + make sure that the resulting form is a known word in adictionary

>>> wnl = nltk.WordNetLemmatizer()

>>> wnl.lemmatize(word)

Normalizing Text

Lemmatization

Normalizing Text

Lemmatization

POS Tagging

The process of classifying words into their parts-of-speech and labelingthem accordingly is known as part-of-speech tagging, POS tagging

>>> text = nltk.word_tokenize("we are attending

FOSS meet at NIC calicut")

>>> nltk.pos_tag(text)

POS Tagging

Parsing

Sentence Parsing

Analyzing sentence structures and create a Parse Tree

>>> sentence = [("the", "DT"), ("little", "JJ"),

("yellow", "JJ"),("dog", "NN"), ("barked", "VBD"),

("at", "IN"), ("the", "DT"), ("cat", "NN")]

>>> grammar = "NP: {?*}"

>>> cp = nltk.RegexpParser(grammar)

>>> result = cp.parse(sentence)

>>> print result

>>> result.draw()

Parsing

Sentence Parsing

("at", "IN"), ("the", "DT"), ("cat", "NN")]

>>> grammar = "NP: {?*}"

>>> print result

>>> result.draw()

Parsing

Sentence Parsing

("at", "IN"), ("the", "DT"), ("cat", "NN")]

>>> grammar = "NP: {?*}"

>>> print result

>>> result.draw()

Machine Translation

Babelizer Shell

Translating a sentence from its source langauge to a specified language.NLTK provides babelize shell

>>> babelize_shell()

Babel> hello how are you?

Babel> german

Babel> run

Just try Google Translator, Yahoo babelfish

Machine Translation

Babelizer Shell

Babel> german

Babel> run

Machine Translation

Babelizer Shell

Babel> german

Babel> run

Machine Translation

Babelizer Shell

Babel> german

Babel> run

What u can do??

Contribute to NLTK

GSOC

NLP Training

Real time research

Reference

Steven Bird, Edvard Loper and Ewan KlienNatural Language Processing with Python

Jacob PerkinsPython Text Processing with NLTK2.0 Cookbook

http://www.nltk.org

Questions

And finally...

Sreejith.S

introduction to nltk

Documents