natural language processing using python - · pdf filediptesh, abhijit natural language...

27
Diptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology, 2016 Instructor: Diptesh Kanojia, Abhijit Mishra Supervisor: Prof. Pushpak Bhattacharyya Center for Indian Language Technology Department of Computer Science and Engineering Indian Institute of Technology Bombay email: {diptesh,abhijitmishra,pb}@cse.iitb.ac.in URL: http://www.cse.iitb.ac.in/~{diptesh,abhijitmishra,pb} http://www.cfilt.iitb.ac.in

Upload: vanquynh

Post on 17-Feb-2018

244 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

Natural Language Processing using PYTHON

(with NLTK, scikit-learn and Stanford NLP APIs)

VIVA Institute of Technology, 2016

Instructor: Diptesh Kanojia, Abhijit Mishra

Supervisor: Prof. Pushpak BhattacharyyaCenter for Indian Language Technology

Department of Computer Science and Engineering

Indian Institute of Technology Bombay

email: {diptesh,abhijitmishra,pb}@cse.iitb.ac.in

URL: http://www.cse.iitb.ac.in/~{diptesh,abhijitmishra,pb}

http://www.cfilt.iitb.ac.in

Page 2: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

Session 1 (Introduction to NLP, Shallow Parsing and Deep Parsing)

Introduction to python and NLTK

Text Tokenization, POS tagging and chunking using NLTK.

Constituency and Dependency Parsing using NLTK and Stanford Parser

Session 2 (Named Entity Recognition, Coreference Resolution)

NER using NLTK

Coreference Resolution using NLTK and Stanford CoreNLP tool

Session 3 (Meaning Extraction, Deep Learning)

WordNets and WordNet-API

Other Lexical Knowledge Networks – Verbnet and Framenet

Roadmap

http://www.cfilt.iitb.ac.in 2VIVA Institute of Technology, 2016

Page 3: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

SESSION-1 (INTRODUCTION TO NLP, SHALLOW PARSING

AND DEEP PARSING)

3

• Introduction to python and NLTK• Text Tokenization, Morphological Analysis, POS tagging and

chunking using NLTK. • Constituency and Dependency Parsing using NLTK

Expected duration: 15 mins

Page 4: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

Why Python?

CFILT 4VIVA Institute of Technology, 2016

As far as I know, you cannot execute a Python program (compiled to bytecode) on every

machine, such as on windows, or on linux without modification.

You are incorrect. The python bytecode is cross platform. See Is python bytecode version-dependent?

Is it platform-dependent? on Stack Overflow.

However, it is not compatible across versions. Python 2.6 cannot execute Python 2.5 files. So while

cross-platform, its not generally useful as a distribution format.

But why Python needs both a compiler and an interpreter?

Speed. Strict interpretation is slow. Virtually every "interpreted" language actually compiles the source

code into some sort of internal representation so that it doesn't have to repeatedly parse the code.

In python's case it saves this internal representation to disk so that it can skip the parsing/compiling

process next time it needs the code.

Page 5: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

A programming language with strong similarities Perl and C with powerful typing and

object oriented features.

Commonly used for producing HTML content on websites. (e.g. Instagram,

Bitbucket, Mozilla and many more websites built on python-django

framework).

Great for text processing (e.g. Powerful RegEx tools).

Useful built-in types (lists, dictionaries, generators, iterators).

Parallel computing (Multi-processing, Multi-threading APIs)

Map-Reduce facilities, lambda functions

Clean/Readable syntax, Lots of open-source standard libraries

Code Reusability

DEMO (basics.py)

Introduction to python

Introduction to python 5VIVA Institute of Technology, 2016

Page 6: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

Download and installation instructions at:

https://www.python.org/download/

Windows/Mac systems require installation

Pre-installed latest linux distributions (Ubuntu, Fedora, Suse

etc.)

We will use python 2.7.X versions.

Installing Python

Introduction to python 6VIVA Institute of Technology, 2016

Page 7: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

“Dive into Python” http://diveintopython.org/

The Official Python Tutorialhttps://docs.python.org/2/tutorial/

The Python Quick Reference http://rgruet.free.fr/PQR2.3.html

Python Tutorials

7VIVA Institute of Technology, 2016 Introduction to python

Page 8: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

IDLE (Windows)

Vi/Emacs (Linux)

Geany (Windows/Linux/Mac)

Pydev plugin for Eclipse IDE (Windows/Linux/Mac)

Notepad++ (Windows)

Useful IDE/Text-Editors

8VIVA Institute of Technology, 2016 Introduction to python

Page 9: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

NumPy (Mathematical Computing, Advanced mathematical functionalities)

Matplotlib (Numerical plotting library, useful in data analysis)

Scipy (Library for scientific computation)

Scikit-learn (Machine Learning/Data-mining library,

PIL (Python library for Image Processing)

PySpeech (Library for speech processing and text-to speech conversion)

XML/LXML (XML Parsing and Processing)

NLTK (Natural Language Processing)

And many more…

https://wiki.python.org/moin/UsefulModules

Useful Python Libraries

9VIVA Institute of Technology, 2016 Introduction to python

Page 10: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

Developed by Steven Bird and Co. at Stanford University

(2006).

Open source python modules, datasets and tutorials

Papers:Bird, Steven. "NLTK: the natural language toolkit." Proceedings of the COLING/ACL on Interactive

presentation sessions. Association for Computational Linguistics, 2006.

Loper, Edward, and Steven Bird. "NLTK: The natural language toolkit." Proceedings of the ACL-02

Workshop on Effective tools and methodologies for teaching natural language processing and

computational linguistics-Volume 1. Association for Computational Linguistics, 2002.

The Natural Language Toolkit (NLTK)

Introduction to NLTK 10VIVA Institute of Technology, 2016

Page 11: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

1. Code: corpus readers, tokenizers, stemmers, taggers, chunkers,

parsers, wordnet, ... (50k lines of code)

2. Corpora: >30 annotated data sets widely used in natural

language processing (>300Mb data)

3. Documentation: a 400-page book, articles, reviews, API

documentation

Components of NLTK (Bird et. al, 2006)

11VIVA Institute of Technology, 2016 Introduction to NLTK

Page 12: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

Corpus Readers

Tokenizers

Stemmers

Taggers

Parsers

WordNet

Semantic Interpretation

Clusterers

Evaluation Metrics

1. Code

12VIVA Institute of Technology, 2016 Introduction to NLTK

Page 13: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

Brown Corpus

Carnegie Mellon Pronouncing Dictionary

CoNLL 2000 Chunking Corpus

Project Gutenberg Selections

NIST 1999 Information Extraction: Entity Recognition Corpus

US Presidential Inaugural Address Corpus

Indian Language POS-Tagged Corpus

Floresta Portuguese Treebank

Prepositional Phrase Attachment Corpus

SENSEVAL 2 Corpus

Sinica Treebank Corpus Sample

Universal Declaration of Human Rights Corpus

Stopwords Corpus

TIMIT Corpus Sample

Treebank Corpus Sample

2. Corpora

13VIVA Institute of Technology, 2016 Introduction to NLTK

Page 14: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

Books:

Natural Language Processing with Python - Steven Bird, Edward Loper, Ewan Klein

Python Text Processing with NLTK 2.0 Cookbook – Jacob Perkins

Included in NLTK:

Installation instructions

API Documentation: describes every module, interface, class, and method

3. Documentation

14VIVA Institute of Technology, 2016 Introduction to NLTK

Page 15: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

Install NLTK

Follow instructions at http://www.nltk.org/install.html

Installers for Windows, Linux and Mac OS available

Check installation

Execute python command through “shell” (Linux/Mac) or command prompt “cmd”

(Windows).

• $python

• >>import nltk

The interpreter should import “nltk” without showing any error.

Download NLTK data (corpora):

>>nltk.download()

NLTK- How to?

15VIVA Institute of Technology, 2016 Introduction to NLTK

Page 16: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

NLTK modules

16

NLP Tasks NLTK Modules Functionality

Accessing Corpora nltk.corpus Standardized interfaces to corpora and lexicons

String Processing nltk.tokenize, nltk.stem Tokenizers, sentence tokenizers, stemmers

Collection Discovery nltk.collections t-test, chi-squared, point-wise mutual information

POS Tagging nltk.tag n-gram, backoff, Brill, HMM, TnT

Chunking nltk.chunk Regular expression, n-gram, named entity

Parsing nltk.parse Chart, feature-based, unification, probabilistic, dependency

Classification nltk.classify, nltk.cluster Decision tree, maximum entropy, naive Bayes, EM, k-means

Semantic Interpretation nltk.sem, nltk.inference Lambda calculus, first-order logic, model checking

Evaluation Metrics nltk.metrics Precision, recall, agreement coefficients

Probability Estimation nltk.probability Frequency distributions, smoothed probability distributions

Applications nltk.app Graphical concordancer, parsers, WordNet browser

Linguistics fieldwork nltk.toolbox Manipulate data in SIL Toolbox format

VIVA Institute of Technology, 2016 Introduction to NLTK

Page 17: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

Process of splitting a string into a list of tokens (words,

punctuation-marks etc.)

For most of the languages whitespace separates two adjacent

words.

Exceptions (source: Wikipedia):

For Chinese and Japanese, sentences are delimited but words are not.

For Thai, Phrases and Sentences are delimited but not words.

Text Tokenization

CFILT 17VIVA Institute of Technology, 2016

Page 18: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

Description: http://www.nltk.org/api/nltk.tokenize.html

Demo:

LineTokenizer – Tokenize string into lines

/*PunctWordTokenizer (Statistical) –

• Tokenizing based on Unupervised ML algorithms.

• Model parameters are learnt through training on a large corpus of abbreviation

words, collocations, and words that start sentences*/

RegexpTokenizer- Tokenization based on RegExp

SExprTokenizer – To find parenthesized expressions in a string

TreebankWordTokenizer – Tokenization as per Penn-Treebank standards

Text Tokenization – NLTK Tokenizers

CFILT 18VIVA Institute of Technology, 2016

Page 19: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

DEMO- Stemmers

Lancaster Stemmer

Porter Stemmer

Regexp Stemmer

Snowball Stemmer

DEMO - Lemmatizers

WordNet based Lemmatizers

Script: morphological_analyzer.py

NLTK Morphological Analyzers

CFILT 19VIVA Institute of Technology, 2016

Page 20: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

The process of sequentially labeling words in a sentence with

their corresponding part of speech tags.

Demo – NLTK POS Taggers (pos_tagger.py)Unigram Tagger (Based on prior probability)

Brill Tagger (Rule Based)

Regexp Tagger (Using regular expressions for tagging)

HMM based Tagger (HMM-Viterbi based)

Stanford tagger (Using Stanford module)

NLTK recommended tagger

NLTK- Part of Speech Tagging

CFILT 20VIVA Institute of Technology, 2016

Page 21: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

Writing grammar

Rule based constituency parsing

RecursiveDescent Parser

ShiftReduce Parser

DEMO- Statistical Parsers

Probabilistic Context Free Grammar (PCFG)

• Stanford parser

Probabilistic Dependency Parsing

• Malt Parser

• Stanford Parser

Script: parser_demo.py

Parsers

CFILT 21VIVA Institute of Technology, 2016

Page 22: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

SESSION 2 (NAMED ENTITY RECOGNITION,

COREFERENCE RESOLUTION)

22

Expected duration: 15 mins

• NER using NLTK

Page 23: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

DEMO – Named entity chunking using NLTK

Using Stanford CoreNLP tool for Coreference Resolution

Download and installation instruction at

http://stanfordnlp.github.io/CoreNLP/

Python Wrapper for CoreNLP

https://github.com/dasmith/stanford-corenlp-python

• DEMO – Coreference Resolution using Stanford CoreNLP

• Scripts: coreference_resolution.py

named_entity_chunking.py

NER and Coreference Resolution using NLTK

CFILT 23VIVA Institute of Technology, 2016

Page 24: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

SESSION-4 (MEANING EXTRACTION, DEEP LEARNING)

24

Expected duration: 10 mins

• WordNets and WordNet-API• Other Lexical Knowledge Networks – VerbNet and FrameNet

Page 25: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

DEMO - NLTK WordNet (wordnet.py)

Finding all the synonym set (SynSet) of a word for all possible pos

tags.

Finding all the SynSets if POS tag is known.

Finding hypernym, hyponym of a synset.

Finding similarities between two words.

WordNet

CFILT 25VIVA Institute of Technology, 2016

Page 26: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

Diptesh, Abhijit

DEMO- Conceptnet (conceptnet.py)

Using Divisi API (since it is not available in NLTK)

Using ConceptNet for finding attributes of a concept.

Using ConceptNet for word similarity computation

DEMO- VerbNet (framenet_verbnet.py)

Obtaining verb classes from VerbNet

DEMO – FrameNet

Listing all the frames in FrameNet

Obtaining properties of a particular frame.

Other Lexical Networks – ConceptNet, Verbnet and FrameNet

CFILT 26VIVA Institute of Technology, 2016

Page 27: Natural Language Processing using PYTHON - · PDF fileDiptesh, Abhijit Natural Language Processing using PYTHON (with NLTK, scikit-learn and Stanford NLP APIs) VIVA Institute of Technology,

THANK YOU

9th January, 2016 27