introduction to nlp

I256

Applied Natural Language

Processing

Fall 2009

Lecture 1

Introduction

Barbara Rosario

Introductions

• Barbara Rosario

– iSchool alumni (class 2005)

– Intel Labs

• Gopal Vaswani

– iSchool master student (class 2010)

• You?

Today

• Introductions

• Administrivia

• What is NLP

• NLP Applications

• Why is NLP difficult

• Corpus-based statistical approaches

• Course goals

• What we’ll do in this course

Administrivia

• http://courses.ischool.berkeley.edu/i256/f09/index.html

• Books:– Foundations of Statistical NLP, Manning and Schuetze, MIT press

– Natural Language Processing with Python, Bird, Klein & Loper, O'Reilly. (also on line)

– See Web site for additional resources

• Work:– Individual coding assignments (Python & NLTK-Natural Language

Toolkit) (4 or 5)

– Final group project

– Participation

• Office hours:– Barbara: Thursday 2:00-3:00 in Room 6

– Gopal: Tuesday 2:00-3:00 in Room 6 (to be confirmed)

http://courses.ischool.berkeley.edu/i256/f09/index.html

http://nlp.stanford.edu/fsnlp/

http://oreilly.com/catalog/9780596516499/

Administrivia

• Communication:– My email: [email protected]

– Gopal : [email protected]

– Mailing list: [email protected]• Send an email to [email protected] with subscribe

i256 in the body

• Through intranet

– Announcements: webpage and/or mailing list and/or Bspace (TBA)

– Public discussion: Bspace(?)

• Related course: Statistical Natural Language Processing, Spring 2009, CS 288 – http://www.cs.berkeley.edu/~klein/cs288/sp09/

– Instructor: Dan Klein

– Much more emphasis on statistical algorithms

• Questions?

mailto:[email protected]




http://www.cs.berkeley.edu/~klein/cs288/sp09/

Natural Language Processing

• Fundamental goal: deep understand of broad language

– Not just string processing or keyword matching!

• End systems that we want to build:

– Ambitious: speech recognition, machine translation, question answering…

– Modest: spelling correction, text categorization…

Slide taken from Klein’s course: UCB CS 288 spring 09

Example: Machine Translation

NLP applications

• Text Categorization– Classify documents by topics, language, author, spam filtering,

information retrieval (relevant, not relevant), sentiment classification (positive, negative)

• Spelling & Grammar Corrections

• Information Extraction

• Speech Recognition

• Information Retrieval– Synonym Generation

• Summarization

• Machine Translation

• Question Answering

• Dialog Systems– Language generation

Why NLP is difficult

• A NLP system needs to answer the question “who did what to whom”

• Language is ambiguous – At all levels: lexical, phrase, semantic

– Iraqi Head Seeks Arms• Word sense is ambiguous (head, arms)

– Stolen Painting Found by Tree• Thematic role is ambiguous: tree is agent or location?

– Ban on Nude Dancing on Governor’s Desk• Syntactic structure (attachment) is ambiguous: is the ban or

the dancing on the desk?

– Hospitals Are Sued by 7 Foot Doctors• Semantics is ambiguous : what is 7 foot?


• Language is flexible

– New words, new meanings

– Different meanings in different contexts

• Language is subtle

– He arrived at the lecture

– He chuckled at the lecture

– He chuckled his way through the lecture

– **He arrived his way through the lecture

• Language is complex!


• MANY hidden variables

– Knowledge about the world

– Knowledge about the context

– Knowledge about human communication techniques

• Can you tell me the time?

• Problem of scale

– Many (infinite?) possible words, meanings, context

• Problem of sparsity

– Very difficult to do statistical analysis, most things

(words, concepts) are never seen before

• Long range correlations


• Key problems:

– Representation of meaning

– Language presupposes knowledge about the

world

– Language only reflects the surface of

meaning

– Language presupposes communication

between people

Meaning

• What is meaning?– Physical referent in the real world

– Semantic concepts, characterized also by relations.

• How do we represent and use meaning– I am Italian

• From lexical database (WordNet)

• Italian =a native or inhabitant of Italy Italy = republic in southern Europe [..]

– I am Italian• Who is “I”?

– I know she is Italian/I think she is Italian• How do we represent “I know” and “I think”

• Does this mean that I is Italian? What does it say about the “I” and about the person speaking?

– I thought she was Italian• How do we represent tenses?

Today

• Introductions

• Administrivia

• What is NLP

• NLP Applications

• Why is NLP difficult

• Corpus-based statistical approaches

• Course goals

• What we’ll do in this course

Corpus-based statistical

approaches to tackle NLP problem– How can a can a machine understand these

differences? • Decorate the cake with the frosting

• Decorate the cake with the kids

– Rules based approaches, i.e. hand coded syntactic constraints and preference rules:

• The verb decorate require an animate being as agent

• The object cake is formed by any of the following, inanimate entities (cream, dough, frosting…..)

– Such approaches have been showed to be time consuming to build, do not scale up well and are very brittle to new, unusual, metaphorical use of language

• To swallow requires an animate being as agent/subject and a physical object as object

– I swallowed his story

– The supernova swallowed the planet


approaches to tackle NLP problem

• A Statistical NLP approach seeks to solve these problems by automatically learning lexical and structural preferences from text collections (corpora)

• Statistical models are robust, generalize well and behave gracefully in the presence of errors and new data.

• So: – Get large text collections

– Compute statistics over those collections

– (The bigger the collections, the better the statistics)


approaches to tackle NLP problem• Decorate the cake with the frosting

• Decorate the cake with the kids

• From (labeled) corpora we can learn that:#(kids are subject/agent of decorate) > #(frosting is subject/agent of

decorate)

• From (UN-labeled) corpora we can learn that:#(“the kids decorate the cake”) >> #(“the frosting decorates the cake”)

#(“cake with frosting”) >> #(“cake with kids”)

etc..

• Given these “facts” we then need a statistical model for the attachment decision

Corpus-based statistical approaches

to tackle NLP problem

• Topic categorization: classify the document

into semantics topicsDocument 1

The U.S. swept into the Davis Cup final

on Saturday when twins Bob and Mike

Bryan defeated Belarus's Max Mirnyi

and Vladimir Voltchkov to give the

Americans an unsurmountable 3-0 lead

in the best-of-five semi-final tie.

Topic = sport

Document 2

One of the strangest, most relentless

hurricane seasons on record reached

new bizarre heights yesterday as the

plodding approach of Hurricane

Jeanne prompted evacuation orders

for hundreds of thousands of

Floridians and high wind warnings

that stretched 350 miles from the

swamp towns south of Miami to the

historic city of St. Augustine.

Topic = disaster

Corpus-based statistical approaches

to tackle NLP problem

• Topic categorization: classify the document

into semantics topicsDocument 1 (sport)

The U.S. swept into the Davis

Cup final on Saturday when twins

Bob and Mike Bryan …

Document 2 (disasters)

One of the strangest, most

relentless hurricane seasons on

record reached new bizarre heights

yesterday as….

• From (labeled) corpora we can learn that:#(sport documents containing word Cup) > #(disaster documents

containing word Cup) -- feature

• We then need a statistical model for the topic

assignment


approaches to tackle NLP problem

• Feature extractions (usually linguistics

motivated)

• Statistical models

• Data (corpora, labels, linguistic

resources)

Goals of this Course

• Learn about the problems and possibilities of natural language analysis:– What are the major issues?

– What are the major solutions?

• At the end you should:– Agree that language is difficult, interesting and important

– Be able to assess language problems

• Know which solutions to apply when, and how

• Feel some ownership over the algorithms

– Be able to use software to tackle some NLP language tasks

– Know language resources

– Be able to read papers in the field

What We’ll Do in this Course

• Linguistic Issues

– What are the range of language phenomena?

– What are the knowledge sources that let us

disambiguate?

– What representations are appropriate?

• Applications

• Software (Python and NLTK)

• Statistical Modeling Methods

What We’ll Do in this Course

• Read books, research papers and tutorials

• Final project

– Your own ideas or chose from some suggestions I will

provide

– We’ll talk later during the couse about ideas/methods

etc. but come talk to me if you have already some

ideas

• Learn Python

• Learn/use NLTK (Natural Language ToolKit) to

try out various algorithms

Python - Simple yet powerful

The zen of python : http://www.python.org/dev/peps/pep-0020/

• Very clear, readable syntax

• Strong introspection capabilities

– http://www.ibm.com/developerworks/linux/library/l-pyint.html(recommended)

• Intuitive object orientation

• Natural expression of procedural code

• Full modularity, supporting hierarchical packages

• Exception-based error handling

• Very high level dynamic data types

• Extensive standard libraries and third party modules for virtually every task

– Excellent functionality for processing linguistic data.

– NLTK is one such extensive third party module.

Source : python.org

Python

http://www.python.org/dev/peps/pep-0020/

http://www.ibm.com/developerworks/linux/library/l-pyint.html

Source : nltk.org

Language processing task NLTK modules Functionality

Accessing corpora nltk.corpus standardized interfaces to corpora and lexicons

String processing nltk.tokenize, nltk.stem tokenizers, sentence tokenizers, stemmers

Collocation discovery nltk.collocations t-test, chi-squared, point-wise mutual information

Part-of-speech tagging nltk.tag n-gram, backoff, Brill, HMM, TnT

Classification nltk.classify, nltk.cluster decision tree, maximum entropy, naive Bayes, EM, k-means

Chunking nltk.chunk regular expression, n-gram, named-entity

Parsing nltk.parse chart, feature-based, unification, probabilistic, dependency

This is not the complete list

NLTK• NLTK defines an infrastructure that can be used to build NLP programs in Python.

• It provides basic classes for representing data relevant to natural language

processing.

• Standard interfaces for performing tasks such as part-of-speech tagging, syntactic

parsing, and text classification.

• Standard implementations for each task which can be combined to solve complex

problems.

Resources:

• Download at http://www.nltk.org/download

• Getting started with NLTK Chapter 1

• NLP and NLTK talk at google http://www.youtube.com/watch?v=keXW_5-llD0

http://www.nltk.org/download

Topics

• Text corpora & other resources

• Words (Morphology, tokenization, stemming, part-of-speech, WSD, collocations, lexical acquisition, language models)

• Syntax: chunking, PCFG & parsing

• Statistical models (esp. for classification)

• Applications – Text classification

– Information extraction

– Machine translation

– Semantic Interpretation

– Sentiment Analysis

– QA / Summarization

– Information retrieval

Next Assignment

• Due before next class Tue Sep 1– No turn-in

• Download and install Python and NLTK

• Download the NLTK Book Collection, as described at the beginning of chapter 1 of the book Natural Language Processing with Python

• Readings:

– Chapter 1 of the book Natural Language Processing with Python

– Chapter 3 of Foundations of Statistical NLP

• Next class:

– Linguistic Essentials

– Python Introduction

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html

http://www.nltk.org/book

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html

http://www.nltk.org/book

introduction to nlp

Engineering