i256: applied natural language processing

40
1 I256: Applied Natural Language Processing Marti Hearst August 28, 2006

Upload: graceland

Post on 25-Feb-2016

47 views

Category:

Documents


1 download

DESCRIPTION

I256: Applied Natural Language Processing. Marti Hearst August 28, 2006. Today. Motivation: SIMS student projects Course Goals Why NLP is difficult How to solve it? Corpus-based statistical approaches What we’ll do in this course. ANLP Motivation: SIMS Masters Projects. HomeSkim (2005) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: I256:  Applied Natural Language Processing

1

I256: Applied Natural Language Processing

Marti HearstAugust 28, 2006 

 

Page 2: I256:  Applied Natural Language Processing

2

Today

Motivation: SIMS student projectsCourse GoalsWhy NLP is difficultHow to solve it? Corpus-based statistical approachesWhat we’ll do in this course

Page 3: I256:  Applied Natural Language Processing

3

ANLP Motivation:SIMS Masters Projects

HomeSkim (2005)Chan, Lib, Mittal, PoonApartment search mashupExtracted fields from Craigslist listingshttp://www.ischool.berkeley.edu/programs/masters/projects/2006/homeskim

Orpheus (2004)Maury, Viswanathan, YangTool for discovering new and independent recording artistsExtracted artists, links, reviews from music websiteshttp://groups.sims.berkeley.edu/orpheus/demo/orpheus_demo.swf

Breaking Story (2002)Reffell, Fitzpatrick, AydelottSummarize trends in news feedsCategories and entities assigned to all news articles

http://dream.sims.berkeley.edu/newshound/

Page 4: I256:  Applied Natural Language Processing

4

Page 5: I256:  Applied Natural Language Processing

5

Page 6: I256:  Applied Natural Language Processing

6

HomeSkim Craigslist Analysis

Page 7: I256:  Applied Natural Language Processing

7

Page 8: I256:  Applied Natural Language Processing

8

Page 9: I256:  Applied Natural Language Processing

9

Page 10: I256:  Applied Natural Language Processing

10

Page 11: I256:  Applied Natural Language Processing

11

Goals of this CourseLearn about the problems and possibilities of natural language analysis:

What are the major issues?What are the major solutions?

– How well do they work?– How do they work (but to a lesser extent than CS 295-4)?

At the end you should:Agree that language is subtle and interesting!Feel some ownership over the algorithmsBe able to assess NLP problems

– Know which solutions to apply when, and howBe able to read papers in the field

Page 12: I256:  Applied Natural Language Processing

12

Today

Motivation: SIMS student projectsCourse GoalsWhy NLP is difficult.How to solve it? Corpus-based statistical approaches.What we’ll do in this course.

Page 13: I256:  Applied Natural Language Processing

13

We’ve past the year 2001,but we are not closeto realizing the dream(or nightmare …)

Page 14: I256:  Applied Natural Language Processing

Dave Bowman: “Open the pod bay doors, HAL”

HAL 9000: “I’m sorry Dave. I’m afraid I can’t do that.”

Page 15: I256:  Applied Natural Language Processing

15

Why is NLP difficult?Computers are not brains

There is evidence that much of language understanding is built-in to the human brain

Computers do not socializeMuch of language is about communicating with people

Key problems:Representation of meaningLanguage presupposed knowledge about the worldLanguage only reflects the surface of meaningLanguage presupposes communication between people

Page 16: I256:  Applied Natural Language Processing

16Adapted from Robert Berwick's 6.863J

Hidden Structure

English plural pronunciationToy + s toyz ; add zBook + s books ; add sChurch + s churchiz ; add iz

Box + s boxiz ; add izSheep + s sheep ; add nothing

What about new words?Bach + ‘s boxs ; why not boxiz?

Page 17: I256:  Applied Natural Language Processing

17

Language subtleties Adjective order and placement

A big black dogA big black scary dogA big scary dogA scary big dogA black big dog

AntonymsWhich sizes go together?

– Big and little– Big and small– Large and small

Large and little

Page 18: I256:  Applied Natural Language Processing

18Adapted from Robert Berwick's 6.863J

World Knowledge is subtle

He arrived at the lecture.He chuckled at the lecture.

He arrived drunk.He chuckled drunk.

He chuckled his way through the lecture.He arrived his way through the lecture.

Page 19: I256:  Applied Natural Language Processing

19Adapted from Robert Berwick's 6.863J

Words are ambiguous(have multiple meanings)

I know that.

I know that block.

I know that blocks the sun.

I know that block blocks the sun.

Page 20: I256:  Applied Natural Language Processing

20

How can a machine understand these differences?Get the cat with the gloves.

Page 21: I256:  Applied Natural Language Processing

21

How can a machine understand these differences?Get the sock from the cat with the gloves.Get the glove from the cat with the socks.

Page 22: I256:  Applied Natural Language Processing

22

How can a machine understand these differences?Decorate the cake with the frosting.Decorate the cake with the kids.Throw out the cake with the frosting.Throw out the cake with the kids.

Page 23: I256:  Applied Natural Language Processing

23Adapted from Robert Berwick's 6.863J

Headline AmbiguityIraqi Head Seeks ArmsJuvenile Court to Try Shooting DefendantTeacher Strikes Idle KidsKids Make Nutritious SnacksBritish Left Waffles on Falkland IslandsRed Tape Holds Up New BridgesBush Wins on Budget, but More Lies AheadHospitals are Sued by 7 Foot Doctors

Page 24: I256:  Applied Natural Language Processing

24

The Role of MemorizationChildren learn words quickly

Around age two they learn about 1 word every 2 hours.(Or 9 words/day)Often only need one exposure to associate meaning with word

– Can make mistakes, e.g., overgeneralization“I goed to the store.”

Exactly how they do this is still under studyAdult vocabulary

Typical adult: about 60,000 wordsLiterate adults: about twice that.

Page 25: I256:  Applied Natural Language Processing

25

The Role of Memorization

Dogs can do word association too!Rico, a border collie in GermanyKnows the names of each of 100 toys Can retrieve items called out to him with over 90% accuracy. Can also learn and remember the names of unfamiliar toys after just one encounter, putting him on a par with a three-year-old child.

http://www.nature.com/news/2004/040607/pf/040607-8_pf.html

Page 26: I256:  Applied Natural Language Processing

26Adapted from Robert Berwick's 6.863J

But there is too much to memorize!

establishestablishment

the church of England as the official state church.disestablishmentantidisestablishmentantidisestablishmentarianantidisestablishmentarianism

is a political philosophy that is opposed to the separation of church and state.

Page 27: I256:  Applied Natural Language Processing

27

Rules and MemorizationCurrent thinking in psycholinguistics is that we use a combination of rules and memorization

However, this is very controversialMechanism:

If there is an applicable rule, apply itHowever, if there is a memorized version, that takes precedence. (Important for irregular words.)

– Artists paint “still lifes” Not “still lives”

– Past tense of think thought blink blinked

This is a simplification; for more on this, see Pinker’s “Words and Rules” and “The Language Instinct”.

Page 28: I256:  Applied Natural Language Processing

28

Representation of Meaning

I know that block blocks the sun.How do we represent the meanings of “block”?How do we represent “I know”? How does that differ from “I know that.”? Who is “I”?How do we indicate that we are talking about earth’s sun vs. some other planet’s sun?When did this take place? What if I move the block? What if I move my viewpoint? How do we represent this?

Page 29: I256:  Applied Natural Language Processing

29

How to tackle these problems?The field was stuck for quite some time.A new approach started around 1990

Well, not really new, but the first time around, in the 50’s, they didn’t have the text, disk space, or GHz

Main idea: combine memorizing and rulesHow to do it:

Get large text collections (corpora)Compute statistics over the words in those collections

Surprisingly effectiveEven better now with the Web

Page 30: I256:  Applied Natural Language Processing

30

Example ProblemGrammar checker example:Which word to use? <principal> <principle>

Solution: look at which words surround each use:I am in my third year as the principal of Anamosa High School. School-principal transfers caused some upset.This is a simple formulation of the quantum mechanical uncertainty principle. Power without principle is barren, but principle without power is futile. (Tony Blair)

Page 31: I256:  Applied Natural Language Processing

31

Using Very, Very Large CorporaKeep track of which words are the neighbors of each spelling in well-edited text, e.g.:

Principal: “high school”Principle: “rule”

At grammar-check time, choose the spelling best predicted by the surrounding words.

Surprising results:Log-linear improvement even to a billion words!Getting more data is better than fine-tuning algorithms!

Page 32: I256:  Applied Natural Language Processing

32

The Effects of LARGE DatasetsFrom Banko & Brill ‘01

Page 33: I256:  Applied Natural Language Processing

33Adapted from Robert Berwick's 6.863J

Real-World Applications of NLPSpelling Suggestions/CorrectionsGrammar CheckingSynonym GenerationInformation ExtractionText CategorizationAutomated Customer ServiceSpeech Recognition (limited)Machine TranslationIn the (near?) future:

Question AnsweringImproving Web Search Engine resultsAutomated Metadata AssignmentOnline Dialogs

Page 34: I256:  Applied Natural Language Processing

34

Automatic Help Desk Translation at Microsoft

Page 35: I256:  Applied Natural Language Processing

35

Synonym Generation

Page 36: I256:  Applied Natural Language Processing

36

Synonym Generation

Page 37: I256:  Applied Natural Language Processing

37

What We’ll Do in this CourseRead research papers and tutorialsUse NLTK-lite (Natural Language ToolKit) to try out various algorithms

Some homeworks will be to do some NLTK exercisesWe’ll do some of this in class

Adopt a large text collectionUse a wide range of NLP techniques to process itWork together to build a useful resource.

Final projectEither extend work on the collection we’ve been using, or chose from some suggestions I provide.Your own idea only if I think it is very likely to work well.

Page 38: I256:  Applied Natural Language Processing

38

Assignment for ThursdayLoad python 2.4.3 and NLTK-lite onto your computersRead Chapter 1 of Jurafsky & MartinRead NLTK-lite tutorial sections 2.1-2.4

Page 39: I256:  Applied Natural Language Processing

39

PythonA terrific programming language

InterpretedObject-orientedEasy to interface to other things (web, DBMS, TK)Good stuff from: java, lisp, tcl, perlEasy to learn

– I learned it this summer by reading Learning PythonFUN!

Assignment for Thursday:Load python 2.4.3 and NLTK-lite onto your computersRead Chapter 1 of Jurafsky & MartinRead NLTK-lite tutorial Chapter 2 sections

Page 40: I256:  Applied Natural Language Processing

40

Questions?