csci 5832 natural language processingmartin/csci5832/slides/lecture_02.pdflecture 2 jim martin...

24
1 1/18/08 1 CSCI 5832 Natural Language Processing Lecture 2 Jim Martin 1/18/08 2 Today 1/17 Wrap up last time Knowledge of language Ambiguity Models and algorithms Generative paradigm Finite-state methods 1/18/08 3 Course Material We’ll be intermingling discussions of: Linguistic topics E.g. Morphology, syntax, discourse structure Formal systems E.g. Regular languages, context-free grammars Applications E.g. Machine translation, information extraction

Upload: others

Post on 11-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

1

1/18/08 1

CSCI 5832Natural Language

Processing

Lecture 2Jim Martin

1/18/082

Today 1/17

• Wrap up last time• Knowledge of language• Ambiguity• Models and algorithms• Generative paradigm• Finite-state methods

1/18/083

Course Material

• We’ll be intermingling discussions of: Linguistic topics

E.g. Morphology, syntax, discourse structure Formal systems

E.g. Regular languages, context-freegrammars

Applications E.g. Machine translation, information

extraction

Page 2: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

2

1/18/084

Linguistics Topics

• Word-level processing• Syntactic processing• Lexical and compositional

semantics• Discourse processing

1/18/085

Topics: Techniques

• Finite-state methods• Context-free methods• Augmented grammars

Unification Lambda calculus

• First order logic

• Probabilitymodels

• Supervisedmachinelearningmethods

1/18/086

Topics: Applications

• Small Spelling correction Hyphenation

• Medium Word-sense

disambiguation Named entity recognition Information retrieval

• Large Question answering Conversational agents Machine translation

• Stand-alone

• Enabling applications

• Funding/Businessplans

Page 3: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

3

1/18/087

Just English?

• The examples in this class will for the mostpart be English Only because it happens to be what I know.

• This leads to an (over?)-emphasis on certaintopics (syntax) to the detriment of others(morphology) due to the properties ofEnglish

• We’ll cover other languages primarily in thecontext of machine translation

1/18/088

Commercial World

• Lot’s of exciting stuff going on…

1/18/089

Google Translate

Page 4: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

4

1/18/0810

Google Translate

1/18/0811

Web Q/A

1/18/0812

Summarization

• Current web-based Q/A is limited toreturning simple fact-like (factoid)answers (names, dates, places, etc).

• Multi-document summarization canbe used to address more complexkinds of questions. Circa 2002:

What’s going on with the Hubble?

Page 5: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

5

1/18/0813

NewsBlaster Example

The U.S. orbiter Columbia has touched down at theKennedy Space Center after an 11-day mission toupgrade the Hubble observatory. The astronauts onColumbia gave the space telescope new solar wings, abetter central power unit and the most advanced opticalcamera. The astronauts added an experimentalrefrigeration system that will revive a disabled infraredcamera. ''Unbelievable that we got everything we set outto do accomplished,'' shuttle commander Scott Altmansaid. Hubble is scheduled for one more servicing missionin 2004.

1/18/0814

Weblog Analytics

• Textmining weblogs, discussionforums, message boards, user groups,and other forms of user generatedmedia. Product marketing information Political opinion tracking Social network analysis Buzz analysis (what’s hot, what topics are

people talking about right now).

1/18/0815

Web Analytics

Page 6: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

6

1/18/0816

Categories of Knowledge

• Phonology• Morphology• Syntax• Semantics• Pragmatics• Discourse

Each kind of knowledge hasassociated with it anencapsulated set of processesthat make use of it.Interfaces are defined that allowthe various levels tocommunicate.This usually leads to a pipelinearchitecture.

1/18/0817

Ambiguity

• I made her duck

1/18/0818

Ambiguity

• I made her duck • Sources Lexical (syntactic)

Part of speech Subcat

Lexical (semantic) Syntactic

Different parses

Page 7: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

7

1/18/0819

Dealing with Ambiguity

• Four possible approaches:1. Tightly coupled interaction among

processing levels; knowledge fromother levels can help decide amongchoices at ambiguous levels.

2. Pipeline processing that ignoresambiguity as it occurs and hopesthat other levels can eliminateincorrect structures.

1/18/0820

Dealing with Ambiguity

3. Probabilistic approaches based onmaking the most likely choices

4. Don’t do anything, maybe it won’t matter We’ll leave when the duck is ready to eat. The duck is ready to eat now. Does the ambiguity matter?

1/18/0821

Models and Algorithms

• By models I mean the formalisms thatare used to capture the various kindsof linguistic knowledge we need.

• Algorithms are then used tomanipulate the knowledgerepresentations needed to tackle thetask at hand.

Page 8: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

8

1/18/0822

Models

• State machines• Rule-based approaches• Logical formalisms• Probabilistic models

1/18/0823

Algorithms

• Many of the algorithms that we’ll studywill turn out to be transducers;algorithms that take one kind ofstructure as input and output another.

• Unfortunately, ambiguity makes thisprocess difficult. This leads us toemploy algorithms that are designed tohandle ambiguity of various kinds

1/18/0824

Paradigms• In particular..

State-space search To manage the problem of making choices during

processing when we lack the information needed to makethe right choice

Dynamic programming To avoid having to redo work during the course of a state-

space search• CKY, Earley, Minimum Edit Distance, Viterbi, Baum-Welch

Classifiers Machine learning based classifiers that are trained to

make decisions based on features extracted from the localcontext

Page 9: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

9

1/18/0825

State Space Search

• States represent pairings of partiallyprocessed inputs with partially constructedrepresentations.

• Goals are inputs paired with completedrepresentations that satisfy some criteria.

• As with most interesting problems thespaces are normally too large toexhaustively explore. We need heuristics to guide the search Criteria to trim the space

1/18/0826

Dynamic Programming

• Don’t do the same work over and over.• Avoid this by building and making use

of solutions to sub-problems thatmust be invariant across all parts ofthe space.

1/18/0827

Administrative Stuff

• Mailing list If you’re registered you’re on it with your

CU account I sent out mail this morning. Check to see

if you’ve received it• The textbook is now in the bookstore

Page 10: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

10

1/18/0828

First Assignment

• Two parts1. Answer the following question:

How many words do you know?2. Write a python program that takes a

newspaper article (plain text that I willprovide) and returns the number of: Words Sentences Paragraphs

1/18/0829

First Assignment Details

• For the first part I want… An actual number and a explanation of

how you arrived at the answer Hardcopy. Bring to class.

• For the second part, email me yourcode and your answers to the test textthat I will send out shortly before theHW is due.

1/18/0830

First Assignment

• In doing this assignment you shouldthink ahead… having access to thewords, sentences and paragraphs willbe useful in future assignments.

Page 11: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

11

1/18/0831

Getting Going

• The next two lectures will covermaterial from Chapters 2 and 3 Finite state automata Finite state transducers English morphology

1/18/0832

Regular Expressions and TextSearching

• Everybody does it Emacs, vi, perl, grep, etc..

• Regular expressions are a compacttextual representation of a set ofstrings representing a language.

1/18/0833

Example

• Find me all instances of the word “the”in a text. /the/

/[tT]he/

/\b[tT]he\b/

Page 12: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

12

1/18/0834

Errors

• The process we just went throughwas based on two fixing kinds oferrors Matching strings that we should not

have matched (there, then, other) False positives (Type I)

Not matching things that we shouldhave matched (The) False negatives (Type II)

1/18/0835

Errors

• We’ll be telling the same story formany tasks, all semester. Reducing theerror rate for an application ofteninvolves two antagonistic efforts: Increasing accuracy, or precision,

(minimizing false positives) Increasing coverage, or recall, (minimizing

false negatives).

1/18/0836

Finite State Automata

• Regular expressions can be viewed as atextual way of specifying the structure offinite-state automata.

• FSAs and their probabilistic relatives are atthe core of what we’ll be doing all semester.

• They also conveniently (?) correspond toexactly what linguists say we need formorphology and parts of syntax. Coincidence?

Page 13: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

13

1/18/0837

FSAs as Graphs

• Let’s start with the sheep language fromthe text /baa+!/

1/18/0838

Sheep FSA

• We can say the following things aboutthis machine It has 5 states b, a, and ! are in its alphabet q0 is the start state q4 is an accept state It has 5 transitions

1/18/0839

But note

• There are other machines thatcorrespond to this same language

• More on this one later

Page 14: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

14

1/18/0840

More Formally

• You can specify an FSA byenumerating the following things. The set of states: Q A finite alphabet: Σ A start state A set of accept/final states A transition function that maps QxΣ to Q

1/18/0841

About Alphabets

• Don’t take that word to narrowly; itjust means we need a finite set ofsymbols in the input.

• These symbols can and will stand forbigger objects that can have internalstructure.

1/18/0842

Dollars and Cents

Page 15: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

15

1/18/0843

Yet Another View

• The guts of FSAsare ultimatelyrepresented astables

443

2,3221

10e!abState

1/18/0844

Recognition

• Recognition is the process of determining ifa string should be accepted by a machine

• Or… it’s the process of determining if astring is in the language we’re defining withthe machine

• Or… it’s the process of determining if aregular expression matches a string

• Those all amount the same thing in the end

1/18/0845

Recognition

• Traditionally, (Turing’s idea) this process isdepicted with a tape.

Page 16: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

16

1/18/0846

Recognition

• Simply a process of starting in thestart state

• Examining the current input• Consulting the table• Going to a new state and updating the

tape pointer.• Until you run out of tape.

1/18/0847

D-Recognize

1/18/0848

Key Points

• Deterministic means that at each pointin processing there is always oneunique thing to do (no choices).

• D-recognize is a simple table-driveninterpreter

• The algorithm is universal for allunambiguous regular languages. To change the machine, you just change

the table.

Page 17: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

17

1/18/0849

Key Points

• Crudely therefore… matching strings withregular expressions (ala Perl, grep, etc.) isa matter of translating the regular expression into a

machine (a table) and passing the table to an interpreter

1/18/0850

Recognition as Search

• You can view this algorithm as a trivialkind of state-space search.

• States are pairings of tape positionsand state numbers.

• Operators are compiled into the table• Goal state is a pairing with the end of

tape position and a final accept state• Its trivial because?

1/18/0851

Generative Formalisms

• Formal Languages are sets of stringscomposed of symbols from a finite set ofsymbols.

• Finite-state automata define formallanguages (without having to enumerate allthe strings in the language)

• The term Generative is based on the viewthat you can run the machine as a generatorto get strings from the language.

Page 18: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

18

1/18/0852

Generative Formalisms

• FSAs can be viewed from twoperspectives: Acceptors that can tell you if a string is in

the language Generators to produce all and only the

strings in the language

1/18/0853

Non-Determinism

1/18/0854

Non-Determinism cont.

• Yet another technique Epsilon transitions Key point: these transitions do not

examine or advance the tape duringrecognition

Page 19: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

19

1/18/0855

Equivalence

• Non-deterministic machines can beconverted to deterministic ones witha fairly simple construction

• That means that they have the samepower; non-deterministic machinesare not more powerful thandeterministic ones in terms of thelanguages they can accept

1/18/0856

ND Recognition

• Two basic approaches (used in allmajor implementations of RegularExpressions)

1. Either take a ND machine and convert itto a D machine and then do recognitionwith that.

2. Or explicitly manage the process ofrecognition as a state-space search(leaving the machine as is).

1/18/0857

Implementations

Page 20: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

20

1/18/0858

Non-DeterministicRecognition: Search

• In a ND FSA there exists at least one paththrough the machine for a string that is inthe language defined by the machine.

• But not all paths directed through themachine for an accept string lead to anaccept state.

• No paths through the machine lead to anaccept state for a string not in thelanguage.

1/18/0859

Non-DeterministicRecognition

• So success in a non-deterministicrecognition occurs when a path isfound through the machine that endsin an accept.

• Failure occurs when all of the possiblepaths lead to failure.

1/18/0860

Example

b a a a ! \

q0 q1 q2 q2 q3 q4

Page 21: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

21

1/18/0861

Example

1/18/0862

Example

1/18/0863

Example

Page 22: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

22

1/18/0864

Example

1/18/0865

Example

1/18/0866

Example

Page 23: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

23

1/18/0867

Example

1/18/0868

Example

1/18/0869

Key Points

• States in the search space are pairingsof tape positions and states in themachine.

• By keeping track of as yet unexploredstates, a recognizer can systematicallyexplore all the paths through themachine given an input.

Page 24: CSCI 5832 Natural Language Processingmartin/Csci5832/Slides/lecture_02.pdfLecture 2 Jim Martin 1/18/08 2 Today 1/17 •Wrap up last time •Knowledge of language ... •Word-level

24

1/18/0870

Next Time

• Finish reading Chapter 2, start on 3. Make sure you have the book

• Make sure you have access to Python