csa2050 introduction to computational linguistics parsing i

27
Introduction to Computational Linguistics Parsing I

Upload: della-hall

Post on 14-Dec-2015

242 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: CSA2050 Introduction to Computational Linguistics Parsing I

CSA2050 Introduction to Computational

Linguistics

Parsing I

Page 2: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 2

Why Is Syntax Important?

The presidential candidate who was extremely popular smiled broadly.

How many presidential candidates are implied?

1 or >1?

Page 3: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 3

Why Is Syntax Important?

The presidential candidate, who was extremely popular, smiled broadly.

How many presidential candidates are implied?

1 or >1?

Page 4: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 4

Why Is Syntax Important?

The presidential candidate, who was extremely popular, smiled broadly.

The presidential candidate who was extremely popular smiled broadly.

…because the syntactic structure has an important bearing on the meaning

Page 5: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 5

PP Attachment

The policeman saw a burglar with a gun The policemen saw a burglar with a

telescope PP can modify V or N In the first case, it modifes V In the second, it modifies N

Page 6: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 6

PP modifies V

D N V D N P D NThe policemen saw the burglar with a telescope

S

NP

VP

PP

NP

NP

Page 7: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 7

PP modifies N

D N V D N P D NThe policemen saw a burglar with a gun

S

NP

VP

PP

NP

NP

Page 8: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 8

Issue

In general, how can we determine whether a prepositional phrase modifies the preceding noun or verb?

Knowledge based approach must encode, for example burglars often have guns people can see things with a telescope + a lot of other things

Statistical approach

Page 9: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 9

PP Attachment – Statistical Approach

The Prepositional Phrase Attachment Corpus, included with NLTK as ppattach, makes it possible for us to study this question systematically.

Derived from the IBM-Lancaster Treebank of Computer Manuals and the Penn Treebank,

Distils only the essential information about PP attachment.

Page 10: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 10

Corpus Example Sentence

Original Four of the five surviving workers have asbestos-

related diseases, including three with recently diagnosed cancer.

including three with recently diagnosed cancerversus

including three by adding two and one

Page 11: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 11

Distilled Information in Corpus

Original Four of the five surviving workers have asbestos-

related diseases, including three with recently diagnosed cancer.

ppattach corpus 16 including three with cancer N

i/d head verb head of obj prep head of pp’s np N or V

Page 12: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 12

Further examples

47830 allow visits between families N 47830 allow visits on peninsula V 42457 acquired interest in firm N 42457 acquired interest in 1986 V

Etc.

Page 13: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 13

Minimal Pair Extraction

NLTK contains primitives that allow us to to extract minimal pairs where we hold NP1, PREP and NP2 constant and get different attachments with respect to verb, e.g. received (NP offer) (PP from group) V

rejected (NP offer (PP from group)) N receive x from y reject x

Page 14: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 14

Why Syntactic Structure? Helps to make explicit how a sentence says who did

what to whomThe fierce dog bit the man

Key idea is to identify noun phrases around the verb <noun group> <verb> <noun group> We can do this in terms of sequences of POS tags,

e.g. D JJ* N But there are limitations to this approach

The child with a fierce dog bit the man Here child is biting but D JJ* N still precedes “bit” so

fierce dog remains the thing doing the biting.

Page 15: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 15

Constituency

We could repair with a more complex regular expression such as

DT JJ* NN (IN DT JJ* NN)* But this is defeated by

The seagull that attacked the child with the fierce dog bit the man

Basic problem is that we need a richer notion of constituency – how the words fit together to form a noun phrase.

Page 16: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 16

Recursion – Central Embedding

The dog barked

Page 17: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 17

Recursion – Central Embedding

The dog barked The dog the cat scratched barked

Page 18: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 18

Recursion – Central Embedding

The dog barked The dog the cat scratched barked The dog the cat the horse liked scratched

barked.

Page 19: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 19

Recursion – Central Embedding

The dog barked The dog the cat scratched barked The dog the cat the horse liked scratched

barked. The dog the cat the horse the man rode liked

scratched barked.

Page 20: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 20

Chomsky Hierarchy

Page 21: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 21

CFG Review

A CFG is a 4-tuple (N, Σ, P, S), where:

N is a set of non-terminal symbols (the category labels); Σ is a set of terminal symbols (e.g., lexical items); P is a set of productions of the form A → α, where – A is a non-terminal, and – α is a string of symbols from (N U Σ)* (i.e., strings of either

terminals or non-terminals); S is the start symbol.

A derivation of a string from a non-terminal N in P is the result or trace of successively applying individual productions in P to A.

Page 22: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 22

Different Derivations for the Same Sentence

Derivation 1NPDet N PPthe N PPthe dog PPthe dog P NPthe dog with NPthe dog with Det Nthe dog with a Nthe dog with a telescope

Derivation 2NPDet N PPDet N P NPDet N with NPThe N with NPThe N with a N

Page 23: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 23

What Does Context Free Mean?

LHS of rule is just one symbol. Can haveNP -> Det N

Cannot haveX NP Y -> X Det N Y

Page 24: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 24

Grammar Symbols

Symbols of the grammar fall into three categories:

1. Non Terminal Symbols

2. Terminal Symbols

3. Parts of Speech

We will sometimes not distinguish between 2 and 3

Page 25: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 25

Technical Aspects of CFGs

Rules of the form LHS -> RHS LHS comprises at most one NT symbol RHS any combination of NT and T symbols

Finite State (type 3) grammars have different restrictions LHS comprises at most one NT symbol RHS combination of T symbols with at most one NT.

Right linear grammar: NT must come at extreme left Left linear grammar: NT must come at extreme right

Page 26: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 26

A Simple Grammar + Lexicon

grammar:

S NP VPNP NVP V NPlexicon:

V kicksN JohnN Bill

S

NP

N

John kicks

NPV

VP

N

Bill

Page 27: CSA2050 Introduction to Computational Linguistics Parsing I

Apr 2008 -- MR CSA2050 - Parsing I 27

Grammar versus Parser

A grammar/lexicon defines a relation between sentences generated by the grammar and their respective syntactic structures.

The grammar does not tell us how to actually go about discovering the structure of a sentence.

A parsing algorithm is an effective procedure for carrying out that discovery.

A parser implements a parsing algorithm. Recursive descent parsing.