november 2009hlt - sentence grammar1 human language technology sentence grammar

58
November 2009 HLT - Sentence Grammar 1 Human Language Technology Sentence Grammar

Upload: deirdre-payne

Post on 26-Dec-2015

243 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 1

Human Language Technology

Sentence Grammar

Page 2: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 2

Introduction

This lecture has several themes:

1. Crash course in sentence-level grammar• Jurafsky and Martin 2nd ed. Chapter 12

• Internet Grammar of Englishhttp://www.ucl.ac.uk/internet-grammar/

2. Show how different linguistic phenomena can be captured by grammar rules.

3. Dependency Parsing

4. Tagsets and Treebanks

Page 3: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 3

Part 1

Grammar of English

Page 4: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 4

Different Kinds of Rule

• Morphological rules.. govern how words may be composed: re+invest+ing = reinvesting.

• Syntactic rules .. govern how words and constituents combine to form grammatical sentences.

• Semantic rules .. govern how meanings may be combined.

Page 5: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 5

Syntax: Why?

• You need knowledge of syntax in many applications:– Parsing– Grammar checkers– Question answering/database access– Information extraction– Generation– Translation

• Full versus superficial analysis?

Page 6: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 6

Levels of Grammar Organisation• Word Classes: different parts of speech (POS).• Phrase Classes: sequences of words inheriting the

characteristics of certain word classes.• Clause Classes: sequences of phrases containing

at least one verb phrase.On the basis of these one may define:• Grammatical Relations: role played by

constitutents e.g. subject; predicate; object• Syntax-Semantics interface: mapping between

syntactic structures and meaning

Page 7: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 7

Word Classes• Closed classes.

– determiners : the, a, an, four.– pronouns : it, he etc.– prepositions : by, on, with .– conjunctions : and, or, but.

• Open classes.– nouns refer to objects or concepts: cat , beauty , Coke.– adjectives describe or qualify nouns: fried chickens.– verbs describe what the noun does: John jumps.– adverbs describe how it is done: John runs quickly.

Page 8: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 8

Word Class Characteristics• Different word classes have characteristic

subclasses and propertiesSubclasses Properties

Nounproper; mass;

count

number;

gender

Verbtransitive;

intransitive

Number; gender;

person, tense

Adjectivedimension; age;

colour

Number,

gender

Page 9: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 9

Phrases

• Longer phrases may be used rather than a single word, but fulfilling the same role in a sentence.– Noun phrases refer to objects: four fried chickens.– Verb phrases state what the noun phrase does: kicks the

dog.– Adjective phrases describe/qualify an object:

sickly sweet.– Adverbial phrases describe how actions are done:

very carefully.– prepositional phrases: add information to a verb

phrase: on the table

Page 10: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 10

Phrases can be Complexe.g. Noun Phrases

• Proper Name or Pronoun: Monday; it

• Specifier, noun: the day

• Specifiers, premodifier, noun:the first wet day

• Specifiers, premodifier, noun, postmodifier:The first wet day that I enjoyed in June

Page 11: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 11

But they all fit the same context

• Monday• It• The day• The first wet day• The first wet day that I

enjoyed in June

was sunny.

Page 12: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 12

Clauses• A clause is a combination of noun phrases and

verb phrases• Clauses can exist at the top level (main clause) or

can be embedded (subordinate clause)– Top level clause is a sentence. E.g.

The cat ate the mouse.– Embedded clause is subordinate e.g.

John said that Sandy is sick.

• Unlike phrases, whole sentences can be used to say something complete, e.g. to state a fact or ask a question.

Page 13: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 13

Different Kinds of Sentences

• Assertion: John ate the cat.

• Yes/No question: Did John eat the cat?

• Wh- question: What did John eat?

• Command: Eat the cat John!

• NB. All these forms share the same underlying semantic proposition.

Page 14: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 14

Part II

Context Free Grammar Rules

Page 15: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 15

Formal Grammar

• A formal grammar consists of– Terminal Symbols (T)– Non Terminal Symbols (NT, disjoint from TS)– Start Symbol (a distinguished NT)– Rewrite rules of the form , where and

are strings of symbols

Page 16: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 16

Classes of Grammar

Type Grammars Languages Machines

0 Phrase StructureRecursivelyEnumerable

TM

1 Context SensitiveContextSensitive

LBA

2 Context Free Context Free PDA

3 Regular Regular FSA

Page 17: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 17

Classes of Grammar

• Learnability

• Different classes of grammar result from various restrictions on the form of rules

Page 18: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 18

Restrictions on Rules

For all rules • Type 0 (unrestricted): no restrictions

• Type 1 (context sensitive): ||||

• Type 2 (context free): is a single NT symbol

• Type 3 (regular)– Every rule is of the form A aB or A a

where A,B NT and aT

Page 19: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 19

Which Class for NLP?

• Type 3 (Regular). Good for morphology. Cannot handle central embedding of sentences.The man that John saw eating died.

• Type 2 (Context Free). OK but problems handling certain phenomena e.g. agreement.

• Type 1 (Context Sensitive). Computational properties not well understood. Too powerful.

• Type 0 (Turing). Too powerful.

Page 20: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 20

Weak versus Strong

• Grammar class that is too restrictive– cannot characterise/discriminate exactly NL

sentence structures.

• Grammar class that is too general– has the power to characterise/discriminate

structures that don't exist in human languages.

• More general, higher complexity→ less efficient computations.

Page 21: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 21

Example Grammar

Cabinet discusses police chief’s caseFrench gunman kills fours np vpnp nnp adj nnp n npvp v np

Page 22: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 22

Classifying the Symbols• NT – symbols appearing on the left• Start – symbol appearing only on the left from

which every other symbol can be derived.• T – symbols appearing only on the right• To include words we also need special rules

such as n [police]n [gunman]n [four]

• Latter rules define the lexicon or “dictionary interface”.

Page 23: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 23

Grammar InducesPhrase Structure

French gunman kills four

adj n v n

np np

vp s

Page 24: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 24

Phrase Structure

• PS includes information about – precedence between constituents– dominance between constituents

• PS constitutes a trace of the rule applications used to derive a sentence

• PS does not tell you the order in which the rules were used

Page 25: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 25

Procedural versus Declarative

• A grammar induces a structure but does not tell you how to discover that structure

• A grammar is declarative

• A parser is a procedure that, given a suitable representation of a grammar and a sentence, actually discovers the structure(s).

• A parser is procedural

Page 26: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 26

Handling Linguistic Phenomena

• Different sentence-types

• Nested structures

• Agreement

• Multiwords

• Subcategories of verb

Page 27: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 27

Different Sentence Types........Different Grammar Rules

• DeclarativesJohn left.S → NP VP

• ImperativesLeave!S →VP

• Yes-No QuestionsDid John leave?S →Aux NP VP

• WH QuestionsWhen did John leave?S →Wh-word Aux NP VP

Page 28: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 28

Recursively NestedStructures handled by ....

• Flights to Miami• Flights to Miami from Boston• Flights to Miami from Boston in April• Flights to Miami from Boston in April on

Friday• Flights to Miami from Boston in April on

Friday under $300.• Flights to Miami from Boston in April on

Friday under $300 with lunch.

Page 29: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 29

Recursive Rules

NP → N

NP → NP PP

PP → Preposition NP

Flights from miami to boston

Page 30: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 30

Ambiguity

np np pppp prep np

(the man) (on the hill with a telescope by the sea)

(the man on the hill) (with a telescope by the sea)

(the man on the hill with a telescope)( by the sea)

etc.

Page 31: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 31

Handling Agreement

• NP → Determiner N• Include these days, this day• Exclude this days, these day

NP → NPSingNP → NPPlurNPPlur → DetSing NSingNPPlur → DetPlur NPlur

• Agreement also includes number, gender, case.• Danger: proliferation of categories/rules.

Page 32: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 32

Handling Multiwords

• John ran up the stairs

• John rang up the doctor

• John ran the stairs up*

• John rang the doctor up

• John rang the doctor who lives in Paris up

Page 33: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 33

Ordinary CF rules don’t work

John rang up the doctor• VP → V NP• here V is multiwordJohn rang the doctor up• VP → V NP particle_from _V• here, multiword has split into two parts• challenge is to express the relation between the

parts

Page 34: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 34

Subcategorisation

• Intransitive verb: no objectJohn disappearedJohn disappeared the cat*

• Transitive verb: one objectJohn opened the windowJohn opened*

• Ditransitive verb: two objectsJohn gave Mary the bookJohn gave Mary*

Page 35: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 35

Subcategorisation Rules

• Intransitive verb: no objectVP → V

• Transitive verb: one objectVP → V NP

• Ditransitive verb: two objectsVP → V NP NP

• If you take account of the category of items following the verb, there are about 40 different patterns like this in English.

Page 36: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 36

Overgeneration

• A grammar should generate only sentences in the language.

• It should exclude sentences not in the language.

s n vpvp vn [John]v [snore]v [snores]

Page 37: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 37

Undergeneration

• A grammar should generate all sentences in the language.

• There should not be sentences in the language that are not generated by the grammar.

s n vpvp vn [John]n [gold]v [found]

Page 38: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 38

Appropriate Stuctures

– A grammar should assign linguistically plausible structures.

n v d a nJohn ate a juicy hamburger

vp

s

n v d a nJohn ate a juicy hamburger

np vp

s

np

vs.

Page 39: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 39

Criteria for Evaluating Grammars

• Does it undergenerate? • Does it overgenerate?• Does it assign appropriate structures to sentences

it generates?• Is it simple to understand? How many rules are

there?• Does it contain generalisations or is it just a

collection of special cases?• How ambiguous is it?

Page 40: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 40

Tagsets

• The main parts of speech reflect naturally occurring occurrence data.

• Practical applications often make use of special tags which include additional information such as number and case.

• One of the most commonly used tagsets is the 45-tag Penn Treebank tagset, used for the Brown corpus.

Page 41: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 41

Penn Treebank Tagset

Page 42: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 42

POS Tagging

The grand jury commented on a number of other topics

The/DT grand/JJ jury/NN commmented/VBD

on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

POS Tagger

Page 43: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 43

Treebanks• Treebanks are corpora in which each sentence has been

paired with a parse tree (presumably the right one).• These are generally created

– By first parsing the collection with an automatic parser– And then having human annotators correct each parse

as necessary.• This requires detailed annotation guidelines that provide a

– POS tagset,

– a grammar and

– instructions for how to deal with particular grammatical constructions.

Page 44: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 44

Penn Treebank• Penn Treebank is a widely used treebank

maintained by the Linguistic Data Consortium.• The Penn Treebank Project annotates naturally-

occurring text for linguistic structure.• Contains skeletal parses showing rough syntactic

and semantic information -- a bank of linguistic trees.

• Most well known is the Wall Street Journal section containing 1M words from the 1987-1989 Wall Street Journal.

Page 45: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 45

Penn Treebank Example

Page 46: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 46

Treebank Grammars

• Treebanks implicitly define a grammar for the language covered in the treebank.

• Simply take the local rules that make up the sub-trees in all the trees in the collection and you have a grammar.

• Not complete, but if you have decent size corpus, you’ll have a grammar with decent coverage.

Page 47: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 47

Treebank Grammars

• Such grammars tend to be very flat due to the fact that they tend to avoid recursion.

• For example, the Penn Treebank has 4500 different rules for VPs. Among them...

Page 48: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 48

Heads in Trees

• Finding heads in treebank trees is a task that arises frequently in many applications.– Particularly important in statistical

parsing

• We can visualize this task by annotating the nodes of a parse tree with the heads of each corresponding node.

Page 49: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 49

Lexically Decorated Tree

Page 50: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 50

Head Finding

• The standard way to do head finding is to use a simple set of tree traversal rules specific to each non-terminal in the grammar.

Page 51: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 51

Noun Phrases

Page 52: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 52

Treebank Uses

• Treebanks (and headfinding) are particularly critical to the development of statistical parsers.

• Also valuable for Corpus Linguistics – Investigating the empirical details of various

constructions in a given language

Page 53: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 53

Dependency Grammars• In CFG-style phrase-structure grammars the main

focus is on constituents.• But it turns out you can get a lot done with just

binary relations among the words in an utterance.• In a dependency grammar framework, a parse is a

tree where – the nodes stand for the words in an utterance– The links between the words represent dependency

relations between pairs of words.– Relations may be typed (labeled), or not.

Page 54: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 54

Dependency Relations

Page 55: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 55

Dependency Parse

They hid the letter on the shelf

Page 56: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 56

Dependency Parsing

• The dependency approach has a number of advantages over full phrase-structure parsing.– Deals well with free word order languages

where the constituent structure is quite fluid– Parsing is much faster than CFG-based parsers– Dependency structure often captures the

syntactic relations needed by later applications

• CFG-based approaches often extract this same information from trees anyway.

Page 57: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 57

Dependency Parsing

• There are two modern approaches to dependency parsing– Optimization-based approaches that search a

space of trees for the tree that best matches some criteria

– Shift-reduce approaches that greedily take actions based on the current word and state.

Page 58: November 2009HLT - Sentence Grammar1 Human Language Technology Sentence Grammar

November 2009 HLT - Sentence Grammar 58

Summary• Context-free grammars can be used to model various facts

about the syntax of a language.• When paired with parsers, such grammars consititute a

critical component in many applications.• Constituency is a key phenomenon easily captured with

CFG rules.– But certain linguistic phenomena pose significant

problems• Dependency parsing is suitable for languages with free

word order• Treebanks pair sentences in corpus with their

corresponding trees.