problematic language processing in artificial intelligence

8/8/2019 Problematic Language Processing in Artificial Intelligence

1/91

Machine Learning Methods in Natural Language Processing

Michael CollinsMIT CSAIL


2/91

Some NLP Problems

Information extraction

Named entities

Relationships between entities

Finding linguistic structure

Part-of-speech tagging

Parsing

Machine translation


3/91

Common Themes

Need to learn mapping from one discrete structure to another

Strings to hidden state sequencesNamed-entity extraction, part-of-speech tagging

Strings to stringsMachine translation

Strings to underlying trees

Parsing

Strings to relational data structuresInformation extraction

Speech recognition is similar (and shares many techniques)


4/91

Two Fundamental Problems

TAGGING: Strings to Tagged Sequences

a b e e a f h j a/C b/D e/C e/C a/D f/C h/D j/C

PARSING: Strings to Trees

d e f g (A (B (D d) (E e)) (C (F f) (G g)))

d e f g A

B

D

d

E

e

C

F

f

G

g


5/91

Information Extraction

Named Entity RecognitionINPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street, as

their CEO Alan Mulally announced first quarter results.

OUTPUT: Profits soared at

Company Boeing Co.

, easily topping forecastson

Location Wall Street

, as their CEO

Person Alan Mulally

announced first

quarter results.

Relationships between Entities

INPUT: Boeing is located in Seattle. Alan Mulally is the CEO.

OUTPUT:

Relationship = Company-Location

Relationship = Employer-Employee

Company = Boeing Employer = Boeing Co.

Location = Seattle

Employee = Alan Mulally


6/91


7/91

Named Entity Extraction as Tagging

INPUT:Profi ts soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced fi rst quarter results.

OUTPUT:

Profi ts/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NAtopping/NA forecasts/NA on/NA Wall/SL Street/CL ,/NA as/NAtheir/NA CEO/NA Alan/SP Mulally/CP announced/NA fi rst/NA

quarter/NA results/NA ./NA

NA = No entity

SC = Start Company

CC = Continue CompanySL = Start Location

CL = Continue Location


8/91

Parsing (Syntactic Structure)

INPUT: Boeing is located in Seattle.

OUTPUT:S

NP

N

Boeing

VP

V

is

VP

V

located

PP

P

in

NP

N

Seattle


9/91

Machine Translation

INPUT:Boeing is located in Seattle. Alan Mulally is the CEO.

OUTPUT:

Boeing ist in Seattle. Alan Mulally ist der CEO.


10/91

Techniques Covered in this Tutorial

Generative models for parsing

Log-linear (maximum-entropy) taggers

Learning theory for NLP


11/91

Data for Parsing Experiments

Penn WSJ Treebank = 50,000 sentences with associated trees Usual set-up: 40,000 training sentences, 2400 test sentences

An example tree:

Canadian

NNP

Utilities

NNPS

NP

had

VBD

1988

CD

revenue

NN

NP

of

IN

C$

$

1.16

CD

billion

CD

,

PUNC,

QP

NP

PP

NP

mainly

RB

ADVP

from

IN

its

PRP$

natural

JJ

gas

NN

and

CC

electric

JJ

utility

NN

businesses

NNS

NP

in

IN

Alberta

NNP

,

PUNC,

NP

where

WRB

WHADVP

the

DT

company

NN

NP

serves

VBZ

about

RB

800,000

CD

QP

customers

NNS

.

PUNC.

NP

VP

S

SBAR

NP

PP

NP

PP

VP

S

TOP

Canadian Utilities had 1988 revenue of C$ 1.16 billion , mainly from its

natural gas and electric utility businesses in Alberta , where the company

serves about 800,000 customers .


12/91


13/91

2) Phrases S

NP

DT

the

N

burglar

VP

V

robbed

NP

DT

the

N

apartment

Noun Phrases (NP): the burglar, the apartment

Verb Phrases (VP): robbed the apartment

Sentences (S): the burglar robbed the apartment


14/91


15/91

An Example Application: Machine Translation

English word order is subject verb object

Japanese word order is subject object verb

English: IBM bought LotusJapanese: IBM Lotus bought

English: Sources said that IBM bought Lotus yesterdayJapanese: Sources yesterday IBM Lotus bought that said


16/91

Context-Free Grammars

[Hopcroft and Ullman 1979]A context free grammar

where:

is a set of non-terminal symbols

is a set of terminal symbols

is a set of rules of the form

for

!

"

,

$

,

%

$

&

$

is a distinguished start symbol


17/91

A Context-Free Grammar for English

= S, NP, VP, PP, D, Vi, Vt, N, P

= S

=

sleeps, saw, man, woman, telescope, the, with, in

S

NP VPVP

ViVP

Vt NPVP

VP PP

NP D NNP

NP PP

PP

P NP

Vi

sleepsVt

saw

N

manN

woman

N

telescopeD

the

P

withP

in

Note: S=sentence, VP=verb phrase, NP=noun phrase, PP=prepositional phrase,

D=determiner, Vi=intransitive verb, Vt=transitive verb, N=noun, P=preposition


18/91

Left-Most Derivations

A left-most derivation is a sequence of strings

, where

, the start symbol

$

, i.e.

is made up of terminal symbols only

Each

%

for

is derived from

%

by picking the left-most non-terminal

in

%

and replacing it by some

where

is a rule in

For example: [S], [NP VP], [D N VP], [the N VP], [the man VP],

[the man Vi], [the man sleeps]

Representation of a derivation as a tree:

S

NP

D

the

N

man

VP

Vi

sleeps


19/91

The Problem with Parsing: Ambiguity

INPUT:She announced a program to promote safety in trucks and vans

POSSIBLE OUTPUTS:

S

NP

She

VP

announced NP

NP

a program

VP

t o p ro mo te NP

safety PP

in NP

t r uc k s a n d v a ns

S

NP

She

VP

announced NP

NP

NP

a progra m

VP

t o p ro mo te NP

safety PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

a progra m

VP

to promote NP

NP

safety PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

a program

VP

t o p ro mo te NP

safety

PP

in NP


S

NP

She

VP

announced NP

NP

NP

a program

VP

t o p r om o te NP

safety

PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

NP

a program

VP

t o p r om o te NP

safety

PP

in NP


And there are more...


20/91

An Example Tree

Canadian Utilities had 1988 revenue of C$ 1.16 billion ,mainly from its natural gas and electric utility businessesin Alberta , where the company serves about 800,000customers .

Canadian

NNP

Utilities

NNPS

NP

had

VBD

1988

CD

revenue

NN

NP

of

IN

C$

$

1.16

CD

billion

CD

,

PUNC,

QP

NP

PP

NP

mainly

RB

ADVP

from

IN

its

PRP$

natural

JJ

gas

NN

and

CC

electric

JJ

utility

NN

businesses

NNS

NP

in

IN

Alberta

NNP

,

PUNC,

NP

where

WRB

WHADVP

the

DT

company

NN

NP

serves

VBZ

about

RB

800,000

CD

QP

customers

NNS

.

PUNC.

NP

VP

S

SBAR

NP

PP

NP

PP

VP

S

TOP


21/91

A Probabilistic Context-Free Grammar

S

NP VP 1.0

VP

Vi 0.4VP

Vt NP 0.4

VP VP PP 0.2NP

D N 0.3NP

NP PP 0.7

PP

P NP 1.0

Vi

sleeps 1.0Vt

saw 1.0

N

man 0.7

N

woman 0.2N

telescope 0.1

D

the 1.0

P

with 0.5

P

in 0.5

Probability of a tree with rules

%

%

is

%

%

%

%

Maximum Likelihood estimation

VP

V NP

VP

VP

V NP

VP


22/91

PCFGs

[Booth and Thompson 73] showed that a CFG with ruleprobabilities correctly defi nes a distribution over the set ofderivations provided that:

1. The rule probabilities defi ne conditional distributions over thedifferent ways of rewriting each non-terminal.

2. A technical condition on the rule probabilities ensuring that

the probability of the derivation terminating in a fi nite numberof steps is 1. (This condition is not really a practical concern.)


23/91

TOP

S

NP

N

IBM

VP

V

bought

NP

N

Lotus

PROB =

TOP

S

S

NP VP

N

VP

V NP

V

NP

N

N

NP

N


24/91

The SPATTER Parser: (Magerman 95;Jelinek et al 94)

For each rule, identify the head child

S

NP VPVP

V NP

NP DT N

Add word to each non-terminal

S(questioned)

NP(lawyer)

DT

the

N

lawyer

VP(questioned)

V

questioned

NP(witness)

DT

the

N

witness


25/91

A Lexicalized PCFG

S(questioned)

NP(lawyer) VP(questioned) ??

VP(questioned)

V(questioned) NP(witness) ??

NP(lawyer)

D(the) N(lawyer) ??

NP(witness)

D(the) N(witness) ??

The big question: how to estimate rule probabilities??


26/91

CHARNIAK (1997)

S(questioned)

NP VP

S(questioned)

S(questioned)

NP VP(questioned)

lawyer

S,VP,NP, questioned)

S(questioned)

NP(lawyer) VP(questioned)


27/91

Smoothed Estimation

NP VP S(questioned)

S(questioned)

NP VP

S(questioned)

S

NP VP

S

Where

!, and

!


28/91

Smoothed Estimation

lawyer S,NP,VP,questioned

lawyer

S,NP,VP,questioned

S,NP,VP,questioned

lawyer

S,NP,VP

S,NP,VP

lawyer

NP

NP

Where

!, and

!


29/91


S(questioned)

S(questioned)

NP VP

S(questioned)

S

NP VP

S

lawyer

S,NP,VP,questioned

S,NP,VP,questioned

lawyer

S,NP,VP

S,NP,VP

lawyer

NP

NP


30/91

Lexicalized Probabilistic Context-Free Grammars

Transformation to lexicalized rules

S

NP VPvs. S(questioned)


Smoothed estimation techniques blend different counts

Search for most probable tree through dynamic programming

Perform vastly better than PCFGs (88% vs. 73% accuracy)

I d d A i


31/91

Independence Assumptions PCFGs

S

NP

DT

the

N

lawyer

VP

V

questioned

NP

DT

the

N

witness

Lexicalized PCFGs

S(questioned)

NP(lawyer)

DT

the

N

lawyer

VP(questioned)

V

questioned

NP(witness)

DT

the

N

witness


32/91

Results

Method AccuracyPCFGs (Charniak 97) 73.0%

Conditional Models Decision Trees (Magerman 95) 84.2%

Lexical Dependencies (Collins 96) 85.5%

Conditional Models Logistic (Ratnaparkhi 97) 86.9%

Generative Lexicalized Model (Charniak 97) 86.7%

Generative Lexicalized Model (Collins 97) 88.2%

Logistic-inspired Model (Charniak 99) 89.6%

Boosting (Collins 2000) 89.8%

Accuracy = average recall/precision

P i f I f ti E t ti


33/91

Parsing for Information Extraction:Relationships between Entities

INPUT:Boeing is located in Seattle.

OUTPUT:

Relationship = Company-LocationCompany = BoeingLocation = Seattle

A Generati e Model (Miller et al)


34/91

A Generative Model (Miller et. al)

[Miller et. al 2000] use non-terminals to carry lexical items andsemantic tags

SisCL

NPBoeingCOMPANY

Boeing

VPisCLLOC

V

is

VPlocatedCLLOC

V

located

PPinCLLOC

P

in

NPSeattleLOCATION

Seattle

PPin lexical headCLLOC semantic tag

A Generative Model [Miller et al 2000]


35/91

A Generative Model [Miller et. al 2000]

Were now left with an even more complicated estimation problem,

P(SisCL NPBoeingCOMPANY VP

isCLLOC)

See [Miller et. al 2000] for the details

Parsing algorithm recovers annotated trees

Simultaneously recovers syntactic structure and namedentity relationships

Accuracy (precision/recall) is greater than 80% in recovering

relations



36/91


Generative models for parsing

Log-linear (maximum-entropy) taggers

Learning theory for NLP

Tagging Problems


37/91

Tagging Problems

TAGGING: Strings to Tagged Sequences

a b e e a f h j a/C b/D e/C e/C a/D f/C h/D j/C

Example 1: Part-of-speech tagging

Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV topping/V

forecasts/N on/P Wall/N Street/N ,/, as/P their/POSS CEO/N Alan/N

Mulally/N announced/V first/ADJ quarter/N results/N ./.

Example 2: Named Entity Recognition

Profits/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NA

topping/NA forecasts/NA on/NA Wall/SL Street/CL ,/NA as/NA their/NA

CEO/NA Alan/SP Mulally/CP announced/NA first/NA quarter/NA

results/NA ./NA

Log-Linear Models


38/91

Log-Linear Models

Assume we have sets and

Goal: defi ne

for any

,

.

A feature vector representation is

Parameters

Defi ne

where

!


39/91

Log-Linear Taggers: Independence Assumptions


40/91

Log Linear Taggers: Independence Assumptions

The basic idea

Chain rule

Independence

assumptions

Two questions:

1. How to parameterize

?

2. How to fi nd

!"

#

?

The Parameterization Problem


41/91

The Parameterization Problem

Hispaniola/NNP quickly/RB became/VB an/DT

important/JJ base/?? from which Spain expanded

its empire into the rest of the Western Hemisphere .

There are many possible tags in the position ??

Need to learn a function from (context, tag) pairs to a probability

Representation: Histories


42/91

Representation: Histories

A history is a 4-tuple

are the previous two tags.

are the words in the input sentence.

is the index of the word being tagged

Hispaniola/NNP quickly/RB became/VB an/DT important/JJbase/?? from which Spain expanded its empire into the rest of theWestern Hemisphere .

DT, JJ

FeatureVector Representations


43/91

a V p a

Take a history/tag pair

.

for

!

are features representing

tagging decision

in context

.

Example: POS Tagging [Ratnaparkhi 96]

Word/tag features

if current word

%

is base and

= VB

"

otherwise

if current word

%

ends in ing and

= VBG

"

otherwise

Contextual Features

if

DT, JJ, VB

"

otherwise

Part-of-Speech (POS) Tagging [Ratnaparkhi 96]


44/91

p ( ) gg g [ p ]

Word/tag features

if current word

%

is base and

= VB

"

otherwise

Spelling features

if current word

%

ends in ing and

= VBG

"

otherwise

if current word

%

starts with pre and

= NN

"

otherwise

Ratnaparkhis POS Tagger


45/91

p gg

Contextual Features

if

DT, JJ, VB

"

otherwise

if

JJ, VB

"

otherwise

if

VB "

otherwise

if previous word

%

= the and

VB

"

otherwise

if next word

%

= the and

VB

"

otherwise

Log-Linear (Maximum-Entropy) Models


46/91

g py

Take a history/tag pair .

for

!

are features

for

!

are parameters

Parameters defi ne a conditional distribution

where

Log-Linear (Maximum Entropy) Models


47/91

Word sequence

Tag sequence

Histories

Linear Score

Local NormalizationTerms

Log-Linear Models


48/91

Parameter estimation:Maximize likelihood of training data through gradient descent,iterative scaling

Search for

!"

#

:Dynamic programming,

complexity

Experimental results: Almost 97% accuracy for POS tagging [Ratnaparkhi 96]

Over 90% accuracy for named-entity extraction

[Borthwick et. al 98] Better results than an HMM for FAQ segmentation

[McCallum et al. 2000]


49/91

Linear Models for Classification


50/91

Goal: learn a function

Training examples

%

%

for

,

A representation

, parameter vector$

.

Classifi er is defi ned as

Unifying framework for many results: Support vector machines, boosting,

kernel methods, logistic regression, margin-based generalization bounds,

online algorithms (perceptron, winnow), mistake bounds, etc.

How can these methods be generalized beyond classification

problems?

Linear Models for Parsing and Tagging


51/91

Goal: learn a function

Training examples

%

%

for

,

Three components to the model:

A function

enumerating candidates for

A representation

. A parameter vector

$

.

Function is defi ned as

Component 1:


52/91

enumerates a set ofcandidates for a sentence

She announced a program to promote safety in trucks and vans

S

NP

She

VP

announced NP

NP

a program

VP

t o p ro mo te NP

safety PP

in NP


S

NP

She

VP

announced NP

NP

NP

a progra m

VP

t o p ro mo te NP

safety PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

a progra m

VP

to promote NP

NP

safety PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

a program

VP

t o p ro mo te NP

safety

PP

in NP


S

NP

She

VP

announced NP

NP

NP

a program

VP

t o p r om o te NP

safety

PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

NP

a program

VP

t o p r om o te NP

safety

PP

in NP


Examples of


53/91

A context-free grammar

A fi nite-state machine

Top

most probable analyses from a probabilistic grammar


54/91

Features


55/91

A feature is a function on a structure, e.g.,

Number of times A

B C

is seen in

A

B

D

d

E

e

C

F

f

G

g

A

B

D

d

E

e

C

F

h

A

B

b

C

c

!

Feature Vectors


56/91

A set of functions

defi ne a feature vector

A

B

D

d

E

e

C

F

f

G

g

A

B

D

d

E

e

C

F

h

A

B

b

C

c

!

!

!

Component 3:


57/91

is aparameter vector

$

and together map a candidate to a real-valued score

S

NP

She

VP

announced NP

NP

a program

VP

to VP

promote NP

safety PP

in NP

NP

trucks

and NP

vans

"

"

"

"

"

"

"

"

"

"

Putting it all Together


58/91

is set of sentences, is set of possible outputs (e.g. trees)

Need to learn a function

,

, defi ne

Choose the highest scoring tree as the most plausible structure

Given examples

%

%

, how to set ?

She announced a program to promote safety in trucks and vans


59/91

S

NP

She

VP

announced NP

NP

a program

VP

t o p ro mo te NP

safety PP

in NP


S

NP

She

VP

announced NP

NP

NP

a progra m

VP

t o p ro mo te NP

safety PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

a progra m

VP

to promote NP

NP

safety PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

a program

VP

t o p ro mo te NP

safety

PP

in NP


S

NP

She

VP

announced NP

NP

NP

a program

VP

t o p r om o te NP

safety

PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

NP

a program

VP

t o p r om o te NP

safety

PP

in NP


13.6 12.2 12.1 3.3 9.4 11.1

S

NP

She

VP

announced NP

NP

a p ro gr am

VP

to VP

promote NP

safety PP

in NP

NP

trucks

and NP

vans

Markov Random Fields[Johnson et. al 1999, Lafferty et al. 2001]


60/91

[Johnson et. al 1999, Lafferty et al. 2001]

Parameters defi ne a conditional distribution over candidates:

Gaussian prior:

MAP parameter estimates maximise

Note: This is a globally normalised model

A Variant of the Perceptron Algorithm


61/91

Inputs: Training set

%

%

for

Initialization: "

Define:

%

Algorithm: For

,

%

%

If

%

%

%

%

%

%

Output: Parameters


62/91

Results that Carry Over from Classification


63/91

[Freund and Schapire 99] defi ne the Voted Perceptron, proveresults for the inseparable case.

Compression bounds [Littlestone and Warmuth, 1986]

Say the perceptron algorithm makes

mistakes whenrun to convergence over a training set of size

.

Then for all distributions

, with probabilityat least

over the choice of training set of size

drawn from

, if

is the hypothesis at convergence,

NB.

!

#

!


64/91

Define:


65/91

!

"

$

'

!

"

$

&

!

$

!

!

"

$# #

!

"

$

%

$ $

%

&

!

'

$

(

)

0

# #

'

!

"

$

&

(

1

'

!

"

$

&

'

$ $

Generalization Theorem:For all distributions

!

!

"

$

generating examples, for all

2

3 with

, for all'

5

1

, with probability at least (7

over the choice of training set of

size

drawn from! ,

!

$

%

&

!

'

$

8

9

!

@

B

@

B

C

$

'

@

B

7

where9

is a constant such that

"

D

!

"

$

E

!

"

$

'

!

"

$

(

'

!

"

E

$

9

. The variableC

is the smallest positive integer suchthat

"

D

!

"

$

(

C .

Proof: See [Collins 2002b]. Based on [Bartlett 1998, Zhang, 2002]; closely

related to multiclass bounds in e.g., [Schapire et al., 1998].

Perceptron Algorithm 1: Tagging


66/91

Going back to tagging:

2 Inputs

are sentences

2

i.e. all tag sequences of length

2 Global representations

are composed from local feature

vectors

where

"

#

%

#

&

'



67/91

2 Typically, local features are indicator functions, e.g.,

if current word

ends in ing and = VBG

otherwise

2 and global features are then counts,

Number of times a word ending in ing istagged as VBG in


Score for a pair is


68/91

Score for a

#

$

#

$

pair is

Viterbi algorithm for

!

"

0#$

%

&

$

#

$

#

$

Log-linear taggers (see earlier part of the tutorial) are locally normalised models:

@

B

'

!

!

"

0#

$

%

)

"

0#

$

%

$

$

1

)

0

2

2

3

2

!

4

1

!

1

$

5

67

8

Linear Model

(

$

1

)

0

@

B

9

!

4

1

$

5

67

8

Local Normalization

Training the Parameters

Inputs: Training set

for

.


69/91

Initialization:

Algorithm: For

is output on

th sentence with current parameters

If

then

5

67

8

Correct tagsfeature value

5

67

8

Incorrect tagsfeature value

Output: Parameter vector .

An Example

S h f h


70/91

Say the correct tags for

th sentence are

the/DT man/NN bit/VBD the/DT dog/NN

Under current parameters, output is

the/DT man/NN bit/NN the/DT dog/NN

Assume also that features track: (1) all bigrams; (2) word/tag pairs

Parameters incremented:

"

NN, VBD'

"

VBD, DT'

"

VBD

bit'

Parameters decremented:

"

NN, NN'

"

NN, DT'

"

NN

bit'

Experiments


71/91

Wall Street Journal part-of-speech tagging data

Perceptron = 2.89%, Max-ent = 3.28%

(11.9% relative error reduction)

[Ramshaw and Marcus 95] NP chunking data

Perceptron = 93.63%, Max-ent = 93.29%

(5.1% relative error reduction)

See [Collins 2002a]

Perceptron Algorithm 2: Reranking Approaches


72/91

2

is the top

most probable candidates from a base model

Parsing: a lexicalized probabilistic context-free grammar

Tagging: maximum entropy tagger Speech recognition: existing recogniser

Parsing Experiments

Beam search used to parse training and test sentences:

around 27 parses for each sentence


73/91

around 27 parses for each sentence

"

'

, where

log-likelihood fromfi rst-pass parser,

are

indicator functions

if

contains "

'

otherwise

S

NP

She

VP

announced NP

NP

a program

VP

to VP

promote NP

safety PP

in NP

NP

trucks

and NP

vans

"

'

Named Entities

Top 20 segmentations from a maximum-entropy tagger


74/91

"

'

,

if

contains a boundary = [The

otherwise

Whether youre an aging flower child or a clueless

[Gen-Xer], [The Day They Shot John Lennon], playing at the

[Dougherty Arts Center], entertains the imagination.

"

'


[Gen-Xer], [The Day They Shot John Lennon], playing at the

[Dougherty Arts Center] entertains the imagination


75/91


"

'

Whether youre an aging flower child or a cluelessGen-Xer, The Day [They Shot John Lennon], playing at the


"

'


[Gen-Xer], The Day [They Shot John Lennon], playing at the


"

'

Experiments


76/91

Parsing Wall Street Journal Treebank

Training set = 40,000 sentences, test 2,416 sentencesState-of-the-art parser: 88.2% F-measure

Reranked model: 89.5% F-measure (11% relative error reduction)Boosting: 89.7% F-measure (13% relative error reduction)

Recovering Named-Entities in Web Data

Training data

53,609 sentences (1,047,491 words),test data 14,717 sentences (291,898 words)State-of-the-art tagger: 85.3% F-measureReranked model: 87.9% F-measure (17.7% relative error reduction)

Boosting: 87.6% F-measure (15.6% relative error reduction)

Perceptron Algorithm 3: Kernel Methods(Work with Nigel Duffy)


77/91

2 Its simple to derive a dual form of the perceptron algorithm

If we can compute

efficientlywe can learn efficiently with the representation

All SubtreesRepresentation [Bod 98]

2 Given: Non-Terminal symbols


78/91

G ve : No e a sy bo s

Terminal symbols

2 An infi nite set of subtrees

A

B C

A

B

b

E

A

B

b

C

A B

A

B A

B

b

C

2 Step 1:

Choose an (arbitrary) mapping from subtrees to integers

Number of times subtree

is seen in

"

%

'

All Subtrees Representation


79/91

is now huge

But inner product

can be computed

effi ciently using dynamic programming.See [Collins and Duffy 2001, Collins and Duffy 2002]


80/91

An Example

A A


81/91

0

B

D

d

E

e

C

F

f

G

g

B

D

d

E

e

C

F

h

G

i

'

!

0

$

&

'

!

$

!

$

!

$

!

$

!

$

!

$

2 Most of these terms are (e.g. ).

2 Some are non-zero, e.g.

B

D E

B

D

d

E

B

D E

e

B

D

d

E

e

Recursive Definition of

2 If the productions at

and

are different


82/91

p %

%

2 Else if

%

are pre-terminals,

%

2 Else

%

&

%

&

is number of children of node ;

&

is the&

th child of

.

Illustration of the Recursion

A A


83/91

B

D

d

E

e

C

F

f

G

g

B

D

d

E

e

C

F

h

G

i

How many subtrees do nodes and have in common? i.e., What is !

$

?

B

D E

B

D

d

E

B

D E

e

B

D

d

E

e

C

F G


84/91

Similar Kernels Exist for Tagged Sequences


85/91

Whether youre an aging flower child or a clueless[Gen-Xer], [The Day They Shot John Lennon], playing

at the [Dougherty Arts Center], entertains the imagination.

Whether [Gen-Xer], Day They John Lennon],playing

Whether youre an aging flower child or a clueless [Gen


86/91

Open Questions

2

h l i h h i b i d


87/91

Can the large-margin hypothesis be trainedeffi ciently, even when

is huge? (see[Altun, Tsochantaridis, and Hofmann, 2003])

2

Can the large-margin bound be improved, to remove the

factor?

2 Which representations lead to good performance on parsing,

tagging etc.?

2 Can methods with global kernels be implementedeffi ciently?

Conclusions

Some Other Topics in Statistical NLP:

M hi l i


88/91

2 Machine translation

2 Unsupervised/partially supervised methods

2 Finite state machines

2 Generation

2 Question answering

2 Coreference

2 Language modeling for speech recognition

2 Lexical semantics

2 Word sense disambiguation

2 Summarization

References

[Alt n Tsochantaridis and Hofmann 2003] Alt n Y I Tsochantaridis and T Hofmann 2003


89/91

[Altun, Tsochantaridis, and Hofmann, 2003] Altun, Y., I. Tsochantaridis, and T. Hofmann. 2003.Hidden Markov Support Vector Machines. In Proceedings of ICML 2003.

[Bartlett 1998] P. L. Bartlett. 1998. The sample complexity of pattern classifi cation with neural

networks: the size of the weights is more important than the size of the network, IEEE

Transactions on Information Theory, 44(2): 525-536, 1998.

[Bod 98] Bod, R. (1998). Beyond Grammar: An Experience-Based Theory of Language. CSLI

Publications/Cambridge University Press.

[Booth and Thompson 73] Booth, T., and Thompson, R. 1973. Applying probability measures to

abstract languages. IEEE Transactions on Computers, C-22(5), pages 442450.

[Borthwick et. al 98] Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R. (1998). Exploiting

Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition. Proc.

of the Sixth Workshop on Very Large Corpora.[Collins and Duffy 2001] Collins, M. and Duffy, N. (2001). Convolution Kernels for Natural

Language. In Proceedings of NIPS 14.

[Collins and Duffy 2002] Collins, M. and Duffy, N. (2002). New Ranking Algorithms for Parsing

and Tagging: Kernels over Discrete Structures, and the Voted Perceptron. In Proceedings

of ACL 2002.[Collins 2002a] Collins, M. (2002a). Discriminative Training Methods for Hidden Markov Models:

Theory and Experiments with the Perceptron Algorithm. In Proceedings of EMNLP 2002.

[Collins 2002b] Collins, M. (2002b). Parameter Estimation for Statistical Parsing Models: Theory

and Practice of Distribution-Free Methods. To appear as a book chapter.

[Crammer and Singer 2001a] Crammer, K., and Singer, Y. 2001a. On the Algorithmic

Implementation of Multiclass Kernel-based Vector Machines. In Journal of Machine

Learning Research 2(Dec):265 292


90/91

Learning Research, 2(Dec):265-292.[Crammer and Singer 2001b] Koby Crammer and Yoram Singer. 2001b. Ultraconservative Online

Algorithms for Multiclass Problems In Proceedings of COLT 2001.

[Freund and Schapire 99] Freund, Y. and Schapire, R. (1999). Large Margin Classifi cation using the

Perceptron Algorithm. In Machine Learning, 37(3):277296.

[Helmbold and Warmuth 95] Helmbold, D., and Warmuth, M. On Weak Learning. Journal of

Computer and System Sciences, 50(3):551-573, June 1995.[Hopcroft and Ullman 1979] Hopcroft, J. E., and Ullman, J. D. 1979. Introduction to automata

theory, languages, and computation. Reading, Mass.: AddisonWesley.

[Johnson et. al 1999] Johnson, M., Geman, S., Canon, S., Chi, S., & Riezler, S. (1999). Estimators

for stochastic unifi cation-based grammars. In Proceedings of the 37th Annual Meeting

of the Association for Computational Linguistics. San Francisco: Morgan Kaufmann.[Lafferty et al. 2001] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random

fi elds: Probabilistic models for segmenting and labeling sequence data. In Proceedings of

ICML-01, pages 282-289, 2001.

[Littlestone and Warmuth, 1986] Littlestone, N., and Warmuth, M. 1986. Relating data compression

and learnability. Technical report, University of California, Santa Cruz.

[MSM93] Marcus, M., Santorini, B., & Marcinkiewicz, M. (1993). Building a large annotated

corpus of english: The Penn treebank. Computational Linguistics, 19, 313-330.

[McCallum et al. 2000] McCallum, A., Freitag, D., and Pereira, F. (2000) Maximum entropy markov

models for information extraction and segmentation. In Proceedings of ICML 2000.

[Miller et. al 2000] Miller, S., Fox, H., Ramshaw, L., and Weischedel, R. 2000. A Novel Use of

Statistical Parsing to Extract Information from Text. In Proceedings of ANLP 2000.

[Ramshaw and Marcus 95] Ramshaw L and Marcus M P (1995) Text Chunking Using


91/91

[Ramshaw and Marcus 95] Ramshaw, L., and Marcus, M. P. (1995). Text Chunking UsingTransformation-Based Learning. In Proceedings of the Third ACL Workshop on Very Large

Corpora, Association for Computational Linguistics, 1995.

[Ratnaparkhi 96] A maximum entropy part-of-speech tagger. In Proceedings of the empirical

methods in natural language processing conference.

[Schapire et al., 1998] Schapire R., Freund Y., Bartlett P. and Lee W. S. 1998. Boosting the margin:A new explanation for the effectiveness of voting methods. The Annals of Statistics,

26(5):1651-1686.

[Zhang, 2002] Zhang, T. 2002. Covering Number Bounds of Certain Regularized Linear Function

Classes. In Journal of Machine Learning Research, 2(Mar):527-550, 2002.

problematic language processing in artificial intelligence

Documents