statistical decision-tree models for parsing

Statistical Decision-TreeStatistical Decision-TreeModels for ParsingModels for Parsing

NLP lab, POSTECHNLP lab, POSTECH

김 지 협김 지 협

CS730B

2

ContentsContents

AbstractAbstract IntroductionIntroduction Decision-Tree ModelingDecision-Tree Modeling SPATTER ParsingSPATTER Parsing Statistical Parsing ModelsStatistical Parsing Models Decision-Tree Growing & SmoothingDecision-Tree Growing & Smoothing Decision-Tree TrainingDecision-Tree Training Experiment ResultsExperiment Results ConclusionConclusion

CS730B

3

AbstractAbstract

Syntactic NL parser: not adequate for highly-ambiguous Syntactic NL parser: not adequate for highly-ambiguous

large-vocabulary text large-vocabulary text (ex. Wall Street Journal)(ex. Wall Street Journal)

Premises for develop a new parserPremises for develop a new parser grammars too complex to develop manually for most domains

parsing models must rely heavily on contextual information

existing n-gram model: inadequate for parsing

SPATTER: a statistical parser based on decision-tree modelSPATTER: a statistical parser based on decision-tree model better than a grammar-based parser

CS730B

4

IntroductionIntroduction

Parsing as making a sequence of disambiguation decisionsParsing as making a sequence of disambiguation decisions

The probability of a complete parse tree(T) of a sentence(S)The probability of a complete parse tree(T) of a sentence(S)

Automatically discovering the rules for disambiguationAutomatically discovering the rules for disambiguation

Producing a parser without a complicated grammarProducing a parser without a complicated grammar

Long-distance lexical information is crucial to disambiguate Long-distance lexical information is crucial to disambiguate

interpretations accuratelyinterpretations accurately

Tdiii

i

)Sd...dd|d(P)S|T(P121

CS730B

5

Decision-Tree ModelingDecision-Tree Modeling

Comparison Comparison

Grammarian: two crucial tasks for parsing

identifying the features relevant to each decision

deciding which choice to select based on the values of the

features

Decision-Tree: above 2 tasks + 3rd task

assigning a probability distribution to the possible choices, and

providing a ranking system

CS730B

6

ContinuedContinued

What is a Statistical Decision Tree?What is a Statistical Decision Tree?

A decision-making device assigning a probability to each of the poss

ible choices based on the context of the decision

P ( f | h ) , where f : an element of the future vocabulary

h : a history (the context of the decision)

The probability determined by asking a sequence of questions

i th question determined by the answers to the i - 1 previous question

Example: Part-of-speech tagging problem ( Figure 1 )

CS730B

7

ContinuedContinued

Decision Trees vs. Decision Trees vs. nn-grams-grams Equivalent to an interpolated n - gram model in expressive power Model Parameterization

n -gram model: n -gram model can be represented by decision-tree model ( n-1 questions ) Example: part-of-speech tagging

|H||F|:parametersofnumber),h...hh|f(P in 121

?backwordstwowordtheoftagtheisWhat.

?wordprevioustheoftagtheisWhat.

?taggedbeingwordtheisWhat.

elmodgramas)ttw|t(P iiii

3

2

1

421

CS730B

9

ContinuedContinued

decision-tree model

decision-tree model can be represented by interpolated n- gram

otherwise

leafaish...hhif)h...hh(

leaftorootfrompaththeonasked

questionstheofonetoanswerthe:h

nkk,nmwhere),h...hh|f(P

m

m

i

m

kkkkkki

k

iikkk

0

121

21

21 1

CS730B

10

ContinuedContinued

Why use decision-tree?Why use decision-tree?

As n grows, the parameter space for an n-gram model grows

exponentially

On the other hand, the decision-tree learning algorithm increases the

size of a model only as the training data allows

So, it can consider much contextual information

CS730B

11

SPATTER ParsingSPATTER Parsing

SPATTER RepresentationSPATTER Representation Parse: as a geometric pattern

4 features in node: words, tags, labels, and extensions (Figure 3)

The Parsing AlgorithmThe Parsing Algorithm Starting with the sentence’s words as leaves (Figure 3)

Gradually tagging, labeling, and extending nodes

Constraints Bottom-up, left-to-right No new node is constructed until its children completed Using DWC(derivational window constraints), # of active nodes restricted

A single-rooted, labeled tree is constructed

CS730B

13

Decision-Tree Growing & SmoothingDecision-Tree Growing & Smoothing

3 main models (tagging, extension, and label) 3 main models (tagging, extension, and label)

Dividing the training corpus into 2 sets: (90% for growing, Dividing the training corpus into 2 sets: (90% for growing,

10% for smoothing)10% for smoothing)

Growing & Smoothing AlgorithmGrowing & Smoothing Algorithm

Figure 3.5

CS730B

14

Decision-Tree TrainingDecision-Tree Training

Parsing model can not be estimated by direct frequency countParsing model can not be estimated by direct frequency count

s because the model contains a hidden component: the derivas because the model contains a hidden component: the deriva

tion modeltion model

In the corpus, no information about orders of derivationsIn the corpus, no information about orders of derivations

So, the training process must process discover which derivatiSo, the training process must process discover which derivati

ons assign higher probability to the parsesons assign higher probability to the parses

Forward-Backward Reestimation usedForward-Backward Reestimation used

CS730B

15

ContinuedContinued

Training AlgorithmTraining Algorithm

'sh

hnew

goalgoalgoal

hhhh

shhh

shhh

h,h

h

))'s,s(f(count

))s,s(f(count)h|f(p

)s()s(where,)s(

)s|)s,s(f(P)s()s())s,s(f(count

)s|)s,s(f(P)s()s(

)s|)s,s(f(P)s()s(

sstatetosstatefromgettomadeassignmentvaluefeature:)ss(f

latticestatetheinsprecedeswhichstate:s,state:s

h

CS730B

16

Experiment ResultsExperiment Results

IBM computer ManualIBM computer Manual

annotated by the University of Lancaster

195 part-of-speech tags and 19 non-terminal labels

trained on 30,800 sentences, and tested on 1,473 new sentences

0-crossing-brackets score

IBM’s rule-based, unification-style PCFG parse: 69%

SPATTER: 76%

CS730B

17

ContinuedContinued

Wall Street JournalWall Street Journal To test ability to accurately parse a highly-ambiguous, large-vocabul

ary domain Annotated in the Penn Treebank, version 2 46 part-of-speech tags, and 27 non-terminal labels Trained on 40,000 sentences, and tested on 1,920 new sentences Using PARSEVAL

BracketsgsinCros

parsetreebankintsconstituenof.no

parseSPATTERintsconstituencorrectof.no

callRe

parseSPATTERintsconstituenof.no

parseSPATTERintsconstituencorrectof.no

ecisionPr

CS730B

18

ConclusionConclusion

Large amounts of contextual information can be incorporated Large amounts of contextual information can be incorporated

into a statistical model for by applying decision-tree learning into a statistical model for by applying decision-tree learning

algorithmalgorithm

Automatically discovering rules are possible Automatically discovering rules are possible

statistical decision-tree models for parsing

Documents