probabilistic and lexicalized parsing
DESCRIPTION
Probabilistic and Lexicalized Parsing. Probabilistic CFGs. Weighted CFGs Attach weights to rules of CFG Compute weights of derivations Use weights to pick, preferred parses Utility: Pruning and ordering the search space, disambiguate, Language Model for ASR. - PowerPoint PPT PresentationTRANSCRIPT
Probabilistic and Lexicalized Parsing
Probabilistic CFGs• Weighted CFGs
– Attach weights to rules of CFG– Compute weights of derivations– Use weights to pick, preferred parses
• Utility: Pruning and ordering the search space, disambiguate, Language Model for ASR.
• Parsing with weighted grammars (like Weighted FA)– T* = arg maxT W(T,S)
• Probabilistic CFGs are one form of weighted CFGs.
Probability Model• Rule Probability:
– Attach probabilities to grammar rules
– Expansions for a given non-terminal sum to 1
R1: VP V .55
R2: VP V NP .40
R3: VP V NP NP .05
– Estimate the probabilities from annotated corpora P(R1)=counts(R1)/counts(VP)
• Derivation Probability:– Derivation T= {R1…Rn}
– Probability of a derivation:
– Most likely probable parse: – Probability of a sentence:
• Sum over all possible derivations for the sentence
• Note the independence assumption: Parse probability does not change based on where the rule is expanded.
n
iiRPTP
1
)()(
)(maxarg* TPTT
T
STPSP )|()(
Structural ambiguity • S NP VP• VP V NP• NP NP PP• VP VP PP• PP P NP
• NP John | Mary | Denver• V -> called• P -> from
John called Mary from Denver
S
VP PP
NP VP
V NP NPP
John called Mary from Denver
S
NP
NP VP
V NP PP
PJohn called Mary
from Denver
NP
Cocke-Younger-Kasami Parser
• Bottom-up parser with top-down filtering
• Start State(s): (A, i, i+1) for each Awi+1
• End State: (S, 0,n) n is the input size• Next State Rules
– (Bi, k) (C, k, j) (A, i,j) if ABC
Example
John called Mary from Denver
Base Case: Aw
NP
P Denver
NP from
V Mary
NP called
John
Recursive Cases: ABC
NP
P Denver
NP from
X V Mary
NP called
John
NP
P Denver
VP NP from
X V Mary
NP called
John
NP
X P Denver
VP NP from
X V Mary
NP called
John
PP NP
X P Denver
VP NP from
X V Mary
NP called
John
PP NP
X P Denver
S VP NP from
V Mary
NP called
John
PP NP
X X P Denver
S VP NP from
X V Mary
NP called
John
NP PP NP
X P Denver
S VP NP from
X V Mary
NP called
John
NP PP NP
X X X P Denver
S VP NP from
X V Mary
NP called
John
VP NP PP NP
X X X P Denver
S VP NP from
X V Mary
NP called
John
VP NP PP NP
X X X P Denver
S VP NP from
X V Mary
NP called
John
VP1
VP2
NP PP NP
X X X P Denver
S VP NP from
X V Mary
NP called
John
S VP1
VP2
NP PP NP
X X X P Denver
S VP NP from
X V Mary
NP called
John
S VP NP PP NP
X X X P Denver
S VP NP from
X V Mary
NP called
John
Probabilistic CKY• Assign probabilities to constituents as they are
completed and placed in the table• Computing the probability
– Since we are interested in the max P(S,0,n)• Use the max probability for each constituent
• Maintain back-pointers to recover the parse.
)(*),,(*),,(),,(
),,(),,(
BCAPjkCPkiBPjiBCAP
jiBCAPjiAPBCA
Problems with PCFGs• The probability model we’re using is just based on the rules in
the derivation.
• Lexical insensitivity:– Doesn’t use the words in any real way
– Structural disambiguation is lexically driven• PP attachment often depends on the verb, its object, and the preposition • I ate pickles with a fork. • I ate pickles with relish.
• Context insensitivity of the derivation– Doesn’t take into account where in the derivation a rule is used
• Pronouns more often subjects than objects • She hates Mary. • Mary hates her.
• Solution: Lexicalization– Add lexical information to each rule
An example of lexical information: Heads
• Make use of notion of the head of a phrase– Head of an NP is a noun– Head of a VP is the main verb– Head of a PP is its preposition
• Each LHS of a rule in the PCFG has a lexical item
• Each RHS non-terminal has a lexical item.– One of the lexical items is shared with the LHS.
• If R is the number of binary branching rules in CFG, in lexicalized CFG: O(2*|∑|*|R|)
• Unary rules: O(|∑|*|R|)
Example (correct parse)
Attribute grammar
Example (less preferred)
Computing Lexicalized Rule Probabilities
• We started with rule probabilities– VP V NP PP P(rule|VP)
• E.g., count of this rule divided by the number of VPs in a treebank
• Now we want lexicalized probabilities– VP(dumped) V(dumped) NP(sacks)PP(in)– P(rule|VP ^ dumped is the verb ^ sacks is the
head of the NP ^ in is the head of the PP)– Not likely to have significant counts in any
treebank
Another Example• Consider the VPs
– Ate spaghetti with gusto– Ate spaghetti with marinara
• Dependency is not between mother-child.
Vp (ate)
Vp(ate) Pp(with)
vAte spaghetti with gusto
np
Vp(ate)
Pp(with)
Np(spag)
npvAte spaghetti with marinara
Log-linear models for Parsing• Why restrict to the conditioning to the elements of a
rule?– Use even larger context– Word sequence, word types, sub-tree context etc.
• In general, compute P(y|x); where fi(x,y) test the properties of the context; i is the weight of that feature.
• Use these as scores in the CKY algorithm to find the best scoring parse.
Yy
yxf
yxf
ii
ii
e
exyP
),(*
),(*
)|(
Supertagging: Almost parsing
Poachers now control the underground trade
NP
N
poachers
N
NN
tradeS
NP
VP
V
NP
N
poachers
::
S
SAdv
now
VP
VPAdv
now
VP
AdvVP
now
::
S
S
VP
V
NP
control
S
NP
VP
V
NP
control
S
NP
VP
V
NP
control
S
NP
NPDet
the
NP
NP
N
trade
N
NN
poachers
S
NP
VP
V
NP
N
trade
N
NAdj
underground
S
NP
VP
V
NP
Adj
underground
S
NP
VP
V
NP
Adj
underground
S
NP
:
Summary• Parsing context-free grammars
– Top-down and Bottom-up parsers– Mixed approaches (CKY, Earley parsers)
• Preferences over parses using probabilities– Parsing with PCFG and PCKY algorithms
• Enriching the probability model– Lexicalization– Log-linear models for parsing