a statistical approach to anaphora resolution
Post on 07-Apr-2018
223 Views
Preview:
TRANSCRIPT
8/6/2019 A Statistical Approach to Anaphora Resolution
http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 1/10
A S t a t i s t i c a l A p p r o a c h t o A n a p h o r a R e s o l u t i o n
N i y u G e , Jo h n H a l e a nd E u g e n e C h a r n i a k
Dept. of Computer Science,Brown University,
[ n g e I h [ c ] ~ c s . b r o w n , e d u
A b s t r a c t
This paper presents an algorithm for identi-
fying pronominal anaphora and two experi-
ments based upon this algorithm. We incorpo-
rate multiple anaphora resolution factors intoa statistical framework - - specifically the dis-
tance between the pronoun and the proposed
antecedent, gender/number/animaticity of the
proposed antecedent, governing head informa-
tion and noun phrase repetition. We combine
them into a single probability that enables us
to identify the referent. Our first experiment
shows the relative contribution of each source
Of information and demo nst rate s a success rate
of 82.9% for all sources combined. The second
experiment investigates a method for unsuper-
vised learning of gender/number/animaticityinformation. We present some exper iment s il-
lustrating the accuracy of the method and note
that with this information added, our pronoun
resolution meth od achieves 84.2% accuracy.
1 I n t r o d u c t i o n
We present a statistical method for determin-
ing pronoun anaphora. This prog ram differs
from earlier work in its almost complete lack of
hand-crafting, relying instead on a very small
corpus of Penn Wall Street Journal Tree-bank
text (Marcus et al., 1993) that has been markedwith co-reference information. The first sections
of this paper describe this program: the proba-
bilistic model behind it, its implementation , and
its performance.
The second half of the paper describes a
method for using (portions of) t~e aforemen-
tioned program to learn automatical ly the typi-
cal gender of English words, information that is
itself used in the pronoun resolution program.
In particular, the scheme infers the gender of a
referent from the gender of the pronouns that
161
refer to it and selects referents using the pro-
noun anaphora program. We present some typ-
ical results as well as the more rigorous results
of a blind evaluation of its o utp ut.
2 A P robab i l i s t i c Mode l
There are many factors, both syntactic and se-
mantic, upon which a pronoun resolution sys-
tem relies. (Mitkov (1997) does a detailed st udy
on factors in anap hora reso lution.) We first dis-
cuss the training features we use and then derive
the probability equations from th em.
The first piece of useful information we con-
sider is the distance between the pronoun
and the candidate antecedent. Obviously the
greater the distance the lower the probability.
Secondly, we look at the syntactic situation inwhich the pronoun finds itself. The most well
studied constraints are those involving reflexive
pronouns. One classical approach to resolving
pronouns in text that takes some syntactic fac-
tors into consideration is that of Hobbs (1976).
This algorithm searches the parse tree in a left-
to-right, breadth-first fashion that obeys the
major reflexive pronoun constraints while giv-
ing a preference to antecedents that are closer
to the pronoun. In resolving inter- senten tial
pronouns, the algorithm searches the previous
sentence, again in left-to-right, breadth-first or-der. This implements the observed preference
for subject position antecedents.
Next, the actual words in a proposed noun-
phrase antecedent give us information regard ing
the gender, number, and animatici ty of the pro-
posed referent. For example:
M a r i e G i r a u d c a r r i es h i s t o r ic a l s i g -
n i f ic a n c e a s o n e o f t h e l a s t w o m e n t o
b e e z e c u t e d i n F r a n c e . S h e b e c a m e
a n a b o r t i o n i s t b e c a u s e i t e n a b l e d h e r t o
8/6/2019 A Statistical Approach to Anaphora Resolution
http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 2/10
b uy ja m , co co a a n d o th er w a r - ra t io n ed
goodies.
H e r e i t is h e l p f u l t o r e co g n i z e t h a t " M a r i e " i s
p r o b a b l y f e m a l e a n d t h u s i s u n l i k e l y t o b e r e -f e r r e d t o b y " h e " o r " i t " . G i v e n t h e w o r d s in t h e
p r o p o s e d a n t e c e d e n t w e w a n t t o f i n d t h e p r o b -
a b i l it y t h a t i t i s t h e r e f e r e n t o f t h e p r o n o u n i n
q u e s t i o n . W e c o ll e c t t h e s e p r o b a b i l i t i e s o n t h e
t r a i n i n g d a t a , w h i c h a r e m a r k e d w i t h r e f e r e n c e
l in k s . T h e w o r d s in th e a n t e c e d e n t s o m e t i m e s
a l so le t u s t e s t f o r n u m b e r a g r e e m e n t . G e n e r -
a l l y , a s i n g u l a r p r o n o u n c a n n o t r e f e r t o a p l u r a l
n o u n p h r a s e , s o t h a t i n r e s o l v i n g s u c h a p r o -
n o u n a n y p l u ra l c a n d i d a t e s s h o u l d b e r u l e d o u t .
H o w e v e r a s i n g u l a r n o u n p h r a s e c a n b e t h e r e f-
e r e n t o f a p l u ra l p r o n o u n , a s i l l u s t r a t e d b y t h ef o l l o w i n g e x a m p l e :
" I t h in k i f I te ll V i a c o m I n ee d m o r e
t ime , t h ey wi l l t a ke 'C o sb y ' a cro ss th e
s t ree t , " sa ys t h e g en era l ma n a g er o l a
n e t w o r k a ~ l i a t e .
I t i s a l s o u s e f u l t o n o t e t h e i n t e r a c t i o n b e -
t w e e n t h e h e a d c o n s t i t u e n t o f t h e p r o n o u n p
a n d t h e a n t e c e d e n t . F o r e x a m p l e :
A Ja p a n ese co mp a n y mig h t ma ke t e l e -
v is ion p ic ture tubes in Japan , asse m-b le t h e T V se t s i n Ma la ys ia a n d ex to r t
th em t o I n d o n es ia .
H e r e w e w o u l d c o m p a r e t h e d e g r e e t o w h i c h
e a c h p o ss i bl e c a n d i d a t e a n t e c e d e n t ( A J a p a n e s e
comp any, te lev is ion p ic ture tubes, Japan , T V
sets , a n d Ma la ys ia i n t h is e x a m p l e ) c o u l d s e r v e
a s t h e d i r e c t o b j e c t o f " e x p o r t " . T h e s e p r o b a -
b i l it i e s g iv e u s a w a y t o i m p l e m e n t s e l e c t i o n a l
r e s t ri c t i o n . A c a n o n i c a l e x a m p l e o f s e l e c ti o n a l
r e s t r ic t i o n i s t h a t o f t h e v e r b " e a t " , w h i c h s e-
l e c t s f o o d a s i t s d i r e c t o b j e c t . I n t h e c a s e o f" e x p o r t " t h e r e s t r i c t io n i s n o t a s c l e a r c u t . N e v -
e r t h e l e s s i t c a n s t i l l g i v e u s g u i d a n c e o n w h i c h
c a n d i d a t e s a r e m o r e p r o b ab l e t h a n o t h e r s.
T h e l a s t f a c t o r w e c o n s i d e r i s r e f e r e n t s ' m e n -
t ion count . N o u n p h r a s e s t h a t a r e m e n t i o n e d
r e p e a t e d l y a re p r e f er r e d . T h e t r a i n i n g c o r p u s i s
m a r k e d w i t h t h e n u m b e r o f t i m e s a r e f e r e n t h a s
b e e n m e n t i o n e d u p t o t h a t p o i n t i n t h e s t o r y .
H e r e w e a r e c o n c e r n e d w i t h t h e p r o b a b i l i t y t h a t
a p r o p o s e d a n t e c e d e n t is c o r r e c t g i v e n t h a t i t
h a s b e e n r e p e a t e d a c e r t a i n n u m b e r o f t i m e s .
162
I n e ff e ct , w e u s e th i s p r o b a b i l i t y i n f o r m a t i o n t o
i d e n t if y t h e t o p ic o f th e s e g m e n t w i t h t h e b e l ie f
t h a t t h e t o p i c i s m o r e l i k e l y t o b e r e f e r r e d t o b y
a p r o n o u n . T h e i d e a is s i m i l a r t o t h a t u s e d in
t h e c e n t e r i n g a p p r o a c h ( B r e n n a n e t a l. , 1 98 7 )w h e r e a c o n t i n u e d t o p i c i s t h e h i g h e s t - r a n k e d
c a n d i d a t e f o r p r o n o m i n a l i z a t i o n .
G i v e n t h e a b o v e p o s s ib l e s o u r c e s o f i n f o r m a r
t io n , w e a rr i v e a t t h e f o l lo w i n g e q u a t i o n , w h e r e
F ( p ) d e n o t e s a f u n c t i o n f r o m p r o n o u n s t o t h e i r
a n t e c e d e n t s :
F(p ) = a rg m a x P ( A( p) = a lp , h , l~ ', t , l, so , d~ A~')
w h e r e A ( p ) i s a r a n d o m v a r i a b l e d e n o t i n g t h e
r e f e r e n t o f t h e p r o n o u n p a n d a i s a p r o p o s e d
a n t e c e d e n t . I n t h e c o n d i t i o n i n g e v e n t s , h i s t h eh e a d c o n s t i t u e n t a b o v e p , l ~ i s t h e l is t o f c a n d i -
d a t e a n t e c e d e n t s t o b e c o n s i d e r e d , t i s t h e t y p e
o f p h r a s e o f t h e p r o p o s e d a n t e c e d e n t ( a lw a y s
a n o u n - p h r a s e i n t h i s s t u d y ) , I i s t h e t y p e o f
t h e h e a d c o n s t i t u e n t , s p d e s c r ib e s t h e s y n t a c t i c
s t r u c t u r e i n w h i c h p a p p e a r s , d s p e c i f i e s t h e d i s-
t a n c e o f e a c h a n t e c e d e n t f r o m p a n d M " i s t h e
n u m b e r o f t i m e s t h e r e f e r en t is m e n t i o n e d . N o t e
th a t 17r , d'~ a n d A ~ a r e v e c to r qu a n t i t i e s i n w h ic h
e a c h e n t r y c o r r e s p o n d s t o a p o s s ib l e a n t e c e d e n t .
W h e n v i e w e d i n t h i s w a y , a c a n b e r e g a r d e d a s
a n i n d e x i n t o t h e s e v e c t o r s t h a t s p e ci f ie s w h i c h
v a l u e i s r e l e v a n t t o t h e p a r t i c u l a r c h o i c e o f a n -
t e c e d e n t .
T h i s e q u a t i o n i s d e c o m p o s e d i n t o p i e c e s t h a t
c o r r e s p o n d t o a ll t h e a b o v e f a c t o r s b u t a r e m o r e
s t a ti s t ic a l ly m a n a g e a b l e . T h e d e c o m p o s i t i o n
m a k e s u s e o f B a y e s ' t h e o r e m a n d i s b a s e d o n
c e r t a i n i n d e p e n d e n c e a s s u m p t i o n s d i s c u s s e d b e -
low .
P ( A ( p ) = a l p , h , f i r , t , l , s p , d ~ . Q ' )
= P ( a l A ~ ) P ( p , h , f i r , t , l , s p , ~ a , 2 ~ ) ( 1 )
P ( p , h , f i r , t , t , s p , d i M )
o ¢ P C a l M ) P ( p , h , f i r , t , l , s p , ~ a , . Q ' ) ( 2 )
= P ( a [ : Q ) P ( . % , ~ a , :~ 'I )
P ( p , h , f ir , t , l l a , ~ , s p , i ) ( 3 )
= P ( a l l ~ ) P ( s p , d ~ a , . Q )
P C h , t , Z l a , ~ '0 " , o , i )
P C . . ~ l a , . ~ ' , s o , d , h , t , l ) ( 4 )
o c P ( a ] l ~ ) P ( S o , ~ a , M ' )
8/6/2019 A Statistical Approach to Anaphora Resolution
http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 3/10
P ( p , 1 4 t in , ] Q , s o , d , h , t , I ) (5 )
= P ( a l . Q ) P ( s p , d ~ a , 3 ~ r )
P ( f f r l a , I ~ , s o , d , h , t , I ) . (6 )
P ( p l a . l ~ , s f , , d . h , t , l , l ~ )
cx P (a1 6 3 P(d t t la )P ( f f ' l h , t , I, a )
P (p lw° ) ( 7 )
E q u a t i o n ( 1) is s i m p l y a n a p p l i c a t i o n o f B a y e s '
r ul e. T h e d e n o m i n a t o r is e l i m i n a t e d in t h e
usua l f a sh ion , r e su l t i n g in e qu a t io n ( 2 ) . S e l e c -
t i v e l y a p p l y i n g t h e c h a i n r u l e r e s u l t s i n e q u a -
t i ons (3 ) a nd ( 4 ) . I n e qu a t io n ( 4 ) , t he t e r m
P ( h . t , l l a , . ~ , S o , d ) i s t h e s a m e f o r e v e r y a n -
t e c e d e n t a n d is t h u s re m o v e d . E q u a t i o n ( 6)f o ll o w s w h e n w e b r e a k t h e l a s t c o m p o n e n t o f
( 5 ) i n t o t w o p r o b a b i l i t y d i s t r i b u t i o n s . I n e q u a -
t i o n ( 7 ) w e m a k e t h e f o l l o w i n g i n d e p e n d e n c e a s -
s u m p t i o n s :
• G i v e n a p a r t i c u l a r c h o i c e o f t h e a n t e c e d e n t
c a n d i d a t e s , t h e d i s t a n c e i s i n d e p e n d e n t o f
d i s t a n c es o f c a n d i d a t e s o t h e r t h a n t h e a n -
t e c e d e n t ( a n d t h e d i s t a n c e t o n o n - r e f e r e n t s
c a n b e i g n o r e d ) :
P ( s o , d ~ a , 2 ~ ) o ¢ P ( s o , d o l a , I C 4 )
• T h e s y n t n c t i c s t r u c t u r e s t , a n d t h e d i s t a n c e
f r o m t h e p r o n o u n d a a r e i n d e p e n d e n t o f t h e
n u m b e r o f t i m e s t h e r e f e r e n t i s m e n t i o n e d .
T h u s
P ( s p , d o la , M ) = P ( s p , d . la )
T h e n w e c o m b i n e s p a n d d e i n t o o n e v a r i -
able d I t , H o b b s d i s t a n c e , s i n c e t h e H o b b s
a l g o r i t h m t a k e s b o t h t h e s y n t a x a n d d i s -
t a n c e i n t o a c c o u n t .
T h e w o r d s in t h e a n t e c e d e n t d e p e n d o n l y
o n t h e p a r e n t c o n s t i t u e n t h , t h e t y p e o f t h e
w o r d s t , a n d t h e t y p e o f t h e p a r e n t I. H e n c e
e ( f f ' l a , M , s p , ~ , h , t , l ) = P ( ~ l h , t , l , a )
• T h e c h o i c e p r o n o u n d e p e n d s o n l y o n t h e
w o r d s in t h e a n t e c e d e n t , i .e .
P { p l a , M , s p , d , h , t , l , ~ = P ( p l a , W )
1 6 3
• I f w e t r e a t a a s a n i n d e x i n t o t h e v e c t o r 1 ~ ,
t h e n ( a , I.V ') i s s i m p l y t h e a t h c a n d i d a t e i n
t h e l i st f fz . W e a s s u m e t h e s e l e c t i o n o f t h e
p r o n o u n i s i n d e p e n d e n t o f t h e c a n d i d a t e so t h e r t h a n t h e a n t e c e d e n t . H e n c e
P ( p l a , W ) = P ( p l w ,~ )
S inc e I ~" i s a ve c to r , w e ne e d to no r m a l -
iz e P ( f f ' l h , t , l , a ) t o o b t a i n t h e p r o b a b i l i t y o f
e a c h e l e m e n t in t h e v e c t o r . I t i s r e a s o n -
a b l e t o a s s u m e t h a t t h e a n t e c e d e n t s i n W a r e
i n d e p e n d e n t o f e a ch o t h e r ; i n o t h e r w o r d s ,
P ( w o + l l w o , h , t , l , a ) = P ( w o + l l h , t , l , a } . T h u s ,
w h e r e
n
P ( f f ' l h , t , l , a ) = 1 I P ( w i l h , t , l , a )i = l
P ( w d h , t , l , a ) = P ( w i l t ) if i # a
a n d
P ( w d h , t , l , a ) = P ( w o l h . t , l ) if i = a
T h e n w e ha v e ,
P ( f f ' l h , t , l , a ) = P ( w t l t ) . . . P ( w o l h , t , l ) . . . P ( w , l t )
T o g e t t h e p r o b a b i l i t y f o r e a c h c a n d i d a t e , w e
d i v i d e t h e a b o v e p r o d u c t b y :
f ( I ~ l h , t , l , a )
P ( w l l t ) . . . P ( w o l h , t , l ) . . . P ( w , l t JOC
e ( w ~ l t ) . . . P ( w ~ l t ) . . . P ( w , l t)
P ( w ~ l h , t , t )
P ( w ° l t )
N o w w e a r r i v e a t t h e f i n a l e q u a t i o n f o r c o m p u t -
i n g t h e p r o b a b i l i t y o f e a c h p r o p o s e d a n t e c e d e n t :
P(A (p) = W o) (S)
P { d H I a ) P ( p l w . ) P ~ l ) p ( a l m . )
W e o b t a i n P ( d H [ a ) b y r u n n i n g t h e H o b b s a l -
g o r i t h m o n t h e t r a i n in g d a t a . S i n c e t h e t r ai n -
i n g c o r p u s i s t a w e d w i t h r e fe r en c e i n f o rm a -
t i o n , t h e p r o b a b i l i t y P ( p l W o ) i s e a s i l y o b t a i n e d .
I n b u i l d i n g a s t a t i s t i c a l p a r s e r f o r t h e P e n n
T r e e - b a n k v a r i o u s s t a t L s t i c s h a v e b e e n c o l l e c t e d
8/6/2019 A Statistical Approach to Anaphora Resolution
http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 4/10
( C h a r n i a k , 1 9 9 7 ), t w o o f w h i c h a r e P(w ~lh , t , l)
and P(w ~ l t , l ) . T o a v o i d t h e s p a r s e - d a t a p r o b -
l e m , t h e h e a d s h a re c l u st e r e d a c c o r d i n g t o h o w
t h e y b e h a v e i n P(w ~lh, t , l ). T h e p r o b a b i l i t y o f
w e is t h e n c o m p u t e d o n t h e b a s i s o f h 's c l u s -
t e r c ( h ) . O u r c o r p u s a ls o c o n t a i n s r e fe r e w t s '
r e p e t i t i o n i n f o r m a t i o n , f r o m w h i c h w e c a n d i -
r e c tl y c o m p u t e P(alrna ) . T h e f o u r c o m p o n e n t s
i n e q u a t i o n ( 8 ) c a n b e e s t i m a t e d i n a r e a s o n -
a b le f as h io n . T h e s y s t e m c o m p u t e s t h i s p r o d u c t
a n d r e t u r n s t h e a n t e c e d e n t t0o f o r a p r o n o u n p
t h a t m a x i m i z e s t h i s p ro b a b il i ty . M o r e fo r m a l l y ,
w e w a n t t h e p r o g r a m t o r e tu r n o u r a n t e c e d e n t
f u n c t i o n F ( p ) , w h e r e
F(p)
= arg maax P( A( p) = alp, h, 1~, t , l, sp, d: M )
= a r g m a x P ( d H [ a ) P ( p l w a )112a
e(walh, t,t) e(almo )P(wol t , t )
3 T h e I m p l e m e n t a t i o n
W e u s e a s m a l l p o r t i o n o f th e P e n n W a l l S t r e e t
J o u r n a l T r e e - b a n k a s o u r t r a in i n g c o r p u s . F r o m
t h i s d a t a , w e c o l l e c t t h e t h r e e s t a t i s t i c s d e t a i l e d
h a t h e f o l l o w i n g s u b s e c t i o n s .
3 .0 .1 T h e H o b b s a l g o r i t h m
T h e H o b b s a l g o r i t h m m a k e s a f e w a s s u m p t i o n s
a b o u t t h e s y n t a c t i c t r e e s u p o n w h i c h i t o p e r a t e s
t h a t a r e n o t s a ti s fi e d b y t h e t r e e - b a n k t r e e s t h a t
f o r m t h e s u b s t r a t e f o r o u r a l g o r i t h m . M o s t n o -
t a b l y , t h e H o b b s a l g o r i t h m d e p e n d s o n t h e e x -
i s t en c e o f a n / ~ " p a r s e - t r e e n o d e t h a t i s a b s e n t
f r o m t h e P e n n T r e e - b a n k t r ee s . W e h a v e i m -
p l e m e n t e d a s l i g h tl y m o d i f i e d v e r si o n o f H o b b s
a l g o r i t h m f o r t h e T r e e - b a n k p a r s e tr e e s . W e
a l s o t r a n s f o r m o u r t r e e s u n d e r c e r t a i n c o n d i -
t i o n s t o m e e t H o b b s ' a s s u m p t i o n s a s m u c h a s
p o s s i b le . W e h a v e n o t , h o w e v e r , b e e n a b l e t o
d u p l i c a t e e x a c t l y t h e s y n t a c t i c s t r u c t u r e s a s -
s u m e d b y H o b b s.
O n c e w e h a v e t h e t r e e s i n t h e p r o p e r f o r m
( t o t h e d e g r e e t h i s i s p o s s i b l e ) w e r u n H o b b s '
a l g o r i t h m r e p e a t e d l y f o r e a c h p r o n o u n u n t i l it
h a s p r o p o s e d n ( = 1 5 i n o u r e x p e r i m e n t ) c a n -
d i d a t e s . T h e i t h c a n d i d a t e is r e g a r d e d a s o c -
c u r r i n g a t " H o b b s d i s t a n c e " dH = i . T h e n t h e
p r o b a b i l i t y P(dH = i la ) i s s i m p l y :
P (d u -= i la)
164
I correct antecedent a t H obbs d is tance i i
[ correct antec edents 1
W e u s e [ z [ t o d e n o t e t h e n u m b e r o f t im e s z i s
o b s e r v e d i n o u r t r a i n i n g s e t .
3 .1 T h e g e n d e r / a n i m a t i c i t y s t a t i s t i c s
A f t e r w e h a v e id e n t if i ed t h e c o r r e c t a n t e c e d e n t s
i t i s a s i m p l e c o u n t i n g p r o c e d u r e t o c o m p u t e
P(p[wa) w h e r e w a is in t h e c o r r e c t a n t e c e d e n t
f o r t h e p r o n o u n p ( N o t e t h e p r o n o u n s a r e
g r o u p e d b y t h e i r g e n d e r ) :
[ wa in the an teceden t fo r p [P ( p l o ) =
W h e n t h e r e a r e m u l t i p l e r e l e v a n t w o r d s i n t h ea n t e c e d e n t w e a p p l y t h e l i k e li h o o d t e s t d e s i g n e d
b y D u n n i n g ( 19 9 3) o n a l l th e w o r d s i n t h e c a n d i -
d a t e N P . G i v e n o u r l i m i t e d d a t a , t h e D u n n i n g
t e s t t e l l s w h i c h w o r d i s t h e m o s t i n f o r m a t i v e ,
ca l l i t w i, a n d w e t h e n u s e P ( p [ w i ) .
3 . 1 .1 T h e m e n t i o n c o u n t s t a t i s t i c s
T h e r e fe r e nt s r a n g e f ro m b e i n g m e n t i o n e d o n l y
o n c e t o b e gi n m e n t i o n e d 1 20 t i m e s i n t h e t r a i n -
h ag e x a m p l e s . I n s t e a d o f c o m p u t i n g t h e p r o b a -
b U i ty fo r e a c h o n e o f th e m w e g r o u p t h e m i n t o
" b u c k e t s " , s o t h a t rrta iS t h e b u c k e t f o r t h e n u m -
b e r o f t i m e s t h a t a is m e n t i o n e d . W e a l s o o b -
s e r v e t h a t t h e p o s i t i o n o f a p r o n o u n i n a s t o r y
i n f lu e n c e s th e m e n t i o n c o u n t o f i t s r e f e r e n t . I n
o t h e r w o r d s , t h e n e a r e r t h e e n d o f t h e s t o r y a
p r o n o u n o c c u r s , th e m o r e p r o b a b l e i t is t h a t
i t s r e fe r e n t h as b e e n m e n t i o n e d s e v e r a l t i m e s .
W e m e a s u r e p o si ti o n b y t h e s e n t e n c e n u m b e r ,
j . T h e m e t h o d t o c o m p u t e t h i s p r o b a b i li t y is :
[ a is ante ceden t, rna, j I
P(a lm ~ , j ) = I m s , j l
( W e o m i t t e d j f r o m e q u a t i o n s ( 1- 7 ) to r e d u c et h e n o t a t i o n a l l o a d . )
3 .2 R e s o l v i n g P r o n o u n s
A f t e r c o l le c ti n g t h e s t a t is t i c s o n t h e t r a i n i n g e x -
a n ap l es , w e r u n t h e p r o g r a m o n t h e t e s t d a t a .
F o r a n y p r o n o u n w e co l l ec t n ( = 1 5 i n t h e e x -
p e r i m e n t ) c a n d i d a t e a n t e c e d e n t s p r o p o s e d b y
H o b b s ' a l g o r i th m . I t i s q u i t e p o s s ib l e t h a t a
w o r d a p p e a rs i n t h e t e s t d a t a t h a t t h e p r o g r a m
n e v e r s a w in t h e t r a i n i n g d a t a a n d l o w w h i c h i t
h e n c e h a s n o P ( p l w o ) p r o b a b i l i t y . I n t h i s c a s e
8/6/2019 A Statistical Approach to Anaphora Resolution
http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 5/10
8/6/2019 A Statistical Approach to Anaphora Resolution
http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 6/10
e r r o n e o u s c a n d i d a t e s . S p a r s e d a t a a l s o c a u s e s
a p r o b l e m i n t h i s s t a t i s t i c . C o n s e q u e n t l y , w e
o b s e r v e a r e l a t i v e l y s m a l l e n h a n c e m e n t t o t h e
s y s t e m .T h e m e n t i o n i n f o r m a t i o n g i v e s t h e s y s ~ e m
s o m e id e a o f t h e s t o r y ' s f o c u s . T h e m o r e f re -
q u e n t l y a n e n t i t y is r e p e a t e d , t h e m o r e l i ke l y i t
i s t o b e t h e t o p i c o f t h e s t o r y a n d t h u s t o b e
a c a n d i d a t e f o r p r o n o m i n a l i z a t i o n . O u r r e s u l t s
s h o w t h a t t h i s i s i n d e e d t h e c a s e . R e f e r e n c e s
b y p r o n o u n s a r e cl o s el y r e l a t e d t o t h e t o p i c o r
t h e c e n t e r o f t h e d is c o u r se . N P r e p e t i t i o n i s
o n e s i m p l e w a y o f a p p r o x i m a t e l y i d e n t if y i n g t h e
t o p i c . T h e m o r e a c c u r a t e l y t h e t o p i c o f a s e g -
m e n t c a n b e i d e n t i f i e d , t h e h i g h e r t h e s u c c e s s
r a t e w e e x p e c t a n a n a p h o r a r e s o l u t i o n s y s t e m
c a n a c h i e v e .
5 U n s u p e r v i s e d L e a r n i n g o f G e n d e r
I n f o r m a t i o n
T h e i m p o r t a n c e o f g e n d er i n f o r m a t i o n a s re -
v e a l e d i n t h e p r e v i o us e x p e r i m e n t s c a u s e d u s t o
c o n s id e r a u t o m a t i c m e t h o d s f o r e s t im a t i n g t h e
p r o b a b i l i t y t h a t n o u n s o c c u r r i n g in a l a r g e c o r -
p u s o f E n g l i s h t e x t d e o n o t e i n a n i m a t e , m a s c u -
l in e o r f e m i n i n e t h i n g s. T h e m e t h o d d e s c r i b e d
h e r e is b as e d o n s i m p l y c o u n t i n g c o - o c c u r r e n c e s
o f p r o n o u n s a n d n o u n p h r a s e s , a n d t h u s c a ne m p l o y a n y m e t h o d o f a n a l y si s o f t h e t e x t
s t r e a m t h a t r e s u l t s i n r e f e r e n t / p r o n o u n p a i r s
( c f. ( H a t z i v a s s il o g l o u a n d M c K e o w n , 1 9 9 7 )
f o r a n o t h e r a p p l i c a t i o n i n w h i c h n o e x p l ic i t
i n d i c a t o r s a r e a v a il a bl e i n t h e s t r e a m ) . W e
p r e s e n t t w o v e r y si m p l e m e t h o d s f o r f i n d i n g
r e f e r e n t / p r o n o u n p a i r s , a n d a l so g i v e a n a p p l i -
c a t i o n o f a s a l ie n c e s t a t i s t ic t h a t c a n i n d i c a t e
h o w c o n f i d e n t w e s h o u l d b e a b o u t t h e p r e d i c -
t i o n s t h e m e t h o d m a k e s . F o l l o w i n g t h is , w e
s h o w t h e r e s u l ts o f a p p l y i n g t h i s m e t h o d t o t h e
2 1 - m i l l i o n - w o r d 1 9 8 7 W a l l S t r e e t J o u r n a l c o r -p u s u s i n g t w o d i ff e r en t p r o n o u n r e f e r e n c e s t r a t e -
g i es o f v a r y i n g s o p h i st i c a ti o n , a n d e v a l u a t e t h e i r
p e r f o r m a n c e u s i n g h o n o r i f ic s a s r e l i ab l e g e n d e r
i n d i c a t o r s .
T h e m e t h o d is a v e r y s i m p l e m e c h a n i s m
f o r h a r v e s t i n g t h e k i n d o f g e n d e r i n f o r m a t i o n
p r e s e n t i n d i s c o u r s e f r a g m e n t s l i k e " K i m s l e p t .
S h e s l e p t f o r a l o ng t i m e . " E v e n i f K i m ' s g e n d e r
w a s u n k n o w n b e f o r e s e e i n g t h e f i r s t s e n t e n c e ,
a f t e r t h e s e c o n d s e n t e n c e , i t i s k n o w n .
T h e p r o b a b i l i t y t h a t a r e f e r e n t i s i n a p a r t i c -
166
u l a r g e n d e r c l as s i s j u s t t h e r e l a t i v e f r e q u e n c y
w i t h w h i c h t h a t r e f e r e n t i s r e f e r r e d t o b y a p r o -
n o u n p t h a t i s p a r t o f t h a t g e n d e r c la s s . T h a t i s,
t h e p r o b a b i l it y o f a r e f e r e n t r e f b e i n g i n g e n d e r
c lass gc~ is
P ( r e / E g ci )
= I r e fs t o r e f w i t h p e gci I (9 )
E l r ef s t o r e / w i t h p E gc j I
J
I n t h i s w o r k w e h a v e c o n s i d e r e d o n l y t h r e e
g e n d e r c l a ss e s , m a s c u l i n e , f e m i n i n e a n d i n a n i -
m a t e , w h i c h a r e i n d i c a t e d b y t h e i r t y p i c a l p r o -
n o u n s , H E , S H E , a n d I T . H o w e v e r , a v a r i e t y o f
p r o n o u n s i n d i c a t e t h e s a m e c l as s : P l u r a l p r o -
pr onoun ge nde r c l a s s
h e , h i m s e l f ,h i m , h i s H E
s h e , h e r s el f , h e r , h e r s S H E
i t , i t se l f , i t s IT
n o u n s li ke " t h e y " a n d " u s " r e v e a l n o g e n d e r in -
f o r m a t i o n a b o u t t h e i r r e f e r e n t a n d c o n s e q u e n t l y
a r e n ' t u s e fu l , a l t h o u g h t h i s m i g h t b e a w a y t o
l e a rn p lu r a l i za t i o n i n a n u n s u p e r v i s e d m a n n e r .
I n o r d e r t o g a t h e r s t a t i s t i c s o n t h e g e n d e r o f
r e f e re n t s i n a co r p u s , t h e r e m u s t b e s o m e w a y
o f i d e n t if y i n g t h e r e f e re n t s . I n a t t e m p t i n g t ob . o o t s t r a p l e x i c a l i n f o r m a t i o n a b o u t r e f e r e n t s '
g e n d e r , w e c o n s i d e r t w o s t r a t e g i e s , b o t h c o m -
p l e te l y b l i n d t o a n y k i n d o f s e m a n t i c s .
O n e o f th e m o s t n a i ve p r o n o u n r e fe r e nc e
s t r a t eg i e s i s t h e " p r e v i o u s n o u n " h e u r i s t ic . O n
t h e i n t u i t i o n p r o n o u n s c l o s e ly f ol l o w t h e i r r e f e r-
e n ts , t h i s h e u r i s t i c s i m p l y k e e p s t r a c k o f t h e l a s t
n o u n se e n a n d s u b m i t s t h a t n o u n a s t h e r ef e r-
e n t o f a n y p r o n o u n s f o l lo w i n g . T h i s s t r a t e g y i s
c e r t a i n l y s i m p l e - m i n d e d b u t , a s n o t e d e a r l i e r , i t
a c h ie v e s a n a c c u r a c y o f 4 3 % .
I n t h e p r e s e n t s y s t e m , a s t a t i s t i c a l p a r s e r i su s e d ( s e e ( C h a r n i a k , 1 9 9 7 ) ) s i m p l y a s a t a g -
g e r . T h i s a p p a r e n t p a r s e r o v e r k i l l i s a c o n t r o l
t o e n s u r e t h a t t h e p a r t - o f - s p e e c h t a g s a s s i g n e d
t o w o r d s a r e t h e s a m e w h e n w e u s e t h e p r e v i -
o u s n o u n h e u r i s t i c a n d t h e H o b b s a l g o r i t h m , t o
w h i ch w e w i sh t o c o m p a r e t h e p r e v i o u s n o u n
m e t h o d . I n f a c t , t h e o n l y p a r t- o f - s p e e c h t a g s
n e c e s s a r y a r e th o s e i n d i c a t i n g n o u n s a n d p r o -
n o u n s .
O b v i o u s l y a m u c h s u p e r i o r s t r a t e g y w o u l d
b e t o a p p l y t h e a n a p h o r a - r e s o l u t i o n s t r a t e g y
8/6/2019 A Statistical Approach to Anaphora Resolution
http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 7/10
f r o m p r e v i o u s s e c t i o n s t o f i n d i n g p u t a t i v e r e f -
e r e n t s . H o w e v e r , w e c h o s e t o u s e o n l y t h e
H o b b s d i s ta n c e p o r t i o n t h e r e o f . W e d o n o t
u s e t h e " m e n t i o n " p r o b a b i l it i e s P(a lm a) , a st h e y a r e n o t g iv e n i n t h e u n m a r k e d t e x t . N o r
d o w e u s e t h e g e n d e r / a n i m i t i c i t y i n f o r m a t i o n
g a t h e r e d f r o m t h e m u c h s m a l l e r h a n d - m a r k e d
t e x t , b o t h b e c a u s e w e w e r e i n t e r e s t e d in s e e i ng
w h a t u n s u p e r v i s e d l e a r n i n g c o u l d a c c o m p l i s h ,
a n d b e c a u s e w e w e r e c o n c e r n e d w i t h i n h e r it -
i n g s t r o n g b i a s es f ro m t h e l i m i t e d h a n d - m a r k e d
d a t a . T h u s o u r s e c o n d m e t h o d o f f i n d in g t h e
p r o n o u n / n o u n c o - o c c u r r e n c e s i s s i m p l y t o p a r se
t h e t e x t an d t h e n a s s u m e t h a t t h e n o u n - p h r a s e
a t H o b b s d i s t a n c e o n e i s t h e a n t e c e d e n t .
G i v e n a p r o n o u n r e s o l u t i o n m e t h o d a n d a c o r-p u s , t h e r e s u l t i s a s e t o f p r o n o u n / r e f e r e n t p a i rs .
B y c o l la t in g b y r e f e re n t a n d a b s t r a c t i n g a w a y
t o t h e g e n d e r c l as s es o f p r o n o u n s , r a t h e r t h a n
i n d i v i d u a l p r o n o u n s , w e h a v e t h e r e l a t i v e f r e -
q u e n c i e s w i t h w h i c h a g i v e n r e f e r e n t i s r e f e r r e d
t o b y p r o n o u n s o f e a c h g e n d e r c la s s . W e w i ll
s a y t h a t t h e g e n d e r c l a s s f o r w h i c h t h i s r e l a t i v e
f r e q u e n c y is t h e h i g h e s t i s t h e g e n d e r c la s s t o
w h i c h t h e r e f e re n t m o s t p r o b a b l y b e l o n g s.
H o w e v e r , a n y s y n t a x - o n l y p r o n o u n r e s o lu t i o n
s t r a t e g y w i ll b e w r o n g s o m e o f t h e t i m e - t h e s e
m e t h o d s k n o w n o t h i n g a b o u t d i sc o u rs e b o u n d -a r ie s , i n t e n t i o n s , o r r e a l- w o r l d k n o w l e d g e . W e
w o u l d li ke t o k n o w , t h e r e f o r e , w h e t h e r t h e p a t -
t e r n o f p r o n o u n r e f e re n c e s t h a t w e o b s e rv e f o r
a g i v en r e f e r e n t is t h e r e s u l t o f o u r s u p p o s e d
" h y p o t h e s i s a b o u t p r o n o u n r e f e re n c e " - t h a t i s,
t h e p r o n o u n r e f e re n c e s t r a t e g y w e h a v e p r ov i -
s i o n a l l y a d o p t e d i n o r d e r t o g a t h e r s t a t i s t i c s -
o r w h e t h e r t h e r e s u lt o f s o m e o t h e r u n i d e n t if i e d
p r oc e s s .
T h i s d e c i s i o n i s m a d e b y r a n k i n g t h e r e f e r -
e n t s b y l o g - li k e l ih o o d r a t i o , t e r m e d s a l i e n c e , f o r
e a c h r e f e r e n t . T h e l ik e l i h o o d r a t i o is a d a p t e df r o m D u n n i n g ( 1 9 9 3 , p a g e 6 6 ) a n d u s e s t h e r a w
f r e q u e n c i e s o f e a c h p r o n o u n c l a ss i n t h e c o r -
p u s a s t h e n u l l h y p o t h e s i s , P r ( g c 0 i ) a s w e l l a s
P r ( r e f E gci) f r o m e q u a t i o n 9 .
s a l i e n c e ( r e / )
= - 2 l og
M a k i n g t h e u n r e a l i s t i c s i m p l i f y i n g a s s u m p t i o n
t h a t r e fe r e n ce s o f o n e g e n d e r c l a s s a r e c o m -
p l e t el y i n d e p e n d e n t o f r e fe r e n c e s f o r a n o t h e r
c l a s se s 1 , t he l i ke l ihood f un c t io n in t h i s c a se i sj u s t t h e p r o d u c t o v e r a ll cl a s s e s o f t h e p r o b a b i l -
i t ie s o f e ac h c l a s s o f r e f e r e n c e t o t h e p o w e r o f
t h e n u m b e r o f o b s e r v a t i o n s o f th i s c l a ss .
6 E v a l u a t i o n
W e r a n t h e p r o g r a m o n 2 1 m i l li o n w o r d s o f W a l l
S t r e e t J o u r n a l t e x t . O n e c a n j u d g e t h e p r o -
g r a m i n f o r m a l l y b y s i m p l y e x a m i n i n g t h e r e -
s u l t s a n d d e t e r m i n i n g i f t h e p r o g r a m ' s g e n d e r
d e c i s io n s a r e c o r r e c t ( o c c a s i o n a l l y l o o k i n g a t t h e
t e x t f o r d i f fi c u l t c a s e s ) . F i g u r e 1 s h o w s t h e 4 3
n o u n p h r a s e s w i t h t h e h i g h e s t s a l i e n c e f i g u r e s
( r u n u s in g t h e H o b b s a l g o r i t h m ) . A n e x a m i n a -
t i o n o f t h e s e s h o w t h a t a ll b u t t h r e e a r e c o r r e c t .
( T h e t h r e e m i s t a k e s a r e " h u s b a n d , " " w if e, " a n d
" y e a r s . " W e r e t u r n t o t h e s i g n i f i c an c e o f t h e s e
m i s t a k e s l a t e r . )
A s a m e a s u r e o f t h e u t i l i t y o f t h e s e r e su l t s , w e
a l s o r a n o u r p r o n o u n - a n a p h o r a p r o g r a m w i t h
t h e s e s t a t i s t i c s a d d e d . T h i s a c h i e v e d a n a c c u -
r a c y r a t e o f 8 4 . 2 % . T h i s i s o n l y a s m a l l i m p r o v e -
m e n t o v er w h a t w a s a c h i e v e d w i t h o u t t h e d a t a .
W e b e l ie v e , h o w e v e r , t h a t t h e r e a r e w a y s t o i m -
p r o v e t h e a c c u r a c y o f t h e l e a r n i n g m e t h o d a n d
t h u s i n c r e as e it s i n f lu e n c e o n p r o n o u n a n a p h o r a
r e s o l u t i o n .
F i n a l l y w e a t t e m p t e d a fu l l y a u t o m a t i c d i -
r e c t t es t o f t h e a c c u r a c y o f b o t h p r o n o u n m e t h -
o d s f o r g e n d e r d e t e r m i n a t i o n . T o t h a t e n d , w e
d e v i s e d a m o r e o b j e c t i v e t e s t , u s e f u l o n l y f o r
s c o r in g t h e s u b s e t o f r e f e r e n t s t h a t a r e n a m e s
o f p e o p le . I n p a r t i c u l a r , w e a s s u m e t h a t a n y
n o u n - p h r a s e w i t h t h e h o n o r if i c s " M r . " . " M r s . "
o r " M s . " m a y b e c o n f i d e n t l y a s s ig n e d t o g e n d e r
c l a ss e s H E , S H E , a n d S H E , r e s p e c t i v e l y . T h u s w e
c o m p u t e p r e c i s i o n a s f o l l o w s :
p r e c i s ion =
[ r a t t r i b . a s H E A M r . E r l +
[ r a t t r i b . a s S H E A M r s. or M s. E r [
I M r . , M r s . , o r M s . E r ]
H e r e r v a r i e s o v e r r e f e r e n t t y p e s , n o t t o k e n s .
T h e p r e ci s io n s c o r e c o m p u t e d o v e r a ll p h r a s e s
c o n t a i n i n g a n y o f t h e t a r g e t h o n o r if i cs a r e 6 6 .0 %
l I n ef f e ct , t h i s is t h e s a m e a s a d m i t t i n g t h a t a r e f -
e r e n t c a n b e i n d i f f e r e n t g e n d e r c l a s s e s a c r o s s d i f f e re n t
o b s e r v a t i o n s .
167
8/6/2019 A Statistical Approach to Anaphora Resolution
http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 8/10
W o r d co u n t ( sa l i en ce ) p ( h e ) p ( sh e ) p ( i t )
C O M P A N Y 7 0 52 ( 1 62 9 .39 ) 0 .0 7 64 0 .0 06 0 0 .9 1 7 4W OM AN 2 50 ( 36 8 .2 67 ) 0 .1 72 0 .7 0 8 0 .1 2
P R E S I D E N T 9 3 :[ (3 5 6 .5 3 9 ) 0 .8 20 6 0 .0 1 3 9 0 . 1 65 4
G R O U P 1 0 9 6 (2 8 7 .31 9 ) 0 .0 6 02 0 .0 0 54 0 .9 343
M R . R E A GA N 53 , t (2 7 0 .8) . 88 2 02 2 0 .0 0 37 0 .1 1 42
M AN 441 ( 2 0 2 .1 0 2 ) 0 .8 480 0 .0 38 5 0 .1 1 33
P R E S I D E N T R E A G A N 4 5 5 (1 9 4 .9 2 8) 0 .8 43 9 0 .0 04 3 0 .1 5 1 6
G O V E R N M E N T 1 2 20 (1 9 4 .1 8 7) 0 .1 17 2 0 . 01 2 2 0 . 87 0 4
U.S . 969(188 .468) 0 .1021 0 .0041 0 .8937
BA N K 81(5(161 .23) 0 .0955 0 .0073 0 .8970
M O T H E R 1 1 3 ( 1 6 1 . 2 0 4 ) 0 . 3 0 0 8 0 . 6 5 4 8 0 . 0 4 4 2
C O L . N O R T H 2 5 8 ( 1 5 8 . 6 9 2 ) 0 . 9 2 6 3 0 . 0 0 7 7 0 . 0 6 5 8
M O O D Y 38 3 ( 1 52 .40 5 ) 0 .00 7 8 0 .0 0 52 0 .9 8 6 9S P O K E S W O M A N 1 3 9 (1 4 5 .6 2 7) 0 .1 22 3 0 . 5 82 7 0 . 29 4 9
M R S . A QU I N O 7 3 ( 1 42 .2 2 3 ) 0 .0 9 58 0 .8 356 0 .0 6 8 4
M R S . T H A T C H E R 6 8 (1 2 8 .3 0 6 ) 0 .0 7 3 5 0 .8 2 3 5 0 . 1 02 9G M 513(119 .664 ) 0 .0779 0 .0038 0 .9181
P L A N 51 4 ( 1 1 1 .1 34 ) 0 .0 856 0 .0 0 58 0 .9 0 8 5
M R . G O R B A C H E V 2 0 5 (1 0 8 .7 7 6 ) 0 .8 9 26 0 . 00 4 8 0 . 10 2 4
J U D G E B O R K 2 1 2 (1 0 8 .7 4 6 ) 0 .8 8 20 0 0 . 11 7 9
HU S B A ND 9 1 ( 1 0 7.438 ) 0 .36 2 6 0 .57 1 4 0 .0 6 59
JA P A N 450 ( 1 0 0 .7 2 7 ) 0 .0 755 0 .0 11 1 0 .9 1 33
A G E N C Y 4 7 6 (9 7 .4 0 1 6 ) 0 .0 8 4 0 0 . 01 4 7 0 . 90 1 2
W I F E 1 53 ( 9 3 .748 5 ) 0 .6 1 43 0 .2 8 75 0 .0 9 8 0
D O L L A R 6 2 1 ( 9 0.8 9 6 3 ) 0 .1 304 0 .0 0 9 6 0 .8 59 9S T A N D A R D P O O R 2 0 0( 90 .1 06 2 ) 0 0 1FA TH E R 1 46 ( 8 9 .41 7 8 - ) 0 .8 0 82 0 .1 438 0 .0 47 9
U TI L I T Y 2 42 ( 8 7 .1 8 2 1 ) 0 .0 247 0 0 .9 7 52
M R . T R U M P 1 2 9( 86 .5 3 4 5 ) 0 .9 4 5 7 0 . 00 7 7 0 .0 4 6 5
M R . B A K E R 1 8 7 (8 4 .2 7 96 ) 0 .8 556 0 .0 0 53 0 .1 39 0
I B M 31 6 ( 8 2 .436 1 ) 0 .0 69 6 0 0 .9 30 3
M A K E R 2 2 4 (8 2 .2 5 2 ) 0 .0 2 23 0 0 . 97 7 6
YE AR S 1 0 55 ( 8 2 .1 6 32 ) 0 .52 98 0 .0 81 5 0 .38 8 6
M R . M E E S E 1 6 6 (8 2 .10 0 7 ) 0 .87 34 0 0 .1 2 65
B R AZ I L 2 8 5 ( 7 9 .7 31 1 ) 0 .0 596 0 0 .9 40 3
S P O K E S M A N 6 6 5 (7 8 .3 4 4 1 ) 0 .6 0 75 0 . 00 4 5 0 . 3 87 9
M R . S I M ON 1 0 5 ( 7 2.6 446 ) 0 .9 523 0 0 .0 47 6DAUGHTER 47(71.3863) 0.2340 0 .7021 0.0638
FO R D 2 49 ( 7 1 .36 0 3 ) 0 .056 2 0 0 .9 437
M R . G R E E N S P A N 1 2 0( 68 .7 8 07 ) 0 .9 0 83 0 0 . 09 1 6
AT & T 1 9 8 (6 7 .96 6 8 ) 0 .02 52 0 .0 0 50 0 .9 6 9 6
M I N I S T E R 1 2 5 (6 7 .7 4 7 5 ) 0 .8 6 4 0 . 06 4 0 . 07 2
JU D G E - 2 39 ( 6 7 .58 9 9 ) 0 .7 154 0 .0 8 36 0 .2 0 0 8
F i g u r e 1 : T o p 4 3 n o u n p h r a s e s a c c o r d i n g t o s a l i e n c e
168
8/6/2019 A Statistical Approach to Anaphora Resolution
http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 9/10
o
o~
o
l . O -
0 . 8 -
0 . 5 -
U
• • 0 •
0
• " ' ' ' " 1 " " " ' ' ' " |
10 100
N u m b e r o f r e fe r e n c es
O •
F i g u r e 2 : P r e c i s i o n u s i n g h o n o r i fi c s c o r i n g
s c h e m e w i t h s y n t a c t i c H o b b s a l g o r i t h m
f or t h e l a s t - n o u n m e t h o d a n d 7 0 . 3 % f o r t h e
H o b b s m e t h o d .
T h e r e a r e s e v e r a l t h i n g s t o n o t e a b o u t t h e s e
r e s u l t s . F i r s t , a s o n e m i g h t e x p e c t g i v e n t h e a l -
r e a d y n o t e d s u p e r i o r p e r f o r m a n c e o f t h e H o b b s
s c h e m e o v e r l a s t -n o u n , H o b b s a ls o p e r f o r m s b e t -
t e r a t d e t e r m i n i n g g e n d e r . S e c o n d l y , a t f i r st
g l a n c e , th e 7 0 .3 % a c c u r a c y o f t h e H o b b s m e t h o d
i s d i s a p p o i n t i n g , o n l y s l i g h t l y s u p e r i o r t o t h e
6 5 .3 % a c c u r a c y o f H o b b s a t f i n d i n g c o r r e c t r e f-
e r e n ts . I t m i g h t h a v e b e e n h o p e d t h a t t h e
s t a t is t i cs w o u l d m a k e t h i n g s c o n s i d e r a b l y m o r e
a c c u r a t e .
I n f a c t , t h e s t a ti s ti c s d o m a k e t h i n gs c o n s id -
e r a b l y m o r e a c cu r at e . F i g u r e 2 s h o w s a v e r a g ea c c u r a c y a s a f u n c ti o n o f n u m b e r o f r e f e re n c e s
f o r a g i v e n r e f e r e n t . I t c a n b e s e e n t h a t t h e r e i s
a s ig n if i ca n t i m p r o v e m e n t w i t h i n c r ea s e d r ef er -
e n t c o u n t . T h e r e a s o n t h a t t h e a v e r a g e o v e r al l
r e f e r en t s i s s o l o w i s t h a t t h e c o u n t s o n r e f e r e n t s
o b e y Z i p f 's l a w , s o t h a t t h e m o d e ~ f t h e di st ri -
b u t i o n o n c o u n t s is o n e . T h u s t h e 7 0 . 3 % o v e ra l l
a c c u r a c y i s a m i x o f r e la t i ve l y h i g h a c c u r a c y f o r
r e fe r en t s w i t h c o u n t s g r e a te r t h a n o n e , a n d r el -
a t iv e ly l o w a c c u r a c y f o r r e f e re n t s w i t h c o u n t s o f
e x a c t l y o n e .
7 P r e v i o u s W o r k
T h e l i t e r a t u r e o n p r o n o u n a n a p h o r a i s t o o e x -
t e n s iv e t o s u m m a r i z e , s o w e c o n c e n t r a t e h e r e o n
c o r p u s - b a s e d a n a p h o r a r e s e a r c h .A o n e a n d B e n n e t t ( 19 9 6) p r e s e n t an a p -
p r o a c h to a n a u t o m a t i c a l ly t r a i n a b le a n a p h o r a
r e s o lu t i o n s y s t e m . T h e y u s e J a p a n e s e n e w s p a -
p e r a r t i c l e s t a g g e d w i t h d i s c o u r s e i n f o r m a t i o n
a s t r a i n i n g e x a m p l e s f o r a m a c h i n e - l e a r n i n g a l -
g o r i t h m w h i c h i s t h e C 4 . 5 d e c i s i o n - t r e e a l g o -
r i t h m b y Q u i n l a n ( 1 99 3 ). T h e y t r a in t h e i r d e -
c i s i o n t r e e u s i n g (anaphora, antecedent) p a i r s
t o g e t h e r w i t h a s e t o f f e a t u r e v e c t o rs . A m o n g
t h e 6 6 f e a t u r e s a r e l e x i c a l , s y n t a c t i c , s e m a n -
t ic , a n d p o s i ti o n a l f e a t u r e s . T h e i r M a c h i n e
L e a r n i n g - b a s e d R e s o l v e r ( M L R ) i s t r a i n e d u s -i n g d e c i s io n t r e e s w i t h 1 9 7 1 a n a p h o r a s ( e x c l u d -
i n g t h o s e r e f e r ri n g t o m u l t ip l e d i s c o n t i n u o u s a n -
t e c e d e n t s ) a n d t h e y r e p o r t a n a v e r a g e s u c c es s
r a t e o f 7 4 . 8 % .
M i t k o v ( 1 9 9 7 ) d e s c r i b e s a n a p p r o a c h t h a t
u s e s a s e t o f f a c t o r s a s c o n s t r a i n t s a n d p r e f e r -
e n c e s. T h e c o n s t r a i n t s r u l e o u t i m p l a u si b le c a n -
d i d a t e s a n d t h e p r e f e r en c e s e m p h a s i z e t h e s e l ec -
t i o n o f t h e m o s t l ik e ly a n t e c e d e n t . T h e s y s t e m
i s n o t e n t i r e l y " s t a t i s t i c a l " i n t h a t i t c o n s i s t s o f
v a r i ou s t y p e s o f r u l e -b a s e d k n o w l e d g e - - s y n -
t ac ti c, s e m a n t i c , d o m a i n , d i s c ou r s e , a n d h e u r is -t ic . A s t at i s ti c a l a p p r o a c h i s p r e s e n t i n t h e d i s -
c o u r s e m o d u l e o n l y w h e r e i t i s u s e d t o d e te r -
m i n e t h e pr o ba b il i t y t h a t a n o u n ( v e r b ) p h r a s e
i s t h e c e n te r o f a s e nt e n c e . T h e s y s t e m a l s o c o n -
t a in s d o m a i n k n o w l e d g e i n cl u di n g t he d o m a i n
c o n c e p t s , s p e c i f ic l i st o f s u b j e c t s a n d v e r b s , a n d
t o p i c h e a d i ng s . T h e e v a l u at i o n w a s c o n d u c t e d
o n 1 3 3 p a r a g r a p h s o f a n n o t a t e d C o m p u t e r S ci -
e n c e t e x t . T h e r e s u l t s s h o w a n a c c u r a c y o f 8 3 %
f o r t h e 5 1 2 o c c u r r e n c e s o f it.
L a p p i n a n d L e a s s ( 1 9 9 4 ) r e p o r t o n a ( e ss e n -
t i al l y n o n - s t a t i s t i c a l ) a p p r o a c h t h a t r e l i es o ns a l i en c e m e a s u r e s d e r i v e d f r o m s y n t ac t i c s t r u c-
t u r e a n d a d y n a m i c m o d e l o f a t te n ti o n a l s t at e .
T h e s y s t e m e m p l o y s v a ri o us c o n s tr a in t s f or N P -
p r o n o u n n o n - c o r e f e r e n c e w i t h i n a s e n t en c e . I t
a l s o u s e s p e r s on , n u m b e r , a n d g e n d e r f e at u r e s
f o r r u l i n g o u t a n a p h o r i c d e p e n d e n c e o f a p r o-
n o u n o n a n N P . T h e a l g o r i t h m h a s a s op hi st i-
c a t e d m e c h a n i s m f o r a s s i gn i n g v a l u es t o s e v e ra l
s a li e n c e p a r a m e t e r s a n d f o r c o m p u t i n g g l ob a l
s a l i en c e v a l u es . A b l i n d t e s t w a s c o n d u c t e d
o n m a n u a l t e x t c o n t a i n i n g 3 6 0 p r o n o u n o c c u r -
169
8/6/2019 A Statistical Approach to Anaphora Resolution
http://slidepdf.com/reader/full/a-statistical-approach-to-anaphora-resolution 10/10
r e nc e s ; t he a l go r i t hm suc c e s s fu l l y i de n t i f i e d t he
a n t e c e d e n t o f t h e p r o n o u n i n 8 6 % o f t h e s e p r o -
n o u n o c c u rr e n ce s . T h e a d d i t i o n o f a m o d u l e
t h a t c o n t r i b u t e s s ta t i s t ic a l l y m e a s u r e d l e x Jc al
p r e f e r e n ce s t o t h e r a n g e o f f a c t o r s t h e a l g o r i t h m
c o n s id e r s i m p r o v e d t h e p e r f o r m a n c e b y 2 % .
8 C o n c l u s t i o n a n d F u t u r e R e s e a r c h
W e h a v e p r e s e n t e d a s t a t i s t i c a l m e t h o d f o r
p r o n o m i n a l a n a p h o r a t h a t a c h ie v e s a n a c c u r a c y
o f 8 4 .2 % . T h e m a i n a d v a n t a g e o f th e m e t h o d i s
i t s es s e n ti a l s i m p l ic i ty . E x c e p t f o r i m p l e m e n t i n g
t h e H o b b s r e f e r e n t - o r d e r i n g a l g o r i t h m , a l l o t h e r
s y s t e m k n o w l e d g e i s i m b e d d e d i n t a b l e s g iv i n g
t h e v a r i o u s c o m p o n e n t p r o b a b i l i t i e s u s e d in t h ep r o b a b i l i t y m o d e l .
W e b e l ie v e t h a t t h i s s im p l i c i t y o f m e t h o d w i l l
t r a n s l a t e i n t o c o m p a r a t i v e s i m p l i c i t y a s w e i m -
p r o v e t h e m e t h o d . S i n c e t h e r e s e a r c h d e s c r i b e d
h e r e in w e h a v e t h o u g h t o f o t h e r i n f lu e n c e s o n
a n a p h o r a r e s o l u t i o n a n d t h e i r s t a t i s t i c a l c o r r e -
l a te s . W e h o p e to i n c l u d e s o m e o f t h e m i n f u t u r e
w o r k .
A l s o , as i n d i c a t e d b y t h e w o r k o n u n s u p e r -
v i s e d le a r n i n g o f g e n d e r i n f o r m a t i o n , t h e r e i s a
g r o w i n g a r s e n a l o f l e a r n i n g t e c h n i q u e s t o b e a p -
p l ie d t o s t a t i s ti c a l p r o b l e m s . C o n s i d e r a g a i n t h et h r e e h i g h - s a li e n c e w o r d s t o w h i c h o u r u n s u p e r -
v i s e d l e a r n i n g p r o g r a m a s s i g n e d i n c o r r e c t g e n -
d e r : " h u s b a n d " , " w i f e ", a n d " y e a r s ." W e s u s -
p e c t t h a t h a d o u r p r o n o u n - a s s ig n m e n t m e t h o d
b e e n a b l e t o u s e t h e t o p i c i n f o r m a t i o n u s e d i n
t h e c o m p l e t e m e t h o d , t h e s e m i g h t w e l l h a v e
b e e n d e c i d e d c o r r ec t ly . T h a t i s, w e s u s p e c t
t h a t " h u s b a n d " , f o r e x a m p l e , w a s d e c i d e d i n -
c o r r e c t ly b e c a u s e t h e t o p i c o f t h e a r t ic l e w a s t h e
w o m a n , t h e re w a s a m e n t i o n o f h e r " h u s b a n d , "
b u t t h e a r t ic l e k e p t o n t al k i n g a b o u t t h e w o m a n
a n d u s e d th e p r o n o u n " s h e ." W h i l e o u r si m p l ep r o g r a m g o t c o n f u s e d , a p r o g r a m u s i n g b e t t e r
s t a t i s t ic s m i g h t n o t h a v e . T h i s t o o i s a t o p i c f o r
f u t u r e r e s e a r c h .
9 A c k n o w l e d g e m e n t s
T h e a u t h o r s w o u l d l i k e t o t h a n k M a r k J o h n s o n
a n d o t h e r m e m b e r s o f t h e B r o w n N L P g r o u p
f o r m a n y u s e f u l i d ea s a n d N S F a n d O N R f o r
s u p p o r t ( N S F g r a n t s I R I - 9 3 1 9 5 1 6 a n d S B R -
9 7 20 3 68 , O N R g r a n t N 0 0 1 4 - 9 6 - 1 -0 5 4 9 ) .
170
R e f e r e n c e s
C h i n a t s u A o n e a n d S c o t t W i l l i a m B e n n e t t ,
1996. Evaluat ing Autom ated an d Manual Ac-
quisit ion off An aph ora Res olutio n Strategies,p a g e s 3 0 2 - 3 1 5 . S p r i n g e r .
S u s a n E . B r e n n a n , M a r i l y n W a l k e r F r i e d m a n ,
a n d C a r l J . P o l l a r d . 1 9 8 7. A c e n t e r i n g ap -
p r o a c h t o p r o n o u n s . I n P r o c . 25 th AnnualMeet ing o f the A CL, p a g e s 1 5 5 - 1 6 2 . A s s o c i a -
t i o n o f C o m p u t a t i o n a l L i n g u i s t i c s .
E u g e n e C h a r n i a k . 1 9 9 7 . S t a t i s t i c a l p a r s i n g
w i t h a c o n t e x t - fr e e g r a m m a r a n d w o r d s ta t is -
t ic s . In Proceedings of the 1 4th N at ional Con-ference on Art i f icial Intel ligence, M e n l o P a r k ,
C A . A A A I P r e s s / M I T P r e ss .
T e d D u n n i n g . 1 9 93 . A c c u r a t e m e t h o d s f o r t h es t a t i s t i c s o f s u r p r i s e a n d c o i n c i d e n c e . Com-putat ional Linguist ics, 1 9 ( 1 ) , M a r c h .
V a s i l e i o s H a t z i v a s s i l o g l o u a n d K a t h l e e n R .
M c K e o w n . 1 9 97 . P r e d i c t i n g t h e s e m a n t i c o r i-
e n t a t i o n o f a d j e c ti v e s . I n Proc. 35 th AnnualMeet ing of the ACL, p a g e s 1 7 4 - 1 8 1 . A s s o c i a -
t i on o f C o m p u t a t i o n a l L i n g u is t ic s .
J e r r y R . H o b b s . 1 9 76 . P r o n o u n r es o l u ti o n .
T e c h n i c a l R e p o r t 7 6 - 1 , C i t y C o l l e g e , N e w
Y o r k .
S h a l o m L a p p i n a n d H e r b e r t J . L e a ss . 1 99 4 . A n
a l g o r i t h m f o r p r o n o m i n a l a n a p h o r a r e s o l u -t i o n . Computat ional Linguist ics, p a g e s 5 3 5 -
"561.
M i t c h e l l P . M a r c u s , B e a t r i c e S a n t o r i n i , a n d
M a r y A n n M a r c i n k i e w i c z . 1 9 9 3 . B u i l d i n g
a l a r g e a n n o t a t e d c o r p u s o f e n g l is h : t h e
p e n n t r e e b a n k . Computat ional Linguist ics,1 9 : 3 1 3 - 3 3 0 .
R u s l a n M i t k o v . 1 9 97 . F a c t o r s i n a n a p h o r a r es -
o l u t io n : t h e y a r e n o t t h e o n ly t h i n g s t h a t
m a t t e r , a c a s e s t u d y b a s e d o n t w o d i f f e r -
e n t a p p r o a c h e s . I n Proceedings o f the A CL
'g7/E A CL 'g7 W orkshop o n Operat ional Fac-tors in Practical, Robust Anaphora Resolu-tion.
J . R o s s Q u i n l a n . 1 9 9 3 . C~.5 Programs for Ma-chine Learning. M o r g a n K a u f m a n n P u b l i s h -
e rs .
top related