eye movement evidence that readers maintain and act on … · 2009. 11. 24. · 1 summary of...

Eye movement evidence that readers maintain and act

on uncertainty about past linguistic input

(Supporting Information)

Roger Levy, Klinton Bicknell, Tim Slattery, and Keith Rayner

1 Summary of uncertain-input sentence-comprehension

model

Levy 2008 [1] introduced a model of noisy-channel sentence comprehension under uncertain

input in which a comprehender uses a probabilistic grammar which defines a joint probability

distribution over word sequences w and structural representations, together with perceptual

input I obtained from reading a sentence w∗ incrementally, to form posterior inferences about

what the sentence and its structure may be. As researchers who know the true sentence w∗

being read by an experimental participant but not the perceptual input I obtained at any

point during reading, we marginalize over perceptual input to obtain the comprehender’s

expected inferences about the sentence being read:

P (w|w∗) =

∫I

PC(w|I,w∗)PT (I|w∗) dI (1)

where PC is the comprehender’s probability distribution and PT is the true noise distribution.

We can apply Bayes’ rule to obtain

P (w|w∗) = PC(w)

∫I

PC(I|w)PT (I|w∗)

PC(I)dI (2)

∝ PC(w)Q(w,w∗) (3)

1

where Q(w,w∗) is proportional to the integral in Equation (2) and represents the average

effect of perceptual noise. For a given partial sentence w∗, we represent Q(w,w∗) as a

function over w by constructing a weighted finite-state automaton in the log (base-2) semiring

[2] that recognizes only w∗ and gives it zero cost, then adding edit, insertion, and deletion

arcs with costs equal to a noise parameter λ times the Levenshtein edit distance between

the original arc’s label and the new arc’s label (for full details see [1]).

To model the behavioral consequences of reading a new word w∗i in a sentence, we assume

that if w∗i dramatically changes the comprehender’s beliefs about the earlier content of a sen-

tence, then the comprehender will tend to respond behaviorally by longer fixation times and

possibly making regressive saccades. We define Pi(w[0,j)) to be the probability distribution

over the sequence of words starting at the beginning of the sentence and continuing up to but

not including the position occupied by w∗j , conditioning on the perceptual input obtained

from words w∗1...i, and use the Kullback-Leibler (K-L) divergence D

(Pi(w[0,i))||Pi−1(w[0,i))

)to quantify the change in this probability distribution sentence induced by reading w∗

i . This

quantity is shown in main-submission Figure 2 as a function of λ. The probabilistic context-

free grammar [3] used for the main submission consisted of the non-terminal rewrite rules

given in Table 1 plus all terminal rewrite rules (of the form part-of-speech→word) found in

the parsed Brown corpus; rule probabilities are estimated from the parsed Brown corpus

[4, 5].

2 Orthographic neighbors and grammatical analysis

The syntactic analysis of the sentence differs dramatically between the true sentence and the

variants in which at has been replaced by an orthographically similar near-neighbor word.

Figure 1 illustrates the difference between the analyses for the at→and substitution, using

the categories of the grammar in Table 1. The mapping from these analyses to analyses

within mainstream syntactic frameworks involve straightforward tree transformations of the

type widely used in computational linguistics [6].

Levy, Bicknell, Slattery, Rayner – Supporting Information 2

ROOT → S 0.00 VP/NP → V 0.1

S → S-base CC S-base 7.3 VP → V PP 2.0

S → S-base 0.01 VP → V NP 0.7

S-base → NP-base VP 0 VP → V 2.9

NP → NP-base RC 4.1 RC → WP S/NP 0.5

NP → NP-base 0.5 RC → VP-pass/NP 2.0

NP → NP-base PP 2.0 RC → WP FinCop VP-pass/NP 4.9

NP-base → DT N N 4.7 PP → IN NP 0

NP-base → DT N 1.9 S/NP → VP 0.7

NP-base → DT JJ N 3.8 S/NP → NP-base VP/NP 1.3

NP-base → PRP 1.0 VP-pass/NP → VBN NP 2.2

NP-base → NNP 3.1 VP-pass/NP → VBN 0.4

VP/NP → V NP 4.0

Table 1: The probabilistic grammar used to compute K-L divergences in the main submission.

Rule weights given as negative log-probabilities in bits.

3 Experiment

3.1 Materials & Design

Our experimental design involved crossing two factors: first, the use of at versus toward

as the post-verbal preposition early in the sentence; second whether the critical participial

verb used in the object-modifying reduced relative clause had the same orthograpic form

(ambiguous) or different orthographic form (unambiguous) as the simple-past member of

the verb’s paradigm. We used 24 experimental items in the study (given in Appendix A); a

sample item is shown below.


S

S-base

NP

NP-base

DT

The

N

coach

VP

V

smiled

PP

IN

at

NP

NP-base

DT

the

N

player

RC

VP-pass/NP

VBN

tossed

NP

NP-base

DT

the

N

frisbee

S

S-base

NP

NP-base

DT

The

N

coach

VP

V

smiled

CC

and

S-base

NP

NP-base

DT

the

N

player

VP

V

tossed

NP

NP-base

DT

the

N

frisbee

Figure 1: Syntactic analyses of sentence with true words (left) and near-neighbor at→and

substitution (right) under the probabilistic grammar

(1) a. at, ambiguous:

The coach smiled at the player tossed a frisbee by the opposing team.

b. at, unambiguous:

The coach smiled at the player thrown a frisbee by the opposing team.

c. toward, ambiguous:

The coach smiled toward the player tossed a frisbee by the opposing team.

d. toward, unambiguous:

The coach smiled toward the player thrown a frisbee by the opposing team.

We constructed four stimulus lists, rotating items among these four conditions in a Latin

Square. These 24 experimental stimuli were interleaved with 36 fillers. Order of presenta-

tion was randomized differently for each participant, subject to the constraint that no two

experimental items appeared consecutively.

3.2 Procedure

40 native-English speaker undergraduate students at UC San Diego participated in the ex-

periment. All had normal vision or corrected to normal vision, and were naive as to the

purpose of the experiment. Participants read each sentence while their eye movements were

monitored by an SR Eyelink 2000 eye-tracker, obtaining one eye-position sample every 1/2

millisecond with a spatial resolution of 0.01 degrees (binocular viewing, recording right-eye


only). Each sentence was presented on a single line in 14 point Courier New font on a 19 inch

LCD monitor positioned 55 cm in front of the participants (1 degree of visual angle ≈ 3 char-

acters). The eye-tracker was calibrated prior to beginning the experiment and subsequently

was recalibrated between trials as necessary.

3.3 Regions of analysis and data processing

We divided each experimental item into seven regions of analysis, as follows:

Subj MV Prep Obj Critical Spill Final

/The coach/ smiled/ {at,toward}/ the player/ {tossed,thrown}/ a frisbee/ by the opposing team./

Each trial was inspected by hand using the University of Massachussetts EyeDoctor

software suite (http://www.psych.umass.edu/eyelab/software/). We discarded any trial

in which there was track loss prior to some fixation in any region other than the Final region.

This resulted in loss of 15.3% of trials. Most of these track losses were due to the participant

blinking.

We examined a number of standard eye movement measures [7] including: (1) the fre-

quency with which a region was skipped on first reading, (2) first fixation duration (the

duration of the first fixation on a region when no material to the right of the region has yet

been fixated), (3) first pass reading time (the total fixation time on a region the first time

it is entered, when no material to the right of the region has yet been fixated; also called

gaze duration for regions consisting of only one word), (4) go-past time (the accumulated

time from when a reader first fixates on a region until their first fixation to the right of

the region; this measure includes any regressions the reader makes prior to moving forward

past the word), (5) total reading time (the sumed time of all fixations on a region), (6)

regressions out of a region immediately after first-pass reading, and (7) regressions

into a region. These measures were computed for the regions of the sentence described

above; we do not report results for the Subject and Final regions as there were no significant

results on measures meaningful for these regions.

3.4 Statistical analysis method

We report most results using traditional by-participants (F1) and by-items (F2) ANOVAs.

Consistent with standard practice, reading times more than four standard deviations outside


the mean for each condition in each region were discarded as outliers. In cases where the

assumptions of ANOVA are badly violated (heavily imbalanced data and/or binary responses

with by-subject or by-item means close to 0 or 1), we use mixed-effects models with crossed

random effects of subject and item [8] using the lme4 package in R [9]. For experimental

psycholinguistic data such as ours, the question of precisely what random-effects structure to

specify for a multi-level model for inference on the fixed effects remains an open question. In

principle, for an n-condition experiment it could be appropriate to use a full n×n covariance

matrix (that is, arbitrary random interactions) for each of the by-subject and by-item random

effects. In practice, however, it is often difficult to obtain reliable convergence with such

complex random-effects structure for psycholinguistic datasets of our size. Therefore we

adopted the following principles, based on discussion in [8]. For each analysis, we began

by fitting a model with random intercepts by-subject and by-item. We then fit one model

with random intercepts by-subject and random interactions by-item, and another model

with random interactions by-subject and random intercepts by-item. We used likelihood-

ratio tests to compare each of these models with the random intercepts-only model. If

neither of these models yielded a significant improvement in log-likelihood, we report fixed-

effects results based on the random intercepts-only model. If at least one of these models

yielded a significant improvement in log-likelihood, we attempted to fit a final model with

random interactions by-subject and by-item. If this model converged and yielded a significant

improvement over the better of the two intermediate models by the likelihood-ratio test, we

report fixed-effects results based on this model; otherwise, we report fixed-effects results

based on the better of the two intermediate models. For linear models, random-effects

model comparisons were done using restricted maximum-likelihood estimation; fixed-effects

results are reported based on maximum-likelihood estimation. For logit models, Laplace

approximation of maximum likelihood was always used. Statistical significance for linear

models is reported as a t-statistic associated with the parameter estimate—for our datasets,

a t-statistic of 2 or greater corresponds approximately to p < 0.05 significance [8], and a

t-statistic of 1.65 or greater corresponds approximately to marginal p < 0.1 significance; for

logit models, as a p-value based on the Wald statistic [10]. Our factorial contrasts (which are

all two-way) were converted to a centered numeric representation to eliminate correlations

among main effects and lower-order interactions.


3.5 Results

For measures 1–7 described in Section 3.3 above, condition-specific means can be found in

Table 2; results of statistical analysis are shown in Table 3. Our results include a number

of main effects of preposition type at/toward on fixation times and regressive-saccade be-

havior that are presumably driven by the dramatic difference in length between these two

prepositions and its effect on first-pass reading behavior. These differences include main

effects on first-fixation (by items), first-pass, go-past, and total reading times, as well as

outward first-pass regression frequency, on the preposition region, and on first-fixation,

go-past time, total reading time, and outward first-pass regression frequency on the object

region. Readers skipped at far more often than they skipped toward, took more time to

read toward than at, regressed from toward more than at, had longer first fixations on the

region immediately following toward (the object region) than on the region following at,

and regressed more from the object region into at than into toward.

For skip probability, we found a highly significant effect of at/toward on the preposi-

tion region, with far more skipping of at than of toward. We also found a significant effect

of ambiguity on skip probability on the critical region, with more frequent skipping in the

unambiguous condition than in the ambiguous condition (p < 0.05 in a mixed-effects logit

model). This is almost certainly due to the fact that mean word length was shorter in the un-

ambiguous condition (5.63 characters) than in the ambiguous condition (6.42 characters).1

For spillover region skip probability, ANOVAs found a significant main effect (p < 0.05)

of ambiguity by participants, but these skip probabilities were close to 0, such that tradi-

tional ANOVA results are unreliable. Mixed-effects logit models found no reliable effects of

condition on spillover-region skip probability.

Our key results involve fixation times and first-pass regressions out involving the crit-

ical region, and first-pass regressions into the preposition region. On the critical region

we find a main effect of ambiguity on first-pass reading times, with longer times in the am-

biguous conditions, plus a numerical interactive trend for effect size to be larger in the at

conditions than in the toward conditions. More crucially, we found significant interactions

on go-past times and first-pass regressions out (GoPast, RegOut), with the the at+ambig

condition condition having the longest times and the most regressions out. On the prepo-

1A mixed-effects logit model with fixed effects of preposition× ambiguity plus word length found a highly

significant effect of word length on critical-region skip probability (p < 0.001) and no effects of condition.


Table 2: Means and standard errors for eye-movement measures

MV Prep Obj Crit Spill

Skip

at ambig 0 (0) 59 (4) 2 (1) 3 (1) 1 (1)

at unambig 1 (1) 58 (4) 2 (1) 8 (2) 1 (1)

toward ambig 1 (1) 1 (1) 2 (1) 2 (1) 3 (1)

toward unambig 0 (0) 3 (1) 1 (1) 5 (2) 1 (1)

FirstFix

at ambig 245 (7) 235 (15) 215 (5) 283 (11) 251 (8)

at unambig 243 (7) 232 (13) 219 (6) 279 (12) 273 (10)

toward ambig 250 (9) 243 (7) 235 (8) 286 (10) 265 (10)

toward unambig 264 (9) 257 (11) 233 (8) 285 (10) 261 (8)

FirstPass

at ambig 332 (11) 251 (19) 363 (18) 355 (15) 426 (21)

at unambig 329 (12) 241 (14) 374 (15) 322 (15) 442 (20)

toward ambig 305 (12) 294 (10) 358 (14) 359 (15) 443 (24)

toward unambig 333 (11) 288 (12) 358 (13) 343 (15) 453 (19)

GoPast

at ambig 391 (19) 294 (24) 502 (24) 476 (24) 679 (40)

at unambig 409 (17) 303 (22) 568 (23) 399 (20) 681 (33)

toward ambig 393 (18) 330 (14) 399 (18) 399 (20) 660 (44)

toward unambig 406 (20) 346 (19) 420 (16) 409 (18) 652 (34)

Total

at ambig 573 (35) 362 (22) 768 (47) 596 (29) 813 (52)

at unambig 566 (32) 364 (23) 758 (42) 626 (42) 818 (42)

toward ambig 574 (38) 490 (23) 641 (38) 640 (39) 776 (45)

toward unambig 605 (34) 505 (31) 659 (41) 616 (35) 829 (49)

RegOut

at ambig 11 (3) 6 (2) 24 (4) 21 (3) 28 (4)

at unambig 14 (3) 4 (1) 26 (4) 12 (2) 33 (4)

toward ambig 15 (3) 10 (2) 8 (2) 10 (2) 25 (3)

toward unambig 12 (3) 13 (3) 9 (2) 14 (2) 26 (3)

RegIn

at ambig 34 (5) 36 (4) 53 (4) 35 (4) 32 (4)

at unambig 38 (4) 31 (4) 44 (4) 44 (5) 32 (4)

toward ambig 40 (4) 31 (4) 42 (4) 37 (4) 33 (4)

toward unambig 40 (5) 31 (4) 36 (5) 42 (4) 34 (4)


Table 3: F -statistics for main effects and interactions for the eye-movement measures in

Table 2 (.p < 0.1,∗ p < 0.05,† p < 0.01,‡ p < 0.001)


F1 F2 F1 F2 F1 F2 F1 F2 F1 F2

Skip

at <1 <1 269.79‡ 359.69‡ <1 <1 3.71. 2.25 1.67 2.22

ambig <1 <1 <1 <1 <1 <1 5.74∗ 4.37∗ 6.03∗ <1

at:ambig 1.01 <1 <1 <1 <1 <1 <1 <1 <1 1.90

FirstFix

at 3.72. 5.01∗ 1.03 4.46∗ 5.81∗ 6.93∗ <1 <1 <1 <1

ambig 1.11 1.07 1.15 <1 <1 <1 <1 <1 1.65 1.30

at:ambig 2.01 <1 <1 1.49 <1 <1 <1 1.17 2.59 1.14

FirstPass

at 1.35 <1 8.99† 14.76‡ <1 2.15 <1 2.11 <1 <1

ambig 1.82 1.41 <1 <1 <1 <1 5.15∗ 3.52. <1 <1

at:ambig 3.30. 2.92 <1 <1 <1 <1 <1 1.08 <1 <1

GoPast

at <1 <1 4.68∗ 4.58∗ 41.48‡ 47.50‡ 3.15. 1.86 <1 <1

ambig <1 <1 <1 <1 4.87∗ 6.90∗ 3.16. 3.33. <1 <1

at:ambig <1 <1 <1 <1 2.26 <1 4.77∗ 6.99∗ <1 <1

Total

at 1.08 3.11. 34.22‡ 47.12‡ 16.61‡ 11.10† <1 <1 <1 <1

ambig <1 <1 <1 <1 <1 <1 <1 <1 <1 1.18

at:ambig <1 1.15 <1 <1 <1 <1 1.11 <1 <1 <1

RegOut

at <1 <1 10.47† 8.42† 18.75‡ 46.21‡ 4.60∗ 6.82∗ 2.86. 1.50

ambig <1 <1 <1 <1 <1 <1 <1 1.55 <1 <1

at:ambig 1.21 1.80 1.65 1.62 <1 <1 11.42† 5.67∗ <1 <1

RegIn

at 2.10 1.77 <1 1.28 9.34† 5.53∗ <1 <1 <1 1.09

ambig <1 <1 <1 <1 4.85∗ 7.36∗ 4.58∗ 3.91. <1 <1

at:ambig <1 <1 1.14 1.79 <1 <1 <1 <1 <1 <1


Table 4: Means and standard errors for eye-movement measures in trials where the preposi-

tion was fixated


GoPast

at ambig 344 (24) 294 (24) 437 (37) 473 (34) 625 (52)

at unambig 417 (29) 303 (22) 419 (32) 405 (33) 696 (56)

toward ambig 389 (18) 330 (14) 398 (18) 397 (19) 660 (44)

toward unambig 402 (19) 346 (19) 416 (16) 404 (18) 653 (34)

RegOut

at ambig 8 (4) 13 (5) 9 (3) 21 (5) 23 (6)

at unambig 15 (5) 10 (4) 9 (3) 11 (3) 30 (6)

toward ambig 14 (3) 10 (2) 8 (2) 9 (2) 25 (3)

toward unambig 12 (3) 14 (3) 8 (2) 13 (3) 26 (3)

sition region, we found a main effect of preposition type on frequency of inward regressive

saccades (RegIn), driven by a numerical interactive trend: inward regressive saccades were

most common in the at+ambig condition.

We obtained two significant main effects that we believe are unlikely to be relevant

to the present study. These include a significant main effect of preposition type on first-

fixation reading time at the main-clause verb (MV), with reading times higher in the toward

condition than in the at condition; and a significant main effect of ambiguity on go-past time

on the object region (the player), with go-past time longer in the unambiguous condition

than in the ambiguous condition. It is possible that these are preview effects related to

superficial properties of the subsequent region. Given that these effects are significant only

at the p < 0.05 level, are not seen in related eye-movement measures on the region in question

(e.g., we see no effect of ambiguity on outward first-pass regressions from the object region),

are not interactive, and that Table 3 involves 210 main-effect hypothesis tests, we do not

consider these two effects to be of immediate concern in interpreting the key results of our

study.


Table 5: F -statistics for main effects and interactions for the eye-movement measures in

trials where the preposition was fixated (Table 4) (.p < 0.1,∗ p < 0.05,† p < 0.01,‡ p < 0.001)


F1 F2 F1 F2 F1 F2 F1 F2 F1 F2

GoPast

at <1 <1 4.68∗ 4.58∗ <1 <1 1.46 2.17 <1 <1

ambig 3.20. <1 <1 <1 <1 <1 2.39 3.23. 1.19 <1

at:ambig 2.67 <1 <1 <1 <1 <1 3.24. 3.22. 2.71 <1

RegOut

at <1 <1 <1 <1 <1 <1 1.29 3.41. <1 <1

ambig <1 <1 <1 <1 <1 <1 1.28 1.20 2.14 <1

at:ambig 2.24 <1 <1 1.42 <1 <1 6.12∗ 3.62. <1 <1

3.5.1 Trials in which the preposition was fixated

Because of the high frequency of skipping the preposition in the at conditions, we also

analyzed go-past and regressions-out measurements in the subset of trials on which the

preposition region was not skipped. The means and standard errors for these trials are

shown in Table 4, and F -statistics are presented in Table 5. The qualitative patterns for these

measures are identical to those patterns observed for all fixations, although the significance

levels on all effects have decreased due to the loss of over half the data. At the critical region,

mixed-effects models found the interactions to be significant for go-past time (t = 2.16) and

marginal (p = 0.055) for regressions out.

3.5.2 Regressive saccades in detail

We also examined in greater detail the distribution of first-pass regressive saccades between

regions of the sentence. Table 6 shows the regression matrix of the relative frequency of

first-pass regressive saccades into each region of the sentence, as a function of the region

from which the saccade originated. It is quite clear that most first-pass regressive saccades

are short and do not skip over regions of analysis. There are no major differences across

conditions, with the exception that first-pass regressions from the spill-over region jump over

the critical region and reach the object region more frequently in the ambiguous conditions,

and most frequently in the at+ambig condition. In mixed-effects logit models, there was

a significant main effect of ambiguity on this pattern (p < 0.001), but the interaction was

marginal (p > 0.08).

In addition, we examined first-pass regressive-saccade behavior beyond the first regressive


Subj MV Prep Obj Crit

MV 1.00 0.00 0.00 0.00 0.00

Prep 0.12 0.88 0.00 0.00 0.00

Obj 0.00 0.33 0.67 0.00 0.00

Crit 0.00 0.02 0.08 0.90 0.00

Spill 0.00 0.00 0.02 0.37 0.61

(a) at/tossed


MV 1.00 0.00 0.00 0.00 0.00

Prep 0.07 0.93 0.00 0.00 0.00

Obj 0.02 0.23 0.75 0.00 0.00

Crit 0.00 0.07 0.10 0.83 0.00

Spill 0.00 0.02 0.00 0.09 0.89

(b) toward/tossed


MV 1.00 0.00 0.00 0.00 0.00

Prep 0.07 0.93 0.00 0.00 0.00

Obj 0.00 0.17 0.83 0.00 0.00

Crit 0.00 0.00 0.05 0.95 0.00

Spill 0.00 0.02 0.00 0.19 0.80

(c) at/thrown


MV 1.00 0.00 0.00 0.00 0.00

Prep 0.04 0.96 0.00 0.00 0.00

Obj 0.05 0.19 0.76 0.00 0.00

Crit 0.03 0.00 0.09 0.88 0.00

Spill 0.00 0.00 0.00 0.10 0.90

(d) toward/thrown

Table 6: First-pass regression matrix. Rows denote region of departure, columns denote

entry region. Numbers are proportions.

saccade from a region. As seen in Table 6, it was rare for readers to regress from the critical

region or beyond directly back to the preposition region in a single saccade. However, in

many cases the first regressive saccade was not immediately followed by a sequence of forward

saccades, but rather by a series of overall backward-moving saccades. To quantify this, we

computed what we will call here go-past regressions. We define a reader to have had a go-past

regression from region Y to region X if s/he had a first-pass regression from region Y and

subsequently fixated on region X before saccading past region Y. Go-past regression counts

are shown in Table 7; we analyzed these using mixed logit models. As seen, there are three

salient patterns in these data. First, there is a main effect of at/toward in go-past regressions

from the object to the preposition and to the main-clause verb, presumably driven by the

difference in length between the two prepositions (both p < 0.01). Second, there is a main

effect of critical-word ambiguity in go-past regressions from the spillover region to the object

and preposition regions (both p < 0.05 in a mixed logit model); in these cases, there are


Table 7: Frequency of go-past regressions

from Object from Critical from Spillover

Subj← MV← Prep← Subj← MV← Prep← Obj← Subj← MV← Prep← Obj← Crit←at/tossed 0 15 47 1 2 11 43 0 4 10 33 50

at/thrown 4 19 48 2 2 4 26 1 1 1 17 65

toward/tossed 1 4 13 2 2 2 20 3 3 12 21 49

toward/thrown 2 8 18 2 2 5 29 4 4 7 13 54

Ambiguous Unambiguous

At 62% 72%

Toward 70% 69%

Table 8: Question-answering accuracy

also numerical interactions such that there are superadditively many go-past regressions in

the at+ambig condition, but neither interaction coefficient reached significance. Finally,

there is an interactive pattern in go-past regressions from the critical region to the object

and preposition regions, with the most such regressions in the at+ambig condition (object

region: p < 0.05, preposition region: p = 0.087).

3.5.3 Question-answering accuracy

Average question-answering accuracy on fillers, at 89.6%, was considerably higher than for

experimental items, and no subject answered filler questions below 72% accuracy; this lowest

accuracy level of 72% is significantly above chance (p < 0.01) by a two-tailed binomial test.

On experimental items, in contrast, participants’ question-answering accuracy was relatively

low (68.5% overall). Condition-specific accuracies are given in Table 8; in no experimental

condition did accuracy exceed 72%. We interpret this pattern as indicating that participants

were reading attentively, but that ditransitive reduced relative clauses involving passivization

on the first object (e.g., tossed the player the frisbee can make sentences quite difficult to

comprehend indeed.

As Table 8 indicates, accuracy was lowest in the at+ambig condition. In 2× 2 ANOVA

analyses we found no significant main effects on accuracy and an interaction significant only

by items (F1(1, 39) = 1.52, p = 0.226; F2(1, 23) = 5.46, p = 0.029). Our mixed logit model

analysis revealed a marginal main effect of ambiguity (p = 0.08), and a significant interaction


(p < 0.05).

3.5.4 Question subtypes

Sixteen of our twenty-four questions queried some property of the reduced relative clause,

including whether the main-clause object was the agent or the goal of the RRC verb. There

were four such types of questions, illustrated in (2) below, flanked by codes used in Ap-

pendix A and correct answers.

(2) [O Vred] Did the player toss/throw a frisbee? NO

[So Vred O] Did someone toss/throw the player a frisbee? YES

[O Vred PP] Did the player toss/throw the opposing team a frisbee? NO

[PP Vred O] Did the opposing team toss/throw the player a frisbee? YES

Each type of RRC-directed question was used in four items. (The question type and by-

condition accuracy for each item can be found in Appendix A.) Mean question-answering

accuracies for RRC-directed and non-RRC-directed question types in each condition are given

in Table 9. Because these data are unbalanced, we analyzed them only with a mixed-effects

logit model.2 This model found a significant main effect of question type (p < 0.01) and

significant interactions between preposition and ambiguity (p < 0.05) and between question

type and ambiguity (p < 0.01). That is, readers were systematically worse at answering

RRC-directed questions than at answering non-RRC-directed questions. Although there

is a numerical trend for at+toward -condition RRC-directed questions to be answered less

accurately than any other question type, this three-way interaction was not statistically

significant.

3.6 Analyses contingent on participant question-answering accu-

racy

One possible concern regarding by the relatively low overall question-answering accuracy

(as stated in Section 3.5.3, 68.5% overall for experimental items) is that the reaction-time

2Specifying separate random effects of subject & item for each of the eight condition types proved com-

putationally prohibitive, so we collapsed the original four conditions to +/−at+ambig, which did not ap-

preciably lower model likelihood for the four-condition analysis of the previous section, and also indicated

qualitatively similar conclusions about fixed effects.


Table 9: Question-answering accuracy by question type

at ambig at unambig toward ambig toward unambig

Not RRC Question 0.812 0.800 0.875 0.762

RRC Question 0.531 0.688 0.612 0.656

and regressive-saccade measurements during sentence reading might reflect processes that

have little to do with normal language-comprehension, such as guessing.3 To address this

possibility, we conducted separate analyses of our crucial online measures (first-pass and go-

past durations, and first-pass and go-past regressions) for two separate participant sugbroups:

those whose question-answering accuracy was above median participant accuracy, and those

below median participant accuracy. The logic behind these analyses is that if processes

such as guessing underlie the crucial eye-movement patterns found in this experiment, these

patterns should be at least as strongly evident in low-accuracy participants than in high-

accuracy participants.

We used two different participant accuracy scores to determine our subgroups: accuracy

on filler-item questions and accuracy on experimental-item questions. In each case, it hap-

pened that seventeen participants lay above the median (the high-accuracy group), eighteen

lay below the median (the low-accuracy group), and five lay on the median and were thus

excluded from the analysis. We present results based on filler-item accuracy first; the two

accuracies are correlated at r = 0.479 (p < 0.01), but the filler-item accuracy has the advan-

tage of being logically independent of experimental-item online behavior. Table 10 presents

by-condition means for each of these cases among high- and low-accuracy comprehenders.

Because these data are unbalanced, we analyze them with mixed-effects models (see Sec-

tion 3.4), conducting separate analyses for high-accuracy and low-accuracy participants. In

first-pass reading times, high-accuracy participants had a marginally significant main effect

of ambiguity (t = 1.74) and a marginally significant interaction between preposition and

ambiguity (t = 1.76), whereas low-accuracy participants had no significant main effects or

interactions (all t < 1.62). In go-past reading times,high-accuracy participants had a sig-

nificant interaction between preposition and ambiguity (t = 2.50), whereas low-accuracy

3We thank an anonymous reviewer for raising this point.


Table 10: Crucial measures as a function of comprehender accuracy on filler questions

High-accuracy comprehenders (n = 17) Low-accuracy comprehenders (n = 18)

FirstPass GoPast RegOut GPReg QA FirstPass GoPast RegOut GPReg QA

at ambig 349 (19) 455 (38) 18 ( 4) 8 70 ( 6) 364 (27) 498 (38) 23 ( 4) 0 56 ( 5)

at unambig 298 (19) 372 (29) 10 ( 3) 0 80 ( 3) 340 (23) 433 (33) 14 ( 4) 5 64 ( 4)

toward ambig 330 (22) 356 (26) 7 ( 3) 0 77 ( 3) 375 (24) 411 (30) 9 ( 3) 2 67 ( 4)

toward unambig 332 (21) 395 (24) 14 ( 4) 3 75 ( 4) 358 (26) 418 (33) 12 ( 4) 0 60 ( 4)

participants had no significant main effects or interactions (all t < 1.62). In first-pass re-

gressions, high-accuracy participants had a marginal interaction between preposition and

ambiguity (pz = 0.06) whereas low-accuracy participants had a numerical trend toward an

interaction which was insignificant (pz = 0.16). Go-past regressions were too rare in either

participant subgroup to analyze reliably, but the numerical trend was toward the predicted

interaction only in the high-accuracy group. On experimental-item question-answering accu-

racy, both groups had marginal interactions between preposition and ambiguity (pz = 0.094

and pz = 0.089 for high-accuracy and low-accuracy participants respectively).

Table 11 presents by-condition means for high- and low-accuracy comprehenders as de-

termined by experimental-item accuracy. In first-pass reading times, both groups had a

marginally significant main effect of ambiguity (t = 1.9 and t = 1.97 respectively). In

go-past reading times,high-accuracy participants had a significant main effect of preposi-

tion (t = 2.6) and a significant interaction between preposition and ambiguity (t = 2.50),

whereas low-accuracy participants had a numerical trend toward the predicted interaction,

but no significant main effects or interactions (all t < 0.8). In first-pass regressions, high-

accuracy participants had a marginal main effect of preposition (pz = 0.09) and a marginal

interaction between preposition and ambiguity (pz = 0.08) whereas low-accuracy partici-

pants had a marginal interaction between preposition and ambiguity (pz = 0.07). As with

the filler-accuracy split, go-past regressions were too rare in either participant subgroup

to analyze reliably, but the numerical trend was toward the predicted interaction only in

the high-accuracy group. On experimental-item question-answering accuracy, high-accuracy

comprehenders had no significant effects of condition, whereas low-accuracy comprehenders

had a significant main effect of ambiguity (pz = 0.03) and a significant interaction be-

tween preposition and ambiguity (pz = 0.01). Because question-answering accuracy was

the criterion by which the two participant groups were determined, it is not surprising that

question-answering accuracy shows different qualitative patterns across the two groups.


Table 11: Crucial measures as a function of comprehender accuracy on experimental ques-

tions

High-accuracy comprehenders (n = 17) Low-accuracy comprehenders (n = 18)

FirstPass GoPast RegOut GPReg QA FirstPass GoPast RegOut GPReg QA

at ambig 349 (22) 519 (48) 25 ( 5) 7 80 ( 4) 350 (21) 452 (35) 18 ( 4) 0 44 ( 3)

at unambig 313 (19) 396 (35) 13 ( 4) 2 83 ( 4) 336 (27) 415 (29) 10 ( 3) 0 64 ( 3)

toward ambig 314 (16) 355 (26) 11 ( 3) 0 79 ( 3) 393 (26) 404 (26) 7 ( 3) 0 60 ( 4)

toward unambig 307 (17) 391 (28) 16 ( 4) 4 81 ( 4) 347 (25) 403 (27) 13 ( 3) 0 59 ( 4)

In sum, there is no clear evidence that the crucial interactions in online measurements

found in our study are disproportionately strong among low-accuracy participants, as one

would expect if inability to understand the sentences were driving these online interactions.

To the contrary, the numerical patterns suggest that these crucial interactions are at least as

strong, if not stronger, when comprehension accuracy is high. The clearest of these results are

that (1) in first-pass durations, high-accuracy participants (based on the filler-accuracy split)

showed a marginally significant interaction between preposition and ambiguity, whereas this

interaction was insignificant when either all participants or only low-accuracy participants are

considered; and (2) in go-past durations, high-accuracy participants (based on either split)

showed a significant interaction between preposition and ambiguity, whereas low-accuracy

participants did not.

3.7 Plausibility norming

We also conducted a plausibility norming study on the main-clause portions of our items

(i.e. The coach smiled at/toward the player for (1)) in order to address a possible confound.4

If the toward -condition main clauses are overall less plausible than the at-condition main

clauses, it is possible that the interactive pattern of greatest difficulty in the at+tossed con-

dition could arise from initial misanalysis of tossed as a main verb with player as its subject,

followed by rapid reanalysis into a reduced-relative or coordinate-verb analysis whose diffi-

culty is greater the more plausible the main-clause structure is. While we believe that this

is an unlikely explanation even if there are systematic differences in main-clause plausibility,

because the reanalysis would not involve any change to the structure of the main clause

itself, we addressed the point empiricially by reanalyzing our results on the basis of plausi-

4We thank Lyn Frazier for pointing out this possible confound to us.


bility norms. 30 native-English speaker UC San Diego undergraduates, none of whom had

participated in the eye-tracking study, took part in the plausibility norming study. The main

clauses of the 24 items were split equally into two blocks and interleaved among 36 fillers.

Each sentence was rated for plausibility on a scale of 1 (implausible) to 7 (plausible), with

presentation order was randomized separately for each participant.

Analysis revealed that at-condition main clauses did indeed have higher overall average

plausibility, at 5.94, than toward -condition main clauses, at 5.49 (by participants: t29 =

3.7, p < 0.001; by items: t23 = 4.4, p < 0.001). To address the potential confound that this

difference in mean plausibility presents, we ranked our items by the difference in plausibility

rating between the at condition and the toward condition, and removed items in rank order

until the mean plausibility in the remaining item set in the toward condition was not lower

than that in the at condition. This left us with 10 of our 24 items, with mean at-condition

plausibility rating of 5.75 and toward -condition rating of 5.78. We then reran analysis of

critical-region go-past time and first-pass regressions out using only these 10 items. The

results are shown in Tables 12 and 13. This subset of plausibility-matched items shows no

qualitative differences from the full item set in go-past time or first-pass regressions out;

in fact, interaction sizes are numerically larger in this subset. Although the regressions-out

interaction within this subset fails to reach statistical significance in this reduced item set,

the go-past time interaction is more highly significant here than in the full item set. Because

the remaining set of 10 items was not fully counterbalanced, we also analyzed go-past times

and regressions out using linear and logit mixed models; the go-past time interaction was

confirmed as highly significant (t = 3.3), though the regressions out interaction was not

(p = 0.19). We conclude that plausibility differential in the at versus toward conditions does

not explain the interactive difficulty pattern observed in the at+tossed condition.

3.8 Analysis based on trial order

Another possible confound in interpretation of our experimental results is that the crucial

interactions found in our experiment (on first-pass and go-past durations, first-pass and go-

past regressions, and question-answering accuracy) could be driven by a learning effect.5 For

example, since toward is a less frequent word than at, and only the latter word appears in

filler sentences, it is possible that participants noticed the contingency that the NP after

5We thank an anonymous reviewer for raising this point.


GoPast RegOut

at ambig 492 (30) 23 (5)

at unambig 356 (25) 9 (3)

toward ambig 405 (28) 11 (3)

toward unambig 453 (29) 11 (3)

Table 12: Mean and standard error at

critical region for at–toward plausibility-

matched item subset

GoPast RegOut

F1 F2 F1 F2

at <1 <1 2.73 3.08

ambig 2.09 2.87 4.40∗ 2.90

at:ambig 9.47† 11.18† 1.92 1.55

Table 13: F -statistics at critical region for

at–toward plausibility-matched item sub-

set

toward was always followed by a reduced relative clause, whereas they did not learn such

a contingency involving at. This could allow participants to become increasingly effective

at processing the toward conditions, which could drive an interactive pattern of the sort we

see here if knowledge of this contingency could only be usefully applied to the ambiguous

conditions. To test for this possibility, we conducted analysis of our crucial measures based on

trial orders. These analyses took three forms: (1) division of the experiment into four blocks

based on trial order, and inspection of condition means and standard errors for each block;

(2) for time measurements, non-parametric regression fits of duration against trial order

in each condition; (3) mixed-effect model analyses to test for the presence of interactions

between trial order and condition.

Analysis (1)—means and standard errors by block—is presented in Table 14. In all four

online measures, the numerical size of the interaction in question (as measured by the sum of

the at+ambiguous and toward+unambiguous condition means, minus the at+unambiguous

and toward+ambiguous condition means) is largest in the first of the four blocks. Question-

answering accuracy behaves differently over the course of the experiment: in the at+ambiguous

condition it seems to fluctuate throughout the course of the experiment, whereas in the other

three conditions it clearly rises through the course of the experiment.

Analysis (2)—non-parametric regression analyses of duration against trial order in each

condition, using R’s non-parametric regression function lowess()—is presented in Tables 2

and 3 for first-pass and go-past durations respectively. In neither case do durations fall more

quickly in the toward+ambig condition than in the at+ambig condition.

For analysis (3), we used mixed-effects models due to their ability to handle continuous

covariates as well as imbalance (trial order was fully randomized and thus slightly imbal-


FirstPass GoPast

1 2 3 4 1 2 3 4

at ambig 405 (28) 346 (27) 349 (30) 375 (34) 504 (44) 550 (65) 483 (66) 494 (47)

at unambig 330 (23) 311 (23) 305 (23) 331 (22) 312 (43) 423 (42) 345 (38) 388 (37)

toward ambig 321 (21) 416 (35) 352 (18) 343 (23) 373 (41) 445 (46) 404 (42) 436 (80)

toward unambig 304 (24) 366 (33) 327 (21) 401 (37) 375 (43) 408 (47) 388 (35) 523 (91)

RegOut GPReg

1 2 3 4 1 2 3 4

at ambig 21 (7) 30 (7) 18 (6) 23 (6) 4 4 2 1

at unambig 11 (5) 18 (7) 12 (4) 7 (3) 0 1 1 2

toward ambig 9 (5) 6 (4) 8 (3) 12 (5) 0 0 1 1

toward unambig 18 (5) 12 (5) 17 (6) 6 (4) 2 1 0 2

QA

1 2 3 4

at ambig 62 (7) 55 (7) 68 (7) 54 (7)

at unambig 51 (7) 72 (7) 73 (7) 84 (5)

toward ambig 52 (7) 75 (6) 75 (5) 83 (6)

toward unambig 64 (7) 64 (7) 70 (7) 83 (6)

filler 89 (2) 86 (3) 91 (2) 91 (2)

Table 14: Crucial measures in first, second, third, and fourth quartiles of trial order

anced across conditions), with trial order—defined as one plus the number of experimental

items the participant had already seen—as a real-valued predictor variable, standardizing it

to eliminate correlation with other predictors (preposition and ambiguity) and to facilitate

interpretation. On first-pass times we found a significant main effect of ambiguity (t = 2.6),

a marginal main effect of preposition (t = 1.76), and a significant interaction between ambi-

guity and order (t = 2.6) such that durations in the ambiguous conditions became shorter

relative to the unambiguous conditions over the course of the experiment. No other effects

were significant, most crucially the three-way interaction between preposition, ambiguity,

and trial order (t = 0.36). On go-past times we found a significant main effect of ambiguity

(t = 2.17) and a significant interaction between preposition and ambiguity (t = 2.4); no

order effects were significant. On regressions out we found a marginal main effect of prepo-

sition (pz = 0.08), a significant interaction between preposition and ambiguity (pz = 0.02),


order

fpas

s

100

200

300

400

0 5 10 15 20 25

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

atambig

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

towardambig

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

atunambig

0 5 10 15 20 25

100

200

300

400

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●●

●●

●

● ●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

● ●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●● ● ●●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

● ●

●

●●

●

●

●

●

●●

●● ●

●●

●●

●

●

towardunambig

Figure 2: First-pass times as a function

of trial order

order

gopa

st

200

400

600

800

0 5 10 15 20 25

●

●

●

●

●●

●

●

● ●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

● ●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

atambig

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●●

●●

●

●

●●

●

●

● ●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●●●●●

● ●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

● ●

●

● ●

●

●

●●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●●

●●

towardambig

●

●

● ●

●

●●●

●●

●

●

●

●

●

●●

●

●

●

●

● ●●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●●

●

●

●

●

●● ●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●●

●

●

●

●

●

●●

●

●

●

atunambig

0 5 10 15 20 25

200

400

600

800

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

● ●

●●●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

● ●

●

●

●●

●●

●●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

● ●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●●

●

● ●

●

●●

●

●

●

●

●●

●●

●

● ●●

●●

●

towardunambig

Figure 3: Go-past times as a function of

trial order

and no significant order effects. Counts are too small to ensure that analysis of go-past

regressions is completely reliable, but the analysis revealed a marginal interaction between

preposition and ambiguity (p = 0.10) consistent with the other findings we obtained on

this measure. Finally, on question-answering accuracy we did find a three-way interaction

between preposition, ambiguity, and trial order (pz = 0.048). To clarify the precise nature of

this three-way interaction, we conducted an equivalent mixed-effects analysis with the fixed

effects recoded as interactions between condition and (scaled) trial order, with no intercept or

main effect of order. This coding assigns a separate learning rate to each condition, allowing

us to investigate the extent to which there is evidence for learning in each of the four con-

ditions. On this analysis, there were significant learning effects—with question-answering

accuracy improving over the course of the experiment—in the at+unambiguous condition

(β = 0.48, pz < 0.01), the toward+ambiguous condition (β = 0.62, pz < 0.001), and the

toward+unambiguous condition (β = 0.48, pz < 0.01). Only the at+ambiguous condition

showed no significant learning effects in either direction (β = −0.01, pz = 0.93).6 For com-

6Although there is a suggestion from the coefficient estimates that the learning effect might be largest

in the toward+ambiguous condition, this possibility was not supported by likelihood-ratio tests between a

two-learning-rate model—one for at+ambiguous and one for the rest—and a model with one learning rate for

each condition (p = 0.91), nor by a test between the two-rate model and a model with a single learning rate


pleteness, we also analyzed trial order effects on filler-question accuracy (here, trial order

is defined as how many fillers have already been seen). A mixed-effects logit model also

found a significant learning effect such that participants improved during the course of the

experiment (β = 0.3, pz < 0.01).

There are two major points that emerge from the analyses presented in this section.

First, the analytic techniques employed here are sensitive enough to pick up on effects of

trial order, including two- and three-way interactions between trial order and experimental

manipulations. This can be seen from the significant interaction between trial order and

ambiguity in first-pass time, and from the significant three-way interaction on question-

answering accuracy. Second, despite the sensitivity of the analytic techniques, no effects of

trial order were obtained that could explain the crucial online interactions in our experiment.

The only relevant online effect of trial order was with ambiguity in first-pass durations;

furthermore, this effect was in the opposite direction (i.e. durations dropped over time in the

ambiguous conditions) as the overall trend in the experiment, and both non-parametric plots

and block-by-block means suggest that this learning effect was, if anything, driven more by

the at+ambiguous condition than by the toward+ambiguous condition.

The relationship of trial order with question-answering accuracy was different than with

online measures: over the course of the experiment, participants got better at answering

questions in all conditions (including on fillers) except in the at+ambiguous condition. Al-

though this pattern bears some resemblance to the possible confound in which participants

get differentially better at the toward+ambiguous condition, this hypothesis provides no ac-

count of why participants’ accuracy improves across all conditions—crucially including both

unambiguous conditions—at approximately the same rate. We believe that the most likely

account of the observed relationship between trial order, condition, and question-answering

accuracy is that—as indicated by all our crucial online measurements—the at+ambiguous is

indeed the most difficult of the four conditions, and that this great difficulty prevents par-

ticipants from making consistent improvements in sentence interpretation over the course of

this short experiment.

for the unambiguous conditions, one for the at+ambiguous condition, and one for the toward+ambiguous

condition (p = 0.73).


3.9 Ruling out a categorical misidentification account

One point that must be emphasized is that these results are not compatible with an account

that simply allows for occasional categorical misidentification of the word at. The reason

for this can be seen when we consider the four experimental conditions plus at-condition

variants with misidentification as a near-neighbor word:

(3) a. The coach smiled toward the player. . . tossed

b. The coach smiled at the player. . . tossed

c. The coach smiled {as/and} the player. . . tossed

(4) a. The coach smiled toward the player. . . thrown

b. The coach smiled at the player. . . thrown

c. The coach smiled {as/and} the player. . . thrown

On such an account, critical-region reading in at+tossed trials should reflect some mixture of

the critical-region behavior that would be obtained in reading correctly-identified (3b) and

(3c) sentences. We would expect critical-region difficulty in (3b) to be similar to that of (3a),

since the only difference between the two is the preposition that was used. The critical-region

difficulty of (3c), on the other hand, should be substantially smaller than that of either (3a)

or (3b), since a finite-verb reading is now available for tossed. The difficulty in the at+tossed

condition should thus be less, if anything, than in the toward+tossed condition. (Note that

any overall increase in difficulty associated with the use of toward in comparison with at

should show up as a main effect, not as an interaction.) In the unambiguous conditions of

(4), in contrast, no corresponding facilitation should occur as a result of categorical misiden-

tification as in (4c), since thrown cannot be a finite main verb. Therefore, any interaction

in a categorical-misidentification model should be facilitatory in the at+tossed condition,

which is the opposite of what our results indicate.

References

[1] Levy R (2008) A noisy-channel model of rational human sentence comprehension under

uncertain input. EMNLP 13 pp 234–243.


[2] Mohri M (1997) Finite-state transducers in language and speech processing. Comput

Linguist 23:269–311.

[3] Manning CD, Schutze H (1999) Foundations of Statistical Natural Language Processing

(MIT Press).

[4] Kucera H, Francis WN (1967) Computational Analysis of Present-day American English

(Providence, RI: Brown University Press).

[5] Marcus MP, Santorini B, Marcinkiewicz MA (1994) Building a large annotated corpus

of English: The Penn Treebank. Comput Linguist 19:313–330.

[6] Collins M (2003) Head-driven statistical models for natural language parsing. Comput

Linguist 29:589–637.

[7] Rayner K (1998) Eye movements in reading and information processing: 20 years of

research. Psychol Bull 124:372–422.

[8] Baayen RH, Davidson DJ, Bates DM (2008) Mixed-effects modeling with crossed ran-

dom effects for subjects and items. J Mem Lang 59:390–412.

[9] Bates D (2005) Fitting linear mixed models in R. R News 5:27–30.

[10] Jaeger TF (2008) Categorical data analysis: Away from ANOVAs (transformation or

not) and towards logit mixed models. J Mem Lang 59:434–446.

A Experimental items

After each experimental item, question type, main-clause plausibility ratings, and question-

answering accuracy by condition are given. Question type codings are as given in (2), plus

as follows:

[S Vm] Did the coach smile? YES

[Vm O] Did someone smile at the player? YES

[S was Vred] Was the coach {tossed/thrown} a frisbee? NO

[O Vm] Did the player toss a frisbee? NO


Main-clause plausibility ratings are given in mean±standard-error format in the order at/toward.

Question-answering accuracy is given in the order

at+ambig/at+unambig/toward+ambig/toward+unambig

1. The students sighed at the professor {taught/given} a dancing lesson by the experi-

enced instructor. [S Vm] (6.38±0.18/5.41±0.36; 1.0/1.0/1.0/0.9)

2. The kindergartner grinned at the little girl {brought/chosen} a toy by her parents on

the first day of Chanukah. [S Vm] (6.18±0.40/5.85±0.27; 0.9/1.0/0.9/1.0)

3. The hostess shrugged at the customer {allowed/forbidden} the pleasure of eating sweets

by his doctor. [Vm O] (6.23±0.32/5.35±0.34; 0.8/0.9/1.0/0.8)

4. The nurse grimaced at a student {grabbed/stolen} a muffin by her friends from the

dining hall. [S was Vred] (5.82±0.31/4.69±0.44; 0.9/1.0/0.9/0.9)

5. The hotel owner scowled at the guest {brought/taken} a drink by the bellboy. [O Vm]

(6.15±0.34/5.59±0.23; 0.9/0.8/0.9/0.9)

6. The benchwarmers cheered at the player {tossed/thrown} a frisbee by the opposing

team. [O Vred] (5.35±0.45/4.85±0.46; 0.7/0.5/0.7/0.2)

7. The priest frowned at the woman {offered/given} a beer by the hostess. [O Vred]

(6.31±0.31/5.47±0.40; 0.4/0.9/0.6/0.9)

8. The foreman cried out at a carpenter {cut/sawn} a board by his buddy. [O Vm]

(5.18±0.37/5.15±0.45; 0.4/0.1/0.7/0.5)

9. The manager cursed at the waiter {served/given} pea soup by a trainee. [So Vred

O] (6.38±0.21/5.65±0.27; 0.3/0.7/0.6/0.7)

10. The receptionist winked at the young man {rented/shown} an apartment by his uncle.

[So Vred O] (6.76±0.14/5.85±0.27; 0.9/0.9/0.9/1.0)

11. The anthropologist looked on at the woman {knitted/woven} a shawl by her mother.

[O Vred PP] (5.46±0.35/5.12±0.35; 0.7/0.8/0.5/0.9)

12. James stared at the children {dyed/hidden} Easter eggs by their teachers. [PP Vred

O] (6.88±0.08/5.23±0.43; 0.6/0.8/0.6/0.8)


13. The soldiers fired at the sergeant {presented/shown} a list of charges by the judge the

previous day. [O Vred PP] (5.08±0.33/4.94±0.35; 0.7/0.7/0.6/0.6)

14. The town drunk snorted at the innkeeper {recited/sung} a verse by a traveling monk.

[PP Vred O] (5.82±0.32/4.23±0.50; 0.3/0.6/0.7/0.6)

15. The taxi driver signaled at the woman {tossed/thrown} a silver dollar by the passerby.

[O Vred PP] (5.92±0.33/5.94±0.20; 0.5/0.7/0.7/0.5)

16. The mime gestured at the artist {painted/drawn} a picture by her father while he was

on his deathbed. [S was Vred] (6.65±0.15/6.38±0.24; 0.8/0.6/0.7/0.3)

17. The trader sneered at the banker {clipped/given} a coupon by her boss. [So Vred

O] (5.77±0.28/5.71±0.29; 0.7/0.9/0.8/0.5)

18. The logger glared at the activist {planted/grown} a tree by his daughter. [O Vred]

(6.12±0.36/6.08±0.31; 0.7/0.8/0.7/0.9)

19. The little boy reached out at the girl {knitted/woven} a hat by her grandmother. [Vm

O] (6.00±0.32/6.47±0.17; 0.8/1.0/0.9/0.8)

20. The lobbyist smiled at the congressman {mailed/written} a letter by the CEO. [PP

Vred O] (6.24±0.32/5.69±0.35; 0.8/1.0/0.9/0.8)

21. The referee motioned at the athlete {hurled/thrown} a pass by the quarterback during

the third quarter. [O Vred] (6.15±0.25/6.59±0.15; 0.2/0.5/0.1/0.4)

22. The people in line rubbernecked at the man {removed/withdrawn} some money by his

wife from the uncooperative ATM. [So Vred O] (4.53±0.45/4.46±0.53; 0.4/0.5/0.3/0.5)

23. The landlord squinted at the tenant {carried/driven} a load of books by her boyfriend

from her office. [O Vred PP] (6.31±0.21/5.71±0.27; 0.4/0.5/0.7/0.7)

24. The actor coughed at the journalist {asked/chosen} a question by the editor for the

interview. [PP Vred O] (5.18±0.33/4.62±0.40; 0.3/0.4/0.8/0.5)


Filler sentences

Items 37–44 are practice sentences and were presented at the beginning of the experiment.

1. Two elementary school students were doing their homework in the adjacent room.

2. The leftovers in the fridge are starting to smell.

3. A tall glass full of apple juice spilled on the coffee table.

4. The architect didn’t recognize the old blueprints from college.

5. The stray dog sniffed at the garbage can in apparent search of food.

6. The limosine arrived at the party completely full of passengers.

7. A group of seagulls settled on the power lines lining the avenue.

8. Pierre just purchased a new cat from the pet store in the next town.

9. The monitor turned itself off after a thirty minutes of inactivity.

10. The last woman in line tapped her foot and stared at her watch impatiently.

11. An aspiring young model from Nebraska moved to Los Angeles and immediately started

looking for work.

12. The accountant went to his boss and complained that the office was too stuffy.

13. Brad tripped on the telephone cord and banged his knee on the table.

14. The cyclist hit a patch of ice and lost control of his bike.

15. She sharpened the scissors and started cutting out her Valentine’s card.

16. Josephine grabbed the trunk of the car and pulled hard to get it open.

17. The news anchor stared off into space and sipped her coffee.

18. The bar was thick with smoke and plenty of men in their sixties.

19. The accountant watched the manager search the desk for the missing check.

20. The teller saw the teenagers enter the bank before the robbery.

21. The violin instructor observed her students work their way through the difficult music.


22. A street musician witnessed two hoodlums attempt to break into a station wagon last night.

23. The receptionist noticed her boss go home extra early on Wednesday.

24. The docent scrutinized the intern cleaning the Florentine vase in the museum hall.

25. The jeweler spotted what he thought were three young men casing his store.

26. The jailed prostitute overheard two police officers discussing her case.

27. A janitor discovered a stray dog scratching at the cafeteria door after school.

28. The mechanical engineer who formerly consulted for Daniel’s startup has now started his

own company.

29. A woman who was wearing a straw hat rummaged in her purse as the bus pulled to a halt.

30. Six protesters who were carrying signs proclaiming opposition to the death penalty marched

up the street.

31. A salesman who tried to sell Adam a magazine subscription yesterday showed up at his door

again today.

32. The worker who was experiencing mood swings quit his job last week.

33. Paula’s sister in London knows at least three people who are vegan.

34. A congressional page who worked for a freshman congressman from Ohio stopped by the

office with tea.

35. None of the farmers who lived in the area expected the season to be so favorable to squash.

36. The novel that most appealed to Simon was unfortunately sold out at his favorite bookstore.

37. The judge heard the bailiff chuckle under his breath.

38. The night watchman detected an intruder tugging at the glass door on the balcony.

39. A gardener who dabbled in hybridizing tomato strains planted some imported seeds in his

newest plot.

40. Lauren worked with an editor who strongly disagreed with her usage of semicolons.

41. She inspected the grassy knoll for remnants of bullets from a high-powered rifle.


42. The old shawl had been passed down to her from her great-grandmother from Ukraine.

43. The woman was severely overweight and had a history of medical problems because of it.

44. A cab driver arrived at the scene and picked up all four of the waiting businessmen.


eye movement evidence that readers maintain and act on … · 2009. 11. 24. · 1 summary of...

Documents