end of award report background - amazon s3 · end of award report background although the primary...

26
End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work needs to be seen against the background of an important theoretical issue, namely, what properties of text influence eye movement control and when is this influence exerted? Since reading involves successive fixations, accurately located on text elements, it is obvious that some low-level control must take place (e.g. information about the physical location of target words). Indeed, a great deal of the variance (possibly as much as 90 percent) in fixation location is accounted for by a combination of low-level visual processing, coupled with oculomotor constraints. The 'Strategy-Tactics' conjecture 1 of O'Regan and co-workers (O'Regan 1990, 1992; O’Regan. & Lévy-Schoen, 1987; O'Regan, Vitu, Radach, & Kerr, 1994) and the Mr Chips model of Legge, Klitz & Tjan (1997) capitalise on this fact and have sought to provide a comprehensive theory of eye movement control in reading couched in purely oculomotor terms, but both approaches are plainly inadequate. The 'where?' of eye movement control in reading may be substantially under low-level control, but the 'when?' is clearly not. It is now beyond dispute that theoretical accounts of the complexities of temporal control in reading must make reference to cognitive factors (Rayner, Sereno, & Raney, 1996). Both the number of fixations and their individual duration are tightly coupled to linguistic properties of the text, including orthographic, lexical, syntactic, pragmatic and discourse levels of description (see Rayner, 1998, for a comprehensive review). However, establishing that cognitive factors play a key role in eye movement control is several steps short of providing a comprehensive model of the processes involved. The first realistic computational model to achieve this was the E-Z Reader model of Reichle, Pollatsek, Fisher & Rayner (1998) (see also, Reichle, Rayner, & Pollatsek, 2003). In summary, this model provides a good account of the effects of the lexical frequency, predictability and physical eccentricity of text items on fixation duration and refixation rate. It can also elegantly account for a wide range of well-established outcomes, such as the preview effect, short-duration fixations, word-skipping and spillover effects. It is important to stress that E-Z Reader is not simply a mathematical model; indeed, as such, it might be seen as relatively 'expensive', given its number of free parameters (Jacobs, 2000). Rather, it is an explicit instantiation of a particular class of theories which seek to account for the reader's behaviour in terms of the deployment of overt and covert attention (Morrison, 1984; see also Henderson, & Ferreira, 1993). The theoretical background to the present Project has this linkage as its focus. Morrison (1984) proposed that the reader's attention shifts in a serial fashion as lexical access 2 is achieved for each successive fixated word (see Henderson, 1992, for a review). That is, once processing of a word in the fovea is complete, the reader can begin work on the next word before actually looking at it. This distinction between overt eye movements and covert shifts of attention provides a convincing account of well-documented parafoveal preview advantage (Rayner, 1998). It was initially proposed that covert attentional shifts and overt inter-word 1 Although often referred to as a 'model' the term 'conjecture' is possibly more appropriate, because there has been no attempt at formal or computational instantiation of this set of ideas. 2 Morrison did not use this term, but it is clear that the trigger event is seen as word identification. 1

Upload: others

Post on 26-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

End of Award Report Background

Although the primary aim of this project was technical (the development of a corpus of data) the work needs to be seen against the background of an important theoretical issue, namely, what properties of text influence eye movement control and when is this influence exerted? Since reading involves successive fixations, accurately located on text elements, it is obvious that some low-level control must take place (e.g. information about the physical location of target words). Indeed, a great deal of the variance (possibly as much as 90 percent) in fixation location is accounted for by a combination of low-level visual processing, coupled with oculomotor constraints. The 'Strategy-Tactics' conjecture1 of O'Regan and co-workers (O'Regan 1990, 1992; O’Regan. & Lévy-Schoen, 1987; O'Regan, Vitu, Radach, & Kerr, 1994) and the Mr Chips model of Legge, Klitz & Tjan (1997) capitalise on this fact and have sought to provide a comprehensive theory of eye movement control in reading couched in purely oculomotor terms, but both approaches are plainly inadequate. The 'where?' of eye movement control in reading may be substantially under low-level control, but the 'when?' is clearly not. It is now beyond dispute that theoretical accounts of the complexities of temporal control in reading must make reference to cognitive factors (Rayner, Sereno, & Raney, 1996). Both the number of fixations and their individual duration are tightly coupled to linguistic properties of the text, including orthographic, lexical, syntactic, pragmatic and discourse levels of description (see Rayner, 1998, for a comprehensive review).

However, establishing that cognitive factors play a key role in eye movement control is several steps short of providing a comprehensive model of the processes involved. The first realistic computational model to achieve this was the E-Z Reader model of Reichle, Pollatsek, Fisher & Rayner (1998) (see also, Reichle, Rayner, & Pollatsek, 2003). In summary, this model provides a good account of the effects of the lexical frequency, predictability and physical eccentricity of text items on fixation duration and refixation rate. It can also elegantly account for a wide range of well-established outcomes, such as the preview effect, short-duration fixations, word-skipping and spillover effects. It is important to stress that E-Z Reader is not simply a mathematical model; indeed, as such, it might be seen as relatively 'expensive', given its number of free parameters (Jacobs, 2000). Rather, it is an explicit instantiation of a particular class of theories which seek to account for the reader's behaviour in terms of the deployment of overt and covert attention (Morrison, 1984; see also Henderson, & Ferreira, 1993). The theoretical background to the present Project has this linkage as its focus.

Morrison (1984) proposed that the reader's attention shifts in a serial fashion as lexical access2 is achieved for each successive fixated word (see Henderson, 1992, for a review). That is, once processing of a word in the fovea is complete, the reader can begin work on the next word before actually looking at it. This distinction between overt eye movements and covert shifts of attention provides a convincing account of well-documented parafoveal preview advantage (Rayner, 1998). It was initially proposed that covert attentional shifts and overt inter-word

1 Although often referred to as a 'model' the term 'conjecture' is possibly more appropriate, because there has been no attempt at formal or computational instantiation of this set of ideas. 2 Morrison did not use this term, but it is clear that the trigger event is seen as word identification.

1

Page 2: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

saccades were triggered by the same cognitive event (word identification), leading to the prediction that preview advantage should be independent of foveal load and relatively constant (i.e. pre-processing of a word can only take place during the time needed to prepare a saccade). However, it is now known that preview advantage actually varies as a function of foveal load (Henderson & Ferreira, 1990): substantial foveal-on-parafoveal interactions exist in normal reading. The only convincing way to account for these pre-processing effects in a serial model is by postulating separate sources of cognitive control over overt and covert processes. For example, in the E-Z Reader model the covert switch of attention is linked to word identification and eye movements to a condition which is typically satisfied earlier than lexical access (e.g. a 'familiarity check').

An alternative, and in some respects more obvious, response to the observation that preview advantage changes as a function of foveal load is to suggest that foveal and parafoveal words are processed in parallel (Engbert, Longtin & Kliegl,2002). Indeed, at the level of the 'where?' decision, some parallel processing must occur. For example, the E-Z Reader model allows for this by suggesting that saccade preparation to a specific target may take place before the attentional shift. But while this is a not unreasonable way of dealing with the preview effect (i.e. it is a processing advantage which can only be cashed in later, after the programmed fixation has been executed), strong parafoveal-on-foveal effects prove much more embarrassing to serial attention-switching models.

It is a critical feature of all serial-sequential models that properties of a word in the parafovea cannot influence current foveal processing, because information from an extra-foveal source only becomes available after attention has been switched away from the fovea. At the time attention-switching models were being developed this potential embarrassment was avoided because the evidence in favour of parafoveal-on-foveal effects was lacking (Carpenter & Just, 1983; Henderson & Ferreira, 1993; Rayner, Fischer & Pollatsek, 1998; but see Kennedy, 2000 for a discussion). More recently, however, the situation has changed and a lively dispute has arisen following reports that properties of an uninspected parafoveal item may have an immediate effect on current foveal processing (Inhoff, Radach, Starr & Greenberg, 2000; Inhoff, Starr & Shindler, 2000; Kennedy, 1995, 1998, 2000; Kennedy, Murray & Boissiere, 2004; Murray, 1998; Pynte, Kennedy & Ducrot, 2004; Pynte, Kennedy & Ducrot, 2004; Underwood, Binns & Walker, 2000). Parafoveal effects on foveal inspection time implicate physical, orthographic, sub-lexical, lexical, and pragmatic properties of the as-yet uninspected parafoveal word. Nonetheless, a number of important caveats need to be entered with regard to these studies: (1) the direction of the obtained effects varies between studies; (2) the absolute size of any modulation of foveal inspection time, even if reliable, is small; (3) there continue to be reports of null effects (Altarriba, Kambe, Pollatsek, & Rayner, 2001; Rayner, Balota & Pollatsek, 1986; Rayner, White, Kambe, Miller & Liversedge, 2003; Schroyens, Vitu, Brysbaert, & d’Ydewalle, 1999; White & Liversedge, 2004); (4) in many cases (albeit, not invariably) the tasks employed in laboratory studies have borne little resemblance to normal reading. With regard to (1), Kennedy, Pynte and Ducrot (2002) point out that foveal and parafoveal length has rarely been fully controlled in laboratory studies. This may explain some of the obtained inconsistencies. With regard to (2) it is important to note that obtained parafoveal

2

Page 3: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

effects (e.g. of lexical frequency) are substantially less than their foveal equivalents and this establishes the context against which experimental data have to be interpreted. It may, for example, account for the null effects reported by Carpenter and Just (1983). The present project was, in large part, motivated by points (3) and (4). In the original Application the question was raised as to whether parafoveal-on-foveal effects might only be found in artificial laboratory tasks. Of course, 'artificiality' in this context is a somewhat graded term: eye movement studies of reading in genuinely naturalistic settings are, for the foreseeable future, impossible. Indeed, in the final analysis, the distinction between the properties of normal reading and other 'reading-like' tasks may turn out to be unhelpful (Kennedy, Murray & Boissiere, 2004). A reasonable working hypothesis is that normal reading must engage the component processes isolated in the course of many years of research using laboratory tasks (e.g. the identification of letters and letter sequences, lexical processing, the assignment of syntactic structure etc). But instances where robust effects obtainable in the laboratory are attenuated or absent in normal reading stand as a challenge to the eye movement researcher, if only in the sense that little is know about why such attenuation occurs. Until this deeper question is resolved it is plainly desirable to make use of reading tasks with demands which, as far as practicable, approximate those imposed on the 'normal reader'. It was this approach that motivated the second phase of the present Project. Objectives (1) To develop a corpus of eye movement data in the English and French languages, derived from a normal reading task (reading newspaper editorials for comprehension). (2) To examine the derived data sets for evidence of parafoveal-on-foveal processing interactions. Objective (1) has been achieved. Preliminary versions of the data sets have been prepared and the corpus will shortly be available to the eye movement community either through the Data Archive or from the applicant. Objective (2) has been completed for the English data set and the results are being prepared for publication. Analysis of the French data will be undertaken over the period October - December, 2003, in collaboration with Dr Pynte, Laboratoire "Langage et Parole", Université de Provence. Because of delays in the data-collection phase of the project it was decided to address the parafoveal-on-foveal question in parallel, using small-scale laboratory tasks which might, nonetheless, capture some aspects of the dynamics of normal reading. Two studies were conducted. In the first, participants read short sentences for comprehension. Typographical errors were introduced into the materials, but these were only present parafoveally: a contingent display procedure ensured that an error was corrected by the time the relevant text item was inspected foveally. The results show substantial parafoveal-on-foveal effects on both fixation duration and saccade targeting (Pynte, Kennedy & Ducrot, In Press). The second study involved a replication and extension of the experiment reported by Murray (1998) which purported to show parafoveal-on-foveal pragmatic effects. That is, pragmatic relations between Noun-Verb combinations in a sentential context were reflected in processing time on the noun, before the relevant verb had been directly inspected. The results of these experiments are appended to this report (Kennedy, Murray & Boissiere, In Press).

3

Page 4: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

Methods and Results I - Corpus collection and analysis With regard to the technical phase of the project (the bulk of the work) it is sensible to consider method and results together 1. Preparation of the texts. English texts were taken, with permission, from editorials in The Independent newspaper. A series of 20 text files were prepared for display, each comprising 40 5-line 'screens'. Text files were then analysed into a 'word object' files, (TX01WRD to TX20WRD) the format of which is shown in the attached Appendix. French texts were taken, with permission, from editorials in Le Monde newspaper. A similar series of 20 text files were prepared for display, each comprising 40 5-line 'screens'. These Text files were analysed into similar a 'Word Object' files, (FR01WRD to FR20WRD) the format of which is also shown in the Appendix. 2. Experimental Procedures

These were identical for both English- and French-speaking participants. English participants were tested in Dundee; eight of the French participants were tested in Aix-en-Provence and the remaining two were tested (in the French language throughout) in Dundee. In both laboratories the materials were presented using a high-resolution (8x16) monopitch font in white-on-black polarity on a monochrome monitor with a high-speed phosphor, running at 100Hz frame rate. Presentation of successive five-line 'screens' began with a fixation marker three characters to the left of the initial word on the first line. Once inspected, this was replaced by display of the five lines of text, double-spaced. Participants were instructed to read the materials normally and then press a button when they had finished the page of text. They were told beforehand that each text (which comprised several 'screens') would be followed by a short multiple-choice comprehension test. For a given participant, testing took place over a period of between 8-10 days, depending on the ease with which calibration could be completed.

The display monitor was interfaced to a Control Systems Artist 1 graphics card mounted in an IBM compatible computer. At the viewing distance of 500 mm, one character subtended approximately 0.3 degrees of visual angle. Eye movements were recorded from the right eye using a Dr Bouis pupil-centre computation Oculometer interfaced to a 12-bit A-D device sampling X and Y position every 1 ms. This eye tracker has a resolution of better than 0.25 characters over the 60-character calibrated range (Beauvillain & Beauvillain, 1995). A dental wax bite bar and chin rest were used to minimise head movements. The eye movement recording system was calibrated prior to the presentation of each set of three screens.

4

Page 5: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

Measures of fixation duration, fixation position, and intra- and inter-word saccade extent were computed off-line. The data reduction technique employed an X-Y vector version of the standard X-only system employed in Dundee for many years. This uses statistical algorithms based on the resolution of the data for each individual participant with respect to the obtained noise in a given data set. The algorithm involves three steps. An initial pass through the data for one calibration trial arrives at an estimate of point-to-point variation. This value is then used to test the separation of successive clusters of stable data (in effect, running t-tests are conducted on successive clusters). Clusters which cannot be statistically separated in this way are aggregated. Finally the data are smoothed to aggregate potential within-letter saccades3. The effective resolution of the eye-tracking system was better than one character position. Blinks lead the Bouis eye tracker to produce a systematic deviation in X and Y with many of the characteristics of a saccade, although blinks can be distinguished from true saccades by testing for a change in direction. After initial analysis of some of the data it was discovered that the combination of the high sampling rate employed and vector arithmetic (the blink artifact is more pronounced in Y than X) meant that some blinks were being treated as glides (i.e. a special class of fixation). Since it was thought helpful to retain information on blink duration, the software was modified. This was computationally complex and delayed progress on the Project for several weeks. Information on blink 'location' and duration is coded in the "???MA1.DAT" matrices discussed below. 3. Data analysis. Scores in the Comprehension Tests were > 90% for all participants. Off-line analysis of the eye movement data took place in a series of steps.

1. Binding raw X-Y data to data from calibration trials to produce an output matrix giving fixation duration in ms, X location (in an 81-chacater space), Y location (in a 61-line space) and an end-of-file flag.

2. Analysis of the Step 1 matrix to assign Y data to lines 1 - 5. In summary, this was

achieved by locating return sweeps to arrive at an initial Y estimate and then checking Y values to identify 5 or less lines (each of which might be inspected more than once).

3. Step 3 involved binding the output of the Step 2 analysis to the Word Object files

to produce a set of appropriately coded matrix files for each participant, ????MA1.DAT and ????MA2.DAT. The prefix codes participant and text (e.g. SA04MA1.DAT represents the set of data for participant A, processing text number 4). ????MA1.DAT files code the data in the order in which 'events' took place, an event being a fixation, a blink or unanalysed off-screen data. Examples are shown in the Appendix ????MA2.DAT files code the data in word order. It is important to note that the ‘fixation number’ field in this matrix may be used to cross-index data in the equivalent ????MA1.DAT file.

3 The procedure involves the detection of periods of stability, rather than the alternative, involving the detection of saccades.

5

Page 6: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

4. The ????MA2.DAT files produced in Step 3 are intermediate data files. Statistical analyses of these were carried out to identify temporal eye movement measures (first fixation duration, initial gaze, second pass gaze, number of fixations etc). These analyses resulted in the final ????MA3.DAT files, samples of which are to be found in the Appendix.

Results II

This section of the Report is divided into two parts. The first part briefly presents some overall parametric data on the corpus; the second deals with the question of parafoveal-on-foveal effects.

1. Parametric Data

Figure 1 shows frequency distributions for measured fixation duration in the English and French data sets. Although the French data are more variable, in both cases the values are broadly comparable with other large-scale data sets (e.g. Radach, 1996).

English

0

4000

8000

12000

16000

20000

100 200 400 600

Cas

es

All FixationsFirst Fixations

French

0

4000

8000

12000

16000

20000

100 200 400 600

Cas

es

First fixationsAll fixations

Figure 1: Distribution of fixation durations in the English and French data sets. First fixation data are plotted separately.

6

Page 7: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

Landing Position English (from -3 chars)

0

100

200

300

400

1 2 3 4 5 6 7 8 9 10

Landing Position

Cas

es5-char7-char9-char

Landing Position French (from -3)

0

100

200

1 2 3 4 5 6 7 8 9 10

Landing Position

Cas

es

5-char7-char9-char

Figure 2: Initial landing position distributions on 5-, 7- and 9-character 'objects' in the English and French data sets. Launch position was fixed at -3 characters. Objects comprised words in the case of the English data and words together with additional characters (e.g. <d'>, <l'>, <s'>), if any, in the case of the French data sets.

Figure 2 shows illustrative landing position distributions from a launch position of -3 characters towards 5- 7- and 9-character 'objects'. Again, these data are comparable to those reported using other large-scale data sets.

7

Page 8: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

2. Parafoveal-on-foveal effects. To date, analyses have been confined to the English data set. Analysis of word-on-word interactions in French are greatly complicated by the presence of numerous initial contractions (e.g. reflexive verbs, contracted articles). This raises issues with regard to eye movement control which are interesting in their own right but which need to to be systematically controlled for in the final analysis. For the purposes of this Report a single overall analysis on the English corpus will be discussed. It should be emphasised that these data are preliminary, in the sense that separate analyses at different foveal and parafoveal word lengths have yet to be completed. The data should not be cited prior to publication. Following the model adopted using Radach’s Gulliver copus in Kennedy (1998), a set of parameters was derived defining valid cases in the corpus. To be valid the following conditions were satisfied:

• Two successive words N and N+1 appeared on the same line • Word N in a pair was not the first word on a line • Neither word was associated with punctuation • Words N and N+1 differed by no more than three letters in length • Both Word N and N+1 were longer than 4 letters

The Corpus was then searched for cases where Word N was processing with exactly one fixation. These single fixation durations on Word N were categorised ‘high’ or ‘low’ with respect to median values of the Frequency, Initial Letter Informativeness and Familiarity of Word N+1. Informativeness was defined in terms of the number of words in the language (or in the Corpus) sharing a word’s initial three letters. Familiarity was assessed by computing Cumulative Lexical Frequency, i.e. the summed lexical frequency of the set of words identified in the Informativeness measure (Kennedy, Pynte & Ducrot, 2002). Categorisation was by reference to the words in the language of the same length, plus or minus one letter. Where this value was not available in the external norms, reference was made to the equivalent value for the corpus itself (the correlation between these two measures was > 0.9). The by-Items analyses in data sets of this kind present problems. One solution is to use a randomisation procedure, sampling for unique item pairs. The By-Items analyses reported here involved averaging across pairs of items sets. Although possibly not ideal, this is an adequate control for spurious effects arising from particular peculiar word-combinations. The results are summarised in Figure 1. There was a main effect of parafoveal frequency. Foveal single fixation duration overall was modulated by the lexical frequency of an unfixated parafoveal word, F1 (1,9) = 15.87, p = 0.003, F2 (1,9) = 5.69, p = 0.04. There was also a main effect of the Informativeness of the parafoveal word: when the parafoveal word shared its initial letters with few other words, foveal inspection time was shorter. The absolute size of the effct is small (6 ms), but is reliable, F1 (1,9) = 8.28, p = 0.02, F2 (1,9) = 5.24, p = 0.05. The largest and most reliable effect relates to parafoveal Familiarity. Saccades following a single fixation are launched 12 ms earlier when the cumulative lexical frequency of the associated parafoveal word was high, F1 (1,9) = 19.78, p < 0.01, F2 (1,9) = 19.26, p <0.01. However, this effect was confined to low

8

Page 9: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

frequency parafoveal items, with a significant Frequency x Familiarity interaction, F1 (1,9) = 5.33, p = 0.04, F2 (1,9) = 15.86, p <0.01.

180

200

220

240

Unfam Fam Unfam FamLF HF

Fix

dura

tion

UninformativeInformative

Activities During the course of the Project the principal Grant-holder spent three months at the Laboratory of Experimental Psychology, Blaise Pascal University, Clermond-Ferrand. In December 2002 he attended the CNRS symposium on "The Lexicon" held in Clermont-Ferrand. The outcome of the project was reported at the meeting of ECEM12 held in Dundee, August 20 - 24, 2003. The applicant is in regular contact with Professor Reinhold Kliegl's group in the University of Potsdam and with Dr Radach's group in the Technical University of Aachen. Outputs Kennedy, A., Murray, W. S. and Boissiere, C. (2004). Parafoveal Pragmatics re-visited.

European Journal of Cognitive Psychology. In Press. Pynte, J., Kennedy, A., & Ducrot, S. (2004). Error detection in the parafovea. European

Journal of Cognitive Psychology. In Press. Kennedy, A. (2004). On keeping word order straight. Commentary on E. D. Reichle, K.

Rayner and A. Pollatsek "The E-Z Reader Model of Eye Movement Control in Reading: Comparisons to Other Models". Behavioral and Brain Sciences. In Press.

9

Page 10: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

Radach, R. and Kennedy, A. (2004). Theoretical perspectives on eye movements in reading: past controversies, current deficits and an agenda for future research. European Journal of Cognitive Psychology. In Press.

Kennedy, A., Hill, R. L. & Pynte, J. (2003). The Dundee Corpus. Poster presented at 12th European Conference on Eye Movements, Dundee, August 2003.

Pynte, J., Ducrot, S. and Kennedy, A. (2003). Within-word eye movements and the lexical decision task. Poster presented at 12th European Conference on Eye Movements, Dundee, August 2003.

Kennedy, A., Pynte, J. and Hill, R. L. (In preparation). Exploring parafoveal-on-foveal effects in normal reading.

Impacts Following ECEM12, there have been initial expressions of interest in the Corpus from Dr Engbert and Professor Kleigl, Potsdam; Dr. McDonald, Edinburgh; Dr Beauvillain and Dr Vitu, Paris; and Dr Yang, Brain Science Institute RIKEN. Future Research Priorities 1. Completion of the investigation of parafoveal-on-foveal effects in the French corpus. 2. Analysis of the modulation of landing position as a function of punctuation in French and English. 3. Collaborative work with Dr Pynte, capitalising on the fact that the French Corpus has been syntactically tagged. 4. Extension of the corpus to include German and Spanish data. Ethics and Confidentiality Participants were paid to take part in the study. It was emphasised that they were free to discontinue the work at any time. No personal data on any participant were retained. No other ethical issues arose in the course of the work. Data Archive and Regard The full data sets (two CD disks) have been offered to the Data Archive. Regard entries are up to date.

10

Page 11: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

References Altarriba, J., Kambe, G., Pollatsek, A. & Rayner, K. (2001). Semantic codes are not used

in integrating information across eye fixations in reading: evidence from fluent Spanish English bilinguals. Perception & Psychophysics, 63, 875-890.

Beauvillain, C. & Beauvillain, P. (1995). Calibration of an eye movement system for use in reading. Behavior Research Methods, Instruments and Computers. 55, 1- 17.

Carpenter, P. A. & Just, M. A. (1983). What your eyes do while your mind is reading. In K. Rayner (Ed.), Eye Movements in Reading: Perceptual and Language Processes (pp. 275-305). New York: Academic Press.

Engbert, R., Longtin, A., & Kliegl, R. (2002). A dynamical model of saccade generation in reading based on spatially distributed lexical processing. Vision Research 42, 621-636.

Henderson, J. M. (1992).Visual attention and eye movement control during reading and picture viewing. In K. Rayner (Ed.). Eye Movements and Visual Cognition: Scene Perception and Reading ( pp. 260 - 283). New York: Springer-Verlag.

Henderson, J. M. & Ferreira, F. (1990). Effects of foveal processing difficulty on the perceptual span in reading: Implications for attention and eye movement control. Journal of Experimental Psychology: Learning, Memory and Cognition, 16, 417 - 429.

Henderson, J. M. & Ferreira, F. (1993). Eye movement control during reading: Fixation measures foveal but not parafoveal processing difficulty. Canadian Journal of Experimental Psychology, 47, 201 - 221.

Inhoff, A. W., Radach, R., Starr, M. & Greenberg, S. (2000). Allocation of Visuo-Spatial Attention and Saccade Programming during Reading. In A. Kennedy, R. Radach, D. Heller and J. Pynte (Eds.). Reading as a Perceptual Process. Oxford: Elsevier.

Inhoff, A. W., Starr, M.& Shindler, K .L. Is the processing or words in reading strictly serial? Perception & Psychophysics, 40, 431-439.Jacobs, A. M. (2000). Five questions about cognitive models and some answers from three models of reading. In A. Kennedy, R. Radach, D. Heller and J. Pynte (Eds.), Reading as a Perceptual Process, Oxford: Elsevier, 721-732.

Kennedy, A. (1995). The influence of parafoveal words on foveal inspection time. AMLaP-95 Conference, Edinburgh, 1995.

Kennedy, A. (1998). The influence of parafoveal words on foveal inspection time: evidence for a processing trade-off. In G. Underwood (Ed.). Eye guidance in reading and scene perception ( pp. 149-223). Oxford: Elsevier.

Kennedy, A. (2000). Parafoveal processing in word recognition. Quarterly Journal of Experimental Psychology, 53A, 429 - 455.

Kennedy, A., Pynte, J. & Ducrot, S. (2002). Parafoveal-on-foveal interactions in word recognition. Quarterly Journal of Experimental Psychology, 55A (4), 1307-1337.

Kennedy, A., Murray, W. S. & Boissiere (2004). Parafoveal pragmatics revisted. European Journal of Cognitive Psychology, In Press.

Legge, G. E., Klitz, T. S. & Tjan, B. S. (1997). Mr Chips: an Ideal-Observer model of reading. Psychological Review, 104, 524-553.

Morrison, R. E. (1984). Manipulation of stimulus onset delay in reading: Evidence for parallel programming of saccades. Journal of Experimental Psychology: Human Perception and Performance, 10, 667 - 682.

11

Page 12: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

Murray, W. S. (1998). Parafoveal pragmatics. In G. Underwood (Ed.). Eye guidance in reading and scene perception ( pp. 181 - 199). Oxford: Elsevier.

O'Regan, J. K. (1990). Eye movements and reading. In E. Kowler (Ed.), Reviews of Oculomotor Research, Vol. 4: Eye Movements and their Role in Visual and Cognitive Processes. Amsterdam: Elsevier, 395-453.

O’Regan, J. K. (1992). Optimal viewing position in words and the strategy-tactics theory of eye movements in reading. In K. Rayner (Ed.), Eye movements and visual cognition: Scene perception and reading. New York: Springer. 333-354.

O’Regan, J. K. & Lévy-Schoen, A. (1987). Eye-movement strategy and tactics in word recognition and reading. In M. Coltheart (Ed.) Attention and Performance, Vol. 12, The Psychology of Reading. London: Erlbaum. 363-383).

O'Regan, J. K., Vitu, F., Radach, R., & Kerr, P.W. (1994). Effects of local processing and oculomotor factors in eye movement guidance in reading. In J. Ygge & G. Lennerstrand (Eds.), Eye Movements in Reading. Oxford, Pergamon Press. 329-348.

Pynte, J., Kennedy, A. & Ducrot, S. (2004). Error-detection in the parafovea. European Journal of Cognitive psychology, In Press.

Radach, R. (1996). Blickbewegungen beim Lesen: Psychologische Aspekte der Determination von Fixationspositionen (Eye movements in Reading). Münster/New York: Waxmann.

Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124, 372-422.

Rayner, K, Balota, D. A. & Pollatsek, A. (1986). Against parafoveal semantic preprocessing during eyefixations in reading. Canadian Journal of Psychology, 40, 473-483.

Rayner, K., Fischer, M. F. & Pollatsek, A. (1998). Unspaced text interferes with both word identification and eye movement control, Vision Research, 38, 1129 - 1144.

Rayner, K., Sereno, S. C. & Raney, G. E. (1996). Eye movement control in reading: A comparison of two types of models. Journal of Experimental Psychology: Human Perception and Performance, 22, 1188-1200.

Rayner, K., White, S., Kambe, G., Miller, B. & Liversedge, S. (2003). On the processing of meaning from parafoveal vision during eye fixations in reading. In Hyönä, J., Radach, R. & Deubel, H. (Eds.) (2003). The Mind's Eye: Cognitive and Applied Aspects of Eye Movement Research. (pp. 213-234). Elsevier Science, Oxford.

Reichle, E.D., Pollatsek, A., Fisher, D.L., & Rayner, K. (1998). Toward a model of eye movement control in reading. Psychological Review, 105, 125-157.

Reichle, E. D., Rayner, K., & Pollatsek, A. (2003, in press). The E-Z reader model of eye movement control in reading: comparisons to other models. Behavioral and Brain Sciences.

Schroyens, W., Vitu, F., Brysbaert, M. & d’Ydewalle, G. (1999). Visual attention and eye-movement control during reading: The case of parafoveal processing. Quarterly Journal of Experimental Psychology, 52A, 1021 - 1046.

Underwood, G., Binns, A. & Walker, S. (2000). Attentional demands on the processing of neighbouring words. In A. Kennedy, R. Radach, D. Heller and J. Pynte (Eds.). Reading as a Perceptual Process ( pp. 247 - 268). Oxford: Elsevier.

White, S. & Liversedge, S. (2004). Orthographic familiarity influences initial fixations in reading. European Journal of Cognitive Psychology, In Press.

12

Page 13: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

APPENDIX - SAMPLE MATERIALS (a) English Fragment of Text:

Are tourists enticed by these attractions threatening their very existence? > The two young sea-lions took not the slightest interest in our arrival. They > were playing on the jetty, rolling over and tumbling into the water together, > entirely ignoring the human beings edging awkwardly round them. Our party then > had to step gingerly over hundreds of marine iguanas basking on the rocks, > strange creatures with richly coloured velvet coats, looking like something > between a giant lizard...

NB: The mark " > " was not displayed. It was used to trigger CR/LF on the Manitron display (which is a point-plotting device). The format of the Word Object file for this fragment, together with field identifiers, follows:

13

Page 14: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

Derived 'Word Object' File "TX01WRD.DAT" Are 1 1 1 1 1 0 3 3 0 0 0 1 351 4393 -99 -99 -99 -99 tourists 1 1 1 2 2 4 8 8 0 0 0 2 3 12 15 137 2 13 enticed 1 1 1 3 3 13 7 7 0 0 0 3 1 -99 24 573 2 99 by 1 1 1 4 4 21 2 2 0 0 0 4 252 5306 -99 -99 -99 -99 these 1 1 1 5 5 24 5 5 0 0 0 5 73 1573 28 14064 3 4352 attractions 1 1 1 6 6 30 11 11 0 0 0 6 2 9 20 175 3 14 threatening 1 1 1 7 7 42 11 11 0 0 0 7 3 26 34 248 1 26 their 1 1 1 8 8 54 5 5 0 0 0 8 225 2670 28 14064 1 2670 very 1 1 1 9 9 60 4 4 0 0 0 9 56 798 -99 -99 -99 -99 existence? 1 1 1 10 10 65 10 9 5 0 1 10 4 107 4 174 1 107 The 1 1 2 1 11 0 3 3 0 0 0 11 3511 69971 -99 -99 -99 -99 two 1 1 2 2 12 4 3 3 0 0 0 12 43 1412 -99 -99 -99 -99 young 1 1 2 3 13 8 5 5 0 0 0 13 43 385 10 1777 1 385 sea-lions 1 1 2 4 14 14 9 9 0 0 0 14 1 -99 29 88 1 2 took 1 1 2 5 15 24 4 4 0 0 0 15 17 426 -99 -99 -99 -99 not 1 1 2 6 16 29 3 3 0 0 0 16 277 4609 -99 -99 -99 -99 the 1 1 2 7 17 33 3 3 0 0 0 17 3511 69971 -99 -99 -99 -99 slightest 1 1 2 8 18 37 9 9 0 0 0 18 3 13 13 123 2 14 interest 1 1 2 9 19 47 8 8 0 0 0 19 17 330 70 1081 2 332 in 1 1 2 10 20 56 2 2 0 0 0 20 1091 21345 -99 -99 -99 -99 our 1 1 2 11 21 59 3 3 0 0 0 21 84 1252 -99 -99 -99 -99 arrival. 1 1 2 12 22 63 8 7 1 0 1 22 2 23 21 253 1 23 They 1 1 2 13 23 72 4 4 0 0 0 23 258 3620 -99 -99 -99 -99 were 1 1 3 1 24 0 4 4 0 0 0 24 106 3284 -99 -99 -99 -99 playing 1 1 3 2 25 5 7 7 0 0 0 25 6 101 71 1254 2 128 on 1 1 3 3 26 13 2 2 0 0 0 26 417 6745 -99 -99 -99 -99 the 1 1 3 4 27 16 3 3 0 0 0 27 3511 69971 -99 -99 -99 -99 jetty, 1 1 3 5 28 20 6 5 2 0 1 28 1 -99 1 4 0 0 rolling 1 1 3 6 29 27 7 7 0 0 0 29 1 19 9 78 1 19 over 1 1 3 7 30 35 4 4 0 0 0 30 76 1236 -99 -99 -99 -99 and 1 1 3 8 31 40 3 3 0 0 0 31 1283 28860 -99 -99 -99 -99 tumbling 1 1 3 9 32 44 8 8 0 0 0 32 1 3 6 21 1 3 into 1 1 3 10 33 53 4 4 0 0 0 33 82 1793 -99 -99 -99 -99 the 1 1 3 11 34 58 3 3 0 0 0 34 3511 69971 -99 -99 -99 -99 water 1 1 3 12 35 62 5 5 0 0 0 35 6 442 6 615 1 442 together, 1 1 3 13 36 68 9 8 2 0 1 36 5 267 1 267 1 267 entirely 1 1 4 1 37 0 8 8 0 0 0 37 9 91 25 421 3 100

14

Page 15: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

ignoring 1 1 4 2 38 9 8 8 0 0 0 38 1 4 10 76 1 4 the 1 1 4 3 39 18 3 3 0 0 0 39 3511 69971 -99 -99 -99 -99 human 1 1 4 4 40 22 5 5 0 0 0 40 25 299 15 396 1 299 beings 1 1 4 5 41 28 6 6 0 0 0 41 4 36 6 753 1 36 edging 1 1 4 6 42 35 6 6 0 0 0 42 1 5 5 52 1 5 awkwardly 1 1 4 7 43 42 9 9 0 0 0 43 1 5 1 5 1 5 round 1 1 4 8 44 52 5 5 0 0 0 44 11 81 11 216 1 81 them. 1 1 4 9 45 58 5 4 1 0 1 45 80 1789 -99 -99 -99 -99 Our 1 1 4 10 46 64 3 3 0 0 0 46 84 1252 -99 -99 -99 -99 party 1 1 4 11 47 68 5 5 0 0 0 47 22 216 33 1269 3 219 then 1 1 4 12 48 74 4 4 0 0 0 48 71 1377 -99 -99 -99 -99 had 1 1 5 1 49 0 3 3 0 0 0 49 130 5133 -99 -99 -99 -99 to 1 1 5 2 50 4 2 2 0 0 0 50 1597 26162 -99 -99 -99 -99 step 1 1 5 3 51 7 4 4 0 0 0 51 5 131 -99 -99 -99 -99 gingerly 1 1 5 4 52 12 8 8 0 0 0 52 1 2 6 13 1 2 over 1 1 5 5 53 21 4 4 0 0 0 53 76 1236 -99 -99 -99 -99 hundreds 1 1 5 6 54 26 8 8 0 0 0 54 6 44 14 283 2 45 of 1 1 5 7 55 35 2 2 0 0 0 55 1757 36411 -99 -99 -99 -99 marine 1 1 5 8 56 38 6 6 0 0 0 56 3 55 88 1019 3 77 iguanas 1 1 5 9 57 45 7 7 0 0 0 57 2 -99 0 0 0 0 basking 1 1 5 10 58 53 7 7 0 0 0 58 2 2 27 169 2 4 on 1 1 5 11 59 61 2 2 0 0 0 59 417 6745 -99 -99 -99 -99 the 1 1 5 12 60 64 3 3 0 0 0 60 3511 69971 -99 -99 -99 -99 rocks, 1 1 5 13 61 68 6 5 2 0 1 61 1 23 9 136 1 23 strange 1 2 1 1 1 0 7 7 0 0 0 62 8 84 88 1835 2 85 creatures 1 2 1 2 2 8 9 9 0 0 0 63 3 20 28 230 5 27 with 1 2 1 3 3 18 4 4 0 0 0 64 393 7291 -99 -99 -99 -99 richly 1 2 1 4 4 23 6 6 0 0 0 65 2 5 12 103 2 6 coloured 1 2 1 5 5 30 8 8 0 0 0 66 1 1 79 737 4 4 velvet 1 2 1 6 6 39 6 6 0 0 0 67 1 4 7 12 1 4 coats, 1 2 1 7 7 46 6 5 2 0 1 68 2 10 13 205 2 18 looking 1 2 1 8 8 53 7 7 0 0 0 69 13 173 19 602 3 186 like 1 2 1 9 9 61 4 4 0 0 0 70 99 1290 -99 -99 -99 -99 something 1 2 1 10 10 66 9 9 0 0 0 71 29 450 18 960 1 450 between 1 2 2 1 11 0 7 7 0 0 0 72 42 730 13 1179 1 730 a 1 2 2 2 12 8 1 1 0 0 0 73 1434 23248 -99 -99 -99 -99

15

Page 16: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

Fields: All Columns Missing values coded as -99 Column 1 Text word with punctuation, if any, attached Column 2 The Text Number (1 – 20) in which this particular Text Word appears Column 3 Screen number. The screen display (1- 40) for this Text Number, usually

containing 5 lines, in which the Text Word is displayed Column 4 The line number on the screen display containing this Text Word Column 5 The position of the Text Word on its line Column 6 The serial number of the Text Word in a screen Column 7 The initial letter position of the Text Word on an 81-character line (the

space to the left of each word counting as its initial letter) Column 8 The length of the displayed Object. An ‘Object’ is the displayed Text

Word together with any initial or final punctuation Column 9 The legth of the displayed Text Word (exclusive of attached puntuation) Column 10 Punctuation coded numerically: 1 = full stop 2 = comma 3 = semicolon 4 = colon 5 = question mark 6 = exclamation mark 7 = apostrophe (e.g. word contains a contraction) 8 = opening double quotation mark 9 = closing double quotation mark 10 = not currently used 11 = opening bracket 12 = closing bracket Column 11 Number of additional characters before Text Word (e.g. opening

punctuation) Column 12 Number of additional characters following Text Word (e.g. closing

punctuation) Column 13 Index number of Text Word in this Text Column 14 Frequency of occurrence of this Text Word in all texts (i.e. ‘local

frequency’) Column 15 Estimate of Lexical Frequency using Kucera-Francis count Column 16 ‘Informativeness’ defined as number of words in the Kucera-Francis count

(plus or minus one character in length) sharing initial trigram (only computed for words > 4 characters)

Column 17 Cumulative lexical frequency (‘Familiarity') of words identified in Column 16

Column 18 Number of words in the Kucera-Francis count (plus or minus one character in length) sharing initial trigram and final letter (only computed for words > 4 characters in length )

Column 19 Cumulative lexical frequency (‘Familiarity') of words identified in Column 18

16

Page 17: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

(b) French Fragment of Text:

Après quarante ans d'encadrement, l'heure a peut-être sonné pour l'industrie > pharmaceutique française de retrouver bientôt sa liberté de manoeuvre. Enfermés > dans le carcan des prix, les laboratoires recherchent à l'étranger la > rémunération de leurs efforts en y délocalisant leurs activités. Le phénomène > s'accélère. Au cours de sa conférence traditionnelle de début d'année, mardi 16 >

NB: French accents were coded in the ARTIST display code using non-standard code (e.g. the accents are not displayed correctly in editing software). This caused some inconvenience in developing search routines. The solution was to convert the non-standard text only at the output stage. Relatively low-level editors (e.g. Microsoft WORDPAD) operating on the output matrices will honour the accented characters, although some word-processing software may not. The format of the Word Object file for part of the above fragment, together with field identifiers, follows:

17

Page 18: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

Derived 'Word Object' File "FR01WRD.DAT" après 1 1 1 1 1 0 5 5 0 0 0 1 62 -2 1 -2 1 -2 quarante 1 1 1 2 2 6 8 8 0 0 0 2 8 -2 21 29486 3 13 ans 1 1 1 3 3 15 3 3 0 0 0 3 85 -99 -99 -99 -99 -99 d'encadrement, 1 1 1 4 4 19 14 11 2 22 1 4 -99 361 26 7400 1 361 l'heure 1 1 1 5 5 34 7 5 0 22 0 5 7 86928 3 88407 1 86928 a 1 1 1 6 6 42 1 1 0 0 0 6 408 2531 -99 -99 -99 -99 peut-être 1 1 1 7 7 44 9 9 0 0 0 7 10 72315 7 73845 1 72315 sonn‚ 1 1 1 8 8 54 5 5 0 0 0 8 3 987 11 37323 1 987 pour 1 1 1 9 9 60 4 4 0 0 0 9 371 -2 -99 -99 -99 -99 l'industrie 1 1 1 10 10 65 11 9 0 22 0 10 2 4326 61 48557 11 8978 pharmaceutique 1 1 2 1 11 0 14 14 0 0 0 11 4 170 0 0 0 0 française 1 1 2 2 12 15 9 9 0 0 0 12 10 -1 36 20922 8 4135 de 1 1 2 3 13 25 2 2 0 0 0 13 2779 -2 -99 -99 -99 -99 retrouver 1 1 2 4 14 28 9 9 0 0 0 14 4 27640 20 60871 4 43418 bientôt 1 1 2 5 15 38 7 7 0 0 0 15 5 16013 6 17135 1 16013 sa 1 1 2 6 16 46 2 2 0 0 0 16 106 340691 -99 -99 -99 -99 liberté 1 1 2 7 17 49 7 7 0 0 0 17 11 20369 12 24203 2 20411 de 1 1 2 8 18 57 2 2 0 0 0 18 2779 -2 -99 -99 -99 -99 manoeuvre. 1 1 2 9 19 60 10 9 1 0 1 19 2 3573 55 17157 16 6737 enfermés 1 1 2 10 20 71 8 8 0 0 0 20 1 -99 19 29348 0 0 dans 1 1 3 1 21 0 4 4 0 0 0 21 425 838996 -99 -99 -99 -99 le 1 1 3 2 22 5 2 2 0 0 0 22 1198 -2 -99 -99 -99 -99 carcan 1 1 3 3 23 8 6 6 0 0 0 23 2 93 51 20199 6 1618 des 1 1 3 4 24 15 3 3 0 0 0 24 874 -2 -99 -99 -99 -99 prix, 1 1 3 5 25 19 5 4 2 0 1 25 15 13396 -99 -99 -99 -99 les 1 1 3 6 26 25 3 3 0 0 0 26 1103 -2 -99 -99 -99 -99 laboratoires 1 1 3 7 27 29 12 12 0 0 0 27 14 -99 2 2590 0 0 recherchent 1 1 3 8 28 42 11 11 0 0 0 28 1 -99 25 46447 3 777 … 1 1 3 9 29 54 1 1 0 0 0 29 1029 2010176 -99 -99 -99 -99 l'étranger 1 1 3 10 30 56 10 8 0 22 0 30 2 -2 17 28250 3 125 la 1 1 3 11 31 67 2 2 0 0 0 31 1532 -2 -99 -99 -99 -99 rémunération 1 1 4 1 32 0 12 12 0 0 0 32 4 276 3 914 1 276 de 1 1 4 2 33 13 2 2 0 0 0 33 2779 -2 -99 -99 -99 -99 leurs 1 1 4 3 34 16 5 5 0 0 0 34 60 -99 2 202 0 0 efforts 1 1 4 4 35 22 7 7 0 0 0 35 7 -99 26 50974 0 0 en 1 1 4 5 36 30 2 2 0 0 0 36 696 -2 -99 -99 -99 -99 y 1 1 4 6 37 33 1 1 0 0 0 37 64 -2 -99 -99 -99 -99 délocalisant 1 1 4 7 38 35 12 12 0 0 0 38 1 -99 16 4973 4 1509 leurs 1 1 4 8 39 48 5 5 0 0 0 39 60 -99 2 202 0 0 activités. 1 1 4 9 40 54 10 9 1 0 1 40 6 -99 9 19646 0 0 le 1 1 4 10 41 65 2 2 0 0 0 41 1198 -2 -99 -99 -99 -99 phénomène 1 1 4 11 42 68 9 9 0 0 0 42 5 8449 4 8872 2 8461 s'accélère. 1 1 5 1 43 0 11 8 1 22 1 43 -99 -99 45 68424 3 530

18

Page 19: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

au 1 1 5 2 44 12 2 2 0 0 0 44 295 532059 -99 -99 -99 -99 cours 1 1 5 3 45 15 5 5 0 0 0 45 17 14719 43 170466 2 14719 de 1 1 5 4 46 21 2 2 0 0 0 46 2779 -2 -99 -99 -99 -99 sa 1 1 5 5 47 24 2 2 0 0 0 47 106 340691 -99 -99 -99 -99 conférence 1 1 5 6 48 27 10 10 0 0 0 48 11 3688 309 406105 43 68273 traditionnelle 1 1 5 7 49 38 14 14 0 0 0 49 3 1629 0 0 0 0 de 1 1 5 8 50 53 2 2 0 0 0 50 2779 -2 -99 -99 -99 -99 début 1 1 5 9 51 56 5 5 0 0 0 51 23 12341 6 18254 3 16496 d'année, 1 1 5 10 52 62 8 5 2 22 1 52 15 32490 5 34533 1 32490 mardi 1 1 5 11 53 71 5 5 0 0 0 53 24 1995 35 60277 2 2054

19

Page 20: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

Fields All Columns Missing values coded as -99 Column 1 Text word with punctuation, if any, attached Column 2 The Text Number (1 – 20) in which this particular Text Word appears Column 3 Screen number. The screen displays, usually containing 5 lines, in which

the Text Word is displayed Column 4 The line number on the screen display containing the this Text Word Column 5 The position of the Text Word on its line Column 6 The serial number of the Text Word in a screen Column 7 The initial letter position of the Text Word on an 81-character line (the

space to the left of each word counts as its initial letter) Column 8 The length of the displayed Object. An ‘object’ is the displayed Text

Word together with any initial or final punctuation Column 9 The legth of the displayed Text Word (exclusive of attached puntuation) Column 10 Punctuation coded numerically: 1 = full stop 2 = comma 3 = semicolon 4 = colon 5 = question mark 6 = exclamation mark 7 = apostrophe (e.g. word contains a contraction) 8 = opening double quotation mark 9 = closing double quotation mark 10 = not currently used 11 = opening bracket 12 = closing bracket Column 11 Number of additional characters before Text Word (e.g. opening

punctuation etc). The special cases of < d’ > , <l’> , <s'> coded with a value > 20. I.e. 22 codes two additional characters, taking the form <letter> plus <’>

Column 12 Number of additional characters following Text Word (e.g. closing punctuation etc)

Column 13 Index number of Text Word in this Text Column 14 Frequency of occurrence of this Text Word in all texts (i.e. ‘local

frequency’ Column 15 Estimate of Lexical Frequency using the Brulex frequency count Column 16 ‘Informativeness’, defined as number of words (plus or minus one

character in length) in the Brulex count sharing initial trigram (only computed for words > 4 characters)

Column 17 Cumulative lexical frequency (‘Familiarity') of words identified in Column 16

Column 18 Number of words in the Brulex count (plus or minus one character in length) sharing initial trigram and final letter (only computed for words > 4 characters in length)

20

Page 21: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

Column 19 Cumulative lexical frequency (‘Familiarity') of words identified in Column 18

21

Page 22: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

Format of SA01MA1.DAT - English Data WORD TEXT LINE OLEN WLEN XPOS WNUM FDUR OBLP WDLP LAUN TXFR BRFR INF1 FAM1 INF2 FAM2 Are 1 1 3 3 1 1 216 1 1 0 351 4393 -99 -99 -99 -99 tourists 1 1 8 8 6 2 156 2 2 -5 3 12 15 137 2 13 enticed 1 1 7 7 17 3 227 4 4 -11 1 -99 24 573 2 99 these 1 1 5 5 25 5 187 1 1 -8 73 1573 28 14064 3 4352 attractions 1 1 11 11 33 6 182 3 3 -8 2 9 20 175 3 14 threatening 1 1 11 11 44 7 96 2 2 -11 3 26 34 248 1 26 *Blink 1 -99 11 11 -99 -99 82 -99 -99 -99 -99 -99 -99 -99 -99 -99 threatening 1 1 11 11 52 7 232 10 10 -99 3 26 34 248 1 26 very 1 1 4 4 62 9 335 2 2 -10 56 798 -99 -99 -99 -99 their 1 1 5 5 57 8 168 3 3 5 225 2670 28 14064 1 2670 existence? 1 1 10 9 65 10 173 0 0 -8 4 107 4 174 1 107 existence? 1 1 10 9 71 10 188 6 6 -6 4 107 4 174 1 107 existence? 1 1 10 9 72 10 88 7 7 -1 4 107 4 174 1 107 enticed 1 1 7 7 19 3 174 6 6 53 1 -99 24 573 2 99 these 1 1 5 5 29 5 168 5 5 -10 73 1573 28 14064 3 4352 these 1 1 5 5 28 5 170 4 4 1 73 1573 28 14064 3 4352 attractions 1 1 11 11 36 6 271 6 6 -8 2 9 20 175 3 14 Format of SA01MA1.DAT - French Data WORD TEXT LINE OLEN WLEN XPOS WNUM FDUR OBLP WDLP LAUN TXFR BRFR INF1 FAM1 INF2 FAM2 quarante 1 1 8 8 7 2 136 1 1 0 8 -2 21 29486 3 13 quarante 1 1 8 8 14 2 132 8 8 -7 8 -2 21 29486 3 13 encadrement, 1 1 14 11 27 4 130 10 8 -13 -99 361 26 7400 1 361 encadrement, 1 1 14 11 28 4 74 11 9 -1 -99 361 26 7400 1 361 heure 1 1 7 5 40 5 231 8 6 -12 7 86928 3 88407 1 86928 encadrement, 1 1 14 11 33 4 448 16 14 7 -99 361 26 7400 1 361 encadrement, 1 1 14 11 20 4 403 3 1 13 -99 361 26 7400 1 361 heure 1 1 7 5 34 5 203 2 0 -14 7 86928 3 88407 1 86928 peut-être 1 1 9 9 44 7 135 0 0 -10 10 72315 7 73845 1 72315 heure 1 1 7 5 40 5 238 8 6 4 7 86928 3 88407 1 86928 sonné 1 1 5 5 57 8 160 3 3 -17 3 987 11 37323 1 987 *Blink 1 -99 5 5 -99 -99 141 -99 -99 -99 -99 -99 -99 -99 -99 -99 année, 1 5 8 5 66 52 134 6 4 -99 15 32490 5 34533 1 32490 pour 1 1 4 4 64 9 75 4 4 -99 371 -2 -99 -99 -99 -99 industrie 1 1 11 9 66 10 91 3 1 -2 2 4326 61 48557 11 8978 industrie 1 1 11 9 66 10 109 3 1 0 2 4326 61 48557 11 8978 *Off-screen 1 -99 11 9 -99 -99 241 -99 -99 -99 -99 -99 -99 -99 -99 -99 *Off-screen 1 -99 11 9 -99 -99 129 -99 -99 -99 -99 -99 -99 -99 -99 -99 *Off-screen 1 -99 11 9 -99 -99 354 -99 -99 -99 -99 -99 -99 -99 -99 -99

22

Page 23: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

Fields in ????MA1.DAT matrices. The initial two characters of the file name codes the participant contributing the data. For the English data set these are sa, sb, sc, sd, se, sf, sg, sh, si, sj . For the French data set these are sa,sb,sc,sd,se,sf,sg,sh, sx,sy (sx and sy were French participants tested in Dundee in the French language) The second two characters codes the Text File processed (01 to 20). The remaining characters identify the particular format of the data matrix. E.g. File SB04MA1.DAT contains the processed data for participant ‘B’ reading Text File 4, with the data in 'event' order (MA1). In ????MA1.DAT files, data are recorded in the order in which successive events (blinks, fixations, ‘OFF-SCREEN’ etc) occurred. ????MA1.DAT files contain the following Header line: WORD TEXT LINE OLEN WLEN XPOS WNUM FDUR OBLP WDLP LAUN TXFR BRFR INF1 FAM1 INF2 FAM2 WORD Text Word (displayed only if fixated, in the order fixated, and displayed as

many times as fixated) TEXT Screen Number (1 - 40) for the identified Text File 1 – 20 LINE Line processed OLEN Length of fixated ‘object’ i.e. word plus any additional characters WLEN Length of the fixated Text Word XPOS Position of fixation in characters WNUM Index number of word in relevant Text File FDUR Duration of this fixation OBLP Landing position on the defined ‘object’ WDLP Landing position on the defined word LAUN Length of the saccade resulting in this landing position (NB – this is not

‘Launch position’, which is usually defined relative to the initial character of the word fixated. Neither is it invariably ‘entry saccade extent’.)

TXFR Local text frequency BRFR Estimated Frequency from norms INF1 ‘Informativeness’ (see Column 16 in Text File) FAM1 ‘Familiarity’ (see Column 17 in Text File) INF2 ‘Informativeness’ (see Column 18 in Text File) FAM2 ‘Familiarity’ (see Column 19 in Text File)

23

Page 24: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work
Page 25: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

Format of SA01MA3.DAT - English Data WORD TEXT LINE OLEN WLEN XPOS WNUM FDUR OBLP WDLP FXNO TXFR EXFR INF1 FAM1 INF2 FAM2 FFIX GAZ1 GAZ2 GAZ3 Are 1 1 3 3 1 1 216 1 1 1 351 4393 -99 -99 -99 -99 216 0 0 0 tourists 1 1 8 8 6 2 156 2 2 2 3 12 15 137 2 13 156 0 0 0 enticed 1 1 7 7 17 3 227 4 4 3 1 -99 24 573 2 99 227 227 174 0 enticed 1 1 7 7 19 3 174 6 6 14 1 -99 24 573 2 99 0 0 0 0 by -99 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 these 1 1 5 5 25 5 187 1 1 4 73 1573 28 14064 3 4352 187 187 170 0 these 1 1 5 5 28 5 170 4 4 16 73 1573 28 14064 3 4352 0 0 0 0 attractions 1 1 11 11 33 6 182 3 3 5 2 9 20 175 3 14 182 182 359 0 attractions 1 1 11 11 36 6 271 6 6 17 2 9 20 175 3 14 0 0 0 0 attractions 1 1 11 11 34 6 88 4 4 18 2 9 20 175 3 14 0 0 0 0 threatening 1 1 11 11 44 7 96 2 2 6 3 26 34 248 1 26 96 96 232 232 threatening 1 1 11 11 52 7 232 10 10 8 3 26 34 248 1 26 0 0 0 0 threatening 1 1 11 11 46 7 232 4 4 19 3 26 34 248 1 26 0 0 0 0 their 1 1 5 5 57 8 168 3 3 10 225 2670 28 14064 1 2670 168 0 0 0 very 1 1 4 4 62 9 335 2 2 9 56 798 -99 -99 -99 -99 335 335 202 0 very 1 1 4 4 61 9 202 1 1 20 56 798 -99 -99 -99 -99 0 0 0 0 Format of SA01MA3.DAT - French Data WORD TEXT LINE OLEN WLEN XPOS WNUM FDUR OBLP WDLP FXNO TXFR EXFR INF1 FAM1 INF2 FAM2 FFIX GAZ1 GAZ2 GAZ3 après 1 -99 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 quarante 1 1 8 8 7 2 136 1 1 1 8 -2 21 29486 3 13 136 268 0 0 quarante 1 1 8 8 14 2 132 8 8 2 8 -2 21 29486 3 13 0 0 0 0 ans 1 -99 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d'encadrement, 1 1 14 11 27 4 130 10 8 3 -99 361 26 7400 1 361 130 204 851 0 d'encadrement, 1 1 14 11 28 4 74 11 9 4 -99 361 26 7400 1 361 0 0 0 0 d'encadrement, 1 1 14 11 33 4 448 16 14 6 -99 361 26 7400 1 361 0 0 0 0 d'encadrement, 1 1 14 11 20 4 403 3 1 7 -99 361 26 7400 1 361 0 0 0 0 l'heure 1 1 7 5 40 5 231 8 6 5 7 86928 3 88407 1 86928 231 231 203 238 l'heure 1 1 7 5 34 5 203 2 0 8 7 86928 3 88407 1 86928 0 0 0 0 l'heure 1 1 7 5 40 5 238 8 6 10 7 86928 3 88407 1 86928 0 0 0 0 a 1 -99 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 peut-être 1 1 9 9 44 7 135 0 0 9 10 72315 7 73845 1 72315 135 0 0 0 sonné 1 1 5 5 57 8 160 3 3 11 3 987 11 37323 1 987 160 0 0 0 pour 1 1 4 4 64 9 75 4 4 14 371 -2 -99 -99 -99 -99 75 0 0 0 l'industrie 1 1 11 9 66 10 91 3 1 15 2 4326 61 48557 11 8978 91 200 0 0

Page 26: End of Award Report Background - Amazon S3 · End of Award Report Background Although the primary aim of this project was technical (the development of a corpus of data) the work

Fields in ????MA3.DAT matrices. Data are recorded in Text Word order. Words fixated more than once are coded separately. The initial two characters of the file name codes the participant contributing the data. For the English data set these are sa, sb, sc, sd, se, sf, sg, sh, si, sj . For the French data set these are sa,sb,sc,sd,se,sf,sg,sh, sx,sy (sx and sy were French participants tested in Dundee in the French language) The second two characters codes the Text File processed (01 to 20). The remaining characters identify the particular format of the data matrix. E.g. File SB04MA3.DAT contains the processed data for participant ‘B’ reading Text File 4, with the data in Fixation order (MA3). MA3 files contain the following Header line (NB in the example here, fields relating to number of fixations have been blanked to allow formatted printing): WORD TEXT LINE OLEN WLEN XPOS WNUM FDUR OBLP WDLP FXNO TXFR EXFR INF1 FAM1 INF2 FAM2 FFIX GAZ1 GAZ2 GAZ3 WORD Text Word TEXT Screen Number (1 - 40) for the identified Text File 1 – 20. If the word

was not fixated, this value is replaced by the missing data code -99 LINE Line processed OLEN Length of fixated ‘object’ i.e. word plus any additional characters WLEN Length of the fixated Text Word XPOS Position of fixation in characters WNUM Index number of word in relevant Text File FDUR Duration of this fixation OBLP Landing position on the defined ‘object’ WDLP Landing position on the defined word FXNO Fixation number as defined in MA1 resulting in this fixation TXFR Local text frequency EXFR Estimated Frequency from norms INF1 ‘Informativeness’ (see Column 16 in Text File) FAM1 ‘Familiarity’ (see Column 17 in Text File) INF2 ‘Informativeness’ (see Column 18 in Text File) FAM2 ‘Familiarity’ (see Column 19 in Text File) FFIX First fixation duration on this word GAZ1 Cumulative fixation duration until first exit from word (first pass) NGA1 Number of fixations comprising GAZ1 (Not shown here) GAZ2 Cumulative fixation duration of second pass on this word NGA2 Number of fixations comprising GAZ2 (Not shown here) GAZ3 Cumulative fixation duration of third pass on this word NGA3 Number of fixations comprising GAZ2 (Not shown here)