from still pictures to moving pictures: eye-tracking text and image

Olli Philippe Lautenbacher Department of Romance Languages French Translation Studies University of Helsinki, Finland [email protected]

From Still Pictures to Moving Pictures: Eye-Tracking Text and Image

Abstract: In visual documents, two kinds of information often co-occur: the pictorial and the textual. The question addressed here concerns their relative impact on the viewer, since they can trigger both bottom-up readings caused by visual saliencies and top-down readings linked to information search. A third type of reading also appears, linked to the human propensity to look not only at text, but also at human faces and gazes. In the reception process, these readings might create contradictions due to semiotic overshadowing of the perceived elements or, on the contrary, be cohesive, strengthening integrated meaning in the document through semiotic cross points between visual and textual elements. Using eye-tracking tests on still pictures, this study aims at addressing questions concerning subtitled films. Key words: Multimodal document reception, bottom-up and top-down reading, semiotic overshadowing, semiotic cross points, subtitling

Film is multimodal by nature (Baldry and Thibault 2004; Baumgarten 2008). It

combines multiple semiotic codes through visual image and sound. Furthermore, in

subtitled film, two different kinds of visual information co-occur with the original

soundtrack, i.e. the pictorial filmic elements and the subtitles’ textual elements.

Because of the simultaneity of these visual elements, the question is whether this

combination of pictures and text creates a contradiction in the perception process or not.

mailto:[email protected]

2

More specifically, with subtitled film, the question is how this combination affects the

reception of films, and what implications this might have for subtitling strategies.

In order to study the relation between text and image in reception, I conducted eye-

tracking tests at the French Translation Department of Turku University, in 2007 2009.

These tests were made on still pictures which combined text and image in different

manners. The subjects were only informed that they would have to look at documents

containing both picture and text and answer to a brief questionnaire afterwards. The

fuzziness of the task was an intentional choice, since one of my interests lay in the

perception of visual saliencies. Nevertheless, the task given to the subjects probably

triggered some orientation in the viewing/reading process, since the terms ‘image’ and

‘text’ were mentioned. Even though the groups were small in size, consisting of only 9

and 10 translation students, the tests led to some interesting findings that should be

taken into account in future eye-tracking studies on subtitle-reading in film documents.

I will first consider what appear to be general tendencies in reading and picture viewing,

then I will go into questions concerning their combination in terms of contradictory

versus cohesive readings, and finally, I will address some methodological questions.

3

1. General tendencies in reading text

In linear text, many studies underline our propensity to begin reading at the upper left

corner of a text (e.g. Baccino 2004). This, of course, concerns languages written from

left to right. These studies also show how readers jump from one gaze location to

another, following certain tendencies, for instance not having fixations on every word of

a sentence or locating their fixation somewhat in the middle of a word but rarely on its

first letter (Baccino 2004: 140). This has been observed in many eye-tracking tests and

was confirmed in ours. In our first example (see Figures 1 & 2), subjects were only

asked to read what they wanted to and could quit the test when they felt like doing so.

The previous slides were shown to them for 6 seconds each, with no specific task. The

slide shown in Figures 1 & 2 was the last of a series of 14 pictures, but the first to

present plain linear text.

Figures 1 & 2 – Reading a linear text. Gaze plots and hot spots in a group of 9 subjects.

One could make three observations about Figures 1 & 2. Firstly, it appears fairly

obvious in these recordings that the first point of attention is located at the upper left

4

corner of the left hand side text1. Similar observations have been made about Web page

reading, where F-shaped viewing patterns (see Figure 3) were observed by Jakob

Nielsen (2006).

Figure 3 – F-shaped reading patterns in Web text observed by Jakob Nielsen (see: http://www.useit.com/alertbox/reading_pattern.html).

Secondly, this more or less linear reading strategy can be altered by the general form of

the text itself. On Web-pages, multiple textual regions are often scattered throughout the

screen, and the less they seem clear, the more the readers hesitate on the order – or

hierarchy – of reading. Consider, for example, the second picture of Nielsen’s three

slides in Figure 3 and what happens in Figure 4, from our own test.

1 Of course, this should be the upper right corner in the case of Arabic text, for instance. In the slide shown in Figures 1 & 2, the text in the right column is a Finnish translation of the French excerpt in the left column. Notice that this also triggered different reading strategies in the subject group, some reading the totality of both texts, others skipping across at different points to the text in their mother tongue, etc.

http://www.useit.com/alertbox/reading_pattern.html

5

Figure 4 – Hesitation caused by the complexity of a web page (analysed picture from: http://www.le-dadaisme.com).

Obviously, the linearity of reading is also broken when the text itself is presented in a

more visual manner as in Dadaist poetry (see Figure 5).

Figure 5 – The broken linearity of Dadaist poetry. Paysage, by Guillaume Apollinaire (analysed picture from: http://www.readingroom.spl.org.uk/classic_poems/img/paysage350.jpg).

One could also add at this point that the viewing angle when sitting in front of a

computer screen differs from that of a person reading a novel or newspaper, which

raises the question of whether we should be cautious of over computer-centric eye-

tracking tests.

http://www.le-dadaisme.com/

http://www.readingroom.spl.org.uk/classic_poems/img/paysage350.jpg

6

Thirdly, it would also seem that any type of completion of the reading process2 needs a

specific motivation, given for instance through specific content-related tasks in the test

situation.

2. General tendencies in viewing pictures

My second point concerns general tendencies in picture viewing, where similar top-

down effects have been observed. There are two well known opposite types of gaze

location selection: bottom-up readings, which are saliency-driven and thus in a way

caused by the picture itself, and top-down readings, which are driven by the viewer’s

consciousness and thus depend on what the viewer is looking for in a picture

(Henderson and Hollingworth 1998). I will also present what I believe is a third

configuration, which somehow stands between these two opposites.

2.1. Bottom-up readings

The first important thing about pictures is that they can contain certain salient regions

able to catch the attention of the human viewer. Visual saliencies are factors of visual

informativeness rather than semantic informativeness (Henderson and Hollingworth

1998). Following Carmi and Itti (2006: 4333), I will “use the term ‘saliency’ to refer to

2 Reading ‘completion’ can occur in different forms, since one can read a text in order to learn a story, analyse its language or structure, recall its main characters, look for specific information, study its political context, etc.

7

any bottom-up measure of conspicuity”. These visual saliencies can be mapped by a

combination of factors such as luminance, contrast, colour, or contour density, for

instance (Henderson and Hollingworth 1998), but they can also be created by object

movements, which is the case with the smooth pursuit phenomenon, for example, where

the eyes tend to follow a moving target on the screen and adjust to its pace, thus not

being a succession of saccades. (For a more detailed analysis of smooth pursuit eye

movement, see Krauzlis 2004.) These bottom-up principles of pictorial guidance are the

basic toolkit for creative professionals in visual communication (artists, photographers,

advertisers...). As an example, eye-movement on still pictures can be instigated through

shape similarity or repetition of objects as one can observe in Figures 6 and 7. This kind

of composition can be considered to some extent like the still-picture counterpart of

object movement in filmic scenes.

Figures 6 & 7 – Creating movement with shape repetition in still pictures. M.C. Escher, Cycle, 1938 (analysed picture from: http://www.mcescher.com/Gallery/switz-bmp/LW305.jpg).

http://www.mcescher.com/Gallery/switz-bmp/LW305.jpg

8

2.2. Top-down readings

On the other hand, we know that those concrete bottom-up factors – or what we might

call ‘true visual saliencies’, since they are not necessarily linked to any semantic

informativeness – can in fact lose their significance or even be completely ignored by

the viewer, since the actual location of eye-fixation can be strongly determined by top-

down tendencies, present in the task, as pointed out by Itti (2005: 1095). This was

already shown by Yarbus (1967) with his attention driven test, where the viewers

looked at very different parts of the same picture, depending on what they were asked to

find out. A context or task driven information search, i.e. logical analysis (which was

not part of our eye-tracking study), is in this sense a true top-down reading.

2.3. Readings of a third kind?

What our tests revealed, however, was the actual importance of another type of reading.

The attention capturing strength of certain visual elements seem to be so strongly linked

to human communicational behaviour that they could almost be considered bottom-up

factors, even though they are not of an intrinsic pictorial nature. This is the case, for

instance, with human faces and their gaze directions in pictures. Human faces are very

powerful gaze-catchers, as has been clearly pointed out by Birmingham et al. (2008).

This can also be seen in Figure 8.

9

Figure 8 – Hot spots on Da Vinci’s Mona Lisa (analysed picture from: http://www.abm-enterprises.net/monalisa.jpg).

An interesting finding in our tests was that this tendency seems so strong that it actually

creates an expectation in the viewer even when there is no face to be seen, but where

there logically should be one (see Figures 9 & 10).

Figure 9 – A Swatch advertisement.

http://www.abm-enterprises.net/monalisa.jpg

10

Figure 10 – Hot spots on the Swatch advertisement, viewed by 10 subjects.

One can also see in the hot spot recording of Figure 10 that subjects’ attention can be

partly guided by the gaze of the faces they are looking at. This is also what Birmingham

et al. suggest, noticing that “[one] normally finds that response time (RT) to detect a

target is shorter when the target appears at the gazed-at location than when it appears at

the non-gazed-at location.” (Birmingham et al 2008: 986) In figures 9 & 10, the woman

is looking backwards, thus leading the viewers’ gaze towards the text and the watch. Of

course this direction is supported at the same time by the visual saliency of the

horizontal lines created by the arms of the human shaped shadow and by the text in the

middle of the advertisement. Nevertheless, the coercive force of the viewed human gaze

is obvious: consider Robert Doisneau’s famous photograph (Figures 11 & 12), where

the horizontal link between the man’s face and what he is looking at appears clearly.

11

Figure 11 – The faces we see lead our gaze. Gaze plots on Doisneau’s Un regard oblique, 1948 (analysed picture from: http://www.photographersgallery.com/i/full/un_regard_oblique.jpg).

Figure 12 – Hot spots on the same photograph.

The eye-tracking recordings for the Bourjois advertisement (Figures 13 & 14) quite

clearly indicate that the repetitive figure of the lips, salient not only by contrast,

luminosity and brilliance, but also because it seems to be in front of the TV-screen in a

3-dimensional manner, is actually overshadowed during reception by the woman’s face,

even if both visual regions lead the viewers’ gaze towards the trademark at the top right

corner of the picture.

http://www.photographersgallery.com/i/full/un_regard_oblique.jpg

12

Figures 13 & 14 – Creating movement in still pictures, by combining repetition and gaze direction in a Bourjois advertisement.

In a sense, our test-slides suggest that we do not necessarily look for information, as

such, while viewing pictures containing human bodies, but rather for potential areas that

usually support human communication (mouth, eyes, gaze direction…). Otherwise, the

eye-tracking tests should logically reveal much more attention on the object that is

actually being marketed in the Bourjois advertisement, i.e. the lipstick at the bottom

right of the picture (see Figures 13 & 14).

Obviously enough, our tests tend to show that when no task is given to the viewer, the

default point of observation will be the human face and the gaze directions it suggests,

thus putting the human face somewhere between a bottom-up visual saliency and a top-

down search object. I will not address the question whether this is a biological or

cultural matter, but it is interesting to notice that, for instance, in Levinas’ philosophy,

the human face is seen as being significant in itself3, and that in another field,

Birmingham et al. (2008) analyse the phenomenon as a question of “social attention”.

Nevertheless, the idea of considering this a communicational reaction has the advantage

3 “Le visage est signification, et signification sans contexte. […] le visage est sens à lui seul.” (Levinas, 2000: XII)

13

of integrating all these elements: significance and social attention triggered by human

face and text. Obviously, another interesting thing about the Bourjois slide is that the

text at the bottom of the picture has clearly been read by the viewers in quite a standard,

linear manner (see Figure 15). This somehow confirms the idea that viewers tend to

consider posters above all as a form of communication and, hence, read almost any text

that appears on a given document (poster, screen, etc.)4.

Figure 15 – Standard linear reading of the text, at the bottom of the picture.

The same observations can be made on the slide in Figures 16 & 17, where both face

and text are clearly the elements that caught the audience’s eye, more than other parts of

the human body.

4 Obviously, there might be differences between audiences, which will have to be taken into account in further research. This particular test group solely consisted of 4th year university students who were specializing in translation studies in Finland, aged 22 or 24 and mainly women. Would an artist or a child look at the same elements?

14

Figures 16 & 17 – Gaze catching power of face and text. (A Guy Laroche advertisement)

This leads to a series of questions concerning the combination of text reading and

picture viewing tendencies.

3. Combining text reading and picture viewing tendencies

What happens when our basic linear reading technique, with its more or less systematic

left to right gaze-trajectory, meets pictorial bottom-up saliencies or areas of human

interest (such as faces or gazes) in the same scene? This combination could cause both

contradictory and cohesive readings.

3.1. Contradictory reading (semiotic overshadowing)

There seems to be a human propensity to read what is on the screen to be read, as has

been shown for subtitled television by d’Ydewalle et al. (1991) in an article

15

significantly subtitled “Automatic Reading Behaviour”5. This seems also to be true for

still pictures, as another version of Mona Lisa presented in our test would suggest (see

Figure 18).

Figure 18 – The gaze catching power of written text. L.H.O.O.Q., by Marcel Duchamp (analysed picture from: http://www.marcelduchamp.net/images/L.H.O.O.Q.jpg).

In the situation of simultaneous codes, a question arises: to what extent does the picture

alter the standard reading of the (subtitle) text or, conversely, to what extent does the

presence of text overshadow our reception of the picture?

In the first case, the presence of an image in itself, and, moreover, specific salient

pictorial elements, might systematically alter the hypothetically linear reading of the

subtitle’s text. As d’Ydewalle et al. noticed, in the case of two-line subtitles, the viewers

“might stay longer in the subtitle or switch back and forth between the visual image and

subtitle” (d’Ydewalle et al. 1991: 662). Such a broken linearity due to pictorial

saliencies might reduce the semantic impact of subtitles.

5 Of course, questions such as the language competences or socio-cultural background of the viewer remain to be addressed.

http://www.marcelduchamp.net/images/L.H.O.O.Q.jpg

16

On the other hand (and this might be even more important from the film makers’ point

of view), can the presence of subtitles overshadow important semiotic pictorial elements

for meaning construction or diegesis in film? To what extent then should film makers

take into account the differences between subtitled and dubbed versions of their films?

This question would be of particular importance in the case of a director such as Alfred

Hitchcock, who was famously precise about the visual details he put on screen. In fact,

should the choice between subtitling and dubbing be made according to film genre

rather than general translation policies, as Hollander (2001) suggests?6

Obviously, when two different kinds of visual elements are coupled together, i.e. image

and text, it takes time for them to be perceived and taken into account cognitively. Their

respective semiotic overshadowing could probably be studied, for instance, through

comparative eye-tracking tests with non-native and native speaker viewers.

3.2. Cohesive reading (semiotic cross points)

Conversely, since pictures can carry exophoric references that somehow confirm the

semantic reading of the subtitle, the combination can also be seen as a reinforcement of

cohesion in film viewing. In the field of multimodal transcriptions, such as those

suggested by Baldry and Thibault (2006: 165 – 249; Appendix I), this simply means

6 This would then create the problem of defining genres: is film genre, as film makers or producers understand it, an adequate definition for the translator’s purposes?

17

that subtitles should be taken into consideration just as any other building block of the

overall meaning of an audiovisual document.

In the visual-verbal cohesion system suggested by Baumgarten (2008), the author

compares original soundtracks of James Bond films (in English) to their dubbed

soundtracks in German. She shows that the translated versions tend to create a stronger

and more explicit link between picture and (spoken) verbal text through German lexical

choices. Hence, with subtitles, one might think that the kind of ‘semiotic cross points’

that enhance the integration of visual objects and co-occurring textual elements on

screen would result in specific dynamic shifts in the spectator’s gaze and create a

stronger perception of those semiotically connected elements.

As an example of this kind of semiotic cross point, let us consider a scene from Astérix

et Obélix – Mission Cléopâtre7, where the humour of the scene is built upon the

semantic ambiguity of the French word dalle, meaning both ‘hunger’ and ‘flagstone’,

and the picture, which immediately follows, of Obélix – who is always hungry –

carrying the said flagstone (Figure 19). Here there is a strong link between the visual

meaning and both spoken verbal meanings. Astérix is asking Obélix “How are you?”

and Obélix answers “I’m hungry” with a French expression meaning also “I have the

flagstone”. The difference between the French soundtrack and its subtitled version in

Swedish is obvious, since the Swedish version of the answer does not carry that second

meaning. The original multimodal link is missing, and part of the humour is lost in

translation, which might cause different viewing paths in different audiences

7 A film by Alain Chabat, 2002.

18

(Lautenbacher forthcoming). This shows clearly how translation can alter the semiotic

reading of a scene by changing these cross points.

Figure 19 – Integrated visual and verbal information in the original dialogue. Compare “Ça va, Obé? – Ben j’ai la dalle.” to the Swedish translation:

“Hur går det? – Jag är hungrig.” (‘How are you? – I am hungry’)

So to what extent should translators take into account the relationship between visual

elements, in their work? Maybe their first question should be: (how) do people actually

read subtitles? An eye-tracking tool demonstration test on subtitled film made at the

Lund Eye-Tracking Academy in October 2008 seemed to show that, in fact, very little

might be read, even with a language completely foreign to the viewers. This test wasn’t

subjected to in depth analysis and it had very few subjects, who were all (too) conscious

of the test, being participants at the eye-tracking academy. Nevertheless, one hypothesis

that emerged in subsequent discussion was that filmic context and intrinsic logic (i.e.

the film genre, mentioned earlier) might partly prepare the viewer for plot-decoding at

each stage of the story and in this way diminish the necessity for actual subtitle reading.

19

4. Some methodological issues

There are also many practical questions about the measurable specificities of subtitles

related to this combined reading strategy that will have to be addressed in future

research, such as the positioning of text on the screen (both physically and temporally

speaking), the number of lines used in subtitling, and the impact of font type on

visibility and legibility. All these factors affect the time and speed of linear reading in

itself and, furthermore, the combined reading of multimodal documents.

Another important methodological issue for further eye-tracking tests concerns the

visual areas covered by parafoveal vision: screen size and the distance between viewer

and screen in eye-tracking test situations must be the same as in normal viewing

conditions. The usual distance (and common screen size) while viewing media on a

computer differs from both television and cinema contexts. In our test, this appeared

quite clearly with some of the observed documents, which were originally made to be

seen as larger images and from further away. None of the 9 subjects noticed the human

face hiding in the slide shown in Figure 20 during the test made on a standard office

computer screen, but everybody saw it immediately when it was projected onto the wide

classroom screen afterwards. So here again, maybe one should avoid being too

computer-centric.8

8 The question of the viewing angle can also be of great importance, especially in cinemas, were the viewers can sometimes be very far from the ideal central viewing point. (This appears to be a sensitive issue even from a technical perspective concerning 3D-films.)

20

Figure 20 – None of the subjects saw the face in the waves, because of insufficient distance from the screen during the test (analysed picture from: http://www.szilagyivallalat.hu/blog/wp-

content/uploads/2006/04/ClubMed-04.jpg).

Some conclusions

The present introductory study led to the following conclusions. First, in random picture

viewing, there is a strong human tendency to systematically fixate relevant

communicative elements such as faces, mouths, eyes, gazes and text. So when there is

no specific top-down search, the audience tends to seek for a source of communication

on the screen.

Secondly, in an integrated multimodal meaning construction system such as film,

subtitle reading is not necessarily linear, but, rather, it varies according to visual

saliencies, top-down information searches, contradictory or cohesive elements, co-text

or narration understanding, and numerous other factors.

http://www.szilagyivallalat.hu/blog/wp-content/uploads/2006/04/ClubMed-04.jpg

http://www.szilagyivallalat.hu/blog/wp-content/uploads/2006/04/ClubMed-04.jpg

21

Thirdly, multimodal cross points (integrating picture, sound and text) are important in

the meaning construction process. Film directors make use of this visual-verbal

cohesion in order to obtain the expected readings. Hence, one important task for the

translator is to avoid overshadowing these cohesive structures, and this implies a global,

integrated approach to the translation of audiovisual documents.

References

Baccino, Thierry. 2004. La lecture électronique. Presses Universitaires de Grenoble. 254 p. Baldry, Anthony and Thibault, Paul J. 2004. Multimodal Transcription and Text Analysis. London/Oakville: Equinox. 270 p. Baumgarten, Nicole. 2008. “Yeah, that’s it!: Verbal Reference to Visual Information in Film Texts and Film Translations.” Meta, LIII, 1, p. 6 – 25. Birmingham, Elina, Bischof Walter F. and Kinstone, Alan. 2008. “Social attention and real-world scenes: The roles of action, competition and social content”, The Quarterly Journal of Experimental Psychology, 61, 7, p. 986 – 998. Carmi, Ran and Itti, Laurent. 2006. “Visual causes versus correlates of attentional selection in dynamic scenes.” Vision research, 46, p. 4333 – 4345. d’Ydewalle, Géry, Praet, Caroline, Verfaillie, Karl and Van Rensbergen, Johan. 1991. “Watching Subtitled Television. Automatic Reading Behaviour”. Communication Research, 18, 5, p. 650 – 666. Henderson, John M. and Andrew Hollingworth. 1998. “Eye Movements during Scene Viewing: An Overview.” In G. Underwood (ed.), Eye Guidance in Reading and Scene Perception. Elsevier Science, p. 269 – 293.

22

Hollander, Régine. 2001. “Doublage et sous-titrage – Étude de cas: Natural Born Killers (Tueurs nés)”. Revue française d’études américaines, 88, p. 79 – 88. On line: http://www.cairn.info/article.php?ID_REVUE=RFEA&ID_NUMPUBLIE=RFEA_088&ID_ARTICLE=RFEA_088_0079&REDIR=1 [accessed: 14. 9. 2009]. Itti, Laurent. 2005. “Quantifying the contribution of low-level saliency to human eye movements in dynamic scenes.” Visual Cognition, 12, 6, p. 1093 – 1123. Krauzlis, Richard J. 2004. “Recasting the Smooth Pursuit Eye Movement System.” Journal of Neurophysiology, 91, p. 591-603. Lautenbacher, Olli Philippe. Forthcoming (on line). “Film et sous-titrage. Pour une définition de l’unité de sens en tradaptation.” Actes du Colloque Représentation du sens linguistique IV (28. – 30. 5. 2008), University of Helsinki. Levinas, Emmanuel. 2000 [1961]. Totalité et Infini. Paris : Le livre de poche, p. XII Nielsen, Jakob. 2006. “F-Shaped Pattern for Reading Web Content.” On line: http://www.useit.com/alertbox/reading_pattern.html [accessed: 14. 9. 2009]. Yarbus, A. L. 1967 [1965]. Eye movements and vision. New York: Plenum Press. 222 p.

http://www.cairn.info/article.php?ID_REVUE=RFEA&ID_NUMPUBLIE=RFEA_088&ID_ARTICLE=RFEA_088_0079&REDIR=1

http://www.cairn.info/article.php?ID_REVUE=RFEA&ID_NUMPUBLIE=RFEA_088&ID_ARTICLE=RFEA_088_0079&REDIR=1

http://www.useit.com/alertbox/reading_pattern.html

from still pictures to moving pictures: eye-tracking text and image

Documents