from still pictures to moving pictures: eye-tracking text and image
TRANSCRIPT
Olli Philippe Lautenbacher Department of Romance Languages French Translation Studies University of Helsinki, Finland [email protected]
From Still Pictures to Moving Pictures: Eye-Tracking Text and Image
Abstract: In visual documents, two kinds of information often co-occur: the pictorial and the textual. The question addressed here concerns their relative impact on the viewer, since they can trigger both bottom-up readings caused by visual saliencies and top-down readings linked to information search. A third type of reading also appears, linked to the human propensity to look not only at text, but also at human faces and gazes. In the reception process, these readings might create contradictions due to semiotic overshadowing of the perceived elements or, on the contrary, be cohesive, strengthening integrated meaning in the document through semiotic cross points between visual and textual elements. Using eye-tracking tests on still pictures, this study aims at addressing questions concerning subtitled films. Key words: Multimodal document reception, bottom-up and top-down reading, semiotic overshadowing, semiotic cross points, subtitling
Film is multimodal by nature (Baldry and Thibault 2004; Baumgarten 2008). It
combines multiple semiotic codes through visual image and sound. Furthermore, in
subtitled film, two different kinds of visual information co-occur with the original
soundtrack, i.e. the pictorial filmic elements and the subtitles’ textual elements.
Because of the simultaneity of these visual elements, the question is whether this
combination of pictures and text creates a contradiction in the perception process or not.
2
More specifically, with subtitled film, the question is how this combination affects the
reception of films, and what implications this might have for subtitling strategies.
In order to study the relation between text and image in reception, I conducted eye-
tracking tests at the French Translation Department of Turku University, in 2007 2009.
These tests were made on still pictures which combined text and image in different
manners. The subjects were only informed that they would have to look at documents
containing both picture and text and answer to a brief questionnaire afterwards. The
fuzziness of the task was an intentional choice, since one of my interests lay in the
perception of visual saliencies. Nevertheless, the task given to the subjects probably
triggered some orientation in the viewing/reading process, since the terms ‘image’ and
‘text’ were mentioned. Even though the groups were small in size, consisting of only 9
and 10 translation students, the tests led to some interesting findings that should be
taken into account in future eye-tracking studies on subtitle-reading in film documents.
I will first consider what appear to be general tendencies in reading and picture viewing,
then I will go into questions concerning their combination in terms of contradictory
versus cohesive readings, and finally, I will address some methodological questions.
3
1. General tendencies in reading text
In linear text, many studies underline our propensity to begin reading at the upper left
corner of a text (e.g. Baccino 2004). This, of course, concerns languages written from
left to right. These studies also show how readers jump from one gaze location to
another, following certain tendencies, for instance not having fixations on every word of
a sentence or locating their fixation somewhat in the middle of a word but rarely on its
first letter (Baccino 2004: 140). This has been observed in many eye-tracking tests and
was confirmed in ours. In our first example (see Figures 1 & 2), subjects were only
asked to read what they wanted to and could quit the test when they felt like doing so.
The previous slides were shown to them for 6 seconds each, with no specific task. The
slide shown in Figures 1 & 2 was the last of a series of 14 pictures, but the first to
present plain linear text.
Figures 1 & 2 – Reading a linear text. Gaze plots and hot spots in a group of 9 subjects.
One could make three observations about Figures 1 & 2. Firstly, it appears fairly
obvious in these recordings that the first point of attention is located at the upper left
4
corner of the left hand side text1. Similar observations have been made about Web page
reading, where F-shaped viewing patterns (see Figure 3) were observed by Jakob
Nielsen (2006).
Figure 3 – F-shaped reading patterns in Web text observed by Jakob Nielsen (see: http://www.useit.com/alertbox/reading_pattern.html).
Secondly, this more or less linear reading strategy can be altered by the general form of
the text itself. On Web-pages, multiple textual regions are often scattered throughout the
screen, and the less they seem clear, the more the readers hesitate on the order – or
hierarchy – of reading. Consider, for example, the second picture of Nielsen’s three
slides in Figure 3 and what happens in Figure 4, from our own test.
1 Of course, this should be the upper right corner in the case of Arabic text, for instance. In the slide shown in Figures 1 & 2, the text in the right column is a Finnish translation of the French excerpt in the left column. Notice that this also triggered different reading strategies in the subject group, some reading the totality of both texts, others skipping across at different points to the text in their mother tongue, etc.
5
Figure 4 – Hesitation caused by the complexity of a web page (analysed picture from: http://www.le-dadaisme.com).
Obviously, the linearity of reading is also broken when the text itself is presented in a
more visual manner as in Dadaist poetry (see Figure 5).
Figure 5 – The broken linearity of Dadaist poetry. Paysage, by Guillaume Apollinaire (analysed picture from: http://www.readingroom.spl.org.uk/classic_poems/img/paysage350.jpg).
One could also add at this point that the viewing angle when sitting in front of a
computer screen differs from that of a person reading a novel or newspaper, which
raises the question of whether we should be cautious of over computer-centric eye-
tracking tests.
6
Thirdly, it would also seem that any type of completion of the reading process2 needs a
specific motivation, given for instance through specific content-related tasks in the test
situation.
2. General tendencies in viewing pictures
My second point concerns general tendencies in picture viewing, where similar top-
down effects have been observed. There are two well known opposite types of gaze
location selection: bottom-up readings, which are saliency-driven and thus in a way
caused by the picture itself, and top-down readings, which are driven by the viewer’s
consciousness and thus depend on what the viewer is looking for in a picture
(Henderson and Hollingworth 1998). I will also present what I believe is a third
configuration, which somehow stands between these two opposites.
2.1. Bottom-up readings
The first important thing about pictures is that they can contain certain salient regions
able to catch the attention of the human viewer. Visual saliencies are factors of visual
informativeness rather than semantic informativeness (Henderson and Hollingworth
1998). Following Carmi and Itti (2006: 4333), I will “use the term ‘saliency’ to refer to
2 Reading ‘completion’ can occur in different forms, since one can read a text in order to learn a story, analyse its language or structure, recall its main characters, look for specific information, study its political context, etc.
7
any bottom-up measure of conspicuity”. These visual saliencies can be mapped by a
combination of factors such as luminance, contrast, colour, or contour density, for
instance (Henderson and Hollingworth 1998), but they can also be created by object
movements, which is the case with the smooth pursuit phenomenon, for example, where
the eyes tend to follow a moving target on the screen and adjust to its pace, thus not
being a succession of saccades. (For a more detailed analysis of smooth pursuit eye
movement, see Krauzlis 2004.) These bottom-up principles of pictorial guidance are the
basic toolkit for creative professionals in visual communication (artists, photographers,
advertisers...). As an example, eye-movement on still pictures can be instigated through
shape similarity or repetition of objects as one can observe in Figures 6 and 7. This kind
of composition can be considered to some extent like the still-picture counterpart of
object movement in filmic scenes.
Figures 6 & 7 – Creating movement with shape repetition in still pictures. M.C. Escher, Cycle, 1938 (analysed picture from: http://www.mcescher.com/Gallery/switz-bmp/LW305.jpg).
8
2.2. Top-down readings
On the other hand, we know that those concrete bottom-up factors – or what we might
call ‘true visual saliencies’, since they are not necessarily linked to any semantic
informativeness – can in fact lose their significance or even be completely ignored by
the viewer, since the actual location of eye-fixation can be strongly determined by top-
down tendencies, present in the task, as pointed out by Itti (2005: 1095). This was
already shown by Yarbus (1967) with his attention driven test, where the viewers
looked at very different parts of the same picture, depending on what they were asked to
find out. A context or task driven information search, i.e. logical analysis (which was
not part of our eye-tracking study), is in this sense a true top-down reading.
2.3. Readings of a third kind?
What our tests revealed, however, was the actual importance of another type of reading.
The attention capturing strength of certain visual elements seem to be so strongly linked
to human communicational behaviour that they could almost be considered bottom-up
factors, even though they are not of an intrinsic pictorial nature. This is the case, for
instance, with human faces and their gaze directions in pictures. Human faces are very
powerful gaze-catchers, as has been clearly pointed out by Birmingham et al. (2008).
This can also be seen in Figure 8.
9
Figure 8 – Hot spots on Da Vinci’s Mona Lisa (analysed picture from: http://www.abm-enterprises.net/monalisa.jpg).
An interesting finding in our tests was that this tendency seems so strong that it actually
creates an expectation in the viewer even when there is no face to be seen, but where
there logically should be one (see Figures 9 & 10).
Figure 9 – A Swatch advertisement.
10
Figure 10 – Hot spots on the Swatch advertisement, viewed by 10 subjects.
One can also see in the hot spot recording of Figure 10 that subjects’ attention can be
partly guided by the gaze of the faces they are looking at. This is also what Birmingham
et al. suggest, noticing that “[one] normally finds that response time (RT) to detect a
target is shorter when the target appears at the gazed-at location than when it appears at
the non-gazed-at location.” (Birmingham et al 2008: 986) In figures 9 & 10, the woman
is looking backwards, thus leading the viewers’ gaze towards the text and the watch. Of
course this direction is supported at the same time by the visual saliency of the
horizontal lines created by the arms of the human shaped shadow and by the text in the
middle of the advertisement. Nevertheless, the coercive force of the viewed human gaze
is obvious: consider Robert Doisneau’s famous photograph (Figures 11 & 12), where
the horizontal link between the man’s face and what he is looking at appears clearly.
11
Figure 11 – The faces we see lead our gaze. Gaze plots on Doisneau’s Un regard oblique, 1948 (analysed picture from: http://www.photographersgallery.com/i/full/un_regard_oblique.jpg).
Figure 12 – Hot spots on the same photograph.
The eye-tracking recordings for the Bourjois advertisement (Figures 13 & 14) quite
clearly indicate that the repetitive figure of the lips, salient not only by contrast,
luminosity and brilliance, but also because it seems to be in front of the TV-screen in a
3-dimensional manner, is actually overshadowed during reception by the woman’s face,
even if both visual regions lead the viewers’ gaze towards the trademark at the top right
corner of the picture.
12
Figures 13 & 14 – Creating movement in still pictures, by combining repetition and gaze direction in a Bourjois advertisement.
In a sense, our test-slides suggest that we do not necessarily look for information, as
such, while viewing pictures containing human bodies, but rather for potential areas that
usually support human communication (mouth, eyes, gaze direction…). Otherwise, the
eye-tracking tests should logically reveal much more attention on the object that is
actually being marketed in the Bourjois advertisement, i.e. the lipstick at the bottom
right of the picture (see Figures 13 & 14).
Obviously enough, our tests tend to show that when no task is given to the viewer, the
default point of observation will be the human face and the gaze directions it suggests,
thus putting the human face somewhere between a bottom-up visual saliency and a top-
down search object. I will not address the question whether this is a biological or
cultural matter, but it is interesting to notice that, for instance, in Levinas’ philosophy,
the human face is seen as being significant in itself3, and that in another field,
Birmingham et al. (2008) analyse the phenomenon as a question of “social attention”.
Nevertheless, the idea of considering this a communicational reaction has the advantage
3 “Le visage est signification, et signification sans contexte. […] le visage est sens à lui seul.” (Levinas, 2000: XII)
13
of integrating all these elements: significance and social attention triggered by human
face and text. Obviously, another interesting thing about the Bourjois slide is that the
text at the bottom of the picture has clearly been read by the viewers in quite a standard,
linear manner (see Figure 15). This somehow confirms the idea that viewers tend to
consider posters above all as a form of communication and, hence, read almost any text
that appears on a given document (poster, screen, etc.)4.
Figure 15 – Standard linear reading of the text, at the bottom of the picture.
The same observations can be made on the slide in Figures 16 & 17, where both face
and text are clearly the elements that caught the audience’s eye, more than other parts of
the human body.
4 Obviously, there might be differences between audiences, which will have to be taken into account in further research. This particular test group solely consisted of 4th year university students who were specializing in translation studies in Finland, aged 22 or 24 and mainly women. Would an artist or a child look at the same elements?
14
Figures 16 & 17 – Gaze catching power of face and text. (A Guy Laroche advertisement)
This leads to a series of questions concerning the combination of text reading and
picture viewing tendencies.
3. Combining text reading and picture viewing tendencies
What happens when our basic linear reading technique, with its more or less systematic
left to right gaze-trajectory, meets pictorial bottom-up saliencies or areas of human
interest (such as faces or gazes) in the same scene? This combination could cause both
contradictory and cohesive readings.
3.1. Contradictory reading (semiotic overshadowing)
There seems to be a human propensity to read what is on the screen to be read, as has
been shown for subtitled television by d’Ydewalle et al. (1991) in an article
15
significantly subtitled “Automatic Reading Behaviour”5. This seems also to be true for
still pictures, as another version of Mona Lisa presented in our test would suggest (see
Figure 18).
Figure 18 – The gaze catching power of written text. L.H.O.O.Q., by Marcel Duchamp (analysed picture from: http://www.marcelduchamp.net/images/L.H.O.O.Q.jpg).
In the situation of simultaneous codes, a question arises: to what extent does the picture
alter the standard reading of the (subtitle) text or, conversely, to what extent does the
presence of text overshadow our reception of the picture?
In the first case, the presence of an image in itself, and, moreover, specific salient
pictorial elements, might systematically alter the hypothetically linear reading of the
subtitle’s text. As d’Ydewalle et al. noticed, in the case of two-line subtitles, the viewers
“might stay longer in the subtitle or switch back and forth between the visual image and
subtitle” (d’Ydewalle et al. 1991: 662). Such a broken linearity due to pictorial
saliencies might reduce the semantic impact of subtitles.
5 Of course, questions such as the language competences or socio-cultural background of the viewer remain to be addressed.
16
On the other hand (and this might be even more important from the film makers’ point
of view), can the presence of subtitles overshadow important semiotic pictorial elements
for meaning construction or diegesis in film? To what extent then should film makers
take into account the differences between subtitled and dubbed versions of their films?
This question would be of particular importance in the case of a director such as Alfred
Hitchcock, who was famously precise about the visual details he put on screen. In fact,
should the choice between subtitling and dubbing be made according to film genre
rather than general translation policies, as Hollander (2001) suggests?6
Obviously, when two different kinds of visual elements are coupled together, i.e. image
and text, it takes time for them to be perceived and taken into account cognitively. Their
respective semiotic overshadowing could probably be studied, for instance, through
comparative eye-tracking tests with non-native and native speaker viewers.
3.2. Cohesive reading (semiotic cross points)
Conversely, since pictures can carry exophoric references that somehow confirm the
semantic reading of the subtitle, the combination can also be seen as a reinforcement of
cohesion in film viewing. In the field of multimodal transcriptions, such as those
suggested by Baldry and Thibault (2006: 165 – 249; Appendix I), this simply means
6 This would then create the problem of defining genres: is film genre, as film makers or producers understand it, an adequate definition for the translator’s purposes?
17
that subtitles should be taken into consideration just as any other building block of the
overall meaning of an audiovisual document.
In the visual-verbal cohesion system suggested by Baumgarten (2008), the author
compares original soundtracks of James Bond films (in English) to their dubbed
soundtracks in German. She shows that the translated versions tend to create a stronger
and more explicit link between picture and (spoken) verbal text through German lexical
choices. Hence, with subtitles, one might think that the kind of ‘semiotic cross points’
that enhance the integration of visual objects and co-occurring textual elements on
screen would result in specific dynamic shifts in the spectator’s gaze and create a
stronger perception of those semiotically connected elements.
As an example of this kind of semiotic cross point, let us consider a scene from Astérix
et Obélix – Mission Cléopâtre7, where the humour of the scene is built upon the
semantic ambiguity of the French word dalle, meaning both ‘hunger’ and ‘flagstone’,
and the picture, which immediately follows, of Obélix – who is always hungry –
carrying the said flagstone (Figure 19). Here there is a strong link between the visual
meaning and both spoken verbal meanings. Astérix is asking Obélix “How are you?”
and Obélix answers “I’m hungry” with a French expression meaning also “I have the
flagstone”. The difference between the French soundtrack and its subtitled version in
Swedish is obvious, since the Swedish version of the answer does not carry that second
meaning. The original multimodal link is missing, and part of the humour is lost in
translation, which might cause different viewing paths in different audiences
7 A film by Alain Chabat, 2002.
18
(Lautenbacher forthcoming). This shows clearly how translation can alter the semiotic
reading of a scene by changing these cross points.
Figure 19 – Integrated visual and verbal information in the original dialogue. Compare “Ça va, Obé? – Ben j’ai la dalle.” to the Swedish translation:
“Hur går det? – Jag är hungrig.” (‘How are you? – I am hungry’)
So to what extent should translators take into account the relationship between visual
elements, in their work? Maybe their first question should be: (how) do people actually
read subtitles? An eye-tracking tool demonstration test on subtitled film made at the
Lund Eye-Tracking Academy in October 2008 seemed to show that, in fact, very little
might be read, even with a language completely foreign to the viewers. This test wasn’t
subjected to in depth analysis and it had very few subjects, who were all (too) conscious
of the test, being participants at the eye-tracking academy. Nevertheless, one hypothesis
that emerged in subsequent discussion was that filmic context and intrinsic logic (i.e.
the film genre, mentioned earlier) might partly prepare the viewer for plot-decoding at
each stage of the story and in this way diminish the necessity for actual subtitle reading.
19
4. Some methodological issues
There are also many practical questions about the measurable specificities of subtitles
related to this combined reading strategy that will have to be addressed in future
research, such as the positioning of text on the screen (both physically and temporally
speaking), the number of lines used in subtitling, and the impact of font type on
visibility and legibility. All these factors affect the time and speed of linear reading in
itself and, furthermore, the combined reading of multimodal documents.
Another important methodological issue for further eye-tracking tests concerns the
visual areas covered by parafoveal vision: screen size and the distance between viewer
and screen in eye-tracking test situations must be the same as in normal viewing
conditions. The usual distance (and common screen size) while viewing media on a
computer differs from both television and cinema contexts. In our test, this appeared
quite clearly with some of the observed documents, which were originally made to be
seen as larger images and from further away. None of the 9 subjects noticed the human
face hiding in the slide shown in Figure 20 during the test made on a standard office
computer screen, but everybody saw it immediately when it was projected onto the wide
classroom screen afterwards. So here again, maybe one should avoid being too
computer-centric.8
8 The question of the viewing angle can also be of great importance, especially in cinemas, were the viewers can sometimes be very far from the ideal central viewing point. (This appears to be a sensitive issue even from a technical perspective concerning 3D-films.)
20
Figure 20 – None of the subjects saw the face in the waves, because of insufficient distance from the screen during the test (analysed picture from: http://www.szilagyivallalat.hu/blog/wp-
content/uploads/2006/04/ClubMed-04.jpg).
Some conclusions
The present introductory study led to the following conclusions. First, in random picture
viewing, there is a strong human tendency to systematically fixate relevant
communicative elements such as faces, mouths, eyes, gazes and text. So when there is
no specific top-down search, the audience tends to seek for a source of communication
on the screen.
Secondly, in an integrated multimodal meaning construction system such as film,
subtitle reading is not necessarily linear, but, rather, it varies according to visual
saliencies, top-down information searches, contradictory or cohesive elements, co-text
or narration understanding, and numerous other factors.
21
Thirdly, multimodal cross points (integrating picture, sound and text) are important in
the meaning construction process. Film directors make use of this visual-verbal
cohesion in order to obtain the expected readings. Hence, one important task for the
translator is to avoid overshadowing these cohesive structures, and this implies a global,
integrated approach to the translation of audiovisual documents.
References
Baccino, Thierry. 2004. La lecture électronique. Presses Universitaires de Grenoble. 254 p. Baldry, Anthony and Thibault, Paul J. 2004. Multimodal Transcription and Text Analysis. London/Oakville: Equinox. 270 p. Baumgarten, Nicole. 2008. “Yeah, that’s it!: Verbal Reference to Visual Information in Film Texts and Film Translations.” Meta, LIII, 1, p. 6 – 25. Birmingham, Elina, Bischof Walter F. and Kinstone, Alan. 2008. “Social attention and real-world scenes: The roles of action, competition and social content”, The Quarterly Journal of Experimental Psychology, 61, 7, p. 986 – 998. Carmi, Ran and Itti, Laurent. 2006. “Visual causes versus correlates of attentional selection in dynamic scenes.” Vision research, 46, p. 4333 – 4345. d’Ydewalle, Géry, Praet, Caroline, Verfaillie, Karl and Van Rensbergen, Johan. 1991. “Watching Subtitled Television. Automatic Reading Behaviour”. Communication Research, 18, 5, p. 650 – 666. Henderson, John M. and Andrew Hollingworth. 1998. “Eye Movements during Scene Viewing: An Overview.” In G. Underwood (ed.), Eye Guidance in Reading and Scene Perception. Elsevier Science, p. 269 – 293.
22
Hollander, Régine. 2001. “Doublage et sous-titrage – Étude de cas: Natural Born Killers (Tueurs nés)”. Revue française d’études américaines, 88, p. 79 – 88. On line: http://www.cairn.info/article.php?ID_REVUE=RFEA&ID_NUMPUBLIE=RFEA_088&ID_ARTICLE=RFEA_088_0079&REDIR=1 [accessed: 14. 9. 2009]. Itti, Laurent. 2005. “Quantifying the contribution of low-level saliency to human eye movements in dynamic scenes.” Visual Cognition, 12, 6, p. 1093 – 1123. Krauzlis, Richard J. 2004. “Recasting the Smooth Pursuit Eye Movement System.” Journal of Neurophysiology, 91, p. 591-603. Lautenbacher, Olli Philippe. Forthcoming (on line). “Film et sous-titrage. Pour une définition de l’unité de sens en tradaptation.” Actes du Colloque Représentation du sens linguistique IV (28. – 30. 5. 2008), University of Helsinki. Levinas, Emmanuel. 2000 [1961]. Totalité et Infini. Paris : Le livre de poche, p. XII Nielsen, Jakob. 2006. “F-Shaped Pattern for Reading Web Content.” On line: http://www.useit.com/alertbox/reading_pattern.html [accessed: 14. 9. 2009]. Yarbus, A. L. 1967 [1965]. Eye movements and vision. New York: Plenum Press. 222 p.