Eye Movements

Download Eye Movements

Post on 18-Dec-2015




1 download

Embed Size (px)


Eye Movements


<ul><li><p>Contextual guidance of eye movements and attention in real-worldscenes: The role of global features on object search</p><p>Antonio TorralbaComputer Science and Artificial Intelligence Laboratory,</p><p>Massachusetts Institute of Technology</p><p>Aude OlivaDepartment of Brain and Cognitive Sciences,</p><p>Massachusetts Institute of Technology</p><p>Monica S. CastelhanoDepartment of Psychology and</p><p>Cognitive Science Program,Michigan State University</p><p>John M. HendersonDepartment of Psychology and</p><p>Cognitive Science Program,Michigan State University</p><p>Behavioral experiments have shown that the human visual system makes extensive use ofcontextual information for facilitating object search in natural scenes. However, the questionof how to formally model contextual influences is still open. Based on a Bayesian framework,we present an original approach of attentional guidance by global scene context. Two parallelpathways comprise the model; one pathway computes local features (saliency) and the othercomputes global (scene-centered) features. The Contextual Guidance model of attentioncombines bottom-up saliency, scene context and top-down mechanisms at an early stage ofvisual processing, and predicts the image regions likely to be fixated by human observersperforming natural search tasks in real world scenes.</p><p>Keywords: attention, eye movements, visual search, context, scene recognition</p><p>Introduction</p><p>According to feature-integration theory (Treisman &amp;Gelade, 1980) the search for objects requires slow serialscanning since attention is necessary to integrate low-levelfeatures into single objects. Current computational modelsof visual attention based on saliency maps have been inspiredby this approach, as it allows a simple and direct implemen-tation of bottom-up attentional mechanisms that are not taskspecific. Computational models of image saliency (Itti, Kock&amp; Niebur, 1998; Koch &amp; Ullman, 1985; Parkhurst, Law &amp;Niebur, 2002; Rosenholtz, 1999) provide some predictionsabout which regions are likely to attract observers attention.These models work best in situations where the image it-self provides little semantic information and when no spe-cific task is driving the observers exploration. In real-worldimages, the semantic content of the scene, the co-occurrenceof objects, and task constraints have been shown to play akey role in modulating where attention and eye movement go(Chun &amp; Jiang, 1998; Davenport &amp; Potter, 2004; DeGraef,1992; Henderson, 2003; Neider &amp; Zelinski, 2006; Noton &amp;Stark, 1971; Oliva, Torralba, Castelhano &amp; Henderson, 2004;Palmer, 1975; Tsotsos, Culhane, Wai, Lai, Davis &amp; Nuflo,1995; Yarbus, 1967). Early work by Biederman, Mezzan-otte &amp; Rabinowitz (1982) demonstrated that the violation oftypical item configuration slows object detection in a scene(e.g., a sofa floating in the air, see also DeGraef, Christianens&amp; dYdewalle, 1990; Henderson, Weeks &amp; Hollingworth,1999). Interestingly, human observers need not be explicitly</p><p>aware of the scene context to benefit from it. Chun, Jiang andcolleagues have shown that repeated exposure to the samearrangement of random elements produces a form of learn-ing that they call contextual cueing (Chun &amp; Jiang, 1998,1999; Chun, 2000; Jiang, &amp; Wagner, 2004; Olson &amp; Chun,2002). When repeated configurations of distractor elementsserve as predictors of target location, observers are implic-itly cued to the position of the target in subsequent viewingof the repeated displays. Observers can also be implicitlycued to a target location by global properties of the imagelike color background (Kunar, Flusberg &amp; Wolfe, 2006) andwhen learning meaningful scenes background (Brockmole &amp;Henderson, 2006; Brockmole, Castelhano &amp; Henderson, inpress; Hidalgo-Sotelo, Oliva &amp; Torralba, 2005; Oliva, Wolfe&amp; Arsenio, 2004).</p><p>One common conceptualization of contextual informationis based on exploiting the relationship between co-occurringobjects in real world environments (Bar, 2004; Biederman,1990; Davenport &amp; Potter, 2004; Friedman, 1979; Hender-son, Pollatsek &amp; Rayner, 1987). In this paper we discussan alternative representation of context that does not requireparsing a scene into objects, but instead relies on global sta-tistical properties of the image (Oliva &amp; Torralba, 2001). Theproposed representation provides the basis for feedforwardprocessing of visual context that can be performed in par-allel with object processing. Global context can thus bene-fit object search mechanisms by modulating the use of thefeatures provided by local image analysis. In our Contex-</p></li><li><p>2 CONTEXTUAL GUIDANCE OF EYE MOVEMENTS AND ATTENTION IN REAL-WORLD SCENES</p><p>tual Guidance model, we show how contextual informationcan be integrated prior to the first saccade, thereby reducingthe number of image locations that need to be considered byobject-driven attentional mechanisms.</p><p>Recent behavioral and modeling research suggests thatearly scene interpretation may be influenced by global imageproperties that are computed by processes that do not requireselective visual attention (Spatial Envelope properties of ascene, Oliva &amp; Torralba, 2001, statistical properties of objectsets, Ariely, 2001; Chong &amp; Treisman, 2003). Behavioralstudies have shown that complex scenes can be identifiedfrom a coding of spatial relationships between componentslike geons (Biederman, 1995) or low spatial frequency blobs(Schyns &amp; Oliva, 1994). Here we show that the structure ofa scene can be represented by the mean of global image fea-tures at a coarse spatial resolution (Oliva &amp; Torralba, 2001,2006). This representation is free of segmentation and objectrecognition stages while providing an efficient shortcut forobject detection in the real world. Task information (search-ing for a specific object) modifies the way that contextualfeatures are used to select relevant image regions.</p><p>The Contextual Guidance model (Fig. 1) combines bothlocal and global sources of information within the sameBayesian framework (Torralba, 2003). Image saliency andglobal-context features are computed in parallel, in a feed-forward manner and are integrated at an early stage of visualprocessing (i.e., before initiating image exploration). Top-down control is represented by the specific constraints of thesearch task (looking for a pedestrian, a painting, or a mug)and it modifies how global-context features are used to selectrelevant image regions for exploration.</p><p>Model of object search andcontextual guidance</p><p>Scene Context recognition without object recogni-tion</p><p>Contextual influences can arise from different sources ofvisual information. On the one hand, context can be framedas the relationship between objects (Bar, 2004; Biederman,1990; Davenport &amp; Potter, 2004; Friedman, 1979; Hender-son et al., 1987). According to this view, scene context isdefined as a combination of objects that have been associ-ated over time and are capable of priming each other to fa-cilitate scene categorization. To acquire this type of context,the observer must perceive a number of diagnostic objectswithin the scene (e.g., a bed) and use this knowledge to inferthe probable identities and locations of other objects (e.g., apillow). Over the past decade, research on the change blind-ness has shown that in order to perceive the details of an ob-ject, one must attend to it (Henderson &amp; Hollingworth, 1999;Hollingworth, Schrock &amp; Henderson, 2001; Hollingworth&amp; Henderson, 2002; Rensink, 2000; Rensink, ORegan &amp;Clark, 1997; Simons &amp; Levin, 1997). In light of these re-sults, object-to-object context would be built as a serial pro-cess that will first require perception of diagnostic objectsbefore inferring associated objects. In theory, this process</p><p>could take place within an initial glance with attention beingable to grasp 3 to 4 objects within a 200 msec window (Vo-gel, Woodman &amp; Luck, in press; Wolfe, 1998). Contextualinfluences induced by co-occurrence of objects have been ob-served in cognitive neuroscience studies. Recent work byBar and collaborators (2003, 2004) demonstrates that spe-cific cortical areas (a subregion of the parahippocampal cor-tex and the retrosplenial cortex) are involved in the analysisof contextual associations (e.g., a farm and a cow) and notmerely in the analysis of scene layout.</p><p>Alternatively, research has shown that scene context canbe built in a holistic fashion, without recognizing individualobjects. The semantic category of most real-world scenescan be inferred from their spatial layout only (e.g., an ar-rangement of basic geometrical forms such as simple Geonsclusters, Biederman, 1995; the spatial relationships betweenregions or blobs of particular size and aspect ratio, Oliva &amp;Schyns, 2000; Sanocki &amp; Epstein, 1997; Schyns &amp; Oliva,1994). A blurred image in which object identities cannotbe inferred based solely on local information, can be veryquickly interpreted by human observers (Oliva &amp; Schyns,2000). Recent behavioral experiments have shown that evenlow level features, like the spatial distribution of coloredregions (Goffaux &amp; et al., 2005; Oliva &amp; Schyns, 2000;Rousselet, Joubert &amp; Fabre-Thorpe, 2005) or the distribu-tion of scales and orientations (McCotter, Gosselin, Cotter&amp; Schyns, 2005) can reliably predict the semantic classes ofreal world scenes. Scene comprehension and more gener-ally recognition of objects in scenes can occur very quickly,without much need for attentional resources. This rapid un-derstanding phenomenon has been observed under differentexperimental conditions, where the perception of the imageis difficult or degraded, like during RSVP tasks (Evans &amp;Treisman, 2006; Potter, 1976; Potter, Staub &amp; OConnor,2004), very short presentation time (Thorpe et al., 1996),backward masking (Bacon-Mace, Mace, Fabre-Thorpe &amp;Thorpe, 2005), dual-task conditions (Li, VanRullen, Koch,&amp; Perona, 2002) and blur (Schyns &amp; Oliva, 1994; Oliva &amp;Schyns, 1997). Cognitive neuroscience research has shownthat these recognition events would occur 150 msec after im-age onset (Delorme, Rousselet, Mace &amp; Fabre-Thorpe, M.,2003; Goffaux et al., 2005; Johnson and Olshausen, 2005;Thorpe, Fize &amp; Marlot, 1996). This establishes an upperbound on how fast natural image recognition can be madeby the visual system, and suggest that natural scene recogni-tion can be implemented within a feed-forward mechanismof information processing. The global features approach de-scribed here may be part of a feed-forward mechanism ofsemantic scene analysis (Oliva &amp; Torralba, 2006).</p><p>Correspondingly, computational modeling work hasshown that real world scenes can be interpreted as a memberof a basic-level category based on holistic mechanisms, with-out the need for segmentation and grouping stages (Fei Fei &amp;Perona, 2005; Oliva &amp; Torralba, 2001; Walker Renninger &amp;Malik, 2004; Vogel &amp; Schiele, in press). This scene-centeredapproach is consistent within a global-to-local image analy-sis (Navon, 1977) where the processing of the global struc-ture and the spatial relationships among components precede</p></li><li><p>CONTEXTUAL GUIDANCE OF EYE MOVEMENTS AND ATTENTION IN REAL-WORLD SCENES 3</p><p>the analysis of local details. Cognitive neuroscience studieshave acknowledged the possible independence between pro-cessing a whole scene and processing local objects within inan image. The parahippocampal place area (PPA) is sensitiveto the scene layout and remains unaffected by the visual com-plexity of the image (Epstein &amp; Kanwisher, 1998), a virtue ofthe global feature coding proposed in the current study. ThePPA is also sensitive to scene processing that does not requireattentional resources (Marois, Yi &amp; Chun, 2004). Recently,Goh, Siong, Park, Gutchess, Hebrank &amp; Chee (2004) showedactivation in different brain regions when a picture of a scenebackground was processed alone, compared to backgroundsthat contained a prominent and semantically-consistent ob-ject. Whether the two approaches to scene context, one basedon holistic global features and the other one based on objectassociations recruit different brain regions (for reviews, seeBar, 2004; Epstein, 2005; Kanwisher, 2003), or instead re-cruit a similar mechanism processing spatial and conceptualassociations (Bar, 2004), are challenging questions for in-sights into scene understanding.</p><p>A scene-centered approach of context would not precludea parallel object-to-object context, rather it would serve as afeed-forward pathway of visual processing, describing spa-tial layout and conceptual information (e.g. scene category,function), without the need of segmenting the objects. Inthis paper, we provide a computational implementation of ascene-centered approach to scene context, and show its per-formance in predicting eye movements during a number ofecological search tasks.</p><p>In the next section we present a contextual model for ob-ject search that incorporates global features(scene-centeredcontext representation) and local image features (salient re-gions).</p><p>Model of object search and contextual guidance</p><p>We summarize a probabilistic framework of attentionalguidance that provides, for each image location, the proba-bility of target presence by integrating global and local imageinformation and task constraints. Attentional mechanismssuch as image saliency and contextual modulation emerge asa natural consequence from such a model (Torralba, 2003).</p><p>There has been extensive research on the relationship be-tween eye movements and attention and it has been well es-tablished that shifts of attention can occur independent of eyemovements (for reviews see Henderson, 2005; Liversedge&amp; Findlay, 2000; Rayner, 1998). Furthermore, the planningof an eye movement is itself thought to be preceded by ashift of overt attention to the target location before the actualmovement is deployed (Deubel &amp; Schnerider, 1996; Hoff-man &amp; Subramaniam, 1995; Kowler, Anderson, Dosher &amp;Blaser, 1995; Rayner, McConkie &amp; Ehrlich, 1978; Rayner,1998; Remington, 1980). However, previous studies havealso shown that with natural scenes and other complex stim-uli (such as reading), the cost of moving the eyes to shiftattention is less than to shift attention covertly, and led someto posit that studying covert and overt attention as separateprocesses in these cases is misguided (Findlay, 2004). The</p><p>model proposed in the current study attempts to predict theimage regions that will be explored by covert and overt atten-tional shifts, but performance of the model is evaluated withovert attention as measured with eye movements.</p><p>In the case of a search task in which we have to look for atarget embedded in a scene, the goal is to identify whether thetarget is present or absent, and if present, to indicate whereit is located. An ideal observer will fixate the image loca-tions that have the highest probability of containing the tar-get object given the available image information. Therefore,detection can be formulated as the evaluation of the prob-ability function p(O,X |I) where I is the set of features ex-tracted from the image. O is a binary variable where O = 1denotes target present and O =...</p></li></ul>


View more >