Post on 18-Dec-2015
Embed Size (px)
Contextual guidance of eye movements and attention in real-worldscenes: The role of global features on object search
Antonio TorralbaComputer Science and Artificial Intelligence Laboratory,
Massachusetts Institute of Technology
Aude OlivaDepartment of Brain and Cognitive Sciences,
Massachusetts Institute of Technology
Monica S. CastelhanoDepartment of Psychology and
Cognitive Science Program,Michigan State University
John M. HendersonDepartment of Psychology and
Cognitive Science Program,Michigan State University
Behavioral experiments have shown that the human visual system makes extensive use ofcontextual information for facilitating object search in natural scenes. However, the questionof how to formally model contextual influences is still open. Based on a Bayesian framework,we present an original approach of attentional guidance by global scene context. Two parallelpathways comprise the model; one pathway computes local features (saliency) and the othercomputes global (scene-centered) features. The Contextual Guidance model of attentioncombines bottom-up saliency, scene context and top-down mechanisms at an early stage ofvisual processing, and predicts the image regions likely to be fixated by human observersperforming natural search tasks in real world scenes.
Keywords: attention, eye movements, visual search, context, scene recognition
According to feature-integration theory (Treisman &Gelade, 1980) the search for objects requires slow serialscanning since attention is necessary to integrate low-levelfeatures into single objects. Current computational modelsof visual attention based on saliency maps have been inspiredby this approach, as it allows a simple and direct implemen-tation of bottom-up attentional mechanisms that are not taskspecific. Computational models of image saliency (Itti, Kock& Niebur, 1998; Koch & Ullman, 1985; Parkhurst, Law &Niebur, 2002; Rosenholtz, 1999) provide some predictionsabout which regions are likely to attract observers attention.These models work best in situations where the image it-self provides little semantic information and when no spe-cific task is driving the observers exploration. In real-worldimages, the semantic content of the scene, the co-occurrenceof objects, and task constraints have been shown to play akey role in modulating where attention and eye movement go(Chun & Jiang, 1998; Davenport & Potter, 2004; DeGraef,1992; Henderson, 2003; Neider & Zelinski, 2006; Noton &Stark, 1971; Oliva, Torralba, Castelhano & Henderson, 2004;Palmer, 1975; Tsotsos, Culhane, Wai, Lai, Davis & Nuflo,1995; Yarbus, 1967). Early work by Biederman, Mezzan-otte & Rabinowitz (1982) demonstrated that the violation oftypical item configuration slows object detection in a scene(e.g., a sofa floating in the air, see also DeGraef, Christianens& dYdewalle, 1990; Henderson, Weeks & Hollingworth,1999). Interestingly, human observers need not be explicitly
aware of the scene context to benefit from it. Chun, Jiang andcolleagues have shown that repeated exposure to the samearrangement of random elements produces a form of learn-ing that they call contextual cueing (Chun & Jiang, 1998,1999; Chun, 2000; Jiang, & Wagner, 2004; Olson & Chun,2002). When repeated configurations of distractor elementsserve as predictors of target location, observers are implic-itly cued to the position of the target in subsequent viewingof the repeated displays. Observers can also be implicitlycued to a target location by global properties of the imagelike color background (Kunar, Flusberg & Wolfe, 2006) andwhen learning meaningful scenes background (Brockmole &Henderson, 2006; Brockmole, Castelhano & Henderson, inpress; Hidalgo-Sotelo, Oliva & Torralba, 2005; Oliva, Wolfe& Arsenio, 2004).
One common conceptualization of contextual informationis based on exploiting the relationship between co-occurringobjects in real world environments (Bar, 2004; Biederman,1990; Davenport & Potter, 2004; Friedman, 1979; Hender-son, Pollatsek & Rayner, 1987). In this paper we discussan alternative representation of context that does not requireparsing a scene into objects, but instead relies on global sta-tistical properties of the image (Oliva & Torralba, 2001). Theproposed representation provides the basis for feedforwardprocessing of visual context that can be performed in par-allel with object processing. Global context can thus bene-fit object search mechanisms by modulating the use of thefeatures provided by local image analysis. In our Contex-
2 CONTEXTUAL GUIDANCE OF EYE MOVEMENTS AND ATTENTION IN REAL-WORLD SCENES
tual Guidance model, we show how contextual informationcan be integrated prior to the first saccade, thereby reducingthe number of image locations that need to be considered byobject-driven attentional mechanisms.
Recent behavioral and modeling research suggests thatearly scene interpretation may be influenced by global imageproperties that are computed by processes that do not requireselective visual attention (Spatial Envelope properties of ascene, Oliva & Torralba, 2001, statistical properties of objectsets, Ariely, 2001; Chong & Treisman, 2003). Behavioralstudies have shown that complex scenes can be identifiedfrom a coding of spatial relationships between componentslike geons (Biederman, 1995) or low spatial frequency blobs(Schyns & Oliva, 1994). Here we show that the structure ofa scene can be represented by the mean of global image fea-tures at a coarse spatial resolution (Oliva & Torralba, 2001,2006). This representation is free of segmentation and objectrecognition stages while providing an efficient shortcut forobject detection in the real world. Task information (search-ing for a specific object) modifies the way that contextualfeatures are used to select relevant image regions.
The Contextual Guidance model (Fig. 1) combines bothlocal and global sources of information within the sameBayesian framework (Torralba, 2003). Image saliency andglobal-context features are computed in parallel, in a feed-forward manner and are integrated at an early stage of visualprocessing (i.e., before initiating image exploration). Top-down control is represented by the specific constraints of thesearch task (looking for a pedestrian, a painting, or a mug)and it modifies how global-context features are used to selectrelevant image regions for exploration.
Model of object search andcontextual guidance
Scene Context recognition without object recogni-tion
Contextual influences can arise from different sources ofvisual information. On the one hand, context can be framedas the relationship between objects (Bar, 2004; Biederman,1990; Davenport & Potter, 2004; Friedman, 1979; Hender-son et al., 1987). According to this view, scene context isdefined as a combination of objects that have been associ-ated over time and are capable of priming each other to fa-cilitate scene categorization. To acquire this type of context,the observer must perceive a number of diagnostic objectswithin the scene (e.g., a bed) and use this knowledge to inferthe probable identities and locations of other objects (e.g., apillow). Over the past decade, research on the change blind-ness has shown that in order to perceive the details of an ob-ject, one must attend to it (Henderson & Hollingworth, 1999;Hollingworth, Schrock & Henderson, 2001; Hollingworth& Henderson, 2002; Rensink, 2000; Rensink, ORegan &Clark, 1997; Simons & Levin, 1997). In light of these re-sults, object-to-object context would be built as a serial pro-cess that will first require perception of diagnostic objectsbefore inferring associated objects. In theory, this process
could take place within an initial glance with attention beingable to grasp 3 to 4 objects within a 200 msec window (Vo-gel, Woodman & Luck, in press; Wolfe, 1998). Contextualinfluences induced by co-occurrence of objects have been ob-served in cognitive neuroscience studies. Recent work byBar and collaborators (2003, 2004) demonstrates that spe-cific cortical areas (a subregion of the parahippocampal cor-tex and the retrosplenial cortex) are involved in the analysisof contextual associations (e.g., a farm and a cow) and notmerely in the analysis of scene layout.
Alternatively, research has shown that scene context canbe built in a holistic fashion, without recognizing individualobjects. The semantic category of most real-world scenescan be inferred from their spatial layout only (e.g., an ar-rangement of basic geometrical forms such as simple Geonsclusters, Biederman, 1995; the spatial relationships betweenregions or blobs of particular size and aspect ratio, Oliva &Schyns, 2000; Sanocki & Epstein, 1997; Schyns & Oliva,1994). A blurred image in which object identities cannotbe inferred based solely on local information, can be veryquickly interpreted by human observers (Oliva & Schyns,2000). Recent behavioral experiments have shown that evenlow level features, like the spatial distribution of coloredregions (Goffaux & et al., 2005; Oliva & Schyns, 2000;Rousselet, Joubert & Fabre-Thorpe, 2005) or the distribu-tion of scales and orientations (McCotter, Gosselin, Cotter& Schyns, 2005) can reliably predict the semantic classes ofreal world scenes. Scene comprehension and more gener-ally recognition of objects in scenes can occur very quickly,without much need for attentional resources. This rapid un-derstanding phenomenon has been observed under differentexperimental conditions, where the perception of the imageis difficult or degraded, like during RSVP tasks (Evans &Treisman, 2006; Potter, 1976; Potter, Staub & OConnor,2004), very short presentation time (Thorpe et al., 1996),