vision needs non-conceptual connections to objects in the world (just as concepts do) zenon pylyshyn...

73
Vision needs non- conceptual connections to objects in the world (just as concepts do) Zenon Pylyshyn Rutgers Center for Cognitive Science Introduction to a theory of visual indexes (aka FINSTs)

Upload: herbert-paul

Post on 13-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Vision needs non-conceptual connections to objects in the world (just as concepts do)

Zenon Pylyshyn

Rutgers Center for Cognitive Science

Introduction to a theory of visual indexes (aka FINSTs)

Plan of talk: Visual Indexes

Theoretical motivations behind the FINST theory The need for a primitive mechanism of individuation Because individuation must be of distal objects, we

have the Correspondence Problem: When do two proximal tokens correspond to the same distal object?• A special case: incremental construction of visual representations

Empirical studies of individuation and indexing Object-specific effects (static & moving objects) Multiple Object Tracking technique.

What, if any, encoded properties are used to individuate, index, and track objects?

Visual Indexes (FINSTs) and what they mean for vision science and cognitive science

The need for a mechanism that individuates objects

Individuating distal objects requires solving the

correspondence problem

A special case of the corre-spondence problem occurs when visual representations are constructed over time.

Examples of solving the correspondence problem

Object-based allocation of visual attention

Multiple Object Tracking and Visual Indexes: what it

means for connecting vision and the world

An important function of early vision is to individuate and select token elements (let’s call them “objects” for now)

The most basic perceptual operation is the individuation and selection that precedes the formulation of perceptual judgments.

Making visual judgments presupposes that the things (objects) that judgments are about have been individuated and selected (or indexed – i.e., made accessible).Another way to put this is that the arguments

of perceptual predicates P(x,y,z,…) must be bound to things in the world in order for the judgment to have perceptual content.

Several objects must be picked out at once in relational judgments

For example, when we judge that certain objects are collinear, we must select (and the visual system must be able to refer to) the relevant individual objects.

Several objects must be picked out at once in relational judgments

The same is true for other relational predicates, like inside or on-the-same-contour… etc. We must pick out the relevant individual objects first.

Several objects must be picked out at once in numerical judgments

In subitizing, the cardinality of sets of 4 or less can be judged rapidly and accurately (over 4 is slow and error-prone).

Subitizing only occurs if items can be automatically individuated.

Enumerating different layouts of squaresTrick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated differently? A limited capacity preattentive stage in vision. Psychological Review, 101(1), 80-102.

Another property that cannot be used to subitize: on-same-contour

Individuation is different from discrimination

How do we select (and index) objects in our field of view?

The principal way we select individual objects is by foveating them – by looking directly at them (Notice that this results in a deictic reference).

We can also select with focal attention, which is independent of direction of gaze.

Focal attention appears to be unitary, yet we can select more than one thing at a time (e.g., in making a relational judgment). So it seems that we need to distinguish attending from selecting: That’s where Visual Indexes or FINSTs come in.

A question for later: In virtue of what properties are primitive objects individuated and indexed?

Indexes must individuate and select objects in the world. This leads to the ubiquitous correspondence problem in vision

Apparent motion, stereo vision, tracking, and very many visual computations face the problem of identifying which proximal image-features correspond to the same individual distal object.

Less well known is the correspondence problem faced when a single visual representation is constructed incrementally over time.

The way the correspondence problem is solved determines what the vision system counts as an individual. These primitive individuals (called “objects”) are thus mind-dependent.

Example of the correspondence problem for apparent motion

The gray disks correspond to the first flash and the black ones to the second flash. Which of the 24 possible matches will the visual system select as the solution to this correspondence problem? What principal does it use? (Dawson & Pylyshyn, 1988)

Curved matches Linear matches

One of the most troubling forms of the correspondence problem occurs because visual representations are constructed incrementally over time

It is clear that when vision requires eye movements, a visual representation is constructed incrementally. But there is also evidence that percepts are built up over time even for the automatic perception of simple forms. So this type of correspondence problem is routine in vision. Why does it constitute a special problem?

Example: Drawing a diagram and noticing its properties

Some of the distinct “views” while exploring the diagram

The correspondence problem for incremental construction of a visual representation

When a property F of some particular individual (token) object O is noticed or encoded, the visual system must check whether object O is already represented. If it is, the new property must be associated with the existing representation of O.

If the only way to identify a particular individual object O is by its description, then the way to solve this correspondence problem is to find an object in memory that bears a particular description (one that had been unique at the time). Which description? If objects can change their properties, we don’t know under what description the object was last stored. Perhaps we look for an object with a description that overlaps the present one, or perhaps we construct a description that somehow incorporates time.

The correspondence problem for incremental construction of a visual representation

Even if it were otherwise feasible to solve the correspondence problem by searching for a unique past description, this would in general be computationally intractable (technically, matching descriptions is an NP-hard problem). In any case it is unlikely that this is what our

visual system does, for many reasons – e.g., we do not in general find it more difficult to perceive a scene that has many identical parts, as would be predicted from this technique (since it would then be more difficult to find a unique descriptor for each object and the correspondence problem would quickly grow in complexity).

In virtue of what visual properties are objects individuated?

The most plausible property used in selecting and accessing an object is its location (this is often the only unique property available). The notion of a pointer suggests the use of location-as-access.

Virtually all theories of visual attention and property detection assume that we access an object’s properties by first retrieving its location.

But….Although there is a great deal of evidence for the

priority of encoding location, this does not show that properties must be accessed by their location.

In studies in which objects remain stationary, location is confounded with individuality since in these cases being at a particular location is coextensive with being a particular individual.

But there is also recent evidence that we can access an object’s properties solely by virtue of the object’s persistence qua individual. This is referred to as object-based attention.

Unconfounding location and individuality

There are at least two possible ways to unconfound location and individuality: 1. use moving objects2. use “objects” whose identity and/or

‘motion’ is independent of their spatial location.

1. Moving objects Object-specific priming (Object Files)

Object-specific Inhibition of Return *

Simultanagnosia & Visual Neglect *

Multiple Object Tracking (MOT)

2. Spatially coincident objects Single-object advantage *

tracking in “feature space”

Distinguishing access-by-location and access-by-individual

* Some of these may be omitted for lack of time

Object-specific Priming (object-file theory) Kahneman, D., Treisman, A., & Gibbs, B. J. (1992). The reviewing of object files: Object-specific integration of information. Cognitive Psychology, 24(2), 175-219.

Moving object studies…

Sequence of displays in a simple Object-Priming experiment

Object File example: Wrong letter in box

Object File example: Correct letter in box

Inhibition of Return(Tipper et al. 1991)

Moving object studies…

When the target-cue interval is between 300 ms and 900 ms it takes longer to detect a cued target than an uncued one – this is called Inhibition of Return.

“Inhibition of Return” moves with the object that is inhibited

Moving object studies…

If the cued object moves, the Inhibition of Return moves with it.

How do we do it? What properties of individual objects do we use?

Multiple Object Tracking Experiments

People can track 5 or more objects under a wide variety of conditions

Objects don’t even have to avoid collisions!

Objects can even disappear from view, as long as they do it in the right way

There must be local evidence of an occluding surface.

A possible location-updating tracking algorithm

1. While the targets are visually distinct, scan attention to each target and encode its location on a list. When targets begin to move;

2. For n=1 to 4; Check the n’th position in the list and retrieve the location Loc(n) listed there.

3. Go to location Loc(n). Find the closest element to Loc(n).

4. Update the n’th position on the list with the actual location of the element found in #3. This becomes the new value of Loc(n).

5. Move attention to the location encoded in the next list position, Loc(n+1).

6. Repeat from #2 until elements stop moving.

7. Go to each Loc(n) in turn and report elements located there.

We compared the above algorithm with human performance on the very same displays. We assume (1) focal attention is required to encode locations (i.e., encoding is not parallel), (2) focal attention is unitary and has to be scanned from location to location. But it assumes no encoding (or dwell) time at each element.

Predicted performance of the location updating algorithm as a function of attention scanning speed

What properties are used in(a) selecting objects, and (b) tracking objects?

Notice that these are different operations and need not involve the same properties

What properties can be used to select (index) an object in MOT?

We have evidence that under certain conditions selecting objects can be done either automatically or voluntarily. Automatic selection requires “popout” features

(sudden appearance, motion, stereo depth, etc) Voluntary selection can use any discriminable

property, but the objects must be attended serially and the property must be available long enough for this to occur (Annan study)

Role of object properties

Role of object properties (continued)

What properties can be used to track indexed objects?

We have some evidence that observers do not encode or use intrinsic object properties (e.g., color, shape) during tracking: When we stop and ask, observers cannot tell us what

properties objects had and they do not notice when properties like color/shape change during occlusion;

There is some evidence that tracking occurs (at least for small numbers of objects) even if it is not task-relevant (e.g., object-based priming and IOR);

We have some evidence that when objects differ in non-identifying (asynchronously changing) properties, they are not tracked any better than if they do not differ in these properties.

Role of object properties (continued)

What properties can be used to track indexed objects?

We have some evidence that observers do not use an encoding of the trajectory of objects in tracking – i.e., tracking is not predictive (Brian Keane). Tested condition in which all objects disappear for t

milliseconds (up to half a second) then reappeared: Where the would have been at that time (worst) Where they were when they disappeared (best) Where they were t ms previously (almost as good as above) All shifted left, right, up or down by the same distance

Targets are tracked most poorly when they reappeared where they would have been at that time, best when they reappeared where they disappeared, and in between for the other conditions.

Role of object properties (continued)

Do observers use some version of object locations for tracking?

It has been suggested that perhaps instead of using the location-updating method to track, observers respond to the objects’ “spatiotemporal trajectory” property (e.g., to their “space-time worms”).

Spatiotemporal continuity as a property that is used in tracking

Could a mechanism respond to spatiotemporal continuity without responding to object identity? The notion of spatiotemporal trajectory presupposes

that it is the trajectory of a single individual object, and not a sequence of time-slices of different objects. Therefore it assumes that the individual object has been selected and tracked. Responding to a spatiotemporal trajectory may be the same as tracking an object’s identity.

Another way to unconfound individuality and location

Can we attend to objects that are not distinguished by their location? Single-object advantage studies

Can we track (generalized) objects that do not move through real space, but move through some other property space?

Observers can track non-spatial ‘virtual objects’ that move through a ‘property space’: Tracking superimposed surfaces

Blaser, Pylyshyn & Holcombe (2000)

Two superimposed Gabor patches that vary in spatial frequency, color and angle

Changing feature dimensions

Surfaces move randomly in “feature-space”

snapshots taken every 250 ms

Snapshots

Such generalized ‘objects’ can be tracked individually, and they also show single-object superiority for change detection.

Some speculations about what vision needs and what the Early Vision module may provide (1)

1. We need a mechanism that puts us in causal contact with distal objects in a visual scene – a contact that does not depend on the object satisfying a certain (conceptual) description, but on a brute causal connection.

We need such a connection in order to connect vision and action.

We need such a connection in order to ground concepts to their instances.*

Speculations on what vision needs and what the visual module may provide (2)

2. We need a mechanism that keeps track of the identity of distal objects without using their encoded properties – this happens whenever the correspondence problem is solved. Such a mechanism realizes a rudimentary identity-

tracker, with its own internal ‘rules’.

3. This is not a general identity-maintenance process; it will not allow you to recognize the identity of a person in a picture and a person on the street. But it may provide a way to maintain same-objecthood within the modular early vision system. There is also this tantalizing fact … There is evidence for such a mechanism in babies as

young as 4 months (Leslie, Spelke)!

Other studies: Implications for visually-controlled action, infant cognition, and robotics

A short tour of research in which the notion of deictic (or indexical) reference has been appealed to.

Ballard, Hayhoe et al.’s proposal for a “deictic strategy”

People appear to use their direction-of-gaze as a reference point in encoding patterns and would prefer to make more eye movements rather than devote extra effort to memorizing a simple pattern.

Ballard, D. H., Hayhoe, M. M., et al. (1997). Deictic codes for the embodiment of cognition. Behavioral and Brain Sciences, 20(4), 723-767.

Use of deictic pointers in the Ballard et al. study

The task is to copy the model by getting blocks from the resource and constructing a copy of the model in the workspace

If subjects memorized and copied 2-block patterns it would take them 4 glances. Instead, subjects made 18 fixations into the model and did not memorize more than they needed for the next basic one-block action. The most common sequence was: fixate model; fixate and pickup block; fixate model; fixate workspace and drop off block (M-P-M-D). If the color/location of a block changed during the sequence, the change went undetected, showing that the color/location of other blocks was not encoded.

The strategy of using where the eye points as the reference for the memory representation is inefficient from the perspective of the number of eye movements required, but it appears to be more efficient from the point of view of the memory cost.

This result illustrates the habitual use of a deictic strategy wherein pointing into a real scene take precedence over memorization

Ballard, Hayhoe et al. call this method of exercising perceptual motor skills the “deictic strategy”

People appear to use direction-of-gaze as the reference point in encoding patterns and would prefer to make more eye movements than memorize even a very simple pattern.

But notice that subjects need to be able to move their gaze back to where they left off, so they need more than one deictic reference pointer, just as FINST theory postulates.

The use of deictic references is a very general strategy, not only because of the cost of storing a complex spatial representation, but also because the information is then in the right form for action – for the command “pick that up ” where the demonstrative refers to what is being attended or foveated, which may remain unrecognized and indeed even unconceptualized in any way.

Relation to work on infants’ sensitivity to the cardinality of sets of objects

Alan Leslie’s “Object Indexes”

Infants as young as 4 months of age show surprise (longer looking time) when they watch two things being placed behind a screen and when the screen is lifted it reveals only one thing. Below 10 months of age they are in general not surprised when the screen is lifted to reveal two things that are different from the ones they saw being placed behind the screen, so long as their numerosity is correct.

In some cases, infants (age 10 months) use the difference in color of the objects they are shown one-at-a-time to infer their numerosity, but they do not record the colors and use them to identify the objects that are revealed when the screen is lifted.

Leslie & Tremoulet: Infants aged 10 and 12 months are shown a red and then a green object which are then hidden behind a screen. The 10 month old is surprised if raising the screen reveals the wrong number of objects, not if it reveals the wrong color of objects. Color is used to individuate objects, but not to keep track of them! At 12 months children can use color to keep track of how many objects went behind the screen.

Object Indexes in infant enumerationLeslie, A. M., Xu, F., Tremolet, P. D., & Scholl, B. J. (1998). Indexing and the object concept: Developing `what' and `where' systems. Trends in Cognitive Sciences, 2(1), 10-18.

Leslie used the notion of an “object index” (which is the same as a FINST) to explain these results. According to his account, babies set up an index to each object they attend to. When the object disappears behind the screen, the index remains active. When objects reappear and the indexes are not matched one-one, it creates a failure of expectation, which leads to longer looking times.

Much remains unspecified (e.g. what do the indexes point to when the objects are hidden?) but the appeal to indexes is consistent with the apparent abstraction to the numerical identity of objects.

Mental representation of space:The core of the imagery debate

It appears that some forms of thought (i.e., those accompanied by the phenomenology of “seeing with the mind’s eye”) have spatial properties in a way that other forms of thought do not.

It is, of course, possible to encode spatial relations in any form of representation, but what do we do about such properties of imagery as …S-R compatibility, eye-movements, visual-motor

adaptation, image superposition findings (scanning, interference, illusions,…). These all suggest that images have spatial properties. This has led to the picture-in-the-head neuroscience program.

The good news is: We don’t need a spatial display in our head if we have the right kind of deictic contact with real (perceived) space

None of the experiments that are alleged to show the existence of a spatial display (in visual cortex) need to appeal to anything more than a small number of imagined locations.

If we can index a small number of (occupied) locations in real space (using FINSTs) we can use them to allocate attention or to program motor commands.

If these indexed objects are also bound to objects of thought this will result in our thoughts (i.e. images) having persisting spatial relations.

Some related trends in artificial intelligence: Situated Robots

Some people in Artificial Intelligence have embraced (and has in some cases been overcome by) a recognition of the need for a special indexical relation between representations and the world. While some of this “situated” movement has become a fad, there is an important point behind the situated movement, and it is the same point the Visual Index theory has been making: We need some nonconceptual connections between representations and things.

Forms of representation for a robot: using indexicals

Pylyshyn, Z.W. (2000). Situating vision on the world.Trends in Cognitive Sciences, 4(5), 197-207

“The author of the book Hiker’s Guide to the Desolation Wilderness stands in the wilderness beside Gilmore Lake, looking at the Mt. Tallac trail as it leaves the lake and climbs the mountain. He desires to leave the wilderness. He believes that the best way out from Gilmore Lake is to follow the Mt. Tallac trail up the mountain … But he doesn’t move. He is lost. He is not sure whether he is standing beside Gilmore Lake, looking at Mt. Tallac, or beside Clyde Lake, looking at the Maggie peaks. Then he begins to move along the Mt. Tallac trail. If asked, he would have to explain the crucial change in his beliefs in this way: ‘I came to believe that this is the Mt. Tallac trail and that is Gilmore Lake’.”* Perry, J. The problem of the essential indexical. In Themes from Kaplan (eds.

Almog, J., Perry, J. & Wettstein, H.) (Oxford University Press, New York, 1989).

Indexes play a role very similar to that of demonstratives.Are demonstratives essential for characterizing beliefs and for explaining the connection between beliefs and actions? Here is an example due to John Perry*:

A unique description of the Mt. Tallac trail might help bring the person to the right belief, but the problem of connecting the belief to an action would remain unsolved until the person had a deictic or demonstrative thought such as: “That is the Mt. Tallac trail.”or perhaps, “The trail I am now looking at is the Mt. Tallac trail”

Perry’s example is intended to show that in order to understand and explain the action of the lost author it is essential to use demonstratives such as this and that in expressing the author’s beliefs.

Summary: FINSTs keep us connected with the world

• Pylyshyn, Z. W. (2000). Situating vision in the world. Trends in Cognitive Sciences, 4(5), 197-207.

• Pylyshyn, Z. W. (2001). Visual indexes, preconceptual objects, and situated vision. Cognition, 80(1/2), 127-158.

• Pylyshyn, Z. W. (submitted). Tracking without keeping track: some puzzling findings concerning multiple object tracking.

• Pylyshyn, Z. W., Burkell, J., Fisher, B., Sears, C., Schmidt, W., & Trick, L. (1994). Multiple parallel access in visual attention. Canadian Journal of Experimental Psychology, 48(2), 260-283.

• Pylyshyn, Z. W., & Storm, R. W. (1988). Tracking multiple independent targets: evidence for a parallel tracking mechanism. Spatial Vision, 3(3), 1-19.

• Scholl, B. J., & Pylyshyn, Z. W. (1999). Tracking multiple items through occlusion: Clues to visual objecthood. Cognitive Psychology, 38(2), 259-290.

• Scholl, B. J., Pylyshyn, Z. W., & Feldman, J. (2001). What is a visual object: Evidence from target-merging in multiple-object tracking. Cognition, 80, 159-177.

• Scholl, B. J., Pylyshyn, Z. W., & Franconeri, S. L. (submitted). The relationship between property-encoding and object-based attention: Evidence from multiple-object tracking.

• Sears, C. R., & Pylyshyn, Z. W. (2000). Multiple object tracking and attentional processes. Canadian Journal of Experimental Psychology, 54(1), 1-14.

• Tipper, S., Driver, J., & Weaver, B. (1991). Object-centered inhibition of return of visual attention. Quarterly Journal of Experimental Psychology, 43A, 289-298.

•Annan, V., & Pylyshyn, Z. W. (2002). Can indexes be voluntarily assigned in multiple object tracking? Paper presented at Vision Sciences 2002, Sarasota, FL.

•Ballard, D. H., Hayhoe, M. M., Pook, P. K., & Rao, R. P. N. (1997). Deictic codes for the embodiment of cognition. Behavioral and Brain Sciences, 20(4), 723-767.

•Blaser, E., Pylyshyn, Z. W., & Holcombe, A. O. (2000). Tracking an object through feature-space. Nature, 408(9), 196-199.

•Burkell, J., & Pylyshyn, Z. W. (1997). Searching through subsets: A test of the visual indexing hypothesis. Spatial Vision, 11(2), 225-258.

•Dawson, M., & Pylyshyn, Z. W. (1988). Natural constraints in apparent motion. In Z. W. Pylyshyn (Ed.), Computational Processes in Human Vision: An interdisciplinary perspective (pp. 99-120). Stamford, CT: Ablex Publishing.

•Intriligator, J., & Cavanagh, P. (2001). The spatial resolution of attention. Cognitive Psychology, 4(3), 171-216.

•Leslie, A. M., Xu, F., Tremoulet, P. D., & Scholl, B. J. (1998). Indexing and the object concept: Developing `what' and `where' systems. Trends in Cognitive Sciences, 2(1), 10-18.

•Nissen, M. J. (1985). Accessing features and objects: Is location special? In M. I. Posner & O. S. Marin (Eds.), Attention and performance XI (pp. 205-219). Hillsdale, NJ: Lawrence Erlbaum.

•Pylyshyn, Z. W. (1989). The role of location indexes in spatial perception: A sketch of the FINST spatial-index model. Cognition, 32, 65-97.

•Pylyshyn, Z. W. (1994). Some primitive mechanisms of spatial attention. Cognition, 50, 363-384.

Selected references related to this talk

The End . . . (except for appendices)

Appendix 1: Other findings concerning the Multiple Object Tracking task

The question of how the correspondence problem in vision is solved in general

The question of whether objects are individuated and/or tracked by their locations.Almost everyone (except me) believes that they are.

Appendix: Some other findings concerning object tracking (1)

Detection of events on targets is better than on nontargets, but this does not generalize to locations between targets;

Objects can continue to be tracked when they disappear completely behind occluders, as long as the mode of disappearance is compatible with there being an occluding surface;

Objects can all disappear from view for as long as 330 ms without impairing tracking;

When objects disappear behind an occluder and come out a different color or shape, the change is unnoticed;

Appendix: Some other findings concerning object tracking (2)

Not all distinct feature clusters can be tracked; some, like the endpoints of a line, cannot;

People can track items that automatically attract attention, or they can decide which items to track; but in the latter case it appears that the may have to visit each object serially

Successful tracking of an object entails keeping track of it as a particular individual, yet people are poor at keeping track of which successfully tracked (initially numbered) item is which. This may be because: When observers make errors, they are more likely to switch

the identity of a target with that of another target than the identity of a target with that of a nontarget.

The whole truth about multiple object tracking

And many more demos ….

How do we do it? What properties of individual objects do we use?

MOT with occlusionMOT with Virtual OccludersMOT with implosion/explosionMOT MOT of the endpoints of a lineMOT squares with rubber band connectionsMOT with IDs (which is which?)Track non-flashed (3 blinks)Track Non-flashed (one flash)

Most theories of attention assume that objects are accessed by the prior encoding of their location:

1. Theories of visual search, including Treisman’s Feature Integration Theory, assume that location provides the means for detecting property-conjunctions. To find a conjunction of properties one finds the first property, determines its location on the master feature map, and checks to see whether the second property is also located there.

11

The case for the prior encoding of location

2. It has been frequently reported that when people detect certain properties (e.g., color) in a display, they very often also know where these properties are located – even when they fail to report any other properties (e.g., shape).

* There are also many reports of the detection of properties without being able to report where the properties occurred. This happens mainly when a second task is being performed that distracts attention and it leads to such errors as conjunction-illusions.

2

The case for the prior encoding of location …

3. Some people have explicitly tested the location-mediation hypothesis by cuing a search display with one property and examining the resulting joint probabilities of detecting various other properties. For example, Mary-Jo Nissan cued a search display with a color C and measured (or estimated) the probability of detecting shape S, location L, and both shape and location S & L. She showed that:

P(L & S | C) = P(L| C) * P(S | L)which is what one would expect if location mediated the joint detection.

3

The correspondence problem for incremental construction of a visual representation

We are interested in solutions that could be carried out by the vision module (as opposed to the cognitive mind), so the solution should meet certain criteria – e.g., capitalize on a natural constraint, as it does in apparent motion and other early vision phenomena.

It would make sense if early vision kept track of individual objects using only “local support” evidence, without relying on specific encoded properties. We will see that it is unlikely that locations or other object

properties are stored and used in solving the general correspondence problem.

I will consider some proposals for how our visual system solves the correspondence problem – e.g., the proposal that it uses spatiotemporal information.

If object properties are not used in solving the general correspondence problem, where does this leave us?

It leaves us needing a primitive indexing mechanism that picks out individual objects qua individuals, and that keeps track of these objects as they move around and change their properties

We do not need to assume an unlimited capacity for indexing. Indeed it seems that there might not be more than 4 or 5 of these indexes available.

Index maintenance favors continuous movements, but can track objects that disappear when local cues are compatible with certain phenomena that hold in our kind of world (e.g., occlusions by opaque surfaces, blinks, saccadic eye movements, etc)

Such a mechanism was proposed in 1978 and was called a FINST.