a neglected problem in the computational theory of mind object tracking and the mind-world gap

A neglected problem in the computational theory of mind

Object Tracking and the Mind-World gap

Zenon Pylyshyn

Rutgers Center for Cognitive Science

Before I begin I would like you to see a ‘video game’ that will figure in the last part of my talk

The demonstration shows a task called “Multiple Object Tracking”

Track the initially-distinct (flashing) items through the trial (here 10 secs) and indicate at the end which items are the “targets”

After each example I’d like you to ask yourself, “How do I do it?”

If you are like most of our subjects you will have no idea, or a false idea…

Keep track of the objects that flash

512x6.83 172x 169

How do we do it? What properties of individual objects do we use?

Going behind occluding surfaces does not disrupt tracking

Scholl, B. J., & Pylyshyn, Z. W. (1999). Tracking multiple items through occlusion: Clues to visual objecthood. Cognitive Psychology, 38(2), 259-290.

Not all well-defined features can be tracked:Track endpoints of these lines

Endpoints move exactly as the squares did!

What determines our behavior is not how the world is, but how we represent it as being As Chomsky pointed out in his review of Skinner, if we

describe behavior in relation to the objective properties of the world, we would have to conclude that behavior is essentially stimulus-independent

Every naturally-occurring behavioral regularity is cognitively penetrable Any information that changes beliefs can

systematically and rationally change behavior

The basic problem of cognitive science

Representation and Mind Why representations are essential

Do representations only come into play in “higher level” mental activities, such as reasoning?

Even at early stages of perception many of the states that must be postulated are representations (i.e. what they are about plays a role in explanations).

Examples from vision (1): Intrapercept constraints

Epstein, W. (1982). Percept-percept couplings. Perception, 11, 75-83.

Examples from vision (2):The Pogendorf iIlusion depends on perceived contours – they need not be physical edges

The rules of color mixing apply to perceived color

‘Red light and yellow light mix to produce orange light’

This ‘law” holds regardless of how the red light and yellow light are produced;

The yellow may be light of 580 nanometer wavelength, or it may be a mixture of light of 530 nm and 650 nm wavelengths.

☺So long as one light looks yellow and the other looks red the “law” will hold – the mixture will look orange.

Another example of a classical representation

Other forms of representation….

a) Lines FG, BC are parallel and equal. b) Lines EH, AD are parallel and equal. c) Lines FB, GC are parallel and equal.d) Lines EA, HD are parallel and equal.e) Vertices EF, HG, DC and AB are joined....

f) Part-Of{Cube, Top-Face(EFGH), Bottom-Face(ABCD), Front-Face(FGCB), Back-Face(EHDA)}

g) Part-Of{Top-Face(Front-Edge(FG), Back-Edge(EH), Left-Edge(EF), Right-Edge(HG)},…

What’s wrong with this picture?

What’s wrong is that the CTM is incomplete — it does not address a number of fundamental questions

It fails to specify how representations connect with what they represent – it’s not enough to use English words in the representation (that’s been a common confusion in AI) or to draw pictures (a common confusion in theories of mental imagery) English labels and pictures may help the theorist recall

which objects are being referred to … But what makes it the case that a particular mental

symbol refers to one thing rather than another? How are concepts grounded? (Symbol Grounding Problem)

Another way to look at what the Computational Theory of Mind lacks

The missing function in the CTM is a mechanism that allows perception to refer to individual things in the visual field directly and nonconceptually: Not as “whatever has properties P1, P2, P3, ...”, but as a

singular term that refers directly to an individual and does not appeal to a representation of the individual’s properties.

Such a reference is like a proper name or a pointer in a computer data structure, or like a demonstrative term (like this or that) in natural language.

Note that in a computer a pointer does not refer via a location, despite what the term “pointer” suggests

An example from personal history: Why we need to pick out individual things without referring to their properties

We wanted to develop a computer system that would reason about geometry by actually drawing a diagram and noticing adventitious properties of the diagram from which it would conjecture lemmas to prove

We wanted the system to be as psychologically realistic as possible so we assumed that it had a narrow field of view and noticed only limited, spatially-restricted information as it examined the drawing

This immediately raised the problem of coordinating noticings and led us to the idea of visual indexes to keep track of previously encoded parts of the diagram.

Begin by drawing a line….

L1L1

Now draw a second line….

L2L2

And draw a third line….

L3L3

Notice what you have so far….(noticings are local – you encode what you attend to)

There is an intersection of two lines…

But which of the two lines you drew are they?

There is no way to indicate which individual things are seen again without a way to refer to individual (token) things

L1L1

L2L2

V6V6

Look around some more to see what is there ….

Here is another intersection of two lines…

Is it the same intersection as the one seen earlier?

Without a special way to keep track of individuals the only way to tell would be to encode unique properties of each of the lines. Which properties should you encode?

L5L5

L2L2

V12V12

In examining a geometrical figure one only gets to see a sequence of local glimpses

The incremental construction of visual representations requires solving a correspondence problem over time

We have to determine whether a particular individual element seen at time t is identical to another individual element seen at a previous time t- . This is one manifestation of the correspondence problem.

Solving the correspondence problem is equivalent to picking out and tracking the identity of token individuals as they change their appearance, their location or the way they are encoded or conceptualized

To do that we need the capacity to refer to token individuals (I will call them objects) without doing so by appealing to their properties. This requires a special form of demonstrative reference I call a Visual Index.

A note about the use of labels in this exampleThere are two purposes for figure labels. One is to specify

what type of individual it is (line, vertex,..). The other is to specify which individual it is so it is individuated and thus can be selected or bound to the argument of a predicate.

The second of these is what I am concerned with because indicating which individual it is is essential in vision. Many people (e.g., Marr, Yantis) have suggested that individuals

may be marked by tags, but that won’t do since one cannot literally place a tag on an object and even if we could it would not obviate the need to individuate and index just as labels don’t help.

Labeling things in the world is not enough because to refer to the line labeled L1 you would have to be able to think “this is line L1” and you could not think that unless you had a way to first picking out the referent of this.

The difference between a direct (demonstrative) and a descriptive way of picking something out has produced many “You are here” cartoons.

It is also illustrated in this recent New Yorker cartoon…

The difference between descriptive and demonstrative ways of picking something out

(illustrated in this New Yorker cartoon by Sipress )

‘Picking out’ Picking out entails individuating, in the sense of separating

something from a background (what Gestalt psychologists called a figure-ground distinction)

This sort of picking out has been studied in psychology under the heading of focal or selective attention. Focal attention appears to pick out and adhere to objects rather than places

In addition to a unitary focal attention there is also evidence for a mechanism of multiple references (about 4 or 5), that I have called a visual index or a FINST Indexes are different from focal attention in many ways that we

have studied in our laboratory (I will mention a few later)

A visual index is like a pointer in a computer data structure – it allows access but does not itself tell you anything about what is being pointed to

The requirements for picking out and keeping track of several individual things reminded me of

an early comic book character called Plastic Man

Imagine being able to place several of your fingers on things in the world without recognizing their properties while doing so. You could then refer to those things (e.g. ‘what finger #2 is touching’) and could move your attention to them. You would then be said to possess FINgers of INSTantiation (FINSTs)

FINST Theory postulates a limited number of pointers in early vision that are elicited by certain events in the

visual field and that enable vision to refer to those things without doing so under concept or a description

FINSTs and Object Files form the link between the world and its conceptualization

Object File contents are conceptual!

Information (causal) link

FINST Demonstrative reference link

The only nonconceptual contents in this picture

are FINST indexes!

A note on terminology A FINST provides a reference to an individual visible ‘thing’ I sometimes call this referent a FING by analogy with FINST and

sometimes an object to conform with usage in psych, but FINGs are nonconceptual so they do not pick out something as an object, because OBJECT us a concept. Maybe “proto object”?

I have also called it a pointer, but that erroneously suggests that it “points to” the location of an object, as opposed to the object itself. In a computer, a pointer is the name of a stored datum.

I have said that a FINST is a visual demonstrative like ‘this’ or ‘that’, but that too is misleading because the reference of a demonstrative depends on the intentions of the speaker

I have also noted that a FINST is like a proper name but that won’t do since a name can pick out something not in sensory contact whereas a FINST can only refer to a visible item (or one that is briefly out of sight).

A quick tour of some evidence for FINSTs

• The correspondence problemThe binding problemEvaluating multi-place visual predicates

(recognizing multi-element patterns)Operating over several visual elements at

once without having to search for them first Subitizing Subset search

● Multiple-Object Tracking • Cognizing space without requiring a spatial

display in the head


• The correspondence problem (mentioned earlier)

The binding problemEvaluating multi-place visual predicates


once without having to search for them first Subitizing Subset selection

Multiple-Object Tracking • Cognizing space without requiring a spatial

display in the head

Individual objects and the binding problemWe can distinguish scenes that differ by conjunctions

of properties, so early vision must somehow keep track of how properties co-occur – conjunction must not be obscured. This is the called the binding problem

The most common proposal is that vision keeps track of properties according to their location and binds together co-located properties.

11

22

The proposal of binding conjunctions by the location of conjuncts does not work when feature location is not punctate and becomes even more problematic if they are co-located – e.g., if their relation is “inside”

Binding as object-basedThe proposal that properties are conjoined by virtue of their

common location has many problems In order to assign a location to a property you need to know its

boundaries, which requires distinguishing the object that has those properties from its background (figure-ground individuation)

Properties are properties of objects, not of locations – which is why properties move when objects move. Empty locations have no causal properties.

The alternative to conjoining-by-location is conjoining by object. According to this view, solving the binding problem requires first selecting individual objects and then keeping track of each object’s properties (in its object file) If only properties of selected objects are encoded and if those

properties are recorded in object files specific to each object, then all conjoined properties will be recorded in the same object file, thus solving the binding problem

Attention spreads over perceived objects

Using a priming method (Egly, Driver & Rafal, 1994) showed that the effect of a prime spreads to other parts of the same visual object compared to equally distant parts of different objects.

Spreads toB and not C

Spreads toB and not C

Spreads toC and not B

Spreads toC and not B

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

*







display in the head

Being able to pick out and refer to individual distal elements is essential for encoding patterns

Encoding relational predicates; e.g., Collinear (x,y,z,..); Inside (x, C); Above (x,y); Square (w,x,y,z), requires simultaneously binding the arguments of n-place predicates to n elements in the visual scene

Evaluating such visual predicates requires individuating and referring to the objects over which the predicate is evaluated: i.e., the arguments in the predicate must be bound to individual elements in the scene.

Several objects must be picked out at once in making relational judgments

When we judge that certain objects are collinear, we must first pick out the relevant objects while ignoring their properties

Several objects must be picked out at once in making relational judgments

The same is true for other relational judgments like inside or on-the-same-contour… etc. We must pick out the relevant individual objects first. Are dots Inside-same contour? On-same contour?


• The correspondence problemThe binding problemEvaluating multi-place visual predicates


once without first having to search for them Subitizing Subset selection


display in the head

More functions of FINSTsFurther experimental explorations

using different paradigms

Recognizing the cardinality of small sets of things: Subitizing vs counting (Trick, 1994)

Searching through subsets – selecting items to search through (Burkell, 1997)

Selecting subsets and maintaining the selection during a saccade (Currie, 2002)

Application of FINST index theory to infant cardinality studies (Carey, Spelke, Leslie, Uller, etc)

Indexes explain how children are able to acquire words for objects by ostension without suffering Quine’s Gavagai problem.







display in the head

Another example of MOT: With self occlusion 5 x 5 1.75 x 1.75

Self occlusion dues not seriously impair tracking

Basic finding: Most people can track at least 4 targets that move randomly among identical non-target objects (even 5 year old children can track 3 objects)

Object properties do not appear to be recorded during tracking and tracking is not improved if all objects are visually distinct (no two objects have the same color, shape or size)

How is it done? We showed that it is unlikely that the tracking is done by

keeping a record of the targets’ locations and updating them by serially visiting the objects (Pylyshyn & Storm, 1998)

Other strategies may be employed (e.g., tracking a single deforming pattern), but they do not explain tracking

Hypothesis: FINST Indexes get assigned to targets. At the end of the trial these pointers can be used to move attention to the targets and hence to select them

Some findings of Multiple Object Tracking

What role do visual properties play in MOT?Certain properties may have to be present in order for an

object to be indexed, and certain properties (probably different properties) may be required in order for the index to keep track of the object, but this does not mean that such properties are encoded, stored, or used in tracking. Compare this with Kripke’s distinction between properties that fix

the referent of a proper name and the property that the name refers to. The former only plays a role at the name’s initial “baptism.”

Is there something special about location? Do we record and track properties-at-locations? Location in time & space may be essential for individuating objects,

but locations need not be encoded or made cognitively available

The fact that an object is actually at some location or other does not mean that it is represented as such. Representing property ‘P’ (where P happens to be at location L) ≠ Representing property ‘P-is-at-L’.

A way of viewing what goes on in MOT According Kahneman & Treisman’s Object File theory, the

appearance of a new visual object causes a new Object File to be created. Each object file is associated with its respective object – presumably through a FINST Index.

The object file may contain information about the object to which it is attached. But according to FINST Theory, keeping track of the object’s identity does not require the use of this information. The evidence suggests that in MOT, little or nothing is stored in the object file except maybe in special cases (e.g., when the object suddenly changes or disappears).

What makes something the same object over time is that it remains connected to the same object-file (by the same FINST). Thus, for vision to treat something as the same enduring individual does not require appeal to properties or concepts.

Why is this relevant to foundational questions in the philosophy of mind?

According to Quine, Strawson, and most philosophers, you cannot pick out or track individuals without concepts (sortals)

But you also cannot pick out individuals with only concepts Sooner or later you have to pick out individuals using non-

conceptual causal connections between thoughts and things The present proposal is that FINSTs provide the needed

non-conceptual mechanism for individuating objects and for tracking their identity, which works most of the time in our kind of world. It relies on a natural constraint (Marr)

FINST indexes provide the right sort of connection for predicating properties of the world by allowing the arguments of predicates to be bound to objects prior to the predicates being evaluated. They may thus be the basis for early vocabulary learning.

But there must be some properties that cause indexes to be grabbed!

Of course there are properties that are causally responsible for indexes being grabbed, and also properties (probably different ones) that make it possible for objects to be tracked;

But these properties need not be represented (encoded) and used in tracking

The distinction between object properties that cause indexes to be assigned and those that are represented (in Object Files) is similar to Kripke’s distinction between properties that are needed to pick out name an object and those that constitute its meaning

Effect of target properties on MOT

Changes of target properties are not reported nor even noticed during MOT

Keeping all targets at different color, size, or shape does not improve tracking

Observers do not use target speed or direction in tracking (e.g., by anticipating where the targets will be when they reappear after occlusion)

Some open questionsWe have arrived at the view that only properties of selected

(indexed) objects enter into subsequent conceptualization and perception-based thought (i.e., only information in object files is made available to cognition)

So what happens to the rest of the visual information? Visual information seems rich and fine-grained while this

theory only allows for the properties of 4 or 5 objects to be encoded! The present view leaves no room for nonconceptual

representations whose content corresponds to the content of conscious experience

According to the present view, the only content that nonconceptual representations have is the demonstrative content of indexes that refer to perceptual objects

Question: Why do we need any more than that?

An intriguing possibility….

Maybe the theoretically relevant information we take in is less than (or at least different from) what we experience This possibility has received attention recently with the discovery of

various “blindnesses” (e.g., change-blindness, inattentional blindness, blindsight…) as well as the discovery of independent-vision systems (e.g., recognition and motor control)

The qualitative content of conscious experience may not play a role in explanations of cognitive processes

Even if unconceptualized information enters into causal process (e.g., motor control) it may not be represented or made available to the cognitive mind it – not even as a nonconceptual representation For something to be a representation its content must figure in explanations

– it must capture generalizations. It must have truth conditions and therefore allow for misrepresentation. It is an empirical question whether current proposals do (e.g., primal sketch, scenarios). cf Devitt: Pylyshyn’s Razor

Vision science has always been deeply ambivalent about role of conscious

experience Isn’t how things appear one of the things that our theories

must explain? Answer: There is no a priori ‘must explain’!● The content of subjective experience is a major type of evidence. But

it may turn out not to be the most reliable source for inferring the relevant functional states. It competes with other types of evidence.

● How things appear cannot be taken at face value: it carries substantive theoretical assumptions. It also draws on many levels of processing. It was a serious obstacle to early theories of vision (Kepler)It has been a poor guide in the case of theories of mental imagery (e.g., color

mixing, image size, image distances). ‘Reading X off an image’ is an illusion.

● It seems likely that vision science will use evidence of conscious experience the way linguistics uses evidence of grammatical intuitions – only as it is filtered through developing theories. The questions a science is expected to answer cannot be set in advance – they

change as the science develops.

What next?

This picture leaves many unanswered questions, but it does provide a mechanism for solving the binding problem and also explaining how mental representations could have a nonconceptual connection with objects in the world (something required if mental representations are to connect with actions)

For a copy of these slides see:http://ruccs.rutgers.edu/faculty/pylyshyn/SelectionReference.ppt

Or MIT PressPaperback

http://ruccs.rutgers.edu/faculty/pylyshyn/SelectionReference.ppt

http://ruccs.rutgers.edu/faculty/pylyshyn/SelectionReference.ppt

XYou are now hereYou are now here

But you are also hereBut you are also here

a neglected problem in the computational theory of mind object tracking and the mind-world gap

Documents