what is not where: the challenge of integrating spatial

12
What is not where: the challenge of integrating spatial representations into deep learning architectures John D. Kelleher ADAPT Centre for Digital Content Technology Dublin Institute of Technology, Ireland [email protected] Simon Dobnik CLASP and FLOV University of Gotenburg, Sweden [email protected] Abstract This paper examines to what degree cur- rent deep learning architectures for im- age caption generation capture spatial lan- guage. On the basis of the evaluation of examples of generated captions from the literature we argue that systems capture what objects are in the image data but not where these objects are located: the cap- tions generated by these systems are the output of a language model conditioned on the output of an object detector that cannot capture fine-grained location information. Although language models provide useful knowledge for image captions, we argue that deep learning image captioning archi- tectures should also model geometric rela- tions between objects. 1 Introduction There is a long-traditional in Artificial Intelligence (AI) of developing computational models that in- tegrate language with visual information, see inter alia.:(Winograd, 1973; McKevitt, 1995; Kelleher, 2003; Gorniak and Roy, 2004; Kelleher and Krui- jff, 2005a; Brenner et al., 2007; Dobnik, 2009; Tellex, 2010; Sj¨ o, 2011; Dobnik and Kelleher, 2016; Sch¨ utte et al., 2017). The goal of this pa- per is to situate and critically examine recent ad- vances in computational models that integrate vi- sual and linguistic information. One of the most exciting developments in AI in recent years has been the development of deep learning (DL) archi- tectures (LeCun et al., 2015; Schmidhuber, 2015). Deep learning models are neural network mod- els that have multiple hidden layers. The advan- tage of these architectures is that these models have the potential to learn high-level useful fea- tures from raw data. For example, Lee et al. (2009) report how their convolutional deep belief network “learns useful high-level visual features, such as object parts, from unlabelled images of objects and natural scenes”. In brief, Lee et al. show how a deep network trained to perform face recognition learns a hierarchical sequence of feature abstrac- tions: neurons in the early layers in the network learn to act as edge detectors, neurons in later lay- ers react to the presence of meaningful parts of a face (e.g., nose, eye, etc.), and the neurons in the last layers of the network react to sensible con- figurations of body parts (e.g., nose and eyes and sensible (approximate) offsets between them). Deep learning models have improved on the start-of-the-art across a range of image and lan- guage modelling tasks. The typical deep learn- ing architecture for image modelling is a convolu- tional neural network (CNN) (Lecun et al., 1998) and for language modelling is a recurrent neu- ral network (RNN), often using long short-term memory (LSTM) units (Hochreiter and Schmid- huber, 1997). However, from the perspective of research into the interface between language and vision perhaps the most exciting aspect of deep learning is the fact that all of these models (both language and vision processing architectures) use a vector based representation. A consequence of this is that deep learning models have the poten- tial to learn multi-modal representations that in- tegrate linguistic and visual information. Indeed, inspired by sequence-to-sequence neural machine translation research (Sutskever et al., 2014), deep learning image captioning systems have been de- veloped that use a CNN to process and encode im- age data and then pass this vector based encoding of the image to an RNN that generates a caption for the image. Figure 1 illustrates the components and flow of data in an encoder-decoder CNN-RNN image captioning architecture. The performance of these deep learning image 41

Upload: others

Post on 23-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

What is not where: the challenge of integrating spatial representationsinto deep learning architectures

John D. KelleherADAPT Centre for Digital Content Technology

Dublin Institute of Technology, [email protected]

Simon DobnikCLASP and FLOV

University of Gotenburg, [email protected]

Abstract

This paper examines to what degree cur-rent deep learning architectures for im-age caption generation capture spatial lan-guage. On the basis of the evaluation ofexamples of generated captions from theliterature we argue that systems capturewhat objects are in the image data but notwhere these objects are located: the cap-tions generated by these systems are theoutput of a language model conditioned onthe output of an object detector that cannotcapture fine-grained location information.Although language models provide usefulknowledge for image captions, we arguethat deep learning image captioning archi-tectures should also model geometric rela-tions between objects.

1 Introduction

There is a long-traditional in Artificial Intelligence(AI) of developing computational models that in-tegrate language with visual information, see interalia.: (Winograd, 1973; McKevitt, 1995; Kelleher,2003; Gorniak and Roy, 2004; Kelleher and Krui-jff, 2005a; Brenner et al., 2007; Dobnik, 2009;Tellex, 2010; Sjoo, 2011; Dobnik and Kelleher,2016; Schutte et al., 2017). The goal of this pa-per is to situate and critically examine recent ad-vances in computational models that integrate vi-sual and linguistic information. One of the mostexciting developments in AI in recent years hasbeen the development of deep learning (DL) archi-tectures (LeCun et al., 2015; Schmidhuber, 2015).Deep learning models are neural network mod-els that have multiple hidden layers. The advan-tage of these architectures is that these modelshave the potential to learn high-level useful fea-tures from raw data. For example, Lee et al. (2009)

report how their convolutional deep belief network“learns useful high-level visual features, such asobject parts, from unlabelled images of objectsand natural scenes”. In brief, Lee et al. show howa deep network trained to perform face recognitionlearns a hierarchical sequence of feature abstrac-tions: neurons in the early layers in the networklearn to act as edge detectors, neurons in later lay-ers react to the presence of meaningful parts of aface (e.g., nose, eye, etc.), and the neurons in thelast layers of the network react to sensible con-figurations of body parts (e.g., nose and eyes andsensible (approximate) offsets between them).

Deep learning models have improved on thestart-of-the-art across a range of image and lan-guage modelling tasks. The typical deep learn-ing architecture for image modelling is a convolu-tional neural network (CNN) (Lecun et al., 1998)and for language modelling is a recurrent neu-ral network (RNN), often using long short-termmemory (LSTM) units (Hochreiter and Schmid-huber, 1997). However, from the perspective ofresearch into the interface between language andvision perhaps the most exciting aspect of deeplearning is the fact that all of these models (bothlanguage and vision processing architectures) usea vector based representation. A consequence ofthis is that deep learning models have the poten-tial to learn multi-modal representations that in-tegrate linguistic and visual information. Indeed,inspired by sequence-to-sequence neural machinetranslation research (Sutskever et al., 2014), deeplearning image captioning systems have been de-veloped that use a CNN to process and encode im-age data and then pass this vector based encodingof the image to an RNN that generates a captionfor the image. Figure 1 illustrates the componentsand flow of data in an encoder-decoder CNN-RNNimage captioning architecture.

The performance of these deep learning image

41

Encoder Decoder

CNNVector BasedRepresentation

RNN “A bird flyingover water”

0 / 0

Figure 1: A schematic of a typical encoder-decoder deep learning image captioning architecture in (Xuet al., 2015). The photo is different from but similar to the photo in this example. The current photo is byJerry Kirkhart (originally posted to Flickr as Osprey Hunting) and is sourced via Wikimedia Commons.It is used here under the Creative Commons Attribution 2.0 generic licence.

captioning systems is impressive. However, thequestion posed by this paper is whether these sys-tems are actually grounding the semantics of theentire linguistic caption in the image, and in par-ticular whether these systems ground the seman-tics of spatial relations in the image. The rest ofthe paper is structured as follows: Section 2 in-troduces the components of deep learning imagecaptioning architecture in more detail; followingthis, Section 3 reviews the challenges of groundinglanguage in perception, with a particular focus onspatial language; the paper concludes in Section5 by posing the question of whether deep learningimage captioning architectures as currently consti-tuted are capable of doing justice to the complexityand diversity of factors that affect the productionand interpretation of spatial language in visuallysituated dialogues.

2 The Standard DL Image CaptioningArchitecture

As mentioned in Section 1 there are two majorcomponents within current standard deep-learningimage captioning systems, a CNN that processesthe image input and encodes some of the informa-tion from the image as a vector, and an RNN thattakes the vector representation of the image as aninput and generates a caption for the image. Thissection provides an explanation for how each ofthese components works: Section 2.1 introducesthe basic architecture of a CNN and Section 2.2introduces the basic architecture of an RNN andexplains how they can be used to create visuallygrounded language models.

2.1 Convolutional Neural NetworksCNNs are specifically designed for image recog-nition tasks, such as handwritten digit recognition(Le Cun, 1989). A well-recognised approach to

image recognition is to extract local visual featuresand combine these features to form higher-orderfeatures. A local feature is a feature whose ex-tent within a image is constrained to a small set ofneighbouring pixels. For example, for face recog-nition a system might first learn to identify fea-tures such as patches of line or curve segments,and then learn patterns across these low level fea-tures that correspond to features such as eyes ora mouth, and finally learn how to combine thesebody-part features to identify a face.

A key challenge in image recognition is creat-ing a model that is able to recognise if a visualfeature has occurred in the image irrespective ofthe location of the feature in the image:

“it seems useful to have a set of fea-ture detectors that can detect a partic-ular instance of a feature anywhere onthe input plane. Since the precise lo-cation of a feature is not relevant tothe classification, we can afford to loosesome position information in the pro-cess” (Le Cun, 1989, p.14)

For example, a face recognition network shouldrecognise the shape of an eye whether the eye is inthe top right corner of the image or in the centreof the image. CNNs achieve this translation in-variant detection of local visual features using twotechniques:

1. weight (parameter) sharing and2. pooling.

Recall that each neuron in a network learns a func-tion that maps from a set of inputs to an outputactivation. The function is defined by the set ofweights the neuron applies to the inputs it receivesand learning the function involves updating the

42

weights from a set of random initialised values to aset of values that define a function that the networkfound useful during training in terms of predictingthe correct output value. In the context of imagerecognition a function can be understood as a fea-ture detector which takes a set of pixel values asinput and outputs a high-activation score if the vi-sual feature is present in the set of input pixels anda low-activation score if the feature is not present.Furthermore, neurons that share (or use) the sameweights implement the same function and henceimplement that same feature detector.

Given that a set of weights for a neuron definesa feature detector and that neurons with the sameweights implement the same feature detector, it ispossible to design a network to check whether a vi-sual feature occurs anywhere in an image by mak-ing multiple neurons share the same set of weightsbut have each of these neurons inspect differentportions of the image in such a way so that to-gether the neurons cover the whole image.

For example, imagine we wish to train a net-work to identify digits in images of 10⇥ 10 pix-els. In this scenario we may design a network sothat one of the neurons in the network inspects thepixels in the top-left corner of the figure to checkif a visual feature is present. The image at thetop of Figure 2 illustrates such a neuron. Thisneuron inspects the pixels (0,0), . . . ,(2,2) and ap-plies the function defined by the weight vector< w0, . . . ,w8 >. This neuron will return a high ac-tivation if the appropriate pixel pattern is presentin the pixels input to the function and low other-wise. We can now create a copy of this neuron thatuses the same weights < w0, . . . ,w8 > but whichinspects a different set of pixels in the image: theimage at the bottom of Figure 2 illustrates sucha neuron, this particular neuron inspects the pixels(0,1), . . . ,(2,3). If the visual feature that the func-tion defined by the weight vector < w0, . . . ,w8 >

occurs in either of the image patches inspected bythese two neurons, then one of the neurons willfire.

Extending the idea of a set of neurons withshared weights inspecting a full image results inthe concept of a feature map. In a CNN a featuremap consists of a group of neurons that share thesame set of weights on their inputs. This meansthat each group of neurons that share their weightslearns to identify a particular visual feature andeach neuron in the group acts as a detector for that

⌃ '

0, 0 0, 1 0, 2

1, 0 1, 1 1, 2

2, 0 2, 1 2, 2

w0w1

w2

w3w4

w5

w6w7

w8

Output

0 1 2 3 4 5 6 7 9801234567

98

0 1 2 3 4 5 6 7 9801234567

98

⌃ '

0, 1 0, 2 0, 3

1, 1 1, 2 1, 3

2, 1 2, 2 2, 3

w0w1

w2

w3w4

w5

w6w7

w8

Output

Figure 2: Two neurons connected to different ar-eas in the input image (i.e., with different recep-tive fields) but which shared the same weights:w0,w1,w2,w3,w4,w5,w6,w7,w8

feature. In a CNN the neurons within each groupare arranged so that each neuron examines a differ-ent local region in the image. The set of pixels thateach neuron in the feature map inspects is knownas the receptive field of that neuron. The neu-rons and the related receptive fields are arrangedso that together the receptive fields cover the en-tire input image. Consequently, if the visual fea-ture the group detects occurs anywhere in the im-age one of the neurons in the group will identifyit. Figure 3 illustrates a feature map and how eachneuron in the feature map has a different receptivefield and how the neurons and fields are organisedso that taken together the receptive fields of thefeature map cover the entire input image. Notethat the receptive fields of neighbouring neuronstypically overlap. In the architecture illustrated inFigure 3 the receptive fields of neighbouring neu-rons will overlap by either two columns (for hor-izontal neighbours) or by two rows (for verticalneighbours). However, an alternative organisationwould be to reduce the number of neurons in thefeature map and reduce the amount of overlap inthe receptive fields. For example, if the receptivefields only overlapped by one row or one columnthen we would only need half the number of neu-rons in the feature map to cover the entire inputimage. This would of course result in a “two-to-one under-sampling in each direction” (Le Cun,1989).

The idea of applying the same function repeat-edly across an input space by defining a set of neu-rons where each neuron applies the function to a

43

01 2 3 45 6 7 9801234567

98

Feature Map

Receptive Fields

Figure 3: A feature map

different part of the input is very general and canbe used irrespective of the type of function beingapplied. CNNs networks often use the same tech-nique to under-sample the output from a featuremap. The motivation for under-sampling is to dis-card locational information in favour of generalis-ing the network’s ability to identify visual featuresin a shift invariant manner. The standard way toimplement under-sampling on the output of a fea-ture map is to use a pooling layer, so called as itpools information from a number of neurons in afeature map. Each neuron in a pooling layer in-spects the outputs of a subset of the neurons in afeature map, in a very similar way to the way theneuron in the feature map each has a receptive fieldin the input. Often the function used by neurons ina pooling layer is the max function. Essentially, amax pooling neuron outputs the maximum activa-tion value of any of the neurons in the precedinglayer that it inspects. Figure 4 illustrates the ex-tension of the feature map in Figure 3 with a pool-ing layer. The output of the highlighted neuronin the pooling layer is simply the highest activa-tion across the 4 neurons in the feature map thatit inspects. Pooling obviously discards locationalinformation at a local level, after pooling the net-work knows that a visual feature occurred in a re-gion of the image but does not know where pre-cisely within the region the feature occurred.

Receptive Fields

01 2 3 45 6 7 9801234567

98

Feature Map

PoolingLayer

max()φ(Σwi*pi)

Figure 4: Applying pooling to a feature map

A CNN network is not restricted to only onefeature map or one pooling layer. A CNN net-work can consist of multiple feature maps andpooling layers working in parallel where the out-

puts of these different streams of processing arefinally merged to one or more fully connected lay-ers (see Figure 5). Furthermore, these basic build-ing blocks of feature maps and pooling layers canbe sequenced in many different ways: the outputof one feature map layer can be used as the inputto another feature map layer, and the output of apooling layer may be the input to a feature maplayer. Consequently, a CNN architecture is veryflexible and can be composed of multiple layers offeature maps and pooling layers. For example, aCNN could include a feature map that is fed intoa pooling layer which in turn acts as the input fora second feature map layer which itself is down-sampled using another pooling layer, and so on,until the outputs of a layer are eventually fed intoa fully-connected feed-forward layer where the fi-nal prediction is calculated: feature map ! pool-ing ! feature map ! pooling ! dots ! fully-connected layer. Obviously with each extra layerof pooling the network discards more and more lo-cation information.

01 2 3 45 6 7 9801234567

98

Feature Map 1

Feature Map 2

Feature Map N

PoolingLayer

PoolingLayer

PoolingLayer

0

3

1

4

5

6

7

8

9

2

FullyConnected

Layer

Figure 5: A CNN architecture containing N par-allel streams of feature maps and pooling layersfeeding into a single fully connected feed-forwardlayer

2.2 Recurrent Neural Network LanguageModels

Recurrent Neural Networks (RNN)1 are an idealneural network architecture for processing sequen-tial data such as language. Generally, RNN mod-els are created by extending a feed-forward neu-

1This introduction to Recurrent Neural Networks is basedon (Kelleher, 2016).

44

ral network that has just one hidden layer with amemory buffer, as shown in Figure 6a.

RNNs process sequential data one input at atime. In an RNN the outputs of the neurons inthe hidden layer of the network for one input arefeed back into the network as part the next input.Each time an input from a sequence is presented tothe network the output from the hidden units forthat input are stored in the memory buffer over-writing whatever was in the memory (Figure 6b).At the next time step when the next data point inthe sequence is considered, the data stored in thememory buffer is merged with the input for thattime step (Figure 6c). Consequently, as the net-work moves through the sequence there is a recur-rent cycle of storing the state of the network andusing that state at the next time step (Figure 6d).

In order to simplify the following figures we donot draw the individual neurons and connectionsbut represent each layer of neurons as a roundedbox and show the flow of information between lay-ers with arrows. Also, we refer to the input layeras xt , the hidden layer as ht , the output layer as yt ,and the memory layer as ht�1. Figure 7a illustratesthe use of this schematic representation of layersof neurons and the flow of information through anRNN and Figure 7b shows the same network usingthe shorter naming convention.

Figure 8 demonstrates the flow of informationthrough an RNN as it processes a sequence of in-puts. An interesting thing to note is that there is apath connecting each h (the hidden layer for eachinput) to all the previous hs. Thus, the hidden layerin an RNN at each point in time is dependent onits past. In other words, the network has a memoryso that when it is making a decision at time step tit can remember what it has seen previously. Thisallows the model to take into account data that de-pends on previous data, for example in sequences.This is the reason why an RNN is useful for lan-guage processing: having a memory of the previ-ous words that have been observed in a sequenceof a sentence is predictive of the words that followthem.

A language model is a computational model thattakes a sequence of words as input and returns aprobability distribution from which the probabil-ity of each vocabulary word being the next word inthe sequence can be predicted. An RNN languagemodel can be trained to predict the next word ina sequence. Figure 9 illustrates how information

flows through an RNN language model as it pro-cesses a sequence of words and attempts to predictthe next word in the sequence after each input. The* indicates the next word as predicted by the sys-tem. All going well ⇤Word2 = Word2 but if thesystem makes a mistake this will not be the case.

When we have trained a language model wecan make it to “hallucinate” or generate languageby giving it an initial word and then inputting theword that the language model predicts as the mostlikely next word as the following input word intothe model, etc. Figure 10 shows how we can usean RNN language model to generate text by feed-ing the words the language model predicts backinto the model.

3 Grounding Spatial Language inPerception

The symbol grounding problem is the problem ofhow the meaning of a symbol can be groundedin anything other than other meaningless sym-bols. Harnad (1990) argues the symbolic rep-resentations must be grounded bottom-up fromtwo forms of non-symbolic sensor representations:iconic representations which can be understood assensory experience of objects and events, and cat-egorial representations which are feature detec-tors that are triggered by invariant features of ob-jects and events within these sensory experiences.Given these two foundational non-symbolic rep-resentations, a grounded symbolic system can bebuilt up with the elementary symbols of this sys-tem being the symbolic names or labels of theobject and event categories that are distinguishedwithin the categorical representations of the agent.Essentially, the meaning of an elementary symbolis the categorisation of sensor grounded experi-ence.

The description “meaningless” in the defini-tion above (originally made by (Harnad, 1990))should be discussed in relation to the work in dis-tributional semantics (Turney et al., 2010) whichhas been used very successfully for computationalrepresentation of meaning. The reason why dis-tributional semantic representations work is thatword contexts capture indirectly latent situationsthat co-occuring words are all referring to. Distri-butional semantic models are built from groundedlanguage (which is therefore not “meaningless”) itis only that grounded representations are not in-cluded in the model. Grounding is expressed indi-

45

InputLayer

OutputLayer

MemoryBuffer

HiddenLayerI1

H1

H2

H3

M1

M2

M3

O1

(a) Adding a memory buffer to a feed-forward neural networkwith one hidden layer

HiddenLayer

MemoryBuffer

OutputLayer

InputLayer

I1

H2

H3

H1

M2

M3

M1

O1

(b) Writing the activation of the hidden layer to the memorybuffer after processing the input at time t

OutputLayer

HiddenLayer

MemoryBuffer

InputLayer

I1

H2

H3

H1

M2

M3

M1

O1

(c) Merging the memory buffer with the next input at t +1

HiddenLayer

OutputLayer

InputLayer

MemoryBuffer

I1

H2

H3

H1

M2

M3

M1

O1

(d) The cycle of writing to memory and merging with the nextinput as the network processes a sequence

Figure 6: The flow of data between the memory buffer and the hidden layer in a recurrent neural network

rectly through word co-occurrences.

(Roy, 2005) extends Harnad’s approach by set-ting out a framework of semiotic schemas thatground symbolic meaning in a causal-predictivecycle of action and perception. Within Roy’sframework meaning is ultimately grounded inschemas where each schema is a belief networkthat connects action, perception, attention, cat-egorisation, inference, and prediction. Theseschemas can be understood as the interface be-tween the external world (reached through the ac-tion and perception components of the schemas)and the agents internal cognitive processes (atten-tion, categorisation, inference, and prediction).

Spatial language is an interesting case study ingrounding language in perception because linguis-tic descriptions of perceived spatial relations be-

tween objects are intrinsically about the world andas such should be grounded within an agent’s per-ception of that world (Dobnik, 2009; Kelleher andCostello, 2009). The most common form of spatiallanguage discussed in the literature is a locativeexpression. A locative expression is composed ofa noun phrase modified by a prepositional phrasethat specifies the location of the referent of thenoun phrase relative to another object. We will usethe word target object to refer to the object whoseposition is being described and the term landmarkobject to refer to the object that the target object’slocation is described relative to2, the annotations

2The literature on locative expressions uses uses a varietyof terms to describe the target and the landmark objects, fora review see (Kelleher, 2003; Dobnik, 2009). Other termsfound in the literature for the target object include: trajector,local object, and figure object. Other terms used to describe

46

Output

Hidden

InputMemory

(a)

yt

ht

xtht�1

(b)

Figure 7: Recurrent Neural Network (RNN)

on the following example locative expression il-lustrates this terminology:

T he big red book| {z }Target

on the table| {z }Landmark| {z }

PrepositionalPhrase| {z }

Noun Phrase| {z }Locative Expression

Previous work on spatial language has revealeda range of factors that impinge on the interpre-tation of locative expressions. An obvious com-ponent in the grounding of a spatial descriptionis the scene geometry and the size and shape ofregion described by the spatial term within thatgeometry. The concept of a spatial template isused to describe these regions and several exper-iments have revealed how these templates varyacross spatial terms and languages, e.g., (Loganand Sadler, 1996; Kelleher and Costello, 2005;

the landmark object include: reference object, relatum, andground.

Output:

Input:

y1 y2 y3 yt yt+1

h1 h2 h3 ��� ht ht+1

x1 x2 x3 xt xt+1

0 / 0

Figure 8: An RNN unrolled in time

Output:

Input:

⇤Word2 ⇤Word3 ⇤Word4 ⇤Wordt+1

h1 h2 h3 ��� ht

Word1 Word2 Word3 Wordt

0 / 0

Figure 9: RNN language model unrolled in time

Dobnik and Astbom, 2017).It has also been shown that the geometry of a

spatial template for a given preposition is affectedby a number cognitive and contextual factors, in-cluding:

• perceptual attention (Regier and Carlson,2001; Burigo and Knoeferle, 2015; Kluth andSchultheis, 2014),

• the viewer’s perspective on the landmark(s)(Kelleher and van Genabith, 2006),

• object occlusion (Kelleher et al., 2011),• frame of reference ambiguity and align-

ment between the landmark object and refer-ence frames (Carlson-Radvansky and Logan,1997; Burigo and Coventry, 2004; Kelleherand Costello, 2005),

• the location of other distractor objects in thescene (Kelleher and Kruijff, 2005b; Costello

Output:

Input:

⇤Word2 ⇤Word3 ⇤Word4 ��� ⇤Wordt+1

h1 h2 h3 ��� ht

Word1

0 / 0

Figure 10: Using an RNN language model to gen-erate a word sequence

47

and Kelleher, 2006; Kelleher et al., 2006),• and the richness of the perceptual context

(Dobnik and Astbom, 2017).

The last point is related to the fact that factorsaffecting semantics of spatial descriptions go be-yond scene geometry and include the functionalrelationships between the target and the landmark(Coventry, 1998; Coventry et al., 2001; Coventryand Garrod, 2004) and force dynamics within thescene (Coventry et al., 2005; Sjoo, 2011). Thesefunctional relations can be captured as mean-ings induced from word distributions (Dobnik andKelleher, 2013, 2014). Another important factorof (projective) spatial descriptions is their contex-tual underspecification in terms of the assignedframe of reference which is coordinated throughdialogue interaction between conversational par-ticipants (Dobnik et al., 2015). It is thereforebased on their coordinated intentions in their in-teraction.

The research in spatial language semanticshighlights its multifaceted nature. Spatial lan-guage draws upon (i) geometric concepts, (ii)world knowledge (i.e., an understanding of func-tional relationships and force dynamics), and (iii)perceptual and discourse cues. Thus in order for acomputational system to adequately model spatiallanguage semantics it should accommodate all ormost of these factors.

4 Spatial Language in DL

The question that this paper addresses is whetherdeep learning image captioning architectures ascurrently constituted are capable of groundingspatial language within the images they are cap-tioning. The outputs of these systems are impres-sive and often include spatial descriptions. Fig-ure 1 (based on an example from (Xu et al., 2015))provides an indicative example of the performanceof these systems. The generated caption in thiscase is accurate and what is particularly interest-ing is that it includes a spatial description: overwater. Indeed, the vast majority of generated cap-tions listed in (Xu et al., 2015) include spatial de-scriptions, some of which include:3

• “A woman is throwing a frisbee in a park”• “A dog is standing on a hardwood floor”

3The emphasis on the spatial descriptions were addedhere.

• “A group of people sitting on a boat in thewater”.

The fact that these example captions includespatial descriptions and that the captions are of-ten correct descriptions of the input image begs thequestion of whether image captioning systems areactually learning to ground the semantics of thesespatial terms in the images. The nature of neuralnetwork systems makes it difficult to directly anal-yse what a system is learning, however there area number of reasons why it would be surprisingto find that these systems were grounding spatiallanguage. First, recall from the review of ground-ing in Section 3 that spatial language draws on avariety of information, including:

• scene geometry,• perceptual cues such as object occlusion,• world knowledge including functional rela-

tionships and force dynamics,• and coordinated intentions of interacting

agents.

Considering only scene geometry, these imagecaptioning systems use CNNs to encode the rep-resentation of the input image. Recall from Sec-tion 2.1 that CNNs discard locational informationthrough the (down-sampling) pooling mechanismand that such down-sampling may be applied sev-eral times within a CNN pipeline. Although it ispossible that the encoding generated by a CNNmay capture rough relative positions of objectswithin a scene, it is likely that this encoding istoo rough to accommodate the level of sensitiv-ity of spatial descriptions to location that experi-mental studies of spatial language have found tobe relevant (cf. the changes in acceptability rat-ings in (Logan and Sadler, 1996; Kelleher andCostello, 2005; Dobnik and Astbom, 2017) as atarget object moved position relative to the land-mark). The architecture of CNNs also points tothe fact that these systems are unlikely to be mod-elling perceptual cues. CNNs essentially work byidentifying what an object is through a hierarchi-cal merging of local visual features that are pre-dictive of the object type. These local visual fea-tures are likely to be features that are parts of theobject and therefore frequently co-occur with theobject label. Consequently, CNNs are unlikely tolearn to identify an object type via context andviewpoint dependent cues such as occlusion. Fi-nally, neither a CNN nor an RNN as currently

48

used in the image description tasks provide mech-anisms to learn force-dynamics or functional re-lationships between objects (cf. (Coventry et al.,2005; Battaglia et al., 2013)) nor do they take intoaccount agent interaction. Viewed in this lightthe current image captioning systems appear tobe missing most of the key factors necessary toground spatial language. And, yet they do appearto generate reasonably accurate captions that in-clude spatial descriptions.

There are a number of factors that may be con-tributing to this apparent ability. First, an inspec-tion of the spatial descriptions used in the gen-erated captions reveals that they tend to includetopological rather than projective spatial preposi-tions (e.g., on and in rather than to the left of andabove): in a forest, in a park, in a field, in the fieldwith trees, in the background, on a bed, on a road,on a skateboard, at a table. These spatial descrip-tions are more underspecified with regard to the lo-cation of the target object relative to the landmarkobject than projective descriptions which also re-quire grounding of direction within a frame of ref-erence. Topological descriptions are semanticallyadequate already if the target is just proximal tothe landmark and hence it is more likely that acaption will be acceptable. Furthermore, it is fre-quently the case that given a particular label fora landmark it is possible to guess the appropri-ate preposition irrespective of the image and/or thetarget object type or location. Essentially the taskposed to these systems is to fill the blanks with oneof at, on, in:

• TARGET a field,• TARGET the background,• TARGET a road,• TARGET a table.

Although the system may get some of theblanks wrong it is likely to get many of them right.This is because the system can use distributionalknowledge of words which captures some ground-ing indirectly as discussed in Section 3. Indeed,recent research has shown that co-occurrence ofnouns with a preposition within a corpus of spa-tial descriptions can reveal functional relations be-tween objects referred to by the nouns (Dobnikand Kelleher, 2013, 2014). Word co-occurrence isthus highly predictive of the correct preposition.Consequently, language models trained on im-age description corpora indirectly model partially

grounded functional relations, at least within thescope of the co-occurrence likelihood of preposi-tions and nouns.

The implication of this is that current imagecaptioning systems do not ground spatial descrip-tions in the images they take as input. Instead, theapparent ability of these systems to frequently cor-rectly use spatial prepositions to describe spatialrelations within the image is the result of the RNNlanguage model learning to predict the most likelypreposition to be used given the target and land-mark nouns where these nouns are predicted fromthe image by the CNN.

There is a negative and a positive side to thisconclusion. Let’s start with the negative side.The distinction between cognitive representationsof what something is versus where something ishas a long tradition in spatial language and spa-tial cognition research (Landau and Jackendoff,1993). These image captioning systems would ap-pear to be learning representations that allow themto ground the semantics of what. But they arenot learning representations that enable them toground the semantics of where. Instead, they relyon the RNN language model to make good guessesof the appropriate spatial terms to use based onword distributions. The latter point introduces thepositive side. It is surprising how much and howrobustly semantic information can be captured bydistributional language models. Of course, lan-guage models cannot capture the geometric rela-tions between objects, for example they are notable to distinguish successfully the difference insemantics between the chair is to the left of the ta-ble and the chair is to the right of the table as leftand right would occur in exactly the same wordcontexts. However, as we argued in Section 3 spa-tial language is not only spatial but also affectedby other sources of knowledge that leave an im-print in the word distributions which capture re-lations between higher-level categorical represen-tations built upon the elementary grounded sym-bols (Harnad, 1990). It follows that some categor-ical representations will be closer to and thereforemore grounded in elementary symbols, somethingthat has been shown for spatial language (Coven-try, 1998; Coventry et al., 2001; Coventry andGarrod, 2004; Dobnik and Kelleher, 2013, 2014).In conclusion, it follows that successful compu-tational models of spatial language require bothkinds of knowledge.

49

5 Conclusions

In this paper we examined the current architecturefor generating image captions with deep learn-ing and argued that in its present setup they failto ground the meaning of spatial descriptions inthe image but nonetheless achieve a good perfor-mance in generating spatial language which is sur-prising given the constraints of the architecturethat they are working with. The information thatthey are using to generate spatial descriptions isnot spatial but distributional, based on word co-occurrence in a sequence as captured by a lan-guage model. While such information is requiredto successfully predict spatial language, it is notsufficient. We see at least two useful areas of fu-ture work. On one hand, it should be possible toextend the deep learning configurations for imagedescription to take into account and specialise tolearn geometric representations of objects, just asthe current deep learning configurations are spe-cialised to learn visual features that are indicativeof objects. The work on modularity of neural net-works such as (Andreas et al., 2016; Johnson et al.,2017) may be relevant in this respect. On the otherhand, we want to study how much information canbe squeezed out of language models to success-fully model spatial language and what kind of lan-guage models can be built to do so.

Acknowledgements

The research of Kelleher was supported by theADAPT Research Centre. The ADAPT Cen-tre for Digital Content Technology is funded un-der the SFI Research Centres Programme (Grant13/RC/2106) and is co-funded under the EuropeanRegional Development Funds.

The research of Dobnik was supported by agrant from the Swedish Research Council (VRproject 2014-39) for the establishment of the Cen-tre for Linguistic Theory and Studies in Probabil-ity (CLASP) at Department of Philosophy, Lin-guistics and Theory of Science (FLoV), Universityof Gothenburg.

ReferencesJacob Andreas, Marcus Rohrbach, Trevor Darrell,

and Dan Klein. 2016. Learning to composeneural networks for question answering. CoRRabs/1601.01705:1–10.

Peter W. Battaglia, Jessica B. Hamrick, and Joshua B.

Tenenbaum. 2013. Simulation as an engine of phys-ical scene understanding. Proceedings of the Na-tional Academy of Sciences 110(45):18327–18332.

Michael Brenner, Nick Hawes, John D. Kelleher,and Jeremy L. Wyatt. 2007. Mediating betweenqualitative and quantitative representations for task-orientated human-robot interaction. In Proceedingsof the International Joint Conference on ArtificialIntelligence (IJCAI). AAAI, pages 2072–2077.

Michele Burigo and Kenny Coventry. 2004. Refer-ence frame conflict in assigning direction to space.In International Conference on Spatial Cognition.Springer, pages 111–123.

Michele Burigo and Pia Knoeferle. 2015. Visual atten-tion during spatial language comprehension. PloSone 10(1):e0115758.

L.A. Carlson-Radvansky and G.D. Logan. 1997. Theinfluence of reference frame selection on spatialtemplate construction. Journal of Memory and Lan-guage 37:411–437.

Fintan Costello and John D. Kelleher. 2006. Spatialprepositions in context: The semantics of Near inthe presense of distractor objects. In Proceedingsof the 3rd ACL-Sigsem Workshop on Prepositions.pages 1–8.

Kenny R. Coventry, Angelo Cangelosi, Rohanna Ra-japakse, Alison Bacon, Stephen Newstead, DanJoyce, and Lynn V. Richards. 2005. Spatialprepositions and vague quantifiers: Implementingthe functional geometric framework. In Chris-tian Freksa, Markus Knauff, Bernd Krieg-Bruckner,Bernhard Nebel, and Thomas Barkowsky, editors,Spatial Cognition IV. Reasoning, Action, Interac-tion, Springer Berlin Heidelberg, volume 3343 ofLecture Notes in Computer Science, pages 98–110.

Kenny R. Coventry, Merce Prat-Sala, and LynnRichards. 2001. The interplay between geometryand function in the apprehension of Over, Under,Above and Below. Journal of Memory and Lan-guage 44(3):376–398.

K.R. Coventry. 1998. Spatial prepositions, functionalrelations, and lexical specification. In P. Olivierand K.P. Gapp, editors, Representation and Process-ing of Spatial Expressions, Lawrence Erlbaum As-sociates, pages 247–262.

K.R. Coventry and S. Garrod. 2004. Saying, Seeingand Acting. The Psychological Semantics of SpatialPrepositions. Essays in Cognitive Psychology Se-ries. Lawrence Erlbaum Associates.

Simon Dobnik. 2009. Teaching mobile robotsto use spatial words. Ph.D. thesis, Uni-versity of Oxford: Faculty of Linguistics,Philology and Phonetics and The Queen’s Col-lege, Oxford, United Kingdom. 289 pages.http://www.dobnik.net/simon/documents/thesis.pdf.

50

Simon Dobnik and Amelie Astbom. 2017. (Percep-tual) grounding as interaction. In Volha Petukhovaand Ye Tian, editors, Proceedings of Saardial –Semdial 2017: The 21st Workshop on the Seman-tics and Pragmatics of Dialogue. Saarbrucken, Ger-many, pages 17–26.

Simon Dobnik, Christine Howes, and John D. Kelle-her. 2015. Changing perspective: Local alignmentof reference frames in dialogue. In Christine Howesand Staffan Larsson, editors, Proceedings of goDIAL– Semdial 2015: The 19th Workshop on the Seman-tics and Pragmatics of Dialogue. Gothenburg, Swe-den, pages 24–32.

Simon Dobnik and John D. Kelleher. 2013. Towardsan automatic identification of functional and geo-metric spatial prepositions. In Proceedings of PRE-CogSsci 2013: Production of referring expressions– bridging the gap between cognitive and compu-tational approaches to reference. Berlin, Germany,pages 1–6.

Simon Dobnik and John D Kelleher. 2014. Explo-ration of functional semantics of prepositions fromcorpora of descriptions of visual scenes. In AnjaBelz, Marie-Francine Moens, and Alan F. Smeaton,editors, Proceedings of the 1st Technical Meetingof the European Network on Integrated Vision andLanguage (V&L Net) a Workshop at the 25th Inter-national Conference on Computational Linguistics(COLING). Association of Computational Lingus-tics, Dublin, Ireland, pages 33–37.

Simon Dobnik and John D. Kelleher. 2016. A modelfor attention-driven judgements in type theory withrecords. In Julie Hunter, Mandy Simons, andMatthew Stone, editors, JerSem: The 20th Workshopon the Semantics and Pragmatics of Dialogue. NewBrunswick, NJ USA, volume 20, pages 25–34.

Peter Gorniak and Deb Roy. 2004. Grounded semanticcomposition for visual scenes. Journal of ArtificialIntelligence Research 21:429–470.

Stevan Harnad. 1990. The symbol grounding problem.Physica D 42:335–346.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural Computation9(8):1735–1780.

Justin Johnson, Bharath Hariharan, Laurens van derMaaten, Judy Hoffman, Fei-Fei Li, C. Lawrence Zit-nick, and Ross B. Girshick. 2017. Inferring andexecuting programs for visual reasoning. CoRRabs/1705.03633(n):n.

John D. Kelleher. 2003. A Perceptually Based Compu-tational Framework for the Interpretation of SpatialLanguage. Ph.D. thesis, Dublin City University.

John D. Kelleher. 2016. Fundamentals of ma-chine learning for neural machine translation.

In Proceedings of the Translating Europen Fo-rum 2016: Focusing on Translation Technolo-gies (doi:10.21427/D78012). European Commis-sion Directorate-General for Translation.

John D. Kelleher and Fintan Costello. 2005. Cogni-tive representations of projective prepositions. InProceedings of the Second ACL-Sigsem Workshopof The Linguistic Dimensions of Prepositions andtheir Use in Computational Linguistic Formalismsand Applications..

John D. Kelleher and Fintan J. Costello. 2009. Apply-ing computational models of spatial prepositions tovisually situated dialog. Computational Linguistics35(2):271–306.

John D. Kelleher and Geert-Jan M. Kruijff. 2005a. Acontext-dependent algorithm for generating locativeexpressions in physically situated environments. InProceedings of the Tenth European Workshop onNatural Language Generation (ENLG-05).

John D. Kelleher and Geert-Jan M. Kruijff. 2005b.A context-dependent model of proximity in physi-cally situated environments. In Proceedings of the2nd ACL-SIGSEM Workshop on The Linguistic Di-mensions of Prepositions and their Use in Compu-tational Linguistics Formalisms and Applications,Colchester, UK.

John D. Kelleher, Geert-Jan M. Kruijff, and FintanCostello. 2006. Proximity in context: an empriciallygrounded computation model of proximity for pro-cessing topological spatial expressions. In Proceed-ings ACL/Coling 2006. Sydney, Australia.

John D. Kelleher, Robert J. Ross, Colm Sloan, andBrian Mac Namee. 2011. The effect of occlusionon the semantics of projective spatial terms: A casestudy in grounding language in perception. Cogni-tive Processing 12(1):95–108.

John D. Kelleher and Josef van Genabith. 2006. Acomputational model of the referential semanticsof projective prepositions. In P. Saint-Dizier, edi-tor, Syntax and Semantics of Prepositions, KluwerAcademic Publishers, Dordrecht, The Netherlands,Speech and Language Processing.

Thomas Kluth and Holger Schultheis. 2014. Atten-tional distribution and spatial language. In Inter-national Conference on Spatial Cognition. Springer,pages 76–91.

Barbara Landau and Ray Jackendoff. 1993. ”What”and ”where” in spatial language and spatial cogni-tion. Behavioral and Brain Sciences 16:217–265.

Yann Le Cun. 1989. Generalization and networkdesign strategies. Technical Report CRG-TR-89-4, University of Tronoto Connectionist ResearchGroup.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998.Gradient-based learning applied to document recog-nition. Proceedings of the IEEE 86(11):2278–2324.

51

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.2015. Deep learning. Nature 521(7553):436–444.

Honglak Lee, Roger Grosse, Rajesh Ranganath, andAndrew Y. Ng. 2009. Convolutional deep beliefnetworks for scalable unsupervised learning of hi-erarchical representations. In Proceedings of the26th Annual International Conference on MachineLearning. ACM, New York, NY, USA, ICML ’09,pages 609–616.

G.D. Logan and D.D. Sadler. 1996. A computationalanalysis of the apprehension of spatial relations. InM. Bloom, P.and Peterson, L. Nadell, and M. Gar-rett, editors, Language and Space, MIT Press, pages493–529.

P. McKevitt, editor. 1995. Integration of Natural Lan-guage and Vision Processing (Vols. I-IV). KluwerAcademic Publishers, Dordrecht, The Netherlands.

T Regier and L. Carlson. 2001. Grounding spatial lan-guage in perception: An empirical and computa-tional investigation. Journal of Experimental Psy-chology: General 130(2):273–298.

Deb Roy. 2005. Semiotic schemas: A framework forgrounding language in action and perception. Artifi-cial Intelligence 167(1-2):170–205.

Jurgen Schmidhuber. 2015. Deep learning in neuralnetworks: An overview. Neural Networks 61:85 –117.

Niels Schutte, Brian Mac Namee, and John D. Kelle-her. 2017. Robot perception errors and human res-olution strategies in situated human–robot dialogue.Advanced Robotics 31(5):243–257.

Kristoffer Sjoo. 2011. Functional understanding ofspace: Representing spatial knowledge using con-cepts grounded in an agent’s purpose. Ph.D. the-sis, KTH, Computer Vision and Active Perception(CVAP), Centre for Autonomous Systems (CAS),Stockholm, Sweden.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in neural information process-ing systems. pages 3104–3112.

Stefanie Tellex. 2010. Natural language and spatialreasoning. Ph.D. thesis, Massachusetts Institute ofTechnology.

Peter D Turney, Patrick Pantel, et al. 2010. Fromfrequency to meaning: Vector space models of se-mantics. Journal of artificial intelligence research37(1):141–188.

T. Winograd. 1973. A procedural model of languageunderstanding. In R.C. Schank and K.M. Colby, ed-itors, Computer Models of Thought and Language,W. H. Freeman and Company, pages 152–186.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual at-tention. In International Conference on MachineLearning. pages 2048–2057.

52