[ieee 2011 fourth international workshop on advanced computational intelligence (iwaci) - wuhan,...

Abstract—The job of language grounding is to research the relationship between language and external physical stimuli. In this paper, we build a language grounding model which is an extension of hidden Markov model. In a show-and-tell experiment, we use this model to learn the words meaning and simple bi-gram syntax, and finally generate the natural language description of special 2-D scenes automatically. The experiment results show the validity of our model in words categorization, semantic learning and phrase generation.

I. INTRODUCTION RADITIONAL semantic models tend to represent semantic using linguistic knowledge. They concern

about the relation between different language units, and research how meaning attaches to larger chunks of text, usually as a result of the composition of smaller units of meaning. Different from traditional language models, the language grounding model focuses on the relationship between language and external physical stimuli, such as visual information.

Here we explore the language cognitive process through building a semantic grounding model. Word grounding originated from the concept of symbol grounding [1] presented by Harnad in 1990. He hold that the grounding of any word in our head is the media bridging the word read (and understood) by us in the book and the external referent of it. The idea of word grounding has been used in word semantic representation based on external physical stimuli [2]–[4], leading to language cognitive methods different from traditional semantic model. The external physical stimuli perceived by human include sight, smell, sound, taste, touch and so on. Note that language grounding is not the ultimate goal but a method, which can be applied to word learning [5], scene automatic description [6], natural language comprehension [7] and generation [8].

In order to test the validity of our model, we apply it to a MLA (miniature language acquisition) task. The MLA task was put forward by Jerome [9] and called the touchstone of cognitive science. The system is given examples of pictures

Manuscript received July 11, 2011. This research has been supported by

NSFC90920006 and RFDP 20090005110005. Wei Zhang is a Ph.D. candidate of Center for Intelligence Science and

Technology, Beijing University of Posts and Telecommunications, No.10, Xitucheng Road, Haidian District, Beijing, China 100876 (phone: 00-86-10-57171688; e-mail: [email protected]). She is also a lecture of School of Automation, Wuhan University of Technology.

Xiaojie Wang is with Center for Intelligence Science and Technology, Beijing University of Posts and Telecommunications, No.10, Xitucheng Road, Haidian District, Beijing, China 100876 (e-mail: [email protected]).

paired with true statements about those pictures in an arbitrary natural language. And it is to learn the relevant portion of the language well enough so that given a new sentence of that language, it can tell whether or not the sentence is true of the accompanying picture. A lot of researchers’ work [10], [11] has been inspired by this show-and-tell task and extended it to word grounding study. E.g. Steels and Kaplan embedded a name learning system in the hardware platform of Sony robot dog AIBO [12]. Dominey and Voegtlin have designed a system which can learn the meaning of new words from events happening in dynamic video and the synchronous annotation made by human experimenter [13]. Especially, DESCRIBER [6] and Newt [14] built by Roy and his team are such two typical show-and-tell systems, which explored the noun and adjective grounding and generation problem based on 8 visual features. DESCRIBER is based on a rectangle description task. Synthetic image containing 10 color rectangles whose widths, heights, positions and RGB colors are randomly generated is used for training and testing. Each image is augmented with an indicator arrow which selects one of the ten rectangles as the target object. The description task consists of generating phrases which best describe target objects. As the inverted version of DESCRIBER, Newt is a trainable, visually-grounded, spoken language understanding system. Once trained, a person can interact with the system by verbally describing objects placed in front of the system. The system points, in real-time, to the object which best fits the visual semantics of the spoken description.

DESCRIBER provides us many useful ideas. First of all, it represents the language learning effect in the term of generating description for new image. Secondly, in DESCRIBER, by using computer-generated images, visual feature extraction is greatly simplified (when compared to using camera images). This enables the investigators to focus their attention on modeling the language grounding system and ignore the image processing step. What is more, the concept of semantic association vector is very enlightening which is used to measure how strongly the relationship between word and each visual feature is. Last but not least, syntactic and contextual constraints are combined in one score formula measuring the fitness of the image and description utterance.

The task of this paper is to some extent similar to that of DESCRIBER. We use computer generated images, K-L divergence, as well as hybrid greedy clustering method to group words which are used by DESCRIBER. Admittedly, the visual scene of DESCRIBER is more complex than ours

Language Grounding Model: Connecting Utterances and Visual Attributions

Wei Zhang and Xiaojie Wang

T

Fourth International Workshop on Advanced Computational Intelligence Wuhan, Hubei, China; October 19-21, 2011

978-1-61284-375-9/11/$26.00 @2011 IEEE 409

for the image used by us contains only one object. However, we have made extension in some directions. We introduced more shapes like circle, triangle, pentagon, and hexagon, and consequently adopted more complex visual shape features to ground shape words. In DESCRIBER, Roy used forward searching algorithm based on multivariant K-L divergence to implement feature selection which is time consuming and performs not good under sparse sample circumstances. In this paper, we present a feature selection method based on mean semantic association vector (MSAV). This method is able to overcome the sparse sample problem to some extent because of its simplicity, and select more proper features for words than those of Roy’s which is demonstrated in the experiment result.

Another point of DESCRIBER is that it hasn’t got explicit language grounding model and the corresponding representation. Although Roy used a score formula to measure the fitness of the image and description utterance, he didn’t give the generative model associated with the utterance grounding problem. However, a key concern of this paper is to combine text semantic and visual features in a language grounding model reasonably. This grounding model is an extension to HMM model the structure of which implies bi-gram syntax and word meaning representation based on visual features.

This paper is sectioned as follows: the show-and-tell task we undertake is presented in section II; the idea of language grounding model is described in section III; the experiment details are described in section IV, as well as the evaluation method and result analysis; finally, conclusions are drawn in section V, together with some discussion.

II. THE SHOW-AND-TELL TASK

The show-and-tell task we undertake is a 2-D scene description task. In the “show” step, it takes image-utterance

pairs as input. The images are generated by computer each of which contains a color geometric object. The location, size and color of the object are all random variants. Each image is paired with a labeling utterance describing the object in natural language. For the free style of natural language, different person may describe the same scene in very different ways, in terms of vocabulary, grammar, language style or even the concern. Therefore, two constraints are implemented on labeling. Firstly, the annotating concern must be the color geometric object. What’s more, the attributions described should only contain categories of color, location, size and shape.

In the “tell” step, given new images, the system can automatically describe them in natural language with proper syntax and words. Fig. 1 demonstrates the brief procedure of this show-and-tell task.

III. LANGUAGE GROUNDING MODEL

A. Corpus Processing Before building the grounding model, some corpus

processing steps are needed to do to bridge the gap between image and its description utterance. Fig. 2 demonstrates the overview of corpus processing. For each image, we extract 13 visual features. These features were chosen intuitively for our language learning task, including 7 Hu invariable moments [15] representing the shape attribution of the geometric object, the horizontal and vertical position of the object center x, y, the area feature and the r, g, b features.

Differing from English, Chinese text needs to be segmented into words sequence before analysis. An automatic segmentation program is applied in our system.

Thus, the image-utterance pair is transformed into “feature vector-word sequence” pair. That means, each 13-dimentional feature vector extracted from the image corresponds to a sequence of labeling words. However, this corresponding relationship is not enough for word grounding. Before building a word semantic model, we should select

the deep green square

the bottom-left deep salmon circle

show

learn predict

new images

utterance 1utterance 2

tell

syntax & semantic rules

system

Fig. 1. The show-and-tell task.

the deep green square

( )1 7~ , , , , , ,x y area r g bϕ ϕ the/ deep/ green/ square

the / deep/ green/ square

( )1 7~ , , , , , ,x y area r g bϕ ϕ

Image feature extraction

utterance segmentation

Feature selection for words

Fig. 2. The overview of corpus processing.

410

related feature set for each word, that is, imply the feature selection step. In some word learning systems, this step is omitted, e.g. the AIBO [12], for the feature sets in these systems were assigned by human. For each word, we compute a semantic association vector [6] which quantify the closeness of the relationship between word and each feature. Subsequently, a hybrid greedy clustering algorithm [6] which integrates word distribution pattern and semantic association vector is used to group words into separate classes.

Based on the word classes and the semantic association vector of each word, we compute the mean semantic association vector (MSAV) for each word class . By setting a threshold T, we implement feature selection for each word class. That is, if (the jth element of ) is beyond the threshold T0, the jth visual feature is then added to the feature subset of word class . Among all the 13 visual features, we consider the features selected for each word class the ones most tightly connected to the class in meaning. The feature selection results will be used for word grounding in the next step.

B. Word Semantic Grounding For each word to be grounded, we then use the

multi-variant Gaussian function to model the word-conditional distribution over selected features which is denoted as ( denotes the word, denotes the feature vector composed of selected features for ). For this word-conditional distribution embodies the word meaning in terms of visual features, we call it the word semantic grounding model based on visual attributions. The model is estimated using the feature vector observations which co-occur with . After calculating the unbiased estimators of mean vector μ and covariance matrix Σ for , the word is called visually grounded.

The reason why we choose the multivariant Gaussian model to represent single word semantic is that its explicit meaning and calculability in mathematics make it suitable for the phrase semantic grounding model discussed in the following Section III-C.

C. HMM-based Phrase Semantic Grounding Model Combining Word Semantic and Bi-gram Syntax

In Section III-A and III-B, the words semantic model is built through feature selection and word grounding. The next work needed to do is to combine phrase syntax into the whole grounding model. For the syntax rules are relatively simple in this scene description task, bi-gram based on word classes is considered in the model. We introduce a modified hidden Markov model demonstrated in Fig. 3 to model the description utterance semantics. This model is different from traditional HMM for between the hidden state word class and the observed variable there exists another hidden state word , which means there are two levels of hidden states in the phrase grounding model.

In Fig. 3, the gray nodes correspond to the observation feature subset sequence , and the hollow latent nodes correspond to two levels of hidden states. The hidden states in the first level are word class variables forming the word class sequence .

The conditional transferring probability from word class to is denoted as . The hidden states

in the second level represent the words which compose the word sequence .

denotes the probability of choosing the th word from word class , and is the word conditional pdf over selected visual feature subset . It should be noted that the selected visual features contained in

depend on the word class . So the subscripts of the observation feature subset sequence X are the same as those of C.

The conditional transferring probability and the probability of choosing word from word class can both be easily obtained through counting the training data. What’s more, actually represents the word semantic grounding model of

which is discussed in Section III-B. Thus, the parameters of this phrase semantic grounding model are all known.

As in Fig. 3, the joint probability of X and u is given by:

tqC1qC

2qC

tqx

TqC1TqC

−

( )1 1q jw

1qx

( )2 2q jw

2qx

( )1 1T Tq jw− −

1Tqx−

( )T Tq jw

Tqx

( )t tq jw

Fig. 3. The HMM-based phrase semantic grounding model. The gray nodes correspond to the observation feature subset sequence, and the hollow latent nodes correspond to two levels of hidden states.

411

(1)

in which is the length of the utterance . As derives from observation data, this probability can be considered as the function of and . The log probability is given by:

(2)

The optimal word sequence satisfies: (3)

In order to separate the effects of syntax and semantic constraints, we define two components:

(4)

(5)

Here (4) corresponds to syntax patterns, and (5) implies word frequency and semantic information. Therefore, (3) can be written as:

(6) We further introduce an interpolation parameter into (6)

to compensate for difference in scale of (4) and (5): (7)

Thus, for a given image, (4), (5) and (7) can be used to obtain the optimal description utterance of length .

Given parameters and observation sequence, Viterbi algorithm is usually used to look for the most probable hidden state sequence of Traditional HMM. Here we adopt a simplified method by enumerating all the possible utterances and looking for the one satisfying (7). The reason why we choose this method is that in our system the maximum of utterance length is small and thus the computational complexity is affordable.

IV. EXPERIMENT AND RESULTS

A. Words Modeling Results Training data contain 1000 randomly generated images

paired with utterances annotated by human. 5 human volunteers participated in labeling the 1000 images. Table I

shows some utterances labeled by them.

The obtained 1000 image-utterance pairs are then feed into the learning stage. To make the learning result reliable, words whose frequencies are below 13 are abandoned.

We construct the semantic association vector for each word according to Section III-A, and cluster all the words into 7 classes by using the hybrid greedy clustering method. The clustering result is shown in Table II.

According to Section III-A, the mean semantic association vector (MSAV) of each class is calculated. The value of feature selection threshold T0 is set to be 0.02. For each word class, feature whose corresponding MSAV component is beyond the threshold is selected and added to the feature subset. Although the threshold T0 is set manually which reduces the automation of the system, this method is very fast and simple to implement. Most importantly, it is more robust than the forward searching feature selection algorithm used by Roy when training samples are not enough. Table III shows the feature subsets of all classes selected by these two

TABLE I SOME UTTERANCES LABELED BY EXPERIMENTERS

Index Description Utterances 0003 the deep blue square 0158 the bottom purplish-red square 0249 the blue circle 0260 the deep salmon circle

TABLE III FEATURE SETS SELECTED FOR WORD CLASSES

Word Class Index

MSAV Algorithm Forward Searching Algorithm

0 r, g, b 2 7~ϕ ϕ , area, r, g, b

1 none 1 7~ϕ ϕ , x, y, area, r, g, b

2 1 7~ϕ ϕ 1 7~ϕ ϕ

3 g 4 7~ϕ ϕ , g, b

4 x, area 5 7~ϕ ϕ , x, area

5 x, y 2 7~ϕ ϕ , x, y

6 g 5ϕ , 7ϕ , g

TABLE II CLUSTERED WORD CLASSES

Word Class Index

Words

0 blue, green, yellow, pink, purple, amaranthine, salmon, red, brown, golden, sky-blue, gray, orange, white

1 an auxiliary word in Chinese like “of” in English 2 square, circle, triangle, pentagon, hexagon 3 light, deep, dark 4 big, small 5 bottom, left, up-right, bottom-right, right, bottom-left, up,

up-left 6 bright

412

methods. We can see that the result of MSAV is more reasonable than that of forward searching algorithm.

With the feature set selected for each word using MSAV method, we model the word semantic in terms of multivariant Gaussian model described in Section III-B. Fig. 4 plots the contour of equal probability density for “bottom-left”, “left”, “up-right” and “up” over feature x and y. We can see that the distributions of the four location words are consistent with their semantic in intuition. E.g. the distribution of “bottom-left” is mainly located in the region with small x and large y coordinates, and that of “up” is mainly located in the area with small y and moderate x.

B. Phrase Generation Results and Evaluation

According to (6), we calculate the output description utterance for new image. Table IV shows two test images and the generated utterances for them. For each image, the utterances are output with all possible length . The maximum does not exceed the number of word classes, that is, 7. If there are words of duplicate classes in the possible utterance, it should be omitted and not output. In the output

utterances the word order is proper. However a great amount of images’ utterances with longest length are not complete in syntax. This is because the effect of is so strong that overwhelm that of . So we adopt (7) to calculate the proper utterances in which the interpolation parameter can compensate for difference in scale of and . Defining can enhance the influence of in syntax. The generation results are also shown as the experiment 1 in Table IV where we only output the longest utterance without duplicate word classes. It can be observed that the utterances are proper in syntax and context.

What is more, in order to research the effects of color features on color and color modifier words generation accuracy, we induct more color features, i.e. SI (saturation and intensity) and LUV. Three combinations of color features, such as LUV, SI+RGB and SI+LUV are used to replace RGB features shown as experiment 2, 3 and 4.

We evaluated the experiment results by using a method originating from the machine translation evaluating technology [16]. The accuracies in word level and utterance level were counted respectively. Three human experimenters manually annotated the 100 test images and the labeling utterances were subsequently used as reference in automatic evaluation process. Different from utterances used for training, the utterances here should be complete in syntax and semantic for the evaluation must be comprehensive. So each utterance includes the 5 categories of location, color modifier, color, size and shape. What is more, the words used must be chosen from the words which are grounded.

In word level, a voting method was adopted to judge a word. For a word in the utterance corresponding to one test image, if it can be found in more than two (including two) reference utterances of that image, it was considered proper. We counted the word accuracies of 5 categories respectively, as

well as the average accuracy. In utterance level, only if all words in an utterance are proper the utterance is proper; otherwise, it is wrong. The evaluation results are shown in Table V where the experiments 1~4 are based on MSAV methods, and experiment 5~8 are based on DESCRIBER's methods.

Fig. 4. The contour of Gaussian distribution for words grounded in x and y features.

TABLE IV SOME UTTERANCES LABELED BY EXPERIMENTERS

Test Image Output Utterances with (6) Output Utterance with (7)

T=1 bright

T=2 bright light

T=3 bright light blue

T=4 small bright light blue

The bottom big light sky-blue square

T=1 bright T=2 yellow circle T=3 bright yellow circle T=4 big bright yellow circle

T=5 big bright light yellow circle

The left big light yellow circle

413

V. DISCUSSION AND CONCLUSIONS

A. Discussion In Table V, taking experiment 1 and 5 as examples in

which the original 13 dimentional features are used, the word average accuracy of our model is 0.70 which is higher than the 0.682 of DESCRIBER. The utterance accuracy of our model is 0.13, and that of DESCRIBER is 0.11. In general, the improvement of our model in description generation accuracy is mainly due to the good performance of MSAV based feature selection method used by us. This demonstrates that the feature selection results of MSAV method are more proper than those of forward searching algorithm in words semantics.

Of all the word categories, the shape category is the most accurate one. The reason why its accuracy is as high as 0.99 is because of the fact that the shape types of geometric objects in this task are limited to five, as well as the unambiguity of the shape word’s meaning. For example, in most cases, a triangle will absolutely not be mistaken for a circle or something else.

The second accurate one is the color word category. Although the number of color words is the biggest among all categories, the meanings of them are relatively precise. For this reason, the accuracy (0.79) of color category is not low.

For the size category contains only two different words, the accuracy of it is above 0.70. However, it is more ambiguous than shape and color categories. Take a middle-sized object for example, someone may label it with “big”, while another may use “small” to describe it. Because of the subjectivity in training set and reference set labeling, the results of learning and evaluation are both influenced badly. The location category faces the same problem, as well as the fact that it includes more words, so its accuracy is the lowest.

From the results of experiments 1~4, it can be observed that the introduced LUV and SI color features does not improve the accuracy of color category words but the

accuracy of color modifier category words in experiment 3 and 4.

B. Conclusions In this paper, we have studied some crucial issues in

Chinese semantic grounding model by building a system which is able to transform 2-D images to natural language descriptions. The key components of this system include the visual feature based representation of word semantic, MSAV feature selection method and the HMM-based Chinese phrase semantic grounding model. By adopting evaluation technique originally used in machine translation testing, the accuracy of this model is 70% in word level, and is 13% in utterance level.

From the evaluation result we can notice that feature selection plays an important role in the grounding model. The word average accuracy and utterance accuracy of our system are both higher than those of DESCRIBER, which mainly owns to the better performance of MSAV than that of K-L based forward searching algorithm on feature selection.

The contribution of the HMM-based phrase semantic grounding model is that it provides an explicit generative model for image features and can also be used in the converted task, that is, the image generation job.

We have also discussed the relation between word category accuracy and the number of words it contains, as well as its ambiguity level in meaning. The fewer its words are, and the less ambiguous in semantics it is, the more accurate in generation the word category is.

Similar to some existing word grounding systems, visual features in this system are well defined in order to improve the system efficiency greatly by avoiding the semantic gap between word semantic and visual features. However, for images captured in real scenes, such features are difficult to extract precisely for current segmentation techniques are not

TABLE V EVALUATION RESULTS

This Paper DESCRIBER's Method Word Class Index 1. RGB 2. LUV 3. SI+RGB 4. SI+LUV 5. RGB 6. LUV 7. SI+RGB 8. SI+LUV

Color Categories 0.79 0.76 0.75 0.75 0.74 0.74 0.69 0.65

Shape Categories 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99

Location Categories 0.50 0.50 0.50 0.50 0.43 0.43 0.44 0.44

Size Categories 0.71 0.71 0.71 0.71 0.76 0.76 0.76 0.76

Color Modifier Categories 0.51 0.49 0.56 0.56 0.49 0.47 0.56 0.56

Word Average Accuracy 0.70 0.69 0.702 0.702 0.682 0.678 0.688 0.680

Utterance Accuracy 0.13 0.14 0.15 0.16 0.11 0.10 0.11 0.11

8 experiments have been done. Experiment 1~4 are based on our model, and experiment 5~8 are based on DESCRIBER's methods. In all these 8 experiments, the same visual features are used except for the different color features.

414

robust enough. Therefore, if we want to extent this work to real complex image description task, better representation of word semantic will be needed.

This system is our first step to grounding Chinese natural language in terms of visual attributions. And we have obtained some valuable conclusions. The future work may focus on improving the complexity of images and languages used in system, adding verbs and adverbs to the learning words set, as well as adding more objects into the scene which currently contains only one object.

ACKNOWLEDGMENT This research was supported by the National Natural

Science Foundation of China (No. 90920006) and the Research Fund for the Doctoral Program of Higher Education of China (No. 20090005110005).

REFERENCES [1] S. Harnad, "The symbol grounding problem," Physica D,vol. 42, pp.

335-346, 1990. [2] K. Bonawitz, A. Kim, and S. Tardiff, "An architecture for word learning

using bidirectional multimodal structural alignment," in Human Language Technology Conference, Proceedings of the HLT-NAACL 2003 workshop on Learning word meaning from non-linguistic data 2003, Association for Computational Linguistics: Morristown, NJ, USA. p. 30-37.

[3] D. Roy, "Semiotic schemas: a framework for grounding language in action and perception," Artificial Intelligence,vol. 167(1-2), pp. 170–205, 2005.

[4] D. Roy and E. Reiterb, "Connecting language to the world," Artificial Intelligence,vol. 167, pp. 1-12, 2005.

[5] K. Barnard and D. Forsyth. "Learning the semantics of words and pictures," in Proc. the Eighth IEEE International Conference on Computer Vision, 2001, pp. 408-415.

[6] D. K. Roy, "Learning visually-grounded words and syntax for a scene description task," Computer Speech and Language,vol. 16, pp. 353-386, 2002.

[7] P. Gorniak and D. Roy, "Situated language understanding as filtering perceived affordances," Cognitive Science: A Multidisciplinary Journal,vol. 31, pp. 197 - 231, 2007.

[8] A. Cangelosi, T. Riga, B. Giolito, and D. Marocco. "Language emergence and grounding in sensorimotor agents and robots," in Proc. First International Workshop on Emergence and Evolution of Linguistic Communication, Kanazawa Japan, 2004, pp.

[9] J. Feldman, G. Lakoff, A. Stolcke, and S. Weber. "Miniature language acquisition: a youchstone for cognitive science," in Proc. the 12th Ann Conf. Cog. Sci. Soc., MIT, Cambridge MA, 1990, pp. 686-693.

[10] D. Roy, "Grounding words in perception and action: computational insights," Trends in Cognitive Sciences,vol. 9, pp. 389-396, 2005.

[11] C. Yu and D. H. Ballard. "On the integration of grounding language and learning objects," in Proc. Proceedings of the Nineteenth National Conference on Artificial Intelligence, 2004, pp.

[12] L. Steels and F. Kaplan, "AIBO's first words, the social learning of language and meaning," Evolution of Communication,vol. 4, pp. 3-32, 2000.

[13] P. F. Dominey and T. Voegtlin. "Learning word meaning and grammatical constructions from narrated video events," in Human Language Technology Conference, Proceedings of the HLT-NAACL 2003 workshop on Learning word meaning from non-linguistic data, Morristown, NJ, USA, 2003: Association for Computational Linguistics, pp. 38-45.

[14] D. Roy, P. Gorniak, N. Mukherjee, and J. Juster. "A trainable spoken language understanding system for visual object selection," in Proc. the 7th International Conference of Spoken Language Processing, Denver, Colorado, USA, 2002, pp. 593-596.

[15] R. C. Gonzales and R. E. Woods, Digital Image Processing: Addison Wesley, 1993, pp. 514-518.

[16] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. "BLEU: a method for automatic evaluation of machine translation," in Proc. the 40th ACL, Philadelphia, 2002, pp. 311-318.

415

[ieee 2011 fourth international workshop on advanced computational intelligence (iwaci) - wuhan,...

Documents