exploiting semantic contextualization for interpretation … semantic contextualization for...

Exploiting Semantic Contextualization for Interpretation ofHuman Activity in Videos

Sathyanarayanan Aakur, Fillipe DM de Souza, and Sudeep SarkarUniversity of South Florida

Tampa, Florida [email protected], [email protected], [email protected]

Abstract

We use large-scale commonsense knowledge bases, e.g.ConceptNet, to provide context cues to establish seman-tic relationships among entities directly hypothesized fromvideo signal, such as putative object and actions labels, andinfer a deeper interpretation of events than what is directlysensed. One approach is to learn semantic relationshipsbetween objects and actions from training annotations ofvideos and as such, depend largely on statistics of the vo-cabulary in these annotations. However, the use of prior en-coded commonsense knowledge sources alleviates this de-pendence on large annotated training datasets. We rep-resent interpretations using a connected structure of basicdetected (grounded) concepts, such as objects and actions,that are bound by semantics with other background con-cepts not directly observed, i.e. contextualization cues. Wemathematically express this using the language of Grenan-der’s pattern generator theory. Concepts are basic gener-ators and the bonds are defined by the semantic relation-ships between concepts. We formulate an inference enginebased on energy minimization using an efficient MarkovChain Monte Carlo that uses the ConceptNet in its moveproposals to find these structures. Using three differentpublicly available datasets, Breakfast, CMU Kitchen andMSVD, whose distribution of possible interpretations spanmore than 150000 possible solutions for over 5000 videos,we show that the proposed model can generate video in-terpretations whose quality are comparable or better thanthose reported by approaches such as discriminative ap-proaches, hidden Markov models, context free grammars,deep learning models, and prior pattern theory approaches,all of whom rely on learning from domain-specific trainingdata.

1. Introduction

There has been tremendous progress in object and ac-tion category recognition. Effort has shifted to gener-

ating simple sentences descriptions [26, 27, 31, 18] oranswering simple yes or no questions about image con-tent [30, 25, 13, 2]. The next step is to go beyond whatis directly observable in the given image or video. Gen-erating semantically coherent interpretations for a givenvideo involves establishing semantic relationships betweenthe atomic elements (or concepts) of the activity. This re-quires the use of prior contextual information. We have tocontextualize the actions and objects detected or rather, hy-pothesized from the image and video signal to arrive at co-herent semantic interpretations.

As defined by Gumperz [11], primarily for linguistics,contextualization refers to the use of knowledge acquiredfrom past experience to retrieve presuppositions required tomaintain involvement in the current task. We adapt this con-cept to computer vision to refer to the integration of pastknowledge of concepts to aid in achieving the objectiveof the current task, i.e. interpreting activities observed invideos. To make it more concrete, in human activity recog-nition, “concept” includes actions and objects that consti-tute an activity; “presuppositions” refers to the backgroundknowledge of concepts, their properties and relationshipsamong such concepts. Note that the goal is to generate in-terpretations of a given activity rather than just recognition.

Extant approaches have explored the use of contextand semantic knowledge in different ways. Graphical ap-proaches [7, 14], attempt to explicitly model the seman-tic relationships that characterize human activities [15, 14]using a variety of methods such as context-free gram-mars [12], Markov networks [17], and AndOr graphs [29, 1]to name a few. These approaches require labeled trainingdata whose sizes increases non-linearly with different se-mantic combinations of possible actions and objects in thescene.

Deep learning [18, 27, 20, 31] models the semantic re-lationships among actions and objects from training anno-tations. They are restricted by the quality, quantity, andvocabulary of the annotations present in the dataset. Bal-ance of vocabulary is necessary for these models to acquire

1

arX

iv:1

708.

0372

5v1

[cs

.CV

] 1

1 A

ug 2

017

Input Video: Put Egg on Plate

Cut,add,put, pour,stir,crack,smear,butter

Oil,orange,plate,bun,eggs,fruit,pan,bowl

Object Labels

Action Labels

orangeàfruit,food;oilàliquid,flammable;

boxàcontainer;milkàliquid,dairy,food;

Contextualization Cues

ConceptNet

PatternTheoryBasedInferenceProcess

Output Inference: Put Egg on Plate

Figure 1: Overall architecture proposed in this paper. Deeplearning or machine learning-based approaches result in pu-tative object and action labels; there are multiple possiblelabels associated with each image/video location. Patterntheory formalism integrates information from ConceptNetto arrive at an interpretive representation, a connected struc-ture expressed using Grenander’s language of generatorsand bonds.

a strong ontology and allow them to handle variations in ac-tivity labels within the test and validation sets. Models suchas [5, 28, 3] have explored the use of contextual knowledgeby leveraging spatial and situational knowledge.

We define interpretation as an intermediate representa-tion that forms the basis for generation of more well-formedexpressions, such as sentences, or can be the basis for ques-tion and answers systems. Specifically, an interpretationis a connected structure of basis concepts, such as objectsand actions, that are bound by semantics. We mathemat-ically express this using the language of Grenander’s pat-tern generator theory [10]. Concepts are basic generatorsand the bonds are defined by the semantic relationships be-tween concepts. Some concepts in this interpretative struc-ture have direct evidence from video, i.e. grounded con-cepts, and some are inferred concepts that bind groundedconcepts, i.e. contextualization cues. We formulate an in-ference engine based on energy minimization using an ef-ficient Markov Chain Monte Carlo that uses the Concept-Net [16, 22] in its move proposals to find these structures.

Our work is a significant departure from other recent ap-proaches that also use Grenander’s pattern theory [9, 8],which also rely on labeled training data to capture seman-tics about a domain, such as phrases describing what hap-pened in the video segment. Unlike them, we do not havethe need to learn action and object combination priors fromsuch annotations. The applicability of our approach is notrestricted to the training domain. The use of a general hu-man knowledge database such as ConceptNet as the sourceof past knowledge alleviates the need for data annotations.

It allows us to leverage the knowledge gleaned from a va-riety of sources and hence not restrict the knowledge to aparticular domain and/or dataset.

The contribution of this paper is three-fold: (1) we are,to the best of our knowledge, among the first to introducethe notion of contextualization in video activity recogni-tion, (2) we show that the introduction of contextualizationimproves upon the performance reported by state-of-the-artmethods and (3) reduce the training requirements by negat-ing the need for large amounts of annotations for capturingsemantic relationships.

We start with discussion about how semantic knowledgeis captured within the ConceptNet semantic network andhow contextualization cues can be constructed. We then ex-plain how interpretations are represented using pattern the-ory formulations followed by discussion on establishing se-mantic relationships among concepts using the ConceptNetsemantic network. The Monte-Carlo inference process thatrely on a Markov Chain induced by the structure of the Con-ceptNet is explained, followed by demonstration of efficacyof the approach on three large datasets.

2. Contextualization using ConceptNetContextualization cues can be of two types - semantic

knowledge of basic concepts and relational connection ofthe current activity to similar ones in the recent past. Theformer, explored further in this paper, provides backgroundknowledge about detected objects and actions to aid the in-terpretation of the video content. It leads to an understand-ing of why entities interact with each other. The latter, onthe other hand, allows us to anticipate the entities that maybe present in the current context based on relational andtemporal cues from experience in the recent past. Combin-ing both types of knowledge derived from contextualizationallows us to propose models that are able to express their in-ferred knowledge through meaningful use of language thatleads to better human-machine interaction.

We propose the use of a commonsense knowledge baseas a source for contextualization. ConceptNet, proposedby Liu and Singh [16] and expanded to ConceptNet5 [22],is one such knowledge base that maps concepts and theirsemantic relationships in a traversable semantic networkstructure. Spanning more than 3 million concepts, the Con-ceptNet framework serves as a source of cross-domain se-mantic information from general human knowledge whilesupporting commonsense knowledge as expressed by hu-mans in natural language. Technically, it encodes and ex-presses knowledge in a hypergraph, with the nodes rep-resenting concepts and edges representing semantic asser-tions.

There are more than 25 relations (also referred to as as-sertions) by which the different nodes are connected, witheach of these relations contributing to the semantic relation-

2

container bowl

egg

plate

food

pour

add

put

RelatedTo: w=0.263

IsA: w=2.0

IsA: w=1.0

AtLocation: w=1.0

UsedFor: w=2.0

RelatedTo: w=4.88 AtLocation: w=1.0

RelatedTo: w=0.83

RelatedTo: w=2.261

RelatedTo: w=2.0

RelatedTo: w=0.161

RelatedTo: w=0.29

Figure 2: ConceptNet is semantic network of common-sense knowledge. Illustrated here is a very small snippet ofthe concepts present in the ConceptNet framework to showhow various semantic relationships between concepts areexpressed.

ship between the two concepts such as HasProperty,IsA, and RelatedTo to name a few. Each relation hasa weight that determines the degree of validity of the as-sertion given the sources and hence provides a quantifiablemeasure of the semantic relation between concepts. Positivevalues indicates assertions and negative values indicates theopposite. Figure 2 illustrates these ideas; for example, theedge between nodes egg and plate represents an asser-tion with the relation AtLocation to indicate that eggscan be placed or found in plates. While ConceptNet hasseveral assertions that represent different semantic relation-ships between the different concepts, we currently utilize asubset; more specifically - RelatedTo, IsA, HasA, HasProp-erty, CapableOf, UsedFor, and Similarity. We also usethese relations to link concepts detected in videos via con-textual cues to form semantically coherent interpretations.

Two concepts that do have a direct relationship can becorrelated using contextualization cues. Relations, repre-sented by the assertions, such as IsA and HasProperty,connect a particular concept with other concepts that repre-sent their contextual properties. For example in Figure 2,there is no a direct semantic relationship between the con-cepts egg and put, but the contextualization cue plateconnects them and provides a semantic context, i.e. eggscan be put on plate. Contextualization cues allow us to cap-ture the semantic relationships among concepts which wecould not have obtained if we rely only on semantic asser-tions that directly connect concepts.

Formally, let concepts be represented by gi for i =1, . . . , N and let gi

Rgjrepresent relations (or assertion) be-

tween two concepts, then contextualization cue, gk, satisfiesthe following expression not

�gi

Rgj

�^ gi

Rgk^ gk

Rgj

blackliquid

fuel oil

pourRelatedTo

featureRelatedTo

IsA

IsA

HasProperty

feature

IsA

HasProperty

RelatedTo

UsedForIsA

RelatedTo

HasProperty

HasA

HasA

Figure 3: Representation of an interpretation using the lan-guage of pattern theory. Black circles are generators thatrepresent grounded concepts and red generators representungrounded concepts that provide contextualization cues.The red links represent contextual bonds. The dashed linksrepresent the shortest path between two grounded concepts.Note that there are open links for each of the generatorspresent in the configuration, which allow the expansion ofthe concept, if desired. Note that only a selection of thebond structure of each generator is shown in this example.

3. Starting Point: Grounded Concepts

The starting point consists of putative object and actionlabels for each region in the image or video, which are ourconcepts with direct evidence from data. There might bemultiple possible labels for each region. We allow for la-beling errors and consider top-k possible labels for each in-stance. We have experimented with both handcrafted andautomated features using well known methods.

Handcrafted features consist of Histogram of OpticalFlow (HOF) [4] for generating action labels and Histogramof Oriented Gradients (HOG) [6] for object labels. HOFfeatures were extracted by computing dense optic flowframes from three temporally sequential segments - eachrepresenting the start, development and end of the actionsequence respectively. A histogram of optic flow (weightedby magnitude) is then constructed for these temporal seg-ments to characterize the integral stages of the action. Thecomposite feature for action recognition is then composedby ordered concatenation of the individual HOFs. Actionand object labels were generated using linear support vec-tor machines (SVM).

We experimented with two different strategies for ex-tracting automated features. First, used deep learning mod-els such as Convolutional Neural Networks (CNN) on im-ages to capture the feature descriptions for objects andCNNs based on optic flows for actions (CNN-Flow) as wasdone in [9, 8]. In the second strategy, we follow the work in[27] and use mean-pooled values extracted from fc7 layerfor each frame from a CNN model pre-trained on a subset

3

of the ImageNet dataset [21].In addition to video based inference, we allow for audio-

based inference using bag-of-audio words (BoAW) andspectrogram features as auditory feature descriptors.

4. Video Interpretation Representation

Interpreting video content consists of constructing a se-mantically coherent composition of basic, atomic elementsof knowledge called concepts detected from videos.These concepts represent the individual actions and objectsthat are required to form an interpretation of an activity. Weuse Grenander’s pattern theory language [10] to representinterpretations. Unlike prior uses of Grenander’s patterntheory for video description [9, 8], we do not need the man-ual specification of the alphabets and associate rules of thislanguage. They are naturally determined by ConceptNet.

4.1. Representing Concepts: Generators

Following Grenander’s notations [10], each concept rep-resents a single, atomic component called a generatorgi 2 GS where GS is the generator space. Thegenerator space represents a finite collection of all possi-ble generators that can exist in a given environment. In theproposed approach, the collection of all concepts presentwithin ConceptNet serves as the generator space.

The generator space (GS) consists of three disjointsubsets that represent three kinds of generators - fea-ture generators (F ), grounded concept generators (G) andungrounded context generators (C). Feature generators(gf1

, gf2, gf3

, . . . , gfq2 F ) correspond to the features ex-

tracted from videos and are used to infer the presence ofthe basic concepts (actions and objects) called groundedconcept generators (g

1, g

2, g

3, . . . , g

k2 G). In-

dividual units of information that represent the back-ground knowledge of these grounded concept genera-tors are called ungrounded context generators(g1, g2, g3, . . . , gq 2 C). The term grounding is used todistinguish between generators with direct evidence in thevideo data and those that define the background knowledgeof these concepts.

4.2. Connecting Generators: Bonds

Each generator gi has a fixed number of bonds called thearity of a generator (w(gi)8gi 2 GS). These bonds rep-resent the semantic assertions from ConceptNet. Bonds aredifferentiated at a structural level by the direction of infor-mation flow that they represent - in-bonds and out-bonds ascan be seen from Figure 4 (a) where the bonds representingRelatedTo and feature represent in-bonds and HasPropertyand IsA represent out-bonds for the generator put. Eachbond is identified by a unique coordinate and bond valuetaken from a set B such that the jth bond of a generator

RelatedTo

feature

HasProperty

IsAput

RelatedTo

feature

HasProperty

IsAegg

HasProperty

IsA

RelatedTo

HasAfood

(a)

RelatedTo

feature

HasProperty

IsAput

HOF

RelatedTo

feature

HasProperty

IsAegg

HOG

RelatedTo

HasProperty

featureegg

HasProperty

RelatedTo

IsA

HasAfood

(b)

food

plate egg

putRelatedTo

feature

IsA

RelatedToHasA

HasProperty

RelatedTo

feature

HasA

RelatedTo

feature

chicken

HasProperty

IsA

RelatedTo

(c)

Figure 4: An illustration of generators and their bond struc-tures. Case (a) gives the structure of individual genera-tors. The generators in black represent grounded generatorsand those in red represent ungrounded generators. Case (b)represents bonded pairs of generators. Case(c) representsa complete configuration representing an interpretation forthe video “Put egg on plate.”.

gi 2 GS is denoted as �jdir(gi), where dir denotes the di-

rection of the bond.Types of Bonds There exist two types of bonds -

semantic bonds and support bonds. The direc-tion of semantic bonds signify the semantics of a conceptand the type of relationship a particular generator shareswith its bonded generator or concept. These bonds are anal-ogous to the assertions present in the ConceptNet frame-work. For example, Figure 3 illustrates an example con-figuration with pour, oil, liquid, etc. representing genera-tors and the connections between them, given by RelatedTo,IsA, etc., representing the semantic bonds. The bonds high-lighted in red indicate the presence of ungrounded contextgenerators, representing the presence of contextual knowl-edge. Support bonds connect (grounded) concept genera-tors to feature generators that represent direct image ev-idence. The these bonds are quantified using confidencescores from classification models.

Bond Compatibility The viability of a closed bondbetween two generators is determined by the bond

4

relation function ⇢. This function determineswhether two bonds �(gi) and �(gj) between two gen-erators, gi and gj , are compatible and is denoted by⇢[�(gi),�(gj)] or simply as ⇢[�,�]. This function repre-sents whether a given bond �j

dir(gi) is either closed or open.The bond relation function is given by

⇢[�(gi),�(gj)] = {TRUE, FALSE}; 8gi, gj 2 GS (1)

A bond is said to be open if it is not connected to an-other generator through a bond; i.e. an out-bond of a gen-erator gi is connected to a generator gj through one of itsin-bonds or vice versa. For example, take the first casefrom Figure 4 (b) representing the bonded generator pair{put and HOF}. The bonds representing HasProperty, IsAand RelatedTo are considered to be open, whereas the bondrepresenting feature represents a closed bond.

Quantifying Bond Strength The bonds between thegenerators are quantified using the strength of the seman-tic relationships between generators being quantified by thebond energy function:

a(�0(gi),�00(gj)) = q(gi, gj) tanh(f(gi, gj)). (2)

where f(.) is the weight associated with the relation inConceptNet between concepts gi and gj through their re-spective bonds �0 and �00. The tanh function normalizesthe score output by f(.) to range from -1 to 1. q(gi, gj)weights the score output by the tanh function according tothe bond connection type (e.g., semantic or support) �0 and�00 forms. It helps us tradeoff between direct evidence andprior knowledge, i.e. context. If we want direct evidence toplay a larger role in the inference we choose a larger valuefor q(gi, gj) for support bonds than semantic bonds, andvice-versa.

4.3. Interpretations: Configurations of Generators

Generators can be combined together through their lo-cal bond structures to form structures called configurationsc, which, in our case, represent semantic interpretations ofvideo activities. Each configuration has an underlying graphtopology, specified by a connector graph �. The set of allfeasible connector graphs � is denoted by ⌃, also known asthe connection type. Formally, a configuration c is a con-nector graph � whose sites 1, 2, . . . , n are populated by acollection of generators g1, g2, . . . , gn expressed as,

c = �(g1, g2, . . . , gi); gi 2 GS . (3)

The collection of generators g1, g2, . . . , gi represents the se-mantic content of a given configuration c. For example, thecollection of generators from the configuration in Figure 3gives rise to the semantic content “pour oil (liquid) (fuel)(black)”.

Configuration Probability The probability of a partic-ular configuration c is determined by its energy as given bythe relation

P (c) / e�E(c) (4)

where E(c) represents the total energy of the configurationc. The energy E(c) of a configuration c is the sum of thebond energies formed by the bond connections that combinethe generators in the configuration, as described in Equa-tion 5.

E(c) = �X

(�0,�00)2c

a(�0(gi),�00(gj)) + Q(c) (5)

where Q is the cost factor associated with using ungroundedcontext generators that do not contribute to the overall de-scription as seen from Equation 6; i.e. any ungrounded con-text generator that possesses bonds that do not connect togrounded generators increases the energy of the configura-tion and hence reduces its probability.

The cost factor Q of a given configuration c is given by

Q(c) = kX

gi2G0

X

�jout2gi

[D(�jout(gi))]. (6)

where G0 is a collection of ungrounded contextual gener-ators present in the configuration c, �out represents eachout-bond of each generator gi and D(.) returns is functionthat true of the given bond is open. k is an arbitrary con-stant that quantifies the extent of the detrimental effect thatthe ungrounded context generators have on the quality ofthe interpretation.

The cost factor Q(c) restricts the inference process fromconstructing configurations with degenerate cases such asthose composed of unconnected or isolated generators thatdo not have any closed bonds and as such do not connectto other generators in the configuration. It also prevents theinference from spawning generators that do not add seman-tic value to the overall quality of the configuration therebyreducing the search space to arrive at an optimal interpreta-tion.

5. Inferring InterpretationsSearching for the best semantic description of a video in-

volves minimizing the energy function E(c) and representsthe inference process. The solution space spanned by thegenerator space is very large as both the number of genera-tors and structures can be variable. For example, the combi-nation of a single connector graph � and a generator spaceGS give rise to a space of feasible configurations C(�).While the structure of the configurations c 2 C(�) is iden-tical, their semantic content is varied due to the different as-signments of generators to the sites of a connector graph . Afeasible optimization solution for such exponentially large

5

space, is to use a sampling strategy. We follow the workin [9] and employ a Markov Chain Monte Carlo (MCMC)based simulated annealing process. The MCMC based sim-ulation method requires two types of proposal functions -global and local proposal functions. A connector graph � is

Algorithm 1: Local Proposal Function

1 localProposal (c, m);2 Randomly select gk 2 c3 Form a set G0 of m generators gi such that gi 2 GS

and gj 6= gi

4 Form a set C 0 of generators {gj} such that gj 2 C andgiRgj exists 8gi 2 G0

5 Remove gj from c6 Remove {gj} 2 C 0 from c such that there exists gi

Rgj

7 c0 c8 Select generator gi that minimizes E(�(c0, gi))9 Add gi to c0

10 Add {gk} 2 C 0 to c such that there exists giRgk

11 c00 c0

12 return c00

given by a global proposal function which makes structuralchanges to the configuration that are reflected as jumps froma subspace to another. To this end, we follow the globalproposal from [9]. A swapping transformation is appliedto switch the generators within a configuration to change ofsemantic content of a given configuration c. Algorithm 1shows the local proposal function which induces the swap-ping transformation. This results in a new configuration c0,thus constituting a move in the configuration space C(�).

Initially, the global proposal function gives introduces aset of grounded concept generators derived from machinelearning classifiers. Then, a set of ungrounded context gen-erators, representing the contextualization cues, are popu-lated for each grounded concept within the initial configu-ration. Bonds are established between compatible gener-ators when each generator is added to the configuration.Each jump given by the local proposal function gives riseto a configuration whose semantic content represents a pos-sible interpretation for the given video. Interpretations withthe least energy are considered to have a higher probabilityof possessing a more semantic coherence. Hence an addi-tional optimization constraint is to minimize the cost factorQ given in 6 by ensuring the least number of open bonds foreach ungrounded contextual generator in the configuration.

6. ResultsWe begin with discussion on the three publicly available

datasets that we use, followed by presentation of qualita-tive and quantitative results on them. Performance is quan-tified using measures that facilitate the comparison with

other approaches, such as precision, F-Score, and BLEUscore [19]. We compare performances against a variety ofcompeting approaches – discriminative methods (DM) [14],hidden Markov models (HMM) [14], context free grammars(CFG+HMM) [14] factor graphs [26], different manifesta-tions of deep recurrent neural networks [27, 18, 31], andgeneric pattern theory (PT) approaches [9]. We also con-sider a variation of pattern theory approach in [9], where weuse the ConceptNet edge weights instead of training fromdomain ontology, called ”PT+weights”.

6.1. Datasets

The Breakfast Actions dataset consists of more than1000 recipe videos, consisting of different scenarios with acombination of 10 recipes, 52 subjects and differing view-points captured from up to 5 cameras, which provides dif-fering qualities of videos with varying amounts of clutterand occlusion. The units of interpretation are temporalvideo segments of these videos, given by the video annota-tion provided along with the dataset. We used handcraftedfeatures (HOG and HOF) discussed in Section 3.

The Carnegie Mellon University Multimodal Activitydataset (CMU) contains multimodal measures of human ac-tivities such as cooking and food preparation. The datasetcontains five different recipes: brownies, pizza, sandwich,salad and scrambled eggs. Spriggs et al. [23] generated theground truth for some videos and recipes. The experimentswere performed on the brownie recipe videos for perfor-mance comparison with [8] and used the same deep learnedfeatures as used by them.

The Microsoft Video Description Corpus (MSVD) is apublicly available dataset that contains 1,970 videos takenfrom YouTube. On an average, there 40 English descrip-tions available per video. We follow the split proposed inprior works [26, 27, 31, 18], and use 1,200 videos for train-ing, 100 for validation and 670 for testing. We used the deeplearning features that came with the dataset, i.e. the seconddeep learned strategy mentioned in Section 3. This is a rea-sonable representation of each video considering that theduration of the videos are relatively small and hence pro-vide an acceptable visual summary of the video.

6.2. Qualitative Results

Figure 5 shows four example interpretations generated(last column) for each of the videos represented in thefirst column. These can be compared against the resultsfrom prior pattern theory approach that requires training(PT+training, second column) and its variation where wereplaced training with Similarity weights from ConceptNet(PT+weights, third column). As is evident, the representa-tion of our approach is richer with contextualization cuesthat go beyond what is seen in the image. For example, takea video with groundtruth as “read brownie box”. Our ap-

6

Ground truth PT + training PT + weights Our approach

Salt andpepper

put

feature

Food

feature

Food

Food

glass

feature

Container

Container Food

Container

egg

add

feature

Food

feature

Food

Food

plate

feature

Container

Container Food

Container

food

plate egg

putRelatedTo

feature

IsA

RelatedToHasA

HasProperty

RelatedTo

feature

HasA

RelatedTo

feature

chicken

HasProperty

IsA

RelatedTo

Put egg on plate Put salt-pepper on glass Add egg to plate Put (chicken) egg (food) on plate(a)

tea

put

feature

Food

feature

Food

Food

bowl

feature

Container

Container Food

Container

food

bowl fruit

putRelatedTo

feature

IsA

IsAHasA

HasProperty

RelatedTo

feature

HasA

RelatedTo

feature

orange

HasProperty

IsA

RelatedTo AtLocation

Put fruit into bowl Stir tea, dough Put tea in bowl Put (food) fruit (orange) in bowl(b)

juice

pour

feature

Food

feature

Food

Food

dough

feature

Container

Container Food

Container

glass

put

feature

Food

feature

Food

Container

bowl

feature

Container

Food Container

Container

liquid

pan dough

pourRelatedTo

feature

IsAHasA

HasProperty

RelatedTo

feature

HasA

RelatedTo

feature

Semi-solid

HasProperty

HasA

RelatedTo AtLocation

Pour dough on pan Pour juice to dough Put from glass to bowl Pour (liquid) (semisolid) dough on pan(c)

Browniebox

take

feature

Food

feature

Food

Container

Browniebox

stir

feature

Food

feature

Food

Container

recipe

Browniebox container

readRelatedTo

feature

RelatedToHasA

HasProperty

RelatedTo

feature

HasA

RelatedTo

IsA

food

HasProperty

IsA

RelatedTo

IsA

HasA

AtLocation

Read brownie box Take brownie box Stir brownie box Read (food) (recipe)on brownie box (container)

(d)

Figure 5: Video descriptions generated by the pattern theory method in [9] (PT+training), this pattern theory model but withConceptNet weights instead of training (PT+weights) and our approach of this paper. The corresponding equivalent text isshown below each figure. The text in parentheses represent conceptualization cues. Black circles denote grounded conceptsand red circles denote ungrounded, contextualization cues.

proach was able to content the presence of the groundedconcepts read and brownie box through the ungroundedcontextual generators container, food and recipe. Withoutcontextualization cues, not only are there errors in the inter-pretations of the prior pattern theory approaches, but theyare not as rich and descriptive.

6.3. Quantitative Results

Breakfast Actions Dataset The proposed approach out-performs both the performance of the HMM-based and theContext Free Grammar models reported by Hilde [14] aswell as the pattern theory model [9] as seen from Table 1.It is important to note that the performance of the proposed

7

approach is remarkable considering that the model is nei-ther trained specifically for the kitchen domain nor on thedataset itself other than for obtaining the starting groundedaction and object labels. Other methods are restricted by thevocabulary of the training data to build their descriptions.For example, the Context Free Grammar method makes useof temporal information such as states and transitions be-tween states to build the final descriptions which is not thecase with our approach.

Approach PrecisionDM [14] 6.10%

HMM [14] 14.90%CFG + HMM [14] 31.8%

PT+training (Top 10) [9] 38.60%Our Approach (Top 10) 39.64%

Table 1: Results on Breakfast Action dataset. Top 10 meansthat we consider the best 10 interpretations generated by theapproach.

CMU Dataset We explore the use of multi-modalfeatures to leverage the auditory features present in theCMU activities dataset. Table 2 compares our perfor-mance with the best performing feature set reported in [8](PT+training), and its variation that replaces training withConceptNet Similarity edge weights (PT+weights).

Approach F-ScorePT+training [8] 69.9%

PT+weights 64.6%Our Approach 70.7%

Table 2: Results on CMU kitchen dataset using audio +video features.

MSVD Dataset On the Microsoft Video DescriptionCorpus (MSVD) dataset, we report the BLEU score to com-pare against other comparable approaches. We choose thetwo best performing models from [27] for comparison. Theproposed approach outperforms the other approaches as isvery close to the top performing approach as seen from Ta-ble 3. An important aspect to note is that the other state-of-the-art approaches take into account more complex featuresthat are more descriptive of the concepts within the video;by contrast, we take into account only the mean pooledCNN features across frames for generating candidate labelsfor all concepts. For example, the HRNE model [18] makesuse of temporal characteristics in the video; the TemporalAttention model [31] leverages the frame-level representa-tion from GoogleNet [24] as well as video-level representa-tion using a 3-D Convolutional Neural Network trained onhand-crafted descriptors in addition to an attention model

Approach BLEU ScoreThomason et al [26] 13.68%

Venugopalan et al (COCO) [27] 33.29%Venugopalan et al (COCO + FLICKR) [27] 33.29%

Yao et al [31] 41.92%Pan et al [18] 43.6%

Our Approach (Best) 34.93%Our Approach (Top 10) 42.98%

Table 3: Results on the Microsoft Video Description Corpus(MSVD) dataset.

that provides dynamic attention to specific temporal regionsof the video for generating descriptions.

Immunity to Unbalance in Training Data: Not all la-bels are equally represented in most training data. This isparticularly acute as number of labels increase. To demon-strate that our method is immune to this effect, we parti-tioned the activity classes labels into 4 different categoriesbased on amount of training data available. Table 4 showsthe performance for our approach as compared to prior pat-tern theory approaches, PT+training and PT+weights. Asexpected, performance of PT+training that rely on annota-tions [9] increase with increase in training data. Whereas,our approach is stable.

Approach Num. of samples 10 10 - 20 20 - 40 � 40

PT+training [8] 11.81% 17.36% 22.98% 35.47%PT+weights 25.41% 36.57% 37.10% 34.16%

Our Approach 34.76% 40.07% 38.23% 38.35%

Table 4: Impact of training data imbalance. We show theperformances (F-Score) of our approach (red) and priorpattern theory approach (PT+training), and its variation(PT+weights) for four different categories of labels that dif-fer in number of training samples.

7. ConclusionContextualization based on commonsense knowledge,

such as ConceptNet, can be used to generate rich seman-tic interpretation of video (and audio data) that go beyondwhat is directly observable. This also helps us break theever increasing demands on large training data with outputdescription sizes. There is no training required beyond thatneeded for the starting object and action labels. We demon-strated how the language of Grenander’s pattern theory canbe used to naturally capture the semantics and context inConceptNet and infer rich interpretative structures. The in-ference process allows for multiple putative objects and ac-tion labels for each video event to overcome errors in clas-sification. These structures are intermediate representationsthat can be basis for generation of sentences or even used

8

for visual question and answers. Extensive experimentsdemonstrate the applicability of the approach to a wide setof domains and its highly competitive performance.

9

References[1] M. R. Amer, S. Todorovic, A. Fern, and S.-C. Zhu.

Monte carlo tree search for scheduling activity recog-nition. In IEEE International Conference on Com-puter Vision (ICCV), pages 1353–1360, 2013.

[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra,C. Lawrence Zitnick, and D. Parikh. Vqa: Visualquestion answering. In IEEE International Confer-ence on Computer Vision (ICCV), December 2015.

[3] G. Cai. Contextualization of geospatial database se-mantics for human–gis interaction. Geoinformatica,11(2):217–237, 2007.

[4] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vi-dal. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for therecognition of human actions. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR),pages 1932–1939. IEEE, 2009.

[5] Q. Chen, Z. Song, J. Dong, Z. Huang, Y. Hua, andS. Yan. Contextualizing object detection and classi-fication. IEEE transactions on pattern analysis andmachine intelligence, 37(1):13–27, 2015.

[6] N. Dalal and B. Triggs. Histograms of oriented gra-dients for human detection. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR),volume 1, pages 886–893. IEEE, 2005.

[7] P. Das, C. Xu, R. F. Doell, and J. J. Corso. A thousandframes in just a few words: Lingual description ofvideos through latent topics and sparse object stitch-ing. In IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), pages 2634–2641, 2013.

[8] F. D. de Souza, S. Sarkar, and G. Camara-Chavez.Building semantic understanding beyond deep learn-ing from sound and vision. International Conferenceon Pattern Recognition (ICPR), 2016.

[9] F. D. de Souza, S. Sarkar, A. Srivastava, and J. Su.Spatially coherent interpretations of videos using pat-tern theory. International Journal on Computer Vi-sion, pages 1–21, 2016.

[10] U. Grenander. Elements of pattern theory. JHU Press,1996.

[11] J. J. Gumperz. Contextualization and understanding.Rethinking context: Language as an interactive phe-nomenon, 11:229–252, 1992.

[12] S.-W. Joo and R. Chellappa. Recognition of multi-object events using attribute grammars. In Interna-tional Conference on Image Processing (ICIP), pages2897–2900. IEEE, 2006.

[13] K. Kafle and C. Kanan. Answer-type prediction for vi-sual question answering. In IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR), June2016.

[14] H. Kuehne, A. Arslan, and T. Serre. The languageof actions: Recovering the syntax and semantics ofgoal-directed human activities. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR),pages 780–787, 2014.

[15] T. Lan, L. Sigal, and G. Mori. Social roles in hierar-chical models for human activity recognition. In IEEEConference on Computer Vision and Pattern Recogni-tion (CVPR), pages 1354–1361. IEEE, 2012.

[16] H. Liu and P. Singh. Conceptneta practical com-monsense reasoning tool-kit. BT technology journal,22(4):211–226, 2004.

[17] V. I. Morariu and L. S. Davis. Multi-agent event recog-nition in structured scenarios. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR),pages 3289–3296. IEEE, 2011.

[18] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. Hier-archical recurrent neural encoder for video represen-tation with application to captioning. In IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR), June 2016.

[19] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu:a method for automatic evaluation of machine transla-tion. In Annual Meeting on Association for Compu-tational Linguistics, pages 311–318. Association forComputational Linguistics, 2002.

[20] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal,and B. Schiele. Translating video content to naturallanguage descriptions. In IEEE International Confer-ence on Computer Vision, pages 433–440, 2013.

[21] O. Russakovsky, J. Deng, H. Su, J. Krause,S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,M. Bernstein, et al. Imagenet large scale visual recog-nition challenge. International Journal of ComputerVision, 115(3):211–252, 2015.

[22] R. Speer and C. Havasi. Conceptnet 5: A large seman-tic network for relational knowledge. In The PeoplesWeb Meets NLP, pages 161–176. Springer, 2013.

[23] E. H. Spriggs, F. D. L. Torre, and M. Hebert. Tempo-ral segmentation and activity classification from first-person sensing. In IEEE Workshops on Computer Vi-sion and Pattern Recognition (CVPRW), pages 17–24,June 2009.

[24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-novich. Going deeper with convolutions. In IEEEConference on Computer Vision and Pattern Recog-nition (CVPR), pages 1–9, 2015.

10

[25] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba,R. Urtasun, and S. Fidler. Movieqa: Understand-ing stories in movies through question-answering. InIEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2016.

[26] J. Thomason, S. Venugopalan, S. Guadarrama,K. Saenko, and R. J. Mooney. Integrating languageand vision to generate natural language descriptionsof videos in the wild. In International Conferenceon Computational Linguistics (COLING), volume 2,page 9, 2014.

[27] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach,R. Mooney, and K. Saenko. Translating videos to nat-ural language using deep recurrent neural networks.arXiv preprint arXiv:1412.4729, 2014.

[28] Y. Wang, D. M. Krum, E. M. Coelho, and D. A. Bow-man. Contextualized videos: Combining videos withenvironment models to support situational understand-ing. IEEE Transactions on Visualization and Com-puter Graphics, 13(6):1568–1575, 2007.

[29] P. Wei, Y. Zhao, N. Zheng, and S.-C. Zhu. Mod-eling 4d human-object interactions for event and ob-ject recognition. In IEEE International Conference onComputer Vision, pages 3272–3279. IEEE, 2013.

[30] Q. Wu, P. Wang, C. Shen, A. Dick, and A. van denHengel. Ask me anything: Free-form visual questionanswering based on knowledge from external sources.In IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2016.

[31] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal,H. Larochelle, and A. Courville. Describing videos byexploiting temporal structure. In IEEE InternationalConference on Computer Cision (ICCV), pages 4507–4515, 2015.

11

exploiting semantic contextualization for interpretation … semantic contextualization for...

Documents