a causal and-or graph model for visibility fluent ... · a causal and-or graph model for visibility...

A Causal And-Or Graph Model for Visibility Fluent Reasoning in Human-ObjectInteractions∗

Lei Qin3∗, Yuanlu Xu1∗, Xiaobai Liu2, Song-Chun Zhu1

1Dept. Computer Science and Statistics, University of California, Los Angeles (UCLA)2Dept. Computer Science, San Diego State University (SDSU)3Inst. Computing Technology, Chinese Academy of Sciences

[email protected], [email protected], [email protected], [email protected]

Abstract

Tracking humans that are interacting with the other subjectsor environment remains unsolved in visual tracking, becausethe visibility of the human of interests in videos is unknownand might vary over times. In particular, it is still difficult forstate-of-the-art human trackers to recover complete humantrajectories in crowded scenes with frequent human interac-tions. In this work, we consider the visibility status of a sub-ject as a fluent variable, whose changes are mostly attributedto the subject’s interactions with the surrounding, e.g., cross-ing behind another objects, entering a building, or getting intoa vehicle, etc. We introduce a Causal And-Or Graph (C-AOG)to represent the causal-effect relations between an object’svisibility fluents and its activities, and develop a probabilis-tic graph model to jointly reason the visibility fluent change(e.g., from visible to invisible) and track humans in videos.We formulate the above joint task as an iterative search of fea-sible causal graph structure that enables fast search algorithm,e.g., dynamic programming method. We apply the proposedmethod on challenging video sequences to evaluate its capa-bilities of estimating visibility fluent changes of subjects andtracking subjects of interests over time. Results with compar-isons demonstrated that our method clearly outperforms thealternative trackers and can recover complete trajectories ofhumans in complicated scenarios with frequent human inter-actions.

1 IntroductionTracking objects of interest in videos is a fundamental com-puter vision task that has great potentials in many videobased applications, e.g., security surveillance, disaster re-sponse, border patrol. In these applications, a critical prob-lem is how to obtain the complete trajectory of the object ofinterest while observing it moving in the scene through cam-era views. This is a challenging problem since an object ofinterest might under frequent interactions with the surround-ing, e.g., entering a vehicle or a building, or with the otherobjects ,e.g., passing behind another subject. With these in-teractions, the visibility status of a subject will be varyingover time, e.g., changes from invisible to visible and viceversa. In the literature, most state-of-the-art trackers employappearance cues or motion cues to localize subjects in video∗* equal contributors.

Copyright c© 2018, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Visible Occluded Contained

(a)

12 34

(d)

12 34

(c)(b)

12 3

4

(e)

123

45

(f)

13 5


2

4

23

4

1

Figure 1: Illustration of visibility fluent changes. There arethree states: visible, occluded, contained. When a person ap-proaches a vehicle, its state changes from ”visible” to ”oc-cluded” to ”contained”, such as the person1 and person2 (a-e). When a vehicle passes, the person4 is occluded. The stateof person4 changes from ”visible” to ”occluded” in (d-e). (f)is the corresponding top-view trajectories. The numbers arethe persons ID. The arrows indicate the moving direction.While traditional trackers are likely to fail when the objectsbecome invisible, our proposed method could jointly inferobjects’ locations and visibility fluent changes, which helpsto recover the complete trajectories. Best viewed in color.

sequences and are likely to fail to track the subjects whosevisibility status change. To deal with the above challenges,in this work, we propose to explicitly reason subjects’ vis-ibility status over time while tracking the subjects of inter-ests in surveillance videos. The developed techniques, withslight modifications, can be generalized to other scenarios,e.g., hand-held cameras, driverless vehicles.

The key idea of our method is to introduce a fluent vari-able for each subject of interest to explicitly indicate his/hervisibility status in videos. Fluent was firstly used by New-ton to denote the time varying status of an object. It is alsoused to represent the varying object status in commonsensereasoning (Mueller 2014). In this paper, the visibility sta-tus of objects can be described as fluents varying over time.As illustrated in Fig. 1, the person3 and person5 are walk-ing through the parking lot, while the person1 and person2are entering a sedan. The visibility status of person1’s andperson2’s changes first from ”visible” to ”occluded”, and

arX

iv:1

709.

0543

7v1

[cs

.CV

] 1

6 Se

p 20

17

A person getting into a vehicle

SE11 SE12 SE21 SE22

Time

Approaching Opening door

SE31 SE32

Entering

SE41

Person

Door

Closing Door

Trunk Closed

AND node

OR node

Template

Vehicle

Person

SE Sub event

Fluent

Visible Contained

Closed Open Closed

t2t1 t3 t4

Figure 2: An illustration of a person’s actions and his/hervisibility fluent changes when he/she enters a vehicle.

then to ”contained”. This group of video frames directlydemonstrates how object’s visibility fluent changes overtime along with their interactions to the surrounding.

We introduce a graphical model to represent the causalrelationships between object’s activities (actions/sub-events)and object’s visibility fluent changes.

Fig. 3 visualizes the action-fluent relationship using aCausal And-Or graph (C-AOG). The occlusion status of anobject might be caused by multiple actions, and we need toreason the actual causality from videos. These actions arealternative choices that lead to the same occlusion status,and form the Or-nodes. Each leaf node indicates an actionor sub-event that can be describe by and-nodes. Taking thevideos shown in Fig. 1 for instance, the status of ”occluded”can be caused by the following actions: i) walking behinda vehicle; ii) walking behind a person; or iii) inertial actionthat maintains the fluent unchanged.

The basic hypothesis of this model is that, for a particularscenario (e.g., parking-lot), there are only a limited numberof actions that can cause the fluent to change. Given a videosequence, we need to create the optimal C-AOG and for eachOr-node select the best choice in order to get the optimalcausal parse graph, shown as red lines in Fig. 3.

We develop a probabilistic graph model to reason object’svisibility fluent changes using C-AOG representation. Ourformula integrates object tracking purposes as well to en-able joint solution of tracking and fluent change reasoning,which are mutually beneficial. In particular, for each subjectof interest, our method uses two variables to represent i) sub-jects’ positions in videos; and ii) visibility status as well asthe best causal parse graph. We utilize a Markov Chain Priormodel to describe the transitions of these variables, i.e., thecurrent state of a subject is only dependent on the previousstate. Then we reformulate the problem into an Integer Lin-ear Programming model, and utilize dynamic programmingto search the optimal states over time. In evaluations, we ap-ply the proposed method over a set of challenging sequencesthat include frequent human-vehicle or human-human inter-sections. Results showed that our method can readily predict

the correct visibility statues and correctly recover the com-plete trajectories. In contrast, most of the alternative trackerscan only recover part of the trajectories due to the occlu-sion/containment.

Contributions. There are three major contributions ofthe proposed framework: i) a Causal And-Or Graph (C-AOG) model to represent object visibility fluents varyingover time; ii) a joint probabilistic formulation for objecttracking and fluent reasoning; and iii) a new occlusion rea-soning dataset to cover objects with diverse fluent changes.

2 Related Work

The proposed research is closely related to the followingthree research streams in computer vision and AI.

Multiple object tracking has been extensively stud-ied in the last decades. In past literature, tracking-by-detection has become the mainstream framework (Wen etal. 2014; Dehghan et al. 2015). Specifically, a general de-tector (Felzenszwalb et al. 2010; Ren et al. 2015) is firstapplied to obtain detection proposals, then data associationtechniques (Berclaz et al. 2011; Dehghan, Assari, and Shah2015; Yu et al. 2016) are employed to link detection pro-posals over times in order to get object trajectories. Our ap-proach follows this pipeline as well but is focused on thereasoning of object visibility status.

Occlusion handling is perhaps the most fundamentalproblem in visual tracking, and has been extensively studiedin the past literatures. These methods can be roughly dividedinto three categories: i) using depth information (Ess et al.2009; Yilmaz, Xin, and Shah 2004), ii) modeling partial oc-clusion relations (Tang, Andriluka, and Schiele 2014), iii)modeling object appearing and disappearing globally (Wanget al. 2014; 2016; Maksai, Wang, and Fua 2016). Thesemethods all have strong assumptions on the appearance ormotion cues and mainly deal with short-term occlusions. Incontrast, this paper presents a principled way to explicitlyreason object visibility status during both short-term inter-actions, e.g., passing behind another object, or long-term in-teractions, e.g., entering a vehicle and moving together.

Causal-effect reasoning is a popular topic in AI but hasnot received much attentions in the field of computer vi-sion. It studies, for instances, the differences between co-occurrence and causality and aims to learn causal knowl-edge automatically from low-level observations, e.g., im-ages or videos. There are two popular causality models:Bayesian Network (Griffiths and Tenenbaum 2005; Pearl2009) and grammar models (Griffiths and Tenenbaum 2007;Liang et al. 2016). Notably, Fire and Zhu (Fire and Zhu2016) introduced a causal grammar to infer the causal-effectrelationships between object’s status, e.g., door open/close,and agent’s actions, e.g., pushing the door. They have stud-ies this problem using manually designed rules and videosequences in lab settings. In this work, we extend thecausal grammar models to infer objects’ visibility fluent andground the task on challenging videos in surveillance sys-tems.

Vehicle

P

D OpenClosed

A2: Open

door1

Vehicle

Visible

Open

ContainedP

D

A3: Enter

vehicle

A1: Walk1

Vehicle

P

A5: Open

door2

Contained

D OpenClosed

Visible

Vehicle

P

D

A6: Exit

vehicle

Open

Contained

Vehicle

P

T

Visible

OpenClosed

B Visible

A9: Open

trunk

Vehicle

P

T

Visible

B Visible Contained

Open

A10: Load

baggage

Vehicle

P

T

Visible

B

A11: Unload

baggage

Open

VisibleContained

Vehicle

P

T

Visible

B Contained

ClosedOpen

A12: Close

trunk

Vehicle

Visible

ClosedOpen

P

D

A7: Close

door1

Vehicle

ClosedOpen

P

D

A4: Close

door2

Contained

A8: Walk2

VisibleP2

Bag Bag

VisibleP1

Occluded

Visible Occluded

Visibility Fluent

Visible

Visible

Occluded

Occluded Contained

A9A6A8 A11 A7 A3 A5A1A2 A10 A4 A12

Visible

Contained

Visible

Visible

Occluded

Visible

Occluded

Occluded

Contained

Visible

Contained

Contained

OR

node

Fluent

change

ActionA

OccludedP2

VisibleP1

(a) (b)

Figure 3: (a) The proposed Causal And-Or Graph (C-AOG) model for the fluent of visibility. We use a C-AOG to representthe visibility status of an subject. Each OR node indicates a possible choice and an arrow shows how visibility fluent transitsamong states. (b) A series of atomic actions that could possibly cause visibility fluent change. Each atomic action describesinteractions among people and interacting objects. ”P”, ”D”, ”T”, ”B” denotes ”person”, ”door”, ”trunk”,”bag”, respectively.The dash triangle denotes fluent. The corresponding fluent could be ”visible”, ”occluded” or ”contained” for a person; ”open”,”closed” or ”occluded” for a vehicle door or truck. See text for more details.

3 RepresentationIn this paper, we define three states for visibility flu-ent reasoning: visible, (partially/fully) occluded, and con-tained. Most multiple object tracking methods are basedon tracking-by-detection framework, which obtain good per-formance in visible and partially occluded situations. How-ever, when full occlusions take place, these trackers usu-ally regard the disappearing-and-reappearing objects as newobjects. Although objects in fully occluded and containedstates could be invisible, there are still evidences to inferthe object’s location and fill-in the complete trajectory. Wecan distinguish object being fully occluded and object beingcontained from three empirical observations.

Firstly, motion independence. In fully occluded state, suchas a person staying behind a pillar, motion of the person isindependent of the pillar. While in contained state, such asperson in vehicle, or bag in trunk, the position and motionof the person/bag would be the same as the vehicle. So itis important to infer the visibility fluent of the object, if wewant to track objects accurately in a complex environment.

Secondly, coupling actions and object fluent changes. Forexample, as illustrated in Fig. 2, if a person gets into a ve-hicle, the related sequential atomic actions are: approachinga vehicle, opening the vehicle door, getting into the vehi-cle, and closing the vehicle door; the related object fluentchanges are vehicle door closed→ open→ closed. The flu-ent change is a consequence of agent actions. If the fluent-changing actions do not happen, the object should maintainits current fluent. For example, a person that is containedin a vehicle will remain contained unless he/she opens thevehicle door and gets out of the vehicle.

Thirdly, visibility in the alternative camera views. In fullocclusion state, such as a person occluded by a pillar, thoughthe person could not be observed from the current viewpoint,he/she could be seen from the other viewpoints; while in

contained state, such as a person in a vehicle, this personcould not be seen from any viewpoints.

In this work, we mainly study the interactions of humansand the developed methods can be expanded to other objects(e.g., animal) as well.

3.1 Causal And-Or GraphIn this paper, we propose a Causal And-Or Graph (C-AOG)to represent the action-fluent relationship, as illustrated inFig. 3(a). A C-AOG has two types of nodes: i) Or-nodes thatrepresent the variations or choices, and ii) And-nodes thatcapture the decomposition of the top-level entity. The arrowrepresents the causal relation between action and fluent tran-sition. For example, we can use a C-AOG to expressivelymodel a series of action-fluent relations.

The C-AOG is capable of representing multiple alterna-tives for causes of occlusion and potential transitions. Thereare four levels in our C-AOG: visibility fluents, possiblestates, state transitions and agent actions. Or nodes repre-sent alternative causes in visibility fluents and state levels;that is, one fluent can have multiple states and one state canhave multiple transitions. An event can be decomposed intopossible atomic actions and represented by And nodes, e.g.,opening vehicle door, opening vehicle trunk, loading a bag-gage.

Given a video sequence I with length T , we represent thesceneR as

R = Ot : t = 1, 2, ..., T ,

Ot =oit : i = 1, 2, ..., Nt

,

(1)

whereOt denotes all the object at time t andNt is the size ofOt, i.e. the number of objects at time t. Nt is unknown andwill be inferred from observations. Each object oit is repre-sented with its location (i.e., bounding boxes in the image),appearance features φit. To study the visibility fluent of a

subject, we further incorporate a state variable sit and an ac-tion label ait, that is,

oit =(lit, φ

it, s

it, a

it

). (2)

Thus, the state of a subject is defined as

sit ∈ S = visible, occluded, contained . (3)

We define a series of atomic actions Ω = ai : i = 1, . . . , Na

that could possibly change the visibility status, e.g., walking,opening vehicle door. As illustrated in Fig. 3(b), a small setof actions Ω could cover most common interactions amongpeople and vehicles.

Our task is to jointly find subject locations in video framesand estimate their visibility fluents M from the video se-quence I . Formally, we have,

M = pgt : t = 1, 2, . . . , T,pgt = oit = (lit, φ

it, s

it, a

it) | i = 1, 2, ..., Nt.

(4)

where pgt can be determined by the optimal causal parsegraph at time t.

4 Problem FormulationAccording to Bayes’ rule, we can solve our joint objecttracking and fluent reasoning problem by maximizing a pos-terior (MAP),

M∗ = arg maxM

p(M |I; θ)

∝ arg maxM

p(I|M ; θ) · p(M ; θ)

= arg maxM

1

Zexp −E(M ; θ)− E(I|M ; θ),

(5)

The prior term E(M ; θ) measures the temporal consistencybetween successive parse graphs. Assuming G is a MarkovChain structure, we can decompose E(M ; θ) as

E(M ; θ) =

T−1∑t=1

E(pgt+1|pgt)

=

T−1∑t=1

Nt∑i=1

Φ(lit+1, lit, s

it) + Ψ(sit+1, s

it, a

it).

(6)

The first term Φ(·) measures the location displacement.It calculates the transition distance between two successiveframes and is defined as:

Φ(lit+1, lit, s

it) =

δ(Ds(lit+1, l

it) > τs), sit = Visible,

1, sit = Occ, Con,(7)

where Ds(·, ·) is the Euclidean distance between two loca-tions, τs is the speed threshold and δ(·) is an indicator func-tion. The location displacement term measures the motionconsistency of object in successive frames.

The second term Ψ(·) measures the state transition energyand is defined as:

Ψ(sit+1, sit, a

it) = − log p(sit+1|sit, ait), (8)

Vehicle

Side viewFrontal view Back view

OR node

Open Closed Occluded Open Closed OccludedOpen Closed Occluded

Hood Door Trunk

AND node

Figure 4: An illustration of Hierarchical And-Or Graph. Thevehicle is decomposed into different views, semantic partsand fluents. Some detection results are drawn below, withdifferent colored bounding boxes denoting different vehicleparts, solid/dashed boxes denoting state ”closed”/”open”.

where p(sit+1|sit, ait) is the action-state transition probabil-ity, which could be learned from the training data.

The likelihood term E(I|M ; θ) measures how well eachparse graph explains the data, which can be decomposed as

E(I|M ; θ) =

T∑t=1

E(It|pgt)

=

T∑t=1

Nt∑i=1

Υ(lit, φit, s

it) + Γ(lit, φ

it, a

it),

(9)

where Υ(·) measures the likelihood between data and ob-ject fluents and Γ(·) measures the likelihood between dataand object actions. Given each object oit, the energy func-tion Υ(·) is defined as:

Υ(lit, φit, s

it) =

1− ho(lit, φ

it), sit = Visible,

σ(Dς(ςi1, ς

i2)), sit = Occluded,

1− hc(lit, φ

it), sit = Contained,

(10)

where ho(.) indicates the object detection score, hc(.) indi-cates the container (i.e., vehicles) detection score and σ(·)is the sigmoid function. When an object is in either visi-ble or contained state, appearance information can describethe probability of the existence of itself or the object con-taining it (i.e., container) at this location. When an object isoccluded, there is no visual evidence to determine its state.Therefore, we utilize temporal information to generate can-didate locations. We employ the SSP algorithm (Pirsiavash,

Ramanan, and Fowlkes 2011) to generate trajectory frag-ments (i.e., tracklets). The candidate locations are identifiedas misses in complete object trajectories. The energy is thusdefined as the cost of generating a virtual trajectory at thislocation. We compute this energy by computing the visualdiscrepancy between a neighboring tracklet ςi1 before thismoment and a neighboring tracklet ςi2 after this moment. Theappearance descriptor of a tracklet is computed as the aver-age pooling of image descriptor over time. If the distance isbelow a threshold τς , a virtual path is generated to connectthese two tracklets using B-spline fitting.

The term Γ(lit, φit, a

it) is defined over the object actions

observed from data. In this work, we study the fluents ofhuman and vehicles, that is,

Γ(lit, φit, a

it) = σ(Dh(lit, φ

it|ait)) + σ(Dv(lit, φ

it|ait)), (11)

where σ(·) is the sigmoid function. The definitions of thetwo data-likelihood terms Dh and Dv are introduced in therest of this section.

A human is represented by his/her skeleton, which con-sists of multiple joints estimated by sequential predictiontechnology (Wei et al. 2015). The feature of each joint isdefined as the relative distances of this joint to four saddlepoints(two shoulders, the center of the body, and the middlebetween the two hipbones). The relative distances are nor-malized by dividing the length of head to eliminate the in-fluence of scale. A feature vector ωh

t concatenating the fea-tures of all joints is extracted, which is assumed to follow aGaussian distribution:

Dh(lit, φit|ait) = − log N(ωh

t ;µait,Σai

t), (12)

where µait

and Σait

are the mean and the covariance of theaction ait respectively, which are obtained from the trainingdata.

A vehicle is described with its viewpoint, semantic vehi-cle parts, and vehicle part fluents. The vehicle fluent is rep-resented by a Hierarchical And-Or Graph, as illustrated inFig. 4. The feature vector of vehicle fluent ωv is obtained bycomputing fluent scores on each vehicle part and concatenat-ing them together. We compute the average pooling feature$ai

for each action ai over the training data as the vehiclefluent template. Given vehicle fluent ωv

t computed on imageIt, the distance Dv(lit, φ

it|ait) is defined as

Dv(lit, φit|ait) = ‖ωv

t −$ait‖2. (13)

5 Inference

We cast the intractable optimization of Eq. (5) as an Inte-ger Linear Formulation (ILF) in order to derive a scalableand efficient inference algorithm. We denoted by V the lo-cations of vehicles, E the edges between all possible pairsof nodes, whose time is consecutive and locations are close.The whole transition graph G = (V,E) is shown as Fig. 5.

Then the energy function Equation (5) can be re-written as:

f∗ = arg maxf

∑mn∈Eo

cmnfmn,

cmn = −Φ(ln, lm, sm)−Ψ(sn, sm, am)−Υ(lm, φm, sm)

− Γ(lm, φm, am),

s.t. fmn ∈ 0, 1,∑m

fmn ≤ 1,∑m

fmn =∑k

fnk,

(14)where fmn is the number of object moving from node Vm tonode Vn, cmn is the corresponding cost.

Since the subject of interest can only enter a nearby con-tainer (e.g., vehicle), to discover the optimal causal parsegraph, we will need to jointly track the container and thesubject of interest. Similar to Equation (14), the energy func-tion of container is as following.

g∗ = arg maxg

∑mn∈Ec

dmn gmn,

dmn = hc(lm, φm)− 1,

s.t. gmn ∈ 0, 1,∑m

gmn ≤ 1,∑m

gmn =∑k

gnk,

(15)

where hc(lm, φm) is the container detection score at locationlm. Then we add the contained constrains as:∑

mn∈Ec

gmn ≥∑

ij∈Eo

fij ,

s.t. tn = tj , ‖ln − lj‖2 < τc,

(16)

where τc is the distance threshold. Finally, we combineEquation (14-16) all together:

f∗, g∗ = maxf,g

∑mn∈Eo

cmnfmn +∑

ij∈Ec

dmn gmn,

s.t. fmn ∈ 0, 1,∑m

fmn ≤ 1,∑m

fmn =∑k

fnk,

gmn ∈ 0, 1,∑i

gmn ≤ 1,∑i

gmn =∑k

gnk,

tn = tj , ‖ln − lj‖2 < τc.

(17)

This is the objective function we want to optimize.As illustrated in the Fig. 5, the re-formulated graph struc-

ture is a directed acyclic graph (DAG). Thus we can adoptthe Dynamic Programming technique to efficiently searchthe optimal solution for Eqn. (17).

6 ExperimentsWe apply the proposed method over two multiple object-tracking datasets and evaluate the improvement in visualtracking brought by the outcomes of visibility status reason-ing.

6.1 Implementation DetailsFor person and suitcase detection, we utilize the Faster R-CNN models (Ren et al. 2015) trained on the MS COCOdataset. The used network is the VGG-16 net, with score

t-1 t t+1 t+2 time

stat

es

visible

occluded

contained

Figure 5: Transition graph utilized to formulate the integerlinear programming. Each node m has its location lm, statesm, and time instant tm. Black solid arrows indicate the pos-sible transitions in the same state. Red dashed arrows indi-cate the possible transitions between different states.

threshold 0.4 and NMS threshold 0.3. The tracklets similar-ity threshold τς is set as 0.8. The contained distance thresh-old τc is set as the width of container. For appearance de-scriptors φ, we employ the dense sampling ColorNames de-scriptor (Zheng et al. 2015), which applies square root op-erator (Arandjelovic and Zisserman 2012) and Bag-of-wordencoding to the original ColorNames descriptors. For hu-man skeleton estimation, we use the public implementationof (Wei et al. 2015). For vehicle detection and semanticpart status estimation, we use the implementation providedby (Li et al. 2016) with default parameters mentioned in theirpaper.

We adopt the widely used CLEAR metrics (Kasturi etal. 2009) to measure the performances of tracking meth-ods. It includes four metrics, i.e., Multiple Object DetectionAccuracy (MODA), Detection Precision (MODP), MultipleObject Tracking Accuracy (MOTA) and Tracking Precision(MOTP), which take into account three kinds of tracking er-rors: false positives, false negatives and identity switches.We also report the number of false positives (FP), false neg-atives (FN), identity switches (IDS) and fragments (Frag). Ahigher value means better for TA and TP while a lower valuemeans better for FP, FN, IDS and Frag. If the Intersection-over-Union (IoU) ratio of tracking results to ground truth isabove 0.5, we accept the tracking result as a correct hit.

6.2 DatasetsPeople-Car dataset (Wang et al. 2014)1. This dataset con-sists of sequences on a parking lot with two synchronizedbird-view cameras, with length 300 − 5100 frames. In thisdataset, there are many instances of people getting in and outof cars. This dataset is challenging for the frequent interac-tions, light variation and low object resolution.

Tracking Interacting Objects (TIO) dataset. Forcurrent popular multiple object tracking datasets(e.g., PETS09 (Ferryman and Shahrokni 2009), KITTIdataset (Geiger, Lenz, and Urtasun 2012)), most tracked

1Available at cvlab.epfl.ch/research/surv/interacting-objects

Method MODA FP FN IDS

Seq.0

Our-full 0.69 0.18 0.08 0.05Our-1 0.62 0.21 0.12 0.05TIF-MIP 0.67 0.07 0.25 0.04KSP 0.49 0.10 0.41 0.07LP2d 0.47 0.05 0.48 0.06POM 0.47 0.06 0.47 N/ASSP 0.20 0.04 0.76 0.04

Seq.1

Our-full 0.61 0.23 0.12 0.04Our-1 0.53 0.29 0.14 0.04TIF-MIP 0.58 0.17 0.25 0.04KSP 0.04 0.71 0.25 0.12LP2D 0.02 0.77 0.21 0.17POM -0.21 0.98 0.23 N/ASSP 0.00 0.75 0.25 0.12

Table 1: Quantitative results and comparisons of false posi-tive (FP) rate, false negative (FN) rate and identity switches(IDS) rate on People-Car Dataset.

objects are pedestrian and no evident interaction visibilityfluent changes. Thus we collect two new scenarios withtypical human-object interactions: person, suitcase, andvehicle on several places.

Plaza. We capture 22 video sequences in a plaza that de-scribe people walking around, getting in/out vehicles.

ParkingLot. We capture 3 video sequences in a park-ing lot that shows vehicles entering/exiting the parkinglot, people getting in/out vehicles, people interacting withtrunk/suitcase.

All the sequences are captured by a GoPro camera, withframe rate 30fps and resolution 1920×1080. The total num-ber of frames of TIO dataset is more than 30K. There ex-ist severe occlusions and large scale changes, making thisdataset very challenging for traditional tracking methods.

Beside the above testing data, we collect another set ofvideo clips for training. To avoid over-fitting, we set up dif-ferent camera positions, different people and vehicles fromthe testing settings. The training data consists of 380 videoclips covering 9 events: walking, opening vehicle door, en-tering vehicle, exiting vehicle, closing vehicle door, openingvehicle trunk, loading baggage, unloading baggage, closingvehicle trunk. Each action category contains 42 video clipson average.

Both the datasets and short clips are annotated with thebounding boxes for people, suitcase, vehicles, and visibilityfluents of people and suitcase. The types of status are ”visi-ble”, ”occluded”, and ”contained”. We utilize VATIC (Von-drick, Patterson, and Ramanan 2013) to annotate the videos.

6.3 Results and ComparisonsFor People-Car dataset, we compare our proposed methodwith 5 state-of-the-arts methods: successive shortest path al-gorithm (SSP) (Pirsiavash, Ramanan, and Fowlkes 2011), K-Shortest Paths Algorithm (KSP) (Berclaz et al. 2011), Prob-ability Occupancy Map (POM) (Fleuret et al. 2008), Lin-ear Programming (LP2D) (Leal-Taixe et al. 2014), Tracklet-Based Intertwined Flows (TIF-MIP) (Wang et al. 2016). The

Plaza MOTA MOTP FP FN IDS FragOur-full 46.0% 76.4% 99 501 5 8Our-1 32.5% 75.3% 75 605 25 30MHT D 34.3% 73.8% 56 661 15 18MDP 32.9% 73.2% 24 656 9 7DCEM 32.3% 76.5% 2 675 2 2SSP 31.7% 72.1% 19 678 21 25DCO 29.5% 76.4% 22 673 6 2JPDA m 13.5% 72.2% 163 673 6 3ParkingLot MOTA MOTP FP FN IDS FragOur-full 38.6% 78.6% 418 1954 6 5Our-1 28.9% 78.4% 544 2203 14 16MDP 30.1% 76.4% 397 2296 26 22DCEM 29.4% 77.5% 383 2346 16 15SSP 28.9% 75.0% 416 2337 12 14MHT D 25.6% 75.7% 720 2170 15 12DCO 24.3% 78.1% 536 2367 38 10JPDA m 12.3% 74.2% 1173 2263 28 17

Table 2: Quantitative results and comparisons of false pos-itive (FP), false negative (FN), identity switches (IDS), andfragments (Frag) on TIO dataset.

quantitative results are reported in Table 1. From the results,the proposed method obtains better performance than thebaseline methods.

For TIO dataset, we compare the proposed methodwith 6 state-of-the-arts: successive shortest path algorithm(SSP) (Pirsiavash, Ramanan, and Fowlkes 2011), multi-ple hypothesis tracking with distinctive appearance model(MHT D) (Kim et al. 2015), Markov Decision Processeswith Reinforcement Learning (MDP) (Xiang, Alahi, andSavarese 2015), Discrete-Continuous Energy Minimiza-tion (DCEM) (Milan, Schindler, and Roth 2016), Discrete-continuous optimization (DCO) (Andriyenko, Schindler,and Roth 2012) and Joint Probabilistic Data Association(JPDA m) (Hamid Rezatofighi et al. 2015). We use the pub-lic implementations of these methods. We set up a baselineOur-1 to analyze the effectiveness of different components.Our-1 only utilizes human data-likelihood term while Our-full utilizes both human and vehicle data-likelihood terms.Note we also utilize the estimated human joint key-points torefine the detected people bounding boxes.

We report quantitative results and comparisons in Table 2for TIO dataset. From the results, we can observe that ourmethod obtains superior performance to the other methodson most metrics. It validates that the proposed method cannot only track visible objects correctly, but also reason lo-cations for occluded or contained objects. The alternativemethods do not work well mainly due to lack of the abilityto track objects under long-term occlusion or containment inother objects. Based on comparisons of Our-1 and Our-full,we can also conclude that each type of fluent plays its role inimproving the final tracking results. Some qualitative resultsare visualized in Fig. 6.

Empirical Studies. We further report fluent estimationresults on TIO-Plaza sequences and TIO-ParkingLot se-quences in Fig. 8. From the results, we can see that our

TIO-Plaza TIO-ParkingLot Car-People

Figure 6: Sampled qualitative results of our proposedmethod on TIO dataset and People-Car dataset. Each colorrepresents an object. The solid bounding box means the visi-ble object. The dash bounding box denotes the object is con-tained by other scene entities. Best viewed in color and zoomin.

Figure 7: Sampled failure cases. The first row denotes per-son entering vehicle. The second row denotes person exitingvehicle.

Visibility Fluent Estimation Accuracy


Status

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Our-fullOur-1

Figure 8: Quantitative results and comparisons of visibilityfluent estimation on TIO dataset.

method can successfully reason the visibility status of sub-jects. Note that the precision of containment estimation isnot high, since some people get in/out the vehicle from theopposite side towards the camera, as shown in Fig. 7. Undersuch situation, multi-view setting might be a better way to

reduce the ambiguities.

7 ConclusionIn this paper, we propose a Causal And-Or Graph model torepresent the causal-effect relations between object visibilityfluents and various human interactions. By jointly modelingshort-term occlusions and long-term occlusions, our methodcan explicitly reason the visibility of subjects as well as theirlocations in in videos. Our method clearly outperforms thealternative methods in complicated scenarios with frequenthuman interactions. In this work, we focus on the human-interactions as a running-case of the proposed technique,which can be easily extended for the other types of objects(e.g., animal, drones).

ReferencesAndriyenko, A.; Schindler, K.; and Roth, S. 2012. Discrete-continuous optimization for multi-target tracking. In IEEE Con-ference on Computer Vision and Pattern Recognition.Arandjelovic, R., and Zisserman, A. 2012. Three things everyoneshould know to improve object retrieval. In IEEE Conference onComputer Vision and Pattern Recognition.Berclaz, J.; Fleuret, F.; Turetken, E.; and Fua, P. 2011. Multipleobject tracking using k-shortest paths optimization. IEEE Trans-actions on Pattern Analysis and Machine Intelligence 33(9):1806–1819.Dehghan, A.; Assari, S.; and Shah, M. 2015. Gmmcp-tracker:globally optimal generalized maximum multi clique prob-lem for multiple object tracking. In IEEE Conference on ComputerVision and Pattern Recognition.Dehghan, A.; Tian, Y.; Torr, P.; and Shah, M. 2015. Target identity-aware network flow for online multiple target tracking. In IEEEConference on Computer Vision and Pattern Recogntion.Ess, A.; Schindler, K.; Leibe, B.; and Gool, L. V. 2009. Im-proved multi-person tracking with active occlusion handling. InIEEE ICRA Workshop.Felzenszwalb, P.; Girshick, R. B.; McAllester, D.; and Ramanan,D. 2010. Object detection with discriminatively trained part basedmodels. IEEE Transactions on Pattern Analysis and Machine In-telligence 32(9):1627–1645.Ferryman, J., and Shahrokni, A. 2009. Pets2009: Dataset and chal-lenge. In IEEE International Workshop on Performance Evaluationof Tracking and Surveillance.Fire, A., and Zhu, S.-C. 2016. Learning perceptual causality fromvideo. ACM Transactions on Intelligent Systems and Technology7(2).Fleuret, F.; Berclaz, J.; Lengagne, R.; and Fua, P. 2008. Mul-ticamera people tracking with a probabilistic occupancy map.IEEE Transactions on Pattern Analysis and Machine Intelligence30(2):267–282.Geiger, A.; Lenz, P.; and Urtasun, R. 2012. Are we ready forautonomous driving? the kitti vision benchmark suite. In IEEEConference on Computer Vision and Pattern Recognition.Griffiths, T., and Tenenbaum, J. 2005. Structure and strength incausal induction. Cognitive Psychology 51(4):334–384.Griffiths, T., and Tenenbaum, J. 2007. Two proposals for causalgrammars. Causal learning: Psychology, philosophy, and compu-tation 323–345.

Hamid Rezatofighi, S.; Milan, A.; Zhang, Z.; Shi, Q.; Dick, A.;and Reid, I. 2015. Joint probabilistic data association revisited. InIEEE International Conference on Computer Vision.Kasturi, R.; Goldgof, D.; Soundararajan, P.; Manohar, V.; Garo-folo, J.; Bowers, R.; Boonstra, M.; Korzhova, V.; and Zhang, J.2009. Framework for performance evaluation of face, text, and ve-hicle detection and tracking in video: Data, metrics, and protocol.IEEE Transactions on Pattern Analysis and Machine Intelligence31(2):319–336.Kim, C.; Li, F.; Ciptadi, A.; and Rehg, J. M. 2015. Multiple hy-pothesis tracking revisited. In IEEE International Conference onComputer Vision.Leal-Taixe, L.; Fenzi, M.; Kuznetsova, A.; Rosenhahn, B.; andSavarese, S. 2014. Learning an image-based motion context formultiple people tracking. In IEEE Conference on Computer Visionand Pattern Recognition.Li, B.; Wu, T.; Xiong, C.; and Zhu, S.-C. 2016. Recognizing carfluents from video. In IEEE Conference on Computer Vision andPattern Recognition.Liang, W.; Zhao, Y.; Zhu, Y.; and Zhu, S.-C. 2016. What are where:Inferring containment relations from videos. In International JointConference on Artificial Intelligence.Maksai, A.; Wang, X.; and Fua, P. 2016. What players do withthe ball: A physically constrained interaction modeling. In IEEEConference on Computer Vision and Pattern Recognition.Milan, A.; Schindler, K.; and Roth, S. 2016. Multi-target trackingby discrete-continuous energy minimization. IEEE Transactionson Pattern Analysis and Machine Intelligence 38(10):2054–2068.Mueller, E. T. 2014. Commonsense reasoning: An event calculusbased approach. Morgan Kaufmann.Pearl, J. 2009. Causality: Models, reasoning and inference. Cam-bridge University Press.Pirsiavash, H.; Ramanan, D.; and Fowlkes, C. C. 2011. Globally-optimal greedy algorithms for tracking a variable number of ob-jects. In IEEE Conference on Computer Vision and Pattern Recog-nition.Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn:Towards real-time object detection with region proposal networks.In Conference on Neural Information Processing Systems.Tang, S.; Andriluka, M.; and Schiele, B. 2014. Detection andtracking of occluded people. International Journal of ComputerVision 110(1):5869.Vondrick, C.; Patterson, D.; and Ramanan, D. 2013. Efficientlyscaling up crowdsourced video annotation. International Journalof Computer Vision 101(1):184–204.Wang, X.; Turetken, E.; Fleuret, F.; and Fua, P. 2014. Trackinginteracting objects optimally using integer programming. In Euro-pean Conference on Computer Vision.Wang, X.; Turetken, E.; Fleuret, F.; and Fua, P. 2016. Trackinginteracting objects using intertwined flows. IEEE Transaction onPattern Analysis and Machine Intelligence 38(11):2312–2326.Wei, S.-E.; Ramakrishna, V.; Kanade, T.; and Sheikh, Y. 2015.Convolutional pose machines. In IEEE Conference on ComputerVision and Pattern Recognition.Wen, L.; Li, W.; Yan, J.; and Lei, Z. 2014. Multiple target track-ing based on undirected hierarchical relation hypergraph. In IEEEConference on Computer Vision and Pattern Recogntion.Xiang, Y.; Alahi, A.; and Savarese, S. 2015. Learning to track:Online multi-object tracking by decision making. In IEEE Interna-tional Conference on Computer Vision.

Yilmaz, A.; Xin, L.; and Shah, M. 2004. Contour based objecttracking with occlusion handling in video acquired using mobilecameras. IEEE Transaction on Pattern Analysis and Machine In-telligence 26(11):1532–1536.Yu, S.-I.; Meng, D.; Zuo, W.; and Hauptmann, A. 2016. The so-lution path algorithm for identity-aware multi-object tracking. InIEEE Conference on Computer Vision and Pattern Recogntion.Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; and Tian, Q.2015. Scalable person re-identification: A benchmark. In IEEEInternational Conference on Computer Vision.

a causal and-or graph model for visibility fluent ... · a causal and-or graph model for visibility...

Documents