video labeling for automatic video surveillance in … · viola also provides an easy-to-use...

Video Labeling for Automatic Video Surveillance inSecurity Domains

Elizabeth Bondi,Debarun Kar, Venil

Noronha, DonnabellDmello, Milind Tambe

University of SouthernCalifornia

Los Angeles, CAbondi, dkar, vnoronha, ddmello,

[email protected]

Fei FangCarnegie Mellon University

Pittsburgh, [email protected]

Arvind Iyer, RobertHannafordAirShepherd

Berkeley Springs, [email protected],

[email protected]

ABSTRACTBeyond traditional security methods, unmanned aerial ve-hicles (UAVs) have become an important surveillance toolused in security domains to collect the required annotateddata. However, collecting annotated data from videos takenby UAVs efficiently, and using these data to build datasetsthat can be used for learning payoffs or adversary behav-iors in game-theoretic approaches and security applications,is an under-explored research question. This paper presentsVIOLA, a novel labeling application that includes (i) a work-load distribution framework to efficiently gather human la-bels from videos in a secured manner; (ii) a software interfacewith features designed for labeling videos taken by UAVs inthe domain of wildlife security. We also present the evolu-tion of VIOLA and analyze how the changes made in thedevelopment process relate to the efficiency of labeling, in-cluding when seemingly obvious improvements did not leadto increased efficiency. VIOLA enables collecting massiveamounts of data with detailed information from challeng-ing security videos such as those collected aboard UAVs forwildlife security. VIOLA will lead to the development ofnew approaches that integrate deep learning for real-timedetection and response.

1. INTRODUCTIONWith the recent use of unmanned aerial vehicle (UAV) tech-nology in security domains, videos taken by UAVs have be-come an emerging source of massive data [9], especially inthe domain of wildlife protection. For example, in secu-rity games, detecting wildlife from UAV videos can help es-timate the animal distribution density, which decides thepayoff structure of a security game. Detecting poachers andtheir movement patterns could also lead to successful learn-ing of attackers’ behavioral models, which is an importanttopic in security games [18, 10]. In addition, data collectedfrom UAVs can enable the development of a new generationof game-theoretic tools for security. Particularly, the datacan be used to train or fine-tune a deep neural network toautomatically detect attackers from the video taken by theUAVs in real-time.

Bloomberg Data for Good Exchange Conference.24-Sep-2017, Chicago, IL, USA.

Unfortunately, collecting labeled data from videos taken byUAVs can be a labor-intensive, time-consuming task. To ourknowledge, there is no existing application that focuses onassisting the labeling work for videos taken by UAVs in se-curity domains. Existing applications for labeling images [6,7] cannot be directly applied to labeling videos, as treatingeach frame as a separate image can lead to inefficiency, sinceit does not exploit the correlation between frames. Videolabeling applications such as VATIC [26] attempt to choosekey frames for labeling, or track objects through the video.However, in UAV videos with camera motion, possibly col-lected using a different wavelength, these methods may notapply and may lead to inaccurate results or extra work forlabelers, since the position of the objects in the video maychange abruptly and the lack of color bands makes the track-ing much more difficult. Furthermore, these applications areoften paired with AMT to get labeled video datasets fromonline workers. However, in a security domain with sensi-tive data, meaning data that would provide attackers withsome knowledge of defenders’ strategies should it be shared,it may be undesirable to use AMT. This would then requirefinding labelers and organizing labeling assignments.

In this paper, we focus on better collection of labeled datafrom UAVs to provide input for game-theoretic approachesfor security, and in particular to security game applicationsfor wildlife conservation such as PAWS [8]. There has beenwork on labeling tools in domains such as computer visionand cyber security [6, 5], but there exists no work on labelingtools for game-theoretic approaches.

In particular, we will focus on labeling videos taken by longwave thermal infrared (hereafter referred to as thermal in-frared) cameras installed on UAVs, in the domain of wildlifesecurity. We present VIOLA (VIdeO Labeling Application),a novel application that assists labeling objects of interestsuch as wildlife and poachers. VIOLA includes a workloaddistribution framework to efficiently gather human labelsfrom videos in a secured manner. We distribute the workof labeling the videos and reviewing the labels amongst asmall group of labelers to ensure efficiency and data secu-rity. VIOLA also provides an easy-to-use interface, with aset of features designed for UAV videos in the wildlife secu-rity domain, such as allowing for moving multiple bounding

arX

iv:1

710.

0852

6v1

[cs

.CY

] 2

3 O

ct 2

017

https://creativecommons.org/licenses/by/4.0/

boxes simultaneously and tracking bright regions automati-cally. We will also discuss the various stages of developmentto create VIOLA, and we will analyze the impact of differentlabeling procedures and versions of the labeling applicationon efficiency, with a particular emphasis on the surprising re-sults that showed some changes did not improve efficiency.

2. RELATED WORKGame-theoretic approaches have been widely used in infras-tructure and green security domains [25]. In green securitydomains such as protecting wildlife from poaching, multipleresearch efforts in artificial intelligence and conservation bi-ology have attempted to estimate wildlife distribution andpoacher activities [8]; such efforts often rely on months oryears of recorded data [17, 11]. With the recent advances inunmanned aerial vehicle (UAV) technology, there is an op-portunity to provide detailed data about wildlife and poach-ers for game-theoretic approaches. Since a poacher is re-warded for successfully poaching wildlife, the wildlife distri-bution determines the payoff structure of the game. Poach-ers’ behavioral models can be inferred from poaching ac-tivities and be used to design better patrol strategies withgame-theoretic reasoning. In addition, game-theoretic pa-trolling with alarm systems [1, 4] has been studied. UAVscan provide input for such systems in real-time using com-puter vision, particularly by detecting attackers or suspi-cious human beings in the UAV videos.

Detecting attackers in the UAV videos is related to objectdetection. Recently, great progress has been achieved incomputer vision by deep learning in object detection andrecognition [23, 22]. However, state-of-the-art detectors can-not be directly applied to our aerial videos because mostmethods focus on detection in high resolution, visible spec-trum images. An alternative approach to this detection is totrack moving objects throughout videos. Tracking of bothsingle and multiple objects in videos has been studied ex-tensively [28]. These methods also rely on high resolutionvisible spectrum videos. Single object trackers use discrim-inant features from high resolution videos to establish cor-respondences [12]. Much of multi-object tracking researchis directed towards pedestrians [3, 29, 15], and primarilyfocuses on visible spectrum videos with high resolution, orvideos taken from a fixed camera (except [15]).

Simpler and more general tracking algorithms exist that donot necessarily have these dependencies, such as the Lucas-Kanade tracker for optical flow [13], popular in the OpenCVpackage, and general correlation-based tracking [14]. Smallmoving objects can also be detected by a background sub-traction method after applying video stabilization [20]. Be-cause these methods are more general, they are still applica-ble to our domain and were explicitly tested, but still did notperform well in many cases. For example, since the videostabilization and background subtraction method assumes aplanar surface, in the case of more complex terrain, therewere many noisy detections. Instead of using tracking fordetection, we therefore decided to focus on deep learning.

In order to use deep learning-based detection methods withaerial, thermal infrared data, hand-labeled training data arerequired to fine-tune the networks or even train them fromscratch. In addition to video labeling applications such as

VATIC [26], there has been work on semi-automatic label-ing [27] and label propagation [2] which combines the effortof human labelers and algorithms to speed up the labelingprocess for videos. This work often focuses on how to selectthe frames for human labelers to label and how to propa-gate the labels for the remaining frames. This is difficultfor our domain because of the motion of UAVs, and be-cause it is often hard for humans to tell which objects areof interest without seeing the object’s motion. As a result,we sought to develop our own labeling application, VIOLA.The first key component of the application is a workload dis-tribution framework. A common framework for image andvideo labeling is a majority voting framework [16, 21, 19,24]. VIOLA uses a framework based upon [7] to efficientlygather labels from a small group of labelers. We examinethe framework further in Sec. 5 and Sec. 6.

3. DOMAINThere has recently been increased use of UAVs for securitysurveillance. UAVs are able to cover more ground than astationary camera and can provide the defenders more ad-vanced notice of a potential threat. To detect suspicioushuman activities at night, the UAVs can be equipped withthermal infrared cameras. This is the type of UAV videowe deal with in our domain, since poaching often occursat night. We will specifically be able to use these types ofdata to detect poachers and provide advanced notice to parkrangers, and use these detections to provide input for patrolgeneration tools such as PAWS.

In order to accomplish this, we need labeled data from thethermal infrared, UAV videos in the form of rectangular“bounding boxes” for objects of interest (animals and poach-ers) in each frame, with a color corresponding to their classi-fication. However, the movement of UAVs and the thermalinfrared images make it extremely difficult to label videosin this domain. First, thermal infrared cameras are low-resolution, and typically show warmer objects as brighterpixels in the image, although the polarity could be reversedoccasionally. Different phenomena could also cause brighterpixels without a warm object. For example, the groundwarms during the day, and then emits heat at night, whichcan be reflected under a tree canopy and lead to an ampli-fied signal that might look like a human or animal. Further-more, vegetation often looks bright and similar to objects ofinterest, as in Fig. 1, where there are three humans labeledwith bounding boxes, amongst many other bright objects.Second, since the data are captured aboard a moving UAV,these data often vary drastically. For example, the resolu-tion, and therefore size of targets, is very different through-out the dataset because the UAV flies at varying altitudes.

In addition to difficult, variable video data to begin with,some videos may have many objects of interest in them,whereas some videos may not have any objects of interest atall. It sometimes takes a long time to determine if there areany objects of interest, and it also often takes a long time tolabel when there are many objects of interest. To illustratethe variation in the number of objects of interest, we analyzethe historical videos we get from our collaborators. Fig. 2shows a histogram of the average number of labels per frame,meaning that all frames in the video were counted, regardlessof whether or not they were labeled, and a histogram of the

Figure 1: An example of a thermal infrared frame, where thethree humans outlined by the white boxes look very similarto the surrounding vegetation.

0 0-1 1-2 2-3 3+Average Number of Labels Per Frame

0

10

20

30

Num

ber o

f Vid

eos

0 0-3 3-6 6-9 9+Average Number of Labels Per Labeled Frame

0

10

20

30

Num

ber o

f Vid

eos

Figure 2: A histogram with the number of videos for averageobjects of interest per frame (left), and the average objectsof interest per labeled frame (right).

average number of labels per labeled frame, meaning framesthat had at least one label were counted.

Although we focus on UAV videos in wildlife security do-mains, similar challenges in UAV videos in other securitydomains can be expected. Therefore, the application VI-OLA we introduce in this paper can potentially be applied toother security domains to provide input for game-theoreticapproaches.

4. VIOLAThe main contribution of this paper is VIOLA, an applica-tion we developed for labeling UAV videos in wildlife secu-rity domains. VIOLA includes an easy-to-use interface forlabelers and a basic framework to enable efficient usage ofthe application. In this section, we first discuss the user in-terface and then the framework for work distribution andtraining process for labelers.

4.1 User Interface of VIOLAThe user interface of VIOLA was written in Java and Javascript,and hosted on a server through a cloud computing serviceso it could be accessed using a URL from anywhere with aninternet connection.

Before labeling, labelers were asked to login to ensure datasecurity (Fig. 3a). The first menu that appears after lo-gin (Fig. 3b) asks the labeler which mode they would like,whether they would like to label a new video or review a pre-vious submission. Then, after choosing “Label”, the secondmenu (Fig. 3c) asks them to choose a video to label. Fig.4 is an example of the next screen used for labeling, alsowith sample bounding boxes that might be drawn at thisstage. Along the top of the screen is an indication of the

mode and the current video name, and along the bottomof the screen is a toolbar. First, in the bottom left cor-ner, is a percentage indicating progress through the video.Then, there are four buttons used to navigate through thevideo. The two arrows move backwards or forwards, the playbutton advances frames at a rate of one frame per second,and the square stop button returns to the first frame of thevideo. The next button is the undo button, which removesthe bounding boxes just drawn in the current frame, just incase they were too tiny to easily delete. Also to help withthe nuisance of creating tiny boxes by accident while draw-ing a new bounding box or while moving existing boundingboxes, there is a filter on bounding box size. The trash canbutton deletes the labeler’s progress and takes them back tothe first menu after login (Fig. 3b). Otherwise, work is au-tomatically saved after each change and re-loaded each timethe browser is closed and re-opened. The application asksfor confirmation before deleting the labeler’s progress andundoing bounding boxes to prevent accidental loss of work.The check-mark button is used to submit the labeler’s work,and is only pressed when the whole video is finished. Again,there is a confirmation screen to avoid accidentally submit-ting half of a video. The copy button and the slider willbe described further in Sec. 5. The eye button allows thelabeler to toggle the display of the bounding boxes on theframe, which is often helpful during review to check that thelabels are correct. Finally, the question mark button pro-vides a help menu with a similar summary of the controlsof the application. Notice the bounding boxes surroundingthe animals in this video are colored red. Humans would becolored blue. This is also included in the help menu.

To draw bounding boxes, the labeler can simply click anddrag a box around the object of interest, then click the boxuntil the color reflects the class. Deleting a bounding boxis done by pressing SHIFT and click, and selecting multi-ple bounding boxes is done by pressing CTRL and click,which allows the labeler to move multiple bounding boxesat once. Finally, while advancing frames, bounding boxesdrawn in the current frame are moved to the next frame.It only happens the first time a frame is viewed since itcould otherwise add redundant bounding boxes or replacethe bounding boxes originally added by the labeler.

If “Review” is chosen in the first menu after login, the secondmenu also asks the labeler to choose a video to review, andthen a third menu (Fig. 3d) asks them to choose a labelingsubmission to review. It finally displays the video with thelabels from that particular submission, and they may beginreviewing the submission. The two differences between thelabeling and review modes in the application are (i) that thereview mode displays an existing set of labels and (ii) thatlabels are not moved to the next frame in review mode.

4.2 Use of VIOLAOur goal in labeling the challenging videos in the wildlifesecurity domain is first to keep the data secure, and sec-ond, to collect more usable labels to provide input for game-theoretic tools for security. In addition, we aim for exhaus-tive labels with high accuracy and consistency. To achievethese goals, we securely distribute the work among a smallgroup of labelers, assign labelers to either provide or reviewothers’ labels, and provide guidelines and training for the

(a) (b)

(c) (d)

Figure 3: The menus to begin labeling.

Figure 4: An example of a frame (left) and labeled frame(right) in a video. This is the next screen displayed after allof the menus and allows the labeler to navigate through thevideo and manipulate or draw bounding boxes throughout.

labelers.

Distribution of Work To keep the data (historical videosfrom our collaborators) secure, instead of using AMT, werecruit a small group of labelers, in this work 13. Labelersare given a username and password to access the labelinginterface, and the images on the labeling interface cannotbe downloaded.

In order to achieve label accuracy, we use a framework oflabel and review. The idea is simply that one person labelsa video, and another person checks, or reviews, the labelsof the first person. By checking the work of the labeler,the reviewer must agree or disagree with the original set oflabels instead of creating their own. Upon disagreement, thereviewer can change the original labels. This was primarilychosen because it is a clean approach, leading to one set offinal labels per video. This choice is described further in Sec.5. Assignments are organized using shareable spreadsheets.

Guidelines and Training for Labelers In order to achieve

accuracy and consistency of labels, we provide guidelines andtraining for the labelers. During the training, we show thelabelers several examples of the videos and point out the ob-jects of interest. We provide them with general guidelineson how to start labeling a video, as below (review is similar).

In general, the process for labeling should be:

• Watch the video once all the way through and try todecide what you see.

• Once you have an idea of what is happening in thevideo by going through it, return to the beginning ofthe video and start labeling.

• Make and move bounding boxes.

• Send screenshots if you need help.

We provide special instructions for the videos in our domain.For example, animals tend to be in herds, obviously shapedlike animals, and/or significantly brighter than the rest ofthe scene, and humans tend to be moving.

5. DEVELOPMENTThanks in large part to feedback provided by the labelers,we were able to make improvements throughout the devel-opment of the application to the current version discussed inSec. 4.1. In the initial version of the application, we had fivepeople label a single video, and then automatically checkedfor a majority consensus among these five sets of labels. Weused the Intersection over Union (IoU) metric to check foroverlap with a threshold of 0.5 [7]. If at least three out of fivesets of labels overlapped, it was deemed to be consensus, andwe took the bounding box coordinates of the first labeler.Our main motivation for having five opinions per video wasto compensate for the difficulty of labeling thermal infrareddata, though we also took into account the work of [16] and[21]. The interface of the initial version allowed the user todraw and manipulate bounding boxes, navigate through thevideo, save work automatically, and submit the completedvideo. Boxes were copied to the next frame and could bemoved individually. To get where we are today, the changeswere as listed in Table 1.

The most significant change made during the developmentprocess was the transition from five labelers labeling thesame video and using majority voting to get the final labels(referred to as “MajVote”) to having one labeler label thevideo followed by a reviewer reviewing the labels (referredto as “LabelReview”). We realized that having five peoplelabel a single video was very time consuming, and the qualityof the labels was still not perfect because of the ambiguity oflabeling thermal infrared data, which led to little consensus.Furthermore, when there was consensus, there were three tofive different sets of coordinates to consider. Switching toLabelReview eliminated this problem, providing a cleanerand also time-saving solution. Another change, “LabelingDays”, consisted of meeting together in one place for severalhours per week so labelers were able to discuss ambiguitieswith us or their peers during labeling. Finally, the trackingalgorithm (Alg. 1) was added to automatically track thebounding boxes when the labeler moves to a new frame to

Table 1: Changes made throughout development.

Version Change Date Brief Description1 - - Draws and edits

boxes, navigatesvideo, copiesboxes to nextframe

2 Multiple 3/23/17 Moves multipleBox boxes at once toSelection increase labeling

speed3 Five 3/24/17 Requires only two

Majority people per videoto instead of five toReview improve overall

efficiency4 Labeling 4/12/17 Has labelers

Days assemble todiscuss difficultvideos

5 Tracking 6/17/17 Copies andautomaticallymoves boxes tonext frame

Algorithm 1 Basic Tracking Algorithm

1: buffer ← userInput2: for all boxesPreviousFrame do3: if boxSize > sizeThreshold then4: newCoords← coords5: else6: searchArea← newFrame[coords + buffer]7: threshIm← Threshold(searchArea, thresh)8: components← CCA(threshIm)9: if numberComponents > 0 then

10: newCoords← GetLargest(components)11: else12: newCoords← coords13: end if14: end if15: CopyAndMoveBox(newFrame, newCoords)16: end for

improve labeling efficiency, as the labelers would be able tolabel a single frame, then check that the labels were correct.

An example of the tracking process in use is shown in Fig.5. First, the labeler drew two bounding boxes around theanimals (Fig. 5a), then adjusted the search size for the track-ing algorithm using the slider in the toolbar (Fig. 5b). Thetracking algorithm was applied to produce the new bound-ing box location (Fig. 5c). In contrast, the copy feature,activated when the copy button was selected on the toolbar,only copied the boxes to the same location (Fig. 5d). In thiscase, since there was movement, and the animals were largeand far from one another, the tracking algorithm correctlyidentified the animals in consecutive frames. If several brightobjects were in the search region, it could track incorrectlyand copying could be better. One direction of future workis to improve the tracking algorithm by setting thresholdsautomatically and accounting for close objects.

6. ANALYSIS

Table 2: Versions tested in additional tests.

Version 1 2 3 4 5Name Basic MB Review LD TrackFW MajVote MajVote LR LR LROrder 4th 3rd 1st 2nd 5th

In this section, we analyze how the changes we made dur-ing the development of VIOLA affect the labeling efficiencyby examining two questions: (i) how the changes affect theoverall efficiency of the data collection process, which is mea-sured by the total person time needed to get a final label– a label confirmed by the five majority voting or reviewthat can be used for game-theoretic analysis or deep learn-ing algorithms; (ii) how the changes affect the individualefficiency, i.e., the person time needed for an individual la-beler or reviewer to provide or to check a label. In addition,we examine whether other desired properties of the data col-lection process, such as exhaustiveness, have been achieved.

To analyze efficiency, we first went through the person timedata collected during the development of VIOLA. Any changesmade in VIOLA were deployed immediately to make fasterprogress in labeling the videos. These person time data camefrom different videos and labelers. They inherently took dif-ferent amounts of time to label regardless of the application,since the videos varied in their content. To mitigate the in-trinsic heterogeneity for analysis, we divide the videos intofour groups, (0, 1), [1, 2), [2, 3), and [3,+∞), based on theaverage number of labels per frame, since it is an importantindicator of the difficulty of labeling a video. There wereother factors affecting the difficulty of labeling videos, sovideos in the same group may still have had high variation.Because of this, we remove the top and bottom 5% of timeper label entries.

Also due to these concerns, we collected additional persontime data in a more controlled environment. We gave sixunique videos that contained animals but no poachers tothe labelers to label. The labelers had not seen these videospreviously. We distributed the work among the labelers soas to get one set of final labels for each video under each ofthe versions of VIOLA (as shown in Table 1). We asked thelabelers to label for no more than 15 minutes on each video.To accommodate the labelers’ schedules and coordinate theirschedules to set up meetings, which were necessary for La-belDays and Tracking, we gave the labelers 2 to 4 days tolabel the videos under each version. As such, it was difficultto get multiple sets of labels for each video or get labels formore videos. Labelers may not have checked all the frames inthe video within 15 minutes, so we use the minimum checkedframe among labelers for each video under each version, andanalyze efficiency using person time data up until that frameonly. Also, note that since some labelers were asked to labelthe same video multiple times under different versions, thelabelers likely got faster as time went on. To mitigate theseeffects, we randomly ordered the five versions of VIOLA forthem to label. The order is shown in Table 2. Frameworkwas abbreviated as FW, MultiBox as MB, LabelDays as LD,Tracking as Track, and LabelReview as LR in Table 2.

We will proceed in this section by first focusing on the im-

(a) (b) (c) (d)

Figure 5: A sample labeling process.

0-1 1-2 2-3 3+Average Number of Labels / Frame

101

102

103

Tim

e (s

econ

ds) /

Lab

el MajVoteLabelReview

(a) Data from developmentprocess.

A B C D E FVideos

101

102

Tim

e (s

econ

ds) /

Lab

el MajVoteLabelReview

(b) Data from additionaltests.

Figure 6: Overall efficiency with different labeling frame-works.

pact of the key change in the labeling framework from Ma-jVote to LabelReview on the overall efficiency. We will thencheck each version of VIOLA to understand the impact ofother changes. Because of the surprising results, we will alsoexamine videos in which these features did not help.

6.1 From MajVote to LabelReviewFig. 6a and Fig. 6b show the comparison on overall effi-ciency between MajVote and LabelReview. The total per-son time per final label is lower on average when we useLabelReview, based on data collected through both the de-velopment process and additional tests. From development,there are only seven videos for which we get final labels us-ing MajVote, two of which do not produce any consensuslabels. There are more than 70 videos for which we get finallabels through LabelReview.

During additional tests, we tested two versions using Ma-jVote and three versions using LabelReview, which meansthe value of each bar is averaged over two or three samples.We exclude one sample for Video C where no consensus la-bels were achieved through MajVote. The LabelReview ef-ficiency for Video D is 0.63 with a standard error of 0.09.

In addition to having more labelers involved, one reason thatMajVote leads to a higher person time per final label is thelack of consensus. Fig. 7 shows that there were large discrep-ancies in the number of labels between individual labelers,which led to fewer consensus labels (zero in Videos I and M).Fig. 8 shows that MajVote leads to fewer final labels than

G H I J K L MVideos

0

1

2

3

4

Num

ber

of L

abel

s /

Fram

e 12345consensus

(a) For the seven videos withfive sets of labels during de-velopment process.

A B C D E FVideos

0

5

10

15

20

Num

ber

of L

abel

s /

Fram

e 12345consensus

(b) For the six videos usedin the additional tests underversion Basic.

Figure 7: Number of labels per frame for individual labelersand for consensus.

MajVoteLabelReview

0

200

400

600

800

1000

Num

ber o

f Lab

els

Figure 8: Number of final labels for MajVote and LabelRe-view in additional tests.

LabelReview in the additional tests as well. This indicatesthat using LabelReview framework can get us closer to thegoal of exhaustive labels when compared to MajVote.

6.2 Impact of Other ChangesIn this section, we examine the individual and overall effi-ciency of each version of VIOLA to analyze the impact ofthe other changes made during the development of VIOLA.For individual efficiency, we calculate person time spent perlabel for each individual labeler or reviewer, regardless ofwhether that label has been confirmed to be a final label.

We first show results of individual efficiency based on persontime data collected during the development process in Fig.9, which shows the mean labeling and reviewing time per la-

BasicMultiBox

ReviewLabelDays

0

20

40

60

80Ti

me

(sec

onds

) /

Labe

l 0-11-22-33+Mean

BasicMultiBox

ReviewLabelDays

0

10

20

30

Tim

e (s

econ

ds)

/ La

bel 0-1

1-22-33+Mean

Figure 9: Average individual efficiency of labeling (left) andreview (right) with data collected during development.

BasicMultiBox

ReviewLabelDays

Tracking0

2

4

6

8

Tim

e (s

econ

ds)

/ La

bel A

BCDEFMean

BasicMultiBox

ReviewLabelDays

Tracking0.0

0.5

1.0

1.5

2.0

Tim

e (s

econ

ds)

/ La

bel A

BCDEFMean

Figure 10: Individual efficiency for each submission and av-erage efficiency of labeling (left) and review (right) with datacollected from additional tests.

bel within the timespan of each change during development.We then examine the individual efficiency for labeling andreviewing in the additional tests (Fig. 10). The results ofeach test have been shown by video, since there are only fivesets of labels in the tests with MajVote (Version 1-2) andonly one set of labels in the tests with LabelReview (Ver-sion 3-5). The five sets of labels in the MajVote tests areaveraged by video, and the standard error bars are included.Fig. 10 shows that each change resulted in an improvementon the individual efficiency for some but not all of the videos.

Multiple Box Selection The feature of multiple box se-lection was added to improve the individual efficiency of la-beling. Checking the first two groups in Fig. 9 and Fig. 10,we notice that surprisingly, this feature improved individualefficiency for some of the videos (e.g., Video F), but not allthe videos. One possible explanation is that in videos wherethere are many animals that did not move much over time,the changing position of the bounding boxes is mainly dueto the movement of the camera. In this case, using multiplebox selection is helpful. However, in other videos with onlyone or two animals in each frame, it may be faster to movethe boxes separately, particularly if an animal moves.

Labeling Days Labeling days were introduced with the aimto increase the overall efficiency. Fig. 11 shows the averageperson time per final label has reduced from Review to La-belDays during additional tests, and the time per final labelhas reduced for Videos A, C, and F. Fig. 11 also shows thenumber of final labels has remained the same on average.

The results indicate that introducing labeling days may helpimproving the efficiency and exhaustiveness of labeling, at

ReviewLabelDays

0

1

2

3

4

Tim

e (s

econ

ds)

/ La

bel A

BCD

EFMean

ReviewLabelDays

0

250

500

750

1000

1250

Num

ber

of L

abel

s

Figure 11: Overall efficiency (left) and number of final labels(right) with Review and LabelDays during additional tests.

least for some more complex videos. Subjective feedbackfrom the labelers also indicated that introducing labelingdays made it easier for them to deal with ambiguous cases,when it is difficult to maintain consistency and accuracy de-spite the guidelines. However, Fig. 9 and Fig. 10 show thatintroducing labeling days does not lead to an improvementon individual efficiency in all cases. It is possible that it in-creased the individual labeling time due to extra discussiontime, while saving time during review. We plan to analyzethe effects of labeling days in more detail in the future.

Tracking We included the tracking feature in the additionaltest Tracking but it has not been deployed for the labelersto use. During the tests, we received positive feedback fromlabelers, particularly on videos in which animals were farapart and bright. In addition, the tracking feature was ableto successfully track two animals in the first 10% of VideoB, as shown in Fig. 5. Unexpectedly, the initial results fromthe additional test do not show a positive effect on time perlabel or number of labels. We believe this is due to the factthat it does not find a brightness threshold automatically,and is likely to track the wrong object when multiple objectsare within the same search region. We plan to continuedeveloping this feature given its promise in the cases whereanimals were far apart and bright.

Summary This section thus shows that while some of ourproposed improvements led to increased efficiency, partic-ularly the switch from MajVote to LabelReview, in othercases (e.g., multiple box selection), surprisingly, it only in-creased efficiency in some videos. This result indicates thatwe must not add features on the intuition that they must im-prove performance, as they may only apply to some videos.

7. CONCLUSIONSIn conclusion, we presented VIOLA, which provides a label-ing and reviewing framework to gather labeled data froma small group of people in a secure manner, and a label-ing interface with both general features for difficult videodata, and specific features for our green security domain totrack wildlife and poachers. We analyzed the impact of theframework and the features on labeling efficiency, and foundthat some changes did not improve efficiency in general, butworked only in particular types of videos. We plan to uti-lize the data we acquired in this work to estimate animaldistributions, automatically detect wildlife and poachers inreal-time, and predict poachers’ movement patterns, whichare important for game-theoretic approaches such as PAWS.

8. ACKNOWLEDGMENTSThis research was supported by UCAR N00173-16-2-C903,with the primary sponsor being the Naval Research Lab-oratory (Z17-19598). It was also partially supported bythe Harvard Center for Research on Computation and Soci-ety Fellowship and the Viterbi School of Engineering Ph.D.Merit Top-Off Fellowship.

9. REFERENCES[1] T. Alpcan and T. Basar. A game theoretic approach to

decision and analysis in network intrusion detection.In Proceedings of 42nd IEEE Conference on Decisionand Control, volume 3, pages 2595–2600. IEEE, 2003.

[2] V. Badrinarayanan, F. Galasso, and R. Cipolla. Labelpropagation in video sequences. In CVPR. IEEE, 2010.

[3] S.-H. Bae and K.-J. Yoon. Robust online multi-objecttracking based on tracklet confidence and onlinediscriminative appearance learning. In CVPR, 2014.

[4] N. Basilico, G. D. Nittis, and N. Gatti. A securitygame combining patrolling and alarm–triggeredresponses under spatial and detection uncertainties. InAAAI, pages 397–403, 2016.

[5] C. A. Catania, F. Bromberg, and C. G. Garino. Anautonomous labeling approach to support vectormachines algorithms for network traffic anomalydetection. Expert Systems with Applications,39(2):1822–1829, 2012.

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In CVPR. IEEE, 2009.

[7] M. Everingham, L. Van Gool, C. K. I. Williams,J. Winn, and A. Zisserman. The pascal visual objectclasses (voc) challenge. International Journal ofComputer Vision, 88(2):303–338, June 2010.

[8] F. Fang, T. H. Nguyen, R. Pickles, W. Y. Lam, G. R.Clements, B. An, A. Singh, M. Tambe, andA. Lemieux. Deploying paws: Field optimization ofthe protection assistant for wildlife security. In AAAI,2016.

[9] J. C. Hodgson, S. M. Baylis, R. Mott, A. Herrod, andR. H. Clarke. Precision wildlife monitoring usingunmanned aerial vehicles. Scientific reports, 6, 2016.

[10] D. Kar, F. Fang, F. D. Fave, N. Sintov, andM. Tambe. “A Game of Thrones”: When HumanBehavior Models Compete in Repeated StackelbergSecurity Games. In AAMAS, 2015.

[11] D. Kar, B. Ford, S. Gholami, F. Fang, A. Plumptre,M. Tambe, M. Driciru, F. Wanyama, A. Rwetsiba,M. Nsubaga, and J. Mabonga. Cloudy with a chanceof poaching: Adversary behavior modeling andforecasting with real-world poaching data. In AAMAS,2017.

[12] M. Kristan, J. Matas, A. Leonardis, M. Felsberg,L. Cehovin, G. Fernandez, T. Vojir, G. Hager,G. Nebehay, and R. Pflugfelder. The visual objecttracking vot2015 challenge results. In ICCVWorkshops, 2015.

[13] B. D. Lucas, T. Kanade, et al. An iterative imageregistration technique with an application to stereovision. 1981.

[14] C. Ma, X. Yang, C. Zhang, and M.-H. Yang.Long-term correlation tracking. In CVPR, pages

5388–5396, 2015.

[15] A. Milan, L. Leal-Taixe, I. Reid, S. Roth, andK. Schindler. Mot16: A benchmark for multi-objecttracking. arXiv preprint arXiv:1603.00831, 2016.

[16] P. Nguyen, J. Kim, and R. C. Miller. Generatingannotations for how-to videos using crowdsourcing. InCHI ’13 Extended Abstracts on Human Factors inComputing Systems, pages 835–840, 2013.

[17] T. H. Nguyen, A. Sinha, S. Gholami, A. Plumptre,L. Joppa, M. Tambe, M. Driciru, F. Wanyama,A. Rwetsiba, R. Critchlow, et al. Capture: A newpredictive anti-poaching tool for wildlife protection. InAAMAS, pages 767–775, 2016.

[18] T. H. Nguyen, R. Yang, A. Azaria, S. Kraus, andM. Tambe. Analyzing the Effectiveness of AdversaryModeling in Security Games. In AAAI, pages 718–724,2013.

[19] L.-V. Nguyen-Dinh, C. Waldburger, D. Roggen, andG. Troster. Tagging human activities in video bycrowdsourcing. In ICMR, 2013.

[20] C.-H. Pai, Y.-P. Lin, G. G. Medioni, and R. R.Hamza. Moving object detection on a runway prior tolanding using an onboard infrared camera. In CVPR,pages 1–8. IEEE, 2007.

[21] S. Park, G. Mohammadi, R. Artstein, and L.-P.Morency. Crowdsourcing micro-level multimediaannotations: The challenges of evaluation andinterface. In CrowdMM, pages 29–34, 2012.

[22] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi.You only look once: Unified, real-time objectdetection. In CVPR, 2016.

[23] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn:Towards real-time object detection with regionproposal networks. In NIPS, pages 91–99, 2015.

[24] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Getanother label? improving data quality and datamining using multiple, noisy labelers. In KDD, pages614–622, 2008.

[25] M. Tambe. Security and Game Theory: Algorithms,Deployed Systems, Lessons Learned. CambridgeUniversity Press, 2011.

[26] C. Vondrick, D. Patterson, and D. Ramanan.Efficiently scaling up crowdsourced video annotation.International Journal of Computer Vision, pages 1–21.

[27] R. Yan, J. Yang, and A. Hauptmann. Automaticallylabeling video data using multi-class active learning.In ICCV, 2003.

[28] A. Yilmaz, O. Javed, and M. Shah. Object tracking:A survey. Acm computing surveys (CSUR), 38(4):13,2006.

[29] L. Zhang, Y. Li, and R. Nevatia. Global dataassociation for multi-object tracking using networkflows. In CVPR, pages 1–8. IEEE, 2008.

video labeling for automatic video surveillance in … · viola also provides an easy-to-use...

Documents