design and assessment of an intelligent activity ... · design and assessment of an intelligent...

Design and Assessment ofan Intelligent Activity Monitoring Platform

Alberto Avanzi

BULL SA, i-DTV groupav. Jean Jaures

78340 Les Clayes sous bois — [email protected]

Francois Bremond, Christophe Tornieri,Monique Thonnat

INRIA Sophia Antipolis, ORION group2004, route des Lucioles, BP93

06902 Sophia Antipolis Cedex — [email protected]

ABSTRACT

We are interested in designing a reusable and robust ac-tivity monitoring platform. In this paper we propose threegood properties that an activity monitoring platform shouldhave to enable its reusability for different applications andto insure performance quality: 1) modularity and flexibilityof the architecture, 2) separation between the algorithmsand the a priori knowledge they use, and 3) automatic eval-uation of algorithm results. We then propose a develop-ment methodology to fulfill the last two properties. Themethodology consists in the interaction between end-usersand developers during the whole development of a specificmonitoring system for a new application. To validate ourapproach, first we present a platform that we use to gen-erate activity monitoring systems dedicated to specific ap-plications. Then we explain how we have managed to giveconcrete expression to these properties in the platform. Fi-nally we describe in details the technical validation and theend-user assessment of an automatic metro monitoring sys-tem built with the platform. We give the correspondingvalidation results for two other systems built with the sameplatform: a bank agency monitoring system and a buildinglock chamber access control system.

KEYWORDS

Intelligent vision platform, video surveillance, autonomoussystem, evaluation, generic platform.

1 INTRODUCTION

The task of developing algorithms able to recognize hu-man activities in video sequences has been an active fieldof research for the last ten years. Nevertheless, the lack ofgenericity and robustness of the proposed solutions is stillan open problem. To break down this challenging prob-lem into smaller and easier ones, a possible approach is tolimit the field of application to specific activities in well-delimited environments. So the scientific community has led

researches on automatic traffic surveillance on highways, onpedestrian and vehicle interaction analysis in parking lotsor roundabouts [1], or on human activity monitoring in out-door (like streets and public places) or indoor (like metrostations, bank agencies, houses) environments [2, 3, 4].

We believe that to obtain a reusable and performantactivity monitoring platform, a unique global and sophis-ticated algorithm is not adapted because it cannot handlethe large diversity of real world applications. However sucha platform can be achieved if many algorithms can be easilycombined and integrated to handle such diversity. There-fore we propose to use software engineering and knowledgeengineering techniques to meet these major requirements.

To illustrate what we mean when we speak of “an ac-tivity monitoring platform”, we first describe the platformwe have developed during the last ten years, called VSIP —Video Surveillance Intelligent Platform. VSIP is a toolboxhelping a developer to build Activity Monitoring Systems(AMS) dedicated to specific applications.

Then we address three general properties of an activ-ity monitoring platform to insure performance quality andplatform reusability. Our goals are (1) to have a platformwhich allows the building of new activity monitoring sys-tems dedicated to different applications and (2) to insurethe quality of the results given by any system built withthe platform. While defining and describing each property,we show how they are fulfilled in VSIP.

The first property is modularity and flexibility ofthe architecture. This is a classical software engineer-ing property. To use a platform for deriving new systemsfor specific applications, it is often necessary to add newalgorithms, to remove existing ones or to replace some ofthem with others which have the same functionality butare able to cope with more challenging situations. For ex-ample, when addressing for the first time an applicationwhere the light can be switched on and off, it is necessaryto develop an algorithm able to handle instantaneous illu-mination changes. This algorithm has then to be integratedto the platform in order to be used, without requiring ad-

1

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Fig. 1. This figure shows six Activity Monitoring Systems (AMS) derived from VSIP to handle different applications.Image 1(a) illustrates a metro monitoring application system running on black and white cameras of the YZER station inBrussels. Image 1(b) illustrates the same system but analyzing images from a color surveillance camera of the SAGRADAFAMILIA station in Barcelona. Image 1(c) illustrates a system for unruly behaviors detection inside trains. Images 1(d)and 1(e), taken with 2 synchronized cameras with overlapping field of view working in a cooperative way, illustrate a bankagency monitoring system detecting an abnormal “bank attack” scenario. Image 1(f) illustrates a single-camera system fora lock chamber access control application for building entrances. Images 1(g) and 1(h) illustrate an application for apronsmonitoring on airports; this application combines a total of 8 surveillance cameras with overlapped fields of view. Finally,image 1(i) illustrates an highway traffic monitoring application.

2

ditional development, by any AMS derived from the plat-form. To allow this kind of “plugging-unplugging” feature,the platform has to be developed keeping in mind a well-defined modular architecture, based for example upon clearinterfaces between modules (in order to insure informationsharing and exchanging between all the system modules).At the same time, modules have to be flexible in order to bereused in different situations. A natural way to obtain flex-ibility is to outsource parameters (to allow automatic pa-rameter tuning from the highest level). In our platform wehave decided to use the same interface type between mod-ules and the same data organization from the lowest levelto the higher one, as detailed in section 4. Moreover, thedata manager we have developed provides the system withfeedback channels going from high-level modules towardslow-level modules to allow closed-loop configurations, evenif these channels are not used by all systems.

A second property is the separation between the al-gorithms and the a priori knowledge they use. Usinga priori knowledge is not new but keeping it separate fromalgorithms enables reusability. Complex systems perform-ing activity monitoring use a huge amount of knowledgeof different types. Knowledge is often application depen-dent and, for the same application, camera dependent, soit should never be embedded into the algorithms. In ourcase, we have decided to use 3D descriptions of the ob-served empty scenes as well as predefined scenarios as apriori knowledge available to the system. The 3D descrip-tions change when the observed scenes change and theirseparation from the algorithms enables to adapt the sys-tem to different video cameras. The predefined scenarioschange when the application changes, but thanks to thisseparation we can reuse the same algorithm for a differentsystem without modifying it.

A third property is automatic evaluation, whose goalis to enable to evaluate the results of the different AMSbuilt with the platform. This property is important afterthe integration of a new algorithm or the modification ofan existing one. When addressing a new application, it isnormal to face with new problems which require to handlenew situations and to use more a priori knowledge (or to re-fine the already existing knowledge). The development andthe integration into the platform of such new algorithmsis made possible by the modularity and the flexibility ofthe architecture (first property). Thus, the difficulty is toinsure that the new algorithm keeps the quality of resultspreviously obtained by the AMS dedicated to other appli-cations. A solution can be the development of an automaticevaluation framework based on ground truth data, which isable to evaluate the performances of a set of AMS on a wideset of predefined sequences. Thanks to that, it is possibleto evaluate the impact of a new algorithm on the platform,insuring that it globally increases the quality of the results.Moreover, a framework of this type enables to apply statis-tical learning methods for parameters tuning, useful to findthe best parameter set for a given application.

Finally, a development methodology to fulfill the lasttwo properties consists in the interaction between end-users and developers (the end users are for examplemetro security or bank agency operators). This interactionis useful because end-users provide the a priori knowledge(the predefined scenario models) used by the system (sec-ond property) and the scenario-level ground-truth used toperform the automatic evaluation (third property). Thereare also three other important reasons for developers to in-teract with end-users. The first reason is that end-users,helped by system developers, can find out which are theinteresting activities to monitor and how to describe themprecisely. The importance of this approach is to avoid tohave a system which does not meet user needs. The secondreason is the necessity to often ask professional actors toact a set of scenes showing either normal activities and theactivities to monitor. These video sequences are necessaryduring the development of the system to tune and to testalgorithms. Actors are needed because there are often toofew recorded sequences showing abnormal activities. Onlyend users can explain to actors: 1) how to act in a realisticway, 2) which are the activities to monitor and 3) how todescribe them precisely. The third reason is that end-userscan perform, at the end of the development, an assessmentof the system, measuring its efficiency and evaluating itsutility.

In our case, we have been working closely with end-usersof different application domains. For example, we have builtwith VSIP three AMS which have been validated by end-users: an activity monitoring system in metro stations, abank agency monitoring system and a lock chamber accesscontrol system for buildings security. These applicationspresent some characteristics which make them interestingfor research purposes: the observed scenes vary from largeopen spaces (like metro halls) to small and closed spaces(corridors and lock chambers); cameras can have both nonoverlapping (like in metro stations and lock chambers sys-tems) and overlapping fields of view (metro stations andbank agencies); humans can interact with the equipment(like ticket vending machines or access control barriers,bank safes and lock chambers doors) either in simple ways(open/close) or in more complex ones (as the interactionoccurring during vandalism against equipment or jumpingover the barrier scenarios). All these AMS have been val-idated and an end-user assessment has been done or it isscheduled for the beginning of 2005.

We are currently building with VSIP three other appli-cations. All these applications are illustrated on figure 1. Afirst application is apron monitoring on an airport 1 wherevehicles of various types are evolving in a cluttered scene.The dedicated system has been able to successfully detectat the same time vehicles and people getting in and outon several videos lasting twenty minutes. A second applica-tion consists in detecting abnormal behaviors inside moving

1In the framework of AVITRACK European Project, see [5]and [6]

3

trains. The dedicated system is able to handle situations inwhich people are partially occluded by the train equipmentlike seats. A third application is traffic monitoring on high-way; the dedicated system has been built in few weeks toshow the adaptability of the platform. These systems arecurrently under development and validation and end-userassessment will be done in the near future.

The next section presents a state of the art of activ-ity monitoring research done during the last ten years. Wethen give an overview of the global architecture of VSIPin section 3. Section 4 presents how we addressed the firstproperty, the modularity and the flexibility of the architec-ture. Section 5 describes how we have managed to obtain aseparation between the algorithms and the a priori knowl-edge. In section 6 we address the third property, the au-tomatic evaluation of the results. Section 7 addresses theplatform development methodology based on the interac-tion with end-users. Then we present in section 8 the vali-dation and the end-user assessment of a system built withthe VSIP platform and applied to metro stations. In sec-tion 9 we give the validation results for two other systems,applied to bank agencies and building entrances. Finally insection 10 we give some concluding remarks and we presentongoing works.

2 STATE OF THE ART

Video Understanding is now a mature scientific domainwhich has started in the eighties. Early research on videounderstanding concentrated on vehicle tracking, because ve-hicle shapes are relatively easy to model and to detect invideos. These works included the monitoring of vehicles inroundabouts for traffic control (ESPRIT VIEWS — [7]).3D vehicle modeling [8] was later extended to include mod-els that can be distorted and which are parameterizable andto include appearance modeling to improve robustness.

The last decade has witnessed a more practical and user-centered development of Vision and Cognitive Vision re-search. The main achievement has been the development ofActivity Monitoring, usually focusing on the low level videoprocessing aspect and on people tracking. People trackingis more difficult than vehicle tracking because the humanbody is non-rigid and people motions have more degrees offreedom and are less predictable than vehicle motions. Atpresent real time tracking of people is mainly achieved usingappearance-based models. For example, Haritaogou et al.[9] use shape analysis and tracking to locate people and theirparts (head, hands, feet, torso) in image sequences. Oliveret al. [10] use Bayesian analysis to identify human interac-tions using trajectories obtained from a monocular camera.Other examples include the Leeds People Tracker [11]. TheLeeds People Tracker was combined with the Reading Ve-hicle Tracker to produce a single 3D integrated tracker forpedestrians and vehicles in the same scene. The VSAM(Visual Surveillance and Monitoring) project (1997-2000)[12, 13] involved twelve research laboratories in the imple-

mentation of systems to segment and track people and ve-hicles in image sequences, locate them in a 3D model of thescene environment using prior camera calibration and visu-alize them in a plan view dynamic scene. More recently, theVACE Program (Video Analysis and Contents Exploitation[14]) and the Homeland Security ARPA program [15] orga-nize research in USA on video understanding. For VACEthe goal is to recognize events of interest from any type ofvideo sources.

In general, video understanding systems rely on care-ful camera positioning and a dense camera network. Themulti-camera tracking of Mubarak Shah et al. [16] usesmultiple views to rebuild the trajectory of people betweennon overlapping cameras, linking the different fields of viewbeing observed. Routes followed by pedestrians throughthe scene are learnt by observing a large number of motiontrajectories and allow to construct a geometric and proba-bilistic trajectory model for long-term prediction.

Scene modeling is used to increase the reliability oftracking and behavior interpretation. For example, peo-ple in the field of view are likely to be on the ground plane,and moving vehicles are likely to be on a road rather thanon the pavement.

A new trend in video understanding systems is to useevaluation and program supervision techniques to improverobustness. The creation of PETS [17] enforces the ideathat we need evaluation techniques to assess the reliabil-ity of existing tracking algorithms. But these workshopsare mostly intended to test various algorithms on the samevideo inputs. Algorithms comparison is mostly qualitativeand quantitative comparison based on precise criteria ismissing. Nevertheless, we can mention an interesting theo-retical work on performance evaluation which can be foundin [18]. It first discusses the importance of testing algo-rithms on real video sequences, for instance to test out-door sequences with various weather conditions. Second, itpresents pros and cons of ground-truth techniques and theiralternatives. However the repair and tuning stage of thesevideo understanding systems is manually realized. Only fewworks [19] try to optimize the performance of these systems.

Moreover, few of these systems are able to performcomplex reasoning (i.e. spatio-temporal reasoning) andto understand all the interactions between people in realworld applications. In the Video Understanding domain,two main approaches are used to recognize temporal eventsfrom video either based on a probabilistic/neural networkor based on a symbolic network. For the computer visioncommunity, a natural approach consists in using a proba-bilistic/neural network. The nodes of this network corre-spond usually to events that are recognized at a given in-stant with a computed probability. For example, Hongenget al. [20] proposed an event recognition method that usesconcurrent Bayesian threads to estimate the likelihood ofpotential events. For the artificial intelligence community,a natural way to recognize an event is to use a symbolicnetwork whose nodes correspond usually to the symbolic

4

recognition of events. For example, some artificial intelli-gence researchers used a declarative representation of eventsdefined as a set of spatio-temporal and logical constraints.Some of them used a traditional constraints resolution ortemporal constraints propagation [21] techniques to recog-nize events.

In spite of all these achievements, no Activity Monitor-ing Systems can be said to be robust or generic enough tobe used in a real world application. An adapted design,development and evaluation methodology is still needed toachieve a generic intelligent video understanding platform.

This paper proposes four good properties that an activ-ity monitoring platform should have to enable its reusabilityfor different applications. As a concrete expression of theseproperties, we present a complete platform including hu-man and vehicle detection and tracking, scene modeling,spatio-temporal reasoning capabilities. To underline thereusability of the platform allowed by these four proper-ties, we present validation and end-users assessment resultsfor three systems built with the platform.

3 PLATFORM OVERVIEW

To demonstrate the feasibility of our approach and toillustrate with concrete examples the application of the pro-posed properties, we present an activity monitoring plat-form, named VSIP, whose global structure is shown in fig-ure 2. We use this platform to build activity monitoringsystems for specific applications.

The input images are color or black and white, digi-tized with a variable frame rate (typically between 4 and25 frames/sec).

The segmentation algorithm detects the moving regionsby subtracting the current image from the reference image(a background image built with images taken under differ-ent lighting conditions). These moving regions, associatedwith a set of 2D features like density or position, are calledBLOBS. A noise tracking algorithm allows to discriminateblobs between real moving regions and regions of persistentchange in the image (like a new poster on the wall or a news-paper on the table). Following the type of the application, adoor detection algorithm, which allows to handle the open-ing/closing of doors which have been specified in the 3Ddescription of the scene, can be activated. This algorithmremoves the moving pixels corresponding to a door beingopened or closed. A set of 3D features like 3D position,width and height, are computed for each blob. Then theblobs are classified into several predefined classes (like forexample PERSON, GROUP, NOISE, CAR, TRUCK, AIR-CRAFT, UNKNOWN...) by the classification algorithm.

The blobs with their associated class and a set of 3D fea-tures are called MOBILE OBJECTS. A split and mergealgorithm corrects some detection errors like a person sepa-rated into two different mobile objects. A 3D repositioningalgorithm corrects the 3D position of the mobile objectsclassified as PERSON that have been located at a wrong

place (such as outside the boundary of the observed sceneor behind a wall). This happens when the bottom part ofthe person is not correctly detected (for example, the legscan be occluded by an object or badly segmented). If use-ful for the application, a chair management algorithm canbe activated, which helps differentiating a mobile objectcorresponding to a chair from a mobile object correspond-ing to a person. A background updating algorithm uses thediscrimination between real mobile objects and regions ofpersistent change in the image (discrimination done by thenoise tracking algorithm) to update the reference image byintegrating the environment changes [22].

The set of the previously described algorithms is gener-ally called “motion detection module”. The output of thismodule is, for each frame, the list of the mobile objects(with their 3D features and their class).

The motion detection module is followed by the frameto frame tracking module. The goal of this module is tolink from frame to frame all mobile objects computed bythe motion detection module. The output of the frame toframe tracking module is a graph containing the detectedmobile objects updated over time and a set of links betweenblobs detected at time t and blobs at time t− 1. A mobileobject with temporal links towards mobile objects of theprevious frame is called a TRACKED MOBILE OB-JECT. This graph provides all the possible trajectories of amobile object and it constitutes the input for the followinglong term tracking module.

The lists of mobile objects coming from different cam-eras with overlapped fields of view are then fused togetherby a fusion algorithm to give a unique representation of themobile objects. The algorithm uses combination matrices(combining several compatibility criteria) to establish thegood association between the different views of a same mo-bile object observed by different cameras. A mobile objectdetected by a camera may be fused with one or more mobileobjects seen by other cameras, or can be simply kept aloneor destroyed if classified as noise. After fusion, the resultingFUSED TRACKED MOBILE OBJECTS combine allthe temporal links of mobile objects which have been fusedtogether. The 3D features of the resulting fused objects arethe weighted mean of the 3D features of the original mobileobjects. Weights are computed in function of the distancesof the original mobile objects from the corresponding cam-era. In this way the resulting 3D features are more accuratethan the original.

Depending on the scenarios to recognize, one or morelong term trackers can be used. All of them rely on thesame idea: they first compute a set of paths representingthe possible trajectories of the PHYSICAL OBJECTSto track (isolated individuals, groups of people, crowd, cars,trucks, airplanes...). Then they track the physical objectswith a predefined delay T to compare the evolution of thedifferent paths. The trackers choose, at each frame, the bestpath to update the physical object characteristics [23].

The fused physical objects are then processed by the

5

... ...

cameras withoverlapped FOV

fusion

trackingframe to framemotion

detection

motiondetection tracking

frame to frame

motiondetection tracking

frame to frame

from camera N

long termindividualtracking

long term

trackinggroup

long termcrowd

tracking

imagesfromcamera 1

imagesfromcamera 2

imagesfromcamera N

alerts

automaton−based

temporal−constraints based

Bayesian−network basedscenario

recognition

scenariorecognition

scenariorecognition

scenariorecognition

AND/OR tree−based

for the whole scenefused tracked mobile objects

mobile objectsfrom camera 1

tracked mobile objectsfrom camera 1

tracked mobile objects

physical objects

Fig. 2. This figure shows the global structure of the activity monitoring platform. First, a motion detection step followed bya frame to frame tracking is made for each camera. Then the tracked mobile objects (objects from camera i) coming fromdifferent cameras with overlapping fields of view are fused into a unique representation for the whole scene. Depending on thechosen application, a combination of one or more of the available trackers (individuals, groups and crowd tracker) is used. Theresults are passed to the behavior recognition algorithms, which combine one or more of the following algorithms, dependingon the scenarios to recognize: automaton based, Bayesian-network based, AND/OR tree based and temporal constraints basedrecognition algorithm. Finally the system generates the alerts corresponding to the predefined recognized scenarios.

behavior recognition algorithms to recognize the predefinedscenarios. Depending on the type of scenarios to recognize,different behavior recognition algorithms (based on automa-tons, Bayesian networks, AND/OR trees and temporal con-straints) can be used. These algorithms use the concepts of“state”, “event” and “scenario”. A state is a spatio-temporalproperty valid at a given instant or stable on a time interval.An event is a change of state. A scenario is any combina-tion of states and events. The scenarios corresponding toa sequence of events are represented as automaton whereevents correspond to a transition (a change of state) withinthe automaton. When the correct chain of events occurs,the scenario is said to be recognized. For scenarios dealingwith uncertainty, Bayesian networks can be used. For sce-narios with a large variety of visual invariants (for examplefighting), AND/OR trees can be used [24]. Visual invari-ants are visual features which characterize a given scenarioindependently of the scene and of a used algorithms. Forexample, for a fighting scenario, some visual invariants arean erratic trajectory of the group of fighters, or one personlying down on the ground or important relative dynamicsinside the group as shown in figure 3.

For scenarios with multiple physical objects involved incomplex temporal relationships, we use a recognition al-gorithm based on a constraint network whose nodes cor-respond to sub-scenarios and whose edges correspond totemporal constraints. Temporal constraints are propagatedinside the network to avoid an exponential combination ofthe recognized sub-scenarios. The scenarios are modeledin terms of “physical objects” (people or static scene ob-jects or zones of interest etc.), “components” (which canbe primitive-states, composite-states, primitive-events orcomposite-events) and “constraints” between the physical

objects and/or the components (constraints can be tempo-ral, spatial or logical). For each frame, scenarios are recog-nized incrementally, starting from the simplest ones (for ex-ample, “an individual is close to”) up to the more complex.The temporal constraints are checked at each frame. Thisalgorithm uses a declarative language to specify scenarios(see the figure 4 for an example and [21]). An ontology forvideo events ([25]) has been developed in the framework ofARDA workshop on video event.

The whole processing chain can be processed for twocameras in real time on one of-the-shell PC.

physical−objects(composite−event(vandalism_against_ticket_machine_one_man,

components(

(p: Person), (eq1: Ticket_Vending_Machine),(z1: Ticket_Vending_Machine_Zone) )

(c4:

(c6:constraints( (c1; c2; c3; c4; c5; c6) ) ) // Sequence

(c1: primitive−event Enters_zone(p, z1))(c2: primitive−event Move_close_to(p, eq1))

Goes_away_from(p, eq1))(c5:

primitive−eventprimitive−event Move_close_to(p, eq1))

(c3: composite−event Stays_at(p, eq1))

composite−event Stays_at(p, eq1))

Fig. 4. This is the description of a vandalism scenario us-ing our declarative language. It describes the degradation ofa piece of equipment by an individual: first the person movesclose to the equipment. He/she stays close to the equipmentthen he/she moves away from the equipment to avoid beingseen. He/she then goes back close to the equipment, etc.The terms corresponding to the video event ontology are inbold.

6

(a) (b) (c) (d)

Fig. 3. This figure shows 3 examples of visual invariants used to recognize a fighting scenario: (a) an erratic trajectory(shown in green on the image) of the group of fighters; (b) one of the fighters laying on the ground; (c) and (d) importantrelative dynamics inside the group (measured as distance variation over time of the people composing the group.)

4 MODULARITY AND FLEXIBILITY OF THEARCHITECTURE

In software engineering a classical property for a plat-form is its modularity and flexibility. We agree that thisproperty has to be considered during the whole develop-ment of an automatic interpretation platform in order toensure the reusability of the algorithms.

For a platform, modularity is the property of beingcomposed by sub-units (modules) each of them achievinga particular and well-defined task. Modularity enables tocreate systems that can be adapted to various applications(metro surveillance, bank agency surveillance, people count-ing, ...) in various environments (different metro stationsfor example). Indeed systems can be composed combiningcarefully the modules corresponding to the particular re-quirements of the applications. Nevertheless, a problem isstill open: the management of the data exchanges betweenmodules. Our solution to this issue is based on the notionof shared data manager. A shared data manager is a datastructure where modules read and write input/output data.As we have seen, a module represents a platform functional-ity: in our case, for example, video acquisition functionality,segmentation functionality or frame to frame tracking. In-put/Output are all the data exchanged between modules.For example, the acquisition module does not take any in-put data and outputs an image. The shared data managercan be thought as a module which performs “data man-agement and distribution” task, following the modularityphilosophy. The shared data manager manages the waydata are exchanged between modules. A module is onlyconnected to the shared data manager. To put some datain the shared data manager, the module calls the appro-priate method of the shared data manager. The moduleis not aware of how and when data will be used. Sepa-rating data management from module functionality allows,for instance, an application to be distributed on differentmachines changing only shared data manager implemen-

tation. This organization enables to provide an homoge-neous vision of the platform. Thus, building an applicationis a systematic process that consists in creating a shareddata manager and selecting one or several modules to con-nect with it. If an additional development is needed (forexample, because addressing for the first time an outdoorapplication), it is limited to the new encountered problem(for instance, “illumination changes due to weather condi-tions”) without affecting the other modules of the platform.Thanks to the shared data manager, we have the possibil-ity to reuse the same algorithms with different architectures(for example distributed or multi-threaded architectures orcode embedded into cameras). To develop an Activity Mon-itoring System on a distributed architecture a shared datamanager has to be created on each computer. The role ofthe data managers is to automatically maintain and updatethe shared data. The distribution has no other effects onthe platform. Moreover, data types which are handled bythe platform have precise and clear definitions; a piece ofinformation is unique and has the same meaning over thewhole platform. For example, a BLOB is a connected setof pixels detected by the segmentation (they can be mov-ing or stationary) with an associated set of 2D descriptorslike size, position and density. A MOBILE OBJECT is de-fined as a set of blobs “merged together” because it globallycorresponds to the perception of a physical object on theimage. It is characterized by a class (like PERSON or AIR-

PLANE) and by a set of 3D features (position and size). ATRACKED MOBILE OBJECT is a mobile object with (po-tentially) one or more temporal links to mobile object(s) ofthe previous frame.

We call flexibility the property of having a set of tun-able parameters. This property implies the possibility toconfigure algorithms and to define different scenarios with-out changing source code. To fulfill this property, we havedecided to make all the internal parameters of every moduletunable. To do that, parameter values are defined in sepa-rate files outside the code (i.e. outsourcing of parameters)

7

using a description formalism as proposed in [26]. Thesefiles are handled by the shared data manager as regular in-put/output data. These parameters can be changed duringprocessing enabling parameters optimization as explainedin section 6.

Thanks to the shared data manager and the outsourcingof parameters we achieved to have a platform architecturewhich fulfills modularity and flexibility properties.

5 SEPARATION BETWEEN ALGORITHMSAND A PRIORI KNOWLEDGE

This section focuses on the second property, separationbetween algorithm and a priori knowledge. VSIP platformuses a large number of a priori knowledge for two main rea-sons. First, it is often useful to design specific routines usingadditional a priori knowledge for correcting imprecise anduncertain data. For instance, we correct 3D wrong posi-tions due to partial occlusion in a cluttered environment byadding a correcting step which uses the information com-ing from the 3D scene description (position and dimensionsof context objects which can occlude people and the typeof occlusion they can cause). Thanks to this approach, wemanage to recognize the jumping over the validation barrierscenario which needs to compute precisely the 3D positionof people behind the barrier. Second, by providing an algo-rithm with knowledge it is possible to reduce the processingtime. For example, on a sidewalk where only pedestrian canbe observed, we will not try to classify mobile objects as ve-hicles.

The a priori knowledge is composed of two differenttypes of information: image acquisition knowledge anda priori models. The first one is composed by the follow-ing information:

• Camera calibration parameters are used to com-pute the real position in the 3D scene of the 2D ob-jects detected on the image.

• Hardware information contains the features ofeach equipment (frame rate of the camera, networkconfiguration, data compression rate, ...).

• Reference images are a set of predefined imagesrepresenting the appearance (night or day) of theempty scene (scene without mobile objects).

Models are of 2 types:

• 3D scene model contains the 3D geometry of thescene observed by the camera and the objects presentin the scene. These physical objects are of two types:contextual objects found in the empty scene (trashcans, benches, stamping machines, zone of interest...)and mobile objects which can evolve in the scene (per-sons, group of people, airplane, train...). A semanticinformation is associated to each object, like “occlud-ing” for an object which can occlude people and “on

top” or “on bottom” to specify the type of the occlu-sion, or “in/out” for a zone corresponding to an entryor an exit.

• Scenario models library. This information is in-dependent of the camera. It consists in a set of prede-fined scenarios to recognize. These scenarios are de-scribed using a special declarative user-oriented lan-guage (see section 7).

If the use of a priori knowledge enables to better andmore efficiently solve the interpretation problems, only theseparation between knowledge and algorithms enables al-gorithms to be independent of the application and to bereusable in other situations. All a priori knowledge in VSIPis stored in specific configuration files independently of thecode. For example, all cameras observing the same scene areprocessed by computers having the same 3D scene model.Also, applying an AMS to a new scene requires only tochange the 3D scene model. We have also proposed anadapted formalism to describe each type of knowledge. Forexample, the scenario models are described using a specialdeclarative user-oriented language, as shown in section 7.

Because of this knowledge organization we have man-aged to separate a priori knowledge from the algorithms.

6 AUTOMATIC EVALUATION

When facing new applications, it is often necessary toadd to the platform new algorithms able to handle situa-tions encountered for the first time. For example, for bankagency monitoring application [27] we have developed achair management module, which has been integrated toVSIP and is currently used by an AMS for indoor applica-tions dealing with chairs.

Our experience in building AMS has shown that usuallyto handle real world diversity a reusable platform shouldcontain a combination of simple algorithms dedicated toeach type of situations rather than containing a very so-phisticated algorithm handling all situations. Robustnessin activity monitoring is then achieved when many algo-rithms can be easily combined in the same platform.

Once validated on a specific application, these new al-gorithms have to be integrated to the platform. To preservethe reusability of the platform and its robustness with re-spect to the whole set of applications, two problems arise:

• it is necessary to insure that the new algorithms donot lower the quality of the results obtained by otherAMS built with the platform. In other words, it isimportant to be able to measure the impact of new al-gorithms on the quality of the results obtained by allAMS on a predefined set of sequences representativeof the applications;

• we have to be able to find the good set of parameterswhich guarantees, for each new application, the bestquality of results. Sometimes it happens that afterthe introduction of a new algorithm, the initial set

8

of parameters does not give satisfactory results any-more. Thus we have to be able to recompute themfor each application (one set for each application) inan automatic way.

To find a solution for both problems we have developedan evaluation framework. This framework is based upon:

• a set of ground-truth sequences for each given appli-cation. With the term “ground-truth” we describe aset of sequences for which a human operator has giventhe “best results” (truth) that a system would havegiven if it had worked perfectly. Ground-truth can bespecified at each different step of the platform: mo-tion detection, frame to frame tracking, fusion, longterm tracking and scenario recognition. For exam-ple, at motion detection level ground-truth means todraw for each image a bounding box surrounding eachmobile object evolving in the scene, labeling it withits type. At tracking levels, it means to correctlytrack by hand the mobile objects even when the mo-bile object is partially or totally occluded. Finally, atscenario level it means to recognize by inspection thescenarios depicted by the image sequences.

• a clear definition of the high-level data types used bythe platform as an interface between modules (detailsin section 4), and an XML format for each of thesedata types allowing their manipulation even outsideVSIP. For example, figure 5 shows the XML formatused to represent the annotation data types which isgenerated when a scenario is recognized.

<annotation_activity id = "31" priority = "2"class = "alarm"sub_class = "jumping_over_barrier">

<list_video_frames best_camera_area = "HALL01"best_camera_id = "C11">

<video_frame id = "42535" camera_area = "HALL01"camera_id = "C11">

</video_frame></list_video_frames><time start_time_hour = "2" start_time_min = "21"

start_time_sec = "46" start_time_ms = "799"></time><list_activity_physical_objects>

<physical_object id = "104" role = "source"></physical_object><physical_object id = "16" role = "stat_reference"></physical_object>

</list_activity_physical_objects></annotation_activity>

Fig. 5. XML annotation of a video: the recognized sce-nario is “jumping over the barrier”. It implies two physicalobjects, one person (ID 104) and one validation barrier (ID16). The scenario is best viewed on camera C11

When a new algorithm is added to the platform, the setof ground-truth sequences are run automatically by eachAMS. The evaluation results are compared with those ob-tained before the integration of the new algorithm. In case

of lower quality results, a first possibility is to recompute theset of parameters separately for each AMS. The frameworkallows to apply a statistical learning technique for param-eter tuning. For all ground-truth sequences, the algorithmis run with a modified set of parameters. If the resultsimprove, the parameters are validated, if not they are mod-ified using an algorithm that explores heuristically the N-dimensional space of parameters (N being the number ofparameters).

If the improved set of parameters still gives worse resultsthan the one used before the introduction of the new algo-rithm, then this algorithm is said to be not generic enoughto be used for all applications. The next step is to un-derstand precisely why the new algorithm fails and underwhich hypotheses it can be used.

For example, the repositioning algorithms (developedfor the bank agency monitoring system) was designed tocorrect the position of individuals when they are wronglydetected behind a wall (in situations where legs are notdetected). The algorithm gives incorrect results in trainsurveillance system because it wrongly correct the positionof people who are behind walls containing a window (see fig-ure 6). Thus two algorithms have to be developed to handlescenes containing both walls with or without windows.

Fig. 6. This image shows a person (on the right) who isseen through a window in the wall. In this case the repo-sitioning algorithm has to take into account the particularsituation and to avoid repositioning the person inside thetrain when it is outside.

Today this choice of algorithm is made manually forVSIP when building an AMS. We are currently working onextending the evaluation framework to determine automat-ically in which situations an algorithm can be used.

7 INTERACTION BETWEEN THE END-USERSAND THE DEVELOPERS

As we have seen in section 1, the interaction between theend-users and the developers is a development methodologyuseful to fulfill the separation between the platform and the

9

a priori knowledge (as described in section 5) and to performautomatic evaluation of the results (as described in section6).

Moreover, system design is often driven by technicallimitations rather than user requirements. The proposedapproach consists in integrating not only user needs butalso user knowledge in the development process in orderto address real life problems. This integration has threemain interests. The first one is to provide a system welladapted to end-users needs. The second one is to provide aframework to assess the usefulness of the system on videosequences representing real life situations. The last interestis the possibility to improve efficiency and robustness of thesystem by using user knowledge. For example, detecting apick pocket theft is impossible but with the help of users,we found some typical precursor events (e.g. blocking apassenger in an exit zone) which are easier to recognize.Collaboration with users is an incremental process duringwhich different types of knowledge are taken into account.The collaboration is composed of three phases.

In the first phase end-users motivations are collected todefine goals and their priorities. In the case of metro sta-tions, three goals were specified by the users: traffic freeflow, passenger and employee security and equipment pro-tection. Based on these goals and on the importance givento each situation (frequency, gravity in terms of physicalloss and costs), several scenarios are chosen. In Barcelona(Spain) metro, for example, one of the major issues is fraud.The Barcelona metro stations are not equipped with effi-cient devices to control platform access: there are only sim-ple barriers easy to stride. Thus, metro managers decidedthat it would be interesting that the video surveillance sys-tem automatically detects people jumping over the barriers.In Brussels (Belgium) metro, fraud is not relevant becausethere is no validation barrier. However, access blocking isa real problem for different reasons pointed out by metromanagers. The first one is the degradation of the trafficfree flow. The second one concerns accidents that may oc-cur when one or several individuals are blocking the escala-tors. In this case, people may pack and fall. The last one isless obvious and concerns pick pocket activities. Actually,while a few individuals are blocking a passenger in an exit,an accomplice can take advantage of the situation to robthe passenger.

In a second phase, users which have a visual or groundexperience (video surveillance operators, security agents forexample) specify precisely the course of each scenario: howthe individuals present in the scene behave before, duringand after the event. Users may also provide visual invariantswhich are characteristic of each behavior wherever it occursas shown in figure 3. Based on the detailed description ofthe course of each scenario and on visual invariants, a set ofvideo sequences representing abnormal and closely relatedbut normal situations are recorded with the help of actorsif necessary. Using an XML language, each video is thenannotated by end-users with the scenarios they represent.

A video annotation describes three pieces of informa-tion as shown in figure 5 (pag. 9): on which camera/framewe can see the scenario (tagged “video frame”), when thescenario occurs (tagged “time”) and who is involved in thescenario (tagged “physical object”). As the formalism is thesame for both information (end-user description of scenariosmodels and VSIP output) we are able first to make sure thatrecognized scenarios match user descriptions and second toautomatically evaluate system efficiency by comparing userannotations and system results.

The third phase corresponds to scenario modeling andrecognition. It is a sensitive step because scenario modelsmust be understood on the same way by the users and thesystem. Actually, users may want to easily modify or extendscenarios. The usual approach is to hard code each scenarioin the system. For example, given that in any application weare interested in recognizing a limited number of scenarios,it is often easier for developers to hard code the scenariorecognition routines (automaton, AND/OR trees ...) in-stead of developing a more reusable and complex algorithmable to generate automatically the routines correspondingto a textual description of a scenario. But this approachis not satisfactory because it heavily limits the reusabilityof the developed routines and prevents non-developers usersfrom being able to modify or extend by themselves the set ofscenarios that can be recognized by the AMS. Our proposedapproach introduces a new scenario representation languagebased on a video event ontology ([25]). An ontology is theset of all the concepts and the relations between conceptsshared by the community in a given domain. The ontologyfirst facilitates the communication between the domain ex-perts (end-users) and the developers. The ontology makesthe video understanding systems user centered and enablesthe end-users to fully understand the terms used to describescenarios models without being concerned by the low levelprocessing of the system. Moreover, the ontology is usefulto evaluate the AMS and to understand exactly what typeof events a particular system can recognize. This ontol-ogy is also useful for developers of AMS to share and reusescenario models dedicated to the recognition of a specificevent.

This video event ontology has been built in the frame-work of ARDA workshops. It insures that the terms areshared by several laboratories specialized in video anal-ysis. Events are decomposed in different abstract levelsand in a hierarchical structure with the aim to make themodel generic and applicable to a wide range of applica-tions. There are two main types of concepts to be repre-sented: physical objects of the observed scene and videoevents occurring in the scene. A physical object can be acontextual object (e.g. a desk, a door) or a mobile objectdetected by a vision routine (e.g. a person, a car). A videoevent can be a primitive state, a composite state, a primi-tive event or a composite event. Primitive states are atomsto build other concepts of the knowledge base of an AMS.A composed concept (i.e. a composite state or a composite

10

event) is represented by a combination of its sub-concepts(called components) and an optional set of events that can-not occur during the recognition of this concept.

The language based on this ontology enables to describein an intuitive and declarative way all the knowledge nec-essary to recognize scenarios (see figure 4 on page 6).

Furthermore, to improve the incremental developmentprocess, we develop a visualization tool that generates3D animations and video sequences from scenario models.These sequences are useful both for users and for develop-ers. Users can visually check that the scenario model cor-responds to the scenarios they want to specify. Developershave a tool to generate test sequences for debugging theircode. The use of the video event ontology, of an adaptedlanguage for scenario modeling and of the visualization toolhas made this collaboration efficient by keeping the knowl-edge coherent and accessible to all participants (end-usersand developers).

8 END-USER ASSESSMENT AND VALIDATIONFOR METRO STATIONS MONITORING AP-PLICATION

Using the VSIP platform we have built several AMS fordifferent applications, as described in section 1 and illus-trated in figure 1: a bank agency surveillance application,a metro activity monitoring system, a lock chambers accesscontrol application and so on. In this section we present theresults of the end-user assessment and the technical valida-tion of the metro activity monitoring system installed inSagrada Familia station of the Barcelona metro at the endof the European ADVISOR project (march 2003). In sec-tion 9 we present the corresponding validation results fortwo other applications (bank agency monitoring and lockchambers access control) for which the end-users assessmentis scheduled for the beginning of 2005.

The AMS built for metro monitoring was the ActivityMonitoring kernel of the final demonstrator of the ADVI-SOR project. Besides the AMS, the demonstrator includes:

• a capture system, which digitizes the images comingfrom live cameras and plays back recorded sequences;

• a crowd monitoring system, delivering additional in-formation about crowd (like direction of crowd mo-tion flow);

• an archive system, which records the input video se-quences together with annotations describing the rec-ognized scenario, if any. A second functionality of thearchive is the possibility to act like a playback sys-tem allowing the easy search and retrieval of specifiedsequences and/or recognized scenarios.

• a Human-Computer Interface allowing the operatorsto visualize the results, to control system parameters,and to access the archive system.

The demonstrator has been presented to security opera-tors from STIB (Brussels, Belgium) and TMB Barcelona,

Spain), two metro companies, followed by a tutorial explain-ing how to use it.

The end-user assessment and the technical validationwere conducted using both live and recorded data. Fourclosed circuit cameras at Sagrada Familia were connectedto the AMS system, providing live data from the metrostation. In addition, four prerecorded sequences were alsofed into the system. These sequences are composed by:

• scenes played by actors containing the various hu-man behaviors to recognize. These sequences were in-tended to demonstrate the capability of the system torecognize predefined scenarios, such as fighting, thatwere unlikely to occur in live during the evaluationand the validation.

• normal scenes coming from recording made by secu-rity operators and showing normal behaviors. Thesesequences were intended to demonstrate the robust-ness of the system with respect to false alerts (i.e.alerts generated even if no predefined scenario is hap-pening in the video).

8.1 End-user assessment

The end-user assessment consists for end-users (videosurveillance operators) to establish how useful the system is.During the end-user assessment, the end-users were askedto use the system, as part as of their regular surveillancetask, for a few hours a day during a week and evaluate itsperformance and usefulness. The results were documentedby the completion of a comprehensive questionnaire, thatpointed out the following remarks.

The operators found that the AMS worked correctly andrecognized with enough precision the predefined scenarios(fighting, blocking, overcrowding, jumping over the barrierand vandalism against equipment).

The scenarios corresponded to the following situations:

• blocking occurs when a group of at least 2 people isstopped in a predefined zone for at least 4 secondsand can potentially block the path of other people;

• fighting occurs when a group of people (at least 2persons) are pushing, kicking or grasping each otherfor at least 2 seconds;

• overcrowding occurs when the density of the peoplein an image is greater than a specified threshold;

• jumping over the barrier occurs when a person jumpsover a specified ticket validation barrier;

• vandalism against equipment occurs when an individ-ual is damaging a piece of equipment in the image.

False alerts happened rarely and were not a problembecause the operators had the time to acknowledge or toreject the generated alert. Operators pointed out that someefforts should be made on the system ergonomics, like easingthe acknowledgment of an alert or automating the replayon the screen of the videos corresponding to a recognized

11

scenario. They concluded stating that the AMS system wasa real help to the surveillance task and that should be usedby metro companies to ease the security operator work.

8.2 Technical validation

The technical validation consists for technical people todetermine whether the system recognizes the specified sce-narios. A technical validation of the AMS system was per-formed at Sagrada Familia metro station. For the valida-tion task, the system was tested using four input channels inparallel, the four channels being composed of three recordedsequences and one live input stream. The validation of thescenario recognition involved playing the sequences throughthe system and reporting the resulting alerts generated bythe AMS. The sequences used for validation were annotatedwith ground-truth corresponding to the type and the occur-rence time of the scenarios. The results obtained when thesequence was played through the system were then com-pared with the ground-truth. If the system generated thecorrect scenario recognition then an estimate of the accu-racy of the recognition was obtained. This was achievedby measuring the overlapping length between the observedscenario (ground-truth) and the occurrence of the scenariorecognized by the AMS. So, for example, if the AMS re-ported a sequence as showing fighting for 45 seconds, whenthe ground-truth shows that 60 seconds of fighting occurred,then a score of 75 % was awarded. The score also includedtrue negative periods of the sequence, i.e. if nothing hap-pens and no alerts are generated, then the sequence is con-sidered as correctly recognized. A delay of 5 seconds be-tween the beginning of the scenario and the ground-truth ispermitted in the measurement as this is the necessary delayfor the scenario recognition algorithm to start the recogni-tion of the scenarios.

The live channel was validated visually by the evaluatorsand was used mainly to check the rate of false alarms.

The results of the validation are presented and analyzedin sections 8.2.1 and 8.2.2.

8.2.1 Ground-truth of the validation data

The sequences used in the validation of the AMS arecomposed of 29 different subsequences containing behaviorsplayed by actors (corresponding to fighting, blocking, over-crowding, jumping over the barrier and vandalism againstequipment scenarios) and 3 long subsequences showing peo-ple with no behavior of interest. The 32 subsequences wereduplicated several times at different places into the videotest sequences, giving a total of 81 occurrences of scenariossupposed to be recognized by the AMS and 22 occurrencesof “normal” activities (supposed to generate no alerts).

We define as“ground-truth”the set of three information:the type, the starting time and the duration of the scenariosrecognized by a competent authority (technical people dif-ferent from end-users and system developers). The ground-truth data is created by visual inspection. That is, the

competent authority examines the sequences and decideswhich behaviors have occured. This process is subjective: ascenario classified as overcrowding by an operator A couldbe considered as “normal” by a different operator B. Thisfact has no major consequences, because even in the caseof end-users (that means people who have to use the sys-tem and judge its utility) the definition could change froma person to another one.

8.2.2 Scenario recognition validation results

Overall, the system was validated for over four hoursusing three recorded videos and one live camera, giving atotal of more than 16 hours of validation. Table 1 detailsthe results of the validation process.

The results of the validation for the fighting scenarioshow a success rate of 95 % and the reports were found tobe 61 % accurate in the timing and duration of the alertreport. Note that the accuracy is subject to the humaninterpretation of when fighting begins, which is not alwaysclear. For example two people might begin fighting by push-ing each other, so it is unclear if fighting has begun at thatpoint or when they actually start coming to blows.

The blocking scenario was detected giving detection rateof 78 % with an average accuracy of 60 %. One false blockingreport was generated during the validation, when there wasonly one person standing by the exit barriers. At leasttwo people are required to be blocking a predefined area toconstitute a blocking event.

The vandalism against equipment scenario contains anactor repeatedly going to a piece of equipment and attempt-ing to break it open. As people approach he moves awayfrom the equipment and returns to it later. The system rec-ognizes this as one long act of vandalism rather than severalindividual acts and, therefore, has been scored as such. Themain problem of this scenario was not to loose the tracksof the people when they cross other people during the sce-nario. This scenario gives a success rate of 100 % with anaccuracy of 71 %

The jumping over the barrier scenario gives a successrate of 88 %. The main difficulty of this scenario was to han-dle occultation and the ability to correctly compute the po-sition of people relative to the validation machine (in frontof/behind).

The overcrowding scenario shows a success rate of 100 %,with an overall accuracy of 80 %. The ground-truthing ofan overcrowding alert is also somewhat subjective since itis not exactly obvious at which point the scene becomesovercrowded.

The AMS was provided with a live feed from the Sagra-da Familia station. The camera was situated in the mainhall and overlooked the escalator from one of the platforms.Therefore, during busy periods, a large number of peopledisembark from the train, go up by the escalator and en-ter the field of view of the camera. The relatively highdensity of people caused the AMS system to trigger an

12

SCENARIONAME

NUMBER OFBEHAVIORS

NUMBER OFRECOGNIZEDINSTANCES

% OF RECOGNIZEDINSTANCES

ACCURACYNUMBER OF

FALSE ALERTS

fighting 21 20 95 % 61 % 0

blocking 9 7 78 % 60 % 1

vandalism 2 2 100 % 71 % 0

jumping o.t.b. 42 37 88 % 100 % 0

overcrowding 7 7 100 % 80 % 0

TOTAL 81 73 90 % 85 % 1

Table 1. This table shows the results of the technical validation of the AMS. For each scenario, we report in particular thepercentage of recognized instances of this scenario (fourth column) and the accuracy in time of the recognition (that meanswhat percentage of the duration of the shown behavior is “covered” by the generation of the corresponding alert by the system.This value is an average over all the scenario instances) (fifth column).

overcrowding alert. This is demonstrated by the fact thatmany such alerts were triggered on the busy friday after-noon, whereas only two were generated on the much quietersaturday morning. Thus, the high number of overcrowdingalerts suggests that it would be interesting to synchronizethe overcrowding scenario detection with the train arrival,to avoid the generation of an alert if the crowd is only disem-barking from the train. Therefore, the overcrowding alertshave been scored as being correct because they were gen-erated by a relatively high density of people emerging fromthe escalator after getting off a train. No other behaviors —except the blocking false alert detailed previously — wereobserved during the validation on this live channel.

Both validation and assessment scored the monitoringsystem as satisfactory. The next step is to test its perfor-mances and usability on larger camera networks and duringlonger periods of time.

9 SOME OTHER VALIDATION RESULTS

9.1 Bank agency monitoring system

As for the previous application, many discussions withdomain experts have been needed in order to define scenar-ios, corresponding to interesting human behaviors, whichhave to be recognized in bank agencies. A bank scenariocan be modeled in two parts: the attack precursor part (i.e.the robber approach) and the attack part.

Today, classical bank agencies gradually evolve towardsagencies with one or several counters without money, ATM(Automatic Teller Machine), safe room and offices for com-mercial employees. The safe room is then the more signif-icant zone inside the bank agency since all the availablemoney is stored inside. As a consequence, all irregularbehaviors or bank protocol infringement (involving eitherrobbers or maintenance and cleaning employees) must bedetected nearby the safe entrance. The protocol can be

different for each bank. For instance, one of these rules isthat only one person can enter the safe room at a time. Inthis case, the system must raise an alert when more thanone person is inside the safe room. For bank experts, thispart of the scenario (people number inside the safe) mustbe recognized with a very high confidence.

Moreover, it is interesting to recognize a robber ap-proaching the safe entrance. Modeling all bank attack pre-cursors is a difficult task due to their large number andvariety. We list here some examples:

• employee attack: frequent, often stealthy, rapid andhardly observable even for human beings. The bankemployee is threatened but it is generally difficult tosee the difference with a classical customer request.

• safe attack: they are not frequent. Bank employeesand customers are threatened. People are shockedand things can take a bad turn.

• aggressive attack: bank employees and customers arethreatened. The robber has lost his/her self control,money is not the main motivation and the robberyusually leads to a drama.

This scenario part is optional for bank attack detectionbut important in order to anticipate potential actions andprevent any drama. Therefore, we have modeled a largeset of scenarios to take into account the variety of bankrobberies.

The behavior recognition assessment has been realizedin live condition inside a bank agency during one hour, to-gether with end-users. The assessment was based on thefollowing scenarios:

• scenarios with 2 persons: the bank employee is be-hind the counter. The robber enters the bank agency,goes to the counter and threatens the employee. Bothpeople go to the safe and the safe gate is opened.

13

SCENARIONUMBER OFINSTANCES

TRUE POSITIVE FALSE NEGATIVE FALSE POSITIVE

with 3 persons 16 93.75 % 6.25 % 0 %

with 2 persons 10 100 % 0 % 0 %

Table 2. Validation results for a live installation of the bank agency monitoring system.

• scenarios with 3 persons: the bank employee is behindthe counter. A customer enters the bank agency, goesto the counter and stays in front of it. After, the rob-ber enters the bank, joins the customer and threatensthe employee and the customer. The employee andthe robber go to the safe and the safe gate is opened.The customer stays behind the counter or leaves theagency.

A true positive corresponds to an alert raise when a realbank attack happens (simulated by actors), a false negativeis the miss of an alert raise when a real bank attack hap-pens and a false positive is an alert raised when no realbank attack happens. The bank attack scenario with 3 per-sons was played 16 times. We obtained 93.75 % of truepositive, 6.25 % of false negative and 0 % of false positive.The scenario with 2 persons was played more than 10 timesand we obtained 100 % of true positive. These results aresummarized on table 2.

The main reason why we obtained good true positivepercentage is first that scenarios were precisely modeledthanks to the interaction with domain experts through anincremental process. The second reason of this success isthe cooperation of two cameras to monitor the agency en-abling to obtain better results due to the redundancy ofinformation.

A second end-user assessment and validation phase willbe held on a different bank agency with other scenarios atthe beginning of 2005.

9.2 Lock chamber access monitoring system

Buildings with lock chambers at entrances are oftenfaced to the problem of controlling how many people enteror exit the building. Sometimes these chambers are acti-vated with a personal pass which allows the passage of theowner only. Nothing (but a human operator or a CCTVcamera) can prevent the owner of a pass to let a secondperson to enter at the same time. Another motivation ofthis application is to be able to know exactly the numberof people inside the building in case of fire alarms.

We built with the VSIP platform a lock chamber accessmonitoring system which is able to count the number ofpeople passing through a general lock chamber defined asa closed space. The AMS can monitor the trajectories ofpeople (where they come from and where they go); this

feature is particularly useful in case of lock chambers withseveral access points.

This application uses automaton-based scenario recog-nition algorithms to monitor the trajectories of people andto count them. The limited field of view of cameras (for ex-ample, see figure 1(f)) and the high number of people thatcan be present at the same time in the field of view makethis application challenging.

We have validated the lock chamber access AMS in twodifferent cases: for the first one, a camera monitors a smalllock chamber with two transparent doors at the oppositesides. For the second one, a camera monitors a larger lockchamber with 6 entrances on the four different sides, five ofthe entrance points are provided with doors.

For each sequence showing one or several persons pass-ing from one entrance to another, we classify the result infour different classes:

• good detection: the entrance and the exit points ofeach person passing through the lock chamber havebeen correctly detected for all persons;

• bad detection: entrance or exit points (or both) whenone or more persons are incorrectly detected, or thenumber of persons detected is wrong;

• misdetection: someone is passing through the lockchamber but the system does not detected the person;

• false alarm: a person is detected as passing from anentrance to an exit when there is nobody in the fieldof view of the camera.

Table 3 summarizes the validation results obtained byour AMS in both cases. Percentage are computed using thefollowing formula:

FAP =FA

GD + BD + WD

where FAP stands for“false alarm percentage”and FA, GD,BD and WD are the total number of false alarms, gooddetections, bad detections and misdetections over all theinstances. Analog formula are used for good detection, baddetection and wrong detection percentages. The sequenceused for the validation are all-day-life sequence, showingnormal passage of people in small and large lock chambersas it happens during normal work activities (lock chambersare located in a company).

We are currently extending the validation of this appli-cation using a larger set of sequences and a live end-user

14

assessment of this AMS is scheduled for the beginning of2005.

10 CONCLUSION

Our goal is to obtain a reusable and performant ac-tivity monitoring platform (called VSIP). To achieve thisgoal, we believe that a unique global and sophisticated al-gorithm is not adapted because it cannot handle the largediversity of real world applications. However such a plat-form can be achieved if it can easily combine and integratemany algorithms. Therefore we have presented three prop-erties that an activity monitoring platform should have toenable its reusability for different applications and to insureperformance quality. We have defined these properties asfollows: modularity and flexibility, separation between al-gorithm code and a priori knowledge, and automatic eval-uation. We have then proposed a development methodol-ogy to fulfill the last two properties and which consists inthe interaction between end-users and developers during thewhole development of a new activity monitoring system fora specific application.

We have then explained how we managed to developVSIP following the given properties. We have shown how ashared data manager, the outsourcing of parameters and theuse of clear definitions of data structure enable to achievemodularity and flexibility. We have explained how theknowledge organization through description files and a lan-guage dedicated to the description of scenarios permit toobtain a clear separation between algorithms and a prioriknowledge provided to the platform. We have shown thatautomatic evaluation allows developers to insure that newalgorithms fulfill their specifications and keep platform per-formance over a set of selected applications. The evaluationframework allows also to apply learning techniques to tunethe parameters of an AMS dedicated to a specific appli-cation. We have underlined that the interaction betweenend-users and developers was possible thanks to the defi-nition of a video events ontology, an adapted language forscenario modeling and a tool to visualize the specified sce-nario models.

To illustrate the feasibility of our approach, we havepresented VSIP, an activity monitoring platform fulfillingthe three properties. This platform has been used to buildactivity monitoring systems dedicated to different applica-tions taking advantage of a deep interaction with end-users.We have described three systems which have been validatedand three other systems currently under development andwhose validation will be completed in the near future.

The activity monitoring platform still presents somelimitations, the most important being the difficulty, whenadding a new algorithm to the platform, to understandwhich are the algorithm weaknesses and how to fix them. Sowe are currently developing tools to extend the evaluationframework. The goal is to help developers to analyze au-tomatically algorithm shortcomings in order to understand

precisely under which hypothesis they can be used.

11 REFERENCES

[1] Y. Ivanov, C. Stauffer, A. Bobick, and W. E. L. Grim-son. Video surveillance of interactions. In Proceedingsof the Second IEEE International Workshop on VisualSurveillance (VS1999), Fort Collins, Colorado, USA,pages 82–89, June, the 26th 1999.

[2] J. Menendez and S. A. Velastin. A method for obtain-ing neural network training sets in video sequences.In Proceedings of the Third IEEE International Work-shop on Visual Surveillance (VS2000), Dublin, Ireland,pages 69–75, July, the 1st 2000.

[3] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati.Detecting moving objects, ghosts and shadows in videostreams. IEEE Transactions on PAMI, 25(10):1337–1342, 2003.

[4] R. Cucchiara, C. Grana, A. Prati, G. Tardini, andR. Vezzani. Using computer vision techniques for dan-gerous situation detection in domotic applications. InProceedings of the International Conference on Intelli-gent Distributed Surveillance Systems (IDSS04), Lon-don, Great Britain, pages 1–5, February, the 23rd 2004.

[5] AVITRACK European Research Project.http://www.avitrack.net.

[6] M. Borg, D. Thirde, J. Ferryman, F. Fusier, F. Bre-mond, and M. Thonnat. An integrated vision sys-tem for aircraft activity monitoring. In Proceedings ofthe 6th IEEE International Workshop on PerformanceEvaluation of Tracking and Surveillance (PETS05),Breckenridge, Colorado, USA, January the 7th 2005.

[7] H. Buxton and S. Gong. Visual surveillance in adynamic and uncertain world. Artificial Intelligence,78(1-2):431–459, 1995.

[8] H.-H. Nagel. Image sequence evaluation: 30 yearsand still going strong. In A. Sanfeliu and J.J. Vil-lanueva et al., editors, Invited Lecture in Proc. of the15th International Conference on Pattern Recognition(ICPR2000), Barcelona, Spain, volume 1, pages 149–158, September, 3rd-7th 2000.

[9] I. Haritaogou, D. Harwood, and L.S. Davis. W4: Real-time surveillance of people and their activities. IEEETransactions on Pattern Analysis and Machine Intel-ligence, 22(8):809–830, 2000.

[10] N.M. Oliver, B. Rosario, and A.P. Pentland. Abayesian computer vision system for modelling humaninteractions. IEEE Transactions on Pattern Analysisand Machine Intelligence, 22(8):831–843, 2000.

[11] N. Johnson and D.C. Hogg. Learning the distributionof object trajectories for event recognition. Image andVision Computing, 14:609–615, 1996.

15

TYPE OF SEQUENCENUMBER OFINSTANCES

GOODDETECTIONS

BADDETECTIONS

MISDETECTIONSFALSE

ALARMS

small lock chamber, 1person passing alone

17 94.10 % 5.90 % 0.00 % 0.00 %

small lock chamber, 2 ormore people passing together

25 96.00 % 4.00 % 0.00 % 0.00 %

large lock chamber, 1 personpassing alone

72 94.50 % 5.50 % 0.00 % 2.00 %

large lock chamber, 2 ormore people passing together

28 92.90 % 7.10 % 0.00 % 2.00 %

Table 3. Validation results for a lock chamber access monitoring system.

[12] Collins, Lipton, Kanade, Fujiyoshi, Duggins, Tsin, To-liver, Enomoto, and Hasegawa. A system for videosurveillance and monitoring: VSAM final report. Tech-nical report, CMU-RI-TR-00-12, Robotics Institute,Carnegie Mellon University, May 2000.

[13] DARPA, editor. Image Understanding Workshop onVideo Surveillance and Monitoring, November 1998.Monterey, California.

[14] ARDA VACE. http://www.ic-arda.org/InfoExploit/vace/index.html.

[15] Homeland Security ARPA (Advanced Re-search Projects Agency). http://www.hsarpabaa.com.

[16] O. Javed, Z. Rasheed, K. Shafique, and M. Shah.Tracking across multiple cameras with disjoint views.In Proceedings of the 9th IEEE International Confer-ence on Computer Vision (ICCV03), Nice, France, Oc-tober 2003.

[17] IEEE Computer Society, editor. IEEE InternationalSeries of Workshops on Performance Evaluation ofTracking and Surveillance (PETS). IEEE ComputerSociety, 2002. http://visualsurveillance.org.

[18] T. Ellis. Performance metrics and methods for trackingin surveillance. In Proceedings of the 3rd IEEE Work-shop on PETS, Copenhagen, Denmark, June 2002.

[19] B. Georis, F. Bremond, M. Thonnat, and B. Macq.Use of an evaluation and diagnosis method to im-prove tracking performances. In Proceedings of the3rd IASTED International Conference on Visualiza-tion, Imaging and Image Proceeding (VIIP03), Benal-madera, Spain, September 2003.

[20] S. Hongeng, F. Bremond, and R. Nevatia. Represen-tation and optimal recognition of human activities. InIEEE Proceedings of the International Conference onComputer Vision and Pattern Recognition (CVPR00),South Carolina, USA, 2000.

[21] T-V. Vu, F. Bremond, and M. Thonnat. Automaticvideo interpretation: a novel algorithm for temporal

scenario recognition. In Proceedings of the 18th In-ternational Joint Conference on Artificial Intelligence(IJCAI03), Acapulco, Mexico, August 2003.

[22] C. Tornieri, F. Bremond, and M. Thonnat. Updating ofthe reference image for visual surveillance systems. InProceedings of the International Conference on Intelli-gent Distributed Surveillance Systems (IDSS03), Lon-don, Great Britain, September 2002.

[23] A. Avanzi, F. Bremond, and M. Thonnat. Trackingmultiple individuals for video communication. In Pro-ceedings of the International Conference on Image Pro-cessing (ICIP01), Tessaloniki, Greece, October 2001.

[24] F. Cupillard, F. Bremond, and M. Thonnat. Be-haviour recognition for individuals, groups of peopleand crowd. In Proceedings of the International Con-ference on Intelligent Distributed Surveillance Systems(IDSS03), London, Great Britain, September 2002.

[25] Francois Bremond, Nicolas Maillot, Monique Thon-nat, and Thinh Van Vu. Rr5189 - ontologies for videoevents. Technical report, Orion Team, Institut Na-tional de Recherche en Informatique et Automatique(INRIA), May 2004.

[26] S. Moisan and M. Thonnat. What can program su-pervision do for program reuse. In IEE Proceedings -Software, volume 147, pages 179–185, octobre 2000.

[27] B. Georis, M. Maziere, F. Bremond, and M. Thonnat.A video interpretation platform applied to bank agencymonitoring. In Proceedings of the International Con-ference on Intelligent Distributed Surveillance Systems(IDSS04), London, Great Britain, pages 46–50, Febru-ary the 23rd 2004.

16

design and assessment of an intelligent activity ... · design and assessment of an intelligent...

Documents