Transcript
Page 1: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

Object Detection and Trackingfor Intelligent Video Surveillance

Kyungnam Kim and Larry S. Davis

1 HRL Laboratories, LLC., Malibu CA, [email protected]

2 Computer Science Dept. University of Maryland, College Park, MD, [email protected]

Abstract. As CCTV/IP cameras and network infrastructure become cheaper andmore affordable, today’s video surveillance solutions are more effective than everbefore, providing new surveillance technology that’s applicable to a wide range end-users in retail sectors, schools, homes, office campuses, industrial /transportationsystems, and government sectors. Vision-based object detection and tracking, espe-cially for video surveillance applications, is studied from algorithms to performanceevaluation. This chapter is composed of three topics: (1) background modeling anddetection, (2) performance evaluation of sensitive target detection, and (3) multi-camera segmentation and tracking of people.

Keywords: video surveillance, object detection and tracking, background subtrac-tion, performance evaluation, multi-view people tracking, CCTV/IP cameras.

Overview

This book chapter describes vision-based object detection and tracking for videosurveillance application. It is organized into three sections. In Section 1, we describea codebook-based background subtraction (BGS) algorithm used for foreground de-tection. We show that the method is suitable for both stationary and moving back-grounds in different types of scenes, and applicable to compressed videos such asMPEG. Important improvements to the above algorithm are presented - automaticparameter estimation, layered modeling/detection and adaptive codebook updating.In Section 2, we describe a performance evaluation technique, named PDR analysis.It measures the sensitivity of a BGS algorithm without assuming knowledge of theactual foreground distribution. Then PDR evaluation results for four different back-ground subtraction algorithms are presented along with some discussions. In Section3, a multi-view multi-target multi-hypothesis tracker is proposed. It segments andtracks people on a ground plane. Human appearance models are used to segmentforeground pixels obtained from background subtraction. We developed a methodto effectively integrate segmented blobs across views on a top-view reconstruction,with a help of ground plane homography. The multi-view tracker is extended effi-ciently to a multi-hypothesis framework (M3Tracker) using particle filtering.

W. Lin et al. (Eds.): Multimedia Analysis, Processing & Communications, SCI 346, pp. 265–288.springerlink.com c© Springer-Verlag Berlin Heidelberg 2011

Page 2: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

266 K. Kim and L.S. Davis

1 Background Modeling and Foreground Detection

Background subtraction algorithm

The codebook (CB) background subtraction algorithm we describe in this sectionadopts a quantization/clustering technique [5], to construct a background model (see[28] for more details). Samples at each pixel are clustered into a set of codewords.The background is encoded on a pixel by pixel basis.

Let X be a training sequence for a single pixel consisting of N RGB-vectors:X = {x1,x2, ...,xN}. Let C = {c1, c2, ..., cL} represent the codebook for thepixel consisting of L codewords. Each pixel has a different codebook size based onits sample variation. Each codeword ci, i = 1 . . . L, consists of an RGB vector vi =(Ri, Gi, Bi) and a 6-tuple auxi = 〈Ii, Ii, fi, λi, pi, qi〉. The tuple auxi containsintensity (brightness) values and temporal variables described below.

I, I : the min and max brightness, respectively,that the codeword accepted;

f : the frequency with which the codeword has occurred;λ : the maximum negative run-length (MNRL)

defined as the longest interval during thetraining period that the codeword has NOT recurred;

p, q : the first and last access times, respectively,that the codeword has occurred.

In the training period, each value, xt, sampled at time t is compared to the currentcodebook to determine which codeword cm (if any) it matches (m is the matchingcodeword’s index). We use the matched codeword as the sample’s encoding approx-imation. To determine which codeword will be the best match, we employ a colordistortion measure and brightness bounds. The detailed pseudo algorithm is givenbelow.

Algorithm for Codebook Construction

I. L← (← means assignment), C ← ∅ (empty set)II. for t=1 to N do

i. xt = (R,G, B), I ← R + G + Bii. Find the codeword cm in C = {ci|1 ≤ i ≤ L} matching to xt based on two

conditions (a) and (b).(a) colordist(xt,vm) ≤ ε1(b) brightness(I, 〈Im, Im〉) = true

iii. If C = ∅ or there is no match, then L ← L + 1. Create a new codeword cL bysetting

vL ← (R,G, B)auxL ← 〈I, I, 1, t− 1, t, t〉.

iv. Otherwise, update the matched codeword cm, consisting of vm = (Rm, Gm, Bm)and auxm = 〈Im, Im, fm, λm, pm,qm〉, by setting

Page 3: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

Object Detection and Tracking for Intelligent Video Surveillance 267

vm ← ( fmRm+Rfm+1

, fmGm+Gfm+1

, fmBm+Bfm+1

)

auxm ← 〈 min{I, Im}, max{I, Im}, fm + 1,max{λm, t− qm}, pm, t 〉.

end forIII. For each codeword ci, i = 1 . . . L, wrap around λi by setting λi ← max{λi, (N −

qi + pi − 1)}.

The two conditions (a) and (b) are satisfied when the pure colors of xt and cm areclose enough and the brightness of xt lies between the acceptable brightness boundsof cm. Instead of finding the nearest neighbor, we just find the first codeword tosatisfy these two conditions. ε1 is the sampling threshold (bandwidth).

We refer to the codebook obtained from the previous step as the fat codebook. Inthe temporal filtering step, we refine the fat codebook by separating the codewordsthat might contain moving foreground objects from the true background codewords,thus allowing moving foreground objects during the initial training period. The truebackground, which includes both static pixels and moving background pixels, usu-ally is quasi-periodic (values recur in a bounded period). This motivates the temporalcriterion of MNRL (λ), which is defined as the maximum interval of time that thecodeword has not recurred during the training period.

LetM denote the background model (a new codebook after temporal filtering):

M = {cm|cm ∈ C ∧ λm ≤ TM}. (1)

Usually, a threshold TM is set equal to half the number of training frames, N2 .

To cope with the problem of illumination changes such as shading and high-lights, we utilize a color model [28] separating the color and brightness components.When we consider an input pixel xt = (R, G, B) and a codeword ci where vi =(Ri, Gi, Bi), we have ‖xt‖2 = R2 +G2 +B2, ‖vi‖2 = R2

i +G2i +B2

i , 〈xt,vi〉2 =(RiR + GiG + BiB)2.

..vi (codeword)

xt (input pixel)p

δδδδ

R

G

B O

εεεε

Ilow

Ihidecision boundary

θθθθ

I

I

Fig. 1. The proposed color model - separate evaluation of color distortion and brightnessdistortion

Page 4: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

268 K. Kim and L.S. Davis

The color distortion δ can be calculated by

p2 = ‖xt‖2 cos2 θ = 〈xt,vi〉2‖vi‖2

colordist(xt,vi) = δ =√‖xt‖2 − p2.

(2)

The logical brightness function is defined as

brightness(I, 〈I, I〉) =

{true if Ilow ≤ ‖xt‖ ≤ Ihi

false otherwise.(3)

Subtracting the current image from the background model is straightforward. UnlikeMixture-of-Gaussians (MOG) [2] or Kernel method [4] which compute probabili-ties using costly floating point operations, our method does not involve probabilitycalculation. Indeed, the probability estimate in [4] is dominated by the nearby train-ing samples. We simply compute the distance of the sample from the nearest clustermean. This is very fast and shows little difference in detection compared with theprobability estimate. The subtraction operation BGS(x)for an incoming pixel valuex in the test set is defined as:

Algorithm for Background Subtraction (Foreground Detection)

I. x = (R,G, B), I ← R + G + BII. For all codewords inM in Eq.1, find the codeword cm matching to x based on two

conditions:colordist(x,vm) ≤ ε2brightness(I, 〈Im, Im〉) = true

III. BGS(x) =

{foreground if there is no match

background otherwise.

ε2 is the detection threshold.

Detection Results and Comparison

This section demonstrates the performance of the proposed algorithm comparedwith MOG [2] and Kernel [4].

Fig.2(a) is an image extracted from the MPEG video encoded at 70 kbits/sec.Fig.2(b) depicts a 20-times scaled image of the standard deviations of green(G)-channel values in the training set. The distribution of pixel values has been affectedby the blocking effects of MPEG. The unimodal model in Fig.2(c) suffers from theseeffects. For compressed videos having very abnormal distributions, CB eliminatesmost compression artifacts - see Fig.2(c)-2(f).

To test unconstrained training, we applied the algorithms to a video in whichpeople are almost always moving in and out a building (see Fig.3(a)-3(d)). By λ-filtering, CB was able to obtain the most complete background model.

Multiple backgrounds moving over a long period of time cannot be well trainedwith techniques having limited memory constraints. A sequence of 1000 frames

Page 5: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

Object Detection and Tracking for Intelligent Video Surveillance 269

recorded at 30 frames per second (fps) was trained. It contains trees moving ir-regularly over that period. The number of Gaussians allowed for MOG was 20. Asample of size 300 was used to represent the background for Kernel. Fig.4(a)-4(d)shows that CB captures most multiple background events. This is due to a compactbackground model represented by quantized codewords. The implementation of ourapproach is straightforward and it is faster than MOG and Kernel.

(a) original image (b) standard deviations (c) unimodal model in [1]

(d) MOG (e) Kernel (f) CB

Fig. 2. Detection results on a compressed video

(a) original image (b) MOG (c) Kernel (d) CB

Fig. 3. Detection results on training of non-clean backgrounds

(a) original image (b) MOG (c) Kernel (d) CB

Fig. 4. Detection results on very long-time backgrounds

Page 6: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

270 K. Kim and L.S. Davis

Automatic Parameter Estimation - ε1 and ε2

Automatic parameter selection is an important goal for visual surveillance systemsas addressed in [7]. Two of our parameters, ε1 and ε2, are automatically determined.Their values depend on variation within a single background distribution, and areclosely related to false alarm rates. First, we find a robust measure of backgroundvariation computed over a sequence of frames (of at least 90 consecutive frames,about 3 seconds of video data). In order to obtain this robust measure, we calculatethe median color consecutive-frame difference over pixels. Then we calculate Θ(median color frame difference) which is the median over time of these mediandifferences over space. For example, suppose we have a sequence of N images. Weconsider the first pair of frames, and calculate the color difference at each pixel, andtake the median over space. We do this for all N−1 consecutive pairs, until we haveN − 1 medians. Then, Θ is the median of the N − 1 values. In fact, an over-spacemedian of medians over time is almost the same as Θ, while Θ is much easier tocalculate with limited memory. Θ will be proportional to the within class variance ofa single background. In addition, it will be a robust estimate, which is insensitive tothe presence of relatively small areas of moving foreground objects. The same colordifference metric should be used as in the background modeling and subtraction.

Finally, we multiply a constant k by this measure to obtain ε1(= kΘ). The defaultvalue of k is 4.5 which corresponds approximately to a false alarm rate of detectionbetween .0001 - .002. ε2 can be set to k′Θ, where (k − 1) < k′ < (k + 1) butusually k′ = k. Experiments on many videos show that these automatically chosenthreshold parameters ε1 and ε2 are sufficient. However, they are not always accept-able, especially for highly compressed videos where we cannot always measure therobust median accurately.

Layered modeling and detection - Model maintenance

The scene can change after initial training, for example, by parked cars, displacedbooks, etc. These changes should be used to update the background model. Weachieve this by defining an additional modelH called a cache and three parametersdescribed below:

• TH: the threshold for MNRL of the codewords inH;• Tadd: the minimum time period required for addition, during which the codeword must

reappear;• Tdelete: a codeword is deleted if it has not been accessed for a period of this long.

The periodicity of an incoming pixel value is filtered by TH, as we did in the back-ground modeling. The values re-appearing for a certain amount of time (Tadd) areadded to the background model as short-term background. Some parts of a scenemay remain in the foreground unnecessarily long if adaptation is slow, but otherparts will disappear too rapidly into the background if adaptation if fast. Neitherapproach is inherently better than the other. The choice of this adaptation speed isproblem dependent.

Page 7: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

Object Detection and Tracking for Intelligent Video Surveillance 271

We assume that the background obtained during the initial background modelingis long-term. This assumption is not necessarily true, e.g., a chair can be movedafter the initial training, but, in general, most long-term backgrounds are obtainableduring training. Background values not accessed for a long time (Tdelete) are deletedfrom the background model. Optimally, the long-term codewords are augmentedwith permanent flags indicating they are not to be deleted∗. The permanent flagscan be applied otherwise depending on specific application needs.

Thus, a pixel can be classified into four subclasses - (1) background found in thelong-term background model, (2) background found in the short-term backgroundmodel, (3) foreground found in the cache, and (4) foreground not found in any ofthem. The overview of the approach is illustrated in Fig.5. This adaptive modelingcapability allows us to capture changes to the background scene.

Background model(long-,short-term)

Input video

BackgroundSubtraction

Cache

ForegroundRegions

ForegroundModel

TrackingFinal

Output

Short-termbackgrounds

Layers in 2.5D-like space

Foregroundshort-term backgrounds: color-labeled based on ‘first-access-time’

UpdatingFindingMatch

FindingMatch Updating

Fig. 5. The overview of our approach with short-term background layers: the foreground andthe short-term backgrounds can be interpreted in a different temporal order. The diagramitems in dotted line, such as Tracking, are added to complete a video surveillance system.

Adaptive codebook updating - detection under global illumination changes

Global illumination changes (for example, due to moving clouds) make it difficult toconduct background subtraction in outdoor scenes. They cause over-detection, falsealarms, or low sensitivity to true targets. Good detection requires equivalent falsealarm rates over time and space. We discovered from experiments that variations ofpixel values are different (1) at different surfaces (shiny or muddy), and (2) underdifferent levels of illumination (dark or bright). Codewords should be adaptivelyupdated during illumination changes. Exponential smoothing of codeword vectorand variance with suitable learning rates is efficient in dealing with illuminationchanges. It can be done by replacing the updating formula of vm with vm ← γxt +

Page 8: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

272 K. Kim and L.S. Davis

(1− γ)vm and appending σ2m ← ρδ2 +(1− ρ)σ2

m to Step II-iv of the algorithm forcodebook construction. γ and ρ are learning rates. Here, σ2

m is the overall varianceof color distortion in the color model, not the variance of RGB. σm is initializedwhen the algorithm starts. Finally the function colordist() in Eq.2 is modified tocolordist(xt,vi) = δ

σi.

We tested a PETS’20011 sequence which is challenging in terms of multipletargets and significant lighting variation. Fig.6(a) shows two sample points (labelled1 and 2) which are significantly affected by illumination changes and Fig.6(b) showsthe brightness changes of those two points. As shown in Fig.6(d), adaptive codebookupdating eliminates the false detection which occurs on the roof and road in Fig.6(c).

(a) original image -frame 1

0 500 1000 1500 2000 2500120

140

160

180

200

220

240

260

frame

sqrt

((R

2 +G

2 +B

2 )/3)

brighness changes

12

(b)brightness changes- blue on roof, gray onroad

(c) before adaptiveupdating

(d) after adaptiveupdating

Fig. 6. Results of adaptive codebook updating for detection under global illuminationchanges. Detected foregrounds on the frame 1105 are labelled with green color.

2 Performance Evaluation of Sensitive Target Detection

In this section, we propose a methodology, called Perturbation Detection Rate(PDR) Analysis [6], for measuring performance of BGS algorithms, which is analternative to the common method of ROC analysis. The purpose of PDR analysisis to measure the detection sensitivity of a BGS algorithm without assuming knowl-edge of the actual foreground distribution. In PDR, we do not need to know exactlywhat the distributions are. The basic assumption made is that the shape of the fore-ground distribution is locally similar to that of the background distribution; however,foreground distribution of small ("just-noticeable") contrast will be a shifted or per-turbed version of the background distribution. This assumption is fairly reasonablebecause, in modeling video, any object with its color could be either background orforeground, e.g., a parked car could be considered as a background in some cases;in other cases, it could be considered a foreground target. Furthermore, by varyingalgorithm parameters we determine not a pair of error rates but a relation among thefalse alarm and detection rates and the distance between the distributions.

Given the parameters to achieve a certain fixed FA-rate, the analysis is performedby shifting or perturbing the entire BG distributions by vectors in uniformly random

1 IEEE International Workshop on Performance Evaluation of Tracking and Surveillance2001 at http://www.visualsurveillance.org/PETS2001

Page 9: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

Object Detection and Tracking for Intelligent Video Surveillance 273

directions of RGB space with fixed magnitude Δ, computing an average detectionrate as a function of contrast Δ. It amounts to simulating possible foregrounds atcertain color distances. In the PDR curve, we plot the detection rate as a function ofthe perturbation magnitude Δ given a particular FA-rate.

First, we train each BGS algorithm on N training background frames, adjustingparameters as best we can to achieve a target FA-rate which would be practical inprocessing the video. Typically this will range from .01% to 1% depending on videoimage quality. To obtain a test foreground at color contrast Δ, we pass through theN background frames again. For each frame, we perturb a random sample of Mpixel values (Ri, Gi, Bi) by a magnitude Δ in uniformly random directions.

The perturbed, foreground color vectors (R′, G′, B′) are obtained by generatingpoints randomly distributed on the color sphere with radius Δ. Then we test the BGSalgorithms on these perturbed, foreground pixels and compute the detection rate forthe Δ. By varying the foreground contrast Δ, we obtain an monotone increasingPDR graph of detection rates. In some cases, one algorithm will have a graph whichdominates that of another algorithm for all Δ. In other cases, one algorithm may bemore sensitive only in some ranges of Δ. Most algorithms perform very well for alarge contrast Δ, so we are often concerned with small contrasts (Δ < 40) wheredifferences in detection rates may be large.

In this study, we compare four algorithms shown in Table 1. Since the algorithmin [4] accepts normalized colors (KER) or RGB colors (KER.RGB) as inputs, it hastwo separate graphs. Figure 2 shows the representative images from four test videos.

To generate PDR curves, we collected 100 empty consecutive frames from eachvideo. 1000 points are randomly selected at each frame. That is, for each Δ,(100)×(1000) perturbations and detection tests were performed. Those 100 emptyframes are also used for training background models. During testing, no updatingof the background model is allowed. For the non-parametric model in KER andKER.RGB, a sample size 50 was used to represent the background. The maximum

Table 1. Four algorithms used in PDR performance evaluation.

Name Background subtraction algorithmCB codebook-based method described in Section 2

MOG mixture of Gaussians described in [2]KER and KER.RGB non-parametric method using kernels described in [4]

UNI unimodal background modeling described in [1]

(a) indoor office (b) outdoor woods (c) red-brick wall (d) parking lot

Fig. 7. The sample empty-frames of the videos used for the experiments

Page 10: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

274 K. Kim and L.S. Davis

number of Gaussians allowed in MOG is 4 for the video having stationary back-grounds and 10 for moving backgrounds. We do not use a fixed FA-rate for all fourvideos. The FA-rate for each video is determined by these three factors - video qual-ity, whether it is indoor or outdoor, and good real foreground detection results formost algorithms. The FA-rate chosen this way is practically useful for each video.The threshold value for each algorithm has been set to produce a given FA-rate. Inthe case of MOG, the learning rate, α, was fixed to 0.01 and the minimum portion ofthe data for the background, T , was adjusted to give the desired FA-rate. Also, thecluster match test statistic was set to 2 standard deviations. Unless noted otherwise,the above settings are used for the PDR analysis.

Evaluation Results

Figures 9(a) and 9(b) show the PDR graphs for the videos in Figures 7(a) and 7(b)respectively. For the indoor office video, consisting almost entirely of stationarybackgrounds, CB and UNI perform better than the others. UNI, designed for uni-modal backgrounds, has good sensitivity as expected. KER performs intermediately.MOG and KER.RGB do not perform as well for small contrast foreground Δ, prob-ably because, unlike the other algorithms, they use original RGB variables and don’tseparately model brightness and color. MOG currently does not model covarianceswhich are often large and caused by variation in brightness. It is probably best toexplicitly model brightness. MOG’s sensitivity is consistently poor in all our testvideos, probably for this reason.

For the outdoor video, all algorithms perform somewhat worse even though theFA-rate has been increased to 1% from .01%. CB and KER, both of which modelmixed backgrounds and separate color/brightness, are most sensitive, while, as ex-pected, UNI does not perform well as in the indoor case. KER.RGB and MOG arealso less sensitive outdoors, as before indoors.

Figure 2 depicts a real example of foreground detection, showing real differ-ences in detection sensitivity for two algorithms. These real differences reflect per-formance shown in the PDR graph in Figure 9(c). The video image in Figure 8(a)shows someone with a red sweater standing in front of a brick wall of somewhatdifferent reddish color. There are detection holes through the sweater (and face) inthe MOG result (Figure 8(b)) . The CB result in Figure 8(c) is much better for this

(a) original frame of a per-son in a red sweater

(b) detection using MOG (c) detection using CB

Fig. 8. Sensitive detection at small delta

Page 11: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

Object Detection and Tracking for Intelligent Video Surveillance 275

small contrast. After inspection of the image, the magnitude of contrast Δ was de-termined to be about 16 in missing spots. This was due to difference in color balanceand not overall brightness. Figure 9(c) shows a large difference in detection for thiscontrast, as indicated by the vertical line.

Figures 9(d), 9(e), 9(f) show how sensitively the algorithms detect foregroundsagainst a scene containing moving backgrounds (trees) as well as stationary sur-faces. In order to sample enough moving background events, 300 frames are allowedfor training. As for previous videos, a PDR graph for the ‘parking lot’ video is givenin Figure 9(d). Two windows are placed to represent ‘stationary’ and ‘moving back-grounds’ as shown in Figure 7(d). PDR analysis is performed on each window withthe FA-rate obtained only within the window - a ‘window’ false alarm rate (insteadof ‘frame’ false alarm rate).

Since most of the frame is stationary background, as expected, the PDR graph(Figure 9(e)) for the stationary background window is very close to that for the entireframe. On the other hand, the PDR graph (Figure 9(f)) for the moving backgroundwindow is generally shifted right, indicating reduced sensitivity of all algorithms formoving backgrounds. Also, it shows differences in performance among algorithms,with CB and KER performing best. These results are qualitatively similar those forthe earlier example of outdoor video shown in Figure 5. We can offer the sameexplanation as before: CB and KER were designed to handle mixed backgrounds,and they separately model brightness and color. In this video experiment, we hadto increase the background sample size of KER to 270 frames from 50 in order toachieve the target FA-rate in the case of the moving background window. It shouldbe noted that CB, like MOG, usually models background events over a longer periodthan KER.

3 Multi-camera Segmentation and Tracking of People

A multi-view multi-hypothesis approach, named M3Tracker, to segmenting andtracking multiple (possibly occluded) persons on a ground plane is presented. Dur-ing tracking, several iterations of segmentation are performed using informationfrom human appearance models and ground plane homography. The full algorithmdescription is available in [30].

Survey on people tracking techniques

Table 22 lists different single-camera and multi-camera algorithms for people track-ing along with their characteristics.

Human appearance model

First, we describe an appearance color model as a function of height that assumesthat people are standing upright and are dressed, generally, so that consistently col-ored or textured color regions are aligned vertically. Each body part has its own color2 MCMC: Markov chain Monte Carlo, KLT: Kanade-Lucas-Tomasi, JPDAF: Joint Proba-

bilistic Data Association Filter, Tsai’s: [22], I: indoor, O: outdoor, n/a: not applicable.

Page 12: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

276 K. Kim and L.S. Davis

Detection rate at perturbation Δ(video 'indoor office' / false alarm rate = .01%)

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40

Δ

Det

ectio

n R

ate(

%)

CB

MOG

KER

KER.RGB

UNI

(a) PDR for ’indoor office’ in Figure7(a)

Detection rate at perturbation Δ(video 'outdoor woods' / false alarm rate = 1%)

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40

Δ

Det

ectio

n R

ate(

%)

CB

MOG

KER

KER.RGB

UNI

(b) PDr for ’outdoor woods’ in Figure7(b)

Detection rate at perturbation Δ(video 'red-brick wall' / false alarm rate = .01%)

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40

Δ

Det

ectio

n R

ate(

%) CB

MOG

KER

KER.RGB

UNI

(c) PDR for ‘red-brick wall’ in Figure7(c)

Detection rate on frame at perturbation Δ(video 'parking lot' / 'frame' false alarm rate = .1%)

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40

Δ

Det

ectio

n R

ate(

%)

CB

MOG

KER

KER.RGB

UNI

(d) PDR for ‘parking lot’ in Figure 7(d)

Detection rate on window at perturbation Δ(video 'parking lot' / 'window' false alarm rate = .1%)

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40

Δ

Det

ectio

n R

ate(

%)

CB

MOG

KER

KER.RGB

UNI

(e) PDR for window on stationary back-ground (Figure 7(d))

Detection rate on window at perturbation Δ(video 'parking lot' / 'window' false alarm rate = .1%)

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40

Δ

Det

ectio

n R

ate(

%)

CB

MOG

KER

KER.RGB

UNI

(f) PDR for window on moving back-ground (Figure 7(d))

Fig. 9. PDR graphs

Page 13: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

Object Detection and Tracking for Intelligent Video Surveillance 277

Table 2. Characteristics of people tracking algorithms

Comparison chart 1Algorithm Tracking Segment Occlusion Human Appearance

-ation Analysis Model

sing

le

Haritaoglu [8] Heuristic Yes No Temporal templateElgammal [9] n/a Yes Yes Kernel density est.Zhao [10] MCMC No Yes Shape, histogramRabaud [11] KLT tracker No No Feature-based

mul

ti-c

amer

a

Yang [12] No (counting) No Yes NoneKhan [13] Look-ahead No Yes NoneKang [14] JPDAF, Kalman No Yes Polar color distrib.Javed [15] Voting-based No No Dynamic color histo.Mittal [16] Kalman Yes Yes Kernel densityEshel [17] Score-based No Yes NoneJin [18] Kalman No Yes Color histogramBlack [19] Kalman No Yes UnknownXu [20] Kalman No No Histogram intersect.Fleuret [21] Dynamic prog. No Yes Color distrib.Ours Particle filtering Yes Yes Kernel density

Comparison chart 2Algorithm Calibration Area Sensors Background Initialization

Subtraction

sing

le

Haritaoglu [8] n/a O B/W Yes AutoElgammal [9] n/a I Color Yes ManualZhao [10] Yes O Color Yes AutoRabaud [11] No O Color No Auto

mul

ti-c

amer

a

Yang [12] Yes I Color Yes AutoKhan [13] Homography O Color Yes n/aKang [14] Homography O Color Yes UnknownJaved [15] No I,O Color Yes ManualMittal [16] Stereo I Color Yes AutoEshel [17] Homography I,O B/W Yes UnknownJin [18] Homography I Color, IR Yes ManualBlack [19] Tsai’s O Color Yes AutoXu [20] Tsai’s O Color Yes AutoFleuret [21] Homography I,O Color Yes UnknownOurs Homography I,O Color Yes Auto

Page 14: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

278 K. Kim and L.S. Davis

model represented by a color distribution. To allow multimodal densities inside eachpart, we use kernel density estimation.

Let M = {ci}i=1...NM be a set of pixels from a body part with colors ci. UsingGaussian kernels and an independence assumption between d color channels, theprobability that an input pixel c = {c1, ..., cd} is from the model M is estimated as

pM (c) =1

NM

NM∑i=1

d∏j=1

1√2πσj

e− 1

2

(cj−ci,j

σj

)2

(4)

In order to handle illumination changes, we use normalized color (r = RR+G+B , g =

GR+G+B , s = R+G+B

3 ) or Hue-Saturation-Value (HSV) color space with a widerkernel for ‘s’ and ‘V’ to cope with the higher variability of these lightness variables.We used both the normalized color and HSV spaces in our experiments and observedsimilar performances.

Viewpoint-independent models can be obtained by viewing people from differentperspectives using multiple cameras. A related calibration issue was addressed in[24, 26] since each camera output of the same scene point taken at the same time ordifferent time may vary slightly depending on camera types and parameters.

Multi-camera Multi-person Segmentation and Tracking

Foreground segmentation. Given image sequences from multiple overlappingviews including people to track, we start by performing detection using backgroundsubtraction to obtain the foreground maps in each view. The codebook-based back-ground subtraction algorithm is used.

Each foreground pixel in each view is labelled as the best matching person (i.e.,the most likely class) by Bayesian pixel classification as in [16]. The posterior prob-ability that an observed pixel x (containing both color c and image position (x, y)information) comes from person k is given by

P (k|x) =P (k)P (x|k)

P (x)(5)

We use the color model in Eq.4 for the conditional probability P (x|k). The colormodel of the person’s body part to be evaluated is determined by the information ofx’s position as well as the person’s ground point and full-body height in the cameraview (See Fig.10(a)). The ground point and height are determined initially by themethod defined subsequently in Sec.3.

The prior reflects the probability that person k occupies pixel x. Given the groundpoint and full-body height of the person, we can measure x’s height from the groundand its distance to the person’s center vertical axis. The occupancy probability isthen defined by

Page 15: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

Object Detection and Tracking for Intelligent Video Surveillance 279

Ok(hk(x), wk(x)) = P [wk(x) < W (hk(x))] = 1− cdfW (hk(x))(wk(x)) (6)

where hk(x) and wk(x) are the height and width of x relative to the person k. hk andwk are measured relative to the full height of the person. W (hk(x)) is the person’sheight-dependent width and cdfW (.) is the cumulative density function for W . Ifx is located at distance W (hk(x)) from the person’s center at a distance W , theoccupancy probability is designed so that it will be exactly 0.5 (while it increases ordecreases as x move towards or move away from the center).

The prior must also incorporate possible occlusion. Suppose that some person lhas a lower ground point than a person k in some view. Then the probability that loccludes k depends on their relative positions and l’s (probabilistic) width. Hence,the prior probability P (k) that a pixel x is the image of person k, based on thisocclusion model, is

P (k) = Ok(hk, wk)∏

gy(k)<gy(l)

(1−Ol(hl, wl)) (7)

where gy(k) is the y-location of the ground point of k and x is omitted for simplicity(i.e., hk = hk(x) and wk = wk(x)).

The best class k∗ is determined by maximum a posteriori (MAP) estimation:k∗ = arg max

kP (k)P (x|k). Finally, the foreground maps are segmented into

the best matching persons based on their appearance models and occlusioninformation.

Model initialization and update. Full automatic tracking is enabled by initializingthe human appearance model when a person is detected in a view by searchingfor isolated foreground blobs (See Fig.10(b)). In order to get a bounding box of aperson from the foreground map, we used the object detection technique in [25].The bounding boxes in the figure were created when the blobs are isolated before.For the case when a person does not constitute an isolated blob, a manual selectionis employed. The full-body height of a person is initialized upon model creation

person k

torso

bottom

hk

wk

head

Wtorso(with medium variance)

Whead (with low variance)

Wbottom(with high variance)

pixel to be evaluated

ground point

(a) (b)

Fig. 10. (a) Illustration of appearance model, (b) Bounding box detection

Page 16: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

280 K. Kim and L.S. Davis

and is updated during segmentation. When the unclassified pixels (those having aprobability in Eq.4 lower than a given threshold) constitute a connected componentof non-negligible size, a new appearance model should be created.

Multi-view integration. Ground plane homography: The segmented blobs acrossviews are integrated to obtain the ground plane locations of people. The correspon-dence of a human across multiple cameras is established by the geometric con-straints of planar homographies. For NV camera views, NV (NV − 1) homographymatrices can possibly be calculated for correspondence; but in order to reduce thecomputational complexity we instead reconstruct the top-view of the ground planeon which the hypotheses of peoples’ locations are generated.

Integration by vertical axes: Given the pixel classification results from Sec.3, aground point of a person could be simply obtained by detecting the lowest point ofthe person’s blob. However those ground points are not reliable due to the errorsfrom background subtraction and segmentation.

We, instead, develop a localization algorithm that employs the center vertical axisof a human body, which can be estimated more robustly even with poor backgroundsubtraction [29]. Ideally, a person’s body pixels are arranged more of less symmetri-cally about a person’s central vertical axis. An estimate of this axis can be obtainedby Least Mean Squares of the perpendicular distance between the body pixel andthe axis as in 3© in Fig.11. Alternatively, the Least Median Squares could be usedsince it is more robust to outliers.

The homographic images of all the vertical axes of a person across different viewsintersect at (or are very close to) a single point (the location of that person on theground) when mapped to the top-view (See [29], [30]). In fact, even when the groundpoint of a person from some view is occluded, the top-view ground point integratedfrom all the views is obtainable if the vertical axis is estimated correctly. This in-tersection point can be calculated by minimizing the perpendicular distances to theaxes. Fig.11 depicts an example of reliable detection of the ground point from thesegmented blobs of a person. The Nv vertical axes are mapped to the top-view andtransferred back to each image view.

Let each axis Li be parameterized by two points {(xi,1, yi,1), (xi,2, yi,2)}i=1...NV .When mapped to the top-view by homography as in 4© in Fig.11, we obtain{(xi,1, yi,1), (xi,2, yi,2)}i=1...NV . The distance of a ground point (x, y) to the axis

is written as d ((x, y), Li) = |aix+biy+ci|√a2

i +b2iwhere ai = yi,1 − yi,2, bi = xi,2 − xi,1,

and ci = xi,1yi,2 − xi,2yi,1. The solution is the point that minimizes a weightedsum of square distances:

(x∗, y∗) = arg min(x,y)

NV∑i=1

w2i d2((x, y), Li) (8)

The weight wi is determined by the segmentation quality (confidence level) of thebody blob of Li (We used the pixel classification score in Eq.5).

Page 17: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

Object Detection and Tracking for Intelligent Video Surveillance 281

Top-view ground plane

IMAGE VIEW 1 IMAGE VIEW 2

initial point

IMAGE VIEW NV

…...segmented blob

ground plane homography mapping

moved point

vertical axis 1 vertical axis 2 vertical axis NVLeastMean (Median) Squares

Fig. 11. All vertical axes of a person across views intersect at (or are very close to) a singlepoint when mapped to the top-view

If a person is occluded severely by others in a view (i.e., the axis information isunreliable), the corresponding body axis from that view will not contribute heavilyto the calculation in Eq.8. When only one axis is found reliably, then the lowestbody point along the axis is chosen.

To obtain a better ground point and segmentation result, we can iterate the seg-mentation and ground-point integration process until the ground point converges toa fixed location within a certain bound ε. That is, given a set of initial ground-pointhypotheses of people as in 1© in Fig.11, segmentation in Sec.3 is performed ( 2©),and then newly moved ground points are obtained based on multi-view integration( 4© and 5©). These new ground points are an input to the next iteration. 2-3 iterationsgave satisfactory results for our data sets.

There are several advantages of our approach. Even though a person’s groundpoint is invisible or there are segmentation and background subtraction errors, therobust final ground point is obtainable once at least two vertical axes are correctlydetected. When total occlusion occurs from one view, robust tracking is possibleusing the other views’ information if available; visibility of a person can be maxi-mized if cameras are placed at proper angles. Since the good views for each trackedperson are changing over time, our algorithm maximizes the effective usage of allavailable information across views. By iterating the multi-view integration process,a ground point moves to the optimal position that explains the segmentation resultsof all views. This nice property is used, in the next section, for a small number ofhypotheses to explore in a large state space that incorporates multiple persons andmultiple views.

Extension to Multi-hypothesis Tracker

Next, we extend our single-hypothesis tracker to one with multiple hypotheses. Asingle hypothesis tracker, while computationally efficient, can be easily distractedby occlusion or nearby similarly colored objects.

Page 18: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

282 K. Kim and L.S. Davis

The iterative segmentation-searching presented in Sec.3 is naturally incorporatedwith a particle filtering framework. There are two advantages - (1) By searching fora person’s ground point from a segmentation, a set of a few good particles can beidentified, resulting in low computational costs, (2) Even if all the particles are awayfrom the true ground point, some of them will move towards the true one as long asthey are initially located nearby. This does not happen generally with particle filters,which need to wait until the target “comes to" the particles.

Our final M3Tracker algorithm of segmentation and tracking is presented with aparticle filter overview and our state space, dynamics, and observation model.

Overview of particle filter, state space, and dynamics. The key idea of par-ticle filtering is to approximate a probability distribution by a weighted sampleset S = {(s(n), π(n))|n = 1...N}. Each sample, s, represents one hypotheticalstate of the object, with a corresponding discrete sampling probability π, where∑N

n=1 π(n) = 1. Each element of the set is then weighted in terms of the observa-tions and N samples are drawn with replacement, by choosing a particular samplewith probability π

(n)t = P (zt|xt = s(n)

t ).In our particle filtering framework, each sample of the distribution is simply given

as s = (x, y) where x, y specify the ground location of the object in the top-view.For multi-person tracking, a state st = (s1,t, ..., sNp,t) is defined as a combination ofNp single-person states. Our state transition dynamic model is a random walk wherea new predicted single-person state is acquired by adding a zero mean Gaussianwith a covariance Σ to the previous state. Alternatively, the velocity x, y or the sizevariable height and width can be added to the state space and then a more complexdynamic model can be applied if relevant.

Observation. Each person is associated with a reference color model q� which isobtained by histogram techniques [27]. The histograms are produced using a func-tion b(ci) ∈ {1, ..., Nb} that assigns the color vector ci to its corresponding bin. Weused the color model defined in Sec.3 to construct the histogram of the referencemodel in the normalized color or HSV space using Nb (e.g., 10 × 10 × 5) bins tomake the observation less sensitive to lighting conditions.

The histogram q(C) = {q(u; C)}u=1...Nbof the color distribution of the sample

set C is given by

q(u; C) = η

NC∑i=1

δ[b(ci)− u] (9)

where u is the bin index, δ is the Kronecker delta function, and η is a normalizingconstant ensuring

∑Nb

u=1 q(u; C) = 1. This model associates a probability to eachof the Nb color bins.

If we denote q� as the reference color model and q as a candidate color model,q� is obtained from the stored samples of person k’s appearance model as men-tioned before while q is specified by a particle sk,t = (x, y). The sample set C inEq.9 is replaced with the sample set specified by sk,t. The top-view point (x, y) is

Page 19: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

Object Detection and Tracking for Intelligent Video Surveillance 283

transformed to an image ground point for a certain camera view v, Hv(sk,t), whereHv is a homography mapping the top-view to the view v. Based on the ground point,a region to be compared with the reference model is determined. The pixel valuesinside the region are drawn to construct q. Note that the region can be constrainedfrom the prior probability in Eq.7, including the occupancy and occlusion informa-tion (i.e., by picking pixels such that P (k) > Threshold, typically 0.5). In addition,as done in pixel classification, the color histograms are separately defined for eachbody part to incorporate the spatial layout of the color distribution. Therefore, weapply the likelihood as the sum of the histograms associated with each body part.

Then, we need to measure the data likelihood between q� and q. The Bhat-tacharyya similarity coefficient is used to define a distance d on color histograms:

d[q�,q(s)] =[1−

Nb∑u=1

√q � (u)q(u; s)

] 12

. Thus, the likelihood (πv,k,t) of person

k consisting of Nr body parts at view v, the actual view-integrated likelihood (πk,t)of a person sk,t, and the final weight of the particle (πk,t) of a concatenation of Np

person states are respectively given by:

πv,k,t ∝ e∑Nr

r=1 −λd2[q�r ,qr(Hv(sk,t))], πk,t = ΠNV

v=1πv,k,t, πt = ΠNp

k=1πk,t (10)

where λ is a constant which can be experimentally determined.

The M3Tracker algorithm. Iteration of segmentation and multi-view integrationmoves a predicted particle to an a better position on which all the segmentationresults of the person agree. The transformed particle is re-sampled for processing ofthe next frames.

Algorithm for Multi-view Multi-target Multi-hypothesis tracking

I. From the “old" sample set St−1 = {s(n)t−1, π

(n)t−1}n=1,...,N at time t− 1, construct the

new samples as follows:II. Prediction: for n = 1, ..., N , draw s(n)

t from the dynamics. Iterate Step III to IV foreach particle s(n)

t .III. Segmentation & Search

st = {sk,t}k=1...Np contains all persons’ states. The superscript (n) is omittedthrough the Observation step.i. for v ← 1 to NV do

(a) For each person k, (k = 1...Np), transform the top-view point sk,t intothe ground point in view v by homography, Hv (sk,t)

(b) perform segmentation on the foreground map in view v with the occlusioninformation according to Sec5.

end forii. For each person k, obtain the center vertical axes of the person across views, then

integrate them on the top-view to obtain a newly moved point s∗k,t as in Sec3.iii. For all persons, if ‖sk,t − s∗k,t‖ < ε, then go to the next step. Otherwise, set

sk,t ← s∗k,t and go to Step III-i.

Page 20: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

284 K. Kim and L.S. Davis

IV. Observationi. for v ← 1 to NV do

For each person k, estimate the likelihood πv,k,t in view v according toEq.10. sk,t needs to be transferred to view v by mapping through Hv for eval-uation. Note that qr(Hv (sk,t)) is constructed only from the non-occludedbody region.

end forii. For each person k, obtain the person likelihood πk,t by Eq.10.

iii. Set πt ← ΠNp

k=1πk,t as the final weight for the multi-person state st.

V. Selection: Normalize {π(n)t }i so that

∑Nn=1 π

(n)t = 1.

For i = n...N , sample index a(n) from discrete probability {π(n)t }i over {1...N},

and set s(n)t ← s

a(n)t .

VI. Estimation: the mean top-view position of person k is∑N

n=1 π(n)t s

(n)k,t .

People Tracking Results

The results on the indoor sequences are depicted in Fig.12. The bottom-most rowshows how the persons’ vertical axes are intersecting on the top-view to obtain theirground points. Small orange box markers are overlaid on the images of frame 198for determination of the camera orientations. Note that, in the figures of ‘verticalaxes’, the axis of a severely occluded person does not contribute to localizationof the ground point. When occlusion occurs, the ground points being tracked aredisplaced a little from their correct positions but are restored to the correct positionsquickly. Only 5 particles (one particle is a combination of 4 single-person states) wasused for robust tracking. Those indoor cameras could be easily placed properly inorder to maximize the effectiveness of our multi-view integration and the visibilityof the people.

Fig.13(a) depicts the graph of the total distance error of people’s tracked groundpoints to the ground truth points. It shows the advantage of multiple views for track-ing of people under severe occlusion.

Fig.13(b) visualizes the homographic top-view images of possible vertical axes.A vertical axis in each indoor image view can range from 1 to each maximum im-age width. 7 transformed vertical axes for each view are depicted for visualization.It helps to understand how the vertical axis location obtained from segmentation af-fects ground point (intersection) errors on the top-view. When angular separation isclose to 180 degrees (although visibility is maximized), the intersection point of twovertical axes transformed to top-view may not be reliable because a small amountof angular perturbation make the intersection point move dramatically.

The outdoor sequences (3 views, 4 persons) are challenging in that three peo-ple are wearing similarly-colored clothes and the illumination conditions changeover time, making segmentation difficult. In order to demonstrate the advantage ofour approach, single hypothesis (deterministic search only) tracker, general particlefilter, and particle filter with deterministic search by segmentation (our proposedmethod) are compared in Fig.14. The number of particles used is 15.

Page 21: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

Object Detection and Tracking for Intelligent Video Surveillance 285

frame# 138

cam1

frame# 198

cam2

cam4

cam3

segmentation example(frame# 138)

Y

X

X

Y

View 1

View 2

View 3

View 4

frame# 178

Fig. 12. The tracking results of 4-view indoor sequences from Frame 138 to 198 are shownwith the segmentation result of Frame 138

140 150 160 170 180 190 2000

50

100

150

200

250

frame

erro

r (d

ista

nce

in p

ixel

)

Tota distance error of tracked points to ground−truth points

num of views = 1num of views = 2num of views = 3num of views = 4

(a) Total distance error of persons’ trackedground points to the ground truth points

−300 −200 −100 0 100 200 300−200

−100

0

100

200

300

400

X

Y

Possible intersections on top−view

view 1view 2

view 3

view 4

(b) Homographic images all different verticalaxes

Fig. 13. Graphs for indoor 4 camera views

Page 22: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

286 K. Kim and L.S. Davis

frame362 frame407

deterministicsearch only

generalparticle filter

our methodInitially, all methods are good.

Fig. 14. Comparison on three methods: While the deterministic search with a single hypoth-esis (persons 2 and 4 are good, cannot recover lost tracks) and the general particle filter (onlyperson 3 is good, insufficient observations during occlusion) fail in tracking all the personscorrectly, our proposed method succeeds with a minor error. The view 2 was only shownhere. The proposed system tracks the ground positions of people afterwards over nearly 1000frames.

Conclusions

All the topics described in the book chapter are all closely related and geared towardintelligent video surveillance.

Our adaptive background subtraction algorithm, which is able to model a back-ground from a long training sequence with limited memory, works well on movingbackgrounds, illumination changes (using our color distortion measures), and com-pressed videos having irregular intensity distributions.

We presented a perturbation method for measuring sensitivity of BGS algorithms.PDR analysis has two advantages over the commonly used ROC analysis: (1) It doesnot depend on knowing foreground distributions, (2) It does not need the presenceof foreground targets in the video in order to perform the analysis, while this is re-quired in the ROC analysis. Because of these considerations, PDR analysis providespractical general information about the sensitivity of algorithms applied to a givenvideo scene over a range of parameters and FA-rates.

A framework to segment and track people on a ground plane is presented. Themulti-view tracker is extended efficiently to a multi-hypothesis framework (M3

Tracker) using particle filtering. To tackle with the explosive state space due to mul-tiple targets and views, the iterative segmentation-searching is incorporated with aparticle filtering framework. By searching the ground point from segmentation, a setof a few good particles can be identified, resulting in low computational costs.

References

1. Horprasert, T., Harwood, D., Davis, L.S.: A statistical approach for real-time robust back-ground subtraction and shadow detection. In: IEEE Frame-Rate Applications Workshop,Kerkyra, Greece (1999)

2. Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time track-ing. In: Int. Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 246–252 (1999)

3. Harville, M.: A framework for high-level feedback to adaptive, per-pixel, mixture-of-gaussian background models. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.)ECCV 2002. LNCS, vol. 2352, pp. 543–560. Springer, Heidelberg (2002)

Page 23: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

Object Detection and Tracking for Intelligent Video Surveillance 287

4. Elgammal, A., Harwood, D., Davis, L.: Non-parametric model for background sub-traction. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 751–767. Springer,Heidelberg (2000)

5. Kohonen, T.: Learning vector quantization. Neural Networks 1, 3–16 (1988)6. Chalidabhongse, T.H., Kim, K., Harwood, D., Davis, L.: A Perturbation Method for

Evaluating Background Subtraction Algorithms. In: Joint IEEE International Workshopon Visual Surveillance and Performance Evaluation of Tracking and Surveillance, VS-PETS (2003)

7. Scotti, G., Marcenaro, L., Regazzoni, C.: A S.O.M. based algorithm for video surveil-lance system parameter optimal selection. In: IEEE Conference on Advanced Video andSignal Based Surveillance (2003)

8. Haritaoglu, I., Harwood, D., Davis, L.S.: W 4: real-time surveillance of people andtheir activities. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8),809–830 (2000)

9. Elgammal, A., Davis, L.S.: Probabilistic Framework for Segmenting People Under Oc-clusion. In: IEEE International Conference on Computer Vision, Vancouver, Canada,July 9-12 (2001)

10. Zhao, T., Nevatia, R.: Tracking Multiple Humans in Complex Situations. IEEE Trans.Pattern Analysis Machine Intell. 26(9) (September 2004)

11. Rabaud, V., Belongie, S.: Counting Crowded Moving Objects. In: IEEE Conf. on Comp.Vis. and Pat. Rec. (2006)

12. Yang, D., Gonzalez-Banos, H., Guibas, L.: Counting People in Crowds with a Real-TimeNetwork of Image Sensors. In: IEEE ICCV (2003)

13. Khan, S.M., Shah, M.: A Multiview Approach to Tracking People in Crowded ScenesUsing a Planar Homography Constraint. In: Leonardis, A., Bischof, H., Pinz, A. (eds.)ECCV 2006. LNCS, vol. 3954, pp. 133–146. Springer, Heidelberg (2006)

14. Kang, J., Cohen, I., Medioni, G.: Multi-Views Tracking Within and Across Uncali-brated Camera Streams. In: Proceedings of the ACM SIGMM 2003 Workshop on VideoSurveillance (2003)

15. Javed, O., Rasheed, Z., Shafique, K., Shah, M.: Tracking Across Multiple Cameras WithDisjoint Views. In: The Ninth IEEE International Conference on Computer Vision, Nice,France (2003)

16. Mittal, A., Davis, L.S.: M2Tracker: A Multi-View Approach to Segmenting and Track-ing People in a Cluttered Scene. International Journal of Computer Vision 51(3) (Febru-ary/March 2003)

17. Eshel, R., Moses, Y.: Homography Based Multiple Camera Detection and Tracking ofPeople in a Dense Crowd. In: Computer Vision and Pattern Recognition, CVPR (2008)

18. Jin, H., Qian, G., Birchfield, D.: Real-Time Multi-View Object Tracking in MediatedEnvironments. In: ACM Multimedia Modeling Conference (2008)

19. Black, J., Ellis, T.: Multi Camera Image Tracking. In: 2nd IEEE International Workshopon Performance Evaluation of Tracking and Surveillance (2001)

20. Xu, M., Orwell, J., Jones, G.A.: Tracking football players with multiple cameras. In:ICIP 2004 (2004)

21. Fleuret, F., Berclaz, J., Lengagne, R., Fua, P.: Multi-Camera People Tracking with aProbabilistic Occupancy Map. IEEE Transactions on Pattern Analysis and Machine In-telligence 30(2), 267–282 (2008)

22. Tsai, R.Y.: An Efficient and Accurate Camera Calibration Technique for 3D MachineVision. In: IEEE Conference on Computer Vision and Pattern Recognition (1986)

23. Tu, Z., Zhu, S.-C.: Image segmentation by data-driven Markov chain Monte Carlo. IEEETransactions on Pattern Analysis and Machine Intelligence 24(5), 657–673 (2002)

Page 24: [Studies in Computational Intelligence] Multimedia Analysis, Processing and Communications Volume 346 || Object Detection and Tracking for Intelligent Video Surveillance

288 K. Kim and L.S. Davis

24. Javed, O., Shafique, K., Shah, M.: Appearance Modeling for Tracking in Multiple Non-overlapping Cameras. In: IEEE CVPR 2005, San Diego, June 20-26 (2005)

25. Senior, A.W.: Tracking with Probabilistic Appearance Models. In: Proceedings ECCVworkshop on Performance Evaluation of Tracking and Surveillance Systems, June 1,pp. 48–55 (2002)

26. Chang, T.H., Gong, S., Ong, E.J.: Tracking Multiple People Under Occlusion UsingMultiple Cameras. In: BMVC (2000)

27. Perez, P., Hue, C., Vermaak, J., Gangnet, M.: Color-based probabilistic tracking. In:Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350,pp. 661–675. Springer, Heidelberg (2002)

28. Kim, K., Chalidabhongse, T.H., Harwood, D., Davis, L.: Real-time foreground-background segmentation using codebook model. Real-Time Imaging 11(3), 172–185(2005)

29. Hu, M., Lou, J., Hu, W., Tan, T.: Multicamera correspondence based on principal axis ofhuman body. In: International Conference on Image Processing (2004)

30. Kim, K., Davis, L.S.: Multi-camera tracking and segmentation of occluded people onground plane using search-guided particle filtering. In: Leonardis, A., Bischof, H., Pinz,A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 98–109. Springer, Heidelberg (2006)


Top Related