[ieee 2011 international symposium on intelligent signal processing and communications systems...

2011 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS) December 7-9, 2011

Object Tracking by Detection for Video Surveillance Systems based on Modified Codebook Foreground

Detection and Particle Filter

JiuXU Graduate School of Information,

Production and Systems Waseda University, Japan

[email protected]

Chenyuan ZHANG Graduate School of Information,


[email protected]

Satoshi GOTO Graduate School of Information,


[email protected]

Abstract-In this paper, a novel approach is proposed to achieve the multi-object tracking in video surveillance system using a combination of tracking by detection method. For the foreground objects detection part, we implement a modified codebook model.

First, the block-based model upgrades the pixel-based codebook model to block level, thus improving the processing speed and reducing memory. Moreover, by adding the orientation and magnitude of the block gradient, the codebook model contains not only information of color, but also the texture feature in

order to further reduce noises and refine more entire foreground regions. For the tracking aspect, we further utilize the data from the foreground detection that a color-edgetexture histogram is used by calculate the local binary pattern of the edge of the

foreground objects which could have a good performance in describing the shape and texture of the objects. Finally, occlusion solutions strategies are applies to order to overcome the occlusion problems during tracking. Experimental results on different data sets prove that our method has better performance and good real-time ability.

Keywords-component; object tracking; codebook; particle filter; color-texture histogram

I. INTRODUCTION

The real-time foreground objects tracking and detection is the most critical and fundamental step in video surveillance systems and the main target of object tracking is to detect and track the moving objects such as humans and vehicles in all kinds of situation in spite of the complications caused by occlusion, shadows and reflections. There are two major tracking approaches: One is target representation and localization, another is filtering and data association.

Target representation and localization is mostly a bottomup process. These methods mainly depend on the feature correspondence by using point detectors such as SIFT [1] SURF [2], which are published in recent years. Or some foreground detection and background modeling algorithms are associated to achieve the tracking. The most famous one is GMM [3], or Codebook [4].

Filtering and data association is mostly a top-down approach, which relies on prior information about the scene or object, dealing with object dynamics, and evaluation of

This research was supported by Waseda University Global COE Program 'International Research and Education Center for Ambient SoC' sponsored by MEXT,Japan

978-1-4577-2166-3/11/$26.00 ©2011 IEEE

different hypotheses. Some common methods perform tracking by pre-initialized trackers based on Kalman filter [5] or particle filter [6].

So far, lots of related algorithms have already been published, such as [7] [8]. However, there are still several important problems remained.

For example, when we use a target representation and localization method and detect the objects frame by frame, the real-time ability is quite low. Besides, the objects will meet some significant size and shape variations. Meanwhile, if we apply some filtering and data association methods such as particle filter, one of the intractable problems is that the target is usually occluded by other objects frequently and instantaneously. To solve the problems, an algorithm to detect the occlusion and capture the object while it appears quickly is necessary. Also we need to select powerful features to avoid hijacking problems when tracking similar objects.

The target of this paper is to build a powerful tracking system for real-time surveillance system. Combining object detection and tracking and do tracking by detection in order to achieve high accuracy and low time consumption together with occlusion solutions.

II. TRACKING SYSTEM OVERVIEW

The overview of the method is shown in Figure 1. The input is a video sequence captured by fixed camera. First we need to locate the moving object such as human or vehicle out of each frame. After extraction the moving objects, we make a selection manually to the objects of interest and keep tracking on the objects in the following frame. In order to associate the information between detection and tracking, in this paper, we develop a tracking by detection system using foreground detection together with object tracking.

For the foreground objects detection part, two improvements have been adopted on the traditional codebook method [4] so that the foreground detection could be more suitable to be used during tracking. First, since the original method is a pixel-wise method, it is difficult to be used to some high resolution video directly in real-time systems. In this way, a block-based down sampling is performed to significantly

2011 International Symposium on Intelligent Signal Processing and Communication Systems (lSPACS) December 7-9, 2011

reduce the time consumption. Also, in order to make up the lost data, the texture feature is applied to enhance the performance.

Codebook Model > Proposed Improvements:

<1> Block feature

� __ "II f' Texture feature

Particle Filter > Proposed Improvements:

<D Color·EdgeTexture histogram

C Occlusion handling strategy

Figure I. System Overview

After getting the result from the detection, for the tracking aspect, we also utilize two modifications on color-based particle filter [9] and take account of the information of the foreground regions to generate the weights for tracking. The Color-EdgeTexture histogram could overcome the hijacking, size and shape problem while the occlusion solution strategies are able to handle the occlusion problems.

III. PROPOSED MODIFIED CODEBOOK MODEL FOR

FOREGROUND DETECTION

In 2005, Kim [4] presented a new kind of non-parametric algorithm for background subtraction. This model captures the periodic motion and handles illumination variations, and it is efficient in memory and speed. For each pixel, it builds a codebook consisting of one or more codewords. Samples at each pixel are clustered into a set of codewords based on a color distortion metric together with brightness bounds.

A. Block-based Model

We decide to use a block-based model which involves the local color dependency between neighboring pixels. In this block strategy, two block sizes: 2x2 and 3x3 are considered as the basic units of the codeword.

First of all, we set up a grid to each frame according to the frame size. In regard to sequence whose frame size is less than 320x240, we use the 2x2 block size, which means that one block contains 4 pixels; In regard to sequence whose frame size is greater than or equal to 320x240, we apply the 3x3 block size. The reason we utilize different block sizes is that specific density and weight of a certain pixel are different according to the frame size. In other words, we cannot use a big block to handle a frame with small size, because that would lead to coarse edge and noises.

Pixel

Block (One codeBlock)

Figure 2. Frame is divide into blocks

In block-based model, each frame is divided into blocks and the blocks are not overlapped with each other, as can be seen in Figure 2. Each block is regarded as one codeblock and we replace the codebook of one single pixel with a codeblock. This certain pixel is the key pixel in each block and in different

block size, the position of the key pixel is different, as in Figure 3. We use the position of the key pixel to represent the codeblock.

� [ill TIffiE Figure 3. Key pixels with red point in 2x2 block and 3x3 block

When the block is of size MxM and X=Xh X2, ... ,XMxM. The

mean value of the key pixel is x = M:M L��M Xi' In this way, the former YUV values of the input pixel xt

are replaced by the V, V, if value calculated by the average value of the block. In other words, only the key pixels need to be modeled and these key pixels contain the mean value of theirs block.

Equivalently, the foreground detection strategy is that if an input block cannot fmd one codeword in the codeblock to match, then all of the members of this block would be considered as foreground pixels. The codeblock has the spatial ability, assuming that the neighboring pixels are locally dependent in space. It also provides morphological function to shrink small noise. Furthermore, as only one-fourths or oneninths of the pixels are modeled, which greatly reduce the memory requirement to train a long period of video sequence, and thereby improve the speed significantly as well.

B. Codebook Model with Oriented-Gradient Feature

Through the block-based down sampling, we just use the average value of the block and consider the block as the basic unit, which might lead to low precision of the foreground objects. Therefore, in order to compensate the lost information and achieve more entire part of foreground targets, in the section, we incorporate both block feature and texture feature into foreground detection.

The texture feature used in the paper is motivated by Histogram of Oriented Gradient (HOG) [10], which claimed that for each pixel I (x, y), the orientation I} (x, y) and the magnitude m (x, y) of the gradient are calculated by:

dx = l(x + 1,y) - l(x - 1,y) (3-1)

dy = l(x,y + 1) - l(x,y - 1) (3-2)

tan-l (dy/dx) (3-3)

m(x,y) = .Jdxz + dyZ (3-4)

In this way, first we divide the input frames into block and set up the key pixels, consider X to be a training sequence for a single key pixel consisting of N vVif vectors which are calculated by the mean value of its modified block, and X = {x1,xz, ... ,XN}, Let J..l represent the codeblock for this pixel consisting of L codewords: J..l = { cl1 Cz, ... , cd. Each codeword from c, to CL is composed of C i =< �, �, VI' �, �, ml' Ii, AU Pi' qi >, here, if the position of the key pixel of this block is (x,y) , then 9(x,y) = tan-l (dy/dx) and m(x,y) = .Jdxz + dyZ the main procedure of the background construction is as flows:

20 II International Symposium on Intelligent Signal Processing and Communication Systems (lSPACS) December 7-9, 2011

Initialize: Ii = {<p}, L=O

For t=l to N:

Calculate I} and m values, let xt = (Y, a, V) , fmd the codeword em in Ii = {cl, cz, ... cd matching to Xt based on three conditions:

a) colordist = (Ut - ai)Z + (Vt - vaz $ c

b) brightness(Y, (Ym, Ym}) = true

c) orCmag_deviation1atio $ cz If Ii = {<p} or there is no match, then let L=L+ 1. And create

a new codeword CL by setting: CL =< Y, Y, a, V, 8, m, 1, t -l,t,t >

Otherwise, update the match codeword Cm consisting of:

( . (y. y.�) (y. y.�) fi[ii+Ut fiii'j+Vt fi8i+1h fp'iii+mt f' + mln t, i ,max t, i ' ---r:+l ' fi+l ' fi+l ' fi+l d i 1, max (A.i, t - qi), Pi ' tfm + 1, max(A·m, t - qm), Pm' t} (3-5)

During the model construction, we calculate the orientation of the magnitude of each key pixel and use a matching strategy to reach the update. In addition to the traditional two conditions, brightness and colordist, a third condition called orientationmagnitude deviation ratio (OMDR) is applied to match the texture information of the codeblock. This condition describes the variations of the orientation and magnitude as the amount of orientation per magnitude in one function by using the ratio of orientation to magnitude, thus it is not necessary to match the orientation and magnitude respectively. The function of the condition is showed below:

OMDR = (3-6)

For the background differing part after we fmish constructing the block-based codebook model with texture feature, in traditional model, we check whether a pixel belongs to background through two conditions: brightness and colordist If it could meet the need of both requirements, this pixel is assumed to be background pixel and vice versa. But in our proposal, since we use three conditions to construct the background model, the requirements might be too strict and could bring a lot of noises to some video sequences with low qualification because some pixels cannot match all three conditions and might be assumed to be foreground ones.

In order to solve this problem, a matching strategy has been utilized to improve the performance.

For each input block, calculate its I} and m values and key point average value x = (Y, a, V),

For all codewords c in Ii = {cv cz, ... , cd, initialize match channel=O, and fmd the codeword matching to x based on three conditions:

If colordist $ c, match_channel ++;

If brightness = true, match_channel++;

If OM D R $ cz , match _ channel++;

BGS(x) = {foregrOUnd if match_channel $ 2 background if match_channel 2:: 2

In this way, only any two of the condition should be matched to determine whether one block belongs to background or not This method has several advantages:

If some parts of the background objects in the video sequence meet sudden illumination variations, the pixels of these targets will suffer the problem that the brightness values change significantly. In traditional model, these pixels will be regarded as foreground pixels since the brightness function cannot be matched due to the huge variations. But in our approach, we use YUV space and consider the texture information. Though the brightness function might not be matched, the colordist and OMDR could still meet the requirements, thus these pixels remain to be background and could eliminate the influence of shadow and illumination change.

Another improvement is that we could get more entire part of the objects through the segmentation. By slightly reducing the value of £ in brightness function during modeling aspect, more pixels of the objects could be detected without increasing the noises due to the extended condition (OMDR).

IV. PROPOSED MODIFIED HISTOGRAM-BASED PARTICLE

FILTER FOR OBJECT TRACKING

Particle filter [6] is based on the Bayes principle and is a sequential Monte-Carlo simulation method indicated by probability density of particles. It basic thought is that through fmd a set of random sample spreading in the state space to be similar to posterior probability distribution, then using sample mean instead of integral operator, thereby obtaining the minimum variance estimation process of state.

A. Tracker Initialization

For implementation of particle filter we need the following mathematical model:

Transition model p(xt Ixt-l) represents how objects move between frames.

Observation model P(Yt Ixt) specifies the likelihood of an object being in a specific state.

Initial state p(xo) describes initial distribution of object states.(Such as Figure 4)

p(x.)

Random Particle Positions

Figure 4. Random Particle positions generated by Gaussian distribution

In this way, after the foreground detection (Figure 5(a», we apply several steps in order to initialize the tracker positions. First of all, we try to fmd contour from connected components by using the morphological operation open to shrink areas of small noise to 0, followed by the morphological operation close to rebuild the area of surviving components that was lost in opening. Finally, we could get the minimum convex hulls of the segments and thus get the whole shape of the objects (Figure 5(b».


Figure 5. Foreground object selection(a) Result after foreground detection (b) Result after finding components (c) Result after selecting objects

Then we could do the particle tracker initialization by selecting foreground objects manually according to our object of interest and draw bounding rectangle of the components (Figure 5(c». We record the position of the rectangles as the initial position of object trackers.

Since we have fmish initialized the trackers, we generate the position of a number of particles by Gaussian distribution. In the next step, it is necessary for us to calculate the likelihood through observation models.

B. Color-EdgeTexture Histogram

We proposed a Color-EdgeTexture Histogram to generate the weight for the observation models.

First of all, we choose the HSV color space to generate our color histogram. The main reason is that the each component of RGB space has high relationship with each other which is not suitable for tracking due to the brightness variations while in HSV color space could better separate the brightness with others. However, since there is a high possibility that the moving objects might be influenced by the shadows during tracking, which will lead to the significant change in brightness, we just keep the H-S components from HSV color space to describe the color information and ignore the influence of intensity. Moreover, we add the edge local binary pattern to describe the shape texture of the moving objects.

Local binary pattern (LBP) [11] is an effective texture description operator, can measure and extract texture information from the local neighborhood in a gray image. Consider a pixel (xc,Yc), the LBP value of this pixel is calculated by:

LBPp R = L�;;;A S(9p,9c)2p, sex) = {Ol lhI x � 0 (4-1) . at erWlse Here gp ,gc means the grey value of corresponding pixels,

Figure 6 also shows how the LBP value be calculated.

32 37 54

Threshold

43 70 70 as 79 91

Multiply

l II -�==:::;-+ll � ttttj

1 129 64

2

32

4 0 0 9 0

16 0 32 LBP=8+16+32=56

Figure 6. Calculation of the LBP value

0 9 16

However, if we just calculate the LBP value inside the whole region of the trackers and use the LBP value together with H-S color information to calculate the models' weight, though it might improve the performance slightly since we not only consider the color data but also the texture, two problems will happen.

The first one is the time consumption. Since LBP is a pixelwise coding, if we calculate all the LBP value inside the whole regions of predicted position of the particles, the calculation is really quite huge, thus it will extremely decrease the real-time ability.

Meanwhile, because of the LBP value for background part of the regions within the trackers also be calculated, it will greatly reduce the tracking rate when the size of the object is changing, and the portion of the background become larger and larger since the weight of the background is increased.

In order to solve these problems, we use a concept of edge LBP and the edge LBP only focuses on the edge points of the foreground objects.

We use canny edge detector to get the edge points only for the foreground parts from the foreground detection, and only calculate the LBP value of these edge points. Figure 7 shows examples of such kind of points, which have a good performance in describing the shape of the objects.

IIII Figure 7. Edge detection for foreground objects

Through the calculation of foreground edge LBP, our method associates both information of edge and texture. Furthermore, since just the foreground regions are taken into considered, we still refer the result of foreground detection during tracking and only the moving objects have the likelihood of the edge LBP. (Figure 8 shows an example)

Figure 8. Original frame and foreground edges

In this way, we use a kind of H-S-ForegroundEdgeLBP histogram which is 8x8x8 for each component. (see Figure 9) , and then particles are weighted according to the similarity between the target histogram distribution q(u) and the histogram distributions p(u) given by particles.

Figure 9. H-S-LBP histogram

According to the target histogram distribution q(u) and the prediction histogram distributions p(u), the Bhattacharyya coefficient is calculated as Figure 10.


q(u) '" rill p(u) Bhattacharyya coefficient: p[p,q] = f � p(u)q(u)du

W Distance: d=�l-p[p,q]

Figure 10. Calculation of Bhattacharyya coefficient

Let qk denotes the color histogram of the target at time k,

the histogram of the i-th particle XLi] is P Ii], the weight of the Xk i-th particle is defmed as

d(X1i]q ) [i] 1 -� 1 W = -- e 2(12 = -- e k .fi1W .fi1W C. Occlusion Handling Strategy

1-P(X�].qk) 2(12 (4-2)

In traditional particle filter tracking, if the object meets some partial or total occlusions, the observation model will tum to the occluder and will not track the previous objects any longer. This case is so called occlusion or hijacking problem in tracking area.In this way, in our proposal, an occlusion handing strategy has been added to improve the performance.

After calculating the max weight of N particles through previous color-edgetextue histogram, we defme a threshold to this weight. If the tracker moves out of the margin of the frame, it means that the object has a very high possibility to move out of the sight of the camera (Figure ll(a)), thus we delete this tracker; If the tracker is still inside the frame and the max weight is great than the threshold, we update the observation model to the particle with highest weight and output its position; If the tracker is still inside the frame and the max weight is less than the threshold, we just keep the previous model of the target. Meanwhile, we increase the number of the particles together with the searching range in order to fmd the objects in following frames. (Figure I I (b)).

(a) (b)

Figure II. Special cases in tracking

v . EXPERIMENTAL ANALYSIS

The experiment environment is as follows: Intel Core2 CPU, [email protected], 3.25 GB RAM. The develop tool is VS2008 and the threshold parameters are set as: £2=0.35, max weight=O.OOl. In no occlusion cases, 75 particles are used while the searching range is generated by x-N(O, O.2)Gaussian random number; In occlusion cases, 150 particles are used while the searching range is generated by x-N(O, 0.6)Gaussian random number. During our experiments, several methods have been used to evaluate the performance of the proposed algorithm and we just show some representative ones.

Figure 12(t) is the result of our proposed approach for foreground detection, it is obviously that compared with the original one (see Figure 12(c)) and other improvement based on the original method (see Figure 12(d) and (e)), our method could get better performance with very low noises and more entire object as well.

(a) (b) (c)

(d) (e) (t)

Figure 12. Comparisons of foreground detection (a) Original frame in [12](b) Ground truth (c) Standard Codebook (d) Result in [13](e) Result in [14](t)

Proposed method

Figure 13 shows the result of ROC curve on testing the waving tree sequence. The area under the ROC Curve (AUC) of our method is closer to I than the standard one and another similar work in [13]. This represents that our method has relative better performance. Moreover, our work could reduce more than 40% on calculation time due to the down sampling.

ROC '��--�-r

i:� fE�E- i l· ! � 094 6!"······"!"··········r···········l ···········;-··········; ......... .

o 93 •• -•.••••• � •.••.••. -•. : •.• -•.• -•.• : ••• ' •••.••. : -.•••.• -••• : . -•• - -.••. 092 ....... + .. ··· 1 7�a::�: � 091 .......... t . ........ � .. . .. I �BIock�.ur •• T.xtla' ... atU/. l "0'----';'0 -, -..,L02-,.....,.,0l--0:'c-. --'0':-' --" ..

F.HPoSII""Rat. Figure 13. ROC curve on waving tree database

Then we try to test the performance of our tracking system. We start our test from single cases out of crowded use PETS 2009 test sequence. As can be seen in the fIrst line of Figure 14 during this tracking, several occlusions and crossing happen. Moreover, there are some other objects with similar color, but our system shows good robustness in tracking when occlusions and crossing happen compared to other method [8] (last one of fIrst line). The tracking rate of our system could almost reach 100% percent in tracking single object.

Also, we test the multi-objects cases with double crossing and occlusions. The test sequences are captured by ourselves in Waseda campus.

In this two-person crossing case in the second line of Figure 14, the shape of the man dressed in white changes signifIcantly due to the falling. The tracker of original method [8][9] gets lost because of this problem while our method still has a high precision while other method meets some problems because of the size variation.

The third line of Figure 14 describes the test for the multiobjects cases with double crossing and occlusions. Previous method (last one of this line) meets hijacking problems while our result is still acceptable.


Figure 14. Tracking object by proposed method and color-based method (last row)

Finally, we manage to test the real-time ability of our system. We calculate the processing speed of our algorithm in foreground detection and tracking one to four objects.

In the fIrst case, the video size is nOx576, the trackers' size is about 40x 100. The tracking rates of four objects are around 30 frames per second. In the second case, we try to test high resolution case, the video size is 1280 x no, and the trackers' size is about 50xl00. The tracking rates are around 20 fps for four. The time consumption of our method is much less than [7][8][9]. And the time table is showed below:

�,--, __ :T_�'��=bN�� FOIIPftI DlttdlOll

---TrIdI;&9tOtltkt .. -Track Two Obtett.

···'···'-· -'-Tt1tkTlwtCltlfld. .,

lS ···

······-t ......... � Tfldi:FOIIfOb!Kt,

61 .......... � .... . : ; : : : !.a �� ..... t-.. J ... �i'" ... � .. -.. ,.J.�.J- •. �-.,� •.

II ....... ..• ....... . . t .: ...... T .. .

i H �:�� i:F;;;;:;:':��;t�;;'� :� ..

' .. ' .. 0 +· · ... '-..... ·.�· ,· ......... ; .. i .. ;· .. ·.·.·.·.·.·� � � i ;

�� .. ! ! ......... ,: ......... ! .... ;! ... � � � · ;i··········;··)· ,-��

: : "� .......... :; ........... ' ........... ; .... I ;

''''-''''-• • -TI1dISIIgIIOI:JtIcI

-T/XIr;T'MIOb!KIs TQdI;rt..Ob,ktt

50 100 150 200 2SO JOG L_�,l;--± ,---+.--�:.=�=� :

TJd:Fo.OtItkt· 1110 I� '" '" 1OO ,-- ,-"'"

Figure 15. Time consumption for each frame (left :720X576 right: I280X720)

REFERENCES

[I] Lowe, D.G, "Distinctive Image Features from Scale-Invariant Key points", International Journal of Computer Vision,2004

[2] Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool "SURF: Speeded Up Robust Features", Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, pp. 346--359,2008

[3] Z. Zivkovic and F. van der Heijden, "Efficient Adaptive Density Estimapion per Image Pixel for the Task of Back-ground Subtraction," Pattern Recognition Letters, vol. 27, no. 7,773-780,2006

[4] K. Kim, T. H. Chalidabhonse, D. Harwood, and L. Davis. "Real-time foreground-background segmentation using codebook model". Elsevier Real-Time Imaging, vol. II, no.3, 167-256, June 2005.

[5] Gutman, P., Velger, M. "Tracking Targets Using Adaptive Kalman Filtering", IEEE Transactions on Aerospace and Electronic Systems Vol. 26, No. 5: pp. 691-699 1990.

[6] B.Ristic, "Beyond the Kalman Filter: Particle Filters for Tracking Applications". Arthech House, 2004.

[7] L.M.Fuentes and S.A.Velastin, "People tracking in surveillance applications", Image and Vision Computing, pp.1l65-1l7I, 2006

[8] Tao Yang, Quan Pan, Jing Li and Li, S.Z. "Real-time Multiple Objects Tracking with Occlusion Handling in Dynamic Scene" In CVPR, VoU. pp. 970-975,2005

[9] R. Hess and A. Fern, "Discriminatively Trained Particle Filters for Complex Multi-Object Tracking". In CVPR, 2009.

[10] N. Dalal and B. Triggs, "Histogram of oriented gradient for human detection," in CVPR, 2005

[II] T. Ojala, M. Pietikainen, and D. Harwood, "A Comparative Study of Texture Measures with Classification Based on Feature Distributions", Pattern Recognition, vol. 29, pp. 51-59. 1996

[12] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, Wallflower, "Principles and practice of background maintenance," in Proc. ICCV, 1999

[13] Q. Tu, Y. Xu and M. Zhou. "Box-based Codebook Model for Real-time Objects Detection". in WCICA 2008.

[14] M. H. Sigari and M. Fathy. "Real-time Background Modeling/Subtraction using Two-Layer Codebook Model". in IMECS 2008, voU, 717-720, Hong Kong

[ieee 2011 international symposium on intelligent signal processing and communications systems...

Documents