abnormal event detection in surveillance videos based on ...xzhang/publications/pr... · [46] x....

16
Pattern Recognition 108 (2020) 107355 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/patcog Abnormal event detection in surveillance videos based on low-rank and compact coefficient dictionary learning Ang Li a,b , Zhenjiang Miao a,b , Yigang Cen a,b,, Xiao-Ping Zhang c , Linna Zhang d , Shiming Chen e a Institute of Information Science, Beijing Jiaotong University, Beijing, 100044, China b Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing 100044, China c Department of Electrical, Computer, and Biomedical Engineering, Ryerson University, Toronto, ON M5B 2K3, Canada d College of Mechanical Engineering, Guizhou University, Guiyang, 550025, China e School of Electrical & Electronic Engineering, East China Jiaotong University, Nanchang, 330013, China a r t i c l e i n f o Article history: Received 30 July 2019 Revised 27 March 2020 Accepted 29 March 2020 Available online 11 July 2020 Keywords: LRCCDL Reconstruction cost Abnormal event detection Crowded scenes Surveillance videos a b s t r a c t In this paper, a novel approach to abnormal event detection in crowded scenes is presented based on a new low-rank and compact coefficient dictionary learning (LRCCDL) algorithm. First, based on the back- ground subtraction and binarization of surveillance videos, we construct a feature space by extracting the histogram of maximal optical flow projection (HMOFP) feature of the foreground from a normal training frame set. Second, in the training stage, a new joint optimization of the nuclear-norm and l 2, 1 -norm is applied to obtain a compact coefficient low-rank dictionary. Third, in the detection stage, l 2, 1 -norm opti- mization is utilized to obtain the reconstruction coefficient vectors of the testing samples. Note that the l 2, 1 -norm forces the reconstruction vectors of all the testing samples to compactly surround the same center in the training stage, such that the reconstruction errors of abnormal testing samples are different from those of normal ones. Finally, a reconstruction cost (RC) is introduced to detect abnormal frames. Experimental results on both global and local abnormal event detection show the effectiveness of our algorithm. Based on comparisons with state-of-the-art methods employing various criteria, the proposed algorithm achieves comparable detection results. © 2020 Elsevier Ltd. All rights reserved. 1. Introduction In recent years, abnormal event detection is a research hotspot in the fields of computer vision (CV) and pattern recognition (PR). As a result of the reduction of surveillance equipment cost and the significant improvement of public safety awareness, it has be- come very common that surveillance cameras are applied in pub- lic areas, such as train stations, airports, museums, stadiums, and markets. In most surveillance systems, cameras in public areas are monitored by human operators who closely observe the monitor screens to identify abnormal events. With the number of surveil- lance equipment increasing, the demand for accurate, automated anomaly detection methods increases since manual monitoring is very inefficient. Additionally, spending long time on watching monitors is tedious work. Moreover, the staff cannot always focus Corresponding author. E-mail addresses: [email protected] (A. Li), [email protected] (Z. Miao), [email protected] (Y. Cen), [email protected] (X.-P. Zhang), [email protected] (L. Zhang), [email protected] (S. Chen). their attention on the monitors, so it is easy to miss some anomaly events [1]. Depending on the areas of occurrence of the abnormal behav- iors of a crowd, the detection objects can be separated into two main classes, i.e., global abnormal events (GAE) and local abnormal events (LAE). GAE denotes that the whole detection scene is abnor- mal, and LAE denotes that the abnormal events occur in some local parts of the detection scene. The crowd usually has a high density in crowded scenes, and traditional crowd analysis algorithms are usually confronted with difficult situations because of the serious overlapping of pedes- trians. Depending on the different established models, the crowd video analysis methods include three main classes: (1) microscopic modeling such as frameworks based on the particle filters, (2) macroscopic modeling based on low-level features such as the spatial-temporal gradient and optical flow, and (3) crowd events detection [2,3]. Based on the developments in the related fields, such as data mining (DM), artificial intelligence (AI), computational intelligent (CI), soft computing (SC), image signal processing (ISP), mathematical modeling (MM), CV, and PR, the research on abnor- https://doi.org/10.1016/j.patcog.2020.107355 0031-3203/© 2020 Elsevier Ltd. All rights reserved.

Upload: others

Post on 16-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

Pattern Recognition 108 (2020) 107355

Contents lists available at ScienceDirect

Pattern Recognition

journal homepage: www.elsevier.com/locate/patcog

Abnormal event detection in surveillance videos based on low-rank

and compact coefficient dictionary learning

Ang Li a , b , Zhenjiang Miao

a , b , Yigang Cen

a , b , ∗, Xiao-Ping Zhang

c , Linna Zhang

d , Shiming Chen

e

a Institute of Information Science, Beijing Jiaotong University, Beijing, 10 0 044, China b Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing 10 0 044, China c Department of Electrical, Computer, and Biomedical Engineering, Ryerson University, Toronto, ON M5B 2K3, Canada d College of Mechanical Engineering, Guizhou University, Guiyang, 550025, China e School of Electrical & Electronic Engineering, East China Jiaotong University, Nanchang, 330013, China

a r t i c l e i n f o

Article history:

Received 30 July 2019

Revised 27 March 2020

Accepted 29 March 2020

Available online 11 July 2020

Keywords:

LRCCDL

Reconstruction cost

Abnormal event detection

Crowded scenes

Surveillance videos

a b s t r a c t

In this paper, a novel approach to abnormal event detection in crowded scenes is presented based on a

new low-rank and compact coefficient dictionary learning (LRCCDL) algorithm. First, based on the back-

ground subtraction and binarization of surveillance videos, we construct a feature space by extracting the

histogram of maximal optical flow projection (HMOFP) feature of the foreground from a normal training

frame set. Second, in the training stage, a new joint optimization of the nuclear-norm and l 2, 1 -norm is

applied to obtain a compact coefficient low-rank dictionary. Third, in the detection stage, l 2, 1 -norm opti-

mization is utilized to obtain the reconstruction coefficient vectors of the testing samples. Note that the

l 2, 1 -norm forces the reconstruction vectors of all the testing samples to compactly surround the same

center in the training stage, such that the reconstruction errors of abnormal testing samples are different

from those of normal ones. Finally, a reconstruction cost (RC) is introduced to detect abnormal frames.

Experimental results on both global and local abnormal event detection show the effectiveness of our

algorithm. Based on comparisons with state-of-the-art methods employing various criteria, the proposed

algorithm achieves comparable detection results.

© 2020 Elsevier Ltd. All rights reserved.

1

i

A

t

c

l

m

m

s

l

a

i

m

M

z

t

e

i

m

e

m

p

t

d

t

v

m

m

h

0

. Introduction

In recent years, abnormal event detection is a research hotspot

n the fields of computer vision (CV) and pattern recognition (PR).

s a result of the reduction of surveillance equipment cost and

he significant improvement of public safety awareness, it has be-

ome very common that surveillance cameras are applied in pub-

ic areas, such as train stations, airports, museums, stadiums, and

arkets. In most surveillance systems, cameras in public areas are

onitored by human operators who closely observe the monitor

creens to identify abnormal events. With the number of surveil-

ance equipment increasing, the demand for accurate, automated

nomaly detection methods increases since manual monitoring

s very inefficient. Additionally, spending long time on watching

onitors is tedious work. Moreover, the staff cannot always focus

∗ Corresponding author.

E-mail addresses: [email protected] (A. Li), [email protected] (Z.

iao), [email protected] (Y. Cen), [email protected] (X.-P. Zhang),

[email protected] (L. Zhang), [email protected] (S. Chen).

s

d

s

i

m

ttps://doi.org/10.1016/j.patcog.2020.107355

031-3203/© 2020 Elsevier Ltd. All rights reserved.

heir attention on the monitors, so it is easy to miss some anomaly

vents [1] .

Depending on the areas of occurrence of the abnormal behav-

ors of a crowd, the detection objects can be separated into two

ain classes, i.e., global abnormal events (GAE) and local abnormal

vents (LAE). GAE denotes that the whole detection scene is abnor-

al, and LAE denotes that the abnormal events occur in some local

arts of the detection scene.

The crowd usually has a high density in crowded scenes, and

raditional crowd analysis algorithms are usually confronted with

ifficult situations because of the serious overlapping of pedes-

rians. Depending on the different established models, the crowd

ideo analysis methods include three main classes: (1) microscopic

odeling such as frameworks based on the particle filters, (2)

acroscopic modeling based on low-level features such as the

patial-temporal gradient and optical flow, and (3) crowd events

etection [2,3] . Based on the developments in the related fields,

uch as data mining (DM), artificial intelligence (AI), computational

ntelligent (CI), soft computing (SC), image signal processing (ISP),

athematical modeling (MM), CV, and PR, the research on abnor-

Page 2: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

2 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355

2

p

m

i

[

m

m

i

w

m

s

a

l

a

a

s

s

m

t

h

a

c

r

s

a

t

f

[

i

f

r

c

[

i

[

c

a

f

m

e

m

a

s

s

t

m

p

v

c

f

t

c

a

t

m

d

t

t

i

a

fi

w

mal event detection has shown a positive evolution in the last

decade [4] . In particular, researchers have proposed a substantial

amount of automated techniques for crowd analysis in the CV and

PR fields, such as tracking and video concept detection models and

models to estimate the density of people and to understand the

behaviors of crowds [5–7] .

Generally, based on treating the crowd as a single entirety in

a specific scene, researchers analyze the motion of the crowd and

update the status of the crowd as abnormal or normal depending

on the dynamics emanating from the entire crowd. Nevertheless, in

conditions where the motion of a crowd is random and the crowd

motion pattern is unstructured, the methods proposed for struc-

tured crowded scenes, such as [8] , show a lack of effectiveness [9] .

In addition, despite the considerable developments achieved in the

field of human activity analysis, the task of modeling and under-

standing the behaviors of a crowd remains immature.

Nowadays, the development of low-rank matrix theory is sig-

nificant and attracts more and more researchers’ attention [10–14] ,

and the theory is utilized in our work. In this paper, we propose a

novel solution to detect both global and local anomalies in surveil-

lance systems. Our main contributions in this paper are summa-

rized as follows:

(1) To remove the low variations and noise of objects in the

background, we extract the motion descriptor of the fore-

ground by integrating background subtraction with binariza-

tion of surveillance videos.

(2) In the training stage, to obtain a low-rank dictionary based

on the similarity of normal training samples and a compact

cluster of reconstruction coefficient vectors surrounding a

center in the meantime, we propose a new joint optimiza-

tion of the nuclear-norm and l 2, 1 -norm.

(3) In the detection stage, to obtain a large gap between the re-

construction errors of abnormal testing samples and those

of normal testing samples, we force the reconstruction coef-

ficient vectors of abnormal frames to distribute so that they

resemble those of normal ones by solving an l 2, 1 -norm op-

timization problem.

The work in this paper is the extension of the previous method

published in [15] . The improvements over [15] are as follows: (1)

To avoid the extra preprocess for denoising and enhance the ro-

bustness of our model, we improve the model by adding new

items representing noise in real situations and relative parame-

ters. What’s more, we present the process to solve the optimiza-

tion problems in a detailed way and analyze the parameters in

Algorithms 1 and 2 , which is not presented in [15] . (2) Except the

UMN dataset [16] , we add the PETS2009 dataset [17] in the experi-

ments to detect anomaly at the global scale, and the UCSD [18] and

CUHK Avenue [19] datasets to validate the effectiveness of our new

method of local abnormal event detection and localization. (3) We

provide a more comprehensive introduction. Also, a new section of

related work to elaborate the previous related works and another

new section of problem formulation and motivation to introduce

our solution and explain the rationality of our method are added

into our paper.

We organize the rest of this paper as follows. Section 2 briefly

reviews the related works. Section 3 describes the problem formu-

lation. Section 4 presents the algorithm we proposed for anomaly

detection in detail. We provide the experimental results of abnor-

mal event detection at both global and local scales and the com-

parisons with the state-of-the-art methods in Section 5 . Finally,

some conclusions are presented in Section 6 .

. Related work

In recent years, many works have been undertaken and much

rogress has been achieved in the area of video surveillance. Kos-

opoulos and Chatzis [20] described a pixel-level model by utiliz-

ng holistic visual behavior understanding methods. Mehran et al.

21] and Yen and Wang [22] introduced an anomaly detection

ethod in crowded scenes, which was named the social force

odel. In the social force model, based on optical flow analysis,

ndividuals were treated as moving particles, and the social force

as the interaction force between every two particles. Further-

ore, Zhang et al. [23,24] proposed an extended model named the

ocial attribute-aware force model and Chaker et al. [25] proposed

n unsupervised approach for crowd scene anomaly detection and

ocalization using a social network model. Lee et al. [26] devised

motion influence map algorithm to describe human activities

nd detect abnormal events. Amraee et al. [27] utilized the de-

criptor of the histogram of oriented gradient (HOG) and a Gaus-

ian model to detect anomalies. Depending on the spatial pyra-

id matching kernel (SPM)-based BoW model, Hung et al. [28] ex-

racted the SIFT feature to represent the motion of a crowd. Sand-

an et al. [29] proposed an unsupervised learning algorithm for

nomaly detection based on the fact that in general human per-

eption, normal events occur frequently while the rarely occur-

ing events are abnormal. By leveraging both labeled and unlabeled

egments, Tziakos et al. [30] discovered the projection subspace

ssociated with detectors to tackle the problem that the informa-

ion about abnormal events was not available and the labeled in-

ormation about normal events was limited. Haque and Murshed

31] presented an algorithm to detect abnormal events without us-

ng any motion or tracking feature. Shi et al. [32] proposed a model

or abnormal event detection utilizing the developed spatiotempo-

al co-occurrence Gaussian mixture models (STCOG). Based on the

haracteristics of the dynamics and density of a crowd, Yin et al.

33] introduced a method to increase the information content via

ncreasing the dimension of the motion feature. Mahadevan et al.

34] leveraged the dynamic texture of the normal behaviors of a

rowd to form a mixture model. Singh and Mohan [35] proposed

n approach for abnormal activity recognition based on graph

ormulation of video activities and graph kernel support vector

achine.

Some models to detect abnormal events using the concept of

ntropy are emerging. Taking advantage of the Gaussian mixture

odel (GMM) and the particle entropy, Gu et al. [36] presented

method to represent the distribution of the crowd in crowded

cenes. Due to random motion patterns of the crowd in abnormal

ituations, Lee et al. [37] described a general purpose human mo-

ion analysis (HMA) method based on statistics and entropy.

Low-level feature optical flow can reflect the relative distance of

oving objects in a specific scene at two different moments at the

ixel-level, which is useful and important in anomaly detection of

ideo surveillance. Wang and Snoussi [38] described a global opti-

al flow orientation histogram-based model. Based on the motion

eature denoted as the histogram of maximal optical flow projec-

ion (HMOFP), Li et al. [15,39–42] proposed models to describe the

rowd motion status and detect anomalies in crowded scenes. Patil

nd Biswa [43] utilized the histogram of the magnitude and orien-

ation of optical flow to capture the motion of a crowd. Further-

ore, Colque et al. [44] proposed a similar spatiotemporal motion

escriptor named the histogram of optical flow orientation, magni-

ude and entropy based on the information of optical flow and en-

ropy. Zhang et al. [45] presented an anomaly detection framework

ntegrating the motion feature in terms of optical flow and appear-

nce cues. Based on the divergence and curl of the optical flow

eld, Chen and Lai [46] proposed a divergence-curl-driven frame-

ork for the perception of crowd motion states.

Page 3: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355 3

h

e

s

l

C

t

t

s

f

t

s

a

g

F

m

w

c

s

s

p

f

f

t

f

v

f

W

s

F

b

f

o

r

t

T

f

f

t

b

t

l

e

d

m

i

w

i

i

s

[

m

w

i

N

l

s

c

d

r

a

p

b

3

s

l

f

i

l

f

F

f

[

t

m

s

w

B

b

R

r

s

u

i

v

m

m

p

t

n

w

t

t

l

a

c

t

b

t

b

t

t

i

j

p

b

r

t

c

r

w

4

4

r

a

H

Recently, the redundant dictionary-based sparse representation

as attracted ever-increasing attention and is different from most

xisting anomaly detection methods. By applying the sparse sub-

pace clustering, Ren and Moeslund [47] proposed a dictionary

earning-based algorithm to detect anomalies in crowded scenes.

ong et al. [48] described a reconstruction model based on a dic-

ionary and utilized the sparse reconstruction cost (SRC) to de-

ect abnormal events. For abnormal event detection in crowded

cenes, Yuan et al. [49] optimized a structured dictionary learning

ramework and sparse representation coefficients through an itera-

ive updating strategy. Moreover, to accomplish the dictionary con-

truction, some dictionary learning methods were presented, such

s the nonnegative matrix factorization (NMF) [50] , the K-SVD al-

orithm [51,52] , the latent dictionary learning (LDL) [53] , and the

isher discrimination dictionary learning (FDDL) [54] .

In recent times, in addition to the hand-crafted feature-based

ethods above, some researchers have established new frame-

orks in the field of anomaly detection with the general appli-

ation of deep learning-based methods. Huang et al. [55] pre-

ented a multimodal fusion scheme based on convolutional re-

tricted Boltzmann machines. Ravanbakhsh et al. [56] proposed a

lug-and-play convolutional neural network (CNN)-based method

or crowd motion analysis. Autoencoder models based on CNN

or abnormal event detection were introduced in [57–60] . Fur-

hermore, an end-to-end deep network based on an autoencoder

ramework was presented in [61,62] , which was called a full con-

olutional network (FCN). Liu et al. [63] proposed a new baseline

or anomaly detection named future frame prediction. In addition,

ang et al. [64] utilized the extreme learning machine (ELM) - a

ingle layer neural network to detect and localize abnormal events.

or anomaly detection, Sun et al. [65] introduced a neural network

ased model called online growing neural gas (online GNG) to per-

orm an unsupervised learning.

In these previous works, the hand-crafted feature-based meth-

ds can be classified into two types. The first type addresses the

epresentation of the motion descriptor of a crowd. The second

ype addresses the model to detect whether an event is normal.

hese methods focused solely on only one part of the detection

ramework. In other words, in the process of modeling, the in-

ormation of the motion descriptor was not made full use of. In

he deep learning-based methods, most anomaly detections are

ased on the reconstruction of regular training data. Even though

hese methods assume that abnormal events would correspond to

arger reconstruction errors due to the good capacity and gen-

ralization of a deep neural network, this assumption, however,

oes not necessarily hold. Therefore, reconstruction errors of nor-

al and abnormal events will be similar, resulting in less discrim-

nation [63] . On the other hand, these models perform extremely

ell in domains with large amounts of training data. With lim-

ted training data, however, they are prone to overfitting. This lim-

tation arises often in the abnormal event detection task where

carcity of real-world training examples is a major constraint

56] .

In this paper, in the process of constructing the detection

odel, we utilize the characteristics of the training samples, i.e.,

e propose an algorithm to conduct the low-rank dictionary learn-

ng based on the similarity of the features of the training data.

ote that the amount of training data is far less than that of deep

earning-based methods. In addition, we add the l 2, 1 -norm to con-

train the reconstruction coefficient vectors to obtain a compact

luster in the training stage. Different from the detection models

escribed in previous work, in the stage of detection, we force the

econstruction coefficient vectors of all the testing samples to have

similar distribution to the training samples. Thus, abnormal sam-

les have large reconstruction errors, and we can detect anomalies

y the value of the reconstruction cost.

. Problem formulation and motivation

Considering that anomaly detection is applied in different

cenes, we define the abnormal event detection problem as fol-

ows. Note that a sample denotes the motion feature of an original

rame in global abnormal event detection or one patch of a frame

n local abnormal event detection in this paper. To clarify the prob-

em conveniently, a sample denotes the motion feature vector of a

rame in this section.

Assume that we have a training frame set denoted as

= [ f 1 , f 2 , ..., f N 0 ] , where N 0 denotes the number of training

rames. The corresponding training sample set is denoted as H = H 1 , H 2 , ..., H N 0

] , where H i ∈ R

M denotes the motion feature vector

o describe a normal training sample, M is the dimension of the

otion feature. Suppose that we have a testing frame f t , where the

ubscript “t” denotes “testing”. To obtain the detection result of f t ,

e should design a discrimination function as follows:

f : f t → { normal, abnormal} (1)

ased on our previous works in [15,40–42] , this can be realized

y the sparse representation with an overcomplete dictionary D ∈

M×K .

Suppose that the motion feature of f t is H t , and its sparse rep-

esentation coefficient vector over D is z t . Then, H t can be recon-

tructed by ˆ H t = D z t . In general, the reconstruction cost can be

sed to determine whether a testing sample is abnormal or not. It

s usually expressed by the reconstruction error, i.e., ‖ ̂ H t − H t ‖ 2 = H t − D z t ‖ 2 . If the reconstruction cost is larger than a threshold

alue, f t is detected as an abnormal frame; otherwise, it is a nor-

al frame.

As we know, the distribution of the coefficient vectors of abnor-

al testing samples is different from that of normal testing sam-

les with a small reconstruction error. To enlarge the gap between

he reconstruction errors of abnormal testing samples and those of

ormal testing samples, i.e., to obtain larger reconstruction errors

hen testing samples are abnormal, we can force the reconstruc-

ion coefficient vectors of abnormal frames to distribute similar to

hose of normal ones by solving an l 2, 1 -norm optimization prob-

em. Thus, we can adopt the value of the reconstruction cost as

n abnormal event measurement to tackle the problem of binary

lassification.

More concretely, for an abnormal sample, when we calculate

he reconstruction coefficient vectors over the dictionary trained

y normal samples, we can force the obtained coefficient vectors

o be closer to the center of the normal samples’ coefficient vectors

y some special restraint conditions. In fact, according to the na-

ure of sparse representation, once a dictionary is trained, the dic-

ionary will represent any sample (normal or abnormal) as well as

t can, no matter what kind of samples are input. The difference is

ust the values of the reconstruction errors. For an abnormal sam-

le, the reconstruction error based on the dictionary trained only

y normal samples will be large. Now, we forced the sparse rep-

esentation coefficient vectors of the abnormal sample to be closer

o the center of the normal samples’ coefficient vectors. This will

ause a bad distortion to the reconstructed sample. As a result, the

econstruction error of the abnormal sample will be larger. In this

ay, the accuracy of abnormal event detection can be improved.

. Proposed method

.1. Motion feature extraction

The motion information of any two consecutive frames can be

eflected by the optical flow field, which describes the directions

nd amplitudes of the moving objects in a scene. In our paper, the

orn-Schunck (HS) method is adopted to obtain the optical flow

Page 4: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

4 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355

Fig. 1. (a) The original frame. (b) The corresponding binary frame after the process

of background subtraction. (c) The process to compute the HMOFP feature.

l

a

p

h

t

t

t

i

t

c

l

l

c

t

m

w

s

r

t

n

a

n

a

c

i

a

t

c

s

p

D

(

a

D

w

p

p

f

L

t

field of frame images. As shown in Fig. 1 (a) and (b), we first obtain

the optical flow field of the original frame. At the same time, the

background subtraction with the method of nonparametric kernel

density estimation [66] and the binarization operation are applied

on the original frame to obtain the corresponding binary frame.

Then, the optical flow vector’s amplitude in each pixel is modified

according to the binary frame, i.e., if the corresponding pixel’s gray

value in the binary frame is 255, the value of the optical flow vec-

tor remains unchanged; otherwise, the optical flow vector is set to

be a zero vector. This processing can eliminate the influence of low

variations and noise from the background, such as the optical flow

vectors caused by the change of illumination in the background ar-

eas. Based on the optical flow field, we use the optimized HMOFP

[15,39–42] as the motion feature descriptor, which is computed by

the binary frame. The whole process to obtain HMOFP is illustrated

in Fig. 1 (c).

As shown in Fig. 2 , we introduce two spatial bases Type A

[48] and Type B for the detection of global and local abnormal

events respectively. The relationship between Type A and Type B

is also shown in Fig. 2 . Spatial basis Type A is chosen to rep-

resent the global motion feature of a frame, and we can obtain

m 1 × n 1 patches of the frame. We compute the HMOFP feature of

each patch and concatenate the m 1 × n 1 feature vectors to con-

struct the total HMOFP of the whole frame. For LAE, the abnormal

detection is based on the image patches of a frame. Similar to the

detection of GAE, a patch is divided into m 2 × n 2 cells, and the

way to extract the HMOFP feature based on this basis is same as

that of the Type A basis. In our framework, when we deal with lo-

cal abnormal event detection, we treat a patch as a frame of small

size. In other words, we convert the local abnormal event detec-

tion to a global scale problem. Note that in the detection of LAE,

we only consider the patches in the foreground corresponding to

the locations where the pixels’ gray values are 255.

4.2. Anomaly detection based on LRCCDL

4.2.1. Training stage

Considering the initial training data set

T R = [ t r 1 , t r 2 , ..., t r N 0 ] (2)

where tr i (1 ≤ i ≤ N 0 ) denotes a single frame in global abnormal

event detection or a set of patches of the i th frame in local abnor-

mal event detection. Since local event abnormal detection can be

treated as a special kind of global abnormal event detection, we

introduce the anomaly detection of global abnormal events. The

corresponding feature pool of TR is H. We leverage the method

in [42] to obtain the optimized feature pool, which is denoted as

H

∗ ∈ R

M×K 0 ( K < N ) . H

∗ is such a set that the columns never uti-

0 0

ized to represent the others in H are deleted. H

∗is a compact set

nd has a better ability to represent normal events. In the training

rocess, since the original frames in the initial training data set

ave a similar visual appearance except the areas of background,

he motion feature vectors of normal events are similar in the fea-

ure pool H

∗. Our dictionary leaning is completed based on such a

raining sample set, and the output dictionary of dictionary learn-

ng has a low-rank characteristic. Furthermore, we control the dis-

ribution of the reconstruction coefficient vectors and make them

ompact. Based on the previous work in [15] , we utilize the fol-

owing model, i.e., the low-rank and compact coefficient dictionary

earning (LRCCDL) method, to obtain a low-rank dictionary and a

ompact cluster of reconstruction coefficient vectors at the same

ime:

in

D,Z,E ‖

D ‖ ∗ + α‖

Z − C ‖ 2 , 1 + β‖

E ‖ 2 , 1

s . t . H

∗ = DZ + E (3)

here D ∈ R

M×K is the reconstruction dictionary with low-rank

tructure and K is the number of columns of D . Z ∈ R

K×K 0 is the

econstruction coefficient matrix. C ∈ R

K×K 0 is a cluster center ma-

rix, and each column of C is the mean vector of Z , which is de-

oted as c . E ∈ R

M×K 0 is the reconstruction error matrix, and αnd β are two regularization parameters. ‖ ‖ ∗ denotes the nuclear-

orm of a matrix, i.e., the sum of the matrix’s singular values,

nd it approximates the rank of the matrix. ‖ A ‖ 2 , 1 =

j

‖ [ A ] : , j ‖ 2 =

j

√ ∑

i

( [ A ] i, j ) 2

is defined as the l 2, 1 -norm of matrix A and each

olumn of matrix A is encouraged to be zero [67] . The aim of ‖ D ‖ ∗s to restrict the low-rank structure of D . Each column in H

∗ has

corresponding vector in Z and E respectively. ‖ Z − C ‖ 2 , 1 makes

he reconstruction coefficient vectors of columns in H

∗ similar and

ompactly surround the center c , and ‖ E ‖ 2, 1 regularizes the recon-

truction error of each column in H

∗; thus, it is as close to zero as

ossible.

To solve (3), we first convert it as follows:

min

,Z,E, J 1 , J 2 ‖

J 1 ‖ ∗ + α‖

J 2 − C ‖ 2 , 1 + β‖

E ‖ 2 , 1

s . t . H

∗ = DZ + E, D = J 1 , Z = J 2 (4)

4) can be solved by solving the following equivalent problem, i.e.,

n augmented Lagrange multiplier (ALM) problem:

min

,Z,E, J 1 , J 2 , Y 1 , Y 2 , Y 3 ‖

J 1 ‖ ∗ + α‖

J 2 − C ‖ 2 , 1 + β‖

E ‖ 2 , 1

+ tr [Y T 1 ( H

∗ − DZ − E) ]

+ tr [Y T 2 (D − J 1 )

]+ tr

[Y T 3 (Z − J 2 )

]+

μ

2

(‖

H

∗ − DZ − E ‖

2 F + ‖

D − J 1 ‖

2 F + ‖

Z − J 2 ‖

2 F

)(5)

here Y 1 , Y 2 and Y 3 are Lagrange multipliers and μ > 0 is a

enalty parameter, and T denotes the operation of matrix trans-

osition. (5) can be solved by the inexact ALM algorithm [68] as

ollows. We can rewrite (5) as

= ‖

J 1 ‖ ∗ + α‖

J 2 − C ‖ 2 , 1 + β‖

E ‖ 2 , 1

+ tr [Y T 1 ( H

∗ − DZ − E) ]

+ tr [Y T 2 (D − J 1 )

]+ tr

[Y T 3 (Z − J 2 )

]+

μ

2

(‖

H

∗ − DZ − E ‖

2 F + ‖

D − J 1 ‖

2 F + ‖

Z − J 2 ‖

2 F

)(6)

To resolve (5), we differentiate (6) and update one variable at a

ime with the others fixed to their recent values.

Step 1: update J 1 .

∂L

∂ J 1 =

∂ ‖

J 1 ‖ ∗∂ J 1

+

∂ tr [Y T 2 (D − J 1 )

]∂ J 1

+

μ

2

∂ ‖

D − J 1 ‖

2 F

∂ J 1

=

∂ ‖

J 1 ‖ ∗∂ J

+ μ( J 1 − (D + Y 2 /μ))

1
Page 5: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355 5

Fig. 2. The Type A basis corresponding to the detection of GAE and the Type B basis corresponding to the detection of LAE.

J

Z

[

w

v

J

Algorithm 1 Solving Problem (3) by Inexact ALM.

Input: matrix H

∗ , initial dictionary D = U (where U comes from

[ U, ∑

, V ] = s v d( H

∗) ), parameter α, β

Output: D, C

Initialize: Z = 0, E = 0, J 1 = 0 , J 2 = 0 , Y 1 = 0 , Y 2 = 0 , Y 3 = 0 ,

μ = 10 −6 , μ̄ = 10 30 , ρ = 1 . 1 , ε = 10 −6

while not converged do

1. Fix others and update J 1 by Eq. (9)

2. Fix others and update Z by Eq. (11)

3. Fix others and update C using Z by Eq. (12)

4. Fix others and update J 2 by Eq. (15)

5. Fix others and update E by Eq. (18)

6. Fix others and update D by Eq. (20)

7. Update the multipliers

Y 1 = Y 1 + μ( H

∗ − DZ − E)

Y 2 = Y 2 + μ(D − J 1 )

Y 3 = Y 3 + μ(Z − J 2 )

8. Update the parameter μ by

μ = min (ρμ, μ̄)

9. Check the convergence conditions

‖ H

∗ − DZ − E ‖ ∞ < ε

and

‖ D − J 1 ‖ ∞ < ε

and

‖ Z − J 2 ‖ ∞ < ε

end while

E

D

t

W

n

a

a

e

l

= 0 (7)

Integrate J 1 and we can obtain

1

μ‖

J 1 ‖ ∗ +

1

2

J 1 − (D + Y 2 /μ) ‖

2 F = 0 (8)

Therefore, we can obtain

1 = argmin

1

μ‖

J 1 ‖ ∗ +

1

2

J 1 − (D + Y 2 /μ) ‖

2 F (9)

Step 2: update Z.

∂L

∂Z =

∂ tr [Y T 1 ( H

∗ − DZ − E) ]

∂Z +

∂ tr [Y T 3 (Z − J 2 )

]∂Z

+

μ

2

(∂ ‖

H

∗ − DZ − E ‖

2 F

∂Z +

∂ ‖

Z − J 2 ‖

2 F

∂Z

)

= −D

T Y 1 + Y 3 − μ( D

T H

∗ − D

T E + J 2 ) + μ( D

T D + I) Z

= 0 (10)

Therefore, we can obtain

= ( D

T D + I) −1 [D

T H

∗ − D

T E + J 2 + ( D

T Y 1 − Y 3 ) /μ]

(11)

Step 3: update C.

We update each column of C as follows:

C ] : , j =

1

K 0

t [ Z ] : ,t (12)

here (:, j )denotes the j th column of matrix C . (12) implies that the

alue of each column of matrix C is equal.

Step 4: update J 2 .

∂L

∂ J 2 = α

∂ ‖

J 2 − C ‖ 2 , 1

∂ J 2 +

∂ tr [Y T 3 (Z − J 2 )

]∂ J 2

+

μ

2

∂ ‖

Z − J 2 ‖

2 F

∂ J 2

= α∂ ‖

J 2 − C ‖ 2 , 1

∂ J 2 + μ[ J 2 − (Z + Y 3 /μ) ]

= 0 (13)

Integrate J 2 and we can obtain

α

μ‖

J 2 − C ‖ 2 , 1 +

1

2

J 2 − (Z + Y 3 /μ) ‖

2 F = 0 (14)

Therefore, we can obtain

2 = argmin

α

μ‖

J 2 − C ‖ 2 , 1 +

1

2

J 2 − (Z + Y 3 /μ) ‖

2 F + C (15)

Step 5: update E.

∂L

∂E = β

∂ ‖

E ‖ 2 , 1

∂E +

∂ tr [Y T 1 ( H

∗ − DZ − E) ]

∂E

+

μ

2

∂ ‖

H

∗ − DZ − E ‖

2 F

∂E

= β∂ ‖

E ‖ 2 , 1

∂E + μ[ E − ( H

∗ − DZ + Y 1 /μ) ]

= 0 (16)

Integrate E and we can obtain

β

μ‖

E ‖ 2 , 1 +

1

2

E − ( H

∗ − DZ + Y 1 /μ) ‖

2 F = 0 (17)

Therefore, we can obtain

= argmin

β

μ‖

E ‖ 2 , 1 +

1

2

E − ( H

∗ − DZ + Y 1 /μ) ‖

2 F (18)

Step 6: update D.

∂L

∂D

=

∂ tr [Y T 1 ( H

∗ − DZ − E) ]

∂D

+

∂ tr [Y T 2 (D − J 1 )

]∂D

+

μ

2

(∂ ‖

H

∗ − DZ − E ‖

2 F

∂D

+

∂ ‖

D − J 1 ‖

2 F

∂D

)

= μD (Z Z T + I) −[Y 1 Z

T − Y 2 + μ( H

∗Z T + J 1 − E Z T ) ]

= 0 (19)

Therefore, we can obtain

=

[H

∗Z T + J 1 − E Z T + ( Y 1 Z T − Y 2 ) /μ

](Z Z T + I) −1 (20)

Our dictionary learning method, i.e., LRCCDL, is described in de-

ail in Algorithm 1 .

hen this iterative process ends, we obtain the low-rank dictio-

ary D and obtain the mean vector of Z , which is denoted as c , i.e.,

ny column of C .

We can solve optimization problem (3) via either inexact or ex-

ct ALM [68] . We choose the inexact ALM algorithm based on its

fficiency, and we outline the method in Algorithm 1 . We can uti-

ize the singular value thresholding operator method to solve the

Page 6: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

6 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355

Fig. 3. (a) The curves of Con 1- Con 3 along with the number of iteration. (b) The curves of Con 4- Con 6 along with the number of iteration.

C

C

c

c

t

b

C

a

e

a

D

4

n

c

a

Z

t

n

r

t

s

b

m

w

i

Y

c

Z

Z

w

p

r

a

A

problem in Step 1 [69] . Step 4 and Step 5 are solved by the alter-

nating minimization algorithm in [70,71] .

The convergence of the exact ALM algorithm to solve the prob-

lem with a smooth objective function has been proven in [72] .

As a variation of exact ALM, inexact ALM is also widely used,

whose convergence has been well studied when the number of

blocks is at most two [68,73] . Up to now, to ensure the conver-

gence of inexact ALM with three or more blocks is still difficult

[12,70,73] . In Algorithm 1 , the objective function of problem (3)

is not smooth and there are six blocks, i.e., D, J 1 , Z, J 2 , E , and C

(Since [ C] : , j =

1 K 0

t [ Z] : ,t , i.e., Eq. (12) in the paper, we mainly an-

alyze the convergence of the first five blocks). So it is difficult to

prove the convergence of Algorithm 1 in theory. However, there are

some guarantees to ensure the convergence of Algorithm 1 . Base

on the theoretical results in [74] , two conditions are sufficient for

the convergence of Algorithm 1 : (a) The dictionary D is full of col-

umn rank; (b) The gap between the solution obtained in each it-

eration after a certain number of iterations, which is denoted as

( D k , J 1 k , Z k , J 2 k ) at the k th iteration, and the ideal solution obtained

by minimizing the Lagrange function is monotonically decreasing,

which is denoted as arg min D, J 1 ,Z, J 2 L . The gap can be described as

ηk = ‖ ( D k , J 1 k , Z k , J 2 k ) − arg min D, J 1 ,Z, J 2 L ‖ 2

F . Condition (a) is easy to

be satisfied. In Theorem 1 of [70] , we have the result: For any op-

timal solution to the problem (3), we have Z ∗ ∈ span( D

∗T ), where

Z ∗, D

∗ are the optimal solution of Z and D respectively. This theo-

rem shows that the optimal solution Z ∗ of problem (3) always lies

within the subspace spanned by the rows of D

∗. This means that Z ∗

can be expressed by Z ∗ = P ∗Z̄ ∗, where P ∗ can be computed by or-

thogonalizing the columns of D

∗T . So problem (3) can be converted

into the following equivalent problem by replacing Z with P ∗Z̄ :

min

A, ̄Z ,E ‖

A ‖ ∗ + α∥∥Z̄ − C̄

∥∥2 , 1

+ β‖

E ‖ 2 , 1

s . t . H

∗ = A ̄Z + E (21)

where A = D P ∗. Based on the optimal solution ( A

∗, ̄Z ∗, E ∗) , we can

obtain the optimal solution of problem (3) by ( A

∗P ∗−1 , P ∗Z̄ ∗, E ∗) .

The number of the rows of Z̄ is at most the rank of D , so the suffi-

cient condition (a) can be satisfied. For the condition (b), although

to prove it strictly is not easy, the convexity of the Lagrange func-

tion could guarantee its validity to some extent [74] . Based on the

sufficient conditions (a) and (b), the convergence properties could

be well expected. What’s more, as shown in [73] , the inexact ALM

algorithm generally performs well in reality.

Moreover, we provide the process of training stage on dataset

UMN, which is also used in the experiments in the next section.

We choose one scene, i.e., the indoor scene, in the dataset to

demonstrate the convergence behavior of Algorithm 1 . For conve-

nience, we set variables as follows:

Con 1=( ‖ D i ‖ F −‖ D i−1 ‖ F ) / ‖ D i ‖ F , Con 2 = ( ‖ Z i ‖ F −‖ Z i −1 ‖ F ) / ‖ Z i ‖ F ,on 3=( ‖ E i ‖ F −‖ E i−1 ‖ F ) / ‖ E i ‖ F , Con 4= ‖ H

∗−DZ−E ‖ ∞

, Con 5= ‖ D−J 1 ‖ ∞

,

on 6= ‖ Z−J 2 ‖ ∞

. The following two figures are shown as the

onvergence analysis of Algorithm 1 .

Fig. 3 (a) shows that Con 1, Con 2, and Con 3 monotonically de-

rease to zeros after a certain number of iterations, which indicates

hat D, Z , and E have good convergence properties. From Fig. 3 (b),

ased on convergence conditions Con 4- Con 6, especially Con 5 and

on 6, we can infer that J 1 = D and J 2 = Z when the values of Con 5

nd Con 6 are zeros. So J 1 and J 2 also have good convergence prop-

rties. Moreover, [ C] : , j =

1 K 0

t [ Z] : ,t . Thus, the matrix C converges

fter a certain number of iterations. In summary, the 6 terms, i.e.,

, J 1 , Z, J 2 , C , and E , converge as shown in Algorithm 1 .

.2.2. Detecting stage

In the detection stage, our aim is to distinguish normal and ab-

ormal samples. Our solution is that we control the reconstruction

oefficient vectors of all the testing samples (including normal and

bnormal samples) compactly distribute surrounding the center of

in (3), i.e., all the reconstruction coefficient vectors are similar to

hose of normal samples. Therefore, the reconstruction error of a

ormal testing sample will be smaller and the reconstruction er-

or of an abnormal sample will be larger. By solving (3), we obtain

he low-rank dictionary D and the mean vector c . Given a testing

ample set Y t , the reconstruction coefficient vectors are obtained

y solving the following optimization problem:

in

Z t , E t ‖

Z t − C t ‖ 2 , 1 + γ ‖

E t ‖ 2 , 1

s . t . Y t = D Z t + E t (22)

here Z t is the reconstruction coefficient set. Each column of C t s denoted as c t , and c t = c. E t is the reconstruction error set of

t , and γ is a regularization parameter. The above problem can be

onverted to the equivalent problem:

min

t , E t , W t

W t − C t ‖ 2 , 1 + γ ‖

E t ‖ 2 , 1

s . t . Y t = D Z t + E t , Z t = W t (23)

( 23 ) can be solved by solving the following ALM problem:

min

t , E t , W t , L 1 ,t , L 2 ,t ‖

W t − C t ‖ 2 , 1 + γ ‖

E t ‖ 2 , 1

+ tr [L T 1 ,t ( Y t − D Z t − E t )

]+ tr

[L T 2 ,t ( Z t − W t )

]+

μt

2

( ‖

Y t − D Z t − E t ‖

2 F + ‖

Z t − W t ‖

2 F ) (24)

here L 1, t and L 2, t are Lagrange multipliers and μt > 0 is a

enalty parameter. We can solve ( 23 ) by the inexact ALM algo-

ithm. The update steps are similar to those in the training stage

nd omitted here, we only give the iteration steps, as shown in

lgorithm 2 .

Page 7: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355 7

Algorithm 2 Solving Problem (22) by Inexact ALM.

Input: matrix Y t , dictionary D , parameter γ

Output: Z t Initialize: Z t = 0 , E t = 0 , W t = 0 , L 1 ,t = 0 , L 2 ,t = 0 , μt = 10 −6 , μ̄t = 10 30 , ρt = 1 . 1 , ε t = 10 −6

while not converged do

1. Fix others and update W t by

W t = arg min 1 μt

‖ W t − C t ‖ 2 , 1 +

1 2 ‖ W t − ( Z t + L 2 ,t / μt ) ‖ 2 F + C t

2. Fix others and update Z t by

Z t = ( D T D + I) −1 [ D T Y t − D T E t + W t + ( D T L 1 ,t − L 2 ,t ) / μt ]

3. Fix others and update E t by

E t = argmin γμt ‖ E t ‖ 2 , 1 +

1 2 ‖ E t − ( Y t − D Z t + L 1 ,t / μt ) ‖ 2 F

4. Update the multipliers

L 1 ,t = L 1 ,t + μt ( Y t − D Z t − E t )

L 2 ,t = L 2 ,t + μt ( Z t − W t )

5. Update the parameter μt by

μt = min ( ρt μt , μ̄t )

6. Check the convergence conditions

‖ Y t − D Z t − E t ‖ ∞ < ε t and

‖ Z t − W t ‖ ∞ < ε t end while

Fig. 4. (a) A clip of testing surveillance video, i.e., frame 850 to frame 1350 (the first abnormal frame is approximately frame 1295). (b) The RC values corresponding to

normal/abnormal frames.

l

i

o

n

n

p

s

c

o

d

s

t

t

s

p

a

e

m

v

t

m

a

e

a

r

s

w

t

Z

w

R

w

p

n

R

w

i

v

5

m

C

2

b

H

a

d

In traditional abnormal event detection based on dictionary

earning, whether the reconstruction coefficient vector of a test-

ng sample is sparse is the key to judge if the sample is normal

r not, such as in [48] . Because the dictionary is learned based on

ormal training samples, it is short of the ability to represent ab-

ormal testing samples. Assume that H t 1 is a normal testing sam-

le and H t 2 is an abnormal testing sample, and the sparse repre-

entation coefficient vectors over D are z t 1 and z t 2 respectively. We

an find that z t 2 is more dense than z t 1 . In our LRCCDL method,

ur dictionary is also trained by normal training samples, so the

ictionary has strong ability to sparsely represent normal testing

amples. Furthermore, the second term ‖ Z − C ‖ 2 , 1 of the objec-

ive function of (3) encourages the reconstruction coefficient vec-

ors of training samples to surround their mean vector compactly,

o the reconstruction coefficient vectors of abnormal testing sam-

les should be far away from the mean vector. In the process of

nomaly detection, we utilize (22) to force the reconstruction co-

fficient vectors of all the testing samples to distribute around the

ean vector compactly ( c t = c), and such reconstruction coefficient

ectors of abnormal testing samples are similar to those of normal

esting samples, which will lead to a bad distortion for the abnor-

al testing samples. Assume that H t 3 is a normal testing sample

nd H t 4 is an abnormal testing sample, and the reconstruction co-

fficient vectors over D are z t 3 and z t 4 respectively. In the stage of

nomaly detection, the gap between the normal testing sample’s

econstruction error, i.e., ‖ H t 3 − D z t 3 ‖ 2 , and the abnormal testing

ample’s reconstruction error, i.e., ‖ H t − D z t ‖ 2 , will become large,

4 4

hich will improve the distinguishing ability of our algorithm for

he normal and abnormal samples.

By solving (22), we can obtain the reconstruction coefficient set

t of Y t over the low-rank dictionary D . Given a sample H t in Y t ,

e define the reconstruction cost (RC) as follows:

C = ‖

H t − D z t ‖ 2 + λ‖

z t ‖ 1 (25)

here z t is the coefficient vector of H t in Z t , and λ is the multiplier

arameter to balance ‖ H t − D z t ‖ 2 and ‖ z t ‖ 1 . H t is determined as

ormal if the value of RC satisfies the following criterion

C < τ (26)

here τ is an artificially defined threshold to control the sensitiv-

ty of the algorithm to abnormal events.

Fig. 4 shows an example of abnormal event detection with the

alues of RC.

. Experimental results

To validate our proposed methods, we demonstrate experi-

ents on four public datasets, i.e., PETS 2009, UMN, UCSD, and

UHK Avenue. Specifically, in the experiments based on the PETS

009 dataset, we validate the significance of our first contribution

y comparing our method LRCCDL with the method extracting the

MOFP feature directly from the optical flow field of original im-

ges, and in the experiments based on the UMN dataset, we vali-

ate the significance of our second and third contributions by com-

Page 8: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

8 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355

Fig. 5. (a) The normal scene of people walking toward all directions. (b) The abnormal scene of people moving toward one direction. (c) The classification results of sequence

Time 14-(55, 17) . Top: the detection result of LRCCDL; Middle: the detection result of LRCCDL_RAW. Bottom: the ground truth of the testing set.

t

d

s

a

a

r

5

a

o

f

f

n

i

s

s

a

u

t

5

r

a

f

c

d

w

r

o

5

r

d

a

r

w

s

m

t

W

f

I

l

o

w

o

p

a

o

p

paring our LRCCDL method with the traditional sparse reconstruc-

tion method.

5.1. Anomaly detection at the global scale on the PETS 2009 dataset

In this section, we choose the PETS 2009 dataset to evaluate our

algorithm by abnormal event detection at the global scale. In the

following experiments, some specific scenes are chosen as the de-

tection targets, i.e., abnormal events. In dataset PETS 2009, the res-

olution of a frame image is 576 × 768. We set the size of an image

patch as 144 × 192. Between every two neighbor patches, there is

no overlapping part. We evenly divide 0 ◦ − 360 ◦ into 36 bins (the

number of bins is a parameter in the process of the motion de-

scriptor extraction [15,39–42] ). The length of HMOFP feature vector

is 576 based on spatial basis Type A, as shown in Fig. 2 . In the ex-

periments, we compare our method LRCCDL with the method ex-

tracting the HMOFP feature directly from the optical flow field of

original images, which is denoted as LRCCDL_RAW.

5.1.1. Detection of crowd movement direction

In this part, the 0th frame to the 399th frame of Time 14–55

are chosen as the training set. The 400th frame to the 488th frame

of Time 14–55 are chosen as the normal testing set, including 89

frames. The abnormal testing set includes 89 frames, i.e., the 0th

frame to the 88th frame of Time 14–17 . In the training set and

the normal testing set, the people of the crowd are walking to-

ward several directions. In the abnormal testing set, the crowd is

moving only toward one direction. For convenience, the two test-

ing video sequences are denoted as Time 14-(55, 17) in this section.

Fig. 5 (a) and (b) show the normal and abnormal scenes. The ac-

curacy values of LRCCDL and LRCCDL_RAW are 93.21% and 91.97%

respectively. Fig. 5 (c) shows the detection results.

5.1.2. Detection of people running

In this part, the training set contains two parts: the 0th frame

to the 49th frame of Time 14–31 and the 0th frame to the 60th

frame of Time 14–17 . The 0th frame to the 37th frame and the

108th frame to the 173rd frame of Time 14–16 are chosen as the

normal testing set, including 104 frames. The abnormal testing set

includes the 38th frame to the 107th frame and the 174th frame to

the 222nd frame of Time 14–16 . In the training set and the normal

testing set, the people of the crowd are walking from right to left

and walking back toward the negative direction. In the abnormal

testing set, the crowd is running toward one direction. Fig. 6 (a)

and (b) show the normal and abnormal scenes. The accuracy val-

ues of LRCCDL and LRCCDL_RAW are 96.96% and 92.34% respec-

tively. Fig. 6 (c) shows the detection results.

5.1.3. Detection of people splitting

In this part, the 0th frame to the 40th frame of Time 14–16 are

chosen as the training set. The 0th frame to the 63rd frame of Time

14–31 are chosen as the normal testing set, including 64 frames.

The abnormal testing set includes 66 frames, i.e., the 64th frame to

the 129th frame of Time 14–31 . In the training set and the normal

esting set, the people of the crowd are walking toward the same

irection. In the abnormal testing set, the people of the crowd are

plitting in some directions. Fig. 7 (a) and (b) show the normal and

bnormal scenes. The accuracy values of LRCCDL and LRCCDL_RAW

re 95.59% and 94.65% respectively. Fig. 7 (c) shows the detection

esults.

.1.4. Detection of people scattering

In this part, the 0th frame to the 222nd frame of Time 14–16

re chosen as the training set. The 48th frame to the 93rd frame

f Time 14–17 are chosen as the normal testing set, including 46

rames. The abnormal testing set includes 36 frames, i.e., the 342nd

rame to the 377th frame of Time 14–33 . In the training set and the

ormal testing set, the people of the crowd are running or walk-

ng toward one direction. In the abnormal testing set, the crowd is

cattering in all directions. For convenience, the two testing video

equences are denoted as Time 14-(17, 33) in this section. Fig. 8 (a)

nd (b) show the normal and abnormal scenes. The accuracy val-

es of LRCCDL and LRCCDL_RAW are 99.75% and 98.75% respec-

ively. Fig. 8 (c) shows the detection results.

.1.5. Performance comparison

The experimental results above show that the LRCCDL algo-

ithm of the HMOFP descriptor based on background subtraction

nd binarization can obtain better performance in general. There-

ore, we can obtain a better motion descriptor based on our first

ontribution. Table 1 shows the detection results on the PETS 2009

ataset. Other than LRCCDL_RAW, our algorithm is also compared

ith the histogram of optical flow orientation (HOFO) method rep-

esented in [38] . As shown in the table, the detection accuracy of

ur proposed algorithm LRCCDL is better than other methods.

.2. Anomaly detection at the global scale on the UMN dataset

In this section, the UMN dataset is chosen to evaluate our algo-

ithm LRCCDL by anomaly detection at the global scale. The UMN

ataset includes three different crowded scenes, i.e., lawn, indoor,

nd plaza. The number of frame images is 7739 in total, and the

esolution of a frame image is 240 × 320. In the dataset, the scenes

here people are walking randomly are normal events, and the

cenes where people are running away simultaneously are abnor-

al events. We set the size of an image patch as 60 × 80. Be-

ween every two neighbor patches, there is no overlapping part.

e evenly divide 0 ◦ − 360 ◦ into 18 bins. The length of the HMOFP

eature vector is 288 based on spatial basis Type A shown in Fig. 2 .

n each scene, we choose the first 400 normal frames to train the

ow-rank dictionary. Traditional anomaly detection methods based

n the learned dictionary uses sparse reconstruction. As contrasted

ith LRCCDL, we also use the sparse reconstruction method to

btain the reconstruction coefficient vectors of the testing sam-

les. The reconstruction cost is the same as (25). This method for

nomaly detection is denoted as LRCCDL_SR. LRCCDL_SR is based

n the method described in [42] , while there are two different

oints: one is that the motion feature is obtained based on binary

Page 9: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355 9

Fig. 6. (a) The normal scene of people walking toward one direction. (b) The abnormal scene of people running toward one direction. (c) The classification results of sequence

Time 14–16 . Top: the detection result of LRCCDL; Middle: the detection result of LRCCDL_RAW. Bottom: the ground truth of the testing set.

Fig. 7. (a) The normal scene of people walking toward the same direction. (b) The abnormal scene of people splitting in some directions. (c) The classification results of

sequence Time 14–31 . Top: the detection result of LRCCDL; Middle: the detection result of LRCCDL_RAW. Bottom: the ground truth of the testing set.

Fig. 8. (a) The normal scene of people walking or running toward one direction. (b) The abnormal scene of people scattering in all directions. (c) The classification results

of sequence Time 14-(17, 33) . Top: the detection result of LRCCDL; Middle: the detection result of LRCCDL_RAW. Bottom: the ground truth of the testing set.

Table 1

Different detection results on the PETS 2009 dataset.

Method Accuracy

Time 14-(55,17) Time 14–16 Time 14–31 Time 14-(17,33)

LRCCDL (Ours) 93.21% 96.96% 95.59% 99.75%

LRCCDL_RAW 91.97% 92.34% 94.65% 98.75%

HOFO [38] 90% 93.24% 94.61% 97.5%

f

T

O

5

i

t

d

c

t

C

5

q

t

(

s

t

5

i

h

s

T

m

5

t

o

O

t

n

a

w

s

m

m

w

o

rames; another is that the reconstruction cost is replaced by (25).

he detailed algorithm of LRCCDL_SR is presented in the Appendix.

ur experimental results are as follows.

.2.1. Abnormal event detection in the lawn scene

In the lawn scene, there are 1453 frames in the video sequence

n total. Fig. 9 (a) and (b) show two representative frames to exhibit

he normal event and abnormal event. Fig. 9 (c) and (d) show the

etection results and the receiver operating characteristic (ROC)

urves in the lawn scene. The area under the ROC Curve (AUC) of

he method LRCCDL is 99.94%, and the AUC of the method LRC-

DL_SR is 98.07%.

.2.2. Abnormal event detection in the indoor scene

In the indoor scene, there are 4144 frames in the video se-

uence in total. Fig. 10 (a) and (b) show two representative frames

o exhibit the normal event and abnormal event. Fig. 10 (c) and

d) show the detection results and the ROC curves in the indoor

cene. The AUC of the method LRCCDL is 99.55%, and the AUC of

he method LRCCDL_SR is 94.69%.

.2.3. Abnormal event detection in the plaza scene

In the plaza scene, there are 2142 frames in the video sequence

n total. Fig. 11 (a) and (b) show two representative frames to ex-

ibit the normal event and abnormal event. Fig. 11 (c) and (d)

hows the detection results and the ROC curves in the plaza scene.

he AUC of the method LRCCDL is 99.93%, and the AUC of the

ethod LRCCDL_SR is 97.65%.

.2.4. Performance comparison

The experimental results above show that compared with the

raditional sparse reconstruction, the LRCCDL algorithm based on

ur second and third contributions obtains better performance.

ur solution based on the low-rank dictionary for anomaly de-

ection by enlarging the gap between the reconstruction errors of

ormal testing samples and those of abnormal ones is effective

nd robust. Except for LRCCDL_SR, our algorithm is also compared

ith several state-of-the-art methods. The performance compari-

on results are shown in Table 2 . The hand-crafted feature based

ethods are listed above the dotted line, and deep learning-based

ethods are listed below the dotted line. In the remaining tables,

e also use the dotted line to distinguish these two types of meth-

ds. As shown in Table 2 , for the lawn and plaza scenes, the AUC

Page 10: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

10 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355

Fig. 9. (a) The normal event in the lawn scene. (b) The abnormal event in the lawn scene. (c) The classification results of the lawn scene. Top: the detection result of

LRCCDL; Middle: the detection result of LRCCDL_SR. Bottom: the ground truth of the testing set. (d) The ROC curves in the lawn scene.

Fig. 10. (a) The normal event in the indoor scene. (b) The abnormal event in the indoor scene. (c) The classification results of the indoor scene. Top: the detection result of

LRCCDL; Middle: the detection result of LRCCDL_SR. Bottom: the ground truth of the testing set. (d) The ROC curves in the indoor scene.

Fig. 11. (a) The normal event in the plaza scene. (b) The abnormal event in the plaza scene. (c) The classification results of the plaza scene. Top: the detection result of

LRCCDL; Middle: the detection result of LRCCDL_SR. Bottom: the ground truth of the testing set. (d) The ROC curves in the plaza scene.

Page 11: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355 11

Fig. 12. The normal scene (the 1st frame) and the abnormal scenes (the 2nd frame to the 5th frame) in dataset UCSD Ped1.

Fig. 13. The normal scene (the 1st frame) and the abnormal scenes (the 2nd frame to the 4th frame) in dataset UCSD Ped2.

Table 2

Comparison of LRCCDL with other methods on the UMN dataset.

Method AUC

Lawn Indoor Plaza

LRCCDL (Ours) 99.94% 99.55% 99.93%

LRCCDL_SR 98.07% 94.69% 97.65%

HOFO [38] 98.45% 90.37% 98.15%

Sparse [48] 99.5% 97.5% 96.4%

STCOG [32] 93.62% 77.59% 96.61%

Lee et al. [26] 99.4% 90.9% 98.1%

HOIF [33] 99.94% 99.18% 99.88%

Patil et al. [43] 98.67% 93.68% 97.11%

Zhang et al. [45] 99.3% 96.9% 98.8%

NN [48] 84%

Optical Flow [21] 93%

SF [21] 96%

—————————————————————————————————————–

AVID [61] 99.6%

TCP [62] 98.8%

Wang et al. [64] 99.0%

o

o

t

i

5

g

i

d

T

o

F

m

p

t

t

o

b

a

a

e

p

5

c

Table 3

Different EER values and AUC values of the detection of LAE on

UCSD Ped1.

Method Criteria

EER AUC

LRCCDL (Ours) 17.05% 90.01%

Sparse [48] 19% 86%

Ren et al. [47] 46.44% 54.26%

Adam et al. [34] 38% 56.63%

MDT [34] 25% 81.8%

SF-MPPCA [34] 32% 67.25%

MPPCA [34] 40% 59%

SF [21] 30% 67.5%

Lee et al. [26] 24.1% 80%

HOFME [44] 33.1% 72.7%

—————————————————————————————————————–

AVID [61] 12.3% –

Conv-WTA + SVM [58] 14.8% 91.6%

ConvLSTM-AE [60] – 75.5%

Liu et al. [63] – 83.1%

TCP [56] 8% 95.7%

Wang et al. [64] 18% 88.5%

Conv-AE [59] 27.9% 81.0%

Huang et al. [55] 11.2% 92.6%

Xu et al. [57] 12% 95.7%

1

d

c

w

U

a

c

s

a

1

t

i

f

c

o

u

m

t

c

s

a

a

f our proposed LRCCDL based on the HMOFP feature outperforms

ther methods. For the indoor scene, the method in [61] obtains

he best result. However, our LRCCDL algorithm is comparable to

t.

.3. Anomaly detection at the local scale on the UCSD dataset

In this section, the UCSD dataset is chosen to evaluate our al-

orithm LRCCDL by anomaly detection at the local scale, which

ncludes experiments of object localization. There are two sub-

atasets in the UCSD dataset, i.e., UCSD Ped1 and UCSD Ped2.

he two first frames in Fig. 12 and Fig. 13 show that the frames

nly contain pedestrians in the normal scenes. Other frames in

igs. 12 and 13 show that cars, wheelchairs, skaters and bikes com-

only occur in the abnormal scenes. The evaluation contains two

arts: (1) the frame-level groundtruth-based local anomaly detec-

ion, i.e., compared with the frame-level groundtruth annotation,

he frame should be determined as abnormal if at least one pixel

f the frame is abnormal; and (2) the pixel-level groundtruth-

ased anomaly localization, i.e., compared with the groundtruth

nnotation at the pixel-level, the abnormal frame is determined as

correctly detected frame if at least 40% of the truly abnormal pix-

ls are detected. Otherwise, the frame is considered to be a false

ositive.

.3.1. Anomaly detection

In the UCSD Ped1 dataset, the training set contains 34 short

lips, and the testing set contains 36 short clips. The frame size is

58 × 238 and there are 200 frames in each clip. In the UCSD Ped2

ataset, the training set contains 16 short clips, and the testing set

ontains 12 short clips. There are 150 to 180 frames in each clip

ith a 240 × 360 resolution. In addition, a subset of 10 clips for

CSD Ped1 and 12 clips for UCSD Ped2 are provided with manu-

lly generated pixel-level binary masks, which identify the regions

ontaining anomalies. In the experiments, we only choose the first

hort clip in the training set during the training stage. The size of

frame is reset as 240 × 320 and the image patch size is set as

0 × 10 without overlap between two neighbor patches. In addi-

ion, 0 ◦ − 360 ◦ are divided into 18 bins. The HMOFP feature vector

s 72 based on spatial basis Type B, as shown in Fig. 2 .

Figs. 14 (a) and 15 (a) show the detection ROC curves in the

rame-level groundtruth-based local abnormal event detection. We

ompare the detection result of our LRCCDL algorithm with those

f the state-of-the-art methods. Tables 3 and 5 show the AUC val-

es and equal error rate (EER) values of our algorithm and other

ethods as a quantitative comparison. Figs. 14 (b) and 15 (b) show

he detection ROC curves in the pixel-level groundtruth-based lo-

al abnormal event localization. As another quantitative compari-

on, the AUC values and equal detected rate (EDR) values of our

lgorithm and the state-of-the-art methods are shown in Tables 4

nd 6 . Examples of localization are presented in Fig. 16 .

Page 12: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

12 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355

Fig. 14. (a) The ROC curves of the local abnormal event detection using frame-level groundtruth on UCSD Ped1. (b) The ROC curves of the local abnormal event localization

using pixel-level groundtruth on UCSD Ped1.

Fig. 15. (a) The ROC curves of the local abnormal event detection using frame-level groundtruth on UCSD Ped2. (b) The ROC curves of the local abnormal event localization

using pixel-level groundtruth on UCSD Ped2.

Fig. 16. (a) The localization results on UCSD Ped1. (b) The localization results on UCSD Ped2.

a

j

a

o

1

t

c

u

m

b

5.3.2. Performance comparison

The quantitative comparisons are displayed in Tables 3 to 6 .

The compared state-of-the-art methods include both hand-crafted

feature-based methods and deep learning-based methods. The cri-

teria include two items: AUC and EER (EDR = 1-EER). The AUC of a

classifier is equivalent to the probability that the classifier will rank

a randomly chosen positive instance higher than randomly chosen

negative instance [75] . Depending on the thresholds, the false re-

ject rate, i.e., the rate of the positive samples that are wrongly de-

tected as negative samples, and the false accept rate, i.e., the rate

of negative samples that are wrongly detected as positive samples,

re changing along the ROC curve. When the value of the false re-

ect rate equals the false accept rate, the common value is denoted

s EER, which corresponds to the abscissa of the intersection point

f the ROC curve and the back-diagonal dotted line in Figs. 14 and

5 . Regarding the problem of binary classification, EER corresponds

o the classification result under a special threshold value, but AUC

an reflect the classification results under all of the threshold val-

es. The classifier with a greater area has a better average perfor-

ance. Based on the analysis about AUC and EER, AUC is the main

aseline to evaluate the performance of binary classifiers.

Page 13: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355 13

Table 4

Different EDR values and AUC values of the localization of LAE

on UCSD Ped1.

Method Criteria

EDR AUC

LRCCDL (Ours) 63.98% 76.09%

Sparse [48] 46% 46.1%

Ren et al. [47] 48.49% 48.23%

Adam et al. [34] 24% 13.3%

MDT [34] 45% 44.1%

SF-MPPCA [34] 28% 21.3%

MPPCA [34] 18% 20.5%

SF [21] 21% 17.9%

Lee et al. [26] 60% 64.9%

Zhang et al. [45] 62.5% 65%

—————————————————————————————————————–

AVID [61] 85.6% –

Conv-WTA + SVM [58] 64.2% 66.1%

TCP [56] 59.2% 64.5%

Wang et al. [64] 67% 68.9%

Huang et al. [55] 61.3% 69.71%

Xu et al. [57] – 69.9%

Table 5

Different EER values and AUC values of the detection of LAE on

UCSD Ped2.

Method Criteria

EER AUC

LRCCDL (Ours) 9.44% 95.20%

Adam et al. [34] 42% 64%

MDT [34] 25% 85%

SF-MPPCA [34] 36% 71%

MPPCA [34] 30% 77%

SF [21] 42% 63%

Lee et al. [26] 9.8% 92%

HOFME [44] 20% 87.5%

Zhang et al. [45] 16% 90%

Amraee et al. [27] 21% 85.5%

—————————————————————————————————————–

AVID [61] 14% -

Conv-WTA + SVM [58] 8.9% 96.6%

ConvLSTM-AE [60] – 88.1%

Liu et al. [63] – 95.4%

TCP [56] 18% 88.4%

Wang et al. [64] 12% 91.3%

Sabokrou et al. [62] 11% -

Conv-AE [59] 21.7% 90%

Xu et al. [57] 13% 92.3%

b

p

b

T

m

[

A

l

t

w

a

w

p

5

5

o

c

Table 6

Different EDR values and AUC values of the localization of LAE

on UCSD Ped2.

Method Criteria

EDR AUC

LRCCDL (Ours) 76.42% 82.72%

Adam et al. [34] 20% 22%

MDT [34] 45% 42%

SF-MPPCA [34] 25% 20%

MPPCA [34] 18% 22%

SF [21] 22% 28%

Lee et al. [26] 76% 81.5%

Zhang et al. [45] 68% 75%

Amraee et al. [27] 71% 80%

—————————————————————————————————————–

AVID [61] 85% -

Conv-WTA + SVM [58] 83.1% 89.3%

Wang et al. [64] 83% 80.1%

Sabokrou et al. [62] 85% -

Table 7

Different AUC values on CUHK Avenue.

Method Criteria

AUC

LRCCDL (Ours) 88.68%

Conv-WTA + SVM [58] 82.1%

Conv-AE [59] 70.2%

ConvLSTM-AE [60] 77%

DeepAppearance [63] 84.6%

Unmasking [63] 80.6%

Stacked RNN [63] 81.7%

Liu et al. [63] 85.1%

Huang et al. [55] 76.8%

3

a

t

t

c

p

c

o

m

5

A

o

t

o

a

w

d

s

s

r

o

r

6

m

From Tables 3 to 6 , compared with the hand-crafted feature-

ased methods, the values of EER (or EDR) and AUC of our pro-

osed methods are the best. Compared with the deep learning-

ased methods, the AUC of our proposed method is the best in

able 4 . Based on the values of AUC, the performance of our

ethod is better than those of [59,60,63,64] in Table 3 , and

56,57,59,60,64] in Table 5 and [64] in Table 6 . Moreover, the

UC values of our method are slightly lower than those of deep

earning-based methods with the best performances. Moreover,

aking the values of EER and AUC from the four tables together,

e can find that there is no deep learning-based method that is

lways better than ours under the criteria of EER and AUC on the

hole UCSD dataset. In conclusion, our method achieves a good

erformance on these two datasets.

.4. Anomaly detection at the local scale on the CUHK avenue dataset

.4.1. Anomaly detection and performance comparison

In this section, we chose the CUHK Avenue dataset to evaluate

ur method. There are 16 training video clips and 21 testing video

lips in the dataset. The resolution of the frames in each video is

60 × 640. The abnormal events include wrong direction, strange

ction, and abnormal objects, which are shown in Fig. 17 .

In the experiment, we also only choose the first clip in the

raining set during the training stage. The HMOFP feature is ex-

racted in the same manner as that in the UCSD dataset. AUC is

hosen as the criterion for the performance evaluation. Our pro-

osed method is compared with the state-of-the-art methods, in-

luding deep learning-based methods in recent years. The values

f AUC are shown in Table 7 . It can be seen that our proposed

ethod obtains the highest AUC value.

.4.2. Analysis of parameters in the algorithm

There are two parameters in Algorithm 1 and one parameter in

lgorithm 2 , i.e., α, β and γ . The relationship between γ and the

ther two parameters is γ = β/α. We fix β and change α to ob-

ain different values of AUC. In addition, α is fixed and the value

f β is changed. The AUC curves that are obtained according to the

bovementioned cases are shown in Fig. 18 (a). It can be seen that

hen α = 5 , β = 0 . 1 , and γ = 0 . 02 , a better AUC is achieved. Ad-

itionally, we illustrate the performance of AUC with a patch size

et {4,8,10,20} in Fig. 18 (b), and we can find that when the patch

ize is 10 × 10, a better AUC is achieved. We mainly represent the

esults on the CUHK Avenue dataset, and our experimental results

n the above datasets show that the conclusion regarding the pa-

ameters can be generalized to other datasets.

. Conclusions

In this paper, we present a novel algorithm to detect abnor-

al events based on dictionary learning. Unlike the previous hand-

Page 14: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

14 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355

Fig. 17. The normal scene (the 1st frame) and the abnormal scenes (the 2nd frame to the 4th frame) in dataset CUHK Avenue.

Fig. 18. (a) The performance of AUC under the different parameters in the algorithm on dataset CUHK Avenue. (b) The performance of AUC under the different patch sizes.

p

W

r

s

a

v

m

b

t

s

D

A

C

6

N

e

T

a

s

A

m

crafted feature-based methods in the literature that focus only on

the representation of the motion descriptor of a crowd or the

model to detect whether an event is normal, our new method fully

uses the motion descriptor information to build the anomaly de-

tection framework.

Based on the background subtraction and binarization of

surveillance videos, we remove the low variations and noise com-

ing from the background objects and obtain the motion descrip-

tor HMOFP. The motion descriptor can describe the motion of a

crowd in the foreground more precisely. In the training stage, to

make full use of the low-rank structure of the training sample set

and restrict the reconstruction coefficient vectors, we propose the

LRCCDL solution by a joint optimization of the nuclear-norm and

the l 2, 1 -norm. The joint optimization achieves two results: one is

the learned low-rank dictionary and the other is the compact re-

construction coefficient vectors of the training samples, which are

surrounding a mean center. The dictionary and the center are used

in the detection stage. In the detection stage, we design a strategy

to force the reconstruction coefficient vectors of abnormal samples

to have the same distribution as normal samples utilizing the l 2, 1 -

norm, which realizes the aim of obtaining a large gap between the

reconstruction error of abnormal testing samples and normal ones.

As a result, abnormal testing samples obtain larger reconstruction

errors than normal testing samples. Finally, a reconstruction cost

(RC) function is developed to detect the frame abnormality based

on the combination of the reconstruction error and the sparsity of

the reconstruction coefficient vector.

In the experiments, which are compared with the deep

learning-based methods, some results of our proposed method are

not superior. However, the amount of training data required by

our method is far smaller than that of deep learning-based meth-

ods, especially for the UCSD and CUHK Avenue datasets. Moreover,

the detection results of our method are comparable to those of

the deep learning-based methods and superior to those of hand-

crafted feature-based methods.

l

In the future works, one important aspect is that how to im-

rove the performance of the anomaly detection at the local scale.

hat’s more, optimizing our proposed method to decrease the

unning time is another important task of our forthcoming re-

earch. In consideration of the widely using of deep learning, the

nomaly detection based on deep learning is a very attractive and

aluable area. Although some explorations arose in recent years,

ore robust and efficient algorithms for different scenes need to

e further studied. In addition, combining deep learning with the

raditional methods such as sparse representation may produce a

urprising result. All of these will be our future research directions.

eclaration of Competing Interest

None.

cknowledgement

This work is supported by the National Key R&D Program of

hina (no. 2019YFB2204200 ), the NSFC (nos. 61572067 , 61872034 ,

1672089 , 61703436 , and 61572064 ), CELFA, the Beijing Municipal

atural Science Foundation under Grant 4202055 , the Natural Sci-

nce Foundation of Guizhou Province ( [2019]1064 ), the Science and

echnology Program of Guangzhou ( 201804010271 ). This work is

lso supported in part by the Natural Sciences and Engineering Re-

earch Council of Canada (NSERC), Grant No. RGPIN239031 .

ppendix

The complete algorithm for the LRCCDL_SR method.

Step 1: Depending on binary frames, we can get the optimized

otion feature set of TR , which is denoted as H

∗.

Step 2: The optimized dictionary D T can be obtained by the on-

ine dictionary learning algorithm.

Page 15: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355 15

H

m

w

R

t

R

w

t

m

R

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

The online dictionary learning algorithm

Require: H

∗ ∈ R M×K 0 (training sample set), λ ∈ R (regularization parameter),

D 0 ∈ R M×K (initial dictionary), T (number of iterations).

1: A 0 ∈ R K×K ← 0 , B 0 ∈ R M×K ← 0 (reset the “past” information).

2: for t = 1 to T do

(1) Draw H ∗t from H

∗ .

(2) Sparse coding: using the Lasso algorithm to compute

αt = arg min α∈ R K

1 2 ‖ H ∗t − D t−1 α‖ 2 2 + λ‖ α‖ 1 .

(3) A t ← A t−1 + αt αT t = [ a 1 , a 2 , ..., a K ] . The element in the n th row and the

n th column in the matrix A t is denoted as A t (n, n ) .

(4) B t ← B t−1 + H ∗t αT t = [ b 1 , b 2 , ..., b K ] .

(5) Compute the dictionary D t .

(a) for n = 1 to K do

Update the n th column of D t−1 = [ d 1 , d 2 , ..., d K ] :

if ( A t (n, n )= 0)

u n ← d n .

Else

u n ←

1 A t (n,n )

( b n − D t−1 a n ) + d n .

end if

d t,n ←

1 max ( ‖ u n ‖ 2 , 1)

u n (the n th column of D t ).

(b) end for

(c) Return D t = [ d t, 1 , d t, 2 , ..., d t,K ] .

3: end for

4: Return D T (learned dictionary).

Step 3: Extract the HMOFP feature of the testing frame f t , i.e.,

t and calculate its sparse reconstruction coefficient vector z t by

in ‖

z t ‖ 1 s . t . H t = D T z t ,

hich can be solved by the OMP method.

Step 4: Then the RC value of f t is computed by

C = ‖

H t − D z t ‖ 2 + λ‖

z t ‖ 1 .

Step 5: The frame f t is detected as normal if the following cri-

erion is satisfied

C < τ,

here τ is a user defined threshold that controls the sensitivity of

he algorithm.

Note: Step 1, and Step 4-Stpe 5 are same as our proposed

ethod LRCCDL.

eferences

[1] H. Keval , CCTV control room collaboration and communication: does it work?

in: Proc. Human Centred Technol. Workshop, 2006, pp. 11–12 . [2] M. Thida , Y.L. Yong , P. Climent-Pérez , H-l. Eng , P. Remagnino , A literature re-

view on video analytics of crowded scenes, Intell. Multimed. Surveill (2013)

17–36 . [3] B. Zhan , D.N. Monekosso , P. Remagnino , et al. , Crowd analysis: a survey, Mach.

Vision Appl. (2008) 345–357 . [4] N.N.A. Sjarif , S.M. Shamsuddin , S.Z. Hashim , Detection of abnormal behaviors

in crowd scene: a review, Int. J. Advance. Soft Comput. Appl. 4 (1) (2012) 1–33 .[5] J.C.S. Junior , S.R. Musse , C.R. Jung , Crowd analysis using computer vision tech-

niques, IEEE Signal Process. Mag. 27 (5) (2010) 66–77 .

[6] C. Ma , Z. Miao , X. Zhang , M. Li , A saliency prior context model for real-timeobject tracking, IEEE Trans. Multimed. 19 (11) (2017) 2415–2424 .

[7] J. Geng , Z. Miao , X. Zhang , Efficient heuristic methods for multimodal fusionand concept fusion in video concept detection, IEEE Trans. Multimed. 17 (4)

(2015) 498–511 . [8] S. Ali , M. Shah , Floor fields for tracking in high density crowd scenes, in: Eur.

Conf. Comput. Vis. (ECCV), 2008, pp. 1–14 .

[9] M. Rodriguez , S. Ali , T. Kanade , Tracking in unstructured crowded scenes, in:Proc. IEEE Int. Conf. Comput. Vis., 2009, pp. 1389–1396 .

[10] Y. Cen , L. Zhang , K. Wang , et al. , Iterative reweighted minimization for gen-eralized norm/quasi-norm difference regularized unconstrained nonlinear pro-

gramming[J], IEEE Access 7 (2019) 153102–153122 . [11] Y. Cen , Y. Cen , K. Wang , et al. , Energy-efficient nonuniform content edge pre–

caching to improve quality of service in fog radio access networks[J], Sensors19 (6) (2019) 1422 .

[12] H. Wang , Y. Cen , Z. He , et al. , Robust generalized low-rank decomposition

of multimatrices for image recovery[J], IEEE Trans. Multimed. 19 (5) (2016)969–983 .

[13] H. Wang , Y. Cen , Z. He , et al. , Reweighted low-rank matrix analysis with struc-tural smoothness for image denoising[J], IEEE Trans. Image Process. 27 (4)

(2017) 1777–1792 .

[14] H. Wang , Y. Li , Y. Cen , et al. , Multi-Matrices low-rank decomposition withstructural smoothness for image denoising[J], IEEE Trans. Circuits Syst. Video

Technol. (2019) . [15] A. Li , Z. Miao , Y. Cen , Global abnormal event detection based on compact co-

efficient low-rank dictionary learning, in: Asian Conf. Pattern Recognit. (ACPR),2017, pp. 4 83–4 87 .

[16] Available: http://mha.cs.umn.edu/movies/crowd-activity-all.avi . [17] Available: http://www.cvg.reading.ac.uk/PETS2009/data.html .

[18] Available: http://www.svcl.ucsd.edu/projects/anomaly/dataset.html .

[19] Available: http://www.cse.cuhk.edu.hk/leojia/projects/detectabnormal/dataset. html .

20] D. Kosmopoulos , S.P. Chatzis , Robust visual behavior recognition, IEEE SignalProcess. Mag. 27 (5) (2010) 34–45 .

[21] R. Mehran , A. Oyama , M. Shah , Abnormal crowd behavior detection using so-cial force model, in: IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2009,

pp. 935–942 .

22] S. Yen , C. Wang , Abnormal event detection using HOSF, in: IEEE Int. Conf. ITConverg. Secur. (ICITCS), 2013, pp. 1–4 .

23] Y. Zhang , L. Qin , H. Yao , Q. Huang , Abnormal crowd behavior detection basedon social attribute-aware force model, in: IEEE Int. Conf. Image Process. (ICIP),

2012, pp. 2689–2692 . [24] Y. Zhang , L. Qin , H. Yao , Q. Huang , Social attribute-aware force model: exploit-

ing richness of interaction for abnormal crowd detection, IEEE Trans. Circuits

Syst. Video Technol. 25 (7) (2015) 1231–1245 . 25] R. Chaker R , Z. Al Aghbari , I.N. Junejo , Social network model for crowd anomaly

detection and localization, Pattern Recognit 61 (2017) 266–281 . 26] D.G. Lee , H.I. Suk , S.K. Park , et al. , Motion influence map for unusual human

activity detection and localization in crowded scenes, IEEE Trans. Circuits Syst.Video Technol. 25 (10) (2015) 1612–1623 .

[27] S. Amraee , A. Vafaei , K. Jamshidi , et al. , Anomaly detection and localization in

crowded scenes using connected component analysis, Multimed. Tools Appl.77 (12) (2018) 14767–14782 .

28] T. Hung , J. Lu , Y. Tan , Cross-scene abnormal event detection, in: IEEE Int. Symp.Circuits Syst. (ISCAS), 2013, pp. 2844–2847 .

29] T. Sandhan , T. Srivastava , A. Sethi , Y. Jin , Unsupervised learning approach forabnormal event detection in surveillance video by revealing infrequent pat-

terns, in: IEEE Int. Conf. Image Vis. Comput. N. Z. (IVCNZ), 2013, pp. 4 94–4 99 .

30] I. Tziakos , A. Cavallaro , L. Xu , Local abnormality detection in video using sub-space learning, in: IEEE Int. Conf. Adv. Video Signal Based Surveill. (AVSS),

2010, pp. 519–525 . [31] M. Haque , M. Murshed , Panic-driven event detection from surveillance video

stream without track and motion features, in: IEEE Int. Conf. Multimed. Expo(ICME), 2010, pp. 173–178 .

32] Y. Shi , Y. Gao , R. Wang , Real-time abnormal event detection in complicated

scenes, in: IEEE Int. Conf. Pattern Recognit. (ICPR), 2010, pp. 3653–3656 . [33] Y. Yin , Q. Liu , S. Mao , Global anomaly crowd behavior detection using crowd

behavior feature vector, Int. J. Smart Home 9 (12) (2015) 149–160 . 34] V. Mahadevan , W. Li , V. Bhalodia , et al. , Anomaly detection in crowded scenes,

in: IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2010, pp. 1975–1981 . [35] D. Singh , C.K. Mohan , Graph formulation of video activities for abnormal ac-

tivity recognition, Pattern Recognit 65 (2017) 265–272 . 36] X. Gu , J. Cui , Q. Zhu , Abnormal crowd behavior detection by using the particle

entropy, Opt.-Int. J. Light Electron Opt. 125 (14) (2014) 3428–3433 .

[37] C.P. Lee , K.M. Lim , W.L. Woon , Statistical and entropy based abnormal motiondetection, in: IEEE Stud. Conf. Res. Dev. (SCOReD), 2010, pp. 192–197 .

38] T. Wang , H. Snoussi , Detection of abnormal visual events via global optical floworientation histogram, IEEE Trans. Inf. Forensics Secur. 9 (6) (2014) 988–998 .

39] A. Li , Z. Miao , Y. Cen , T. Wang , V. Voronin , Histogram of maximal optical flowprojection for abnormal events detection in crowded scenes, Int. J. Distrib.

Sens. Netw. (2015) 1–12 .

40] A. Li , Z. Miao , Y. Cen , Q. Liang , Abnormal event detection based on sparsereconstruction in crowded scenes, in: IEEE Int. Conf. Speech Signal Process.

(ICASSP), 2016, pp. 1786–1790 . [41] A. Li , Z. Miao , Y. Cen , Global anomaly detection in crowded scenes based on

optical flow saliency, in: IEEE Int. Workshop Multimed. Signal Process. (MMSP),2016, pp. 1–5 .

42] A. Li , Z. Miao , Y. Cen , Y. Cen , Anomaly detection using sparse reconstruction in

crowded scenes, Multimed. Tools Appl. 76 (24) (2017) 26249–26271 . 43] N. Patil , P.K. Biswa , Global abnormal events detection in surveillance video —

A hierarchical approach, in: IEEE Int. Symp. Embed. Comput. Syst. Des., 2017,pp. 217–222 .

44] R.V.H.M. Colque , C. Caetano , M.T.L. de Andrade , et al. , Histograms of opticalflow orientation and magnitude and entropy to detect anomalous events in

videos, IEEE Trans. Circuits Syst. Video Technol. 27 (3) (2017) 673–682 .

45] Y. Zhang , H. Lu , L. Zhang , X. Ruan , Combining motion and appearance cues foranomaly detection, Pattern Recognit 51 (C) (2016) 443–452 .

46] X. Chen , J. Lai , Detecting abnormal crowd behaviors based on the div-curl char-acteristics of flow fields, Pattern Recognit 88 (2019) 342–355 .

[47] H. Ren , T.B. Moeslund , Abnormal event detection using local sparse repre-sentation, in: IEEE Int. Conf. Adv. Video Signal Based Surveill. (AVSS), 2014,

pp. 125–130 .

48] Y. Cong , J. Yuan , J. Liu , Sparse reconstruction cost for abnormal event detection,in: IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2011, pp. 3449–3456 .

49] Y. Yuan , Y. Feng , X. Lu , Structured dictionary learning for abnormal event de-tection in crowded scenes, Pattern Recognit 73 (2018) 99–110 .

Page 16: Abnormal event detection in surveillance videos based on ...xzhang/publications/PR... · [46] X. Chen, J. Lai, Detecting abnormal crowd behaviors based on the div-curl char- acteristics

16 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355

A

N

C

s

Z

i

C

F

d

t

t

O

H

J

t

C

d

Y

H

N

s

d

J

i

t

X

T

g

o

w

T

n

D

R

d

c

p

D

H

L

i

S

H

C

r

[50] X. Zhu , J. Liu , J. Wang , C. Li , H. Lu , Sparse representation for robust abnormalitydetection in crowded scenes, Pattern Recognit 47 (5) (2014) 1791–1799 .

[51] M. Aharon , M. Elad , A. Bruckstein , K-SVD: an algorithm for designing overcom-plete dictionaries for sparse representation, IEEE Trans. Signal Process 54 (11)

(2006) 4311–4322 . [52] C. Bi , H. Wang , R. Bao , SAR image change detection using regularized dictio-

nary learning and fuzzy clustering, in: IEEE Int. Conf. Cloud Comput. Intell.Syst. (CCIS), 2014, pp. 327–330 .

[53] M. Yang , D. Dai , L. Shen , et al. , Latent dictionary learning for sparse representa-

tion based classification, in: IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),2014, pp. 4138–4145 .

[54] M. Yang , L. Zhang , X. Feng , et al. , Sparse representation based fisher discrimi-nation dictionary learning for image classification, Int. J. Comput. Vis. 109 (3)

(2014) 209–232 . [55] S. Huang , D. Huang , X. Zhou , Learning multimodal deep representations for

crowd anomaly event detection[J], Math. Probl. Eng. (2018) 1–13 .

[56] M. Ravanbakhsh , M. Nabi , H. Mousavi , E. Sangineto , N. Sebe , Plug-and-playcnn for crowd motion analysis: an application in abnormal event detection,

in: IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2018, pp. 1689–1698 . [57] M. Xu , X. Yu , D. Chen , et al. , An efficient anomaly detection system for

crowded scenes using variational autoencoders[J], Appl. Sci. 9 (16) (2019) 3337 .[58] H.T.M. Tran , D. Hogg , Anomaly detection using a convolutional winner-take-all

autoencoder, in: Proc. Br. Mach. Vis. Conf. (BMVC), 2017, pp. 1–12 .

[59] M. Hasan , J. Choi , J. Neumann , A.K. Roy-Chowdhury , L.S. Davis , Learning tem-poral regularity in video sequences, in: IEEE Conf. Comput. Vis. Pattern Recog-

nit. (CVPR), 2016, pp. 733–742 . [60] W. Luo , W. Liu , S. Gao , Remembering history with convolutional LSTM

for anomaly detection, in: IEEE Int. Conf. Multimed. Expo (ICME), 2017,pp. 439–4 4 4 .

[61] M. Sabokrou , M. Pourreza , M. Fayyaz , R. Entezari , et al. , AVID: adversarial visual

irregularity detection, in: Asian Conf. Comput. Vis. (ACCV), 2018, pp. 1–18 . [62] M. Sabokrou , M. Fayyaz , M. Fathy , Z. Moayed , R. Klette , Deep-anomaly: fully

convolutional neural network for fast anomaly detection in crowded scenes,Comput. Vis. Image Underst. 172 (2018) (2018) 88–97 .

[63] W. Liu , W. Luo , D. Lian , S. Gao , Future frame prediction for anomaly detec-tion–a new baseline, in: IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),

2018, pp. 6536–6545 .

[64] S. Wang , E. Zhu , J. Yin , F. Porikli , Video anomaly detection and localizationby local motion based joint video representation and OCELM, Neurocomputing

277 (2018) (2018) 161–175 . [65] Q. Sun , H. Liu , T. Harada , Online growing neural gas for anomaly detection in

changing surveillance scenes, Pattern Recognit 64 (2017) 187–201 . [66] A. Elgammal , R. Duraiswami R , D. Harwood , et al. , Background and foreground

modeling using nonparametric kernel density estimation for visual surveil-

lance, Proc. IEEE 90 (7) (2002) 1151–1163 . [67] G. Liu , Z. Lin , Y. Yu , Robust subspace segmentation by low-rank representation,

in: Int. Conf. Mach. Learn. (ICML), 2010, pp. 663–670 . [68] Z. Lin, M. Chen, and Y. Ma, The augmented lagrange multiplier method for

exact recovery of corrupted low-rank matrices, UIUC Technical Report UILU-ENG-09-2215, 2009.

[69] J.F. Cai , E.J. Candès , Z. Shen , A singular value thresholding algorithm for matrixcompletion, SIAM J. Optim. 20 (4) (2010) 1956–1982 .

[70] G. Liu , Z. Lin , S. Yan , et al. , Robust recovery of subspace structures by low-rank

representation, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 171–184 . [71] J. Yang , W. Yin , Y. Zhang , Y. Wang , A fast algorithm for edge-preserving varia-

tional multichannel image restoration, SIAM J. Imag. Sci. 2 (2) (2009) 569–592 .[72] D. Bertsekas , Constrained Optimization and Lagrange Multiplier Methods, Aca-

demic Press, 1982 . [73] Y. Zhang, Recent advances in alternating direction methods: practice and the-

ory, tutorial, 2010.

[74] J. Eckstein , D. Bertsekas , On the Douglas-Rachford splitting method and theproximal point algorithm for maximal monotone operators, Math. Progr. 55

(1992) 293–318 .

[75] T. Fawcett , An introduction to ROC analysis, Pattern Recognit. Lett. 27 (8)(2006) 861–874 .

ng Li received the Bachelor degree in 2011 from Harbin Institute of Technology.ow he is a doctor student of Beijing Jiaotong University. His research interests

ompressed Sensing, Video Processing, Abnormal Event detection, Sparse Recon-truction, Low-rank Matrix Reconstruction etc.

henjiang Miao received the B.E. degree from Tsinghua University, Beijing, China,

n 1987, and the M.E. and Ph.D. degrees from Northern Jiaotong University, Beijing,hina, in 1990 and 1994, respectively. From 1995 to 1998, he was a Post-Doctoral

ellow with the Ecole Nationale Superieure d’Electrotechnique, d’Electronique,’Informatique, d’Hydraulique et des Telecommunications, Institute National Poly-

echnique de Toulouse, Toulouse, France. From 1998 to 2004, he was with the Insti-

ute of Information Technology, National Research Council Canada, Nortel Networks,ttawa, ON, Canada. He joined Beijing Jiaotong University, Beijing, China, in 2004.

e is currently a Professor and the Director of the Media Computing Center, Beijingiaotong University, and Director of the Institute for Digital Culture Research, Cen-

er for Ethnic and Folk Literature and Art Development, Ministry of Culture, Beijing,hina. His current research interests include image and video processing, multime-

ia processing, and intelligent human/machine interaction.

igang Cen received the Ph.D. degree in control science engineering from theuazhong University of Science Technology, Wuhan, China, in 2006. In 2006, he

joined the Signal Processing Centre, School of Electrical and Electronic Engineering,

anyang Technological University, Singapore, as a Research Fellow. From 2014 to2015, he was a Visiting Scholar with the Department of Computer Science, Univer-

ity of Missouri, Columbia, MO, USA. He is currently a Professor and a Supervisor ofoctoral students with the School of Computer and Information Technology, Beijing

iaotong University, Beijing, China. His research interests include compressed sens-ng, sparse representation, low-rank matrix reconstruction, and wavelet construc-

ion theory.

iao-Ping Zhang received the B.S. and Ph.D. degrees in electronic engineering fromsinghua University, Beijing, China, in 1992 and 1996, respectively, and the MBA de-

ree (with honors) in finance, economics, and entrepreneurship from the Universityf Chicago Booth School of Business, Chicago, IL, USA. Since Fall 20 0 0, he has been

ith the Department of Electrical and Computer Engineering, Ryerson University,

oronto, ON, Canada, where he is currently a Professor and the Director of Commu-ication and Signal Processing Applications Laboratory. He has served as Program

irector of Graduate Studies. He is cross-appointed to the Finance Department, Tedogers School of Management, Ryerson University. He is Cofounder and CEO of Ei-

oSearch, Toronto, ON, Canada. His research interests include statistical signal pro-essing, multimedia content analysis, sensor networks and electronic systems, com-

utational intelligence, and applications in bioinformatics, finance, and marketing.

r. Zhang is a Registered Professional Engineer in the Province of Ontario, Canada.e is a member of the Beta Gamma Sigma Honor Society.

inna Zhang received the M.S. degree in College of Mechanical Engineering from

the Guizhou University, Guiyang, China, in 2010. She is currently a lecturer withthe College of Mechanical Engineering, Guizhou University. Her research interests

nclude signal processing, fault diagnosis etc.

himing Chen received the Ph.D. degree in control science engineering from the

uazhong University of Science Technology in 2006. He is currently a Supervisor

of doctor students with the School of Electrical and Automation Engineering, Easthina Jiaotong University. His research interests include signal processing, multi-

obot control, complex network etc.